Docker and R: Rocker

Docker is a popular tool for building and deploying an application by using containers. Docker is designed to deliver an application and its dependencies in a package. Compared to other virtualization software, docker containers are lighweight and fast to deploy. The reason for this is Docker relies on the kernel of the host instead of virtualizing the whole kernel.

Docker is used in data science application because it allows to package the application and all its dependencies in one container. This ensures that an analysis created with R, which runs one computer, also runs on other computers, as all needed dependencies are packaged together. One can also create a Docker image with older version of R-Packages. This ensures reproducibility of the analysis and prevents a possible break down once a certain package has been updated. Another possibility is to offload computational intense tasks. This is especially handy for Bayesian models estimated with Stan, where a complicated model can take hours to estimate.

Docker consists of three parts:

A docker file, which describes how a Docker image is created
A docker image, which gets built by using the docker file
A docker container, which is a running instance of a docker image

In this post I will create a simple Dockerfile with the analysis and the data and build the corresponding Docker container.

To install Docker on Manjaro we use pacman.

pacman -S docker

The most used Docker images for R are called Rocker. The image is based on stable Debian releases. A Dockerfile describes the base image used. In this case we use a geospatial image and install additionally some packages.

FROM rocker/geospatial:3.4.0

RUN apt-get install libssl-dev  &&  apt-get install libxml2-dev

RUN R -e "options(repos=list(CRAN='http://mran.revolutionanalytics.com/snapshot/2019-03-15'))\
     ; getOption('Ncpus',6L) ;install.packages(c('rstan','brms','dplyr'),dependencies=TRUE)"

COPY Bayes.R  R-Code/  data/  /home/rstudio/

#CMD R -e "source('/home/analysis/Bayes.R')"

FROM tells us which image we want to use for our Docker image. There exist special Docker images for R, called Rocker. As we need a Docker image with spatial packages already installed, we use the rocker/geospatial with 3.4 R-Version. RUN are bash commands, which are executed in the shell of the Docker image. The Rocker image is based on Debian. Therefore we need the apt-get install to install additional packages. Here we install two packages, libssl and libxml2, both in the developer version. This packages are needed for the shinystan package, which is a suggest requirement for the brms package. We can then install additional R Packages with the command R and the option -e. Afterwards we copy the data the we need into the home directoy of the rocker image, which is /home/rstudio in this case. The COPY command is similiar to the cp in the shell and copies the data into the directory. In the end we could run the complete analysis by using the CMD command. The difference between RUN and CMD is the following. RUN executes commands inside the docker images. These commands get run once during the build phase and are written into a new layer of the docker image. CMD runs a command once the container starts. There are additional commands available for the Dockerfile like ENV etc. But in this example those commands are sufficient.

To actually start Docker, we need to start the docker daemon using systemctl

sudo systemctl start docker

Otherwise the user can be added to the docker group by using usermod. In this case the user does not need to be in the sudo group.

usermod -aG docker username

After we finished the dockerfile, we need to build to the image.

docker build -t name_image .

With docker build we can create an image from a dockerfile. The option -t specifies the tag, or name of the image. In the end we need to give the path, where the dockerfile resides. As we stay in the current directory, we can just use .

To see all available docker images use docker image ls

docker image ls

To finally start the docker image, we run the following command:

docker run -e USER=user PASSWORD=password name_image -p 8787:8787 image_name

The option -e sets the environment variables. For the Rocker image this is the password and the username. -p sets the port, from which the RStudio Server is available.

The results in this container are not persistent and are deleted once the container stops running. To acces the results outside the docker container from the host, we create a folder which is accesible by the host and the container. The option -v creates a volume which can be used inside the container and from the host.

docker run -v ~/Docker_dateien/:/home/rstudio/analysis -e PASSWORD=rstudioaccess -p 8787:8787 image_name

In this example the folder Docker/dateien on the host is used for sharing the data. Inside the container the folder /home/rstudio/analysis can be used for copying results from the analysis to be later used outside the container.

To acces this R-Studio server, we can type in the browser localhost:8787. A login screen is presented and we log in using the username rstudio and the password we created before, in this case rstudioaccess. An R-Studio session is presented with all the data we copied before. We can create a R-Script and run it in the container. Once we are finished with the analysis, we copy the scripts to the shared folder, which is ~/Docker_dateien inside the R-Studio container.