Docker in a Nutshell

Docker has been a really convenient tool for me to play with, and it has helped me both in terms of doing research and having fun. Recently I have been advertising Docker to my friends but my explanations are sometimes vague and not well-defined. Therefore I decide to refresh my own understanding by writing this post, and hopefully it can be useful to others as well.

1. Motivation: environments

Configuring develop environments can be a huge headache. How do you ensure that your code will perform exactly the same way it is supposed to on any machines regardless of the operating system? Even if on the same OS, different people use different versions of software – like Python 2 and Python 3. Sometimes it is helpful to write all the required packages and dependencies in a requirements.txt or makefile, but unexpected bugs may still happen as you cannot make it general enough and error-free on all users’ computers.

Therefore, it is desirable that a verbatim copy of your development environment can be made available to users, so that they can use it directly without worring about configurations. This, is the motivation for using Docker.

2. Why not use virtual machines

You may wonder why we cannot just use a virtual machine, which will ensure that the copy is a verbatim one. Indeed, we can use a VM. But VMs are usually quite large. I used to use an Ubuntu VM, which was more than 10GB in size. Copying over such a big item across machines, or downloading it from the internet, can be time consuming. Also, when you create a VM, part of your machine’s resource is allocated to it, whether actually used by the VM or not, which means there is waste of resource. In addition, VMs are quite slow, as you are virtually powering up a full-fledged operating system.

3. Docker

Now we are ready to appreciate the convenience brought to us by Docker. A Docker container is indeed a “container”. Pieces of code that power your app or your analysis, together with its dependencies, are packaged into one Docker image. For example, you can find “Dockerized” Tensorflow, Pytorch, Rstudio, and many other tools. When you run the image, a container is created, within which the code is run. You can either interact with the container via command line, or use a web interface, which is usually accessed by typing localhost:port in the address bar of your web browser. Compared to a virtual machine, whose processes run independently of those on your host machine, the processes in the Docker container is also running on your host machine. This saves you time waiting for the VM OS to power up.

Also, a container means less resource waste - it only uses whatever is needed. Therefore, if your Docker container is idle, little resource is taken.

One more thing - a Docker image is usually of much smaller size than a VM. For example, the Docker version of Ubuntu only requires around 100MB space - what a save!

4. Using Docker images built by others

We stand on the shoulder of giants - same principle applies to the usage of Docker. We start with using images built by others, and gradually learn how it works. Now we use a simple example to demonstrate.

Suppose I want to compare the results for sample() function in R 3.5.1 and R 3.6.1, without messing up my local R installation. The first thing I do, after installing the Docker CE for mac, is to pull the two corresponding images.

$ docker pull r-base:3.5.1
$ docker pull r-base:3.6.1

Now if we run docker images, we will see a list of all docker images, together with their tags.

$ docker images
REPOSITORY          TAG                 IMAGE ID            CREATED             SIZE
r-base              3.6.1               4e55790c88ae        2 weeks ago         642MB
r-base              3.5.1               bd9edc1a85ed        8 months ago        712MB

Now to launch an interactive R session of a specific version, do

$ docker run -ti --rm r-base:3.5.1

Everything will work just as if you have opened an R session locally. To make the comparison I can use two terminal windows and have both versions running side by side. And we do see that, the sample() function, even with the random seed set, gives different results! Upon further investigation, I found that in R 3.5.1, a random number is generated between 0 and 1 first, and multiplied by 1000, and finally rounded to an integer. It can be verified that, after setting the seed to 1, runif(1) gives 0.2655087. In R 3.6.0 and up, the mechanism seems to have changed.

Lord

Except for command line interactive mode, there are also other operating modes of a Docker container - via Jupyter notebook interface, batch mode, or silent, background mode. We will skip these details here as they are tailored to serve different purposes.

5. Docker for statisticians: reproducible research

Suppose now I want to do some simulation and want to build everything into a Docker image. A minimal piece of code shown below will be used in future demonstrations.

library(ggplot2)
## generate uniform points in the unit circle
set.seed(1)
r <- sqrt(runif(1000, 0, 1))
theta <- runif(1000, 0, 2 * pi)
x <- r * cos(theta); y <- r * sin(theta);
mydata <- data.frame(x = x, y = y)

myplot <- ggplot(mydata, aes(x = x, y = y)) + geom_point() + theme_bw()
ggsave(filename = "unifpoints.pdf")

Say I have saved this piece of code as example.R, and put it in a ~/mydocker folder. To build an image, we need to write a Dockerfile, where all configurations of environment are specified.

First, let us pick a specific version of R. The base image, available here, is first loaded. From the wide choice of Tags, let us choose 3.5.1 for demonstration. Now we can start the Dockerfile. We want to save the plot to a folder called myplot. Note tha the package ggplot2 was requested but the base image does not come with it. Therefore we should install it first.

From r-base:3.5.1

ARG WHEN

RUN mkdir /home/myplot
RUN mkdir /home/results

RUN R -e "install.packages('ggplot2',repos = 'http://cran.rstudio.com/')"

## copy the script from host into docker image
COPY example.R /home/myplot/example.R

CMD cd /home/myplot/ \
	&& R -e "source('/home/myplot/example.R')" \
	&& mv /home/myplot/unifpoints.pdf /home/results/unifpoints.pdf

Now with a finished Dockerfile, we are ready to build our image. The WHEN argument is the time the analysis is done; -t name is the name of the image. . denotes the location, i.e., the building process will happen in the current working directory.

$ docker build --build-arg WHEN=2019-07-27 -t example .

## now see a list of docker images I have
(base) xueyishu@Tiger:~/mydocker $ docker images
REPOSITORY              TAG                 IMAGE ID            CREATED             SIZE
example                 latest              fd453879b1e6        28 seconds ago      712MB

Finally, we can launch our example. Also, we want the final results to be exported to a /results folder under the current directory. Therefore we do

$ mkdir results
$ docker run -v ~/mydocker/results:/home/results example

Now we check the subfolder:

(base) xueyishu@Tiger:~/mydocker/results $ ls
total 32
-rw-r--r--@ 1 xueyishu  staff  12552 Jul 27 22:01 unifpoints.pdf

A minor comment - you might have noticed that the the size of the Docker image we just built is 712MB, which does not seem to be worthy - why can’t I just use my local machine? This is because our example is too simplified - when you want a whole complicated research project to be fully reproducible, this will be your best friend.

Yishu Xue
Yishu Xue
Data Scientist / Coder / Novice Sprinter / Gym Enthusiast

The night is dark and full of terrors.