Docker in a Nutshell
Docker has been a really convenient tool for me to play with, and it has helped me both in terms of doing research and having fun. Recently I have been advertising Docker to my friends but my explanations are sometimes vague and not well-defined. Therefore I decide to refresh my own understanding by writing this post, and hopefully it can be useful to others as well.
1. Motivation: environments
Configuring develop environments can be a huge headache. How do you ensure
that your code will perform exactly the same way it is supposed to on any
machines regardless of the operating system? Even if on the same OS,
different people use different versions of software – like Python 2 and
Python 3. Sometimes it is helpful to write all the required packages and
dependencies in a requirements.txt
or makefile
, but unexpected bugs may
still happen as you cannot make it general enough and error-free on all
users’ computers.
Therefore, it is desirable that a verbatim copy of your development environment can be made available to users, so that they can use it directly without worring about configurations. This, is the motivation for using Docker.
2. Why not use virtual machines
You may wonder why we cannot just use a virtual machine, which will ensure that the copy is a verbatim one. Indeed, we can use a VM. But VMs are usually quite large. I used to use an Ubuntu VM, which was more than 10GB in size. Copying over such a big item across machines, or downloading it from the internet, can be time consuming. Also, when you create a VM, part of your machine’s resource is allocated to it, whether actually used by the VM or not, which means there is waste of resource. In addition, VMs are quite slow, as you are virtually powering up a full-fledged operating system.
3. Docker
Now we are ready to appreciate the convenience brought to us by Docker. A Docker
container is indeed a “container”. Pieces of code that power your app or your
analysis, together with its dependencies, are packaged into one Docker image.
For example, you can find “Dockerized” Tensorflow, Pytorch, Rstudio, and many
other tools. When you run the image, a container is created, within which the
code is run. You can either interact with the container via command line, or use
a web interface, which is usually accessed by typing localhost:port
in the
address bar of your web browser. Compared to a virtual machine, whose processes
run independently of those on your host machine, the processes in the Docker
container is also running on your host machine. This saves you time waiting for
the VM OS to power up.
Also, a container means less resource waste - it only uses whatever is needed. Therefore, if your Docker container is idle, little resource is taken.
One more thing - a Docker image is usually of much smaller size than a VM. For example, the Docker version of Ubuntu only requires around 100MB space - what a save!
4. Using Docker images built by others
We stand on the shoulder of giants - same principle applies to the usage of Docker. We start with using images built by others, and gradually learn how it works. Now we use a simple example to demonstrate.
Suppose I want to compare the results for sample()
function in R
3.5.1
and R
3.6.1, without messing up my local R
installation. The first thing
I do, after installing the Docker CE for mac, is to pull the two corresponding
images.
$ docker pull r-base:3.5.1
$ docker pull r-base:3.6.1
Now if we run docker images
, we will see a list of all docker images, together
with their tags.
$ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
r-base 3.6.1 4e55790c88ae 2 weeks ago 642MB
r-base 3.5.1 bd9edc1a85ed 8 months ago 712MB
Now to launch an interactive R
session of a specific version, do
$ docker run -ti --rm r-base:3.5.1
Everything will work just as if you have opened an R
session locally. To make
the comparison I can use two terminal windows and have both versions running
side by side. And we do see that, the sample()
function, even with the random
seed set, gives different results! Upon further investigation, I found that in
R
3.5.1, a random number is generated between 0 and 1 first, and multiplied by
1000, and finally rounded to an integer. It can be verified that, after setting
the seed to 1, runif(1)
gives 0.2655087. In R
3.6.0 and up, the mechanism seems to have changed.
Except for command line interactive mode, there are also other operating modes of a Docker container - via Jupyter notebook interface, batch mode, or silent, background mode. We will skip these details here as they are tailored to serve different purposes.
5. Docker for statisticians: reproducible research
Suppose now I want to do some simulation and want to build everything into a Docker image. A minimal piece of code shown below will be used in future demonstrations.
library(ggplot2)
## generate uniform points in the unit circle
set.seed(1)
r <- sqrt(runif(1000, 0, 1))
theta <- runif(1000, 0, 2 * pi)
x <- r * cos(theta); y <- r * sin(theta);
mydata <- data.frame(x = x, y = y)
myplot <- ggplot(mydata, aes(x = x, y = y)) + geom_point() + theme_bw()
ggsave(filename = "unifpoints.pdf")
Say I have saved this piece of code as example.R
, and put it in a ~/mydocker
folder. To build an image, we need to write a Dockerfile
,
where all configurations of environment are specified.
First, let us pick a specific version of R
. The base image, available here, is first loaded. From the wide choice of Tags, let us choose 3.5.1 for demonstration. Now we can start the Dockerfile
. We want to save the plot to a folder called myplot
.
Note tha the package ggplot2
was requested but the base image does not
come with it. Therefore we should install it first.
From r-base:3.5.1
ARG WHEN
RUN mkdir /home/myplot
RUN mkdir /home/results
RUN R -e "install.packages('ggplot2',repos = 'http://cran.rstudio.com/')"
## copy the script from host into docker image
COPY example.R /home/myplot/example.R
CMD cd /home/myplot/ \
&& R -e "source('/home/myplot/example.R')" \
&& mv /home/myplot/unifpoints.pdf /home/results/unifpoints.pdf
Now with a finished Dockerfile
, we are ready to build our image. The WHEN
argument is the time the analysis is done; -t name
is the name of the image.
.
denotes the location, i.e., the building process will happen in the current
working directory.
$ docker build --build-arg WHEN=2019-07-27 -t example .
## now see a list of docker images I have
(base) xueyishu@Tiger:~/mydocker $ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
example latest fd453879b1e6 28 seconds ago 712MB
Finally, we can launch our example. Also, we want the final results to be
exported to a /results
folder under the current directory. Therefore we do
$ mkdir results
$ docker run -v ~/mydocker/results:/home/results example
Now we check the subfolder:
(base) xueyishu@Tiger:~/mydocker/results $ ls
total 32
-rw-r--r--@ 1 xueyishu staff 12552 Jul 27 22:01 unifpoints.pdf
A minor comment - you might have noticed that the the size of the Docker image we just built is 712MB, which does not seem to be worthy - why can’t I just use my local machine? This is because our example is too simplified - when you want a whole complicated research project to be fully reproducible, this will be your best friend.