Parallel Computing on UConn HPC
1. Before we get started
I assume you have:
- knowledge about how to use
ssh
to connect to remote servers via terminal on Mac, or via Putty on Windows - knowledge about using at least one of vim, emacs or nano, to modify files on command line interface
- an account for UConn’s HPC
2. What’s special about parallelizing on a cluster
When running jobs only on our local machine, in the “computer science”
terminology, we are using cores on a single node. Such parallelization
can be done easily via packages like parallel
on Unix systems, and
doparallel
on Windows. The parallel
package, however, when used on
clusters, is not capable of breaking the node barrier. No matter how
may cores you requested, your code will still only run on cores within
a single node.
3. Break the barrier
3.1 Load MPI
MPI, short for “message passing interface”, is designed to power on huge clusters, and enable the exchange of information between nodes. To use MPI on HPC:
ssh netid@login.storrs.spc.uconn.edu
echo "module load null gcc/5.4.0-alt r/3.5.1-openblas-gcc540 mpi/openmpi/1.10.1-gcc" >> ~/.bashrc
echo "export OMPI_MCA_mpi_warn_on_fork=0" >> ~/.bashrc
Adding the above code to .bashrc
ensures that every time you log in
to cluster, the MPI module is automatically loaded.
3.2 R Code!
For illustration we use this R
code snippet:
sq <- function(x){
x * x
}
Regardless of vectorization, computing squares for, say, 1 million numbers, would be time consuming if we use a loop. Let’s now make a cluster of, say, as many cores (allowed by system) as possible!
options(echo=TRUE) # have outputs saved to RLog
library(parallel)
cl <- makeCluster(Sys.getenv()["SLURM_NTASKS"], type = "MPI")
clusterExport(cl, varlist = ls())
## pass the 100,000 calculations to many clusters!
result <- clusterApply(cl, 1:100000, fun = function(x){
sq(x)
})
save.image("squares.RData")
The clusterExport
command exports objects in your workspace to each core
so that all cores can use them. The result will be a large list whose length
is the same as the number of replicates you want, with the i-th element
being the result of your i-th replicate. You can then use functions in the
purrr
package (part of the tidyverse series) to manipulate and process
the results to get your final desired output. I recommend purrr
as it
makes list operation simple, fast, and tractable.
Written above is the R
code. As UConn’s HPC uses SLURM to manage all
jobs, we need to write a submit.sh
script that submit jobs to cluster.
A typical script in this case would be:
#! /bin/bash
#SBATCH --partition=general
#SBATCH -n 100
#SBATCH --mail-type=END
#SBATCH --mail-user=first.last@uconn.edu
#SBATCH --mem 128000
#SBATCH -t 12:00:00
R CMD BATCH code.R
Finally, we use sbatch submit.sh
to submit our job.
3.3 Run interactive jobs that take command line arguments
It is sometimes desired that we change some of the parameter inputs to run
different simulation settings, especially when the project involves tuning.
Of course, one can copy the same code several times and hard code the
parameters, and use many different bash
scripts to submit jobs. This
could be tedious, and it’s easy to err.
As an example, suppose I now have a function that depends not only on x
,
but on other optional arguments:
qd <- function(x, a, b, c){
a * x^2 + b * x + c
}
To make my code able to identify command line arguments:
args <- commandArgs(trailingOnly = TRUE)
a <- as.numeric(args[1])
b <- as.numeric(args[2])
c <- as.numeric(args[3])
## the usuall makeCluster step to follow...
And the submit.sh
file should be modified accordingly:
#! /bin/bash
#SBATCH --partition=general
#SBATCH -n 100
#SBATCH --mail-type=END
#SBATCH --mail-user=first.last@uconn.edu
#SBATCH --mem 128000
#SBATCH -t 12:00:00
R CMD BATCH --vanilla --slave "--args $1 $2 $3" code.R RLog_$1_$2_$3
Now in terminal, we type
sbatch submit.sh 5 4 3
to submit a job. The three arguments will be passed to R
, and console
inputs will be in RLog_5_4_3
.
3.4 Passing a vector as optional argument
In the previous section’s example, we have seen how to take command line inputs
as arguments. These arguments, however, are all single numbers. What if we want
To pass an entire vector as the argument? The solution here is we pass the vector
as one string, and parse it inside R
.
For example, I want to be able to specify my vector of true coefficients. Then:
args <- commandArgs(trailingOnly = TRUE)
mybeta <- as.numeric(unlist(strsplit(args[1], split=",")))
Everything remain the same. When submitting, I would do:
sbatch submit.sh 2,0,0,4,8
3.5 Copy the results to local machine
On a Windows machine, this can easily be done by using something like WinSCP.
On Unix/Linux systems, using the command line is a simpler way. Take our
previous squares.RData
as an example. It’s now under the home directory
of our hpc account. If we want to copy it to the Downloads folder, we can do:
scp netid@login.storrs.hpc.uconn.edu:~/squares.RData ~/Downloads/
4. Easy login
It can be a pain having to type login.storrs.hpc.uconn.edu
every time we try
to copy something to/from the remote machine. An easy solution is to save the
user and host information in .ssh/config
. To do that, type vi ~/.ssh/config
in terminal. Then press i
and enter the edit mode, and add the following:
Host hpc
User net19001 # replace with your own netid
HostName login.storrs.hpc.uconn.edu
Then press Esc
and type :wq
to exit vim. Now to copy the squares.RData
, we
only need to type
scp hpc:~/squares.RData ~/Downloads/
To connect to hpc, now we type
ssh hpc
5. Quick reference sheet
To see all jobs submitted by yourself:
sjobs
To cancel a specific job:
scancel jobid # can be found in sjobs
To cancel all jobs submitted by yourself:
scancel -u netid
To see your priority on HPC:
sprio -u netid
There are three factors that influence your priority on HPC: AGE
(how long
your job has been pending), FAIRSHARE
(how much resource have you used within
a certain period), and PARTITION
(depends on partition, most people just
submit to “general”). Intuitively, the longer you have been waiting, the less
resource you have used recently, and the higher priority your partition has, the
higher will your job’s priority be.
6. Debug
The R
package Rmpi
needs to be installed for us to be able to utilize the
great MPI. However, some errors might occur when trying to install it. The
reason could be due to that MPI is not loaded. If yes, R
can sometimes miss
the location of MPI. Therefore, the path to MPI needs to be manually entered. In
an R
session, type
install.packages("Rmpi",
configure.args=c("--with-Rmpi-include=/apps2/openmpi/1.10.1-gcc/include/",
"--with-Rmpi-libpath=/apps2/openmpi/1.10.1-gcc/lib/",
"--with-Rmpi-type=OPENMPI"))
and Rmpi
should install without error. This works on both R
3.4.1 and 3.5.1.
Other versions, However, have not been tested.