ClusterJob (CJ) is an automation system for high-throughput reproducible computations. Using CJ, one can conduct large-scale computational studies on remote clusters in a hassle-free and reproducible manner. CJ builds 'reproducible' computational packages that are easy to share with others. It is written mainly in perl and has simple commands that are easy to learn. The project started at Stanford University by Hatef Monajemi and his PhD advisor David L. Donoho with the goal of making large-scale scientific computing simpler and more efficient. Current implementation allows submission of MATLAB and Python jobs. In future, we hope to include other heavily used programming languages such as R and Julia.
CJ on Github
You may access the CJ Github repository at https://github.com/monajemi/clusterjob
This book is a work in progress. You may find the AsciiDoc source code for this book at https://github.com/monajemi/CJ-book.
A PDF copy of this documentation can be found at http://clusterjob.org/documentation/book.pdf.
Setting up CJ
Installing a CJ agent is very simple. Follow the two steps below to set one up!
Step 1: Setting up authentication key
CJ assumes that the secure shell (ssh) to cluster is handled in a password-free manner. There are various ways to achieve this. Some clusters use Kerberos and some others might use ssh-keygen. We will explain how you could setup a key using ssh-keygen in this document. ssh-keygen is a Unix utility that is used to generate, manage, and convert authentication keys for ssh authentication. If you already have a password-free connection to your cluster, you may skip this step.
Check SSH setup
You can check to see if you already set up the public ssh key. Open a terminal and enter:
$ ls -al ~/.ssh
if you see any of the following, you probably have already setup the ssh-keygen
Generate a SSH key on your machine
If you have not setup ssh key, do not worry. It is very simple to set up. Open terminal, and enter,
$ ssh-keygen -t rsa -C "email@example.com" # This will generate a SSH key. Just use the default setting if # you are asked questions in the process of key generation
This will generate a key in '~/.ssh/id_rsa.pub'
Copy the key to remote server
The last step is to copy the content of your
$HOME/.ssh/authorized_keyslocated on the remote server
You should check your authentication keys by trying to connect to server. In your terminal enter,
$ ssh firstname.lastname@example.org
Some clusters use the so-called Multi-Factor Authentication (e.g., Duo Two-Factor). In this
case, each time you log on, you need to provide an additional code. In order to avoid it, you need to
add the following lines for each cluster (here
Step 2: Installing CJ
Clone ClusterJob from GitHub
Clone ClusterJob to a directory where you would like to install it, say
cd ~/ rm -rf monajemi-clusterjob-* CJ_install curl -sL https://github.com/monajemi/clusterjob/tarball/master | tar -zx - && mv monajemi-clusterjob-* CJ_install
You may also use the following alternative command to clone CJ with the entire history
git clone https://github.com/monajemi/clusterjob.git ~/CJ_install
Install perl dependencies
ClusterJob is written in perl and so depends on other perl modules. You can install CJ dependencies via
cpanm. Copy and paste the following lines into your terminal:
sudo cpan -i Data::Dumper Data::UUID FindBin File::chdir File::Basename File::Spec IO::Socket::INET IO::Socket::SSL Getopt::Declare Term::ReadLine JSON::PP JSON::XS Digest::SHA Time::Local Time::Piece Moo HTTP::Thin HTTP::Request::Common JSON URI
Provide configuration info
You will need to update the contents of
~/CJ_install/cj_configfile to reflect your own information.
This file contains the following information:
CJID <YOUR CJ ID> CJKEY <YOUR CJ KEY> SYNC_TYPE auto SYNC_INTERVAL 300
To use CJ remotely, you need to provide your unique CJID and CJKEY obtained from http://clusterjob.org.
If you plan to use CJ locally without syncing to CJ remote database, you must provide an arbitrary CJID.
If you do not include a CJKEY, your meta data will not sync to CJ remote database even if you have registered an acount on http://clusterjob.org.
You must keep your CJKEY private.
Provide cluster info
You will need to update the contents of
/CJ_install/ssh_configfile to reflect your own server setup.
The CJ convention for each remote machine is
[MACHINE_ALIAS] Host your_host User your_username Bqs batch_queuing_system Repo CJ_remote_repo MAT Matlab version MATlib Matlab libraries [MACHINE_ALIAS]
You can add as many machines as you want to
Build an alias for CJ
For easy use, you may want to add an alias for calling
alias cj='perl ~/CJ_install/src/CJ.pl';
Initialize your CJ agent
You may now initialize your CJ agent by
$ cj init
You may check if the agent is installed by
$ cj who
This should print out something like
You may install as many CJ agents as needed on a single machine or different machines.
After you have successfully started a CJ agent, it is time to have some fun! In what follows, we will demonstrate how you can use CJ to make a reproducible computational package from your MATLAB code and run it on a remote cluster in a hassle-free way.
A simple Matlab example
We will now go over a simple example of running a Matlab code on a cluster to demonstrate the usage of ClusterJob.
Suppose we would like to run the following simple matlab
code. The code includes two
for loops indexed by
j, and for each such combination it write a line
to the file
% This is a test Matlab script for CJ % Author: Hatef Monajemi June 28 2016 file = 'results.txt'; for i = 1:3 for j = 1:5 % write to a text file for testing reduce fid = fopen(file,'at'); fprintf(fid, '%i,%i,%i\n', i,j,i+j); fclose(fid) end end
This code runs on our personal machine without any error. We now wish to run this code
on a cluster named sherlock using ClusterJob. To run this code serially, once we are
in the folder containing
simpleExample.m, we simply type
the following command in a terminal.
$ cj run simpleExample.m sherlock -m "This is a simple test for run command"
This command should result in the following output:
CJmessage::runing [simpleExample.m] on [sherlock] CJmessage::Sending from: /Users/hatef/github_projects/clusterjob/example/MATLAB CJmessage::Creating reproducible script(s) reproduce_simpleExample.m CJmessage::Sending package 07264a5d CJmessage::Submitting job CJmessage::1 job(s) submitted (10097640)
run command uses only one processing core. We could however run our code in parallel by simply changing the command to
$ cj parrun simpleExample.m sherlock -m "This is a simple test for parrun command"
This time, you should see the following output informing you of submitting 15 jobs instead.
CJmessage::parruning [simpleExample.m] on [sherlock] CJmessage::Sending from: /Users/hatef/github_projects/clusterjob/example/MATLAB CJmessage::Creating reproducible script(s) reproduce_simpleExample.m CJmessage::Sending package 0ed00c68 CJmessage::Submitting job(s) CJmessage::15 job(s) submitted (10097772-10097786)
|The parallelization of your code happens automatically with no further effort from you.|
As you can see, each instance of the
run command produces a reproducible package with a distinct PID (for Package IDentifier). A PID is a SHA1 code, which is essentially a 40 digits long hexadecimal number. Though
PIDs are 40 digits long, when using CJ commands you can provide only a short version of it, which contains the initial 8 characters only.
|A PID is valid as long as it is a hexadecimal number of length 8 or more|
The full PID can be retrieved using
You can check the status of your jobs using
$ cj state 07264a5d pid 07264a5d33bab71c1463f651b1ff920f6d32bb1c remote_account: email@example.com job_id: 10109624 state: COMPLETED
Or for a parallel case,
$ cj state 0ed00c68 pid 0ed00c6851c504af7d8064a954aba44cf1da40f2 remote_account: firstname.lastname@example.org 1 10097772 COMPLETED 2 10097773 COMPLETED 3 10097774 COMPLETED 4 10097775 COMPLETED 5 10097776 COMPLETED 6 10097777 COMPLETED 7 10097778 COMPLETED 8 10097779 COMPLETED 9 10097780 COMPLETED 10 10097781 COMPLETED 11 10097782 COMPLETED 12 10097783 COMPLETED 13 10097784 COMPLETED 14 10097785 COMPLETED 15 10097786 COMPLETED
You can pull your computational package entirely or partially using
this command will pull the package associated with a particular PID into a temporary directory
on your local machine:
$ cj get 07264a5d CJmessage::Getting results from 'sherlock' CJmessage::Please see your last results in /Users/hatef/CJ_get_tmp/07264a5d33bab71c1463f651b1ff920f6d32bb1c
If we now look at the contents of this package, we would see the following. You may want to take a moment to see what is inside each file.
$ ls -lt /Users/hatef/CJ_get_tmp/07264a5d33bab71c1463f651b1ff920f6d32bb1c total 96 -rwxr-xr-x 1 hatef staff 2297 Oct 6 16:19 CJ_CONFIRMATION.TXT -rw-r--r-- 1 hatef staff 2865 Oct 6 16:18 CJrandState.mat -rw-r--r-- 1 hatef staff 90 Oct 6 16:18 results.txt drwxr-xr-x 5 hatef staff 170 Oct 6 16:18 logs drwxr-xr-x 3 hatef staff 102 Oct 6 16:18 scripts -rw-r--r-- 1 hatef staff 29 Oct 6 16:17 qsub.info -rw-r--r-- 1 hatef staff 1606 Oct 6 16:17 bashMain.sh -rw-r--r-- 1 hatef staff 901 Oct 6 16:17 master.sh -rw-r--r-- 1 hatef staff 623 Oct 6 16:17 reproduce_simpleExample.m -rw-r--r-- 1 hatef staff 308 Oct 6 16:17 simpleExample.m
contains confirmation that this package was run by ClusterJob.
contains the information about the random seed that generated the results so that other people can reproduce your results.
is the output of your code
reproduces these results upon execution.
For a parallel package, if we use
get, we will see the following output:
$ cj get 0ed00c6851c504 CJmessage::Getting results from 'sherlock' CJmessage::Run REDUCE before GET for reducing parrun packages CJmessage::Please see your last results in /Users/hatef/CJ_get_tmp/0ed00c6851c504af7d8064a954aba44cf1da40f2
For parallel packages, one typically needs to use
Clearly, we are asked to use
get for a parallel package. We will discuss
in the next section. If we now look at the folder containing this parallel package, we see:
$ ls -lt /Users/hatef/CJ_get_tmp/0ed00c6851c504af7d8064a954aba44cf1da40f2 total 80 -rwxr-xr-x 1 hatef staff 2429 Oct 7 16:06 CJ_CONFIRMATION.TXT drwxr-xr-x 9 hatef staff 306 Oct 6 22:28 13 drwxr-xr-x 9 hatef staff 306 Oct 6 22:26 15 drwxr-xr-x 9 hatef staff 306 Oct 6 22:26 14 drwxr-xr-x 9 hatef staff 306 Oct 6 22:25 11 drwxr-xr-x 9 hatef staff 306 Oct 6 22:25 10 drwxr-xr-x 9 hatef staff 306 Oct 6 22:25 6 drwxr-xr-x 9 hatef staff 306 Oct 6 22:25 7 drwxr-xr-x 9 hatef staff 306 Oct 6 22:25 8 drwxr-xr-x 9 hatef staff 306 Oct 6 22:25 3 drwxr-xr-x 9 hatef staff 306 Oct 6 22:25 5 drwxr-xr-x 9 hatef staff 306 Oct 6 22:25 9 drwxr-xr-x 9 hatef staff 306 Oct 6 22:25 4 drwxr-xr-x 9 hatef staff 306 Oct 6 22:25 12 drwxr-xr-x 9 hatef staff 306 Oct 6 22:25 2 drwxr-xr-x 9 hatef staff 306 Oct 6 22:24 1 -rw-r--r-- 1 hatef staff 435 Oct 6 22:23 qsub.info -rw-r--r-- 1 hatef staff 11591 Oct 6 22:23 master.sh -rw-r--r-- 1 hatef staff 308 Oct 6 22:23 simpleExample.m
We now see 15 directories for 15 jobs submitted. If we further look into one of these directories:
$ ls -lt /Users/hatef/CJ_get_tmp/0ed00c6851c504af7d8064a954aba44cf1da40f2/1 total 40 -rw-r--r-- 1 hatef staff 2865 Oct 6 22:24 CJrandState.mat -rw-r--r-- 1 hatef staff 6 Oct 6 22:24 results.txt drwxr-xr-x 5 hatef staff 170 Oct 6 22:23 logs drwxr-xr-x 3 hatef staff 102 Oct 6 22:23 scripts -rw-r--r-- 1 hatef staff 1953 Oct 6 22:23 bashMain.sh -rw-r--r-- 1 hatef staff 515 Oct 6 22:23 reproduce_simpleExample.m -rw-r--r-- 1 hatef staff 196 Oct 6 22:23 simpleExample.m
reduce command is designed for reducing the results of a parallel run into a single
file. This is because the
parrun command executes the
for loops in the main script in parallel by
generating independent sub-directories and submitting a separate job for each index combination.
If we look at the
results.txt file in sub-directories 1 and 2 for instance, we see:
$ cat /Users/hatef/CJ_get_tmp/0ed00c6851c504af7d8064a954aba44cf1da40f2/1/results.txt 1,1,2 $ cat /Users/hatef/CJ_get_tmp/0ed00c6851c504af7d8064a954aba44cf1da40f2/2/results.txt 1,2,3
We certainly want these results to be reduced to one single
results.txt file. This is done
$ cj reduce results.txt 0ed00c68 CJmessage::Checking progress of runs... CJmessage::Reducing results... CJmessage::Do you want to submit the reduce script to the queue via srun?(recommneded for big jobs) Y/N? n SubPackage 1 Collected (6.67%) SubPackage 2 Collected (13.33%) SubPackage 3 Collected (20.00%) SubPackage 4 Collected (26.67%) SubPackage 5 Collected (33.33%) SubPackage 6 Collected (40.00%) SubPackage 7 Collected (46.67%) SubPackage 8 Collected (53.33%) SubPackage 9 Collected (60.00%) SubPackage 10 Collected (66.67%) SubPackage 11 Collected (73.33%) SubPackage 12 Collected (80.00%) SubPackage 13 Collected (86.67%) SubPackage 14 Collected (93.33%) SubPackage 15 Collected (100.00%) CJmessage::Reducing results done! Please use "CJ get 0ed00c68 " to get your results.
This will produce the new file
results.txt in the main directory, which contains:
$ cat /Users/hatef/CJ_get_tmp/0ed00c6851c504af7d8064a954aba44cf1da40f2/results.txt 1,1,2 1,2,3 1,3,4 1,4,5 1,5,6 2,1,3 2,2,4 2,3,5 2,4,6 2,5,7 3,1,4 3,2,5 3,3,6 3,4,7 3,5,8
The file to reduce can be any acceptable MATLAB output, for instance
sanity command is used to check the output of your experiments,
and making sure that they are as expected.
Consider the previous
simpleExample.m script that you ran through CJ.
There are a number of checks one can conduct on a result file. For instance,
one may be interested to check the number of lines in a specific file
that belongs to a particular experiments (say
0ed00c68). In ClusterJob,
this can easily be achived via the following command:
$ cj sanity lines 0ed00c68
This command results in the following output:
What file (e.g., results.txt | */results.txt)? */results.txt #subPackages: 15 Min # lines: 0 Max # lines: 1 Mode # lines: 1 ❌ Following subPackages have different # lines than 1 (mode of # lines) 3 5 9 ❌ Following subPackages are missing 'results.txt': 3 5 9
As you see, CJ is giving us the statistcs of the number of lines present in
results.txt in each subpackage.
* sign tells CJ to look into all subpackages.
one can then rerun the jobs for these subpackages that are missing results files:
$ cj rerun 0ed00c68 3 5 9 CJmessage::Sending rerun script over to email@example.com CJmessage::Submitting job(s) CJmessage::job-id: 3746149-3746151
Another example that checks the existance of a
$ cj sanity exists What file (e.g., results.txt | */results.txt)? */StudyGradientOnGrid.mat ❌ Following subPackages are missing 'StudyGradientOnGrid.mat': 8 9 10 11 24 25 32 33 34 35 36 49 50 51 55 56 70 71 72 76 77 96 97 98 102 103 127 128 129 133 134 144 145 146 157 158 172 173 174 179 180 201 202 203 211 212 233 234 235 244 245 264 265 266 288 289 302 303 304 319 320 341 342 343 365 366 377 378 379 392 393 415 416 417 460 461 462 482 483 503 504 505 530 531 547 548 549 581 604 605 606 647 648 649 674 675 696 697 698 716 717 731 732 733 762 763 785 786
To see a full list of options
In what follows, we review important commands.
cj run <script> <machine> -dep <DEPENDENCY_FOLDER> -m <MESSAGE> -mem <MEM_REQUESTED>
cj deploy <script> <machine> -dep <DEPENDENCY_FOLDER> -m <MESSAGE> -mem <MEM_REQUESTED>
parellel run for embarssingly parallel problems.
cj parrun <script> <machine> -dep <DEPENDENCY_FOLDER> -m <MESSAGE> -mem <MEM_REQUESTED>
To rerun a previously failed package.
cj rerun <PACKAGE> <FOLDER_NUMBER>
To see the state of the last job submitted through CJ,
To see the state of a particular package,
cj state <PACKAGE>
To see the state of a particular folder in a parrun package,
cj state <PACKAGE> <FOLDER_NUMBER>
To see the last N instances of CJ run,
cj log <N>
To see detailed log of a particular package,
cj log <PACKAGE>
To see all the packages
cj log all
To see the information of the last call to CJ,
To see the information of a particular package,
cj info <PACKAGE>
To reduce the results of the last parrun call,
cj reduce <RESULTS_FILENAME>
To reduce the results of a particular parrun package,
cj reduce <RESULTS_FILENAME> <PACKAGE>
To get the results of the last call back,
To get the results of a particular package call back,
cj get <PACKAGE>
To remove the last package and its associated jobs,
To remove a particular package and its associated jobs,
cj clean <PACKAGE>
To show the program run by CJ at the last call,
To show the program run by CJ for a particular package,
cj show <PACKAGE>