Version du 18 mars 2016 à 11:52

Sommaire

1 General
2 Job submission
3 How to Interact with a job when it is running
4 Troubleshooting

General

What is OAR ?

OAR is a versatile resource and task manager (also called a batch scheduler) for HPC clusters, and other computing infrastructures (like distributed computing experimental testbeds where versatility is a key).

The official User Documentation is here : http://oar.imag.fr/docs/2.5/#ref-user-docs
The Inria Rennes Tutorial : http://igrida.gforge.inria.fr/tutorial.html

What are the most commonly used commands ? (see official docs)

oarsub : to submit a job
oarstat : to see the state of the queues (Running/Waiting jobs)
oardel : to cancel a job
oarpeek : to show the stdout of a job when its running
oarhold : to hold a job when its Waiting
oarresume : to resume jobs in the states Hold or Suspended

Can i use a web interface rather than command line ?

Yes, connect to the Kali web portal : if you have an Inria account just Sign in/up with CAS ; if you have no Inria account use Sign in/up.

Job submission

How do i submit a job ?

With OAR you can directly use a binary as a submission argument in the command line, or even an inline script. You can also create a submission script. The script includes the command line to execute and the resources needed for the job. Do not forget to use the -S tag of oarsub if you want the OAR parameters in the script to be parsed and honored.

Once the script exists, use it as the argument of the oarsub command: oarsub -S ./myscript.sh

How to choose the node type and properties ?

The cluster has several kind of nodes.

To view all defined OAR properties :

graphical : connect to Monika and click on the node name.
command line : use oarnodes, example for nef085: oarnodes nef085.inria.fr

If you want all the cores from a single node :

oarsub -l /nodes=1

If you want 48 cores from any type of nodes :

oarsub -l /core=48

in this case, the 48 cores can be spread on several nodes; Your application must handle this case ! (using MPI or other frameworks) A multithreaded application won't be able to use all the cores reserved if they are spreaded on several nodes.

If you need to reserve a given amount of cores from a single node, use :

oarsub -l /nodes=1/core=2

If you want all the cores of 2 nodes from xeon nodes:

oarsub -p "cputype='xeon'" -l /nodes=2

If you want 96 cores as 12 cores from 8 nodes from xeon nodes:

oarsub -p "cputype='xeon'" -l /nodes=8/core=12

You can make more specific reservations using additional resource tags. This job reserves a total of 16 cores as 8 cores from the same node on 2 different Infiniband network switches

oarsub -l /ibswitch=2/node=1/core=8

What is the format of a submission script ?

The script should not only includes the path to the program to execute, but also includes information about the needed resources (you can specify resources using oarsub options on the command line). Simple example : helloWorld.sh

# Submission script for the helloWorld program
#
# Comments starting with #OAR are used by the 
# resource manager if "oarsub -S"
#
#OAR -l /nodes=8/core=1,walltime=00:10:00
#OAR -p cputype='xeon'
# The job reserves 8 nodes with one processor (core) per node,
# only on xeon nodes
# job duration is less than 10min
#
#OAR -q default
# The job is submitted to the default queue

# Le chemin vers le binaire
./helloWorld

You can mix resources in the submission script and on the command line. In this example the -p on the command line takes precedence over the script, while the -l from the script and the command line are combined (moldable jobs when using multiple -l options) :

oarsub -p "cputype='xeon'" -l /nodes=4/core=2 -S ./helloWorld.sh

Several examples can be downloaded here :

#TODO

All scripts must specify the walltime of the job (maximum expected duration) (#OAR -l /nodes=1/core=1/,walltime=HH:MM:SS ). This is needed by scheduler. If a job is using more time that requested, it will be killed.

What are the available queues ?

The limits and parameters of the queues are listed below :

queue name	max user resources	max duration (days)	priority	max user (hours*resources)
default	256	30	10	20504
big	1024	30	5	2000
besteffort		30	0

This means all jobs of a user running in the default queue at a given time can use at most 256 resources (eg 256 cores, or 128 cores with twice the default memory per core) with a cumulated reservation of 21504 hours*resources. Maximum walltime of each job is 30 days.

In other words a user can have at a time running jobs in the default queue using cumulated resources reservation of at most

either 32 cores during 28 days with the default memory per core ;
either 128 cores during 7 days with the default memory per core ;
either 128 cores during 3 days 1/2 with twice the default memory per core ;
either 256 cores during 3 days 1/2 with the default memory per core ;
etc.

How are the jobs scheduled and prioritized ?

Jobs are scheduled based on queue priority (jobs in higher priority queues are served first), and then based on the user Karma (jobs of lower Karma users are served first).

The user's fair share scheduling Karma measures his/her recent resources consumption during the last 30 days. Resource consumption takes in account both the used resources and the reserved (but unused) resources. When you reserve or consume resources on the cluster, your priority in regard of other users decreases (as your Karma increases).

Jobs in the default and big queues wait until the requested resources can be reserved.

Jobs in the besteffort queue run without resource reservation : they are allowed to run as soon as there is available resource on the cluster (they are not subject to per user limits, etc.) but can be killed by the scheduler at any time when running if a non-besteffort job requests the used resource.

Using the besteffort queue enables a user to use more resources at a time than the per user limits and permits efficient cluster resource usage. Thus using the besteffort queue is encouraged for short jobs, jobs that can easily be resubmitted, etc.

How do i submit a job in the "big" queue ?

Use oarsub -q big or use the equivalent option in your submission script (see submission script examples).

How do i submit a besteffort job ?

To submit a job to the best effort queue just use oarsub -t besteffort or use the equivalent option in your submission script (see submission script examples).

Your jobs will be rescheduled automatically with the same behaviour if you additionnaly use the idempotent mode oarsub -t besteffort -t idempotent

OAR checkpoint facility may be useful for besteffort jobs but requires support by the running code.

How do i reserve resources in advance ?

Submit a job with oarsub -r "YYYY-MM-DD HH:MM:SS". A user can have at most 2 scheduled advance reservations at a given time.

How much memory (RAM) is allocated to my job ?

OAR is using the total amount of RAM of a node and divide it by the number of cores.

So for instance, if a node has 96GB of RAM and 12 cores, each reserved core will have 8GB of RAM dedicated by OAR. If you reserve only one core on this type of node, your job will be limited to 8GB of RAM.

How can i change the memory (RAM) allocated to my job ?

You can use explicitly the mem_core property of OAR. If you want cores with a minimum amount of RAM per core, you can do (at lease 8GB per core in this example) :

oarsub  -l '{mem_core > 8000}/nodes=1/core=3'

In this case, you will have 3 cores with at least 3x8GB = 24GB of RAM.

If you need a single core, but more than the dedicated amount of RAM by core, you need to reserve more than one core. Since our cluster is heterogeneous (memory per core is not the same on each sub-cluster), it is not easy to have a single syntax to get the needed amount of memory.

For this use case (needs to reserve a given amount of RAM, whatever the number of cores, on a single node), we have developed a small wrapper around oarsub, called oarsub_mem (warning : still alpha, works only with simple cases). This wrapper understand a mem=XXg syntax. You can use it like this:

oarsub_mem -l mem=20g,walltime=1:0:0

How can i check the resources really used by a running or terminated job ?

Use the Colmet tool to view CPU and RAM usage profile of your job during its execution.

Can i submit hundreds/thousands of jobs ?

You can have up to 500 jobs submitted at a time (includes jobs in all states : Waiting, Running, etc.).

Please consider using array jobs when submitting a large number of similar jobs to the queueing system. One obvious, but inefficient, way to do this would be to prepare a prototype job script, and then to use shell scripting to call oarsub on this (possibly modified) job script the required number of times.

Alternatively, the current version of Oar provides a feature known as job arrays which allows the creation of multiple, similar jobs with one oarsub command. This feature introduces a new job naming convention that allows users either to reference the entire set of jobs as a unit or to reference one particular job from the set.

To submit a job array use either

oarsub --array <array_number>

or

oarsub --array-param-file <param_file>

or equivalently insert a directive

#OAR --array <array_number>
##OR
#OAR --array-param-file <param_file>

in your batch script. In each case an array is created consisting of a set of jobs, each of which is assigned a unique index number (available to each job as the value of the environment variable %OAR_ARRAY_ID%), with each job using the same jobscript and running in a nearly identical environment. Here array_number is your number of array to be created, if not specified the number will be the number of lines in your parameter file. Example:

oarsub --array-param-file ./param_file -S ./jobscript
[ADMISSION RULE] Modify resource description with type constraints
  [ARRAY COUNT] You requested 6 job array
  [CORES COUNT] You requested 3066 cores
  [CPUh] You requested 3066 total cpuh (cores * walltime)
  [JOB QUEUE] Your job is in default queue
Generate a job key...
Generate a job key...
Generate a job key...
Generate a job key...
Generate a job key...
Generate a job key...
OAR_JOB_ID=1839
OAR_JOB_ID=1840
OAR_JOB_ID=1841
OAR_JOB_ID=1842
OAR_JOB_ID=1843
OAR_JOB_ID=1844
OAR_ARRAY_ID=1839

oarstat --array
Job id    A. id     A. index  Name       User     Submission Date     S Queue
--------- --------- --------- ---------- -------- ------------------- - --------
1839      1839      1         TEST_OAR   rmichela 2015-08-21 17:49:08 R default 
1840      1839      2         TEST_OAR   rmichela 2015-08-21 17:49:09 W default 
1841      1839      3         TEST_OAR   rmichela 2015-08-21 17:49:09 W default 
1842      1839      4         TEST_OAR   rmichela 2015-08-21 17:49:09 W default 
1843      1839      5         TEST_OAR   rmichela 2015-08-21 17:49:09 W default 
1844      1839      6         TEST_OAR   rmichela 2015-08-21 17:49:09 W default

Note that each job is assigned a composite job id of the form 1839 (stored as usual in the environment variable OAR_JOB_ID). Each job is distinguished by a unique array index x (stored in the environment variable OAR_ARRAY_ID). Thus each job can perform slightly different actions based on the value of OAR_ARRAY_ID (e.g. using different input or output files, or different options).

When using a parameter file containing two or more lines, each subjob is given the items in the line corresponding to its index as arguments. You can also use the shell syntax in it. Example:

 foo 'a b'     # Your subjob will receive two arguments, foo, a b
 bar $HOME y   # 3 arguments, bar, <the path of your homedir>, y

Note that you shouldn't use a parameter file with only one single line: the parameters in this line will be ignored. In other words oar doesn't like arrays of size 1 :-(

How can i pass command line arguments to my job ?

oarsub does not have a command line option for this but you can pass parameters directly to your job, eg :

oarsub [-S] "./mycode abcde xyzt"

and then in ./mycode check $1 (abcde) and $2 (xyzt) variables, in the language specific syntax. Example :

# Submission script ./mycode
#
# Comments starting with #OAR are used by the resource manager if "oarsub -S"
#OAR -p cputype='xeon'

# pick first argument (abcde) in VAR1
VAR1=$1
# pick second argument (xyzt) in VAR2
VAR2=$2

# Place here your submission script body
echo "var1=$VAR1 var2=$VAR2"

Another syntax for that :

oarsub [-S] "./mycode --VAR1 abcde --VAR2 xyzt"

and then in ./mycode use options parsing in the language specific syntax.

If you do not use the -S option of oarsub then you may prefer to use shell environment variables, eg :

oarsub -l /nodes=2/core=4  "env VAR1=abcde VAR2=xyzt  ./myscript.sh"

How to run a MPICH 2 application ?

mvapich2 is a infiniband optimized version of mpich2. You should use the mvapich2 module in your script.

Submission script for MVAPICH2 : monAppliMPICH2.sh

# File : monAppliMPICH2.sh
#!/bin/bash
#OAR -l /nodes=3/core=1
module load mpi/mvapich2-x86_64
mpirun -machinefile $OAR_NODEFILE -launcher-exec oarsh monAppliMPICH2

In this case, mpirun will launch MPI on 3 nodes with one core per node.

How to run an OpenMPI application?

The mpirun binary included in openmpi run the application using the resources reserved by the jobs :

Submission script for OpenMPI : monAppliMPICH2.sh

The openmpi 1.10.1 version installed on nef is patched to discover automatically the ressources of your job, so you don't have to specify a machinefile.

# Fichier : monAppliOpenMPI.sh
#!/bin/bash
#OAR -l /nodes=3/core=1
source /etc/profile.d/modules.sh
module load mpi/openmpi-1.10.1-gcc
mpirun --prefix $MPI_HOME  monAppliOpenMPI

in this case, mpirun will start the MPI application on 3 nodes with a single core per node.

If you are using the main openmpi module (mpi/openmpi-x86_64) you have to add -machinefile $OAR_NODEFILE

module load mpi/openmpi-x86_64
mpirun --prefix $MPI_HOME  -machinefile $OAR_NODEFILE monAppliOpenMPI

How to run an Intel MPI application?

the Intel compiler and mpi implementation is installed on nef. To run a mpi job:

#!/bin/bash
#OAR -l /nodes=3/core=1
source /etc/profile.d/modules.sh
module load mpi/intel64-5.1.1.109   
mpirun -machinefile $OAR_NODEFILE monAppliIntelMPI

How can i run caffe ?

First you have to use a node with a GPU (it should be much faster with a GPU), for example:

oarsub -I -p "GPU='YES'" -l /nodes=1

Then you have to load the cuda and caffe modules:

source /etc/profile.d/modules.sh
module load cuda/7.0
module load caffe/caffe-0.13
$CAFFE_HOME/build/tools/caffe

How can i run a graphical application on a node ?

First, connect to the nef frontend with ssh using the -X option, then submit an interactive job like this , OAR will do the necessary to setup X11 forwarding:

oarsub -I ...

You can also use VirtualGL on GPU nodes, see this blog post

How to Interact with a job when it is running

How do i connect to the nodes of my running job ?

Use oarsub -C <jobid> to start an interactive shell on the master node of the job, and then use oarsh<hostname> to connect to other job nodes. Use oarcp to copy files between nodes local filesystems.

Alternatively you can use OAR_JOB_ID=<jobid> oarsh <hostname> to connect to any job node.

Other useful commands : oarcp to copy files between nodes local filesystems, oarprint to query resources allocated to the job (eg : oarprint host for the list of the hostname your job is running on)

Please note ssh to the nodes is not allowed, but oarsh is a wrapper around ssh.

How can i use ssh between nodes of my job ?

How can i use spark ?

Let's say you want to use spark on 4 nodes :

oarsub -I -l /nodes=4,walltime=3:0:0

This will reserve 4 nodes and start a shell on the first one (say nef107)

Then start the master:

./sbin/start-master.sh

Then you can start the slaves on three other nodes using oarsh ( the server URL in this case is spark://nef107.inria.fr:7077 ), like this:

for i in `uniq $OAR_NODEFILE | grep -v nef107`; do  
oarsh $i $HOME/spark-1.6.0-bin-hadoop2.6/sbin/start-slave.sh spark://nef107.inria.fr:7077 ; done

Then you can use spark, for ex. to run the sparkPi example:

export MASTER=spark://nef107.inria.fr:7077
./bin/run-example  SparkPi

To connect remotely to the WebUI you need to start Inria VPN (with vpn.inria.fr/all) or use SSH tunneling through nef-frontal.inria.fr

In which state is my job ?

The oarstat JOBID command let you show the state of your job and in which queue it has been scheduled.

Examples of using oarstat

-bash-4.1$ oarstat -j 1839
Job id     Name           User           Submission Date     S Queue
---------- -------------- -------------- ------------------- - ----------
1839       TEST_OAR       rmichela       2015-08-21 17:49:08 T default

The S column gives the the current state ( Waiting, Running, Launching, Terminating).

The Queue column shows the job's queue

We can have full information about your job with -f , and array specific information with --array

You can use sql syntax to remenber which job did you launched

oarstat --sql "job_user='rmichela' and state='Terminated'"

When will my job be executed ?

You can use oarstat -fj jobid | grep scheduledStart to have an estimation on when your job will be started

How can i get the stderr or stdout of my job during its execution ?

You can use the oarpeek command. oarpeek JOBID show the stdout of JOBID , and oarpeek -e shows the stderr .

The stdout and stderr will be created in your Submitted directory by default, for example OAR.TEST_ARRAY.518.stderr and OAR.TEST_ARRAY.518.stdout

How can i cancel a job ?

The oardel <JOBID> command let you cancel a job (man oardel).

How to know my Karma priorities ?

The command oarstat -u <login> --accounting "YYYY-MM-DD, yyyy-mm-dd" shows your resource consumption and associated Karma (fair share coefficient) between two dates. The higher your Karma, the lower your priority.

To see your approximate current Karma use :

yyyy-mm-dd = tomorrow
YYYY=MM-DD = ( yyyy-mm-dd - 30 days )

To see the Karma associated to one of your currently running jobs :

use oarstat -f -j <jobid> | grep Karma
or use | Monika and click <jobid> to view the job details

Troubleshooting

Why is my job rejected at submission ?

The job system may refuse a job submission due to the admission rules, an explicit error message will be displayed, in case of contact the admin cluster team.

Most of the time it indicates that the requested resources are not available, which may be caused by a typo (eg -p "cluster='dell6220'" instead of -p "cluster='dellc6220'").

Sometimes it may also be caused by some nodes being temporarily out of service. This may be verified typing oarnodes -s for listing all nodes in service.

Another cause may be the job requested more resources than the total resources existing on the cluster.

Can i use tensorflow?

Yes, you can install the CPU version of tensorflow (current GPUs on nef are too old and are not compatible with the GPU version of tensorflow) There is a conflict with the protobuf library, so you have to use virtualenv:

virtualenv  ~/tensorflow
cd tensorflow
source bin/activate
pip install --upgrade https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.6.0-cp27-none-linux_x86_64.whl

Why is my job still Waiting while other jobs go Running ?

Many possible (normal) explanations include :

other job may have higher priority : queue priority, user Karma
your job requests currently unavailable resources (eg : only dellc6220 nodes while the other job accepts any node type)
your job requests more resources than currently available and a lower priority job can be run before without delaying your job (best fit). Eg : you requested 4 nodes, only 2 are currently available, the 2 others will be available in 3 hours. A job requesting 2 nodes during at most 3 hours can be run before yours.
the other job made an advance reservation of resources
etc.

Why is my job still Waiting while some there are unused resources ?