FAQ new config
Sommaire
- 1 General
- 2 Job submission
- 2.1 How do i submit a job ?
- 2.2 How to choose the node type ?
- 2.3 What is the format of a submission script ?
- 2.4 What are the available queues ?
- 2.5 How do i submit a besteffort job ?
- 2.6 How much memory (RAM) is allocated to my job ?
- 2.7 How can i change the memory (RAM) allocated to my job ?
- 2.8 How can i check the resources really used by a running job ?
- 2.9 How can i check the resources really used by a terminated job ?
- 2.10 Can i submit hundreds/thousands of jobs ?
- 2.11 How to run a MPICH 2 application ?
- 2.12 How to run an OpenMPI application?
- 2.13 How can i run a graphical application on a node ?
- 3 How to Interact with a job when it is running
- 4 Troubleshooting
General
What is OAR ?
OAR is a versatile resource and task manager (also called a batch scheduler) for HPC clusters, and other computing infrastructures (like distributed computing experimental testbeds where versatility is a key).
- The official User Documentation is here : http://oar.imag.fr/docs/2.5/#ref-user-docs
- The Inria Rennes Tutorial : http://igrida.gforge.inria.fr/tutorial.html
What are the most commonly used commands ? (see official docs)
- oarsub : to submit a job
- oarstat : to see the state of the queues (Running/Waiting jobs)
- oardel : to cancel a job
- oarhold : to hold a job when its Waiting
- oarresume : to resume jobs in the states Hold or Suspended
Can i use a web interface rather than command line ?
Yes, connect to the Kali web portal : if you have an Inria account just Sign in/up with CAS ; if you have no Inria account use Sign in/up.
Job submission
How do i submit a job ?
With Oar you can't directly use a binary as a submission argument in the command line. The first thing to do is to create a submission script ; The script includes the command line to execute and the resources needed for the job.
Once the script exists, use it as the argument of the oarsub command: oarsub -S ./myscript.sh
How to choose the node type ?
The cluster has 2 kinds of nodes, with the following properties:
- Dell R900 nodes
- nef, cluster=dellr900, cputype=xeon, cpuarch=x86_64, gpu=NO, mem=65536, mem_core=2731
- Dell C6100 nodes
- nef, cluster=dellc6100, cputype=xeon, cpuarch=x86_64, gpu=NO, mem=98304, mem_core=8192
Use Monika to view all defined OAR properties
If you want 48 cores from the Dell R900 cluster:
oarsub -p "cluster='dellr900'" -l /nodes=2/core=24
If you want 96 cores from xeon nodes:
oarsub -p "cputype='xeon'" -l /nodes=8/core=12
You can also asked just a number of cores, the scheduler will reserve this amount of cores from several nodes
oarsub -l /core=64
You can also use the switch resource to ask one or more switch for your job, this job will reserve 8 cores for each switch, so here 16 cores
oarsub -l switch=2/nodes=1/core=8
What is the format of a submission script ?
The script should not only includes the path to the program to execute, but also includes information about the needed resources (you can specify resources using oarsub options on the command line). Simple example : helloWorld.sh
# Submission script for the helloWorld program # # Comments starting with #OAR are used by the # resource manager # #OAR -l /nodes=8/core=1,walltime=00:10:00 #OAR -p "cputype='opteron'" # The job use 8 nodes with one processor (core) per node, # only on opteron nodes # (remplace opteron by xeon to start the job # on dell/Xeon machines) # job duration is less than 10min # #OAR -q default # The job is submitted to the default queue
# Le chemin vers le binaire ./helloWorld
Resources can also be specified on the command line (bypassing the ressources specified in the script), eg: oarsub -p "cputype='opteron'" -l /nodes=8/core=1 -S ./helloWorld.sh will run helloWorld on 8 nodes using 2 cores per node, on nef nodes (Opteron or Xeon).
Several examples can be downloaded here :
#TODO
All scripts must specify the walltime of the job (maximum expected duration) (#OAR -l /nodes=1/core=1/,walltime=HH:MM:SS
). This is needed by scheduler. If a job is using more time that requested, it will be killed.
What are the available queues ?
User can submit jobs to two different queues :
- the defaultqueue ;
- the besteffort queue.
Jobs in the default queue wait until the requested resources can be reserved.
Jobs in the besteffort queue run without resource reservation : they are allowed to run as soon as there is available resource on the cluster (they are not subject to per user, etc. limits) but they are killed by the scheduler while running when a non-besteffort job requests the used resource.
Using the besteffort queue enables a user to use more resources at a time than the per user limits and permits efficient cluster resource usage. Thus using the besteffort queue is encouraged for short jobs, jobs that can easily be resubmitted, etc.
The limits and parameters of the queues are listed below :
| queue name | max user | min | max | prio | max user | | | resources| dur. |dur. | |(hours*resources)| |-------------+----------+-------+------+-------+-----------------+ | default | 256 | | 30d | 10 | 21504 | | besteffort | | | | 0 | |
This means a user jobs running in the default queue at a given time can use at most 256 resources (eg 256 cores, or 128 cores with twice the default memory per core) with a cumulated reservation of 21504 hours*resources ; maximum walltime of each job is 30 days.
eg. a user can have at a time running jobs in the default queue using cumulated resources reservation of at most
- either 32 cores during 28 days ;
- either 128 cores during 7 days ;
- either 256 cores during 3 days 1/2 ;
- etc.
Jobs are scheduled based on queue priority (higher priority first), and then based on used Karma.
The user's fair share scheduling Karma measures his/her recent resources consumption during the last 30 days. Resource consumption takes in account both the used resources and the reserved (but unused) resources. When you consume resources on the cluster, your priority in regard of other users decreases (and your Karma increases).
How do i submit a besteffort job ?
To submit a job to the best effort queue just use oarsub -t besteffort
or use the equivalent option in your submission script (see submission script examples).
Your jobs will be rescheduled automatically with the same behaviour if you additionnaly use the idempotent mode oarsub -t besteffort -t idempotent
How much memory (RAM) is allocated to my job ?
TO_BE_DONE
How can i change the memory (RAM) allocated to my job ?
If you need a single core, but more than the dedicated amount of RAM by core, you need to reserve more than one core. Since our cluster is heterogeneous (memory per core is not the same on each sub-cluster), it not easy to have a single syntax to get the needed amount of memory.
For this use case (needs to reserve a given amount of RAM, whatever the number of cores), we have developped a small wrapper around oarsub, called oarsub_mem. You can use it like this:
oarsub_mem -l mem=20g,walltime=1:0:0 <...>
How can i check the resources really used by a running job ?
TO_BE_DONE
How can i check the resources really used by a terminated job ?
TO_BE_DONE
Can i submit hundreds/thousands of jobs ?
You can submit easily hundreds of jobs, but you should not try to submit more than 500 jobs at a time, and please consider the use of array jobs for this.
Sometimes it is desired to submit a large number of similar jobs to the queueing system. One obvious, but inefficient, way to do this would be to prepare a prototype job script, and then to use shell scripting to call oarsub on this (possibly modified) job script the required number of times.
Alternatively, the current version of Oar provides a feature known as job arrays which allows the creation of multiple, similar jobs with one oarsub command. This feature introduces a new job naming convention that allows users either to reference the entire set of jobs as a unit or to reference one particular job from the set.
To submit a job array use either
oarsub --array <array_number> --array-param-file <param_file>
or equivalently insert a directive
#OAR --array <array_number> #OAR --array-param-file <param_file>
in your batch script. In each case an array is created consisting of a set of jobs, each of which is assigned a unique index number (available to each job as the value of the environment variable %OAR_ARRAY_ID%), with each job using the same jobscript and running in a nearly identical environment. Here array_number is your number of array to be created, if not specified the number will be the number of lines in your parameter file. Example:
oarsub --array 6 --array-param-file ./param_file -S ./jobscript [ADMISSION RULE] Modify resource description with type constraints [ARRAY COUNT] You requested 6 job array [CORES COUNT] You requested 3066 cores [CPUh] You requested 3066 total cpuh (cores * walltime) [JOB QUEUE] Your job is in default queue Generate a job key... Generate a job key... Generate a job key... Generate a job key... Generate a job key... Generate a job key... OAR_JOB_ID=1839 OAR_JOB_ID=1840 OAR_JOB_ID=1841 OAR_JOB_ID=1842 OAR_JOB_ID=1843 OAR_JOB_ID=1844 OAR_ARRAY_ID=1839
oarstat --array Job id A. id A. index Name User Submission Date S Queue --------- --------- --------- ---------- -------- ------------------- - -------- 1839 1839 1 TEST_OAR rmichela 2015-08-21 17:49:08 R default 1840 1839 2 TEST_OAR rmichela 2015-08-21 17:49:09 W default 1841 1839 3 TEST_OAR rmichela 2015-08-21 17:49:09 W default 1842 1839 4 TEST_OAR rmichela 2015-08-21 17:49:09 W default 1843 1839 5 TEST_OAR rmichela 2015-08-21 17:49:09 W default 1844 1839 6 TEST_OAR rmichela 2015-08-21 17:49:09 W default
Note that each job is assigned a composite job id of the form 1839 (stored as usual in the environment variable OAR_JOB_ID). Each job is distinguished by a unique array index x (stored in the environment variable OAR_ARRAY_ID). Thus each job can perform slightly different actions based on the value of OAR_ARRAY_ID (e.g. using different input or output files, or different options).
When using a parameter file containing two or more lines, each subjob is given the items in the line corresponding to its index as arguments. You can also use the shell syntax in it. Example:
foo 'a b' # Your subjob will receive two arguments, foo, a b bar $HOME y # 3 arguments, bar, <the path of your homedir>, y
Note that you shouldn't use a parameter file with only one single line: the parameters in this line will be ignored. In other words oar doesn't like arrays of size 1 :-(
How to run a MPICH 2 application ?
mvapich2 is a infiniband optimized version of mpich2. You should use the mvapich2 module in your script.
Submission script for MVAPICH2 : monAppliMPICH2.sh
# File : monAppliMPICH2.sh #!/bin/bash #OAR -l /nodes=3/core=1 module load mpi/mvapich2-x86_64 mpirun -machinefile $OAR_NODEFILE -launcher-exec oarsh monAppliMPICH2
In this case, mpirun
will launch MPI on 3 nodes with one core per node.
How to run an OpenMPI application?
The mpirun
binary included in openmpi run the application using the resources reserved by the jobs :
Submission script for OpenMPI : monAppliMPICH2.sh
The openmpi 1.8.8 version installed on nef is patched to discover automatically the ressources of your job, so you don't have to specify a machinefile.
# Fichier : monAppliOpenMPI.sh #!/bin/bash #OAR -l /nodes=3/core=1 module load mpi/openmpi-1.8.8-gcc mpirun --prefix $MPI_HOME --mca plm_rsh_agent oarsh monAppliOpenMPI
in this case, mpirun
will start the MPI application on 3 nodes with a single core per node.
If you are using the main openmpi module (mpi/openmpi-x86_64) you have to add -machinefile $OAR_NODEFILE
module load mpi/openmpi-x86_64 mpirun --prefix $MPI_HOME --mca orte_rsh_agent oarsh -machinefile $OAR_NODEFILE monAppliOpenMPI
How can i run a graphical application on a node ?
First, connect to the nef frontend with ssh using the -X
option, then submit an interactive job like this , OAR will do the necessary to setup X11 forwarding:
oarsub -I ...
How to Interact with a job when it is running
in which state is my job ?
The oarstat
JOBID command let you show the state of your job and in which queue it has been scheduled.
Examples of using oarstat
-bash-4.1$ oarstat -j 1839 Job id Name User Submission Date S Queue ---------- -------------- -------------- ------------------- - ---------- 1839 TEST_OAR rmichela 2015-08-21 17:49:08 T default
The S column gives the the current state ( Waiting, Running, Launching, Terminating).
The Queue column shows the job's queue
We can have full information about your job with -f , and array specific information with --array
When will my job be executed ?
You can use oarstat -fj jobid
to have an estimation on when your job will be started
How can i get the stderr or stdout of my job during its execution ?
The stdout and stderr will be created in your Submitted directory by default, for example OAR.TEST_ARRAY.518.stderr and OAR.TEST_ARRAY.518.stdout
How can i cancel a job ?
The oardel <JOBID>
command let you cancel a job (man oardel).
How to know my Karma priorities ?
The oarstat -u <login> --accounting "YYYY-MM-DD, YYYY-MM-DD"
command let you see your Karma (fair share coefficient). Which the second date need to be tomorrow. More your Karma is hight less prority you have.
Troubleshooting
Why is my job rejected at submission ?
The job system may refuse a job submission due to the admission rules, an explicit error message will be displayed, in case of contact the admin cluster team.
Most of the time it indicates that the requested resources are not available, which may be caused by a typo (eg -p "cluster='dell6220'"
instead of -p "cluster='dellc6220'"
).
Sometimes it may also be caused by some nodes being temporarily out of service. This may be verified typing oarnodes -s
for listing all nodes in service.
Another cause may be the job requested more resources than the total resources existing on the cluster.
Why is my job blocked in a queue while there are no other jobs currently running ?
A node on the cluster has a problem, please contact the administrators.
What are the best practices for Matlab jobs ?
If launching many Matlab jobs at the same time, please launch them on as few nodes as possible. Matlab uses a floating licence per {node,user} couple. Eg :
- 10 jobs for user foo on 10 differents cores of nef012 node use 1 floating license,
- 1 job for user foo on each of nef01[0-9] nodes use 10 floating licenses.