FAQ
This documentation describes the obsolete "legacy nef" configuration which will definitively stop on 17 april 2016 : please use the new nef configuration
Sommaire
- 1 General
- 2 Job submission
- 2.1 How do i submit a job ?
- 2.2 How to choose the node type ?
- 2.3 What is the format of a submission script ?
- 2.4 What are the available queues ?
- 2.5 Can i submit hundreds/thousands of jobs ?
- 2.6 How to run a MPICH 2 application ?
- 2.7 How to run an OpenMPI application?
- 2.8 How can I run a LAM MPI application ?
- 2.9 How can i run a PVM application ?
- 2.10 How can i run a graphical application on a node ?
- 3 How to Interact with a job when it is running
- 4 Troubleshooting
- 4.1 Why is my job rejected at submission ?
- 4.2 Why is my job blocked in a queue while there are no other jobs currently running ?
- 4.3 What does this error message mean ?
- 4.4 I received a mail that says: job violates resource utilization policies, what does it mean ?
- 4.5 What are the best practices for Matlab jobs ?
General
What is TORQUE ?
TORQUE is a free resource manager based on OpenPBS (2.3.12). It is compatible with OpenPBS but includes several new features ('scalability', fault tolerance, ...). It is available at Adaptative Computing.
What are the most commonly used commands ?
- qsub : to submit a job
- qstat : to see the state of the queues (running/waiting jobs)
- qdel : to cancel a job
- qpeek : to show the stdout of a job when its running
- showq : to show jobs are resources usage
- showstart : to show the scheduled starttime of a job
Job submission
How do i submit a job ?
With Torque you can't directly use a binary as a submission argument in the command line. The first thing to do is to create a submission script ; The script includes the command line to execute and the resources needed for the job.
Once the script exists, use it as the argument of the qsub command: qsub myscript.pbs
How to choose the node type ?
The cluster has 6 kinds of nodes, with the following properties:
- Dell 1950 nodes
- nef, xeon, dell, nogpu
- Dell R815 nodes
- nef, opteron, dellr815, nogpu
- Dell C6220 nodes
- nef,xeon,dellc6220,nogpu
- Dell C6145 nodes
- nef,opteron,dellc6145,nogpu
- HP nodes
- nef, xeon, gpu, hpdl160g5, T10
- Carri nodes
- nef, xeon, carri5600XLR8, gpu, C2050
If you want 32 cores from the Dell cluster:
qsub -l "nodes=4:dell:ppn=8"
If you want 96 cores from the Dell R815 cluster:
qsub -l "nodes=2:opteron:ppn=48"'
If you want a node with a GPU and 8 cores :
qsub -l "nodes=1:gpu:ppn=8"'
You can mix several types of nodes for a single job, for example: 19 xeon nodes (bi-quadcore) and 6 opteron nodes (48 cores)
qsub -l "nodes=19:xeon:ppn=8+6:opteron:ppn=48"
You can mix several kinds of nodes using the nef property, and ppn=2:
qsub -l "nodes=20:nef:ppn=2"
What is the format of a submission script ?
The script should not only includes the path to the program to execute, but also includes information about the needed resources (you can specify resources using qsub options on the command line). Simple example : helloWorld.pbs
# Submission script for the helloWorld program # # Comments starting with #PBS are used by the # resource manager # #PBS -l nodes=8:ppn=1:opteron # The job use 8 nodes with one processor (core) per node, # only on opteron nodes # (remplace opteron by xeon to start the job # on dell/Xeon machines) # #PBS -l walltime=10:00 # job duration is less than 10min #PBS -l pmem=600mb # The job needs 600MB per core # cd to the directory where the job was submitted # ! By default, torque use the user's home directory cd $PBS_O_WORKDIR # Le chemin vers le binaire ./helloWorld
Resources can also be specified on the command line (bypassing the ressources specified in the script), eg: qsub -l nodes=8:nef:ppn=2 helloWorld.pbs will run helloWorld on 8 nodes using utilisant 2 cores per node, on nef nodes (Opteron or Xeon).
Several examples can be downloaded here :
- example.pbs : sequential job
- example-openmpi.pbs : OpenMPI job
- example-mpich.pbs : MPICH job
- example-mpich2.pbs : MPICH2 job
All scripts must specify the walltime of the job (maximum expected duration) (#PBS -l walltime=HH:MM:SS
). This is needed by scheduler. If a job is using more time that requested, it will be killed.
You can specify the amount of memory per core needed by the job (-l pmem=XXXmb
) or per node (-l mem=XXXXmb
). This is useful to prevent the nodes from swapping, if your processes are using more than the available memory per core (2GB on Dell nodes, 2.5GB on Opteron ). If one of the processes use more memory that asked, the job will be killed.
You can find more information in the pbs_resources manpage
What are the available queues ?
Each queue has a maximum number of running jobs per user, and a maximum number of jobs running.
There are two sets of queues: one for parallel jobs, and one for sequential jobs: parallel jobs have a higher priority
| queue name | max user | max | min | max | prio | | | cores | cores | dur. |dur. | | |-------------+----------+-------+-------+------+-------+ | parshort | | | | 3h | 10000 | | par | 512 | 768 | 3h | 6h | 7000 | | parlong | 376 | 512 | 6h | 12h | 5000 | | parverylong | 256 | 376 | 12h | 48h | 2500 | | parextralong | 128 | 256 | 48h | 7d | 100 | | seqshort | 512 | 768 | | 3h | 1000 | | seq | 512 | 768 | 3h | 6h | 500 | | seqlong | 376 | 512 | 6h | 12h | 10 | | seqverylong | 256 | 376 | 12h | 48h | -1000 | | seqextralong| 128 | 256 | 48h | 7d | -5000 | | interactive | 1 node |1 node | | 4h | 15000 | | gpu | | | | 48h | 50000 |
eg. a user can submit several jobs in the parlong queue (using at most 376 cores and no longer than 12hours). Another user can submit several sequential jobs in the seqshort queue, only 512 (at most) will be running at the same time.
Used cores are counted in processor equivalent (PE), eg a sequential job requesting 10 times the default pmem counts 10 cores.
Fair share scheduling decreases each user's priority based on his recent resources consumption (current evaluation window : 4 days).
If you don't specify the queue, jobs are automatically routed to one of the parallel queue based on the walltime.
To use the sequential queues, please use the queue name explicitly (qsub -q <queuename>
...)
The gpu queue is dedicated for jobs running on GPU nodes: using this queue, you will be the only user running on reserved nodes to avoid gpu usage conflicts.
Can i submit hundreds/thousands of jobs ?
You can submit easily hundreds of jobs, but you should not try to submit more than 500 jobs at a time, and please consider the use of array jobs for this:
Sometimes it is desired to submit a large number of similar jobs to the queueing system. One obvious, but inefficient, way to do this would be to prepare a prototype job script, and then to use shell scripting to call qsub on this (possibly modified) job script the required number of times.
Alternatively, the current version of Torque provides a feature known as job arrays which allows the creation of multiple, similar jobs with one qsub command. This feature introduces a new job naming convention that allows users either to reference the entire set of jobs as a unit or to reference one particular job from the set.
To submit a job array use either
qsub -t <array_request> <jobscript>
or equivalently insert a directive
#PBS -t <array_request> <jobscript>
in your batch script. In each case an array is created consisting of a set of jobs, each of which is assigned a unique index number (available to each job as the value of the environment variable PBS_ARRAYID), with each job using the same jobscript and running in a nearly identical environment. Here array_request is a comma-separated list consisting of either single numbers (specifying particular index values), or a pair of numbers separated by a dash (representing a range of index values). Example:
qsub -t 0,1,4-6 jobscript 220531.cluster.inria.fr qstat Job id Name User Time Use S Queue ------------------------- ---------------- --------------- -------- - ----- 220531-0.cluster MPI-0 sjr20 0 Q default 220531-1.cluster MPI-1 sjr20 0 Q default 220531-4.cluster MPI-4 sjr20 0 Q default 220531-5.cluster MPI-5 sjr20 0 Q default 220531-6.cluster MPI-6 sjr20 0 Q default
Note that each job is assigned a composite job id of the form 220531-x (stored as usual in the environment variable PBS_JOBID). Each job is distinguished by a unique array index x (stored in the environment variable PBS_ARRAYID), which is also added to the value of the PBS_JOBNAME variable. Thus each job can perform slightly different actions based on the value of PBS_ARRAYID (e.g. using different input or output files, or different options). The values of x used are those specified in the argument to the -t option.
How to run a MPICH 2 application ?
mpiexec
is available to run application compiled with mvapich (mpich2 for infiniband) :
Submission script for MPICH2 : monAppliMPICH2.pbs
# File : monAppliMPICH2.pbs #PBS -l "nodes=3:ppn=1:nef" cd $PBS_O_WORKDIR mpiexec monAppliMPICH2
In this case, mpiexec
will launch MPI on 3 nodes with one core per node.
How to run an OpenMPI application?
The mpirun
binary included in openmpi run the application using the resources reserved by the jobs :
Submission script for OpenMPI : monAppliMPICH2.pbs
# Fichier : monAppliOpenMPI.pbs #PBS -l "nodes=3:ppn=1:nef" export LD_LIBRARY_PATH=/opt/openmpi/current/lib64 # bind a mpi process to a cpu; the linux scheduler sucks for mpi export OMPI_MCA_mpi_paffinity_alone=1 cd $PBS_O_WORKDIR # For PGI openmpi: /opt/openmpi/current/bin/mpirun monAppliOpenMPI # For gcc openmpi: #/opt/openmpi-gcc/current/bin/mpirun monAppliOpenMPI # or simply: #mpirun monAppliOpenMPI
in this cas, mpirun
will start the MPI application on 3 nodes with a single core per node.
How can I run a LAM MPI application ?
LAM is no longer installed on the cluster. Use OpenMPI or MPICH2 instead.
How can i run a PVM application ?
PVM is no longer installed on the cluster.
How can i run a graphical application on a node ?
First, connect to nef with ssh using the -X
option, then submit your job like this (add -I for interactive job):
qsub -X ...
How to Interact with a job when it is running
in which state is my job ?
The qstat
JOBID command let you show the state of your job and in which queue it has been scheduled.
Examples of using qstat
cluster:~/$ qstat 898 Job id Name User Time Use S Queue ---------------- ---------------- ---------------- -------- - ----- 898.cluster hello.sh sgeorget 00:00:00 R normal
The S column gives the the current state ( Queued, Running, Ending, Suspended).
The Queue column shows the job's queue
When will my job be executed ?
You can use showstart
to have an estimation on when your job will be started
How can i get the stderr or stdout of my job during its execution ?
You can use the qpeek
command. qpeek JOBID
show the stdout of JOBID , and qpeek -e
shows the stder .
How can i cancel a job ?
The qdel <JOBID>
command let you cancel a job (man qdel).
Troubleshooting
Why is my job rejected at submission ?
The job system may refuse a job submission, which usually causes this error message :
qsub: Job rejected by all possible destinations
Most of the time it indicates that the requested resources are not available, which may be caused by a typo (eg -l nodes=3:dell6220
instead of -l nodes=3:dellc6220
).
Sometimes it may also be caused by some nodes being temporarily out of service. This may be verified typing pbsnodes -a
for listing all nodes in service.
Another cause may be the job requested more resources than the total resources existing on the cluster.
Why is my job blocked in a queue while there are no other jobs currently running ?
A node on the cluster has a problem, please contact the administrators.
What does this error message mean ?
- cannot execute binary file
-
qsub
was launched without a launching script as a parameter. A script lauching the binary is required byqsub
even if you don't need to transmit specific parameters. - p0_XXXX
- p4_error: : 8262 : The binary was compiled with mpich and is being run with LAM MPI
I received a mail that says: job violates resource utilization policies, what does it mean ?
Most of the time, it means that you have used too much memory and maui has to kill your job. You should try to change the pmem or mem value of your job
What are the best practices for Matlab jobs ?
If launching many Matlab jobs at the same time, please launch them on as few nodes as possible. Matlab uses a floating licence per {node,user} couple. Eg :
- 10 jobs for user foo on 10 differents cores of nef012 node use 1 floating license,
- 1 job for user foo on each of nef01[0-9] nodes use 10 floating licenses.