Sommaire

1 General
2 Job submission
3 How to Interact with a job when it is running
4 Software
5 Troubleshooting
6 Disks and filesystems
7 Other
- 7.1 Guidelines for hardware support of multi GPU scaling ?

General

Who can have an account on the cluster ?

Inria users : nef is an Inria Sophia Antipolis - Méditerranée research center platform open for all people with an Inria account during the validity period of the account
Academic and industrial partners of Inria, under agreement.

For account application, extension, renewal please follow the first steps procedure.

How do i authenticate on the cluster ?

Connect using ssh with a public/private keypair for OpenSSH 2.

Public/private keypair is generated by each user. Reminder: private key must be kept by the user, not disclosed to anyon and protected properly.

Initial public key for authentication is provided at account request. User can later add/suppress authentication public keys by connecting to the cluster and editing his/her ~/.ssh/authorized_keys file.

When does my cluster account expire ?

Type nef-user -l your_nef_login on nef-devel2 or nef-frontal. The Expire date is the first day the account will be desactivated.

What is OAR ?

OAR is a versatile resource and task manager (also called a batch scheduler) for HPC clusters, and other computing infrastructures (like distributed computing experimental testbeds where versatility is a key).

OAR is the way you reserve resources (nodes, cores) on the cluster by submitting a job.

The official User Documentation is here : http://oar.imag.fr/docs/2.5/#ref-user-docs
The Inria Rennes Tutorial : http://igrida.gforge.inria.fr/tutorial.html

What are the most commonly used OAR commands ? (see official docs)

oarsub : to submit a job
oarstat : to see the state of the queues (Running/Waiting jobs)
oardel : to cancel a job
oarpeek : to show the stdout of a job when its running
oarhold : to hold a job when its Waiting
oarresume : to resume jobs in the states Hold or Suspended

Can i use a web interface rather than command line ?

Warning : Kali support is now discontinued. It is recommended to switch to the command line. Kali is only partly functional.

For Kali, connect to the Kali web portal : if you have an Inria account just Sign in/up with CAS ; if you have no Inria account use Sign in/up.

Job submission

How do i submit a job ?

Use command oarsub. With OAR you can directly use a binary as a submission argument in the command line, or even an inline script. You can also create a submission script. The script includes the command line to execute and the resources needed for the job. Do not forget to use the -S tag of oarsub if you want the OAR parameters in the script to be parsed and honored (oarsub -S ./myscript).

How to choose the node type and properties ?

The cluster has several kind of nodes.

To view all defined OAR properties :

graphical : connect to Monika and click on the node name.
command line : use oarnodes, example for nef085: oarnodes nef085.inria.fr

If you want all the cores from a single node :

oarsub -l /nodes=1

If you want 48 cores from any type and any number of nodes :

oarsub -l /core=48

In this case, the 48 cores can be spread on several nodes; Your application must handle this case ! (using MPI or other frameworks) A multithreaded application won't be able to use all the cores reserved if they are spreaded on several nodes.

If you need to reserve a given amount of cores from a single node, use :

oarsub -l /nodes=1/core=2

If you want all the cores of 2 nodes from xeon nodes with more than 8GB RAM per core each during 10 hours:

oarsub -p "cputype='xeon' and mem_core > 8000" -l /nodes=2,walltime=10:00:00

If you want 48 cores as 12 cores from 4 nodes from the first C6220 cluster (c6220a) :

oarsub -p "cluster='c6220a'" -l /nodes=4/core=12

You can make more specific reservations using additional resource tags. This job reserves a total of 16 cores as 8 cores from the same node on 2 different Infiniband network switches

oarsub -l /ibswitch=2/node=1/core=8

Reserve either 6 cores during 1 hour or 3 cores during 2 hours (moldable jobs, with a either-or

oarsub -l /core=6,walltime=1 -l /core=3,walltime=2

How do i reserve GPU resources ?

To reserve a single gpu, do:

oarsub -p "gpu='YES'" -l /gpunum=1

Several CPU cores may be attached to a GPU, so, for example, on nefgpu18 you will get 5 cores reserved with 1 gpu.

To reserve a single gpu with compute capability 5.0 or more, do:

oarsub -p "gpu='YES' and gpucapability>='5.0'" -l /gpunum=1

To reserve 2 gpus with compute capability 5.0 or more, do:

oarsub -p "gpu='YES' and gpucapability>='5.0'" -l /nodes=1/gpunum=2

This request is often a bad idea. You may be allocated 2 GPUs from different hosts. Unless your code can handle it, you will then only use one of the GPU(s) and keep a GPU blocked and idle :

oarsub -p "gpu='YES' and gpucapability>='5.0'" -l /gpunum=2

If you want mores gpus on a single node, say 4:

oarsub -p "gpu='YES'" -l /nodes=1/gpunum=4

If you want all the gpus on a node, during 4 hours

oarsub -p "gpu='YES'" -l /nodes=1,walltime=4

If you reserve a single core (-l /nodes=1/core=1) , you will NOT have exclusive access to the gpu attached to it

Remember: to check the available gpus and monitor them, use nvidia-smi

What is the format of a submission script ?

The script should not only includes the path to the program to execute, but also includes information about the needed resources (you can specify resources using oarsub options on the command line). Simple example : helloWorldScript

#!/bin/bash
#
# Submission script for the helloWorld program
#
# Comments starting with #OAR are used by the resource manager if using "oarsub -S"
#
# The job reserves 8 nodes with one processor (core) per node,
# only on xeon nodes from cluster dellc6145, job duration is less than 10min
# Note : quoting style of parameters matters, follow the example
#OAR -l /nodes=8/core=1,walltime=00:10:00
#OAR -p cputype='xeon' and cluster='dellc6145'
#
# The job is submitted to the default queue
#OAR -q default
# 
# Path to the binary to run
./helloWorld

Script must be executable (chmod u+rx helloWorldScript).

You can mix parameters in the submission script and on the command line but take about how they combine. In this example the -p on the command line takes precedence over the script, while the -l from the script and the command line are combined (moldable jobs when using multiple -l options) :

oarsub -p "cputype='opteron'" -l /nodes=4/core=2 -S ./helloWorldScript

What are the available queues ?

The limits and parameters of the queues are listed below :

queue name	max user resources	max user running jobs	max duration (days)	priority	max user (hours*resources)
dedicated	576		30	15	32256
default	576		30	10	32256
big	1536	2	30	5	6144
besteffort			3	0

A core is either 1 resource on a non-hyperthreaded, or 2 resources on an hyperthreaded node (1 core has 2 hardware threads).

This means all the jobs of a user running in the default queue at a given time can use at most 576 resources (eg 576 non-hyperthreaded cores, or 288 hyperthreaded cores, or 288 non-hyperthreaded cores with twice the default memory per core, etc.) with a cumulated reservation of 32256 hours*resources. Maximum walltime of each job is 30 days. Number of running jobs is not limited (thus can be up to 576).

In other words a user can have at a time running jobs in the default queue using cumulated resources reservation of at most

either 32 resources during 28 days with the default memory per core ;
either 128 resources during 7 days with the default memory per core ;
either 128 resources during 3 days 1/2 with twice the default memory per core ;
either 256 resources during 3 days 1/2 with the default memory per core ;
etc.

A user can have at most 2 jobs running in the big queue at a given time, using a total of at most 1536 resources with a cumulated reservation of 6144 hours*resources. Maximum walltime of each job is 30 days. The big queue should be used only for jobs that need more than the max user resource of the default queue.

The dedicated queue can use only dedicated resources. Its interest is that your default queue Karma won't increase.

Specific for STARS team members : interactive jobs are limited to 8 hours, by request of the team.

How are the jobs scheduled and prioritized ?

Jobs are scheduled :

based on queue priority (jobs in higher priority queues are served first),
and then based on the user Karma (for jobs of equal queue priority, jobs with lower Karma users are served first).

The user's fair share scheduling Karma measures his/her recent resources consumption during the last 30 days in a given queue. Resource consumption takes in account both the used resources and the requested (but unused) resources in a given queue with the same formula as detailed here. When you request or consume resources on the cluster, your priority in regard of other users decreases (as your Karma increases).

Jobs in the dedicated, default and big queues wait until the requested resources can be reserved.

Jobs in the besteffort queue run without resource reservation : they are allowed to run as soon as there is available resource on the cluster (they are not subject to per user limits, etc.) but can be killed by the scheduler at any time when running if a non-besteffort job requests the used resource.

Using the besteffort queue enables a user to use more resources at a time than the per user limits and permits efficient cluster resource usage. Thus using the besteffort queue is encouraged for short jobs (several hours) that can easily be resubmitted.

Why my long job does not execute on dellc6100 or dellr900 nodes ?

dellc6100 or dellr900 nodes are shared between NEF and GRID5000, alternatively 1 week on each platform. A job with a walltime over ~167 hours (1 week minus reconfiguration time) specifically for these nodes will never run and stay forever in the waiting (W) state. Please oardel it if submitted by error.

How do i submit a job in the "big" queue ?

Use oarsub -q big or use the equivalent option in your submission script (see submission script examples).

How do i submit a besteffort job ?

To submit a job to the best effort queue just use oarsub -t besteffort or use the equivalent option in your submission script (see submission script examples).

Your jobs will be rescheduled automatically with the same behaviour if you additionnaly use the idempotent mode oarsub -t besteffort -t idempotent

OAR checkpoint facility may be useful for besteffort jobs but requires support by the running code.

How do i reserve resources in advance ?

Submit a job with oarsub -r "YYYY-MM-DD HH:MM:SS". A user can have at most 2 scheduled advance reservations at a given time.

Example 1 :

# No command specified : 1 node is reserved for 2 hours on 2017-12-10 08:00:00
# Running job remains idle until the user connects to the node oarsub -C job_number
# and interactively launches commands.
oarsub -r "2017-12-10 08:00:00" -l /nodes=1,walltime=2

Example 2 :

# Command specified : 2 cores are reserved for 3 hours on 2017-12-10 14:00:00
# /path/to/my/script script is launched at that time,
# and resources are released when scripts finishes or walltime is reached
oarsub -r "2017-12-10 14:00:00" -l /core=2,walltime=3 /path/to/my/script

How much memory (RAM) is allocated to my job ?

OAR is using the total amount of RAM of a node and divide it by the number of cores (minus a small amount for the system).

So for instance, if a node has 96GB of RAM and 12 cores, each reserved core will have ~8GB of RAM allocated by OAR. If you reserve only one core on this type of node, your job will be limited to ~8GB of RAM. RAM is counted for RSS (physical memory really used) not for VSZ (virtual memory allocated).

How can i change the memory (RAM) allocated to my job ?

If you need a single core, but more than the dedicated amount of RAM by core, you need to reserve more than one core. Since our cluster is heterogeneous (memory per core is not the same on each sub-cluster), it is not easy to have a single syntax to get the needed amount of memory.

You can use explicitly the mem_core property of OAR. If you want cores with a minimum amount of RAM per core, you can do (at lease 8GB per core in this example) :

oarsub  -l '{mem_core > 8000}/nodes=1/core=3'

In this case, you will have 3 cores on the same node with at least 3x8GB = 24GB of RAM.

In this example you reserve a full node with at least 150GB of RAM :

oarsub -p 'mem > 150000' -l /nodes=1

How can i check the resources really used by a running or terminated job ?

Use the Colmet tool to view CPU and RAM usage profile of your job during or after its execution.

warning : bug in Colmet, it crashes if you use 1 point per 5 seconds or more (eg: no more than 5 points for 30 seconds)
warning : bug in Colmet, we observed that the reported RSS (RAM) is sometimes false

Colmet can be accessed :

for Inria users : from Inria Sophia entreprise network ; or through Inria VPN with vpn.inria.fr/all profile
for all users : by ssh tunneling through nef-frontal.inria.fr (eg: ssh -L 5000:nef-devel2:5000 nef-frontal.inria.fr and browsing http://localhost:5000)

Alternatively, connect to a node while your job is running and check your process physical memory (RSS) usage and virtual memory (VSZ) usage with :

ps -o pid,command,vsz,rss -u yourlogin

How can i submit hundreds/thousands of jobs ?

You can have up to 10000 jobs submitted at a time (includes jobs in all states : Waiting, Running, etc.).

We have raised the limit up to 10000 (20.06.2016). This is experimental and we may lower this limit at anytime if a problem occurs.

OAR provides a feature called array job which allows the creation of multiple, similar jobs with one oarsub command.

Please consider using array jobs when submitting a large number of similar jobs to the queueing system. The obvious but inefficient way to do this would be to prepare a prototype job script and shell scripting a loop to call oarsub on this (possibly modified) job script the required number of times.

To submit an array comprised of array_number jobs use :

oarsub --array array_number

To submit an array comprised of array_number jobs with distinct parameters passed to each job use :

oarsub --array-param-file param_file

where param_file is a text file with array_number lines. Each line contains the arguments passed to the job with the corresponding index in the array, using shell syntax. Example for an array of 3 jobs :

 foo 'a b'     # First job receives 2 arguments : 'foo', 'a b'
 bar $HOME y   # Second job receives 3 args : 'bar', the path to your homedir, y
 hi `hostname` $MYVAR # Third job receives 3 args : 'hi', result of hostname command, value of $MYVAR variable

Variables and commands are evaluated when launching the job not when running the oarsub command (thus in the user's context on the execution node, not on the submission frontend).

Don't use a parameter file with only one single line: the parameters in this line will be ignored. In other words OAR doesn't like arrays of size 1 :-(

When using a submission script, array job can be specified with a directive in the script :

#OAR --array array_number
##OR
#OAR --array-param-file param_file

OAR creates one different job per member in the array, with the following environment variables :

$OAR_JOB_ID : unique jobid for each member of the array
$OAR_ARRAY_ID : common value for all members of the array (equal to the jobid of the first array member)
$OAR_ARRAY_INDEX : unique index for each member of the array (first job has index 1, second job has index 2, etc.)

Example :

nef-devel2$ oarsub --array 2 ./runme
Generate a job key...
Generate a job key...
OAR_JOB_ID=235542
OAR_JOB_ID=235543
OAR_ARRAY_ID=235542
nef-devel2$ oarstat --array 235542
Job id    A. id     A. index  Name       User     Submission Date     S Queue
--------- --------- --------- ---------- -------- ------------------- - --------
235542    235542    1                    mvesin   2016-04-01 15:49:27 R default 
235543    235542    2                    mvesin   2016-04-01 15:49:27 R default 
nef-devel2$

When using oarsub -t besteffort -t idempotent jobs with arrays, a job in the array may be killed while running and automatically resubmitted. In this case in the resubmitted job : $OAR_JOB_ID is the new jobid, $OAR_ARRAY_INDEX and $OAR_ARRAY_ID are unchanged.

Example of besteffort array member automatic resubmission with $OAR_ARRAY_ID = 235524, and job 235525 (array index 2) killed by OAR and resubmitted as 235527 :

nef-devel2$ oarstat --array 235524
Job id    A. id     A. index  Name       User     Submission Date     S Queue
--------- --------- --------- ---------- -------- ------------------- - --------
235524    235524    1                    mvesin   2016-04-01 14:07:38 R besteffo
235525    235524    2                    mvesin   2016-04-01 14:07:38 E besteffo
235527    235524    2                    mvesin   2016-04-01 14:15:55 R besteffo
nef-devel2$ oarstat -fj235527 | grep resubmit
   resubmit_job_id = 235525

How can i pass command line arguments to my job ?

oarsub does not have a command line option for this but you can pass parameters directly to your job, eg :

oarsub [-S] "./mycode abcde xyzt"

and then in ./mycode check $1 (abcde) and $2 (xyzt) variables, in the language specific syntax. Example :

# Submission script ./mycode
#
# Comments starting with #OAR are used by the resource manager if "oarsub -S"
#OAR -p cputype='xeon'

# pick first argument (abcde) in VAR1
VAR1=$1
# pick second argument (xyzt) in VAR2
VAR2=$2

# Place here your submission script body
echo "var1=$VAR1 var2=$VAR2"

Another syntax for that :

oarsub [-S] "./mycode --VAR1 abcde --VAR2 xyzt"

and then in ./mycode use options parsing in the language specific syntax.

If you do not use the -S option of oarsub then you may prefer to use shell environment variables, eg :

oarsub -l /nodes=2/core=4  "env VAR1=abcde VAR2=xyzt  ./myscript.sh"

What is a dedicated node ?

A dedicated node is a node for which a limited number of cluster users (eg: a research team) has privileged access (usually because it funded the node). Other cluster users can only submit besteffort jobs to this node and cannot use the additional local storage (under /local).

Check the node properties to see whether a node is dedicated :

property dedicated has value NO for a common node
property dedicated has value groupname for a node dedicated to groupname

How do i use a dedicated node ?

No specific option is required, just describe the requested resources. For example, to submit an interactive besteffort queue job reserving one gpu with GPU capability 5.0 or higher :

oarsub -p "gpu='YES' and gpucapability>='5.0'" -t besteffort -l /gpunum=1 -I

To specifically request the dedicated resources of groupname use -p "dedicated='groupname'". In this case you may prefer to use the dedicated queue versus the default queue. For example to submit an interactive default queue job reserving one gpu node from asclepios team use :

oarsub -q dedicated -p "gpu='YES' and dedicated='asclepios'" -l /nodes=1 -I

If you use a -q dedicated and don't have access to matching dedicated resources, you'll (of course) get a Not enough resources error message at job submission.

How do i share a node reservation with other user ?

A timesharing ( -t timesharing ) job reserves resources that can be accessed at the same time by all authorized users (hint : reserve full nodes to avoid unexpected behaviours unless you're an expert user).

All users of the timesharing node have access to all the reserved resources (cores, memory, GPUs) and should coordinate to avoid conflict or over-usage.

Who are the authorized users to a timesharing node ?

standard case : all the cluster users
dedicated node : only the privileged users for this node (no simultaneous besteffort)

When all the jobs sharing the node have finished the resources are freed.

Example 1 :

# shared interactive use of a node during 4 hours for a hands on session :
# check for a free node, and name the node to be sure you all access the same node
# 
# simplest case: all users access with the same command
oarsub -t 'timesharing=*,*' -p "host='nef111.inria.fr'" -l /nodes=1,walltime=4:0:0 -I
# the standard node and resources can be accessed by all cluster users during this time

Example 2 :

# advance reservation of nefgpu09 dedicated node during one week starting
# at 2017/03/11 8AM, for shared usage :
oarsub -r "2017-03-11 8:00:00" -p "gpu='YES' and host='nefgpu09.inria.fr'" -l /nodes=1,walltime=7:0:0:0 -t 'timesharing=*,*'
#
# the dedicated node and resources can be accessed by privileged users during this time eg :
oarsub -t 'timesharing=*,*' -p "gpu='YES' and host='nefgpu09.inria.fr'" -l /nodes=1,walltime=2 -I
oarsub -t 'timesharing=*,*' -p "gpu='YES' and host='nefgpu09.inria.fr'" -l /nodes=1,walltime=4 /path/to/script
# etc.

Pay attention to the walltime in the subsequent oarsub : if requesting longer access than currently reserved, the job will start only if the reservation can be extended. The simple way is to submits subsequent jobs with a walltime expiring before the initial job.

Do not confuse timesharing (several jobs have simultaneous access to a set of reserved resources) and container (kind of scheduling-in-scheduling, but each inner job has dedicated resources).

How to choose GPU resources for multi GPU jobs ?

A multi-GPU job has better performance when the reserved GPU are connected by a high speed data path. On a GPU node, check data path between GPUs with :

nvidia-smi topo -m

Recommended usage is :

choose GPUs from the same host and
with a high speed connection (eg: PHB PXB PIX but not SOC).

Minimal reasonable multi-GPU request example :

# Request 2 GPUs from same host
oarsub -p "gpu='YES'" -l /nodes=1/gpunum=2 -I

Advanced resource request example :

# A person from the STARS team requests a pair of GPU cards from one of their dedicated nodes.
# Wants either the gpudevice pair 0/1 or 2/3 which are on same PCIe host bridge (PHB).
oarsub -p "gpu='YES' and dedicated='stars'" -l "{ gpudevice=0 or gpudevice=1 }/nodes=1/gpunum=2" -l "{ gpudevice=2 or gpudevice=3 }/nodes=1/gpunum=2" -I

How to use CPU hyperthreading ?

CPU hyperthreading is available only on nef newest nodes (see hardware description for node details). Hyperthreading is permanently enabled on these nodes.

When reserving a CPU core with OAR, you are always assigned both threads from this CPU core, without a specific OAR request syntax. Each thread appears as one OAR resource, so you are assigned two resources by core.

Example : reserve 1 core (2 threads) from nefgpu12 in besteffort :

[nef-frontal $] oarsub -t besteffort -p "host='nefgpu12.inria.fr'" -l /core=1 -I
# we are assigned 2 threads from core number 2261
[nefgpu12 ~]$ cat $OAR_RESOURCE_PROPERTIES_FILE | sed -e 's/.*\(thread = .[0-9]*. \).*\(core = .[0-9]*.\).*/\1\2/g'
thread = '0' core = '2261'
thread = '1' core = '2261'

From the developper's point of view, each thread appears to programs as a logical processor.

How to Interact with a job when it is running

How do i connect to the nodes of my running job ?

Use oarsub -C jobid to start an interactive shell on the master node of the job jobid, or use OAR_JOB_ID=jobid oarsh hostname to connect to any node of the job.

To get the list of the job nodes do a cat $OAR_NODE_FILE and then use oarsh hostname to connect to other job nodes.

Other useful commands : oarcp to copy files between nodes local filesystems, oarprint to query resources allocated to the job (eg : oarprint host for the list of the hostname your job is running on)

Please note ssh to the nodes is not allowed, but oarsh is a wrapper around ssh.

In which state is my job ?

The oarstat jobid command let you show the state of job jobid and in which queue it has been scheduled.

Example for jobid 1839 :

nef-frontal$ oarstat -j 1839
Job id     Name           User           Submission Date     S Queue
---------- -------------- -------------- ------------------- - ----------
1839       TEST_OAR       rmichela       2015-08-21 17:49:08 T default

the S column gives the the current state ( Waiting, Running, Launching, Terminating).
the Queue column shows the job's queue

-f gives full information about the job, --array prints information for a whole array

You can use SQL syntax for advanced queries, example :

oarstat --sql "job_user='rmichela' and state='Terminated'"

When will my job be executed ?

oarstat -fj jobid | grep scheduledStart gives an estimation on when your job will be started

How can i get the stderr or stdout of my job during its execution ?

oarpeek jobid shows the stdout of jobid and oarpeek -e jobid shows the stderr.

How can i cancel a job ?

oardel jobid cancels job jobid.

How to know my Karma priorities ?

To see the Karma associated to one of your currently running jobs :

use oarstat -f -j jobid | grep Karma
or use | Monika and click on jobid to view the job details

This gives your Karma for this job's queue at the time of the job submission.

If you want more details, the command oarstat -u login --accounting "YYYY-MM-DD, yyyy-mm-dd" shows your resource consumption between two dates. The indicated Karma is the one of your last submitted job. To see the details of your resource consumption for a given queue use oarstat -u login --sql "queue_name = 'queue' " --accounting "YYYY-MM-DD, yyyy-mm-dd"

To see your time window used for Karma calculation use :

yyyy-mm-dd = tomorrow
YYYY-MM-DD = ( yyyy-mm-dd - 30 days )

Software

How to use an environment module in a job ?

To use an environment module module_name in a batch job, add the following lines in your submission script (the script used in oarsub MyScript) :

# Submission script MyScript - excerpt
source /etc/profile.d/modules.sh
module load module_name
# Commands using the module are after loading the module

Typing a module load module_name on a frontend node or an interactive job session set the environment module for this session only (not for submitted jobs).

How to run an OpenMPI application?

The mpirun binary included in openmpi run the application using the resources reserved by the jobs :

Submission script for OpenMPI : monAppliMPICH2.sh

The openmpi 2.0.0 version installed on nef is patched to discover automatically the ressources of your job, so you don't have to specify a machinefile.

# Fichier : monAppliOpenMPI.sh
#!/bin/bash
#OAR -l /nodes=3/core=1
source /etc/profile.d/modules.sh
module load mpi/openmpi-2.0.0-gcc
mpirun --prefix $MPI_HOME  monAppliOpenMPI

in this case, mpirun will start the MPI application on 3 nodes with a single core per node.

If you are using the main openmpi module (mpi/openmpi-x86_64) you have to add manually parameters :

module load mpi/openmpi-x86_64
mpirun -mca btl_openib_pkey 0x8108 -mca plm_rsh_agent oarsh --prefix $MPI_HOME  -machinefile $OAR_NODEFILE monAppliOpenMPI

How to run an Intel MPI application?

the Intel compiler and mpi implementation is installed on nef. To run a mpi job:

#!/bin/bash
#OAR -l /nodes=3/core=1
source /etc/profile.d/modules.sh
module load mpi/intel64-5.1.1.109   
mpirun -machinefile $OAR_NODEFILE monAppliIntelMPI

How can i use BLAS (ATLAS, OPENBLAS ...) ?

The recommended version is Openblas (atlas or netlib blas are much slower) or the MKL from Intel. Several versions of openblas are available: sequential (-l openblas64), pthread (-l openblasp64) or openmp (-l openblaso64)

For example to use the sequential version of openblas (recommended if your application is already multithreaded/parallel):

gcc -I/usr/include/openblas -l openblas64 myblas.c

How can i use FreeFem++ ?

A more complete version than the default system version of FreeFem++ is available :

module load mpi/openmpi-x86_64
module load freefem++/3.62
# example program
mpirun -mca btl_openib_pkey 0x8108 -mca plm_rsh_agent oarsh --prefix $MPI_HOME -machinefile $OAR_NODE_FILE -np 4 $FREEFEM_PATH/bin/FreeFem++-mpi  $FREEFEM_PATH/share/freefem++/3.62/examples++-mpi/testsolver_MUMPS.edp

How to use gcc GPU offloading for better performance ?

gcc compiler supports OpenACC and OpenMP offloading to Nvidia GPUs (NVPTX targets).

PGI GPU offloading performance is often better than gcc GPU offloading, up to x10 factor

For OpenACC :

extend your code for OpenACC support with #pragma acc clauses
compile with offloading support :

module load gcc-nvptx/9.2.0
g++ -fopenacc -fopt-info-optimized-omp -foffload="-O3" your_compile_commands_and_options

For OpenMP :

extend your code for OpenMP offloading support with #pragma omp target clauses
compile with offloading support :

module load gcc-nvptx/9.2.0
g++ -fopenmp your_compile_commands_and_options

Then for OpenACC and OpenMP :

run on a GPU node with GPU memory ECC enabled (important) :

oarsub -p "gpu= 'YES' and gpuecc='YES' and gpucapability>='5.0'" -I
module load gcc-nvptx/9.2.0
your_test_code

How to use PGI GPU offloading for better performance ?

PGI compiler supports OpenACC offloading to Nvidia GPUs (NVPTX targets) :

extend your code for OpenACC support with #pragma acc clauses
compile with offloading support :

module load pgi/19.10
pg++ -acc -Minfo=all your_compile_commands_and_options

run on a GPU node with GPU memory ECC enabled (important) :

oarsub -p "gpu= 'YES' and gpuecc='YES' and gpucapability>='5.0'" -I
module load pgi/19.10
your_test_code

How to use AVX for better performance ?

Vector instructions for recent Intel CPU increase performance by executing each instruction on bigger data operand (256 bits for AVX/AVX2 and 512 bits AVX-512, versus 64/128 bits for base instruction set or previous CPU).

To use a vector instruction acceleration function you need :

hardware support from the node (eg: check with lscpu | tr ' ' '\n' | grep avx)
support from the libraries used by your code (eg: OpenBlas 0.2 supports up to AVX2, Intel MKL supports up to AVX-512)
support from your compiler (eg: gcc >= 4.6 for AVX2, gcc >= 4.9 for AVX-512)
support from your code (eg: -mfma -mavx512f -mavx512cd compiler options for automatic AVX-512 vectorization by gcc for Skylake nodes, more info here)

Vector instruction sets reduce backward portability, eg : a code compiled with AVX-512 instructions will fail on a non AVX-512 capable node (Illegal instruction).

Performance tradeoff for vectorization include :

newer vector extension usually means better performance for vector computation oriented workloads (perf AVX-512 > perf AVX/AVX2 > perf legacy)
however exceptions exists such as the Xeon Silver C6420 on NEF where AVX2 gives optimal performance due to hardware capabilities (2 AVX2 FMA, 1 AVX-512 FMA)
rough gain estimate in a typical scenario when doubling instruction size (legacy to AVX/AVX2, AVX/AVX2 to AVX-512) would be ~50% rather 100% due to CPU reduced frequency and hardware bottlenecks

How can i run a docker container ?

Docker is not available as it requires (in)direct user root privileges, but singularity is the alternative container technology proposed.

How can i run a singularity container ?

Singularity is a container technology which can import a docker image and adds native GPU support.

Please read this blog post with quickstart guidelines for using singularity on Nef or the full singularity documentation.

Example : download a container image from docker hub (docker://). Image is newest (Ubuntu) tensorflow with GPU support (tensorflow/tensorflow:latest-gpu). Convert to singularity image and save to file (pull). Launch singularity image from saved file with interactive console (shell), requesting GPU support in singularity (--nv) and /data filesystem availability (-B /data) :

module load singularity/2.5.2
singularity pull docker://tensorflow/tensorflow:latest-gpu
singularity shell -B /data --nv ./tensorflow-latest-gpu.simg

You can use pre-built containers from several registries : the docker hub (singularity pull docker://), the singularity hub (singularity pull shub://), Nvidia NGC, etc. NGC provides containers built and optimized by Nvidia for Nvidia GPUs with usage guidelines for singularity. To use NGC :

follow the NGC getting started guide for signup and API key generation (credentials to access containers)
keep safe your API key (eg : save to ~/.ssh/ngcapikey and chmod go-rwx ~/.ssh/ngcapikey)
configure your environment for NGC access and download a container (example : tensorflow 19.01 for python3)

module load singularity/2.5.2
export SINGULARITY_DOCKER_USERNAME='$oauthtoken'
export SINGULARITY_DOCKER_PASSWORD=$(cat ~/.ssh/ngcapikey)
# one time : download the container
singularity pull docker://nvcr.io/nvidia/tensorflow:19.01-py3
# each run : launch the downloaded container
singularity shell -B /data --nv ./tensorflow-19.01-py3.simg

On your laptop with root privileges you can also build your own singularity container files, copy them to Nef and run them. If converting your own docker container to singularity container file on your laptop, you can :

either create a local docker registry on your laptop, pull from that registry
or use docker2singularity

Building containers on an public container registry is technically possible but raises privacy issues for non-public data or software.

On you laptop you can also convert back singularity containers to docker, eg with singularity2docker.

How can i install a python package with pip ?

You can install a python package with pip specifying --user option (installation for your account only, no admin privileges are required) :

pip install --user package

How can i use a specific python environment ?

You can create a python virtual environment matching your needs and work into it.

Python virtual environments are useful for choosing default python version (eg: python3 by default), easily installing additional python packages per project, using reproducible python environments, etc. Moreover, some python features from the system default installation are broken : problem is solved by using a virtual environment.

Create a python2 virtual environment named virt_py2 :

virtualenv virt_py2 # use : -p /path/to/python for a specific python version
cd virt_py2
source ./bin/activate 
# now in python2 virtual environment
# eg : upgrade pip to the last version
pip install -U pip
# now install packagename for the virtual environment
pip install packagename
#
# leave the created python2 virtual environment
deactivate
# re-use the created python2 virtual environment :
cd ˜/virt_py2
source ./bin/activate

Create a python3 virtual environment named virt_py3 :

# need to update virtualenv for proper python3 support
pip install -U --user virtualenv
# nota : currently need to specify 3.6 as obsolete 3.4 is the system default
virtualenv -p python3.6 virt_py3
cd virt_py3
source ./bin/activate 
# now in python3 virtual environment
# useful in some cases : pip install --upgrade pip
#
# now install packagename for the virtual environment
pip3 install packagename

Create a python3 virtual environment without upgrading virtualenv for other softwares : nested virtualenv

# python2 virtualenv for upgrading virtualenv
virtualenv virt_virtualenv
cd virt_virtualenv
source ./bin/activate
pip install -U virtualenv
source bin/activate # update virtualenv version used
# python3 nested environment
virtualenv -p python3.6 virt_py3
cd virt_py3
source ./bin/activate 
#
# leave the created nested python3 virtual environments
deactivate ; cd .. ; deactive
#
# re-use the create python3 nested virtual environment
cd ~/virt_virtualenv
source ./bin/activate
cd virt_py3
source ./bin/activate

How can i install a package and use a specific environment with conda ?

Conda is an alternative tool to pip/virtualenv for installing your own packages and managing virtual environments. Conda is not limited to python packages and includes environment exporting/sharing.

Create and use a conda virtual environment named virt_conda :

module load conda/5.0.1-python2.7
# create and use a conda virtual environment
conda create --name virt_conda
source activate virt_conda
# install a package in virtual environment
conda install package
# leave the conda virtual environment
source deactivate virt_conda

Create and use a conda virtual environment named virt_conda_py3 using python 3.6 by default :

module load conda/5.0.1-python3.6
conda create --name virt_conda_py3 python=3.6
source activate virt_conda_py3

How can i use caffe ?

First you have to use a node with a GPU (it should be much faster with a GPU), for example:

oarsub -I -p "gpu='YES' and gpucapability>='5.0'" -l /gpunum=1

Method 1 : Then you have to load the cuda and caffe modules:

source /etc/profile.d/modules.sh
module load cuda/10.0
module load cudnn/7.4-cuda-10.0
module load caffe/0.17-cuda-10.0
$CAFFE_HOME/build/tools/caffe

For importing caffe in python you need to install additional packages (numpy, scikit-image, protobuf) eg in a virtual environment.

Method 2 : Alternatively you can use a container based distribution of caffe. Please read the guidelines to setup your singularity environment and setup your NGC environment if needed.

Example : NVidia caffe container from NGC

module load singularity/2.5.2
export SINGULARITY_DOCKER_USERNAME='$oauthtoken'
export SINGULARITY_DOCKER_PASSWORD=$(cat ~/.ssh/ngcapikey)
# one time : download the container
singularity pull docker://nvcr.io/nvidia/caffe:19.01-py2
# each run : launch the downloaded container
singularity shell -B /data --nv ./caffe-19.01-py2.simg

How can i use torch ?

Method 1 : To use the system pre-installed version of torch, load the torch module :

module load torch/7
# launch the torch interactive session, etc.
th

You can install your own additional packages in your homedir to extend to system installation of torch :

luarocks --tree=~/.luarocks install packagename

Method 2 : Alternatively, if the system pre-installed version of torch does not match your customization needs, you can install your own torch version as explained in the torch installation process, except you need to ignore the install-deps step.

Method 3 : Alternatively you can use a container based distribution of torch. Please read the guidelines to setup your singularity environment and setup your NGC environment if needed.

Example : torch container from NGC

module load singularity/2.5.2
export SINGULARITY_DOCKER_USERNAME='$oauthtoken'
export SINGULARITY_DOCKER_PASSWORD=$(cat ~/.ssh/ngcapikey)
# one time : download the container
singularity pull docker://nvcr.io/nvidia/torch:18.08-py2
# each run : launch the downloaded container
singularity shell -B /data --nv ./torch-18.08-py2.simg

How can i use pytorch ?

Method 1 : You can tailor pytorch conda installation guidelines.

Example of a conda based installation (tested for pytorch 1.3.1 with cuda 9.2 and torchvision 0.4.2) :

module load conda/5.0.1-python3.6
conda create --name virt_pytorch_conda python=3.6
source activate virt_pytorch_conda
conda install pytorch torchvision cudatoolkit=9.2 -c pytorch

Method 2 : Alternatively you can use a container based distribution of pytorch. Please read the guidelines to setup your singularity environment and setup your NGC environment if needed.

Example : container 19.02 with pytorch 1.1.0 container from NGC

module load singularity/2.5.2
export SINGULARITY_DOCKER_USERNAME='$oauthtoken'
export SINGULARITY_DOCKER_PASSWORD=$(cat ~/.ssh/ngcapikey)
# one time : download the container
singularity pull docker://nvcr.io/nvidia/pytorch:19.02-py3
# each run : launch the downloaded container
singularity shell -B /data --nv ./pytorch-19.02-py3.simg

Caveat: container 19.02 is the newest version supporting nvidia driver version 410, installed on Nef when writing this FAQ. Container 19.02 provides pytorch 1.1.0, thus newest versions of pytorch are currently not supported on Nef with this method.

Method 3 : Alternatively you can use a pytorch 1.4.0 version compiled for nef, built with GPU support, recent CPU support (avx-512), python 3.6, cuda 9.2

module load conda/5.0.1-python3.6
conda create --name virt_pytorch
source activate virt_pytorch
module load cuda/9.2
module load cudnn/7.1-cuda-9.2
module load gcc/7.3.0
module load mpi/openmpi-2.0.0-gcc
module load pytorch/1.4.0

Method 4 : Alternatively you can build your own pytorch version from sources by tuning the pytorch install from sources documentation.

Exemple guidelines for building 1.4.0 with cuda 9.2

module load conda/5.0.1-python3.6
conda create --name virt_pytorch_source
source activate virt_pytorch_source
module load cuda/9.2
module load cudnn/7.1-cuda-9.2
module load gcc/7.3.0
module load cmake/3.10.1
module load mpi/openmpi-2.0.0-gcc
conda install -c pytorch magma-cuda90
git clone --recursive https://github.com/pytorch/pytorch
export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}
cd pytorch
export CMAKE_LIBRARY_PATH=$CUDNN_LIB_DI
python setup.py install --prefix=/path/to/my/install/dir
# then export PYTHONPATH=/path/to/my/install/dir/lib/python3.6/site-packages

How can i use tensorflow?

Method 1 : A tensorflow version is available on nef, built from sources with GPU support, recent CPU support (eg avx2) and python3. To setup it please run the following on a node with a recent GPU and CPU support (oarsub -p "gpucapability >= '5.0'") for tensorflow 1.10 with cuda 9.2 :

pip install -U --user virtualenv
virtualenv -p python3.6 virt_tf
cd virt_tf
source ./bin/activate
module load cuda/9.2
module load cudnn/7.1-cuda-9.2
module load tensorflow/1.10.1-python3-cuda9.2
pip install numpy

Method 2 : Alternatively you can install a google-built version of tensorflow. Nef older GPU may not be supported depending on the tensorflow version (eg: 1.4 requires -p "gpucapability >= '5.0'"). Tensorflow may not use all GPU and CPU hardware capabilities depending the nodes, with a performance impact (eg: recent CPU nodes capabilities such as sse3/4,avx,avx2,fma are not used by tensorflow 1.0.1).

Example for tensorflow 2.0.0 :

pip install -U --user virtualenv
virtualenv -p python3.6 virt_tf
cd virt_tf
source ./bin/activate
module load cuda/10.0
module load cudnn/7.4-cuda-10.0
#
pip install https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-2.0.0-cp36-cp36m-manylinux2010_x86_64.whl

Method 3 : Alternatively you can use a container based distribution of tensorflow. Please read the guidelines to setup your singularity environment and setup your NGC account if needed ; they contain example for tensorflow container from docker hub and NGC.

How can i use theano ?

Method 1 : You can tailor | theano CentOS6 installation.

Example of a conda based installation :

module load conda/5.0.1-python2.7
# make it clean : conda remove --name virt_theano --all
conda create --name virt_theano
source activate virt_theano
conda install numpy scipy mkl nose sphinx pydot-ng
module load cuda/10.0
module load cudnn/7.4-cuda-10.0
conda install theano pygpu

Using this installation from a GPU node :

module load conda/5.0.1-python2.7
source activate virt_theano
module load cuda/10.0
module load cudnn/7.4-cuda-10.0
# create gpu_tutorial1.py from example on http://deeplearning.net/software/theano/tutorial/using_gpu.html#gpuarray-backend
# use .theanorc for permanent flags
export MKL_THREADING_LAYER=GNU
THEANO_FLAGS="device=cuda0,floatX=float32,dnn.base_path=/misc/opt/cudnn/7.4-cuda-10.0" python gpu_tutorial1.py

Method 2 : Alternatively you can use a container based distribution of theano. Please read the guidelines to setup your singularity environment and setup your NGC environment if needed.

Example : theano container from NGC

module load singularity/2.5.2
export SINGULARITY_DOCKER_USERNAME='$oauthtoken'
export SINGULARITY_DOCKER_PASSWORD=$(cat ~/.ssh/ngcapikey)
# one time : download the container
singularity pull docker://nvcr.io/nvidia/theano:18.08
# each run : launch the downloaded container
singularity shell -B /data --nv ./theano-18.08.simg
# run an example in the container
export MKL_THREADING_LAYER=GNU
THEANO_FLAGS="device=cuda0,floatX=float32" python gpu_tutorial1.py

How can i use spark ?

Let's say you want to use spark on 4 nodes :

oarsub -I -l /nodes=4,walltime=3:0:0

This will reserve 4 nodes and start a shell on the first one (say nef107)

Then start the master:

./sbin/start-master.sh

Then you can start the slaves on three other nodes using oarsh ( the server URL in this case is spark://nef107.inria.fr:7077 ), like this:

for i in `uniq $OAR_NODEFILE | grep -v nef107`; do  
oarsh $i $HOME/spark-1.6.0-bin-hadoop2.6/sbin/start-slave.sh spark://nef107.inria.fr:7077 ; done

Then you can use spark, for ex. to run the sparkPi example:

export MASTER=spark://nef107.inria.fr:7077
./bin/run-example  SparkPi

To connect remotely to the WebUI you need to start Inria VPN (with vpn.inria.fr/all) or use SSH tunneling through nef-frontal.inria.fr

How can i run a graphical software on a node using my laptop's screen ?

Method 1 : if connecting from a client machine outside Inria Sophia network : setup automatic ssh tunneling to nef-devel/nef-devel2 on your laptop by adding in ~/.ssh/config :

Host   nef-devel*.inria.fr
       ProxyCommand    ssh -q nef-frontal.inria.fr nc %h %p

Then use the virtual desktop available on nef-devel/nef-devel2 for each user login :

prerequisite on client laptop : a vncviewer client software
if your vncviewer supports -via option (eg tigervnc) : connect to the virtual desktop dedicated for your login :

[user@laptop $] vncviewer -via nef-devel.inria.fr vnc-login:0

if your vncviewer does not support -via option : setup a ssh tunnel first

[user@laptop $] ssh -N -L 5901:vnc-login:5900 nef-devel.inria.fr
[user@laptop $] vncviewer localhost:1

a basic X11 graphical desktop appears, right click to launch a text terminal
in terminal start an interactive job (eg oarsub -I) or connect to an existing job (eg oarsub -C jobid)
when job starts, launch a X11 graphical command (eg firefox)

Method 2 : an alternate (most simple) way to launch a graphical application :

prerequisite on client laptop : a X11 server (not native for Windows or Mac)
drawback : very slow thus only adapted to light applications
connect to a nef frontend tunnel and X11 graphics (eg ssh -X nef-devel.inria.fr)
start an interactive job (eg oarsub -I) or connect to an existing job (eg oarsub -C jobid)
launch a X11 graphical command (eg xterm)
nota : 3 steps in 1 with ssh -X -t nef-devel.inria.fr 'oarsub -C jobid '

Method 2bis : X11 tunneling using jobkey (same as tunneling a port, but LocalForward is not needed)

How can i tunnel a port from my laptop to a node ?

Tunneling a port from your laptop/workstation to a node can help connect to a server launched by your job on the node :

if the server listens only on localhost (eg for security reasons)
etc.

Example below describes forwarding of port 8080 from your client laptop to a node, and connecting to a web server :

launched on node by the user's job
listening on localhost:8080

Step 0 : (once before first use) add to ~/.ssh/config on your client laptop (replace 8080 with the port used by your job)

Host *.neforward
    ProxyCommand ssh nef-frontal.inria.fr -W "$(basename %h .neforward):6667"
    LocalForward 8080 127.0.0.1:8080
    User oar
    Port 6667
    IdentityFile ~/.ssh/jobkey

Step 1 : submit your job generating a job key

[user@nef-devel $] oarsub -k -e ~/.ssh/jobkey -I  # job can be interactive or batch

Step 2 : copy the job key on your client laptop

[user@laptop $] scp nef-frontal:~/.ssh/jobkey ~/.ssh/
[...]
Connect to OAR job jobid via the node node.inria.fr # note the node name
[user@node $]

Step 3 : launch tunnel (using node from step 2)

[user@laptop $] ssh node.neforward

Step 4 : use tunnel. In this example we access a web server on port 8080 on node

in a browser on your laptop, connect to http://localhost:8080

Another example using tunnel to use VirtualGL on GPU nodes can be found on this blog post.

How can i use DIGITS ?

A version of DIGITS is available on the cluster. You need to setup your environment before first use :

# load required modules
module load cuda/7.5
module load cudnn/5.1-cuda-7.5 
module load caffe/0.14 
module load torch/7
module load digits/6.1
# install required python packages in a virtual environment
virtualenv --system-site-packages virt_digits
cd virt_digits
source ./bin/activate
pip install scikit-image
pip install -r $DIGITS_ROOT/requirements.txt
pip install -U numpy

Then launch a server on a reserved node before each use :

# load required modules
module load cuda/7.5
module load cudnn/5.1-cuda-7.5 
module load caffe/0.14 
module load torch/7
module load digits/6.1
# enter virtual environment
cd virt_digits
source ./bin/activate
# launch a DIGITS dev server
cd $DIGITS_ROOT
./digits-devserver

Launch a browser on the reserved node and connect to http://localhost:5000 to connect to the DIGITS server.

The pre-installed version of DIGITS is not full featured (no Torch support, no data and vizualization plugins). Alternatively you can install your own DIGITS version adapting the build documentation for a non-root install in a python virtualenv.

What are the Matlab licences available ?

Matlab community licenses from Inria Sophia can be used on the cluster. They are shared with all the sites desktops and laptops.

How to compile my Matlab program and run it ?

Matlab compilation produces an application (or a standalone package that includes the application and a Matlab runtime) from a Matlab program, using either Matlab GUI or mcc command line.

The application can then run using a Matlab runtime. Matlab runtimes are installed in /opt/matlab<version>_runtime. You need to use the same Matlab version for the compiler and the runtime (eg: use a 2017a runtime for a 2017a compiled program).

# compile myprogram.m using CLI mcc
[user@nef012 $] /opt/matlab2018a/bin/mcc -m ./myprogram.m
#
# run the wrapper created by the compiler for the application
[user@nef012 $] ./run_myprogram.sh /opt/matlab2018a_runtime myprogram_params

"Licensing error: -4,132" error message means compiler licence is currently used, retry later.

Caution : compiling with CLI mcc causes the licence to be reserved and blocked during 30 minutes by the user (linger time) and cannot be released quicker, while this is not the case with GUI compilation.

More generally, Matlab runtime can also be downloaded from Mathworks, so a compiled Matlab program can be run and distributed to people and platforms that do not have access to Matlab licenses.

Some toolbox functions are not supported by Matlab compilation.

What are the best practices for Matlab jobs ?

Before running long Matlab jobs or many Matlab jobs using the same code over time (eg: parameter sweeping), compile your matlab program and run the compiled program. Running a Matlab compiled program does not require Matlab license so that license tokens remain available for development activity.

If your Matlab program cannot be compiled : when launching many Matlab jobs at the same time, please launch them on as few nodes as possible. Matlab uses a floating licence per {node,user} couple. Eg :

10 jobs for user foo on 10 differents cores of nef012 node use 1 floating license,
1 job for user foo on each of nef01[0-9] nodes use 10 floating licenses.

OAR container jobs may be useful.

Example : make a long reservation of a full node and launch many short mono-core jobs

# one day reservation of a full node (/path/to/loop-script is an idle wait/loop script)
-bash-4.2$ oarsub -t container  -l /node=1,walltime=24 /path/to/loop-script
[...]
OAR_JOB_ID=3303953
[...]
# launch 200 matlab jobs on the reserved node
-bash-4.2$ oarsub --array 200 -t inner=3303953 -l /core=1,walltime=1 /path/to/matlab/job

Troubleshooting

Why is my job rejected at submission ?

The job system may refuse a job submission due to the admission rules, an explicit error message will be displayed, in case of contact the admin cluster team.

Most of the time it indicates that the requested resources are not available, which may be caused by a typo (eg -p "cluster='dell6220'" instead of -p "cluster='dellc6220'").

Job is also rejected if you submit a non-besteffort job to a dedicated node of another team. Add -t besteffort to your oarsub command to check this point.

Sometimes it may also be caused by some nodes being temporarily out of service. This may be verified typing oarnodes -s for listing all nodes in service.

Another cause may be the job requested more resources than the total resources existing on the cluster.

Why is my job still Waiting while other jobs go Running ?

Many possible (normal) explanations include :

other job may have higher priority : queue priority, user Karma
your job requests currently unavailable resources (eg : only dellc6220 nodes while the other job accepts any node type)
your job requests more resources than currently available and a lower priority job can be run before without delaying your job (best fit). Eg : you requested 4 nodes, only 2 are currently available, the 2 others will be available in 3 hours. A job requesting 2 nodes during at most 3 hours can be run before yours.
the other job made an advance reservation of resources
etc.

Why is my job still Waiting while some there are unused resources ?

Many possible (normal) explanations include :

you have reached maximum resource reservation per user at a given time and your job is not besteffort
resources are reserved for a higher priority job. Eg: a higher priority job requests 3 nodes, 2 are currently available, 1 will be available in 1 hour. Your job requests 1 node during 2 hours. Running your job would result in delaying a higher priority job.
resources are reserved by an advance reservation (same example as above).
etc.

I see several nodes in the StandBy state in Monika, are they available ?

Yes; it's because we have enabled the Energy Savings feature of OAR.

It means that when no jobs are waiting, OAR can decide to shut down nodes to save energy. As soon a new job is queued, OAR will automatically restart some nodes not enough nodes are alive. Usually, the nodes can boot in 2 minutes, so the job will wait at most a few minutes before starting.

Why did my job got killed ?

Your job can be killed by the scheduler in several ways; you can check what happens using oarstat -fj <JOBID>

Your script use more memory than requested:

If your main process uses too much memory (see also How much memory is allocated to my job) , it is killed by OAR; it's state is 'Terminated' and it has received the kill signal (9)

   state = Terminated
   exit_code = 9 (0,9,0)
2017-02-14 14:35:53> SWITCH_INTO_TERMINATE_STATE:[bipbip 3321148] Ask to change the job state

One of the process started by your script use more memory than requested:

If your use a bash script to start your main process, and it uses too much memory, then the bulkiest process is killed by OAR, and the bash script ends with an exit signal of 128+9 =137 (if your script correctly handles and returns the error code). Its state is 'Terminated'

   state = Terminated
   exit_code = 35072 (137,0,0)
2017-02-14 14:35:53> SWITCH_INTO_TERMINATE_STATE:[bipbip 3321148] Ask to change the job state

Your job has exceeded its walltime:

In this case, the state is Error and OAR tells you what happens (killed by root because of WALLTIME)

   state = Error
2017-02-14 15:01:34> SWITCH_INTO_ERROR_STATE:[bipbip 3321314] Ask to change the job state
2017-02-14 15:01:31> SEND_KILL_JOB:[Leon] Send the kill signal to oarexec on nef012.inria.fr for job 3321314
2017-02-14 15:01:30> WALLTIME:[sarko] Job [3321314] from 1487080849 with 15; current time=1487080890 (Elapsed)
2017-02-14 15:01:30> FRAG_JOB_REQUEST:User root requested to frag the job 3321314

Your besteffort job has been killed to start a regular job:

In this case, the state is Error and OAR tells you what happens (killed by root because of BESTEFFORT_KILL)

   state = Error
2017-02-14 16:01:50> SCHEDULER_PRIORITY_UPDATED_STOP:Scheduler priority for job 3321820 updated (network_address/resource_id)
2017-02-14 16:01:50> SWITCH_INTO_ERROR_STATE:[bipbip 3321820] Ask to change the job state
2017-02-14 16:01:47> SEND_KILL_JOB:[Leon] Send the kill signal to oarexec on nefgpu04.inria.fr for job 3321820
2017-02-14 16:01:46> FRAG_JOB_REQUEST:User root requested to frag the job 3321820
2017-02-14 16:01:46> BESTEFFORT_KILL:[MetaSched] kill the besteffort job 3321820

How can i clean my user environment ?

Failures may come from problems in your user environment (user specific customization and caches). Environment problem often trigger after system or application upgrades. A good hint for a user environment problem : it does not occur under someone else's identity.

Cleaning your environment is very user and application specific. A few hints/recipes with classical problems :

if using a virtual environment, container, etc. : re-create from scratch
de-activate all your initializations (eg: mv ~/.bashrc ~/.bashrc.save for bash), logout, login again
- check LD_LIBRARY_PATH is empty (eg: echo $LD_LIBRARY_PATH) : session-long configuration is usually a bad idea unless you really know what you're doing
clear cache files/directories eg :
- mv ~/.local ~/.local.save (python, gnome, etc.)
- mv ~/.conda ~/.conda.save (conda cache)
- mv ~/.nv ~/.nv.save (cuda)

Disks and filesystems

How can i access files on the cluster using sshfs ?

With sshfs you can access files on the cluster as a mounted filesystem on your client laptop/desktop running Linux or MacOs.

On Linux, you should first install the fuse-sshfs package. On MacOs, install OSXFUSE and SSHFS from http://osxfuse.github.io/.

Example for a machine connected on INRIA-sophia network:

Linux:

mylaptop$ mkdir -p /workspaces/nef
mylaptop$ sshfs -o transform_symlinks nef-devel2.inria.fr:/ /workspaces/nef

MacOs:

mylaptop$ mkdir -p /Volumes/nef
mylaptop$ sshfs -o transform_symlinks nef-devel2.inria.fr:/ /Volumes/nef

Mounting / and using the transform_symlinks option permits to access to all the storages of nef with a single mount and to manage properly the eventual symbolic links you main encounter (ex: a symbolic link in your nef homedir pointing to /data/...).

It is better to not do such a network mount on a subdirectory of your homedir to prevent your session to freeze in case of network problem or when you disconnect your laptop.

If you want to make shorcuts using symbolic links in your homedir, it is better to do them in a subdirectory. For example (on Linux):

mkdir $HOME/nef.d
ln -s /workspaces/nef/home/LOGIN $HOME/nef.d/myhome
ln -s /workspaces/nef/data/TEAM/user/LOGIN $HOME/nef.d/mydata
ln -s /workspaces/nef/data/TEAM $HOME/nef.d/teamdata

where LOGIN is your nef login name, TEAM the name of your team.

You unmount this filesystem with:

Linux:

mylaptop$ fusermount -u /workspaces/nef

MacOs:

mylaptop$ umount -f /Volumes/nef

For a machine outside of Inria network :

configure ssh tunneling through nef-frontal
or mount on nef-frontal instead of nef-devel2 (lower performance)

What to do with my data before my account expires ?

When your user account expires, all files in /home/user and /data/team/user/user are removed after a grace delay (currently : 8 months).

Thus before account expiration one should sort its data :

ensure retention of data still needed by the team :
- move data to /data/team/share
- tag data to long term storage
- position access rights for other team members if needed
if possible, delete un-needed user data (in /home/user, /data/team/user/user, local nodes storage, etc.) in anticipation of automatic removal

See also section on disk space management.

How do i tag files on /data to the scratch or long term storage ?

Files in /data belong either long term storage or scratch storage. This is based on the Unix group of files not on the path hierarchy.

Use the standard Unix file group commands and rules eg :

chgrp scratch /path/to/file : tag /path/to/file to the scratch group (so /path/to/file is now on scratch storage)
chgrp my_team_group /path/to/file : tag /path/to/file to the my_team_group group (so /path/to/file is now on long term storage of my team)
chmod g+s /path/to/dir : files created under /path/to/dir from now on inherit same Unix group as /path/to/dir
sg scratch : current process now uses scratch as effective group id, so files are now created belonging to scratch group by default (if no path inherit rules takes precedence)
etc.

So files can be moved from one storage to another without copying them (quicker with TB of data).

Why the /data quota usage for users and groups do not match ?

The group numbers indicates the long term storage quota usage by all the members of a group.
The user numbers indicates the total disk usage of a user, long term storage plus scratch storage.

There is currently no simple way to get the long term storage quota usage by a single user.

Example :

semir group is currently using 128.810 GiB out of its 1024 GiB long term storage usage quota which is the default quota for a team.
user mvesin from group semir currently uses 10 GiB (mix of long term storage and scratch storage).

nef-devel2$ sudo nef-getquota -g semir
Group quotas under /data, restricted to the given groups (sizes in GiB):
  Group               Used       Hard   Declared 
  semir            128.810   1024.000   1024.000 $default_data_quota

Disk usage by user under /data for the semir group (sizes in GiB):
  User                Used
  mvesin           210.000
  fm                 44.100

What are the performance of the different filesystems ?

This is a complex question that needs to be considered case by case :

depends on the type of access (read/write/mix, long sequential/short random chunks, etc.)
for /home and /data : overall performance is shared between jobs on all nodes of the cluster
for /tmp and /dev/shm : overall performance is shared between jobs on the node
etc.

Results of a test for big sequential write access (with caching disabled) :

~200 MB/s for /home access (shared between jobs on all nodes)
~9000 MB/s for /data access (shared between jobs on all nodes)
- 1x access : up to ~1100 MB/s ; 5x access on 1x node : up to ~2500 MB/s ; 25x access shared on 5x nodes : up to ~6000 MB/s ; etc.
~100-200 MB/s for /tmp access (shared between jobs on this node)
~400-500 MB/s for /local on a SATA SSD disk (shared between jobs on this node)
~1000 MB/s for /local on a SATA SSD RAID-{0,5} disk array (shared between jobs on this node)
~2000-3000 MB/s for /dev/shm access (shared between jobs on this node)

Why shouldn't i use many small files ?

Using small files (aka ZOTfiles, zillions of tiny files) consumes more filesystem metadata (finite) resources. Metadata exhaustion can prevent new file creation even with a filesystem not full. Using many small files or reading/writing small chunks of data also reduces file access performance for yourself and other users (lower data and metadata access efficiency).

Good practices for /data :

avoid using many small files (rule of the thumb : try using files over 1MB when using more than 100k files and links)
avoid reading/writing many small chunks of data (rule of the thumb : when doing intensive read/write try grouping requests by chunks over 1MB)
do not create too many entries (files, directories, links) in the same directory (rule of the thumb : to 1-5k files per directory maximum).

Example : check metadata usage for user mylogin:

$ sudo beegfs-ctl --getquota --uid mylogin
      user/group    ||           size          ||    chunk files    
    name     |  id  ||    used    |    hard    ||  used   |  hard   
--------------|------||------------|------------||---------|---------
     mylogin |  1234||   2.49 TB  |      0 Byte|| 37525834|        0

User mylogin uses 37525834 chunk files (metadata entries) for 2.49TB, thus an average of 71KB per chunk file (average file size is slightly over this average). Rule of the thumb : mylogin should create average files at least 15 times bigger (~ 1MB average).

Draft squashfs/mountimg

How can I use many small files efficiently?

You can gain in performance and minimize the pressure under /data in the following cases:

case1 your jobs are only reading under the directories where your zotfiles reside
case2 your jobs are reading your zotfiles but add only new files or directories in them
case3 your jobs generate zotfiles, but they will be accessed only for reading or adding new files afterwards

For case1:

convert your zotfiles directories to squashfs images
in your jobs:
- mount those images using sudo mountimg
- use those mounted directories for processing

For case2:

convert your zotfiles directories to squashfs images
in your jobs:
- mount those images using sudo mountimg
- use those mounted directories for processing but generate new files on the local filesystems of the node (ex: /tmp)
- unmount the images with sudo umountimg
- add the new files to the images with mksquashfs-no-compression

For case3:

in your jobs:
- generates your zotfiles on the local filesystems of the node (ex: /tmp)
- convert them to squashfs images under /data with mksquashfs-no-compression

Creating squashfs images

You can convert your zotfiles on nef-devel or nef-devel2.

To convert your zotfiles to images, choose first the granularity appropriate to your case.

sudo mountimg allows actually to mount at most 4000 images on a node.

If you have for example a really big directory /data/.../DDD/DD/ containing hundreds of sub-directories D1 D2 ... DN, you may prefer to make one image per such sub-directory.

Example (in bash):

 cd /data/.../DDD
 # Build a separate directory for the images and the mountpoints
 mkdir DD-img DD-mnt
 cd DD
 for i in D*; do
   # Create the image
   mksquashfs-no-compression $i ../DD-img/$i.squashfs
   # Create the mountpoint for your future jobs
   mkdir ../DD-mnt/$i
 done

mksquashfs-no-compression is a simple wrapper to mksquashfs that disable any kind of compression to focus on speed. Feel free to try mksquashfs directly with other options like -comp lzo to save disk space.

You can also use the convert-to-squashfs command to convert safely your directories to squashfs images. See the online help: convert-to-squashfs --help

The commands sudo mountimg, mksquashfs-no-compression and convert-to-squashfs are provided by the fstools-sop RPM, installed also by default on the Fedora machines starting with Fedora-26.

Some mksquashfs hints:

if the destination image exist, the source files/directories will be added (appended) to the image.
- In addition, if a file/directory with a same name already exist in the image, the new file/directory will be added with the name xxx_1 xxx_2, etc, where xxx is the original name.
If a single directory is specified (i.e. mksquashfs source output.squashfs) the squashfs filesystem will consist of that directory, with the top-level root directory corresponding to the source directory.
- use the -keep-as-directory option to tell mksquashfs to keep the basename of the directory in its output.
If multiple source directories or files are specified, mksquashfs will merge the specified sources into a single filesystem, with the root directory containing each of the source files/directories. The name of each directory entry will be the basename of the source path. If more than one source entry maps to the same name, the conflicts are named xxx_1, xxx_2, etc. where xxx is the original name.

Mounting squashfs images

To mount one image, simply call: sudo mountimg <image path> <directory>

To unmount: sudo umountimg <directory>

Example: mount every squashfs images of /data/.../DDD/DD-img/ on the corresponding sub-directory under /data/.../DDD/DD-mnt/

 cd /data/.../DDD/DD-mnt || exit
 for i in *; do
   sudo mountimg ../DD-img/$i.squashfs $i || exit
 done

In an oar job, a mount done with mountimg will be automatically unmounted when the job terminates.

Such a mount can also be shared by more than one oar job and by more than one user. In this case, the unmount will be done when all the jobs terminate. Beware that every job has to do this mount to register to the list of processes needing it.

mountimg allows actually to mount at most 4000 images on a node.

What size can i use on the RAM filesystem ?

One limit is that the RAM filesystem (/dev/shm) space used by a job can be at most the RAM allocated to the job on this node (it is part of the resources allocated to the job).

The other limit is that the system of each node is configured with a total limit for the RAM filesystem (around 50% of the node RAM).

Other

Guidelines for hardware support of multi GPU scaling ?

Multi GPU scaling of GPU computation relies on many factors (type of computation, optimization, framework, data) including GPU node hardware.

Important hardware elements include GPU-GPU and CPU-GPU interconnect technology, CPU and storage resources.

nvidia-smi topo -m describes GPU connections for the current node. GPU-GPU connection may be direct via PCIe switch or root, or indirect via PCIe plus CPU (and QPI/UPI).
nodes hardware are presented here
guidelines can help choosing a filesystem. Local SSD disk may be a good option when available.

Dell T630 nodes have at most 2 GPU cards per PCIe root, thus P2P GPU can scale well up to 2 GPU. PCIe 3.0 P2P interconnects cards at 16GB/s each way in theory. A P2P data transfer test shows ~20GB/s (2x 10GB/s) effective total and ~6 us latency between a pair of cards on the same PCIe root.

Asus ESC8000 node has a single root PCIe topology, enabling better P2P GPU scaling up to 8 GPU. P2P data transfer test shows ~25GB/s total and ~6.5 us latency between any pair amongst the 8 GPUs. On the other hand, a job with a bottleneck on data transfers between CPU RAM/disks <=> GPU RAM may not benefit from PCIe single root and multi GPU.

node reservation example : oarsub -p "gpu='YES' and cluster='esc8kgpu'" -l /nodes=1 -I

Even in P2P, multi GPU data transfer is much slower than data transfer local to a GPU. Example : for a GTX1080 Ti a test show ~350GB/s transfer rate and latency ~4 us.

Multi-node scaling introduces other potential bottlenecks : IB network (40Gb/s or 56 Gb/s), CPU

FAQ new config