FAQ new config

De ClustersSophia
Aller à : navigation, rechercher


Sommaire


General

Who can have an account on the cluster ?

  • Inria users : nef is an Inria Sophia Antipolis - Méditerranée research center platform open for all people with an Inria account during the validity period of the account
  • Academic and industrial partners of Inria, under agreement.

For account application, extension, renewal please follow the first steps procedure.


When does my cluster account expire ?

Type nef-user -l your_nef_login on nef-devel2 or nef-frontal. The Expire date is the first day the account will be desactivated.


What is OAR ?

OAR is a versatile resource and task manager (also called a batch scheduler) for HPC clusters, and other computing infrastructures (like distributed computing experimental testbeds where versatility is a key).

OAR is the way you reserve resources (nodes, cores) on the cluster by submitting a job.


What are the most commonly used OAR commands ? (see official docs)

  • oarsub : to submit a job
  • oarstat : to see the state of the queues (Running/Waiting jobs)
  • oardel : to cancel a job
  • oarpeek : to show the stdout of a job when its running
  • oarhold : to hold a job when its Waiting
  • oarresume : to resume jobs in the states Hold or Suspended


Can i use a web interface rather than command line ?

Yes, connect to the Kali web portal : if you have an Inria account just Sign in/up with CAS ; if you have no Inria account use Sign in/up.

Warning : Kali is now only partly functional.

Job submission

How do i submit a job ?

Use command oarsub. With OAR you can directly use a binary as a submission argument in the command line, or even an inline script. You can also create a submission script. The script includes the command line to execute and the resources needed for the job. Do not forget to use the -S tag of oarsub if you want the OAR parameters in the script to be parsed and honored (oarsub -S ./myscript.sh).


How to choose the node type and properties ?

The cluster has several kind of nodes.

To view all defined OAR properties :

  • graphical : connect to Monika and click on the node name.
  • command line : use oarnodes, example for nef085: oarnodes nef085.inria.fr

If you want all the cores from a single node :

oarsub -l /nodes=1

If you want 48 cores from any type and any number of nodes :

oarsub -l /core=48

In this case, the 48 cores can be spread on several nodes; Your application must handle this case ! (using MPI or other frameworks) A multithreaded application won't be able to use all the cores reserved if they are spreaded on several nodes.

If you need to reserve a given amount of cores from a single node, use :

oarsub -l /nodes=1/core=2

If you want all the cores of 2 nodes from xeon nodes with more than 80GB RAM each during 10 hours:

oarsub -p "cputype='xeon' and mem > 80000" -l /nodes=2,walltime=10:00:00

If you want 96 cores as 12 cores from 8 nodes from xeon nodes:

oarsub -p "cputype='xeon'" -l /nodes=8/core=12

You can make more specific reservations using additional resource tags. This job reserves a total of 16 cores as 8 cores from the same node on 2 different Infiniband network switches

oarsub -l /ibswitch=2/node=1/core=8

Reserve either 6 cores during 1 hour or 3 cores during 2 hours (moldable jobs, with a either-or

oarsub -l /core=6,walltime=1 -l /core=3,walltime=2

How do i reserve GPU resources ?

To reserve a single gpu, do:

oarsub -p "gpu='YES'" -l /gpunum=1

Several cores may be attached to a GPU, so, for example, on nefgpu05/06 , you will get 3 cores and 1 gpu; on nefgpu03/04 you will get one or two cores and 1 gpu

If you want mores gpus on a single node, say 4:

oarsub -p "gpu='YES'" -l /nodes=1/gpunum=4

If you want all the gpus on a node, during 4 hours

oarsub -p "gpu='YES'" -l /nodes=1,walltime=4

If you reserve a single core (-l /nodes=1/core=1) , you will NOT have exclusive access to the gpu attached to it

Remember: to check the available gpus and monitor them, use nvidia-smi

What is the format of a submission script ?

The script should not only includes the path to the program to execute, but also includes information about the needed resources (you can specify resources using oarsub options on the command line). Simple example : helloWorld.sh

# Submission script for the helloWorld program
#
# Comments starting with #OAR are used by the resource manager if using "oarsub -S"
#
# The job reserves 8 nodes with one processor (core) per node,
# only on xeon nodes, job duration is less than 10min
#OAR -l /nodes=8/core=1,walltime=00:10:00
#OAR -p cputype='xeon'
#
# The job is submitted to the default queue
#OAR -q default
# 
# Path to the binary to run
./helloWorld

You can mix parameters in the submission script and on the command line but take about how they combine. In this example the -p on the command line takes precedence over the script, while the -l from the script and the command line are combined (moldable jobs when using multiple -l options) :

oarsub -p "cputype='opteron'" -l /nodes=4/core=2 -S ./helloWorld.sh


What are the available queues ?

The limits and parameters of the queues are listed below :

queue name max user resources max user running jobs max duration (days) priority max user (hours*resources)
default 384 30 10 21504
big 1024 2 30 5 4096
besteffort 3 0


This means all the jobs of a user running in the default queue at a given time can use at most 384 resources (eg 384 cores, or 192 cores with twice the default memory per core) with a cumulated reservation of 21504 hours*resources. Maximum walltime of each job is 30 days. Number of running jobs is not limited (thus can be up to 384).

In other words a user can have at a time running jobs in the default queue using cumulated resources reservation of at most

  • either 32 cores during 28 days with the default memory per core ;
  • either 128 cores during 7 days with the default memory per core ;
  • either 128 cores during 3 days 1/2 with twice the default memory per core ;
  • either 256 cores during 3 days 1/2 with the default memory per core ;
  • etc.

A user can have at most 2 jobs running in the big queue at a given time, using a total of at most 1024 resources with a cumulated reservation of 4096 hours*resources. Maximum walltime of each job is 30 days. The big queue should be used only for jobs that need more than the max user resource of the default queue.


How are the jobs scheduled and prioritized ?

Jobs are scheduled :

  • based on queue priority (jobs in higher priority queues are served first),
  • and then based on the user Karma (for jobs of equal queue priority, jobs with lower Karma users are served first).

The user's fair share scheduling Karma measures his/her recent resources consumption during the last 30 days in a given queue. Resource consumption takes in account both the used resources and the requested (but unused) resources in a given queue with the same formula as detailed here. When you request or consume resources on the cluster, your priority in regard of other users decreases (as your Karma increases).


Jobs in the default and big queues wait until the requested resources can be reserved.

Jobs in the besteffort queue run without resource reservation : they are allowed to run as soon as there is available resource on the cluster (they are not subject to per user limits, etc.) but can be killed by the scheduler at any time when running if a non-besteffort job requests the used resource.

Using the besteffort queue enables a user to use more resources at a time than the per user limits and permits efficient cluster resource usage. Thus using the besteffort queue is encouraged for short jobs (several hours) that can easily be resubmitted.


Why my long job does not execute on dellc6100 or dellr900 nodes ?

dellc6100 or dellr900 nodes are shared between NEF and GRID5000, alternatively 1 week on each platform. A job with a walltime over ~167 hours (1 week minus reconfiguration time) specifically for these nodes will never run and stay forever in the waiting (W) state. Please oardel it if submitted by error.


How do i submit a job in the "big" queue ?

Use oarsub -q big or use the equivalent option in your submission script (see submission script examples).


How do i submit a besteffort job ?

To submit a job to the best effort queue just use oarsub -t besteffort or use the equivalent option in your submission script (see submission script examples).

Your jobs will be rescheduled automatically with the same behaviour if you additionnaly use the idempotent mode oarsub -t besteffort -t idempotent

OAR checkpoint facility may be useful for besteffort jobs but requires support by the running code.


How do i reserve resources in advance ?

Submit a job with oarsub -r "YYYY-MM-DD HH:MM:SS". A user can have at most 2 scheduled advance reservations at a given time.

Example 1 :

# No command specified : 1 node is reserved for 2 hours on 2017-12-10 08:00:00
# Running job remains idle until the user connects to the node oarsub -C job_number
# and interactively launches commands.
oarsub -r "2017-12-10 08:00:00" -l /nodes=1,walltime=2 

Example 2 :

# Command specified : 2 cores are reserved for 3 hours on 2017-12-10 14:00:00
# /path/to/my/script script is launched at that time,
# and resources are released when scripts finishes or walltime is reached
oarsub -r "2017-12-10 14:00:00" -l /core=2,walltime=3 /path/to/my/script

How much memory (RAM) is allocated to my job ?

OAR is using the total amount of RAM of a node and divide it by the number of cores (minus a small amount for the system).

So for instance, if a node has 96GB of RAM and 12 cores, each reserved core will have ~8GB of RAM allocated by OAR. If you reserve only one core on this type of node, your job will be limited to ~8GB of RAM. RAM is counted for RSS (physical memory really used) not for VSZ (virtual memory allocated).


How can i change the memory (RAM) allocated to my job ?

If you need a single core, but more than the dedicated amount of RAM by core, you need to reserve more than one core. Since our cluster is heterogeneous (memory per core is not the same on each sub-cluster), it is not easy to have a single syntax to get the needed amount of memory.

You can use explicitly the mem_core property of OAR. If you want cores with a minimum amount of RAM per core, you can do (at lease 8GB per core in this example) :

oarsub  -l '{mem_core > 8000}/nodes=1/core=3'

In this case, you will have 3 cores on the same node with at least 3x8GB = 24GB of RAM.

In this example you reserve a full node with at least 150GB of RAM :

oarsub -p 'mem > 150000' -l /nodes=1


For simple use cases (need to reserve a given amount of RAM, whatever the number of cores, on a single node), we have written a small wrapper around oarsub, called oarsub_mem (warning : still alpha, works only with simple cases). This wrapper understand a mem=XXg syntax. You can use it like this:

oarsub_mem -l mem=20g,walltime=1:0:0


How can i check the resources really used by a running or terminated job ?

Use the Colmet tool to view CPU and RAM usage profile of your job during or after its execution.

  • warning : bug in Colmet, it crashes if you use 1 point per 5 seconds or more (eg: no more than 5 points for 30 seconds)
  • warning : bug in Colmet, we observed that the reported RSS (RAM) is sometimes false

Colmet can be accessed :

  • for Inria users : from Inria Sophia entreprise network ; or through Inria VPN with vpn.inria.fr/all profile
  • for all users : by ssh tunneling through nef-frontal.inria.fr (eg: ssh -L 5000:nef-devel2:5000 nef-frontal.inria.fr and browsing http://localhost:5000)

Alternatively, connect to a node while your job is running and check your process physical memory (RSS) usage and virtual memory (VSZ) usage with :

ps -o pid,command,vsz,rss -u yourlogin


How can i submit hundreds/thousands of jobs ?

You can have up to 10000 jobs submitted at a time (includes jobs in all states : Waiting, Running, etc.).

We have raised the limit up to 10000 (20.06.2016). This is experimental and we may lower this limit at anytime if a problem occurs.

OAR provides a feature called array job which allows the creation of multiple, similar jobs with one oarsub command.

Please consider using array jobs when submitting a large number of similar jobs to the queueing system. The obvious but inefficient way to do this would be to prepare a prototype job script and shell scripting a loop to call oarsub on this (possibly modified) job script the required number of times.


To submit an array comprised of array_number jobs use :

oarsub --array array_number

To submit an array comprised of array_number jobs with distinct parameters passed to each job use :

oarsub --array-param-file param_file

where param_file is a text file with array_number lines. Each line contains the arguments passed to the job with the corresponding index in the array, using shell syntax. Example for an array of 3 jobs :

 foo 'a b'     # First job receives 2 arguments : 'foo', 'a b'
 bar $HOME y   # Second job receives 3 args : 'bar', the path to your homedir, y
 hi `hostname` $MYVAR # Third job receives 3 args : 'hi', result of hostname command, value of $MYVAR variable

Variables and commands are evaluated when launching the job not when running the oarsub command (thus in the user's context on the execution node, not on the submission frontend).

Don't use a parameter file with only one single line: the parameters in this line will be ignored. In other words OAR doesn't like arrays of size 1 :-(


When using a submission script, array job can be specified with a directive in the script :

#OAR --array array_number
##OR
#OAR --array-param-file param_file


OAR creates one different job per member in the array, with the following environment variables :

  • $OAR_JOB_ID : unique jobid for each member of the array
  • $OAR_ARRAY_ID : common value for all members of the array (equal to the jobid of the first array member)
  • $OAR_ARRAY_INDEX : unique index for each member of the array (first job has index 1, second job has index 2, etc.)


Example :

nef-devel2$ oarsub --array 2 ./runme
Generate a job key...
Generate a job key...
OAR_JOB_ID=235542
OAR_JOB_ID=235543
OAR_ARRAY_ID=235542
nef-devel2$ oarstat --array 235542
Job id    A. id     A. index  Name       User     Submission Date     S Queue
--------- --------- --------- ---------- -------- ------------------- - --------
235542    235542    1                    mvesin   2016-04-01 15:49:27 R default 
235543    235542    2                    mvesin   2016-04-01 15:49:27 R default 
nef-devel2$


When using oarsub -t besteffort -t idempotent jobs with arrays, a job in the array may be killed while running and automatically resubmitted. In this case in the resubmitted job : $OAR_JOB_ID is the new jobid, $OAR_ARRAY_INDEX and $OAR_ARRAY_ID are unchanged.

Example of besteffort array member automatic resubmission with $OAR_ARRAY_ID = 235524, and job 235525 (array index 2) killed by OAR and resubmitted as 235527 :

nef-devel2$ oarstat --array 235524
Job id    A. id     A. index  Name       User     Submission Date     S Queue
--------- --------- --------- ---------- -------- ------------------- - --------
235524    235524    1                    mvesin   2016-04-01 14:07:38 R besteffo
235525    235524    2                    mvesin   2016-04-01 14:07:38 E besteffo
235527    235524    2                    mvesin   2016-04-01 14:15:55 R besteffo
nef-devel2$ oarstat -fj235527 | grep resubmit
   resubmit_job_id = 235525

How can i pass command line arguments to my job ?

oarsub does not have a command line option for this but you can pass parameters directly to your job, eg :

oarsub [-S] "./mycode abcde xyzt"

and then in ./mycode check $1 (abcde) and $2 (xyzt) variables, in the language specific syntax. Example :

# Submission script ./mycode
#
# Comments starting with #OAR are used by the resource manager if "oarsub -S"
#OAR -p cputype='xeon'

# pick first argument (abcde) in VAR1
VAR1=$1
# pick second argument (xyzt) in VAR2
VAR2=$2

# Place here your submission script body
echo "var1=$VAR1 var2=$VAR2"

Another syntax for that :

oarsub [-S] "./mycode --VAR1 abcde --VAR2 xyzt"

and then in ./mycode use options parsing in the language specific syntax.


If you do not use the -S option of oarsub then you may prefer to use shell environment variables, eg :

oarsub -l /nodes=2/core=4  "env VAR1=abcde VAR2=xyzt  ./myscript.sh"


What is a dedicated node ?

A dedicated node is a node for which a limited number of cluster users (eg: a research team) has privileged access (usually because it funded the node). Other cluster users can only submit besteffort jobs to this node and cannot use the additional local storage (under /local).

Check the node properties to see whether a node is dedicated :

  • property dedicated has value NO for a common node
  • property dedicated has value groupname for a node dedicated to groupname


How do i use a dedicated node ?

No specific option is required, just describe the requested resources. For example, to submit an interactive besteffort queue job reserving one gpu on a dellt630gpu node :

oarsub -p "gpu='YES' and cluster='dellt630gpu'" -t besteffort -l /gpunum=1 -I

To specifically request the dedicated resources of groupname use -p "dedicated='groupname'". For example to submit an interactive default queue job reserving one gpu node from asclepios team use :

oarsub -p "gpu='YES' and dedicated='asclepios'" -l /nodes=1 -I


How do i share a node reservation with other user ?

A timesharing ( -t timesharing ) job reserves resources that can be accessed at the same time by all authorized users (hint : reserve full nodes to avoid unexpected behaviours unless you're an expert user).

All users of the timesharing node have access to all the reserved resources (cores, memory, GPUs) and should coordinate to avoid conflict or over-usage.

Who are the authorized users to a timesharing node ?

  • standard case : all the cluster users
  • dedicated node : only the privileged users for this node (no simultaneous besteffort)

When all the jobs sharing the node have finished the resources are freed.

Example 1 :

# shared interactive use of a node during 4 hours for a hands on session :
# check for a free node, and name the node to be sure you all access the same node
# 
# simplest case: all users access with the same command
oarsub -t 'timesharing=*,*' -p "host='nef111.inria.fr'" -l /nodes=1,walltime=4:0:0 -I
# the standard node and resources can be accessed by all cluster users during this time

Example 2 :

# advance reservation of nefgpu09 dedicated node during one week starting
# at 2017/03/11 8AM, for shared usage :
oarsub -r "2017-03-11 8:00:00" -p "gpu='YES' and host='nefgpu09.inria.fr'" -l /nodes=1,walltime=7:0:0:0 -t 'timesharing=*,*'
#
# the dedicated node and resources can be accessed by privileged users during this time eg :
oarsub -t 'timesharing=*,*' -p "gpu='YES' and host='nefgpu09.inria.fr'" -l /nodes=1,walltime=2 -I
oarsub -t 'timesharing=*,*' -p "gpu='YES' and host='nefgpu09.inria.fr'" -l /nodes=1,walltime=4 /path/to/script
# etc.

Pay attention to the walltime in the subsequent oarsub : if requesting longer access than currently reserved, the job will start only if the reservation can be extended. The simple way is to submits subsequent jobs with a walltime expiring before the initial job.

Do not confuse timesharing (several jobs have simultaneous access to a set of reserved resources) and container (kind of scheduling-in-scheduling, but each inner job has dedicated resources).


How to choose GPU resources for multi GPU jobs ?

A multi-GPU job has better performance when the reserved GPU are connected by a high speed data path. On a GPU node, check data path between GPUs with :

nvidia-smi topo -m

Recommended usage is :

  • choose GPUs from the same host and
  • with a high speed connection (eg: PHB PXB PIX but not SOC).

Request resources accordingly, eg :

# A person from the STARS team requests a pair of GPU cards from one of their dedicated nodes.
# Wants either the gpudevice pair 0/1 or 2/3 which are on same PCIe host bridge (PHB).
oarsub -p "gpu='YES' and dedicated='stars'" -l "{ gpudevice=0 or gpudevice=1 }/nodes=1/gpunum=2" -l "{ gpudevice=2 or gpudevice=3 }/nodes=1/gpunum=2" -I

From the developper's point of view, each thread appears to programs as a logical processor.


How to use CPU hyperthreading ?

CPU hyperthreading is available only on nef newest nodes (see hardware description for node details). Hyperthreading is permanently enabled on these nodes.

When reserving a CPU core with OAR, you are always assigned both threads from this CPU core, without a specific OAR request syntax. Each thread appears as one OAR resource, so you are assigned two resources by core.

Example : reserve 1 core (2 threads) from nefgpu12 in besteffort :

[nef-frontal $] oarsub -t besteffort -p "host='nefgpu12.inria.fr'" -l /core=1 -I
# we are assigned 2 threads from core number 2261
[nefgpu12 ~]$ cat $OAR_RESOURCE_PROPERTIES_FILE | sed -e 's/.*\(thread = .[0-9]*. \).*\(core = .[0-9]*.\).*/\1\2/g'
thread = '0' core = '2261'
thread = '1' core = '2261'

How to Interact with a job when it is running

How do i connect to the nodes of my running job ?

Use oarsub -C jobid to start an interactive shell on the master node of the job jobid, or use OAR_JOB_ID=jobid oarsh hostname to connect to any node of the job.

To get the list of the job nodes do a cat $OAR_NODE_FILE and then use oarsh hostname to connect to other job nodes.

Other useful commands : oarcp to copy files between nodes local filesystems, oarprint to query resources allocated to the job (eg : oarprint host for the list of the hostname your job is running on)

Please note ssh to the nodes is not allowed, but oarsh is a wrapper around ssh.


In which state is my job ?

The oarstat jobid command let you show the state of job jobid and in which queue it has been scheduled.

Example for jobid 1839 :

nef-frontal$ oarstat -j 1839
Job id     Name           User           Submission Date     S Queue
---------- -------------- -------------- ------------------- - ----------
1839       TEST_OAR       rmichela       2015-08-21 17:49:08 T default   
  • the S column gives the the current state ( Waiting, Running, Launching, Terminating).
  • the Queue column shows the job's queue

-f gives full information about the job, --array prints information for a whole array

You can use SQL syntax for advanced queries, example :

oarstat --sql "job_user='rmichela' and state='Terminated'"


When will my job be executed ?

oarstat -fj jobid | grep scheduledStart gives an estimation on when your job will be started


How can i get the stderr or stdout of my job during its execution ?

oarpeek jobid shows the stdout of jobid and oarpeek -e jobid shows the stderr.


How can i cancel a job ?

oardel jobid cancels job jobid.


How to know my Karma priorities ?

To see the Karma associated to one of your currently running jobs :

  • use oarstat -f -j jobid | grep Karma
  • or use | Monika and click on jobid to view the job details

This gives your Karma for this job's queue at the time of the job submission.


If you want more details, the command oarstat -u login --accounting "YYYY-MM-DD, yyyy-mm-dd" shows your resource consumption between two dates. The indicated Karma is the one of your last submitted job. To see the details of your resource consumption for a given queue use oarstat -u login --sql "queue_name = 'queue' " --accounting "YYYY-MM-DD, yyyy-mm-dd"

To see your time window used for Karma calculation use :

  • yyyy-mm-dd = tomorrow
  • YYYY-MM-DD = ( yyyy-mm-dd - 30 days )


Software

How to use an environment module in a job ?

To use an environment module module_name in a batch job, add the following lines in your submission script (the script used in oarsub MyScript) :

# Submission script MyScript - excerpt
source /etc/profile.d/modules.sh
module load module_name
# Commands using the module are after loading the module

Typing a module load module_name on a frontend node or an interactive job session set the environment module for this session only (not for submitted jobs).

How to run an OpenMPI application?

The mpirun binary included in openmpi run the application using the resources reserved by the jobs :

Submission script for OpenMPI : monAppliMPICH2.sh

The openmpi 2.0.0 version installed on nef is patched to discover automatically the ressources of your job, so you don't have to specify a machinefile.

# Fichier : monAppliOpenMPI.sh
#!/bin/bash
#OAR -l /nodes=3/core=1
source /etc/profile.d/modules.sh
module load mpi/openmpi-2.0.0-gcc
mpirun --prefix $MPI_HOME  monAppliOpenMPI

in this case, mpirun will start the MPI application on 3 nodes with a single core per node.

If you are using the main openmpi module (mpi/openmpi-x86_64) you have to add -machinefile $OAR_NODEFILE

module load mpi/openmpi-x86_64
mpirun --prefix $MPI_HOME  -machinefile $OAR_NODEFILE monAppliOpenMPI


How can i use BLAS (ATLAS, OPENBLAS ...) ?

The recommended version is Openblas (atlas or netlib blas are much slower) or the MKL from Intel. Several versions of openblas are available: sequential (-l openblas64), pthread (-l openblasp64) or openmp (-l openblaso64)

For example to use the sequential version of openblas (recommended if your application is already multithreaded/parallel):

gcc -I/usr/include/openblas -l openblas64 myblas.c 


How to run an Intel MPI application?

the Intel compiler and mpi implementation is installed on nef. To run a mpi job:

#!/bin/bash
#OAR -l /nodes=3/core=1
source /etc/profile.d/modules.sh
module load mpi/intel64-5.1.1.109   
mpirun -machinefile $OAR_NODEFILE monAppliIntelMPI


How can i install a python package with pip ?

You can install a python package with pip specifying --user option (installation for your account only, no admin privileges are required) :

pip install --user package


How can i use a specific python environment ?

You can create a python virtual environment matching your needs and work into it.

Python virtual environments are useful for choosing default python version (eg: python3 by default), easily installing additional python packages per project, using reproducible python environments, etc. Moreover, some python features from the system default installation are broken : problem is solved by using a virtual environment.

Create a python2 virtual environment named virt_py2 :

virtualenv virt_py2 # use : -p /path/to/python for a specific python version
cd virt_py2
source ./bin/activate 
# now in python2 virtual environment
# eg : upgrade pip to the last version
pip install -U pip
# now install packagename for the virtual environment
pip install packagename
#
# leave the created python2 virtual environment
deactivate
# re-use the created python2 virtual environment :
cd ˜/virt_py2
source ./bin/activate 

Create a python3 virtual environment named virt_py3 :

# need to update virtualenv for proper python3 support
pip install -U --user virtualenv
virtualenv -p python3 virt_py3
cd virt_py3
source ./bin/activate 
# now in python3 virtual environment
pip3 install -U pip
# now install packagename for the virtual environment
pip3 install packagename

Create a python3 virtual environment without upgrading virtualenv for other softwares : nested virtualenv

# python2 virtualenv for upgrading virtualenv
virtualenv virt_virtualenv
cd virt_virtualenv
source ./bin/activate
pip install -U virtualenv
source bin/activate # update virtualenv version used
# python3 nested environment
virtualenv -p python3 virt_py3
cd virt_py3
source ./bin/activate 
#
# leave the created nested python3 virtual environments
deactivate ; cd .. ; deactive
#
# re-use the create python3 nested virtual environment
cd ~/virt_virtualenv
source ./bin/activate
cd virt_py3
source ./bin/activate 


How can i install a package and use a specific environment with conda ?

Conda is an alternative tool to pip/virtualenv for installing your own packages and managing virtual environments. Conda is not limited to python packages and includes environment exporting/sharing.

Create and use a conda virtual environment named virt_conda :

module load conda/5.0.1-python2.7
# create and use a conda virtual environment
conda create --name virt_conda
source activate virt_conda
# install a package in virtual environment
conda install package
# leave the conda virtual environment
source deactivate virt_conda

Create and use a conda virtual environment named virt_conda_py3 using python 3.6 by default :

module load conda/5.0.1-python3.6
conda create --name virt_conda_py3 python=3.6
source activate virt_conda_py3


How can i use caffe ?

First you have to use a node with a GPU (it should be much faster with a GPU), for example:

oarsub -I -p "gpu='YES'" -l /gpunum=1

Then you have to load the cuda and caffe modules:

source /etc/profile.d/modules.sh
module load cuda/9.1
module load cudnn/7.0-cuda-9.1
module load caffe/0.16-cuda-9.1
$CAFFE_HOME/build/tools/caffe


How can i use torch ?

To use the system pre-installed version of torch, load the torch module :

module load torch/7
# launch the torch interactive session, etc.
th

You can install your own additional packages in your homedir to extend to system installation of torch :

luarocks --tree=~/.luarocks install packagename

Alternatively, if the system pre-installed version of torch does not match your customization needs, you can install your own torch version as explained in the torch installation process, except you need to ignore the install-deps step.

How can i use tensorflow?

A tensorflow version is available on nef, built from sources with GPU support, recent CPU support (eg avx2) and python3. To setup it please run the following on a node with a recent GPU and CPU support (oarsub -p "gpucapability >= '5.0'") :

pip install -U --user virtualenv
virtualenv -p python3 virt_tf
cd virt_tf
source ./bin/activate
pip3 install -U pip
module load cuda/9.1
module load cudnn/7.0-cuda-9.1
module load tensorflow/1.6-python3-cuda9.1
pip install numpy google googleapis-common-protos

Alternatively you can install a google-built version of tensorflow. This method currently support tensorflow <= 1.4.x (1.5/1.6 require cuda 9.0 which is not installed on nef). Nef older GPU may not be supported depending on the tensorflow version (eg: 1.0.1 requires -p "gpucapability >= '5.0'"). Tensorflow may not use all GPU and CPU hardware capabilities depending the nodes, with a performance impact (eg: recent CPU nodes capabilities such as sse3/4,avx,avx2,fma are not used by tensorflow 1.0.1).

Example for tensorflow 1.0.1 with GPU support, cuda 8.0, cudnn5.1, python3.4 :

pip install -U --user virtualenv
virtualenv -p python3 virt_tf
cd virt_tf
source ./bin/activate
pip3 install -U pip
module load cuda/8.0
module load cudnn/5.1-cuda-8.0
pip3 install https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.0.1-cp34-cp34m-linux_x86_64.whl


How can i use spark ?

Let's say you want to use spark on 4 nodes :

oarsub -I -l /nodes=4,walltime=3:0:0

This will reserve 4 nodes and start a shell on the first one (say nef107)

Then start the master:

./sbin/start-master.sh

Then you can start the slaves on three other nodes using oarsh ( the server URL in this case is spark://nef107.inria.fr:7077 ), like this:

for i in `uniq $OAR_NODEFILE | grep -v nef107`; do  
oarsh $i $HOME/spark-1.6.0-bin-hadoop2.6/sbin/start-slave.sh spark://nef107.inria.fr:7077 ; done

Then you can use spark, for ex. to run the sparkPi example:

export MASTER=spark://nef107.inria.fr:7077
./bin/run-example  SparkPi

To connect remotely to the WebUI you need to start Inria VPN (with vpn.inria.fr/all) or use SSH tunneling through nef-frontal.inria.fr


How can i run a graphical application on a node ?

First, connect to the nef frontend with ssh using the -X option, then submit an interactive job like this , OAR will do the necessary to setup X11 forwarding:

oarsub -I ...


You can also use VirtualGL on GPU nodes, see this blog post


How can i use DIGITS ?

A version of DIGITS is available on the cluster. You need to setup your environment before first use :

# load required modules
module load cuda/7.5
module load cudnn/5.1-cuda-7.5 
module load caffe/0.14 
module load torch/7
module load digits/6.1
# install required python packages in a virtual environment
virtualenv --system-site-packages virt_digits
cd virt_digits
source ./bin/activate
pip install scikit-image
pip install -r $DIGITS_ROOT/requirements.txt
pip install -U numpy

Then launch a server on a reserved node before each use :

# load required modules
module load cuda/7.5
module load cudnn/5.1-cuda-7.5 
module load caffe/0.14 
module load torch/7
module load digits/6.1
# enter virtual environment
cd virt_digits
source ./bin/activate
# launch a DIGITS dev server
cd $DIGITS_ROOT
./digits-devserver

Launch a browser on the reserved node and connect to http://localhost:5000 to connect to the DIGITS server.

The pre-installed version of DIGITS is not full featured (no Torch support, no data and vizualization plugins). Alternatively you can install your own DIGITS version adapting the build documentation for a non-root install in a python virtualenv.


What are the Matlab licences available ?

Matlab community licenses from Inria Sophia can be used on the cluster. They are shared with all the sites desktops and laptops. Please find here the complete licenses list.


What are the best practices for Matlab jobs ?

If launching many Matlab jobs at the same time, please launch them on as few nodes as possible. Matlab uses a floating licence per {node,user} couple. Eg :

  • 10 jobs for user foo on 10 differents cores of nef012 node use 1 floating license,
  • 1 job for user foo on each of nef01[0-9] nodes use 10 floating licenses.

OAR | container jobs may be useful.

Example : make a long reservation of a full node and launch many short mono-core jobs

# one day reservation of a full node (/path/to/loop-script is an idle wait/loop script)
-bash-4.2$ oarsub -t container  -l /node=1,walltime=24 /path/to/loop-script
[...]
OAR_JOB_ID=3303953
[...]
# launch 200 matlab jobs on the reserved node
-bash-4.2$ oarsub --array 200 -t inner=3303953 -l /core=1,walltime=1 /path/to/matlab/job

Troubleshooting

Why is my job rejected at submission ?

The job system may refuse a job submission due to the admission rules, an explicit error message will be displayed, in case of contact the admin cluster team.

Most of the time it indicates that the requested resources are not available, which may be caused by a typo (eg -p "cluster='dell6220'" instead of -p "cluster='dellc6220'").

Job is also rejected if you submit a non-besteffort job to a dedicated node of another team. Add -t besteffort to your oarsub command to check this point.

Sometimes it may also be caused by some nodes being temporarily out of service. This may be verified typing oarnodes -s for listing all nodes in service.

Another cause may be the job requested more resources than the total resources existing on the cluster.


Why is my job still Waiting while other jobs go Running ?

Many possible (normal) explanations include :

  • other job may have higher priority : queue priority, user Karma
  • your job requests currently unavailable resources (eg : only dellc6220 nodes while the other job accepts any node type)
  • your job requests more resources than currently available and a lower priority job can be run before without delaying your job (best fit). Eg : you requested 4 nodes, only 2 are currently available, the 2 others will be available in 3 hours. A job requesting 2 nodes during at most 3 hours can be run before yours.
  • the other job made an advance reservation of resources
  • etc.


Why is my job still Waiting while some there are unused resources ?

Many possible (normal) explanations include :

  • you have reached maximum resource reservation per user at a given time and your job is not besteffort
  • resources are reserved for a higher priority job. Eg: a higher priority job requests 3 nodes, 2 are currently available, 1 will be available in 1 hour. Your job requests 1 node during 2 hours. Running your job would result in delaying a higher priority job.
  • resources are reserved by an advance reservation (same example as above).
  • etc.


I see several nodes in the StandBy state in Monika, are they available ?

Yes; it's because we have enabled the Energy Savings feature of OAR.

It means that when no jobs are waiting, OAR can decide to shut down nodes to save energy. As soon a new job is queued, OAR will automatically restart some nodes not enough nodes are alive. Usually, the nodes can boot in 2 minutes, so the job will wait at most a few minutes before starting.

Why did my job got killed ?

Your job can be killed by the scheduler in several ways; you can check what happens using oarstat -fj <JOBID>

  • Your script use more memory than requested:

If your main process uses too much memory (see also How much memory is allocated to my job) , it is killed by OAR; it's state is 'Terminated' and it has received the kill signal (9)

   state = Terminated
   exit_code = 9 (0,9,0)
2017-02-14 14:35:53> SWITCH_INTO_TERMINATE_STATE:[bipbip 3321148] Ask to change the job state
  • One of the process started by your script use more memory than requested:

If your use a bash script to start your main process, and it uses too much memory, then the bulkiest process is killed by OAR, and the bash script ends with an exit signal of 128+9 =137 (if your script correctly handles and returns the error code). Its state is 'Terminated'

   state = Terminated
   exit_code = 35072 (137,0,0)
2017-02-14 14:35:53> SWITCH_INTO_TERMINATE_STATE:[bipbip 3321148] Ask to change the job state
  • Your job has exceeded its walltime:

In this case, the state is Error and OAR tells you what happens (killed by root because of WALLTIME)

   state = Error
2017-02-14 15:01:34> SWITCH_INTO_ERROR_STATE:[bipbip 3321314] Ask to change the job state
2017-02-14 15:01:31> SEND_KILL_JOB:[Leon] Send the kill signal to oarexec on nef012.inria.fr for job 3321314
2017-02-14 15:01:30> WALLTIME:[sarko] Job [3321314] from 1487080849 with 15; current time=1487080890 (Elapsed)
2017-02-14 15:01:30> FRAG_JOB_REQUEST:User root requested to frag the job 3321314
  • Your besteffort job has been killed to start a regular job:

In this case, the state is Error and OAR tells you what happens (killed by root because of BESTEFFORT_KILL)

   state = Error
2017-02-14 16:01:50> SCHEDULER_PRIORITY_UPDATED_STOP:Scheduler priority for job 3321820 updated (network_address/resource_id)
2017-02-14 16:01:50> SWITCH_INTO_ERROR_STATE:[bipbip 3321820] Ask to change the job state
2017-02-14 16:01:47> SEND_KILL_JOB:[Leon] Send the kill signal to oarexec on nefgpu04.inria.fr for job 3321820
2017-02-14 16:01:46> FRAG_JOB_REQUEST:User root requested to frag the job 3321820
2017-02-14 16:01:46> BESTEFFORT_KILL:[MetaSched] kill the besteffort job 3321820

Disks and filesystems

How can i access files on the cluster using sshfs ?

With sshfs you can access files on the cluster as a mounted filesystem on your client laptop/desktop running Linux or MacOs.

On Linux, you should first install the fuse-sshfs package. On MacOs, install OSXFUSE and SSHFS from http://osxfuse.github.io/.

Example for a machine connected on INRIA-sophia network:

  • Linux:
mylaptop$ mkdir -p /workspaces/nef
mylaptop$ sshfs -o transform_symlinks nef-devel2.inria.fr:/ /workspaces/nef
  • MacOs:
mylaptop$ mkdir -p /Volumes/nef
mylaptop$ sshfs -o transform_symlinks nef-devel2.inria.fr:/ /Volumes/nef

Mounting / and using the transform_symlinks option permits to access to all the storages of nef with a single mount and to manage properly the eventual symbolic links you main encounter (ex: a symbolic link in your nef homedir pointing to /data/...).

It is better to not do such a network mount on a subdirectory of your homedir to prevent your session to freeze in case of network problem or when you disconnect your laptop.

If you want to make shorcuts using symbolic links in your homedir, it is better to do them in a subdirectory. For example (on Linux):

  • mkdir $HOME/nef.d
  • ln -s /workspaces/nef/home/LOGIN $HOME/nef.d/myhome
  • ln -s /workspaces/nef/data/TEAM/user/LOGIN $HOME/nef.d/mydata
  • ln -s /workspaces/nef/data/TEAM $HOME/nef.d/teamdata

where LOGIN is your nef login name, TEAM the name of your team.

You unmount this filesystem with:

  • Linux:
mylaptop$ fusermount -u /workspaces/nef
  • MacOs:
mylaptop$ umount -f /Volumes/nef

For a machine outside of Inria network :

  • configure ssh tunneling through nef-frontal
  • or mount on nef-frontal instead of nef-devel2 (lower performance)

How do i tag files on /data to the scratch or long term storage ?

Files in /data belong either long term storage or scratch storage. This is based on the Unix group of files not on the path hierarchy.

Use the standard Unix file group commands and rules eg :

  • chgrp scratch /path/to/file  : tag /path/to/file to the scratch group (so /path/to/file is now on scratch storage)
  • chgrp my_team_group /path/to/file  : tag /path/to/file to the my_team_group group (so /path/to/file is now on long term storage of my team)
  • chmod g+s /path/to/dir  : files created under /path/to/dir from now on inherit same Unix group as /path/to/dir
  • sg scratch : current process now uses scratch as effective group id, so files are now created belonging to scratch group by default (if no path inherit rules takes precedence)
  • etc.

So files can be moved from one storage to another without copying them (quicker with TB of data).

Why the /data quota usage for users and groups do not match ?

  • The group numbers indicates the long term storage quota usage by all the members of a group.
  • The user numbers indicates the total disk usage of a user, long term storage plus scratch storage.

There is currently no simple way to get the long term storage quota usage by a single user.


Example :

  • semir group is currently using 128.810 GiB out of its 1024 GiB long term storage usage quota which is the default quota for a team.
  • user mvesin from group semir currently uses 10 GiB (mix of long term storage and scratch storage).
nef-devel2$ sudo nef-getquota -g semir
Group quotas under /data, restricted to the given groups (sizes in GiB):
  Group               Used       Hard   Declared 
  semir            128.810   1024.000   1024.000 $default_data_quota

Disk usage by user under /data for the semir group (sizes in GiB):
  User                Used
  mvesin           210.000
  fm                 44.100


What are the performance of the different filesystems ?

This is a complex question that needs to be considered case by case :

  • depends on the type of access (read/write/mix, long sequential/short random chunks, etc.)
  • for /home and /data : overall performance is shared between jobs on all nodes of the cluster
  • for /tmp and /dev/shm : overall performance is shared between jobs on the node
  • etc.

Results of a test for big sequential write access (with caching disabled) :

  • ~200 MB/s for /home access (shared between jobs on all nodes)
  • ~3500 MB/s for /data access (shared between jobs on all nodes)
    • 1x access 800-1100 MB/s ; 4x access on 1x node 2000-2500 MB/s ; 4x access on 4x nodes 2500-3000 MB/s ; etc.
  • ~100-200 MB/s for /tmp access (shared between jobs on this node)
  • ~400-500 MB/s for /local on a SSD disk (shared between jobs on this node)
  • ~2000-3000 MB/s for /dev/shm access (shared between jobs on this node)


Why shouldn't i use many small files ?

Using small files (aka ZOTfiles, zillions of tiny files) consumes more filesystem metadata (finite) resources. Metadata exhaustion can prevent new file creation even with a filesystem not full. Using many small files or reading/writing small chunks of data also reduces file access performance for yourself and other users (lower data and metadata access efficiency).

Good practices for /data :

  • avoid using many small files (rule of the thumb : try using files over 1MB when using more than 100k files and links)
  • avoid reading/writing many small chunks of data (rule of the thumb : when doing intensive read/write try grouping requests by chunks over 1MB)
  • do not create too many files in the same directory (rule of the thumb : try limiting to 1k files per directory maximum).

Example : check metadata usage for user mylogin:

$ sudo beegfs-ctl --getquota --uid mylogin
      user/group    ||           size          ||    chunk files    
    name     |  id  ||    used    |    hard    ||  used   |  hard   
--------------|------||------------|------------||---------|---------
     mylogin |  1234||   2.49 TB  |      0 Byte|| 37525834|        0

User mylogin uses 37525834 chunk files (metadata entries) for 2.49TB, thus an average of 71KB per chunk file (average file size is slightly over this average). Rule of the thumb : mylogin should create average files at least 15 times bigger (~ 1MB average).


What size can i use on the RAM filesystem ?

One limit is that the RAM filesystem (/dev/shm) space used by a job can be at most the RAM allocated to the job on this node (it is part of the resources allocated to the job).

The other limit is that the system of each node is configured with a total limit for the RAM filesystem (around 50% of the node RAM).


Other

Guidelines for hardware support of multi GPU scaling ?

Multi GPU scaling of GPU computation relies on many factors (type of computation, optimization, framework, data) including GPU node hardware.

Important hardware elements include GPU-GPU and CPU-GPU interconnect technology, CPU and storage resources.

  • nvidia-smi topo -m describes GPU connections for the current node. GPU-GPU connection may be direct via PCIe switch or root, or indirect via PCIe plus CPU (and QPI/UPI).
  • nodes hardware are presented here
  • guidelines can help choosing a filesystem. Local SSD disk may be a good option when available.


Dell T630 nodes have at most 2 GPU cards per PCIe root, thus P2P GPU can scale well up to 2 GPU. PCIe 3.0 P2P interconnects cards at 16GB/s each way in theory. A P2P data transfer test shows ~20GB/s (2x 10GB/s) effective total and ~6 us latency between a pair of cards on the same PCIe root.


Asus ESC8000 node has a single root PCIe topology, enabling better P2P GPU scaling up to 8 GPU. P2P data transfer test shows ~25GB/s total and ~6.5 us latency between any pair amongst the 8 GPUs. On the other hand, a job with a bottleneck on data transfers between CPU RAM/disks <=> GPU RAM may not benefit from PCIe single root and multi GPU.


Even in P2P, multi GPU data transfer is much slower than data transfer local to a GPU. Example : for a GTX1080 Ti a test show ~350GB/s transfer rate and latency ~4 us.


Multi-node scaling introduces other potential bottlenecks : IB network (40Gb/s or 56 Gb/s), CPU