FAQ new config : Différence entre versions
(→I see several nodes in the StandBy state in Monika, are they available ?) |
|||
Ligne 648 : | Ligne 648 : | ||
It means that when no jobs are waiting, OAR can decide to shut down nodes to save energy. As soon a new job is queued, OAR will automatically restart some nodes not enough nodes are alive. Usually, the nodes can boot in 2 minutes, so the job will wait at most a few minutes before starting. | It means that when no jobs are waiting, OAR can decide to shut down nodes to save energy. As soon a new job is queued, OAR will automatically restart some nodes not enough nodes are alive. Usually, the nodes can boot in 2 minutes, so the job will wait at most a few minutes before starting. | ||
+ | == Why did my job got killed ? == | ||
+ | |||
+ | Your job can be killed by the scheduler in several ways; you can check what happens using <code>oarstat -fj <JOBID></code> | ||
+ | |||
+ | * '''Your script use more memory than requested''': | ||
+ | If your main process uses too much memory (see also [[#How_much_memory_.28RAM.29_is_allocated_to_my_job_.3F|How much memory is allocated to my job]]) , it is killed by OAR; it's state is 'Terminated' and it has received the kill signal (9) | ||
+ | state = Terminated | ||
+ | exit_code = 9 (0,9,0) | ||
+ | 2017-02-14 14:35:53> SWITCH_INTO_TERMINATE_STATE:[bipbip 3321148] Ask to change the job state | ||
+ | |||
+ | * '''One of the process started by your script use more memory than requested''': | ||
+ | If your use a bash script to start your main process, and it uses too much memory, then the main process is killed by OAR, and the bash script ends with an exit signal of 128+9 =137. It's state is 'Terminated' | ||
+ | state = Terminated | ||
+ | exit_code = 35072 (137,0,0) | ||
+ | 2017-02-14 14:35:53> SWITCH_INTO_TERMINATE_STATE:[bipbip 3321148] Ask to change the job state | ||
+ | |||
+ | * '''Your job has exceeded its walltime''': | ||
+ | In this case, the state is ''Error'' and OAR tells you what happens (killed by root because of WALLTIME) | ||
+ | state = Error | ||
+ | 2017-02-14 15:01:34> SWITCH_INTO_ERROR_STATE:[bipbip 3321314] Ask to change the job state | ||
+ | 2017-02-14 15:01:31> SEND_KILL_JOB:[Leon] Send the kill signal to oarexec on nef012.inria.fr for job 3321314 | ||
+ | 2017-02-14 15:01:30> WALLTIME:[sarko] Job [3321314] from 1487080849 with 15; current time=1487080890 (Elapsed) | ||
+ | 2017-02-14 15:01:30> FRAG_JOB_REQUEST:User root requested to frag the job 3321314 | ||
+ | |||
+ | * '''Your besteffort job has been killed to start a regular job: | ||
+ | In this case, the state is ''Error'' and OAR tells you what happens (killed by root because of BESTEFFORT_KILL) | ||
+ | state = Error | ||
+ | 2017-02-14 16:01:50> SCHEDULER_PRIORITY_UPDATED_STOP:Scheduler priority for job 3321820 updated (network_address/resource_id) | ||
+ | 2017-02-14 16:01:50> SWITCH_INTO_ERROR_STATE:[bipbip 3321820] Ask to change the job state | ||
+ | 2017-02-14 16:01:47> SEND_KILL_JOB:[Leon] Send the kill signal to oarexec on nefgpu04.inria.fr for job 3321820 | ||
+ | 2017-02-14 16:01:46> FRAG_JOB_REQUEST:User root requested to frag the job 3321820 | ||
+ | 2017-02-14 16:01:46> BESTEFFORT_KILL:[MetaSched] kill the besteffort job 3321820 | ||
= Disks and filesystems = | = Disks and filesystems = |
Version du 14 février 2017 à 16:28
Sommaire
- 1 General
- 2 Job submission
- 2.1 How do i submit a job ?
- 2.2 How to choose the node type and properties ?
- 2.3 How do i reserve GPU resources ?
- 2.4 What is the format of a submission script ?
- 2.5 What are the available queues ?
- 2.6 How are the jobs scheduled and prioritized ?
- 2.7 Why my long job does not execute on dellc6100 or dellr900 nodes ?
- 2.8 How do i submit a job in the "big" queue ?
- 2.9 How do i submit a besteffort job ?
- 2.10 How do i reserve resources in advance ?
- 2.11 How much memory (RAM) is allocated to my job ?
- 2.12 How can i change the memory (RAM) allocated to my job ?
- 2.13 How can i check the resources really used by a running or terminated job ?
- 2.14 How can i submit hundreds/thousands of jobs ?
- 2.15 How can i pass command line arguments to my job ?
- 2.16 What is a dedicated node ?
- 2.17 How do i use a dedicated node ?
- 3 How to Interact with a job when it is running
- 4 Software
- 4.1 How to use an environment module in a job ?
- 4.2 How to run an OpenMPI application?
- 4.3 How can i use BLAS (ATLAS, OPENBLAS ...) ?
- 4.4 How to run an Intel MPI application?
- 4.5 How can i install a python package with pip ?
- 4.6 How can i run caffe ?
- 4.7 How can i use spark ?
- 4.8 How can i run a graphical application on a node ?
- 4.9 Can i use tensorflow?
- 4.10 What are the Matlab licences available ?
- 4.11 What are the best practices for Matlab jobs ?
- 5 Troubleshooting
- 6 Disks and filesystems
General
Who can have an account on the cluster ?
- Inria users : nef is an Inria Sophia Antipolis - Méditerranée research center platform open for all people with an Inria account during the validity period of the account
- Academic and industrial partners of Inria, under agreement.
For account application, extension, renewal please follow the first steps procedure.
When does my cluster account expire ?
Type nef-user -l your_nef_login
on nef-devel2 or nef-frontal. The Expire date is the first day the account will be desactivated.
What is OAR ?
OAR is a versatile resource and task manager (also called a batch scheduler) for HPC clusters, and other computing infrastructures (like distributed computing experimental testbeds where versatility is a key).
OAR is the way you reserve resources (nodes, cores) on the cluster by submitting a job.
- The official User Documentation is here : http://oar.imag.fr/docs/2.5/#ref-user-docs
- The Inria Rennes Tutorial : http://igrida.gforge.inria.fr/tutorial.html
What are the most commonly used OAR commands ? (see official docs)
- oarsub : to submit a job
- oarstat : to see the state of the queues (Running/Waiting jobs)
- oardel : to cancel a job
- oarpeek : to show the stdout of a job when its running
- oarhold : to hold a job when its Waiting
- oarresume : to resume jobs in the states Hold or Suspended
Can i use a web interface rather than command line ?
Yes, connect to the Kali web portal : if you have an Inria account just Sign in/up with CAS ; if you have no Inria account use Sign in/up.
Job submission
How do i submit a job ?
Use command oarsub
.
With OAR you can directly use a binary as a submission argument in the command line, or even an inline script. You can also create a submission script. The script includes the command line to execute and the resources needed for the job. Do not forget to use the -S tag of oarsub if you want the OAR parameters in the script to be parsed and honored (oarsub -S ./myscript.sh
).
How to choose the node type and properties ?
The cluster has several kind of nodes.
To view all defined OAR properties :
- graphical : connect to Monika and click on the node name.
- command line : use
oarnodes
, example for nef085:oarnodes nef085.inria.fr
If you want all the cores from a single node :
oarsub -l /nodes=1
If you want 48 cores from any type and any number of nodes :
oarsub -l /core=48
In this case, the 48 cores can be spread on several nodes; Your application must handle this case ! (using MPI or other frameworks) A multithreaded application won't be able to use all the cores reserved if they are spreaded on several nodes.
If you need to reserve a given amount of cores from a single node, use :
oarsub -l /nodes=1/core=2
If you want all the cores of 2 nodes from xeon nodes with more than 80GB RAM each during 10 hours:
oarsub -p "cputype='xeon' and mem > 80000" -l /nodes=2,walltime=10:00:00
If you want 96 cores as 12 cores from 8 nodes from xeon nodes:
oarsub -p "cputype='xeon'" -l /nodes=8/core=12
You can make more specific reservations using additional resource tags. This job reserves a total of 16 cores as 8 cores from the same node on 2 different Infiniband network switches
oarsub -l /ibswitch=2/node=1/core=8
Reserve either 6 cores during 1 hour or 3 cores during 2 hours (moldable jobs, with a either-or
oarsub -l /core=6,walltime=1 -l /core=3,walltime=2
How do i reserve GPU resources ?
To reserve a single gpu, do:
oarsub -p "gpu='YES'" -l /gpunum=1
Several cores may be attached to a GPU, so, for example, on nefgpu05/06 , you will get 3 cores and 1 gpu; on nefgpu03/04 you will get one or two cores and 1 gpu
If you want mores gpus on a single node, say 4:
oarsub -p "gpu='YES'" -l /nodes=1/gpunum=4
If you want all the gpus on a node, during 4 hours
oarsub -p "gpu='YES'" -l /nodes=1,walltime=4
If you reserve a single core (-l /nodes=1/core=1) , you will NOT have exclusive access to the gpu attached to it
Remember: to check the available gpus and monitor them, use nvidia-smi
What is the format of a submission script ?
The script should not only includes the path to the program to execute, but also includes information about the needed resources (you can specify resources using oarsub options on the command line). Simple example : helloWorld.sh
# Submission script for the helloWorld program # # Comments starting with #OAR are used by the resource manager if using "oarsub -S" # # The job reserves 8 nodes with one processor (core) per node, # only on xeon nodes, job duration is less than 10min #OAR -l /nodes=8/core=1,walltime=00:10:00 #OAR -p cputype='xeon' # # The job is submitted to the default queue #OAR -q default # # Path to the binary to run ./helloWorld
You can mix parameters in the submission script and on the command line but take about how they combine. In this example the -p on the command line takes precedence over the script, while the -l from the script and the command line are combined (moldable jobs when using multiple -l options) :
oarsub -p "cputype='opteron'" -l /nodes=4/core=2 -S ./helloWorld.sh
What are the available queues ?
The limits and parameters of the queues are listed below :
queue name | max user resources | max user running jobs | max duration (days) | priority | max user (hours*resources) |
default | 384 (*) | 30 | 10 | 21504 | |
big | 1024 | 2 | 30 | 5 | 2000 |
besteffort | 3 | 0 |
(*) 2016-10 : raised from 256, experimental (will be confirmed or reversed)
This means all the jobs of a user running in the default queue at a given time can use at most 256 resources (eg 256 cores, or 128 cores with twice the default memory per core) with a cumulated reservation of 21504 hours*resources. Maximum walltime of each job is 30 days. Number of running jobs is not limited (thus can be up to 256).
In other words a user can have at a time running jobs in the default queue using cumulated resources reservation of at most
- either 32 cores during 28 days with the default memory per core ;
- either 128 cores during 7 days with the default memory per core ;
- either 128 cores during 3 days 1/2 with twice the default memory per core ;
- either 256 cores during 3 days 1/2 with the default memory per core ;
- etc.
A user can have at most 2 jobs running in the big queue at a given time, using a total of at most 1024 resources with a cumulated reservation of 2000 hours*resources. Maximum walltime of each job is 30 days. The big queue should be used only for jobs that need more than the max user resource of the default queue.
How are the jobs scheduled and prioritized ?
Jobs are scheduled :
- based on queue priority (jobs in higher priority queues are served first),
- and then based on the user Karma (for jobs of equal queue priority, jobs with lower Karma users are served first).
The user's fair share scheduling Karma measures his/her recent resources consumption during the last 30 days in a given queue. Resource consumption takes in account both the used resources and the requested (but unused) resources in a given queue with the same formula as detailed here. When you request or consume resources on the cluster, your priority in regard of other users decreases (as your Karma increases).
Jobs in the default and big queues wait until the requested resources can be reserved.
Jobs in the besteffort queue run without resource reservation : they are allowed to run as soon as there is available resource on the cluster (they are not subject to per user limits, etc.) but can be killed by the scheduler at any time when running if a non-besteffort job requests the used resource.
Using the besteffort queue enables a user to use more resources at a time than the per user limits and permits efficient cluster resource usage. Thus using the besteffort queue is encouraged for short jobs (several hours) that can easily be resubmitted.
Why my long job does not execute on dellc6100 or dellr900 nodes ?
dellc6100 or dellr900 nodes are shared between NEF and GRID5000, alternatively 1 week on each platform. A job with a walltime over ~167 hours (1 week minus reconfiguration time) specifically for these nodes will never run and stay forever in the waiting (W) state. Please oardel
it if submitted by error.
How do i submit a job in the "big" queue ?
Use oarsub -q big
or use the equivalent option in your submission script (see submission script examples).
How do i submit a besteffort job ?
To submit a job to the best effort queue just use oarsub -t besteffort
or use the equivalent option in your submission script (see submission script examples).
Your jobs will be rescheduled automatically with the same behaviour if you additionnaly use the idempotent mode oarsub -t besteffort -t idempotent
OAR checkpoint facility may be useful for besteffort jobs but requires support by the running code.
How do i reserve resources in advance ?
Submit a job with oarsub -r "YYYY-MM-DD HH:MM:SS"
. A user can have at most 2 scheduled advance reservations at a given time.
How much memory (RAM) is allocated to my job ?
OAR is using the total amount of RAM of a node and divide it by the number of cores (minus a small amount for the system).
So for instance, if a node has 96GB of RAM and 12 cores, each reserved core will have ~8GB of RAM allocated by OAR. If you reserve only one core on this type of node, your job will be limited to ~8GB of RAM. RAM is counted for RSS (physical memory really used) not for VSZ (virtual memory allocated).
How can i change the memory (RAM) allocated to my job ?
If you need a single core, but more than the dedicated amount of RAM by core, you need to reserve more than one core. Since our cluster is heterogeneous (memory per core is not the same on each sub-cluster), it is not easy to have a single syntax to get the needed amount of memory.
You can use explicitly the mem_core property of OAR. If you want cores with a minimum amount of RAM per core, you can do (at lease 8GB per core in this example) :
oarsub -l '{mem_core > 8000}/nodes=1/core=3'
In this case, you will have 3 cores on the same node with at least 3x8GB = 24GB of RAM.
In this example you reserve a full node with at least 150GB of RAM :
oarsub -p 'mem > 150000' -l /nodes=1
For simple use cases (need to reserve a given amount of RAM, whatever the number of cores, on a single node), we have written a small wrapper around oarsub, called oarsub_mem (warning : still alpha, works only with simple cases). This wrapper understand a mem=XXg syntax. You can use it like this:
oarsub_mem -l mem=20g,walltime=1:0:0
How can i check the resources really used by a running or terminated job ?
Use the Colmet tool to view CPU and RAM usage profile of your job during or after its execution.
- warning : bug in Colmet, it crashes if you use 1 point per 5 seconds or more (eg: no more than 5 points for 30 seconds)
- warning : bug in Colmet, we observed that the reported RSS (RAM) is sometimes false
Colmet can be accessed :
- for Inria users : from Inria Sophia entreprise network ; or through Inria VPN with vpn.inria.fr/all profile
- for all users : by ssh tunneling through nef-frontal.inria.fr (eg:
ssh -L 5000:nef-devel2:5000 nef-frontal.inria.fr
and browsing http://localhost:5000)
Alternatively, connect to a node while your job is running and check your process physical memory (RSS) usage and virtual memory (VSZ) usage with :
ps -o pid,command,vsz,rss -u yourlogin
How can i submit hundreds/thousands of jobs ?
You can have up to 10000 jobs submitted at a time (includes jobs in all states : Waiting, Running, etc.).
We have raised the limit up to 10000 (20.06.2016). This is experimental and we may lower this limit at anytime if a problem occurs.
OAR provides a feature called array job which allows the creation of multiple, similar jobs with one oarsub command.
Please consider using array jobs when submitting a large number of similar jobs to the queueing system. The obvious but inefficient way to do this would be to prepare a prototype job script and shell scripting a loop to call oarsub on this (possibly modified) job script the required number of times.
To submit an array comprised of array_number jobs use :
oarsub --array array_number
To submit an array comprised of array_number jobs with distinct parameters passed to each job use :
oarsub --array-param-file param_file
where param_file is a text file with array_number lines. Each line contains the arguments passed to the job with the corresponding index in the array, using shell syntax. Example for an array of 3 jobs :
foo 'a b' # First job receives 2 arguments : 'foo', 'a b' bar $HOME y # Second job receives 3 args : 'bar', the path to your homedir, y hi `hostname` $MYVAR # Third job receives 3 args : 'hi', result of hostname command, value of $MYVAR variable
Variables and commands are evaluated when launching the job not when running the oarsub command (thus in the user's context on the execution node, not on the submission frontend).
Don't use a parameter file with only one single line: the parameters in this line will be ignored. In other words OAR doesn't like arrays of size 1 :-(
When using a submission script, array job can be specified with a directive in the script :
#OAR --array array_number ##OR #OAR --array-param-file param_file
OAR creates one different job per member in the array, with the following environment variables :
- $OAR_JOB_ID : unique jobid for each member of the array
- $OAR_ARRAY_ID : common value for all members of the array (equal to the jobid of the first array member)
- $OAR_ARRAY_INDEX : unique index for each member of the array (first job has index 1, second job has index 2, etc.)
Example :
nef-devel2$ oarsub --array 2 ./runme Generate a job key... Generate a job key... OAR_JOB_ID=235542 OAR_JOB_ID=235543 OAR_ARRAY_ID=235542 nef-devel2$ oarstat --array 235542 Job id A. id A. index Name User Submission Date S Queue --------- --------- --------- ---------- -------- ------------------- - -------- 235542 235542 1 mvesin 2016-04-01 15:49:27 R default 235543 235542 2 mvesin 2016-04-01 15:49:27 R default nef-devel2$
When using oarsub -t besteffort -t idempotent
jobs with arrays, a job in the array may be killed while running and automatically resubmitted. In this case in the resubmitted job : $OAR_JOB_ID is the new jobid, $OAR_ARRAY_INDEX and $OAR_ARRAY_ID are unchanged.
Example of besteffort array member automatic resubmission with $OAR_ARRAY_ID = 235524, and job 235525 (array index 2) killed by OAR and resubmitted as 235527 :
nef-devel2$ oarstat --array 235524 Job id A. id A. index Name User Submission Date S Queue --------- --------- --------- ---------- -------- ------------------- - -------- 235524 235524 1 mvesin 2016-04-01 14:07:38 R besteffo 235525 235524 2 mvesin 2016-04-01 14:07:38 E besteffo 235527 235524 2 mvesin 2016-04-01 14:15:55 R besteffo nef-devel2$ oarstat -fj235527 | grep resubmit resubmit_job_id = 235525
How can i pass command line arguments to my job ?
oarsub does not have a command line option for this but you can pass parameters directly to your job, eg :
oarsub [-S] "./mycode abcde xyzt"
and then in ./mycode check $1 (abcde) and $2 (xyzt) variables, in the language specific syntax. Example :
# Submission script ./mycode # # Comments starting with #OAR are used by the resource manager if "oarsub -S" #OAR -p cputype='xeon' # pick first argument (abcde) in VAR1 VAR1=$1 # pick second argument (xyzt) in VAR2 VAR2=$2 # Place here your submission script body echo "var1=$VAR1 var2=$VAR2"
Another syntax for that :
oarsub [-S] "./mycode --VAR1 abcde --VAR2 xyzt"
and then in ./mycode use options parsing in the language specific syntax.
If you do not use the -S option of oarsub then you may prefer to use shell environment variables, eg :
oarsub -l /nodes=2/core=4 "env VAR1=abcde VAR2=xyzt ./myscript.sh"
What is a dedicated node ?
A dedicated node is a node for which a limited number of cluster users (eg: a research team) has privileged access (usually because it funded the node). Other cluster users can only submit besteffort jobs to this node.
Check the node properties to see whether a node is dedicated :
- property dedicated has value NO for a common node
- property dedicated has value groupname for a node dedicated to groupname
How do i use a dedicated node ?
No specific option is required, just describe the requested resources. For example, to submit an interactive besteffort queue job reserving one gpu on a dellt630gpu node (currently all nodes of this type are dedicated) use :
oarsub -p "gpu='YES' and cluster='dellt630gpu'" -t besteffort -l /gpunum=1 -I
To specifically request the dedicated resources of groupname use -p "dedicated='groupname'"
. For example to submit an interactive default queue job reserving one gpu node from asclepios team use :
oarsub -p "gpu='YES' and dedicated='asclepios'" -l /nodes=1 -I
How to Interact with a job when it is running
How do i connect to the nodes of my running job ?
Use oarsub -C jobid
to start an interactive shell on the master node of the job jobid, or use OAR_JOB_ID=jobid oarsh hostname
to connect to any node of the job.
To get the list of the job nodes do a cat $OAR_NODE_FILE
and then use oarsh hostname
to connect to other job nodes.
Other useful commands : oarcp
to copy files between nodes local filesystems, oarprint
to query resources allocated to the job (eg : oarprint host
for the list of the hostname your job is running on)
Please note ssh to the nodes is not allowed, but oarsh is a wrapper around ssh.
In which state is my job ?
The oarstat jobid
command let you show the state of job jobid and in which queue it has been scheduled.
Example for jobid 1839 :
nef-frontal$ oarstat -j 1839 Job id Name User Submission Date S Queue ---------- -------------- -------------- ------------------- - ---------- 1839 TEST_OAR rmichela 2015-08-21 17:49:08 T default
- the S column gives the the current state ( Waiting, Running, Launching, Terminating).
- the Queue column shows the job's queue
-f
gives full information about the job, --array
prints information for a whole array
You can use SQL syntax for advanced queries, example :
oarstat --sql "job_user='rmichela' and state='Terminated'"
When will my job be executed ?
oarstat -fj jobid | grep scheduledStart
gives an estimation on when your job will be started
How can i get the stderr or stdout of my job during its execution ?
oarpeek jobid
shows the stdout of jobid and oarpeek -e jobid
shows the stderr.
How can i cancel a job ?
oardel jobid
cancels job jobid.
How to know my Karma priorities ?
To see the Karma associated to one of your currently running jobs :
- use
oarstat -f -j jobid | grep Karma
- or use | Monika and click on jobid to view the job details
This gives your Karma for this job's queue at the time of the job submission.
If you want more details, the command oarstat -u login --accounting "YYYY-MM-DD, yyyy-mm-dd"
shows your resource consumption between two dates. The indicated Karma is the one of your last submitted job. To see the details of your resource consumption for a given queue use oarstat -u login --sql "queue_name = 'queue' " --accounting "YYYY-MM-DD, yyyy-mm-dd"
To see your time window used for Karma calculation use :
- yyyy-mm-dd = tomorrow
- YYYY-MM-DD = ( yyyy-mm-dd - 30 days )
Software
How to use an environment module in a job ?
To use an environment module module_name in a batch job, add the following lines in your submission script (the script used in oarsub MyScript
) :
# Submission script MyScript - excerpt source /etc/profile.d/modules.sh module load module_name # Commands using the module are after loading the module
Typing a module load module_name
on a frontend node or an interactive job session set the environment module for this session only (not for submitted jobs).
How to run an OpenMPI application?
The mpirun
binary included in openmpi run the application using the resources reserved by the jobs :
Submission script for OpenMPI : monAppliMPICH2.sh
The openmpi 1.10.1 version installed on nef is patched to discover automatically the ressources of your job, so you don't have to specify a machinefile.
# Fichier : monAppliOpenMPI.sh #!/bin/bash #OAR -l /nodes=3/core=1 source /etc/profile.d/modules.sh module load mpi/openmpi-1.10.1-gcc mpirun --prefix $MPI_HOME monAppliOpenMPI
in this case, mpirun
will start the MPI application on 3 nodes with a single core per node.
If you are using the main openmpi module (mpi/openmpi-x86_64) you have to add -machinefile $OAR_NODEFILE
module load mpi/openmpi-x86_64 mpirun --prefix $MPI_HOME -machinefile $OAR_NODEFILE monAppliOpenMPI
How can i use BLAS (ATLAS, OPENBLAS ...) ?
The recommended version is Openblas (atlas or netlib blas are much slower) or the MKL from Intel. Several versions of openblas are available: sequential (-l openblas64), pthread (-l openblasp64) or openmp (-l openblaso64)
For example to use the sequential version of openblas (recommended if your application is already multithreaded/parallel):
gcc -I/usr/include/openblas -l openblas64 myblas.c
How to run an Intel MPI application?
the Intel compiler and mpi implementation is installed on nef. To run a mpi job:
#!/bin/bash #OAR -l /nodes=3/core=1 source /etc/profile.d/modules.sh module load mpi/intel64-5.1.1.109 mpirun -machinefile $OAR_NODEFILE monAppliIntelMPI
How can i install a python package with pip ?
Since you don't have the rights to install software in the system, you have to add the --user option:
pip install --user packagename
How can i run caffe ?
First you have to use a node with a GPU (it should be much faster with a GPU), for example:
oarsub -I -p "gpu='YES'" -l /nodes=1
Then you have to load the cuda and caffe modules:
source /etc/profile.d/modules.sh module load cuda/7.5 module load cudnn/5.0 module load caffe/0.14 $CAFFE_HOME/build/tools/caffe
How can i use spark ?
Let's say you want to use spark on 4 nodes :
oarsub -I -l /nodes=4,walltime=3:0:0
This will reserve 4 nodes and start a shell on the first one (say nef107)
Then start the master:
./sbin/start-master.sh
Then you can start the slaves on three other nodes using oarsh ( the server URL in this case is spark://nef107.inria.fr:7077 ), like this:
for i in `uniq $OAR_NODEFILE | grep -v nef107`; do oarsh $i $HOME/spark-1.6.0-bin-hadoop2.6/sbin/start-slave.sh spark://nef107.inria.fr:7077 ; done
Then you can use spark, for ex. to run the sparkPi example:
export MASTER=spark://nef107.inria.fr:7077 ./bin/run-example SparkPi
To connect remotely to the WebUI you need to start Inria VPN (with vpn.inria.fr/all) or use SSH tunneling through nef-frontal.inria.fr
How can i run a graphical application on a node ?
First, connect to the nef frontend with ssh using the -X
option, then submit an interactive job like this , OAR will do the necessary to setup X11 forwarding:
oarsub -I ...
You can also use VirtualGL on GPU nodes, see this blog post
Can i use tensorflow?
Yes, you can install the CPU version of tensorflow like this: There is a conflict with the protobuf library, so you have to use virtualenv:
virtualenv ~/tensorflow cd tensorflow source bin/activate pip install --upgrade https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.6.0-cp27-none-linux_x86_64.whl
If can also use the GPU version of tensorflow, it will only work on the latest GPU (-p "gpucapability = '5.2'"). For python3, you can use directly pip3, otherwize you should use virtualenv (see CPU installation example just before)
module load cuda/8.0 module load cudnn/5.1-cuda-8.0 export TF_BINARY_URL=https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow-0.11.0rc2-cp34-cp34m-linux_x86_64.whl pip3 install --user $TF_BINARY_URL
What are the Matlab licences available ?
Matlab community licenses from Inria Sophia can be used on the cluster. They are shared with all the sites desktops and laptops. Please find here the complete licenses list.
What are the best practices for Matlab jobs ?
If launching many Matlab jobs at the same time, please launch them on as few nodes as possible. Matlab uses a floating licence per {node,user} couple. Eg :
- 10 jobs for user foo on 10 differents cores of nef012 node use 1 floating license,
- 1 job for user foo on each of nef01[0-9] nodes use 10 floating licenses.
OAR | container jobs may be useful.
Example : make a long reservation of a full node and launch many short mono-core jobs
# one day reservation of a full node (/path/to/loop-script is an idle wait/loop script) -bash-4.2$ oarsub -t container -l /node=1,walltime=24 /path/to/loop-script [...] OAR_JOB_ID=3303953 [...] # launch 200 matlab jobs on the reserved node -bash-4.2$ oarsub --array 200 -t inner=3303953-l /core=1,walltime=1 /path/to/matlab/job
Troubleshooting
Why is my job rejected at submission ?
The job system may refuse a job submission due to the admission rules, an explicit error message will be displayed, in case of contact the admin cluster team.
Most of the time it indicates that the requested resources are not available, which may be caused by a typo (eg -p "cluster='dell6220'"
instead of -p "cluster='dellc6220'"
).
Job is also rejected if you submit a non-besteffort job to a dedicated node of another team. Add -t besteffort
to your oarsub
command to check this point.
Sometimes it may also be caused by some nodes being temporarily out of service. This may be verified typing oarnodes -s
for listing all nodes in service.
Another cause may be the job requested more resources than the total resources existing on the cluster.
Why is my job still Waiting while other jobs go Running ?
Many possible (normal) explanations include :
- other job may have higher priority : queue priority, user Karma
- your job requests currently unavailable resources (eg : only dellc6220 nodes while the other job accepts any node type)
- your job requests more resources than currently available and a lower priority job can be run before without delaying your job (best fit). Eg : you requested 4 nodes, only 2 are currently available, the 2 others will be available in 3 hours. A job requesting 2 nodes during at most 3 hours can be run before yours.
- the other job made an advance reservation of resources
- etc.
Why is my job still Waiting while some there are unused resources ?
Many possible (normal) explanations include :
- you have reached maximum resource reservation per user at a given time and your job is not besteffort
- resources are reserved for a higher priority job. Eg: a higher priority job requests 3 nodes, 2 are currently available, 1 will be available in 1 hour. Your job requests 1 node during 2 hours. Running your job would result in delaying a higher priority job.
- resources are reserved by an advance reservation (same example as above).
- etc.
I see several nodes in the StandBy state in Monika, are they available ?
Yes; it's because we have enabled the Energy Savings feature of OAR.
It means that when no jobs are waiting, OAR can decide to shut down nodes to save energy. As soon a new job is queued, OAR will automatically restart some nodes not enough nodes are alive. Usually, the nodes can boot in 2 minutes, so the job will wait at most a few minutes before starting.
Why did my job got killed ?
Your job can be killed by the scheduler in several ways; you can check what happens using oarstat -fj <JOBID>
- Your script use more memory than requested:
If your main process uses too much memory (see also How much memory is allocated to my job) , it is killed by OAR; it's state is 'Terminated' and it has received the kill signal (9)
state = Terminated exit_code = 9 (0,9,0) 2017-02-14 14:35:53> SWITCH_INTO_TERMINATE_STATE:[bipbip 3321148] Ask to change the job state
- One of the process started by your script use more memory than requested:
If your use a bash script to start your main process, and it uses too much memory, then the main process is killed by OAR, and the bash script ends with an exit signal of 128+9 =137. It's state is 'Terminated'
state = Terminated exit_code = 35072 (137,0,0) 2017-02-14 14:35:53> SWITCH_INTO_TERMINATE_STATE:[bipbip 3321148] Ask to change the job state
- Your job has exceeded its walltime:
In this case, the state is Error and OAR tells you what happens (killed by root because of WALLTIME)
state = Error 2017-02-14 15:01:34> SWITCH_INTO_ERROR_STATE:[bipbip 3321314] Ask to change the job state 2017-02-14 15:01:31> SEND_KILL_JOB:[Leon] Send the kill signal to oarexec on nef012.inria.fr for job 3321314 2017-02-14 15:01:30> WALLTIME:[sarko] Job [3321314] from 1487080849 with 15; current time=1487080890 (Elapsed) 2017-02-14 15:01:30> FRAG_JOB_REQUEST:User root requested to frag the job 3321314
- Your besteffort job has been killed to start a regular job:
In this case, the state is Error and OAR tells you what happens (killed by root because of BESTEFFORT_KILL)
state = Error 2017-02-14 16:01:50> SCHEDULER_PRIORITY_UPDATED_STOP:Scheduler priority for job 3321820 updated (network_address/resource_id) 2017-02-14 16:01:50> SWITCH_INTO_ERROR_STATE:[bipbip 3321820] Ask to change the job state 2017-02-14 16:01:47> SEND_KILL_JOB:[Leon] Send the kill signal to oarexec on nefgpu04.inria.fr for job 3321820 2017-02-14 16:01:46> FRAG_JOB_REQUEST:User root requested to frag the job 3321820 2017-02-14 16:01:46> BESTEFFORT_KILL:[MetaSched] kill the besteffort job 3321820
Disks and filesystems
How can i access files on the cluster using sshfs ?
With sshfs you can access files on the cluster as a mounted filesystem on your client laptop/desktop running Linux or MacOs.
On Linux, you should first install the fuse-sshfs package. On MacOs, install OSXFUSE and SSHFS from http://osxfuse.github.io/.
Example for a machine connected on INRIA-sophia network:
- Linux:
mylaptop$ mkdir -p /workspaces/nef mylaptop$ sshfs -o transform_symlinks nef-devel2.inria.fr:/ /workspaces/nef
- MacOs:
mylaptop$ mkdir -p /Volumes/nef mylaptop$ sshfs -o transform_symlinks nef-devel2.inria.fr:/ /Volumes/nef
Mounting / and using the transform_symlinks option permits to access to all the storages of nef with a single mount and to manage properly the eventual symbolic links you main encounter (ex: a symbolic link in your nef homedir pointing to /data/...).
It is better to not do such a network mount on a subdirectory of your homedir to prevent your session to freeze in case of network problem or when you disconnect your laptop.
If you want to make shorcuts using symbolic links in your homedir, it is better to do them in a subdirectory. For example (on Linux):
- mkdir $HOME/nef.d
- ln -s /workspaces/nef/home/LOGIN $HOME/nef.d/myhome
- ln -s /workspaces/nef/data/TEAM/user/LOGIN $HOME/nef.d/mydata
- ln -s /workspaces/nef/data/TEAM $HOME/nef.d/teamdata
where LOGIN is your nef login name, TEAM the name of your team.
You unmount this filesystem with:
- Linux:
mylaptop$ fusermount -u /workspaces/nef
- MacOs:
mylaptop$ umount -f /Volumes/nef
For a machine outside of Inria network :
- configure ssh tunneling through nef-frontal
- or mount on nef-frontal instead of nef-devel2 (lower performance)
How do i tag files on /data to the scratch or long term storage ?
Files in /data belong either long term storage or scratch storage. This is based on the Unix group of files not on the path hierarchy.
Use the standard Unix file group commands and rules eg :
chgrp scratch /path/to/file
: tag /path/to/file to the scratch group (so /path/to/file is now on scratch storage)chgrp my_team_group /path/to/file
: tag /path/to/file to the my_team_group group (so /path/to/file is now on long term storage of my team)chmod g+s /path/to/dir
: files created under /path/to/dir from now on inherit same Unix group as /path/to/dirsg scratch
: current process now uses scratch as effective group id, so files are now created belonging to scratch group by default (if no path inherit rules takes precedence)- etc.
So files can be moved from one storage to another without copying them (quicker with TB of data).
Why the /data quota usage for users and groups do not match ?
- The group numbers indicates the long term storage quota usage by all the members of a group.
- The user numbers indicates the total disk usage of a user, long term storage plus scratch storage.
There is currently no simple way to get the long term storage quota usage by a single user.
Example :
- semir group is currently using 128.810 GiB out of its 1024 GiB long term storage usage quota which is the default quota for a team.
- user mvesin from group semir currently uses 10 GiB (mix of long term storage and scratch storage).
nef-devel2$ sudo nef-getquota -g semir Group quotas under /data, restricted to the given groups (sizes in GiB): Group Used Hard Declared semir 128.810 1024.000 1024.000 $default_data_quota Disk usage by user under /data for the semir group (sizes in GiB): User Used mvesin 210.000 fm 44.100
What are the performance of the different filesystems ?
This is a complex question that needs to be considered case by case :
- depends on the type of access (read/write/mix, long sequential/short random chunks, etc.)
- for /home and /data : overall performance is shared between jobs on all nodes of the cluster
- for /tmp and /dev/shm : overall performance is shared between jobs on the node
- etc.
Results of a test for big sequential write access (with caching disabled) :
- ~140 MB/s for /home access (shared between jobs on all nodes)
- ~3500 MB/s for /data access (shared between jobs on all nodes) -- yet another performance increase expected Q3 2016 with scale out from 5 to 8 storage servers
- 1x access 800-1100 MB/s ; 4x access on 1x node 2000-2500 MB/s ; 4x access on 4x nodes 2500-3000 MB/s ; etc.
- ~100-200 MB/s for /tmp access (shared between jobs on the node)
- ~2000-3000 MB/s for /dev/shm access (shared between jobs on the node)
For better performance, use big files and read/write big data chunks (rule of the thumb: over 1MB) rather than ones.
What size can i use on the RAM filesystem ?
One limit is that the RAM filesystem (/dev/shm) space used by a job can be at most the RAM allocated to the job on this node (it is part of the resources allocated to the job).
The other limit is that the system of each node is configured with a total limit for the RAM filesystem (around 50% of the node RAM).