FAQ new config
Sommaire
- 1 General
- 1.1 Who can have an account on the cluster ?
- 1.2 How can an Inria user of the cluster be accredited for OPAL ?
- 1.3 How can an Inria OPAL user connect to an OPAL platform ?
- 1.4 How can an OPAL user connect connect to Nef ?
- 1.5 How do i authenticate on the cluster ?
- 1.6 When does my cluster account expire ?
- 1.7 What is OAR ?
- 1.8 What are the most commonly used OAR commands ? (see official docs)
- 2 Job submission
- 2.1 How do i submit a job ?
- 2.2 How to choose the node type and properties ?
- 2.3 How do i reserve GPU resources ?
- 2.4 What is the format of a submission script ?
- 2.5 What are the available queues ?
- 2.6 How are the jobs scheduled and prioritized ?
- 2.7 Why my long job does not execute on dellc6100 or dellr900 nodes ?
- 2.8 How do i submit a job in the "big" queue ?
- 2.9 How do i submit a besteffort job ?
- 2.10 How do i reserve resources in advance ?
- 2.11 How much memory (RAM) is allocated to my job ?
- 2.12 How can i change the memory (RAM) allocated to my job ?
- 2.13 How can i check the resources really used by a running or terminated job ?
- 2.14 How can i submit hundreds/thousands of jobs ?
- 2.15 How can i pass command line arguments to my job ?
- 2.16 What is a dedicated node ?
- 2.17 How do i use a dedicated node ?
- 2.18 How do i share a node reservation with other user ?
- 2.19 How to choose GPU resources for multi GPU jobs ?
- 2.20 How to use CPU hyperthreading ?
- 3 How to Interact with a job when it is running
- 3.1 How do i submit a job and wait for it to terminate ?
- 3.2 How do i connect to the nodes of my running job ?
- 3.3 In which state is my job ?
- 3.4 When will my job be executed ?
- 3.5 How can i get the stderr or stdout of my job during its execution ?
- 3.6 How can i cancel a job ?
- 3.7 How to know my Karma priorities ?
- 4 Software
- 4.1 How to use an environment module in a job ?
- 4.2 How to run an OpenMPI application?
- 4.3 How to run an Intel MPI application?
- 4.4 How can i use BLAS (ATLAS, OPENBLAS ...) ?
- 4.5 How can i use FreeFem++ ?
- 4.6 How to use gcc GPU offloading for better performance ?
- 4.7 How to use Nvidia HPC SDK GPU offloading for better performance ?
- 4.8 What has the PGI compiler become ?
- 4.9 How to use AVX for better performance ?
- 4.10 How can i run a docker container ?
- 4.11 How can i run a singularity container ?
- 4.12 Speed up singularity image creation using gitlab
- 4.13 speed up singularity downloads using caches =
- 4.14 How can i install a python package with pip ?
- 4.15 How can i use a specific environment with conda ?
- 4.16 How can i use a specific environment with an older conda version ?
- 4.17 How can i use a specific conda and python environment ?
- 4.18 How can i use a specific python environment ?
- 4.19 How can i use caffe ?
- 4.20 How can i use legacy torch ?
- 4.21 How can i use pytorch ?
- 4.22 How can i use LibTorch ?
- 4.23 How can i use tensorflow?
- 4.24 How can i use theano ?
- 4.25 How can i use spark ?
- 4.26 How can i run a graphical software on a node using my laptop's screen ?
- 4.27 How can i tunnel a port from my laptop to a node ?
- 4.28 How can i use DIGITS ?
- 4.29 What are the Matlab licences available ?
- 4.30 How to compile my Matlab program and run it ?
- 4.31 What are the best practices for Matlab jobs ?
- 4.32 How can i measure my python job energy consumption ?
- 5 Troubleshooting
- 5.1 Why is my job rejected at submission ?
- 5.2 Why is my job still Waiting while other jobs go Running ?
- 5.3 Why is my job still Waiting while some there are unused resources ?
- 5.4 I see several nodes in the StandBy state in Monika, are they available ?
- 5.5 Why did my job got killed ?
- 5.6 How can i clean my user environment ?
- 6 Disks and filesystems
- 6.1 How can i access files on the cluster using sshfs ?
- 6.2 What to do with my data before my account expires ?
- 6.3 How do i tag files on /data to the scratch or long term storage ?
- 6.4 Why the /data quota usage for users and groups do not match ?
- 6.5 What are the performance of the different filesystems ?
- 6.6 Why shouldn't i use many small files ?
- 6.7 How can I use many small files efficiently?
- 6.8 What size can i use on the RAM filesystem ?
- 7 Other
General
Who can have an account on the cluster ?
- all Inria users : nef is an Inria Sophia Antipolis - Méditerranée research center platform open for all people with an Inria account during the validity period of the account
- all OPAL users : people from a lab or a project eligible to OPAL and that got an OPAL accreditation
- Academic and industrial partners of Inria, under agreement.
For account application, extension, renewal please follow the first steps procedure.
How can an Inria user of the cluster be accredited for OPAL ?
Inria users that are eligible to use OPAL are people who belong to an Inria Sophia Antipolis Méditerranée team.
All Inria Sophia Antipolis Méditerranée users that have an active account on the Nef cluster are automatically accredited for OPAL for the duration of their Nef account, so they can directly request an account on an OPAL platform as indicated on the OPAL website.
How can an Inria OPAL user connect to an OPAL platform ?
After requesting and obtaining an account for an OPAL platform, follow the platform specific instructions.
To connect to the OCA cluster (licallo) :
- ssh from a Nef frontend machine (nef-devel.inria.fr or nef-devel2.inria.fr) or from ssh-sop.inria.fr
- in your account request use case b) from the form ( for Contact Informatique just mention filled in OPAL directory) and do not request specific unfiltering
When connecting from ssh-sop.inria.fr to Licallo you may have to force password authentication :
- ssh -o PasswordAuthentication=yes -o PubkeyAuthentication=no [login@host]
How can an OPAL user connect connect to Nef ?
Request an account on Nef through the first steps procedure and ssh to Nef as indicated.
All OPAL users can connect to nef-frontal.inria.fr from the Internet.
OCA users can also connect either from licallo.oca.eu or through the OCA VPN.
How do i authenticate on the cluster ?
Connect using ssh
with a public/private keypair for OpenSSH 2
.
Public/private keypair is generated by each user. Reminder: private key must be kept by the user, not disclosed to anyone and protected properly.
Initial public key for authentication is provided at account request.
User can later add/suppress authentication public keys by connecting to the cluster and editing his/her ~/.ssh/authorized_keys
file.
Warning : the following requirements are needed:
- You should give the proper login name. By default ssh uses the one you have on your machine
- The ~ and ~/.ssh directories and the ~/.ssh/authorized_keys and ~/.ssh/id_rsa files should be properly protected on the client and the server. They should for example NOT be writable by the group you are member of.
When does my cluster account expire ?
Type nef-user -l your_nef_login
on nef-devel2 or nef-frontal. The Expire date is the first day the account will be desactivated.
What is OAR ?
OAR is a versatile resource and task manager (also called a batch scheduler) for HPC clusters, and other computing infrastructures (like distributed computing experimental testbeds where versatility is a key).
OAR is the way you reserve resources (nodes, cores) on the cluster by submitting a job.
- The official User Documentation is here : http://oar.imag.fr/docs/2.5/#ref-user-docs
- The Inria Rennes Tutorial : https://igrida.gitlabpages.inria.fr/userdocs/guide/tutorials.html
What are the most commonly used OAR commands ? (see official docs)
- oarsub : to submit a job
- oarstat : to see the state of the queues (Running/Waiting jobs)
- oardel : to cancel a job
- oarpeek : to show the stdout of a job when its running
- oarhold : to hold a job when its Waiting
- oarresume : to resume jobs in the states Hold or Suspended
Job submission
How do i submit a job ?
Use command oarsub
.
With OAR you can directly use a binary as a submission argument in the command line, or even an inline script. You can also create a submission script. The script includes the command line to execute and the resources needed for the job. Do not forget to use the -S tag of oarsub if you want the OAR parameters in the script to be parsed and honored (oarsub -S ./myscript
).
How to choose the node type and properties ?
The cluster has several kind of nodes.
To view all defined OAR properties :
- graphical : connect to Monika and click on the node name.
- command line : use
oarnodes
, example for nef085:oarnodes nef085.inria.fr
If you want all the cores from a single node :
oarsub -l /nodes=1
If you want 48 cores from any type and any number of nodes :
oarsub -l /core=48
In this case, the 48 cores can be spread on several nodes; Your application must handle this case ! (using MPI or other frameworks) A multithreaded application won't be able to use all the cores reserved if they are spreaded on several nodes.
If you need to reserve a given amount of cores from a single node, use :
oarsub -l /nodes=1/core=2
If you want all the cores of 2 nodes from xeon nodes with more than 8GB RAM per core each during 10 hours:
oarsub -p "cputype='xeon' and mem_core > 8000" -l /nodes=2,walltime=10:00:00
If you want 48 cores as 12 cores from 4 nodes from the first C6220 cluster (c6220a) :
oarsub -p "cluster='c6220a'" -l /nodes=4/core=12
You can make more specific reservations using additional resource tags. This job reserves a total of 16 cores as 8 cores from the same node on 2 different Infiniband network switches
oarsub -l /ibswitch=2/node=1/core=8
Reserve either 6 cores during 1 hour or 3 cores during 2 hours (moldable jobs, with a either-or
oarsub -l /core=6,walltime=1 -l /core=3,walltime=2
How do i reserve GPU resources ?
To reserve a single gpu, do:
oarsub -p "gpu='YES'" -l /gpunum=1
Several CPU cores may be attached to a GPU, so, for example, on nefgpu18 you will get 5 cores reserved with 1 gpu.
To reserve a single gpu with compute capability 5.0 or more, do:
oarsub -p "gpu='YES' and gpucapability>='5.0'" -l /gpunum=1
To reserve 2 gpus with compute capability 5.0 or more, do:
oarsub -p "gpu='YES' and gpucapability>='5.0'" -l /nodes=1/gpunum=2
This request is often a bad idea. You may be allocated 2 GPUs from different hosts. Unless your code can handle it, you will then only use one of the GPU(s) and keep a GPU blocked and idle :
oarsub -p "gpu='YES' and gpucapability>='5.0'" -l /gpunum=2
If you want mores gpus on a single node, say 4:
oarsub -p "gpu='YES'" -l /nodes=1/gpunum=4
If you want all the gpus on a node, during 4 hours
oarsub -p "gpu='YES'" -l /nodes=1,walltime=4
If you reserve a single core (-l /nodes=1/core=1) , you will NOT have exclusive access to the gpu attached to it
Remember: to check the available gpus and monitor them, use nvidia-smi
What is the format of a submission script ?
The script should not only includes the path to the program to execute, but also includes information about the needed resources (you can specify resources using oarsub options on the command line). Simple example : helloWorldScript
#!/bin/bash # # Submission script for the helloWorld program # # Comments starting with #OAR are used by the resource manager if using "oarsub -S" # # The job reserves 8 nodes with one processor (core) per node, # only on xeon nodes from cluster dellc6145, job duration is less than 10min # Note : quoting style of parameters matters, follow the example #OAR -l /nodes=8/core=1,walltime=00:10:00 #OAR -p cputype='xeon' and cluster='dellc6145' # # The job is submitted to the default queue #OAR -q default # # Path to the binary to run ./helloWorld
Script must be executable (chmod u+rx helloWorldScript
).
You can mix parameters in the submission script and on the command line but take about how they combine. In this example the -p on the command line takes precedence over the script, while the -l from the script and the command line are combined (moldable jobs when using multiple -l options) :
oarsub -p "cputype='opteron'" -l /nodes=4/core=2 -S ./helloWorldScript
What are the available queues ?
The limits and parameters of the queues are listed below :
queue name | max user resources | max user running jobs | max duration (days) | priority | max user (hours*resources) |
dedicated | 576 | 30 | 15 | 32256 | |
default | 576 | 30 | 10 | 32256 | |
big | 1536 | 2 | 30 | 5 | 6144 |
besteffort | 3 | 0 |
A core is either 1 resource on a non-hyperthreaded, or 2 resources on an hyperthreaded node (1 core has 2 hardware threads).
This means all the jobs of a user running in the default queue at a given time can use at most 576 resources (eg 576 non-hyperthreaded cores, or 288 hyperthreaded cores, or 288 non-hyperthreaded cores with twice the default memory per core, etc.) with a cumulated reservation of 32256 hours*resources. Maximum walltime of each job is 30 days. Number of running jobs is not limited (thus can be up to 576).
In other words a user can have at a time running jobs in the default queue using cumulated resources reservation of at most
- either 32 resources during 28 days with the default memory per core ;
- either 128 resources during 7 days with the default memory per core ;
- either 128 resources during 3 days 1/2 with twice the default memory per core ;
- either 256 resources during 3 days 1/2 with the default memory per core ;
- etc.
A user can have at most 2 jobs running in the big queue at a given time, using a total of at most 1536 resources with a cumulated reservation of 6144 hours*resources. Maximum walltime of each job is 30 days. The big queue should be used only for jobs that need more than the max user resource of the default queue.
The dedicated queue can use only dedicated resources. Its interest is that your default queue Karma won't increase.
Specific for STARS team members : interactive jobs are limited to 8 hours, by request of the team.
How are the jobs scheduled and prioritized ?
Jobs are scheduled :
- based on queue priority (jobs in higher priority queues are served first),
- and then based on the user Karma (for jobs of equal queue priority, jobs with lower Karma users are served first).
The user's fair share scheduling Karma measures his/her recent resources consumption during the last 30 days in a given queue. Resource consumption takes in account both the used resources and the requested (but unused) resources in a given queue with the same formula as detailed here. When you request or consume resources on the cluster, your priority in regard of other users decreases (as your Karma increases).
Jobs in the dedicated, default and big queues wait until the requested resources can be reserved.
Jobs in the besteffort queue run without resource reservation : they are allowed to run as soon as there is available resource on the cluster (they are not subject to per user limits, etc.) but can be killed by the scheduler at any time when running if a non-besteffort job requests the used resource.
Using the besteffort queue enables a user to use more resources at a time than the per user limits and permits efficient cluster resource usage. Thus using the besteffort queue is encouraged for short jobs (several hours) that can easily be resubmitted.
Why my long job does not execute on dellc6100 or dellr900 nodes ?
dellc6100 or dellr900 nodes are shared between NEF and GRID5000, alternatively 1 week on each platform. A job with a walltime over ~167 hours (1 week minus reconfiguration time) specifically for these nodes will never run and stay forever in the waiting (W) state. Please oardel
it if submitted by error.
How do i submit a job in the "big" queue ?
Use oarsub -q big
or use the equivalent option in your submission script (see submission script examples).
How do i submit a besteffort job ?
To submit a job to the best effort queue just use oarsub -t besteffort
or use the equivalent option in your submission script (see submission script examples).
Your jobs will be rescheduled automatically with the same behaviour if you additionnaly use the idempotent mode oarsub -t besteffort -t idempotent
OAR checkpoint facility may be useful for besteffort jobs but requires support by the running code.
How do i reserve resources in advance ?
Submit a job with oarsub -r "YYYY-MM-DD HH:MM:SS"
. A user can have at most 2 scheduled advance reservations at a given time.
Example 1 :
# No command specified : 1 node is reserved for 2 hours on 2017-12-10 08:00:00
# Running job remains idle until the user connects to the node oarsub -C job_number
# and interactively launches commands.
oarsub -r "2017-12-10 08:00:00" -l /nodes=1,walltime=2
Example 2 :
# Command specified : 2 cores are reserved for 3 hours on 2017-12-10 14:00:00 # /path/to/my/script script is launched at that time, # and resources are released when scripts finishes or walltime is reached oarsub -r "2017-12-10 14:00:00" -l /core=2,walltime=3 /path/to/my/script
How much memory (RAM) is allocated to my job ?
OAR is using the total amount of RAM of a node and divide it by the number of cores (minus a small amount for the system).
So for instance, if a node has 96GB of RAM and 12 cores, each reserved core will have ~8GB of RAM allocated by OAR. If you reserve only one core on this type of node, your job will be limited to ~8GB of RAM. RAM is counted for RSS (physical memory really used) not for VSZ (virtual memory allocated).
How can i change the memory (RAM) allocated to my job ?
If you need a single core, but more than the dedicated amount of RAM by core, you need to reserve more than one core. Since our cluster is heterogeneous (memory per core is not the same on each sub-cluster), it is not easy to have a single syntax to get the needed amount of memory.
You can use explicitly the mem_core property of OAR. If you want cores with a minimum amount of RAM per core, you can do (at lease 8GB per core in this example) :
oarsub -l '{mem_core > 8000}/nodes=1/core=3'
In this case, you will have 3 cores on the same node with at least 3x8GB = 24GB of RAM.
In this example you reserve a full node with at least 150GB of RAM :
oarsub -p 'mem > 150000' -l /nodes=1
How can i check the resources really used by a running or terminated job ?
Use the Colmet tool to view CPU and RAM usage profile of your job during or after its execution.
- warning : bug in Colmet, it crashes if you use 1 point per 5 seconds or more (eg: no more than 5 points for 30 seconds)
- warning : bug in Colmet, we observed that the reported RSS (RAM) is sometimes false
Colmet can be accessed :
- for Inria users : from Inria Sophia entreprise network ; or through Inria VPN with vpn.inria.fr/all profile
- for all users : by ssh tunneling through nef-frontal.inria.fr (eg:
ssh -L 5000:nef-devel2:5000 nef-frontal.inria.fr
and browsing http://localhost:5000)
Alternatively, connect to a node while your job is running and check your process physical memory (RSS) usage and virtual memory (VSZ) usage with :
ps -o pid,command,vsz,rss -u yourlogin
How can i submit hundreds/thousands of jobs ?
You can have up to 10000 jobs submitted at a time (includes jobs in all states : Waiting, Running, etc.).
We have raised the limit up to 10000 (20.06.2016). This is experimental and we may lower this limit at anytime if a problem occurs.
OAR provides a feature called array job which allows the creation of multiple, similar jobs with one oarsub command.
Please consider using array jobs when submitting a large number of similar jobs to the queueing system. The obvious but inefficient way to do this would be to prepare a prototype job script and shell scripting a loop to call oarsub on this (possibly modified) job script the required number of times.
To submit an array comprised of array_number jobs use :
oarsub --array array_number
To submit an array comprised of array_number jobs with distinct parameters passed to each job use :
oarsub --array-param-file param_file
where param_file is a text file with array_number lines. Each line contains the arguments passed to the job with the corresponding index in the array, using shell syntax. Example for an array of 3 jobs :
foo 'a b' # First job receives 2 arguments : 'foo', 'a b' bar $HOME y # Second job receives 3 args : 'bar', the path to your homedir, y hi `hostname` $MYVAR # Third job receives 3 args : 'hi', result of hostname command, value of $MYVAR variable
Variables and commands are evaluated when launching the job not when running the oarsub command (thus in the user's context on the execution node, not on the submission frontend).
Don't use a parameter file with only one single line: the parameters in this line will be ignored. In other words OAR doesn't like arrays of size 1 :-(
When using a submission script, array job can be specified with a directive in the script :
#OAR --array array_number ##OR #OAR --array-param-file param_file
OAR creates one different job per member in the array, with the following environment variables :
- $OAR_JOB_ID : unique jobid for each member of the array
- $OAR_ARRAY_ID : common value for all members of the array (equal to the jobid of the first array member)
- $OAR_ARRAY_INDEX : unique index for each member of the array (first job has index 1, second job has index 2, etc.)
Example :
nef-devel2$ oarsub --array 2 ./runme Generate a job key... Generate a job key... OAR_JOB_ID=235542 OAR_JOB_ID=235543 OAR_ARRAY_ID=235542 nef-devel2$ oarstat --array 235542 Job id A. id A. index Name User Submission Date S Queue --------- --------- --------- ---------- -------- ------------------- - -------- 235542 235542 1 mvesin 2016-04-01 15:49:27 R default 235543 235542 2 mvesin 2016-04-01 15:49:27 R default nef-devel2$
When using oarsub -t besteffort -t idempotent
jobs with arrays, a job in the array may be killed while running and automatically resubmitted. In this case in the resubmitted job : $OAR_JOB_ID is the new jobid, $OAR_ARRAY_INDEX and $OAR_ARRAY_ID are unchanged.
Example of besteffort array member automatic resubmission with $OAR_ARRAY_ID = 235524, and job 235525 (array index 2) killed by OAR and resubmitted as 235527 :
nef-devel2$ oarstat --array 235524 Job id A. id A. index Name User Submission Date S Queue --------- --------- --------- ---------- -------- ------------------- - -------- 235524 235524 1 mvesin 2016-04-01 14:07:38 R besteffo 235525 235524 2 mvesin 2016-04-01 14:07:38 E besteffo 235527 235524 2 mvesin 2016-04-01 14:15:55 R besteffo nef-devel2$ oarstat -fj235527 | grep resubmit resubmit_job_id = 235525
How can i pass command line arguments to my job ?
oarsub does not have a command line option for this but you can pass parameters directly to your job, eg :
oarsub [-S] "./mycode abcde xyzt"
and then in ./mycode check $1 (abcde) and $2 (xyzt) variables, in the language specific syntax. Example :
# Submission script ./mycode # # Comments starting with #OAR are used by the resource manager if "oarsub -S" #OAR -p cputype='xeon' # pick first argument (abcde) in VAR1 VAR1=$1 # pick second argument (xyzt) in VAR2 VAR2=$2 # Place here your submission script body echo "var1=$VAR1 var2=$VAR2"
Another syntax for that :
oarsub [-S] "./mycode --VAR1 abcde --VAR2 xyzt"
and then in ./mycode use options parsing in the language specific syntax.
If you do not use the -S option of oarsub then you may prefer to use shell environment variables, eg :
oarsub -l /nodes=2/core=4 "env VAR1=abcde VAR2=xyzt ./myscript.sh"
What is a dedicated node ?
A dedicated node is a node for which a limited number of cluster users (eg: a research team) has privileged access (usually because it funded the node). Other cluster users can only submit besteffort jobs to this node and cannot use the additional local storage (under /local
).
Check the node properties to see whether a node is dedicated :
- property dedicated has value NO for a common node
- property dedicated has value groupname for a node dedicated to groupname
How do i use a dedicated node ?
No specific option is required, just describe the requested resources. For example, to submit an interactive besteffort queue job reserving one gpu with GPU capability 5.0 or higher :
oarsub -p "gpu='YES' and gpucapability>='5.0'" -t besteffort -l /gpunum=1 -I
To specifically request the dedicated resources of groupname use -p "dedicated='groupname'"
. In this case you may prefer to use the dedicated queue versus the default queue. For example to submit an interactive default queue job reserving one gpu node from asclepios team use :
oarsub -q dedicated -p "gpu='YES' and dedicated='asclepios'" -l /nodes=1 -I
If you use a -q dedicated and don't have access to matching dedicated resources, you'll (of course) get a Not enough resources error message at job submission.
A timesharing ( -t timesharing
) job reserves resources that can be accessed at the same time by all authorized users (hint : reserve full nodes to avoid unexpected behaviours unless you're an expert user).
All users of the timesharing node have access to all the reserved resources (cores, memory, GPUs) and should coordinate to avoid conflict or over-usage.
Who are the authorized users to a timesharing node ?
- standard case : all the cluster users
- dedicated node : only the privileged users for this node (no simultaneous besteffort)
When all the jobs sharing the node have finished the resources are freed.
Example 1 :
# shared interactive use of a node during 4 hours for a hands on session : # check for a free node, and name the node to be sure you all access the same node # # simplest case: all users access with the same command oarsub -t 'timesharing=*,*' -p "host='nef111.inria.fr'" -l /nodes=1,walltime=4:0:0 -I # the standard node and resources can be accessed by all cluster users during this time
Example 2 :
# advance reservation of nefgpu09 dedicated node during one week starting # at 2017/03/11 8AM, for shared usage : oarsub -r "2017-03-11 8:00:00" -p "gpu='YES' and host='nefgpu09.inria.fr'" -l /nodes=1,walltime=7:0:0:0 -t 'timesharing=*,*' # # the dedicated node and resources can be accessed by privileged users during this time eg : oarsub -t 'timesharing=*,*' -p "gpu='YES' and host='nefgpu09.inria.fr'" -l /nodes=1,walltime=2 -I oarsub -t 'timesharing=*,*' -p "gpu='YES' and host='nefgpu09.inria.fr'" -l /nodes=1,walltime=4 /path/to/script # etc.
Pay attention to the walltime in the subsequent oarsub : if requesting longer access than currently reserved, the job will start only if the reservation can be extended. The simple way is to submits subsequent jobs with a walltime expiring before the initial job.
Do not confuse timesharing (several jobs have simultaneous access to a set of reserved resources) and container (kind of scheduling-in-scheduling, but each inner job has dedicated resources).
How to choose GPU resources for multi GPU jobs ?
A multi-GPU job has better performance when the reserved GPU are connected by a high speed data path. On a GPU node, check data path between GPUs with :
nvidia-smi topo -m
Recommended usage is :
- choose GPUs from the same host and
- with a high speed connection (eg: PHB PXB PIX but not SOC).
Minimal reasonable multi-GPU request example :
# Request 2 GPUs from same host oarsub -p "gpu='YES'" -l /nodes=1/gpunum=2 -I
Advanced resource request example :
# A person from the STARS team requests a pair of GPU cards from one of their dedicated nodes. # Wants either the gpudevice pair 0/1 or 2/3 which are on same PCIe host bridge (PHB). oarsub -p "gpu='YES' and dedicated='stars'" -l "{ gpudevice=0 or gpudevice=1 }/nodes=1/gpunum=2" -l "{ gpudevice=2 or gpudevice=3 }/nodes=1/gpunum=2" -I
How to use CPU hyperthreading ?
CPU hyperthreading is available only on nef newest nodes (see hardware description for node details). Hyperthreading is permanently enabled on these nodes.
When reserving a CPU core with OAR, you are always assigned both threads from this CPU core, without a specific OAR request syntax. Each thread appears as one OAR resource, so you are assigned two resources by core.
Example : reserve 1 core (2 threads) from nefgpu12 in besteffort :
[nef-frontal $] oarsub -t besteffort -p "host='nefgpu12.inria.fr'" -l /core=1 -I # we are assigned 2 threads from core number 2261 [nefgpu12 ~]$ cat $OAR_RESOURCE_PROPERTIES_FILE | sed -e 's/.*\(thread = .[0-9]*. \).*\(core = .[0-9]*.\).*/\1\2/g' thread = '0' core = '2261' thread = '1' core = '2261'
From the developper's point of view, each thread appears to programs as a logical processor.
How to Interact with a job when it is running
How do i submit a job and wait for it to terminate ?
From nef-devel or nef-devel2, use the oarctl sub command instead of oarsub to submit iyour job.
With array jobs, this command waits for all the jobs of the array to terminate.
It uses internally the exec notifications of oarsub. The notifications received are prefixed by the date.
As for every long running command, it is safer to protect it from a disconnection by starting it with either nohup or from within a screen, tmux or VNC session.
Examples:
- A simple array job:
oarctl sub --array 2 -l /core=1,walltime=00:01:00 /bin/true [ADMISSION RULE] Modify resource description with type constraints [ADMISSION RULE] Automatically add constraint to go on nodes permitted for the user. Simple array job submission is used OAR_JOB_ID=12319955 OAR_JOB_ID=12319956 OAR_ARRAY_ID=12319955 2022-03-28 13:28:57: Waiting job notifications 2022-03-28 13:29:06: 12319955 RUNNING Job is running. 2022-03-28 13:29:06: 12319956 RUNNING Job is running. 2022-03-28 13:29:53: 12319956 END Job stopped normally. 2022-03-28 13:29:53: 12319955 END Job stopped normally. fm@nef-devel 2022-03-28 13:29:53 ~
- The same using nohup:
nohup oarctl sub --array 2 -l /core=1,walltime=00:01:00 /bin/true >& oarctl.out &
- The same with the output of oarctl sent by mail. $LOGNAME is a valid mail desstination on NEF.
nohup bash -c "oarctl sub --array 2 -l /core=1,walltime=00:01:00 /bin/true |& mail -s 'oarctl test' $LOGNAME " >& /dev/null &
A real use case has been to launch through nohup a script that:
oarctl sub an array job with some initial paramaters iterate: analyse the results of the previous jobs to compute new parameters oarctl sub an array job with those new parameters
This command is specific to NEF.
See also the --anterior option of oarsub.
How do i connect to the nodes of my running job ?
Use oarsub -C jobid
to start an interactive shell on the master node of the job jobid, or use OAR_JOB_ID=jobid oarsh hostname
to connect to any node of the job.
To get the list of the job nodes do a cat $OAR_NODE_FILE
and then use oarsh hostname
to connect to other job nodes.
Other useful commands : oarcp
to copy files between nodes local filesystems, oarprint
to query resources allocated to the job (eg : oarprint host
for the list of the hostname your job is running on)
Please note ssh to the nodes is not allowed, but oarsh is a wrapper around ssh.
In which state is my job ?
The oarstat jobid
command let you show the state of job jobid and in which queue it has been scheduled.
Example for jobid 1839 :
nef-frontal$ oarstat -j 1839 Job id Name User Submission Date S Queue ---------- -------------- -------------- ------------------- - ---------- 1839 TEST_OAR rmichela 2015-08-21 17:49:08 T default
- the S column gives the the current state ( Waiting, Running, Launching, Terminating).
- the Queue column shows the job's queue
-f
gives full information about the job, --array
prints information for a whole array
You can use SQL syntax for advanced queries, example :
oarstat --sql "job_user='rmichela' and state='Terminated'"
When will my job be executed ?
oarstat -fj jobid | grep scheduledStart
gives an estimation on when your job will be started
How can i get the stderr or stdout of my job during its execution ?
oarpeek jobid
shows the stdout of jobid and oarpeek -e jobid
shows the stderr.
How can i cancel a job ?
oardel jobid
cancels job jobid.
How to know my Karma priorities ?
To see the Karma associated to one of your currently running jobs :
- use
oarstat -f -j jobid | grep Karma
- or use | Monika and click on jobid to view the job details
This gives your Karma for this job's queue at the time of the job submission.
If you want more details, the command oarstat -u login --accounting "YYYY-MM-DD, yyyy-mm-dd"
shows your resource consumption between two dates. The indicated Karma is the one of your last submitted job. To see the details of your resource consumption for a given queue use oarstat -u login --sql "queue_name = 'queue' " --accounting "YYYY-MM-DD, yyyy-mm-dd"
To see your time window used for Karma calculation use :
- yyyy-mm-dd = tomorrow
- YYYY-MM-DD = ( yyyy-mm-dd - 30 days )
Software
How to use an environment module in a job ?
To use an environment module module_name in a batch job, add the following lines in your submission script (the script used in oarsub MyScript
) :
# Submission script MyScript - excerpt source /etc/profile.d/modules.sh module load module_name # Commands using the module are after loading the module
Typing a module load module_name
on a frontend node or an interactive job session set the environment module for this session only (not for submitted jobs).
How to run an OpenMPI application?
The mpirun
binary included in openmpi run the application using the resources reserved by the jobs :
Submission script for OpenMPI : monAppliMPICH2.sh
The openmpi 2.0.0 version installed on nef is patched to discover automatically the ressources of your job, so you don't have to specify a machinefile.
# Fichier : monAppliOpenMPI.sh #!/bin/bash #OAR -l /nodes=3/core=1 source /etc/profile.d/modules.sh module load mpi/openmpi-2.0.0-gcc mpirun --prefix $MPI_HOME monAppliOpenMPI
in this case, mpirun
will start the MPI application on 3 nodes with a single core per node.
If you are using the main openmpi module (mpi/openmpi-x86_64) you have to add manually parameters :
module load mpi/openmpi-x86_64 mpirun -mca btl_openib_pkey 0x8108 -mca plm_rsh_agent oarsh --prefix $MPI_HOME -machinefile $OAR_NODEFILE monAppliOpenMPI
How to run an Intel MPI application?
the Intel compiler and mpi implementation is installed on nef. To run a mpi job:
#!/bin/bash #OAR -l /nodes=3/core=1 source /etc/profile.d/modules.sh module load mpi/intel64-5.1.1.109 mpirun -machinefile $OAR_NODEFILE monAppliIntelMPI
How can i use BLAS (ATLAS, OPENBLAS ...) ?
The recommended version is Openblas (atlas or netlib blas are much slower) or the MKL from Intel. Several versions of openblas are available: sequential (-l openblas64), pthread (-l openblasp64) or openmp (-l openblaso64)
For example to use the sequential version of openblas (recommended if your application is already multithreaded/parallel):
gcc -I/usr/include/openblas -l openblas64 myblas.c
How can i use FreeFem++ ?
A more complete version than the default system version of FreeFem++ is available :
module load mpi/openmpi-x86_64 module load freefem++/3.62 # example program mpirun -mca btl_openib_pkey 0x8108 -mca plm_rsh_agent oarsh --prefix $MPI_HOME -machinefile $OAR_NODE_FILE -np 4 $FREEFEM_PATH/bin/FreeFem++-mpi $FREEFEM_PATH/share/freefem++/3.62/examples++-mpi/testsolver_MUMPS.edp
How to use gcc GPU offloading for better performance ?
gcc compiler supports OpenACC and OpenMP offloading to Nvidia GPUs (NVPTX targets).
Nvidia HPC SDK GPU offloading performance is often better than gcc GPU offloading, up to x10 factor
For OpenACC :
- extend your code for OpenACC support with
#pragma acc
clauses - compile with offloading support :
module load gcc-nvptx/9.2.0 g++ -fopenacc -fopt-info-optimized-omp -foffload="-O3" your_compile_cmd_opts
For OpenMP :
- extend your code for OpenMP offloading support with
#pragma omp target
clauses - compile with offloading support :
module load gcc-nvptx/9.2.0 g++ -fopenmp your_compile_cmd_opts
Then for OpenACC and OpenMP :
- run on a GPU node with GPU memory ECC enabled (important) :
oarsub -p "gpu= 'YES' and gpuecc='YES' and gpucapability>='5.0'" -I module load gcc-nvptx/9.2.0 your_test_code
How to use Nvidia HPC SDK GPU offloading for better performance ?
Nvidia HPC SDK compiler supports GPU offloading for acceleration.
Nvidia HPC SDK supports OpenACC offloading to Nvidia GPUs (NVPTX targets) :
- extend your code for OpenACC support with
#pragma acc
clauses - compile with offloading support :
module load nvhpc/20.11 nvc++ -acc -Minfo=all your_compile_commands_and_options
- run on a GPU node with GPU memory ECC enabled (important) :
oarsub -p "gpu= 'YES' and gpuecc='YES' and gpucapability>='5.0'" -I module load nvhpc/20.11 your_test_code
nvc++
compiler OpenACC strategies has changed since PGI 19.10. In some cases it may result in a performance decrease. Hint : in this case, look at PGI 19.10 compiler optimization messages and add these optimizations to your code.
As an alternative to OpenACC, Nvidia HPC SDK compiler also support GPU offloading of a C++17 parallel code using -stdpar
:
module load nvhpc/20.11 nvc++ -stdpar your_compile_commands_and_options
What has the PGI compiler become ?
PGI compiler is now obsolete and replaced by Nvidia HPC SDK. For example nvc++
replaces pgc++
.
If you need to test PGI for legacy purpose, here is an example of OpenACC offloading to Nvidia GPUs (NVPTX targets) :
- extend your code for OpenACC support with
#pragma acc
clauses - compile with offloading support :
module load pgi/19.10 pg++ -acc -Minfo=all your_compile_commands_and_options
- run on a GPU node with GPU memory ECC enabled (important) :
oarsub -p "gpu= 'YES' and gpuecc='YES' and gpucapability>='5.0'" -I module load pgi/19.10 your_test_code
How to use AVX for better performance ?
Vector instructions for recent Intel CPU increase performance by executing each instruction on bigger data operand (256 bits for AVX/AVX2 and 512 bits AVX-512, versus 64/128 bits for base instruction set or previous CPU).
To use a vector instruction acceleration function you need :
- hardware support from the node (eg: check with
lscpu | tr ' ' '\n' | grep avx
) - support from the libraries used by your code (eg: OpenBlas 0.2 supports up to AVX2, Intel MKL supports up to AVX-512)
- support from your compiler (eg: gcc >= 4.6 for AVX2, gcc >= 4.9 for AVX-512)
- support from your code (eg:
-mfma -mavx512f -mavx512cd
compiler options for automatic AVX-512 vectorization by gcc for Skylake nodes, more info here)
Vector instruction sets reduce backward portability, eg : a code compiled with AVX-512 instructions will fail on a non AVX-512 capable node (Illegal instruction
).
Performance tradeoff for vectorization include :
- newer vector extension usually means better performance for vector computation oriented workloads (perf AVX-512 > perf AVX/AVX2 > perf legacy)
- however exceptions exists such as the Xeon Silver C6420 on NEF where AVX2 gives optimal performance due to hardware capabilities (2 AVX2 FMA, 1 AVX-512 FMA)
- rough gain estimate in a typical scenario when doubling instruction size (legacy to AVX/AVX2, AVX/AVX2 to AVX-512) would be ~50% rather 100% due to CPU reduced frequency and hardware bottlenecks
How can i run a docker container ?
Docker is not available as it requires (in)direct user root privileges, but singularity is the alternative container technology proposed.
How can i run a singularity container ?
Singularity is a container technology which can import a docker image and adds native GPU support.
Please read this blog post with quickstart guidelines for using singularity on Nef or the full singularity documentation.
Example : download a container image from docker hub (docker://). Image is newest (Ubuntu) tensorflow with GPU support (tensorflow/tensorflow:latest-gpu). Convert to singularity image and save to file (pull). Launch singularity image from saved file with interactive console (shell), requesting GPU support in singularity (--nv) and /data
filesystem availability (-B /data) :
module load singularity/3.5.2 singularity pull docker://tensorflow/tensorflow:latest-gpu singularity shell -B /data --nv ./tensorflow-latest-gpu.simg
You can use pre-built containers from several registries : the docker hub (singularity pull docker://), the singularity hub (singularity pull shub://), Nvidia NGC, etc. NGC provides containers built and optimized by Nvidia for Nvidia GPUs with usage guidelines for singularity. To use NGC :
- follow the NGC getting started guide for signup and API key generation (credentials to access containers)
- keep safe your API key (eg : save to
~/.ssh/ngcapikey
andchmod go-rwx ~/.ssh/ngcapikey
) - configure your environment for NGC access and download a container (example : tensorflow 19.01 for python3)
module load singularity/3.5.2 export SINGULARITY_DOCKER_USERNAME='$oauthtoken' export SINGULARITY_DOCKER_PASSWORD=$(cat ~/.ssh/ngcapikey) # one time : download the container singularity pull docker://nvcr.io/nvidia/tensorflow:19.01-py3 # each run : launch the downloaded container singularity shell -B /data --nv ./tensorflow-19.01-py3.simg
On your laptop with root privileges you can also build your own singularity container files, copy them to Nef and run them. If converting your own docker container to singularity container file on your laptop, you can :
- either create a local docker registry on your laptop, pull from that registry
- or use docker2singularity
Building containers on an public container registry is technically possible but raises privacy issues for non-public data or software.
On you laptop you can also convert back singularity containers to docker, eg with singularity2docker.
Speed up singularity image creation using gitlab
via the gitlab ci/cd
A "comfortable" option is to create it via the gitlab ci/cd ... It is an investment that pays for itself.
setting up the project You have to go on https://gitlab.inria.fr/ Create a project In general setting visibility/project feature/permissions - Activate the ci/cd of the project - Activate the container registry - Then create an access token maintainer with all the permissions (don't forget to note it) - In ci/cd variable create an API_TOKEN with the value noted above
In setting ci/cd runners - in availlable specific runner choose the runner tagged docker and kaniko - Press activate for this project
In setting -> package and registry - Activate the cleaning Every day - Keep the 5 most recent tags by name - keep the .*-stable ones - Delete tags older than 14 days - Delete tags that match .*
0b) Create the dockerfile
In the repository create a folder "images" and inside it a folder "myapp"
In the myapp folder, create a file named Dockerfile
Writing dockerfile is easier than some people think If in the dockerfile you type :
FROM quay.io/centos/centos:stream9 RUN dnf -y install R-core RUN dnf -y install git COPY myapp /usr/local/bin/myapp RUN chmod ug+x /usr/local/bin/myapp
You will use a centos stream9 distribution then you install R and git in the distribution then you copy the file that was in images/myapp (the one you are developing)
To note:
- You could also start on an ubuntu, but the package names may be different.
- It's a bad practice to copy the data in the image (a bind mount is better).
- If you use apptainer because you have dependency problems or want to change the base system (ubuntu).
- It may not even be necessary to copy the application you have developed into the container.
0c) Setting up the ci/cd
In ci/cd editor, copy :
stages:
- build-myapp
variables:
DOCKER_HOST: tcp://dockerdaemon:2375/ DOCKER_DRIVER: overlay2 DOCKER_TLS_CERTDIR: ""
build-myapp:
stage: build-myapp rules: - if: "$CI_COMMIT_TAG"
image: name: registry-sam.inria.fr:5000/hub.docker.com/library/docker:20.10.22-dind services: - name: docker:20.10.22-dind alias: dockerdaemon
- line are jumps important for output readability
script: - docker login -u "private_token" -p "$API_TOKEN" "$CI_REGISTRY"
- docker build -t "${CI_REGISTRY_IMAGE}/myapp:${CI_COMMIT_TAG}" -t "${CI_REGISTRY_IMAGE}/myapp:latest" --cache-from "${CI_REGISTRY_IMAGE}/myapp:latest" images/myapp/
- docker push "${CI_REGISTRY_IMAGE}/myapp:${CI_COMMIT_TAG}"
- docker push "${CI_REGISTRY_IMAGE}/myapp:latest"
0d)
Various things to note, but the line - if: "$CI_COMMIT_TAG" will cause the image to be recompiled and uploaded only if you create a new tag In repository -> tab -> new tag
I recommend incremental numbers and add a -stable after the number if the image works very well
1) Getting an image
From the gitlab inria registry singularity pull docker://registry.inria.fr/myteam/myapp:latest
From dockerhub :
singularity pull docker://tensorflow/tensorflow:latest-gpu
(this is the default repository, no need to specify the server)
Via the proxycache registry of sophia : singularity pull docker://registry-sam.inria.fr:5000/hub.docker.com/tensorflow/tensorflow:latest-gpu (You have to prefix the project/image with registry-sam.inria.fr:5000/hub.docker.com/)
If no project is specified for the image, (ie caddy instead of caddy/caddy). the project name is library singularity pull docker://registry-sam.inria.fr:5000/hub.docker.com/library/caddy (in dockerhub, the default project is library)
speed up singularity downloads using caches =
2 cache are usable to speed up images downloa.
1) The proxy cache
Use registry proxycache of sophia : This will allow to download the image only once from the internet, then it will be available on the local network for all nodes. singularity pull docker://registry-sam.inria.fr:5000/hub.docker.com/tensorflow/tensorflow:latest-gpu (You have to prefix the project/image with registry-sam.inria.fr:5000/hub.docker.com/)
2) The file cache
Once the image is on the network, each node must have a local copy This one is in $HOME/.apptainer/cache by default.
However it is possible to group the cache by team echo "export APPTAINER_CACHEDIR=/data/$(id -gn $USER)/share/.apptainer/cache" >> ~/.bashrc
notes) By default the cache is not expired it is recommended to clean it regularly apptainer cache clean --days 30
Caches performance )
With a ~3Gb image 7m14.828s without proxy cache 4m43.576s cache on /home and proxy cache 3m25.574s cache on /data and with proxy cache (shared cache) 0m0.182s image already present
How can i install a python package with pip ?
You can install a python package with pip specifying --user option (installation for your account only, no admin privileges are required) :
pip install --user package
It is recommended to always use specific conda or python environment.
How can i use a specific environment with conda ?
Conda is a widespread tool for installing your own packages and managing virtual environments. Conda is not limited to python packages and includes environment exporting/sharing.
Create and use a conda virtual environment named virt_conda using python 3.9 by default :
module load conda/2021.11-python3.9 eval "$(conda shell.${0#-} hook)" # create and use a conda virtual environment conda create --name virt_conda conda activate virt_conda # install a package in virtual environment conda install package # leave the conda virtual environment conda deactivate
For subsequent uses :
# activate the conda environment module load conda/2021.11-python3.9 eval "$(conda shell.${0#-} hook)" conda activate virt_conda # leave the conda virtual environment conda deactivate
The conda/2020.11-python3.8 module is also available.
How can i use a specific environment with an older conda version ?
Conda is a widespread tool for installing your own packages and managing virtual environments. Older conda versions like 5.0.1 are for legacy support.
Create and use a conda virtual environment named virt_conda using python 3.6 by default :
module load conda/5.0.1-python3.6 # create and use a conda virtual environment conda create --name virt_conda python=3.6 source activate virt_conda # install a package in virtual environment conda install package # leave the conda virtual environment source deactivate virt_conda
How can i use a specific conda and python environment ?
Create a pip/python environment inside a conda environment.
module load conda/2020.11-python3.8 eval "$(conda shell.${0#-} hook)" # create and use a conda virtual environment conda create --name virt_conda conda activate virt_conda # install the pip and update python conda install pip
To choose an alternate version of pip and python replace the last line with :
# require pip 20.2.3 and python 3.8 conda install pip=20.2.3=py38_0
For subsequent uses :
# activate the conda environment module load conda/2020.11-python3.8 eval "$(conda shell.${0#-} hook)" conda activate virt_conda # install a conda package in the environment conda install package # install a pip package in the environment pip install package # leave the conda virtual environment conda deactivate
Check the good practices for pip in conda environments in particular :
- no
pip --user
in conda environments - as many requirements as possible with conda then use pip
- to install additional conda packages, it is best to recreate the environment
How can i use a specific python environment ?
Create a python3 virtual environment named virt_py3 :
python3 -m venv virt_py3 cd virt_py3 source ./bin/activate # install a pip package in the environment pip install package # leave the environment deactivate
How can i use caffe ?
First you have to use a node with a GPU (it should be much faster with a GPU), for example:
oarsub -I -p "gpu='YES' and gpucapability>='5.0'" -l /gpunum=1
Method 1 : Then you have to load the cuda and caffe modules:
source /etc/profile.d/modules.sh module load cuda/10.0 module load cudnn/7.4-cuda-10.0 module load caffe/0.17-cuda-10.0 $CAFFE_HOME/build/tools/caffe
For importing caffe in python you need to install additional packages (numpy, scikit-image, protobuf
) eg in a virtual environment.
Method 2 : Alternatively you can use a container based distribution of caffe.
Please read the guidelines to setup your singularity environment and setup your NGC environment if needed.
Example : NVidia caffe container from NGC
module load singularity/3.5.2 export SINGULARITY_DOCKER_USERNAME='$oauthtoken' export SINGULARITY_DOCKER_PASSWORD=$(cat ~/.ssh/ngcapikey) # one time : download the container singularity pull docker://nvcr.io/nvidia/caffe:19.01-py2 # each run : launch the downloaded container singularity shell -B /data --nv ./caffe-19.01-py2.simg
How can i use legacy torch ?
Method 1 : To use the system pre-installed version of torch, load the torch module :
module load torch/7 # launch the torch interactive session, etc. th
You can install your own additional packages in your homedir to extend to system installation of torch :
luarocks --tree=~/.luarocks install packagename
Method 2 : Alternatively, if the system pre-installed version of torch does not match your customization needs, you can install your own torch version as explained in the torch installation process, except you need to ignore the install-deps
step.
Method 3 : Alternatively you can use a container based distribution of torch. Please read the guidelines to setup your singularity environment and setup your NGC environment if needed.
Example : torch container from NGC
module load singularity/3.5.2 export SINGULARITY_DOCKER_USERNAME='$oauthtoken' export SINGULARITY_DOCKER_PASSWORD=$(cat ~/.ssh/ngcapikey) # one time : download the container singularity pull docker://nvcr.io/nvidia/torch:18.08-py2 # each run : launch the downloaded container singularity shell -B /data --nv ./torch-18.08-py2.simg
How can i use pytorch ?
Method 1 : You can tailor pytorch conda installation guidelines.
Example of a conda based installation (tested for pytorch 1.3.1 with cuda 9.2 and torchvision 0.4.2) :
module load conda/5.0.1-python3.6 conda create --name virt_pytorch_conda python=3.6 source activate virt_pytorch_conda conda install pytorch torchvision cudatoolkit=9.2 -c pytorch
Method 2 : Alternatively you can use a container based distribution of pytorch. Please read the guidelines to setup your singularity environment and setup your NGC environment if needed.
Example : container 19.02 with pytorch 1.1.0 container from NGC
module load singularity/3.5.2 export SINGULARITY_DOCKER_USERNAME='$oauthtoken' export SINGULARITY_DOCKER_PASSWORD=$(cat ~/.ssh/ngcapikey) # one time : download the container singularity pull docker://nvcr.io/nvidia/pytorch:19.02-py3 # each run : launch the downloaded container singularity shell -B /data --nv ./pytorch-19.02-py3.simg
Caveat: container 19.02 is the newest version supporting nvidia driver version 410, installed on Nef when writing this FAQ. Container 19.02 provides pytorch 1.1.0, thus newest versions of pytorch are currently not supported on Nef with this method.
Method 3 : Alternatively you can use a pytorch 1.4.0 version compiled for nef, built with GPU support, recent CPU support (avx-512), python 3.6, cuda 9.2
module load conda/5.0.1-python3.6 conda create --name virt_pytorch source activate virt_pytorch module load cuda/9.2 module load cudnn/7.1-cuda-9.2 module load gcc/7.3.0 module load mpi/openmpi-2.0.0-gcc module load pytorch/1.4.0
Method 4 : Alternatively you can build your own pytorch version from sources by tuning the pytorch install from sources documentation.
Exemple guidelines for building 1.4.0 with cuda 9.2
module load conda/5.0.1-python3.6 conda create --name virt_pytorch_source source activate virt_pytorch_source module load cuda/9.2 module load cudnn/7.1-cuda-9.2 module load gcc/7.3.0 module load cmake/3.10.1 module load mpi/openmpi-2.0.0-gcc conda install -c pytorch magma-cuda90 git clone --recursive https://github.com/pytorch/pytorch export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"} cd pytorch export CMAKE_LIBRARY_PATH=$CUDNN_LIB_DI python setup.py install --prefix=/path/to/my/install/dir # then export PYTHONPATH=/path/to/my/install/dir/lib/python3.6/site-packages
How can i use LibTorch ?
LibTorch is pytorch C++ distribution. You can tailor pytorch installation guidelines.
Example of an installation (tested for pytorch 1.4.0 with gpu support for cuda 9.2) :
wget https://download.pytorch.org/libtorch/cu92/libtorch-shared-with-deps-1.4.0%2Bcu92.zip unzip libtorch-shared-with-deps-1.4.0+cu92.zip # these modules are needed at compile time module load cmake/3.10.1 cuda/9.2 # these module are needed at compile time (and runtime depending on linking opts) module load cudnn/7.1-cuda-9.2 gcc/7.3.0 # compile and run your code eg the test from # https://pytorch.org/cppdocs/installing.html
How can i use tensorflow?
Method 1 : A tensorflow version is available on nef, built from sources with GPU support, recent CPU support (eg avx2) and python3.
To setup it please run the following on a node with a recent GPU and CPU support (oarsub -p "gpucapability >= '5.0'"
) for tensorflow 1.10 with cuda 9.2 :
pip install -U --user virtualenv virtualenv -p python3.6 virt_tf cd virt_tf source ./bin/activate module load cuda/9.2 module load cudnn/7.1-cuda-9.2 module load tensorflow/1.10.1-python3-cuda9.2 pip install numpy
Method 2 : Alternatively you can install a google-built version of tensorflow.
Nef older GPU may not be supported depending on the tensorflow version (eg: 1.4 requires -p "gpucapability >= '5.0'"
).
Tensorflow may not use all GPU and CPU hardware capabilities depending the nodes, with a performance impact (eg: recent CPU nodes capabilities such as sse3/4,avx,avx2,fma are not used by tensorflow 1.0.1).
Example for tensorflow 2.0.0 :
pip install -U --user virtualenv virtualenv -p python3.6 virt_tf cd virt_tf source ./bin/activate module load cuda/10.0 module load cudnn/7.6-cuda-10.0 # pip install https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-2.0.0-cp36-cp36m-manylinux2010_x86_64.whl
Method 3 : Alternatively you can use a container based distribution of tensorflow.
Please read the guidelines to setup your singularity environment and setup your NGC account if needed ; they contain example for tensorflow container from docker hub and NGC.
Method 4 : Alternatively you can follow tensorflow documentation for installation via conda/pip.
Example of installation for tensorflow 2.11 with GPU support:
# interactive login to a GPU node: # oarsub -p "gpu='YES' and gpucapability>='5.0'" -l /gpunum=1,walltime=4 -I conda create -n mytestenv python=3.9 conda activate mytestenv conda install -c conda-forge cudatoolkit=11.2.2 cudnn=8.1.0 export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/ pip install --upgrade pip pip install tensorflow==2.11.* # quick check python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
Example of usage of previous install:
# interactive login to a GPU node: # oarsub -p "gpu='YES' and gpucapability>='5.0'" -l /gpunum=1,walltime=4 -I conda activate mytestenv export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/ # quick check python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
How can i use theano ?
Method 1 : You can tailor | theano CentOS6 installation.
Example of a conda based installation :
module load conda/5.0.1-python2.7 # make it clean : conda remove --name virt_theano --all conda create --name virt_theano source activate virt_theano conda install numpy scipy mkl nose sphinx pydot-ng module load cuda/10.0 module load cudnn/7.4-cuda-10.0 conda install theano pygpu
Using this installation from a GPU node :
module load conda/5.0.1-python2.7 source activate virt_theano module load cuda/10.0 module load cudnn/7.4-cuda-10.0 # create gpu_tutorial1.py from example on http://deeplearning.net/software/theano/tutorial/using_gpu.html#gpuarray-backend # use .theanorc for permanent flags export MKL_THREADING_LAYER=GNU THEANO_FLAGS="device=cuda0,floatX=float32,dnn.base_path=/misc/opt/cudnn/7.4-cuda-10.0" python gpu_tutorial1.py
Method 2 : Alternatively you can use a container based distribution of theano. Please read the guidelines to setup your singularity environment and setup your NGC environment if needed.
Example : theano container from NGC
module load singularity/3.5.2 export SINGULARITY_DOCKER_USERNAME='$oauthtoken' export SINGULARITY_DOCKER_PASSWORD=$(cat ~/.ssh/ngcapikey) # one time : download the container singularity pull docker://nvcr.io/nvidia/theano:18.08 # each run : launch the downloaded container singularity shell -B /data --nv ./theano-18.08.simg # run an example in the container export MKL_THREADING_LAYER=GNU THEANO_FLAGS="device=cuda0,floatX=float32" python gpu_tutorial1.py
How can i use spark ?
Let's say you want to use spark on 4 nodes :
oarsub -I -l /nodes=4,walltime=3:0:0
This will reserve 4 nodes and start a shell on the first one (say nef107)
Then start the master:
./sbin/start-master.sh
Then you can start the slaves on three other nodes using oarsh ( the server URL in this case is spark://nef107.inria.fr:7077 ), like this:
for i in `uniq $OAR_NODEFILE | grep -v nef107`; do oarsh $i $HOME/spark-1.6.0-bin-hadoop2.6/sbin/start-slave.sh spark://nef107.inria.fr:7077 ; done
Then you can use spark, for ex. to run the sparkPi example:
export MASTER=spark://nef107.inria.fr:7077 ./bin/run-example SparkPi
To connect remotely to the WebUI you need to start Inria VPN (with vpn.inria.fr/all) or use SSH tunneling through nef-frontal.inria.fr
How can i run a graphical software on a node using my laptop's screen ?
Method 1 : if connecting from a client machine outside Inria Sophia network : setup automatic ssh tunneling to nef-devel/nef-devel2 on your laptop by adding in ~/.ssh/config
:
Host nef-devel*.inria.fr ProxyCommand ssh -q nef-frontal.inria.fr nc %h %p
Then use the virtual desktop available on nef-devel/nef-devel2 for each user login :
- prerequisite on client laptop : a vncviewer client software
- if your vncviewer supports
-via
option (eg tigervnc) : connect to the virtual desktop dedicated for your login :
[user@laptop $] vncviewer -via nef-devel.inria.fr vnc-login:0
- if your vncviewer does not support
-via
option : setup a ssh tunnel first
[user@laptop $] ssh -N -L 5901:vnc-login:5900 nef-devel.inria.fr [user@laptop $] vncviewer localhost:1
- a basic X11 graphical desktop appears, right click to launch a text terminal
- in terminal start an interactive job (eg
oarsub -I
) or connect to an existing job (egoarsub -C jobid
) - when job starts, launch a X11 graphical command (eg
firefox
)
Method 2 : an alternate (most simple) way to launch a graphical application :
- prerequisite on client laptop : a X11 server (not native for Windows or Mac)
- drawback : very slow thus only adapted to light applications
- connect to a nef frontend tunnel and X11 graphics (eg
ssh -X nef-devel.inria.fr
) - start an interactive job (eg
oarsub -I
) or connect to an existing job (egoarsub -C jobid
) - launch a X11 graphical command (eg
xterm
) - nota : 3 steps in 1 with
ssh -X -t nef-devel.inria.fr 'oarsub -C jobid '
Method 2bis : X11 tunneling using jobkey (same as tunneling a port, but LocalForward
is not needed)
How can i tunnel a port from my laptop to a node ?
Tunneling a port from your laptop/workstation to a node can help connect to a server launched by your job on the node :
- if the server listens only on localhost (eg for security reasons)
- etc.
Example below describes forwarding of port 8080 from your client laptop to a node, and connecting to a web server :
- launched on node by the user's job
- listening on localhost:8080
Step 0 : (once before first use) add to ~/.ssh/config
on your client laptop (replace 8080 with the port used by your job)
Host *.neforward ProxyCommand ssh nef-frontal.inria.fr -W "$(basename %h .neforward):6667" LocalForward 8080 127.0.0.1:8080 User oar Port 6667 IdentityFile ~/.ssh/jobkey
Step 1 : submit your job generating a job key
[user@nef-devel $] oarsub -k -e ~/.ssh/jobkey -I # job can be interactive or batch
Step 2 : copy the job key on your client laptop
[user@laptop $] scp nef-frontal:~/.ssh/jobkey ~/.ssh/ [...] Connect to OAR job jobid via the node node.inria.fr # note the node name [user@node $]
Step 3 : launch tunnel (using node from step 2)
[user@laptop $] ssh node.neforward
Step 4 : use tunnel. In this example we access a web server on port 8080 on node
- in a browser on your laptop, connect to
http://localhost:8080
Another example using tunnel to use VirtualGL on GPU nodes can be found on this blog post.
How can i use DIGITS ?
A version of DIGITS is available on the cluster. You need to setup your environment before first use :
# load required modules module load cuda/7.5 module load cudnn/5.1-cuda-7.5 module load caffe/0.14 module load torch/7 module load digits/6.1 # install required python packages in a virtual environment virtualenv --system-site-packages virt_digits cd virt_digits source ./bin/activate pip install scikit-image pip install -r $DIGITS_ROOT/requirements.txt pip install -U numpy
Then launch a server on a reserved node before each use :
# load required modules module load cuda/7.5 module load cudnn/5.1-cuda-7.5 module load caffe/0.14 module load torch/7 module load digits/6.1 # enter virtual environment cd virt_digits source ./bin/activate # launch a DIGITS dev server cd $DIGITS_ROOT ./digits-devserver
Launch a browser on the reserved node and connect to http://localhost:5000 to connect to the DIGITS server.
The pre-installed version of DIGITS is not full featured (no Torch support, no data and vizualization plugins). Alternatively you can install your own DIGITS version adapting the build documentation for a non-root install in a python virtualenv.
What are the Matlab licences available ?
Matlab community licenses from Inria Sophia can be used on the cluster. They are shared with all the sites desktops and laptops.
How to compile my Matlab program and run it ?
Matlab compilation produces an application (or a standalone package that includes the application and a Matlab runtime) from a Matlab program, using either Matlab GUI or mcc command line.
The application can then run using a Matlab runtime. Matlab runtimes are installed in /opt/matlab<version>_runtime. You need to use the same Matlab version for the compiler and the runtime (eg: use a 2017a runtime for a 2017a compiled program).
# compile myprogram.m using CLI mcc [user@nef012 $] /opt/matlab2018a/bin/mcc -m ./myprogram.m # # run the wrapper created by the compiler for the application [user@nef012 $] ./run_myprogram.sh /opt/matlab2018a_runtime myprogram_params
"Licensing error: -4,132" error message means compiler licence is currently used, retry later.
Caution : compiling with CLI mcc causes the licence to be reserved and blocked during 30 minutes by the user (linger time) and cannot be released quicker, while this is not the case with GUI compilation.
More generally, Matlab runtime can also be downloaded from Mathworks, so a compiled Matlab program can be run and distributed to people and platforms that do not have access to Matlab licenses.
Some toolbox functions are not supported by Matlab compilation.
What are the best practices for Matlab jobs ?
Before running long Matlab jobs or many Matlab jobs using the same code over time (eg: parameter sweeping), compile your matlab program and run the compiled program. Running a Matlab compiled program does not require Matlab license so that license tokens remain available for development activity.
If your Matlab program cannot be compiled : when launching many Matlab jobs at the same time, please launch them on as few nodes as possible. Matlab uses a floating licence per {node,user} couple. Eg :
- 10 jobs for user foo on 10 differents cores of nef012 node use 1 floating license,
- 1 job for user foo on each of nef01[0-9] nodes use 10 floating licenses.
OAR container jobs may be useful.
Example : make a long reservation of a full node and launch many short mono-core jobs
# one day reservation of a full node (/path/to/loop-script is an idle wait/loop script) -bash-4.2$ oarsub -t container -l /node=1,walltime=24 /path/to/loop-script [...] OAR_JOB_ID=3303953 [...] # launch 200 matlab jobs on the reserved node -bash-4.2$ oarsub --array 200 -t inner=3303953 -l /core=1,walltime=1 /path/to/matlab/job
How can i measure my python job energy consumption ?
experiment-impact-tracker (code and article) estimates energy consumption and carbon impact of a python program. It measures Intel CPU, Nvidia GPU and RAM energy consumption on a machine during program execution. If the machine is shared between multiple computations, it estimates the share of CPU and RAM consumption of a specific computation, and identifies its GPUs consumption. It derives a carbon impact depending on the carbon intensity of the geographic zone of the computation.
Before first execution, configure your environment for using experiment-impact-tracker :
export CONDA_ENV_NAME=impact-tracker module load impact-tracker/0.1.4_2020-11 module load conda/5.0.1-python3.6 conda create -y --name $CONDA_ENV_NAME source activate $CONDA_ENV_NAME # 2020-11 : python 3.9 conflict with experiment-impact-tracker conda install -y pip=20.2.4=py38_0 # optional : install additional conda packages eg : # conda install tensorflow-gpu pip install geopy bootstrapped ujson py-cpuinfo joblib numpy requests deepdiff scipy geocoder psutil bs4 pandas arrow seaborn shapely jinja2 pylatex progiter matplotlib pytest # 2020-11 : manually patch broken link in geocoder pip package ln -s /usr/lib64/libgeos_c.so.1 ~/.conda/envs/$CONDA_ENV_NAME/lib/libgeos_c.so
Before subsequent executions, activate the environment :
export CONDA_ENV_NAME=impact-tracker module load impact-tracker/0.1.4_2020-11 module load conda/5.0.1-python3.6 source activate $CONDA_ENV_NAME
Then add these lines at the begining of your program for tracking its energy consumption
from experiment_impact_tracker.compute_tracker import ImpactTracker import os tracker = ImpactTracker(os.getcwd()) tracker.launch_impact_monitor()
Launch your program : tracking information is saved in impacttracker
subdirectory.
Generate a html report in the create-compute-appendix
subdirectory, with the sample leaderboard_example.json
format or your custom format file :
create-compute-appendix --site_spec $IMPACT_TRACKER/examples/leaderboard_example.json --output_dir ./create-compute-appendix ./impacttracker
Tips if you install your own version of experiment-impact-tracker :
- use a
git clone
rather than the pip package (not up to date, packaging glitches) - check the git log of the installed version, we corrected a few bugs
Troubleshooting
Why is my job rejected at submission ?
The job system may refuse a job submission due to the admission rules, an explicit error message will be displayed, in case of contact the admin cluster team.
Most of the time it indicates that the requested resources are not available, which may be caused by a typo (eg -p "cluster='dell6220'"
instead of -p "cluster='dellc6220'"
).
Job is also rejected if you submit a non-besteffort job to a dedicated node of another team. Add -t besteffort
to your oarsub
command to check this point.
Sometimes it may also be caused by some nodes being temporarily out of service. This may be verified typing oarnodes -s
for listing all nodes in service.
Another cause may be the job requested more resources than the total resources existing on the cluster.
Why is my job still Waiting while other jobs go Running ?
Many possible (normal) explanations include :
- other job may have higher priority : queue priority, user Karma
- your job requests currently unavailable resources (eg : only dellc6220 nodes while the other job accepts any node type)
- your job requests more resources than currently available and a lower priority job can be run before without delaying your job (best fit). Eg : you requested 4 nodes, only 2 are currently available, the 2 others will be available in 3 hours. A job requesting 2 nodes during at most 3 hours can be run before yours.
- the other job made an advance reservation of resources
- etc.
Why is my job still Waiting while some there are unused resources ?
Many possible (normal) explanations include :
- you have reached maximum resource reservation per user at a given time and your job is not besteffort
- resources are reserved for a higher priority job. Eg: a higher priority job requests 3 nodes, 2 are currently available, 1 will be available in 1 hour. Your job requests 1 node during 2 hours. Running your job would result in delaying a higher priority job.
- resources are reserved by an advance reservation (same example as above).
- etc.
I see several nodes in the StandBy state in Monika, are they available ?
Yes; it's because we have enabled the Energy Savings feature of OAR.
It means that when no jobs are waiting, OAR can decide to shut down nodes to save energy. As soon a new job is queued, OAR will automatically restart some nodes not enough nodes are alive. Usually, the nodes can boot in 2 minutes, so the job will wait at most a few minutes before starting.
Why did my job got killed ?
Your job can be killed by the scheduler in several ways; you can check what happens using oarstat -fj <JOBID>
- Your script use more memory than requested:
If your main process uses too much memory (see also How much memory is allocated to my job) , it is killed by OAR; it's state is 'Terminated' and it has received the kill signal (9)
state = Terminated exit_code = 9 (0,9,0) 2017-02-14 14:35:53> SWITCH_INTO_TERMINATE_STATE:[bipbip 3321148] Ask to change the job state
- One of the process started by your script use more memory than requested:
If your use a bash script to start your main process, and it uses too much memory, then the bulkiest process is killed by OAR, and the bash script ends with an exit signal of 128+9 =137 (if your script correctly handles and returns the error code). Its state is 'Terminated'
state = Terminated exit_code = 35072 (137,0,0) 2017-02-14 14:35:53> SWITCH_INTO_TERMINATE_STATE:[bipbip 3321148] Ask to change the job state
- Your job has exceeded its walltime:
In this case, the state is Error and OAR tells you what happens (killed by root because of WALLTIME)
state = Error 2017-02-14 15:01:34> SWITCH_INTO_ERROR_STATE:[bipbip 3321314] Ask to change the job state 2017-02-14 15:01:31> SEND_KILL_JOB:[Leon] Send the kill signal to oarexec on nef012.inria.fr for job 3321314 2017-02-14 15:01:30> WALLTIME:[sarko] Job [3321314] from 1487080849 with 15; current time=1487080890 (Elapsed) 2017-02-14 15:01:30> FRAG_JOB_REQUEST:User root requested to frag the job 3321314
- Your besteffort job has been killed to start a regular job:
In this case, the state is Error and OAR tells you what happens (killed by root because of BESTEFFORT_KILL)
state = Error 2017-02-14 16:01:50> SCHEDULER_PRIORITY_UPDATED_STOP:Scheduler priority for job 3321820 updated (network_address/resource_id) 2017-02-14 16:01:50> SWITCH_INTO_ERROR_STATE:[bipbip 3321820] Ask to change the job state 2017-02-14 16:01:47> SEND_KILL_JOB:[Leon] Send the kill signal to oarexec on nefgpu04.inria.fr for job 3321820 2017-02-14 16:01:46> FRAG_JOB_REQUEST:User root requested to frag the job 3321820 2017-02-14 16:01:46> BESTEFFORT_KILL:[MetaSched] kill the besteffort job 3321820
How can i clean my user environment ?
Failures may come from problems in your user environment (user specific customization and caches). Environment problem often trigger after system or application upgrades. A good hint for a user environment problem : it does not occur under someone else's identity.
Cleaning your environment is very user and application specific. A few hints/recipes with classical problems :
- if using a virtual environment, container, etc. : re-create from scratch
- de-activate all your initializations (eg: mv ~/.bashrc ~/.bashrc.save for bash), logout, login again
- check LD_LIBRARY_PATH is empty (eg: echo $LD_LIBRARY_PATH) : session-long configuration is usually a bad idea unless you really know what you're doing
- clear cache files/directories eg :
- mv ~/.local ~/.local.save (python, gnome, etc.)
- mv ~/.conda ~/.conda.save (conda cache)
- mv ~/.nv ~/.nv.save (cuda)
Disks and filesystems
How can i access files on the cluster using sshfs ?
With sshfs you can access files on the cluster as a mounted filesystem on your client laptop/desktop running Linux or MacOs.
On Linux, you should first install the fuse-sshfs package. On MacOs, install OSXFUSE and SSHFS from http://osxfuse.github.io/.
Example for a machine connected on INRIA-sophia network:
- Linux:
mylaptop$ mkdir -p /workspaces/nef mylaptop$ sshfs -o transform_symlinks nef-devel2.inria.fr:/ /workspaces/nef
- MacOs:
mylaptop$ mkdir -p /Volumes/nef mylaptop$ sshfs -o transform_symlinks nef-devel2.inria.fr:/ /Volumes/nef
Mounting / and using the transform_symlinks option permits to access to all the storages of nef with a single mount and to manage properly the eventual symbolic links you main encounter (ex: a symbolic link in your nef homedir pointing to /data/...).
It is better to not do such a network mount on a subdirectory of your homedir to prevent your session to freeze in case of network problem or when you disconnect your laptop.
If you want to make shorcuts using symbolic links in your homedir, it is better to do them in a subdirectory. For example (on Linux):
- mkdir $HOME/nef.d
- ln -s /workspaces/nef/home/LOGIN $HOME/nef.d/myhome
- ln -s /workspaces/nef/data/TEAM/user/LOGIN $HOME/nef.d/mydata
- ln -s /workspaces/nef/data/TEAM $HOME/nef.d/teamdata
where LOGIN is your nef login name, TEAM the name of your team.
You unmount this filesystem with:
- Linux:
mylaptop$ fusermount -u /workspaces/nef
- MacOs:
mylaptop$ umount -f /Volumes/nef
For a machine outside of Inria network :
- configure ssh tunneling through nef-frontal
- or mount on nef-frontal instead of nef-devel2 (lower performance)
What to do with my data before my account expires ?
When your user account expires, all files in /home/user
and /data/team/user/user
are removed after a grace delay (currently : 8 months).
Thus before account expiration one should sort its data :
- ensure retention of data still needed by the team :
- move data to
/data/team/share
- tag data to long term storage
- position access rights for other team members if needed
- move data to
- if possible, delete un-needed user data (in
/home/user
,/data/team/user/user
, local nodes storage, etc.) in anticipation of automatic removal
See also section on disk space management.
How do i tag files on /data to the scratch or long term storage ?
Files in /data belong either long term storage or scratch storage. This is based on the Unix group of files not on the path hierarchy.
Use the standard Unix file group commands and rules eg :
chgrp scratch /path/to/file
: tag /path/to/file to the scratch group (so /path/to/file is now on scratch storage)chgrp my_team_group /path/to/file
: tag /path/to/file to the my_team_group group (so /path/to/file is now on long term storage of my team)chmod g+s /path/to/dir
: files created under /path/to/dir from now on inherit same Unix group as /path/to/dirsg scratch
: current process now uses scratch as effective group id, so files are now created belonging to scratch group by default (if no path inherit rules takes precedence)- etc.
So files can be moved from one storage to another without copying them (quicker with TB of data).
Why the /data quota usage for users and groups do not match ?
- The group numbers indicates the long term storage quota usage by all the members of a group.
- The user numbers indicates the total disk usage of a user, long term storage plus scratch storage.
There is currently no simple way to get the long term storage quota usage by a single user.
Example :
- semir group is currently using 128.810 GiB out of its 1024 GiB long term storage usage quota which is the default quota for a team.
- user mvesin from group semir currently uses 210 GiB (mix of long term storage and scratch storage).
nef-devel2$ sudo nef-getquota -g semir Group quotas under /data, restricted to the given groups (sizes in GiB): Group Used Hard Declared semir 128.810 1024.000 1024.000 $default_data_quota Disk usage by user under /data for the semir group (sizes in GiB): User Used mvesin 210.000 fm 44.100
What are the performance of the different filesystems ?
This is a complex question that needs to be considered case by case :
- depends on the type of access (read/write/mix, long sequential/short random chunks, etc.)
- for /home and /data : overall performance is shared between jobs on all nodes of the cluster
- for /tmp and /dev/shm : overall performance is shared between jobs on the node
- etc.
Results of a test for big sequential write access (with caching disabled) :
- ~200 MB/s for /home access (shared between jobs on all nodes)
- ~9000 MB/s for /data access (shared between jobs on all nodes)
- 1x access : up to ~1100 MB/s ; 5x access on 1x node : up to ~2500 MB/s ; 25x access shared on 5x nodes : up to ~6000 MB/s ; etc.
- ~100-200 MB/s for /tmp access (shared between jobs on this node)
- ~400-500 MB/s for /local on a SATA SSD disk (shared between jobs on this node)
- ~1000 MB/s for /local on a SATA SSD RAID-{0,5} disk array (shared between jobs on this node)
- ~2000-3000 MB/s for /dev/shm access (shared between jobs on this node)
Why shouldn't i use many small files ?
Using small files (aka ZOTfiles, zillions of tiny files) consumes more filesystem metadata (finite) resources. Metadata exhaustion can prevent new file creation even with a filesystem not full. Using many small files or reading/writing small chunks of data also reduces file access performance for yourself and other users (lower data and metadata access efficiency).
Good practices for /data :
- avoid using many small files (rule of the thumb : try using files over 1MB when using more than 100k files and links)
- avoid reading/writing many small chunks of data (rule of the thumb : when doing intensive read/write try grouping requests by chunks over 1MB)
- do not create too many entries (files, directories, links) in the same directory (rule of the thumb : to 1-5k files per directory maximum).
Example : check metadata usage for user mylogin:
$ sudo beegfs-ctl --getquota --uid mylogin user/group || size || chunk files name | id || used | hard || used | hard --------------|------||------------|------------||---------|--------- mylogin | 1234|| 2.49 TB | 0 Byte|| 37525834| 0
User mylogin uses 37525834 chunk files (metadata entries) for 2.49TB, thus an average of 71KB per chunk file (average file size is slightly over this average). Rule of the thumb : mylogin should create average files at least 15 times bigger (~ 1MB average).
Draft squashfs/mountimg
How can I use many small files efficiently?
You can gain in performance and minimize the pressure under /data in the following cases:
- case1 your jobs are only reading under the directories where your zotfiles reside
- case2 your jobs are reading your zotfiles but add only new files or directories in them
- case3 your jobs generate zotfiles, but they will be accessed only for reading or adding new files afterwards
For case1:
- convert your zotfiles directories to squashfs images
- in your jobs:
- mount those images using sudo mountimg
- use those mounted directories for processing
For case2:
- convert your zotfiles directories to squashfs images
- in your jobs:
- mount those images using sudo mountimg
- use those mounted directories for processing but generate new files on the local filesystems of the node (ex: /tmp)
- unmount the images with sudo umountimg
- add the new files to the images with mksquashfs-no-compression
For case3:
- in your jobs:
- generates your zotfiles on the local filesystems of the node (ex: /tmp)
- convert them to squashfs images under /data with mksquashfs-no-compression
Creating squashfs images
You can convert your zotfiles on nef-devel or nef-devel2.
To convert your zotfiles to images, choose first the granularity appropriate to your case.
sudo mountimg allows actually to mount at most 4000 images on a node.
If you have for example a really big directory /data/.../DDD/DD/ containing hundreds of sub-directories D1 D2 ... DN, you may prefer to make one image per such sub-directory.
Example (in bash):
cd /data/.../DDD # Build a separate directory for the images and the mountpoints mkdir DD-img DD-mnt cd DD for i in D*; do # Create the image mksquashfs-no-compression $i ../DD-img/$i.squashfs # Create the mountpoint for your future jobs mkdir ../DD-mnt/$i done
mksquashfs-no-compression is a simple wrapper to mksquashfs that disable any kind of compression to focus on speed. Feel free to try mksquashfs directly with other options like -comp lzo to save disk space.
You can also use the convert-to-squashfs command to convert safely your directories to squashfs images. See the online help: convert-to-squashfs --help
The commands sudo mountimg, mksquashfs-no-compression and convert-to-squashfs are provided by the fstools-sop RPM, installed also by default on the Fedora machines starting with Fedora-26.
Some mksquashfs hints:
- if the destination image exist, the source files/directories will be added (appended) to the image.
- In addition, if a file/directory with a same name already exist in the image, the new file/directory will be added with the name xxx_1 xxx_2, etc, where xxx is the original name.
- If a single directory is specified (i.e. mksquashfs source output.squashfs) the squashfs filesystem will consist of that directory, with the top-level root directory corresponding to the source directory.
- use the -keep-as-directory option to tell mksquashfs to keep the basename of the directory in its output.
- If multiple source directories or files are specified, mksquashfs will merge the specified sources into a single filesystem, with the root directory containing each of the source files/directories. The name of each directory entry will be the basename of the source path. If more than one source entry maps to the same name, the conflicts are named xxx_1, xxx_2, etc. where xxx is the original name.
Mounting squashfs images
To mount one image, simply call: sudo mountimg <image path> <directory>
To unmount: sudo umountimg <directory>
Example: mount every squashfs images of /data/.../DDD/DD-img/ on the corresponding sub-directory under /data/.../DDD/DD-mnt/
cd /data/.../DDD/DD-mnt || exit for i in *; do sudo mountimg ../DD-img/$i.squashfs $i || exit done
In an oar job, a mount done with mountimg will be automatically unmounted when the job terminates.
Such a mount can also be shared by more than one oar job and by more than one user. In this case, the unmount will be done when all the jobs terminate. Beware that every job has to do this mount to register to the list of processes needing it.
mountimg allows actually to mount at most 4000 images on a node.
What size can i use on the RAM filesystem ?
One limit is that the RAM filesystem (/dev/shm) space used by a job can be at most the RAM allocated to the job on this node (it is part of the resources allocated to the job).
The other limit is that the system of each node is configured with a total limit for the RAM filesystem (around 50% of the node RAM).
Other
Guidelines for hardware support of multi GPU scaling ?
Multi GPU scaling of GPU computation relies on many factors (type of computation, optimization, framework, data) including GPU node hardware.
Important hardware elements include GPU-GPU and CPU-GPU interconnect technology, CPU and storage resources.
nvidia-smi topo -m
describes GPU connections for the current node. GPU-GPU connection may be direct via PCIe switch or root, or indirect via PCIe plus CPU (and QPI/UPI).- nodes hardware are presented here
- guidelines can help choosing a filesystem. Local SSD disk may be a good option when available.
Dell T630 nodes have at most 2 GPU cards per PCIe root, thus P2P GPU can scale well up to 2 GPU. PCIe 3.0 P2P interconnects cards at 16GB/s each way in theory. A P2P data transfer test shows ~20GB/s (2x 10GB/s) effective total and ~6 us latency between a pair of cards on the same PCIe root.
Asus ESC8000 node has a single root PCIe topology, enabling better P2P GPU scaling up to 8 GPU. P2P data transfer test shows ~25GB/s total and ~6.5 us latency between any pair amongst the 8 GPUs.
On the other hand, a job with a bottleneck on data transfers between CPU RAM/disks <=> GPU RAM may not benefit from PCIe single root and multi GPU.
- node reservation example :
oarsub -p "gpu='YES' and cluster='esc8kgpu'" -l /nodes=1 -I
Even in P2P, multi GPU data transfer is much slower than data transfer local to a GPU. Example : for a GTX1080 Ti a test show ~350GB/s transfer rate and latency ~4 us.
Multi-node scaling introduces other potential bottlenecks : IB network (40Gb/s or 56 Gb/s), CPU