Until now:
- access the cluster
- copy data to/from the cluster
- create parallel software
- compile code and use optimized libraries
- how to run the software on the full cluster
tl;dr:
Job scheduler/Resource manager :
Piece of
software
which:
●
manages and
allocates
resources
;
●manages and
schedules
jobs
;
and sets up the environment
for parallel and distributed computing
Two computers are available for 10h
You go, then you go. You wait.
Resources:
CPU cores Memory
Disk space
Network
Accelerators
Software
Licenses
Slurm
Free and open-source
Mature
Very active community
Many success stories
Runs 50% of TOP10 systems, including 1
stOther job schedulers
PBSpro
Torque/Maui
Oracle (ex Sun) Grid Engine
Condor
You will learn how to:
Create a job
Monitor the jobs
Control your own job
Get job accounting info
1. Make up your mind
●
resources you need;
●
operations you need to perform.
e.g. 1 core, 2GB RAM for 1 hour
e.g. launch 'myprog'
Job parameters
2. Write a submission script
It is a shell script (Bash) Regular Bash comment Bash sees these as comments Slurm takes them as commands Job step creation Regular Bash commandsOther useful parameters
You want You ask
To set a job name --job-name=MyJobName
To attach a comment to the job --comment=”Some comment”
To get emails --email-type= BEGIN|END|FAILED [email protected]
To set the name of the ouptut file --output=result-%j.txt --error=error-%j.txt To delay the start of your job --begin=16:00
--begin=now+1hour
--begin=2010-01-20T12:34:00
To specify an ordering of your jobs --dependency=after(ok|notok|any):jobids --dependency=singleton
To control failure options --nokill
--norequeue --requeue
Constraints and resources
You want You ask
To choose a specific feature (e.g. a processor type or a NIC type)
--constraint To use a specific resources (e.g. a gpu) --gres
To reserve a whole node for yourself --exclusive To chose a partition --partition
3. Submit the script
Slurm gives me the JobID I submit with 'sbatch' One more job parameterSo you can play
Download
http://www.cism.ucl.ac.be/Services/Formations/slurm.tgz
with wget and untar it on hmem
compile the 'stress' program
you can use it to burn cputime and memory:
./stress --cpu 1 --vm-bytes 128M --timeout 30s
Write a job script
Submit a job
See it running
Cancel it
4. Monitor your job
●
squeue
●sprio
●sstat
4. Monitor your job
●
squeue
●sprio
●sstat
4. Monitor your job
●
squeue
●sprio
●sstat
4. Monitor your job
●
squeue
●sprio
●sstat
A word about backfill
The rule: a job with a lower priority can
start before a job with a higher priority
if it does not delay that job's start time.
resources time 60 100 80 70 10
Low priority job has short max run time and less requirements ; it starts before larger priority job job's priority
4. Monitor your job
●
squeue
●sprio
●sstat
4. Monitor your job
●
squeue
●sprio
●sstat
4. Monitor your job
●squeue
●sprio
●sstat
●sview
http://www.schedmd.com/slurmdocs/slurm_ug_2011/sview-users-guide.pdf
5. Control your job
●
scancel
●scontrol
5. Control your job
●
scancel
●scontrol
5. Control your job
●
scancel
●scontrol
5. Control your job
●
scancel
●scontrol
5. Control your job
●
scancel
●scontrol
●
sview
6. Job accounting
●
sacct
●sreport
●sshare
6. Job accounting
●
sacct
●sreport
●sshare
6. Job accounting
●
sacct
●sreport
●sshare
6. Job accounting
●
sacct
●sreport
●sshare
6. Job accounting
●
sacct
●sreport
●sshare
6. Job accounting
●
sacct
●sreport
●sshare
The rules of fairshare
●
A share is allocated to you: 1/nbusers
●
If your actual usage is above that share, your
fairshare value is decreased towards 0.
●
If your actual usage is below that share, your
fairshare value is increased towards 1.
●
The actual usage taken into account decreases
A word about fairshare
●
Assume 3 users, 3-cores cluster
●
Red uses 1 core for a certain period of time
●Blue uses 2 cores for half that period
●
Red uses 2 cores afterwards
#nodesA word about fairshare
●
Assume 3 users, 3-cores cluster
●
Red uses 1 core for a certain period of time
●Blue uses 2 cores for half that period
Getting cluster info
●
sinfo
●sjstat
Getting cluster info
●
sinfo
●sjstat
Interactive work
●
salloc
Interactive work
●
salloc
Summary
●
Explore the enviroment
●
Get node features (sinfo --node --long)
●Get node usage (sinfo --summarize)
●
Submit a job:
●
Define the resources you need
●
Determine what the job should do
●Submit the job script (sbatch)
●
View the job status (squeue)
●
Get accounting information (sacct)
You will learn how to:
Create a parallel job
Request distributed resources
with
Concurrent - Parallel - Distributed
Master/slave vs SPMD
Synchronous vs asynchronous
Typical resource request
You want You ask
16 independent processes (no communication) --ntasks=16 MPI and do not care about where cores are
distributed --ntasks=16
cores spread across distinct nodes --ntasks=16 --nodes=16 cores spread across distinct nodes and nobody
else around --ntasks=16 --nodes=16 --exclusive 16 processes to spread across 8 nodes --ntasks=16 --ntasks-per-node=2 16 processes on the same node --ntasks=16 --ntasks-per-node=16 one process that can use 16 cores for
multithreading --ntasks=1 --cpus-per-task=16 4 processes that can use 4 cores --ntasks=4 --cpus-per-task=4
●
Your program draws random numbers and processes
them sequentially
●
Parallelism is obtained by launching the same
program multiple times simultaneously
●
Every process does the same thing
●No inter process communication
●
Results appended to one common file
Use case 1: Random sampling
You want You ask
16 independent processes (no communication) --ntasks=16
Use case 1: Random sampling
You want You ask
16 independent processes (no communication) --array=1-16 --output=res%a You merge with cat res*
Use case 2: Multiple datafiles
●
Your program processes data from one datafile
●Parallelism is obtained by launching the same
program multiple times on distinct data files
●
Everybody does the same thing on distinct data
stored in different files
●
No inter process communication
Use case 2: Multiple datafiles
You want You ask
16 independent processes (no communication) --ntasks=16
You use srun ./myprog
Use case 2: Multiple datafiles
Useful commands: xargs and find/ls:
Single node:
ls “data*” | xargs -n1 -P $SLURM_NPROCS myprog
Multiple nodes:
ls “data*” | xargs -n1 -P $SLURM_NTASKS srun -c1 myprog
Use case 2: Multiple datafiles
You want You ask
16 independent processes (no communication) --array=1-16
Use case 3: Parameter sweep
●
Your program tests something for one particular
value of a parameter
●
Parallelism is obtained by launching the same
program multiple times with an distinct identifier
●
Everybody does the same thing except for a given
parameter value based on the identifier
●
No inter process communication
Use case 3: Parameter sweep
You want You ask
16 independent processes (no communication) --ntasks=16
You use srun ./myprog
Use case 3: Parameter sweep
You want You ask
16 independent processes (no communication) --array=1-16 --output=res%a
You use $SLURM_ARRAY_TASK_ID
Use case 3: Parameter sweep
Useful command: GNU Parallel
Single node:
parallel -j $SLURM_NPROCS myprog ::: {1..5} ::: {A..D}
Multiple nodes:
parallel -j $SLURM_NTASKS srun -c1 myprog ::: {1..5} ::: {A..D}
Useful: parallel --joblog runtask.log –resume for checkpointing
parallel echo data_{1}_{2}.dat ::: 1 2 3 ::: 1 2 3
Use case 4: Multithread
●
Your program uses OpenMP or TBB
●
Parallelism is obtained by launching a multithreaded
program
●
One program spawns itself on the node
●
Inter process communication by shared memory
●Results managed in the program which outputs a
You want You ask
one process that can use 16 cores for
multithreading --ntasks=1 --cpus-per-task=16
You use OMP_NUMTHREADS=16 srun myprog
●
Your program uses MPI
●
Parallelism is obtained by launching a multi-process
program
●
One program spawns itself on several nodes
●Inter process communication by the network
●
Results managed in the program which outputs a
summary
Use case 5: Message passing
You want You ask
16 processes for use with MPI --ntasks=16
You use module load openmpi mpirun myprog
●
You have two types of programs: master and slave
●
Parallelism is obtained by launching a several slaves,
managed by the master
●
The master launches several slaves on distinct nodes
●Inter process communication by the network or the
disk
●
Results managed in the master program which
outputs a summary
Use case 6: Master slave
You want You ask
16 processes
16 threads --ntasks=16--cpus-per-task=16
Use case 6: Master slave
You want You ask
16 processes
16 threads --ntasks=16--cpus-per-task=16
Summary
●
Choose number of processes: --ntasks
●
Choose number of threads: --cpu-per-task
●Launch processes with srun or mpirun
●
Set multithreading with OMP_NUM_THREADS
●
You can use $SLURM_PROC_ID
Try
●
Download MPI hello world on Wikipedia,
compile it, write job script and submit it
●
Rewrite 'Multiple files' examples using
xargs
●