Until now: tl;dr: - submit a job to the scheduler

(1)

(2)

Until now:

- access the cluster

- copy data to/from the cluster

- create parallel software

- compile code and use optimized libraries

- how to run the software on the full cluster

tl;dr:

(3)

(4)

(5)

Job scheduler/Resource manager :

Piece of

software

which:

●

manages and

allocates

resources

;

●

manages and

schedules

jobs

;

and sets up the environment

for parallel and distributed computing

Two computers are available for 10h

You go, then you go. You wait.

(6)

Resources:

CPU cores Memory

Disk space

Network

Accelerators

Software

Licenses

(7)

(8)

(9)

(10)

Slurm

Free and open-source

Mature

Very active community

Many success stories

Runs 50% of TOP10 systems, including 1

st

(11)

Other job schedulers

PBSpro

Torque/Maui

Oracle (ex Sun) Grid Engine

Condor

(12)

You will learn how to:

Create a job

Monitor the jobs

Control your own job

Get job accounting info

(13)

1. Make up your mind

●

resources you need;

●

operations you need to perform.

e.g. 1 core, 2GB RAM for 1 hour

e.g. launch 'myprog'

Job parameters

(14)

2. Write a submission script

It is a shell script (Bash) Regular Bash comment Bash sees these as comments Slurm takes them as commands Job step creation Regular Bash commands

(15)

Other useful parameters

You want You ask

To set a job name --job-name=MyJobName

To attach a comment to the job --comment=”Some comment”

To get emails --email-type= BEGIN|END|FAILED [email protected]

To set the name of the ouptut file --output=result-%j.txt --error=error-%j.txt To delay the start of your job --begin=16:00

--begin=now+1hour

--begin=2010-01-20T12:34:00

To specify an ordering of your jobs --dependency=after(ok|notok|any):jobids --dependency=singleton

To control failure options --nokill

--norequeue --requeue

(16)

Constraints and resources

You want You ask

To choose a specific feature (e.g. a processor type or a NIC type)

--constraint To use a specific resources (e.g. a gpu) --gres

To reserve a whole node for yourself --exclusive To chose a partition --partition

(17)

3. Submit the script

Slurm gives me the JobID I submit with 'sbatch' One more job parameter

(18)

So you can play

Download

http://www.cism.ucl.ac.be/Services/Formations/slurm.tgz

with wget and untar it on hmem

compile the 'stress' program

you can use it to burn cputime and memory:

./stress --cpu 1 --vm-bytes 128M --timeout 30s

Write a job script

Submit a job

See it running

Cancel it

(19)

4. Monitor your job

●

squeue

●

sprio

●

sstat

(20)

4. Monitor your job

●

squeue

●

sprio

●

sstat

(21)

4. Monitor your job

●

squeue

●

sprio

●

sstat

(22)

4. Monitor your job

●

squeue

●

sprio

●

sstat

(23)

A word about backfill

The rule: a job with a lower priority can

start before a job with a higher priority

if it does not delay that job's start time.

resources time 60 100 80 70 10

Low priority job has short max run time and less requirements ; it starts before larger priority job job's priority

(24)

(25)

4. Monitor your job

●

squeue

●

sprio

●

sstat

(26)

4. Monitor your job

●

squeue

●

sprio

●

sstat

(27)

4. Monitor your job

●

squeue

●

sprio

●

sstat

●

sview

http://www.schedmd.com/slurmdocs/slurm_ug_2011/sview-users-guide.pdf

(28)

5. Control your job

●

scancel

●

scontrol

(29)

5. Control your job

●

scancel

●

scontrol

(30)

5. Control your job

●

scancel

●

scontrol

(31)

5. Control your job

●

scancel

●

scontrol

(32)

5. Control your job

●

scancel

●

scontrol

●

sview

(33)

6. Job accounting

●

sacct

●

sreport

●

sshare

(34)

6. Job accounting

●

sacct

●

sreport

●

sshare

(35)

6. Job accounting

●

sacct

●

sreport

●

sshare

(36)

6. Job accounting

●

sacct

●

sreport

●

sshare

(37)

6. Job accounting

●

sacct

●

sreport

●

sshare

(38)

6. Job accounting

●

sacct

●

sreport

●

sshare

(39)

The rules of fairshare

●

A share is allocated to you: 1/nbusers

●

If your actual usage is above that share, your

fairshare value is decreased towards 0.

●

If your actual usage is below that share, your

fairshare value is increased towards 1.

●

The actual usage taken into account decreases

(40)

(41)

A word about fairshare

●

Assume 3 users, 3-cores cluster

●

Red uses 1 core for a certain period of time

●

Blue uses 2 cores for half that period

●

Red uses 2 cores afterwards

#nodes

(42)

A word about fairshare

●

Assume 3 users, 3-cores cluster

●

Red uses 1 core for a certain period of time

●

Blue uses 2 cores for half that period

(43)

(44)

(45)

Getting cluster info

●

sinfo

●

sjstat

(46)

Getting cluster info

●

sinfo

●

sjstat

(47)

Interactive work

●

salloc

(48)

Interactive work

●

salloc

(49)

Summary

●

Explore the enviroment

●

Get node features (sinfo --node --long)

●

Get node usage (sinfo --summarize)

●

Submit a job:

●

Define the resources you need

●

Determine what the job should do

●

Submit the job script (sbatch)

●

View the job status (squeue)

●

Get accounting information (sacct)

(50)

(51)

(52)

You will learn how to:

Create a parallel job

Request distributed resources

with

(53)

Concurrent - Parallel - Distributed

Master/slave vs SPMD

Synchronous vs asynchronous

(54)

Typical resource request

You want You ask

16 independent processes (no communication) --ntasks=16 MPI and do not care about where cores are

distributed --ntasks=16

cores spread across distinct nodes --ntasks=16 --nodes=16 cores spread across distinct nodes and nobody

else around --ntasks=16 --nodes=16 --exclusive 16 processes to spread across 8 nodes --ntasks=16 --ntasks-per-node=2 16 processes on the same node --ntasks=16 --ntasks-per-node=16 one process that can use 16 cores for

multithreading --ntasks=1 --cpus-per-task=16 4 processes that can use 4 cores --ntasks=4 --cpus-per-task=4

(55)

●

Your program draws random numbers and processes

them sequentially

●

Parallelism is obtained by launching the same

program multiple times simultaneously

●

Every process does the same thing

●

No inter process communication

●

Results appended to one common file

(56)

Use case 1: Random sampling

You want You ask

16 independent processes (no communication) --ntasks=16

(57)

Use case 1: Random sampling

You want You ask

16 independent processes (no communication) --array=1-16 --output=res%a You merge with cat res*

(58)

Use case 2: Multiple datafiles

●

Your program processes data from one datafile

●

Parallelism is obtained by launching the same

program multiple times on distinct data files

●

Everybody does the same thing on distinct data

stored in different files

●

No inter process communication

(59)

Use case 2: Multiple datafiles

You want You ask

You use srun ./myprog

(60)

Use case 2: Multiple datafiles

Useful commands: xargs and find/ls:

Single node:

ls “data*” | xargs -n1 -P $SLURM_NPROCS myprog

Multiple nodes:

ls “data*” | xargs -n1 -P $SLURM_NTASKS srun -c1 myprog

(61)

Use case 2: Multiple datafiles

You want You ask

16 independent processes (no communication) --array=1-16

(62)

Use case 3: Parameter sweep

●

Your program tests something for one particular

value of a parameter

●

Parallelism is obtained by launching the same

program multiple times with an distinct identifier

●

Everybody does the same thing except for a given

parameter value based on the identifier

●

No inter process communication

(63)

Use case 3: Parameter sweep

You want You ask

You use srun ./myprog

(64)

Use case 3: Parameter sweep

You want You ask

16 independent processes (no communication) --array=1-16 --output=res%a

You use $SLURM_ARRAY_TASK_ID

(65)

Use case 3: Parameter sweep

Useful command: GNU Parallel

Single node:

parallel -j $SLURM_NPROCS myprog ::: {1..5} ::: {A..D}

Multiple nodes:

parallel -j $SLURM_NTASKS srun -c1 myprog ::: {1..5} ::: {A..D}

Useful: parallel --joblog runtask.log –resume for checkpointing

parallel echo data_{1}_{2}.dat ::: 1 2 3 ::: 1 2 3

(66)

Use case 4: Multithread

●

Your program uses OpenMP or TBB

●

Parallelism is obtained by launching a multithreaded

program

●

One program spawns itself on the node

●

Inter process communication by shared memory

●

Results managed in the program which outputs a

(67)

You want You ask

one process that can use 16 cores for

multithreading --ntasks=1 --cpus-per-task=16

You use OMP_NUMTHREADS=16 srun myprog

(68)

●

Your program uses MPI

●

Parallelism is obtained by launching a multi-process

program

●

One program spawns itself on several nodes

●

Inter process communication by the network

●

Results managed in the program which outputs a

summary

(69)

Use case 5: Message passing

You want You ask

16 processes for use with MPI --ntasks=16

You use module load openmpi mpirun myprog

(70)

●

You have two types of programs: master and slave

●

Parallelism is obtained by launching a several slaves,

managed by the master

●

The master launches several slaves on distinct nodes

●

Inter process communication by the network or the

disk

●

Results managed in the master program which

outputs a summary

(71)

Use case 6: Master slave

You want You ask

16 processes

16 threads --ntasks=16--cpus-per-task=16

(72)

Use case 6: Master slave

You want You ask

16 processes

16 threads --ntasks=16--cpus-per-task=16

(73)

Summary

●

Choose number of processes: --ntasks

●

Choose number of threads: --cpu-per-task

●

Launch processes with srun or mpirun

●

Set multithreading with OMP_NUM_THREADS

●

You can use $SLURM_PROC_ID

(74)

(75)

(76)

Try

●

Download MPI hello world on Wikipedia,

compile it, write job script and submit it

●

Rewrite 'Multiple files' examples using

xargs

●