Efficient cluster computing
Introduction to the Sun Grid Engine (SGE) queuing system
Markus Rampp
(
RZG
,
MIGenAS
)
MPI for Evolutionary Anthropology
Leipzig, Feb. 16, 2007
Outline
.
Introduction
.
Basic concepts:
· queues, jobs, scripts
· essential SGE commands and options
.
Advanced topics
· Job chains
· Array jobs
· DRMAA API
.
Tips & Tricks, References
Introduction
I
Sun Grid Engine (SGE): a popular batch-queuing system
Software
like
SGE
is
typically
used
on
a
computer
farm
or
computer cluster
and
is
responsible
for
accepting,
scheduling,
dispatching,
and
managing the remote execution of large numbers of standalone, parallel or interactive user jobs.
It also manages and schedules the allocation of distributed resources such as processors,
memory, disk space, and software licenses.
(taken from
Wikipedia
)
I Popular batch systems (DRMs)
.
Sun Grid Engine (open source)
.
LoadLeveler (IBM)
.
NQS (Cray, NEC)
.
DQS (open source)
.
. . .
Introduction
(2)
I
Why should one use a DRM ? y Increase efficiency
I Operator’s perspective:
.
transparent resource management
· clustering of compute resources
· load balancing, optimization of resource usage
· ”fair” (policy-based) distribution of resources
· accounting
I User’s perspective:
.
shared usage of system resources
.
optimize throughput
.
organize/simplify handling of (“large”) computational tasks
.
enhanced stability (survive system crashes, maintenance, . . . )
.
well-defined resource allocation (y benchmarking)
Basic concepts
I
Queues:
Queue 2 Queue 1 Queue 2 Queue 1 Queue 3 Resource A Resource Bqueuename qtype used/tot. load_avg arch states
[email protected] BIP 0/2 0.00 lx26-x86 d [email protected] BIP 0/2 0.00 lx26-x86 [email protected] BIP 0/2 0.00 lx26-x86 [email protected] BIP 0/2 0.00 lx26-x86 [email protected] BIP 0/2 0.00 lx26-x86 [email protected] BIP 0/2 0.00 lx26-x86 [email protected] BIP 0/2 0.00 lx26-x86 [email protected] B 0/0 0.09 lx26-amd64 [email protected] B 0/4 0.00 lx26-amd64 [email protected] B 0/4 0.00 lx26-amd64 [email protected] B 0/4 0.00 lx26-amd64 [email protected] B 0/4 0.00 lx26-amd64
Basic concepts
(2)
I
Jobs & scripts
1. prepare script of executable commands
2. specify resources and meta information
3. submit to batch system (returns a job ID)
4. use the job ID for job control
(query status, cancel, . . . )
#$ -S /bin/sh #$ -cwd
#$ -M [email protected] #$ -m e
#$ -N example
#begin executable commands (shell specified by #$ -S)
# note: starting here, a leading ’#’ starts
# a comment, whereas in the above
# SGE header it does NOT
echo "starting job ..."
blastall -p blastp -d nr -i query_1.fa -o blastout_1.txt blastall -p blastp -d nr -i query_2.fa -o blastout_2.txt echo "...done"
˜> qsub example_1.sge
Your job 10404 ("example") has been submitted. ˜> qstat
job-ID prior name user state submit/start at queue slots ja-task-ID
---10404 0.00000 example mjr qw 02/12/2007 11:43:42 1
˜> qstat
job-ID prior name user state submit/start at queue slots ja-task-ID
SGE commands & options
I
Interacting with the queuing system: SGE’s q-commands
cf. ll-commands of LoadLeveler
. qsub
submit job
↔llsubmit
. qstat
query queue/job status
↔llq,llstatus
. qdel
delete job
↔llcancel
. qhold
hold (”suspend”) job; (note: user/operator/system holds)
↔llhold
. qrls
releases holds
↔llhold -r
. qalter, qmod modify job
↔llmodify
. qhost
provide concise system overview
SGE commands & options
(2)
I
Specify qsub options in script header and/or on command line (overrides script)
I Essential options for qsub:
-S
: path to shell
-m b|e|a|s|n|...
: send mail at beginning|end|. . . of job
-M: E-mail address for notification
-N: name of job
-j y
: join stdout and stderr
I Additional options for qsub:
-q
: queue
-p
: priority (default 0; users may only decrease)
-P: name of project
-a: earliest date/time at which a job is eligible for execution
. . . : cf. man qsub
SGE commands & options
(3)
I
Commonly used options for qstat:
qstat
displays list of jobs only
qstat -u <user> -j <job ID>
displays list of jobs for specified user/job
qstat -f
full format display
qstat -r
extended display (incl. resource requirements, scheduling info)
˜>qstat -f
queuename qtype used/tot. load_avg arch states
[email protected] BIP 0/2 0.04 lx26-x86 d [email protected] BIP 0/2 0.00 lx26-x86 ---... [email protected] BIP 0/2 0.00 lx26-x86 [email protected] BIP 2/2 3.67 lx26-x86 10422 0.56000 megaBLAST hfz r 02/13/2007 20:34:12 1 882 10422 0.56000 megaBLAST hfz r 02/13/2007 20:34:12 1 883 [email protected] BIP 2/2 4.85 lx26-x86 10422 0.56000 megaBLAST hfz r 02/13/2007 20:28:27 1 864 10422 0.56000 megaBLAST hfz r 02/13/2007 20:31:57 1 875 . . . ############################################################################ - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS ############################################################################
SGE commands & options
(4)
˜> qstat -r -j 10422==============================================================
job_number: 10422
exec_file: job_scripts/10422
submission_time: Tue Feb 13 16:34:29 2007
owner: hfz uid: 1553 group: rzb gid: 4131 sge_o_home: /afs/ipp/home/h/hfz sge_o_log_name: hfz sge_o_path: /opt/SGE6/bin/lx26-x86:/usr/local/bin:/opt/gnome/bin:/usr/games:/usr/bin/X11:/usr/bin:/bin sge_o_shell: /bin/tcsh sge_o_workdir: /bio/tmp/hfz/Sargossa sge_o_host: e01 account: sge cwd: /bio/tmp/hfz/Sargossa path_aliases: /tmp_mnt/ * * / mail_list: [email protected] notify: FALSE job_name: megaBLAST jobshare: 0 shell_list: /bin/sh env_list: script_file: /afs/ipp/home/h/hfz/MySQL/Sequenzen/e01/submit_megablast_test.sge project: gendb job-array tasks: 1-5000:1
usage 808: cpu=00:08:46, mem=595.19275 GBs, io=0.00000, vmem=1.377G, maxvmem=1.566G
. .
usage 875: cpu=00:00:24, mem=28.67529 GBs, io=0.00000, vmem=1.251G, maxvmem=1.251G
scheduling info: queue instance "[email protected]" dropped because it is disabled
queue instance "[email protected]" dropped because it is queue instance "[email protected]" dropped because it is full
queue instance "[email protected]" dropped because it is full queue instance "[email protected]" dropped because it is full queue instance "[email protected]" dropped because it is full queue instance "[email protected]" dropped because it is full queue instance "[email protected]" dropped because it is full queue instance "[email protected]" dropped because it is full
(project gendb) is not allowed to run in host "e07.bc.rzg.mpg.de" based on the excluded project list
Input/Output
I
Output
.
stdout: <job name>.o<job ID>
.
stderr: <job name>.e<job ID>
.
path: can be specified by qsub -o <stdout path> -e <stderr path>
paths relative to
· current working directory at submission (with qsub -cwd option)
· user’s home directory (if -cwd option is not specified):
˜> ls
example.e10404
example.o10404
example_1.sge
I Input
.
arguments: qsub [ options ] [ command | -- [ command_args ]]
Advanced topics
.
Job chains:
sets of consecutive interdependent jobs
.
Job arrays:
sets of similar and independent (parallel) jobs
.
DRMAA:
Job chains: sets of consecutive jobs
I
Solution 1 (trivial)
˜>cat allInOne.sge #$ -S /bin/sh #$ -N allInOne ./doFormatDB ./doBlastAll ./doPostprocessing ˜>qsub allInOne.sgeYour job 10411 ("allInOne") has been submitted.
I Solution 2 (modular, nested qsub)
˜>cat formatDB.sge #$ -S /bin/sh #$ -N FormatDB ./doFormatDB qsub blastAll.sge ˜>cat blastAll.sge #$ -S /bin/sh #$ -N BlastAll ./doBlastAll qsub postprocessing.sge
. . .
˜>qsub formatDB.sgeJob chains: sets of consecutive jobs
(2)
I
Solution 3 (optimized, uses -hold jid <job id|job name>)
˜>cat formatDB.sge #$ -S /bin/sh #$ -N FormatDB ./doFormatDB ˜>cat blastAll.sge #$ -S /bin/sh #$ -N BlastAll #$ -hold_jid FormatDB ./doBlastAll
. . .
˜>qsub formatDB.sgeYour job 10451 ("formatDB") has been submitted. ˜>qsub blastAll.sge
Your job 10452 ("blastAll") has been submitted. ˜>qsub postprocessing.sge
Your job 10453 ("postprocessing") has been submitted.
.
Advantage: accumulates ”waiting time”
.
Note:
-hold_jid <job_name>
can
only
be
used
to
reference
jobs of the same user (-hold_jid <job_id> can be used to reference any job)
Array jobs
I
Submit sets of similar and independent “tasks”:
. qsub -t 1-500:1 example_3.sge
submits 500 instances of the same script
.
each instance (“task”) is executed independently
.
all instances subsumed with a single job ID
.
variable $SGE_TASK_ID discriminates between instances
.
task numbering scheme: -t <first>-<last>:<stepsize>
.
related: $SGE_TASK_FIRST,$SGE_TASK_LAST,$SGE_TASK_STEPSIZE
I Example:
#$ -S /bin/sh #$ -cwd #$ -N blastArray #$ -t 1-500:1 QUERY=query_${SGE_TASK_ID}.fa OUTPUT=blastout_${SGE_TASK_ID}.txt echo "processing query $QUERY ..."blastall -p blastn -d nt -i $QUERY -o $OUTPUT echo "...done"
Array jobs
(2)
˜> qsub example_3.sgeYour job 10420.1-500:1 ("blastArray") has been submitted.
˜> qstat
job-ID prior name user state submit/start at queue slots ja-task-ID
---10420 0.56000 blastArray mjr r 02/13/2007 15:05:56 [email protected] 1 198 10420 0.56000 blastArray mjr r 02/13/2007 15:05:56 [email protected] 1 199 10420 0.56000 blastArray mjr r 02/13/2007 15:07:11 [email protected] 1 202 10420 0.56000 blastArray mjr r 02/13/2007 15:07:11 [email protected] 1 203 10420 0.56000 blastArray mjr r 02/13/2007 15:05:41 [email protected] 1 196 10420 0.56000 blastArray mjr r 02/13/2007 15:05:41 [email protected] 1 197 10420 0.55241 blastArray mjr r 02/13/2007 15:08:41 [email protected] 1 208 10420 0.55241 blastArray mjr r 02/13/2007 15:08:41 [email protected] 1 209 10420 0.56000 blastArray mjr r 02/13/2007 15:08:11 [email protected] 1 204 10420 0.56000 blastArray mjr r 02/13/2007 15:08:11 [email protected] 1 206 10420 0.56000 blastArray mjr r 02/13/2007 15:02:11 [email protected] 1 176 10420 0.56000 blastArray mjr r 02/13/2007 15:02:11 [email protected] 1 177 10420 0.56000 blastArray mjr r 02/13/2007 15:03:26 [email protected] 1 182 10420 0.56000 blastArray mjr r 02/13/2007 15:03:26 [email protected] 1 183 10420 0.56000 blastArray mjr r 02/13/2007 15:07:11 [email protected] 1 200 10420 0.56000 blastArray mjr r 02/13/2007 15:07:11 [email protected] 1 201 10420 0.56000 blastArray mjr r 02/13/2007 15:05:11 [email protected] 1 193 10420 0.56000 blastArray mjr r 02/13/2007 15:05:11 [email protected] 1 194 10420 0.56000 blastArray mjr r 02/13/2007 15:04:41 [email protected] 1 190 10420 0.56000 blastArray mjr r 02/13/2007 15:04:41 [email protected] 1 191 10420 0.56000 blastArray mjr r 02/13/2007 15:03:41 [email protected] 1 184 10420 0.56000 blastArray mjr r 02/13/2007 15:03:41 [email protected] 1 185 10420 0.56000 blastArray mjr r 02/13/2007 15:08:11 [email protected] 1 205 10420 0.56000 blastArray mjr r 02/13/2007 15:08:11 [email protected] 1 207 10420 0.56000 blastArray mjr r 02/13/2007 15:05:11 [email protected] 1 192 10420 0.56000 blastArray mjr r 02/13/2007 15:05:26 [email protected] 1 195 10420 0.56000 blastArray mjr r 02/13/2007 15:04:26 [email protected] 1 188 10420 0.56000 blastArray mjr r 02/13/2007 15:04:26 [email protected] 1 189 10420 0.56000 blastArray mjr r 02/13/2007 15:03:56 [email protected] 1 186 10420 0.56000 blastArray mjr r 02/13/2007 15:03:56 [email protected] 1 187 10420 0.55242 blastArray mjr qw 02/13/2007 14:28:34 1 210-500:1
Array jobs
(3)
I
Benefits:
.
simple organization
.
simple interaction with job (single job ID)
.
optimized throughput (see, e.g. qconf -sconf for jobs-per-user limits, etc.)
y powerful tool for (trivially) parallel applications
I Notes:
.
one stdout/stderr file per task
stdout: <job name>.o<job ID>.<task ID>
stderr: <job name>.e<job ID>.<task ID>
.
task-specific $TMPDIR
.
$SGE TASK ID (and its relatives) are undefined for non-array jobs
.
allocate reasonable chunks of work to tasks
Excursus: load balancing
PE 1
PE 2
PE 3
time
total work
chunk 2
chunk 3
chunk 4
chunk 5
chunk 1
idle time
overhead
⇒ number of PEs number of chunks
t
tot
DRMAA
I
Distributed Resource Management Application API:
.
API specification for the submission and control of jobs to one or more DRM
sys-tems (see
http://drmaa.org
)
.
Purpose: integration with applications
I Advantages:
.
Portability, vendor independence
.
Reliability: avoids error-prone parsing of output from qsub, qstat, . . .
.
Efficiency: avoids expensive (and intricate: e.g. Perl) system calls
I Implementations:
.
SGE
.
Bindings for Java, C/C++
.
Modules for perl, Python . . .
DRMAA
(2)
I
Java example (fragment)
package de.mpg.rzg.drmaa.queue; import java.util.List; import org.apache.commons.logging.Log; import org.apache.commons.logging.LogFactory; import org.ggf.drmaa.DrmaaException; import org.ggf.drmaa.JobTemplate; import org.ggf.drmaa.Session; public class DrmaaQueueScheduler { ...
public String submitJob(){ String jobId = null; try {
/* create DRMAA session */
SessionFactory factory session = SessionFactory.getFactory().getSession(); session.init(null);
/* setup job template */
JobTemplate jt = session.createJobTemplate(); jt.setRemoteCommand("blastall");
jt.setArgs(new String[]{"-p","blastp","-d","nr"}); jt.setJobName("blast");
List<String> taskIds = session.runBulkJobs(jt,1,numJobTasks,chunkSize); jobId = taskIds.isEmpty() ? null : taskIds.get(0).split("[.]")[0]; } catch (DrmaaException e) {
logger.error("submitting DRMAA job failed: "+e.getMessage()); }
return jobId; }