• No results found

Efficient cluster computing

N/A
N/A
Protected

Academic year: 2021

Share "Efficient cluster computing"

Copied!
23
0
0

Loading.... (view fulltext now)

Full text

(1)

Efficient cluster computing

Introduction to the Sun Grid Engine (SGE) queuing system

Markus Rampp

(

RZG

,

MIGenAS

)

MPI for Evolutionary Anthropology

Leipzig, Feb. 16, 2007

(2)

Outline

.

Introduction

.

Basic concepts:

· queues, jobs, scripts

· essential SGE commands and options

.

Advanced topics

· Job chains

· Array jobs

· DRMAA API

.

Tips & Tricks, References

(3)

Introduction

I

Sun Grid Engine (SGE): a popular batch-queuing system

Software

like

SGE

is

typically

used

on

a

computer

farm

or

computer cluster

and

is

responsible

for

accepting,

scheduling,

dispatching,

and

managing the remote execution of large numbers of standalone, parallel or interactive user jobs.

It also manages and schedules the allocation of distributed resources such as processors,

memory, disk space, and software licenses.

(taken from

Wikipedia

)

I Popular batch systems (DRMs)

.

Sun Grid Engine (open source)

.

LoadLeveler (IBM)

.

NQS (Cray, NEC)

.

DQS (open source)

.

. . .

(4)

Introduction

(2)

I

Why should one use a DRM ? y Increase efficiency

I Operator’s perspective:

.

transparent resource management

· clustering of compute resources

· load balancing, optimization of resource usage

· ”fair” (policy-based) distribution of resources

· accounting

I User’s perspective:

.

shared usage of system resources

.

optimize throughput

.

organize/simplify handling of (“large”) computational tasks

.

enhanced stability (survive system crashes, maintenance, . . . )

.

well-defined resource allocation (y benchmarking)

(5)

Basic concepts

I

Queues:

Queue 2 Queue 1 Queue 2 Queue 1 Queue 3 Resource A Resource B

queuename qtype used/tot. load_avg arch states

[email protected] BIP 0/2 0.00 lx26-x86 d [email protected] BIP 0/2 0.00 lx26-x86 [email protected] BIP 0/2 0.00 lx26-x86 [email protected] BIP 0/2 0.00 lx26-x86 [email protected] BIP 0/2 0.00 lx26-x86 [email protected] BIP 0/2 0.00 lx26-x86 [email protected] BIP 0/2 0.00 lx26-x86 [email protected] B 0/0 0.09 lx26-amd64 [email protected] B 0/4 0.00 lx26-amd64 [email protected] B 0/4 0.00 lx26-amd64 [email protected] B 0/4 0.00 lx26-amd64 [email protected] B 0/4 0.00 lx26-amd64

(6)

Basic concepts

(2)

I

Jobs & scripts

1. prepare script of executable commands

2. specify resources and meta information

3. submit to batch system (returns a job ID)

4. use the job ID for job control

(query status, cancel, . . . )

#$ -S /bin/sh #$ -cwd

#$ -M [email protected] #$ -m e

#$ -N example

#begin executable commands (shell specified by #$ -S)

# note: starting here, a leading ’#’ starts

# a comment, whereas in the above

# SGE header it does NOT

echo "starting job ..."

blastall -p blastp -d nr -i query_1.fa -o blastout_1.txt blastall -p blastp -d nr -i query_2.fa -o blastout_2.txt echo "...done"

˜> qsub example_1.sge

Your job 10404 ("example") has been submitted. ˜> qstat

job-ID prior name user state submit/start at queue slots ja-task-ID

---10404 0.00000 example mjr qw 02/12/2007 11:43:42 1

˜> qstat

job-ID prior name user state submit/start at queue slots ja-task-ID

(7)

SGE commands & options

I

Interacting with the queuing system: SGE’s q-commands

cf. ll-commands of LoadLeveler

. qsub

submit job

↔llsubmit

. qstat

query queue/job status

↔llq,llstatus

. qdel

delete job

↔llcancel

. qhold

hold (”suspend”) job; (note: user/operator/system holds)

↔llhold

. qrls

releases holds

↔llhold -r

. qalter, qmod modify job

↔llmodify

. qhost

provide concise system overview

(8)

SGE commands & options

(2)

I

Specify qsub options in script header and/or on command line (overrides script)

I Essential options for qsub:

-S

: path to shell

-m b|e|a|s|n|...

: send mail at beginning|end|. . . of job

-M: E-mail address for notification

-N: name of job

-j y

: join stdout and stderr

I Additional options for qsub:

-q

: queue

-p

: priority (default 0; users may only decrease)

-P: name of project

-a: earliest date/time at which a job is eligible for execution

. . . : cf. man qsub

(9)

SGE commands & options

(3)

I

Commonly used options for qstat:

qstat

displays list of jobs only

qstat -u <user> -j <job ID>

displays list of jobs for specified user/job

qstat -f

full format display

qstat -r

extended display (incl. resource requirements, scheduling info)

˜>qstat -f

queuename qtype used/tot. load_avg arch states

[email protected] BIP 0/2 0.04 lx26-x86 d [email protected] BIP 0/2 0.00 lx26-x86 ---... [email protected] BIP 0/2 0.00 lx26-x86 [email protected] BIP 2/2 3.67 lx26-x86 10422 0.56000 megaBLAST hfz r 02/13/2007 20:34:12 1 882 10422 0.56000 megaBLAST hfz r 02/13/2007 20:34:12 1 883 [email protected] BIP 2/2 4.85 lx26-x86 10422 0.56000 megaBLAST hfz r 02/13/2007 20:28:27 1 864 10422 0.56000 megaBLAST hfz r 02/13/2007 20:31:57 1 875 . . . ############################################################################ - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS ############################################################################

(10)

SGE commands & options

(4)

˜> qstat -r -j 10422

==============================================================

job_number: 10422

exec_file: job_scripts/10422

submission_time: Tue Feb 13 16:34:29 2007

owner: hfz uid: 1553 group: rzb gid: 4131 sge_o_home: /afs/ipp/home/h/hfz sge_o_log_name: hfz sge_o_path: /opt/SGE6/bin/lx26-x86:/usr/local/bin:/opt/gnome/bin:/usr/games:/usr/bin/X11:/usr/bin:/bin sge_o_shell: /bin/tcsh sge_o_workdir: /bio/tmp/hfz/Sargossa sge_o_host: e01 account: sge cwd: /bio/tmp/hfz/Sargossa path_aliases: /tmp_mnt/ * * / mail_list: [email protected] notify: FALSE job_name: megaBLAST jobshare: 0 shell_list: /bin/sh env_list: script_file: /afs/ipp/home/h/hfz/MySQL/Sequenzen/e01/submit_megablast_test.sge project: gendb job-array tasks: 1-5000:1

usage 808: cpu=00:08:46, mem=595.19275 GBs, io=0.00000, vmem=1.377G, maxvmem=1.566G

. .

usage 875: cpu=00:00:24, mem=28.67529 GBs, io=0.00000, vmem=1.251G, maxvmem=1.251G

scheduling info: queue instance "[email protected]" dropped because it is disabled

queue instance "[email protected]" dropped because it is queue instance "[email protected]" dropped because it is full

queue instance "[email protected]" dropped because it is full queue instance "[email protected]" dropped because it is full queue instance "[email protected]" dropped because it is full queue instance "[email protected]" dropped because it is full queue instance "[email protected]" dropped because it is full queue instance "[email protected]" dropped because it is full

(project gendb) is not allowed to run in host "e07.bc.rzg.mpg.de" based on the excluded project list

(11)

Input/Output

I

Output

.

stdout: <job name>.o<job ID>

.

stderr: <job name>.e<job ID>

.

path: can be specified by qsub -o <stdout path> -e <stderr path>

paths relative to

· current working directory at submission (with qsub -cwd option)

· user’s home directory (if -cwd option is not specified):

˜> ls

example.e10404

example.o10404

example_1.sge

I Input

.

arguments: qsub [ options ] [ command | -- [ command_args ]]

(12)

Advanced topics

.

Job chains:

sets of consecutive interdependent jobs

.

Job arrays:

sets of similar and independent (parallel) jobs

.

DRMAA:

(13)

Job chains: sets of consecutive jobs

I

Solution 1 (trivial)

˜>cat allInOne.sge #$ -S /bin/sh #$ -N allInOne ./doFormatDB ./doBlastAll ./doPostprocessing ˜>qsub allInOne.sge

Your job 10411 ("allInOne") has been submitted.

I Solution 2 (modular, nested qsub)

˜>cat formatDB.sge #$ -S /bin/sh #$ -N FormatDB ./doFormatDB qsub blastAll.sge ˜>cat blastAll.sge #$ -S /bin/sh #$ -N BlastAll ./doBlastAll qsub postprocessing.sge

. . .

˜>qsub formatDB.sge

(14)

Job chains: sets of consecutive jobs

(2)

I

Solution 3 (optimized, uses -hold jid <job id|job name>)

˜>cat formatDB.sge #$ -S /bin/sh #$ -N FormatDB ./doFormatDB ˜>cat blastAll.sge #$ -S /bin/sh #$ -N BlastAll #$ -hold_jid FormatDB ./doBlastAll

. . .

˜>qsub formatDB.sge

Your job 10451 ("formatDB") has been submitted. ˜>qsub blastAll.sge

Your job 10452 ("blastAll") has been submitted. ˜>qsub postprocessing.sge

Your job 10453 ("postprocessing") has been submitted.

.

Advantage: accumulates ”waiting time”

.

Note:

-hold_jid <job_name>

can

only

be

used

to

reference

jobs of the same user (-hold_jid <job_id> can be used to reference any job)

(15)

Array jobs

I

Submit sets of similar and independent “tasks”:

. qsub -t 1-500:1 example_3.sge

submits 500 instances of the same script

.

each instance (“task”) is executed independently

.

all instances subsumed with a single job ID

.

variable $SGE_TASK_ID discriminates between instances

.

task numbering scheme: -t <first>-<last>:<stepsize>

.

related: $SGE_TASK_FIRST,$SGE_TASK_LAST,$SGE_TASK_STEPSIZE

I Example:

#$ -S /bin/sh #$ -cwd #$ -N blastArray #$ -t 1-500:1 QUERY=query_${SGE_TASK_ID}.fa OUTPUT=blastout_${SGE_TASK_ID}.txt echo "processing query $QUERY ..."

blastall -p blastn -d nt -i $QUERY -o $OUTPUT echo "...done"

(16)

Array jobs

(2)

˜> qsub example_3.sge

Your job 10420.1-500:1 ("blastArray") has been submitted.

˜> qstat

job-ID prior name user state submit/start at queue slots ja-task-ID

---10420 0.56000 blastArray mjr r 02/13/2007 15:05:56 [email protected] 1 198 10420 0.56000 blastArray mjr r 02/13/2007 15:05:56 [email protected] 1 199 10420 0.56000 blastArray mjr r 02/13/2007 15:07:11 [email protected] 1 202 10420 0.56000 blastArray mjr r 02/13/2007 15:07:11 [email protected] 1 203 10420 0.56000 blastArray mjr r 02/13/2007 15:05:41 [email protected] 1 196 10420 0.56000 blastArray mjr r 02/13/2007 15:05:41 [email protected] 1 197 10420 0.55241 blastArray mjr r 02/13/2007 15:08:41 [email protected] 1 208 10420 0.55241 blastArray mjr r 02/13/2007 15:08:41 [email protected] 1 209 10420 0.56000 blastArray mjr r 02/13/2007 15:08:11 [email protected] 1 204 10420 0.56000 blastArray mjr r 02/13/2007 15:08:11 [email protected] 1 206 10420 0.56000 blastArray mjr r 02/13/2007 15:02:11 [email protected] 1 176 10420 0.56000 blastArray mjr r 02/13/2007 15:02:11 [email protected] 1 177 10420 0.56000 blastArray mjr r 02/13/2007 15:03:26 [email protected] 1 182 10420 0.56000 blastArray mjr r 02/13/2007 15:03:26 [email protected] 1 183 10420 0.56000 blastArray mjr r 02/13/2007 15:07:11 [email protected] 1 200 10420 0.56000 blastArray mjr r 02/13/2007 15:07:11 [email protected] 1 201 10420 0.56000 blastArray mjr r 02/13/2007 15:05:11 [email protected] 1 193 10420 0.56000 blastArray mjr r 02/13/2007 15:05:11 [email protected] 1 194 10420 0.56000 blastArray mjr r 02/13/2007 15:04:41 [email protected] 1 190 10420 0.56000 blastArray mjr r 02/13/2007 15:04:41 [email protected] 1 191 10420 0.56000 blastArray mjr r 02/13/2007 15:03:41 [email protected] 1 184 10420 0.56000 blastArray mjr r 02/13/2007 15:03:41 [email protected] 1 185 10420 0.56000 blastArray mjr r 02/13/2007 15:08:11 [email protected] 1 205 10420 0.56000 blastArray mjr r 02/13/2007 15:08:11 [email protected] 1 207 10420 0.56000 blastArray mjr r 02/13/2007 15:05:11 [email protected] 1 192 10420 0.56000 blastArray mjr r 02/13/2007 15:05:26 [email protected] 1 195 10420 0.56000 blastArray mjr r 02/13/2007 15:04:26 [email protected] 1 188 10420 0.56000 blastArray mjr r 02/13/2007 15:04:26 [email protected] 1 189 10420 0.56000 blastArray mjr r 02/13/2007 15:03:56 [email protected] 1 186 10420 0.56000 blastArray mjr r 02/13/2007 15:03:56 [email protected] 1 187 10420 0.55242 blastArray mjr qw 02/13/2007 14:28:34 1 210-500:1

(17)

Array jobs

(3)

I

Benefits:

.

simple organization

.

simple interaction with job (single job ID)

.

optimized throughput (see, e.g. qconf -sconf for jobs-per-user limits, etc.)

y powerful tool for (trivially) parallel applications

I Notes:

.

one stdout/stderr file per task

stdout: <job name>.o<job ID>.<task ID>

stderr: <job name>.e<job ID>.<task ID>

.

task-specific $TMPDIR

.

$SGE TASK ID (and its relatives) are undefined for non-array jobs

.

allocate reasonable chunks of work to tasks

(18)

Excursus: load balancing

PE 1

PE 2

PE 3

time

total work

chunk 2

chunk 3

chunk 4

chunk 5

chunk 1

idle time

overhead

⇒ number of PEs  number of chunks 

t

tot

(19)

DRMAA

I

Distributed Resource Management Application API:

.

API specification for the submission and control of jobs to one or more DRM

sys-tems (see

http://drmaa.org

)

.

Purpose: integration with applications

I Advantages:

.

Portability, vendor independence

.

Reliability: avoids error-prone parsing of output from qsub, qstat, . . .

.

Efficiency: avoids expensive (and intricate: e.g. Perl) system calls

I Implementations:

.

SGE

.

Bindings for Java, C/C++

.

Modules for perl, Python . . .

(20)

DRMAA

(2)

I

Java example (fragment)

package de.mpg.rzg.drmaa.queue; import java.util.List; import org.apache.commons.logging.Log; import org.apache.commons.logging.LogFactory; import org.ggf.drmaa.DrmaaException; import org.ggf.drmaa.JobTemplate; import org.ggf.drmaa.Session; public class DrmaaQueueScheduler { ...

public String submitJob(){ String jobId = null; try {

/* create DRMAA session */

SessionFactory factory session = SessionFactory.getFactory().getSession(); session.init(null);

/* setup job template */

JobTemplate jt = session.createJobTemplate(); jt.setRemoteCommand("blastall");

jt.setArgs(new String[]{"-p","blastp","-d","nr"}); jt.setJobName("blast");

List<String> taskIds = session.runBulkJobs(jt,1,numJobTasks,chunkSize); jobId = taskIds.isEmpty() ? null : taskIds.get(0).split("[.]")[0]; } catch (DrmaaException e) {

logger.error("submitting DRMAA job failed: "+e.getMessage()); }

return jobId; }

(21)

Tips & Tricks

I

Submit scripts

.

do not wire SGE logics into your application

.

instead, use SGE scripts only as simple wrappers

.

example:

#

$ -S /bin/sh

#

$ -t 1-1000:10

perl ${HOME}/doMegablastChunk.pl $SGE_TASK_ID $SGE_TASK_STEPSIZE $TMPDIR

y facilitates:

· (interactive) testing

· code maintenance

(22)

Tips & Tricks

(2)

I

Misc

.

do not rely on checkpointing: implement restart capability instead

.

do not rely on (interactive) environment (e.g. $PATH)

.

chose appropriate location for stdout, stderr

.

redirect (wanted) stdout to separate file

.

use reasonable partitioning of total computational work:

· avoid very short jobs/tasks (. 1 Minute): scheduling overhead

· avoid very long jobs/large arrays (& several days, & 10000 tasks): manageability

I RZG specific

.

issue save-password (AFS/Kerberos) before submitting your first job or after a

change of your RZG password

(23)

Tips & Tricks

(3)

I

References and further reading

.

Wikipedia

http://en.wikipedia.org/wiki/Sun Grid Engine

.

SGE homepage

http://gridengine.sunsource.net/

.

SGE documentation

http://gridengine.sunsource.net/documentation.html

.

SGE man pages

.

SGE documentation of the RZG homepage (section ”Computing”)

http://www.rzg.mpg.de/

.

SGE configuration on the SUN Linux Cluster of the MPI-EVAn

References

Related documents

To analyze only endogenous subclinical hypothyroidism, we excluded 253 participants in the Cardiovascular Health Study, 207 in the Health ABC Study, 43 in the Osteoporotic Fractures

To assess the performance of the WDNN classifier, EEG signals of thirteen schizophrenic patients and eighteen normal subjects are analyzed for the classification of the two groups..

Recently, Addario-Berry and Al- benque [ ABA13 ] obtained the convergence to the Brownian map for the classes of simple trian- gulations and simple quadrangulations (maps without

By Theorem 6.2.31, A non-periodic commutative Clifford semigroup S is a HIS if and only if isomorphic to either a surjective Clifford HIS, or [Y ; G] for some homogeneous semilattice

By 2011, mobiles accounted for 83 percent of total global phone lines, 43 percent of originated international call traffic, and 58 percent of terminated international traffic

Direct federal funding plus sub-federal dollars (sub-federal funds are “flow-through” dollars allocated by other entities, such as universities, non-profits, and state and

These purposes are: a negotiated fit between policy mandates and evidence, histories and strategies of community-based services; local organisational and accountability

To address these questions, the following goals were set: (a) to reproduce field explosions pertaining to primary blast injury as accurate as possible in a controlled