S un P owe rs the Grid
S U P E R C O M P U T I N G 2 0 0 1S un P owe rs the Grid
Grid S oftwa re S ta ck
Platfo rm: Solaris De ve lo pment Tools : Forte Infras tructure Adminis tratio n: iPlanet, JXTASys te ms Adminis tration: SunMC, iChange , So laris Manage me nt Cons ole
Distribute d Re source Ma na gement: Sun Grid Engine
Dis tribute d Pro ce s s Manage ment: Clus te rToo ls Storag e Manage ment: JIRO, QFS, SAM-FS, HPC-SAN Campus Grid: SGE Broke r, SGE Ente rpris e Editio n
Global Grid: Avaki, Cac tus , Globus , PUNCH
Technic al C om put
S un P owe rs the Grid
S un Grid Engine
●
http://www.s un.c om/gridware
–
P roduct Ove rvie w, FAQs , We b Ba s e d Tra ining
●
http://supportforum.s un.com/gride ngine
–
S upportforum on Ins talla tion/
Compute Farms/HPC
–
Applica tion Note s , FAQs
●
http://gridengine.s uns ource.net
–
Ope n S ource Proje ct
–
Ma iling Lis ts
–
Courte s y Bina rie s
S un P owe rs the Grid
S un Grid Engine Focus :
La rge , long running ba tch jobs
Job
D uratio
# of
users
10m
1m
100k
10k
1k
100
10
m S ec S ecs M ins H ours D ays
Tra ns a ction
P roce s s ing
Ba tch
P roce s s ing
S un P owe rs the Grid
Res
ource-S e lection
Sun Grid Engine
Jobs
CS
TCF
WS
Results DispatchOve rvie w
User● Resource allocation policy
● Job and resource priority
● Systems' characteristics
S un P owe rs the Grid
De s ign Approa ch: Ba nk
Ana logy
Loans Teller
Wa iting a re a
Teller
M ortages
Teller
Loans
Pe nding lis t
Counte rs
Que ue s
Counte rs
Que ue s
Cus tome rs
low er
higher
Jobs
S un P owe rs the Grid
S imple Ins ta lla tion
Ma ster hos t $CODINE_ROOT/ \_ de fa ult/ \_ (dire ctorie s )
Exe c hos ts
Exe c hos ts
Exe c hos ts
S ubmit hos t
Admin
hos t
S un P owe rs the Grid
High Ava ila bility Ins ta lla tion
HA File s erve r
$CODINE_ROOT/ \_ de fa ult/
\_ (othe r dire ctories )
Ma ste r
host
S ha dow
ma s te r
host
Exe c
hos t
Submit host
Admin
hos t
S un P owe rs the Grid
S ola ris, Linux, e tc. S olaris, Linux, e tc.
Archite cture
execd
commd
s chedd
qma s te r
S un P owe rs the Grid
Hos t Type s a nd Da e mons
Ma ste r-Hos t Exe c-Host Submit-Host Admin-Host
commd
s chedd
ma s terd
e xecd
commd
execd
s ha dowd
s ha dowd
Ma nda tory Optiona lS un P owe rs the Grid
Que ue s
J obs a cquire a ttribute s of
the que ue , e .g.:
●
CP U, me mory limits
●
P riority
●
S us pens ion (s chedule d,
ma nual)
●
Auxillia ry scripts
●
Intera ctive , ba tch, pa ra lle l
Hos t
Queues
S lots
S lots
Queue:
job container
J ob S lots : a llow multiple
jobs with s a me
characte ristics on same
host
S un P owe rs the Grid
qs ub
S che dd
Exe cd
5) L oad Report 7) In form when done 4) D ispa tch 1) Subm it Rec 8) ord 2) 3) J obExe cd
Exe cd
qma ste r
Informa tion Flow
a ccounting
6) Con
S un P owe rs the Grid
S che duling
# # STARHPC # Version 1# #$ -l cdlic=1 # cp xyz abc Sche dd qma ster job se le ction ● use r s ort● firs t come , first s e rve
que ue s e le ction
● re source ma tch ● se que nce numbe r ● loa d formula
load form ula (configurable by adm in), e g
● loa d_a vg ● slots
S un P owe rs the Grid
J ob Exe cution Eve nt Cha in
queue/host prolog
job script
E N D queue/host epilog
term inate m ethod resum e m ethod suspend m ethod
parallel start
parallel stop
parallel stop
queue/host epilog m igration com m and
clean com m and
requeu e job
checkpoint com m and
run at specifi ed
M IG R A
TE
S US P END
DELETE
S TA R TEXIT
S un P owe rs the Grid
Comple xe s
comple x:
a related group of re s ource s
Hos t
●num_proc
●loa d_avg
●me m_total
●s wa p_free
Queue
●me m limit
●tmp dir
●s lots
●s us p s che d
Global
●floa ting
lice ns e s
●us e r-de fine d
●
consumable (s wap_free ) or fixed
(num_proc)
●
re ques ta ble (lice nse , me m limit)
Re s ource type s (e xa mple s )
qsub -l job_type=synth,license=1,m em _lim it=1G
S un P owe rs the Grid
S un P owe rs the Grid
Comple xe s : Re s ource
Attribute s
variable com m ent
na me must be unique across all comple xe s
s hortcut unique, ma y be use d a s a re pla ce me nt e ve rywhe re
type STRING, INT, BOOL, MEMORY, DOUBLE, TIME, CSTRING,
HOST
va lue de fault value, ma y be ove rride n by a ctua l va lue or loa d re port re lop re la tiona l opera tor
=< < == > >= !=
re que s t if YES , use r can re que s t; if FORCED, use r MUST reque s t consum attribute is consuma ble
de fault de fault re que s t if re source is ma rke d consuma ble
S un P owe rs the Grid
cluster
Loa d S e nsors : tra cking
a rbitra ry re s ource s
com pute
host com putehost
com pute host ● Hos t-s pe cific & cluste r-wide
informa tion. Exa mple s : ● "fre e dis k s pa ce on
a rbitra ry pa rtition" (host-s pe cific)
● "numbe r of lice ns e s of a pa rticula r S W in us e " (clus te r-wide )
● Cus tom s cript/progra m que rie d pe riodica lly
● Informa tion use d for ● re s ource re que s ts ● loa d thre sholds ● loa d formula
S un P owe rs the Grid
Re s ource Ma tching
Ca n a jobs run in a pa rticula r que ue ?
if (use r-re que s t RELOP a ctua l-va lue ) the n YES
Exa mple 1:
qsub -l a rch=s ola ris 64 myjob.s h ● a ctua l a rch=glinux
● RELOP for a rch is ==
if (s ola ris 64 == glinux) the n sche dule job.
R e s ult: job doe s NOT ge t sche dule d
Exa mple 2:
qsub -l h_vme m=256m myjob.s h ● a ctua l h_vme m=512m
● RELOP for h_vme m is <=
if (256m <= 512m) the n s che dule job.
R e s ult: job ge ts s che dule d
S un P owe rs the Grid
Inhe rita nce of Re source s
globa l
res A=val1
host re s B=val2 que ue que ue hos t res A=val6 que ue que ue re s C=val3 v a lu e e v e r y w h e r e v a lu e o n a ll q u e u e s o n th is h o s t v a lu e fo r a ll jo b s o n th is q u e u e user-defined resD =val4 v a lu e w h e r e v e r a tta c h e d o v e r r id e s g lo b a l v a lu eS un P owe rs the Grid
Cons uma ble s
S P E C IAL TYP E O F R E S O UR C E
● Ca n us e us e r-de fine d re s ource , or built-in loa d va lue s , or va lue from loa d s e ns or
● "fre e me mory"
● "a mount of s pa ce in a s cra tch dire ctory" ● "numbe r of lice ns e s "
● Cons umption of re source s de te rmine d in two wa ys:
● If REQUES TABLE is YES or FORCED, individua l jobs will "cons ume " the spe cifie d a mount of re source s (e ithe r
a mount re que ste d or DEFAULT a mount)
● If REQUES TABLE is NO, a mount "cons ume d" mus t be give n by built-in loa d va lue or loa d s e ns or
● Jobs re que sting una va ila ble re source s will wa it until the y a re fre e d
S un P owe rs the Grid
Exa mple : Cons uma ble
initia l globa l, que ue comple x va lue s
jobs re que s ting re source s
upda te d globa l, que ue comple x
S un P owe rs the Grid
P a ra lle l a nd Che ckpointing
Environme nts
O VE R VIE W
Environment
a s et o f que ues
tha t is us e d to support
paralle l
or c hec kpointing jo bs
q2
q3
q1
q4
q5
q6
q7
E nv A
E nv
S un P owe rs the Grid
P E Configura tion
S un P owe rs the Grid
P a ra lle l Environme nt
c o d in e _ p e (5 ) q c o n f -s p <p e n a m e >
allo c atio n rule
setting com m ent
<inte ge r> a lloca te e xa ctly this ma ny s lots pe r host
$pe _s lots a lloca te a s ma ny s lots on s ingle hos t a s s ta te d on comma nd line : qs ub -pe <pe na me > <s lot
ra nge >
$fill_up fill up one hos t, move to a nothe r, continue until ra nge fille d
$round_robin do round-robin a lloca tion ove r a ll s uita ble hos ts until ra nge fille d
S un P owe rs the Grid
Che ckpoint Configura tion
S un P owe rs the Grid
Loa d S e ns or: Fre e disk
s pa ce
#!/bin/shmyhost=`$CODINE_ROOT/utilbin/solaris64/gethostname -name`
end=false
while [ $end = false ]; do read input result=$? if [ $result != 0 ]; then end=true break fi
if [ "$input" = "quit" ]; then end=true
break fi
echo "begin"
dfoutput=`df -k /var | tail -1`
varfree=`echo $dfoutput | awk '{ print $4}'` echo "$myhost:varfree:${varfree}k"
echo "end" done
host-specific
loop until told to quit
output
inform ation
in specific
form at
script O R
binary
S un P owe rs the Grid
Cons uma ble s : Lice ns e
ma na ge me nt
P roble m:
For a pa rticula r a pplica tion, the re a re four floa ting lice ns e s : two for long jobs, two for s hort jobs
Re quire me nt:
● Applica tion should only run on two pa rticula r hos ts out of the whole clus te r
● jobs s hould only run if a lice nse is a va ila ble .
● a t a ny one time the re ca n only be four ins ta nce s of the job running: two s hort jobs a nd two long.
S un P owe rs the Grid
Cons uma ble s : Lice ns e
ma na ge me nt
global long=2 s hort=2 Host1 que ue Along=0 que ue Bs hort=0
Host2
que ue A
long=0 que ue Bs hort=0
All Othe r hosts
long=0 s hort=0 ● cre a te globa l consuma ble s ● se t the m to ze ro e xplicitly whe re ve r the y're not de s ire d ● jobs na tura lly go
to wha te ve r
que ue s re ma ins
long jobs
run he re run he rene ithe r
s hort jobs run he re
S un P owe rs the Grid
Que ue s : Re s ource s ha ring
P roble m: se t ba s e configura tion to ma tch ne e ds Ha rdwa re
● 24 x S un Fire 280R 2 CP U 1GB RAM Re quire me nts
● inte ra ctive + fa s t jobs, no more tha n 16 a t a time , runtime <=1hr, me m<=256M, up to 2 jobs pe r CP U
● long jobs , no time limit, me m<=512M, only one job pe r CP U
● sus pe nd long jobs, if ne e de d, to run short jobs , but try to a void s uspe nding whe ne ve r pos sible
S un P owe rs the Grid
Que ue s : Re s ource s ha ring
Comple xe s : a dd a ne w que ue comple x re s ource
na me type value re lop re q cons. de fa ult
jobtype s tring NONE == YES NO long
Que ues
on both the two hosts, se t up two que ue s: long a nd short
● ba tch only que ue ; 2 slots, s et h_vme m to 512M, se t jobtype =long
● ba tch + intera ctive que ue ; 4 slots, ma ke othe r que ue subordina te , se t h_rt to 600 se conds , h_vmem to 256M, s et jobtype =short
● on a ll ma chine s, numbe r the queue s in opposite a sce nding orde r, eg
queue hostA hostB hostC hostD
batch only 1 2 3 4 batch+int 4 3 2 1
Clus ter
● se t use r-sort
● s e t que ue _sort_method to se qno
● disa ble batch+int queue s on all but 4 s yste ms (e na ble on othe rs, e g, if a s ys te m goe s down)
to run s hort job: qs ub -l jobtype =s hort shortjob.sh to run interactive job: qrsh /usr/loca l/bin/myjob