The Quality of Service on the Grid with User Level Scheduling

(1)

Enabling Grids for EsciencE

www.euegee.org

The Quality of Service on the Grid with UserLevel Scheduling

Jakub T. Moscicki CERN / IT

Informatics Institute Seminar, UvA, Amsterdam, September 1^st 2006

Q/A: answers to some of the questions during the seminar

(2)

Enabling Grids for EsciencE

Context

• ARDA Project

– enabling applications and users on the Grid

• Users

– High Energy Physics: LHC experiments – EGEE: biomed, special acitvities

• Grid

– LCG and EGEE Grid

 the largest Grid infrastructure to date

 over 200 sites

 over 20K worker nodes

 over 5 Pb of storage

(5)

(6)

Main Grid Characteristics

• Grid

– federation of heterogeneous computing and storage resources

 effects of scale:

• efficiency

o far less efficient than in the local, optimized, smaller systems

• reliability and failures

o brokering

o site/application configuration

 user view point

• looks like a large batch system

• jobs are monolithic and mostly unrelated

o Q/A: this is NOT application assumption: currently LCG resource brokers do not support any dependencies between jobs, in new gLite middleware support bulkjob submission for optimizing submission of jobs which share common sandbox

o some CEs however (such as Condorbased) support DAGs

(7)

• Grid job submission chain

– UI > RB > CE > BS > WN

– scheduling decisions based on the monitoring system – simple failure recovery: JDL retry count

Grid Job Submission Chain

(8)

Submission performance

Credits to H.C.Lee and ARDA team

(9)

Reliability and Failures: Brokering

• Example:

– Reliability and Failures

 brokering and info system

 sites instability

 ...

Credits to P.Saiz and ARDA team

(10)

Configuration/application errors

• configuration/application errors

 compiler,tools,... version mismatch

 wrong environment

 missing application files

 installed application software not uptodate

– Example: 5 sites, 5 different problems

Problems related to a Geant4 application on WNs

gridce0.desy.de:2119/jobmanagerlcgpbsgeant

problem with setenv DIANE_PYTHON_INCLUDE_DIR /usr/include/python2.2 tbn20.nikhef.nl:2119/jobmanagerpbsqlong

problem of aidaconfig: Operating system version 'CentOS release 3.7 (Final)' is not supported, sorry.

grid109.kfki.hu:2119/jobmanagerlcgpbsgeant4

g++: Internal error: Segmentation fault (program cc1plus) ce102.cern.ch:2119/jobmanagerlcglsfgrid_gea

GNUmakefile:28: /afs/cern.ch/project/gd/apps/geant4/dirInstallations/dirGeant47.1.p01/config/binmake.gmk: No such file or directory The problem is that at CERN version 4.7.1.p01 is NOT available.

lcgce0.ifh.de:2119/jobmanagerlcgpbsgeant4

swig: error while loading shared libraries: libstdc++libc6.22.so.3: cannot open shared object file: No such file or directory

(11)

Applications

• Some applications from our work environment – data analysis

 extraction of (statistical) parameters from data: event loop

 (quasi)interactivity

• you may want to see the evolution of the parameter (e.g. histogram) and take a decision to change input cuts,....

– Monte Carlo simulation

 obtaining parameters or building images by generating large number of independent events

 radiotherapy example:

• you may want to see the energy deposits in the tissues given geometry approximation, radioactive dose etc.

– testing activities: geant 4 regression test

 running a large number of jobs in various configurations (parameter sweep) – avian flu : check a number of drug candidates for by docking (> details later) – special activities: ITU frequency analysis (> detail later)

(12)

Class of Applications

• Class of Applications

– loose to moderate coupling

 no or few synchronization points

 e.g.: computing intensive

 e.g.: data intensive if data movements not considered

• assumed that data is available locally

o either by policy (eg. LHCb computing model for analysis assumes all data replicated in Tier1 sites) > the data management tends to be simplified with time in HEP...

o or on demand, before the job starts

– low communication to computing ratio

 more real work than communication

• Type of parallelism

– iterative decomposition (including parameter sweep) – embarrassingly parallel applications

(13)

Definition of the Quality of Service

the computing system provides an appropriate QoS if it responds in an acceptable way to the user and is capable of automatically maintaining the

processing goals defined by the user (measured by metrics)

• keep in mind:

• in the Grid the basic interaction of a user is sending jobs

● either directly or indirectly via a portal

• response

(depends on the nature of the application: interactive, batch, mixed)

• obtaining (maybe partial) results of the processing

• e.g. for the onthefly analysis

• estimation/prediction of how the processing will evolve

(16)

QoS Metrics

•QoS metrics (measure of userdefined goals)

• turnaround time (typically minimize the total execution time of the job)

• response latency (time for the first N partial results to arrive)

•the feedback curve (e.g. filling histograms with events > significance of individual partial results decreases with time)

• response order

•i.e. order of arrival of the partial results (application specific)

• output rate stability

• make sure that partial results arrive at predictable intervals (jiiter)

• failure rate

•automatic and efficient coping with failures

recall data analysis

recall MC simulation

(17)

Mechanisms for better QoS

• In general QoS in NOT implemented on the Grid

• Techniques for performance related metrics

– dedication of resources (wasteful) – advanced reservations

 difficult for some users who do not plan ahead interactive work

– better scheduling: fast/slow queues (site configuration) – preemption: suspend lower priority job

– migration: suspend and migrate elsewhere

– better brokering: forecasting using monitoring systems (e.g. NWS)

• Techniques for failure related metrics

– metascheduling (JDL retry count, Condor)

• Techniques for applicationspecific metrics

– metascheduling (not generally implemented, e.g. out of scope of DAGs)

(18)

QoS Implementation Choices

• QoS implementation

– site service modifications

 faster queues, scheduler modifications e.g. virtualization schemes with MAUI

– middleware modification

 checkpointing/migration, special services (e.g. GARA), Virtual Machines

– system level modifications (unix kernel modules, special I/O) – userlevel overlay schedulers (plot jobs, agents,...)

• Boundary conditions

– acceptance/deployment of middleware changes (very slow due organizational constraints)

– resource providers' constraints (site changes)

 many sites cannot freely change their software (serving also nongrid users)

 sysadmins do not like sudolike programs

– interfacing applications

 including legacy ones

(19)

Related Work

• User level (Grid Overlays)

– production systems of LHCb experiments:

 DIRAC, AliEn: pilot jobs with central task queue

• permanent, VOspecific services, VOBoxes on the sites,...

– specialized schedulers embedded into the applications

 gPTM3D: interactive medical analysis application (DICOM images)

– AppLeS APST: parameter sweep templates

 brokering based on NWS

– Condor M/W

• System level

– VRes: time sharing with virtual reservations with MAUI scheduler – SLAs

– GARA (General Purpose Architecture for Reservation and Allocation)

 attempt to create a middleware standard (not deployed on the Grid)

– Virtual Machines (PlanetLab,...)

(21)

Parametric task model

• parametric tasks

– automatic job splitting into (parametrized) tasks – master sends task parameters to the workers

– migration and checkpointing may be emulated easily – at the granularity of atomic tasks

– master may take scheduling decisions

Master

Worker A Worker B Worker C

(a2,b2,c2) (a3,b3,c3)

ErrorRecoveryPolicy TaskDispatchingPolicy

task parameter list (a1,b1,c1) done

(a2,b2,c2) assigned C (a3,b3,c3) assigned A (a4,b4,c4) failed

(25)

“original”

application

• distributed component framework

 transparent to applications

• behind the scenes: message passing, synchronization, heartbeat checks...

• application components loaded in a plugin style and called back only when needed

• very easy application integration via python adapter classes

DIANE Application Interfacing

“original”

application

(26)

Runtime Policies

• Customizable algorithms

– Master behaviour may be modified on perapplication or perjob basis:

insert python functions into the “JDL”

 application specific error recovery

 task dispatching (e.g. for data access,

redundant scheduling, time synchronization)

 other synchronization patterns

• divideandconquer

• pipeline

• master serves as a synchronization point

def failRecoveryIgnore(self):

for t in self.master.tasks.failed():

t.ignore() return 1

def failRecoveryThreshold(self):

failure_no = reduce

(add_failures,self.master.tasks.failed,0) for t in self.master.tasks.failed:

t.retry()

return failure_no <

self.master.tasks.number*0.1

(27)

task was reassigned to other nodes

Policy example: backup tasks

def matchTasksToWorkers(self):

unassigned = self.master.tasks.unassigned[:]

if len(self.master.workers.ready) > len(self.master.tasks.unassigned):

unassigned += self.master.tasks.unfinished() return zip(unassigned,ready)

policy function inserted into “JDL”

similar strategy used by Google to improve

performance of search and indexing algorithms (MapReduce)

(28)

Performance Profile

startup overhead (submission)

transient network problem or worker crash

tail effects: CPU / task length differences

(29)

Operation modes

• Operation modes

– onthefly virtual M/W network for each job

 fairshare, default

 only soft QoS

– preallocation of resources

 have many workers available at once at certain moment of time (peakperformance)

 currently blocking the resources

• but hard QoS possible if you are willing to “pay” for it

(30)

Achieved goals

• Achieved goals

– improved soft QoS characteristics

 automatic load balancing at a very finegrained level

 shorter turnaround time (avoiding brokering overhead)

 more stable and predictable job output rate

 faster error recovery

– used by several applications > details later

 easy to interface to existing applications

– customizable on perapplication basis

– selfcontained running within single user Grid jobs

 very easy to deploy > no modifications to the middleware, services, ...

– worker brokering done externally

 flexible: possible to mix grid and nongrid resources

– resource conservation in onthefly mode

 fairshare in multiuser environment

(31)

QoS Metric: output rate stability

Comparison of G4 Production on LCG: DIANE and direct submission

• 6 sites / 173 CPUs / 100 VOshared, 70 VOdedicated

• 207 tasks, direct: 1 task = 1 job, DIANE workers: 1/3 of shared CPUs, ½ of dedicated CPUs

Credits to P. Mendez and ARDA team

(33)

Performance

• Scalability/performance of current implementation

– turnaround time: 40K ultrashort jobs (tasks) in 1 hour – failure rate: < 3*e04 (mostly 0)

– max number of simultaneous workers: 400 (for occupancy > 99%) – in/out task rate 110Hz

 equivalent to 4 seconds task length for 400 workers

– longest sustained processing period: 3 weeks – max number of registered workers: ~1000

(34)

Task lenght overhead

Test environment:16 CPU hours, cluster

max: 289 workers

data msg test (next slide)

(35)

Message Size Overhead (1)

10Kb

100Kb

200Kb

~100 WNs, 30 s / task, 1000 tasks

(36)

Message Size Overhead (2)

<=10Kb

100Kb

200Kb

~100 WNs, 30 s / task, 1000 tasks

(37)

ITU RRC06

• International Telecommunication Union – ITU: oldest UN agency (17 May 1865)

– ITU/BR: Radiocommunication Sector

 “management of the radiofrequency spectrum and satellite orbits for fixed, mobile, broadcasting and other communication services“

• RRC06 (15 May–16 June 2006)

– 120 countries (~1200 delegates) negotiated the new digital frequency plan

– a part of a new international agreement – introduction of digital broadcasting

 UHF (470862 Mhz)

 VHF (174230 Mhz)

– preceded by RRC04 and other international meetings

(38)

Frequency Planning

• Goal: validate the frequency plan

– around 200K “requirements” (corresponding to “jobs”/”events”)

 individual transmitters, service areas in all countries

 both digital and analogue

• Planning cycle during the conference

– compatibility analysis

 detect frequency clashes between requirements

 in the scope of our activity using distributed resources

– 4 major iterations for global plan (weekends)

 few minor cycles for certain regions (midweek)

• Computing requirements

– each run ~500 CPU hours – 12h hard time limit

– dependability: correctness (crosschecking numerical results), availability

(39)

Computation Structure

– Total plan: max 425 CPUh (was 750CPUh) – Hard time limit: 12 h

– Unbalanced execution time of the requirements

 from seconds to 1.5 hours

 unpredictably changing with every planning iteration

shape of the country makes a difference to spatial complexity of the frequency optimization

execution time of tasks (first iteration)

(40)

Operations: major iterations

• Summary of major iterations

run #req #task time start stop CPUh #WN comment

1 243K 26K 6.40h 21:20 04:00 425h 190 lost <10 tasks (3*e04) 2 237K 23K 6.30h 19:10 01:25 332h 125 lost 1 task (4*e05) 3 224K 40K 3.05h 22:50 00:26 192h 210 OK

4 218K 39K 1.05h 22:18 23:23 151h 320 OK Some observations:

● CPUh and req decreasing:

● subsequent plans have less clashes

● 3,4 high granularity splitting

● variable #WN

● highreliability

● late Friday afternoon ;)

variable #WN example: iteration 4

(41)

Example of ITU run

ITU job example:

● 116 LCG workers

● 3470 tasks

● ~130 CPU h large span of task length

not a priori known!

27%

64%

(42)

Operations: EGEE Grid

• Usage of the sites – Example: iteration 2

– CERN largest contributor (38%) – 62% from the Grid

(43)

Sucesses

• Main goal

– provide enough computing power to ensure the successful achievement of the international broadcasting agreement fully accomplished!

 talk of IT department head on RRC06 plenary session

• Other goals:

– enabled new user community on the EGEE Grid

 users trained to be autonomous (despite extended consultancy)

 in short time and relatively easy

– demonstrated that with User Level Scheduling EGEE Grid may be used for applications with QoS requirements

– we have learned a lot!

(44)

(45)

(46)

(47)

(48)

Performance Model

• Formalize the performance model

– resources are available in time slots (latency, duration) – failure rate

• Check how sharefairness is affected

– userlevel scheduling tends to monopolize the resources – use shorter slots? startup overhead (submission)

transient network problem or worker crash

tail effects: CPU / task length differences

(50)

QoS Enforcement

• Formalize the QoS metrics

– fuzzy metrics (specify the range of acceptable values)

• improve the QoS runtime support

– dynamic decomposition (vary the splitting granularity at runtime)

 example: to reduce the message impact rate, allocate bunches of tasks which then may be decomposed if needed (“simulated checkpointing”)

 more autonomous workers

– measurements of the runtime performance

 do application benchmarking on small samples

 apply static application performance knowledge

– forecasting of the QoS accomplishment with the defined metrics

(51)

Other directions

• hard QoS requirements

– floating pool of workers shared by users

– each user contributes part of the resources to the pool – workers may be drawn from the pool for minimal latency

• framework as the general processing control vehicle

– workflows and pipelines (maybe) – mutimasters (master tree)

 mapping naturally to the Grid structure (CEs)...

 multiple task synchronization points

 should scale better

 submasters could be more autonomous

(52)

Feedback

DIANE download and more information:

http://cern.ch/diane

Contact: Jakub.Moscicki AT cern.ch

...please do not hesitate to get in touch if you have questions or ideas...

The Quality of Service on the Grid with User Level Scheduling