GC3: Grid Computing Competence Center Cluster computing, I Batch-queueing systems

(1)

GC3: Grid Computing Competence Center

Cluster computing, I

Batch-queueing systems

Riccardo Murri, Sergio Maffioletti Grid Computing Competence Center, Organisch-Chemisches Institut, University of Zurich

(2)

Today’s topic

purpose

z }| {

Batch job processing clusters_{| {z }}

(3)

What is a cluster? I

compute−0−0.local compute−0−1.local compute−0−27.local

internet

local network fabric

ssh [email protected] frontend.node.uzh.ch 00 00 11 11 00 00 11 11

A cluster is a group of computers with a direct network interconnect,

(4)

What is a cluster? II

Centralized: – Authorization and Authentication

– Shared filesystem

– Application execution and

management

Distributed: – Execution of jobs

– Multiple units of the same parallel

job may reside on separate resources

(5)

What is an HPC cluster?

A cluster is a group of computers with a direct network interconnect,

centralized management and distributed execution facilites.

An HPC cluster is a cluster with a fast local network interconnect,

specialized for execution of parallel distributed-memory programs. A supercomputer is (currently)

a very large HPC cluster

(6)

What’s batch job processing?

Asynchronous execution of shell commands.

Wikipedia: Asynchronous actions are actions executed in a non-blocking scheme, allowing the main program flow to continue processing.

(7)

Lifecycle of a batch job

1. A command to run is submitted to the batch processing system

2. The batch job scheduler selects appropriate resources to run the job

3. The resource manager executes the job 4. Users monitor the job execution state

(8)

Functional components of a batch job system

Resource Manager

Monitors compute infrastructure, launches and supervise jobs, cleans up after termination.

Job manager / scheduler

Allocates resources and time slots (scheduling)

Workload Manager

Policy and orchestration at “job collection” level: fair share, workflow orchestration, QoS, SLA, etc.

(9)

Architecture of a batch job system

compute−0−0.local compute−0−1.local compute−0−27.local scheduler resource manager server client job launch & execution frontend master 2. allocate resources 4. monitor execution 3. start job 4. monitor execution 1. submit job

machine status monitoring

monitor monitor

(10)

Grid Engine

Sun Grid Engine(GE) is a batch-queuing system produced by Sun Microcomputers; made open-source in 2001.

After acquisition by Oracle, the product forked:

– Open Grid Scheduler (OGS) and Son of Grid Engine (SGE), independent open-source versions.

– Oracle Grid Engine, commercial and focused on enterprise technical computing.

– Univa Grid Engine is a commercial-only version, developed by the core SGE engineer team from Sun.

(11)

GE architecture, I

sge qmaster

– Runs on master node

– Accepts client requests (job submission, job/host state inspection)

– Schedules jobs on compute nodes (formerly separate sge schedd process)

Client programs qhost, qsub, qstat – Run by user on submit node – Clients for sge qmaster

– Master daemon has a list of authorized submit nodes

(12)

GE architecture, II

sge execd

– Runs on every compute node

– Accepts job start requests from sge qmaster – Monitors node status (load average, free memory,

etc.) and reports back to sge qmaster sge shepherd

– Spawned by sge execd when starting a job – Monitors the execution of a single job

(13)

GE architecture, III

compute−0−0.local compute−0−1.local compute−0−27.local scheduler resource manager ge_master frontend master 2. allocate resources 4. monitor execution 3. start job 4. monitor execution 1. submit job

machine status monitoring

ge_execd qstat qsub ge_execd ge_execd ge_shepherd

(14)

Lifecycle of a Job: user perspective

1. Prepare job script (normally shell script) 2. Define resource requirements

3. Submit job and record jobID

4. Monitor status of job (using JobID) 5. When done, inspect results

(15)

Prepare job script

#!/bin/bash

MZXMLSEARCH="./MzXML2Search"

${MZXMLSEARCH} -dta ${MZXML_NAME}.mzXML

if [ ! $? -eq 0 ]; then

echo "[FATAL]"

exit $1 fi

(16)

Submit job and monitor using jobID

# qsub test.sh

534.localhost

# qstat 534

Job id Name S Queue

- - -

(17)

Lifecycle of a Job: system perspective

1. Job submission form DRM client

2. Resource Manager stores job in a queue

– Queue selected inspecting DRM policies and job’s

description

3. Scheduler starts scheduling cycle

– Collects resource information from exec hosts

– Inspects jobs in queues

– Applies scheduling policies to sort jobs in queues

– Sends run request to Resource Manager

4. Resource Manager sends job to exec host to run 5. Exec host receives payload and runs it

– Job executed using user credentials

– Periodically report to Resource Manager resource

utilization

(18)

(19)

Implementation issues

I/O

how to provide input data to the job and collect output data from it

Scheduling

When should the job start?

Resource allocation

On what computer(s) should it run? How to cope with heterogeneous resource pools?

Job monitoring and accounting

(20)

I/O management in HPC clusters

Two main ways:

1. Shared file system 2. Data staging

Reference:O. Richard, Batch Scheduler and File Management, The third workshop of the INRIA-Illinois joint-laboratory on Petascale Computing, June 21-24, 2010, Bordeaux, France

(21)

Shared file systems

Used on most cluster systems

Parallel filesystem (e.g., Lustre, GPFS, PVFS, NFSv4.1, . . . ) for performance and scalability

Often separate filesystems based on features:

– a filesystem for persistent / longer-term data (e.g., /home)

– another one for ephemeral I/O (deleted after the job has finished running)

– responsibility is on the user to move data into the appropriate filesystem

(22)

Data staging

Job data requirements are identified and provided by user in submitted script.

Stage-in

Input Files are transfered to local disk of compute nodes before job start.

Stage-out

Output Files are transfered from nodes to mass storage after execution.

(23)

Scheduling

Long-term scheduler.

– Jobs may last hours, days, even months! HPC job scheduling is usually non-preemptive.

– Compute resources are fully utilized, there’s little room for sharing.

Common scheduling algorithms are usually variations of FCFS or priority-based scheduling.

(24)

Scheduling: terminology

Turnaround time

The total time elapsed from the moment a job is submitted to the moment it terminates running.

Wait time

The time elapsed from submission until a job actually starts running.

Wall-time

The time elapsed from job start to end. (Abbreviation of wall-clock time.)

(25)

Scheduling: FCFS, I

First come, first served

Job requests are kept in a queue.

New job requests (submissions) append to the back of the queue.

Each time a suitable execution slot is freed, the job at the front of the queue is run.

(26)

Scheduling: FCFS, II

Issues with bare FCFS:

1. Average waiting time might be long:

– e.g., a user submits a large number of very long

jobs; other users have to wait a lot in order to have shorter jobs running.

– Solutions: separate queues, backfill,

priority-based scheduling

2. When there are parallel jobs spanning multiple execution units, the scheduler has to keep some nodes idle to allocate enough resources.

(27)

Scheduling: separate queues

Create separate job queues.

– Submission queue may be explicitly chosen by user, or selected by scheduler based on job characteristics.

Each queue is associated with a different set of execution nodes.

Each queue has different run features – e.g., different maximum run time

(28)

Scheduling: backfill

Jobs jump ahead in the queue and are executed on “reserved” nodes if they will be finished by the time the job holding the reservation is scheduled to start.

(29)

Scheduling: SFJ, I

Shortest job first

Job queue is sorted according to duration: shortest jobs are moved to the front.

(30)

Scheduling: SFJ, II

If all jobs are known in advance, it can be proved to deliver the optimal average wait time.

Otherwise, may delay long jobs indefinitely:

– At 10 am, Job X with expected runtime 4 hours is submitted; it has to wait 2 hours in the queue. – At 11 am, 10 jobs of 2 hours runtime are

submitted; they jump ahead in the queue and delay job X by 20 hours.

– At 12 am, 5 more jobs of 1 hour runtime are submitted; they delay job X by another 5 hours.

(31)

Priority-based scheduling

Sort job queues according to some priority function

The “priority” function is usually a weighted sum of various contributions, e.g.:

– Requested run time – Number of processors – Wait time in queue

– Recent usage by same user/group/department (fair share)

– Administrator-set QoS

(32)

Fair-share scheduling

Fair-share prioritization assigns higher priorities to users/groups/etc. that have not used all of their resource quota (usually expressed in CPU time). Important parameters in defining a fair-share policy:

– window length: how much historical information is kept and used for calculating resource usage – interval: how often is resource utilization

computed

– decay: weights applied to resource usage in the past (e.g., 2 hours of CPU time one week ago might weigh less than 2 hours of CPU time today)

(33)

Resource allocation, I

Resource allocation is the act of selecting execution units out of the available pool for running a job. Over time, clusters tend to grow inhomogeneously: new nodes are added, that are different from the older ones.

Jobs are different in computational and hardware requirements, e.g.:

– short jobs vs long-running jobs

– large memory hence less jobs fit in a single multi-core node

(34)

Resource allocation, II

General resource allocation algorithm (match-making): 1. user specifies resource requirements during job

submission

2. filtering: scheduler filters resources based on evaluation of a boolean formula

– usually, logical AND of resource requirements

3. ranking: matching resources are sorted and the first-ranking one gets the job

Normally the filtering and ranking functions are fixed or can only be modified by the cluster admin.

(35)

Example: resource requirements in SGE

Grid Engine allows specifying resource requirements within a job script.

#$/bin/bash

#$ -q all.q # queue name

#$ -l s_vmem=300M # memory

#$ -l s_rt=60 # walltime

#$ -l gpu=1 # require 1 GPGPU

#$ -pe mpich 32 # CPU cores

MZXMLSEARCH="./MzXML2Search" ...

(36)

Condor

local 1Gb/s ethernet network

batch system server

condor_master condor_submit condor_resource condor_resource condor_resource condor_agent 00000 00000 00000 00000 00000 00000 00000 11111 11111 11111 11111 11111 11111 11111 0000 0000 0000 0000 1111 1111 1111 1111 00000 00000 00000 11111 11111 11111 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 00 11

(37)

Condor overview

Agents(client-side software) and Resources

(cluster-side software) advertise their requests and capabilities to the Condor Master.

The Master performs match-making between Agents’ requests and Resources’ offerings.

An Agent sends its computational job directly to the matching Resource.

Reference:Thain, D., Tannenbaum, T. and Livny, M. (2005): “Distributed computing in practice: the Condor experience.” Concurrency and Computation: Practice and Experience, 17:323–356.

(38)

(39)

Matchmaking, I

Same idea in Condor, except the schema is not fixed. Agents and Resources report their requests and offers using the “ClassAd” format (an enriched key=value format).

No prescribed schema, hence a Resource is free to advertise any “interesting feature” it has, and to represent it in any way that fits the key=value model.

(40)

Matchmaking, II

1. Agents specify a Requirements constraint: a boolean expression that can use any value from the Agents’ own (self) ClassAd or the Resource’s (other). 2a. Resources whose offered ClassAd does not satisfy the Requirements constraint are discarded.

2b. Conversely, if the Agents’ ClassAd does not satisfy the Resource Requirements, the Resource is

discarded.

3. Surviving Resources are sorted according to the value of the Rank expression in the Agent’s ClassAd,

(41)

Example: Job ClassAd

Select 64-bit Linux hosts, and sort them preferring hosts with larger memory and CPU speed.

Requirements = Arch=="x86_64" && OpSys == "LINUX"

Rank = TARGET.Memory + TARGET.Mips

Reference:http:

(42)

Example: Resource ClassAd

A complex access policy, giving priority to users from the owner research group, then other “friend” users, and then the rest. . .

Friend = Owner == "tannenba" ResearchGroup = (Owner == "jbasney"

|| Owner == "raman") Trusted = Owner != "rival"

Requirements = Trusted && ( ResearchGroup || LoadAvg < 0.3 && KeyboardIdle > 15*60 )

(43)

Resource allocation, III

Problem: How do you submit a job that requires 200GB of local scratch space? Or 16 cores in a single node?

(44)

Resource allocation, IV

The names and types of resource requirements vary from cluster to cluster

– Defaults change with batch system software release

– Custom requirements depend on local system administrator

Job management software must adapted to the local cluster

– When you get access to a new cluster, you must rewrite a large portion of your submission scripts. – Applies to Condor as well: since ClassAds are

(45)

All these job management systems are based on a pushmodel (you send the job to an execution cluster).