GC3: Grid Computing Competence Center
Cluster computing, I
Batch-queueing systems
Riccardo Murri, Sergio Maffioletti Grid Computing Competence Center, Organisch-Chemisches Institut, University of Zurich
Today’s topic
purpose
z }| {
Batch job processing clusters| {z }
What is a cluster? I
compute−0−0.local compute−0−1.local compute−0−27.local
internet
local network fabric
ssh [email protected] frontend.node.uzh.ch 00 00 11 11 00 00 11 11
A cluster is a group of computers with a direct network interconnect,
What is a cluster? II
Centralized: – Authorization and Authentication
– Shared filesystem
– Application execution and
management
Distributed: – Execution of jobs
– Multiple units of the same parallel
job may reside on separate resources
What is an HPC cluster?
A cluster is a group of computers with a direct network interconnect,
centralized management and distributed execution facilites.
An HPC cluster is a cluster with a fast local network interconnect,
specialized for execution of parallel distributed-memory programs. A supercomputer is (currently)
a very large HPC cluster
What’s batch job processing?
Asynchronous execution of shell commands.
Wikipedia: Asynchronous actions are actions executed in a non-blocking scheme, allowing the main program flow to continue processing.
Lifecycle of a batch job
1. A command to run is submitted to the batch processing system
2. The batch job scheduler selects appropriate resources to run the job
3. The resource manager executes the job 4. Users monitor the job execution state
Functional components of a batch job system
Resource Manager
Monitors compute infrastructure, launches and supervise jobs, cleans up after termination.
Job manager / scheduler
Allocates resources and time slots (scheduling)
Workload Manager
Policy and orchestration at “job collection” level: fair share, workflow orchestration, QoS, SLA, etc.
Architecture of a batch job system
compute−0−0.local compute−0−1.local compute−0−27.local scheduler resource manager server client job launch & execution frontend master 2. allocate resources 4. monitor execution 3. start job 4. monitor execution 1. submit job
machine status monitoring
monitor monitor
Grid Engine
Sun Grid Engine(GE) is a batch-queuing system produced by Sun Microcomputers; made open-source in 2001.
After acquisition by Oracle, the product forked:
– Open Grid Scheduler (OGS) and Son of Grid Engine (SGE), independent open-source versions.
– Oracle Grid Engine, commercial and focused on enterprise technical computing.
– Univa Grid Engine is a commercial-only version, developed by the core SGE engineer team from Sun.
GE architecture, I
sge qmaster
– Runs on master node
– Accepts client requests (job submission, job/host state inspection)
– Schedules jobs on compute nodes (formerly separate sge schedd process)
Client programs qhost, qsub, qstat – Run by user on submit node – Clients for sge qmaster
– Master daemon has a list of authorized submit nodes
GE architecture, II
sge execd
– Runs on every compute node
– Accepts job start requests from sge qmaster – Monitors node status (load average, free memory,
etc.) and reports back to sge qmaster sge shepherd
– Spawned by sge execd when starting a job – Monitors the execution of a single job
GE architecture, III
compute−0−0.local compute−0−1.local compute−0−27.local scheduler resource manager ge_master frontend master 2. allocate resources 4. monitor execution 3. start job 4. monitor execution 1. submit job
machine status monitoring
ge_execd qstat qsub ge_execd ge_execd ge_shepherd
Lifecycle of a Job: user perspective
1. Prepare job script (normally shell script) 2. Define resource requirements
3. Submit job and record jobID
4. Monitor status of job (using JobID) 5. When done, inspect results
Prepare job script
#!/bin/bash
MZXMLSEARCH="./MzXML2Search"
${MZXMLSEARCH} -dta ${MZXML_NAME}.mzXML
if [ ! $? -eq 0 ]; then
echo "[FATAL]"
exit $1 fi
Submit job and monitor using jobID
# qsub test.sh
534.localhost
# qstat 534
Job id Name S Queue
- - -
Lifecycle of a Job: system perspective
1. Job submission form DRM client
2. Resource Manager stores job in a queue
– Queue selected inspecting DRM policies and job’s
description
3. Scheduler starts scheduling cycle
– Collects resource information from exec hosts
– Inspects jobs in queues
– Applies scheduling policies to sort jobs in queues
– Sends run request to Resource Manager
4. Resource Manager sends job to exec host to run 5. Exec host receives payload and runs it
– Job executed using user credentials
– Periodically report to Resource Manager resource
utilization
Implementation issues
I/O
how to provide input data to the job and collect output data from it
Scheduling
When should the job start?
Resource allocation
On what computer(s) should it run? How to cope with heterogeneous resource pools?
Job monitoring and accounting
I/O management in HPC clusters
Two main ways:
1. Shared file system 2. Data staging
Reference:O. Richard, Batch Scheduler and File Management, The third workshop of the INRIA-Illinois joint-laboratory on Petascale Computing, June 21-24, 2010, Bordeaux, France
Shared file systems
Used on most cluster systems
Parallel filesystem (e.g., Lustre, GPFS, PVFS, NFSv4.1, . . . ) for performance and scalability
Often separate filesystems based on features:
– a filesystem for persistent / longer-term data (e.g., /home)
– another one for ephemeral I/O (deleted after the job has finished running)
– responsibility is on the user to move data into the appropriate filesystem
Data staging
Job data requirements are identified and provided by user in submitted script.
Stage-in
Input Files are transfered to local disk of compute nodes before job start.
Stage-out
Output Files are transfered from nodes to mass storage after execution.
Scheduling
Long-term scheduler.
– Jobs may last hours, days, even months! HPC job scheduling is usually non-preemptive.
– Compute resources are fully utilized, there’s little room for sharing.
Common scheduling algorithms are usually variations of FCFS or priority-based scheduling.
Scheduling: terminology
Turnaround time
The total time elapsed from the moment a job is submitted to the moment it terminates running.
Wait time
The time elapsed from submission until a job actually starts running.
Wall-time
The time elapsed from job start to end. (Abbreviation of wall-clock time.)
Scheduling: FCFS, I
First come, first served
Job requests are kept in a queue.
New job requests (submissions) append to the back of the queue.
Each time a suitable execution slot is freed, the job at the front of the queue is run.
Scheduling: FCFS, II
Issues with bare FCFS:
1. Average waiting time might be long:
– e.g., a user submits a large number of very long
jobs; other users have to wait a lot in order to have shorter jobs running.
– Solutions: separate queues, backfill,
priority-based scheduling
2. When there are parallel jobs spanning multiple execution units, the scheduler has to keep some nodes idle to allocate enough resources.
Scheduling: separate queues
Create separate job queues.
– Submission queue may be explicitly chosen by user, or selected by scheduler based on job characteristics.
Each queue is associated with a different set of execution nodes.
Each queue has different run features – e.g., different maximum run time
Scheduling: backfill
Jobs jump ahead in the queue and are executed on “reserved” nodes if they will be finished by the time the job holding the reservation is scheduled to start.
Scheduling: SFJ, I
Shortest job first
Job queue is sorted according to duration: shortest jobs are moved to the front.
Scheduling: SFJ, II
If all jobs are known in advance, it can be proved to deliver the optimal average wait time.
Otherwise, may delay long jobs indefinitely:
– At 10 am, Job X with expected runtime 4 hours is submitted; it has to wait 2 hours in the queue. – At 11 am, 10 jobs of 2 hours runtime are
submitted; they jump ahead in the queue and delay job X by 20 hours.
– At 12 am, 5 more jobs of 1 hour runtime are submitted; they delay job X by another 5 hours.
Priority-based scheduling
Sort job queues according to some priority function
The “priority” function is usually a weighted sum of various contributions, e.g.:
– Requested run time – Number of processors – Wait time in queue
– Recent usage by same user/group/department (fair share)
– Administrator-set QoS
Fair-share scheduling
Fair-share prioritization assigns higher priorities to users/groups/etc. that have not used all of their resource quota (usually expressed in CPU time). Important parameters in defining a fair-share policy:
– window length: how much historical information is kept and used for calculating resource usage – interval: how often is resource utilization
computed
– decay: weights applied to resource usage in the past (e.g., 2 hours of CPU time one week ago might weigh less than 2 hours of CPU time today)
Resource allocation, I
Resource allocation is the act of selecting execution units out of the available pool for running a job. Over time, clusters tend to grow inhomogeneously: new nodes are added, that are different from the older ones.
Jobs are different in computational and hardware requirements, e.g.:
– short jobs vs long-running jobs
– large memory hence less jobs fit in a single multi-core node
Resource allocation, II
General resource allocation algorithm (match-making): 1. user specifies resource requirements during job
submission
2. filtering: scheduler filters resources based on evaluation of a boolean formula
– usually, logical AND of resource requirements
3. ranking: matching resources are sorted and the first-ranking one gets the job
Normally the filtering and ranking functions are fixed or can only be modified by the cluster admin.
Example: resource requirements in SGE
Grid Engine allows specifying resource requirements within a job script.
#$/bin/bash
#$ -q all.q # queue name
#$ -l s_vmem=300M # memory
#$ -l s_rt=60 # walltime
#$ -l gpu=1 # require 1 GPGPU
#$ -pe mpich 32 # CPU cores
MZXMLSEARCH="./MzXML2Search" ...
Condor
local 1Gb/s ethernet network
batch system server
local 1Gb/s ethernet network
batch system server
local 1Gb/s ethernet network
batch system server
condor_master condor_submit condor_resource condor_resource condor_resource condor_agent 00000 00000 00000 00000 00000 00000 00000 11111 11111 11111 11111 11111 11111 11111 0000 0000 0000 0000 1111 1111 1111 1111 00000 00000 00000 11111 11111 11111 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 00 11
Condor overview
Agents(client-side software) and Resources
(cluster-side software) advertise their requests and capabilities to the Condor Master.
The Master performs match-making between Agents’ requests and Resources’ offerings.
An Agent sends its computational job directly to the matching Resource.
Reference:Thain, D., Tannenbaum, T. and Livny, M. (2005): “Distributed computing in practice: the Condor experience.” Concurrency and Computation: Practice and Experience, 17:323–356.
Matchmaking, I
Same idea in Condor, except the schema is not fixed. Agents and Resources report their requests and offers using the “ClassAd” format (an enriched key=value format).
No prescribed schema, hence a Resource is free to advertise any “interesting feature” it has, and to represent it in any way that fits the key=value model.
Matchmaking, II
1. Agents specify a Requirements constraint: a boolean expression that can use any value from the Agents’ own (self) ClassAd or the Resource’s (other). 2a. Resources whose offered ClassAd does not satisfy the Requirements constraint are discarded.
2b. Conversely, if the Agents’ ClassAd does not satisfy the Resource Requirements, the Resource is
discarded.
3. Surviving Resources are sorted according to the value of the Rank expression in the Agent’s ClassAd,
Example: Job ClassAd
Select 64-bit Linux hosts, and sort them preferring hosts with larger memory and CPU speed.
Requirements = Arch=="x86_64" && OpSys == "LINUX"
Rank = TARGET.Memory + TARGET.Mips
Reference:http:
Example: Resource ClassAd
A complex access policy, giving priority to users from the owner research group, then other “friend” users, and then the rest. . .
Friend = Owner == "tannenba" ResearchGroup = (Owner == "jbasney"
|| Owner == "raman") Trusted = Owner != "rival"
Requirements = Trusted && ( ResearchGroup || LoadAvg < 0.3 && KeyboardIdle > 15*60 )
Resource allocation, III
Problem: How do you submit a job that requires 200GB of local scratch space? Or 16 cores in a single node?
Resource allocation, IV
The names and types of resource requirements vary from cluster to cluster
– Defaults change with batch system software release
– Custom requirements depend on local system administrator
Job management software must adapted to the local cluster
– When you get access to a new cluster, you must rewrite a large portion of your submission scripts. – Applies to Condor as well: since ClassAds are
All these job management systems are based on a pushmodel (you send the job to an execution cluster).