• No results found

Utilization Driven Power-Aware Parallel Job Scheduling

N/A
N/A
Protected

Academic year: 2021

Share "Utilization Driven Power-Aware Parallel Job Scheduling"

Copied!
21
0
0

Loading.... (view fulltext now)

Full text

(1)

Utilization Driven Power-Aware

Parallel Job Scheduling

Maja Etinski Julita Corbalan Jesus Labarta Mateo Valero

(2)

Motivation

Performance increase has been followed by even higher increase in power consumption

Top500

(3)

Power reduction approaches in HPC

Power reduction approaches

Application level:

- Runtime systems:

-exploit certain application

characteristics (load imbalance, communication intensive

regions)

- based on very fine grain DVFS application

System level:

- Turning off idle nodes:

- resource allocation such that there are more completely idle nodes

- determining number of online nodes

- Operating system power management via DVFS:

- linux governors – per core, unawareness of the rest of the system

(4)

HPC Job Scheduler Job submission Job description with requeriments Queued jobs Frequency assignment algorithm that uses DVFS

based on current load

Job Scheduling

Resource Manager

Power-Aware Module

Job scheduler has a global view of the whole system:

Power-Aware Parallel Job Scheduling

WAIT TIME RUN TIME

JOB PERFORMANCE

(5)

Outline

Parallel job scheduling:

 the EASY backfilling policy

frequency assignment

Power and run time modeling:

 how does frequency scaling affect power dissipation and execution time?

Evaluation:

 experimental methodology (simulator, workloads, policy parameters)

 results

(6)

The EASY backfilling policy

Job 2 Job 3 Job 4 Job 1 Arrival of Job 5 Job 5 Job 5 Job 6 MakeJobReservation(Job5) BackfillJob(Job6) Arrival of Job 6 Time CPUs

Jobs are executed in FCFS order except when the first job in the wait queue can not start Users have to submit an estimation of job's runtime – requested time

When the first job in the WQ can not start, a reservation is made for it based on requested times of running jobs

A job is executed before previously arrived ones only if it does not delay the first job in the queue

(7)

When to use DVFS? Which frequency?

Use of DVFS during period of low system activity

When utilization is low, impact on performance is minimal (normally there are no queued jobs)

Majority of workloads have average systems utilizations in range 45% - 75% (Parallel Workload Archive)

Transient periods of low load (over night and holidays) Two levels of control:

system utilization

number of jobs in the wait queue

Frequency assignment algorithm can be applied with any parallel job scheduling policy (Industrial strength schedulers are usually based on backfilling policies)

(8)

Frequency assignment

Frequency assigned once (at jobs start time) for entire job execution Utilization is computed for each interval T:

If there are more than WQthreshold jobs in the wait queue no frequency scaling

will be applied

Otherwise, job started during interval Jk runs at frequency F

Uk-1 Fk Ulower Uupper flower fupper ftop

(9)

UPAS

WQ size , utilization Uk-1

WQ size > WQ threshold

F = ftop

F = flower

utilization Uk-1 < threshold Ulower

utilization Uk-1 < threshold Uupper

NO

NO

NO

YES

(10)

Power Model

CPU power presents major portion of total system power It consists of dynamic and static power:

Pcpu= Pdynamic+ Pstatic Pdynamic= AcfV2 P

static = α V Fraction of static in total CPU power is a model parameter:

Pstatic(Vtop) = X(Pstatic(Vtop) + Pdynamic(ftop,Vtop))

( X = 25% in our experiments )

Two scenarios for idle CPUs:

idle processors do not consume power

idle CPUs are at the lowest frequency with low activity factor Average activity factor assumed to be same for all jobs

Activity factor of idle processors 2.5 times lower than running activity

DVFS gear set :

f 0.80 1.10 1.40 1.70 2.00 2.30

V 1.00 1.10 1.20 1.30 1.40 1.50

(11)

Time Model

Execution time dependence on frequency is captured by the following model: F(f,ß)=T(f) / T(ftop) = ß(ftop / f -1) + 1

[Hsu,Feng SC05: A Power-Aware Run Time System for High-Performance Computing]

ß is assumed to have the following distributions:

0.8 1.1 1.4 1.7 2 2.3 0 0.5 1 1.5 2 2.5

Execution time penalty due to frequency scaling

Frequency Exe cu ti o n t ime ß = 0.7 ß = 0.5 ß = 0.2

(12)

Evaluation

C++ event driven parallel job scheduling simulator has been upgraded

Policy parameters:

utilization thresholds: Ulower= 50% Uupper= 80% reduced frequencies: flower = 1.4 GHz fupper = 2.0 GHz utilization computation interval: T = 10 min

wait queue length threshold: WQthreshold = 0, 4, 16, NO

Metric of job performance – Bounded Slowdown

BSLD at frequency f

Alvio simulator

Policy parameters

(13)

Workloads

Five workloads from production use have been simulated:

Workload - #CPUs Avg LR

CTC - 430 70% 1.61 50% 28%

SDSC – 128 85% 8.17 26% 5%

69% 2.31 55% 26%

80% 0.80 29% 11%

75% 0.94 26% 19%

Avg Util %T below Uupper % T below Ulower

SDSCBlue – 1152 LLNLThunder – 4008

LLNLAtlas – 9216 San Diego Supercomputing

Center

-less sequential jobs than CTC -runtime distribution similar

San Diego Supercomputing Center

- no sequential job

Lawrence Livermore National Lab

- small to medium size jobs

Cornell Theory Center

-large jobs with relatively low level of parallelism

Parallel workload archive

http://www.cs.huji.ac.il/labs/parallel/workload

Lawrence Livermore National Lab

(14)

Results: Energy - Original System Size

CTC SDSC SDSCBlue LLNLThunder LLNLAtlas

80% 82% 84% 86% 88% 90% 92% 94% 96% 98% 100% WQ 0 4 16 NO Workload % E (i dl e= 0)

CTC SDSC SDSCBlue LLNLThunder LLNLAtlas

80% 82% 84% 86% 88% 90% 92% 94% 96% 98% 100% WQ 0 4 16 NO Workload %E (id le = lo w )

short wait queues

very similar results for both energy scenarios

(15)

Results: Performance – Original System Size

CTC SDSC SDSCBlue LLNLThunder LLNLAtlas

0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 WQ 0 4 16 NO Workload N or m al iz ed A v er age B S LD

CTC SDSC SDSCBlue LLNLThunder LLNLAtlas

0 500 1000 1500 2000 2500 WQ 0 4 16 NO Workload N um ber of r educ ed j obs

high penalty in the least conservative case for highly loaded workload

(16)
(17)

DVFS system size

Frequency scaling is applied when load/utilization is low

More CPUs -> lower load/utilization -> more opportunities for DFVS application

DVFS scaling leads to lower power More CPUs -> lower CPU energy

More CPUs -> better job performance due to lower wait times

Is it possible to achieve both? (lower energy and higher performance) Following system sizes have been considered in the evaluation process:

(18)

System Oversizing: Performance

CTC SDSC SDSCBlue LLNLThunder LLNLAtlas 0 0.2 0.4 0.6 0.8 1 1.2 1.4 increase 10% 20% 50% 75% Workload N or m al iz ed A v er age B S LD

CTC SDSC SDSCBlue LLNLThunder LLNLAtlas 0 0.2 0.4 0.6 0.8 1 1.2 1.4 increase 10% 20% 50% 75% Workload N or m al iz ed A v er age B S LD WQthreshold = 0 WQ threshold= NO

LLNLThunder has perfect BSLD

CTC, SDSC, SDSCBlue achieve performance better than original for only 10% increase in system size

(19)

System Oversizing: Energy

10% 20% 50% 75% 0.7 0.75 0.8 0.85 0.9 0.95 1 workloads CTC SDSC SDSCBlue LLNLThunder LLNLATlas

System size increase

% E (i dl e= 0) 10% 20% 50% 75% 0.7 0.75 0.8 0.85 0.9 0.95 1 workloads CTC SDSC SDSCBlue LLNLThunder LLNLATlas

System size increase

%E (id le = lo w ) 0.85 0.9 0.95 1 workloads CTC SDSC SDSCBlue E (i dl e= 0) 0.85 0.9 0.95 1 workloads CTC SDSC SDSCBlue (id le = lo w) WQthreshold= 0 WQthreshold = NO

(20)

Conclusions

Use of DVFS at the level of parallel job scheduling has been proposed A power-aware parallel job scheduling policy based on system utilization has been evaluated

Trade-off between job performance and energy

For less loaded workloads it is possible to save up to 12% of energy without affecting average BSLD significantly

Modest energy savings in highly loaded workloads result in high performance penalty

An analysis of system dimension has been performed showing that bigger DVFS systems can results in lower CPU energy consumption and higher job performance

(21)

Utilization Driven Power-Aware

Parallel Job Scheduling

References

Related documents

Uniformly apply 150 kilograms per hectare CASORON G-4 Granular Herbicide in late fall (between October 15th and December 15th) and reapply at the rate of 150 kilograms per hectare

Od kada se vode sluţbene statistike u Republici Hrvatskoj moţe se s većom preciznošću pratiti kretanje sigurnosti prometa na hrvatskim cesta što omogućuje

IVR strategy relies on maintaining coolability of molten corium and debris in a reactor pressure vessel lower head by external reactor vessel cooling (ERVC) by ensuring the

The wind- side deposits from combustion of pulverised DDGS contained (apart from O) mainly P, K, Mg and also some content of Ca and Si. Ground wheat DDGS can be fed without

Any caravans or trailers attached to your vehicle at the time of the vehicle breakdown will be recovered with your vehicle. Caravans or trailers attached to vehicles following

Background: Hematological markers of the systemic inflammatory response (SIR) including the neutrophil to lymphocyte ratio (NLR), platelet to lymphocyte ratio (PLR), and

Objective: To determine the expression of Human Leukocyte Antigen-E and Natural Killer Cell cells in the Blighted Ovum and normal pregnancy.. Methods: Observational analytic