• No results found

Heuristic Static Load-Balancing Algorithm Applied to CESM

N/A
N/A
Protected

Academic year: 2021

Share "Heuristic Static Load-Balancing Algorithm Applied to CESM"

Copied!
17
0
0

Loading.... (view fulltext now)

Full text

(1)

Heuristic Static Load-Balancing Algorithm

Applied to CESM

1Yuri Alexeev, 1Sheri Mickelson,

1Sven Leyffer, 1Robert Jacob, 2Anthony Craig

1Argonne National Laboratory, 9700 S. Cass Avenue, Argonne, IL 60439, USA 2CCSM Software Engineering Group, NCAR, Boulder, CO 80305, USA

(2)

Argonne National Laboratory supercomputers

2

 40,960 nodes / 163,840 cores

 557 Teraflops peak

 PowerPC 450 with 4 cores/node at 850 MHz

 Double FPU – 2 wide double precision SIMD

 512 MB per core

Intrepid (IBM Blue Gene/P)

 49,152 nodes / 786,432 cores

 10 Petaflops peak

 PowerPC A2 with 16 cores/node at 1.6 GHz

 Quad FPU – 4 wide double precision SIMD

 1Gb per core

(3)

CESM setup

CESM fully coupled active components, 1 degree resolution: f09_g16.B Calculations were run on Intrepid (40 racks Blue Gene/P)

Goal: minimize total execution time

(4)

Heuristic Static Load-Balancing (HSLB) Algorithm

(1) Gather Data: Run CESM calculations D times using a different total numbers

of cores. Collect the running times yij for each component i.

(2) Fit: Next, solve least squares problem for each component to determine the coefficients ai, bi, ci, and di for each fragment i in performance model.

(3) Solve: Determine the best allocation by solving the MINLP, and obtain the

optimal values of size ni for each component i.

(4) Execute: Execute CESM simulations, using the determined subgroup sizes in

step (3).

(5)

Gather data for step (1)

(6)

Performance model for step (2)

- the wall-clock time to compute the ith component as a function of

the number of cores allocated to process it

- time spent in perfectly scalable portion of the component - time spent in the non-parallelized portion of the component - time spent in partially parallelized portion: initialization, communication, and synchronization etc. (anything nonlinear and not serial)

Model makes sense both mathematically and from the viewpoint of Amdahl’s law 6 C i i d c i n i b i n i a serial i T i n nonlin i T i n scal i T i n i T ( ) = ( )+ ( )+ = + i + , =1,..., ) (ni i T i n i a i n scal i T ( ) = i d serial i T = i c i n i b i n nonlin i T ( ) = i n

(7)

Fitting data for step (2)

Obtain the best fit by solving the least squares problem

+

=

i

d

i

c

i

b

i

a

D

j

i

d

c

ij

n

i

b

ij

n

i

a

ij

y

d

c

b

a

i i i i i i

,

,

,

subject to

1

2

,

,

,

min

(8)

Formulating the Optimization Problem

Problem: optimize the number of nodes, , to be allocated to each component

 minimize the total wall time over all components :

 minimize the maximum wall time used by a component :

 maximize the minimum wall time used by a component :

8

i

n

} ,... 1 { C i

= C i i n i T n 1 ) ( min

)

(

min

max

T

i

n

i

i

n

)

(

max

min

T

i

n

i

i

n

time

Number of nodes

(9)

Formulating the mathematical problem for step (3)

1 Given: +- set of positive integer numbers 2 +- set of positive real numbers

3 C ={ice, lnd, atm, ocn}={ , , , }i l a o - set of components

4 N∈ + - total number of nodes available for allocation

5

{ } { }

O = 2, 4,…, 480,768 = O ,…, O1 m - possible allocations for ocn

6

{ } { }

A= 1, 2,…,1638,1664 = A ,…, A1 m - possible allocations for atm

7 Variables: T∈ + - wall-clock time obtained by solving allocation problem

8 Ticelnd∈ + - wall-clock time to balance lnd and ice

9 Tsync∈ + - synchronization tolerance to balance lnd and ice

10 n j∈ + - number of nodes allocated

11 T (n )

j j ∈ + - (fitted) performance function

modeling time taken to run on n j

12 zk{0,1}- binary variables to model selection of number nodes, no

13 Minimize: T

Constraints for layout (1) 14 Subject to: TicelndT ni( i)

15 TicelndT nl( l) 16 T T T (n ) icelnd a a ≥ + 17 T To(no) 18 T n( ) T n( ) T l li isync

(10)

Solving MINLP problem

Formulation is written in AMPL

 Classical branch-and-bound [Dakin, 1965] implemented in MINOTAUR:

http://wiki.mcs.anl.gov/minotaur

 Solve relaxed NLP (continuous relaxation); solution value provides lower bound Branch on yi

Solve NLP & branch until:

Node infeasible

Node integer feasible (get upper bound)

Lower bound

 Tree search exhaustive but not complete enumeration

Method guarantees to find optimal global

solution or show that none exist

 Solution time is ≤ 10 seconds on a single core (155 components)

(11)

MINLP Tree

(12)

Results

12

CESM fully coupled active components, 1 degree resolution: f09_g16.B Calculations were run on Intrepid (40 racks Blue Gene/P)

1° resolution, 128 nodes

Manual HSLB

components # nodes Time, sec Predicted # nodes Predicted Time, sec Actual Time, sec lnd 24 63.766 15 100.951 100.202 ice 80 109.054 89 102.972 116.472 atm 104 306.952 104 307.651 308.699 ocn 24 362.669 24 365.649 365.853 Total time, sec 416.006 410.623 425.171

(13)

Results

(14)

Prediction of Optimal Layout

(15)

Prediction of Efficiency

(64) / ( )

( )

/ 64

T

T n

E n

n

=

(16)

Future work

16

Convert the AMPL code to C++ to be more portable

 Create scripts that will automate the load balancing process

- First script will gather timing data for scaling curve by creating/running 4-5 test layouts

- Second script will analyze the timing files and produce a load balanced layout based on how many cores the user would like to run on

(17)

Acknowledgments

Thank you

 Dr. Ray Loy and ALCF team members (Argonne National Laboratory)

 Jim Edwards and Mariana Vertenstein for encouraging this work and helpful discussions.

Funding was provided by

References

Related documents

2 Secretary’s Records, 1998-1999 (Cecil Reynolds, President) : Includes annual report; listing of officers and committee chairs; Executive Committee Meeting Minutes, Agenda,

Subsequently, stem rust severity rapidly decreased in Spanish fields, but leaf rust epidemics became frequent during 1998–2008 on durum wheat in south–west Spain.. In 2013,

In relation to the Sultanate of Oman the information contained in this Presentation does not constitute a prospectus, or an offer document, relative to the offering of securities in

enterprise, operating and trading outside of the network operator’s domain, could offer a service to end customers by employing some of the key capabilities available from the

This research investigated the main frames and themes related to haze emanating from forest fire in Indonesia that covered the country as well as Malaysia and Singapore

However, it may get outperformed for specific high-dimensional data settings by the prediction strength method in Tibshirani and Guenther (2005). This latter ‘better’ approach,

Each CSSAP unit includes a summative assessment, in which the student applies knowledge and understanding in a provided context; essential questions that get to the heart of