• No results found

A nice little scheduling problem

N/A
N/A
Protected

Academic year: 2022

Share "A nice little scheduling problem"

Copied!
37
0
0

Loading.... (view fulltext now)

Full text

(1)

A nice little scheduling problem

Yves Robert

ENS Lyon & Institut Universitaire de France University of Tennessee Knoxville

[email protected]

CCDSC 2014 – Dareiz´e, September 3, 2014

(2)

A nice little scheduling problem

Lame motivation

(3)

A nice little scheduling problem

Theorem 1 Theorem 2 Theorem 3 Theorem 4 Theorem 5 Theorem 6 Theorem 8 Theorem 9

(4)

A nice little scheduling problem

Conclusion: proving Theorem 7 would be nice

(5)

Algorithms for coping with silent errors

Yves Robert

ENS Lyon & Institut Universitaire de France University of Tennessee Knoxville

[email protected]

CCDSC 2014 – Dareiz´e, September 3, 2014

(6)

Exascale platforms

Hierarchical

• 105 or 106 nodes

• Each node equipped with 104 or 103 cores Failure-prone

MTBF – one node 10 years 120 years

MTBF – platform 5mn 1h

of 106 nodes

More nodes ⇒ Shorter MTBF (Mean Time Between Failures)

(7)

Definitions

Instantaneous error detection ⇒ fail-stop failures, e.g. resource crash

Silent errors (data corruption) ⇒ detection latency

Silent error detected only when the corrupt data is activated

Includes some software faults, some hardware errors (soft errors in L1 cache), double bit flip

Cannot always be corrected by ECC memory

(8)

Quotes

Soft Error: An unintended change in the state of an electronic device that alters the information that it stores without destroying its functionality, e.g. a bit flip caused by a cosmic-ray-induced neutron. (Hengartner et al., 2008) SDC occurs when incorrect data is delivered by a computing system to the user without any error being logged (Cristian Constantinescu, AMD)

Silent errors are the black swan of errors(Marc Snir)

(9)

Should we be afraid? (courtesy Al Geist)

(10)

Probability distributions for silent errors

?

Theorem: µp= µind

p for arbitrary distributions

(11)

Probability distributions for silent errors

?

Theorem: µp= µind

p for arbitrary distributions

(12)

Lesson learnt for fail-stop failures

(Not so) Secret data

• Tsubame 2: 962 failures during last 18 months so µ = 13 hrs

• Blue Waters: 2-3 node failures per day

• Titan: a few failures per day

• Tianhe 2: wouldn’t say

Topt=p

2µC ⇒ Wasteopt ≈ s

2C µ

Petascale: C = 20 min µ = 24 hrs ⇒ Wasteopt = 17%

Scale by 10: C = 20 min µ = 2.4 hrs ⇒ Wasteopt = 53%

Scale by 100: C = 20 min µ = 0.24 hrs ⇒ Wasteopt = 100%

(13)

Lesson learnt for fail-stop failures

(Not so) Secret data

• Tsubame 2: 962 failures during last 18 months so µ = 13 hrs

• Blue Waters: 2-3 node failures per day

• Titan: a few failures per day

• Tianhe 2: wouldn’t say

Topt=p

2µC ⇒ Wasteopt ≈ s

2C µ

Petascale: C = 20 min µ = 24 hrs ⇒ Wasteopt = 17%

Scale by 10: C = 20 min µ = 2.4 hrs ⇒ Wasteopt = 53%

Scale by 100: C = 20 min µ = 0.24 hrs ⇒ Wasteopt = 100%

Exascale 6= Petascale ×1000

Need more reliable components

Need to checkpoint faster

(14)

Lesson learnt for fail-stop failures

(Not so) Secret data

• Tsubame 2: 962 failures during last 18 months so µ = 13 hrs

• Blue Waters: 2-3 node failures per day

• Titan: a few failures per day

• Tianhe 2: wouldn’t say

Topt=p

2µC ⇒ Wasteopt ≈ s

2C µ

Petascale: C = 20 min µ = 24 hrs ⇒ Wasteopt = 17%

Scale by 10: C = 20 min µ = 2.4 hrs ⇒ Wasteopt = 53%

Scale by 100: C = 20 min µ = 0.24 hrs ⇒ Wasteopt = 100%

Silent errors:

detection latency ⇒ additional problems

(15)

Outline

1 General-purpose approach

2 Checkpointing and verification

3 Application-specific methods

(16)

Outline

1 General-purpose approach

2 Checkpointing and verification

3 Application-specific methods

(17)

General-purpose approach

Xe Xd Time Error

Detection

Error and detection latency

Last checkpoint may have saved an already corrupted state

Saving k checkpoints (Lu, Zheng and Chien):

¬ Critical failure when all live checkpoints are invalid

­ Which checkpoint to roll back to?

(18)

General-purpose approach

Xe Xd Time Error

Detection

Error and detection latency

Last checkpoint may have saved an already corrupted state Saving k checkpoints (Lu, Zheng and Chien):

¬ Critical failure when all live checkpoints are invalid Assume unlimited storage resources

­ Which checkpoint to roll back to?

Assume verification mechanism

(19)

Optimal period?

Xe Xd Time Error

Detection

Error and detection latency

Xe inter arrival time between errors; mean time µe Xd error detection time; mean time µd

Assume Xd and Xe independent

(20)

Arbitrary distribution

WasteFF= C T WasteFail=

T

2 + R +µd

µe

Only valid if T2 + R + µd  µe

Theorem

Best period is Topt ≈√ 2µeC Independent of Xd

(21)

Limitation of the model

It is not clear how to detect when the error has occurred (hence to identify the last valid checkpoint)

/ / /

Need a verification mechanism to check the correctness of the checkpoints. This has an additional cost!

(22)

Outline

1 General-purpose approach

2 Checkpointing and verification

3 Application-specific methods

(23)

Coupling checkpointing and verification

Verification mechanism of cost V

Silent errors detected only when verification is executed Approach agnostic of the nature of verification mechanism (checksum, error correcting code, coherence tests, etc) Fully general-purpose

(application-specific information, if available, can always be used to decrease V )

(24)

Base pattern (and revisiting Young/Daly)

W W Time

Error Detection

V C V C V C

Fail-stop (classical) Silent errors

Pattern T = W + C S = W + V + C

WasteFF C T

V +C S

Wastefail 1

µ(D + R +W2 ) µ1(R +W + V ) Optimal Topt =√

2C µ Sopt =p(C + V )µ Wasteopt

q2C

µ 2q

C +V µ

(25)

With p = 1 checkpoint and q = 3 verifications

Time

w w w w w w

Error Detection

V C V V V C V V V C

Base Pattern p = 1, q = 1 Wasteopt = 2q

C +V µ

New Pattern p = 1, q = 3 Wasteopt = 2

q4(C +3V )

(26)

BalancedAlgorithm

2w 2w w w 2w 2w Time

V C V V C V V V C

p checkpoints and q verifications, p ≤ q p = 2, q = 5, S = 2C + 5V + W W = 10w , six chunks of size w or 2w

May store invalid checkpoint (error during third chunk) After successful verification in fourth chunk, preceding checkpoint is valid

Keep only two checkpoints in memory and avoid any fatal failure

(27)

BalancedAlgorithm

2w 2w w w 2w 2w Time

V C V V C V V V C

¬ ( proba 2w /W ) Tlost= R + 2w + V

­ ( proba 2w /W ) Tlost= R + 4w + 2V

® ( proba w /W ) Tlost= 2R + 6w + C + 4V

¯ ( proba w /W ) Tlost= R + w + 2V

° ( proba 2w /W ) Tlost= R + 3w + 2V

± ( proba 2w /W ) Tlost= R + 5w + 3V

Wasteopt≈ 2 s

7(2C + 5V ) 20µ

(28)

Outline

1 General-purpose approach

2 Checkpointing and verification

3 Application-specific methods

(29)

Literature

ABFT: dense matrices / fail-stop, extended to sparse / silent.

Limited to one error detection and/or correction in practice Asynchronous (chaotic) iterative methods (old work) Partial differential equations: use lower-order scheme as verification mechanism (detection only, Benson, Schmit and Schreiber)

FT-GMRES: inner-outer iterations (Hoemmen and Heroux) PCG: orthogonalization check every k iterations,

re-orthogonalization if problem detected (Sao and Vuduc) . . . Many others

(30)

On-line ABFT scheme for PCG

Zizhong Chen, PPoPP’13 Iterate PCG

Cost: SpMV, preconditioner solve, 5 linear kernels Detect soft errors by checking orthogonality and residual

Verification every d iterations Cost: scalar product+SpMV Checkpoint every c iterations Cost: three vectors, or two vectors + SpMV at recovery

Experimental method to choose c and d

(31)

Conclusion

Soft errors difficult to cope with, even for divisible workloads Investigate graphs of computational tasks

Combine checkpointing and application-specific techniques Multi-criteria optimization problem

execution time/energy/reliability

best resource usage (performance trade-offs)

Several challenging algorithmic/scheduling problems

,

(32)

A little game?

Framework

Compute something

Energy cost C1 = 10 and failure probability f1 = 0.2 Energy cost C2 = 8 and failure probability f2 = 0.3 . . . (many other modes) . . .

Problem

You win when you get twice the same result (no false positive) Find optimal strategy and compute expected cost

(33)

A mutlicore game?

Framework

Now you have p cores for each trial

Can freely run each core in a different mode (including idle) Each configuration has a cost C , and several probabilities:

- ptom = two or more successes (then you know you won) - pone = exactly one success (but you don’t know it) - of course f = 1 − ptom− pone

Problem

You win when you get twice the same result (no false positive) Find optimal strategy and compute expected cost

(34)

Back to task graphs?

Framework

You’re given a (very big) task graph

Each task produces files that you can save (checkpoint) or not Each task can choose from different execution speeds, with different error probabilites

You can replicate some tasks, either for verification or for faster execution of successor tasks

You may also be able to verify results by some application-specific mechanism

Problem

Given energy budget or power cap, minimize execution time For each task, many things to be decided by schedule

/

(35)

Back to task graphs?

Framework

You’re given a (very big) task graph

Each task produces files that you can save (checkpoint) or not Each task can choose from different execution speeds, with different error probabilites

You can replicate some tasks, either for verification or for faster execution of successor tasks

You may also be able to verify results by some application-specific mechanism

Problem

Given energy budget or power cap, minimize execution time For each task, many things to be decided by schedule

/

(36)

Back to task graphs?

Framework

You’re given a (very big) task graph

Each task produces files that you can save (checkpoint) or not Each task can choose from different execution speeds, with different error probabilites

You can replicate some tasks, either for verification or for faster execution of successor tasks

You may also be able to verify results by some application-specific mechanism

Problem

Given energy budget or power cap, minimize execution time For each task, many things to be decided by schedule

/

The fox wants to save the polar bears . . .

(37)

Thanks

INRIA & ENS Lyon Anne Benoit Fr´ed´eric Vivien

PhD students (Guillaume Aupy,Dounia Zaidouni) Univ. Tennessee Knoxville

George Bosilca Aur´elien Bouteiller Jack Dongarra Thomas H´erault Others

Franck Cappello, Argonne National Lab.

Henri Casanova, Univ. Hawai‘i

Saurabh K. Raina, Jaypee IIT, Noida, India Marc Snir, Argonne National Lab.

References

Related documents

Click Add and enter the following information for each Dampening Profile that you want to configure, select Enable , and click OK :8. • Profile Name —Enter a name to identify

1. Retailer-related human capital plays a strong role in the online consumer's store choice decision. Further, the key factors that bring customers back to an online retailer

The hypothetical examples are in no way an official recommendation of a particular insurer, nor is it an exclusive list of the carriers available through Answer Financial...

Stenting and bypass surgery are measures that buy you that time, but the true “fix” for coronary artery disease comes with living a heart healthy lifestyle. This means losing

There are still ways by w/c the President may exercise control over the NICP such as the requirement for his approval for contracting loans for financing its projects,

With the aim of finding esterification procedures with high efficiency and minimum environmental impact, we report here the results obtained in the synthesis of diestersII to

Este proceso/método puede utilizarse para crear un mortero adhesivo de color para facilitar la instalación de mosaicos de vidrio traslúcido, transparente u opaco, así como de

You can save money, reduce risk, and improve system reliability with application managed services, which eliminate the headaches of managing your financial and ERP systems. By