• No results found

Performance metrics for parallelism

N/A
N/A
Protected

Academic year: 2021

Share "Performance metrics for parallelism"

Copied!
30
0
0

Loading.... (view fulltext now)

Full text

(1)

Performance metrics for parallelism

Performance metrics for parallelism

Albert-Jan Yzelman

(2)

Sources

Rob H. Bisseling; Parallel Scientific Computing, Oxford Press. Grama, Gupta, Karypis, Kumar; Parallel Computing, Addison Wesley.

(3)

Performance metrics for parallelism

Measuring performance

Definition (Parallel overhead)

let Tseq be the time taken by a sequential algorithm;

let Tp be the time taken by a parallelisation of that algorithm,

using p processes.

Then, the parallel overhead To is given by

To = pTp− Ts.

(Effort is proportional to the number of workers multiplied with the duration of their work, that is, equal to pTp.)

(4)

Performance metrics for parallelism

Measuring performance

Definition (Speedup)

Let Tseq, p, and Tp be as before. Then, the speedup S is given by

S (p) = Tseq/Tp.

• Target: S = p (no overhead; To = 0).

• Best case: S > p (superlinear speedup). • Worst case: S < 1 (slowdown).

(5)

Performance metrics for parallelism

Measuring performance

Definition (Speedup)

Let Tseq, p, and Tp be as before. Then, the speedup S is given by

S (p) = Tseq/Tp.

• Target: S = p (no overhead; To = 0).

• Best case: S > p (superlinear speedup). • Worst case: S < 1 (slowdown).

(6)

Performance metrics for parallelism

Measuring performance

What is Tseq?

Many sequential algorithms solving the same problem.

When determining the speedup S ,

compare against the best sequential algorithm

(that is available on your architecture).

When determining the overhead To,

compare against the most similar algorithm

(7)

Performance metrics for parallelism

Measuring performance

What is Tseq?

Many sequential algorithms solving the same problem. When determining the speedup S ,

compare against the best sequential algorithm

(that is available on your architecture).

When determining the overhead To,

compare against the most similar algorithm

(8)

Measuring performance

What is Tseq?

Many sequential algorithms solving the same problem. When determining the speedup S ,

compare against the best sequential algorithm

(that is available on your architecture).

When determining the overhead To,

compare against the most similar algorithm

(9)

Performance metrics for parallelism

Measuring performance

Definition (strong scaling)

S (p) = Tseq/Tp= Ω(p) (i.e., lim supp→∞|S(p)/p| > 0)

0 1 2 3 4 5 6 1 2 3 4 5 6 7 log 2 speedup log2 p

MulticoreBSP for Java: FFT on a Sun UltraSparc T2 Perfect scaling

512k-length vector

Question: is it reasonable to expect strong scalability for (good) parallel algorithms?

(10)

Measuring performance

Definition (strong scaling)

S (p) = Tseq/Tp= Ω(p) (i.e., lim supp→∞|S(p)/p| > 0)

0 1 2 3 4 5 6 1 2 3 4 5 6 7 log 2 speedup log2 p

MulticoreBSP for Java: FFT on a Sun UltraSparc T2 Perfect scaling

512k-length vector 32M-length vector

Answer: not as p → ∞. You cannot efficiently clean a table with 50 people, or paint a wall with 500 painters.

(11)

Performance metrics for parallelism

Measuring performance

Definition (weak scaling)

S (n) = Tseq(n)/Tp(n) = Ω(1), with p fixed.

0 20 40 60 80 100 10 12 14 16 18 20 22 24 26 speedup log2 n

BSPlib: FFT on two different parallel machines Perfect speedup

Lynx (distributed-memory) HP DL980 (shared-memory)

For large enough problems, we do expect to make maximum use of our parallel computer.

(12)

Measuring performance: example

0 5 10 15 20 25 30 35 14 16 18 20 22 24 26 speedup log2 n

MulticoreBSP for Java: FFT on a Sun UltraSparc T2

Perfect scaling 64 processes

(13)

Performance metrics for parallelism

Measuring performance: example

0 5 10 15 20 25 30 35 14 16 18 20 22 24 26 speedup log2 n

MulticoreBSP for Java: FFT on a Sun UltraSparc T2

Perfect scaling 64 processes

This processor advertises 64 processors, but only has 32 FPUs. We oversubscribed!

(14)

Measuring performance

0 5 10 15 20 25 30 35 14 16 18 20 22 24 26 speedup log2 n

MulticoreBSP for Java: FFT on a Sun UltraSparc T2

Perfect scaling 64 processes 32 processes

This processor advertises 64 processors, but only has 32 FPUs. Be careful with oversubscription! (Including hyperthreading!)

(15)

Performance metrics for parallelism

Measuring performance

0 5 10 15 20 25 30 35 14 16 18 20 22 24 26 speedup log2 n

MulticoreBSP for Java: FFT on a Sun UltraSparc T2

Perfect scaling 64 processes 32 processes

This processor advertises 64 processors, but only has 32 FPUs. Q: would you say this algorithm scales on the Sun Ultrasparc T2?

(16)

Measuring performance

A: if the speedup stabilises around 16x, then yes, since the relative efficiency is stable.

Definition (Parallel efficiency)

Let Tseq, p, Tp, and S as before. The parallel efficiency E equals

E = Tseq

(17)

Performance metrics for parallelism

Measuring performance

If there is no overhead (To = pTp− Tseq = 0), the efficiency

E = 1; decreasing the overhead increases the efficiency.

Weak scalability: what happens if the problem size increases?

But what is a sensible definition of the ‘problem size’ ?

Consider the following applications: inner-product calculation; binary search;

(18)

Measuring performance

If there is no overhead (To = pTp− Tseq = 0), the efficiency

E = 1; decreasing the overhead increases the efficiency.

Weak scalability: what happens if the problem size increases?

But what is a sensible definition of the ‘problem size’ ?

Consider the following applications: inner-product calculation; binary search;

(19)

Performance metrics for parallelism

Measuring performance

Problem Size Run-time Inner-product Θ(n) bytes Θ(n) flops

Binary search Θ(n) bytes Θ(log2n) comparisons Sorting Θ(n) bytes Θ(n log2n) swaps FFT Θ(n) bytes Θ(n log2n) flops Hence the problem size is best identified by Tseq.

Question:

How should the ratio To/Tseq behave as Tseq→ ∞, for the

(20)

Measuring performance

If To/Tseq = c, with c ∈ R≥0 constant, then

pTp− Tseq

Tseq

= pS−1− 1 = c, so

S = p

c + 1, which is constant when p is fixed. Note that here, E = S /p = c+11

Question:

How should the ratio To/Tseq behave as p → ∞, for the

(21)

Performance metrics for parallelism

Measuring performance

If To/Tseq = c, with c ∈ R≥0 constant, then

pTp− Tseq

Tseq

= pS−1− 1 = c, so

S = p c + 1.

Note that here, E = S /p = c+11 which is still constant!

Answer:

Exactly the same! Both strong and weak scalability are iso-efficiency constraints (E remains constant).

(22)

Measuring performance

Definition (iso-efficiency)

Let E be as before. Suppose To = f (Tseq, p) is a known function.

Then the iso-efficiency relation is given by Tseq=

1

1

E − 1

f (Tseq, p).

This follows from the definition of E : E−1 = pTp/Tseq+ 1 − Tseq Tseq = 1 +pTp− Tseq Tseq = 1 + To Tseq , so Tseq(E−1− 1) = To.

(23)

Performance metrics for parallelism

Questions

Does this scale? Yes! (It’s again an FFT)

-2 0 2 4 6 8 10 5 10 15 20 25 30 Time Problem size

(24)

Questions

Does this scale? Yes! (It’s again an FFT)

-2 0 2 4 6 8 10 5 10 15 20 25 30

Execution time (logarithmic)

log2 n

Measured running time Theoretical optimum (asymptotic)

(25)

Performance metrics for parallelism

Questions

Better use speedups when investigating scalability.

0 10 20 30 40 50 60 70 10 12 14 16 18 20 22 24 26 speedup log2 n HP DL980

(26)

Performance metrics for parallelism

In summary...

A BSP algorithm is scalable when

T = O(Tseq/p + p).

This considers scalability of the speedup and includes parallel overhead.

It does not include memory scalability:

M = O(Mseq/p + p),

where M is the memory taken by one BSP process and Mseq the

memory requirement of the best sequential algorithm.

Question:

(27)

Performance metrics for parallelism

In summary...

A BSP algorithm is scalable when

T = O(Tseq/p + p).

This considers scalability of the speedup and includes parallel overhead. It does not include memory scalability:

M = O(Mseq/p + p),

where M is the memory taken by one BSP process and Mseq the

memory requirement of the best sequential algorithm.

Question:

(28)

In summary...

It does indeed make sense:

E−1 = pTp Ts = 1 +pTp− Ts Ts = 1 +To Ts . If To = Θ(p), then E = 1 1 + p/Ts = Ts Ts+ p .

Note limp→∞E = 0 (strong scalability, Amdahl’s Law) and

limTs→∞= 1 (weak scalability). For iso-efficiency, the ratio To and p

(29)

Performance metrics for parallelism

Practical session

Coming Thursday, see http://people.cs.kuleuven.be/ ~albert-jan.yzelman/education/parco13/ Subjects:

1 Amdahl’s Law 2 Parallel Sorting

(30)

Insights from the practical session

This intuition leads to a definition of how ‘parallel’ certain algorithms are:

Definition (Parallelism)

Consider a parallel algorithm that runs in Tp time. Let Tseq the time

taken by the best sequential algorithm that solves the same problem. Then the parallelism is given by

Tseq T∞ = lim p→∞ Tseq Tp .

This kind of analysis is fundamental for fine-grained parallelisation schemes.

Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. Leiserson, Keith H. Randall, and Yuli Zhou. 1995. Cilk: an efficient multithreaded runtime system. SIGPLAN Not. 30, 8 (August 1995), pp. 207-216.

References

Related documents

Under the leadership of MedPro and Hanaa Hamdi, PhD, Director of the Department of Health and Community Wellness, they’ve developed a full program that includes health

The intent of this standard is for students to be able to locate selected countries and major physical features in Southern and Eastern Asia using a world and

If one interprets the price impact as information contribution, then these results suggest that while all other traders contribute more to price discovery in absolute terms (on

Whilst a general variation clause may not suffice to entitle an employer to unilaterally vary an employee’s terms and conditions of employment, if there is written evidence, in the

Tot slot zouden andere modererende effecten onderzocht kunnen worden zoals het effect van de leerstijl van de student op de relatie tussen servicekwaliteit en tevredenheid, naast

De resultaten bieden ook steun voor de hypothese dat vooral de kiezers met minder politieke zelfzekerheid een stijging in politiek vertrouwen zullen kennen: de

The specific objectives were: (i) to investigate the accu- racy of the Monotone Upstream-centred Scheme for Conservation Laws (MUSCL) 2 nd to 5 th - order, and the Weighted

We present the mean (SD) and median weekly income, taxes paid and welfare payments of people aged 45 – 64 years out of the labour force because of ill health (lost PLYs) or in