• No results found

Dense Matrix Algorithms

8.3 Quick Sort

8.3.2 Efficient Parallel Quicksort

The parallel quicksort formulation [87] works as follows. Let N be the number of elements to be sorted and P = 2bbe the number of cores available. Each cores is assigned a block of N/P elements, and the labels of the cores {1, ..., P } define the global order of the sorted sequence. For simplicity, we assume that the initial distribution of elements in each core is uniform. The algorithm starts with all cores sorting their own set of elements (sequential quicksort). Then Core 1 broadcasts the median of its elements to each of the other cores. This median acts as the pivot for partitioning elements at all cores. Upon receiving the pivot, each cores partition its elements into elements smaller than the pivot and elements larger than the pivot. Next, each Core i where i ∈ {1...P/2} exchanges elements with the Core i + P/2 such that core i retains all the elements smaller than the pivot, and Core i + P/2 retains all elements larger than the pivot. After this step, each Core i i ∈ {1....P/2} stores elements smaller than the pivot, and remaining cores ({P/2 + 1, ...P }) store elements greater than the pivot. Upon receiving the elements, each core merges them with its own set of elements so that all elements at the core remain sorted. The above procedure is performed recursively for both sets of cores, splitting the elements further. After b recursions, all the elements are sorted with respect to the global ordering imposed on the cores.

Because all of the cores are busy all of the time, the critical path of this parallel algorithm is the execution path of any one of the cores. The total number of computation steps in the critical path is (n/p) · log p + ks(n/p · log(n/p)), where ks ≈ 1.4 is the quicksort constant. The total communication time in the critical path is 3(ts+ tw(n/p))√

p.

Total energy spent on message transfers for this parallel algorithm running on p cores is approximately ewhn√ p.

Moreover, the total number of computation steps summed over all cores is approximately ks· n · log(n). In Summary,

πncomp =

 ks

n plog n

p

 +n

plog p

 β πtcomm = 3

 ts+ tw

n p

 √p

Scomp = ksn log nβ Ecomm = ewhn√

p

where β is number of cycles required for a single comparison.

Energy Scalability under Iso-performance

We now compute the energy scalability under iso-performance of efficient parallel quicksort algorithm on 2-dimensional mesh interconnect. We first Scale the computation steps of critical path so that the performance of parallel algorithm matches the specified performance requirement. Assuming the performance target to be the time taken by the best sequential algorithm on a single core (processor) operating at maximum frequency F , the new reduced frequency at which all p cores should run is given by:

X = πncomp

Performance Target − πtcomm

=

 ksn

plog

n p



+nplog p β n log nβF1 − 3

ts+ twn p

√ p

The total active time at all the cores at new frequency is given by pn log nβ/F . Therefore, expression for energy consumption of the efficient parallel quick algorithm as per equation 5.1 is given by

E = Ecomp+ Ecomm+ Eleak

= Edksn log nβX2+ ewhn√

p + Elpn log nβX F

Asymptotic Analysis:: Note that, If n  p then X ≈ F/p. Thus, the energy consumed by the efficient parallel quicksort algorithm on 2-dimensional mesh interconnect with p cores running at frequency X is given by:

E = Edksn log nβF2

p2 + ewhn√

p + Eln log nβ (8.10)

The optimal number of cores required for minimum energy consumption is given by

popt = O

(log n)25

= O

(log W )25

Thus, the asymptotic energy scalability under iso-performance of the efficient parallel quicksort algorithm on 2-dimensional mesh interconnect is O((log W )2/5). Note that, n should be greater than p for this asymptotic result to apply.

Energy Bounded Scalability

We now evaluate the energy bounded scalability of the efficient parallel quicksort algorithm on 2-dimensional mesh interconnect. The total active time (Tactive) at all the cores as a function of the frequency of the cores is given by

Tactive = p(πncomp 1

Expression for the energy consumed by the parallel algorithm as a function of the frequency of the cores, using the energy model is given by

Ecomp = Edksn log nβX2

Here, we fix the energy budget E to be that of the energy required for the sequential algorithm, running on a single core at maximum frequency F (Eseq). Given the energy budget Eseq, the frequency X with which the cores should run is given by

The time taken (inverse of performance) by the parallel algorithm as a function of the new frequency X is

Time Taken = 3

Asymptotic Analysis: Note that, If n  2p, then X ≈ F . Thus, the time taken by the efficient parallel quicksort algorithm running on p cores is given by:

Time Taken = 3

The optimal number of cores required for maximum performance under the energy budget is given by

popt = O (log n)2

= O (log W )2

Thus, the asymptotic energy bounded scalability of the parallel algorithm is O (log W )2. Note that, n should be greater than 2pfor this asymptotic result to apply.

Utility Based Scalability

We now evaluate the utility based scalability of the efficient parallel quicksort algorithm on 2-dimensional mesh interconnect. Expression for the energy consumed by the parallel algorithm as a function of the frequency of the cores, using the energy model is given by

Ecomp = Edksn log nβX2

The time taken (inverse of performance) by the parallel algorithm as a function of frequency is

T (p, X) = 3

we now frame an expression for the cost function C(p, X) of the parallel algorithm using Eq. 5.5:

C(p, X) = α(Ecomp+ Ecomm+ Eleak) + T (p, X)

Asymptotic Analysis:: If n  p the cost expression of the efficient parallel quicksort algorithm on 2-dimensional mesh interconnect with p cores at frequency X can be approximated as

C(p, X) ≈ O

The optimal number of cores and frequency required for minimum cost varies with problem size as follows

popt = O

(log W )67 Fopt = O

(log W )

−2 7