Checkerboard Partitioning - Matrix Transposition

Dense Matrix Algorithms

7.1 Matrix Transposition

7.1.1 Checkerboard Partitioning

In this section, we consider an n×n matrix mapped onto a logical square mesh of processors by using checkerboarding.

Assume that an n × n matrix is stored in an n × n mesh of processors so that one processors holds as dingle element of the matrix. To obtain the transpose, the matrix elements located below this diagonal must move to the corresponding diametrically opposite locations above the diagonal, and vice versa. An element located below the diagonal first moves up to the diagonal, and then to the right of its destination processor. Similarly, an element above the diagonal moves down to the diagonal and then left to its destination processor. Now consider the case in which the number of processors p is less than n², and the matrix is distributed among the processors by using a uniform block-checkerboard partitioning. The transpose of the entire matrix can be computed in two phases. In the first phase, the square matrix blocks are treated as indivisible units, and the two-dimensional array of blocks is transposed. In the second phase, all blocks are transposed locally within their respective processors.

During the communication phase, the matrix blocks initially residing on the bottom-left and top-right processors cover the longest distances to swap their locations. These paths, covering approximately 2√

p links each, determine the total time spent in the communication phase. Since a block containing n²/p elements takes ts+ t_wn²/p time to move across a single link, it takes a total of 2(t_s+ t_wn²/p)√

p time for all the blocks to move to their final destinations in the mesh of processors. Total energy spent during the communication phase is given by

Ecomm = Σ

√p−1

i=1 4ewh(n² p )(√

p − i)i

= 4e_whn² p

√p(p − 1) 6

≈ (2/3)ewhn²√ p

In the computation phase, each processors performs approximately (n²/2p) exchanges in transposing its (n/√ p × n/√

p) local submatrix. Thus, total number of exchanges at all processors during this phase is approximately (n²/2).

In summary,

πncomp = n² 2pβ π_tcomm = 2

t_s+ t_wn² p

√p

Scomp = n² 2 β Ecomm = 2

3ewhn²√ p

where β is the number of cycles required for a local exchange.

Energy Scalability under Iso-performance

We now compute the energy scalability under iso-performance of parallel matrix transposition algorithm on 2-dimensional mesh interconnect with checkerboard partitioning. We first Scale the computation steps of critical path so that the parallel performance of parallel matrix transposition matches the specified performance requirement. Assuming the performance target to be the time taken by the best sequential algorithm on a single core (processor) operating at maximum frequency F , the new reduced frequency at which all p cores should run is given by:

X = πncomp

The total active time at all the cores at new frequency is given by pn²β/(2F ). Therefore, expression for energy consumption of the parallel transposition algorithm with checkerboard partitioning as per equation 5.1 is given by

E = E_comp+ E_comm+ E_leak

Asymptotic Analysis: Note that, If n p then X ≈ F/p. Thus, the energy consumed by the parallel matrix transposition algorithm on 2-dimensional mesh interconnect (with checkerboard partitioning running) with p cores running at frequency X is given by:

E = Ed

The optimal number of cores required for minimum energy consumption is given by

popt = 3E_dβF² ewh

^2/5

= O(1)

Thus, the asymptotic energy scalability under iso-performance of the parallel matrix transposition algorithm on 2-dimensional mesh interconnect with checkerboard partitioning running is O(1). In other words, optimal number of cores for minimal energy under the performance budget (corresponding to the best sequential algorithm) is constant

irrespective of the problem size (not scalable). Note that, n should be greater than p for this asymptotic result to apply.

Energy Bounded Scalability

We now evaluate the energy bounded scalability of the parallel matrix transposition algorithm on 2-dimensional mesh interconnect with checkerboard partitioning. The total active time (T_active) at all the cores as a function of the fre-quency of the cores is given by

Tactive = p(πncomp

Expression for the energy consumed by the parallel algorithm as a function of the frequency of the cores, using the energy model is given by

Ecomp = Ed·n²

Here, we fix the energy budget E to be that of the energy required for the sequential algorithm, running on a single core at maximum frequency F (Eseq). Given the energy budget Eseq, the frequency X with which the cores should run is given by

The time taken (inverse of performance) by the parallel algorithm as a function of the new frequency X is

Time Taken = 2

Asymptotic Analysis: In order for X to have a valid solution, E − Ecomm should be greater than zero and which simplifies to having

p < 3EdβF² 4e_wh

If n p, the time taken by the parallel algorithm running decreases with p. Thus the optimal number of cores required for maximum performance under the energy budget is given by

p_opt = 3E_dβF² 4ewh

= O(1)

Thus, the asymptotic energy bounded scalability of this parallel algorithm is O(1). In other words, optimal number of cores for maximum performance under the energy budget (corresponding to the best sequential algorithm) is constant irrespective of the problem size. Thus, the parallel algorithm is not energy bounded scalable.

Utility Based Scalability

We now evaluate the utility based scalability of the parallel matrix transposition algorithm on 2-dimensional mesh interconnect with checkerboard partitioning. Expression for the energy consumed by the parallel algorithm as a function of the frequency of the cores, using the energy model is given by

Ecomp = Ed·n²

The time taken (inverse of performance) by the parallel algorithm as a function of frequency is

T (p, X) = 2

we now frame an expression for the cost function C(p, X) of the parallel algorithm using Eq. 5.5:

C(p, X) = α(Ecomp+ Ecomm+ Eleak) + T (p, X)

Asymptotic Analysis: Note that, If n p then the cost expression of the parallel matrix transposition algorithm on 2-dimensional mesh interconnect with checkerboard partitioning running on p cores at frequency X can be approximated

C(p, X) ≈ O

n²X²+ n²√ p + n²

√p+ n² pX

Since each term in the cost expression includes square of the input size, the optimal number of cores and frequency required for minimum cost do not vary with problem size.

In document Towards energy-performance trade-off analysis of parallel applications (Page 72-76)