Dense Matrix Algorithms
7.1 Matrix Transposition
7.1.1 Checkerboard Partitioning
In this section, we consider an n×n matrix mapped onto a logical square mesh of processors by using checkerboarding.
Assume that an n × n matrix is stored in an n × n mesh of processors so that one processors holds as dingle element of the matrix. To obtain the transpose, the matrix elements located below this diagonal must move to the corresponding diametrically opposite locations above the diagonal, and vice versa. An element located below the diagonal first moves up to the diagonal, and then to the right of its destination processor. Similarly, an element above the diagonal moves down to the diagonal and then left to its destination processor. Now consider the case in which the number of processors p is less than n2, and the matrix is distributed among the processors by using a uniform block-checkerboard partitioning. The transpose of the entire matrix can be computed in two phases. In the first phase, the square matrix blocks are treated as indivisible units, and the two-dimensional array of blocks is transposed. In the second phase, all blocks are transposed locally within their respective processors.
During the communication phase, the matrix blocks initially residing on the bottom-left and top-right processors cover the longest distances to swap their locations. These paths, covering approximately 2√
p links each, determine the total time spent in the communication phase. Since a block containing n2/p elements takes ts+ twn2/p time to move across a single link, it takes a total of 2(ts+ twn2/p)√
p time for all the blocks to move to their final destinations in the mesh of processors. Total energy spent during the communication phase is given by
Ecomm = Σ
√p−1
i=1 4ewh(n2 p )(√
p − i)i
= 4ewhn2 p
√p(p − 1) 6
≈ (2/3)ewhn2√ p
In the computation phase, each processors performs approximately (n2/2p) exchanges in transposing its (n/√ p × n/√
p) local submatrix. Thus, total number of exchanges at all processors during this phase is approximately (n2/2).
In summary,
πncomp = n2 2pβ πtcomm = 2
ts+ twn2 p
√p
Scomp = n2 2 β Ecomm = 2
3ewhn2√ p
where β is the number of cycles required for a local exchange.
Energy Scalability under Iso-performance
We now compute the energy scalability under iso-performance of parallel matrix transposition algorithm on 2-dimensional mesh interconnect with checkerboard partitioning. We first Scale the computation steps of critical path so that the parallel performance of parallel matrix transposition matches the specified performance requirement. Assuming the performance target to be the time taken by the best sequential algorithm on a single core (processor) operating at maximum frequency F , the new reduced frequency at which all p cores should run is given by:
X = πncomp
The total active time at all the cores at new frequency is given by pn2β/(2F ). Therefore, expression for energy consumption of the parallel transposition algorithm with checkerboard partitioning as per equation 5.1 is given by
E = Ecomp+ Ecomm+ Eleak
Asymptotic Analysis: Note that, If n p then X ≈ F/p. Thus, the energy consumed by the parallel matrix transposition algorithm on 2-dimensional mesh interconnect (with checkerboard partitioning running) with p cores running at frequency X is given by:
E = Ed
The optimal number of cores required for minimum energy consumption is given by
popt = 3EdβF2 ewh
2/5
= O(1)
Thus, the asymptotic energy scalability under iso-performance of the parallel matrix transposition algorithm on 2-dimensional mesh interconnect with checkerboard partitioning running is O(1). In other words, optimal number of cores for minimal energy under the performance budget (corresponding to the best sequential algorithm) is constant
irrespective of the problem size (not scalable). Note that, n should be greater than p for this asymptotic result to apply.
Energy Bounded Scalability
We now evaluate the energy bounded scalability of the parallel matrix transposition algorithm on 2-dimensional mesh interconnect with checkerboard partitioning. The total active time (Tactive) at all the cores as a function of the fre-quency of the cores is given by
Tactive = p(πncomp
Expression for the energy consumed by the parallel algorithm as a function of the frequency of the cores, using the energy model is given by
Ecomp = Ed·n2
Here, we fix the energy budget E to be that of the energy required for the sequential algorithm, running on a single core at maximum frequency F (Eseq). Given the energy budget Eseq, the frequency X with which the cores should run is given by
The time taken (inverse of performance) by the parallel algorithm as a function of the new frequency X is
Time Taken = 2
Asymptotic Analysis: In order for X to have a valid solution, E − Ecomm should be greater than zero and which simplifies to having
p < 3EdβF2 4ewh
2
If n p, the time taken by the parallel algorithm running decreases with p. Thus the optimal number of cores required for maximum performance under the energy budget is given by
popt = 3EdβF2 4ewh
2
= O(1)
Thus, the asymptotic energy bounded scalability of this parallel algorithm is O(1). In other words, optimal number of cores for maximum performance under the energy budget (corresponding to the best sequential algorithm) is constant irrespective of the problem size. Thus, the parallel algorithm is not energy bounded scalable.
Utility Based Scalability
We now evaluate the utility based scalability of the parallel matrix transposition algorithm on 2-dimensional mesh interconnect with checkerboard partitioning. Expression for the energy consumed by the parallel algorithm as a function of the frequency of the cores, using the energy model is given by
Ecomp = Ed·n2
The time taken (inverse of performance) by the parallel algorithm as a function of frequency is
T (p, X) = 2
we now frame an expression for the cost function C(p, X) of the parallel algorithm using Eq. 5.5:
C(p, X) = α(Ecomp+ Ecomm+ Eleak) + T (p, X)
Asymptotic Analysis: Note that, If n p then the cost expression of the parallel matrix transposition algorithm on 2-dimensional mesh interconnect with checkerboard partitioning running on p cores at frequency X can be approximated
as
C(p, X) ≈ O
n2X2+ n2√ p + n2
√p+ n2 pX
Since each term in the cost expression includes square of the input size, the optimal number of cores and frequency required for minimum cost do not vary with problem size.