Dense Matrix Algorithms
7.2 Matrix-Vector Multiplication
7.2.1 Rowwise Striping
We first describe the case in which the n × n matrix is striped among n processors so that each processor store one complete row of the matrix. The n × 1 vector x is distributed such that each processor stores one of its elements.
Processor Piinitially stores x[i] and A[i, 0], A[i, 1], ..., A[i, n − 1] and is responsible for computing y[i]. Vector x is multiplied with each row of the matrix; hence every processors needs the entire vector. Since each processor starts with only one element of x, an all-to-all broadcast is required to distribute all the elements to all the processors. After the vector x is distributed among the processors processor Picomputes y[i] = Σn−1j=0(A[i, j] × x[j]). The result vector y is stored exactly the way the starting vector x was stored.
Consider the case in which p processors are used such p < n, and the matrix is partitioned among the processors by using the block stripping. Each processor initially stores n/p complete rows of the matrix and a portion of the vector x of size n/p. Since the vector x has to be multiplied with each row of the matrix, every processor needs the entire vector. This again requires an all-to-all broadcast. The all-to-all broadcast takes place among the p processors and involves messages of size n/p. As derived earlier in 3.2.3, the total time and energy spent during this phase with message size m = n/p are 2ts(√
p − 1) + twn and ewhn(p − 1) respectively. After this communication step, each processor multiples its n/p rows with the vector x to produce n/p elements of the result vector. Each processor performs n2/p multiplications and additions during the computation phase. In summary,
πncomp = n2 pβ πtcomm = (2ts(√
p − 1) + twn) Scomp = n2β
Ecomm = ewhn(p − 1)
Energy Scalability under Iso-performance
We now compute the energy scalability under iso-performance of parallel matrix-vector multiplication algorithm on 2-dimensional mesh interconnect with striped partitioning. We first Scale the computation steps of critical path so that the
performance of parallel algorithm matches the specified performance requirement. Assuming the performance target to be the time taken by the best sequential algorithm on a single core (processor) operating at maximum frequency F , the new reduced frequency at which all p cores should run is given by:
X = πncomp
Performance Target − πtcomm
=
n2 p
n2βF1 − 2ts(√
p − 1) + twn
The total active time at all the cores at new frequency is given by pn2β/F . Therefore, expression for energy consumption of the parallel matrix-vector multiplication algorithm with striped partitioning as per equation 5.1 is given by
E = Ecomp+ Ecomm+ Eleak
= Edn2βX2+ ewhn(p − 1) + El
pn2β
F X
Asymptotic Analysis: Note that, If n p then X ≈ F/p. Thus, the energy consumed by the parallel matrix-vector multiplication algorithm on 2-dimensional mesh interconnect (with stripped partitioning running) with p cores running at frequency X is given by:
E = Edn2βF2
p2 + ewhn(p − 1) + Eln2β (7.4)
The optimal number of cores required for minimum energy consumption is given by
popt = 2EdβF2n ewh
1/3
= O(n13)
= O(W16)
Thus, the asymptotic energy scalability under iso-performance of the parallel matrix-vector multiplication algorithm on 2-dimensional mesh interconnect with stripped partitioning running is O(W1/6). Note that, n should be greater than p for this asymptotic result to apply.
Energy Bounded Scalability
We now evaluate the energy bounded scalability of the parallel matrix-vector multiplication algorithm on 2-dimensional mesh interconnect with row wise partitioning. The total active time (Tactive) at all the cores as a function of the fre-quency of the cores is given by
Tactive = p(πncomp 1
X + πtcomm)
= p n2 pβ 1
X + 2ts(√
p − 1) + twn
Expression for the energy consumed by the parallel algorithm as a function of the frequency of the cores, using the energy model is given by
Ecomp = Edn2βX2 Eleak = El· Tactive· X
= Elp n2
pβ + (2ts(√
p − 1) + twn) X
Here, we fix the energy budget E to be that of the energy required for the sequential algorithm, running on a single core at maximum frequency F (Eseq). Given the energy budget Eseq, the frequency X with which the cores should run is given by
X = −Elpπtcomm+pEl2p2πtcomm2 + 4EdScomp(E − Ecomm− Elpπncomp) 2EdScomp
= −Elp 2ts(√
p − 1) + twn +q
El2p2 2ts(√
p − 1) + twn2
+ 4Edn2β(Edn2βF2− ewhn(p − 1) − Eln2β) 2Edn2β
The time taken (inverse of performance) by the parallel algorithm as a function of the new frequency X is
Time Taken = (2ts(√
p − 1) + twn) +n2 p β 1
X (7.5)
Asymptotic Analysis: Note that, If n p, then X ≈ F . Thus, the time taken by the parallel matrix-vector multiplication algorithm running on p cores is given by:
Time Taken = (2ts(√
p − 1) + twn) +n2 pβ1
F
= O√
p + n +n2 p
The optimal number of cores required for maximum performance under the energy budget is given by
popt = O n43
= O
W23
Thus, the asymptotic energy bounded scalability of the parallel algorithm is O(W23). Note that, n should be greater than p for this asymptotic result to apply.
Utility Based Scalability
We now evaluate the utility based scalability of the parallel matrix-vector multiplication algorithm on 2-dimensional mesh interconnect with row wise partitioning. Expression for the energy consumed by the parallel algorithm as a function of the frequency of the cores, using the energy model is given by
Ecomp = Edn2βX2 Eleak = El· Tactive· X
= Elp n2
pβ + (2ts(√
p − 1) + twn) X
The time taken (inverse of performance) by the parallel algorithm as a function of frequency is
T (p, X) = (2ts(√
p − 1) + twn) +n2 pβ 1
X (7.6)
we now frame an expression for the cost function C(p, X) of the parallel algorithm using Eq. 5.5:
C(p, X) = α(Ecomp+ Ecomm+ Eleak) + T (p, X)
= α(Edn2βX2+ ewhn(p − 1)) + α(Elp n2
pβ + (2ts(√
p − 1) + twn) X
) + (2ts(√
p − 1) + twn) +n2 p β 1
X
Asymptotic Analysis: Note that, If n p the cost expression of the parallel matrix-vector multiplication algorithm on 2-dimensional mesh interconnect with row wise partitioning running on p cores at frequency X can be approximated
as
The optimal number of cores and frequency required for minimum cost varies with problem size as follows
popt = O
In this section, we analyze parallel matrix-vector multiplication for the case in which the matrix is distributed among the processors by using a block-checkerboard partitioning. Consider a 2-dimensional mesh of p processors in which each processor stores an (n/√
p)times(n/√
p) block of the matrix. The vector is distributed in portions of n/√ p elements in the last column of processors only. The entire vector must be distributed on each row of processors before the multiplication can be performed. First the vector is aligned along the main diagonal. For this, each processor in the rightmost column sends its n/√
p vector elements to the diagonal processors in its row. Then a columnwise one-to-all broadcast of these n/√
p elements takes place. Each processor then performs n2/p multiplications and locally adds the n/√
p sets of products. At the end of this step, each processor has n/√
p partial sums that must be accumulated along each row to obtain the result vector. Hence, the last step of the algorithm is a single node accumulation of the n/√
p values in each row, with the rightmost processor of the row as the destination.
The first step of sending a message of size n/√
p from the rightmost processor of a row to the diagonal processor takes the maximum time of ts+ twn/√
p + th
√p for the first row of processors in the mesh. Furthermore, energy spent during this steps is (1/2)ewhn(√
p + 1). We can perform the columnwise one-to-all broadcast in approximately (ts+ twn/√
p) log(√ p) + th
√p time and (1/4)ewhn√
p log penergy by using the procedure described in section ref-patterns for a ring with cut-through routing. Ignoring the time to perform additions, the final row wise single-node accumulation takes approximately (ts+ twn/√
p) log(√
p) + th√
p time and (1/4)ewhn√
p log p energy for com-munication. Moreover, each processor performs approximately n2/p multiplication and additions in the computation phase. In summary,
πncomp = n2 pβ
πtcomm = ts+ twn/√
We now compute the energy scalability under iso-performance of parallel matrix-vector multiplication algorithm on 2-dimensional mesh interconnect with checkerboard partitioning. We first Scale the computation steps of critical path so that the performance of the parallel algorithm matches the specified performance requirement. Assuming the performance target to be the time taken by the best sequential algorithm on a single core (processor) operating at maximum frequency F , the new reduced frequency at which all p cores should run is given by:
X = πncomp
The total active time at all the cores at new frequency is given by pn2β/F . Therefore, expression for energy consumption of the parallel matrix-vector multiplication algorithm with checkerboard partitioning as per equation 5.1 is given by
Asymptotic Analysis: Note that, If n p then X ≈ F/p. Thus, the energy consumed by the parallel matrix-vector multiplication algorithm on 2-dimensional mesh interconnect (with checkerboard partitioning running) with p cores running at frequency X is given by:
E = Edn2βF2 p2 +1
2ewhn√
p log p + Eln2β (7.7)
The optimal number of cores required for minimum energy consumption is given by
Note that, n should be greater than p for this asymptotic result to apply.
Energy Bounded Scalability
We now evaluate the energy bounded scalability of the parallel matrix-vector multiplication algorithm on 2-dimensional mesh interconnect with checkerboard partitioning. The total active time (Tactive) at all the cores as a function of the frequency of the cores is given by
Tactive = p(πncomp
Expression for the energy consumed by the parallel algorithm as a function of the frequency of the cores, using the energy model is given by
Ecomp = Edn2βX2
Here, we fix the energy budget E to be that of the energy required for the sequential algorithm, running on a single core at maximum frequency F (Eseq). Given the energy budget Eseq, the frequency X with which the cores should run is given by
The time taken (inverse of performance) by the parallel algorithm as a function of the new frequency X is
Asymptotic Analysis: Note that, If n p log p, then X ≈ F . Thus, the time taken by the parallel matrix-vector multiplication algorithm running on p cores is given by:
Time Taken =
The optimal number of cores required for maximum performance under the energy budget is given by
popt = O n43
= O
W23
Thus, the asymptotic energy bounded scalability of the parallel algorithm is O(W23). Note that, n should be greater than p log p for this asymptotic result to apply.
Utility Based Scalability
We now evaluate the utility based scalability of the parallel matrix-vector multiplication algorithm on 2-dimensional mesh interconnect with checkerboard partitioning. Expression for the energy consumed by the parallel algorithm as a function of the frequency of the cores, using the energy model is given by
Ecomp = Edn2βX2
The time taken (inverse of performance) by the parallel algorithm as a function of frequency is
T (p, X) =
we now frame an expression for the cost function C(p, X) of the parallel algorithm using Eq. 5.5:
Asymptotic Analysis: Note that, If n p log p the cost expression of the parallel matrix-vector multiplication algorithm on 2-dimensional mesh interconnect with checkerboard partitioning running on p cores at frequency X can be approximated as
The optimal number of cores and frequency required for minimum cost varies with problem size as follows
popt = O
In this section we analyze parallel algorithms for multiplying two n × n dense, square matrices A and B to yield the product matrix C = A × B. The sequential algorithm for matrix computation requires n3multiplication and addition pairs. Thus, the problem size W = O(n3).
Consider two n×n matrices A and B partitioned into p blocks Ai,jand Bi,j(o ≤ i, j <√
p) of size n/√ p×n/√
p each. These blocks are mapped onto a √
p ×√
p mesh of processors. The processors are labeled from P0,0 to P√p−1,√p−1. Processor Pi,jinitially stores Ai,jand Bi,j and computes block Ci,jof the result matrix. Computing the submatrix Ci,j requires all submatrices Ai,k and Bk,j for 0 ≤ k < √
p. To acquire all the required blocks, an all-to-all broadcast of matrix A’s blocks is performed in each row of processors, and an all-to-all broadcast of matrix