A High Performance Two Dimensional Scalable Parallel
Algorithm for Solving Sparse Triangular Systems
Mahesh V. Joshi
yAnshul Gupta
zGeorge Karypis
yVipin Kumar
yAbstract
Solving a system of equations of the form Tx=y,
whereT is a sparse triangular matrix, is required after
the factorization phase in the direct methods of solving systems of linear equations. A few parallel formula-tions have been proposed recently. The common belief in parallelizing this problem is that the parallel
formu-lation utilizing a two dimensional distribution of T is
unscalable. In this paper, we propose the rst known ecient scalable parallel algorithm which uses a two
dimensional block cyclic distribution of T. The
algo-rithm is shown to be applicable to dense as well as sparse triangular solvers. Since most of the known highly scalable algorithms employed in the
factoriza-tion phase yield a two dimensional distribufactoriza-tion of T,
our algorithm avoids the redistribution cost incurred by the one dimensional algorithms. We present the paral-lel runtime and scalability analyses of the proposed two dimensional algorithm. The dense triangular solver is shown to be scalable. The sparse triangular solver is shown to be at least as scalable as the dense solver. We also show that it is optimal for one class of sparse systems. The experimental results of the sparse tri-angular solver show that it has good speedup charac-teristics and yields high performance for a variety of sparse systems.
1 Introduction
Many direct and indirect methods of solving large sparse linear systems, need to compute the solution
to the systems of the form Tx = y, where T is a
sparse triangular matrix. The most frequent
This work was supported by NSF CCR-9423082, by
Army Research Oce contract DA/DAAH04-95-1-0538, by Army High Performance Computing Research Center coop-erative agreement number DAAH04-95-2-0003/contract num-ber DAAH04-95-C-0008, by the IBM Partnership Award, and by the IBM SUR equipment grant. Access to computing fa-cilities was provided by AHPCRC, Minnesota Supercomputer Institute. Related papers are available via WWW at URL:
http://www.cs.umn.edu/~kumar.
yDepartment of Computer Science, University of Minnesota,
Minneapolis, MN 55455.
zP.O.Box 218, IBM Thomas J. Watson Research Center,
Yorktown Heights, NY 10598.
tion is found in the domain of direct methods, which are based on the factorization of a matrix into
trian-gular factors using a scheme likeLU,LLT (Cholesky)
orLDLT. Most of the research in the domain of direct
methods has been concentrated on developing fast and scalable algorithms for factorization, since it is compu-tationally the most expensive phase of a direct solver. Recently some very highly scalable parallel algorithms have been proposed for sparse matrix factorizations [1], which require the matrix to be distributed among processors using a two dimensional mapping. This results in a two dimensional distribution of the trian-gular factor matrices used by the triantrian-gular solvers to get the nal solution.
With the emergence of powerful parallel comput-ers and the need of solving very large problems, com-pletely parallel direct solvers are rapidly emerging [2, 4]. There is a need to nd ecient and scalable parallel algorithms for triangular solvers, so that they do not form a performance bottleneck when the direct solver is used to solve large problems on large num-ber of processors. A few attempts have been made recently to formulate such algorithms [3, 4, 6]. Most of the attempts until now, rely on redistributing the factor matrix from a two dimensional mapping to a one dimensional mapping. This is because it was be-lieved till now, that the parallel formulations based on two dimensional mapping are unscalable [3]. But, as we show in this paper, even a simple two dimensional block cyclic mapping of data can be utilized by our parallel algorithm to achieve as much scalability as achieved by the algorithms based on one dimensional mapping.
The challenge in formulating a scalable parallel al-gorithm lies in the sequential data dependency of the computation and relatively small amount of compu-tation to be distributed among processors. We show that, the data dependency in fact exhibits concurrency in both the dimensions and an appropriate two dimen-sional mapping of the processors can achieve a scalable formulation. We elucidate this further in Section 2 where we describe our two dimensional parallel algo-rithm for a dense trapezoidal solver.
Scalable and ecient formulation of the parallel al-gorithms for solving large sparse triangular systems is more challenging than the dense case. This is mainly because of the inherently unstructured computations. We present, in section 3, an ecient parallel mul-tifrontal algorithm for solving sparse triangular sys-tems. It traverses the elimination tree of the given system and employs the dense trapezoidal algorithm at various levels of the elimination tree.
In Section 4, we analyze the proposed parallel algo-rithms for their parallel runtime and scalability. We rst show that the parallel dense trapezoidal solver is scalable. Analyzing a general sparse triangular solver is a dicult problem. Hence we present the analysis for two dierent classes of sparse systems. For these classes, we show that the sparse triangular solver is at least as scalable as the dense triangular solver. Al-though it is not as scalable as the best factorization algorithm based on the same data mapping, its paral-lel runtime is asymptotically of lower order than that of the factorization phase. This makes a strong case for the utilization of our algorithm in a parallel direct solver.
We have implemented a sparse triangular solver based on the proposed algorithm and it is a part of the recently announced high performance direct solver [2]. The experimental results presented in the sec-tion 5 show the good speedup characteristics and a high performance on a variety of sparse matrices. Our algorithm for sparse backward substitution achieves a performance of 4.575 GFLOPS on 128 processors of an IBM SP2 for solving a system with multiple
num-ber of right hand sides (nrhs = 16), for which the
single processor algorithm performed at around 200
MFLOPS. For nrhs = 1, it achieved 1630 MFLOPS
on 128 processors, whereas the single processor algo-rithm performed at around 70 MFLOPS. To the best of our knowledge, these are the best performance and speedup numbers reported till now, for solving sparse triangular systems.
2 Two Dimensional Scalable
Formula-tion
In this section, we build a framework for the two di-mensional scalable formulation by analyzing the data dependency and describing the algorithm to extract concurrency.
Consider a system of linear equations of the form
Tx=b, whereT = [T1T2] T;x= [x 1x2] T;b= [b 1b2] T;
andT1is a lower triangular matrix. We dene the
pro-cess of computing the solution vectorx1that satises
T1x1 =b1 followed by the computation of the update
vector x2 using x2 =b2 ,T 2x, as a dense trapezoidal n m b 1 22 22 23 24 24 23 22 22 23 21 21 20 20 20 20 19 19 2 3 4 5 6 7 8 9 11 12 13 14 15 16 17 18 19 17 3 4 4 5 5 5 6 6 6 7 7 7 8 8 9 9 10 9 10 11 12 12 12 13 13 13 14 14 15 15 18 18 18 17 16 15 11 11 2 4 17 16 19 21 14 10 8 3 P0 P1 P2 P3 v h blks blks (c) solution flow update flow (a) solution flow flow update (b)
Figure 1: (a). Data Dependency in Dense
Trape-zoidal Forward Elimination. (b). Data Dependency
in Dense Trapezoidal Backward Substitution. (c).
Two Dimensional Parallel Dense Trapezoidal Forward Elimination.
forward elimination. We refer to b1 as a right hand
side vector. The data dependency structure of this
process is shown in Figure 1(a). WhenT = [TT
1 T T 2 ],
we call the process of computing the update vector
z=b1 ,T
T 2 x
2 followed by the solution of T
T 1x
1 =z
a dense trapezoidal backward substitution. The data dependency structure of this process is shown in Fig-ure 1(b). As can be seen from FigFig-ures 1(a) and 1(b), the data dependency structures of two processes are identical. Henceforth, we describe the design and anal-ysis only for the dense trapezoidal forward elimination. Also, we assume that the system has single right hand side, but the discussion is easily extensible to multiple right hand sides.
Examine the data dependency structure of Fig-ure 1(a). It can be observed that after the solution is computed for the diagonal element, the computations of updates due to the elements in that column can proceed independently. Similarly the computations of the updates due to the elements in a given row can proceed independently before all of them are applied at the diagonal element and then the solution can be computed for that row. Thus, the available concur-rency is identical in both the dimensions. The com-putation in each of the dimensions can be pipelined to propagate updates horizontally and the computed solutions vertically.
For the parallel formulation, consider the two di-mensional block and cyclic approaches of mapping processors to the elements. The two dimensional block mapping will not give a scalable formulation as the processors owning the blocks towards the lower right
hand corner of the matrix will be idle for more time as the size of the matrix increases. But, the two dimen-sional cyclic mapping solves this problem by localizing the pipeline to a xed portion of the matrix indepen-dent of the size of the matrix. This is a very desirable property for achieving scalability. The unit grain of computation for each processor in a cyclic mapping is too small to get good performance out of most of today's multiprocessor parallel computers. Hence the cyclic mapping can be extended to a block-cyclic map-ping to increase the grain of computation.
We now describe the parallel algorithm for dense trapezoidal forward elimination. Consider a block
ver-sion of the equations. Let T be as shown in
Fig-ure 1(c). We distribute T in the block cyclic
fash-ion using a two dimensfash-ional grid of processors. The
blocks of b1 reside on the processors owning the
diag-onal blocks of T. Each processor traverses the blocks
assigned to it in a predetermined order. The Figure shows the time step in which each block is processed for the column major order. The Figure also shows the shaded region, where the vertical and horizontal pipelines are operated to propagate computed solu-tions and accumulated updates. The height and the width of this region are dened by the processor grid dimensions. The processing applied to a block is deter-mined by its location in the matrix. For the triangular blocks on the diagonal (except for those in the rst col-umn), the processor receives the accumulated updates in that block-row, from its left neighbor and applies them to the right hand side vector. Then it computes the solution vector and sends it to its neighbor below. The processor accumulates the update contributions due to those blocks in a block-row which are not in the shaded region. For the rectangular blocks in the shaded region, the processor receives the solution vec-tor from the top neighbor. It also receives the update vector from the left neighbor except when the block lies on the boundary of the shaded region. Then it adds the update contribution due to the block being processed and sends the accumulated update vector to its right neighbor. In the end, the solution vector is distributed among the diagonal processors and the update vector is distributed among the last column processors.
3 Parallel Algorithm for Sparse
Trian-gular Solver
Solving a triangular system Tx = y eciently in
parallel is challenging when T is sparse. In this
sec-tion, we propose a parallel multifrontal algorithm for solving such systems.
We rst give the description of the serial multi-0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 17 18 16 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X 6 7 8 15 16 17 0 1 3 4 9 10 12 13 2 5 11 14 18 : Supernode : Node : Subtree
Figure 2: (a) An Example Sparse Matrix. Lower Tri-angular part shows the ll-ins ("o") that occur during the factorization process. (b) Supernodal Tree for this matrix.
frontal algorithm for forward elimination which closely follows the algorithm described in [6]. The algorithm
is guided by the structure of elimination tree ofT. We
refer to a collection of consecutive nodes in an elimi-nation tree each with only one child, as a supernode. The nodes in a supernode are collapsed to form the su-pernodal tree. An example matrix and its associated supernodal tree as shown in Figure 2. The columns
of T corresponding to the the individual nodes in a
supernode, form the supernodal matrix. It can be eas-ily veried that the supernodal matrix is dense and trapezoidal. Given two vectors and their correspond-ing sets of indices, an extend-add operation is dened as extending each of the vectors to the union of two index sets by lling in zeros if needed, followed by adding them up index-wise.
The serial multifrontal algorithm for forward elim-ination does a postorder traversal of the supernodal tree associated with the given triangular matrix. At each supernode, the update vectors resulting from the dense supernodal forward elimination of its children nodes are collected. They are extend-added together and the resulting vector is subtracted from the right hand side vector corresponding to the row indices of the supernodal matrix. Then the process of dense trapezoidal forward elimination as dened in Section 2 is applied.
The parallel multifrontal algorithm was developed keeping in mind that our sparse triangular solver is a part of a parallel direct solver described in [2]. Our triangular solver uses the same distribution of the
fac-tor matrix, T, as used by the highly scalable
algo-rithm employed in the numerical factorization phase [1]. Thus, there is no cost incurred in redistributing
the data. Distribution of T is described in the next
paragraph.
the top logplevels 1, where p is the number of
pro-cessors used to solve the problem. The portions of this binary supernodal tree are assigned to processors using a subtree-to-subcube strategy illustrated in Fig-ure 3, where eight processors are used to solve the example matrix of Figure 2. The processor assigned to a given element of a supernodal matrix is deter-mined by the row and column indices of the element and a bitmask determined by the depth of the supern-ode in the supernodal tree [1]. This method achieves a two dimensional block cyclic distribution of the su-pernodal matrix among the logical grid of processors in the subcube as shown in Figure 3.
In the parallel multifrontal algorithm for forward elimination, each processor traverses the part of the supernodal tree assigned to it in a bottom up fashion. The subtree at the lowest level is solved using the serial multifrontal algorithm. After that at each supernode, a parallel extend-add operation is done followed by the parallel dense trapezoidal forward elimination. In the parallel extend-add operation, the processors own-ing the rst block-column of the distributed supern-odal matrix at the parent supernode collect the update vectors resulting from the trapezoidal forward elimina-tions of its children supernodes. The bitmask based data distribution ensures that this can be achieved with at most two steps of point-to-point communica-tions among pairs of processors. Then each processor performs a serial extend-add operation on the update vectors it receives, the result of which is used as an ini-tial update vector in the dense algorithm described in Section 2. The process continues until the topmost su-pernode is reached. The entire process is illustrated in Figure 3, where the processors communicating in the parallel extend-add operation at each level are shown. In the parallel multifrontal backward substitution, each processor traverses its part of the supernodal tree in a top down fashion. At each supernode in the top
logplevels, a parallel dense trapezoidal backward
sub-stitution algorithm is applied. The solution vector computed at the parent supernode propagates to the child supernodes. This is done with at most one step of point-to-point communication among pairs of pro-cessors from the two subcubes of the children. Each processor adjusts the indices of the received solution vector to correspond to the row indices of its child supernode.
1The separator tree obtained from the recursive nested
dis-section based ordering algorithms yields such a binary tree.
4 Parallel Runtime and Scalability
Analysis
We rst derive an expression for the runtime of the parallel dense trapezoidal forward elimination algo-rithm. We present a conservative analysis assuming a naive implementation of the algorithm. We consider the case where the processors compute the blocks in a column major order. Same results hold true for the row major order of computation.
Denetc as the unit time for computation. Lettw
be the unit time for communication, assuming a model where communication time is proportional to the size of the data, which holds true for most of today's paral-lel processors when a moderate size of the data is
com-municated. Consider a block trapezoidal matrix,T, of
dimensionsmnas shown in Figure 1(c). The
block-size isb. Let the problem be solved using a square grid
ofqprocessors (h=v=p
q). We assume that bothm
andn are divisible byp
q. For the purpose of
analy-sis, we also assume that all the processors synchronize after each processor nishes processing one column as-signed to it. The actual implementation does not use
an explicit synchronization. The matrix T now
con-sists ofn=bp
qvertical strips of widthp
q. We number
these strips from left to right starting from 1. The
parallel runtime for ith vertical strip consists of two
components. The pipelined component (in the shaded
region of the strip), takes (p
q)(btw+b 2t
c) time.
The other component is the time needed for each
pro-cessor to compute update vectors for the (m=bp
q,i)
blocks, each of which takes (b2t
c) time. In the end,
we need to add the time required for the last vertical
strip, where (m,n)=(b
p
q) horizontal pipelines
oper-ate to propagoper-ate the updoper-ates to the processors in the last block-column. After summing up all the terms for all the vertical strips and simplifying, the
paral-lel runtime, Tp, can be expressed asymptotically as,
Tp= 1=p(mn ,n
2=2)t
c+ (m+n)tw+ (n)tc.
The overhead due to communication and pipelining
can be obtained asTo=qTp
,T
s, whereTs=Wtc is
the serial runtime. W is size of the problem expressed
in terms of number of computational steps needed to solve the problem on a uniprocessor machine. For
the trapezoidal forward elimination problem, W is
(mn,n
2=2). Thus, T
o= ((m+n)q)tw+ (nq)tc.
Eciency of the parallel system is expressed as E =
Ts=(Ts+To). The scalability is expressed in terms of
an isoeciency function, which is dened as the rate
at whichW has to increase when the number of
pro-cessors are increased in order to maintain a constant eciency. The system is more scalable, when this rate
4 6 5 18 3 3 3 3 4 10 11 15 16 5 5 5 5 10 12 14 17 18 6 6 6 6 6 6 6 6 12 3 5 8 18 2 2 2 2 3 1 2 6 7 1 1 1 1 1 0 7 8 2 0 0 0 0 13 14 15 18 7 7 7 7 13 9 11 16 17 4 4 4 4 9 2 2 0 3 5 7 6 7 0 18 18 0 P P1 P2 P3 P4 P5 P6 0 1 P P 6 7
T : Dense Supernodal Matrix
4 5 P 2 3 P 7 P 8 7 6 2 3 9 16 17 15 18 5 11 14 0 T T T T T T T T 0 2 3 1 0 0 0 1 0 6 7 8 18 6 7 8 T 5 7 5 4 4 6 7 5 15 16 17 18 15 16 17 T 0 0 0 6 7 8 3 3 3 3 6 7 8 18 5 5 5 15 16 17 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 7 7 0 0 0 2 6 7 5 8 18 6 5 18 11 16 17 11 15 16 14 17 18 7 8 2 14 15 18 0 U U U U U U U U U U U U U 0 18 185 0 1 3 4 5 6 0 3 5 2 7 7 7 2 3 2 4 5 7 6 0 1 0 0 5 6 6 6 6 15 16 17 18 U T 6 6 6 6 15 16 17 18 14 14 6 0 0 0 0 2 6 7 8 2 T 3 3 3 3 3 5 6 7 8 18 5 T 5 5 5 11 5 11 15 16 17 T T 0 2 1 3 P 4 5 7 6
and the computed solution for index k :
k
p
k
U : Update vector
processor p owns the right-hand side
P : Subcube as the Logical Grid
6 7 4 5 P 0 1 2 3 P 12 13 1 4 10
Figure 3: Supernodal tree guided Parallel Multifrontal Forward Elimination. expressions, we get the expressions for the triangular
solver. The overhead To becomes (nq) and Ts
be-comesO(n2). These expressions can be used to arrive
at the isoeciency function ofO(q2). This shows that
the two dimensional block cyclic algorithm proposed for the dense trapezoidal solution is scalable.
Analyzing a general sparse triangular solver is a dicult problem. We present the analysis for two wide classes of sparse systems arising out of two-dimensional and three-two-dimensional constant node-degree graphs. We refer to these problems as 2-D and 3-D problems, respectively. Let the problem of
dimensionN be solved usingpprocessors. Denelas
a level in the supernodal tree with l= 0 for the
top-most level. Thus,q, the number of processors assigned
to level l, is p=2l. With a nested-dissection based
ordering scheme, the number of nodes, n, at each
level isp
N=2lfor the 2-D problems and(N=2
l)2=3
for the 3-D problems, where is a small constant
[3]. The row dimension of a supernodal matrix, m,
can be assumed to be a constant times n, for a
bal-anced supernodal tree. The overall computation is proportional to the number of nonzero elements in
T, which is O(NlogN) and O(N4=3) for the 2-D
and 3-D problems, respectively. Assuming that the computation is uniformly distributed among all the
processors and summing up the overheads at all the levels, the asymptotic parallel runtime expression for
the 2-D problems isTp =O((NlogN)=p) +O(
p
N). For the 3-D problems, the asymptotic expression is
Tp = O(N
4=3=p) +O(N2=3). An analysis similar to
that of dense case, shows that the the isoeciency
function for the 2-D problems is O(p2=logp) and for
the 3-D problems, it is O(p2). Refer to [5] for the
detailed derivation of the expressions above.
The parallel formulation is thus, more scalable than the corresponding dense formulation for the class of 2-D problems and it is as scalable as the dense formu-lation for the class of 3-D problems. In the 3-D prob-lems, the asymptotic complexity of sparse formulation is the same as that of the dense solver operating on the
topmost supernode of dimensionN2=3
N
2=3. This
shows that our sparse formulation is optimally scal-able for the class of 3-D problems. The comparison of the isoeciency expressions to those of the solver described in [3] shows that our two dimensional par-allel formulation is as scalable as the one dimensional formulation.
5 Performance Results and Discussion
We have implemented our sparse triangular solver as part of a direct solver. We use MPI for
communi-cation to make it portable to a wide variety of parallel computers. The BLAS are used to achieve high com-putational performance, especially on the platforms having vendor-tuned BLAS libraries. We tested the performance on an IBM SP2 using IBM's ESSL for the sparse matrices taken from various domains of ap-plications. The nature of the matrices is shown in the
Table, where N is the dimension of T and jTj is the
number of nonzero elements inT. Figure 4 shows the
speedup in terms of time needed for combined forward substitution and backward elimination phases for the
number of right hand sides (nrhs) of 1 and 16
re-spectively. Figure 5 shows the performance of just the backward substitution phase in terms of the MFLOPS. The performance was observed to vary with the
blocksize, b and nrhs, as expected. For a given b,
as nrhs increases, we start getting the better
perfor-mance because of the level-3 BLAS. For givennrhs, as
b increases, the grain of computation increases, since
the ratio of computation to communication per block
is proportional to b. The combined eect of
bet-ter BLAS performance, reduced number of message startup costs and better bandwidth obtained from the communication subsystem, increases the overall
per-formance. But, after a threshold onb, the distribution
gets closer to the block distribution and processors start idling. This threshold depends on the character-istics of the problem and the parallel machine. The detailed results exhibiting these trends can be found in [5]. For most of the matrices we tested, we found that a blocksize of 64 gives better performance, on an IBM SP2 machine.
As can be seen from Figure 4, the parallel sparse triangular solver based on our two dimensional formu-lation of the algorithm, achieves good speedup charac-teristics for a variety of sparse systems. The speedup and performance numbers cited in Section 1 were ob-served for the 144pf3D matrix. For the hsct2 matrix, the entire triangular solver (both phases) delivered a speedup of 20.9 on 128 processors and of 17.5 on 64
processors for nrhs = 1. The performance curves
in Figure 5(b) show that the backward substitution phase achieves a performance of 4.575 GFLOPS for on 128 processors and 3.656 GFLOPS on 64
proces-sors for the 144pf3D matrix for nrhs = 16. To the
best of our knowledge, these are the best speedup and performance numbers reported till now for sparse tri-angular solvers.
References
[1] Anshul Gupta, George Karypis and Vipin Ku-mar, Highly Scalable Parallel Algorithms for Sparse Matrix Factorization, Technical Report
94-63, Department of Computer Science, Univer-sity of Minnesota, Minneapolis, MN, 1994. [2] Anshul Gupta, Fred Gustavson, Mahesh Joshi,
George Karypis and Vipin Kumar, Design and Implementation of a Scalable Parallel Direct Solver for Sparse Symmetric Positive Denite Systems, Proc. Eighth SIAM Conference on Par-allel Processing for Scientic Computing, Min-neapolis, 1997.
[3] Anshul Gupta and Vipin Kumar, Parallel Algo-rithms for Forward Elimination and Backward Substitution in Direct Solution of Sparse Linear Systems, Proc. Supercomputing'95, San Diego, 1995.
[4] Michael T. Heath and Padma Raghavan, LA-PACK Working Note 62: Distributed Solution of
Sparse Linear Systems , Technical Report
UT-CS-93-201, University of Tennessee, Knoxville, TN, 1993.
[5] Mahesh V. Joshi, Anshul Gupta, George Karypis, and Vipin Kumar, Two Dimensional Scalable Parallel Algorithms for Solution of Triangular Systems, Technical Report TR 97-024, Depart-ment of Computer Science, University of Min-nesota, Minneapolis, MN, 1997.
[6] Chunguang Sun, Ecient Parallel Solutions of Large Sparse SPD Systems on Distributed
Mem-ory Multiprocessors , Technical Report 92-102,
Matrix Application Domain N jTj
bcsstk15 Structural Engineering 3948 474921
bcsstk30 Structural Engineering 28924 4293227
copter2 3D Finite Element Methods 55476 10218255
hsct2 High Speed Commercial Transport 88404 18744255
144pf3D 3D Finite Element Methods 144649 48898387
0 2 4 6 8 10 0 20 40 60 80 100 120 140 Time (sec.) Number of Processors bcsstk15 bcsstk30 copter2 hsct2 144pf3D 0 0.5 1 1.5 2 0 20 40 60 80 100 120 140 Time (sec.) Number of Processors bcsstk15 bcsstk30 copter2 hsct2 144pf3D (a) (b)
Figure 4: Timing Results on the Parallel Sparse Triangular Solver (Forward Elimination and Backward
Substi-tution) (a)nrhs= 1. (b)nrhs= 16. (a) (b) 0 200 400 600 800 1000 1200 1400 1600 0 20 40 60 80 100 120 140 Performance (MFLOPS) Number of Processors bcsstk15 bcsstk30 copter2 hsct2 144pf3D 0 500 1000 1500 2000 2500 3000 3500 4000 4500 0 20 40 60 80 100 120 140 Performance (MFLOPS) Number of Processors bcsstk15 bcsstk30 copter2 hsct2 144pf3D