solution flow update flow

(1)

A High Performance Two Dimensional Scalable Parallel

Algorithm for Solving Sparse Triangular Systems

Mahesh V. Joshi

y

Anshul Gupta

z

George Karypis

y

Vipin Kumar

y

Abstract

Solving a system of equations of the form Tx=y,

whereT is a sparse triangular matrix, is required after

the factorization phase in the direct methods of solving systems of linear equations. A few parallel formula-tions have been proposed recently. The common belief in parallelizing this problem is that the parallel

formu-lation utilizing a two dimensional distribution of T is

unscalable. In this paper, we propose the rst known ecient scalable parallel algorithm which uses a two

dimensional block cyclic distribution of T. The

algo-rithm is shown to be applicable to dense as well as sparse triangular solvers. Since most of the known highly scalable algorithms employed in the

factoriza-tion phase yield a two dimensional distribufactoriza-tion of T,

our algorithm avoids the redistribution cost incurred by the one dimensional algorithms. We present the paral-lel runtime and scalability analyses of the proposed two dimensional algorithm. The dense triangular solver is shown to be scalable. The sparse triangular solver is shown to be at least as scalable as the dense solver. We also show that it is optimal for one class of sparse systems. The experimental results of the sparse tri-angular solver show that it has good speedup charac-teristics and yields high performance for a variety of sparse systems.

1 Introduction

Many direct and indirect methods of solving large sparse linear systems, need to compute the solution

to the systems of the form Tx = y, where T is a

sparse triangular matrix. The most frequent

This work was supported by NSF CCR-9423082, by

Army Research Oce contract DA/DAAH04-95-1-0538, by Army High Performance Computing Research Center coop-erative agreement number DAAH04-95-2-0003/contract num-ber DAAH04-95-C-0008, by the IBM Partnership Award, and by the IBM SUR equipment grant. Access to computing fa-cilities was provided by AHPCRC, Minnesota Supercomputer Institute. Related papers are available via WWW at URL:

http://www.cs.umn.edu/~kumar.

yDepartment of Computer Science, University of Minnesota,

Minneapolis, MN 55455.

zP.O.Box 218, IBM Thomas J. Watson Research Center,

Yorktown Heights, NY 10598.

tion is found in the domain of direct methods, which are based on the factorization of a matrix into

trian-gular factors using a scheme likeLU,LLT (Cholesky)

orLDLT. Most of the research in the domain of direct

methods has been concentrated on developing fast and scalable algorithms for factorization, since it is compu-tationally the most expensive phase of a direct solver. Recently some very highly scalable parallel algorithms have been proposed for sparse matrix factorizations [1], which require the matrix to be distributed among processors using a two dimensional mapping. This results in a two dimensional distribution of the trian-gular factor matrices used by the triantrian-gular solvers to get the nal solution.

With the emergence of powerful parallel comput-ers and the need of solving very large problems, com-pletely parallel direct solvers are rapidly emerging [2, 4]. There is a need to nd ecient and scalable parallel algorithms for triangular solvers, so that they do not form a performance bottleneck when the direct solver is used to solve large problems on large num-ber of processors. A few attempts have been made recently to formulate such algorithms [3, 4, 6]. Most of the attempts until now, rely on redistributing the factor matrix from a two dimensional mapping to a one dimensional mapping. This is because it was be-lieved till now, that the parallel formulations based on two dimensional mapping are unscalable [3]. But, as we show in this paper, even a simple two dimensional block cyclic mapping of data can be utilized by our parallel algorithm to achieve as much scalability as achieved by the algorithms based on one dimensional mapping.

The challenge in formulating a scalable parallel al-gorithm lies in the sequential data dependency of the computation and relatively small amount of compu-tation to be distributed among processors. We show that, the data dependency in fact exhibits concurrency in both the dimensions and an appropriate two dimen-sional mapping of the processors can achieve a scalable formulation. We elucidate this further in Section 2 where we describe our two dimensional parallel algo-rithm for a dense trapezoidal solver.

(2)

Scalable and ecient formulation of the parallel al-gorithms for solving large sparse triangular systems is more challenging than the dense case. This is mainly because of the inherently unstructured computations. We present, in section 3, an ecient parallel mul-tifrontal algorithm for solving sparse triangular sys-tems. It traverses the elimination tree of the given system and employs the dense trapezoidal algorithm at various levels of the elimination tree.

In Section 4, we analyze the proposed parallel algo-rithms for their parallel runtime and scalability. We rst show that the parallel dense trapezoidal solver is scalable. Analyzing a general sparse triangular solver is a dicult problem. Hence we present the analysis for two dierent classes of sparse systems. For these classes, we show that the sparse triangular solver is at least as scalable as the dense triangular solver. Al-though it is not as scalable as the best factorization algorithm based on the same data mapping, its paral-lel runtime is asymptotically of lower order than that of the factorization phase. This makes a strong case for the utilization of our algorithm in a parallel direct solver.

We have implemented a sparse triangular solver based on the proposed algorithm and it is a part of the recently announced high performance direct solver [2]. The experimental results presented in the sec-tion 5 show the good speedup characteristics and a high performance on a variety of sparse matrices. Our algorithm for sparse backward substitution achieves a performance of 4.575 GFLOPS on 128 processors of an IBM SP2 for solving a system with multiple

num-ber of right hand sides (nrhs = 16), for which the

single processor algorithm performed at around 200

MFLOPS. For nrhs = 1, it achieved 1630 MFLOPS

on 128 processors, whereas the single processor algo-rithm performed at around 70 MFLOPS. To the best of our knowledge, these are the best performance and speedup numbers reported till now, for solving sparse triangular systems.

2 Two Dimensional Scalable

Formula-tion

In this section, we build a framework for the two di-mensional scalable formulation by analyzing the data dependency and describing the algorithm to extract concurrency.

Consider a system of linear equations of the form

Tx=b, whereT = [T1T2] T;x= [x 1x2] T;b= [b 1b2] T;

andT1is a lower triangular matrix. We dene the

pro-cess of computing the solution vectorx1that satises

T1x1 =b1 followed by the computation of the update

vector x2 using x2 =b2 ,T 2x, as a dense trapezoidal n m b 1 22 22 23 24 24 23 22 22 23 21 21 20 20 20 20 19 19 2 3 4 5 6 7 8 9 11 12 13 14 15 16 17 18 19 17 3 4 4 5 5 5 6 6 6 7 7 7 8 8 9 9 10 9 10 11 12 12 12 13 13 13 14 14 15 15 18 18 18 17 16 15 11 11 2 4 17 16 19 21 14 10 8 3 P0 P1 P2 P3 v h blks blks (c) solution flow update flow (a) solution flow flow update (b)

Figure 1: (a). Data Dependency in Dense

Trape-zoidal Forward Elimination. (b). Data Dependency

in Dense Trapezoidal Backward Substitution. (c).

Two Dimensional Parallel Dense Trapezoidal Forward Elimination.

forward elimination. We refer to b1 as a right hand

side vector. The data dependency structure of this

process is shown in Figure 1(a). WhenT = [TT

1 T T 2 ],

we call the process of computing the update vector

z=b1 ,T

T 2 x

2 followed by the solution of T

T 1x

1 =z

a dense trapezoidal backward substitution. The data dependency structure of this process is shown in Fig-ure 1(b). As can be seen from FigFig-ures 1(a) and 1(b), the data dependency structures of two processes are identical. Henceforth, we describe the design and anal-ysis only for the dense trapezoidal forward elimination. Also, we assume that the system has single right hand side, but the discussion is easily extensible to multiple right hand sides.

Examine the data dependency structure of Fig-ure 1(a). It can be observed that after the solution is computed for the diagonal element, the computations of updates due to the elements in that column can proceed independently. Similarly the computations of the updates due to the elements in a given row can proceed independently before all of them are applied at the diagonal element and then the solution can be computed for that row. Thus, the available concur-rency is identical in both the dimensions. The com-putation in each of the dimensions can be pipelined to propagate updates horizontally and the computed solutions vertically.

For the parallel formulation, consider the two di-mensional block and cyclic approaches of mapping processors to the elements. The two dimensional block mapping will not give a scalable formulation as the processors owning the blocks towards the lower right

(3)

hand corner of the matrix will be idle for more time as the size of the matrix increases. But, the two dimen-sional cyclic mapping solves this problem by localizing the pipeline to a xed portion of the matrix indepen-dent of the size of the matrix. This is a very desirable property for achieving scalability. The unit grain of computation for each processor in a cyclic mapping is too small to get good performance out of most of today's multiprocessor parallel computers. Hence the cyclic mapping can be extended to a block-cyclic map-ping to increase the grain of computation.

We now describe the parallel algorithm for dense trapezoidal forward elimination. Consider a block

ver-sion of the equations. Let T be as shown in

Fig-ure 1(c). We distribute T in the block cyclic

fash-ion using a two dimensfash-ional grid of processors. The

blocks of b1 reside on the processors owning the

diag-onal blocks of T. Each processor traverses the blocks

assigned to it in a predetermined order. The Figure shows the time step in which each block is processed for the column major order. The Figure also shows the shaded region, where the vertical and horizontal pipelines are operated to propagate computed solu-tions and accumulated updates. The height and the width of this region are dened by the processor grid dimensions. The processing applied to a block is deter-mined by its location in the matrix. For the triangular blocks on the diagonal (except for those in the rst col-umn), the processor receives the accumulated updates in that block-row, from its left neighbor and applies them to the right hand side vector. Then it computes the solution vector and sends it to its neighbor below. The processor accumulates the update contributions due to those blocks in a block-row which are not in the shaded region. For the rectangular blocks in the shaded region, the processor receives the solution vec-tor from the top neighbor. It also receives the update vector from the left neighbor except when the block lies on the boundary of the shaded region. Then it adds the update contribution due to the block being processed and sends the accumulated update vector to its right neighbor. In the end, the solution vector is distributed among the diagonal processors and the update vector is distributed among the last column processors.

3 Parallel Algorithm for Sparse

Trian-gular Solver

Solving a triangular system Tx = y eciently in

parallel is challenging when T is sparse. In this

sec-tion, we propose a parallel multifrontal algorithm for solving such systems.

We rst give the description of the serial multi-0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 17 18 16 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X 6 7 8 15 16 17 0 1 3 4 9 10 12 13 2 5 11 14 18 : Supernode : Node : Subtree

Figure 2: (a) An Example Sparse Matrix. Lower Tri-angular part shows the ll-ins ("o") that occur during the factorization process. (b) Supernodal Tree for this matrix.

frontal algorithm for forward elimination which closely follows the algorithm described in [6]. The algorithm

is guided by the structure of elimination tree ofT. We

refer to a collection of consecutive nodes in an elimi-nation tree each with only one child, as a supernode. The nodes in a supernode are collapsed to form the su-pernodal tree. An example matrix and its associated supernodal tree as shown in Figure 2. The columns

of T corresponding to the the individual nodes in a

supernode, form the supernodal matrix. It can be eas-ily veried that the supernodal matrix is dense and trapezoidal. Given two vectors and their correspond-ing sets of indices, an extend-add operation is dened as extending each of the vectors to the union of two index sets by lling in zeros if needed, followed by adding them up index-wise.

The serial multifrontal algorithm for forward elim-ination does a postorder traversal of the supernodal tree associated with the given triangular matrix. At each supernode, the update vectors resulting from the dense supernodal forward elimination of its children nodes are collected. They are extend-added together and the resulting vector is subtracted from the right hand side vector corresponding to the row indices of the supernodal matrix. Then the process of dense trapezoidal forward elimination as dened in Section 2 is applied.

The parallel multifrontal algorithm was developed keeping in mind that our sparse triangular solver is a part of a parallel direct solver described in [2]. Our triangular solver uses the same distribution of the

fac-tor matrix, T, as used by the highly scalable

algo-rithm employed in the numerical factorization phase [1]. Thus, there is no cost incurred in redistributing

the data. Distribution of T is described in the next

paragraph.

(4)

the top logplevels 1, where p is the number of

pro-cessors used to solve the problem. The portions of this binary supernodal tree are assigned to processors using a subtree-to-subcube strategy illustrated in Fig-ure 3, where eight processors are used to solve the example matrix of Figure 2. The processor assigned to a given element of a supernodal matrix is deter-mined by the row and column indices of the element and a bitmask determined by the depth of the supern-ode in the supernodal tree [1]. This method achieves a two dimensional block cyclic distribution of the su-pernodal matrix among the logical grid of processors in the subcube as shown in Figure 3.

In the parallel multifrontal algorithm for forward elimination, each processor traverses the part of the supernodal tree assigned to it in a bottom up fashion. The subtree at the lowest level is solved using the serial multifrontal algorithm. After that at each supernode, a parallel extend-add operation is done followed by the parallel dense trapezoidal forward elimination. In the parallel extend-add operation, the processors own-ing the rst block-column of the distributed supern-odal matrix at the parent supernode collect the update vectors resulting from the trapezoidal forward elimina-tions of its children supernodes. The bitmask based data distribution ensures that this can be achieved with at most two steps of point-to-point communica-tions among pairs of processors. Then each processor performs a serial extend-add operation on the update vectors it receives, the result of which is used as an ini-tial update vector in the dense algorithm described in Section 2. The process continues until the topmost su-pernode is reached. The entire process is illustrated in Figure 3, where the processors communicating in the parallel extend-add operation at each level are shown. In the parallel multifrontal backward substitution, each processor traverses its part of the supernodal tree in a top down fashion. At each supernode in the top

logplevels, a parallel dense trapezoidal backward

sub-stitution algorithm is applied. The solution vector computed at the parent supernode propagates to the child supernodes. This is done with at most one step of point-to-point communication among pairs of pro-cessors from the two subcubes of the children. Each processor adjusts the indices of the received solution vector to correspond to the row indices of its child supernode.

1The separator tree obtained from the recursive nested

dis-section based ordering algorithms yields such a binary tree.

4 Parallel Runtime and Scalability

Analysis

We rst derive an expression for the runtime of the parallel dense trapezoidal forward elimination algo-rithm. We present a conservative analysis assuming a naive implementation of the algorithm. We consider the case where the processors compute the blocks in a column major order. Same results hold true for the row major order of computation.

Denetc as the unit time for computation. Lettw

be the unit time for communication, assuming a model where communication time is proportional to the size of the data, which holds true for most of today's paral-lel processors when a moderate size of the data is

com-municated. Consider a block trapezoidal matrix,T, of

dimensionsmnas shown in Figure 1(c). The

block-size isb. Let the problem be solved using a square grid

ofqprocessors (h=v=p

q). We assume that bothm

andn are divisible byp

q. For the purpose of

analy-sis, we also assume that all the processors synchronize after each processor nishes processing one column as-signed to it. The actual implementation does not use

an explicit synchronization. The matrix T now

con-sists ofn=bp

qvertical strips of widthp

q. We number

these strips from left to right starting from 1. The

parallel runtime for ith vertical strip consists of two

components. The pipelined component (in the shaded

region of the strip), takes (p

q)(btw+b 2t

c) time.

The other component is the time needed for each

pro-cessor to compute update vectors for the (m=bp

q,i)

blocks, each of which takes (b2t

c) time. In the end,

we need to add the time required for the last vertical

strip, where (m,n)=(b

p

q) horizontal pipelines

oper-ate to propagoper-ate the updoper-ates to the processors in the last block-column. After summing up all the terms for all the vertical strips and simplifying, the

paral-lel runtime, Tp, can be expressed asymptotically as,

Tp= 1=p(mn ,n

2=2)t

c+ (m+n)tw+ (n)tc.

The overhead due to communication and pipelining

can be obtained asTo=qTp

,T

s, whereTs=Wtc is

the serial runtime. W is size of the problem expressed

in terms of number of computational steps needed to solve the problem on a uniprocessor machine. For

the trapezoidal forward elimination problem, W is

(mn,n

2=2). Thus, T

o= ((m+n)q)tw+ (nq)tc.

Eciency of the parallel system is expressed as E =

Ts=(Ts+To). The scalability is expressed in terms of

an isoeciency function, which is dened as the rate

at whichW has to increase when the number of

pro-cessors are increased in order to maintain a constant eciency. The system is more scalable, when this rate

(5)

4 6 5 18 3 3 3 3 4 10 11 15 16 5 5 5 5 10 12 14 17 18 6 6 6 6 6 6 6 6 12 3 5 8 18 2 2 2 2 3 1 2 6 7 1 1 1 1 1 0 7 8 2 0 0 0 0 13 14 15 18 7 7 7 7 13 9 11 16 17 4 4 4 4 9 2 2 0 3 5 7 6 7 0 18 18 0 P P1 _P₂ _P₃ _P₄ _P₅ _P₆ 0 1 P P 6 7

T : Dense Supernodal Matrix

4 5 P 2 3 P 7 P 8 7 6 2 3 9 16 17 15 18 5 11 14 0 T T T T T T T T 0 2 3 1 0 0 0 1 0 6 7 8 18 6 7 8 T 5 7 5 4 4 6 7 5 15 16 17 18 15 16 17 T 0 0 0 6 7 8 3 3 3 3 6 7 8 18 5 5 5 15 16 17 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 7 7 0 0 0 2 6 7 5 8 18 6 5 18 11 16 17 11 15 16 14 17 18 7 8 2 14 15 18 0 U U _U U U U U U U U U U U 0 18 185 0 1 3 4 5 6 0 3 ₅ 2 7 7 7 2 3 2 4 5 7 6 0 1 0 0 5 6 6 6 6 15 16 17 18 U T 6 6 6 6 15 16 17 18 14 14 6 0 0 0 0 2 6 7 8 2 T 3 3 3 3 3 5 6 7 8 18 5 T 5 5 5 11 5 11 15 16 17 T T 0 2 1 3 P 4 5 7 6

and the computed solution for index k :

k

p

k

U : Update vector

processor p owns the right-hand side

P : Subcube as the Logical Grid

6 7 4 5 P 0 1 2 3 P 12 13 1 ₄ ₁₀

Figure 3: Supernodal tree guided Parallel Multifrontal Forward Elimination. expressions, we get the expressions for the triangular

solver. The overhead To becomes (nq) and Ts

be-comesO(n2). These expressions can be used to arrive

at the isoeciency function ofO(q2). This shows that

the two dimensional block cyclic algorithm proposed for the dense trapezoidal solution is scalable.

Analyzing a general sparse triangular solver is a dicult problem. We present the analysis for two wide classes of sparse systems arising out of two-dimensional and three-two-dimensional constant node-degree graphs. We refer to these problems as 2-D and 3-D problems, respectively. Let the problem of

dimensionN be solved usingpprocessors. Denelas

a level in the supernodal tree with l= 0 for the

top-most level. Thus,q, the number of processors assigned

to level l, is p=2l. With a nested-dissection based

ordering scheme, the number of nodes, n, at each

level isp

N=2lfor the 2-D problems and(N=2

l)2=3

for the 3-D problems, where is a small constant

[3]. The row dimension of a supernodal matrix, m,

can be assumed to be a constant times n, for a

bal-anced supernodal tree. The overall computation is proportional to the number of nonzero elements in

T, which is O(NlogN) and O(N4=3) for the 2-D

and 3-D problems, respectively. Assuming that the computation is uniformly distributed among all the

processors and summing up the overheads at all the levels, the asymptotic parallel runtime expression for

the 2-D problems isTp =O((NlogN)=p) +O(

p

N). For the 3-D problems, the asymptotic expression is

Tp = O(N

4=3=p) +O(N2=3). An analysis similar to

that of dense case, shows that the the isoeciency

function for the 2-D problems is O(p2=logp) and for

the 3-D problems, it is O(p2). Refer to [5] for the

detailed derivation of the expressions above.

The parallel formulation is thus, more scalable than the corresponding dense formulation for the class of 2-D problems and it is as scalable as the dense formu-lation for the class of 3-D problems. In the 3-D prob-lems, the asymptotic complexity of sparse formulation is the same as that of the dense solver operating on the

topmost supernode of dimensionN2=3

N

2=3. This

shows that our sparse formulation is optimally scal-able for the class of 3-D problems. The comparison of the isoeciency expressions to those of the solver described in [3] shows that our two dimensional par-allel formulation is as scalable as the one dimensional formulation.

5 Performance Results and Discussion

We have implemented our sparse triangular solver as part of a direct solver. We use MPI for

(6)

communi-cation to make it portable to a wide variety of parallel computers. The BLAS are used to achieve high com-putational performance, especially on the platforms having vendor-tuned BLAS libraries. We tested the performance on an IBM SP2 using IBM's ESSL for the sparse matrices taken from various domains of ap-plications. The nature of the matrices is shown in the

Table, where N is the dimension of T and jTj is the

number of nonzero elements inT. Figure 4 shows the

speedup in terms of time needed for combined forward substitution and backward elimination phases for the

number of right hand sides (nrhs) of 1 and 16

re-spectively. Figure 5 shows the performance of just the backward substitution phase in terms of the MFLOPS. The performance was observed to vary with the

blocksize, b and nrhs, as expected. For a given b,

as nrhs increases, we start getting the better

perfor-mance because of the level-3 BLAS. For givennrhs, as

b increases, the grain of computation increases, since

the ratio of computation to communication per block

is proportional to b. The combined eect of

bet-ter BLAS performance, reduced number of message startup costs and better bandwidth obtained from the communication subsystem, increases the overall

per-formance. But, after a threshold onb, the distribution

gets closer to the block distribution and processors start idling. This threshold depends on the character-istics of the problem and the parallel machine. The detailed results exhibiting these trends can be found in [5]. For most of the matrices we tested, we found that a blocksize of 64 gives better performance, on an IBM SP2 machine.

As can be seen from Figure 4, the parallel sparse triangular solver based on our two dimensional formu-lation of the algorithm, achieves good speedup charac-teristics for a variety of sparse systems. The speedup and performance numbers cited in Section 1 were ob-served for the 144pf3D matrix. For the hsct2 matrix, the entire triangular solver (both phases) delivered a speedup of 20.9 on 128 processors and of 17.5 on 64

processors for nrhs = 1. The performance curves

in Figure 5(b) show that the backward substitution phase achieves a performance of 4.575 GFLOPS for on 128 processors and 3.656 GFLOPS on 64

proces-sors for the 144pf3D matrix for nrhs = 16. To the

best of our knowledge, these are the best speedup and performance numbers reported till now for sparse tri-angular solvers.

References

[1] Anshul Gupta, George Karypis and Vipin Ku-mar, Highly Scalable Parallel Algorithms for Sparse Matrix Factorization, Technical Report

94-63, Department of Computer Science, Univer-sity of Minnesota, Minneapolis, MN, 1994. [2] Anshul Gupta, Fred Gustavson, Mahesh Joshi,

George Karypis and Vipin Kumar, Design and Implementation of a Scalable Parallel Direct Solver for Sparse Symmetric Positive Denite Systems, Proc. Eighth SIAM Conference on Par-allel Processing for Scientic Computing, Min-neapolis, 1997.

[3] Anshul Gupta and Vipin Kumar, Parallel Algo-rithms for Forward Elimination and Backward Substitution in Direct Solution of Sparse Linear Systems, Proc. Supercomputing'95, San Diego, 1995.

[4] Michael T. Heath and Padma Raghavan, LA-PACK Working Note 62: Distributed Solution of

Sparse Linear Systems , Technical Report

UT-CS-93-201, University of Tennessee, Knoxville, TN, 1993.

[5] Mahesh V. Joshi, Anshul Gupta, George Karypis, and Vipin Kumar, Two Dimensional Scalable Parallel Algorithms for Solution of Triangular Systems, Technical Report TR 97-024, Depart-ment of Computer Science, University of Min-nesota, Minneapolis, MN, 1997.

[6] Chunguang Sun, Ecient Parallel Solutions of Large Sparse SPD Systems on Distributed

Mem-ory Multiprocessors , Technical Report 92-102,

(7)

Matrix Application Domain N jTj

bcsstk15 Structural Engineering 3948 474921

bcsstk30 Structural Engineering 28924 4293227

copter2 3D Finite Element Methods 55476 10218255

hsct2 High Speed Commercial Transport 88404 18744255

144pf3D 3D Finite Element Methods 144649 48898387

0 2 4 6 8 10 0 20 40 60 80 100 120 140 Time (sec.) Number of Processors bcsstk15 bcsstk30 copter2 hsct2 144pf3D 0 0.5 1 1.5 2 0 20 40 60 80 100 120 140 Time (sec.) Number of Processors bcsstk15 bcsstk30 copter2 hsct2 144pf3D (a) (b)

Figure 4: Timing Results on the Parallel Sparse Triangular Solver (Forward Elimination and Backward

Substi-tution) (a)nrhs= 1. (b)nrhs= 16. (a) (b) 0 200 400 600 800 1000 1200 1400 1600 0 20 40 60 80 100 120 140 Performance (MFLOPS) Number of Processors bcsstk15 bcsstk30 copter2 hsct2 144pf3D 0 500 1000 1500 2000 2500 3000 3500 4000 4500 0 20 40 60 80 100 120 140 Performance (MFLOPS) Number of Processors bcsstk15 bcsstk30 copter2 hsct2 144pf3D