A Distributed-Memory Randomized Structured Multifrontal Method for Sparse Direct Solutions

(1)

A DISTRIBUTED-MEMORY RANDOMIZED STRUCTURED MULTIFRONTAL METHOD FOR SPARSE DIRECT SOLUTIONS∗

ZIXING XIN†, JIANLIN XIA‡, MAARTEN V. DE HOOP§, STEPHEN CAULEY¶, AND VENKATARAMANAN BALAKRISHNANk

Abstract. We design a distributed-memory randomized structured multifrontal solver for large sparse matrices. Two layers of hierarchical tree parallelism are used. A sequence of innovative parallel methods are developed for randomized structured frontal matrix operations, structured update ma-trix computation, skinny extend-add operation, selected entry extraction from structured matrices, etc. Several strategies are proposed to reuse computations and reduce communications. Unlike an earlier parallel structured multifrontal method that still involves large dense intermediate matrices, our parallel solver performs the major operations in terms of skinny matrices and fully structured forms. It thus significantly enhances the efficiency and scalability. Systematic communication cost analysis shows that the numbers of words are reduced by factors of aboutO(√n/r) in two dimen-sions and aboutO(n2/3_{/r) in three dimensions, where}_n_{is the matrix size and}_r_{is an off-diagonal}

numerical rank bound of the intermediate frontal matrices. The efficiency and parallel performance are demonstrated with the solution of some large discretized PDEs in two and three dimensions. Nice scalability and significant savings in the cost and memory can be observed from the weak and strong scaling tests, especially for some 3D problems discretized on unstructured meshes.

Key words. distributed memory, tree parallelism, fast direct solver, randomized multifrontal method, rank structure, skinny matrices

AMS subject classifications. 15A23, 65F05, 65F30, 65Y05, 65Y20

DOI. 10.1137/16M1079221

1. Introduction. The solution of large sparse linear systems plays a critically important role in modern numerical computations and simulations. Generally, there are two types of sparse solvers, iterative ones and direct ones. Direct solvers are robust and suitable for solving linear systems with multiple right-hand sides but usually take more memory to store the factors that are often much denser. The focus of this work is on a parallel sparse direct solution that uses low-rank approximations to reduce floating-point operations and decrease memory usage.

An important type of direct solvers is the multifrontal method proposed in [9]. Multifrontal methods perform the factorization of a sparse matrix via a sequence of smaller dense factorizations organized following a tree called an elimination or assembly tree. This results in nice data locality and great potential for parallelization. Matrix reordering techniques such as nested dissection [10] are commonly used before the factorization to reduce fill-in.

∗_{Submitted to the journal’s Software and High-Performance Computing section June 9, 2016;} accepted for publication (in revised form) March 13, 2017; published electronically August 17, 2017.

http://www.siam.org/journals/sisc/39-4/M107922.html

Funding: The work of the second author was supported in part by NSF CAREER Award DMS-1255416.

†_{Department of Mathematics, Purdue University, West Lafayette, IN 47907 ([email protected]).} ‡_{Department of Mathematics and Department of Computer Science, Purdue University, West} Lafayette, IN 47907 ([email protected]).

§_{Department of Computational and Applied Mathematics, Rice University, Houston, TX 77005} ([email protected]).

¶_{Athinoula A. Martinos Center for Biomedical Imaging, Department of Radiology, Massachusetts} General Hospital, Harvard University, Charlestown, MA 02129 ([email protected]).

k_{School of Electrical and Computer Engineering, Purdue University, West Lafayette, IN 47907} ([email protected]).

(2)

In recent years, structured multifrontal methods [33, 30, 27, 2] have been devel-oped to utilize a certain low-rank property of the intermediate dense matrices that arise in the factorization of some discretized PDEs. This low-rank property has been found in many problems, such as the ones arising from the discretization of elliptic PDEs. Hierarchical structured matrices like H/H2_{-matrices [5, 15, 17] and}

hierar-chically semiseparable (HSS) matrices [6, 34] are often used to take advantage of the low-rank property. Other rank structured representations are also used in multifrontal and similar solvers [2, 14].

To enhance the performance and flexibility of the structured matrix operations, some recent work integrates randomization into structured multifrontal methods [31]. Randomized sampling enables the conversion of a large rank-revealing problem into a much smaller one after matrix-vector multiplications [18, 21]. This often greatly accelerates the construction of structured forms and also makes the processing of the data much simpler [24, 35].

In this work, we are interested in the parallelization of the randomized struc-tured multifrontal (RSMF) method in [31] for the factorization of large-scale sparse matrices. The work is a systematic presentation of our developments in the technical reports [36, 37, 38]. The parallelism is essential to speed up the algorithms and make the algorithms available to large problems by the exploitation of more processes and memory. Earlier work in [28] gives a parallel multifrontal solver based on a simpli-fied structured multifrontal method in [30] that involves dense intermediate matrices. Some dense matrices called frontal matrices are approximated by HSS forms and then factorized. The resulting Schur complements called update matrices are still dense and are used to assemble later frontal matrices. The use of dense update matrices is due to the lack of an effective structured data assembly strategy. All these dense matrices tend to be a memory bottleneck if their sizes are large. Moreover, the dense forms make some major operations more costly than necessary, including the struc-tured approximations of the frontal matrices, the computation of the update matrices, and the assembly of the frontal matrices. In addition, the parallel solver in [28] relies on the geometry of the mesh, which is required to be a regular mesh. This limits the applicability of that solver.

Here, we seek to build a systematic framework for parallelizing theRSMFmethod in [31] using distributed memory. The randomized approach avoids the use of dense frontal and update matrices and also makes the parallelization significantly more convenient and efficient. We also allow more general matrix types. Our main results include the following:

• We design all the mechanisms for a distributed-memory parallel implemen-tation of the RSMF method. A static mapping parallel model is designed to handle two layers of parallelism (one for the assembly tree and another for the structured frontal/update matrices) as well as new parallel structured operations. Novel parallel algorithms for all the major steps are developed, such as the intermediate structured matrix operations, structured submatrix extraction, skinny matrix assembly, and information propagation.

• A sequence of strategies is given to reduce the communication costs, such as the combination of process grids for different tree parallelism, a compact storage of HSS forms, the assembly and passing of data between processes in compact forms, heavy data reuse, the collection of data for BLAS3 op-erations, and the storage of a small amount of the same data in multiple processes.

(3)

• The parallel structured multifrontal method is not restricted to a specific mesh geometry. Graph partitioning techniques and packages are used to efficiently produce nested dissection ordering of a general mesh. The method can be applied to much more general sparse matrices where the low-rank property exists. For problems without the low-rank property, we can also use the method as an effective preconditioner.

• Our nested dissection ordering and symbolic factorization well respect the local geometric connectivity and thus naturally preserve the inherent low-rank property in the intermediate dense matrices.

• As compared with the parallel structured multifrontal method in [28],

– our parallel method can be applied to sparse matrices corresponding to more general graphs instead of only regular meshes;

– we do not need to store large dense frontal/update matrices or convert between dense and HSS matrices, and instead, the HSS constructions and factorizations are fully structured;

– the parallel data assembly is performed through skinny matrix-vector products instead of large dense matrices.

We give a detailed analysis of the communication cost (which is also missing for the method in [28]). The communication cost of our solver includes O(√P) + O(rlog3P) messages andO(r

√ n ˆ P ) +O( r2_√log3P ˆ P

) words for two-dimensional (2D) prob-lems orO(rn 2 3 ˆ P ) +O( r2_log3_P √ ˆ P

) words for 3D problems, where nis the matrix size, P is the total number of processes, r is the maximum off-diagonal numerical rank of the frontal matrices, and ˆP is the minimum size of all the process grids. There is a significant reduction in the communication cost as compared with the solver in [28]. The numbers of words of the new solver are smaller by factors about O(

√ n r ) and O(n 2 3

r ) for two and three dimensions, respectively.

Extensive numerical tests in terms of some discretized PDEs in two and three dimensions are used to show the parallel performance. Both weak and strong scaling tests are done. The nice scalability and significant savings in the cost and memory can be clearly observed, especially for some 3D PDEs discretized on unstructured meshes. For example, for a discretized 3D Helmholtz equation withn≈9.7×106_{, the}

new solver takes about 1/4 of the flops of the standard multifrontal solver and about 1/2 of the storage for the factors. With up to 256 processes, when the number of processes doubles, the parallel factorization time reduces by a factor about 1.7∼1.8. The new solver works well for largerneven if the standard parallel multifrontal solver runs out of memory.

The remaining sections are organized as follows. We discuss nested dissection, parallel symbolic factorization, and graph connectivity preservation in section 2. Sec-tion 3 reviews the basic framework of theRSMFmethod. Section 4 details our parallel randomized structured multifrontal (PRSMF) method. Section 5 shows the analysis of the communication cost. The numerical results are shown in section 6. Finally, we draw our conclusions in section 7.

For convenience, we first explain the basic notation to be used throughout the paper.

• For a matrix A and an index setI, A|I denotes the submatrix ofA formed by the rows with the row index setI.

• For a matrix A and index sets I,J, A|I×J denotes the submatrix of Awith row index setIand column index setJ.

(4)

• If T is a postordered binary tree with nodes i = 1,2, . . ., we use sib(i) and par(i) to denote the sibling and parent of a node i, respectively, and use root(T) to denote the root node ofT.

• Bold and calligraphic letters are usually used for symbols related to the mul-tifrontal process.

2. Nested dissection for general graphs, parallel symbolic factorization, and connectivity preservation. For an n×n sparse matrix A, let G(A) be its corresponding adjacency graph, which has n vertices and has an edge connecting verticesi andj ifAij 6= 0. G is a directed graph ifA is nonsymmetric. In this case,

we consider the adjacency graphG(A+AT) instead ofG(A) (see, e.g., [3, 20]). In the factorization ofA, vertices in the adjacency graph are eliminated, and their neighbors are connected, which yields fill-in in the factors.

To reduce fill-in, reordering techniques are usually applied toGbefore the factor-ization. Nested dissection [10] is one of the most commonly used reordering techniques. The basic idea is to divide a graph recursively into disjoint parts using separators which are small subsets of vertices. Then upper level separators are ordered after lower level ones. Treating each separator as a supernode, we can construct a tree called assembly tree T to organize the later factorization. This provides a natural mechanism of parallelism. See Figure 1. For convenience, the nodes and the corre-sponding separators are labeled asi= 1,2, . . .in postorder. Denote the set of vertices within separator/nodeibysi. Suppose the root node is at levell= 0.

We apply nested dissection to G and, in the mean time, collect relevant con-nectivity information for the sake of later structured matrix operations. During the later elimination stage (via the multifrontal method which will be reviewed in the next section), it will require the reordering of the vertices within a separator together with those within some neighbor separators. Here, roughly speaking, a separator j

is aneighbor of a separator iifj is an ancestor of iand the L factor in the LU fac-torization of A has nonzero entries in L|sj×si. We try to keep nearby vertices close to each other after reordering, so as to preserve the inherent low-rank structures of the intermediate Schur complements [30]. The details are as follows.

• Nested dissection is performed with the package Scotch [25]. A multilevel scheme [19] is applied to determine the vertex separators. The assembly tree T is built with each nonleaf node corresponding to a separator. The

8 9 12 1 4 5 11 2 3 10 13 6 14 7 15 15 4 3 6 1 7 1 0 13 1 2 4 5 8 9 11 12

(i) Nested dissection and separator ordering (ii) Corresponding assembly tree Fig. 1. Nested dissection of a mesh and the corresponding assembly tree, where the mesh illustration is based onMeshpart[13].

(5)

Gibbs–Poole–Stockmeyer method [12] is used to order the vertices within each separator so that the bandwidth of the corresponding adjacency submatrix of Ais small. Geometrically, this keeps nearby vertices within a separator close to each other under the new ordering and may benefit the low-rank structure of the corresponding frontal matrix. The reason is as follows. For some discretized elliptic PDEs, the frontal matrices are related to (the inverses of) the discretized Green’s functions, and the low-rank property of the frontal matrices is related to the geometric separability of the mesh points where the Green’s function is evaluated [7, 33]. Well-separated points correspond to blocks with small numerical ranks. Thus intuitively, it is often beneficial to use an appropriate ordering that respects the geometric connectivity.

• In a symbolic factorization, following a bottom-up traversal of the assembly treeT, we collect the neighbor information of each node ofT. The neighbor information reflects how earlier eliminations create fill-in. The connectivity of the vertices within a separator iand its neighbors is also collected. Such connectivity information is accumulated from child nodes to parent nodes and is used to preserve the graph connectivity after reordering.

• The symbolic factorization is performed in parallel. Under a certain paral-lel level lp, each process performs a local symbolic factorization of a private

subtree and stores the corresponding neighboring information. At levellp, an

all-to-all communication between all processes is triggered to exchange neigh-boring information of nodes at lp. For all the nodes above lp, the symbolic

factorization is performed within each processor, so that each process can conveniently access relevant nodes without extra communications.

• Following the nice data locality of the assembly tree, we adopt a subtree-to-subcube static mapping model [11] to preallocate computational tasks to each process. Each process is assigned a portion of the computational tasks associated with a subtree of T or a subgraph of G. Starting from a certain level, processes form disjoint process groups form process grids to work to-gether. We try to evenly divide the adjacency graph so as to maintain the load balance of processes. See also section 4.1. Static unbalances may still exist as a result of unbalanced partitioning of the graph. Dynamic unbal-ances may also arise in the factorization stage since the HSS ranks and HSS generator sizes are generally unknown in advance. It will be interesting to in-vestigate the use of dynamic scheduling in our solver. Other static mapping models such as the strict proportional mapping [26] are also used in some sparse direct solvers and may help improve the static balance of our solver. These are not the primary focus of the current work and will be studied in the future.

3. Review of the randomized structured multifrontal method. After nested dissection, the matrix factorization can be performed via the multifrontal method [9, 22], where partial factors are produced along the traversal of the assembly treeT. For simplicity, we useAi,j to denoteA|si×sj for nodesi,j ofT and useNito denote the set of neighbors ofi.

For each leafi,Ni includes all the separators that are connected toiin G. Form an initialfrontal matrix

(3.1) Fi≡ Fi0= Ai,i Ai,Ni ANi,i 0 .

(6)

Compute an LU factorization ofFi: (3.2) Fi= Li,i LNi,i I Ki,i Ki,Ni Ui ,

where Ui is the Schur complement and is called an update matrix. In the adjacency graph, this corresponds to the elimination ofi.

For each nonleaf nodei,Niincludes all the ancestor separators that are originally connected to i in G or get connected due to lower level eliminations. Suppose the child nodes ofiarec1 andc2. Form afrontal matrix

(3.3) Fi=Fi0↔l Uc1↔l Uc2,

where the symbol↔ldenotes theextend-addoperation [9, 22]. This operation performs matrix permutation and addition by matching the index sets following the ordering of the vertices inG. Then partitionFias

(3.4) Fi= Fi;1,1 Fi;1,2 Fi;2,1 Fi;2,2 ,

whereFi;1,1corresponds tosi, and perform an LU factorization as in (3.2).

The above procedure continues along the assembly tree until the root is reached. This is known as the standard multifrontal method. A bottleneck of the method is the storage and cost associated with the local dense frontal and update matrices.

The class of structured multifrontal methods [30, 31, 33] is based on the idea that, for certain matrices A, the local dense frontal and update matrices are rank structured. For some cases, Fi and Ui may be approximated by rank structured forms such as HSS forms. An HSS matrix F is a hierarchical structured form that can be defined via a postordered binary treeT called anHSS tree. Each nodei ofT is associated with some local matrices called generators that are used to defineF. A diagonal generatorDi is recursively defined, so thatDk≡F for the root kofT, and

for a nodei with childrenc1 andc2,

Di = Dc1 Uc1Bc1V T c2 Uc2Bc2V T c1 Dc2 ,

where the off-diagonal basis generatorsU, V are also recursively defined as Ui= Uc1Rc1 Uc2Rc2 , Vi= Vc1Wc1 Vc2Wc2 .

SupposeDi corresponds to an index setsi⊂ {1 :N}, where N is the size of F; then

F_i−≡F|si×({1:N}\si)andF |

i ≡F|({1:N}\si)×siare called HSS blocks, whose maximum (numerical) rank is called theHSS rank ofF. Also for convenience, suppose the root node ofT is at levell= 0.

In the structured multifrontal method in [33], HSS approximations to the frontal matrices are constructed. The factorization of an HSS frontal matrixFiyields an HSS update matrix Ui. This process may be costly and complex. A simplified version in [28] uses denseUi instead. To enhance the efficiency and flexibility, randomized HSS construction [24, 35] is used in [31]. The basic idea is to compress the off-diagonal blocks of Fi via randomized sampling [21]. Suppose Fi has size N ×N and HSS

(7)

rank r. Generate an N ×(r+µ) Gaussian random matrix X, where µ is a small integer. Compute

(3.5) Y =FiX.

The compression of an HSS blockF_i− ≡ Fi|si×({1:N}\si)is done via the compression of

(3.6) Si=Y −DiX|si(=F

−

i X|{1:N}\si).

Applying a strong rank-revealing factorization [16] to the above quantity yields Si ≈UiSi|ˆsi with Ui= Πi I Ei ,

where ˆsi is a subset ofsi and the entries ofEi have magnitudes around 1. This gives

theU generator in the HSS construction, so thatF_i−≈UiFi−|sˆi with high probability [21]. Due to the feature that the row basis matrixF_i−|sˆi is a submatrix ofF

−

i , this compression is also referred to as a structure-preserving rank-revealing factorization [35]. More details for generating the other HSS generators can be found in [24, 35]. The overall HSS construction follows a bottom-up sweep of the HSS tree.

In theRSMFmethod [31], the product (3.5) and selected entries ofFiare used to construct an HSS approximation toFi. A partial ULV factorization [6, 34] of the HSS form is then computed. Unlike other multifrontal methods, here the update matrixUi does not directly participate in the extend-add operation. Instead, its products with random vectors are used in a skinny extend-add [31] that only involves vectors and their row permutation and addition. This significantly enhances the flexibility and efficiency of extend-add. We sketch the main framework of the method, which slightly generalizes the method in [31] to nonsymmetric matrices. A pictorial illustration is given in Figure 2. In the following, we assumeXito be a skinny random matrix that plays the role ofX in (3.5) and is used for the HSS construction forFi.

=

Y

c1

Y

c1 c1 c1

X

i i0 Skinny extend-add =

Y

c2

Y

c2 c2 c2 =

Y

i Selected entries

Y

i i i HSS construction, factorization

HSS construction, factorization HSS construction, factorization

Selected

entries Selectedentries

~

Fig. 2. Illustration of theRSMFmethod in[31]. For simplicity, only a symmetric version is used for illustration.

(8)

1. For nodesi below a certain switching level ls of T, perform the traditional

multifrontal factorization.

2. For nodesi above level ls, construct an HSS approximation to Fi using se-lected entries of Fi and matrix-vector productsYi =FiXi and Zi =FiTXi. Here,Yi andZiare obtained from (3.8) below.

3. Partially ULV factorizeFi and compute Ui with the fast Schur complement update in HSS form. Compute

(3.7) Y˜i=UiX˜i, Z˜i=UiTX˜i,

where ˜Xiis a submatrix ofXi corresponding to the indicesUi as in (3.2). 4. For an upper level nodeiwith childrenc1andc2, compute skinny extend-add

operations (3.8) Yi= (Fi0Xi)−lY˜c1 −l Y˜c2, Zi= ((F 0 i) T_X i)−lZ˜c1 −lZ˜c2.

5. Form selected entries ofFibased onFi0, Uc1, and Uc2, and go to step 2. 4. Distributed-memory parallel randomized multifrontal method. Here, we present our distributed-memory parallel sparse direct solver based on the RSMF method, denotedPRSMF. The primary significance includes the following:

• We design a static mapping parallel model that (1) has two hierarchical layers of parallelism with good load balance and (2) can conveniently handle both HSS matrices and intermediate skinny matrices. ThePRSMFmethod signif-icantly extends the one in [28] by allowing more flexible parallel structured operations and also avoiding large dense intermediate frontal and update ma-trices.

• We develop some innovative parallel algorithms for a sequence of HSS opera-tions, such as a parallel HSS Schur complement computation in partial ULV factorization, a parallel extraction of selected entries from an HSS form, and a parallel skinny extend-add algorithm for assembling local data.

• The main ideas to reduce the communication costs in the new parallel al-gorithms are (1) taking advantage of the tree parallelism and store HSS generators in an appropriate way; (2) assembling and passing data between processes in compact forms so that the number of messages is reduced; (3) reusing data whenever possible; (4) collecting data into messages to facilitate BLAS3 dense operations; and (5) storing a small amount of the same data in multiple processes.

• The graph reordering strategies and the convenient data assembly make the method much more general than the parallel structured multifrontal method in [28] which is restricted to a specific mesh geometry.

In this section, an overview of the parallel model is given, followed by details of the parallel algorithms.

4.1. Overview of the parallel model and data distribution. In theRSMF method, both the assembly tree and the HSS tree have nice data locality. For a distributed memory architecture, a static mapping model usually requires less com-munication than a dynamic scheduling model. To avoid idling as much as possible, it is vital to keep work load balance among all processes. When generating the assembly tree and local HSS trees, we try to make the trees as balanced as possible. A roughly balanced assembly tree can be obtained withScotch[25] with appropriate strategies. More specifically, a multilevel graph partition method in [19] is used to generate the

(9)

3 1 2 7 0 6 4 5 0 1 15 0 1 10 8 9 14 1 13 11 12 Switching 30 31 Processes Process grid Parallel level l_p level l_s 0 1 2 3 3 2 2 3 3 2 0 1 3 2 18 16 17 22 2 21 19 20 25 23 24 29 3 28 26 27 0 1 2 3 0 1 0 1 23 0 1 2 3 3 2 0 1 3 2 0 1 0 1 0 1 23 23

Fig. 3.Static mapping model for ourPRSMFmethod, where the outer-layer tree is the assembly tree, and the inner-layer trees are HSS trees.

separators of the adjacency graph ofA. In this method, each time we use the vertex greedy-graph-growing method to obtain a rough separator, and refine this separator with some strategies so that the resulting lower level subgraphs are roughly balanced. For the HSS trees, we try to keep them balanced via the recursive bipartition of the index set of a frontal matrix.

Our parallel model (Figure 3) is described as follows. The basic idea is inher-ited from [28], and we further incorporate more structured forms to avoid dense frontal/update matrices and also allow more flexible choices of processes for the struc-tured operations. Below a parallel levellp, one process is used to handle the operations

corresponding to one subtree, called alocal tree. (The method also involves a switch-ing level ls, at which the local operations switch from dense to HSS ones so as to

optimize the complexity and also to avoid structured operations for very small ma-trices at lower levels [31].) Starting from levellp, processes are grouped into process

grids or contexts to work on larger matrices. The standard 2D block cyclic storage scheme in ScaLAPACK [4] is adopted to store dense blocks (such as HSS generators and skinny matrices). Process grids associated with the nodes of the assembly tree

T are generated as follows. Let Gi denote the set of processes assigned to a node i. Then,

• Gi∩Gj=∅ifiandjare distinct nodes at the same level;

• Gi = Gc1 ∪Gc2 for a parent node iwith child nodes c1 and c2. Usually,

the process grids associated withGc1 andGc2 are concatenated vertically or

horizontally to form the process grid associated withGi. With this approach, only pairwise exchanges between the processes are needed when redistributing matrices from the children process grids to the parent process grids. This can help reduce the number of messages exchanged and save the communication cost. See [28, 29] for more discussions. Here, since the algorithm frequently involves skinny matrices, the process grids are not necessarily square. This is different from those in [28, 29], where large local matrices are involved. In this setting, each process can only access a subset of nodes in the assembly tree. We call these nodes accessible nodes of a process. The accessible nodes of a process are those in the associated local tree and in the path connecting the root of the local tree and the root of the assembly tree. For example, in Figure 3, the

(10)

accessible nodes of process 0 are nodes 1,2, . . . ,7,15,31. A process only gets involved in the computations associated with its accessible nodes.

Along the assembly tree, child nodes pass information to parent nodes, and com-munication between two process groups is involved. Unlike the traditional multifrontal method and the structured one in [28] that pass dense update matrices, here, only skinny matrix-vector products and a small number of matrix entries are passed. On the process grid of a parent node, we redistribute these skinny matrices to the corre-sponding processes. We design a distribution scheme and a message passing method to perform the skinny extend-add operation in parallel. The result will be used in the parallel randomized HSS construction.

For a nodeiabovelp, if HSS operations are involved, smaller process subgroups

are formed by the processes in Gi to accommodate the parallel HSS operations for

Fi, Ui, etc. For each node i of the HSS tree T of Fi, a process group Gi ⊂ Gi is formed. Initially, for the root k of T, Gk ≡Gi, which is used for both Fi and Ui. Unlike the scheme in [23], Gk is also used for both children of k, so as to facilitate

the computation of Ui. (See the inner-layer trees in Figure 3.) For nodes at lower levels, the same strategies to group processes as above are used along the HSS tree to form the process subgroups. That is, for any nonleaf node i of T below level 1, Gi=Gc1∪Gc2, where childrenc1andc2are the children ofi. Each process subgroup

forms a process grid which stores the dense blocks (such as HSS generators and skinny matrices) in a 2D block cyclic scheme. The operations (such as multiplication and redistribution) on the dense blocks are performed with ScaLAPACK. As in [23], the HSS generatorsRi, Wi(and alsoDi, Uifor a leafi) are stored inGi, andBi is stored

in Gpar(i). The details of these grouping and distribution schemes will become clear

in the subsequent subsections.

We would also like to mention that the nonzero entries of the original sparse matrixA are needed in the formation of frontal matrices or in frontal matrix-vector multiplications if a frontal matrix is not formed explicitly. They are also needed in the formation of theD,B HSS generators. It is usually not memory efficient to store a copy of the data in each process. On the other hand, distributing the entries ofA without overlapping among processes may increase communication cost dramatically. Although we know what entries ofAare needed in the current front, we do not know in advance the specific destination processes of selected entries ofAthat are needed in the formation of theBgenerators in the randomized HSS construction algorithm. As a result, communications of the entries ofAcannot be avoided with a nonoverlapping mapping strategy, in general. Thus, we choose a trade-off between the memory and communication costs. That is, each process within a process groupGistores, besides its private data, some overlapping data. The overlapping data corresponds to the entries ofAassociated with the accessible nodes of the process. If the processes are in the same process groupGi, then they store the same overlapping data. An illustration of this storage scheme is given in Figure 4. This idea can also be applied to store the neighbor information of a separator in the symbolic factorization as well as solutions in forward and backward substitutions.

To summarize, two layers of tree parallelism are naturally combined in this parallel model: the parallelism of the assembly tree and the HSS trees. Within the distributed assembly tree and HSS trees, distributed dense operations for HSS generators and skinny matrices are facilitated by ScaLAPACK. The parallel struc-tured matrix operations and skinny matrix operations will be discussed in detail next.

(11)

Private data Overlapping data Process 1 Process 2 l=lp

...

l=1l=0

...

Process ... l=lp

...

l=1l=0 l=lp

...

l=1l=0

Fig. 4.Storage scheme for the sparse matrixA. Below the parallel levellp, each process stores a private portion ofA. Abovelp, each process stores overlapping data associated with its accessible nodes at each levell= 0,1, . . . ,lp.

4.2. Parallel randomized HSS construction, factorization, and Schur complement computation. Above the switching level ls, all frontal and update

matrices are in HSS forms. The method in [28] is only partially structured in the sense that dense frontal matrices are formed first and then converted into HSS forms, and the updated matrices are also dense. Here, the HSS construction for a frontal matrix

Fiis done via randomization, and a partial ULV factorization ofFialso produces an HSS form Schur complement/update matrixUiin (3.2).

To involve as many processes as possible, we assign all the processes in the pro-cess groupGito perform the HSS construction, partial ULV factorization, and Schur complement computation. The HSS treeT forFi has two subtrees. The left subtree T1corresponds to the pivot blockFi;1,1 in (3.4), and the right subtreeT2corresponds

to Fi;2,2 and thus the Schur complement Ui. Note thatGi is used for the HSS con-structions of bothFi;1,1 andFi;2,2. This is different from the storage scheme in [23].

See the inner-layer tree associated with node i = 31 in Figure 3 for an illustration. The individual parallel randomized HSS constructions forFi;1,1andFi;2,2follow [23].

Initially, each process stores a piece of the matrix-vector products (3.5), which are used to compress the leaf-level HSS blocks. These products are updated when the compression moves up along the HSS treeT. Intergrid redistributions are involved for pairs of siblings in the HSS tree. After the HSS constructions forFi;1,1 andFi;2,2, we

continue to useGi to construct the remaining generatorsBk,( Rk₁

Rk₂) fork= root(T1)

with children k1, k2 and Bq,( Wq1

Wq2) for q = root(T2) with children q1, q2. Figure 5

shows how the HSS generators are stored among the processes.

The parallel partial ULV factorization follows a scheme in [23], except that it stops after all the nodes inT1 are traversed. The ULV factors associated with each

tree node are stored in the corresponding processes. The partial ULV factorization reducesFi;1,1to a final reduced matrix ˜Dk [30].

At this point, we need to form the Schur complement Ui in an HSS form. The blocks ˜Dk, Rk1 Rk₂ , Wq₁ Wq₂

are stored in the process groupGi, which is also the process group Gq for node q. To obtain the HSS form of Ui, we only need to update the D, B generators ofFi;2,2, which are stored within the subgroups ofGq. This is done

via a fast update method in [31]. (The method in [31] is symmetric and here a

(12)

( )

Rk1 0 1 2 3 0 1 Rk2 Bk (Wq1,Wq2) T T V1T U2 0

0

1 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3

1

2

3

0

1

2

3

22 33 Pivot block F_i;1,1 D1 B2 0 1 2 3 0 1 2 3 0 1 0 1 D2 F_i;2,2

Fig. 5. Distributed memory storage for an HSS form frontal matrix Fi given four processes, wherek1, k2are the children ofk(root of the HSS treeT1 forFi;1,1) andq1, q2are the children ofq (root of the HSS treeT2 forFi;2,2). See the inner-layer tree associated with nodei=31in Figure3.

nonsymmetric variation is used.) The computations are performed in disjoint process groups at each level in a top-down traversal of T2. A sequence of products Si are

computed for the update. EachSi is computed inGi and then redistributed for reuse

at lower levels. Its submatrices are used to update the generatorsDi, Bi. The details

are given in Algorithm 1. An important observation from this update process is that the HSS rank ofUi is bounded by that ofFi[30].

After this, we then compute ˜Yiand ˜Ziin (3.7) in parallel. This may be based on direct HSS matrix-vector multiplications, or an indirect procedure [31] that updates Yi. The detailed parallelization is omitted here. ˜Yi and ˜Zi are passed to the parent node to participate in the skinny extend-add operation.

Remark 4.1. In our solver, the nested HSS structure has a very useful feature that significantly simplifies one major step: the formation of the Schur complement/update matrix (the mathematical scheme behind Algorithm 1). That is, the nested HSS structure enables the Schur complement computation to be conveniently done via some reduced matrices, so that only certain generators of the frontal matrix need to be quickly updated. Such a mathematical scheme was designed and proved in [30, 31]. For 3D problems, the HSS structure is indeed less efficient than more advanced structures such asH2_{and multilayer hierarchical structures [32]. The HSS form serves}

as a compromise between the efficiency and the implementation simplicity.

4.3. Parallel skinny extend-add operation. The dense updated matrices used in the parallel partially structured multifrontal method in [28] need not only more computations but also more communications. In fact, the extend-add oper-ation is a major communicoper-ation bottleneck of the method in [28], just like in the standard parallel multifrontal method. In the 2D block cyclic setting, the standard dense extend-add in (3.3) requires extensive intragrid communications. Both row and column permutations are performed. Entries in the same row/column of an update

(13)

Algorithm 1. Parallel computation of the HSS Schur complement/update matrixUi. 1: procedure PSchur 2: Sq ←Bq WqT1 W T q2 _˜ D−_k1 Rk1 Rk2 Bk

.Performed in the process groupGq;q: root of the HSS treeT2 forFi;2,2 3: foreach levell= 0,1, . . .ofT2 do .Top-down traversal of T2 4: foreach nodeiat levell do .Performing operations in the process groupGi and redistributing information 5: if iis a nonleaf nodethen

6: Si=

Si;1,1 Si;1,2

Si;2,1 Si;2,2

. Conformable partition following

Rc1

Rc2

7: RedistributeSi;1,1 andSi;2,2 toGc1 andGc2, respectively

. Gc1, Gc2: children process groups

8: Sc1←W T c1Si;1,1Rc1,Sc2 ←W T c2Si;2,2Rc2 .Performed in Gc1, Gc2 9: Bc1←Bc1−Si;1,2,Bc2 ←Bc2−Si;2,1 . Bc1, Bc2: stored inGi 10: else

11: Di ←Di−Si .Leaf levelD generator update

12: end if

13: end for 14: end for 15: end procedure

matrix are added to one same row/column of a frontal matrix after permutations. For parallel permutations on a square process grid with P processes, each process com-municates withO(√P) processes in the same column, as well asO(√P) processes in the same row.

Our parallel randomized multifrontal solver uses, instead, the skinny extend-add operations (3.8) and significantly saves the communication cost. The matrices to be permuted are tall and skinny ones instead of large square ones, and moreover, only row permutation is needed. Figure 6 illustrates the storage of the skinny matrices on a process grid.

To further reduce the communication overhead, we use a message passing method [3, 20] to reduce the number of messages. Instead of passing information row by row, we group rows according to their destination processes. The rows needed to be sent

0 1 2 3 4 5 6 7 8 9 10 11 1213 14 15 0 1 2 3 4 5 6 7 8 9 10 11 1213 14 15 0 1 2 3 4 5 6 7 8 9 10 11 1213 14 15 0 1 2 3 4 5 6 7 8 9 10 11 1213 14 15 0 1 2 3 4 5 6 7 8 9 10 11 1213 14 15 0 1 4 5 8 9 1213 0 1 2 3 4 5 6 7 0 1 4 5 c1 c2 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 Yc1 ~ 0 1 2 3 0 1 2 3 0 1 2 3 Yc2 ~

(i) Standard parallel extend-add (ii) Skinny parallel extend-add Fig. 6. Parallel skinny extend-add operation as compared with the standard dense one, where the matrices are stored on a process grid. The skinny one deals with much less data and also needs only row permutations.

(14)

Algorithm 2. Parallel skinny extend-add operation.

1: procedure PSExtAdd . Yi= (Fi0Xi)−l Y˜c1 −lY˜c2 2: Yi← Fi0Xi(stored in the process gridG)

3: forY˜ = ˜Yc1,Y˜c2 do . ProcessingY˜c1 andY˜c2 individually 4: forcolumnj = 1, . . . ,ncoldo . ncol: number of process columns in G

5: foreach processpin columnj ofGdo .Packing message

6: pˆ←row coordinate ofpin G

7: forrowi= 1, . . . ,locrdo

.locr: number of local rows of Y˜ stored in p

8: qˆ←row coordinate of the destination process for ˜Y|i in G 9: Insert ˜Y|iand its destination row index inYiinto messageMpˆ→qˆ

10: end for

11: end for

12: foreach processpin columnj ofGdo.Sending & receiving messages

13: pˆ←row coordinate ofpin G

14: forqˆ= 1, . . . ,nrowdo . nrow: number of process rows in G

15: if pˆ6= ˆqthen .Locally blocking send operation

16: Send messageMpˆ→qˆifMpˆ→qˆ6=∅

17: end if

18: end for

19: forqˆ= 1, . . . ,nrowdo

20: if pˆ6= ˆqthen . Globally blocking receive operation

21: Receive messageMqˆ→pˆifMqˆ→pˆ6=∅

22: end if

23: Add rows inMqˆ→pˆto Yi according to their

destination row indices ifMqˆ→pˆ6=∅

24: end for

25: end for

from one process to another are first wrapped into a message, which is sent later. Since the update and frontal matrices are both distributed in a block-cyclic layout, it is not hard to figure out the target process of each local row by using functionsindxl2gand

indxg2pin ScaLAPACK. The messages are sent and received with BLACS [8] routines

for point to point communications. The send operations are locally blocking. The receive operations are globally blocking. The parallel skinny extend-add algorithm is outlined in Algorithm 2.

4.4. Extracting selected entries from HSS update matrices. The matrix-vector productsYi (and also Zi) obtained in the previous subsection are then used to construct an HSS approximation toFi. This still needs selected entries ofFithat will be used to form theB generators and leaf levelDgenerators [24, 35]. Given the row and column index sets, we can extract the entries ofFifromFi0,Uc1, andUc2 in

(3.3). F0

i is immediately available, and we focus onUc1 andUc2.

The indices of the desired entries of Fi are mapped to those of Uc1 and Uc2

(Figure 7). Thus, we consider how to extract in parallel selected entriesU |I×J of an HSS matrixU, as specified by a row index setI and a column index setJ. Suppose

(15)

c1 c2

i

Fig. 7.Mapping the row and column index sets of selected entries ofFito those ofUc1 andUc2.

U has HSS generatorsDi, Ui, Ri, Bi, etc. The basic idea of the entry extraction in [31]

is to reuse intermediate computations while performing matrix-vector multiplications. Here, instead of using matrix-vector multiplications like in [31], we group indices as much as possible so as to enable BLAS3 operations.

Among the leaf nodes of the HSS tree ofU, identify all the leaves i and j that correspond to I and J, respectively. Suppose i and j are associated with subsets

Ii⊂ I and Jj⊂ J and local index sets ˜Ii and ˜Jj ofUiand Vj, respectively. Let the

path connecting nodesiandj in the HSS tree be

(4.1) i−i1−i2− · · · −id−jd− · · · −j2−j1−j,

whered≡distance(i, j) is the number of ancestors ofiorj in the path and is called thedistance betweeniandj [31]. Then obviously,

(4.2) U |Ii×Ij =Ui|_Ii˜ Ri1Ri2· · ·Rid−1BidW T jd−1· · ·W T j2W T j1V T j |_Ij˜ .

By grouping the indices asIiandIj, we reduce the number of paths to be traversed

fromO(N) in [31] toO(N_r), whereN is the size ofU andris its HSS rank.

To save computational costs, some intermediate matrix-matrix products can be precomputed for all the entries inU |I×J in the same rows/columns. That is, in (4.2), a portion of the following product is stored if the corresponding subpathi−i1−i2− · · · −id−1 is shared by multiple paths like (4.1):

(4.3) Ωdi ≡Ui|_Ii˜Ri1Ri2· · ·Rid−1.

This is similar for Θd

j ≡ VjWj1Wj2· · ·Wjd−1. Ω d

i and Θdj are computed in process

gridsGid−1 and Gjd−1, respectively. Then the stored results Ω d

i and Θdj are used to

compute the entries as

U |Ii×Jj = Ω d iBid(Θ d j) T_.

If it is necessary to extend the subpathi−i1−i2− · · · −id−1 for largerd, then Ωdi

and Θd

j will be redistributed to the parent process grids for additional computations.

Thus, Ωd

i and Θdj are computed progressively. See Algorithm 3.

Since the HSS generators are distributed on different process grids, this technique also helps to reduce the number and volume of messages in the parallel implementa-tion. This is especially attractive if many entries to be extracted are from the same rows or columns.

(16)

Algorithm 3. Parallel extraction ofU |I×J from an HSS matrixU.

1: procedure PHSSIJ

2: foreach leafiofT do . T: HSS tree ofU

3: Find the corresponding subsetsIi ⊂ I andJi⊂ J

4: Find the corresponding local index sets ˜Ii ofUi and ˜Ji ofVi 5: end for

6: foreach leafi(withIi6=∅) ofT do . Precomputation

7: d←sorted vector of nonzero distance(i, j) for all leavesj6=iwithJj6=∅ 8: Ω0_i ←Ui|_Ii˜ , c←0, k←i . Performed in the process groupGi 9: forι= 1,2, . . . ,length(d)do

10: d←dι, Ωdi ←Ω c

i . Ω

d

i is for the partial product as in(4.3)

11: ford˜=dι−1, . . . ,dι do .Assuming d0= 1

12: Ωd_i ←Ωd_iRk. Computing(4.3)step by step in the process groupGk 13: Redistribute Ωdi from the process groupGk to Gpar(k)

14: k←par(k)

15: end for

16: c←d

17: end for 18: end for

19: Repeat steps 6–18 withIi,Jj,I˜i, Rk,Ωdi replaced byJi,Ij,J˜i, Wk,Θdi,

respectively

20: foreach leafi(withIi6=∅) ofT do .Extracting the entries 21: foreach leaf j (withJj 6=∅) ofT do

22: if i==j then .Diagonal block extraction

23: U |Ii×Jj ←Di|Ii×˜ _Jj˜ . Performed in the process groupGi

24: else . Off-diagonal block extraction

25: d←distance(i, j)

26: k←the largest ancestor ofi in the path connectingi toj

27: U |Ii×Jj ←ΩdiBk(Θdj)T .Performed in the process groupGk

28: end if

5. Analysis of the communication cost. The parallel framework together with the individual parallel structured operations in the previous section give our

PRSMF method. The parallel solution method is similar to that in [28]. We can

conveniently study the performance of the PRSMF method. The factorization cost and storage are given in [31]. For discretized sparse matrices, when certain rank conditions are satisfied (such as in elliptic problems), the factorization costs roughly O(n) flops in two dimensions and roughlyO(n) toO(n4/3_{) in three dimensions. The}

storage for both two dimensions and three dimensions is near O(n). Here, some polylogarithmic terms ofnmay be omitted.

In the following, we estimate the communication cost. For convenience, suppose the maximum HSS rank of all the frontal matrices isr. In practice, the HSS ranks often depend on the applications and the matrix size. Our analysis here then gives a reason-able upper bound. For simplicity, assume the switching levellsis below the parallel

levellp in the assembly treeT. Thus, HSS operations are used at every parallel level.

(17)

Let P be the total number of processes, ˆP be the minimum number of process within the process grids. The number of processes of each process grid at levellofT

is

Pl= P 2l−1.

Suppose a process grid withPlprocesses hasO(

√

Pl) rows and columns. Also suppose all the HSS generators of the structured frontal matrices are of sizesO(r)×O(r).

We first review the communication costs of some basic operations. Use #messages and #words to denote the numbers of messages and words, respectively.

• For ak×kmatrix, redistribution between a process grid and its parent process requires #messages =O(1) and #words at mostO(k_ˆ2

P).

• On the same process grid withP processes, the multiplication of twoO(r)×

O(r) matrices costs #messages = O(r_b) and #words = O(√r2 ˆ P

), where b is the block size used in ScaLAPACK.

Suppose a frontal matrixF is at levellofT and has sizenl. F corresponds to a process grid withPlprocess. Then at level lof the HSS treeT forF, the number of processes is Pl

2l−1. The number of parallel levels of the HSS tree isO(logPl).

To analyze the communication costs at the factorization stage, we investigate the communication at each levellof the assembly tree.

1. Randomized HSS construction and ULV factorization. In [23], it is shown that both operations cost

#messages =O rlog2Pl , #words =O r 2_log2_P l p ˆ P ! .

2. Structured Schur complement computation and multiplication of the Schur complement and random vectors.

For each process, the information is passed in a top-down order along the HSS tree. The major operations are matrix redistributions and multiplica-tions, since the additions on the same process grid involve no communication. The redistribution costsO(logPl) messages andO(r

2_log_P

l

ˆ

P ) words. The

mul-tiplications between HSS generators at each level cost O(r_b) messages and O(_√r2 ˆ P ) words. Thus #messages =O(logPl) + O(logPl) X l=1 Or b =Or blogPl , #words =O _r2_log_P l ˆ P + O(logPl) X l=1 O r 2 p ˆ P ! =O r 2_log_P l p ˆ P ! . The sampling of the structured Schur complement is performed via HSS matrix-vector multiplications. The communication costs are similar to the estimates above.

3. Skinny extend-add operation and redistribution of the skinny sampling ma-trices.

The skinny matrices are of size O(nl)×O(r). Each process communicates with processes in the same column on the process grid. Thus

#messages =OpPl , #words =O _rn l Pl .

(18)

The redistribution of the skinny sampling matrices costs #messages =O(1), #words =O _rn l ˆ P . 4. Extracting selected entries from HSS matrices.

For each process, the information is passed in a bottom-up order along an HSS tree. The major operations are matrix redistribution and multiplications. Similar to the communication costs of the structured Schur complement com-putation, #messages =O(logPl) + O(logPl) X l=1 O(r b) =O(rlogPl), #words =O _r2_log_P l ˆ P + O(logPl) X l=1 O r 2 p ˆ P ! =O r 2_log_P l p ˆ P ! . The total communication costs can be estimated by summing up the above costs. The total number of messages is

#messages = O(logP) X l=1 OpPl +O rlog2Pl =O √ P+O rlog3P. The total number of words is

#words = O(logP) X l=1 O _rn l ˆ P +O r 2_log2 Pl p ˆ P !! ,

which depends on specific forms ofnl. For 2D and 3D problems,nl=O(

√

n/2bl/2c) andO(n2/3/2bl/2c), respectively. Thus, the total number of words for 2D problems is

#words =O _r√_n ˆ P +O r 2_log3 P p ˆ P !

and for 3D problems is

#words =O rn 2 3 ˆ P ! +O r 2_log3 P p ˆ P ! .

In comparison, the parallel solver in [28] has communication costs of #messages = O(√P) +O(rlog3P) and #words =O _n ˆ P +O r √ nlog2P p ˆ P ! for 2D problems, #words =O n 4 3 ˆ P ! +O rn 2 3log2P p ˆ P !

for 3D problems. In particular, the numbers of words of our new parallel solver are smaller by factors of aboutO(

√ n

r ) andO( n23

r ), respectively.

6. Numerical experiments. In this section, we show some performance results of ourPRSMFsolver. For convenience, we simply refer to it asPRSMFin this section. A sequences of tests are performed for Poisson’s equation, a linear elasticity equation,

(19)

Helmholtz equation, and a elastic wave equation in two or three dimensions. We carried out our experiments on a Cray XC30 cluster at TOTAL E&P Research & Technology USA. Each node has 20 cores and 64 GB memory. The number of nodes available for our tests is 52. The peak performance of each core is 23.30 Gflops. We show both the weak scalability and the strong scalability of the solver, as compared with the standard parallel multifrontal solver (by settingls= 0 inPRSMF), denoted

PMF. We report the following performance measurements:

• Time: the runtime for the factorization ofAand the solution of Ax=b.

• Flops and flop rate: the number of floating point operations in the factoriza-tion and the corresponding flop rate.

• Factor size: the number of nonzero entries for storing the factors.

• Peak memory per process: the maximum memory occupied by one process for any process.

• Accuracy: the relative residual ||Ax−b||2

||b||2 , where b is generated via a random

exact solution.

In PRSMF, the maximum HSS rank of all the frontal matrices is also reported.

We also discuss how the variation of some parameters (the switching level and the sampling size) impacts the performance. In our current implementation, the number of processes used is equal to 2lp−1_{, where}_l

p is the parallel switching level. Thus, the

strong scaling tests below indicate the dependency of the code on lp. The following

notation will be used for convenience:

• r˜: the sampling size in randomized compression (the column size ofX in 3.5).

• τ: the relative tolerance of rank-revealing factorizations of the matrix-vector products as in (3.6).

• lmax: the total number of levels in the assembly tree.

6.1. Poisson’s equation and linear elasticity equation in two dimen-sions. We first look at two PDEs in two dimensions, Poisson’s equation and a linear elasticity equation, which are known to be suitable for structured multifrontal methods [7, 33]. That is, in the multifrontal factorization, the frontal matrices have relatively small off-diagonal numerical ranks.

For 2D Poisson’s equation discretized with the five-point stencil, we show the weak scalability by letting both the matrix size n and the number of processes P increase by a factor of 2. Accordingly,lmax is increased by 1 every time so that the

sizes of the smallest submeshes after nested dissection remain almost the same. The number of levels (lmax−ls) below the switching level ls remains the same, so that

the factorization costs below and above ls are nearly the same and the total cost is

minimized [31]. (See section 6.4.) This also implies that all frontal matrices larger than a certain size are approximated by HSS forms. Here, we setlmax−ls= 9, which

roughly gives the optimal factorization complexity in the tests, as shown in section 6.4. More details on the selection oflmax−lsfor structured multifrontal methods can

be found in [30, 33].

In randomized compression, the sampling size is set to be ˜r = 200, and the compression tolerance is set toτ = 10−5_{. Since our solver aims for relatively general}

sparse linear systems, in all these tests, it does not specifically exploit the positive definiteness of the matrices. The numerical results are shown in Table 1 and Figure 8. The CPU time for factorization and solution is shown. A weak scaling pattern can be observed. PRSMFalso has a clear advantage in terms of the memory. For the matrix withn= 256×106_, _PRSMF_{needs less than 1}_/_{2 of the parallel factorization time of}

PMF and about 1/3 of the flops. Let the upper bound of the total memory at peak

(20)

Table 1

Parallel weak scaling test for discretized matrices from2D Poisson’s equation.

Mesh sizen 16×106 ₃₂_×₁₀6 ₆₄_×₁₀6 ₁₂₈_×₁₀6 ₂₅₆_×₁₀6

lmax 21 22 23 24 25

Number of processesP 32 64 128 256 512

PMF

Factorization time (s) 39.80 59.85 108.20 108.41 113.58 Factorization flops 1.99E12 6.11E12 1.81E13 4.05E13 1.17E14 Flop rate (Gflops) 50.03 102.09 167.28 373.58 1030.13 Factor size (GB) 28.65 60.70 125.45 253.89 526.51 Peak memory 3.05 3.24 3.38 3.65 4.12 per process (GB) Solution time (s) 1.53 2.21 2.25 2.78 3.15 PRSMF Factorization time (s) 32.44 37.91 50.82 49.03 55.83 Factorization flops 1.67E12 3.64E12 7.14E12 1.46E13 2.93E13 Flop rate (Gflops) 51.48 96.02 140.50 297.78 524.81 Factor size (GB) 17.51 34.59 67.46 163.97 351.60 Peak memory 2.23 2.27 2.36 2.59 3.38 per process (GB) Maximum HSS rank 97 101 113 163 190 Solution time (s) 1.58 2.05 2.89 2.75 3.39 Relative residual 8.34E−6 9.21E−6 5.48E−6 8.44E−5 1.20E−5

107 108 1012 1013 1014 n Flops PMF PRSMF O(n) reference line

107 108 101 102 103 n Factor size (GB) PMF PRSMF O(n) reference line

(i) Factorization flops (ii) Factor sizes

Fig. 8. Factorization flops and factor sizes in Table 1 for discretized matrices from the2D Poisson’s equation.

be the product of peak memory per process and the number of processes P. Then PMFneeds nearly 380 GB more memory thanPRSMFwhen both are at peak.

Similarly, we apply the solver to a linear elasticity equation near the incompress-ible limit:

−(µ∆u+ (λ+µ)∇∇ ·u) =f in Ω, u= 0 on∂Ω,

whereuis the displacement vector field on a rectangular domain Ω, andλ, µare Lam´e constants. Here, λ/µ= 105_{, which leads to ill-conditioned discretized matrices. We}

perform a weak scaling test similarly to the Poisson case. InPRSMF, the parameters ˜

r= 200 and τ = 10−5 _{are used, and the number of levels below the switching level}

islmax−ls= 12. The test results are shown in Table 2. For the largest matrix, the

(21)

Table 2

Parallel weak scaling test for discretized matrices from the2D linear elasticity equation.

Matrix sizen 7,992,002 17,988,002 31,984,002 71,976,002 127,968,002

lmax 20 21 22 23 24

Number of processesP 16 32 64 128 256

PMF

Factorization time (s) 19.47 39.20 56.93 69.79 68.42 Factorization flops 0.58E12 3.11E12 5.35E12 1.23E13 3.27E13 Flop rate (Gflops) 29.79 79.34 93.98 176.24 477.93 Factor size (GB) 11.88 31.65 56.15 137.89 250.38 Peak memory 2.34 3.01 3.37 3.85 4.15 per process (GB) Solution time (s) 0.92 1.51 2.59 2.94 3.58 PRSMF Factorization time (s) 17.20 27.00 31.75 44.04 42.98 Factorization flops 0.53E12 1.61E12 2.68E12 6.27E12 1.13E13 Flop rate (Gflops) 30.81 59.63 84.40 142.36 262.93 Factor size (GB) 9.46 22.81 43.40 95.69 180.25 Peak memory 2.02 2.45 2.50 2.73 2.96 per process (GB) Maximum HSS rank 126 113 122 149 158 Solution time (s) 0.83 1.56 2.07 3.04 3.42 Relative residual 2.23E−4 4.30E−4 3.95E−4 2.88E−4 2.93E−4

Number of processes 16 32 64 128 256 Fa ctorization time (s) 101 102 n = 49,980,002 n = 31,984,002 Number of processes 16 32 64 128 256 Sol ut ion time (s) 2 3 4 5 6 n = 49,980,002 n = 31,984,002

(i) Strong scaling of factorization time (ii) Strong scaling of solution time Fig. 9. Strong scaling tests withPRSMFfor two discretized matrices from the 2D linear elas-ticity equation.

parallel factorization time ofPRSMFis less than 2/3 of the time of PMF, and the flop count is less than 1/2.

For two matrices of size 31,984,002 and 49,980,002, we display the strong scaling results of the PRSMFby changing the number of processes. Strong scaling patterns can be observed in both the factorization stage and the solution stage. See Figure 9. When the number of processesP doubles, the parallel factorization time reduces by factors close to 1.7. We also display the parallel efficiency of PRSMFin Figure 10 for matrices of sizes 7,992,002 and 17,988,002. With 16 processes, the parallel efficiency is slightly over 70%.

6.2. 3D Helmholtz equation. Next, consider the numerical solution of the 3D Helmholtz equation −∆−ω 2 v2 u=f,

(22)

100 101 102 103 0 20 40 60 80 100 Number of processes Parall e l efficiency (%) n = 7,992,002 n = 17,988,002

Fig. 10.Parallel efficiency ofPRSMFfor two discretized matrices from the2D linear elasticity equation.

where ω is the angular frequency, v is the seismic velocity field, andf is the forcing term. The equation is discretized on 3D tetrahedron meshes with the continuous Galerkin method. The number of sample points per wavelength is fixed to be 5. The perfectly matched layer boundary condition is used. The P wave speed is 3.4 km/s, and ω slowly increases from about 6 Hz to 10 Hz. The meshes have 20, 40, 60, and 100 million elements, respectively. The sizes of the corresponding discretized matrices range from 3,233,795 to 15,794,036. The numbers of nonzero entries in the matrices are also reported. In seismic imaging, the Helmholtz equation is often solved with relatively low accuracies. Here, we conduct the tests in single precision. The compression tolerance inPRSMFis τ= 10−3_{. The sampling size ˜}_r _{is set to be 1200,}

1500, 1800, and 2100 for these four matrices, respectively. The number (lmax−ls) of

levels below the switching level is 9 or 10. Table 3 shows the results.

PRSMFhas a significant advantage in the flop count and storage. This can also be observed from Figure 11, and the results are consistent with the theoretical estimates in [31]. For the third matrix, PRSMF needs about 1/4 of the flops and about 1/2 of the storage for the factors. The factorization time of the new solver is always shorter. The great memory advantage of the new solver is very attractive since the memory issue plays a critical role in the design of 3D direct PDE solvers. For the largest size, PMF runs out of memory. The flop rate of PRSMF is lower due to the large number of small dense matrices (HSS generators and skinny matrices). The communication pattern of the new solver is also more complicated. On the other hand,PMFworks mainly on larger dense matrices and exploits the BLAS3 operations intensively. Iterative refinements can be used to improve the accuracy of the solver. For the largest matrix, 10 steps of iterative refinements reduce the relative residual to around 10−6_.

For two of the matrices, we also display the strong scaling results of PRSMF in Figure 12. For each matrix, when the number of processes P doubles, the parallel factorization time reduces by factors about 1.7∼1.8.

6.3. 3D elastic wave equation. Then consider the solution of the 3D elastic wave equation

−ρω2u− ∇ ·(c:∇u) =f, ν·(c:∇u) =g,

(23)

Table 3

Parallel weak scaling tests for discretized matrices from the3D Helmholtz equation.

Matrix sizen 3,233,795 6,387,657 9,754,486 15,794,036 Number of nonzeros 50,233,223 99,696,607 152,290,784 247,634,526 lmax 14 15 16 17 Number of processesP 64 128 256 512 PMF Factorization time (s) 430.46 871.28 1078.77 Factorization flops 1.79E14 5.32E14 1.35E15 Flop rate (Gflops) 415.81 610.69 1251.48 Factor size (GB) 103.74 256.34 420.78 Peak memory

6.05 7.14 7.24 per process (GB)

Solution time(s) 8.67 17.93 24.82 Relative residual 4.50E−6 4.97E−5 4.98E−5

PRSMF

Factorization time (s) 376.53 655.93 817.22 981.67 Factorization flops 9.96E13 1.84E14 2.83E14 5.13E14 Flop rate (Gflops) 264.52 280.58 346.78 522.88 Factor size (GB) 62.39 138.85 212.96 350.32 Peak memory 5.58 5.70 5.66 6.09 per process (GB) Maximum HSS rank 1077 1342 1653 2021 Solution time (s) 7.82 15.58 25.32 33.04 Relative residual 2.48E−3 9.46E−3 1.76E−2 1.38E−2 Number of iterative

3 7 9 10

refinement steps Relative residual

1.50E−6 2.86E−6 1.76E−6 1.20E−6 after refinements n _×107 0.4 0.6 0.8 1 1.2 1.4 1.6 Flops 1014 1015 PMF PRSMF O(n4/3_{) reference line}

n _×107 0.4 0.6 0.8 1 1.2 1.4 1.6 Factor size (GB) 102 103 PMF PRSMF O(n) reference line

Fig. 11. Factorization flops and factor sizes in Table3 for discretized matrices from the3D Helmholtz equation.

where ρ is the density, ω is the frequency, c is the elastic stiffness tensor, uis the displacement,ν is the outward normal to the boundary of the domain, and the colon represents tensor contraction. The equation is discretized on meshes with 2, 5, 10, and 20 million elements in the finite element method. The sample points per wavelength are fixed to be 5 for all these meshes. Single precision is used. PRSMFuses a compres-sion tolerance τ = 10−3 _{and a uniform sampling size ˜}_r _{= 1000. Also,}_l

max−ls= 9.

The results are given in Table 4, where we can observe clear advantages of the new solver in terms of the speed and memory. For the matrix withn= 9,701,385,PRSMF takes less than 1/3 of the flops of PMF, less than 3/5 of the parallel factorization

(24)

32 64 128 256 250 500 1000 2000 4000 Number of processes Factorization time (s) n = 9,754,486 n = 6,387,657 Number of processes 32 64 128 256 Sol ut ion time (s) 10 20 40 n = 9,754,486 n = 6,387,657

(i) Strong scaling of factorization time (ii) Strong scaling of solution time Fig. 12. Strong scaling tests withPRSMFfor two discretized matrices from the3D Helmholtz equation.

Table 4

Parallel weak scaling tests for discretized matrices from the3D elastic wave equation.

Matrix size 919,683 2,496,888 4,907,088 9,701,385 Number of nonzeros 41,798,079 114,950,592 227,587,086 452,099,007 lmax 12 13 14 15 Number of processesP 64 128 256 512 PMF Factorization time (s) 144.80 509.05 1045.07 1853.94 Factorization flops 3.34E13 2.45E14 8.80E14 2.59E15 Flop rate (Gflops) 230.68 481.34 842.08 1397.02 Factor size (GB) 33.93 130.48 288.27 583.44 Peak memory

2.11 4.37 5.24 5.47 per process (GB)

Solution time (s) 3.85 9.61 21.50 30.38 Relative residual 3.46E−7 4.84E−7 1.59E−7 1.87E−7

PRSMF

Factorization time(s) 109.76 304.15 457.04 680.51 Factorization flops 2.42E13 1.19E14 2.12E14 4.68E14 Flop rate (Gflops) 220.45 391.26 463.85 687.72 Factor size (GB) 24.34 87.64 183.38 382.23 Peak memory 1.61 3.29 3.65 3.94 per process (GB) Maximum HSS rank 477 813 901 976 Solution time (s) 3.79 9.43 18.61 28.40 Relative residual 2.81E−6 4.33E−6 8.05E−6 1.35E−6

time, and about 2/3 of the storage. Let the upper bound of the total memory at peak be the product of the peak memory per process and the number of processesP. Then PMF may need nearly 780 GB more memory than PRSMF when both are at peak. Also, when nincreases from 2,496,888 to 9,701,385, the flop count of PRSMF increases by a factor around 2, while the count of PMFincreases by a factor around 3.

6.4. Varying the switching level and sampling size. As mentioned in sec-tion 6.1, the switching levellsis a user specified level beyond which HSS methods are

applied. As shown in [31],lsis chosen so thatlmax−lsis nearly constant. This means

that we can use a small sizento roughly determine the optimal switching level ls.

For example, for the 3D Helmholtz equation in section 6.2, consider the matrix of sizen=3,233,795. With 64 processes,τ = 10−3_{, ˜}_r_{= 1200, and}_l

max= 14, we vary

(25)

106 107 1014 1015 n Flops PMF PRSMF

O(n4/3) reference line

106 107 102 103 n Factor size (GB) PMF PRSMF

O(n) reference line

Fig. 13. Factorization flops and factor sizes in Table4 for discretized matrices from the3D elastic wave equation.

Table 5

Performance ofPRSMFfor the discretized3D Helmholtz equation in section6.2with different switching levelsls forn= 3,233,795.

ls 3 4 5 6 7

Factorization flops 1.25E14 1.06E14 9.96E13 1.03E14 1.09E14 Factor size (GB) 81.31 71.45 62.39 63.01 69.28 Relative residual 1.60E−3 3.64E−3 2.48E−3 1.46E−2 4.81E−2

Table 6

Performance ofPRSMFfor the discretized3D Helmholtz equation in section6.2with different

˜

randτforn=3,233,795.

Sampling size ˜r 1000 1200 1400 1600 Compression toleranceτ 1E−2 1E−3 1E−4 1E−5

Factorization flops 8.66E13 9.96E13 1.08E14 1.13E14 Factor size (GB) 53.87 62.39 72.51 81.62 Maximum HSS rank 658 1077 1317 1544 Relative residual 5.08E−2 2.48E−3 9.62E−4 1.82E−5

ls and compare the cost of the factorization and the storage. See Table 5. Clearly, ls= 5 orlmax−ls= 9 gives the optimal complexity and storage.

The sampling size ˜r is usually set to accommodate the desired accuracy of the randomized HSS approximation. An increase in ˜rwill lead to the increase of the cost for matrix-vector multiplications and RRQR factorizations. When the compression tolerance τ reduces accordingly, higher accuracies can be achieved. To see this, we test the same matrix as in Table 5, but with varying ˜r andτ and with fixedls= 5.

The number of processes is 64. The results are given in Table 6. Hence, ˜r andτ can be used to control the accuracy of the solution. If we use lower accuracies for faster factorization and solution, we may use iterative refinement to improve the accuracy, or use the solver as a preconditioner.

7. Conclusions. In this paper, we have presented a distributed-memory alge-braic randomized structured multifrontal solver with two levels of tree parallelism. Hierarchical structured methods are shown to be naturally scalable. Unlike previous efforts in [28], the new solver enhances both the computation and the communication

(26)

by using randomization to avoid dense intermediate matrices. It can also handle more general matrices. Several parallel HSS algorithms have been developed for the solver. The solver achieves the desired complexity and shows nice scalability in testing some discretized PDEs.

The implementation of this solver is relatively complex. It is built upon multiple components such as multifrontal methods, HSS algorithms, and randomized methods. The efficient implementations of some components are extensively studied by the scientific computing community. Here, we build a systematic fast parallel structured sparse solver, and also provide some new parallel implementations of several useful strategies in sparse and structured solutions. In the near future, we hope to make the individual components and even the entire solver work as black-box codes so that other researchers can conveniently use them for different tasks without knowing the technical details. Such codes will be made publicly available. It is also possible to simplify the implementation using simpler structures such as the HODLR form [1].

Due to the complex scale of the project, no pivoting is integrated yet. Another reason for this is that it is not clear how pivoting affects the structures since the structures depend on proper ordering. This will be thoroughly studied in future work. Out future work will also include an adaptive scheme for choosing sampling sizes at different levels of the assembly tree, the replacement of intermediate HSS forms by multilayer structures [32], the use of batched BLAS for multiple small BLAS operations, as well as more extensive direct solution and preconditioning tests.

Acknowledgments. We thank Xiao Liu, Fabien Peyruss, and Jia Shi for helping with the numerical tests and thank TOTAL E&P Research & Technology USA for providing the computing resource. We are also grateful to the three anonymous referees for the valuable suggestions.

REFERENCES

[1] S. Ambikasaran and E. F. Darve,AnO(NlogN)fast direct solver for partial hierarchically semi-separable matrices, J. Sci. Comput., 57 (2013), pp. 477–501.

[2] P. Amestoy, C. Ashcraft, O. Boiteau, A. Buttari, J.-Y. L’Excellent, and C. Weis-becker,Improving multifrontal methods by means of block low-rank representations, SIAM

J. Sci. Comput., 37 (2015), pp. A1451–A1474.

[3] P. Amestoy, I. S. Duff, and J. Y. L’Excellent,Multifrontal parallel distributed symmetric and unsymmetric solvers, Comput. Methods Appl. Mech. Engrg. 184 (2000), pp. 501–520. [4] L. S. Blackford, J. Choi, A. Cleary, E. D’Azeuedo, J. Demmel, I. Dhillon, et al.,

ScaLAPACK User’s Guide, SIAM, Philadelphia, PA, 1997.

[5] S. B¨orm and W. Hackbusch,Data-sparse approximation by adaptiveH2_-matrices_,

Comput-ing, 69 (2002), pp. 1–35.

[6] S. Chandrasekaran, P. Dewilde, M. Gu, and T. Pals,A fast ULV decomposition solver for hierarchically semiseparable representations, SIAM J. Matrix Anal. Appl., 28 (2006), pp. 603–622.

[7] S. Chandrasekaran, P. Dewilde, M. Gu, and N. Somasunderam,On the numerical rank of the off-diagonal blocks of Schur complements of discretized elliptic PDEs, SIAM J. Matrix Anal. Appl. 31 (2010), pp. 2261–2290.

[8] R. Clint Whaley,Basic Linear Algebra Communication Subprograms: Analysis and Imple-mentation Across Multiple Parallel Architectures, LAPACK Working Note 73, University of Tennessee, 1994.

[9] I. S. Duff and J. K. Reid,The multifrontal solution of indefinite sparse symmetric linear, ACM Trans. Math. Software, 9 (1983), pp. 302–325.

[10] J. A. George,Nested dissection of a regular finite element mesh, SIAM J. Numer. Anal., 10 (1973), pp. 345–363.

[11] J. A. George, J. W. H. Liu, and E. Ng,Communication results for parallel sparse Cholesky factorization on a hypercube, Parallel Comput., 10 (1989), pp. 287–298.

[12] N. Gibbs, W. Poole, and P. Stockmeyer,An algorithm for reducing the bandwidth and profile of a sparse matrix, SIAM J. Sci. Comput., 13 (1976), pp. 236–250.