coarse grids guaranteed to represent smooth error accurately. Two parallel compatible relaxation algorithms using concepts from earlier chapters are examined and implemented in Chapter 6.
This research and the research of others has produced a large and growing field of coarsening algorithms. These algorithms are introduced in several publications and are, in many cases, not tested against one another. Providing a single forum for all of the algorithms, this thesis contains a wealth of experiments and data examining the performance of many parallel independent set-based coarsening algorithms. This represents the largest set of coarsening algorithms tested simultaneously. In Chapter 5, attention turns to the design and efficiency of coarsening algorithms themselves. Coarse-grid selection algorithms contain routines for searching a graph to identify new coarse-grid points. The weight initialization in CLJP and PMIS forces a brute force search, which involves large numbers of comparisons between vertex weights. The algorithms using graph coloring have a theoretical advantage in terms of the search methods available. An algorithm called Bucket Sorted Independent Sets (BSIS) is developed to use a bucket algorithm for sorting and identifying new coarse points without requiring comparisons between vertex weights. This novel application of comparison-free sorting produces a coarsening algorithm with much lower search costs. BSIS is the first coarse-grid selection algorithm developed with the explicit goal of reducing the cost of coarsening. In addition to presenting the new algorithm, theory is developed to prove that changes made from CLJP-c to BSIS do not affect the selected coarse grid.
Parallel Implementation of the Coordinates-Partitioning Based Multigrid Preconditioner Though there are several ways to parallelize a given algorithm, only the message passing version is considered here. The message passing interface (MPI) is used and the parallelalgorithms are based on the domain decomposition method. Assume that the linear system (1) has been distributed to P processors, with approximately even number of rows of the coefficient matrix A assigned to each processor. For simplicity, we assume that the rows assigned to each processor have contiguous indexes. The components of all the vectors on each level are assigned to the processors according to the rows of the coefficient matrix. In the parallelization of PCG iteration and the multigrid algorithm, the results of the inner products are repeatedly stored on all processors. Therefore, the inner products are done with the MPI function MPI_Allreduce. In this way, to complete the whole parallel implementation, the remaining components includes the partitioning algorithm ACRP, the setup process described as figure 2, the smoothing steps in the application of multigrid, the multiplication of matrices to a vector.
The FE method, although more complicated from the implementation point of view, has also been adapted to use GPUs, even prior to the appearance of CUDA hardware and software. Turek et al.  attempted to use GPUs through the FEAST finite-element library which they develop. Initially, a single-precision iterative solver is implemented on the GPU to serve as a preconditioner for an outer iterative solver running in double precision on the CPU. A 2D Laplacian problem is solved on a regular cartesian grid. This approach, using OpenGL, is approximately 3.5 times faster than a CPU implementation. A later development by the same group is described in . The FEAST library is used to solve a non-linear steady-state Navier-Stokes problem. The linearized subproblems of the non-linear solver are solved with a global BiCGStab preconditioned with a Schur complement matrix. Solving the advection- diffusion problem is done with a global multigrid solver that uses as smoother multi-grid solvers on the local domains running on the GPU. To ensure the regular access patterns suitable for the GPU, this strategy uses a 2D unstructured mesh composed of a small number of quadrilateral domains, while the domains themselves, on which the local multi-grid GPU solvers are operating, are discretized with regular generalized tensor product grids. The components that are ported to the GPU are up to an order of magnitude faster than the original CPU version. These components represent only a fraction of the total solver code, so the total simulation time is only decreased by a factor of two, as can be expected due to Amdahl’s law.
To give an illustration of the numerical performance of C-AMG, consider again the example problem in (3) and Figure 2. Table 1 shows single-processor re- sults on an Intel Pentium workstation. The coarse grids for the 31 × 31 problem are shown in the figure. We see that the convergence factors are almost uni- form independent of problem size, the growth in both setup and solve time is essentially linear with problem size, and the number of grid levels grows logarithmi- cally with problem size. These are expected charac- teristics of multigrid methods. We also see that the grid and operator complexities stay nicely bounded for this problem (growth in operator complexity is often an issue for AMG, especially in parallel). Note that in practice, it is usually better to use AMG as a preconditioner for a Krylov method such as con- jugate gradients (CG). To precondition CG, we first must ensure that the AMG cycle is symmetric. If we do that for this problem by using C-F Jacobi, the re- sulting AMG-CG method takes 8 or 9 iterations for all problem sizes. See [6, 8] for a more extensive set of numerical experiments.
Many independent set-based coarsening algorithms have been developed. RS coarsening is introduced in  and is the inspiration for many algorithms that follow. The first parallel independent set-based coarsening algorithms appeared a little more than ten years ago and include RS3 , CLJP , BC-RS and Falgout , PMIS and HMIS , CLJP-c , PMIS-c1 and PMIS-c2 , and PMIS Greedy , to name a few. Independent set-based coarsening algorithms are defined by a set of policies that specify how weights are initialized, the neighborhood used in selecting C-points (see selection neighborhood below), and the way in which vertex weights are updated. We investigate the nature of these components in the following section.
In addition, from table 1 to table 4, the more complex the coefficient matrix of the initial linear system is, relatively the less setup time and the more efficient the scheme aggva will be. The more points each point connects to in average, the more efficient the scheme aggcq will be. And agg8p needs always the least setup time. Though aggst, agg4p and agg8p are all very cheap to setup, the performance of iteration is also very good, and the best is often one of them in most cases. Even if they are not the best, the performance is not far from the best. It should be mentioned that though there give only the results from three test examples, many other tests give the similar results.
experiments have shown that, in most cases, the best results are achieved when using only 1+1 Jacobi iterations and 500 – 1,000 groups. But, if a model is very complex and has high conductivity contrasts, it will be probably necessary to create up to 5,000 groups and to have more basic iterations. Similarly, SSOR relaxations have proved to be more efficient in dealing with high contrasts between conductivities than Jacobi iterations. In our examples, the systems that have been solved have between 0.5 and 2 million unknowns, which means that numbers of groups that proved to be the most efficient choices are three or four orders of magnitude smaller. We remark that these ratios may be used as guidance when choosing the number of groups. In addition, the obtained results have proved that there is no need to introduce more than one level of coarsening, which would considerably increase the cost of each BiCGStab iteration.
We have implemented our saddle point AMG using the hypre software package [hyp] [CCF98]. An important component of this parallel linear solver suite is the BoomerAMG algebraicmultigrid solver and preconditioner for positive definite matrices. The ingredi- ents of BoomerAMG include smoothers (Jacobi, Gauss–Seidel, SOR, polynomial), par- allel coarse grid generation techniques (third pass coarsening, CLJP, Falgout’s scheme, PMIS, HMIS, CGC(-E), compatible relaxation, . . . ), interpolation setup routines (di- rect, modified classical, extended(+i), Jacobi, and may more). Furthermore, for systems of elliptic PDEs both the unknown-based (UAMG) and the point-based (PAMG) ap- proach are supported. For the latter, block smoothers and block interpolation routines can be chosen.
8 Conclusions and Future Work
Overall, there are many efficientparallel implementations of algebraic multi- grid and multilevel methods. Various parallel coarsening schemes, interpola- tion procedures, parallel smoothers as well as several parallel software pack- ages have been briefly described. There has truly been an explosion of research and development in the area of algebraic multilevel techniques for parallel computers with distributed memories. Even though we have tried to cover as much information as possible, there are still various interesting approaches that have not been mentioned. One of those approaches that shows a lot of promise is the concept of compatible relaxation. This was originally suggested by Achi Brandt . Much research has been done in this area. Although many theoretical results have been obtained [22, 35], we are not aware of an efficient implementation of this algorithm to this date. However, once this has been formulated, compatible relaxation holds much promise for parallel computa- tion. Since the smoother is used to build the coarse grid, use of a completely parallel smoother (e.g. C-F Jacobi relaxation) will lead to a parallel coarsening algorithm.
Algebraicmultigrid (AMG) methods [1–4] are among the most efficient solution and precondi- tioning algorithms for large and sparse systems of linear equations that arise in a wide range of scientific and engineering applications governed by elliptic partial differential equations (PDEs). Multigrid methods combine effects of smoothing and coarse-level correction. The smoothing oper- ation employs relaxation schemes such as the Jacobi or Gauss–Seidel iterations and attempts to suppress the oscillatory components of the error, while the coarse-level correction is designed to eliminate the complementary smooth components of the error, which are not efficiently reduced by smoothing. The latter consists of transferring the smoothed error on a coarser level with fewer degrees of freedom (DOF), solving the associated residual equation, and interpolating the computed correction back to update the current fine-level solution. If the coarse level problem is too large, the same scheme can be applied recursively leading to a hierarchy of levels of decreasing size. In contrast to geometric methods [2, 5], AMG constructs the multigrid hierarchy entirely by algebraic means without an a priori connection to the underlying continuous problem. The AMG methods can be consequently employed in a ‘black-box’ fashion and are applicable even for problems defined on complicated geometric domains, where it might be difficult, if possible, to develop a suitable geometric multigrid method.
Abstract. An algebraicmultigrid method is presented to solve large systems of linear equations. The coarsen- ing is obtained by aggregation of the unknowns. The aggregation scheme uses two passes of a pairwise matching algorithm applied to the matrix graph, resulting in most cases in a decrease of the number of variables by a factor slightly less than four. The matching algorithm favors the strongest negative coupling(s), inducing a problem depen- dent coarsening. This aggregation is combined with piecewise constant (unsmoothed) prolongation, ensuring low setup cost and memory requirements. Compared with previous aggregation-based multigrid methods, the scalability is enhanced by using a so-called K-cycle multigrid scheme, providing Krylov subspace acceleration at each level. This paper is the logical continuation of [SIAM J. Sci. Comput., 30 (2008), pp. 1082–1103], where the analysis of a anisotropic model problem shows that aggregation-based two-grid methods may have optimal order convergence, and of [Numer. Lin. Alg. Appl., 15 (2008), pp. 473–487], where it is shown that K-cycle multigrid may provide optimal or near optimal convergence under mild assumptions on the two-grid scheme. Whereas in these papers only model problems with geometric aggregation were considered, here a truly algebraic method is presented and tested on a wide range of discrete second order scalar elliptic PDEs, including nonsymmetric and unstructured problems. Numerical results indicate that the proposed method may be significantly more robust as black box solver than the classical AMG method as implemented in the code AMG1R5 by K. St¨uben. The parallel implementation of the method is also discussed. Satisfactory speedups are obtained on a medium size multi-processor cluster that is typical of today computer market. A code implementing the method is freely available for download both as a FORTRAN program and a MATLAB function.
This raises, however, some design and implementation issues. As mentioned in Section 4.3.2, CUSP performs not only the solve phase of the linear solver, but even the setup phase on the GPU. This does, however require a version of AMG that can be constructed with fine-grained parallelism. To our knowledge, it is not clear whether it is possible to implement the BSSA algorithm efficiently on a GPU. A more general point can made from this. The design decision to construct the preconditioner on the GPU makes sense given that it enables CUSP to offer a complete black box GPU solver, however, this excludes preconditioners that are not feasible for efficientsetup on GPUs. However, given CUSP’s and THRUST’s functionality for generic data structures and copying between host and device, an interesting question is whether it is possible to construct the preconditioner on the CPU and transfer it to the GPU for the solve phase of the algorithm. If so, one could consider aggregation schemes that have been successfully implemented for CPUs, such as the ones used in AGMG , or even classical AMG , possibly allowing approaches more tailored to anisotropic problems. This requires, however, that the preconditioner can be constructed in a way that is compatible with the data structures used in CUSP and allows for efficient V-cycle execution.
CHAPTER 4. IMPLEMENTATION 4.3. SETUP PHASE
Direct interpolation uses only direct C-F point connections for interpolation. This can be implemented using either only strongly connected C points or all C points with at least a weak connection to the respective F point. The decision was to implement the ﬁrst approach because this usually leads to sparser prolongation matrices and also to sparser operator matrices via the Galerkin operation, improving complexity and performance. The disadvantage is that interpolation might be worse, although experiments have shown that the eﬀect is negligible. The interpolation is otherwise quite straightforward. To improve performance, the summations of the C point coeﬃcients and coeﬃcient row sum in equation 2.1 are calculated on a per-row level. Furthermore, the rows of the prolongation matrix are computed in parallel. If chosen by the user (interpolweight > 0), each row is truncated to further reduce the number of non-zero points per row (see section 2.4.1). The truncation operation is quite straightforward and therefore not described.
sorting the intermediate matrix in global memory was identified as the primary performance limiting factor of our implementation. We therefore proposed and evaluated a selective segmented processing approach that utilizes a lightweight analysis procedure to order processing of the intermediate matrix into uniform subsets. This regularization proved to be most beneficial for workloads with a moderate number of products per row allowing efficient processing completely within fast shared on-chip memory. Finally we considered an alternative SpGEMM implementation that ignored segmentation an utilized a two phase sorting scheme to reduce the total processing time of the ESC algorithm. We showed that this strategy maintained the performance stability of the ESC algorithm while decreasing the overall processing time dramatically in some cases. There are numerous lines of future work concerning SpGEMM and other sparse matrix operations on GPUs. In this work we focused on analysis and accelerating SpGEMM using traditional storage formats, such as COO and CSR, but many computational packages support specialized SpMV storage formats. It is unclear how the use of these formats will impact the implementation and performance of SpGEMM on GPUs. In the context of AMG we have considered performing the coarse matrix construction using two SpGEMM operations. It is an open question if it possible perform both SpGEMM operations simultaneously to reduce the memory overhead associated with forming the intermediate matrix and improve the performance.
Abstract. Algebraicmultigrid (AMG) is one of the most efficient and scalable parallel algo- rithms for solving sparse linear systems on unstructured grids. However, for large three-dimensional problems, the coarse grids that are normally used in AMG often lead to growing complexity in terms of memory use and execution time per AMG V-cycle. Sparser coarse grids, such as those obtained by the Parallel Modified Independent Set coarsening algorithm (PMIS) , remedy this complexity growth, but lead to non-scalable AMG convergence factors when traditional distance-one interpo- lation methods are used. In this paper we study the scalability of AMG methods that combine PMIS coarse grids with long distance interpolation methods. AMG performance and scalability is compared for previously introduced interpolation methods as well as new variants of them for a variety of relevant test problems on parallel computers. It is shown that the increased interpolation accuracy largely restores the scalability of AMG convergence factors for PMIS-coarsened grids, and in combination with complexity reducing methods, such as interpolation truncation, one obtains a class of parallel AMG methods that enjoy excellent scalability properties on large parallel computers. Key words. Algebraicmultigrid, long range interpolation, parallel implementation, reduced complexity, truncation
Smoothed aggregation-based (SA) [74, 72] algebraicmultigrid (AMG) [17, 64] is a popular and effective solver for systems of linear equations that arise from discretized partial differential equations (PDEs). While SA has been effective over a broad class of problems, it has several limitations and weaknesses that this thesis is intended to address. This includes the development of a more robust strength-of-connection measure which guides coarsening and the choice of interpolation sparsity patterns. Unfortunately, the classic strength measure is only well-founded for M-matrices, leading us to develop a new measure based on local knowledge of both algebraically smooth error and the behavior of interpolation. Another limitation is that classic SA is only formally defined for Hermitian positive definite (HPD) problems. For non-Hermitian operators, the operator-induced energy-norm does not exist, which impacts the complementary relationship between relaxation and interpolation. This requires a redesign of SA, such that restriction and prolongation operators approximate the left and right near null-spaces, respectively. As a result, we develop general SA setupalgorithms for both the HPD and the non-Hermitian cases. To realize these algorithms, we develop general prolongation smoothing methods so that restriction and prolongation target the left and right near null-spaces, respectively. The right near null-space is loosely defined for a matrix A as all vectors v such that Av ≈ 0. Likewise, the left near null-space is loosely defined as all vectors u such that A ∗ u ≈ 0. For Hermitian matrices, the left and right near null-spaces are identical. Overall, the proposed methods do not assume any user-input beyond what standard SA does and the result is a new direction for multigrid methods for non-Hermitian systems.
Step 2i takes O (log«) time using Q(n) processors. Step 2ii is very efficient: it takes con stant time using a linear number of processors on an EREW PRAM. Step 2iii takes O (log«) time using n 2 processors on an EREW PRAM. Finally the recursive steps take log« stages since each new subproblem is at most half the size of the previous problem; further the sum of the sizes of the new problems is less than the size of the previous problem and hence the processor count is dominated by the first stage. Thus the algorithm takes O (log2« ) time using Q(n) processors.
211 For the rapid computation of the features at many levels we introduce the concept of integral image representation. The integral image can be computed from an image only with a few operations per pixel. Once computed, any one of these Haar-like features can be computed at any scale or location in constant time. The another contribution of the research work is a process for developing a classifier with selection of a small number of features using LDA. The weak classifier is constrained so that each weak classifier returned can depend on only a single feature. The third contribution is of the method that combines successively simple to complex classifiers in a parallel and cascaded structure which increases the speed of the detector by paying attention on promising regions of the image. In the domain of face detection it is possible to achieve less than 1per cent false negatives and 40 per cent false positives with a classifier developed from five Haar-like features.
The PRO model is inspired by the Bulk Synchronous Parallel (BSP) model introduced by Valiant  and the Coarse Grained Multicomputer (CGM) model of Dehne et al. . In the BSP model a parallel algorithm is organized as a sequence of supersteps with distinct computation and communication phases. The emergence of the BSP model marked an important milestone in parallel computation. The model introduced a desirable structure to parallel programming, and was accompanied by the definition and implemen- tation of communication infrastructure libraries due to Bonorden et al.  and Hill et al. . Recently, Bisseling  has written a textbook on scientific parallel compu- tation using the BSP model. From an algorithmic (as opposed to a programming) point of view, we believe that the relatively many and machine-specific parameters involved in the BSP model make the design and analysis of algorithms somewhat cumbersome. The CGM model partially addresses this limitation as it involves only two parameters, the input size and the number of processors. The CGM model is a specialization of the BSP model in that the communication phase of a superstep is required to consist of single long messages rather than multiple short ones. A drawback of the CGM model is the lack of an accurate performance measure; the number of communication rounds (supersteps) is usually used as a quality measure, but as we shall see later in this paper, this measure is sometimes inaccurate.
Chapter 1. Introduction 2 means of achieving higher performance and this brought about the advent of multi- core technology and mainstream parallel computation. Multi-core processor technology was particularly attractive and promising especially because manufacturers were able to more than double performance without necessarily increasing the operating frequency by simply adding more processing cores. This means that devices are able to do more as more processing power became readily available and this lead to an age of ubiquitous computing as these chips powered almost everything ranging from small devices such as mobile phones and tablets to our home computers to enterprise server systems and super-computers. However, the problem of energy consumption soon re-surfaced and it became highly imperative that system designers and developers tackle the issue for both technical and economic reasons in order to prolong the sustainability of multi-core and parallel systems.