coarse grids guaranteed to represent smooth error accurately. Two **parallel** compatible relaxation **algorithms** using concepts from earlier chapters are examined and implemented in Chapter 6.
This research and the research of others has produced a large and growing field of coarsening **algorithms**. These **algorithms** are introduced in several publications and are, in many cases, not tested against one another. Providing a single forum for all of the **algorithms**, this thesis contains a wealth of experiments and data examining the performance of many **parallel** independent set-based coarsening **algorithms**. This represents the largest set of coarsening **algorithms** tested simultaneously. In Chapter 5, attention turns to the design and efficiency of coarsening **algorithms** themselves. Coarse-grid selection **algorithms** contain routines for searching a graph to identify new coarse-grid points. The weight initialization in CLJP and PMIS forces a brute force search, which involves large numbers of comparisons between vertex weights. The **algorithms** using graph coloring have a theoretical advantage in terms of the search methods available. An algorithm called Bucket Sorted Independent Sets (BSIS) is developed to use a bucket algorithm for sorting and identifying new coarse points without requiring comparisons between vertex weights. This novel application of comparison-free sorting produces a coarsening algorithm with much lower search costs. BSIS is the first coarse-grid selection algorithm developed with the explicit goal of reducing the cost of coarsening. In addition to presenting the new algorithm, theory is developed to prove that changes made from CLJP-c to BSIS do not affect the selected coarse grid.

Show more
171 Read more

The FE method, although more complicated from the implementation point of view, has also been adapted to use GPUs, even prior to the appearance of CUDA hardware and software. Turek et al. [65] attempted to use GPUs through the FEAST finite-element library which they develop. Initially, a single-precision iterative solver is implemented on the GPU to serve as a preconditioner for an outer iterative solver running in double precision on the CPU. A 2D Laplacian problem is solved on a regular cartesian grid. This approach, using OpenGL, is approximately 3.5 times faster than a CPU implementation. A later development by the same group is described in [66]. The FEAST library is used to solve a non-linear steady-state Navier-Stokes problem. The linearized subproblems of the non-linear solver are solved with a global BiCGStab preconditioned with a Schur complement matrix. Solving the advection- diffusion problem is done with a global **multigrid** solver that uses as smoother multi-grid solvers on the local domains running on the GPU. To ensure the regular access patterns suitable for the GPU, this strategy uses a 2D unstructured mesh composed of a small number of quadrilateral domains, while the domains themselves, on which the local multi-grid GPU solvers are operating, are discretized with regular generalized tensor product grids. The components that are ported to the GPU are up to an order of magnitude faster than the original CPU version. These components represent only a fraction of the total solver code, so the total simulation time is only decreased by a factor of two, as can be expected due to Amdahl’s law.

Show more
152 Read more

Numerical performance
To give an illustration of the numerical performance of C-AMG, consider again the example problem in (3) and Figure 2. Table 1 shows single-processor re- sults on an Intel Pentium workstation. The coarse grids for the 31 × 31 problem are shown in the figure. We see that the convergence factors are almost uni- form independent of problem size, the growth in both **setup** and solve time is essentially linear with problem size, and the number of grid levels grows logarithmi- cally with problem size. These are expected charac- teristics of **multigrid** methods. We also see that the grid and operator complexities stay nicely bounded for this problem (growth in operator complexity is often an issue for AMG, especially in **parallel**). Note that in practice, it is usually better to use AMG as a preconditioner for a Krylov method such as con- jugate gradients (CG). To precondition CG, we first must ensure that the AMG cycle is symmetric. If we do that for this problem by using C-F Jacobi, the re- sulting AMG-CG method takes 8 or 9 iterations for all problem sizes. See [6, 8] for a more extensive set of numerical experiments.

Show more
13 Read more

Many independent set-based coarsening **algorithms** have been developed. RS coarsening is introduced in [24] and is the inspiration for many **algorithms** that follow. The first **parallel** independent set-based coarsening **algorithms** appeared a little more than ten years ago and include RS3 [19], CLJP [12], BC-RS and Falgout [19], PMIS and HMIS [14], CLJP-c [1], PMIS-c1 and PMIS-c2 [2], and PMIS Greedy [10], to name a few. Independent set-based coarsening **algorithms** are defined by a set of policies that specify how weights are initialized, the neighborhood used in selecting C-points (see selection neighborhood below), and the way in which vertex weights are updated. We investigate the nature of these components in the following section.

Show more
19 Read more

In addition, from table 1 to table 4, the more complex the coefficient matrix of the initial linear system is, relatively the less **setup** time and the more **efficient** the scheme aggva will be. The more points each point connects to in average, the more **efficient** the scheme aggcq will be. And agg8p needs always the least **setup** time. Though aggst, agg4p and agg8p are all very cheap to **setup**, the performance of iteration is also very good, and the best is often one of them in most cases. Even if they are not the best, the performance is not far from the best. It should be mentioned that though there give only the results from three test examples, many other tests give the similar results.

Show more
10 Read more

experiments have shown that, in most cases, the best results are achieved when using only 1+1 Jacobi iterations and 500 – 1,000 groups. But, if a model is very complex and has high conductivity contrasts, it will be probably necessary to create up to 5,000 groups and to have more basic iterations. Similarly, SSOR relaxations have proved to be more **efficient** in dealing with high contrasts between conductivities than Jacobi iterations. In our examples, the systems that have been solved have between 0.5 and 2 million unknowns, which means that numbers of groups that proved to be the most **efficient** choices are three or four orders of magnitude smaller. We remark that these ratios may be used as guidance when choosing the number of groups. In addition, the obtained results have proved that there is no need to introduce more than one level of coarsening, which would considerably increase the cost of each BiCGStab iteration.

Show more
38 Read more

We have implemented our saddle point AMG using the hypre software package [hyp] [CCF98]. An important component of this **parallel** linear solver suite is the BoomerAMG **algebraic** **multigrid** solver and preconditioner for positive definite matrices. The ingredi- ents of BoomerAMG include smoothers (Jacobi, Gauss–Seidel, SOR, polynomial), par- allel coarse grid generation techniques (third pass coarsening, CLJP, Falgout’s scheme, PMIS, HMIS, CGC(-E), compatible relaxation, . . . ), interpolation **setup** routines (di- rect, modified classical, extended(+i), Jacobi, and may more). Furthermore, for systems of elliptic PDEs both the unknown-based (UAMG) and the point-based (PAMG) ap- proach are supported. For the latter, block smoothers and block interpolation routines can be chosen.

Show more
229 Read more

8 Conclusions and Future Work
Overall, there are many **efficient** **parallel** implementations of **algebraic** multi- grid and multilevel methods. Various **parallel** coarsening schemes, interpola- tion procedures, **parallel** smoothers as well as several **parallel** software pack- ages have been briefly described. There has truly been an explosion of research and development in the area of **algebraic** multilevel techniques for **parallel** computers with distributed memories. Even though we have tried to cover as much information as possible, there are still various interesting approaches that have not been mentioned. One of those approaches that shows a lot of promise is the concept of compatible relaxation. This was originally suggested by Achi Brandt [6]. Much research has been done in this area. Although many theoretical results have been obtained [22, 35], we are not aware of an **efficient** implementation of this algorithm to this date. However, once this has been formulated, compatible relaxation holds much promise for **parallel** computa- tion. Since the smoother is used to build the coarse grid, use of a completely **parallel** smoother (e.g. C-F Jacobi relaxation) will lead to a **parallel** coarsening algorithm.

Show more
31 Read more

19 Read more

Abstract. An **algebraic** **multigrid** method is presented to solve large systems of linear equations. The coarsen- ing is obtained by aggregation of the unknowns. The aggregation scheme uses two passes of a pairwise matching algorithm applied to the matrix graph, resulting in most cases in a decrease of the number of variables by a factor slightly less than four. The matching algorithm favors the strongest negative coupling(s), inducing a problem depen- dent coarsening. This aggregation is combined with piecewise constant (unsmoothed) prolongation, ensuring low **setup** cost and memory requirements. Compared with previous aggregation-based **multigrid** methods, the scalability is enhanced by using a so-called K-cycle **multigrid** scheme, providing Krylov subspace acceleration at each level. This paper is the logical continuation of [SIAM J. Sci. Comput., 30 (2008), pp. 1082–1103], where the analysis of a anisotropic model problem shows that aggregation-based two-grid methods may have optimal order convergence, and of [Numer. Lin. Alg. Appl., 15 (2008), pp. 473–487], where it is shown that K-cycle **multigrid** may provide optimal or near optimal convergence under mild assumptions on the two-grid scheme. Whereas in these papers only model problems with geometric aggregation were considered, here a truly **algebraic** method is presented and tested on a wide range of discrete second order scalar elliptic PDEs, including nonsymmetric and unstructured problems. Numerical results indicate that the proposed method may be significantly more robust as black box solver than the classical AMG method as implemented in the code AMG1R5 by K. St¨uben. The **parallel** implementation of the method is also discussed. Satisfactory speedups are obtained on a medium size multi-processor cluster that is typical of today computer market. A code implementing the method is freely available for download both as a FORTRAN program and a MATLAB function.

Show more
24 Read more

This raises, however, some design and implementation issues. As mentioned in Section 4.3.2, CUSP performs not only the solve phase of the linear solver, but even the **setup** phase on the GPU. This does, however require a version of AMG that can be constructed with fine-grained parallelism. To our knowledge, it is not clear whether it is possible to implement the BSSA algorithm efficiently on a GPU. A more general point can made from this. The design decision to construct the preconditioner on the GPU makes sense given that it enables CUSP to offer a complete black box GPU solver, however, this excludes preconditioners that are not feasible for **efficient** **setup** on GPUs. However, given CUSP’s and THRUST’s functionality for generic data structures and copying between host and device, an interesting question is whether it is possible to construct the preconditioner on the CPU and transfer it to the GPU for the solve phase of the algorithm. If so, one could consider aggregation schemes that have been successfully implemented for CPUs, such as the ones used in AGMG [17], or even classical AMG [25], possibly allowing approaches more tailored to anisotropic problems. This requires, however, that the preconditioner can be constructed in a way that is compatible with the data structures used in CUSP and allows for **efficient** V-cycle execution.

Show more
81 Read more

CHAPTER 4. IMPLEMENTATION 4.3. **SETUP** PHASE
Direct Interpolation
Direct interpolation uses only direct C-F point connections for interpolation. This can be implemented using either only strongly connected C points or all C points with at least a weak connection to the respective F point. The decision was to implement the ﬁrst approach because this usually leads to sparser prolongation matrices and also to sparser operator matrices via the Galerkin operation, improving complexity and performance. The disadvantage is that interpolation might be worse, although experiments have shown that the eﬀect is negligible. The interpolation is otherwise quite straightforward. To improve performance, the summations of the C point coeﬃcients and coeﬃcient row sum in equation 2.1 are calculated on a per-row level. Furthermore, the rows of the prolongation matrix are computed in **parallel**. If chosen by the user (interpolweight > 0), each row is truncated to further reduce the number of non-zero points per row (see section 2.4.1). The truncation operation is quite straightforward and therefore not described.

Show more
90 Read more

sorting the intermediate matrix in global memory was identified as the primary performance limiting factor of our implementation. We therefore proposed and evaluated a selective segmented processing approach that utilizes a lightweight analysis procedure to order processing of the intermediate matrix into uniform subsets. This regularization proved to be most beneficial for workloads with a moderate number of products per row allowing **efficient** processing completely within fast shared on-chip memory. Finally we considered an alternative SpGEMM implementation that ignored segmentation an utilized a two phase sorting scheme to reduce the total processing time of the ESC algorithm. We showed that this strategy maintained the performance stability of the ESC algorithm while decreasing the overall processing time dramatically in some cases. There are numerous lines of future work concerning SpGEMM and other sparse matrix operations on GPUs. In this work we focused on analysis and accelerating SpGEMM using traditional storage formats, such as COO and CSR, but many computational packages support specialized SpMV storage formats. It is unclear how the use of these formats will impact the implementation and performance of SpGEMM on GPUs. In the context of AMG we have considered performing the coarse matrix construction using two SpGEMM operations. It is an open question if it possible perform both SpGEMM operations simultaneously to reduce the memory overhead associated with forming the intermediate matrix and improve the performance.

Show more
117 Read more

Abstract. **Algebraic** **multigrid** (AMG) is one of the most **efficient** and scalable **parallel** algo- rithms for solving sparse linear systems on unstructured grids. However, for large three-dimensional problems, the coarse grids that are normally used in AMG often lead to growing complexity in terms of memory use and execution time per AMG V-cycle. Sparser coarse grids, such as those obtained by the **Parallel** Modified Independent Set coarsening algorithm (PMIS) [7], remedy this complexity growth, but lead to non-scalable AMG convergence factors when traditional distance-one interpo- lation methods are used. In this paper we study the scalability of AMG methods that combine PMIS coarse grids with long distance interpolation methods. AMG performance and scalability is compared for previously introduced interpolation methods as well as new variants of them for a variety of relevant test problems on **parallel** computers. It is shown that the increased interpolation accuracy largely restores the scalability of AMG convergence factors for PMIS-coarsened grids, and in combination with complexity reducing methods, such as interpolation truncation, one obtains a class of **parallel** AMG methods that enjoy excellent scalability properties on large **parallel** computers. Key words. **Algebraic** **multigrid**, long range interpolation, **parallel** implementation, reduced complexity, truncation

Show more
26 Read more

Chapter 1
Introduction
1.1 Introduction
Smoothed aggregation-based (SA) [74, 72] **algebraic** **multigrid** (AMG) [17, 64] is a popular and effective solver for systems of linear equations that arise from discretized partial differential equations (PDEs). While SA has been effective over a broad class of problems, it has several limitations and weaknesses that this thesis is intended to address. This includes the development of a more robust strength-of-connection measure which guides coarsening and the choice of interpolation sparsity patterns. Unfortunately, the classic strength measure is only well-founded for M-matrices, leading us to develop a new measure based on local knowledge of both algebraically smooth error and the behavior of interpolation. Another limitation is that classic SA is only formally defined for Hermitian positive definite (HPD) problems. For non-Hermitian operators, the operator-induced energy-norm does not exist, which impacts the complementary relationship between relaxation and interpolation. This requires a redesign of SA, such that restriction and prolongation operators approximate the left and right near null-spaces, respectively. As a result, we develop general SA **setup** **algorithms** for both the HPD and the non-Hermitian cases. To realize these **algorithms**, we develop general prolongation smoothing methods so that restriction and prolongation target the left and right near null-spaces, respectively. The right near null-space is loosely defined for a matrix A as all vectors v such that Av ≈ 0. Likewise, the left near null-space is loosely defined as all vectors u such that A ∗ u ≈ 0. For Hermitian matrices, the left and right near null-spaces are identical. Overall, the proposed methods do not assume any user-input beyond what standard SA does and the result is a new direction for **multigrid** methods for non-Hermitian systems.

Show more
131 Read more

Step 2i takes O (log«) time using Q(n) processors. Step 2ii is very **efficient**: it takes con stant time using a linear number of processors on an EREW PRAM. Step 2iii takes O (log«) time using n 2 processors on an EREW PRAM. Finally the recursive steps take log« stages since each new subproblem is at most half the size of the previous problem; further the sum of the sizes of the new problems is less than the size of the previous problem and hence the processor count is dominated by the first stage. Thus the algorithm takes O (log2« ) time using Q(n) processors.

Show more
28 Read more

211 For the rapid computation of the features at many levels we introduce the concept of integral image representation. The integral image can be computed from an image only with a few operations per pixel. Once computed, any one of these Haar-like features can be computed at any scale or location in constant time. The another contribution of the research work is a process for developing a classifier with selection of a small number of features using LDA. The weak classifier is constrained so that each weak classifier returned can depend on only a single feature. The third contribution is of the method that combines successively simple to complex classifiers in a **parallel** and cascaded structure which increases the speed of the detector by paying attention on promising regions of the image. In the domain of face detection it is possible to achieve less than 1per cent false negatives and 40 per cent false positives with a classifier developed from five Haar-like features.

Show more
The PRO model is inspired by the Bulk Synchronous **Parallel** (BSP) model introduced by Valiant [1990] and the Coarse Grained Multicomputer (CGM) model of Dehne et al. [1996]. In the BSP model a **parallel** algorithm is organized as a sequence of supersteps with distinct computation and communication phases. The emergence of the BSP model marked an important milestone in **parallel** computation. The model introduced a desirable structure to **parallel** programming, and was accompanied by the definition and implemen- tation of communication infrastructure libraries due to Bonorden et al. [1999] and Hill et al. [1998]. Recently, Bisseling [2004] has written a textbook on scientific **parallel** compu- tation using the BSP model. From an algorithmic (as opposed to a programming) point of view, we believe that the relatively many and machine-specific parameters involved in the BSP model make the design and analysis of **algorithms** somewhat cumbersome. The CGM model partially addresses this limitation as it involves only two parameters, the input size and the number of processors. The CGM model is a specialization of the BSP model in that the communication phase of a superstep is required to consist of single long messages rather than multiple short ones. A drawback of the CGM model is the lack of an accurate performance measure; the number of communication rounds (supersteps) is usually used as a quality measure, but as we shall see later in this paper, this measure is sometimes inaccurate.

Show more
22 Read more

Chapter 1. Introduction 2 means of achieving higher performance and this brought about the advent of multi- core technology and mainstream **parallel** computation. Multi-core processor technology was particularly attractive and promising especially because manufacturers were able to more than double performance without necessarily increasing the operating frequency by simply adding more processing cores. This means that devices are able to do more as more processing power became readily available and this lead to an age of ubiquitous computing as these chips powered almost everything ranging from small devices such as mobile phones and tablets to our home computers to enterprise server systems and super-computers. However, the problem of energy consumption soon re-surfaced and it became highly imperative that system designers and developers tackle the issue for both technical and economic reasons in order to prolong the sustainability of multi-core and **parallel** systems.

Show more
254 Read more