PERFORMANCE ANALYSIS AND OPTIMIZATION OF LARGE-SCALE SCIENTIFIC APPLICATIONS JINGJIN WU

(1)

SCIENTIFIC APPLICATIONS

BY JINGJIN WU

Submitted in partial fulﬁllment of the requirements for the degree of Doctor of Philosophy in Computer Science

in the Graduate College of the Illinois Institute of Technology

Approved

Advisor

Chicago, Illinois July 2013

(2)

(3)

First and foremost, I would like to thank my advisor Prof. Zhiling Lan. With-out her guidance and support throughWith-out my Ph.D. study, this dissertation would not have been accomplished. I appreciate all her research directions, technical criticisms, as well as valuable feedbacks from innumerable oﬃce discussions. Her intelligence, wisdom and diligence are contagious and motivational for me. I will always be in her debt and admire her for the great person she is. I would also like to thank Prof. Nick-olay Y. Gnedin, Prof. Andrey V. Kravtsov, Dr. Douglas H. Rudd and Dr. Roberto E. Gonz´alez, who have also guided me and worked with me on research related to this thesis. I am also thankful to my thesis committee: Prof. Xian-He Sun, Prof. Ioan Raicu, and Prof. Jia Wang for their time, interest, suggestions and comments. My special thanks go to my friends and colleagues in the SCS group for their friendship and encouragement. Last, but not least, I would like to thank my husband Xuanxing and our parents for their support at all times, and I hope that I can make them proud.

(4)

Page ACKNOWLEDGEMENT . . . iii LIST OF TABLES . . . vi LIST OF FIGURES . . . ix LIST OF SYMBOLS . . . x ABSTRACT . . . xi CHAPTER 1. INTRODUCTION . . . 1 1.1. Motivation . . . 1 1.2. Contributions . . . 4 1.3. Thesis Organization . . . 7 2. COSMOLOGY SIMULATIONS . . . 8

2.1. Adaptive Mesh Reﬁnement . . . 8

2.2. The Adaptive Reﬁnement Tree (ART) Code . . . 10

2.3. Performance Analysis . . . 17

3. PERFORMANCE EMULATION . . . 23

3.1. Proposed Approach . . . 23

3.2. Load Balancing Schemes . . . 28

3.3. Experiments . . . 30

4. OVERVIEW OF TOPOLOGY MAPPING . . . 38

4.1. Background . . . 38

4.2. The Topology Mapping Problem . . . 39

4.3. Related Works . . . 42

5. HIERARCHICAL TOPOLOGY MAPPING FOR ART . . . 48

5.1. Problem Statement . . . 48

5.2. Proposed Approach . . . 49

5.3. Performance Evaluation . . . 57

6. NETWORK AND MULTICORE AWARE TOPOLOGY MAPPING 66 6.1. Proposed Methodology . . . 66

(5)

7. ANALYTICAL TOPOLOGY MAPPING . . . 87

7.1. Preliminary . . . 87

7.2. Algorithm Overview . . . 89

7.3. Global Mapping . . . 90

7.4. Legalization . . . 94

8. TOPOMAP: A TOPOLOGY MAPPING LIBRARY . . . 107

9. CONCLUSION AND FUTURE WORK . . . 109

BIBLIOGRAPHY . . . 111

(6)

Table Page 3.1 Overall Load Balance Ratio of Diﬀerent Load Balancing Schemes . 34 5.1 Average Intra-Socket and Inter-Socket Communication Time

(Ping-Ping) . . . 58 6.1 Properties of Sparse Matrices . . . 76 6.2 Overhead of Topology Mapping on Stampede (Time in Seconds) . . 82 6.3 Overhead of Topology Mapping on Kraken by Using the Recursive

Bipartitioning Mapping Algorithm (Time in Seconds) . . . 85 6.4 Overhead of Topology Mapping on Kraken by Using the Recursive

Tree Mapping Algorithm (Time in Seconds) . . . 85 7.1 The Relations between Topology Mapping and VLSI Placement . . 88 7.2 Topology of the Blue Gene/P Supercomputer . . . 98 7.3 Overhead of the Proposed Analytical Mapping Algorithm. # Proc.:

the number of processes; Time: the total runtime in seconds; # Iter.:

the number of iterations in the global mapping stage. . . 105

(7)

Figure Page 2.1 A 2D block-structured adaptive mesh reﬁnement example with a

reﬁnement factor r = 2. . . . 9

2.2 A 2D cell-based adaptive mesh reﬁnement example: a quad tree with a reﬁnement factor r = 2. . . . 10

2.3 An example of large-scale ART simulation. Three panels show the dark matter (left), cosmic gas (middle) and stars (right), respectively. 11 2.4 Flow of the ART code. The four steps in the dotted region are the major steps for evolving a time step at each level. . . 12

2.5 The execution order for evolving time steps on three levels (reﬁne-ment factor r = 2). . . . 13

2.6 A space-ﬁlling curve traversing a 2D mesh (left), and a parallel par-tition of the 2D adaptive mesh into four parts (right). The cells with the same color are assigned to the same process. . . 14

2.7 The communication pattern of ART during a large-scale cosmology simulation with 1536 processes. . . 16

2.8 Execution time of the ART code on Abe for simulating a single iteration of gf25CV1 at a late physical epoch with medium amount of particles. . . 20

2.9 Execution time of the ART code on Ranger for simulating a single iteration of gf25CV1 at an early physical epoch with large amount of particles. . . 20

2.10 Total parallel scaling of the ART code on Abe. . . . 21

2.11 Total parallel scaling of the ART code on Ranger. . . . 22

3.1 Flow of performance emulator of ART. . . 26

3.2 Part of the time axes of two processes. . . 27

3.3 Comparison of actual runtime and emulated runtime of ART. . . . 32

3.4 Comparison of emulated runtime by using diﬀerent load balancing schemes for coarse resolution case. . . 33

3.5 Comparison of emulated runtime by using diﬀerent load balancing schemes for ﬁne resolution case. . . 34

(8)

3.7 Load Balance Ratio of each level for ﬁne resolution case. . . 36 3.8 Average communication time of each level for coarse resolution case. 37 3.9 Average communication time of each level for ﬁne resolution case. 37 5.1 Interconnected multiprocessor clusters with multicore CPUs on each

node. . . 49 5.2 The recursive bipartitioning algorithm for inter-node mapping. . . 51 5.3 The algorithm for intra-node mapping by minimizing the maximum

inter-socket message size (MIMS). . . 55 5.4 The intra-node topology of Ranger (from TACC Ranger website [79]). 60 5.5 Comparison of inter-node mapping and default mapping on Kraken

in terms of hop-bytes. . . 60 5.6 Comparison of intra-node mapping and default mapping on Kraken

in terms of MIMS (the maximum inter-socket message size). . . . 61 5.7 Comparison of hierarchical mapping and inter-node mapping on

Kraken in terms of MIMS (the maximum inter-socket message size). 61 5.8 Communication time reduction of diﬀerent mapping mechanisms

compared to default mapping on Kraken. . . 62 5.9 Comparison of inter-node mapping and default mapping on Ranger

in terms of hop-bytes. . . 63 5.10 Comparison of intra-node mapping and default mapping on Ranger

in terms of MIMS (the maximum inter-socket message size). . . . 64 5.11 Comparison of hierarchical mapping and inter-node mapping on

Ranger in terms of MIMS (the maximum inter-socket message size). 64 5.12 Communication time reduction of diﬀerent mapping mechanisms

compared to default mapping on Ranger. . . 65 6.1 High performance computing systems with multicore CPU(s) on

each compute node. . . 66 6.2 Topology mapping framework. . . 67 6.3 A fat-tree network topology, and the neighbor joining tree

represent-ing the topology of the nodes allocated to a user application (light

green). . . 69

(9)

6.5 A 2D mesh/torus topology, and the neighbor joining tree represent-ing the topology of the nodes allocated to a user application (light

green). . . 71

6.6 Recursive bipartitioning of a 2D mesh/torus topology. The nodes allocated to the user application are light green. . . 72

6.7 The recursive bipartitioning mapping algorithm for inter-node map-ping on mesh/torus topology. . . 73

6.8 A machine architecture and its corresponding topology tree. . . . 74

6.9 Hop-bytes reduction of sparse matrix tests on Stampede. . . 78

6.10 Communication time reduction of sparse matrix tests on Stampede. 79 6.11 Hop-bytes reduction of sparse matrix tests on Kraken. . . 80

6.12 Communication time reduction of sparse matrix tests on Kraken. 81 6.13 Performance results of ART tests on Stampede. . . 83

6.14 Performance results of ART tests on Kraken. . . 84

7.1 The analytical mapping algorithm. . . 89

7.2 The process migration algorithm. . . 94

7.3 The communication pattern of SP, CG and ART (1024 processes). “nz” is the number of blue dots, which indicates the number of inter-process communication. . . 99

7.4 NPB benchmark results – SP. . . 101

7.5 NPB benchmark results – CG. . . 102

7.6 Cosmology application results. . . 104

8.1 Flow of the topology mapping library TOPOMap. . . 108

(10)

Symbol Deﬁnition

c(u, v) The amount of communication in bytes between task u and v Ep The set of edges in the topology graph

Et The set of edges in the task graph Gp The topology graph

Gt The task graph Vp The set of processors Vt The set of tasks

ϕ The mapping from tasks in Vt to processors in Vp

(11)

Scientific applications are critical for solving complex problems in many areas of research, and often require a large amount of computing resources in terms of both runtime and memory. Massively parallel supercomputers with ever increasing com-puting power are being built to satisfy the need of large-scale scientific applications. With the advent of petascale era, there is an enlarged gap between the computing power of supercomputers and the parallel scalability of many applications. To take full advantage of the massive parallelism of supercomputers, it is indispensable to im-prove the scalability of large-scale scientific applications through performance analysis and optimization.

This thesis work is motivated by cell-based AMR (Adaptive Mesh Refine-ment) cosmology simulations, in particular, the Adaptive Refinement Tree (ART) application. Performance analysis is performed to identify its scaling bottleneck, a performance emulator is designed for efficient evaluation of different load balancing schemes, and topology mapping strategies are explored for performance improve-ments. More importantly, the exploration of topology mapping mechanisms leads to a generic methodology for network and multicore aware topology mapping, and a set of efficient mapping algorithms for popular topologies. These have been imple-mented in a topology mapping library – TOPOMap, which can be used to support MPI topology functions.

(12)

CHAPTER 1 INTRODUCTION

1.1 Motivation

In the recent decades, numerical simulation has become critical for scientific and engineering research. As an important complement to theory and experiment, it has been successfully utilized to study complex phenomena in many areas, such as computational fluid dynamics, electromagnetics, materials science, meteorology, cosmology, and so on. The resultant scientific applications are often highly compli-cated, and require a large amount of computing resources in terms of both runtime and memory, especially for large-scale simulations. To satisfy the need of large-scale scientific applications, massively parallel high performance computing (HPC) systems with ever increasing computing power are being built. In the current TOP500 list [81] (June 2013), all the supercomputers have at least several thousands of cores. The No. 1 supercomputer Tianhe-2 has 3, 120, 000 cores, achieving a Linpack performance (Rmax) of 33, 862.7 TFlop/s. More petascale machines will be available in the near future. In order to take full advantage of the massive parallelism of supercomputers, it is a must to improve the scalability of scientific applications through performance optimization. However, this is very challenging.

One important issue is load balancing, which becomes increasingly important as the system size scales up. In practice, the scaling bottleneck of applications is often attributable to the imbalanced workload distribution among processes. Although load balancing schemes has been extensively studied in both theory and application during the past decades, it is still highly non-trivial to achieve ideal load balance on large-scale systems due to the following reasons. First, ﬁne-grained load balancing should be adopted to reduce the imbalance, but this often increases the time complexity of load balancing routines. Second, many scientiﬁc simulations are highly dynamic, with the

(13)

workload distribution changing dynamically throughout the execution. Hence, highly efficient and scalable (dynamic) load balancing algorithms should be integrated into applications. To achieve this goal, we need to explore and experiment different load balancing algorithms. However, load balancing schemes are tightly related to the problem decomposition and data structure of the application. Developers are often confined to a few choices, which block the exploration of more efficient solutions. To explore new schemes, major code changes are often required, which is costly and risky if the expected performance is not evaluated before hand. Besides, it often takes a long runtime to experiment the performance of different schemes through real runs. As a result, we need a smarter way to evaluate various load balancing schemes accurately and efficiently.

Communication optimization is another important issue. If the workload dis-tribution is well-balanced, then the communication time often determines the overall scaling of the application. There are at least three approaches to reduce the com-munication cost. First, we can aggregate multiple small messages into comparatively large messages to fully utilize the bandwidth of the interconnection network. This can substantially reduce the overall communication latency. Second, communication should be overlapped with computation whenever it is possible, so that the com-munication overhead can be hid. These two approaches have been widely used by application developers to improve the communication performance. The third ap-proach is topology-aware task mapping [12], or simply topology mapping, which is relatively under-exploited, but critical for scaling on petascale systems and beyond.

According to the topology of the interconnection network and the communica-tion pattern of the applicacommunica-tion, topology-aware task mapping techniques map parallel application tasks (i.e. processes) onto processors (i.e. nodes or cores) properly to min-imize the communication cost. The motivation is that message latencies are highly

(14)

dependent on the distance between communicating processors and the network con-tention. From the HPC system perspective, the interconnection networks are always sparse, e.g., fat-tree [55], 3D mesh or 3D torus [1, 2]. As the system scales up, the diameter of the interconnection network (i.e. the maximum distance between two nodes) increases, and the bisection bandwidth (i.e. the minimum total bandwidth of links connecting one half of the HPC system and the other) often decreases, making the communication increasingly expensive. Consider that parallel scientiﬁc applica-tions usually have sparse communication pattern, it is critical to map processes onto nodes properly, so that the traﬃc in the network will be localized, leading to better communication performance.

There are other aspects for performance optimization of scientiﬁc applications, e.g. I/O optimization [97], fault tolerance [56], and so on. This thesis work focuses on load balancing and communication optimization as they are the dominating factors for the performance of many applications.

Speciﬁcally, we consider cosmology simulations, and study the adaptive reﬁne-ment tree (ART) code [47], which is an advanced “hydro+N-body” simulation tool for cosmological research. Simulations are critical to making quantum leaps in cos-mological research as they provide insight for the evolution of the universe, e.g., the formation of stars and galaxies. There are mainly two categories of cosmology simula-tion tools: those that only simulate the dark matter (often referred to as “N-body”), and those that model gas dynamics (often called “hydro”). Since cosmologically rele-vant scales are mainly dependent on the dark matter, a hydro simulation tool always includes an N-body component for modeling the dark matter. Typically, hydro sim-ulations are much more compute-intensive than purely N-body simsim-ulations. Modern hydro simulations become even more computationally demanding in terms of both runtime and memory as more and more physical processes are included, e.g., gas

(15)

cooling, star formation and feedback, radiative transfer, and so on. Adaptive mesh reﬁnement (AMR) [8, 68] has been widely applied to model the dynamics of cosmic baryons (gas and stars) for cosmology simulation, since it can follow the fragmenta-tion of gas down to virtually unlimited small scales. Enzo [32] and the ART code are representative cosmology simulation tools using AMR. In particular, the ART code uses the cell-based AMR algorithm [46, 88], and incorporates many physical calculations for cosmology research, making it unique in its capabilities.

As cosmology simulations usually consume a large amount of computing re-sources in terms of both runtime and memory, they are typically carried out on massively parallel high performance computing (HPC) platforms, e.g., HPC clusters. Production simulations using the ART code often involve physics computations for thousands of time steps, and can take several weeks or even months using hundreds of processing cores on production HPC platforms. It is important to further improve the ART code, so that it can scale up to petascale systems, and enable cosmologists to solve complex problems more eﬃciently.

1.2 Contributions

To improve the ART code, we analyze its performance on production HPC systems to identify its performance bottleneck, design a performance emulator for eﬃcient evaluation of load balancing schemes, and explore topology-aware task map-ping techniques for performance optimization. Moreover, we also propose a generic methodology for network and multicore aware topology mapping, and develop a set of eﬃcient mapping algorithms for popular topologies. The major contributions of this dissertation are as follows.

• Performance Emulation. A performance emulator is designed for cell-based AMR cosmology simulations. The emulator follows the ﬂow of the original

(16)

application, i.e. the ART code, while the major physical computation and interprocess communication are replaced by runtime estimates provided by per-formance models. We demonstrate the eﬀectiveness of the emulator by means of realistic cosmology simulation data on production systems. Experiments in-dicate that the emulator achieves good accuracy. Given the dynamic feature of AMR and the wide range of workload per cell, load balancing is an extremely challenging problem for cell-based cosmology simulations. We further evaluate three load balancing schemes for cell-based AMR cosmology simulations via the performance emulator. The use of the emulator enables us to quickly identify the issues associated with diﬀerent load balancing schemes.

• Hierarchical Topology Mapping for ART. A hierarchical topology map-ping algorithm is proposed to map ART processes onto multiprocessor clusters, where each node contains several multicore CPUs. In order to exploit the ar-chitectural properties of multiprocessor clusters (the performance gap between node and intra-node communication, as well as the gap between inter-socket and intra-inter-socket communication), we propose to perform topology map-ping in a hierarchical manner. First, the mapmap-ping of processes onto nodes (i.e., inter-node mapping) is obtained by using the recursive bipartitioning technique to minimize the amount of traﬃc in the network. Second, for each node, the mapping of processes onto multicore CPUs (i.e., intra-node mapping) is de-rived by minimizing the maximum size of messages transmitted between CPU sockets. Experiments on production HPC systems show that signiﬁcant commu-nication time reduction can be achieved by using the optimized mappings. This hierarchical approach has a wide applicability for cell-based AMR cosmology simulations, and the general methodology of performing hierarchical mapping can be used for many parallel applications.

(17)

• Network and Multicore Aware Topology Mapping. A generic topology mapping methodology, which considers both the topology of the interconnection network and the hierarchical architecture of multicore compute nodes, is pro-posed for communication optimization of parallel applications on HPC systems. It is the broad extension of the hierarchical topology mapping of ART. Specifi-cally, the mapping is still performed in two phases. In the first phase, the map-ping of processes onto compute nodes is derived by utilizing efficient algorithms, which exploit the structure of the interconnection network to reduce inter-node message traffic. In the second phase, the mapping of processes onto logical pro-cessors is determined by partitioning the communication graph according to the intra-node hierarchical architecture, which is represented as a tree. We consider supercomputers with popular fat-tree and mesh/torus topologies, and develop efficient mapping algorithms, including a recursive tree mapping algorithm for generic tree topologies, and a recursive bipartitioning algorithm, which par-titions the compute nodes of mesh/torus topologies by using their geometric coordinates. These have been integrated in the topology mapping library – TOPOMap, which can be employed to support MPI topology functions. Ex-periments on production HPC systems show that significant performance gain can be achieved by using TOPOMap to find optimized mappings.

• Analytical Topology Mapping. An analytical mapping algorithm is pro-posed for topology mapping onto 3D mesh/torus-connected supercomputers. The design of this algorithm is motivated by mapping applications with irreg-ular communication patterns onto regirreg-ular 3D mesh/torus topology, and it is inspired by the analytical placement technique [84, 85] for VLSI design. Specif-ically, a two-stage strategy is used to determine the mapping. The ﬁrst stage derives a rough global mapping solution by using quadratic programming, which intends to minimize the amount of traﬃc in the network. The second stage

(18)

le-galizes the rough global mapping solution to get a proper mapping by using a diﬀusion-like migration algorithm to move the processes between nodes. The resulting mapping is highly optimized due to the global optimization through quadratic programming. Experiments with popular benchmarks and the ART application on IBM Blue Gene/P system [41] show that the analytical map-ping algorithm derives highly optimized mapmap-pings, which lead to signiﬁcant performance gains.

1.3 Thesis Organization

The rest of this thesis is organized as follows. Chapter 2 introduces the Adap-tive Reﬁnement Tree (ART) application, and analyzes its performance. Chapter 3 elaborates the design of the performance emulator and the evaluation of diﬀerent load balancing schemes. Chapter 4 presents an overview of topology mapping techniques for communication optimization of HPC applications. Chapter 5 proposes the hierar-chical topology mapping algorithm for ART. Chapter 6 introduces the methodology and algorithms for network and multicore aware topology mapping. Chapter 7 pro-poses the analytical topology mapping algorithm for regular 3D mesh/torus topology. Chapter 8 introduces the topology mapping library – TOPOMap. Finally, Chapter 9 summarizes this thesis and discusses the future work.

(19)

CHAPTER 2

COSMOLOGY SIMULATIONS

This chapter introduces a cosmology simulation application, the Adaptive Re-ﬁnement Tree (ART) code, which is based on adaptive mesh reRe-ﬁnement (AMR) [8,68], and presents its performance analysis on production HPC systems.

2.1 Adaptive Mesh Refinement

Numerical simulations of many multiscale physical phenomena consume enor-mous computing resources in terms of both runtime and memory, because their mul-tiple spatial and(or) temporal scales are discretized into the finest resolution for so-lution accuracy. This excessive resource requirement usually makes large-scale nu-merical simulations prohibitive. However, some resources are often underutilized for performing computation of the spatial and(or) temporal regions which do not require the finest resolution. To achieve better computing efficiency, AMR [8, 68] has been proposed to employ high resolution only in those required regions, so that we can focus resources on compute-intensive regions. AMR allows the user to perform simulations that are completely intractable on a uniform mesh. It has been much studied for more than two decades, and has proved to be successful in modeling multiscale phe-nomena for a variety of disciplines, including computational fluid dynamics, thermal dynamics, material science, geophysics, meteorology, cosmology, astrophysics, and so on.

Generally, there are two kinds of AMR strategies: structured and unstructured. Structured AMR uses unions of regular meshes or cells to cover the computational domain, while unstructured AMR utilizes mesh distortion, which provides greater geometric ﬂexibility at the cost of storing all neighborhood relations explicitly. In this dissertation, we focus on structured AMR, which is often implemented in two

(20)

Figure 2.1. A 2D block-structured adaptive mesh reﬁnement example with a reﬁne-ment factor r = 2.

ways: block-structured AMR [8,18] and cell-based AMR [46,88]. The former achieves high spatial resolution by inserting smaller meshes (“blocks”) at places where high resolution is needed. The latter instead refines the computational domain on a cell-by-cell basis. In practice, these two methods use different data structures and very different methods for distributing the computational load across a large number of processes. In this section, we briefly review these two structured AMR approaches.

2.1.1 Block-structured AMR. The basic principle of block-structured AMR is

straightforward. Initially, a uniform mesh is adopted for the entire computational domain. In the regions which require higher resolution, finer meshes (“blocks”) are added. If some regions still need more resolution, even finer meshes are added. The computational domain is refined recursively in this way, and turns into a tree of meshes. The initial uniform mesh, as the tree’s root, is at the top level. Each finer level decreases the mesh size by a factor r, which is called the refinement factor. Figure 2.1 shows a block-structured AMR example on the 2D mesh, including both mesh hierarchy and the overall structure. As each mesh block is regular, the sequential

(21)

Figure 2.2. A 2D cell-based adaptive mesh reﬁnement example: a quad tree with a reﬁnement factor r = 2.

code of regular meshes can be reused for block-structured AMR implementations. Load balancing can be achieved by evenly distributing mesh blocks to processors.

2.1.2 Cell-based AMR. The cell-based AMR implements the AMR algorithm by

performing grid refinement based on each cell. If higher spatial resolution is required for a cell, then it is refined into smaller cells, which are at the finer level of the overall grid hierarchy. With each level up, the cell size is decreased by the refinement factor r. A simple example of cell-based AMR on the 2D mesh is shown Figure 2.2. For simplicity, there are only 3 levels of cells from level 0 to level 2. Initially, a uniform grid on level 0 covers the overall computational domain. The dotted cells need higher resolution, so they are further refined. Throughout the execution of the AMR application, the grid hierarchy changes adaptively, and the cells are organized in refinement trees [46,88]. For each refinement tree, its root is the corresponding cell at level 0. The cell with children is a non-leaf cell. Otherwise, it is a leaf cell. Compared with block-structured AMR, cell-based AMR is more flexible, and can easily adapt to high resolution in localized regions.

(22)

Figure 2.3. An example of large-scale ART simulation. Three panels show the dark matter (left), cosmic gas (middle) and stars (right), respectively.

The ART code is an advanced “hydro+N-body” simulation tool for cosmo-logical research. It simulates the evolution of the universe, or more speciﬁcally, the formation of stars and galaxies. It employs a combination of multi-level particle-mesh and shock-capturing Eulerian methods for simulating the evolution of dark matter and cosmic gas, respectively. High dynamic range is achieved by applying adaptive mesh reﬁnement to both gas dynamics and gravity calculations. The ART code is distinguished from the rest of cosmological simulation tools in the large number of physical processes it includes, which enable comprehensive simulation of cosmological phenomena and provide deep insight for cosmologists. Figure 2.3 shows the visual-ization of a large-scale ART simulation, including dark matter, cosmic gas and stars.

In particular, ART is a hybrid “MPI+OpenMP” C program, with Fortran functions for compute-intensive routines. The MPI parallelization is used between separate compute nodes and the OpenMP parallelization is used inside a multi-core node. This mixed mode parallelization enables us to take full advantage of modern multi-core architectures. The ART code employs the cell-based AMR algorithm, per-forms refinement locally on individual cells and organizes cells in refinement trees. In order to model the universe, it adopts a cubic computational volume with a refinement factor of 2. For each cubic cell, the refinement operation evenly subdivides the cell

(23)

Figure 2.4. Flow of the ART code. The four steps in the dotted region are the major steps for evolving a time step at each level.

into 8 cells, namely an oct. Hence, the refinement tree is also called oct-tree [46, 88]. In a typical cosmology simulation, the highest resolution regions can reach 7 to 10 refinement levels, resulting in a large range of dynamic multidimensional regions. With cell-based AMR, the ART code is able to control the computational mesh on the level of individual cells, such that the refinement mesh can easily be built and modified, and therefore, can effectively match the complex geometry of cosmologically interesting regions. In the following subsections, we briefly review its flow, domain decomposition, and communication pattern. Additional details about the ART code can be found in [47, 48, 97].

2.2.1 Flow of ART. Figure 2.4 shows the basic ﬂow of the ART code. First, it reads

input files, including parameter files and cosmology data. Second, it initializes oct-tree and cell buffer. Then it checks whether the simulation time reaches the user specified time limit. If yes, the simulation stops, otherwise the ART code performs load balance

(24)

Level 0 Level 1 Level 2 7th 3rd 6th 1st 2nd 4th 5th Time Steps

Figure 2.5. The execution order for evolving time steps on three levels (reﬁnement factor r = 2).

and simulates another iteration by evolving time steps for the overall computational domain, including the cells at all the levels. For each level, the evolution of a time step mainly consists of four steps: collect boundary information, perform physics computation, project physics data to the coarser level, and adaptively refine/derefine the cells. At the end of each iteration, the ART code generates output files, including log files and cosmology data files if any.

Specifically, in each iteration of the simulation, ART evolves a time step dt for the overall computational domain, which is adaptively refined into multiple levels of cells during simulation. This refinement in the spatial domain is accompanied by a refinement in the time domain, where finer level meshes or cells evolve with a smaller time step according to the refinement factor. Since the ART code uses a refinement factor of 2, the time step size at level ℓ is 2−ℓdt. As a result, ART evolves 2ℓtime steps at level ℓ in each iteration. Figure 2.5 shows the recursive execution order for evolving these time steps on three levels. Basically, finer levels evolve first, then coarser levels follow. Except the finest level, each level ℓ evolves a new time step when and only when level ℓ + 1 has evolved two time steps ahead.

2.2.2 Domain Decomposition. The computational domain decomposition is critical to the performance of cosmology applications using AMR. In order to take advantage of the cell-based AMR and exploit the spatial locality, the ART code

(25)

Figure 2.6. A space-ﬁlling curve traversing a 2D mesh (left), and a parallel partition of the 2D adaptive mesh into four parts (right). The cells with the same color are assigned to the same process.

adopts a domain decomposition scheme [65] based on Hilbert’s space-filling curve (SFC) [21]. The SFC is identified by a traversal of all the root cells according to their spatial coordinates. A parallel partition of root cells is obtained by dividing the curve into Np (number of processes) equal parts, where each part is weighted by the total workload of the corresponding computational domain. It is to be noted that each root cell keeps all its child cells at finer refinement levels as a single composite unit, thus being the basic object for domain decomposition. Figure 2.6 presents a space-filling curve on a 2D mesh and a parallel partition of cells into four parts. Because of the spatial locality preserved by the curve, each part is a continuous domain consisting of nearby cells, and this property also holds for the 3D case of parallel partition of the cubic universe in ART. This SFC-based domain decomposition scheme enables efficient partitioning of the adaptive mesh by transforming a multidimensional problem (e.g., 2D or 3D) into a unidimensional one, and it has been widely employed in parallel AMR implementations [19, 20, 35, 47, 75].

Since the computational meshes and cosmological objects evolve dynamically during a cosmology simulation, the workload distribution between processes changes. In order to ensure load balance, ART regularly examines the workload distribution

(26)

during simulation, and performs domain decomposition to re-balance workload when necessary.

2.2.3 Communication Pattern. With the SFC-based domain decomposition, each process of ART mainly performs computation for its local computational domain, and communicates with other processes to get the boundary information, which are the data associated with the external boundary cells of each process. It is worth noting that each process only keeps the data associated with its local computational domain, enabling a fully parallel solution for both computation and memory. Up-dating the boundary information is the dominating communication routine of ART. Specifically, in order to exchange boundary information with other processes, each process first posts non-blocking receives (MPI Irecv), then sends data by non-blocking sends (MPI Isend), and finalizes the communication by an MPI Waitall for all sends and receives. Such MPI communication is performed per time step at each level, and cannot be overlapped with computation to reduce runtime, because the physics computation of each process depends on updated boundary data.

Generally, each process only communicates with relatively small number of processes whose computational domain is nearby, and the amount of communication between two processes is mainly dependent on the number of boundary cells between their computational domains. Figure 7.3 shows the communication pattern of ART for a production simulation with 1536 processes. Each blue dot at (i, j) represents the communication between process i and j, and “nz” denotes the total number of blue dots, i.e., the total number of communicating process pairs. Clearly, each process only communicates with a few other processes, and most communication is between neighboring processes since most blue dots are along the diagonal. The communication pattern may vary for diﬀerent mesh structure and diﬀerent number of processes, yet the sparse and diagonal dominant property always holds due to the

(27)

0 500 1000 1500 0 500 1000 1500 nz = 12950

Figure 2.7. The communication pattern of ART during a large-scale cosmology sim-ulation with 1536 processes.

(28)

spatial locality provided by the SFC-based domain decomposition.

2.3 Performance Analysis

In order to study the scalability of the ART code, we instrument it with performance counters and timers, and perform simulations with practical cosmology datasets on various production HPC systems. In this section, we present the perfor-mance analysis of ART on two HPC systems.

2.3.1 Two Cluster Systems. The two cluster systems are the Intel 64 cluster Abe at the National Center for Supercomputing Applications (NCSA), and the Sun cluster Ranger at the Texas Advanced Computing Center (TACC). These systems are chosen because they have different hardware/software configurations, and they are representative HPC clusters widely used for scientific computing.

The first system is an Intel 64 cluster called Abe which is located at NCSA. It was ranked the 82nd in the top 500 list [81] (June 2010) with a peak performance of 89.47 TFLOPS. Abe is equipped with InfiniBand network and Lustre parallel filesys-tem. It contains 9600 processing cores, arranged as 1200 symmetric multiprocessing (SMP) nodes. Each node has either 8GB or 16GB memory, and two quadcore CPUs running at 2.33 GHz. Each CPU has 4MB L2 cache. The operating system on Abe is Red Hat Enterprise Linux 4 (Linux 2.6.18).

The second system is a Sun constellation Linux cluster called Ranger, which is located at TACC. It was ranked the 11th in the top 500 list [81] (June 2010) with a peak performance of 579.4TFlops. It is composed of 3936 nodes, with a total of 62976 processing cores. The nodes are connected by a full-CLOS InﬁniBand interconnect, which provides a 1GB/sec point-to-point bandwidth. Each node has 32GB memory, and 4 quad-core AMD 64-bit Barcelona CPUs. Each CPU has a clock frequency of 2.3 GHz and three levels of cache for fast memory access: 2MB L3 cache shared by 4

(29)

cores, 512KB L2 cache dedicated to each core, and 64 KB L1 cache. Also, Ranger is equipped with the Lustre ﬁle system and a 2.6.18.8 Linux kernel.

2.3.2 Cosmology Dataset. A practical cosmology dataset of ART called gf25CV1

is employed in our experiments. This cosmological simulation is used in the box of 36 comoving Mpc on a side, covered with the uniform top level grid of 2563 cells. A small fraction of the volume (containing Lagrangian regions of 5 galaxies with masses between 1012 and 1013 M_⊙) is further resolved in the initial conditions by additional 3 levels, to the eﬀective 20483 grid. This high resolution region is allowed to reﬁne dynamically by another 6 levels (10 levels total, counting the top grid).

This dataset is chosen because it is a practical dataset used in ongoing cosmo-logical research, and it is similar to the datasets, which will be used in the ultimate large-scale cosmology simulations in future. In fact, the computational domain of a large-scale simulation can be 100 times the box size of gf25CV1, and these 100 box pieces will be made essentially independent of each other. Therefore, this dataset is representative for both current cosmology simulations and future large-scale simula-tions. The scalability analysis and performance optimization based on this dataset will be beneﬁcial for general simulations.

In practice, the simulation of this dataset evolves thousands of iterations, which take large amount of runtime (up to a few months using hundreds of cores), thus it is prohibitive to simulate all the iterations. In our experiments, we selectively simulate this dataset for a few iterations at diﬀerent physical epoches with diﬀerent grid resolutions for performance evaluation.

2.3.3 Experimental Setup. On Abe, we use the ART code to simulate gf25CV1

for a few number of iterations at a late physical epoch with medium amount of particles. Here only medium amount of particles are employed for the experiments

(30)

on Abe, because the quota of disk in the home directory of each user on Abe is only 50GB, and it cannot accommodate the data files for simulating more particles. The experiments are carried out with 256, 512, and 1024 processors (i.e., cores), respectively. As each node of Abe has 8 processors, so actually there are 32, 64 and 128 nodes used in these experiments. Smaller number of nodes are not adopt for experimentation because they cannot accommodate the data for simulating this dataset. As ART is a hybrid “MPI+OpenMP” code and there are 8 processors available on each node of Abe, we assign an MPI process with 8 OpenMP threads for each node. This kind of configuration is employed for two reasons. First, this configuration often provides good performance, and it is adopt in most production simulations. Second, to simulate the practical cosmology dataset, each process of the ART code needs a large amount of memory. In practice, each node of the Abe machine cannot accommodate two or more MPI processes of ART. Therefore, we use one MPI process and 8 OpenMP threads on each node for performance analysis.

As Ranger has suﬃcient quota of disk per user, we simulate the dataset at an early physical epoch with large amount of particles. A series of experiments with 256, 512, 1024, and 2048 processors (i.e., cores) have been performed. Similarly, we assign one MPI process for each node, and one OpenMP thread for each processor. Recall that each node of Ranger has 16 processors, so there are 16 OpenMP threads on each node.

2.3.4 Results. Figures 2.8 and 2.9 show the average runtime per process of the ART code on Abe and Ranger, respectively. The overall runtime of the ART code is divided into three parts: physics computation, MPI communication, and others. The physics computation time includes all the runtime spent on solving the physics equations and managing the cells; the MPI communication time is the time spent on message passing function calls; and the other runtime mainly includes IO and

(31)

256 512 1024 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 Number of Processors Time(s)

Execution Time On Abe

Physics Computation MPI Communication Others

Figure 2.8. Execution time of the ART code on Abe for simulating a single iteration of gf25CV1 at a late physical epoch with medium amount of particles.

256 512 1024 2048 0 500 1000 1500 2000 2500 Number of Processors Time(s)

Execution Time On Ranger

Physics Computation MPI Communication Others

Figure 2.9. Execution time of the ART code on Ranger for simulating a single itera-tion of gf25CV1 at an early physical epoch with large amount of particles.

(32)

256 512 1024 0.8 1 1.5 2 2.5 3 3.5 4 Number of Processors Speedup

Speedup Relative to 256 Processors on Abe

Physics Computation Time MPI Communication Time Total Runtime

Figure 2.10. Total parallel scaling of the ART code on Abe.

load balance time. Since diﬀerent simulations are performed on these two clusters, the absolute runtime of these two sets of experiments are not comparable. As the “others” only consumes less than 10% of the total runtime, we focus on the analysis of the physics computation time and MPI communication time.

The scaling curves of these experiments are illustrated in Figures 2.10 and 2.11, respectively. For both experiments on Abe and Ranger, the physics computa-tion time has a linear speedup as the number of processors increases. In comparison with the physics computation time, the MPI communication time has an inferior scaling trend on both clusters, and this inferior scaling results in the poor scaling of the total runtime. Basically, it is expected that the average physics computation time has a good scaling, because the overall workload of simulation is dependent on the given cosmology data. It is observed that the physics computation time of differ-ent processes often varies significantly, which causes large amount of synchronization time (i.e. waiting time) between processes. As such synchronization time constitutes a large part of MPI communication time, it is a must to explore more efficient load balancing schemes, so that the physics computation time (i.e. workload) of different

(33)

2561 512 1024 2048 2 3 4 5 6 7 8 Number of Processors Speedup

Speedup Relative to 256 Processors on Ranger

Physics Computation Time MPI Communication Time Total Runtime

Figure 2.11. Total parallel scaling of the ART code on Ranger.

processes can be well balanced, and the synchronization time can be reduced. More-over, it would also be beneﬁcial to explore topology-aware mapping techniques for reducing communication cost (i.e. data transmission time).

(34)

CHAPTER 3

PERFORMANCE EMULATION

In this chapter, we present the design of the performance emulator for cell-based AMR cosmology simulations, and evaluate three load balancing schemes via the performance emulator.

3.1 Proposed Approach

The execution time of a cosmology simulation can be divided into three parts: physics computation, MPI communication, and others. Here, physics computation time includes the runtime spent on solving physics equations and managing the cells; the MPI communication time is the time spent on MPI function calls; and the other runtime mainly includes IO and load balance time. As the sum of physics computation and MPI communication time accounts for more than 95% of the overall runtime, we focus on these two parts during the design of the emulator. In the following, we ﬁrst present our performance models to estimate physics computation time and MPI communication time, and then describe the performance emulator.

3.1.1 Performance Models. Runtime performance models are built to estimate physics computation time and MPI communication time. Considering that different levels have different resolutions and time step sizes, our focus is to build models to es-timate these runtime of each time step per level. During simulation, each application process stores two types of cells: the cells within its computational domain, namely local cells; and the external neighboring cells of its computational domain, namely buffer cells. Each process conducts physical calculation on local cells, and communi-cates with other processes to get buffer cell data, which serves as important boundary information for simulating its local computational domain. These two types of cells are the main indicators of physics computation time and MPI communication time,

(35)

respectively.

3.1.1.1 Physics Computation Time. In order to characterize the physics computation time accurately, we use principle component analysis (SPCA) [73] to analyze the runtime, and observe that the number of local cells and particles are the dominant terms in determining the physics computation time. It is expected because the ART code solves two kinds of physics equations at each level: the physics equations for hydro dynamics, and the physics equations for N-body simulation. The former is only solved for leaf local cells, and the solution is projected to obtain the solution of non-leaf local cells at the coarser level, while the latter is solved for particles. Thus, we use the following linear model for the physics computation time TP of each level.

TP = TPnllc+ T llc P + T particle P (3.1) = w1× Nnllc+ w2× Nllc+ w3× Nparticle,

where wi (i = 1, 2, 3) are constant coefficients for the level of interest. TPnllc, TPllc and T_Pparticle denote the computation time of non-leaf local cells, leaf local cells and particles, respectively. Nnllc, Nllc, and Nparticle denote the number of non-leaf local cells, leaf local cells and particles, respectively. Note that Req.(3.1) is defined for each level of each process, while all the processes share common constant coefficients of each level.

To extract these coefficients, we use the cell and particle counts along with the physics computation time at each level of each process to formulate Req.(3.1), and then solve a linear system in the form of Ax = b for each level. As the number of equations is usually larger than the number of coefficients, we apply linear regression to compute the least square fit solution of these coefficients.

(36)

further divided into two parts: data transmission time and synchronization time.

Data transmission time is simply the runtime spent on transmitting data be-tween processes. As updating the boundary information is a major communication routine and the amount of boundary data for each process is proportional to its buﬀer cell counts, the data transmission time is largely dependent on the number of buﬀer cells. In our model, data transmission time is modeled as follows.

Ttrans = ts+ n× tc, (3.2)

where ts can be considered as the latency for message passing, tc is the inverse of the bandwidth and n is the data size for one time data transmission. The latency and bandwidth can be obtained by using Intel MPI Benchmarks (IBM) [43], and the number of transmitted bytes can be calculated using the number of buﬀer cells.

Synchronization time is incurred when processes do not start their communi-cation routines simultaneously. For example, consider two processes with a maximum refinement level of 3 and 6, respectively, and assume that these two processes need to exchange boundary information from level 0 to 3. The first process starts evolving time steps at level 3, while the second process starts at level 6. Once the first process finishes a time step at level 3, it sits idle until the second process catches up to it, and then they can communicate to exchange boundary information. Even if two processes have the same refinement level, when there exists load imbalance, a process may still have to sit idle waiting for the boundary information from the other process. Such extra waiting time is recorded as part of the MPI communication time, because the process is stalled when executing MPI function calls, and we refer to it as synchro-nization time. The MPI communication time including both data transmission time and synchronization time can be characterized by emulating time steps as detailed in the next subsection.

(37)

Grid Structure and Model Coefficients Analyze Communication Relationships Iter < MaxIter YES NO Exit Load Balance

Emulate Time Steps

Estimate Physics Computation Time and Data Transmission Time Using the Performance Models

Estimate Total Runtime

Figure 3.1. Flow of performance emulator of ART.

To build these models, we need performance data so as to extract model coeﬃ-cients. In practice, we can either use the performance data from previous simulations, or conduct a few iterations of the simulation to collect performance data for model construction.

3.1.2 Emulator Design. Figure 3.1 presents the design of our emulator. Com-paring Figure 3.1 and Figure 2.4, we can see that the emulator uses the performance models to estimate physics computation and MPI communication time for each time step, rather than conducting the actual simulation. During each iteration, we use a load balancer to determine workload distribution among processes, and then emulate time steps in exactly the same order as the ART code. For each time step, the emula-tor ﬁrst analyzes the communication relationships among processes. Speciﬁcally, for each process, the emulator derives the amount of data that needs to be transmitted to and received from all the other processes. Next, according to the cell and particle counts of each process and the communication relationships, the emulator estimates

(38)

Figure 3.2. Part of the time axes of two processes.

physics computation time and data transmission time using the performance models shown in equation (3.1) and (3.2), respectively. Finally, the emulator estimates the total runtime. It is achieved by maintaining a time axis for each process to record the computation and communication intervals.

Figure 3.2 shows the time axes of two processes P0 and P1. The length of computation time intervals are the estimated physics computation times, while the arrows represent data transmission. Clearly, MPI communication time intervals are determined by computation time intervals and data transmissions. For example, the second MPI communication time interval of P0 can be computed as Tc = t2− t1 =

(t0 + Ttrans)− t1, and the start time of P0 for the communication in the next time

step is t3 = t2 + TP. Therefore, using the estimated physics computation time and data transmission time for each time step of each process, the emulator is able to estimate the MPI communication time including both data transmission time and synchronization time without evaluating synchronization time separately.

In summary, by emulating time steps with the performance models to charac-terize runtime components, the emulator can estimate the performance of the ART code without executing physics solvers.

(39)

3.2 Load Balancing Schemes

One of the major design goals of our performance emulator is to evaluate the performance of different load balancing schemes, thus avoiding time-consuming and complicated implementation in code without knowing potential effects of the modifi-cation. The ART code performs cosmology simulations using a cubic computational domain, which represents the universe. The computational domain is initially divided into many uniform cubic cells at level 0. We refer to such cells at level 0 as root cells since they are the roots of oct-trees. Each root cell keeps all its child cells at finer refinement levels as a single composite unit, thus being the basic unit for load bal-ancing. In this section, we present three representative load balancing schemes which will be evaluated later in Section 3.3.

3.2.1 SFC-Based Load Balancing Scheme (SCAB). Currently, the ART code

employs a load balancing scheme based on Hilbert space-filling curve (SFC) [21]. We denote this scheme as SCAB. It assigns a unique SFC ID for each root cell according to their spatial coordinates, then generates an SFC curve by connecting root cells with continuous SFC IDs, and finally divides the SFC curve into Np (Np is the number of processes) segments with similar amount of workload. Specifically, this scheme considers the total workload of each root cell, and adopts a greedy algorithm to split the SFC curve, so that the workload can be evenly distributed among all the processes. Currently, SCAB is widely used for parallel AMR [19, 35, 75]. One salient feature of this SCAB scheme is its good spatial locality, where each process gets root cells with continuous SFC IDs resulting in a continuous computational domain. However, SCAB restricts the assignments of root cells to processes by the SFC curve, and does not consider the communication among processes when splitting the SFC curve.

(40)

partitioning is an alternative approach for the load balancing of cell-based AMR ap-plications. We denote it as Graph. To apply the graph method, we need to map the load balancing problem into a graph partitioning problem. One straightforward mapping is to use vertices to represent root cells, and edges to represent communi-cation relationships between neighboring root cells. The weight of each vertex is the workload of the corresponding root cell. Although this mapping does make sense, it results in a large graph, which is difficult to partition using acceptable amount of runtime and memory. For example, in a medium-sized cosmology simulation, a cubic computational domain of 2563 root cells maps into a graph with 2563 vertices and about 6×2563 edges. Partitioning such a large graph is almost prohibitive. Therefore, we must reduce the graph size for efficient partitioning. In our preliminary exper-iments, it is observed that only the root cells in a few localized regions are deeply refined to finer levels, while most root cells are not refined. Thus, we use a single vertex to represent the unrefined root cells with continuous SFC IDs, and create the edges accordingly in order to generate a manageable graph.

The assignments of root cells to processes can be obtained by partitioning the vertices in the graph into Np partitions. Note that graph partitioning algorithms typically minimize the total edge-cuts subject to the constraints that the partitions are of similar size. When they are applied for the load balancing of cell-based AMR applications, we are actually minimizing the amount of communication among pro-cesses while ensuring that the workload is well-balanced. In our implementation, we use the graph partitioning tool MENTIS [60] to partition the graph.

3.2.3 Group-Based Load Balancing Scheme (Group). The aforementioned two load balancing schemes only consider the total workload of each root cell. How-ever, as the ART code evolves time steps for each level, it is also critical to balance the workload of each level in order to reduce the synchronization time. Besides, it

(41)

is observed that the communication at deep reﬁnement levels usually results in large synchronization cost, so it is important to minimize the communication at deep re-ﬁnement levels. To meet such requirements, we design a new load balancing scheme called Group.

This scheme first assigns neighboring root cells into groups, where each group has the lowest possible boundary level and satisfies a set of group workload constraints to control the granularity. The assignment of root cells to groups is based on the Friends-of-Friends algorithm [29]. Second, Group assigns root cell groups to processes by solving a constrained bin packing problem to balance the workload of each level, where each bin corresponds to a process. Specifically, we sort the groups in non-increasing order according to their workload, and pack them into bins sequentially. To achieve good spatial locality, we compute the distances between groups using Coronoid tessellations [86], and try to assign each group to the process which holds its neighboring groups. In this way, Group is able to achieve good level-by-level load balance and spatial locality.

3.3 Experiments

In this section, we present two sets of experiments. In the ﬁrst experiment, we assess the accuracy of the emulator. In the second experiment, we examine the load balancing schemes using the emulator with realistic cosmology data, and discuss their advantages and shortcomings.

3.3.1 Experimental Setup. We instrument the ART code with performance counters and timers for analysis. We use a real cosmological simulation of a box of 36 comoving Mpc on a side, covered with the uniform top level grid of 2563 _{root cells.}

This simulation represents a scaled-down version of our future petascale cosmological simulations. In fact, the petascale simulation is 100 times the volume of this dataset,

(42)

and these 100 box pieces are essentially independent of each other.

Our tested is the Intel 64 cluster Abe located at NCSA [77]. It is equipped with InﬁniBand network and Lustre parallel ﬁle system. Each node has either 8GB or 16GB memory, and two quad-core CPUs running at 2.33 GHz. As ART is a hybrid “MPI+OpenMP” code and there are 8 processors available on each node of Abe, we assign an MPI process with 8 OpenMP threads to each node.

3.3.2 Accuracy of Performance Emulator. To measure the accuracy of the emulator, we conduct experiments using the emulator with the current load balancing scheme of ART – SFCLB, and compare the emulated performance results with the actual runtime of the ART code. As the emulator emulates time steps without running physics solvers, it only consumes a few minutes to estimate the performance of ART.

Figure 3.3 presents the comparison of actual runtime and emulated runtime. Here, the runtime includes physics computation time and MPI communication time. The diﬀerence between the actual runtime and emulated runtime is within 12%. The error is mainly due to some extra communication routines in the ART code except for exchanging boundary information between processes. The emulator does not model such extra communications because they will be removed in our next version of ART. Therefore, the emulator is accurate. More importantly, we notice that both curves in the ﬁgure have exactly the same trend. This further indicates that our emulator can clearly predict the performance and scalability of realistic cosmology simulations, and the performance analysis based on this emulator is reliable.

3.3.3 Evaluation of Load Balancing Schemes. In this set of experiments, we assess three load balancing schemes presented in Section 3.2 for cosmology sim-ulations by means of the emulator. For the same cosmology dataset described in Section 3.3.1, we test two diﬀerent resolution cases: the coarse resolution case and

(43)

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 192 256 512 1024 T im e (s ) Number of Processors

Comparison of Actual Runtime and Emulated Runtime

Actual Runtime Emulated Runtime

Figure 3.3. Comparison of actual runtime and emulated runtime of ART.

the fine resolution case. The coarse resolution case represents a simulation with an intermediate resolution which reaches to a maximum refinement level of 6. The fine resolution case represents a simulation with an extremely high resolution, which is allowed to refine dynamically to level 9.

We use three metrics, namely execution time, load balance ratio, and commu-nication time per level, for evaluation. Speciﬁcally, load balance ratio represents the quality of workload distributions among processes. It is deﬁned as

Load Balance Ratio =∆

1 Np ∑Np−1 i=0 Wi max0≤i≤Np−1Wi × 100%,

where Wi is the workload of process i and Np is the number of processes. Note that Load Balance Ratio is always smaller than or equal to 100%, and a much closer value to 100% indicates a better load balance quality.

Figures 3.4 and 3.5 present emulated runtime by using different load balancing schemes for coarse resolution and fine resolution case, respectively. In both figures, the runtime using GroupLB scheme is smaller than that of the other two schemes, especially in the coarse resolution case. In Figure 3.5, we notice that as the number

(44)

0 500 1000 1500 2000 2500 3000 192 256 512 1024 T im e (s ) Number of Processors SFCLB GraphLB GroupLB Emulated Runtime of Coarse Resolution Case

Figure 3.4. Comparison of emulated runtime by using diﬀerent load balancing schemes for coarse resolution case.

of processors increases, three curves are trending toward the same value. Such trend is caused by granularity, which is determined by the maximum workload of root cells. In the ﬁne resolution case, there are several root cells whose individual workload is much larger than the average workload of each process when running on 512 and 1024 processors, thus introducing signiﬁcant synchronization cost.

Table 3.1 presents the overall Load Balance Ratio among processes for different load balancing schemes. The closer the metric is to 100%, the better load balance is achieved. Obviously, both SFCLB and GroupLB achieve better load balance for the coarse resolution case in comparison with the fine resolution case. This is because finer grid resolution increases the workload of root cells and results in larger granularity, which makes load balancing more challenging. GraphLB behaves differently since its primal objective is to minimize the communication among processes instead of balancing the workload. Comparing these schemes, GroupLB achieves the best load balance for the coarse resolution case and the first two tests of the fine resolution case. For the fine resolution case with 512 and 1024 processors, all of these three schemes fail to provide good load balance because of granularity problem.

(45)

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 192 256 512 1024 T im e (s ) Number of Processors SFCLB GraphLB GroupLB Emulated Runtime of Fine Resolution Case

Figure 3.5. Comparison of emulated runtime by using diﬀerent load balancing schemes for ﬁne resolution case.

Table 3.1. Overall Load Balance Ratio of Diﬀerent Load Balancing Schemes Number of Coarse Resolution Case Fine Resolution Case Processors SFCLB GraphLB GroupLB SFCLB GraphLB GroupLB

192 92.83% 64.89% 96.63% 90.21% 85.48% 91.17% 256 91.16% 73.54% 96.43% 71.35% 78.81% 88.51% 512 82.51% 53.76% 95.84% 53.74% 54.02% 51.73% 1024 70.49% 42.74% 92.21% 26.98% 27.01% 26.75%

(46)

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 0 1 2 3 4 5 6 SFCLB GraphLB GroupLB Load_Balance_Ratio of Each Level

(192 Processors, Coarse Resolution)

L o a d _ B a la n c e _ R a ti o Level 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 0 1 2 3 4 5 6 SFCLB GraphLB GroupLB Load_Balance_Ratio of Each Level

L o a d _ B a la n c e _ R a ti o Level 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 0 1 2 3 4 5 6 SFCLB GraphLB GroupLB Level L o a d _ B a la n c e _ R a ti o

Load_Balance_Ratio of Each Level (512 Processors, Coarse Resolution)

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 0 1 2 3 4 5 6 SFCLB GraphLB GroupLB Load_Balance_Ratio of Each Level

Level L o a d _ B a la n c e _ R a ti o

Figure 3.6. Load Balance Ratio of each level for coarse resolution case.

Figures 3.6 and 3.7 show the Load Balance Ratio of each level for coarse res-olution case and fine resres-olution case, respectively. In the coarse resres-olution tests, the Load Balance Ratio for almost all the levels of GraphLB is obviously smaller than that of the other two schemes; SFCLB and GroupLB provide similar load balance at deep refinement levels 4 to 6; GroupLB also delivers well-balanced workload distribu-tion at level 0 to 3. In the fine resoludistribu-tion tests, these schemes achieve comparable load balance at level 6 to 9, while GroupLB is more effective in balancing the workload at lower levels. In general, GroupLB achieves better level-by-level load balance than SFCLB and GraphLB for both coarse and fine resolution cases.

Figures 3.8 and 3.9 illustrate the average communication time of each level for the two resolution cases, respectively. In Figure 3.8, as the number of processors increases, the level-by-level communication time is decreasing. Generally, GroupLB introduces smaller level-by-level communication time than SFCLB and GraphLB be-cause it achieves better load balance at each level. In Figure 3.9, the

(47)

communica-0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 0 1 2 3 4 5 6 7 8 9 SFCLB GraphLB GroupLB Load_Balance_Ratio of Each Level

(192 Processors, Fine Resolution)

L o a d _ B a la n c e _ R a ti o Level 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 0 1 2 3 4 5 6 7 8 9 SFCLB GraphLB GroupLB Load_Balance_Ratio of Each Level

Level L o a d _ B a la n c e _ R a ti o 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 0 1 2 3 4 5 6 7 8 9 SFCLB GraphLB GroupLB Load_Balance_Ratio of Each Level

(1024Processors, Fine Resolution)

L o a d _ B a la n c e _ R a ti o Level

Figure 3.7. Load Balance Ratio of each level for ﬁne resolution case.

tion time is not scaling down with the increasing number of processors because the granularity problem results in poor load balance. GroupLB and GraphLB have much smaller communication time than SFCLB at level 7 to 9 since both methods try to re-duce communication. However, for 512 and 1024 processors, GroupLB and GraphLB still have large communication time due to the granularity.

In summary, by comparing these three load balancing schemes, we conclude that GroupLB provides the best performance. It achieves a good load balance quality by balancing both overall and level-by-level workload, and minimizes communication cost by preserving spatial locality. While SFCLB maintains an overall load balance and keeps spatial locality using the SFC curve, it does not take into consideration of level-by-level load balance, thereby introducing non-trivial synchronization cost. Although GraphLB minimizes communication cost, it does not provide satisfactory load balance quality.

(48)

0 50 100 150 200 250 300 350 0 1 2 3 4 5 6 SFCLB GraphLB GroupLB

Average Communication time for Each Level (192 Processors, Coarse Resolution)

T im e (s ) Level 0 50 100 150 200 250 300 350 0 1 2 3 4 5 6 SFCLB GraphLB GroupLB

Level T im e (s ) 0 50 100 150 200 250 300 350 0 1 2 3 4 5 6 SFCLB GraphLB GroupLB

Level Level T im e (s ) 0 50 100 150 200 250 300 350 0 1 2 3 4 5 6 SFCLB GraphLB GroupLB

T im e (s ) Level

Figure 3.8. Average communication time of each level for coarse resolution case.

0 200 400 600 800 1000 1200 1400 1600 0 1 2 3 4 5 6 7 8 9 SFCLB GraphLB GroupLB

Averagae Communication Time for Each Level (192 Processors, Fine Resoulution)

Level T im e (s ) 0 200 400 600 800 1000 1200 1400 1600 0 1 2 3 4 5 6 7 8 9 SFCLB GraphLB GroupLB

T im e (s ) Level 0 200 400 600 800 1000 1200 1400 1600 0 1 2 3 4 5 6 7 8 9 SFCLB GraphLB GroupLB T im e (s )

Level 0 200 400 600 800 1000 1200 1400 1600 0 1 2 3 4 5 6 7 8 9 SFCLB GraphLB GroupLB

Averagae Communication Time for Each Level (1024Processors, Fine Resoulution)

Level T im e (s )

(49)

CHAPTER 4

OVERVIEW OF TOPOLOGY MAPPING

Supercomputers with up to hundreds of thousands of cores have been built to fulfill the demands of large-scale and complex scientific applications. These machines usually have sparse network topologies with large network diameters (i.e., the maxi-mum distance between two nodes). As the system size scales up, the diameter of the interconnection network increases, and the bisection bandwidth (i.e., the minimum total bandwidth of links connecting one half of the supercomputer and the other) often decreases. As a result, communication in the network becomes increasingly expensive due to large distance between nodes and network contention, leading to the scaling bottleneck of parallel applications. Topology-aware task mapping is an essential technique for communication optimization of parallel applications on high performance computing platforms. According to the topology of the interconnection network and the communication pattern of the application, it maps parallel appli-cation tasks onto processors to reduce communiappli-cation cost. The following sections briefly review topology-aware task mapping techniques.

4.1 Background

Today’s supercomputers often use fat-tree, mesh or torus network topologies. A fat-tree network uses switches to connect compute nodes into a tree structure [55]. The compute nodes are at the leaves, while intermediate nodes represent switches. From the leaves to the root, the available bandwidth of links increases, i.e. the links become “fatter”. Inﬁniband networks are representative examples of fat-tree topology. To optimize communication on fat-tree, we prefer local communication to global communication for smaller latency.