ABSTRACT
SRIPATHI, VAMSI KIRAN. Performance Analysis and Optimization of Parallel I/O in a Large Scale Groundwater Application on Petascale Architectures. (Under the direction of G.(Kumar) Mahinthakumar.)
This thesis attempts to characterize and optimize the performance of a large scale
ground-water application, PFLOTRAN, on leading petascale architectures, Cray XT5 and IBM
Blue-Gene/P (BG/P). PFLOTRAN is a highly scalable groundwater simulation code that solves multi-phase groundwater flow and multicomponent reactive transport in three-dimensional
porous media. It uses MPI (Message Passing Interface) for interprocess communication, PETSc
(Portable, Extensible Toolkit for Scientific Computation) framework for solving numerical equa-tions and parallel HDF5 (Hierarchical Data Format 5) library for I/O. We performed detailed
performance analysis of PFLOTRAN and its associated libraries on Cray XT5 and IBM BG/P
using Cray Performance Analysis Tool(CrayPAT) and Tuning and Analysis Utilities(TAU). Our profiling results indicated that IBM BG/P delivers superior MPI Allreduce and I/O
performance when compared to Cray XT5 for PFLOTRAN. Differences between the two
ar-chitectures in terms of computational performance, communication and I/O sub-systems are discussed. We designed a custom vector dot-product benchmark to accurately characterize
MPI Allreduce performance at scale on both petascale architectures. We experimented with
different communication schemes to optimize the benchmark performance on Cray XT5 and obtained an improvement of 20-40% with a hybrid MPI-OpenMP implementation.
A strong scaling study with a 270 million cell test problem of PFLOTRAN indicated that a high volume of independent I/O disk access requests and file access operations would severely
limit I/O performance scalability on Cray XT5. To mitigate the performance penalty at higher
processor counts, we implemented a two phase I/O approach at the application layer by splitting the MPI global communicator into multiple sub-communicators. With this approach we were
able to achieve 25X improvement in HDF5 read I/O and 3X improvement in HDF5 write I/O
Performance Analysis and Optimization of Parallel I/O in a Large Scale Groundwater Application on Petascale Architectures
by
Vamsi Kiran Sripathi
A thesis submitted to the Graduate Faculty of North Carolina State University
in partial fulfillment of the requirements for the Degree of
Master of Science
Computer Science
Raleigh, North Carolina
2010
APPROVED BY:
Xiaosong Ma
Co-chair of Advisory Committee
Richard T. Mills
Frank Mueller G.(Kumar) Mahinthakumar
DEDICATION
BIOGRAPHY
Vamsi Sripathi was born in Bhimavaram, in the state of Andhra Pradesh in India. He had a
fun filled childhood attending various schools across the state. In April 2007, he graduated with a Bachelor of Technology in Information Technology degree from V.R.Siddhartha Engineering
College in Vijayawada, AP, India. He came to USA in Fall 2007 to pursue graduate studies
ACKNOWLEDGEMENTS
I would like to express my heartfelt gratitude to my advisor, Dr. Mahinthakumar for his
in-valuable guidance during my graduate studies. His thought provoking suggestions and insighful observations were pivotal to the success of my thesis research. His confidence in my abilities
motivated me to pursue challenging goals. His insights in High Performance Computing helped
me comprehend this broad area of research. I would also like to thank him for his availability to discuss my work.
I’m thankful to Dr. Richard Mills for his steadfast support while mentoring me during
my summer internship at Oak Ridge National Laboratory. His encouragement and substantial feedback helped crystallize my thesis topic and was instrumental to this thesis study.
I would like to thank Dr. Xiaosong Ma, co-chair of my thesis committee for her crucial
feedback and flexibility. I enhanced my knowledge of Parallel I/O when I attended her “Parallel Systems” course. I’m grateful to Dr. Frank Mueller for accepting my request to be a member
of the thesis committee and for providing valuable observations.
I’m thankful to the U.S Department of Energy (DOE) Office of Science, Scientific Discov-ery through Advanced Computing (SciDAC-2) research program and Performance Engineering
Research Institute (PERI) for funding this research study.
I’m grateful to my parents for their love and unwavering support. They provided me the freedom and encouragement to pursue a career of my choice. I’m greatly indebted to my brother,
Sarat Sreepathi who first introduced me to Computer Science and mentored me throughout my life. His patience and support during my thesis preparation were indispensable. I tremendously
enjoyed our passionate discussions on a wide range of topics. I’m happy to acknowledge the
support of my long time friend, Shivakar Vulli for sharing his perspectives on my research. I would like to thank my uncle, Sharma Vemuri and his kids, Rujula and Rishabh for fun filled
weekends that provided a much needed respite from graduate school.
This research used resources of the National Center for Computational Sciences at Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of
Energy under Contract No. DE-AC05-0OR22725, and the resources of the Argonne Leadership
TABLE OF CONTENTS
List of Tables . . . vi
List of Figures . . . vii
Chapter 1 Introduction . . . 1
1.1 Motivation . . . 2
1.2 Computational Platforms . . . 2
1.2.1 Cray XT5 (JaguarPF) . . . 3
1.2.2 IBM BlueGene/P (Intrepid) . . . 3
1.3 Thesis Organization . . . 4
Chapter 2 Performance Analysis . . . 6
2.1 Description of PFLOTRAN . . . 6
2.1.1 Benchmark Problems . . . 8
2.2 Cray XT5 and IBM BG/P Compute Nodes . . . 8
2.3 Scalability Analysis . . . 11
2.3.1 Observations . . . 11
2.4 Profiling Tools . . . 17
2.4.1 CrayPAT . . . 18
2.4.2 TAU . . . 19
2.4.3 Performance Results . . . 21
2.5 Communication Analysis . . . 29
2.5.1 Communication sub-systems of Cray XT5 and IBM BG/P . . . 29
2.5.2 MPI Allreduce in PFLOTRAN . . . 33
Chapter 3 Custom MPI Allreduce Benchmark . . . 39
3.1 Motivation . . . 39
3.2 Benchmark Description . . . 40
3.3 Performance Results . . . 41
Chapter 4 Parallel I/O Analysis and Optimization . . . 48
4.1 Lustre and GPFS . . . 49
4.2 Analysis of file read . . . 51
4.3 Analysis of file write . . . 56
4.4 Optimization of I/O performance . . . 58
4.4.1 Improvement in file read . . . 59
4.4.2 Improvement in file write . . . 65
Chapter 5 Summary and Conclusions . . . 69
LIST OF TABLES
Table 2.1 Distribution of MPI Allreduce calls into various message size bins at 8,184
cores of Cray XT5 . . . 35
Table 3.1 Performance of CMB on Cray XT5 with PAT RT MPI SYNC = 1 . . . 42
Table 3.2 Performance of CMB on Cray XT5 with PAT RT MPI SYNC = 0 . . . 42
Table 3.3 Custom MPI Allreduce benchmark performance on IBM BG/P . . . 43
Table 3.4 Cray XT5: Comparison of MPI Allreduce performance in CMB . . . 47
LIST OF FIGURES
Figure 1.1 Architecture of Cray XT5 (JaguarPF) . . . 3
Figure 1.2 Architecture of IBM BlueGene/P (Intrepid) . . . 4
Figure 2.1 Architecture of PFLOTRAN . . . 7
Figure 2.2 PFLOTRAN: Hanford 300 benchmark problem . . . 9
Figure 2.3 Cray XT5 compute node . . . 10
Figure 2.4 BlueGene/P compute node . . . 10
Figure 2.5 PFLOTRAN Flow Chart . . . 12
Figure 2.6 PFLOTRAN on Cray XT5 and IBM BlueGene/P: 1 billion DoF problem, comparison of INITIALIZATION stage . . . 12
Figure 2.7 PFLOTRAN on Cray XT5 and IBM BlueGene/P: Comparison of 68 mil-lion DoF FLOW stage . . . 13
Figure 2.8 PFLOTRAN on Cray XT5 and IBM BlueGene/P: Comparison of 1 billion DoF TRANSPORT stage . . . 13
Figure 2.9 PFLOTRAN on Cray XT5 and IBM BlueGene/P: 1 billion DoF problem, FLOW + TRANSPORT stages . . . 14
Figure 2.10 PFLOTRAN on Cray XT5: Comparison of 68 million DoF FLOW stage . 14 Figure 2.11 PFLOTRAN on Cray XT5: Comparison of 1 billion DoF TRANSPORT stage . . . 15
Figure 2.12 PFLOTRAN on Cray XT5: 1 billion DoF problem, FLOW + TRANS-PORT stages . . . 15
Figure 2.13 PFLOTRAN on Cray XT5: CrayPAT - Apprentice tool . . . 19
Figure 2.14 PFLOTRAN on Cray XT5: 1 billion DoF - call path structure of routines in PETSc . . . 20
Figure 2.15 PFLOTRAN on Cray XT5: 1 billion DoF - call path structure of routines in FLOW and TRANSPORT phases . . . 20
Figure 2.16 PFLOTRAN on IBM BG/P: TAU - Mean time profiling of routines . . . 21
Figure 2.17 PFLOTRAN on Cray XT5: Percentage contribution of USER, MPI and MPI SYNC group routines to wall clock time . . . 22
Figure 2.18 PFLOTRAN on Cray XT5: Percentage contribution of user routines to wall clock time . . . 23
Figure 2.19 PFLOTRAN on Cray XT5: Percentage contribution of MPI routines to wall clock time. . . 23
Figure 2.20 PFLOTRAN on Cray XT5: Timings for user and MPI routines . . . 24
Figure 2.21 PFLOTRAN on Cray XT5: Percentage of theoretical peak performance. Peak performance of 1 core of Cray XT5 = 2.6 GHz∗4 ops/cycle = 10.4 GFlops/second. . . 24
Figure 2.22 PFLOTRAN on IBM BG/P:Breakdown of dominant user routines . . . . 25
Figure 2.24 PFLOTRAN on Cray XT5: Time spent in MatMult SeqBAIJ N routine by each process for a 8,184 core run. . . 26 Figure 2.25 PFLOTRAN on Cray XT5: Time spent in MatSolve SeqBAIJ N routine
by each process for a 8,184 core run. . . 27 Figure 2.26 PFLOTRAN on Cray XT5: Time spent in Rtotal routine by each process
for a 8,184 core run. . . 27 Figure 2.27 PFLOTRAN on Cray XT5: Time spent in synchronizing for MPI
Allre-duce by each process for a 8,184 core run. . . 28 Figure 2.28 PFLOTRAN on Cray XT5: Floating point operations (PAPI FP OPS)
executed by each process for a 8184 core run. . . 28 Figure 2.29 PFLOTRAN on Cray XT5: Total number of instructions (PAPI TOT
-INS) executed by each process for a 8184 core run. . . 29 Figure 2.30 PFLOTRAN on IBM BG/P: Time spent in Rtotal routine by each process
for a 8,184 core run. . . 30 Figure 2.31 PFLOTRAN on IBM BG/P: Time spent in Rmultiratesorption routine
by each process for a 8,184 core run. . . 30 Figure 2.32 PFLOTRAN on IBM BG/P: Time spent in Rkineticmineral routine by
each process for a 8,184 core run. . . 31 Figure 2.33 PFLOTRAN on IBM BG/P: Time spent in MatLUFactorNumeric
Se-qBAIJ N routine by each process for a 8,184 core run. . . 31 Figure 2.34 PFLOTRAN on IBM BG/P: Time spent in MPI Allreduce by each
pro-cess for a 8,184 core run. . . 32 Figure 2.35 Variation in run-time on Cray XT5 with the same simulation parameters. 34 Figure 2.36 PFLOTRAN on Cray XT5: 1 DoF problem - MPI Allreduce
synchro-nization time . . . 35
Figure 2.37 PFLOTRAN on IBM BG/P: 1 DoF problem - Total MPI Allreduce time 36
Figure 2.38 PFLOTRAN on Cray XT5: 1 DoF problem - Box-plot of MPI Allreduce timings (includes synchronization also) . . . 37 Figure 2.39 PFLOTRAN on IBM BG/P: 1 DoF problem - Box-plot of MPI Allreduce
timings (includes synchronization also) . . . 37 Figure 2.40 PFLOTRAN on Cray XT5 and IBM BG/P: 1 DoF problem - Total time
spent in MPI Allreduce . . . 38
Figure 3.1 CMB (direct method): MPI Allreduce comparison on Cray XT5 and IBM
BG/P . . . 43
Figure 3.2 Cray XT5: Custom MPI Allreduce benchmark at 16,380 cores - Wall
clock time . . . 44
Figure 3.3 Cray XT5: Custom MPI Allreduce benchmark at 32,760 cores - Wall
clock time . . . 44
Figure 3.4 Cray XT5: Custom MPI Allreduce benchmark at 65,532 cores - Wall
clock time . . . 44
Figure 3.5 Cray XT5: Custom MPI Allreduce benchmark at 98,292 cores - Wall
clock time . . . 45
Figure 3.6 Cray XT5: Custom MPI Allreduce benchmark at 131,064 cores - Wall
Figure 4.1 I/O software stack involved in high performance computing environments 48 Figure 4.2 Overview ofSpider center-wide Lustre file system at ORNL [10] . . . 50 Figure 4.3 PFLOTRAN: 270 million DoF - Default read access patterns . . . 51
Figure 4.4 PFLOTRAN on Cray XT5 and IBM BG/P: 270 million DoF problem
-Comparison of Initialization phase . . . 52
Figure 4.5 PFLOTRAN on Cray XT5 (quadcore): 270 million DoF - CrayPAT
pro-filing of MPI subroutines in Initialization phase . . . 55
Figure 4.6 PFLOTRAN on Cray XT4: 270 million DoF - CrayPAT profiling of MPI
subroutines in Initialization phase . . . 55 Figure 4.7 PFLOTRAN: 270 million DoF - Default write access pattern . . . 57
Figure 4.8 PFLOTRAN on Cray XT4: 25 million DoF - comparison between HDF5
collective and independent write performance . . . 58
Figure 4.9 PFLOTRAN on Cray XT5 and IBM BlueGene/P: 270 million DoF
-Comparison of HDF5 write performance . . . 59 Figure 4.10 PFLOTRAN: 270 million DoF - Comparison between Default and
Mod-ified read access pattern - 1 . . . 61 Figure 4.11 PFLOTRAN: 270 million DoF - Comparison between Default and
Mod-ified read access pattern - 2 . . . 63 Figure 4.12 PFLOTRAN on Cray XT5 (quadcore): 270 million DoF - Impact of read
group size on Initialization phase . . . 64 Figure 4.13 PFLOTRAN on Cray XT: 270 million DoF - Comparison of Initialization
phase . . . 64 Figure 4.14 PFLOTRAN on IBM BG/P: 270 million DoF - Initialization phase using
Independent and Collective I/O modes . . . 65 Figure 4.15 PFLOTRAN on Cray XT5 and IBM BlueGene/P: 270 million DoF
-Comparison of best implementations of Initialization phase . . . 66 Figure 4.16 PFLOTRAN: 270 million DoF - Modified write access pattern . . . 67 Figure 4.17 PFLOTRAN on Cray XT5(quadcore): 270 million DoF - Comparison of
default and improved HDF5 write performance . . . 67 Figure 4.18 PFLOTRAN on Cray XT5(quadcore): 270 million DoF - Impact of
Chapter 1
Introduction
High performance computing systems span a vast array of architectures, from erstwhile vector machines to the presently widespread distributed memory model systems. The top three
ma-chines on the current TOP500 list [13] are capable of performing more than 1 petaflop/second
(1012 floating point operations per second). These machines are composed of complex sub-systems and this complexity is bound to increase in coming years because of the advances in
processor, memory, disk and inter-connect technologies. The rapid advances in computational
power can be attributed to Moore’s law, which states that the number of components on an integrated circuit (IC) would double for every 18 months.
With the availability of powerful supercomputers, complex computational models were
con-structed to study important scientific phenomena that yield solutions to high-impact real world problems. As such, high performance computing systems have played a key role in helping
to solve grand challenge problems in critical scientific areas such as climate sciences,
astro-physics, fusion energy and biological sciences. By providing enormous computational power, supercomputers have established simulation as the third pillar of science along with theory and
experiment.
Low latency and high bandwidth communication and I/O sub-systems are necessary in building large scale systems that can offer sustainable performance to scientific applications.
Scalable programming models and lightweight software interfaces are pivotal in designing
com-putational models that can effectively utilize the current high performance computing systems. This thesis research focuses on the performance analysis and optimization of a large scale
scientific application, PFLOTRAN on current petascale architectures. PFLOTRAN is a highly
scalable groundwater simulation code that solves multi-phase groundwater flow and multicom-ponent reactive transport in three-dimensional porous media.
used in this study. Section 1.3 describes the the rest of the thesis organization.
1.1
Motivation
Current state-of-the-art computational models typically consists of a large code base and
in-teract with a wide variety of external software components such as mathematical libraries, I/O libraries and graph partitioning tools. Building scientific application models that can
effec-tively utilize the computing capabilities of existing and future supercomputers is a challenging
task. The evolution of high performance computing systems has been rapid over the last few decades. However, the different components of computer architecture have evolved at different
rates which introduced contention among various sub-systems. Performance analysis is the
process of evaluation and characterization of an application performance on a target architec-ture. This process can be very helpful in the development of computational models on high
performance computing systems. Performance analysis is a non-trivial task because it involves understanding the interactions between the various hardware components such as CPU,
mem-ory, communication and I/O subsystems and the software components involved in scientific
applications. Performance analysis is a critical component in scientific application development life cycle. In addition to helping to achieve effective utilization of current high performance
computing systems, the benefits of performance analysis to scientific computing teams can be
summarized as follows
• Allows computational scientists to characterize the impact of various design choices to
make informed decisions.
• Identify scalability challenges and use design insights to ameliorate such problems.
• Identify performance bottlenecks to direct optimization strategy.
• Performance optimization of real world scientific applications would enable us to solve
problems at larger scale and with better resolution thereby achieving better predictive capabilities.
• Understand the impact of computer architecture on application performance.
• Compare the performance of various architectures for target applications and provide
recommendations.
1.2
Computational Platforms
In the rest of the document, the words processor orsocket are used to describe the processor
chip whereas the wordsprocessor cores orcores are used to describe the multiple cores within a single chip.
1.2.1 Cray XT5 (JaguarPF)
At the time of this writing, the Cray XT5 (JaguarPF) system at Oak Ride National Laboratory
(ORNL) is the world’s fastest supercomputer [13]. Figure 1.1 shows the overall architecture of the system. It has a total of 224,256 processor cores with 300 terabytes(TB) of memory resulting
in a theoretical peak performance of 2.3 petaflop/s (PF/s). The 224,256 processor cores are arranged in 18,688 compute nodes. In addition to the compute nodes, there are dedicated login
and I/O nodes. Each compute node consists of two hex-core AMD Opteron 2435 “Istanbul”
processors running at 2.6 GHz, 16 GB memory and SeaStar 2+ routers that are connected in a 3D torus topology to provide very high bandwidth and low latency communication network.
Cray XT5 uses the Lustre file system to provide a total disk space of 5 petabytes(PB).
Figure 1.1: Architecture of Cray XT5 (JaguarPF)
1.2.2 IBM BlueGene/P (Intrepid)
The IBM BlueGene/P (BG/P, Intrepid) system at Argonne National Laboratory (ANL) is the
world’s ninth fastest supercomputer at the time of this writing in June 2010 [13]. Figure 1.2 shows the overall architecture of the system. Intrepid has a total of 163,840 processor cores
with 80 TB of memory resulting in a theoretical peak performance of 557.06 teraflop/s (TF/s).
in 40 racks. In addition to the compute nodes, there are dedicated login and I/O nodes. Each
compute node consists of a single quad-core PowerPC 450 processor running at 850MHz with 2 GB of shared memory. The compute nodes are connected by 3D torus and tree networks
providing very efficient communication sub-system. IBM BG/P uses General Parallel File
System (GPFS) to provide a total disk space of 3 petabytes(PB).
Figure 1.2: Architecture of IBM BlueGene/P (Intrepid)
1.3
Thesis Organization
Chapter 2 introduces the application PFLOTRAN and presents detailed performance results on Cray XT and IBM BG/P architectures. Section 2.1 describes the architecture of
PFLO-TRAN and benchmark problems used in this study. Section 2.2 describes the compute node architecture of Cray XT5 and IBM BG/P systems. Section 2.3 presents the scalability analysis
of PFLOTRAN on both architectures. Section 2.4 introduces the performance analysis tools
employed and identifies the dominant computational routines in PFLOTRAN. Section 2.5 dis-cusses the communication sub-systems of Cray XT5 and IBM BG/P and presents detailed
analysis of MPI Allreduce performance on both architectures.
In chapter 3, section 3.1 articulates the motivation for developing the custom MPI Allre-duce benchmark. Section 3.2 explains the various communication mechanisms designed for
mechanism for MPI Allreduce on Cray XT5.
In chapter 4, section 4.1 describes the file systems on Cray XT and IBM BG/P architectures. Sections 4.2 and 4.3 present the performance analysis of HDF5 read and write I/O access
patterns in PFLOTRAN respectively. Section 4.4 describes the I/O scalability bottlenecks of
PFLOTRAN and our optimization strategy for improving the I/O performance and present our results for Cray XT architecture.
Chapter 5 summarizes the major activities conducted, results achieved and conclusions that
Chapter 2
Performance Analysis
Chapter 2 introduces the application PFLOTRAN and presents detailed performance results on Cray XT and IBM BG/P architectures. Section 2.1 describes the architecture of
PFLO-TRAN and benchmark problems used in this study. Section 2.2 describes the compute node
architecture of Cray XT5 and IBM BG/P systems. Section 2.3 presents the scalability analysis of PFLOTRAN on both the architectures. Section 2.4 introduces the performance analysis
tools employed and identifies the dominant computational routines in PFLOTRAN. Section
2.5 discusses the communication sub-systems of Cray XT5 and IBM BG/P and presents de-tailed analysis of MPI Allreduce performance on both the architectures. Dede-tailed I/O analysis
is described in Chapter 4
2.1
Description of PFLOTRAN
The U.S Department of Energy (DOE) is interested in studying the effects of geologic
seques-tration of CO2 in deep reservoirs and migration of radionuclides in groundwater. Modeling of subsurface flow and reactive transport is necessary to understand these problems.
PFLOTRAN [26][32] is a highly scalable groundwater simulation code that solves
multi-phase groundwater flow and multicomponent reactive transport in three-dimensional porous media. It is written in Fortran 90 and has certain degree of abstraction through the use of
Fortran 90 features such as modules and derived data types. PFLOTRAN extensively uses
PETSc (Portable, Extensible Toolkit for Scientific Computation) [16, 17, 15] which is a widely used numerical software package for solving system of equations. PETSc is being developed at
Argonne National Laboratory (ANL) and provides powerful set of tools that can be used by
scientific applications written in C, C++ and Fortran. In addition to PETSc, PFLOTRAN uses MPI (Message Passing Interface)[8] for interprocess communication and parallel HDF5
PFLOTRAN uses domain-decomposition parallelism in which each processor is assigned a
sub-domain of the problem and a parallel solve is implemented over all processors. An inexact Newton method is employed inside of which the BiConjugate Gradient Stabilized (BiCGStab)
linear solver is used. A block-Jacobi preconditioner is used that applies an incomplete
LU-decomposition solver with zero fill-in (ILU(0)) on each block. PETSc provides the nonlinear and linear solvers, and preconditioners that are used in PFLOTRAN. PFLOTRAN uses the
PETSc Distributed Array (DA) framework to handle ghost point updates that arise at node
boundaries in domain decomposition. Various other functionality such as sparse matrix and vector data structures, basic performance logging framework and runtime options control in
PFLOTRAN are also provided by PETSc. The architecture diagram of PFLOTRAN is shown
in Figure 2.1
2.1.1 Benchmark Problems
In this section, we describe the PFLOTRAN benchmark problems used in our study. We
performed strong scaling performance studies with two test problems that are described below.
• 270 million DoF (Hanford 300): This benchmark problem is from a model of a hypothetical Uranium plume at the Hanford 300 Area in southeastern Washington state [27]. The
computational domain of the problem measures 1350*2500*20 meters (x,y,z) and utilizes
complex stratigraphy shown in Figure 2.2 mapped from the Hanford EarthVision database [2], with material properties provided by [12]. A 1350*2500*80 cell flow-only version of
the problem (270 million total degrees of freedom) is used.
• 1 billion DoF: This benchmark is a coupled flow and transport problem from the Hanford
300 Area. This problem consists of 850*1000*80 cells and has 15 chemical components. This accounts to a total of 68 million degrees of freedom in the flow solve and 1.02 billion
degrees of freedom in the transport solve.
2.2
Cray XT5 and IBM BG/P Compute Nodes
The Cray XT5 (JaguarPF) system at ORNL has a total of 18,688 compute nodes. Each XT5
compute node shown in Figure 2.3 has two AMD Opteron 2435 “Istanbul” processors linked
with dual HyperTransport connections. Each Opteron processor has 6 cores running at 2.6 GHz clock speed and directly attached to 8 GB of DDR2 800 MHz memory. Each core has
dedicated L1 cache of 64 KB and L2 cache of size 512 KB. All cores share the on-chip L3
Figure 2.2: PFLOTRAN: Hanford 300 benchmark problem
shared memory with a peak performance of 124.8 Gflop/s. JaguarPF supports MPI, OpenMP, SHMEM, and PGAS programming models. PGI, PathScale and Cray compiler environments
are available on the machine. Applications can be launched in various configurations depending on the programming model. The number of cores or NUMA sockets allocated to an application
can be controlled through the various options available to PBS (Portable Batch System).
The IBM BG/P (Intrepid) system at ANL has a total of 40,960 compute nodes arranged in 40 racks. Figure 2.4 depicts a BG/P compute node. Each compute node has four PowerPC
450 cores running at 850 MHz clock speed, with 2 GB of shared memory. Each compute core
has a double precision, dual pipe, floating point unit, typically called Double Hummer that can perform two fused multiply adds (FMA) per cycle giving it a capability to perform 4 floating
point operations per cycle. The peak performance of each core is 3.4 GFlop/s and for the
compute node it is equal to 13.6 GFlop/s. Each core has a private L1 cache of 32KB and a private L2 stream prefetching engine. All cores share the on-chip L3 cache of 8 MB. The cores
also share DDR-2 memory controllers that access 2 GB of DDR-2 memory. The Intrepid system
has IBM XL suite of compilers along with support for OpenMP within a compute node. Applications can be launched in 3 different modes on Intrepid. These are described below
• Symmetric Multiprocessor mode (SMP): In this mode, single MPI task is allocated to a
compute node. This task can spawn a maximum of four threads. All the compute node resources are used by the single MPI task.
• Dual Node mode (DUAL): In this mode, each compute node executes two MPI tasks
Figure 2.3: Cray XT5 compute node
• Virtual Node mode (VN): This mode allows four MPI tasks with one thread each to
run on the Compute Node. All tasks share the memory and network resources on the compute node. Optimizations in the system software allow tasks within a compute node
to communicate via shared memory.
2.3
Scalability Analysis
We primarily used the 1 billion DoF problem for computation and communication analysis.
The 270 million DoF problem is used for I/O analysis. In this section, we discuss the strong
scaling results with the 1 billion DoF problem on Cray XT5 and IBM BG/P. For the results presented in this chapter, we have set PFLOTRAN to run for 30 timesteps. Each time step
involves a coupled flow and transport time step execution. All file output is turned off to
avoid any interaction with the I/O sub-system. There is significant difference between flow and transport phases. When compared with a transport time step, a flow time step involves
less number of Newton (nonlinear) iterations (with each iteration involving a linear solve), but
performs significantly more number of BiCGSTAB iterations per each linear solve.
PFLOTRAN uses the PETSc logging framework to obtain basic profiling of the code. The
instrumentation overhead introduced by the PETSc logging framework is very minimal when
the-log summaryprofiling option is used. PETSc provides two other options,infoand-logtrace
that capture details about algorithms, data structures and tracing events. We used the log
-summary option for the results presented in this section.
The control flow of PFLOTRAN can be divided into 4 different phases, namely Initialization, Flow, Transport and Output phases. Figure 2.5 shows the execution model of PFLOTRAN
involving the 4 phases.
Figure 2.6 to Figure 2.9 show the scaling of the Initialization, Flow, Transport and combined Flow and Transport phases from 4,092 to 131,064 cores of Cray XT5 and IBM BG/P. For the
runs on Cray XT5, we used all six available cores per processor socket. In this scheme, each core is assigned a MPI task that can access approximately 1 GB of main memory. On IBM
BG/P, we had to use only two cores out of the available four cores per node because of the
large memory requirements for the 1 billion DoF problem. In this setup, each processor core is assigned a MPI task with 1 GB memory.
2.3.1 Observations
It can be seen from Figure 2.6 to Figure 2.9, PFLOTRAN exhibits different scaling behavior on
Cray XT5 and IBM BG/P. Observations from the scaling plots and the characteristics of the
Figure 2.5: PFLOTRAN Flow Chart
Figure 2.7: PFLOTRAN on Cray XT5 and IBM BlueGene/P: Comparison of 68 million DoF FLOW stage
Figure 2.9: PFLOTRAN on Cray XT5 and IBM BlueGene/P: 1 billion DoF problem, FLOW + TRANSPORT stages
Figure 2.11: PFLOTRAN on Cray XT5: Comparison of 1 billion DoF TRANSPORT stage
• Initialization and Output phases: These phases involve HDF5 I/O routines that read
the simulation initial conditions from an input file and write the simulation results to an output file. Since these operations are I/O bound, the I/O sub-system of the target
archi-tecture determines the performance of these phases. For the I/O intensive Initialization
phase (Figure 2.6), the IBM BG/P has far superior scaling behavior when compared to the Cray XT5. More detailed analysis of this phase including improvement on the Cray
XT5 is described in Chapter 4
• Flow phase: This phase of PFLOTRAN involves a large number of linear solves. Each
linear iteration is communication bound because it involves global reduction operations that require communication across processes. The flow stage is communication intensive
(95% of MPI Allreduce’s are encountered here). The communication sub-system plays a
key role in determining the performance of this phase. Flow phase does not scale beyond 16,380 cores on Cray XT5 and 65,532 cores on IBM BG/P. It is to be noted that at
higher processor counts, the computation to communication ratio is low for the 68 million
DoF Flow problem. For the communication intensive Flow phase, the IBM BG/P beats Cray XT5 around 17K cores (Figure 2.7) due to its superior collective communication
performance.
• Transport phase: This stage of PFLOTRAN is computation intensive because of the
nonlinear reactions involving 15 DoF per cell. This phase is computation intensive (90% of total flops). Its performance is dependent upon the peak performance of the processor
and the memory sub-system. For the computation intensive transport stage (Figure 2.8),
the Cray XT5 performs nearly twice as fast as the IBM BG/P because Cray XT5 has faster nodes (peak of 10.4 Gflop/sec for single core) compared to BG/P (peak of 3.4
GFlops/sec for a single core).
• Combined Flow and Transport phases: Figure 2.9 shows that the superior communication
performance on the IBM BG/P compensates for its slower computing power, while the computation advantage of Cray XT5 diminishes and the communication cost increases at
higher processor counts.
With the unique characteristics of its various phases, PFLOTRAN provides us an excellent
opportunity to study the impact of architecture on performance of a large scale application. Figure 2.10 to Figure 2.12 show the impact of using fewer cores per Cray XT5 socket. These runs
used only 3 cores per processor socket. In this setup, each MPI task can access approximately
largely due to the improved performance of MPI Allreduce and less pressure on the memory
sub-system.
Although -log summary option in PETSc gives a high level overview of the performance
of PFLOTRAN, it does not provide detailed information such as the hardware performance
metrics, call path structures, low-level subroutines called in different PETSc interfaces, and ex-clusive timings of individual routines. This motivated us to instrument PFLOTRAN along with
its associated libraries (PETSc, MPI and parallel HDF5) with different performance analysis
tools. These details are presented in the next section.
2.4
Profiling Tools
Performance analysis tools or profiling tools are software libraries that can collect detailed per-formance metrics of a computational application. Because of the inherent complexity involved
in high performance computer architectures and scientific applications, profiling tools act as
a valuable asset to computational scientists and are commonly used to better understand an application’s performance.
Profiling tools can provide data that could be used to identify potential performance
bottle-necks. They also help us better characterize the interaction between the application with the architecture. The ease of use of profiling tools depends upon the complexity of the application
as well as the tool chosen for instrumenting the application. Most profiling tools provide an
application programming interface (API) to directly interact with the tool and a graphical user interface (GUI) to visualize the large amounts of performance data. Some profiling tools can
instrument applications without making any source code changes. There are different ways of
instrumenting an application, these are described below.
• Compilers provide support for instrumenting the source code. For example, gcc provides
-pg flag that can be used at compile and link time.
• Binary: This process involves tools that can insert timing routines to an already built binary. CrayPAT [1] is an example of this class.
• Runtime: Involves invoking the tool at execution time. Examples of these include Valgrind
suite [14].
• Automatic source-level: This involves tools that automatically add timing routines to the
source code. TAU [35, 25, 22] belongs to this class.
the tools try to measure a function that is called a large number of times and time spent in each
instance of the function is very minute. In order to obtain more detailed, low-level performance metrics of PFLOTRAN we used different profiling tools to instrument PFLOTRAN on Cray
XT5 and IBM BG/P.
2.4.1 CrayPAT
Cray Performance Analysis Tool (CrayPAT) [1] is a binary instrumentation based profiling tool. It is developed by Cray and thus can only be used on Cray architectures. CrayPAT
can intercept routines from a wide range of libraries such as MPI, OpenMP, UPC, BLAS, LAPACK, HDF5 and many others and is relatively easy to use when compared to other source
level instrumentation tools such as TAU. It uses Performance API (PAPI) underneath to capture
hardware performance counter data.
We used CrayPAT to profile PFLOTRAN, PETSc and MPI libraries on the Cray XT5.
In order to minimize the profiling overhead, we have performed selective instrumentation of
PETSc and PFLOTRAN routines. We initially performed a sampling experiment to identify the most dominant routines and then use those routines to perform a tracing experiment. An
instrumented binary is built by using the pat build command. pat report generates text report
from the performance data that was dumped during run time. A wide choice of options are accepted by bothpat build and pat report. We have various options ofpat build and pat report
to capture various performance metrics on the Cray XT5.
CrayPAT provides various environment variables to control its run-time behavior. By de-fault, runtime summarization is enabled and the data collected is aggregated. The runtime
variable PAT RT HWPC specifies the hardware performance counter events monitored for a
program counter tracing experiment. The environment flag, PAT RT SYNC controls the re-porting behavior of MPI collective communication routines. By enabling this flag, CrayPAT
reports two separate timing groups for MPI collective routines, namely MPI and MPI SYNC.
MPI SYNC group represents the time spent in synchronization and the MPI group represents the time spent in the actual collective communication operations. CrayPAT achieves this
func-tionality by placing a MPI Barrier before each MPI collective call. It reports the time spent in
MPI Barrier as MPI SYNC time and the time spent in the MPI collective routine is reported as MPI group time. This functionality is helpful in identifying possible synchronization
bottle-necks in applications; however, this may incur an additional overhead in some applications due
to the explicit barrier call placed before each MPI collective function. CrayPAT provides a GUI called Apprentice to view the application performance data. Figure 2.13 depicts the Apprentice
shows the call path structure of routines in Flow and Transport phases of PFLOTRAN.
Figure 2.13: PFLOTRAN on Cray XT5: CrayPAT - Apprentice tool
2.4.2 TAU
Tuning and Analysis Utilities (TAU) is an automatic source level instrumentation tool. TAU [11]
is capable of gathering performance information through instrumentation of functions, methods, basic blocks, and statements. It is developed at the University of Oregon. We have used TAU
to collect performance metrics primarily on IBM BG/P, but also used it on Cray XT5. We
instrumented PFLOTRAN and its associated libraries (PETSc, MPI and parallel HDF5) using TAU on the IBM BG/P. In order to instrument with TAU, themakefileof PFLOTRAN has to be
altered to make use oftau compiler script. In addition to performing selective instrumentation,
TAU THROTTLE environment variable is used to minimize the overhead. We had to alter the TAU provided configuration script to build PETSc with selective instrumentation. The
following changes were made to the PFLOTRAN makefile
include /soft/apps/tau/tau-2.19/bgp/lib/Makefile.tau-mpi-pdt
FC = $(TAU_COMPILER) -optVerbose -optPreProcess -optKeepFiles bgxlf90_r FC_LINKER = $(TAU_COMPILER) mpixlf90_r
When compiling PFLOTRAN with TAU, we observed that TAU was unable to instrument
Figure 2.14: PFLOTRAN on Cray XT5: 1 billion DoF - call path structure of routines in PETSc
Database Toolkit (PDT) fails for some of the source files because it inserts variable
declara-tions after an executable statement (pointer assignment, if statements etc) in some subroutines. This can be seen in the intermediate files (*.pp.inst.F90) generated during compilation. Moving
the TAU variables above the executable statements and re-compilation fixes this problem and
instruments the source file. TAU provides a profile visualization tool, Paraprof. It provides graphical displays of all the performance analysis results, in aggregate and single
node/contex-t/thread forms. Figure 2.16 depicts a screenshot of mean time of routines in PFLOTRAN with
Paraprof.
Figure 2.16: PFLOTRAN on IBM BG/P: TAU - Mean time profiling of routines
2.4.3 Performance Results
In this section, we present detailed performance results of the 1 billion DoF problem of
PFLO-TRAN obtained with CrayPAT and TAU tools. Recall that for the results presented in this
chapter, we have set PFLOTRAN to run for 30 time steps and all file output is turned off to avoid any interaction with the I/O sub-system. The profiling overhead is very minimal (less
than 5% of runtime) for the results presented. Figure 2.17 shows the percentage of time spent
in different groups of routines from 4,092 to 32,760 cores of Cray XT5. The User group shows good scalability, whereas the time spent inMPI andMPI SYNC group increases with processor
count.
Figure 2.18 and Figure 2.19 depict a breakdown of the routines in theUser andMPI groups.
They show the percentage contribution of different User and MPI routines to wall clock time.
Figure 2.17: PFLOTRAN on Cray XT5: Percentage contribution of USER, MPI and MPI -SYNC group routines to wall clock time
the other two belong to PFLOTRAN. From the MPI routines (Figure 2.19), it is clear that MPI Allreduce is the most significant routine followed by MPI File open. A brief description
about the dominant PETSc, PFLOTRAN routines is given below.
• MatSolve SeqBAIJ N: This routine solves the system A x = b, given a factored matrix “A”
• MatLUFactorNumeric SeqBAIJ N: This routine performs numeric LU factorization of a matrix.
• MatMult SeqBAIJ N: This routine computes the matrix-vector product, y = A x.
• RTotal: This routine calculates the contribution of aqueous equilibrium complexation to the residual and Jacobian functions for Newton-Raphson.
Figure 2.20 shows the time spent in variousUser andMPI routines. MPI Allreduce sync is
the most dominant routine at all the processor counts. More detailed analysis of MPI Allreduce
performance in PFLOTRAN is described in section 2.5. There is an increase in the time spent in the MPI file routines after 16,380 cores. These file operations happen in the Initialization phase
of PFLOTRAN. More detailed I/O analysis is presented in Chapter 4. Figure 2.21 shows the
peak performance of the dominant computational routines on Cray XT5. The PETSc routine
MatLUFactorNumeric SeqBAIJ Ngives the highest peak performance of close to 18 %, followed
Figure 2.18: PFLOTRAN on Cray XT5: Percentage contribution of user routines to wall clock time
Figure 2.20: PFLOTRAN on Cray XT5: Timings for user and MPI routines
Figure 2.22 shows the break down of dominant user routines on IBM BG/P. Although the
same user routines that appear on Cray XT5 are also seen on IBM BG/P, the three most dominant routines are native PFLOTRAN routines. These routines involve logarithm and
exponentiation functions which are expensive.
Figure 2.22: PFLOTRAN on IBM BG/P:Breakdown of dominant user routines
Figure 2.23, Figure 2.24, Figure 2.25 and Figure 2.26 show the time spent in the most
dominant user routines by each process for a 8,184 processor core run on Cray XT5. It can be observed that a small percentage of processes spent very little time in these routines indicating
that they do very little work. Consequently, these same processes spent idle time waiting
at different synchronization points through the program execution. This idle time or load imbalance experienced by this small fraction of processors can be largely attributed to the
inactive cells contained in some regions of the problem being solved - the Columbia river portion
of the Hanford problem domain does not contain any porous media and PFLOTRAN uses inactive cells in these regions. This synchronization time is captured in the MPI Allreduce
calls. The time spent in synchronizing for MPI Allreduce by each process at 8,184 cores of
Cray XT5 is shown in Figure 2.27.
Measuring the number of floating point operations and instructions executed by each
pro-cess would help us in identifying the work load imbalance across the propro-cesses. Figure 2.28
and Figure 2.29 show the PAPI FP OPS and PAPI TOT INS hardware counter data for each process at 8,184 cores. Figure 2.28 indicates that there are five clusters of processes with similar
Figure 2.23: PFLOTRAN on Cray XT5: Time spent in MatLUFactorNumeric SeqBAIJ N routine by each process for a 8,184 core run.
Figure 2.25: PFLOTRAN on Cray XT5: Time spent in MatSolve SeqBAIJ N routine by each process for a 8,184 core run.
Figure 2.27: PFLOTRAN on Cray XT5: Time spent in synchronizing for MPI Allreduce by each process for a 8,184 core run.
Figure 2.29: PFLOTRAN on Cray XT5: Total number of instructions (PAPI TOT INS) exe-cuted by each process for a 8184 core run.
Figure 2.30 to Figure 2.34 show the time spent in the most dominant user routines by each process for a 8,184 processor core run on IBM BG/P. The time spent in synchronizing for
MPI Allreduce by each process at 8,184 cores of IBM BG/P is shown in Figure 2.34. From the
figures, it is clear that the same processes that are spending very little time in the dominant user routines have higher MPI Allreduce synchronization time. This pattern is similar to what
we have observed on Cray XT5.
2.5
Communication Analysis
2.5.1 Communication sub-systems of Cray XT5 and IBM BG/P
The IBM BG/P architecture has five networks that are used for performing various tasks on
the Intrepid system. We briefly describe the functionality of the various networks below. More detailed description is given in [5]
• Three-dimensional Torus: This network is used for general-purpose, point-to-point mes-sage passing and multicast operations between the IBM BG/P compute nodes. On the
IBM BG/P, a single compute node is connected to six nearest neighbor compute nodes.
The torus is constructed with point-to-point, serial links between routers embedded within the Blue Gene/P ASICs. The target hardware bandwidth for each torus link is 425 MB/s
in each direction of the link for a total of 5.1 GB/s bidirectional bandwidth per node.
com-Figure 2.30: PFLOTRAN on IBM BG/P: Time spent in Rtotal routine by each process for a 8,184 core run.
Figure 2.32: PFLOTRAN on IBM BG/P: Time spent in Rkineticmineral routine by each process for a 8,184 core run.
Figure 2.34: PFLOTRAN on IBM BG/P: Time spent in MPI Allreduce by each process for a 8,184 core run.
munication operations and to move application data from the I/O nodes to the compute nodes. Each compute and I/O node has three links to the global collective network at 850
MB/s per direction for a total of 5.1 GB/s bidirectional bandwidth per node. Collective
operations such as broadcast and reduction are performed on this network.
• Global Interrupt: This network based on asynchronous logic and is used for signaling
global interrupts and barriers.
• 10 gigabit Ethernet: This network consists of all I/O nodes and discrete nodes that are
connected to a standard 10 gigabit Ethernet switch. The compute nodes are not directly connected to this network.
• Control network has direct access to every compute and I/O nodes. This network is used for system boot and monitoring.
The MPI implementation on BG/P supports three different protocols, these are described
below.
• MPI short protocol is used for short messages.
• MPI eager protocol used for medium-sized messages. No negotiations are made on both ends before sending a message in this protocol.
The Cray XT5 uses a three dimensional torus network for communications between the
compute nodes. Each compute node has a 6-port Cray SeaStar2+ network interface controller that is connected to the processor with Hypertransport. Each SeaStar2+ port is capable of 9.6
GB/s peak bi-directional bandwidth and 6 GB/s of sustained bandwidth. There is no separate
network for connecting the compute nodes to the I/O nodes. The MPI implementation on Cray XT5 uses two device drivers for communications on the machine. A shared memory (SMP)
driver for communicating with processes that share a single compute node and the portal device
driver for communications across the nodes.
Cray’s implementation of the MPI standard is contained in Cray Message Passing Toolkit
(MPT). It is based on the MPICH2 implementation [8]. MPT uses two device drivers for
communications on the machine. A shared memory (SMP) driver for communicating with pro-cesses that share a single compute node and the portal device driver for communications across
the nodes. Portals is a low-level software interface for inter-node communication. Like many
message passing libraries, Portals uses eager protocol for sending short messages wherein short messages are sent with the assumption that the receiving process has the resources to store the
message, and the receiver is responsible for buffering the message if the matching receive has not
yet been posted. The data is copied the receives buffer if the matching receive has been posted. Otherwise the data is copied to an unexpected buffer and two entries (put start and put end
events) are generated in theunexpected event queue that track incoming unexpected messages. Evidently, the unexpected event queue and buffer can run out of resources causing
scalabil-ity issues. These issues can be mitigated by increasing the buffer size and maximum queue
length using MPICH UNEX BUFFER SIZE and MPICH PTL UNEX EVENTS environment variables respectively. Alternately, a flow control mechanism can prevent the unexpected event
queue from being exhausted in any situation, and may cause performance degradation. This
mechanism can be enabled by setting the environment variable MPICH PTL SEND CREDITS to “-1” Mills et al. [31] discuss the scalablity implications of MPT on real applications.
2.5.2 MPI Allreduce in PFLOTRAN
PFLOTRAN heavily uses the MPI subroutine, MPI Allreduce during its execution. In this
section, we discuss our observations corresponding to the performance of MPI Allreduce in PFLOTRAN on Cray XT5 and IBM BG/P. MPI Allreduce is a collective subroutine that
combines values from all the participating MPI processes and distribute the result back to all
of the processes. Since this is a collective operation all processes need to synchronize before participating in the call.Different MPI implementations use different algorithms to gather and
factors.
1. Performance of the inter-connect because it determines the latency and bandwidth avail-able between the compute nodes.
2. Topology in which the compute nodes are connected to one another and the mapping of
MPI tasks to compute nodes.
3. Number of participating MPI processes.
4. Size of the message that has to be distributed.
5. Time spent in the process of synchronization before the reduction operation.
Figure 2.35: Variation in run-time on Cray XT5 with the same simulation parameters.
Figure 2.35 shows the time spent in the Flow and Transport phase of PFLOTRAN at 65,536
cores of Cray XT5. Two different solvers were used for these runs. As can be seen from the
figure, there is significant variation in the runtime of Flow stage between the five trial runs. Recall that the Flow phase is communication intensive and that 95% of MPI Allreduce calls
happen in this phase. The observed variation in runtime of Flow phase can be attributed
to the variability of MPI Allreduce performance on Cray XT5. This shows that the state of the machine when the simulation is run also plays a role in determining the performance of
MPI Allreduce. We did not observe this phenomenon on the IBM BG/P.
Table 2.1: Distribution of MPI Allreduce calls into various message size bins at 8,184 cores of Cray XT5
Bin Count Call site
0B <Message size<16B 113,070 VecDot MPI, VecNorm MPI etc.,
16B <= Message size<256B 32,725 VecDotNorm2
4KB <= Message size <64KB 943 MatZeroRows MPIBAIJ, MatZeroRows
-MPIAIJ, MatAssemblyBegin MPIBAIJ,
MatAssemblyBegin MPIAIJ
are invoked to perform the dot product and norm calculations. The functions VecDot MPI
andVecNorm MPI perform a reduction operation on a single double variable and the function
VecDotNorm2performs a reduction operation on two double variables. It must be noted that the number of MPI Allreduce calls executed depends up on the number of time steps PFLOTRAN
is set to run.
Figure 2.36 shows the time spent by each process in synchronizing for MPI Allreduce at 8,184 cores of Cray XT5. It is clearly evident that a small percentage of processes spent more
time in MPI Allreduce synchronization when compared to other processes which indicates that
some processes arrive at the synchronization point much earlier than others. Similar pattern can be observed from Figure 2.37 for IBM BG/P. The reasons for this behavior of MPI Allreduce
has already been discussed in the previous section.
Figure 2.37: PFLOTRAN on IBM BG/P: 1 DoF problem - Total MPI Allreduce time
Box plots are commonly used in descriptive statistics to visualize distribution of data,
es-pecially to identify outliers in a distribution. Box plots use five number summaries to describe
the data, namely the sample minimum, the lower quartile (25th percentile), median. upper quartile (75th percentile) and the sample maximum. In our case, box plots would be
help-ful in identifying the outlier processes with respect to MPI Allreduce time. Figure 2.38 and Figure 2.39 show the box plot timings of total MPI Allreduce timings on Cray XT5 and IBM
BG/P. The bottom and top of the box are respectively Q1 (25th percentile) and Q3 (75th
percentile). The mid-point in the box represents the median of the distribution. Whiskers are marked atQ1−(1.5∗IQR) andQ3 + (1.5∗IQR). IQR is interquartile range and is defined as (Q3−Q1) Points which cross the whiskers are marked with ’+’ symbols and represent outliers.
Figure 2.40 shows the total time spent in MPI Allreduce for a 30 time-step 1 billion DoF test
problem on Cray XT5 and IBM BG/P. This includes both the time spent in synchronization
and the actual reduction operation. For the same simulation parameters on both architectures, we observe that IBM BG/P gives a much better MPI Allreduce performance at 16,384 cores
when compared to Cray XT5. This is due to the presence of the “Global Collective” network
Figure 2.38: PFLOTRAN on Cray XT5: 1 DoF problem - Box-plot of MPI Allreduce timings (includes synchronization also)
Chapter 3
Custom MPI Allreduce Benchmark
This chapter is organized as follows: Section 3.1 provides background and motivation for de-veloping a custom MPI Allreduce benchmark. Section 3.2 explains the various communication
mechanisms developed for the benchmark. Finally, section 3.3 compares the benchmark
perfor-mance on Cray XT5 and IBM BG/P architectures. It also provides recommendations on the optimal communication mechanism for MPI Allreduce on Cray XT5.
3.1
Motivation
In the previous chapter, we discussed the differences in the communication sub-systems between the Cray XT5 and IBM BG/P architectures. Alam et al. [20] discussed the Intel MPI
bench-mark (IMB) [6] results on IBM BG/P and Cray XT4. For the IMB MPI Allreduce benchbench-mark,
they measured latency with varying message sizes and latency for a fixed message size for a range of processor counts. They concluded that both the architectures show good scalability in
terms of message sizes and processor counts with IBM BG/P performing exceptionally well in terms of scalability. Alam et al. [21] also discuss the impact of Message Passing Toolkit (MPT)
on the performance of IMB on Cray XT4. Characterizing the performance of MPI Barrier
would be helpful in understanding the performance of MPI Allreduce because MPI Allreduce involves a synchronization operation. Worley et al. [18] measured the performance of MPI
-Barrier on Cray XT4/5 and IBM BG/P. They concluded that the barrier performance on IBM
BG/P is superior to that of Cray XT4/5. They also observed small performance preference for power-of-two processor counts and significant performance variability on Cray XT4/5.
The aforementioned studies were performed upto a maximum of 16,384 processor cores and
included relatively fewer MPI Allreduce calls within a single program instance. Hence, we developed a custom MPI Allreduce benchmark to investigate performance of programs that
believe that this study facilitates accurate assessment of performance impact of architecture on
real scientific applications like PFLOTRAN.
3.2
Benchmark Description
The custom MPI Allreduce benchmark (CMB) is written in C programming language and uses
both MPI and OpenMP (shared-memory programming interface) [9] programming models. It performs a simple dot product operation on a vector of doubles that is distributed across the
processor cores. Each process computes local dot product on its allocated vector and then all processes participate in an MPI Allreduce call with summation operation to compute the global
dot product. The global dot product operation is repeated 150,000 times in a single program
instance, this is equivalent to performing the MPI Allreduce 150,000 times.
In order to have sufficient computation between each reduction operation, we implemented
a loop over the local dot product computation. The number of times this loop is iterated is
user configurable and specified as a command line argument to the benchmark. Similarly the vector size and number of MPI Allreduce operations (global dot product iterations) can also
be set through command line arguments. Each process is allocated a constant vector of length
61,050 elements, which makes the total vector size approximately equal to 1 billion elements when 16,380 processor cores are used.
We have chosen 150,000 MPI Allreduce calls because it approximately represents the total
number of MPI Allreduce calls in PFLOTRAN for the 1 billion DoF test problem at 16,380 processor cores. Unlike the test problem of PFLOTRAN, this benchmark does not have any
load imbalance in its computation because each process is allocated a constant vector size
of 61,050 elements. This implies that any MPI synchronization time observed is due to the inherent synchronization time on the target architecture and not due to computational load
imbalance in the benchmark. Several methods of performing MPI Allreduce were implemented
in the benchmark. These are described below.
• Direct: In this method, all MPI processes participate in the MPI Allreduce operation.
MPI COMM WORLD is used as the communicator handle in MPI Allreduce call. This is the base MPI Allreduce implementation available on the target machines and it is used
without any modifications.
• Async: In this method, only one MPI process per compute node participates in the
MPI Allreduce operation. A MPI sub-communicator is created consisting of only the
-global communicator, MPI COMM WORLD. The root process on each node gathers the
local dot product of the rest of the processes within its node by using asynchronous point-to-point receive (MPI Irecv) operation. After completing the MPI Allreduce operation
with the process belonging to the sub-communicator, the root process in each node sends
back the global dot product result to the rest of the processes in their respective nodes using an asynchronous point-to-point send (MPI Isend) operation. Asynchronous MPI
routines are known as non-blocking message passing communications because these
oper-ations return without waiting for the communication to finish on the other end. So the operations following the communication call can proceed even though the message
deliv-ery/receipt may not have been completed. One can wait for a non-blocking operation
to finish using MPI Wait or MPI Waitall function so as to avoid corrupting the message buffer or accessing a non-existing buffer.
• Sub-comms: This method involves creation of two MPI sub-communicators namely,
intra-node and inter-node. The intra-node sub-communicator is a collection of all
pro-cesses within a node, while inter-node sub-communicator is a collection of all the root processes among all compute nodes. A MPI Reduce operation within a node involving
the intra-node communicator computes the dot product among the processes within a
node. The result is only stored at the root process because the computed result does not yet represent the global dot product. All the root processes among the compute nodes
participate in MPI Allreduce operation involving the inter-node communicator to
com-pute the global dot product. This is followed by a MPI Bcast operation involving the intra-node communicator to propagate the final result to the rest of the processes in each
node.
• Hybrid MPI-OpenMP: This method involves using both MPI and OpenMP
function-ality to compute the vector dot product. A single MPI task is requested per compute node which spawns 12 OpenMP threads corresponding to 12 cores available on each node. The
local dot product in a node is computed by using the OpenMP directiveparallel for with
thereduction clause. The MPI task in each node participates in MPI Allreduce involving the global communicator to compute the global dot product.
3.3
Performance Results
The Cray XT5 system used for the experiments described in this section has 6 cores per
proces-sor socket, with two such sockets per XT5 node. We used both CrayPAT and our custom timing
Table 3.1: Performance of CMB on Cray XT5 with PAT RT MPI SYNC = 1
Cray XT5 cores MPI Allreduce MPI Allreduce (sync) Total MPI Allreduce
16,380 58.18 77.59 135.76
32,760 51.32 74.87 126.2
65,532 156.5 152.33 308.83
Table 3.2: Performance of CMB on Cray XT5 with PAT RT MPI SYNC = 0
Cray XT5 cores MPI Allreduce
16,380 74.33
32,760 86.24
65,532 117.88
on CMB performance withdirect method. Table 3.1 shows the time spent in MPI Allreduce at different processor counts on Cray XT5. These runs were executed with CrayPAT flag PAT
-RT MPI SYNC set to 1. Recall that setting this flag would capture the synchronization time of
MPI collective routines by explicitly placing an MPI Barrier before each MPI collective func-tion. Table 3.2 shows the timings of the same runs with PAT RT MPI SYNC set to 0. From
tables 3.1 and 3.2, it is clearly evident that CrayPAT introduced an additional overhead when profiling with PAT RT MPI SYNC enabled for this benchmark. For runs below 16,380
pro-cessor count we did not observe any overhead in using the PAT RT MPI SYNC environment
variable.
Table 3.3 shows the timings of CMB with direct method on IBM BG/P. We could have
used TAU for profiling the benchmark but we wanted to explicitly capture the MPI Allreduce
synchronization time and since TAU does not provide this functionality, we had to use our custom timing routines. We placed a MPI Barrier before MPI Allreduce to explicitly capture
the MPI Allreduce synchronization time. We did not observe any significant overhead with the
introduction of the MPI Barrier function in our profiling.
From Figure 3.1, it can be observed that IBM BG/P outperforms Cray XT5 by a factor
of 7-10 in MPI Allreduce performance. It can be observed from the results on IBM BG/P
that the performance of MPI Allreduce remains almost constant from 16,380 to 98,304 cores which is quite remarkable for a collective communication routine. This is largely due to the
availability of a special tree network on the IBM BG/P that is used for collective operations
Table 3.3: Custom MPI Allreduce benchmark performance on IBM BG/P
BG/P cores MPI Allreduce MPI Allreduce (sync) Total MPI Allreduce
16,380 1.61 9.67 11.28
32,760 1.61 9.74 11.35
65,532 1.68 9.64 11.32
98,304 1.75 8.20 9.95
operations.
To avoid skewing of performance results at higher processor counts, we ran the benchmark
without CrayPAT profiling. We could have used CrayPAT with PAT RT MPI SYNC set to 0, but we used our custom timing routines without any MPI Barrier call. The remaining results
presented in this chapter do not include any explicit barrier operations.
Figure 3.1: CMB (direct method): MPI Allreduce comparison on Cray XT5 and IBM BG/P
Since the performance of MPI Allreduce on Cray XT5 was less than ideal when compared with IBM BG/P, it prompted us to develop and experiment with different methods of
imple-menting MPI Allreduce on Cray XT5. Figure 3.2 to Figure 3.6 show the wall clock time of
running CMB with various methods at different processor counts on Cray XT5. These figures show that the OpenMP-MPI hybrid model results in an improvement of 20-40% over other
implementations. Due to the observed performance variability on Cray XT5, we conducted
multiple runs at each processor count.
Further investigation indicated that the improvement in the hybrid OpenMP-MPI model
Figure 3.2: Cray XT5: Custom MPI Allreduce benchmark at 16,380 cores - Wall clock time
Figure 3.5: Cray XT5: Custom MPI Allreduce benchmark at 98,292 cores - Wall clock time
Although the Async and Sub-comms approaches involve performing MPI Allreduce operation
with only one MPI process per node, these implementations require creation of MPI sub-communicators which is an expensive operation. In the hybrid method, there was no need
of creating a sub-communicator because each node is assigned a single MPI task and so the
global communicator can be used. Also, the OpenMP compiler in PGI was able to vectorize the dot product loop whereas the PGI compiler could not vectorize that loop with MPI only
implementations. We derive our conclusions regarding vectorization by enabling the-MinfoPGI
flag during compilation. When this flag is enabled, the compiler prints out information regarding the various optimizations performed on the source code. When compiled with OpenMP, the
following information is printed out with -Minfo. Line 331 in the source file is the OpenMP
parallel for dot product loop. It can be seen that in addition to generating SSE (Streaming SIMD Extensions) instructions for the loop, it also produced alternate variations of the loop
and generated prefetch instructions.
line 331, Parallel loop activated with static block-cyclic schedule
Generated 3 alternate loops for the loop
Generated vector sse code for the loop
Generated 2 prefetch instructions for the loop
When compiled without using any OpenMP directives, the following information is printed
out. Line 359 is the dot productfor loop
line 359, Loop not vectorized: data dependency
Loop unrolled 2 times
The PGI compiler is not able to vectorize the loop because it assumes that the two vectors which
are accessed through pointers in the code could overlap with one another. This phenomenon is known as pointer aliasing and is the default behavior of the PGI compiler. In our code, both
the pointers reference to unique memory locations and so we could safely instruct the compiler
to vectorize the loop. PGI compiler provides -Msafeptr flag that can used during compilation to specify that it is safe to assume that the pointers in the source code do not overlap. When
compiled with-Msafeptr enabled, the following is printed out
line 359, Generated 3 alternate loops for the loop
Generated vector sse code for the loop
Generated 2 prefetch instructions for the loop
The same result can also be achieved by declaring the pointers with therestrict C keyword and
Table 3.4: Cray XT5: Comparison of MPI Allreduce performance in CMB
Cray XT5 cores direct - trial 1 direct - trial-2 hybrid - trial 1 hybrid - trial 2
16,380 59.76 55.15 25.39 25.87
32,760 75.81 85.58 57.67 60.43
65,532 85.10 86.17 38.20 34.99
98,304 108.84 98.05 34.91 76.18
131,064 227.79 268.46 219.14 147.63
Table 3.4 shows the time spent in MPI Allreduce when direct and hybrid methods are
used. In all the cases, the hybrid method outperforms the direct method. Even though the
hybrid method of CMB results in better performance on Cray XT5, it is still far inferior to the performance of MPI Allreduce on IBM BG/P.
Large scale scientific applications can take advantage of these performance gains by adopting
the hybrid MPI-OpenMP model. This usually involves significant development work. However, with the current trend of increasing number of cores per processor socket the hybrid model
may be worth exploring. We were not not able to incorporate the approaches discussed in this
chapter into PFLOTRAN because the current PETSc does not support OpenMP. Recently Oral et al. [19] have investigated the impact of OS jitter on MPI Allreduce performance on
quadcore per socket Cray XT5 system. They aggregated the OS noise sources onto a single
core for each node and then executed the scientific application on the remaining cores in each node. Their results indicate an improvement of over two orders of magnitude in MPI Allreduce