Performance Analysis and Optimization of Parallel I/O in a Large Scale Groundwater Application on Petascale Architectures.

(1)

ABSTRACT

SRIPATHI, VAMSI KIRAN. Performance Analysis and Optimization of Parallel I/O in a Large Scale Groundwater Application on Petascale Architectures. (Under the direction of G.(Kumar) Mahinthakumar.)

This thesis attempts to characterize and optimize the performance of a large scale

ground-water application, PFLOTRAN, on leading petascale architectures, Cray XT5 and IBM

Blue-Gene/P (BG/P). PFLOTRAN is a highly scalable groundwater simulation code that solves multi-phase groundwater flow and multicomponent reactive transport in three-dimensional

porous media. It uses MPI (Message Passing Interface) for interprocess communication, PETSc

(Portable, Extensible Toolkit for Scientific Computation) framework for solving numerical equa-tions and parallel HDF5 (Hierarchical Data Format 5) library for I/O. We performed detailed

performance analysis of PFLOTRAN and its associated libraries on Cray XT5 and IBM BG/P

using Cray Performance Analysis Tool(CrayPAT) and Tuning and Analysis Utilities(TAU). Our profiling results indicated that IBM BG/P delivers superior MPI Allreduce and I/O

performance when compared to Cray XT5 for PFLOTRAN. Differences between the two

ar-chitectures in terms of computational performance, communication and I/O sub-systems are discussed. We designed a custom vector dot-product benchmark to accurately characterize

MPI Allreduce performance at scale on both petascale architectures. We experimented with

different communication schemes to optimize the benchmark performance on Cray XT5 and obtained an improvement of 20-40% with a hybrid MPI-OpenMP implementation.

A strong scaling study with a 270 million cell test problem of PFLOTRAN indicated that a high volume of independent I/O disk access requests and file access operations would severely

limit I/O performance scalability on Cray XT5. To mitigate the performance penalty at higher

processor counts, we implemented a two phase I/O approach at the application layer by splitting the MPI global communicator into multiple sub-communicators. With this approach we were

able to achieve 25X improvement in HDF5 read I/O and 3X improvement in HDF5 write I/O

(2)

Performance Analysis and Optimization of Parallel I/O in a Large Scale Groundwater Application on Petascale Architectures

by

Vamsi Kiran Sripathi

A thesis submitted to the Graduate Faculty of North Carolina State University

in partial fulfillment of the requirements for the Degree of

Master of Science

Computer Science

Raleigh, North Carolina

2010

APPROVED BY:

Xiaosong Ma

Co-chair of Advisory Committee

Richard T. Mills

Frank Mueller G.(Kumar) Mahinthakumar

(3)

DEDICATION

(4)

BIOGRAPHY

Vamsi Sripathi was born in Bhimavaram, in the state of Andhra Pradesh in India. He had a

fun filled childhood attending various schools across the state. In April 2007, he graduated with a Bachelor of Technology in Information Technology degree from V.R.Siddhartha Engineering

College in Vijayawada, AP, India. He came to USA in Fall 2007 to pursue graduate studies

(5)

ACKNOWLEDGEMENTS

I would like to express my heartfelt gratitude to my advisor, Dr. Mahinthakumar for his

in-valuable guidance during my graduate studies. His thought provoking suggestions and insighful observations were pivotal to the success of my thesis research. His confidence in my abilities

motivated me to pursue challenging goals. His insights in High Performance Computing helped

me comprehend this broad area of research. I would also like to thank him for his availability to discuss my work.

I’m thankful to Dr. Richard Mills for his steadfast support while mentoring me during

my summer internship at Oak Ridge National Laboratory. His encouragement and substantial feedback helped crystallize my thesis topic and was instrumental to this thesis study.

I would like to thank Dr. Xiaosong Ma, co-chair of my thesis committee for her crucial

feedback and flexibility. I enhanced my knowledge of Parallel I/O when I attended her “Parallel Systems” course. I’m grateful to Dr. Frank Mueller for accepting my request to be a member

of the thesis committee and for providing valuable observations.

I’m thankful to the U.S Department of Energy (DOE) Office of Science, Scientific Discov-ery through Advanced Computing (SciDAC-2) research program and Performance Engineering

Research Institute (PERI) for funding this research study.

I’m grateful to my parents for their love and unwavering support. They provided me the freedom and encouragement to pursue a career of my choice. I’m greatly indebted to my brother,

Sarat Sreepathi who first introduced me to Computer Science and mentored me throughout my life. His patience and support during my thesis preparation were indispensable. I tremendously

enjoyed our passionate discussions on a wide range of topics. I’m happy to acknowledge the

support of my long time friend, Shivakar Vulli for sharing his perspectives on my research. I would like to thank my uncle, Sharma Vemuri and his kids, Rujula and Rishabh for fun filled

weekends that provided a much needed respite from graduate school.

This research used resources of the National Center for Computational Sciences at Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of

Energy under Contract No. DE-AC05-0OR22725, and the resources of the Argonne Leadership

(6)

TABLE OF CONTENTS

List of Tables . . . vi

List of Figures . . . vii

Chapter 1 Introduction . . . 1

1.1 Motivation . . . 2

1.2 Computational Platforms . . . 2

1.2.1 Cray XT5 (JaguarPF) . . . 3

1.2.2 IBM BlueGene/P (Intrepid) . . . 3

1.3 Thesis Organization . . . 4

Chapter 2 Performance Analysis . . . 6

2.1 Description of PFLOTRAN . . . 6

2.1.1 Benchmark Problems . . . 8

2.2 Cray XT5 and IBM BG/P Compute Nodes . . . 8

2.3 Scalability Analysis . . . 11

2.3.1 Observations . . . 11

2.4 Profiling Tools . . . 17

2.4.1 CrayPAT . . . 18

2.4.2 TAU . . . 19

2.4.3 Performance Results . . . 21

2.5 Communication Analysis . . . 29

2.5.1 Communication sub-systems of Cray XT5 and IBM BG/P . . . 29

2.5.2 MPI Allreduce in PFLOTRAN . . . 33

Chapter 3 Custom MPI Allreduce Benchmark . . . 39

3.1 Motivation . . . 39

3.2 Benchmark Description . . . 40

3.3 Performance Results . . . 41

Chapter 4 Parallel I/O Analysis and Optimization . . . 48

4.1 Lustre and GPFS . . . 49

4.2 Analysis of file read . . . 51

4.3 Analysis of file write . . . 56

4.4 Optimization of I/O performance . . . 58

4.4.1 Improvement in file read . . . 59

4.4.2 Improvement in file write . . . 65

Chapter 5 Summary and Conclusions . . . 69

(7)

LIST OF TABLES

Table 2.1 Distribution of MPI Allreduce calls into various message size bins at 8,184

cores of Cray XT5 . . . 35

Table 3.1 Performance of CMB on Cray XT5 with PAT RT MPI SYNC = 1 . . . 42

Table 3.2 Performance of CMB on Cray XT5 with PAT RT MPI SYNC = 0 . . . 42

Table 3.3 Custom MPI Allreduce benchmark performance on IBM BG/P . . . 43

Table 3.4 Cray XT5: Comparison of MPI Allreduce performance in CMB . . . 47

(8)

LIST OF FIGURES

Figure 1.1 Architecture of Cray XT5 (JaguarPF) . . . 3

Figure 1.2 Architecture of IBM BlueGene/P (Intrepid) . . . 4

Figure 2.1 Architecture of PFLOTRAN . . . 7

Figure 2.2 PFLOTRAN: Hanford 300 benchmark problem . . . 9

Figure 2.3 Cray XT5 compute node . . . 10

Figure 2.4 BlueGene/P compute node . . . 10

Figure 2.5 PFLOTRAN Flow Chart . . . 12

Figure 2.6 PFLOTRAN on Cray XT5 and IBM BlueGene/P: 1 billion DoF problem, comparison of INITIALIZATION stage . . . 12

Figure 2.7 PFLOTRAN on Cray XT5 and IBM BlueGene/P: Comparison of 68 mil-lion DoF FLOW stage . . . 13

Figure 2.8 PFLOTRAN on Cray XT5 and IBM BlueGene/P: Comparison of 1 billion DoF TRANSPORT stage . . . 13

Figure 2.9 PFLOTRAN on Cray XT5 and IBM BlueGene/P: 1 billion DoF problem, FLOW + TRANSPORT stages . . . 14

Figure 2.10 PFLOTRAN on Cray XT5: Comparison of 68 million DoF FLOW stage . 14 Figure 2.11 PFLOTRAN on Cray XT5: Comparison of 1 billion DoF TRANSPORT stage . . . 15

Figure 2.12 PFLOTRAN on Cray XT5: 1 billion DoF problem, FLOW + TRANS-PORT stages . . . 15

Figure 2.13 PFLOTRAN on Cray XT5: CrayPAT - Apprentice tool . . . 19

Figure 2.14 PFLOTRAN on Cray XT5: 1 billion DoF - call path structure of routines in PETSc . . . 20

Figure 2.15 PFLOTRAN on Cray XT5: 1 billion DoF - call path structure of routines in FLOW and TRANSPORT phases . . . 20

Figure 2.16 PFLOTRAN on IBM BG/P: TAU - Mean time profiling of routines . . . 21

Figure 2.17 PFLOTRAN on Cray XT5: Percentage contribution of USER, MPI and MPI SYNC group routines to wall clock time . . . 22

Figure 2.18 PFLOTRAN on Cray XT5: Percentage contribution of user routines to wall clock time . . . 23

Figure 2.19 PFLOTRAN on Cray XT5: Percentage contribution of MPI routines to wall clock time. . . 23

Figure 2.20 PFLOTRAN on Cray XT5: Timings for user and MPI routines . . . 24

Figure 2.21 PFLOTRAN on Cray XT5: Percentage of theoretical peak performance. Peak performance of 1 core of Cray XT5 = 2.6 GHz∗4 ops/cycle = 10.4 GFlops/second. . . 24

Figure 2.22 PFLOTRAN on IBM BG/P:Breakdown of dominant user routines . . . . 25

(9)

Figure 2.24 PFLOTRAN on Cray XT5: Time spent in MatMult SeqBAIJ N routine by each process for a 8,184 core run. . . 26 Figure 2.25 PFLOTRAN on Cray XT5: Time spent in MatSolve SeqBAIJ N routine

by each process for a 8,184 core run. . . 27 Figure 2.26 PFLOTRAN on Cray XT5: Time spent in Rtotal routine by each process

for a 8,184 core run. . . 27 Figure 2.27 PFLOTRAN on Cray XT5: Time spent in synchronizing for MPI

Allre-duce by each process for a 8,184 core run. . . 28 Figure 2.28 PFLOTRAN on Cray XT5: Floating point operations (PAPI FP OPS)

executed by each process for a 8184 core run. . . 28 Figure 2.29 PFLOTRAN on Cray XT5: Total number of instructions (PAPI TOT

-INS) executed by each process for a 8184 core run. . . 29 Figure 2.30 PFLOTRAN on IBM BG/P: Time spent in Rtotal routine by each process

for a 8,184 core run. . . 30 Figure 2.31 PFLOTRAN on IBM BG/P: Time spent in Rmultiratesorption routine

by each process for a 8,184 core run. . . 30 Figure 2.32 PFLOTRAN on IBM BG/P: Time spent in Rkineticmineral routine by

each process for a 8,184 core run. . . 31 Figure 2.33 PFLOTRAN on IBM BG/P: Time spent in MatLUFactorNumeric

Se-qBAIJ N routine by each process for a 8,184 core run. . . 31 Figure 2.34 PFLOTRAN on IBM BG/P: Time spent in MPI Allreduce by each

pro-cess for a 8,184 core run. . . 32 Figure 2.35 Variation in run-time on Cray XT5 with the same simulation parameters. 34 Figure 2.36 PFLOTRAN on Cray XT5: 1 DoF problem - MPI Allreduce

synchro-nization time . . . 35

Figure 2.37 PFLOTRAN on IBM BG/P: 1 DoF problem - Total MPI Allreduce time 36

Figure 2.38 PFLOTRAN on Cray XT5: 1 DoF problem - Box-plot of MPI Allreduce timings (includes synchronization also) . . . 37 Figure 2.39 PFLOTRAN on IBM BG/P: 1 DoF problem - Box-plot of MPI Allreduce

timings (includes synchronization also) . . . 37 Figure 2.40 PFLOTRAN on Cray XT5 and IBM BG/P: 1 DoF problem - Total time

spent in MPI Allreduce . . . 38

Figure 3.1 CMB (direct method): MPI Allreduce comparison on Cray XT5 and IBM

BG/P . . . 43

Figure 3.2 Cray XT5: Custom MPI Allreduce benchmark at 16,380 cores - Wall

clock time . . . 44

clock time . . . 45

(10)

Figure 4.1 I/O software stack involved in high performance computing environments 48 Figure 4.2 Overview ofSpider center-wide Lustre file system at ORNL [10] . . . 50 Figure 4.3 PFLOTRAN: 270 million DoF - Default read access patterns . . . 51

Figure 4.4 PFLOTRAN on Cray XT5 and IBM BG/P: 270 million DoF problem

-Comparison of Initialization phase . . . 52

Figure 4.5 PFLOTRAN on Cray XT5 (quadcore): 270 million DoF - CrayPAT

pro-filing of MPI subroutines in Initialization phase . . . 55

Figure 4.6 PFLOTRAN on Cray XT4: 270 million DoF - CrayPAT profiling of MPI

subroutines in Initialization phase . . . 55 Figure 4.7 PFLOTRAN: 270 million DoF - Default write access pattern . . . 57

Figure 4.8 PFLOTRAN on Cray XT4: 25 million DoF - comparison between HDF5

collective and independent write performance . . . 58

Figure 4.9 PFLOTRAN on Cray XT5 and IBM BlueGene/P: 270 million DoF

-Comparison of HDF5 write performance . . . 59 Figure 4.10 PFLOTRAN: 270 million DoF - Comparison between Default and

Mod-ified read access pattern - 1 . . . 61 Figure 4.11 PFLOTRAN: 270 million DoF - Comparison between Default and

Mod-ified read access pattern - 2 . . . 63 Figure 4.12 PFLOTRAN on Cray XT5 (quadcore): 270 million DoF - Impact of read

group size on Initialization phase . . . 64 Figure 4.13 PFLOTRAN on Cray XT: 270 million DoF - Comparison of Initialization

phase . . . 64 Figure 4.14 PFLOTRAN on IBM BG/P: 270 million DoF - Initialization phase using

Independent and Collective I/O modes . . . 65 Figure 4.15 PFLOTRAN on Cray XT5 and IBM BlueGene/P: 270 million DoF

-Comparison of best implementations of Initialization phase . . . 66 Figure 4.16 PFLOTRAN: 270 million DoF - Modified write access pattern . . . 67 Figure 4.17 PFLOTRAN on Cray XT5(quadcore): 270 million DoF - Comparison of

default and improved HDF5 write performance . . . 67 Figure 4.18 PFLOTRAN on Cray XT5(quadcore): 270 million DoF - Impact of

(11)

Chapter 1

Introduction

High performance computing systems span a vast array of architectures, from erstwhile vector machines to the presently widespread distributed memory model systems. The top three

ma-chines on the current TOP500 list [13] are capable of performing more than 1 petaflop/second

(1012 floating point operations per second). These machines are composed of complex sub-systems and this complexity is bound to increase in coming years because of the advances in

processor, memory, disk and inter-connect technologies. The rapid advances in computational

power can be attributed to Moore’s law, which states that the number of components on an integrated circuit (IC) would double for every 18 months.

With the availability of powerful supercomputers, complex computational models were

con-structed to study important scientific phenomena that yield solutions to high-impact real world problems. As such, high performance computing systems have played a key role in helping

to solve grand challenge problems in critical scientific areas such as climate sciences,

astro-physics, fusion energy and biological sciences. By providing enormous computational power, supercomputers have established simulation as the third pillar of science along with theory and

experiment.

Low latency and high bandwidth communication and I/O sub-systems are necessary in building large scale systems that can offer sustainable performance to scientific applications.

Scalable programming models and lightweight software interfaces are pivotal in designing

com-putational models that can effectively utilize the current high performance computing systems. This thesis research focuses on the performance analysis and optimization of a large scale

scientific application, PFLOTRAN on current petascale architectures. PFLOTRAN is a highly

scalable groundwater simulation code that solves multi-phase groundwater flow and multicom-ponent reactive transport in three-dimensional porous media.

(12)

used in this study. Section 1.3 describes the the rest of the thesis organization.

1.1 Motivation

Current state-of-the-art computational models typically consists of a large code base and

in-teract with a wide variety of external software components such as mathematical libraries, I/O libraries and graph partitioning tools. Building scientific application models that can

effec-tively utilize the computing capabilities of existing and future supercomputers is a challenging

task. The evolution of high performance computing systems has been rapid over the last few decades. However, the different components of computer architecture have evolved at different

rates which introduced contention among various sub-systems. Performance analysis is the

process of evaluation and characterization of an application performance on a target architec-ture. This process can be very helpful in the development of computational models on high

performance computing systems. Performance analysis is a non-trivial task because it involves understanding the interactions between the various hardware components such as CPU,

mem-ory, communication and I/O subsystems and the software components involved in scientific

applications. Performance analysis is a critical component in scientific application development life cycle. In addition to helping to achieve effective utilization of current high performance

computing systems, the benefits of performance analysis to scientific computing teams can be

summarized as follows

• Allows computational scientists to characterize the impact of various design choices to

make informed decisions.

• Identify scalability challenges and use design insights to ameliorate such problems.

• Identify performance bottlenecks to direct optimization strategy.

• Performance optimization of real world scientific applications would enable us to solve

problems at larger scale and with better resolution thereby achieving better predictive capabilities.

• Understand the impact of computer architecture on application performance.

• Compare the performance of various architectures for target applications and provide

recommendations.

1.2 Computational Platforms

(13)

In the rest of the document, the words processor orsocket are used to describe the processor

chip whereas the wordsprocessor cores orcores are used to describe the multiple cores within a single chip.

1.2.1 Cray XT5 (JaguarPF)

At the time of this writing, the Cray XT5 (JaguarPF) system at Oak Ride National Laboratory

(ORNL) is the world’s fastest supercomputer [13]. Figure 1.1 shows the overall architecture of the system. It has a total of 224,256 processor cores with 300 terabytes(TB) of memory resulting

in a theoretical peak performance of 2.3 petaflop/s (PF/s). The 224,256 processor cores are arranged in 18,688 compute nodes. In addition to the compute nodes, there are dedicated login

and I/O nodes. Each compute node consists of two hex-core AMD Opteron 2435 “Istanbul”

processors running at 2.6 GHz, 16 GB memory and SeaStar 2+ routers that are connected in a 3D torus topology to provide very high bandwidth and low latency communication network.

Cray XT5 uses the Lustre file system to provide a total disk space of 5 petabytes(PB).

Figure 1.1: Architecture of Cray XT5 (JaguarPF)

1.2.2 IBM BlueGene/P (Intrepid)

The IBM BlueGene/P (BG/P, Intrepid) system at Argonne National Laboratory (ANL) is the

world’s ninth fastest supercomputer at the time of this writing in June 2010 [13]. Figure 1.2 shows the overall architecture of the system. Intrepid has a total of 163,840 processor cores

with 80 TB of memory resulting in a theoretical peak performance of 557.06 teraflop/s (TF/s).

(14)

in 40 racks. In addition to the compute nodes, there are dedicated login and I/O nodes. Each

compute node consists of a single quad-core PowerPC 450 processor running at 850MHz with 2 GB of shared memory. The compute nodes are connected by 3D torus and tree networks

providing very efficient communication sub-system. IBM BG/P uses General Parallel File

System (GPFS) to provide a total disk space of 3 petabytes(PB).

Figure 1.2: Architecture of IBM BlueGene/P (Intrepid)

1.3 Thesis Organization

Chapter 2 introduces the application PFLOTRAN and presents detailed performance results on Cray XT and IBM BG/P architectures. Section 2.1 describes the architecture of

PFLO-TRAN and benchmark problems used in this study. Section 2.2 describes the compute node architecture of Cray XT5 and IBM BG/P systems. Section 2.3 presents the scalability analysis

of PFLOTRAN on both architectures. Section 2.4 introduces the performance analysis tools

employed and identifies the dominant computational routines in PFLOTRAN. Section 2.5 dis-cusses the communication sub-systems of Cray XT5 and IBM BG/P and presents detailed

analysis of MPI Allreduce performance on both architectures.

In chapter 3, section 3.1 articulates the motivation for developing the custom MPI Allre-duce benchmark. Section 3.2 explains the various communication mechanisms designed for

(15)

mechanism for MPI Allreduce on Cray XT5.

In chapter 4, section 4.1 describes the file systems on Cray XT and IBM BG/P architectures. Sections 4.2 and 4.3 present the performance analysis of HDF5 read and write I/O access

patterns in PFLOTRAN respectively. Section 4.4 describes the I/O scalability bottlenecks of

PFLOTRAN and our optimization strategy for improving the I/O performance and present our results for Cray XT architecture.

Chapter 5 summarizes the major activities conducted, results achieved and conclusions that

(16)

Chapter 2

Performance Analysis

Chapter 2 introduces the application PFLOTRAN and presents detailed performance results on Cray XT and IBM BG/P architectures. Section 2.1 describes the architecture of

PFLO-TRAN and benchmark problems used in this study. Section 2.2 describes the compute node

architecture of Cray XT5 and IBM BG/P systems. Section 2.3 presents the scalability analysis of PFLOTRAN on both the architectures. Section 2.4 introduces the performance analysis

tools employed and identifies the dominant computational routines in PFLOTRAN. Section

2.5 discusses the communication sub-systems of Cray XT5 and IBM BG/P and presents de-tailed analysis of MPI Allreduce performance on both the architectures. Dede-tailed I/O analysis

is described in Chapter 4

2.1 Description of PFLOTRAN

The U.S Department of Energy (DOE) is interested in studying the effects of geologic

seques-tration of CO2 in deep reservoirs and migration of radionuclides in groundwater. Modeling of subsurface flow and reactive transport is necessary to understand these problems.

PFLOTRAN [26][32] is a highly scalable groundwater simulation code that solves

multi-phase groundwater flow and multicomponent reactive transport in three-dimensional porous media. It is written in Fortran 90 and has certain degree of abstraction through the use of

Fortran 90 features such as modules and derived data types. PFLOTRAN extensively uses

PETSc (Portable, Extensible Toolkit for Scientific Computation) [16, 17, 15] which is a widely used numerical software package for solving system of equations. PETSc is being developed at

Argonne National Laboratory (ANL) and provides powerful set of tools that can be used by

scientific applications written in C, C++ and Fortran. In addition to PETSc, PFLOTRAN uses MPI (Message Passing Interface)[8] for interprocess communication and parallel HDF5

(17)

(18)

PFLOTRAN uses domain-decomposition parallelism in which each processor is assigned a

sub-domain of the problem and a parallel solve is implemented over all processors. An inexact Newton method is employed inside of which the BiConjugate Gradient Stabilized (BiCGStab)

linear solver is used. A block-Jacobi preconditioner is used that applies an incomplete

LU-decomposition solver with zero fill-in (ILU(0)) on each block. PETSc provides the nonlinear and linear solvers, and preconditioners that are used in PFLOTRAN. PFLOTRAN uses the

PETSc Distributed Array (DA) framework to handle ghost point updates that arise at node

boundaries in domain decomposition. Various other functionality such as sparse matrix and vector data structures, basic performance logging framework and runtime options control in

PFLOTRAN are also provided by PETSc. The architecture diagram of PFLOTRAN is shown

in Figure 2.1

2.1.1 Benchmark Problems

In this section, we describe the PFLOTRAN benchmark problems used in our study. We

performed strong scaling performance studies with two test problems that are described below.

• 270 million DoF (Hanford 300): This benchmark problem is from a model of a hypothetical Uranium plume at the Hanford 300 Area in southeastern Washington state [27]. The

computational domain of the problem measures 1350*2500*20 meters (x,y,z) and utilizes

complex stratigraphy shown in Figure 2.2 mapped from the Hanford EarthVision database [2], with material properties provided by [12]. A 1350*2500*80 cell flow-only version of

the problem (270 million total degrees of freedom) is used.

• 1 billion DoF: This benchmark is a coupled flow and transport problem from the Hanford

300 Area. This problem consists of 850*1000*80 cells and has 15 chemical components. This accounts to a total of 68 million degrees of freedom in the flow solve and 1.02 billion

degrees of freedom in the transport solve.

2.2 Cray XT5 and IBM BG/P Compute Nodes

The Cray XT5 (JaguarPF) system at ORNL has a total of 18,688 compute nodes. Each XT5

compute node shown in Figure 2.3 has two AMD Opteron 2435 “Istanbul” processors linked

with dual HyperTransport connections. Each Opteron processor has 6 cores running at 2.6 GHz clock speed and directly attached to 8 GB of DDR2 800 MHz memory. Each core has

dedicated L1 cache of 64 KB and L2 cache of size 512 KB. All cores share the on-chip L3

(19)

Figure 2.2: PFLOTRAN: Hanford 300 benchmark problem

shared memory with a peak performance of 124.8 Gflop/s. JaguarPF supports MPI, OpenMP, SHMEM, and PGAS programming models. PGI, PathScale and Cray compiler environments

are available on the machine. Applications can be launched in various configurations depending on the programming model. The number of cores or NUMA sockets allocated to an application

can be controlled through the various options available to PBS (Portable Batch System).

The IBM BG/P (Intrepid) system at ANL has a total of 40,960 compute nodes arranged in 40 racks. Figure 2.4 depicts a BG/P compute node. Each compute node has four PowerPC

450 cores running at 850 MHz clock speed, with 2 GB of shared memory. Each compute core

has a double precision, dual pipe, floating point unit, typically called Double Hummer that can perform two fused multiply adds (FMA) per cycle giving it a capability to perform 4 floating

point operations per cycle. The peak performance of each core is 3.4 GFlop/s and for the

compute node it is equal to 13.6 GFlop/s. Each core has a private L1 cache of 32KB and a private L2 stream prefetching engine. All cores share the on-chip L3 cache of 8 MB. The cores

also share DDR-2 memory controllers that access 2 GB of DDR-2 memory. The Intrepid system

has IBM XL suite of compilers along with support for OpenMP within a compute node. Applications can be launched in 3 different modes on Intrepid. These are described below

• Symmetric Multiprocessor mode (SMP): In this mode, single MPI task is allocated to a

compute node. This task can spawn a maximum of four threads. All the compute node resources are used by the single MPI task.

• Dual Node mode (DUAL): In this mode, each compute node executes two MPI tasks

(20)

Figure 2.3: Cray XT5 compute node

(21)

• Virtual Node mode (VN): This mode allows four MPI tasks with one thread each to

run on the Compute Node. All tasks share the memory and network resources on the compute node. Optimizations in the system software allow tasks within a compute node

to communicate via shared memory.

2.3 Scalability Analysis

We primarily used the 1 billion DoF problem for computation and communication analysis.

The 270 million DoF problem is used for I/O analysis. In this section, we discuss the strong

scaling results with the 1 billion DoF problem on Cray XT5 and IBM BG/P. For the results presented in this chapter, we have set PFLOTRAN to run for 30 timesteps. Each time step

involves a coupled flow and transport time step execution. All file output is turned off to

avoid any interaction with the I/O sub-system. There is significant difference between flow and transport phases. When compared with a transport time step, a flow time step involves

less number of Newton (nonlinear) iterations (with each iteration involving a linear solve), but

performs significantly more number of BiCGSTAB iterations per each linear solve.

PFLOTRAN uses the PETSc logging framework to obtain basic profiling of the code. The

instrumentation overhead introduced by the PETSc logging framework is very minimal when

the-log summaryprofiling option is used. PETSc provides two other options,infoand-logtrace

that capture details about algorithms, data structures and tracing events. We used the log

-summary option for the results presented in this section.

The control flow of PFLOTRAN can be divided into 4 different phases, namely Initialization, Flow, Transport and Output phases. Figure 2.5 shows the execution model of PFLOTRAN

involving the 4 phases.

Figure 2.6 to Figure 2.9 show the scaling of the Initialization, Flow, Transport and combined Flow and Transport phases from 4,092 to 131,064 cores of Cray XT5 and IBM BG/P. For the

runs on Cray XT5, we used all six available cores per processor socket. In this scheme, each core is assigned a MPI task that can access approximately 1 GB of main memory. On IBM

BG/P, we had to use only two cores out of the available four cores per node because of the

large memory requirements for the 1 billion DoF problem. In this setup, each processor core is assigned a MPI task with 1 GB memory.

2.3.1 Observations

It can be seen from Figure 2.6 to Figure 2.9, PFLOTRAN exhibits different scaling behavior on

Cray XT5 and IBM BG/P. Observations from the scaling plots and the characteristics of the

(22)

Figure 2.5: PFLOTRAN Flow Chart

(23)

Figure 2.7: PFLOTRAN on Cray XT5 and IBM BlueGene/P: Comparison of 68 million DoF FLOW stage

(24)

Figure 2.9: PFLOTRAN on Cray XT5 and IBM BlueGene/P: 1 billion DoF problem, FLOW + TRANSPORT stages

(25)

Figure 2.11: PFLOTRAN on Cray XT5: Comparison of 1 billion DoF TRANSPORT stage

(26)

• Initialization and Output phases: These phases involve HDF5 I/O routines that read

the simulation initial conditions from an input file and write the simulation results to an output file. Since these operations are I/O bound, the I/O sub-system of the target

archi-tecture determines the performance of these phases. For the I/O intensive Initialization

phase (Figure 2.6), the IBM BG/P has far superior scaling behavior when compared to the Cray XT5. More detailed analysis of this phase including improvement on the Cray

XT5 is described in Chapter 4

• Flow phase: This phase of PFLOTRAN involves a large number of linear solves. Each

linear iteration is communication bound because it involves global reduction operations that require communication across processes. The flow stage is communication intensive

(95% of MPI Allreduce’s are encountered here). The communication sub-system plays a

key role in determining the performance of this phase. Flow phase does not scale beyond 16,380 cores on Cray XT5 and 65,532 cores on IBM BG/P. It is to be noted that at

higher processor counts, the computation to communication ratio is low for the 68 million

DoF Flow problem. For the communication intensive Flow phase, the IBM BG/P beats Cray XT5 around 17K cores (Figure 2.7) due to its superior collective communication

performance.

• Transport phase: This stage of PFLOTRAN is computation intensive because of the

nonlinear reactions involving 15 DoF per cell. This phase is computation intensive (90% of total flops). Its performance is dependent upon the peak performance of the processor

and the memory sub-system. For the computation intensive transport stage (Figure 2.8),

the Cray XT5 performs nearly twice as fast as the IBM BG/P because Cray XT5 has faster nodes (peak of 10.4 Gflop/sec for single core) compared to BG/P (peak of 3.4

GFlops/sec for a single core).

• Combined Flow and Transport phases: Figure 2.9 shows that the superior communication

performance on the IBM BG/P compensates for its slower computing power, while the computation advantage of Cray XT5 diminishes and the communication cost increases at

higher processor counts.

With the unique characteristics of its various phases, PFLOTRAN provides us an excellent

opportunity to study the impact of architecture on performance of a large scale application. Figure 2.10 to Figure 2.12 show the impact of using fewer cores per Cray XT5 socket. These runs

used only 3 cores per processor socket. In this setup, each MPI task can access approximately

(27)

largely due to the improved performance of MPI Allreduce and less pressure on the memory

sub-system.

Although -log summary option in PETSc gives a high level overview of the performance

of PFLOTRAN, it does not provide detailed information such as the hardware performance

metrics, call path structures, low-level subroutines called in different PETSc interfaces, and ex-clusive timings of individual routines. This motivated us to instrument PFLOTRAN along with

its associated libraries (PETSc, MPI and parallel HDF5) with different performance analysis

tools. These details are presented in the next section.

2.4 Profiling Tools

Performance analysis tools or profiling tools are software libraries that can collect detailed per-formance metrics of a computational application. Because of the inherent complexity involved

in high performance computer architectures and scientific applications, profiling tools act as

a valuable asset to computational scientists and are commonly used to better understand an application’s performance.

Profiling tools can provide data that could be used to identify potential performance

bottle-necks. They also help us better characterize the interaction between the application with the architecture. The ease of use of profiling tools depends upon the complexity of the application

as well as the tool chosen for instrumenting the application. Most profiling tools provide an

application programming interface (API) to directly interact with the tool and a graphical user interface (GUI) to visualize the large amounts of performance data. Some profiling tools can

instrument applications without making any source code changes. There are different ways of

instrumenting an application, these are described below.

• Compilers provide support for instrumenting the source code. For example, gcc provides

-pg flag that can be used at compile and link time.

• Binary: This process involves tools that can insert timing routines to an already built binary. CrayPAT [1] is an example of this class.

• Runtime: Involves invoking the tool at execution time. Examples of these include Valgrind

suite [14].

• Automatic source-level: This involves tools that automatically add timing routines to the

source code. TAU [35, 25, 22] belongs to this class.

(28)

the tools try to measure a function that is called a large number of times and time spent in each

instance of the function is very minute. In order to obtain more detailed, low-level performance metrics of PFLOTRAN we used different profiling tools to instrument PFLOTRAN on Cray

XT5 and IBM BG/P.

2.4.1 CrayPAT

Cray Performance Analysis Tool (CrayPAT) [1] is a binary instrumentation based profiling tool. It is developed by Cray and thus can only be used on Cray architectures. CrayPAT

can intercept routines from a wide range of libraries such as MPI, OpenMP, UPC, BLAS, LAPACK, HDF5 and many others and is relatively easy to use when compared to other source

level instrumentation tools such as TAU. It uses Performance API (PAPI) underneath to capture

hardware performance counter data.

We used CrayPAT to profile PFLOTRAN, PETSc and MPI libraries on the Cray XT5.

In order to minimize the profiling overhead, we have performed selective instrumentation of

PETSc and PFLOTRAN routines. We initially performed a sampling experiment to identify the most dominant routines and then use those routines to perform a tracing experiment. An

instrumented binary is built by using the pat build command. pat report generates text report

from the performance data that was dumped during run time. A wide choice of options are accepted by bothpat build and pat report. We have various options ofpat build and pat report

to capture various performance metrics on the Cray XT5.

CrayPAT provides various environment variables to control its run-time behavior. By de-fault, runtime summarization is enabled and the data collected is aggregated. The runtime

variable PAT RT HWPC specifies the hardware performance counter events monitored for a

program counter tracing experiment. The environment flag, PAT RT SYNC controls the re-porting behavior of MPI collective communication routines. By enabling this flag, CrayPAT

reports two separate timing groups for MPI collective routines, namely MPI and MPI SYNC.

MPI SYNC group represents the time spent in synchronization and the MPI group represents the time spent in the actual collective communication operations. CrayPAT achieves this

func-tionality by placing a MPI Barrier before each MPI collective call. It reports the time spent in

MPI Barrier as MPI SYNC time and the time spent in the MPI collective routine is reported as MPI group time. This functionality is helpful in identifying possible synchronization

bottle-necks in applications; however, this may incur an additional overhead in some applications due

to the explicit barrier call placed before each MPI collective function. CrayPAT provides a GUI called Apprentice to view the application performance data. Figure 2.13 depicts the Apprentice

(29)

shows the call path structure of routines in Flow and Transport phases of PFLOTRAN.

Figure 2.13: PFLOTRAN on Cray XT5: CrayPAT - Apprentice tool

2.4.2 TAU

Tuning and Analysis Utilities (TAU) is an automatic source level instrumentation tool. TAU [11]

is capable of gathering performance information through instrumentation of functions, methods, basic blocks, and statements. It is developed at the University of Oregon. We have used TAU

to collect performance metrics primarily on IBM BG/P, but also used it on Cray XT5. We

instrumented PFLOTRAN and its associated libraries (PETSc, MPI and parallel HDF5) using TAU on the IBM BG/P. In order to instrument with TAU, themakefileof PFLOTRAN has to be

altered to make use oftau compiler script. In addition to performing selective instrumentation,

TAU THROTTLE environment variable is used to minimize the overhead. We had to alter the TAU provided configuration script to build PETSc with selective instrumentation. The

following changes were made to the PFLOTRAN makefile

include /soft/apps/tau/tau-2.19/bgp/lib/Makefile.tau-mpi-pdt

FC = $(TAU_COMPILER) -optVerbose -optPreProcess -optKeepFiles bgxlf90_r FC_LINKER = $(TAU_COMPILER) mpixlf90_r

When compiling PFLOTRAN with TAU, we observed that TAU was unable to instrument

(30)

Figure 2.14: PFLOTRAN on Cray XT5: 1 billion DoF - call path structure of routines in PETSc

(31)

Database Toolkit (PDT) fails for some of the source files because it inserts variable

declara-tions after an executable statement (pointer assignment, if statements etc) in some subroutines. This can be seen in the intermediate files (*.pp.inst.F90) generated during compilation. Moving

the TAU variables above the executable statements and re-compilation fixes this problem and

instruments the source file. TAU provides a profile visualization tool, Paraprof. It provides graphical displays of all the performance analysis results, in aggregate and single

node/contex-t/thread forms. Figure 2.16 depicts a screenshot of mean time of routines in PFLOTRAN with

Paraprof.

Figure 2.16: PFLOTRAN on IBM BG/P: TAU - Mean time profiling of routines

2.4.3 Performance Results

In this section, we present detailed performance results of the 1 billion DoF problem of

PFLO-TRAN obtained with CrayPAT and TAU tools. Recall that for the results presented in this

chapter, we have set PFLOTRAN to run for 30 time steps and all file output is turned off to avoid any interaction with the I/O sub-system. The profiling overhead is very minimal (less

than 5% of runtime) for the results presented. Figure 2.17 shows the percentage of time spent

in different groups of routines from 4,092 to 32,760 cores of Cray XT5. The User group shows good scalability, whereas the time spent inMPI andMPI SYNC group increases with processor

count.

Figure 2.18 and Figure 2.19 depict a breakdown of the routines in theUser andMPI groups.

They show the percentage contribution of different User and MPI routines to wall clock time.

(32)

Figure 2.17: PFLOTRAN on Cray XT5: Percentage contribution of USER, MPI and MPI -SYNC group routines to wall clock time

the other two belong to PFLOTRAN. From the MPI routines (Figure 2.19), it is clear that MPI Allreduce is the most significant routine followed by MPI File open. A brief description

about the dominant PETSc, PFLOTRAN routines is given below.

• MatSolve SeqBAIJ N: This routine solves the system A x = b, given a factored matrix “A”

• MatLUFactorNumeric SeqBAIJ N: This routine performs numeric LU factorization of a matrix.

• MatMult SeqBAIJ N: This routine computes the matrix-vector product, y = A x.

• RTotal: This routine calculates the contribution of aqueous equilibrium complexation to the residual and Jacobian functions for Newton-Raphson.

Figure 2.20 shows the time spent in variousUser andMPI routines. MPI Allreduce sync is

the most dominant routine at all the processor counts. More detailed analysis of MPI Allreduce

performance in PFLOTRAN is described in section 2.5. There is an increase in the time spent in the MPI file routines after 16,380 cores. These file operations happen in the Initialization phase

of PFLOTRAN. More detailed I/O analysis is presented in Chapter 4. Figure 2.21 shows the

peak performance of the dominant computational routines on Cray XT5. The PETSc routine

MatLUFactorNumeric SeqBAIJ Ngives the highest peak performance of close to 18 %, followed

(33)

Figure 2.18: PFLOTRAN on Cray XT5: Percentage contribution of user routines to wall clock time

(34)

Figure 2.20: PFLOTRAN on Cray XT5: Timings for user and MPI routines

(35)

Figure 2.22 shows the break down of dominant user routines on IBM BG/P. Although the

same user routines that appear on Cray XT5 are also seen on IBM BG/P, the three most dominant routines are native PFLOTRAN routines. These routines involve logarithm and

exponentiation functions which are expensive.

Figure 2.22: PFLOTRAN on IBM BG/P:Breakdown of dominant user routines

Figure 2.23, Figure 2.24, Figure 2.25 and Figure 2.26 show the time spent in the most

dominant user routines by each process for a 8,184 processor core run on Cray XT5. It can be observed that a small percentage of processes spent very little time in these routines indicating

that they do very little work. Consequently, these same processes spent idle time waiting

at different synchronization points through the program execution. This idle time or load imbalance experienced by this small fraction of processors can be largely attributed to the

inactive cells contained in some regions of the problem being solved - the Columbia river portion

of the Hanford problem domain does not contain any porous media and PFLOTRAN uses inactive cells in these regions. This synchronization time is captured in the MPI Allreduce

calls. The time spent in synchronizing for MPI Allreduce by each process at 8,184 cores of

Cray XT5 is shown in Figure 2.27.

Measuring the number of floating point operations and instructions executed by each

pro-cess would help us in identifying the work load imbalance across the propro-cesses. Figure 2.28

and Figure 2.29 show the PAPI FP OPS and PAPI TOT INS hardware counter data for each process at 8,184 cores. Figure 2.28 indicates that there are five clusters of processes with similar

(36)

Figure 2.23: PFLOTRAN on Cray XT5: Time spent in MatLUFactorNumeric SeqBAIJ N routine by each process for a 8,184 core run.

(37)

Figure 2.25: PFLOTRAN on Cray XT5: Time spent in MatSolve SeqBAIJ N routine by each process for a 8,184 core run.

(38)

Figure 2.27: PFLOTRAN on Cray XT5: Time spent in synchronizing for MPI Allreduce by each process for a 8,184 core run.

(39)

Figure 2.29: PFLOTRAN on Cray XT5: Total number of instructions (PAPI TOT INS) exe-cuted by each process for a 8184 core run.

Figure 2.30 to Figure 2.34 show the time spent in the most dominant user routines by each process for a 8,184 processor core run on IBM BG/P. The time spent in synchronizing for

MPI Allreduce by each process at 8,184 cores of IBM BG/P is shown in Figure 2.34. From the

figures, it is clear that the same processes that are spending very little time in the dominant user routines have higher MPI Allreduce synchronization time. This pattern is similar to what

we have observed on Cray XT5.

2.5 Communication Analysis

2.5.1 Communication sub-systems of Cray XT5 and IBM BG/P

The IBM BG/P architecture has five networks that are used for performing various tasks on

the Intrepid system. We briefly describe the functionality of the various networks below. More detailed description is given in [5]

• Three-dimensional Torus: This network is used for general-purpose, point-to-point mes-sage passing and multicast operations between the IBM BG/P compute nodes. On the

IBM BG/P, a single compute node is connected to six nearest neighbor compute nodes.

The torus is constructed with point-to-point, serial links between routers embedded within the Blue Gene/P ASICs. The target hardware bandwidth for each torus link is 425 MB/s

in each direction of the link for a total of 5.1 GB/s bidirectional bandwidth per node.

(40)

com-Figure 2.30: PFLOTRAN on IBM BG/P: Time spent in Rtotal routine by each process for a 8,184 core run.

(41)

Figure 2.32: PFLOTRAN on IBM BG/P: Time spent in Rkineticmineral routine by each process for a 8,184 core run.

(42)

Figure 2.34: PFLOTRAN on IBM BG/P: Time spent in MPI Allreduce by each process for a 8,184 core run.

munication operations and to move application data from the I/O nodes to the compute nodes. Each compute and I/O node has three links to the global collective network at 850

MB/s per direction for a total of 5.1 GB/s bidirectional bandwidth per node. Collective

operations such as broadcast and reduction are performed on this network.

• Global Interrupt: This network based on asynchronous logic and is used for signaling

global interrupts and barriers.

• 10 gigabit Ethernet: This network consists of all I/O nodes and discrete nodes that are

connected to a standard 10 gigabit Ethernet switch. The compute nodes are not directly connected to this network.

• Control network has direct access to every compute and I/O nodes. This network is used for system boot and monitoring.

The MPI implementation on BG/P supports three different protocols, these are described

below.

• MPI short protocol is used for short messages.

• MPI eager protocol used for medium-sized messages. No negotiations are made on both ends before sending a message in this protocol.

(43)

The Cray XT5 uses a three dimensional torus network for communications between the

compute nodes. Each compute node has a 6-port Cray SeaStar2+ network interface controller that is connected to the processor with Hypertransport. Each SeaStar2+ port is capable of 9.6

GB/s peak bi-directional bandwidth and 6 GB/s of sustained bandwidth. There is no separate

network for connecting the compute nodes to the I/O nodes. The MPI implementation on Cray XT5 uses two device drivers for communications on the machine. A shared memory (SMP)

driver for communicating with processes that share a single compute node and the portal device

driver for communications across the nodes.

Cray’s implementation of the MPI standard is contained in Cray Message Passing Toolkit

(MPT). It is based on the MPICH2 implementation [8]. MPT uses two device drivers for

communications on the machine. A shared memory (SMP) driver for communicating with pro-cesses that share a single compute node and the portal device driver for communications across

the nodes. Portals is a low-level software interface for inter-node communication. Like many

message passing libraries, Portals uses eager protocol for sending short messages wherein short messages are sent with the assumption that the receiving process has the resources to store the

message, and the receiver is responsible for buffering the message if the matching receive has not

yet been posted. The data is copied the receives buffer if the matching receive has been posted. Otherwise the data is copied to an unexpected buffer and two entries (put start and put end

events) are generated in theunexpected event queue that track incoming unexpected messages. Evidently, the unexpected event queue and buffer can run out of resources causing

scalabil-ity issues. These issues can be mitigated by increasing the buffer size and maximum queue

length using MPICH UNEX BUFFER SIZE and MPICH PTL UNEX EVENTS environment variables respectively. Alternately, a flow control mechanism can prevent the unexpected event

queue from being exhausted in any situation, and may cause performance degradation. This

mechanism can be enabled by setting the environment variable MPICH PTL SEND CREDITS to “-1” Mills et al. [31] discuss the scalablity implications of MPT on real applications.

2.5.2 MPI Allreduce in PFLOTRAN

PFLOTRAN heavily uses the MPI subroutine, MPI Allreduce during its execution. In this

section, we discuss our observations corresponding to the performance of MPI Allreduce in PFLOTRAN on Cray XT5 and IBM BG/P. MPI Allreduce is a collective subroutine that

combines values from all the participating MPI processes and distribute the result back to all

of the processes. Since this is a collective operation all processes need to synchronize before participating in the call.Different MPI implementations use different algorithms to gather and

(44)

factors.

1. Performance of the inter-connect because it determines the latency and bandwidth avail-able between the compute nodes.

2. Topology in which the compute nodes are connected to one another and the mapping of

MPI tasks to compute nodes.

3. Number of participating MPI processes.

4. Size of the message that has to be distributed.

5. Time spent in the process of synchronization before the reduction operation.

Figure 2.35: Variation in run-time on Cray XT5 with the same simulation parameters.

Figure 2.35 shows the time spent in the Flow and Transport phase of PFLOTRAN at 65,536

cores of Cray XT5. Two different solvers were used for these runs. As can be seen from the

figure, there is significant variation in the runtime of Flow stage between the five trial runs. Recall that the Flow phase is communication intensive and that 95% of MPI Allreduce calls

happen in this phase. The observed variation in runtime of Flow phase can be attributed

to the variability of MPI Allreduce performance on Cray XT5. This shows that the state of the machine when the simulation is run also plays a role in determining the performance of

MPI Allreduce. We did not observe this phenomenon on the IBM BG/P.

(45)

Table 2.1: Distribution of MPI Allreduce calls into various message size bins at 8,184 cores of Cray XT5

Bin Count Call site

0B <Message size<16B 113,070 VecDot MPI, VecNorm MPI etc.,

16B <= Message size<256B 32,725 VecDotNorm2

4KB <= Message size <64KB 943 MatZeroRows MPIBAIJ, MatZeroRows

-MPIAIJ, MatAssemblyBegin MPIBAIJ,

MatAssemblyBegin MPIAIJ

are invoked to perform the dot product and norm calculations. The functions VecDot MPI

andVecNorm MPI perform a reduction operation on a single double variable and the function

VecDotNorm2performs a reduction operation on two double variables. It must be noted that the number of MPI Allreduce calls executed depends up on the number of time steps PFLOTRAN

is set to run.

Figure 2.36 shows the time spent by each process in synchronizing for MPI Allreduce at 8,184 cores of Cray XT5. It is clearly evident that a small percentage of processes spent more

time in MPI Allreduce synchronization when compared to other processes which indicates that

some processes arrive at the synchronization point much earlier than others. Similar pattern can be observed from Figure 2.37 for IBM BG/P. The reasons for this behavior of MPI Allreduce

has already been discussed in the previous section.

(46)

Figure 2.37: PFLOTRAN on IBM BG/P: 1 DoF problem - Total MPI Allreduce time

Box plots are commonly used in descriptive statistics to visualize distribution of data,

es-pecially to identify outliers in a distribution. Box plots use five number summaries to describe

the data, namely the sample minimum, the lower quartile (25th percentile), median. upper quartile (75th percentile) and the sample maximum. In our case, box plots would be

help-ful in identifying the outlier processes with respect to MPI Allreduce time. Figure 2.38 and Figure 2.39 show the box plot timings of total MPI Allreduce timings on Cray XT5 and IBM

BG/P. The bottom and top of the box are respectively Q1 (25th percentile) and Q3 (75th

percentile). The mid-point in the box represents the median of the distribution. Whiskers are marked atQ1−(1.5∗IQR) andQ3 + (1.5∗IQR). IQR is interquartile range and is defined as (Q3−Q1) Points which cross the whiskers are marked with ’+’ symbols and represent outliers.

Figure 2.40 shows the total time spent in MPI Allreduce for a 30 time-step 1 billion DoF test

problem on Cray XT5 and IBM BG/P. This includes both the time spent in synchronization

and the actual reduction operation. For the same simulation parameters on both architectures, we observe that IBM BG/P gives a much better MPI Allreduce performance at 16,384 cores

when compared to Cray XT5. This is due to the presence of the “Global Collective” network

(47)

Figure 2.38: PFLOTRAN on Cray XT5: 1 DoF problem - Box-plot of MPI Allreduce timings (includes synchronization also)

(48)

(49)

Chapter 3

Custom MPI Allreduce Benchmark

This chapter is organized as follows: Section 3.1 provides background and motivation for de-veloping a custom MPI Allreduce benchmark. Section 3.2 explains the various communication

mechanisms developed for the benchmark. Finally, section 3.3 compares the benchmark

perfor-mance on Cray XT5 and IBM BG/P architectures. It also provides recommendations on the optimal communication mechanism for MPI Allreduce on Cray XT5.

3.1 Motivation

In the previous chapter, we discussed the differences in the communication sub-systems between the Cray XT5 and IBM BG/P architectures. Alam et al. [20] discussed the Intel MPI

bench-mark (IMB) [6] results on IBM BG/P and Cray XT4. For the IMB MPI Allreduce benchbench-mark,

they measured latency with varying message sizes and latency for a fixed message size for a range of processor counts. They concluded that both the architectures show good scalability in

terms of message sizes and processor counts with IBM BG/P performing exceptionally well in terms of scalability. Alam et al. [21] also discuss the impact of Message Passing Toolkit (MPT)

on the performance of IMB on Cray XT4. Characterizing the performance of MPI Barrier

would be helpful in understanding the performance of MPI Allreduce because MPI Allreduce involves a synchronization operation. Worley et al. [18] measured the performance of MPI

-Barrier on Cray XT4/5 and IBM BG/P. They concluded that the barrier performance on IBM

BG/P is superior to that of Cray XT4/5. They also observed small performance preference for power-of-two processor counts and significant performance variability on Cray XT4/5.

The aforementioned studies were performed upto a maximum of 16,384 processor cores and

included relatively fewer MPI Allreduce calls within a single program instance. Hence, we developed a custom MPI Allreduce benchmark to investigate performance of programs that

(50)

believe that this study facilitates accurate assessment of performance impact of architecture on

real scientific applications like PFLOTRAN.

3.2 Benchmark Description

The custom MPI Allreduce benchmark (CMB) is written in C programming language and uses

both MPI and OpenMP (shared-memory programming interface) [9] programming models. It performs a simple dot product operation on a vector of doubles that is distributed across the

processor cores. Each process computes local dot product on its allocated vector and then all processes participate in an MPI Allreduce call with summation operation to compute the global

dot product. The global dot product operation is repeated 150,000 times in a single program

instance, this is equivalent to performing the MPI Allreduce 150,000 times.

In order to have sufficient computation between each reduction operation, we implemented

a loop over the local dot product computation. The number of times this loop is iterated is

user configurable and specified as a command line argument to the benchmark. Similarly the vector size and number of MPI Allreduce operations (global dot product iterations) can also

be set through command line arguments. Each process is allocated a constant vector of length

61,050 elements, which makes the total vector size approximately equal to 1 billion elements when 16,380 processor cores are used.

We have chosen 150,000 MPI Allreduce calls because it approximately represents the total

number of MPI Allreduce calls in PFLOTRAN for the 1 billion DoF test problem at 16,380 processor cores. Unlike the test problem of PFLOTRAN, this benchmark does not have any

load imbalance in its computation because each process is allocated a constant vector size

of 61,050 elements. This implies that any MPI synchronization time observed is due to the inherent synchronization time on the target architecture and not due to computational load

imbalance in the benchmark. Several methods of performing MPI Allreduce were implemented

in the benchmark. These are described below.

• Direct: In this method, all MPI processes participate in the MPI Allreduce operation.

MPI COMM WORLD is used as the communicator handle in MPI Allreduce call. This is the base MPI Allreduce implementation available on the target machines and it is used

without any modifications.

• Async: In this method, only one MPI process per compute node participates in the

MPI Allreduce operation. A MPI sub-communicator is created consisting of only the

(51)

-global communicator, MPI COMM WORLD. The root process on each node gathers the

local dot product of the rest of the processes within its node by using asynchronous point-to-point receive (MPI Irecv) operation. After completing the MPI Allreduce operation

with the process belonging to the sub-communicator, the root process in each node sends

back the global dot product result to the rest of the processes in their respective nodes using an asynchronous point-to-point send (MPI Isend) operation. Asynchronous MPI

routines are known as non-blocking message passing communications because these

oper-ations return without waiting for the communication to finish on the other end. So the operations following the communication call can proceed even though the message

deliv-ery/receipt may not have been completed. One can wait for a non-blocking operation

to finish using MPI Wait or MPI Waitall function so as to avoid corrupting the message buffer or accessing a non-existing buffer.

• Sub-comms: This method involves creation of two MPI sub-communicators namely,

intra-node and inter-node. The intra-node sub-communicator is a collection of all

pro-cesses within a node, while inter-node sub-communicator is a collection of all the root processes among all compute nodes. A MPI Reduce operation within a node involving

the intra-node communicator computes the dot product among the processes within a

node. The result is only stored at the root process because the computed result does not yet represent the global dot product. All the root processes among the compute nodes

participate in MPI Allreduce operation involving the inter-node communicator to

com-pute the global dot product. This is followed by a MPI Bcast operation involving the intra-node communicator to propagate the final result to the rest of the processes in each

node.

• Hybrid MPI-OpenMP: This method involves using both MPI and OpenMP

function-ality to compute the vector dot product. A single MPI task is requested per compute node which spawns 12 OpenMP threads corresponding to 12 cores available on each node. The

local dot product in a node is computed by using the OpenMP directiveparallel for with

thereduction clause. The MPI task in each node participates in MPI Allreduce involving the global communicator to compute the global dot product.

3.3 Performance Results

The Cray XT5 system used for the experiments described in this section has 6 cores per

proces-sor socket, with two such sockets per XT5 node. We used both CrayPAT and our custom timing

(52)

Table 3.1: Performance of CMB on Cray XT5 with PAT RT MPI SYNC = 1

Cray XT5 cores MPI Allreduce MPI Allreduce (sync) Total MPI Allreduce

16,380 58.18 77.59 135.76

32,760 51.32 74.87 126.2

65,532 156.5 152.33 308.83

Table 3.2: Performance of CMB on Cray XT5 with PAT RT MPI SYNC = 0

Cray XT5 cores MPI Allreduce

16,380 74.33

32,760 86.24

65,532 117.88

on CMB performance withdirect method. Table 3.1 shows the time spent in MPI Allreduce at different processor counts on Cray XT5. These runs were executed with CrayPAT flag PAT

-RT MPI SYNC set to 1. Recall that setting this flag would capture the synchronization time of

MPI collective routines by explicitly placing an MPI Barrier before each MPI collective func-tion. Table 3.2 shows the timings of the same runs with PAT RT MPI SYNC set to 0. From

tables 3.1 and 3.2, it is clearly evident that CrayPAT introduced an additional overhead when profiling with PAT RT MPI SYNC enabled for this benchmark. For runs below 16,380

pro-cessor count we did not observe any overhead in using the PAT RT MPI SYNC environment

variable.

Table 3.3 shows the timings of CMB with direct method on IBM BG/P. We could have

used TAU for profiling the benchmark but we wanted to explicitly capture the MPI Allreduce

synchronization time and since TAU does not provide this functionality, we had to use our custom timing routines. We placed a MPI Barrier before MPI Allreduce to explicitly capture

the MPI Allreduce synchronization time. We did not observe any significant overhead with the

introduction of the MPI Barrier function in our profiling.

From Figure 3.1, it can be observed that IBM BG/P outperforms Cray XT5 by a factor

of 7-10 in MPI Allreduce performance. It can be observed from the results on IBM BG/P

that the performance of MPI Allreduce remains almost constant from 16,380 to 98,304 cores which is quite remarkable for a collective communication routine. This is largely due to the

availability of a special tree network on the IBM BG/P that is used for collective operations

(53)

Table 3.3: Custom MPI Allreduce benchmark performance on IBM BG/P

BG/P cores MPI Allreduce MPI Allreduce (sync) Total MPI Allreduce

16,380 1.61 9.67 11.28

32,760 1.61 9.74 11.35

65,532 1.68 9.64 11.32

98,304 1.75 8.20 9.95

operations.

To avoid skewing of performance results at higher processor counts, we ran the benchmark

without CrayPAT profiling. We could have used CrayPAT with PAT RT MPI SYNC set to 0, but we used our custom timing routines without any MPI Barrier call. The remaining results

presented in this chapter do not include any explicit barrier operations.

Figure 3.1: CMB (direct method): MPI Allreduce comparison on Cray XT5 and IBM BG/P

Since the performance of MPI Allreduce on Cray XT5 was less than ideal when compared with IBM BG/P, it prompted us to develop and experiment with different methods of

imple-menting MPI Allreduce on Cray XT5. Figure 3.2 to Figure 3.6 show the wall clock time of

running CMB with various methods at different processor counts on Cray XT5. These figures show that the OpenMP-MPI hybrid model results in an improvement of 20-40% over other

implementations. Due to the observed performance variability on Cray XT5, we conducted

multiple runs at each processor count.

Further investigation indicated that the improvement in the hybrid OpenMP-MPI model

(54)

Figure 3.2: Cray XT5: Custom MPI Allreduce benchmark at 16,380 cores - Wall clock time

(55)

Figure 3.5: Cray XT5: Custom MPI Allreduce benchmark at 98,292 cores - Wall clock time

(56)

Although the Async and Sub-comms approaches involve performing MPI Allreduce operation

with only one MPI process per node, these implementations require creation of MPI sub-communicators which is an expensive operation. In the hybrid method, there was no need

of creating a sub-communicator because each node is assigned a single MPI task and so the

global communicator can be used. Also, the OpenMP compiler in PGI was able to vectorize the dot product loop whereas the PGI compiler could not vectorize that loop with MPI only

implementations. We derive our conclusions regarding vectorization by enabling the-MinfoPGI

flag during compilation. When this flag is enabled, the compiler prints out information regarding the various optimizations performed on the source code. When compiled with OpenMP, the

following information is printed out with -Minfo. Line 331 in the source file is the OpenMP

parallel for dot product loop. It can be seen that in addition to generating SSE (Streaming SIMD Extensions) instructions for the loop, it also produced alternate variations of the loop

and generated prefetch instructions.

line 331, Parallel loop activated with static block-cyclic schedule

Generated 3 alternate loops for the loop

Generated vector sse code for the loop

Generated 2 prefetch instructions for the loop

When compiled without using any OpenMP directives, the following information is printed

out. Line 359 is the dot productfor loop

line 359, Loop not vectorized: data dependency

Loop unrolled 2 times

The PGI compiler is not able to vectorize the loop because it assumes that the two vectors which

are accessed through pointers in the code could overlap with one another. This phenomenon is known as pointer aliasing and is the default behavior of the PGI compiler. In our code, both

the pointers reference to unique memory locations and so we could safely instruct the compiler

to vectorize the loop. PGI compiler provides -Msafeptr flag that can used during compilation to specify that it is safe to assume that the pointers in the source code do not overlap. When

compiled with-Msafeptr enabled, the following is printed out

line 359, Generated 3 alternate loops for the loop

Generated vector sse code for the loop

Generated 2 prefetch instructions for the loop

The same result can also be achieved by declaring the pointers with therestrict C keyword and

(57)

Table 3.4: Cray XT5: Comparison of MPI Allreduce performance in CMB

Cray XT5 cores direct - trial 1 direct - trial-2 hybrid - trial 1 hybrid - trial 2

16,380 59.76 55.15 25.39 25.87

32,760 75.81 85.58 57.67 60.43

65,532 85.10 86.17 38.20 34.99

98,304 108.84 98.05 34.91 76.18

131,064 227.79 268.46 219.14 147.63

Table 3.4 shows the time spent in MPI Allreduce when direct and hybrid methods are

used. In all the cases, the hybrid method outperforms the direct method. Even though the

hybrid method of CMB results in better performance on Cray XT5, it is still far inferior to the performance of MPI Allreduce on IBM BG/P.

Large scale scientific applications can take advantage of these performance gains by adopting

the hybrid MPI-OpenMP model. This usually involves significant development work. However, with the current trend of increasing number of cores per processor socket the hybrid model

may be worth exploring. We were not not able to incorporate the approaches discussed in this

chapter into PFLOTRAN because the current PETSc does not support OpenMP. Recently Oral et al. [19] have investigated the impact of OS jitter on MPI Allreduce performance on

quadcore per socket Cray XT5 system. They aggregated the OS noise sources onto a single

core for each node and then executed the scientific application on the remaining cores in each node. Their results indicate an improvement of over two orders of magnitude in MPI Allreduce