Application. Fortran C C++ Cobol. Programming Language. Computational Scientist API. Machine LPARX. Implementation Abstractions. Message Passing Layer

(1)

A Parallel Software Infrastructure

for Dynamic Block-Irregular

Scientic Calculations

Scott R. Kohn

LPARX Implementation Abstractions

Message Passing Layer LDA AMG SPH3D MD

Adaptive Mesh API Particle API

API Application Computational Scientist C Fortran C++ Cobol Programming Language Machine

(2)

A Parallel Software Infrastructure for Dynamic Block-Irregular Scientic Calculations

A dissertation submitted in partial satisfaction of the requirements for the degree Doctor of Philosophy in the Department of Computer Science and Engineering

by Scott R. Kohn

Committee in charge:

Professor Scott B. Baden, Chair Professor Francine D. Berman Professor William G. Griswold Professor Keith Marzullo

Professor Maria Elizabeth G. Ong Professor John H. Weare

(3)

Copyright Scott R. Kohn, 1995

(4)

The dissertation of Scott R. Kohn is approved, and it is acceptable in quality and form for publication on microlm:

University of California, San Diego 1995

(5)

Signature Page : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : iii Table of Contents : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : iv List of Figures : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : vii List of Tables : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : x Acknowledgements : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : xii Vita and Publications : : : : : : : : : : : : : : : : : : : : : : : : : : : : : xiv Abstract : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : xvi 1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1 1. Parallel Scientic Computation : : : : : : : : : : : : : : : : : : : : : : 1 2. Dynamic Block-Irregular Calculations : : : : : : : : : : : : : : : : : : 4 3. A Parallel Software Infrastructure : : : : : : : : : : : : : : : : : : : : 6 1. LPARX : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 9 2. Implementation Abstractions : : : : : : : : : : : : : : : : : : : : : 9 3. Adaptive Mesh API : : : : : : : : : : : : : : : : : : : : : : : : : : 10 4. Particle API : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 11 4. Organization of the Dissertation : : : : : : : : : : : : : : : : : : : : : 11 2 Parallelization Abstractions : : : : : : : : : : : : : : : : : : : : : : : : : : 13 1. Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 13 2. The LPARX Abstractions : : : : : : : : : : : : : : : : : : : : : : : : : 15 1. Philosophy : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 16 2. Data Types : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 19 3. Coarse-Grain Data Parallel Computation : : : : : : : : : : : : : : 21 4. The Region Calculus : : : : : : : : : : : : : : : : : : : : : : : : : : 23 5. Data Motion : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 23 6. LPARX Implementation: : : : : : : : : : : : : : : : : : : : : : : : 26 7. Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 27 3. LPARX Programming Examples : : : : : : : : : : : : : : : : : : : : : 29 1. Jacobi Relaxation : : : : : : : : : : : : : : : : : : : : : : : : : : : 29 2. Decomposing the Problem Domain : : : : : : : : : : : : : : : : : : 30 3. Parallel Computation : : : : : : : : : : : : : : : : : : : : : : : : : 33 4. Communicating Boundary Values : : : : : : : : : : : : : : : : : : 34 5. Dynamic and Irregular Computations : : : : : : : : : : : : : : : : 35 4. Related Work : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 36 1. Structural Abstraction: : : : : : : : : : : : : : : : : : : : : : : : : 36

(6)

2. Parallel Languages : : : : : : : : : : : : : : : : : : : : : : : : : : : 38 3. Run-Time Support Libraries : : : : : : : : : : : : : : : : : : : : : 42 5. Analysis and Discussion : : : : : : : : : : : : : : : : : : : : : : : : : : 44 1. Structural Abstraction: : : : : : : : : : : : : : : : : : : : : : : : : 45 2. Limitations of the Abstractions : : : : : : : : : : : : : : : : : : : : 46 3. Shared Memory : : : : : : : : : : : : : : : : : : : : : : : : : : : : 47 4. Coarse-Grain Data Parallelism : : : : : : : : : : : : : : : : : : : : 47 5. Language Interoperability : : : : : : : : : : : : : : : : : : : : : : : 48 6. Communication Model: : : : : : : : : : : : : : : : : : : : : : : : : 49 7. Future Work : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 50 3 Implementation Methodology: : : : : : : : : : : : : : : : : : : : : : : : : 52 1. Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 52 1. Motivation : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 54 2. Related Work : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 56 2. Implementation Abstractions : : : : : : : : : : : : : : : : : : : : : : : 58 1. Message Passing Layer: : : : : : : : : : : : : : : : : : : : : : : : : 59 2. Asynchronous Message Streams: : : : : : : : : : : : : : : : : : : : 60 3. Distributed Parallel Objects: : : : : : : : : : : : : : : : : : : : : : 66 4. Communication Example : : : : : : : : : : : : : : : : : : : : : : : 71 3. Implementation and Performance: : : : : : : : : : : : : : : : : : : : : 76 1. Interrupts versus Polling : : : : : : : : : : : : : : : : : : : : : : : 77 2. DPO and AMS Overheads : : : : : : : : : : : : : : : : : : : : : : 78 3. Application Performance : : : : : : : : : : : : : : : : : : : : : : : 79 4. Analysis and Discussion : : : : : : : : : : : : : : : : : : : : : : : : : : 81 1. Flexibility: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 82 2. Portability : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 82 3. Implementation Mistakes : : : : : : : : : : : : : : : : : : : : : : : 84 4 Adaptive Mesh Applications : : : : : : : : : : : : : : : : : : : : : : : : : 86 1. Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 86 1. Motivation : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 88 2. Related Work : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 90 2. Structured Adaptive Mesh Algorithms : : : : : : : : : : : : : : : : : : 92 3. Adaptive Mesh API : : : : : : : : : : : : : : : : : : : : : : : : : : : : 96 1. Software Infrastructure Overview : : : : : : : : : : : : : : : : : : : 96 2. Data Structures : : : : : : : : : : : : : : : : : : : : : : : : : : : : 99 3. Error Estimation : : : : : : : : : : : : : : : : : : : : : : : : : : : : 101 4. Grid Generation : : : : : : : : : : : : : : : : : : : : : : : : : : : : 103 5. Load Balancing and Processor Assignment : : : : : : : : : : : : : 106 6. Numerical Computation : : : : : : : : : : : : : : : : : : : : : : : : 110 7. Communication : : : : : : : : : : : : : : : : : : : : : : : : : : : : 112 4. Adaptive Eigensolvers in Materials Design : : : : : : : : : : : : : : : : 115 1. A Model Problem : : : : : : : : : : : : : : : : : : : : : : : : : : : 117

(7)

2. Adaptive Framework: : : : : : : : : : : : : : : : : : : : : : : : : : 118 3. Eigenvalue Algorithm : : : : : : : : : : : : : : : : : : : : : : : : : 118 4. Multigrid : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 120 5. Finite Dierence Discretizations : : : : : : : : : : : : : : : : : : : 122 6. Computational Results : : : : : : : : : : : : : : : : : : : : : : : : 123 5. Performance Analysis : : : : : : : : : : : : : : : : : : : : : : : : : : : 126 1. Performance Comparison : : : : : : : : : : : : : : : : : : : : : : : 128 2. Execution Time Analysis : : : : : : : : : : : : : : : : : : : : : : : 131 3. Uniform Grid Patches : : : : : : : : : : : : : : : : : : : : : : : : : 134 6. Analysis and Discussion : : : : : : : : : : : : : : : : : : : : : : : : : : 138 1. Parallelization Requirements : : : : : : : : : : : : : : : : : : : : : 139 2. Future Research Directions : : : : : : : : : : : : : : : : : : : : : : 140 5 Particle Calculations : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 142 1. Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 142 1. Motivation : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 144 2. Related Work : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 148 2. Application Programmer Interface : : : : : : : : : : : : : : : : : : : : 149 1. Balancing Non-Uniform Workloads : : : : : : : : : : : : : : : : : : 150 2. Caching O-Processor Data : : : : : : : : : : : : : : : : : : : : : : 155 3. Writing Back Particle Information : : : : : : : : : : : : : : : : : : 157 4. Repatriating Particles : : : : : : : : : : : : : : : : : : : : : : : : : 159 5. Implementation Details : : : : : : : : : : : : : : : : : : : : : : : : 160 3. Smoothed Particle Hydrodynamics : : : : : : : : : : : : : : : : : : : : 166 1. Numerical Background : : : : : : : : : : : : : : : : : : : : : : : : 167 2. Performance Comparison : : : : : : : : : : : : : : : : : : : : : : : 169 3. Execution Time Analysis : : : : : : : : : : : : : : : : : : : : : : : 173 4. Exploiting Force Law Symmetry : : : : : : : : : : : : : : : : : : : 177 5. Communication Optimizations : : : : : : : : : : : : : : : : : : : : 179 4. Analysis and Discussion : : : : : : : : : : : : : : : : : : : : : : : : : : 182 1. Parallelization Requirements : : : : : : : : : : : : : : : : : : : : : 183 2. Unstructured Partitionings : : : : : : : : : : : : : : : : : : : : : : 184 3. Future Research Directions : : : : : : : : : : : : : : : : : : : : : : 185 6 Conclusions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 186 1. Research Contributions : : : : : : : : : : : : : : : : : : : : : : : : : : 186 2. Outstanding Research Issues : : : : : : : : : : : : : : : : : : : : : : : 187 1. Implementation Strategies for APIs : : : : : : : : : : : : : : : : : 188 2. Language Interoperability : : : : : : : : : : : : : : : : : : : : : : : 189 3. The Scientic Computing Community : : : : : : : : : : : : : : : : : : 191 Appendix A: Machine Characteristics : : : : : : : : : : : : : : : : : : : : 194 Bibliography : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 198

(8)

LIST OF FIGURES

1.1 Current design trends in parallel architecture favor machines that re-semble tightly coupled networks of workstations : : : : : : : : : : : : 3 1.2 An overview of our parallel software infrastructure : : : : : : : : : : : 8 2.1 The LPARX layer of our software infrastructure provides

paralleliza-tion mechanisms on which we build applicaparalleliza-tion-specic APIs : : : : : 14 2.2 LPARX applications logically consist of three components:

partition-ing routines, LPARX code, and serial numerical kernels : : : : : : : : 17 2.3 The XArray ofGrids structure provides a common framework for

im-plementing various block-irregular decompositions of data: : : : : : : 21 2.4 Examples of LPARX's region calculus operations : : : : : : : : : : : 24 2.5 The computational domain for a simple nite dierence problem : : : 30 2.6 The main routine for the parallel Jacobi application : : : : : : : : : : 32 2.7 The relaxation routine for the parallel Jacobi application : : : : : : : 33 2.8 Subroutine FillPatchmanages all interprocessor communication : : 34

3.1 The LPARX run-time system is built on a message passing library, Asynchronous Message Streams, and Distributed Parallel Objects : : 53 3.2 LPARX programs are modeled as a collection of objects (Grids) with

asynchronous and unpredictable communication patterns : : : : : : : 54 3.3 Asynchronous communication facilities of the AMS layer : : : : : : : 62 3.4 An example of AMS's message stream abstractions : : : : : : : : : : 64 3.5 Primary and secondary objects in the DPO model : : : : : : : : : : : 67 3.6 LPARX functionXAllocsupplies aRegionand a processor assignment

when creating aGrid : : : : : : : : : : : : : : : : : : : : : : : : : : : 68

3.7 Each LPARX Grid is a DPO object : : : : : : : : : : : : : : : : : : : 69

3.8 Coarse-grain execution in DPO employs the owner-computes rule : : 70

3.9 FillPatch will be used to illustrate how the various implementation

layers interact in interprocessor communication : : : : : : : : : : : : 72 3.10 A time-line view of the transmission of data to another processor : : 75 3.11 A time-line view of the reception of data from another processor : : : 75 4.1 The adaptive mesh API provides application-specic facilities for

struc-tured adaptive mesh methods : : : : : : : : : : : : : : : : : : : : : : 87 4.2 A comparison of unstructured and structured adaptive mesh methods 89 4.3 Structured adaptive mesh methods represent the numerical solution to

a partial dierential equation using a hierarchy of grid levels : : : : : 92 4.4 A sample 2d structured adaptive mesh hierarchy for a materials design

problem : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 93 4.5 Organization of the structured adaptive mesh API library: : : : : : : 97 4.6 A composite grid is represented using a Grid, an IrregularGrid, and

aCompositeGrid : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 100

(9)

4.7 Error estimation and grid generation : : : : : : : : : : : : : : : : : : 102 4.8 Grid generation using the signature algorithm : : : : : : : : : : : : : 104 4.9 Two grid generation strategies for uniform renement regions: : : : : 105 4.10 A simple load balancing algorithm for grid patches : : : : : : : : : : 107 4.11 An improved load balancing strategy : : : : : : : : : : : : : : : : : : 109 4.12 Coarse-grain numerical computation over the individual Grids within

an IrregularGrid : : : : : : : : : : : : : : : : : : : : : : : : : : : : 110

4.13 A comparison of coarse-grain and ne-grain data parallel execution : 111 4.14 Intralevel communication between grids at the same level : : : : : : : 113 4.15 Interlevel communication between grids at dierent levels : : : : : : : 114 4.16 Materials design seeks to understand the chemical properties of

molecules such as this C20H20 ring : : : : : : : : : : : : : : : : : : : : 116

4.17 Outline of the adaptive eigenvalue solver : : : : : : : : : : : : : : : : 118 4.18 An iterative multigrid-based eigenvalue algorithm : : : : : : : : : : : 119 4.19 The Full Approximation Storage (FAS) multigrid algorithm: : : : : : 121 4.20 Computational results for hydrogen : : : : : : : : : : : : : : : : : : : 124 4.21 Computational results for the hydrogen molecular ion : : : : : : : : : 125 4.22 Computational results were gathered for this 3d synthetic eigenvalue

problem : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 127 4.23 Adaptive eigenvalue solver execution times : : : : : : : : : : : : : : : 130 4.24 A level-by-level accounting of the execution time for the eigenvalue

algorithm : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 131 4.25 Execution time breakdown on the Intel Paragon and IBM SP2 : : : : 133 4.26 These graphs illustrate the performance overheads of uniform grid

patches as compared to non-uniform patches : : : : : : : : : : : : : : 137 5.1 Our particle API provides computational scientists with high-level

fa-cilities targeted towards particle applications : : : : : : : : : : : : : : 143 5.2 A framework for a generic particle calculation : : : : : : : : : : : : : 145 5.3 Snapshots of a 2d vortex dynamics application with a non-uniform

workload distribution : : : : : : : : : : : : : : : : : : : : : : : : : : : 147 5.4 A parallelized version of the generic particle code : : : : : : : : : : : 151 5.5 An irregular decomposition of the computational domain using the

XArray : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 153

5.6 API function BalanceWorkloads redistributes computational eort

across the processors : : : : : : : : : : : : : : : : : : : : : : : : : : : 154

5.7 FetchParticles locally caches copies of o-processor particle

infor-mation needed for particle interactions : : : : : : : : : : : : : : : : : 156

5.8 WriteBackupdates force information for particles owned by other

pro-cessors : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 158 5.9 The API denition for C++class ChainMesh : : : : : : : : : : : : : : 161

5.10 Application C++ code to compute local interactions : : : : : : : : : : 163

5.11 Customizations for C++ class ParticleList : : : : : : : : : : : : : : 164

5.12 Our SPH3D application simulates the evolution of a 3d disk galaxy : 166 viii

(10)

5.13 SPH3D execution times on a Cray C-90, Intel Paragon, IBM SP2, and an Alpha workstation farm running PVM: : : : : : : : : : : : : : : : 172 5.14 Execution time summary for one SPH3D timestep on the Intel Paragon

and the IBM SP2 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 177 5.15 A comparison of the SPH3D code with a restricted version that does

not fully exploit force law symmetry : : : : : : : : : : : : : : : : : : 179 5.16 A comparison of the SPH3D code to a \naive" implementation that

does not attempt to minimize interprocessor communication : : : : : 180 5.17 A comparison of structured and unstructured partitions : : : : : : : : 184 A.1 Alpha workstation cluster message passing performance : : : : : : : : 196 A.2 IBM SP2 message passing performance : : : : : : : : : : : : : : : : : 197 A.3 Intel Paragon message passing performance: : : : : : : : : : : : : : : 197

(11)

LIST OF TABLES

2.1 A brief description of the four LPARX data types: Point, Region, Grid, and XArray : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 27

2.2 A summary of LPARX operations : : : : : : : : : : : : : : : : : : : : 28 3.1 A summary of the facilities provided by DPO, AMS, and MP++. : : : 58

3.2 A summary of the asynchronous communication facilities provided by the Asynchronous Message Stream layer : : : : : : : : : : : : : : : : 65 3.3 A summary of the object management mechanisms dened by the

Dis-tributed Parallel Objects layer : : : : : : : : : : : : : : : : : : : : : : 72 3.4 The implementation of communication between Grids depends on

whether they are primary or secondary objects : : : : : : : : : : : : : 74 3.5 Message length and memory overheads for AMS, DPO, and LPARX : 78 3.6 LPARX overheads for a 3d Jacobi application : : : : : : : : : : : : : 80 4.1 A breakdown of the eleven thousand lines of code that constitute the

adaptive mesh API library : : : : : : : : : : : : : : : : : : : : : : : : 98 4.2 Descriptions of Grid,IrregularGrid, and CompositeGrid : : : : : : 101

4.3 Unknowns and mesh spacing for the adaptive mesh hierarchy used to solve the eigenvalue problem : : : : : : : : : : : : : : : : : : : : : : : 128 4.4 Software version numbers and compileroptimization ags for the

struc-tured adaptive mesh performance results : : : : : : : : : : : : : : : : 128 4.5 Adaptive eigenvalue solver execution times : : : : : : : : : : : : : : : 129 4.6 Execution time breakdown on the Intel Paragon : : : : : : : : : : : : 132 4.7 Execution time breakdown on the IBM SP2 : : : : : : : : : : : : : : 133 4.8 Average interprocessor communication volume : : : : : : : : : : : : : 134 4.9 Uniform grid patches require additional memoryresources as compared

to non-uniform patches : : : : : : : : : : : : : : : : : : : : : : : : : : 136 5.1 A survey of the computational structure for various N-body

approxi-mation methods : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 146 5.2 Variables and functions of the smoothed particle hydrodynamics

equa-tions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 168 5.3 Software version numbers and compiler optimization ags for the

SPH3D computational results : : : : : : : : : : : : : : : : : : : : : : 170 5.4 SPH3D execution times on a Cray C-90, Intel Paragon, IBM SP2, and

an Alpha workstation farm running PVM: : : : : : : : : : : : : : : : 171 5.5 Execution time breakdown of one SPH3D timestep on the Intel

Paragon and the IBM SP2 : : : : : : : : : : : : : : : : : : : : : : : : 174 5.6 Execution time summary for one SPH3D timestep on the Intel Paragon

and the IBM SP2 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 176 5.7 A comparison of the SPH3D code with a restricted version that does

not fully exploit force law symmetry : : : : : : : : : : : : : : : : : : 178 x

(12)

5.8 A comparison of the SPH3D code to a \naive" implementation that does not attempt to minimize interprocessor communication : : : : : 181 A.1 Software version numbers and compiler optimization ags : : : : : : : 195 A.2 A summary of machine characteristics : : : : : : : : : : : : : : : : : 195

(13)

ACKNOWLDEGEMENTS

As one of those rare individuals destined for true greatness, this record of my thoughts and convictions will provide invaluable insight into budding genius. Think of it! A priceless historical document in the making! | Calvin, \Calvin and Hobbes"

Many people contribute to the completion of a dissertation, and I would like to thank everyone who has contributed to mine.

I have had the privilege and pleasure to work with Scott Baden for the last ve years. I am indebted to him for his support and encouragement. I have enjoyed our numerous \lively discussions," which have greatly contributed to my work. I would also like to thank my committee members|Fran Berman, Bill Griswold, Keith Marzullo, Beth Ong, and John Weare|for oering criticisms and comments.

Special thanks go to Steve Fink and Val Donaldson. Their keen insights, thoughtful comments, and honest criticisms are the stu of good science. Sharing a lab with them has been a pleasure that I will miss. I doubt that they know how much I have valued their input.

I appreciate the many useful suggestions from Greg Cook, Steve Fink, Chris Myers, and Charles Rendleman on how to improve LPARX and the adaptive mesh software. I also thank Eric Bylaska, Alan Edelman, Ryoichi Kawai, Beth Ong, and John Weare for numerous valuable discussions on numerical methods in materials design.

I would like to thank my family for their support, encouragement, and love. In particular, I thank my father for his sense of curiosity and my mother for trying to make me read Dr. Seuss when I only wanted to read science books. I also have to thank my two cats for putting things in perspective they do not recognize the importance of writing a dissertation, and they have never hesitated to let me know that their needs (e.g. being fed on time) should come rst and foremost.

Finally, I would like to dedicate this dissertation to my wife, Kristin. I nd it dicult to express in words what I feel for her in my heart. I thank her for always

(14)

being there when I needed her and for always reminding me what is most important in my life.

Generous nancial support has been provided by a General Atomics fel-lowship, NSF contract ASC-9110793, and ONR contract N00014-93-1-0152. Access to the Cray C-90, IBM SP2, Intel Paragon, and DEC Alpha workstation farm has been provided by the San Diego Supercomputer Center (through a UCSD School of Engineering Block Grant) and the Cornell Theory Center.

(15)

VITA

1990 B.S. Electrical Engineering

Additional major in Mathematics Additional major in Computer Science University of Wisconsin at Madison

1992 M.S. Computer Science

University of California at San Diego

1995 Ph.D. Computer Science

University of California at San Diego PUBLICATIONS

Submitted for Publication

S. R. Kohn and S. B. Baden, A Parallel Software Infrastructure for Structured Adaptive Mesh Methods, submitted to Supercomputing '95.

Journals

S. R. Kohn and S. B. Baden, Irregular Coarse-Grain Data Parallelism Under LPARX, to appear, Journal of Scientic Programming.

S. B. Baden and S. R. Kohn,Portable Parallel Programming of Numerical Prob-lems Under the LPAR System, Journal of Parallel and Distributed Computation, May, 1995.

Conferences

S. R. Kohn and S. B. Baden,The Parallelization of an Adaptive Multigrid Eigen-value Solver with LPARX, Proceedings of the Seventh SIAM Conference on Parallel Processing for Scientic Computing, San Francisco, CA, February, 1995.

E. J. Bylaska, S. R. Kohn, S. B. Baden, A. Edelman, R. Kawai, M. E. Ong, and J. H. Weare, Scalable Parallel Numerical Methods and Software Tools for Material Design, Proceedings of the Seventh SIAM Conference on Parallel Pro-cessing for Scientic Computing, San Francisco, CA, February, 1995.

S. B. Baden, S. R. Kohn, and S. J. Fink,Programming with LPARX, Proceed-ings of the 1994 Intel Supercomputer User's Group, San Diego, CA, June, 1994.

S. R. Kohn and S. B. Baden, A Robust Parallel Programming Model for Dy-namic Non-Uniform Scienti c Computations, Proceedings of the 1994 Scalable High Performance Computing Conference, Knoxville, TN, May, 1994.

(16)

S. R. Kohn and S. B. Baden,An Implementation of the LPAR Parallel Program-ming Model for Scienti c Computations. Proceedings of the Sixth SIAM Conference on Parallel Processing for Scientic Computing, Norfolk, VA, March, 1993.

S. B. Baden and S. R. Kohn,Lattice Parallelism: A Parallel Programming Model for Manipulating Non-Uniform Structured Scienti c Data Structures, Proceedings of the Workshop on Languages, Compilers, and Run-Time Environments for Distributed Memory Multiprocessors, Boulder, CO, October, 1992.

S. B. Baden and S. R. Kohn, A Comparison of Load Balancing Strategies for Particle Methods Running on MIMD Multiprocessors, Proceedings of the Fifth SIAM Conference on Parallel Processing for Scientic Computing, Houston, TX, March, 1991.

Technical Reports

S. R. Kohn and S. B. Baden,Blobs: Visualization of Particle Methods on Multi-processors, Tech. Rep. CS92-241, University of California, San Diego, May, 1992.

S. B. Baden and S. R. Kohn, The Reference Guide to GenMP|The Generic Multiprocessor, Tech. Rep. CS92-243, University of California, San Diego, June, 1992.

(17)

ABSTRACT OF THE DISSERTATION A Parallel Software Infrastructure for Dynamic

Block-Irregular Scientic Calculations by

Scott R. Kohn

Doctor of Philosophy in Computer Science University of California, San Diego, 1995

Professor Scott B. Baden, Chair

Dear Sir or Madam, will you read my book? It took me years to write, will you take a look?

| John Lennon and Paul McCartney, \Paperback Writer"

The accurate solution of many problems in science and engineering requires the resolution of unpredictable, localized physical phenomena. Such applications may involve the solution of complicated, time-dependent partial dierential equations such as those in materials design, computational uid dynamics, astrophysics, and molecular dynamics. The important feature of these numerical problems is that some portions of the computational domain require higher resolution, and thus more computational eort, than others.

Parallel supercomputers oer the power to solve many of these computa-tionally intensive tasks however, these applications are particularly challenging to implement on parallel architectures because they rely on dynamic, complicated, ir-regular structures with dynamic and irir-regular communication patterns. Current parallel software technology does not yet aord a solution, and new programming abstractions|along with the accompanying run-time support|are needed.

(18)

We have developed a parallel software infrastructure to simplify the imple-mentation of dynamic, irregular, block-structured scientic computations on high-performance parallel supercomputers. Our software infrastructure provides compu-tational scientists with high-level, domain-specic tools that hide low-level details of the parallel hardware. It is portable across a wide range of parallel architectures.

At the center of our infrastructure is the LPARX parallel programming system. LPARX introduces the concept of \structural abstraction," which enables applications to dynamically manipulate irregular data decompositions as language-level objects. LPARX provides a framework for creating decompositions that may be tailored to meet the needs of a particular application.

Building on the LPARX abstractions, we have developed application pro-grammer interfaces (APIs) for two important classes of applications: structured adap-tive mesh methods and particle calculations. These APIs enable scientists to concen-trate on the mathematics and the physics of their application APIs provide high-level software tools that hide underlying implementation details. Our parallel software in-frastructure has enabled computational scientists to explore new approaches to solving a variety of problems, and it has reduced the development time of challenging numer-ical applications. Indeed, we have applied our structured adaptive mesh API to the adaptive solution of eigenvalue problems in materials design and our particle API to a 3d smoothed particle hydrodynamics application in astrophysics.

(19)

Chapter 1

Introduction

I realized that the purpose of writing is to inate weak ideas, obscure poor reasoning, and inhibit clarity. With a little practice, writing can be an intimidating and impenetrable fog ::: Academia, here I come!

| Calvin, \Calvin and Hobbes"

1.1 Parallel Scientic Computation

Parallel supercomputers oer the power to solve many of the computation-ally intensive problems that arise in science and engineering. Unfortunately, this potential has only been partially realized due to the diculty of implementing scien-tic applications on parallel platforms. To put it bluntly, today's parallel computers are hard to use, and parallel software technology does not yet aord the performance and ease of use that computational scientists have come to expect from sequential and vector supercomputers.

It would not be an understatement to say that parallel software is in a state of crisis. The vast majority of scientic programmers nd that current parallel soft-ware support is inadequate 53]. In fact, they are more likely to develop their own in-house software support rather than use existing products 113]. The most com-monly used parallel programming paradigm today is message passing. Standardiza-tion eorts have resulted in a portable message passing library called MPI (Message

(20)

2 Passing Interface) 105]. Unfortunately, programming with message passing is te-dious, as the programmer must explicitly manage low-level details of data placement and interprocessor communication.

Developments in High Performance Fortran1 _{(HPF) 83] are promising}

un-fortunately, HPF will require improvements before it becomes a general purpose par-allel language. For example, HPF does not adequately address dynamic and irregular problems 84], and these limitations have prompted a second HPF standardization eort. The HPF-2 committee is currently investigating enhancements to HPF, but it will be some time before we know what strategies will be eective and how dicult they will be to support in the compiler. Improvements in HPF for dynamic and irreg-ular scientic applications will likely require new parallel programming abstractions and run-time support libraries.

Parallel computers are dicult to use because they require the explicit and low-level management of data locality. Current design trends in high-performance parallel architectures favor machines constructed with commodity components. More than anything else, today's parallel computers resemble tightly coupled networks of workstations (see Figure 1.1). The programmer, compiler, or run-time system must distribute data carefully because access to remote data (through the interconnection network) is typically several orders of magnitude more expensive than access to local data. Some parallel computers, such as the Intel Paragon and the IBM SP2, provide very little hardware support for managing data distributed across processor memories. Other machines, such as the Stanford FLASH 97] and the Wisconsin COW 122], contain hardware for the automatic caching of remote data. However, recent studies with these \distributed shared memory" machines 65, 109] indicate that such hard-ware caching mechanisms are inadequate for dynamic scientic applications. In fact, these studies conclude that ecient distributed shared memory applications require the same attention to data management and the same implementation techniques as message passing applications.

1High Performance Fortran is a data parallel Fortran language that is quickly becoming accepted

(21)

3 M P M P M P M P M P Interconnection Network

Figure 1.1: Current design trends in parallel architecture favor machines built with commodity components. Today's parallel computers resemble tightly coupled net-works of net-workstations and typically contain a few tens to a few hundreds of powerful processing nodes connected by an interconnection network. Each processorPis tied to

a local memoryM, and remote data is accessed through the interconnection network.

At any one time, several parallel applications share the machine, with a single appli-cation generally using a few tens of dedicated processors. Appendix A summarizes the machine characteristics for the parallel architectures used in this dissertation.

Another concern when writing parallel applications is portability. Parallel platforms obsolesce at an alarming rate, and portability is essential so that applica-tions will run on the next generation of architectures. In just the two years spent developing our software infrastructure, four parallel computers have become obsolete (nCUBE nCUBE/2, Intel iPSC/860, Kendall Square Research KSR-1, and Think-ing Machines CM-5), two manufacturers have declared bankruptcy (Kendall Square Research and Thinking Machines), and two manufacturers have entered the parallel scientic computing market (IBM and Silicon Graphics). This trend is likely to con-tinue in the near future due to the rapidly changing microprocessor and interconnect technology used to build parallel machines.

The key to portability is hiding low-level, machine-dependent details. For sequential programs, this can be easily achieved through the use of a standard

(22)

pro-4 gramming language such as Fortran. However, parallel programs typically contain a considerable amount of hardware-dependent code to manage data distribution and interprocessor communication. Such hardware dependencies hamper portability. To be portable, parallelization mechanisms must hide these low-level, architecture-dependent implementation details.

Implementing portable parallel programs without high-level software sup-port is a dicult task. Computational scientists would rather address the mathe-matics and the physics of their problems than worry about ecient parallel imple-mentation techniques. Low-level, machine-dependent details reduce portability and obscure the algorithms underlying an application. Appropriate software support is essential for developing architecture-independent, high-performance parallel scientic applications.

1.2 Dynamic Block-Irregular Calculations

My own interests are in using computers as God intended|to do arith-metic.

| Cleve Moler

Many scientic computations involve the study of dynamic, irregular, lo-cally structured physical phenomena. Such applications may involve the solution of complicated, time-dependent partial dierential equations such as those in ma-terials design 37], computational uid dynamics 22], or localized deformations in geophysical systems. Also included are particle methods in molecular dynamics 50], astrophysics 108], and vortex dynamics 12]. More recently, adaptive methods have been applied to the study of entire ecosystems through satellite imagery at multiple resolutions 135]. These applications are particularly challenging to implement on parallel computers owing to their dynamic, irregular decompositions of data.

Our research addresses the programming abstractions and the accompanying software support required for dynamic, irregular, block-structured scientic compu-tations running on MIMD 72] parallel computers. The distinguishing characteristics

(23)

5 of this class of problems are that (1) numerical work is non-uniformly distributed over space, (2) the workload distribution changes as the computation progresses, and (3) the workload exhibits a local structure. Such applications employ dynamic, irregular|but locally structured|meshes to represent the changing numerical com-putation. They spend considerably more eort in some portions of the problem space than in others. The distribution of computational eort is not known at compile-time, and the application must adapt to the evolving calculation at run-time. Numerical work tends to be localized in regions irregularly distributed across the problem do-main. This localization property is especially important on multiprocessors, since we can exploit data locality to reduce interprocessor communication costs and improve parallel performance.

We focus on two important classes of dynamic, block-irregular applications: structured adaptive mesh methods 22, 23, 94], and

particle methods based on link-cell techniques 12, 87].

Such applications can be dicult to implement without advanced software support because they rely on dynamic, complicated irregular array structures with irregular communication patterns2_{. The programmer is burdened with the responsibility of}

managing dynamically changing data distributed across processor memories and or-chestrating interprocessor communication and synchronization. Little information is available at compile-time to guide a parallel compiler because numerical workloads change in response to the dynamics of the particular problem being solved.

Current parallel programming languages provide little support for dynamic, block-irregular applications. Data parallel Fortran languages such as High Perfor-mance Fortran typically focus on regular, static problems such as dense linear algebra. HPF denes a set of built-in, uniform data decompositions specied through compile-time directives however, it provides few mechanisms for dynamically changing irreg-ular data. Extending HPF will require developments in (1) parallel programming

(24)

6 abstractions and (2) run-time support libraries. New parallelization abstractions are needed because the current compile-timedata distribution mechanismsare inadequate for dynamic problems. Such applications will also require sophisticated run-time sup-port to manage changing data distributions and communication patterns.

A number of run-time support systems have already been developed, includ-ing CHAOS (formerly called PARTI) 60], multiblock PARTI 3], and Multipol 40]. Both CHAOS and multiblockPARTI have been used as run-time support for data par-allel Fortran compilers. CHAOS has been very successful in addressing unstructured problems such as sparse linear algebra and nite elements 58]. Multiblock PARTI has been employed in the parallelization of applications with a small number of large, static blocks 45] its support for dynamic block structured problems is unclear. The Multipol library provides a collection of distributed non-array data structures such as graphs, unstructured grids, hash tables, sets, trees, and queues. However, none of these systems directly address the dynamic, block-irregular problems that are the focus of our research3_.

1.3 A Parallel Software Infrastructure

Solving a problem is similar to building a house. We must collect the right material, but collecting the material is not enough a heap of stones is not yet a house. To construct the house or the solution, we must put together the parts and organize them into a purposeful whole.

| George Polya

We have developed a parallel software infrastructure to simplify the imple-mentation of dynamic, irregular, block-structured scientic computations on high-performance parallel supercomputers. Our software infrastructure has enabled com-putational scientists to explore new approaches to solving applied problems. It has reduced the development time of challenging numerical applications. Our infrastruc-ture has been implementedas aC++class library and consists of approximately thirty

(25)

7 thousand lines of C++and Fortran code4. It has been employed by researchers at the

University of California at San Diego, George Mason University, Lawrence Livermore National Laboratories, Sandia National Laboratories, and the Cornell Theory Center for applications in gas dynamics 141], smoothed particle hydrodynamics, particle simulation studies 68], adaptive eigenvalue solvers in materials design 37, 94], ge-netics algorithms 81], adaptive multigrid methods in numerical relativity, and the dynamics of earthquake faults (see Section 6.3 for a complete list).

Our parallel software infrastructure addresses two goals of software support for scientic applications 112]:

it hides low-level details of the hardware, and

it provides high-level, ecient mechanisms that match the scientist's view of the computation.

The rst goal is necessary for portability. Software that exposes too much of the un-derlying hardware will not run ecientlyon parallel platforms with dierent hardware characteristics. Our software meets this goal through the use of high-level paralleliza-tion mechanisms that assume very little about the underlying hardware architecture. The second goal is necessary for ease-of-use and simplied code development. Our software infrastructure provides the programmer with high-level tools appropriate for the task at hand through domain-specic application programmer interfaces (APIs) built upon our parallelization mechanisms.

Figure 1.2 illustrates the organization of our parallel software infrastructure. At the very top of the infrastructure lie applications, and at the bottom lies a portable message passing layer. Each level provides more powerful|and also more specic| abstractions. The infrastructure consists of four primary components: (1) implemen-tation support, (2) a set of parallelization abstractions called LPARX (\ell-par-eks"), (3) a structured adaptive mesh API, and (4) a particle API.

4The current software distribution may be obtained through the World Wide Web at address

(26)

8

LPARX

Implementation Abstractions

Message Passing Layer

LDA AMG SPH3D MD

Chapter 2

Chapter 3

Chapter 4 Chapter 5

Adaptive Mesh API Particle API

Figure 1.2: This gure shows our parallel software infrastructure for dynamic, irregu-lar, block-structured scientic computations. It consists of four primary components, each of which has been labeled with the chapter of this dissertation that describes that particular component. Higher levels of the infrastructure provide more powerful|and more specialized|abstractions. At the top lie applications (LDA,AMG,SPH3D, and MD)

and at the bottom lies a portable message passing layer. See the text for a brief description of each component.

There are three advantages to a layered software infrastructure as compared to a single, monolithic application: portability, code reusability, and extensibility. Because applications do not directly rely on the message passing layer and instead employ the mechanisms provided by their application-specic APIs, low-level changes in the implementation do not directly aect the applications. For example, applica-tions are completely shielded from low-level changes in the LPARX implementation. Code reuse is achieved because multiple application libraries share the same paral-lelization mechanisms. Optimizations in the LPARX implementation are realized by both particle and adaptive mesh applications. Finally, because the infrastructure

(27)

pro-9 vides tools, and not canned solutions, computational scientists can tailor and extend our abstractions to match their applications.

The following sections briey describe the four main components of our parallel software infrastructure.

1.3.1 LPARX

At the center of the software infrastructure is the LPARX parallel program-ming system, which denes high-level, ecient mechanisms for data distribution, partitioning and mapping, parallel execution, and interprocessor communication. It provides a common set of parallelization facilities on which we built the particle and adaptive mesh APIs.

LPARX introduces the concept of \structural abstraction," which enables applications to dynamically manipulate irregular data decompositions. Instead of forcing the programmer to choose from a small set of predened decompositions, LPARX provides a framework for creating decompositions that may be tailored to meet the needs of a particular application. To our knowledge, LPARX is the rst and only system that eciently supports arbitrary dynamic, user-dened, block-irregular data distribution on parallel architectures.

LPARX assumes only basic message passing support and is therefore portable to a variety of high-performance computing platforms. Our current im-plementation runs on the Cray C-90 (single processor), IBM SP2, Intel Paragon, single processor workstations (for code development and debugging), and networks of workstations connected via PVM 132].

1.3.2 Implementation Abstractions

At the very bottom of our software infrastructure is a portable message pass-ing layer calledMP++(our own version of MPI 105]). To simplify the implementation

(28)

abstrac-10 tion between LPARX and the message passing layer: Asynchronous Message Streams (AMS) and Distributed Parallel Objects (DPO). AMS and DPO provide support for parallel programs consisting of a relatively small number of large, complicated objects with asynchronous and unpredictable communication patterns. They build on ideas from the concurrent object oriented programming community.

AMS denes a \message stream" abstraction that greatly simplies the munication of complicated data structures between processors. Its mechanisms com-bine ideas from asynchronous remote procedure calls 7, 110], Active Messages 138], and the C++ I/O stream library 106, 131]. DPO provides object oriented

mecha-nisms for manipulating objects that are physically distributed across processor mem-ories and is based on communicating object models from the distributed systems community 61].

1.3.3 Adaptive Mesh API

Our adaptive mesh API denes specialized, high-level facilities tailored to structured multilevel adaptive mesh renement applications 23, 94]. Such numerical methods dynamically rene the local representation of a problem in \interesting" portions of the computational domain, such as shock regions in computational uid dynamics 22]. They are dicult to implement because renement regions vary in size and location, resulting in complicated geometries and irregular communication patterns.

Computational scientists using our adaptive mesh API can concentrate on their numerical applications rather than being concerned with low-level implemen-tation details. The API library, built upon the parallelization and communication abstractions of LPARX, provides mechanisms for automatic error estimation, grid generation, load balancing, and grid hierarchy management. All details associated with parallelism are completely hidden from the programmer.

We have used our software infrastructure to develop a parallel adaptive eigenvalue solver (LDA) and an adaptive multigrid solver (AMG) for problems

(29)

aris-11 ing in materials design 37, 94]. By exploiting adaptivity, we have reduced mem-ory consumption and computation time by more than two orders of magnitude over an equivalent non-adaptive method. To our knowledge, this is the rst time that structured adaptive mesh techniques have been used to solve eigenvalue problems in materials design.

1.3.4 Particle API

Our particle API provides computational scientists high-level tools that sim-plify the implementation of particle applications 12, 87] on parallel computers. Par-ticle methods are dicult to parallelize because they require dynamic, irregular data decompositions to balance changing non-uniform workloads. Built on top of the LPARX mechanisms, our particle API denes facilities specically tailored towards particle methods. The use of the LPARX abstractions enabled us to provide func-tionality and explore performance optimizations that would have been dicult had the library been implemented using only a primitive message passing layer. Using our software infrastructure, we have developed a 3d smoothed particle hydrodynam-ics 108] code (SPH3D) that simulates the evolution of galactic bodies in astrophysics,

and we are currently developing a 3d molecular dynamics application (MD) to study

fracture dynamics in solids 1].

1.4 Organization of the Dissertation

Sixty minutes of thinking of any kind is bound to lead to confusion and unhappiness.

| James Thurber

This dissertation is organized into six chapters. Each of Chapters 2 through 5 covers a portion of the software infrastructure shown in Figure 1.2. Each chapter is self-contained, with its own introduction, motivation, related work, and analysis and conclusions. The discussion of LPARX in Chapter 2 is a starting point for all further

(30)

12 chapters otherwise, Chapter 3 (Implementation Mechanisms), Chapter 4 (Adaptive Mesh Applications), and Chapter 5 (Particle Calculations) may be read independently of the others. We conclude with the contributions of this work in Chapter 6. The parallel architectures used in this dissertation are described in Appendix A.

(31)

Chapter 2

Parallelization Abstractions

Fundamental de nitions do not arise at the start but at the end of the exploration, because in order to de ne a thing you must know what it is and what it is good for.

| Hans Freudenthal, \Developments in Mathematical Education" If at rst you do succeed | try to hide your astonishment.

| Harry F. Banks

2.1 Introduction

The LPARX parallel programming system 93, 95] provides portable facil-ities for the ecient implementation of dynamic, non-uniform scientic applications on MIMD architectures. Such applications are typically dicult to implement with-out sophisticated software support. The LPARX mechanisms hide low-level imple-mentation details and provide powerful tools for data distribution, partitioning and mapping, parallel execution, and interprocessor communication. LPARX requires only basic message passing support and is therefore portable to a variety of high-performance computing platforms. Our current implementation runs on the Cray C-90 (single processor), IBM SP2, Intel Paragon, and networks of workstations con-nected via PVM 132]. LPARX applications may be developed and debugged on a single processor workstation.

(32)

14

LPARX

Implementation Abstractions Message Passing Layer

LDA AMG SPH3D MD Adaptive Mesh API Particle API

Figure 2.1: The LPARX layer of our software infrastructure provides paralleliza-tion facilities designed for scientic applicaparalleliza-tions that employ dynamic, irregular, structured representations. Based on the LPARX mechanisms, we have developed application-specic APIs for particle computations and structured adaptive mesh methods.

Building on the LPARX mechanisms described in this chapter (see Fig-ure 2.1), we have developed application-specic support libraries for two important classes of applications: multilevel structured adaptive mesh methods 94] and par-ticle calculations 87]. In Chapters 4 and 5, we describe how LPARX provides the parallelization support infrastructure needed to eciently and easily implement these re-usable APIs.

This chapter is organized as follows. We begin with a description of the LPARX abstractions in Section 2.2. Section 2.3 illustrates how these abstractions are used to parallelize a simple application. We compare our approach with other related work in Section 2.4. Finally, we conclude with an analysis of the advantages and limitations of the LPARX approach.

(33)

15

2.2 The LPARX Abstractions

A breakthrough is not a breakthrough unless you coin a term for it. | Sidney Harris, \Einstein Simpli ed"

I think you've done it. All we need now is a trademark and a theme song. | Sidney Harris, \From Personal Ads to Cloning Labs"

LPARX 93, 95] is a coarse-grain, domain-specic parallel programming model that provides high-level abstractions for representing and manipulating dy-namic, irregular block-structured data on MIMD distributed memory architectures. Dynamic irregular block decompositions are not currently supported by programming languages such as High Performance Fortran (HPF) 83], Fortran D 77], Vienna For-tran 43], or ForFor-tran 90D 33, 143]. They arise in two important classes of scientic computations:

multilevel structured adaptive nite dierence methods 22, 23, 94], which rep-resent renement regions using block-irregular data structures, and

parallel computations such as particle methods 87] that require an irregular data decomposition 12, 21] to balance non-uniform workloads across parallel processors.

We have used the LPARX mechanisms to implement domain-specic APIs and rep-resentative applications from each of these two problem classes.

LPARX should not be thought of as a \language" but rather as a set of data distribution and parallel coordination abstractions which may be implemented in a library (as we have done) or added to a language. The design goals of LPARX are as follows:

Express irregular data decompositions, layouts, and data dependencies at run-time using high-level, intuitive abstractions.

Require only basic message passing support and give portable performance across diverse parallel architectures.

(34)

16 Separate parallel control and communication from numerical computation. Provide the basis for an expandable software infrastructure of application-specic APIs.

Implementing dynamic, irregular computations on parallel computers is a dicult task. To achieve reasonable parallel performance, the application must explicitly manage low-level details of data locality and communication, even on shared memory multiprocessors 109, 127]. This burden soon becomes unmanageable and can obscure the salient features of the algorithm. LPARX hides many of these implementation de-tails and provides high-level coordination mechanisms to manage data locality within the memory hierarchy and minimize communication costs. The software support pro-vided by LPARX greatly simplies the development of high-performance, portable, parallel applications software.

The following sections describe LPARX's facilities. We begin with an overview of the philosophy underlying the LPARX model. Section 2.2.2 introduces the LPARX data types and its representation of irregular block decompositions. We then present LPARX's model of coarse-grain data parallel execution. Sections 2.2.4 and 2.2.5 describe LPARX's region calculus and data motion primitives which ex-press data decompositions and dependencies in geometric terms. We briey discuss the LPARX implementation in Section 2.2.6 and then conclude with a summary.

2.2.1 Philosophy

The LPARX parallel programming model separates the expression of data decomposition, communication, and parallel execution from numerical computation. As shown in Figure 2.2, LPARX applications are logically organized into three sepa-rate pieces: partitioners, LPARX code, and serial numerical kernels.

The LPARX layer provides facilities for the coordination and control of parallel execution. LPARX is a coarse-grain data parallel programmingmodel it gives the illusion of a single global address space and a single logical thread of control. On a

(35)

17

Routines Partitioning

Serial Numerical Kernels LPARX

Figure 2.2: The logical organization of an LPARX application consists of three com-ponents: partitioning routines, LPARX code, and serial numerical kernels.

MIMD parallel computer, the underlying run-time system executes in Single Program Multiple Data (SPMD) mode.

Computations are divided into a relatively small number of coarse-grain pieces. Each work unit represents a substantial computation with thousands or tens of thousands of oating point operations executing on a single logical processing node. Parallel execution is expressed using a coarse-grain loop each iteration of the loop executes as if on its own processor. The computation for each piece is performed by a numerical kernel, and the computations proceed independently of one another. Numerical routines may be written in any language, such as C++, C, or

Fortran. The advantage of this approach is that LPARX can leverage serial compiler technology and existing sequential code. Heavily optimized numerical routines need not be re-implemented to parallelize an application. Furthermore, numerical code can be optimized for a processing node without regard to the higher level parallelization. LPARX does not dene what constitutes a single logical node a node may correspond to a single processor, a processing cluster, or a processor subset. Thus, kernels may be tuned to take advantage of low-level node characteristics, such as vector units, cache sizes, or multiple processors.

An important part of the LPARX philosophy is that data partitioning for dynamic, non-uniform scientic computations is extremely problem-dependent and

(36)

18 therefore is best left to the application (or the API). No specic data decomposition strategies have been built into the LPARX model. Rather, all data decomposition is performed at run-time under the direct control of the application. LPARX pro-vides the application a uniform framework for representing and manipulating block-irregular decompositions. Although our implementation supplies a standard library of decomposition routines, the programmer is free to write others.

Our approach to data decomposition diers from most parallel languages, such as HPF 83], which require the programmer to choose from a small number of predened decomposition methods. Vienna Fortran 43] provides some facilities for irregular user-dened data decompositions but limits them to tensor products of irregular one dimensional decompositions. Block-irregular decompositions may be constructed using the pointwise mapping arrays of Fortran D 77] however, point-wise decompositions are inappropriate and unnatural for calculations which exhibit block structures. Because pointwise decompositions have no knowledge of the block structure, mapping information must be maintained for each individual array ele-ment at a substantial cost in memory and communication overheads. By comparison, coarse-grain partitionings incur a cost proportional to the number of blocks, which is typically three or four orders of magnitude smaller than the number of array elements. Once a decomposition has been specied, the details of the data partitioning are hidden from the application. The programmer can change partitioning strategies without aecting the correctness of the underlying code. Thus, LPARX views parti-tioners as interchangeable, and the application may change decomposition strategies by simply invoking a dierent partitioning routine.

At the core of LPARX is the concept of structural abstraction. Structural abstraction enables an application to express the logical structure of data and its decomposition across processors as rst-class, language-level objects. The key idea is that the structure of the data|the \oorplan" describing how the data is decomposed and where the data is located|is represented and manipulated separately from the data itself. LPARX expresses communication and operations on data decompositions

(37)

19 using intuitive geometric operations, such as intersection, instead of explicit indexing. Interprocessor communication is hidden by the run-time system, and the application is completelyunaware of low-level details. Although the current LPARX implementa-tion is limited to representing irregular, block-structured decomposiimplementa-tions, the concept of structural abstraction is general and extends to other classes of applications, such as unstructured nite element meshes 13].

2.2.2 Data Types

LPARX provides the following four basic data types:

Point: an integern-tuple representing a point in Z

n_,

Region: an object representing a rectangular subset of array index space, Grid: a dynamic array instantiated over a Region, and

XArray: a dynamic array of Grids distributed over processors.

The Point is a simple, auxiliary data type used to dene and manipulate Regions.

Element-wiseaddition and scalar multiplicationare dened overPoints in the obvious

way.

The Regionprovides the basis for structural abstraction. An n-dimensional

Region represents a subset of Zn, the space of n-dimensional integer vectors. The

Region does not contain data elements, as an array, but rather represents a portion

of index space. In the current implementation of LPARX, we restrict Regions to

be rectangular however, the concepts described here apply to arbitrary subsets of Zn 13]. A Region is uniquely dened by the two Points at its lower and upper corners. We denote the lower bound of a Region R by lwb(R) and its upper bound

by upb(R). Although there is no identical construct in Fortran or C, the Region

is related to array section speciers found in Fortran-90. Unlike Fortran-90 array section speciers, however, the Region is a rst-class object and may be assigned

(38)

20 and manipulated at run-time. The concept of rst-class array section objects (called domains) was introduced in the FIDIL programming language 85].

The Grid is a dynamic array dened over an arbitrary rectangular index

set specied by a Region. The Grid is similar to a Fortran 90 allocatable array.

Each Grid remembers its associated Region, which can be queried at run-time, a

convenience that greatly reduces bookkeeping for dynamically dened Grids1. All Gridelements must have the same type they may be integers, oating point numbers,

or any user-dened type or class. For example, in addition to representing a mesh of oating point numbers, the Grid may also be used to implement the spatial data

structures 87] common in particle calculations. Grids may be manipulated using

high-level block copy operations, described in Section 2.2.5.

LPARX is targeted towards applications with irregular, block structures. To support such structures, it provides a special array|theXArray|for organizing a

dynamic collection ofGrids. EachGridin anXArrayis arbitrarily assigned to a single

processor individualGrids are not subdivided across processors. TheXArraycan be

viewed as a coarse-grain analogue of a Fortran D array decomposed via mapping arrays except that XArrayelements are themselves arrays (Grids).

TheGrids in anXArraymay have dierent origins, sizes, and index sets, but

allGrids must have the same spatial dimension. To allocate an XArray, the

applica-tion invokes the LPARX operaapplica-tionXAllocwith an array of Regions representing the

structure of the Grids and a corresponding array of processor assignments (i.e. the

oorplan). LPARX provides a default assignment of Grids to processors if none is

given. An XArray is intended to implement coarse-grain irregular decompositions

thus, each processor is typically assigned only a few Grids.

LPARX denes a coarse grain looping construct|forall|which iterates

concurrently over the Grids of an XArray. The semantics of forall are similar to

HPF's INDEPENDENT forall 83] each loop iteration is executed as if an atomic

operation. In writing a forall loop, the programmer is unaware of the assignment

1Compare this to

C, which requires the programmer to keep track of bounds for dynamically

(39)

21 processor 1 processor 2 processor 3 processor 4

XArrays

Figure 2.3: Two examples of an XArray of Grids data structure. The recursive

bisection decomposition on the far left is usually employed to balance non-uniform workloads in particle calculations. The structure in the middle is typical of a single level mesh renement in structured adaptive mesh methods. On the right, we show one possible mapping of XArray elements to processors. Note that the XArray is a

container for the Grids and its elements are Grids, not pointers.

of Grids to processors|each XArray element is treated as if it were assigned to its

own processor|and the LPARX run time system correctly manages the parallelism. LPARX also denes a for loop, a sequential version of the forall.

The XArray of Grids structure provides a common framework for

imple-menting various block-irregular decompositions of data. This framework is used by standard load balancing routines such as recursive bisection 21] (see Chapter 5) and also by application-specic routines, such as the grid generator for an adaptive mesh calculation 94] (see Chapter 4). Figure 2.3 shows decompositions arising in two dierent applications. In each case, the data has been divided intoGrids, each

repre-senting a dierent portion of the computational domain, which have been assigned to

an XArray. The following section provides more detail about how XArrays are used

to organize a parallel computation.

2.2.3 Coarse-Grain Data Parallel Computation

Recall that an LPARX application consists of three components: partition-ing routines, LPARX code, and serial numerical kernels. Here we show how these pieces work together in an application. LPARX provides the programmer with a simple model of coarse-grain parallel computation:

(40)

22 1. Decompose the computational structure into an array ofRegions.

2. Specify an assignment of eachRegion in (1) to a processor.

3. Create an XArrayof Grids representing the data decomposition oorplan

gen-erated by steps (1) and (2).

4. Satisfy data dependencies between Grids in the XArray using LPARX's

com-munication facilities (described in the following sections).

5. Perform calculations on the Grids in the XArray in parallel using the

coarse-grain forallloop.

The decomposition in (1) may be managed explicitly by the application, such as in generating renement regions, or by load balancing utilities that implement partitioning strategies. The LPARX implementation provides a standard library of partitioners that implement recursive coordinate bisection 21] and uniform block partitioning.

The assignment of Regions to processors in (2) provides applications the

exibility to delegate work to processors. In general, this information will be returned by the routine which renders the partitions. This step may be omitted, in which case LPARX generates a default assignment.

In step (3), the application invokes an LPARX operation called XAlloc

which, using the partitioning and the processor assignment information, instantiates

an XArray of Grids implementing the data decomposition. LPARX creates Grids

based on the supplied Regioninformation and assigns them to the appropriate

pro-cessors.

After the decomposition and allocation of data, applications typically al-ternate between steps (4) and (5). In (4), data dependencies between the Grids in

the XArray are satised using LPARX's region calculus and block copy operations,

described in the following sections. Finally, the application computes in parallel on the Grids in the XArrayusing a forallloop (5). For eachGrid, a numerical routine

(41)

23 is called to perform the computation the computation executes on a single logical processing node which may actually consist of many physical processors. The execu-tion of forall assumes the Grids are decoupled they are processed independently

and asynchronously.

2.2.4 The Region Calculus

LPARX denes a region calculus which enables the programmer to manip-ulate index sets (Regions) in high-level geometric terms. In this section, we describe

the most important region calculus operations2_: _shift_,_intersection_{, and} _grow_.

Given a Region R and a Point P, shift(R,P)denotes the Region R

trans-lated by a distance P, as shown in Figure 2.4a. The intersection of two Regions is

simply the set of points which the two have in common. The dark shaded area in Fig-ure 2.4b represents the intersection ofRegionsRandS, written asR * S.Regions are

closed under intersection|the intersection of two Regions is always another Region.

If two Regions do not overlap, the resulting intersection is said to be empty.

Grow surrounds a Region with a boundary layer of a specied width. It

takes two arguments|aRegion and a Point|and returns a new Regionwhich has

been extended (or shrunk for negative widths) in each coordinate direction by the specied amount. The second argument to grow may also be an integer, in which

case each dimension is grown by the same amount. Figure 2.4c shows the Region

resulting from grow(R,1).

2.2.5 Data Motion

LPARX coordinates data motion between Grids using two types of block

copy operations: copy-on-intersect and general block copy. Copy-on-intersect copies data from a Grid into the corresponding elements of another where their Regions

overlap in the underlying integer coordinate system. Of course, allGrids andRegions

(42)

24 . . . . . . 2 1 7 4 5 17 R Shift(R, [7,-1]) . . . . . . 2 1 7 4 5 17 R S R * S (a) (b) . . . . . . 2 1 7 4 5 17 R _Grow(R,1) . . . . . . 2 1 7 4 5 17 R _Grow(R,1) S (c) (d)

Figure 2.4: Four examples of LPARX's region calculus. Although shown in 2d for simplicity, these operations generalize readily to higher dimensions. (a) Shifttakes

a Region and a Point and returns a Region translated by the specied distance.

(b) Intersection returns the set of points shared by two Regions. (c) Grow adds a

boundary layer to a Region. (d) Data dependencies for a ghost cell region can be

calculated easily using the grow and intersection operations. In this example, the

(43)

25 in the same copy statement must have the same spatial dimension. For GridsG and H, the statement:

copy into G from H

copies data fromH into G where the Regions of G and H intersect. Another form: copy into G from H on R

where R is a Region, limits the copy to the index space in which all three Regions

intersect. General block copy is similar to copy-on-intersect except that it allows a shift between the source and destination Regions. The statement:

copy into G on R from H on S

copies data fromRegion S of Grid Hinto Region Rof Grid G.

The default behavior of both data motion operations is to simply copy data from the source into the destination. LPARX also provides a reduction form:

copy into :::] from :::] using combine

where combine is a specied commutative associative reduction function. Instead

of copying the data, LPARX appliescombineelementwise to combine corresponding

source and destination data values. For example,

copy into G from H using sum

adds corresponding elements of Grid H to Grid G portions of G that do not intersect

with Hremain unchanged. Section 5.2.3 illustrates how this reduction variant is used

to sum force information in a particle application.

We now show how these simple but powerful operations are used to calculate data dependencies. One common communication operation in scientic codes is the transmission of data to ll ghost cells, boundary elements added to each processor's local data partition (see Figure 2.5b). The region calculus represents the processor's local partition as a Region. We grow the Region to dene ghost cells and then use

(44)

26 intersection to calculate theRegionof data required from another processor. Finally,

a copy updates the ghost region's data values (see Figure 2.4d). Recall that copy-on-intersect copies values that lie in the copy-on-intersection of the ghost region and interacting blocks. The calculation of data dependencies involves no explicit computations in-volving subscripts, as copy-on-intersect manages all bookkeeping details. The region calculus is independent of theGriddimension, and the same operations work for any

problem dimension. All interprocessor communication is managed by the run-time system and is completely hidden from the user.

2.2.6 LPARX Implementation

In this section, we briey describe the LPARX implementation further de-tails are provided in Chapter 3.

LPARX has been implemented as a C++ run-time library consisting of

ap-proximately fteen thousand lines of code (excluding the application-specic API libraries described in Chapters 4 and 5). The implementation does not require a spe-cial compiler other than a standard C++ compiler, and LPARX code may be freely

mixed with calls to othe

Application. Fortran C C++ Cobol. Programming Language. Computational Scientist API. Machine LPARX. Implementation Abstractions. Message Passing Layer

A Parallel Software Infrastructure

LPARX Implementation Abstractions

TABLE OF CONTENTS

LIST OF TABLES

ACKNOWLDEGEMENTS

Submitted for Publication

Conferences

Technical Reports

Chapter 1

1.2 Dynamic Block-Irregular Calculations

1.3 A Parallel Software Infrastructure

LPARX

Implementation Abstractions

LDA AMG SPH3D MD

Chapter 2

1.3.1 LPARX

1.3.3 Adaptive Mesh API

1.3.4 Particle API

Chapter 2

2.1 Introduction

Implementation Abstractions Message Passing Layer

2.2 The LPARX Abstractions

2.2.1 Philosophy

Routines Partitioning

2.2.2 Data Types

XArrays

2.2.3 Coarse-Grain Data Parallel Computation

2.2.4 The Region Calculus

2.2.5 Data Motion

2.2.6 LPARX Implementation