Modeling, Predicting, And Optimizing Parallel Performance Of Grid Stuctured Problems

(1)

Abstract

SMITH, FRANK ANDERSON. Modeling, Predicting, And Optimizing Parallel Performance Of Grid Stuctured Problems. (Under the direction of Robert E. Funderlic.)

The performance of parallel computers can be greatly affected by a user’s choices of data

distribution and logical processor configuration. Selecting optimal choices for such user

speci-fiable parameters may be easier if the performance of the target machine can be predicted by a

performance model. Models for parallel performance on the IBM SP for grid structured

prob-lems are considered. Such probprob-lems are ubiquitous in scientific computing and frequently are

characterized by a nearest neighbor communication pattern. Bounds are derived for the size

of the solution space of data distributions and logical processor configurations for problems

with nearest neighbor communication. Proofs are derived that exclude a substantial number

of non-optimal choices of data distribution and logical processor configuration. Algorithms

are given that are intended to predict parallel performance for a model application and allow

the user to select optimal choices for parameters that can be specified. Experimental evidence

is presented that suggests that performance on the SP is characterized fairly accurately by

a specific model. Experimental evidence also suggests that an algorithm exists to optimize

a user’s choices of data distribution and logical processor configuration for grid structured

(2)

MODELING, PREDICTING, AND OPTIMIZING

PARALLEL PERFORMANCE OF

GRID STRUCTURED PROBLEMS

by

FRANK ANDERSON SMITH

A dissertation submitted to the Graduate Faculty of North Carolina State University

in partial fulfillment of the requirements for the Degree of

Doctor of Philosophy

Computer Science

Raleigh

2004

APPROVED BY:

(3)

Biography

Andy Smith grew up in Raleigh NC. After high school, he attended UNC-Chapel Hill intermittently over several years. Andy met Louise Hern and they married four years later. Soon after Louise graduated from nursing school, Andy began a constuction business. He was the qualifying examinee for a corporate general contractor’s license and built a successful business over a period of years.

Andy enrolled as a full time non-traditional student at NCSU in the spring of 1993. He completed the BS degree in Computer Science in May 1995, graduating summa cum laude

with a minor in Mathematics. He did undergraduate research under the direction of Dr. Rex Dwyer, which led to an interest in graduate school and Honorable Mention in the 1995 Computing Research Association Award for Outstanding Undergraduate. Andy chose to attend graduate school at NCSU rather than Duke or UNC. He was awarded an Andrews Fellowship, a Dean’s Fellowship, and an NCSU Alumni Association Fellowship.

Andy began work in 1995 at Data General Corporation in Research Triangle Park through the Computer Science Department’s industrial assistant program. He became an employee of Data General in 1996 and worked there until the merger with EMC Corporation in 1999. He became an EMC employee after the merger.

(4)

Acknowledgements

I would like to express my deep appreciation and heartfelt thanks to those whose help has been invaluable in completing this work. Dr. Robert Fundelic has served as my committee chairman and mentor. His guidance and friendship are gifts that will always be remembered and treasured. All of my committee members have been generous with their time and helpful with their advice and insight. Each has contributed significantly to my knowledge, both through their classes and through helpful discussions. Each of them has been an inspiration to me. I am indebted to Drs. Edward Davis, Rex Dwyer, Carla Savage, and Jeffrey Scroggs. I must mention specifically Rex Dwyer’s comments about the convex hull algorithm, as his comments guided me to find the correct approach for a particular algorithm.

My wife Louise has encouraged me and been helpful in ways too numerous to enumerate. However, I must thank her again for proofreading many of the text portions of this work.

Dr. David Wong’s Ph.D. dissertation served as the starting point for the work herein. I am grateful to him for helpful discussions and honored to have been a co-author with him.

Kenny Purcell and Dr. Chuck Harner have earned my gratitude for their help in checking some of the more tedious equations. I am very appreciative of the time Sujuan Upshaw and Dr. Pierre Gremaud each spent in checking my translation of Souriau’s paper and helping me with phrases that are easier for those who know the French language.

It has been my good fortune to complete all my degrees at North Carolina State University. I feel fortunate to have encountered so many of the fine people on the faculties of the Computer Science and Mathematics Departments. Their instruction provided the foundation that made this work possible. My experience also suggests that the Department of Computer Science is blessed with outstanding staff members without whom the Department could not function.

(5)

List of Tables

4.1 A bound for and the number of leaves ingenerate’s call tree. . . 31

6.1 Scaled execution time as a function ofb1. . . 61

6.2 Bounds for and the number of candidate solutions. . . 65

9.1 Experimental results in one dimension . . . 94

9.2 Estimates ofα in 1-D experiments . . . 96

9.3 Estimates ofC andR in 1-D experiments . . . 96

9.4 Experimental results in two dimensions . . . 99

9.5 Experimental results in three dimensions . . . 102

10.1 Estimates ofα in modified 1-D experiments . . . 106

11.1 Estimates for the computation parameterR . . . 122

11.2 Estimates for communication parameters in 2-D . . . 123

11.3 Estimates for communication parameters in 3-D . . . 124

11.4 Summary of results for 2-D experiments. . . 126

11.5 Summary of results for 3-D experiments. . . 131

16.1 Statistics for intervals between successive calls to MPI Wtime(switch). . . 157

16.2 Statistics for intervals between successive calls to MPI Wtime(PE). . . 157

16.3 Statistics for subsets of intervals between successive calls toMPI Wtime. . . 159

16.4 Scaled statistics for assignments. . . 163

16.5 Scaled statistics for arithmetic operations. . . 165

16.6 Minimum times for arithmetic operations with doubles. . . 166

16.7 Times for overhead and operations for vector operations. . . 166

16.8 Times per grid cell for mixed operations. . . 167

16.9 Times for overhead and mixed operations with doubles. . . 167

(10)

List of Figures

2.1 A discrete grid in one dimension. . . 4

2.2 Five and nine point stencils in two dimensions . . . 8

2.3 Seven and nineteen point stencils in three dimensions . . . 10

2.4 Red-black scheme on a one dimensional grid. . . 13

2.5 Red-black scheme on a two dimensional grid. . . 13

3.1 The HPF data mapping model. . . 16

3.2 The effect of the HPF ALIGN directive. . . 17

3.3 Block, cyclic and block cyclic data partitions. . . 19

4.1 Examples of processor grids with 4 and 8 processors. . . 21

4.2 Thegenerate function. . . 25

4.3 Portions ofgenerate’s call tree whenP = 19 and D= 3. . . 26

4.4 The ratio of non-maximal to maximal grids ingenerate’s call tree in 3-D. . . . 32

5.1 Two possible cases occur for the overlapped model. . . 37

5.2 The pipe function,P(x) =x/(x+ 1). . . 39

5.3 Examples ofei. . . 44

6.1 Examples of cases in the definition ofψ. . . 52

6.2 Block sizes equal to and less thandw/de. . . 57

6.3 Partitioning evenly if more than two processors are used. . . 59

6.4 Partitioning evenly if two processors are used. . . 60

6.5 The solution space for candidate solutions in 2-D whenP = 32. . . 63

6.6 The solution space for candidate solutions in 3-D whenP = 27. . . 64

6.7 Bounds for the number of candidate solutions in two and three dimensions. . . 66

6.8 A geometric view of Example 6.24 andF(α). . . 67

(11)

7.2 Linked list of candidate solutions after sorting. . . 73

8.1 Distributing data to a logical processor grid. . . 80

8.2 Ghost cells in a two dimensional grid. . . 81

8.3 Scalability of run times for two dimensional programs . . . 88

8.4 Scalability of minimum run times for three dimensional programs . . . 89

8.5 Scalability of median run times for three dimensional programs . . . 90

9.1 Predicted results in one dimension . . . 95

9.2 Typical experimental results in two dimensions. . . 100

9.3 Experimental results in three dimensions. . . 103

9.4 Perspectives of results of a three dimensional experiment. . . 103

10.1 Two layers of ghost cells in a two dimensional grid. . . 116

11.1 Predicted and actual run times for 150x150 grid. . . 127

11.2 Predicted and actual run times for 108x108 grid. . . 128

11.3 Predicted and actual run times for rectangular 2-D grids. . . 129

11.4 Predicted and actual run times for 3-D grids of equal size. . . 132

11.5 Run times for 3-D grids with equalx and y grid dimensions. . . 133

11.6 Run times for 3-D grids with equalx and z grid dimensions. . . 134

11.7 Run times for 3-D grids with equaly and z grid dimensions. . . 135

11.8 Run times for 3-D grids with all grid dimensions unequal. . . 136

16.1 Histograms for time intervals for MPI Wtime. . . 158

16.2 Duration of time intervals forMPI Wtime. . . 160

16.3 Excerpt of output from theps command. . . 160

16.4 Timing results for MPI Send and MPI Recv (linear scale). . . 169

16.5 Timing results for MPI Send and MPI Recv (log scale). . . 169

16.6 Timing results for MPI Send and MPI Recv (step size of 32). . . 170

(12)

Chapter 1

Introduction

The performance of parallel computers can be greatly affected by a user’s choices of data distribution and logical processor configuration. The demand for massively parallel computers is largely driven by the needs of users whose problems require more resources than can be met with a single machine. Such problems often are scientific applications and frequently are grid structured. The grid structure may arise from the discretization of a physical domain or from the use of large matrices in the computation. Many problems involving the discretization of a physical domain are characterized by a nearest neighbor communication pattern.

The research described herein concerns optimizing the choices for data distribution and logical processor configuration. The focus is on optimizing those choices for grid structured problems with nearest neighbor communication. The research includes both theoretical com-ponents and experimental comcom-ponents. Experiments are performed on an IBM SP.

The notation used herein is too extensive to summarize here. Chapter 2 includes a review of some common mathematical notation. Other notation is defined in the chapter or section where it is used, since we reuse symbols at times where the context ensures there can be no confusion about the meaning. Chapter 14 provides both an alphabetical and functional summary of notation. When the work of other authors is described, their notation is sometimes modified for consistency with other notation or to avoid ambiguity.

Chapter 2 also includes a review of other mathematical topics that are relevant to our discussion. Chapter 3 provides a brief overview of parallel computing including a discussion of data distribution. Chapter 4 describes logical processor grids and introduces the concept of maximal grids. Chapter 5 includes descriptions of and details about selected performance models by other authors.

(13)

logical processor grid dimensions and an optimal data distribution. We extend Wong’s nota-tion for two dimensional problems to higher dimensions. We show how the solunota-tion space for the problem of optimizing certain user specifiable parameters can be substantially restricted under certain conditions.

In Chapter 7, two optimization algorithms are proposed for the problem of choosing a logical processor grid and a data distribution. Chapter 8 describes the framework for experi-ments intended to test the validity of the model proposed in Chapter 6 and an optimization algorithm from Chapter 7. The results of some preliminary experiments are also described in Chapter 8. The results of experiments with the model and algorithm are described in Chapter 9, and analysis of those experiements is included in Chapter 10.

Chapter 11 describes revisions to the model proposed in Chapter 6. An algorithm based on the revised model is proposed, and experiments to test the validity of the revised model and algorithm are described. Chapter 12 summarizes the results of our research.

Supplemental material is included in several appendices. Chapter 14 provides both an alphabetical and functional listing of notation, as noted above. Chapter 15 lists function prototypes for all the Message Passing Interface (MPI) functions used in our code. Chapter 16 describes in detail the results of preliminary experiments that are summarized in Chapter 8. Chapter 17 includes listings of the most significant portions of the code used in experiments. Certain portions of this paper are numbered. Within each chapter, figures and tables are numbered as separate sequences. Thus, Table 2 in a particular chapter may appear either before or after Figure 3. Tables and figures are numbered so as to indicate both the chapter and the position within the chapter, as in Table 6.2 and Figure 6.3.

(14)

Chapter 2

Mathematical review

This chapter provides a brief overview of the numerical applications used in experiments to test the validity of our performance models. We also discuss certain aspects of the mathematical problems that underlie the numerical applications.

Finite difference schemes produce approximations to derivatives using function values at discrete points. These schemes can be used to derive iteration formulas foriterative methods, some of which are described in Section 2.3. We use the term finite difference method to describe an iterative method where the iteration formula is derived from a finite difference scheme.

We use Poisson’s equation on a rectangular domain as a model problem representing the class of problems which are often solved using finite difference methods. This problem produces a banded sparse matrix problem that generates a particular communication pattern.

2.1 Basic mathematical notation

We consider functions of one or more variables in Section 2.2 and later chapters. Our notation may indicate the dependence of function on the independent variables, as in f(x, y), or may omit them, as in f, especially when the dependence has been shown earlier or where the number of independent variables is not essential to the discussion. Although we sometimes use lowercase letters for scalars, the context should allow the reader to distinguish functions from scalars.

(15)

indicate function values at positions on a discrete spatial grid or discrete times, as in ui,j. No confusion should ensue, as the variables indicate whether we refer to a vector or a function.

We use || · ||to denote a norm on <n _{as well as the corresponding induced matrix norm,} where ||A|| = max||x||=1||Ax||. The condition number of A, κ(A), relative to a given norm

|| · || is defined as infinity ifA is singular and asκ(A) =||A|| ||A−1|| otherwise.

2.2 Finite difference schemes for Poisson’s equation

We consider Poisson’s equation, ∇2_u ₌_f _{in one, two and three spatial dimensions.}

Specifi-cally, we consider finding approximate solutions to Poisson’s equation at discrete points of a Cartesian grid by approximating derivatives with finite differences. The special case where

f = 0 is called Laplace’s equation, ∇2_u_{= 0. Let} _u _{be defined on a domain Ω. Appropriate}

boundary conditions on the boundary∂Ω of the region Ω must be specified to be able to com-pletely determine the solution to ∇2_u ₌ _f_{. Only one boundary condition may be specified}

at each point of the boundary. Dirichlet boundary conditions specify the value of u on the boundary; Neumannboundary conditions specify the value of a derivative. More detail may be found in any standard text on partial differential equations (PDEs), such as [30].

PDEs with two independent variables involving second derivatives but not third or higher derivatives can be classified as elliptic, parabolic, or hyperbolic. Poisson’s equation is a pro-totypical elliptic equation. The methods we describe below to approximate partial derivatives for Poisson’s equation can be adapted for hyperbolic equations, such as the wave equation

utt−auxx = 0, and parabolic PDEs, such as the heat equation ut = buxx, where a and b are nonnegative. Such time dependent equations require that initial values be specified in addition to boundary conditions.

In the discussion that follows, we assume that Ω is a rectangular region defined by one or more of the inequalities xlo ≤ x ≤xhi, ylo ≤y ≤ yhi, and zlo ≤ z ≤zhi. By defining a grid on the region Ω, we seek to approximate the solution to Poisson’s equation at the grid points of the region. The notation of Strikwerda [31] is adapted and extended in the discussion that follows. We divide the interval [xlo, xhi] into n equal subintervals and denote the grid points by xi, i= 0, . . . , n, as shown in Figure 2.1. Similarly, in higher dimensions, we partition the

X 0 X 1 X 2 X i−1 X i X i+1 X n−2 X n−1 X n

(16)

other dimensions and use the notationyj andzk. An approximation tou(xi) in one dimension is denoted byvi and analogously in higher dimensions.

On the discrete domain, we use difference operators in a fashion akin to the use of differen-tial operators on the continuous domain. The forward and backward first difference operators are defined, respectively, by the equations

δ+vi= vi+1_h−vi

and

δ−vi = vi−_hvi−1.

The central first difference operator, denoted δ0 is defined by

δ0vi = 1₂(δ+vi+δ−vi) = vi+1₂−_hvi−1.

The central second difference operator, denoted δ2, is defined byδ2 = (δ+−δ−)/h, and thus

δ2vi = vi+1−2_hv2i+vi−1.

We sometimes useδ_x2to denote the central difference approximation for the operatord2/dx2

in one dimension. In two dimensions, letδ2_xandδ2_ydenote the second difference approximations for the operators ∂2/∂x2 and∂2/∂y2, respectively.

Strikwerda [31] uses the notation above in deriving fourth order accurate approximations to the first and second derivative operators. We reproduce portions of his work below in deriving equations 2.5-2.9. The operators δ0 and δ2 are themselves second order accurate

approximations to the first and second derivative operators, respectively. Consider the Taylor series for u(x) at xi. We have

u(x) =u(xi) +u0(xi)(x−xi) +u00(xi)

(x−xi)2

2 +u

000 (xi)

(x−xi)3

6 +· · · (2.1) By lettingx=xi+1 andx=xi−1 in equation 2.1, where|x−xi|=h, and subtracting, observe that

u(xi+1)−u(xi−1) = 2u0(xi)h+u000(xi)

h3

3 +O(h

5₎_, _(2.2)

where O(h5) denotes terms containing fifth and higher powers of h. Thus, δ0 = _dxd +O(h2)

is second order accurate. By letting xassume the valuesxi+1, xi, xi−1 in equation 2.1, we see

that δ2= _dxd22 +O(h2) since

(17)

Now, since xi and h are arbitrary in equation 2.2, we have

δ0u =

du

dx+

h2

6

d3u

dx3 +O(h

4_{) = (1 +} h2

6

d2 dx2)

du dx+O(h

4₎

= (1 +h

2

6 δ

2₎du

dx+O(h

4₎_. _(2.4)

It follows that

(1 + h

2

6 δ

2₎−1_δ 0u=

du dx+O(h

4₎_. _(2.5)

An alternative approximation can be derived from equation 2.2. We have

δ0u=

du

dx+

h2

6

d3u dx3 +O(h

4_{) =} du

dx+

h2

6 δ

2_δ

0u+O(h4),

and thus

(1− h

2

6 δ

2₎−1_δ 0u=

du dx+O(h

4₎_. _(2.6)

A similar approach yields two fourth order accurate formulas for the second derivative,

d2

dx2 = (1 +

h2

12δ

2₎−1_δ2₊_O₍_h4₎_, _(2.7)

and

d2

dx2 = (1−

h2

12δ

2₎−1_δ2₊_O₍_h4₎_.

In one dimension, Poisson’s equation is d_dx2u2 = uxx = f(x). We approximate uxx = f(x) using finite differences on a grid with grid points xi, i= 0, . . . , n, wherexi−xi−1 =h,i >0.

By Taylor series, we have δ_x2ui = fi +O(h2), where ui = u(xi) and fi = f(xi). Letting vi denote an approximation to u(xi) and dropping the last term on the right gives us the second order accurate discrete approximation

vi+1−2vi+vi−1

h2 =fi. (2.8)

By rewriting equation 2.8 to isolate vi, one derives a system of equations appropriate for an iterative method such as Jacobi, Gauss-Seidel, or SOR. These methods are described in Section 2.3. Another approximation is found by applying equation 2.7 touxx =f(x), dropping fourth order terms and multiplying by (1 + h₁₂2δ2), yielding δ2u= (1 +h₁₂2δ2)f, or

vi+1−2vi+vi−1

h2 =fi+

h2

(18)

to replace∂2/∂x2and∂2/∂y2. Assuming the spacing of the grid is the same in each dimension, with ∆x= ∆y=h, we have

δ_x2ui,j+δy2ui,j =f(xi,j) +O(h2), or

vi+1,j+vi−1,j+vi,j+1+vi,j−1−4vi,j

h2 =fi,j+O(h

2₎_. _(2.10)

If the grid spacing differs by dimension, with ∆x=h, ∆y=k, and ∆ = max{∆x,∆y}, then

vi+1,j−2vi,j+vi−1,j

h2 +

vi,j+1−2vi,j +vi,j−1

k2 =fi,j+O(∆

2₎_. _(2.11)

Thus, this second order approximation to Poisson’s equation in two dimensions at each grid point (xi, yj) involves function approximations at the point itself and immediately adjacent points. These points compose a five point stencil, as illustrated in Figure 2.2.

Rosser [24] derives a fourth order accurate finite difference formula for the two dimensional version of Poisson’s equation. Let ∆ = max{∆x,∆y}. By equation 2.7, uxx+uyy =f(x, y) can be approximated by

1 +∆₁₂x2δ2_x−1δ_x2v+1 +∆₁₂y2δ_y2−1δ2_yv=f+O(∆4),

which yields

1 +∆₁₂y2δ2_yδ_x2v+1 +∆₁₂x2δ2_xδ_y2v = 1 +∆₁₂x2δ_x2 1 +∆₁₂y2δ_y2f+O(∆4) = h1 +₁₂1(∆x2δ2_x+ ∆y2δ_y2)if +O(∆4).

Dropping fourth order terms produces the fourth order accurate scheme

δ_x2+δ2_y+₁₂1 (∆x2+ ∆y2)δ_x2δ_y2v=f +₁₂1(∆x2δ_x2+ ∆y2δ2_y)f. (2.12)

Under the assumption that ∆x= ∆y=h, Rosser’s scheme can be written

1

6(vi−1,j−1+vi−1,j+1+vi+1,j−1+vi+1,j+1) + 2

3(vi,j−1+vi,j+1+vi−1,j+vi+1,j)− 10

3vi,j

= h₁₂2(fi,j−1+fi,j+1+fi−1,j+fi+1,j+ 8fi,j). (2.13)

(19)

x₀ x₁ x_i−1 x_i x_i+1 x_n−1 x_n y_m

y_m−1

y_j+1

y_j

y_j−1

y₁

y₀ k h i,j i+1,j i−1,j i,j+1 i,j−1

x₀ x₁ x_i−1 x_i x_i+1 x_n−1 x_n

y_m

y_m−1

y_j+1

y_j

y_j−1

y₁

y₀ k h i,j i+1,j i−1,j i,j+1 i,j−1 i−1,j+1 i−1,j−1 i+1,j+1 i+1,j−1

Figure 2.2: Five and nine point stencils in two dimensions foruxx+uyy =f at (xi, yj).

To use equation 2.13 in an implementation of an iterative method such as Jacobi, Gauss-Seidel, or SOR, we isolate vi,j by writing

vi,j = ₂₀1 (vi−1,j−1+vi−1,j+1+vi+1,j−1+vi+1,j+1)+ 1

5(vi,j−1+vi,j+1+vi−1,j+vi+1,j)− 10

3 vi,j+

h2

40(fi,j−1+fi,j+1+fi−1,j+fi+1,j+ 8fi,j). (2.14)

We observe that equation 2.12 takes a more complex form if ∆x =h 6= k= ∆y. In this case, we have

h2₊_k2

12 (vi−1,j−1+vi−1,j+1+vi+1,j−1+vi+1,j+1)+ 10h2−2k2

12 (vi,j−1+vi,j+1) +

−2h2+10k2

12 (vi−1,j+vi+1,j)−

5(h2₊_k2₎

3 vi,j

= h2₁₂k2(fi,j−1+fi,j+1+fi−1,j+fi+1,j+ 8fi,j). (2.15)

To use equation 2.15 in an implementation of an iterative method, we isolatevi,jby writing

vi,j = ₂₀1 (vi−1,j−1+vi−1,j+1+vi+1,j−1+vi+1,j+1)+ 10h2−2k2

20(h2₊_k2₎(vi,j−1+vi,j+1) +−2h

2₊₁₀_k2

20(h2₊_k2₎(vi−1,j+vi+1,j)− h2_k2

(20)

We approximate Poisson’s equation in three dimensions,uxx+uyy+uzz =f(x, y, z), by

1 +∆₁₂x2δ2_x−1δ2_xu+1 +∆₁₂y2δ_y2−1δ_y2u+1 +∆₁₂z2δ_z2−1δ2_zu=f+O(∆4),

which yields

1+∆₁₂y2δ2_y1+∆₁₂z2δ_z2δ_x2u + 1+∆₁₂x2δ_x21+∆₁₂z2δ2_zδ_y2u+1+∆₁₂x2δ_x21+∆₁₂y2δ2_yδ_z2u

= 1 +∆₁₂x2δ_x2 1 +∆₁₂y2δ_y2 1 +∆₁₂x2δ_x2f+O(∆4)

= h1 +₁₂1 (∆x2δ_x2+ ∆y2δ_y2+ ∆z2δ_z2)if+O(∆4). (2.17)

Dropping fourth order terms in the expansion of the products on the left side of equation 2.17 and the O(∆4) term on the right gives us the fourth order accurate scheme

δ2_x+δ_y2+δ_z2 + ₁₂1(∆y2δ_y2+ ∆z2δ_z2)δ2_xv

+ ₁₂1(∆x2δ2_x+ ∆z2δ_z2)δ_y2v

+ ₁₂1(∆x2δ2_x+ ∆y2δ_y2)δ2_zv =f+₁₂1(∆x2δ2_x+ ∆y2δ_y2+ ∆z2δ_z2)f. (2.18)

Assuming ∆x= ∆y= ∆z=h, this scheme can be written

1

3 (vi−1,j,k+vi+1,j,k+vi,j−1,k+vi,j+1,k+vi,j,k−1+vi,j,k+1) + 1

6 (vi−1,j−1,k+vi−1,j+1,k+vi+1,j−1,k+vi+1,j+1,k+vi−1,j,k−1+vi−1,j,k+1+

vi+1,j,k−1+vi+1,j,k+1+vi,j−1,k−1+vi,j−1,k+1+vi,j+1,k−1+vi,j+1,k+1)−4vi,j,k = h₁₂2 (fi,j,k−1+fi,j,k+1+fi,j−1,k+fi,j+1,k+fi−1,j,k+fi+1,j,k+ 6fi,j,k). (2.19)

This scheme produces a nineteen point stencil for v and a seven-point stencil for f, as illus-trated in Figure 2.3. We can isolate vi,j,k for use in an iterative method in the same fashion as in equations 2.14 and 2.16.

If it is not the case that ∆x = ∆y = ∆z = h, then by letting hx = ∆x, hy = ∆y, and

hz = ∆z, equation 2.18 can be written

[−6 +2₃(h2_x+h_y2+h2_z)]vi,j,k+ [1−1₆(2h2x+h2y+h2z)](vi+1,j,k+vi−1,j,k) +

[1−1₆(h2_x+ 2h2_y +h2_z)](vi,j+1,k+vi,j−1,k) + [1−1₆(h2x+h2y+ 2h2z)](vi,j,k+1+vi,j,k−1) + 1

12(h 2

x+h2y)(vi+1,j+1,k+vi+1,j−1,k+vi−1,j+1,k+vi−1,j−1,k) +

1 12(h

2

x+h2z)(vi+1,j,k+1+vi+1,j,k−1+vi−1,j,k+1+vi−1,j,k−1) + 1

12(h 2

y+h2z)(vi,j+1,k+1+vi,j+1,k−1+vi,j−1,k+1+vi,j−1,k−1)

= (1−1 6(h

2

x+h2y+h2z))fi,j,k+h

2 x

12(fi+1,j,k+fi−1,j,k) +

h2 y

12(fi,j+1,k+fi,j−1,k) +

h2 z

(21)

Figure 2.3: Seven and nineteen point stencils in three dimensions foruxx+uyy+uzz =f.

Subtracting the first term on the left side and the entire right side from each side of equation 2.20 and dividing through by 6− 2

3(h 2

x+h2y+h2z) produces a form suitable for an iterative method.

2.3 Stationary iterative methods

An iterative method is one where successive iterates are computed using previous iterates, beginning with some initial data. Stationary iterative methods are those of the formvnew =

M vold+w. Stationary methods are so named because the matrixM is fixed throughout the iteration. As the iteration progresses, we expect to converge to a fixed point of the iteration, where vnew = vold, under appropriate conditions. Successive iterates are increasingly better approximations of an exact solution.

Stationary iterative methods can be used to find a numerical solutionuto the linear system

Au = f. The iteration matrix M is derived from some splitting of A, where A =A1+A2.

Thus,A1u=f−A2u, andu=A1−1(f−A2u) =A−11f−A

−1

1 A2u=A−11f+M u. The splitting

of Ais done so that systems of the form A1y=z are easy to solve. Typically,A1 is chosen so

that its inverse is easy to compute.

(22)

the communication patterns they produce in parallel implementations. Additional details on iterative methods can be found in many texts on numerical methods such as [2] and [28].

LetAu=f be a system of linear equations andA=L−D+U, whereLis lower triangu-lar,D is diagonal, andU is upper triangular. Let v be an approximation tou. We define the

error,e, ase=u−v. Ifu is unknown, then we are unable to computee. However, we define the residual,r, asr =f−Av, which is a computable measure of how wellv approximatesu. Observe that since Au=f, thenr =Au−Av=Ae.

In the iterative methods we consider here, we compute an approximation v at each suc-cessive iteration. We distinguish the sucsuc-cessive iterates using the notation v(0), v(1), v(2),· · ·, where v(0) is an initial guess. Similarly, we use the notation e(m) and r(m) to denote the error and residual at the mth iteration. The notation {v(m)} denotes the sequence of vectors

v(0)_{, . . . , v}(m)_{. The methods discussed below are applicable to matrices without special}

struc-ture, although the one dimensional Poisson equation we use to illustrate them is tridiagonal. We observed that equation 2.8 for the discrete one dimensional Poisson problem could be rewritten for use in an iterative method. Specifically, we write

v(_im+1) = 1₂(v_i(₊₁m)+v(_i₋m₁)−h2fi). (2.21) By isolating vi,j in equations 2.10 and 2.11 on the left side, we can derive iteration equations analogous to equation 2.21. Similarly, we can derive new iterates on the left side of equations 2.14 and 2.16 by using the previous iterate on the right.

The Jacobi method uses formulations such as those just described to produce successive iterates. In matrix form, Au=f can be written (L−D+U)u=f, or Du= (L+U)u−f. The Jacobi method can be written in matrix form as

v(m+1)=D−1(L+U)v(m)−D−1f. (2.22)

The Gauss-Seidel method is closely related to the Jacobi method. Consider equation 2.21 above. Clearly, the order in which the unknowns vi are computed is arbitrary, although lexicographic ordering is typically used. Suppose that lexicographic ordering is used so that the unknowns are computed from left to right in Figure 2.1. When we compute v(_im+1), the new value of v(_i₋m₁+1) has already been computed. The Gauss-Seidel method uses the most recent information, so for the discrete one dimensional Poisson problem, we use

(23)

Recalling thatA=L−D+U, we can writeAu=fin matrix form as (L−D)u=f−U u. The Gauss-Seidel method has the matrix representation

v(m+1) = (D−L)−1U v(m)−(D−L)−1f . (2.24)

Note that a computer implementation of the Jacobi method requires that two successive iterates be stored. In contrast, since the Gauss-Seidel method uses the most recently computed information, each element of v can be overwritten as it is computed. Thus, the Gauss-Seidel method reduces storage requirements for the iterates by half as compared with the Jacobi method.

A variant of the Jacobi method is the weightedJacobi method. We use the calculation in equation 2.21, but only as an intermediate value. Let

v_i∗= 1₂(v(_i₊₁m)+v_i(₋m₁)−h2fi). (2.25) Then, the weighted Jacobi method computes the new iterate using

v_i(m+1)= (1−ω)v(m)+ωv∗=v(m)+ω(v∗−v(m)), (2.26) whereω ∈ <is arelaxation parameter. If we letPJ =D−1(L+U), andPω = (1−ω)I+ωPJ, then the weighted Jacobi method can be written in matrix form as

v(m+1)=Pωv(m)−ωD−1f. (2.27)

The SOR method is derived in a similar fashion from the Gauss-Seidel method. We use the Gauss-Seidel iteration formula in equation 2.23 as an intermediate value, defining

v∗_i = 1₂(v(_i₊₁m)+v_i(₋m₁+1)−h2fi). (2.28) The SOR method uses the same calculation, equation 2.26, as the weighted Jacobi method, albeit with a different definition ofv_i∗. If we letPG = (D−L)−1U andPω= (1−ω)I+ωPG, then the SOR method can be written in matrix form as

v(m+1)=Pωv(m)−ω(D−L)−1f. (2.29)

(24)

Figure 2.4: Red-black scheme on a one dimensional grid. Black points are filled; red points are open.

and black, and the iteration sweeps first through the red points and then the black. We illustrate the color pattern for one and two dimensions in Figures 2.4 and 2.5, respectively.

In two dimensions, a red-black scheme provides local uncoupling of the unknowns if a five point stencil is used. If a nine point stencil is used, a four color scheme can provide local uncoupling of the unknowns. The iteration would then sweep through each color in turn. In three dimensions, two colors suffice if a seven point stencil is used; eight colors suffice for a nineteen point stencil. The structure of the domain and the stencils noted above for Poisson’s equation makes identification of the color patterns evident. For arbitrary discretizations, finding the minimum number of colors for decoupling of unknowns is equivalent to the graph coloring problem, which has been shown to be NP-complete.

(25)

Chapter 3

Overview of parallel computing

The development of parallel machines is motivated largely by the demand of users whose programs cannot be executed in a reasonable time on a sequential machine. The demand for supercomputers is driven by programs, typically scientific and engineering applications, that use large data sets. Parallel machines provide both additional processors and more memory, providing the potential to make applications tractable that otherwise would not be. However, parallel execution often implies time for communication not required when a program is executed sequentially. On a distributed memory machine, when the computation on one processor depends on data in a remote memory, communication is required. Programs that require no interprocessor communication are classified as trivially parallel or embarrassingly parallel. Most parallel programs are not trivially parallel and are characterized by data dependencies requiring communication.

There are distinct types of parallelism, including data parallelism and task parallelism. Data parallelism exists when the same operations can be performed by different processors on different subsets of an application’s data. Task parallelism exists when a program contains two or more tasks where the order of execution does not affect the results of the program. Such tasks can be executed in parallel by different processors.

(26)

While some commercial programs exhibit task parallelism or even trivial parallelism, such as searching databases, supercomputing is primarily driven by scientific and engineering ap-plications. Commercial software for scientific and engineering applications exists, of course. For example, fluid dynamics in aircraft design, simulation and modeling in petroleum explo-ration, and many others have commercial application. Our research does not consider task parallelism. Instead, we consider those applications that exhibit data parallelism where data distribution is a significant concern.

Given that some effort, sometimes significant effort, is required to adapt a sequential pro-gram to run on a parallel machine, it is not surprising that propro-grams run on supercomputers tend to share certain characteristics. First, they tend to be large programs, either in terms of the number of computations performed, the size of the data set, or both. In addition, the computations tend to be dominated by floating point operations. Moreover, these programs tend to be highly iterative, with much of their computation consisting of the kernels of loops, often nested loops. Programs run on supercomputers are often regular in some sense; fre-quently the data for such programs consists of arrays whose elements correspond to points in space and/or time on a uniform grid or a grid whose intervals are defined by a formula. Although some parallel machines can support independent instruction streams, it is common for parallel programs to execute the same instructions on different data. Programs of this type fit the single program, multiple data (SPMD) model. Parallel programs also tend to be characterized by relatively low I/O, frequently reading all data initially and writing results only after all computations are complete. These characteristics are common to many parallel programs because they describe factors that allow parallelism to be worthwhile. A parallel version of a sequential program is faster only if the time for communication, synchronization and other overhead is less than that saved by partitioning the computation.

3.1 Data distribution

We use the term data distribution frequently in the following chapters. Data must be parti-tioned among processors to exploit parallelism in a program and allow execution on a parallel machine. By partitioning we mean mapping a p dimensional array of data to an abstract q

(27)

and mapping the logical processor grid to physical processors. We use the termdata distribu-tion occasionally to indicate the entire mapping process but more usually to indicate the first two steps of the mapping process. We briefly examine the first two steps below, describing data alignment and then outlining common data partitioning schemes.

other objects

Group of Arrays or

aligned objects

Abstract processors in Cartesian grid

Physical Processors

Figure 3.1: The HPF data mapping model.

3.2 Data alignment

Data alignment is a technique used to associate elements of distinct arrays prior to distri-bution. Alignment is motivated by the owner computes rule, i.e. the owner of the left side element executes the computation for an assignment. It is desirable that objects used in a computation reside in memory on the same processor. A primary consideration is the relation-ship of variables in loops, a focus Gallivan [9] justifies by the importance of loops in scientific computation and the availability of tools for data dependence analysis.

Consider the code below. The HPF directives reflect the offsets between elements of A, B, and C used in the loop kernel. In a given iteration, the elements of A and B that are used are offset by 3 rows, the elements of A and C by 3 columns. The effect of the ALIGN directives is shown in Figure 3.2, where elements of B and C are concealed by the element of A with which they are aligned.

$HPF ALIGN B(I+3,J) WITH A(I,J)

$HPF ALIGN C(I,J+3) WITH A(I,J)

DO I = 1, N-3 DO J = 1, N-3

A(I,J) = B(I+3,J) + C(I,J+3) END DO

(28)

1,1 1,2 1,3

2,1 2,2 2,3

3,1 3,2 3,3

4,1 4,2 4,3

5,2 5,3 5,1 1,1 1,2 1,3 1,3 2,1 2,1 2,2 2,2 2,3 2,3 3,1 3,1 3,2 3,2 3,3 3,3 4,1

5,1 5,2 5,3 4,2 4,3

1,1 1,2 1,4 1,5 1,6 1,7 1,8 1,9

2,4 2,5 2,6 2,7 2,8 2,9

3,4 3,5 3,6 3,7 3,8 3,9

4,4 4,5 4,6 4,7 4,8 4,9

5,4 5,5 5,6 5,7 5,8 5,9 1,5 1,6 1,7 1,8 1,9

2,9 3,9 3,8 3,7 2,7 3,6 2,6 3,5 2,5 3,4 2,4 1,4 2,8

A

B

C

Figure 3.2: The effect of the HPF ALIGN directive.

If the relationship between elements of arrays changes during the program, it may be desirable to modify an initial (static) alignment. One might realign arrays within a subroutine. Lee [20] shows that a static distribution is not optimal when communication costs exceed a threshold level. A similar analysis could be applied to alignment. HPF provides the REALIGN directive to handle such problems.

3.3 Data partitioning

Several authors, e.g. [19], [7], [32], classify data partitioning schemes into three major types, i.e. block, cyclic, and block cyclic. These partitioning schemes are important since they are supported in the HPF language specification [21] and in other parallel Fortran versions such as CRAFT and Vienna Fortran. The schemes are oriented toward the arrays of data that are pervasive in scientific and engineering applications. The data partition can be specified independently for each dimension of a multidimensional array. A programmer can implement these schemes in parallel programs where support is not provided by the language.

(29)

the definition to higher dimensions is then considered. We assume that both the data and pro-cessor arrays are indexed from 0. The definitions of data partitions use the following notation.

p: the number of processors in the processor array.

w: the number of elements in the data array.

b: the number of elements in a data partition block.

Definition 3.1 Block – A group of contiguous elements in a data array.

Definition 3.2 Block cyclic partition – A mapping from a data array of w elements to an

abstract array of p processors, where contiguous elements of the data array are mapped to processors in blocks of size b. IfB =pb,b >0, then elementiof the data array is mapped to processor b(imod B)/bc.

Definition 3.3 Block partition –A special case of the block cyclic partition whereb=dw/pe.

For 0≤i < w, element iof the data array is mapped to processorbi/bc.

Definition 3.4 Cyclic partition – A special case of the block cyclic partition where b = 1.

Then, for 0 ≤i < w, element iof the data array is mapped to processorimod p.

In Figure 3.3, we illustrate the difference in these schemes for an abstract one dimensional array of 4 processors, 0, 1, 2, 3, and a one dimensional data array of 17 elements, 0, . . . , 16. Observe that if b is not a divisor of w, then bdw/be > w. In such a case, the number of data elements assigned to each processor is not identical. This occurs in each of the cases illustrated in Figure 3.3. In general, the number of elements in the last block may be less than the number of elements in other blocks, as in the block cyclic and block partitions in Figure 3.3. Observe that in every case, no processor is assigned more elements than PE0.

It is possible, of course, to not partition the data at all, but assign all the data to a single processor. This corresponds to choosing a block size of w or more. In the multidimensional case where the data array has more dimensions than the processor array, the data in some dimensions would not be partitioned.

A block cyclic data partition mapping an n dimensional array of data to an abstract m

(30)

Cyclic, b = 1

Block Cyclic, b = 3 Block, b = 5

PE 3 PE 2

PE 1 PE 0

Figure 3.3: Block, cyclic, and block cyclic data partitions for a logical array of four processors and a data array of 17 elements, with block sizes 5, 1, and 3, respectively.

one dimensional case above. If the dimensionality of the data array is greater than that of the abstract array of processors, we can create an abstract array of processors consisting of

q1×. . .×qm elements, where each pi = qj for some j ≥ i by inserting n−m dimensions

of size 1. Then the data and processor arrays contain the same number of dimensions, and we can construct a mapping from the ith dimension of the data array to theith dimension of the processor array as in the one dimensional case. The construction is reversed if the dimensionality of the data array is less than that of the abstract array of processors.

3.4 Message Passing Interface

A standard for a portable message passing library called Message Passing Interface (MPI) was developed in 1993 by the Message Passing Interface Forum (MPIF), a collaboration of vendors, scientists and others with an interest in parallel computing. It is a voluntary standard, as is the High Performance Fortran (HPF) standard, and has not yet been adopted by standards bodies such as ANSI or IEEE. Like HPF, the collaboration of contributors produced a definition that allowed vendors flexibility in implementation, while providing a framework that allows development of portable programs. The standard for MPI was published in 1994 in [22].

(31)

(32)

Chapter 4

Logical processor grids

A logical processor grid is an abstract Ddimensional Cartesian grid. (Hereafter, we useDto denote the number of dimensions to avoid confusion, as the more commonnandkare used in different contexts.) Figure 4.1 illustrates two possible grids (a. and b.) for four processors in two dimensions, and two possible grids (c. and d.) for eight processors in three dimensions. Given P processors, a maximal grid is one where no dimension may be increased without the product of the dimensions exceeding P.

c. d.

a.

b.

Figure 4.1: Examples of processor grids with 4 and 8 processors.

Mapping data for a grid structured problem to a logical processor grid is a natural way to implement a parallel program using the single program, multiple data (SPMD) model. In this chapter, we consider the number of maximal logical grids and how to enumerate them given

(33)

4.1 The number of maximal grids for

P

processors

Souriau [27] considered functions that allow computing the number of ordered factorizations of an integer using D parts. Our problem is more general. Given an integer P, how many

D-tuples with integer elements are there where the product of the elements is less than or equal toP, but no element may be increased without the product exceedingP? Equivalently, given P processors, how many maximal processor grids are there in Ddimensions?

Let P and D be given and let di be the extent of a processor grid in the ith dimension. We require that P ≥ QD

i=1di and thus min{d1, . . . , dn} ≤

D√

P. In one dimension, there is, trivially, one maximal processor grid. In two dimensions, either d1 ∈ [1,b

√

Pc] or d2 ∈

[1,b√Pc]. A grid with dimensions d1 and bP/d1c is maximal in the first case. A grid with

dimensions bP/d2c and d2 is maximal in the second case. Thus, there are 2b

√

Pc maximal grids in two dimensions, or 2√P−1 ifbP/b√Pc c=b√Pc.

As the number of dimensions increases, we want to bound the number of maximal grids in terms of P. We first prove a result that allows us to bound a sum we will encounter with an integral.

Lemma 4.1. If c≥1 and 0< δ < c, then √2δ c <

Rc+δ

c−δ x

−1/2_dx_.

Proof. Let c ≥1 and 0< δ < c. For all x ≥0, let √x denote the positive root of x. Note

that, since c >0, √c >0 and thus √1

c >0. Since f(x) =x

2 _{is continuous and increasing on}

[0, c] and [c,2c], there exists an1>0 such that

c−δ = (√c−1)2 =c−21

√

c+₁2,

and an 2>0 such that

c+δ = (√c+2)2 =c+ 22

√

c+₂2.

Thus,

δ = 22

√

c+₂2 = 21

√

c−₁2. (4.1)

Then

1−2 = (22+12)

1 2√c >0.

(34)

It is easy to see that Rc+δ

c−δ √1cdx=

2δ √

c. Thus, the desired result will follow if we show that the difference between the integrals Rc+δ

c−δ x

−1/2_dx _andRc+δ

c−δ √1cdxis positive.

Z c+δ

c−δ

x−1/2dx−

Z c+δ

c−δ 1 √

cdx

=

2√xc+δ

c−δ−

_x

√

c c+δ

c−δ = 2(2+1)−

2δ

√

c

= 2(2+1)−

1 √

c(22

√

c+₂2+ 21

√

c−₁2) (by (4.1) ) = √1

c(

2

1 −22)>0. 2

Corollary 4.2. If c≥1, then √1

c <

Rc+1/2

c−1/2 x

−1/2_dx_.

We use Corollary 4.2 to bound the number of maximal grids in three dimensions.

Lemma 4.3. Given P processors, the number of maximal three dimensional processor grids

is O(P2/3).

Proof. IfP = 1, the result is immediate, so suppose thatP >1. Ifd1 = 1, there are at most

2b√Pc maximal grids, by the analysis for two dimensions above. Similarly, there are at most 2b√Pc maximal grids if either d2 = 1 or d3 = 1. Excluding the double counting of the cases

where exactly two of d1,d2 and d3 equal 1, we have at most 6b

√

Pc −3 maximal grids when at least one and at most two of d1,d2 and d3 equals 1.

A similar counting argument shows that there are at most 6bp

P/2c −9 distinct maximal grids when min{d1, d2, d3} = 2, after excluding cases counted twice. We have at most

6bp

P/3c−15 distinct maximal grids when min{d1,d2,d3}= 3, again excluding cases counted

twice. An easy inductive argument shows that, for 3 ≤ k < bP1/3c, there are at most 6bp

P/kc −6k+ 3 maximal grids when min{d1, d2, d3} = k. When k = bP1/3c, we have

only one maximal grid when P is a cube or when bP/(bP1/3c2₎_c ₌ _b_P1/3_c_{, and at most}

(35)

bP1/3_c X

k=1

(6bqP/kc −6k+ 3)

≤ 6√P

bP1/3_c X

k=1

q

1/k−6 bP1/3_c

X

k=1

k+ 3bP1/3c

< 6√P

Z bP1/3c+1/2

1/2

x−1/2dx−3bP1/3c(bP1/3c+ 1) + 3bP1/3c

= 12√Phx1/2ibP

1/3_c₊₁_/₂

1/2 −3bP 1/3_c2

< 12 √

PP1/6+1₄P−1/6−√2/2−3bP1/3c2 = 12P2/3−3bP1/3c2−6

√

2P + 3P1/3. 2

A similar approach allows us to derive a crude bound for the number of maximal grids in higher dimensions. LetND(P) denote the number of maximal grids in Ddimensions usingP

processors.

Theorem 4.4. GivenP processors, the number of maximal processor grids inDdimensions

is OD!P(D−1)/D.

Proof. We use induction on D. From above, the proposition holds for positive D ≤ 3.

Assume that ND−1(P) =O

(D−1)!P(D−2)/(D−1). Recall that min{d1, . . . , dD} ≤ D √

P inD

dimensions. For each di, we consider the number of (D−1) dimensional grids using bP/dic processors as di assumes the values 1, . . . ,bD

√

Pc. Note that, for any integer m > 1 and any

c > 0, Rc+1

c x−(m−1)/m > (c+ 1)−(m−1)/m since x−(m−1)/m is decreasing on (0,∞). We may thus bound the number of maximal grids as follows, although some are counted more than once.

ND(P) ≤ D

bD√Pc

X

k=1

O(D−1)!bP/kc(D−2)/(D−1)

≤ D! O P(D−2)/(D−1)+

Z bD

√ Pc

1

_P x

(D−2)/(D−1) dx

!

≤ D! O P(D−2)/(D−1)+P(D−2)/(D−1)h(D−1)x1/(D−1)ib

D√

Pc

1

!

≤ D! OP(D−2)/(D−1)+ (D−1)P(D−2)/(D−1)(P1/(D(D−1))−1) = D! OP(D−1)/D+ (D−2)P(D−1)/D−P(D−2)/(D−1)

(36)

The bound derived in Theorem 4.4 is not tight, since we do not exclude grids counted more than once. However, the bound seems to accurately characterize the dependence of the number of maximal grids on P. Our work in the next section shows that we can generate all maximal grids for any P andD. We consider empirical data on the number of maximal grids thereafter.

4.2 Enumerating maximal grids for

P

processors

A more practical concern than the number of maximal grids is the computational complexity of an algorithm to enumerate them. Figure 4.2 shows the kernel of a program written in C that generates the dimensions of maximal grids in lexicographic order. We show that it is correct and bound its running time below.

As implemented, the code uses global variables P and D for the number of processors and dimensions. The program has three functions not shown here, print em, max of, and

main. The function print em prints the D grid dimensions in order and is called only from

void generate(int idx, int p, int *d, int d_prod, int max_d) /* 1 */

{ /* 2 */

int j, f_rt_p, start; /* 3 */

if (idx == D-1){ /* is this the final dimension? */ /* 4 */

d[idx] = p; /* 5 */

print_em(d); /* 6 */

} /* 7 */

else{ /* iterate over possible d[idx]s */ /* 8 */

f_rt_p = floor(sqrt(p)); /* 9 */

for (j = 1; j <= f_rt_p; j++){ /* 10 */

d[idx] = j; /* 11 */

if (((d_prod/max_d)*(max_d+1))*j*(p/j) > P) /* 12 */

generate(idx+1, p/j, d, d_prod*j, max_of(j, max_d)); /* 13 */

} /* 14 */

start = f_rt_p; /* don’t repeat case if */ /* 15 */

if (p/f_rt_p == f_rt_p) /* floor( p/(floor(sqrt(p)) ) */ /* 16 */ start = f_rt_p - 1; /* is equal to floor(sqrt(p)) */ /* 17 */

for (j = start; j > 0; j--){ /* 18 */

d[idx] = p/j; /* 19 */

if (((d_prod/max_d)*(max_d+1))*j*(p/j) > P) /* 20 */

generate(idx+1, j, d, d_prod*(p/j), max_of(p/j, max_d)); /* 21 */

} /* 22 */

} /* 23 */

} /* 24 */

(37)

line 6 in generate. The functionmax ofsimply returns the greater of its two arguments. The functionmaininitializes P and D, allocates space for the arraydand callsgenerate. The call togenerate frommainpasses 0 toidx,P top, a pointer to the array d, and 1 for the last two arguments. Note that di is stored ind[i−1] in the C language.

Correctness is easy to show if the number of dimensions is one. The sole call to generate

satisfies the condition on line 4. Thus d1 is correctly set to P and print emis called.

If the number of dimensions is greater than one, dimensions are considered in order by recursive calls to generate. One may represent the calls to generate in a tree whose leaves represent either maximal grids or nodes that do not have children because the condition in line 12 or 20 is false. Figure 4.3 illustrates a portion of such a tree for the case where P = 19 and D = 3. Numbers preceding parentheses represent grid dimensions already generated at the beginning of each call; numbers inside parentheses represent the value of the argument

p at the beginning of each call. The root of the tree is the call to generate from main. The lowest level of the tree and some interior subtrees are not shown.

When D > 1, eitherd1 ≤

√

P orQD

i=2di ≤ √

P. The initial call togenerate handles both cases. The first loop handles the first case; the second loop handles the second case. The conditions in lines 12 and 20, which ensure that no dimension can be increased without the product exceeding P, are satisfied during the initial call to generate since both d prod and

max dequal one. The loops passbP/d1cfor the new value ofpand the correct value (idx+ 1)

of the next index. Each iteration passes the product of thedis (i= 1,2,· · ·, idx+1) tod prod and the maximum dimension thus far tomax d. (During the initial call,d1 is passed tod prod

and max d.) Recursive calls to generatebehave similarly while idx < D−1.

Whenever D > 1 andidx= D−1 in line 4 of generate, we are assured that the call was made after the condition in either line 12 or 20 was satisfied. These conditions ensure that no dimension can be increased without their product exceeding P. Thus, we eliminate grids such

(19)

6,1,(3) 6,3,(1) 9,1,(2) 9,2,(1) 19,1,(1) 1,1,(19) 1,2,(9) 1,3,(6) 1,4,(4) 1,6,(3) 1,9,(2) 1,19,(1)

19,(1) 9,(2)

1,(19) 2,(9) 3,(6) 4,(4) 6,(3)

(38)

as 19×2×2 whenP = 96, which is smaller than 24×2×2. (We would encounter 19×2×2 because b96/5c = 19 in line 19 when j = 5 in line 18.) It is not difficult to see that lines 12 and 20 eliminate only grids that are not maximal. Then, an easy inductive argument shows that we set the value of dD in line 5 and call print em in line 6 only when the dimensions generated are those of a maximal grid.

It remains to show that all maximal grids are generated. It suffices to show that we consider

D dimensions in total, and all possible grids in D−idx dimensions when generate is called with p processors and correct values for its other arguments. On the initial call to generate,

idx= 0. Sinceidx is incremented on recursive calls in lines 13 and 21, and the value ofidx is checked in line 4, grids are generated (printed by print em) with D dimensions. Clearly, on each call, the first loop iterates over all possible values ford[idx] that are less than or equal to √

p. Note that the second loop ingeneratesetsd[idx] only to values greater thanb√pcwhich can possibly produce a maximal grid, but that it iterates over all such values. The value of

idx is correctly incremented for recursive calls, as noted above. Also, values of d prod and

max d consistent with their descriptions above are passed.

Since the loops in generate iterate over all possible values of di for a maximal grid on the ith recursive call, and lines 12 and 20 do not eliminate maximal grids, we generate all maximal grids. Since lines 12 and 20 eliminate grids which are not maximal, we conclude that we generate all maximal grids and only maximal grids.

Now consider the running time for the program described above. The function print em

called in line 6 requiresO(D) time to printDgrid dimensions. Excluding the call togenerate,

main requires O(D) time for allocation of the array d, and constant time to initialize P and

D. Calls tomax of require constant time per call. The remaining work is done ingenerate. In one dimension, idx = D−1 in generate. Then, generate requires constant time for the assignment in line 5 and O(D) =O(1) time for the call to print em. In two dimensions,

generate takes constant time to compute startin lines 15-17 and constant time per iteration for the assignments in lines 11 and 19. Evaluation of the conditions in lines 12 and 20 also requires constant time per iteration. Each recursive call requires constant time for computing arguments plus the time needed in the one dimensional case. Since there are at most 2b√Pc recursive calls, generaterequires O(2D√P) time in all in two dimensions.

In three dimensions, generate makes at most 2b√Pc recursive calls when it is called from

(39)

values for p. We use Corollary 4.2 and the facts that (P1/2+ 1)3<(P3/4+ 3/2P1/4+ 1)2 for

P ≥1, andRc+1

c x1/2dx > c1/2 forc≥1, to bound the running time. Ignoring work requiring constant time, and observing that the deepest recursive calls each require O(D) time for the call to print em, we sum to find that the worst case running time (when the conditions in lines 12 and 20 are always true) is

b√Pc

X

i=1

2 b√P/ic

X

j=1

O(D) + b√Pc

X

i=1

2 b√ic

X

j=1

O(D)

= 2O(D) b√Pc

X

i=1

bqP/ic + 2O(D) b√Pc

X

i=1

b√ic

≤ 2 √

P O(D) b√Pc

X

i=1

q

1/i + 2O(D) b√Pc

X i=1 √ i < 2 √

P O(D)

Z b

√ Pc+1/2 1/2

x−1/2dx + 2O(D)

Z b

√ Pc+1 1

x1/2dx

= 4√P O(D)hx1/2ib

√ Pc+1/2 1/2 +

4 3O(D)

h x3/2ib

√ Pc+1 1

< 4√P O(D)

P1/4+1 4P

−1/4₋₁_/√₂ ₊ 4

3O(D)

P3/4+3 2P

1/4_{+ 1}₋₁

= O(D)

₁₆

3 P

3/4₋₂√₂_P _{+ 3}_P1/4_.

InDdimensions, observe that each call toprint emrequires thatgenerateproceed through

Dlevels of recursion. Since each level of recursion typically results in multiple calls to the next level, there are, on average, fewer thanDcalls to generatefor each call toprint em. As noted above, each call toprint emrequiresO(D) time. Thus, we bound the running time ofgenerate

by bounding the number of calls to print em. Our analysis assumes that the conditions in lines 12 and 20 are always true, although this worst case assumption is true only for relatively small values of P orD.

Our analysis will require that we bound b√Pc+ 1 to a fractional power. To this end, we show informally that for any positive integer P and integerk >1, there exists a c such that

√

P + 12

k₋₁

<P1−(12) k

+cP12−( 1 2)

k

+ 12

(k−1)

. (4.2)

(The inequality (P1/2+ 1)3 <(P3/4+ 3/2P1/4+ 1)2 used in the three dimensional case has exponents of the form shown above.) We wish to show that ccan be chosen for a givenk so that the inequality in 4.2 holds for all P. Observe that

P1−(12)

k

+cP12−( 1 2)

k2(k−1)

+cP12−( 1 2)

k

+ 12

(k−1)

<P1−(12) k

+cP12−( 1 2)

k

+ 12

(40)

When (√P+1)2k−1and (P1−(12) k

+cP12−( 1 2)

k

)2(k−1)are expanded, the powers ofP are identical for the first 2(k−1)+ 1 terms. When (cP12−(

1 2)

k

+ 1)2(k−1) is expanded, the powers of P in its first 2(k−1) terms are at least equal to those in the last 2(k−1) terms of (√P+ 1)2k−1. As an example, when k= 3, we have

√

P + 17 = P7/2 + 7P3 + 21P5/2 + 35P2 + 35P3/2 + 21P + 7P1/2 + 1;

P7/8+cP3/84 = P7/2+ 4cP3+ 6c2P5/2+ 4c3P2+c4P3/2;

cP3/8+ 14 = c4P3/2+ 4c3P9/8+ 6c2P3/4+ 4cP3/8+ 1.

Clearly, inequality 4.2 holds ifcis chosen large enough so that the coefficients in the expansions of the two binomials involving c equal or exceed the coefficients of the expansion of (√P + 1)2k−1. A bit of scratch work reveals that the maximal value ofcis required to equal or exceed the coefficient of the middle terms of (√P + 1)2k−1_{. It is not difficult to see that choosing} _c

so that

2c2(k−1) ≥ (2 k₋_{1) !}

2(k−1)_{! (2}(k−1)₋_{1) !} (4.3)

suffices to establish the result in inequality 4.2. In the example above withk= 3, this requires that 2c4 _≥ _7!_/_(4!_·_{3!) = 35. Solving equation 4.3 for} _c _{produces values that do not exceed}

2.02 lnk. For example, solving equation 4.3 gives c ≥ 2.05 when k = 3, and c ≥ 3.96 when

k= 10. We use inequality 4.2 in the proof that follows.

Theorem 4.5. For P processors and D dimensions, the function generate in Figure 4.2

requires OD a P1−2(1−D) time, where a = QD

k=2 2

(2k−3)

2(k−1)₋₁, and where P and D > 1 are positive integers.

Proof. We use induction on D. If D = 2, then D a P1−2(1−D) = 2D√P, which agrees with

our previous result. Let GD(P) be the number of calls made toprint emforP processors and

D dimensions. Assume thatGD(P)≤a P1−2

(1−D)

forD >1, and considerGD+1(P). During

the initial call to generate, as i assumes the values 1,2, . . . ,b√Pc, we make recursive calls with bP/icprocessors and with iprocessors. The total number of calls to print emis thus

b√Pc

X

i=1

GD(bP/ic) + b√Pc

X

i=1

GD(i)

≤ b√Pc

X

i=1

abP/ic1−(12) (D−1)

+ b√Pc

X

i=1

a i1−(12) (D−1)

< aP1−(12) (D−1)

1 +

Z b

√ Pc

1

x−(1−(12) (D−1)₎

dx !

+ a Z b

√ Pc+1 1

x1−(12) (D−1)

(41)

= aP1−(12) (D−1)

1 +h2(D−1)x(12) (D−1)ib

√ Pc 1 ! +a

2−(1₂)(D−1)−1x2−(12) (D−1)b

√ Pc+1 1

< a P1−(12) (D−1)

1 + 2(D−1)P(12) D

−1+a2−(1₂)(D−1) −1

P1−(12) D

+cP12−( 1 2)

D

= a 2

D

2−(1₂)(D−1)

P1−(12) D

+a1−2(D−1)P1−(12) (D−1)

+a c

2−(1₂)(D−1)P

1 2−(

1 2)

D

= 2

(2D−1)

2D₋₁ D

Y

k=2

2(2k−3) 2(k−1)₋₁P

1−(1₂)D

+a1−2(D−1)P1−(12) (D−1)

+a c

2−(1₂)(D−1)P

1 2−(

1 2) D = O D+1 Y k=2

2(2k−3) 2(k−1)₋₁P

1−(1₂)D !

.

The proposition follows from the observation that each call toprint emrequiresO(D) time.2

The product in the bound above grows rapidly withD. It is not hard to see that forD≥2 D

Y

k=2

2(2k−3) 2(k−1)₋₁ =O

D−2

Y

k=0

2k

!

=O2(D2−3D+2)/2. (4.4)

This bound on the computational complexity ofgeneratesuggests that the algorithm is compu-tationally infeasible for large values ofDorP. However, if we evaluatef(D) =QD

k=2 2

(2k−3)

2(k−1)₋₁,

we find that f(2) = 2, f(3) = 16/3, and f(4) = 512/21. Then, for D ≤ 4, it is not un-reasonable to treat f(D) as a constant, in which case generate requires O(P1−2(1−D)) time. Moreover, the seeming infeasibility of the algorithm for large P and D may derive from the number of maximal grids rather than any defect in the algorithm. It is difficult to conceive of an algorithm to generate any combinatorial object in less than constant time per element. The actual computational complexity of generate seems to be much less than suggested by the bound above. The bound in Theorem 4.5 relies on bounding sums with integrals and bounding bP/ic by P(1/i). In addition, no consideration is given to the number of non-maximal grids eliminated by the conditions in lines 12 and 20.

Table 4.1 lists the number of leaves in a tree like that in Figure 4.3 representing calls to

generatefor selected values ofP andD. It compares the number of leaves to a function of the bound found in Theorem 4.5. We use FD to denote the number of leaves. We deviate slightly from our previous use of GD, using it to denote the value of bDQDk=2 2

(2k−3)

2(k−1)₋₁ P

1−2(1−D)

c, i.e. the bound for rather than the actual number of calls to print em. We compare FD to bGD/(D−1)!c. We do not tabulateF2 andG2, since they differ by at most 1.

Modeling, Predicting, And Optimizing Parallel Performance Of Grid Stuctured Problems

MODELING, PREDICTING, AND OPTIMIZING

PARALLEL PERFORMANCE OF

GRID STRUCTURED PROBLEMS

Contents

List of Tables

List of Figures

Chapter 1

Introduction

Chapter 2

Mathematical review

2.1

Basic mathematical notation

2.2

Finite difference schemes for Poisson’s equation

2.3

Stationary iterative methods

Chapter 3

Overview of parallel computing

3.1

Data distribution

3.2

Data alignment

A

B

C

3.3

Data partitioning

3.4

Message Passing Interface

Chapter 4

Logical processor grids

4.1

The number of maximal grids for

P

processors

4.2

Enumerating maximal grids for

P

processors