Design checklist.
4.6 Case Study: Matrix Multiplication
4.6.3 A Systolic Algorithm
We still have not said the last word about the ideal data distribution for matrix-matrix multiplication! An alternative algorithm allows the broadcast operations used in the preceding algorithm to be replaced with regular, nearest-neighbor (``systolic'') communications. However, data must be distributed among tasks in a different fashion. As before, we assume that A, B, and C are decomposed into submatrices. Each task (i,j) contains submatrices , , and
, where . This data layout is illustrated in Figure 4.14. Computation proceeds in steps. In each step, contributions to C are accumulated in each task, after which values of A move down and values of B move right. The entire computation requires a total of messages per task, each of size , for a cost of
Communication costs are less by a factor of about than in Equation 4.4. Again, this benefit must be weighed against the cost of converting matrices A, B, and C into the layout required by this algorithm. This analysis is left as an exercise.
4.7 Summary
Modular design techniques are fundamental to good software engineering practice. In this chapter, we have shown how these techniques can be applied to the design of parallel programs. The major points are as follows:
1. The central tenets of modular design, such as simple interfaces and information hiding, apply in parallel programming just as in sequential programming.
2. Data distribution is an important implementation detail that, if abstracted out of a module interface, can facilitate code reuse.
3. It is useful to distinguish between sequential, parallel, and concurrent composition of parallel modules. Sequential composition is simple but inflexible. Parallel composition can be used to improve locality and scalability. Concurrent composition is the most general form.
4. Performance models can be composed, but care must be taken to account for communication costs at interfaces, overlapping of computation and communication, and other factors.
In Part II, we show how the modular design techniques introduced in this chapter can be applied when developing programs using a range of parallel programming tools.
Exercises
1. Discuss ways in which modular design techniques are used in the design of an automobile engine, house, or suspension bridge.
2. Identify ways in which modularity ideas might apply to the management of an orchestra, educational institution, or company.
3. Develop analytic models for the maximum throughput (in requests processed per second) supported by the centralized and distributed tuple space implementations outlined in Section 4.5.
4. Using a language that supports concurrent composition, such as CC++ \ or Fortran M, implement centralized and distributed tuple space modules. Study the performance of these modules in a simple parameter study problem, and relate performance to the models of Exercise 3.
5. Discuss ways in which the database search algorithm of Section 4.5 could be modified to avoid the central bottleneck inherent in a single manager.
6. An alternative parallel algorithm for the 2-D FFT considered in Section 4.4.1 assumes a fixed 2-D decomposition of data structures; hence, communication is required within each 1-D FFT. Use a performance model to contrast this algorithm with those described in the text. Assume that the communication costs for a 1-D FFT algorithm operating on N points
on P processors are .
7. A problem comprises two components, A and B. A can be solved in 1000 seconds on computer C1 and in 5000 seconds on computer C2 ; B requires 4000 and 2000 seconds on C1 and C2, respectively. The two computers are connected by a 1000-km optical fiber link that can transfer data at 100 MB/sec with a latency of 10 msec. The two components can execute concurrently but must transfer 10 MB of data 10,000 times during execution. Is it cheapest (in terms of computational resources consumed) to solve the problem on computer C1, on computer C2, or on both computers together?
8. A problem similar to that of Exercise 7 is found to use fewer computational resources when run on the two networked computers than on either computer alone. Your public relations office proposes to promote this as an example of superlinear speedup. Do you agree?
9. A climate model consists of an atmosphere and ocean model. At each step, the models both perform a finite difference computation and exchange five 2-D arrays. Develop performance models for two alternative parallel implementations, based on sequential and parallel composition respectively. Discuss the relative performance of the two implementations.
10. Determine the problem size and processor count regimes in which each of the three convolution algorithms described in Example 4.4 would be faster, assuming machine parameters characteristic of (a) a multicomputer and (b) an Ethernet-connected LAN. 11. The performance analysis of the pipelined algorithms considered in Section 4.4 did not
take into account idle time incurred when starting up and shutting down pipelines. Refine the performance models to account for these costs. How many images must be processed before efficiency reaches 95 percent of that predicted in the absence of these costs?
12. Execute by hand for N=P=4 the matrix multiplication algorithm based on a 2-D decomposition described in Section 4.6.1.
13. A simpler variant of the multiplication algorithm for matrices decomposed in two dimensions uses a ring rather than a tree for the broadcast operations. Use performance models to compare the two algorithm variants. (Note that a broadcast can be performed on a P -node bidirectional ring in approximately P/2 steps.)
14. Extend the analysis and comparison of Exercise 13 to account for competition for bandwidth in a 2-D mesh architecture.
15. Develop a performance model for the systolic matrix multiplication algorithm of Section 4.6, and use this to identify regimes in which this algorithm is faster than the simpler algorithm based on regular 2-D decompositions. Account for data reorganization costs.
16. Another matrix multiplication algorithm collects the matrices that are to be multiplied on a single processor. Use performance models to identify regimes in which this algorithm might be competitive. Conduct empirical studies to validate your analytic results.
Chapter Notes
The merits of modular design are described in landmark papers by Parnas [220,221,223] and Wirth [295]. The book by McConnell [198] provides an excellent survey of the software construction process. Booch [40] and Cox and Novobilski [66] provide good introductions to modularity and object-oriented programming. Milner [210], Hoare [154], and Chandy and Misra [54] provide abstract treatments of modularity and program composition in parallel programming. Foster and Taylor [107] explore the use of modular design techniques in concurrent logic programming. Mead and Conway [199] and Ullman [286] provide introductions to VLSI design, another area in which modular design techniques are used extensively.
Gropp and Smith emphasize the importance of data distribution neutrality in the design of SPMD libraries [127]. This principle is applied extensively in their Portable Extensible Tools for Scientific computing (PETSc) package. ScaLAPACK is described by Choi, Dongarra, and Walker [59], Dongarra and Walker [85], and Dongarra, van de Geign, and Walker [84]. Dongarra, Pozo, and Walker [83] describe the C++ interface. Other parallel SPMD libraries include Lemke and Quinlan's [188] P++ library for grid applications, Skjellum's [262] Multicomputer Toolbox, and Thinking Machine's [283] CMSSL. Skjellum et al. [265] discuss the use of the MPI message- passing standard to develop parallel libraries. A variety of other issues relating to parallel SPMD libraries are discussed in workshop proceedings edited by Skjellum [263,264].
The tuple space module discussed in Section 4.5 forms the basis for the Linda parallel programming model of Carriero and Gelernter [47,48,49]. The tuple space used in Linda is more general than that described here. In particular, any field can be used as a key when retrieving tuples. The tuple space solution to the database search problem is based on a Linda program in [48]. The convolution problem and the performance results in Section 4.4 are taken from a paper by Foster et al. [101]. Foster and Worley [110] describe parallel algorithms for the fast Fourier transform.
Part II: Tools
The second part of this book comprises five chapters that deal with the implementation of parallel programs. In parallel as in sequential programming, there are many different languages and programming tools, each suitable for different classes of problem. Because it would be neither feasible nor useful to describe them all, we restrict our attention to four systems---Compositional C++ (CC++), Fortran M (FM), High Performance Fortran (HPF), and the Message Passing Interface (MPI)---and explain how each can be used to implement designs developed using the techniques of Part I. We also describe, in Chapter 9, tools that aid in the collection and analysis of performance data.
Except where material is explicitly cross-referenced, each chapter in Part II is self-contained. Hence, it is quite feasible to base a practical study of parallel programming on just one of the four tools described here. However, while each of these tools is of broad utility, each also is most
appropriate for different purposes, and we recommend that you become familiar with several systems.
CC++, described in Chapter 5, is a small set of extensions to C++. These extensions provide the programmer with explicit control over locality, concurrency, communication, and mapping and can be used to build libraries that implement tasks, channels, and other basic parallel programming abstractions. Designs developed using the techniques of Part I are easily expressed as CC++ programs.
FM, described in Chapter 6, is a small set of extensions to Fortran. These extensions provide explicit support for tasks and channels and hence can implement designs developed using the techniques of Part I directly. A distinguishing feature of FM is that programs can be guaranteed to be deterministic, meaning that two executions with the same input will produce the same output. HPF, described in Chapter 7, is an example of a data-parallel language and has emerged as a de facto standard for scientific and engineering computation. Parallelism is expressed in terms of array operations---statements that apply to many elements of an array at once. Communication operations are inferred by the compiler, and need not be specified by the programmer.
MPI, described in Chapter 8, is a library of standard subroutines for sending and receiving messages and performing collective operations. Like HPF, MPI has emerged as a standard and hence supersedes earlier message-passing libraries such as PARMACS, p4, PVM, and Express. When building parallel programs, our choice of tool will depend on the nature of the problem to be solved. HPF is particularly appropriate for numeric algorithms based on regular domain decompositions (for example, finite difference computations). CC++ and FM are better suited for applications involving dynamic task creation, irregular communication patterns, heterogeneous and irregular computation structures, and concurrent composition. They can also be used to build data-parallel libraries for regular problems. MPI is a lower-level approach to parallel programming than CC++, FM, or HPF and is particularly appropriate for algorithms with a regular SPMD structure.
CC++, FM, HPF, and MPI represent very different approaches to parallel programming. Nevertheless, we shall see that good design is independent of our choice of implementation language. The design techniques introduced in Part I apply regardless of language. Issues of concurrency, scalability, locality, and modularity must be addressed in any parallel program.
5 Compositional C++
In this chapter, we describe Compositional C++ (CC++), a small set of extensions to C++ for parallel programming. CC++ provides constructs for specifying concurrent execution, for managing locality, and for communication. It allows parallel programs to be developed from simpler components using sequential, parallel, and concurrent composition. Hence, algorithms designed using the techniques described in Part I can be translated into CC++ programs in a straightforward manner.
Since the CC++ extensions are simple, we are able in this chapter to provide both a complete language description and a tutorial introduction to important programming techniques. We also provide a brief review of those C++ constructs used in this chapter, so as to make the presentation
intelligible to readers familiar with C but not C++. In the process, we show how the language is used to implement various algorithms developed in Part I.
After studying this chapter, you should be able to write simple CC++ programs. You should know how to create tasks; how to implement structured, unstructured, and asynchronous communication patterns; and how to control the mapping of tasks to processors. You should also know both how to write deterministic programs and when it is useful to introduce nondeterministic constructs. Finally, you should understand how CC++ supports the development of modular programs, and you should know how to specify both sequential and parallel composition.
5.1 C++ Review
We first review some of the basic C++ constructs used in the rest of this chapter, so as to make subsequent material understandable to readers familiar with C but not C++. Readers familiar with C++ can skip this section.
With a few exceptions, C++ is a pure extension of ANSI C. Most valid ANSI C programs are also valid C++ programs. C++ extends C by adding strong typing and language support for data abstraction and object-oriented programming.