Shortest-Path Algorithms Summary - Parallel Dijkstra 2.

Parallel Dijkstra 2.

3.9.3 Shortest-Path Algorithms Summary

Table 3.7 summarizes the performance models developed for the four all-pairs shortest-path algorithms. Clearly, Floyd 2 will always be more efficient that Floyd 1. Both algorithms have the same computation costs and send the same number of messages, but Floyd 2 communicates considerably less data. On the other hand, Floyd 1 is easier to implement. Algorithms Dijkstra 1 and 2 will be more efficient than Floyd 2 in certain circumstances. For example, Dijkstra 1 is more efficient than Floyd 2 if P N and

Table 3.7: Performance of four parallel shortest-path algorithms.

In addition to these factors, we must consider the fact that algorithms Dijkstra 1 and Dijkstra 2 replicate the graph P and P/N times, respectively. This replication may compromise the scalability of these algorithms. Also, the cost of replicating an originally distributed graph must be considered if (as is likely) the shortest-path algorithm forms part of a larger program in which the graph is represented as a distributed data structure.

Clearly, the choice of shortest-path algorithm for a particular problem will involve complex tradeoffs between flexibility, scalability, performance, and implementation complexity. The performance models developed in this case study provide a basis for evaluating these tradeoffs.

3.10 Summary

In this chapter, we have seen how to develop mathematical performance models that characterize the execution time, efficiency, and scalability of a parallel algorithm in terms of simple parameters such as problem size, number of processors, and communication parameters. We have also seen how these models can be used throughout the parallel program design and implementation cycle:

• Early in the design process, we characterize the computation and communication

requirements of our parallel algorithms by building simple performance models. These models can be used to choose between algorithmic alternatives, to identify problem areas in the design, and to verify that algorithms meet performance requirements.

• Later in the design process, we refine our performance models and conduct simple

experiments to determine unknown parameters (such as computation time or communication costs) or to validate assumptions. The refined models can be used to increase our confidence in the quality of our design before implementation.

• During implementation, we compare the performance of the parallel program with its

performance model. Doing this can help both to identify implementation errors and to improve the quality of the model.

A performance model gives information about one aspect of an algorithm design: its expected parallel performance. We can use this information, when it is combined with estimates of implementation cost, etc., to make informed choices between design alternatives.

Exercises

The exercises in this chapter are designed to provide experience in the development and use of performance models. When an exercise asks you to implement an algorithm, you should use one of the programming tools described in Part II.

1. Discuss the relative importance of the various performance metrics listed in Section 3.1 when designing a parallel floorplan optimization program for use in VLSI design.

2. Discuss the relative importance of the various performance metrics listed in Section 3.1 when designing a video server that uses a parallel computer to generate many hundreds of thousands of concurrent video streams. Each stream must be retrieved from disk, decoded, and output over a network.

3. The self-consistent field (SCF) method in computational chemistry involves two operations: Fock matrix construction and matrix diagonalization. Assuming that diagonalization accounts for 0.5 per cent of total execution time on a uniprocessor computer, use Amdahl's law to determine the maximum speedup that can be obtained if only the Fock matrix construction operation is parallelized.

4. You are charged with designing a parallel SCF program. You estimate your Fock matrix construction algorithm to be 90 percent efficient on your target computer. You must choose between two parallel diagonalization algorithms, which on five hundred processors achieve speedups of 50 and 10, respectively. What overall efficiency do you expect to achieve with these two algorithms? If your time is as valuable as the computer's, and you expect the more efficient algorithm to take one hundred hours longer to program, for how many hours must you plan to use the parallel program if the more efficient algorithm is to be worthwhile?

5. Some people argue that in the future, processors will become essentially free as the cost of computers become dominated by the cost of storage and communication networks. Discuss how this situation may affect algorithm design and performance analysis.

6. Generate an execution profile similar to that in Figure 3.8 for an implementation of a parallel finite difference algorithm based on a 2-D decomposition. Under which circumstances will message startups contribute more to execution time than will data transfer costs?

7. Derive expressions that indicate when a 2-D decomposition of a finite difference computation on an grid will be superior to a 1-D decomposition and when a 3- D decomposition will be superior to a 2-D decomposition. Are these conditions likely to apply in practice? Let sec, sec, sec, and P=1000. For what values of N does the use of a 3-D decomposition rather than a 2-D decomposition reduce execution time by more than 10 percent?

8. Adapt the analysis of Example 3.4 to consider 1-D and 2-D decompositions of a 2-D grid. Let N=1024, and fix other parameters as in Exercise 7. For what values of P does the use of a 2-D decomposition rather than a 1-D decomposition reduce execution time by more than 10 percent?

9. Implement a simple ``ping-pong'' program that bounces messages between a pair of processors. Measure performance as a function of message size on a workstation network and on one or more parallel computers. Fit your results to Equation 3.1 to obtain values for

and . Discuss the quality of your results and of the fit.

10. Develop a performance model for the program constructed in Exercise 5 in Chapter 2 that gives execution time as a function of N, P, , , and . Perform empirical studies to determine values for , , and on different parallel computer systems. Use the results of these studies to evaluate the adequacy of your model.

11. Develop performance models for the parallel algorithms developed in Exercise 10 in Chapter 2. Compare these models with performance data obtained from implementations of these algorithms.

12. Determine the isoefficiency function for the program developed in Exercise 10. Verify this experimentally.

13. Use the ``ping-pong'' program of Exercise 9 to study the impact of bandwidth limitations on performance, by writing a program in which several pairs of processors perform exchanges concurrently. Measure execution times on a workstation network and on one or more parallel computers. Relate observed performance to Equation 3.10.

14. Implement the parallel summation algorithm of Section 2.4.1. Measure execution times as a function of problem size on a network of workstations and on one or more parallel computers. Relate observed performance to the performance models developed in this chapter.

15. Determine the isoefficiency function for the butterfly summation algorithm of Section 2.4.1, with and without bandwidth limitations.

16. Design a communication structure for the algorithm Floyd 2 discussed in Section 3.9.1. 17. Assume that a cyclic mapping is used in the atmosphere model of Section 2.6 to

compensate for load imbalances. Develop an analytic expression for the additional communication cost associated with various block sizes and hence for the load imbalance that must exist for this approach to be worthwhile.

18. Implement a two-dimensional finite difference algorithm using a nine-point stencil. Use this program to verify experimentally the analysis of Exercise 17. Simulate load imbalance by calling a ``work'' function that performs different amounts of computation at different grid points.

19. Assume that , , and sec. Use the performance models summarized in Table 3.7 to determine the values of N and P for which the various shortest-path algorithms of Section 3.9 are optimal.

20. Assume that a graph represented by an adjacency matrix of size is distributed among P tasks prior to the execution of the all-pairs shortest-path algorithm. Repeat the analysis of Exercise 19 but allow for the cost of data replication in the Dijkstra algorithms.

21. Extend the performance models developed for the shortest-path algorithms to take into account bandwidth limitations on a 1-D mesh architecture.

22. Implement algorithms Floyd 1 and Floyd 2, and compare their performance with that predicted by Equations 3.12 and 3.13. Account for any differences.

23. In so-called nondirect Fock matrix construction algorithms, the integrals of Equation 2.3 are cached on disk and reused at each step. Discuss the performance issues that may arise when developing a code of this sort.

24. The bisection width of a computer is the minimum number of wires that must be cut to divide the computer into two equal parts. Multiplying this by the channel bandwidth gives the bisection bandwidth. For example, the bisection bandwidth of a 1-D mesh with bidirectional connections is 2/ . Determine the bisection bandwidth of a bus, 2-D mesh, 3- D mesh, and hypercube.

Figure 3.29: Parallel matrix transpose of a matrix A decomposed by column, with P=4.

The components of the matrix allocated to a single task are shaded black, and the components required from other tasks are stippled.

25. An array transpose operation reorganizes an array partitioned in one dimension so that it is partitioned in the second dimension (Figure 3.29). This can be achieved in P-1 steps, with each processor exchanging of its data with another processor in each step. Develop a performance model for this operation.

26. Equation 3.1 can be extended to account for the distance D between originating and destination processors:

The time per hop typically has magnitude comparable to . Under what circumstances might the term be significant?

27. Develop a performance model for the matrix transpose algorithm on a 1-D mesh that takes into account per-hop costs, as specified by Equation 3.14. Assume that and , and identify P and N values for which per-hop costs make a significant (>5 percent) difference to execution time.

28. Demonstrate that the transpose algorithm's messages travel a total of hops on a 1-D mesh. Use this information to refine the performance model of Exercise 25 to account for competition for bandwidth.

29. In the array transpose algorithm of Exercise 25, roughly half of the array must be moved from one half of the computer to the other. Hence, we can obtain a lower time bound by dividing the data volume by the bisection bandwidth. Compare this bound with times predicted by simple and bandwidth-limited performance models, on a bus, one-dimensional mesh, and two-dimensional mesh.

30. Implement the array transpose algorithm and study its performance. Compare your results to the performance models developed in preceding exercises.

Chapter Notes

The observation commonly referred to as Amdahl's law was first formulated in [12]. Asymptotic analysis of parallel algorithms is discussed in many computer science texts, such as those by Akl [8], Leighton [187], and Smith [267]. Cook [64] discusses problems for which no efficient parallel algorithms have been discovered.

Many different approaches to performance modeling have been proposed, each appropriate for different purposes. See, for example, the papers by Culler et al. [67], Eager et al. [89], Flatt and Kennedy [97], Karp and Flatt [167], and Nussbaum and Agarwal [216]. Patel [224] discusses the modeling of shared-memory computers. The book by Kumar et al. [179] provides many example models and a more detailed treatment of the concept of isoefficiency. Gustafson et al. [129,130] introduce the concept of scaled speedup. Singh, Hennessy, and Gupta [259], Sun and Ni [274], and Worley [297,298] discuss various constraints on the scalability of parallel programs. Lai and Sahni [183] and Quinn and Deo [237] discuss speedup anomalies in search problems. Faber et al. [93] argue against the concept of superlinear speedup. Fromm et al. [115], Harrison and Patel [134], and Thomasian and Bay [284] use queuing models to study performance of parallel systems. Kleinrock [173] reviews techniques used for performance analysis of networks and discusses issues that arise in high-speed (gigabit/sec) WANs.

The chapter notes in Chapter 1 provide references on parallel computer architecture. Feng [94] provides a tutorial on interconnection networks. Hypercube networks have been used in a variety of multicomputers such as the Cosmic Cube [254], nCUBE-2 [212], Intel iPSC, and Thinking Machines CM2 [281]. The Intel DELTA and Intel Paragon [276] use two-dimensional mesh networks. The Cray T3D and MIT J machine [72] use a three-dimensional torus. Adams, Agrawal, and Siegel [2] survey multistage interconnection networks, and Harrison [133] discusses the analytic modeling of these networks. Various forms of multistage network have been used in the BBN Butterfly [31], NYU Ultracomputer [123], IBM RP3 [226], and IBM SP [271]. The IBM SP uses a bidirectional multistage network constructed from 4 4 crossbars (a modified 4-ary n -fly) similar to that illustrated in Figure 3.18. Seitz [253,255] provides an introduction to multicomputers and their interconnection networks. Dally [69] discusses networks and the concept of bisection width, while Leighton [187] provides a more detailed and theoretical treatment. Dally and Seitz [70,71] discuss routing techniques. The material in Example 3.8 is based on work by Foster and Worley [110]. Ethernet was designed by Metcalfe and Boggs [205]; Shoch, Dalal, and Redell [257] describe its evolution.

Miller and Katz [208], Foster, Henderson, and Stevens [103], and Pool et al. [229] discuss the I/O requirements of scientific and engineering applications. Del Rosario and Choudhary [76] discuss

problems and prospects for parallel I/O. Henderson, Nickless, and Stevens [145] discuss application I/O requirements and describe a flexible I/O architecture for parallel computers. Plank and Li [228] discuss checkpointing. Bordawekar, del Rosario, and Choudhary [41] explain the utility of a two-phase I/O strategy. A special issue of the Journal of Parallel and Distributed Computing [60] discusses various aspects of parallel I/O, as do Aggarwal and Vitter [4] and Katz, Gibson, and Patterson [168]. DeWitt and Gray [79] discuss parallel database machines. Gibson [120] examines the design and performance analysis of redundant disk arrays (RAID disks). Hennessy and Patterson [134] provide a good description of I/O system performance analysis and design.

The parallel versions of Floyd's shortest-path algorithm [98] are due to Jenq and Sahni [158], while the parallel version of Dijkstra's single-source algorithm [80] is described by Paige and Kruskal [217]. Our analysis of these algorithms follows Kumar and Singh [182], who also present analyses that take into account bandwidth limitations on hypercube and two-dimensional mesh architectures. Bertsekas and Tsitsiklis [35] describe a pipelined variant of Floyd 2 that improves performance by allowing iterations to proceed concurrently, subject only to dataflow constraints. Aho, Hopcroft, and Ullman [7] and Cormen, Leiserson, and Rivest [65] provide good introductions to sequential graph algorithms. Quinn and Deo [236] and Das, Deo, and Prasad [73,74] describe parallel graph algorithms. Ranka and Sahni's [238] book on parallel algorithms for image processing and pattern recognition includes relevant material.

In document Designing and Building Parallel Programs Ian Foster (Page 126-131)