Graph Algorithms
9.3 All-Pairs Shortest Paths: Floyd’s Algorithm
Instead of finding the shortest paths from a single vertex v to every other vertex, we are sometimes interested in finding the shortest paths between all pairs of vertices. Formally, given a weighted graph G(V, E, w), the all-pairs shortest paths problem is to find the shortest paths between all pairs of vertices vi, vj ∈ V such that i 6= j. For a graph with n vertices, the output of an all-pairs shortest paths algorithm is an n × n matrix D = (di,j) such that di,jis the cost of the shortest path from vertex vito vertex vj.
Floyd’s algorithm for solving the all-pairs shortest paths problem is based on the following observation. Let G = (V, E, w) be the weighed graph, and let V = {v1, ..., vn} be the vertices of G. Consider a subset {v1, ..., vk} of vertices for some k where k ≤ n. For any pairs of vertices vi, vj ∈ V , consider all paths from vi to vj whose intermediate vertices belong to the set {v1, ..., vk}. LetPi,j(k)be the minimum-weight path among them, and let d(k)i,j be the weight of Pi,j(k). If vertex vkis not in the shortest path from vito vj, then Pi,j(k)is the same as Pi,j(k−1). However, if vk is in Pi,j(k), then we can break Pi,j(k)into two paths- one from vito vkand one from vk to vj. Each of these paths uses vertices from {v1, v2, ..., vk−1}. Thus, d(k)i,j = d(k−1)i,k + d(k−1)k,j . The length of the shortest path from vito vj is given by d(n)i,j. In general, the solution is a matrix D(n)= (d(n)i,j). Floyd’s algorithm solves the recursion bottom up in the order of increasing values of k. The run time complexity of the sequential algorithm is θ(n3). Thus, problem size W = O(n3). Note that only matrix D(k−1)is needed while computing matrix D(k).
A generic parallel formulation of Floyd’s algorithm assigns the task of computing matrix D(k)for each value of k to a set of processes. Let p be the number of processes available. Matrix D(k)is partitioned into p parts, and each part is assigned to a process. Each process computes the D(k)values of its partition. To accomplish this, a process must access the corresponding segments of the kthrow and column of matrix D(k−1). One way to partition matrix D(k)is to use the block checkerboard mapping. Specifically, matrix D(k)is divided into p squares of size(n/√
p) × (n/√ p), and each square is assigned to one of the p processors. These p processors are arranged on a 2-d mesh of size√
p×√ p.
We refer to the processor on the ithrow and the jthcolumn as Pi,j. Each processor updates its part of the matrix during each iteration. During the kthiteration of the algorithm, each processor Pi,j needs certain segments of the kthrow and kthcolumn of the D(k−1)matrix. Segments are transferred as follows. During the kthiteration of the algorithm, each of the√
p processors containing part of the kthrow send it to the√
p − 1 processors in the same column. similarly, each the√
p processors containing path of the kthcolumn send it to the√
p − 1 processors in the same row.
During each iteration of the algorithm, the kthrow and kthcolumn of processors perform a one-to-all broadcast along a row or a column of √
p processors. Each such processor has n/√
p elements of the kth row or column, so it sends n/√
p elements. Thus, total time taken and energy spent during the communication phases is n(2(ts+ tw(n/√
p) log(√
p)) + 2th(√
p − 1)) and (1/2)ewhn2√
p log p respectively. Since each processor is assigned n2/p
elements of the D(k)matrix, the number of computation steps required to compute corresponding D(k)matrix is n2/p.
In summary,
πncomp = n3 pβ πtcomm = n
ts+ tw
√n p
log p + 2th(√ p − 1)
Scomp = n3β Ecomm = 1
2ewhn2√ p log p
where β is number of cycles required for a single addition.
Energy Scalability under Iso-performance
We now compute the energy scalability under iso-performance of parallel Floyd’s all-pairs shortest paths algorithm on 2-dimensional mesh interconnect. We first Scale the computation steps of critical path so that the performance of parallel algorithm matches the specified performance requirement. Assuming the performance target to be the time taken by the best sequential algorithm on a single core (processor) operating at maximum frequency F , the new reduced frequency at which all p cores should run is given by:
X = πncomp
Performance Target − πtcomm
=
n3 p β Edn3βF2− n
ts+ tw√n p
log p + 2th
√p − 1
The total active time at all the cores at new frequency is given by pn3β(1/F ). Therefore, expression for energy consumption of the parallel Floyd’s all-pairs shortest paths algorithm as per equation 5.1 is given by
E = Ecomp+ Ecomm+ Eleak
= Edn3βX2+1
2ewhn2√
p log p + Elpn3βX F
Asymptotic Analysis:: Note that, If n p1/4 then X ≈ F/p. Thus, the energy consumed by the parallel Floyd’s all-pairs shortest paths algorithm on 2-dimensional mesh interconnect with p cores running at frequency X is given by:
E = Edn3βF2 p2 +1
2ewhn2√
p log p + Eln3β (9.4)
The optimal number of cores required for minimum energy consumption is given by
popt = O
Thus, the asymptotic energy scalability under iso-performance of the parallel Floyd’s all-pairs shortest paths algorithm on 2-dimensional mesh interconnect is O(). Note that, n should be greater than p1/4for this asymptotic result to apply.
Energy Bounded Scalability
We now evaluate the energy bounded scalability of the parallel Floyd’s all-pairs shortest paths algorithm on 2-dimensional mesh interconnect. The total active time (Tactive) at all the cores as a function of the frequency of the cores is given by
Expression for the energy consumed by the parallel algorithm as a function of the frequency of the cores, using the energy model is given by
Ecomp = Edn3βX2
Here, we fix the energy budget E to be that of the energy required for the sequential algorithm, running on a single core at maximum frequency F (Eseq). Given the energy budget Eseq, the frequency X with which the cores should run is given by
X ≈ −Elpπtcomm+pEl2p2πtcomm2 + 4EdScomp(E − Ecomm− Elpπncomp) 2EdScomp
=
The time taken (inverse of performance) by the parallel algorithm as a function of the new frequency X is
Time Taken = n
Asymptotic Analysis: Note that, If n √
p log p, then X ≈ F . Thus, the time taken by the parallel Floyd’s all-pairs shortest paths algorithm running on p cores is given by:
Time Taken = n
The optimal number of cores required for maximum performance under the energy budget is given by
popt = O n43
= O
W49
Thus, the asymptotic energy bounded scalability of the parallel algorithm is n4/3. Note that, n should be greater than
√p log p for this asymptotic result to apply.
Utility Based Scalability
We now evaluate the utility based scalability of the parallel Floyd’s all-pairs shortest paths algorithm on 2-dimensional mesh interconnect. Expression for the energy consumed by the parallel algorithm as a function of the frequency of the cores, using the energy model is given by
Ecomp = Edn3ββX2
The time taken (inverse of performance) by the parallel algorithm as a function of frequency is
T (p, X) = n
we now frame an expression for the cost function C(p, X) of the parallel algorithm using Eq. 5.5:
Asymptotic Analysis: The cost expression of the parallel Floyd’s all-pairs shortest paths algorithm on 2-dimensional mesh interconnect with p cores at frequency X can be approximated as
C(p, X) ≈ O
The optimal number of cores and frequency required for minimum cost varies with problem size as follows
popt = O
The results of this chapter are summarized in the table 9.1.
Parallel Algorithm Energy scalability Energy bounded Utility based
under iso-performance scalability scalability Minimum Spanning Tree: Prim’s Algorithm O
Single-Source Shortest Paths: Dijkstra’s Algorithm O
All-Pairs Shortest Paths: Floyd’s Algorithm O
Table 9.1: Scalability metrics of graph algorithms on 2D mesh interconnect