All-Pairs Shortest Paths: Floyd’s Algorithm

Graph Algorithms

9.3 All-Pairs Shortest Paths: Floyd’s Algorithm

Instead of finding the shortest paths from a single vertex v to every other vertex, we are sometimes interested in finding the shortest paths between all pairs of vertices. Formally, given a weighted graph G(V, E, w), the all-pairs shortest paths problem is to find the shortest paths between all pairs of vertices vi, vj ∈ V such that i 6= j. For a graph with n vertices, the output of an all-pairs shortest paths algorithm is an n × n matrix D = (di,j) such that di,jis the cost of the shortest path from vertex vito vertex vj.

Floyd’s algorithm for solving the all-pairs shortest paths problem is based on the following observation. Let G = (V, E, w) be the weighed graph, and let V = {v1, ..., v_n} be the vertices of G. Consider a subset {v₁, ..., v_k} of vertices for some k where k ≤ n. For any pairs of vertices v_i, v_j ∈ V , consider all paths from v_i to v_j whose intermediate vertices belong to the set {v1, ..., vk}. LetP_i,j^(k)be the minimum-weight path among them, and let d^(k)_i,j be the weight of P_i,j^(k). If vertex vkis not in the shortest path from vito vj, then P_i,j^(k)is the same as P_i,j^(k−1). However, if vk is in P_i,j^(k), then we can break P_i,j^(k)into two paths- one from vito vkand one from vk to vj. Each of these paths uses vertices from {v1, v2, ..., vk−1}. Thus, d^(k)_i,j = d^(k−1)_i,k + d^(k−1)_k,j . The length of the shortest path from vito vj is given by d⁽ⁿ⁾_i,j. In general, the solution is a matrix D⁽ⁿ⁾= (d⁽ⁿ⁾_i,j). Floyd’s algorithm solves the recursion bottom up in the order of increasing values of k. The run time complexity of the sequential algorithm is θ(n³). Thus, problem size W = O(n³). Note that only matrix D^(k−1)is needed while computing matrix D^(k).

A generic parallel formulation of Floyd’s algorithm assigns the task of computing matrix D^(k)for each value of k to a set of processes. Let p be the number of processes available. Matrix D^(k)is partitioned into p parts, and each part is assigned to a process. Each process computes the D^(k)values of its partition. To accomplish this, a process must access the corresponding segments of the k^throw and column of matrix D^(k−1). One way to partition matrix D^(k)is to use the block checkerboard mapping. Specifically, matrix D^(k)is divided into p squares of size(n/√

p) × (n/√ p), and each square is assigned to one of the p processors. These p processors are arranged on a 2-d mesh of size√

p×√ p.

We refer to the processor on the i^throw and the j^thcolumn as Pi,j. Each processor updates its part of the matrix during each iteration. During the k^thiteration of the algorithm, each processor Pi,j needs certain segments of the k^throw and k^thcolumn of the D^(k−1)matrix. Segments are transferred as follows. During the k^thiteration of the algorithm, each of the√

p processors containing part of the k^throw send it to the√

p − 1 processors in the same column. similarly, each the√

p processors containing path of the k^thcolumn send it to the√

p − 1 processors in the same row.

During each iteration of the algorithm, the k^throw and k^thcolumn of processors perform a one-to-all broadcast along a row or a column of √

p processors. Each such processor has n/√

p elements of the k^th row or column, so it sends n/√

p elements. Thus, total time taken and energy spent during the communication phases is n(2(ts+ tw(n/√

p) log(√

p)) + 2th(√

p − 1)) and (1/2)ewhn²√

p log p respectively. Since each processor is assigned n²/p

elements of the D^(k)matrix, the number of computation steps required to compute corresponding D^(k)matrix is n²/p.

In summary,

π_ncomp = n³ pβ πtcomm = n

ts+ tw

√n p

log p + 2th(√ p − 1)

S_comp = n³β Ecomm = 1

2ewhn²√ p log p

where β is number of cycles required for a single addition.

Energy Scalability under Iso-performance

We now compute the energy scalability under iso-performance of parallel Floyd’s all-pairs shortest paths algorithm on 2-dimensional mesh interconnect. We first Scale the computation steps of critical path so that the performance of parallel algorithm matches the specified performance requirement. Assuming the performance target to be the time taken by the best sequential algorithm on a single core (processor) operating at maximum frequency F , the new reduced frequency at which all p cores should run is given by:

X = πncomp

Performance Target − πtcomm

n³ p β Edn³βF²− n

ts+ tw√n p

log p + 2th

√p − 1

The total active time at all the cores at new frequency is given by pn³β(1/F ). Therefore, expression for energy consumption of the parallel Floyd’s all-pairs shortest paths algorithm as per equation 5.1 is given by

E = Ecomp+ Ecomm+ Eleak

= E_dn³βX²+1

2e_whn²√

p log p + E_lpn³βX F

Asymptotic Analysis:: Note that, If n p^1/4 then X ≈ F/p. Thus, the energy consumed by the parallel Floyd’s all-pairs shortest paths algorithm on 2-dimensional mesh interconnect with p cores running at frequency X is given by:

E = E_dn³βF² p² +1

2e_whn²√

p log p + E_ln³β (9.4)

The optimal number of cores required for minimum energy consumption is given by

popt = O

Thus, the asymptotic energy scalability under iso-performance of the parallel Floyd’s all-pairs shortest paths algorithm on 2-dimensional mesh interconnect is O(). Note that, n should be greater than p^1/4for this asymptotic result to apply.

Energy Bounded Scalability

We now evaluate the energy bounded scalability of the parallel Floyd’s all-pairs shortest paths algorithm on 2-dimensional mesh interconnect. The total active time (Tactive) at all the cores as a function of the frequency of the cores is given by

Expression for the energy consumed by the parallel algorithm as a function of the frequency of the cores, using the energy model is given by

Ecomp = Edn³βX²

Here, we fix the energy budget E to be that of the energy required for the sequential algorithm, running on a single core at maximum frequency F (Eseq). Given the energy budget Eseq, the frequency X with which the cores should run is given by

X ≈ −E_lpπ_tcomm+pE_l²p²π_tcomm² + 4E_dS_comp(E − E_comm− E_lpπ_ncomp) 2EdScomp

The time taken (inverse of performance) by the parallel algorithm as a function of the new frequency X is

Time Taken = n

Asymptotic Analysis: Note that, If n √

p log p, then X ≈ F . Thus, the time taken by the parallel Floyd’s all-pairs shortest paths algorithm running on p cores is given by:

Time Taken = n

The optimal number of cores required for maximum performance under the energy budget is given by

popt = O n⁴³

= O

W⁴⁹

Thus, the asymptotic energy bounded scalability of the parallel algorithm is n^4/3. Note that, n should be greater than

√p log p for this asymptotic result to apply.

Utility Based Scalability

We now evaluate the utility based scalability of the parallel Floyd’s all-pairs shortest paths algorithm on 2-dimensional mesh interconnect. Expression for the energy consumed by the parallel algorithm as a function of the frequency of the cores, using the energy model is given by

Ecomp = Edn³ββX²

The time taken (inverse of performance) by the parallel algorithm as a function of frequency is

T (p, X) = n

we now frame an expression for the cost function C(p, X) of the parallel algorithm using Eq. 5.5:

Asymptotic Analysis: The cost expression of the parallel Floyd’s all-pairs shortest paths algorithm on 2-dimensional mesh interconnect with p cores at frequency X can be approximated as

C(p, X) ≈ O

The optimal number of cores and frequency required for minimum cost varies with problem size as follows

p_opt = O

The results of this chapter are summarized in the table 9.1.

Parallel Algorithm Energy scalability Energy bounded Utility based

under iso-performance scalability scalability Minimum Spanning Tree: Prim’s Algorithm O

Single-Source Shortest Paths: Dijkstra’s Algorithm O

All-Pairs Shortest Paths: Floyd’s Algorithm O

Table 9.1: Scalability metrics of graph algorithms on 2D mesh interconnect

Chapter 10

In document Towards energy-performance trade-off analysis of parallel applications (Page 119-124)