2.2 Graph partitioning on Distributed Systems
3.1.1 Parallel and Distributed Breadth-First Search
Formally, given the source vertex s, BFS algorithm systematically expands the edges of G to discover every vertex that is reachable from s. During the traversal step, all vertices at a level k are first visited, before discovering any vertices at level k + 1. Therefore a typical BFS output consists of a BFS-tree rooted in s, where the vertices at level k represent the predecessors of the vertices at level k + 1. Optionally, BFS is also used to compute the distance from s to each reachable vertex. The BFS frontier is defined as the set of ver- tices in the current level. A first-in first-out (FIFO) queue-based sequential algorithm for BFS takes O(n+m) time. The fastest known algorithm for par- allel BFS represents the graph as an incidence matrix, and involves repeatedly squaring this matrix, where the element-wise operations are in the min-plus semiring [GM88]. This solution computes the BFS ordering of the vertices
3.1 Background and Related Work 21
algorithm is not suitable for traversing sparse large-scale graphs. Prior work on large-scale BFS implementations are mainly based on the extensions of two parallel BFS algorithms. In the first approach, vertices are visited level by level (top-down BFS), and the graph (vertex or edges ) is partitioned (either implicitly or explicitly) among the processors. The running time in- creases linearly with the number of traversed levels. Notice that concerning the correctness, a given vertex v can have more predecessors therefore the algorithm can return different (valid) BFS-tree as output ( idempotent prop- erty) according to the architecture consistency model implemented. This no-deterministic behavior can be avoided by a careful management of the memory contention. In the second one, refereed as bottom-up each unvis- ited vertex tries finding any parent among its neighbors, unlike top-down approach where the vertices in the current frontier looks for their neighbors. The advantage is that once a vertex has found a parent, it does not need to check the rest of its neighbors. As a result, this approach is very effective on short diameter graphs. Furthermore, the bottom-up algorithm also removes the need for some atomic operations in parallel implementations [BAP13]. A hybrid approach, called direction-optimizing, switches between the top- down and the bottom-up traversal. When the frontier is small, the top-down approach is faster than the bottom-up since it will struggle to find valid par- ents and thus do unnecessary work. On the other hand, when the frontier
is large since the bottom-up results to be more efficient [BAP13, BBM+15].
Detailed discussions are provided in [DNM14, YBD14, HT13]. Finally, a dif- ferent algorithm was proposed by Ullman and Yannakakis [UY91]. Instead of a level-synchronized search, the graph is explored using multiple path-limited searches, and these searches are finally merged together to obtain the prop- erly BFS-tree from the source vertex. However, the implementation is more complicated than the simple level-synchronous approach, and thus there are not experimental evidences about performance.
A number of studies have focused on the BFS algorithm both on shared and distributed architectures. Agarwal et al. [APPB10] demonstrated poor
scaling due to atomic operations on multi-socket systems. To reduce this, the authors proposed a combination of the fine-grained approach (edge partition- ing among the sockets) and the accumulation-based approach in edge traver- sal. In details, each socket atomically updates only the information of the local vertices into a bitmap. They achieve good scaling going with up to four sockets. On single GPU, several studies addressed the problem related to the data-thread mapping strategy. The easiest approach assign to each thread one element of the BFS queue. On power-law graphs, such approach suffers
from thread unbalancing and poor performance [JLH+11, MB13, MB14a].
Hong et al. overcame the difficulties due to the vertex-parallelism by adopting a warp centric programming model [HKOO11, HOO11]. In their implementation each warp is responsible of a subset of the vertices in the BFS queue.
The approach proposed by Merrill et al. [MGG12] assigned a chunk
of data to a CTA (a CUDA block). The CTA works in parallel to in-
spect the vertices in its chunk. Furthermore, they uses heuristics for avoid- ing redundant vertex discovery (warp culling). Similar approach was effi- ciently adopted by Gunrock for their direction-optimizing BFS implementa-
tion [WDP+16].
On distributed memory systems several works based on an algebraic ap-
proach have been proposed [BM11, CPW+12, US13, BBM+15]. Satish et al.
[SKCD12] implemented a distributed BFS with 1-D partitioning that shows good scaling with up to 1024 nodes. They described a technique to postpone the exchange of predecessors at the end of the traversal step. Similar tech-
nique was also reported and validated in several works [BCMV15, BCM+15].
Ueno et al. [US13] presented a hybrid CPU-GPU implementation of the Graph500 benchmark, using the 2D partitioning proposed by Yoo et al.
[YCH+05b]. Their implementation uses the technique introduced by Mer-
rill et al. [MGG12] to create the edge frontier. Furthermore, in order to reduce the size of the messages, they used a novel compression technique. Finally, they also implemented a sophisticated method to overlap communi-
3.1 Background and Related Work 23
cation and computation in order to reduce the working memory size of the GPUs.
Recently, Edmonds et al. provided the first hybrid-parallel 1D BFS im- plementation that uses active messages [EWHL10].
Petrini et al. [CP14] implemented a distributed-memory parallelization of BFS for BlueGene/P architectures. Their results show that the combina- tion of the underlaying architecture and the System Programming Interface (SPI) allows achieving significant performance on R-MAT graphs. They also reported that the SPI implementation outperforms the MPI one by a factor of 5. Bisson et al. [BBM16] developed a fast 2D BFS implementation which exploits Nvidia Kepler capabilities. Their code achieved 830 billion edges per second on an R-MAT graph by exploiting 4096 GPUs.
Bernaschi et al. [BCM+15] provided an efficient BFS implementation on
Multi-GPU system based on:
• a modified CSR data structure which allow to use an efficient data- thread mapping strategies based on prefix-sum and binary search. This technique allows exploiting the idempotent property in order to avoid atomic operations;
• 1-D partitioning of the graph among GPUs;
• a new communication pattern for the predecessors exchanging, similarly to what proposed in [SKCD12].
Our st -connectivity implementation are based on such techniques. For- mally, given an undirected graph G and two vertices s and t, the st -connectivity determines whether or not there is a path from s to t in G. If this path exist, it is also the shortest path between s and t. Often, in real cases, the corre- sponding search problem of finding a path from s to t is required as well. All the BFS implementations described before can be easily extended to solve st -connectivity. There are not many studies about bi-directional search for the solution of ST-CON on large-scale graphs.
The first ST-CON implementation based on concurrent bi-directional search on Cray MTA-2 architecture was provided by Bader et al. [BM06]. Their approach introduced a race condition when the frontiers of the both BFSs expend the same vertex. To solve this, they also provided a different strategies which perform one BFS at time. The algorithm first performs the BFS with the smaller current frontier. Besta et al. [BH15] proposed to use Atomic Active Messages (AAM) based on hardware transactional memory to solve the race condition on Intel Haswell and IBM Blue Gene/Q.