Graph Clustering: SSCA2 Kernel-4 - Transaction Benchmarking

5.3 Resilient Application Frameworks

5.4.2 Transaction Benchmarking

5.4.2.2 Graph Clustering: SSCA2 Kernel-4

TheSSCAbenchmark suite is used for studying the performance of computations on graphs with arbitrary structures and arbitrary data access patterns. In order

0 1000 2000 3000 4000 5000 6000 7000 8000 64 128 256 512 throughput (operations/ms)

producer threads (cores=2X, places=X/4) (a) RL_EA_UL 2x8 u=0% non-resilient O-p0 O-dist 0 1000 2000 3000 4000 5000 6000 64 128 256 512 throughput (operations/ms)

producer threads (cores=2X, places=X/4) (d) RV_LA_WB 2x8 u=0% non-resilient O-p0 O-dist 0 500 1000 1500 2000 2500 3000 3500 4000 64 128 256 512 throughput (operations/ms)

producer threads (cores=2X, places=X/4) (b) RL_EA_UL 4x4 u=0% non-resilient O-p0 O-dist 0 500 1000 1500 2000 2500 3000 64 128 256 512 throughput (operations/ms)

producer threads (cores=2X, places=X/4) (e) RV_LA_WB 4x4 u=0% non-resilient O-p0 O-dist 0 500 1000 1500 2000 2500 64 128 256 512 throughput (operations/ms)

producer threads (cores=2X, places=X/4) (c) RL_EA_UL 8x2 u=0% non-resilient O-p0 O-dist 0 200 400 600 800 1000 1200 1400 1600 64 128 256 512 throughput (operations/ms)

producer threads (cores=2X, places=X/4) (f) RV_LA_WB 8x2 u=0% non-resilient

O-p0 O-dist

§5.4 Performance Evaluation 147 0 200 400 600 800 1000 1200 64 128 256 512 throughput (operations/ms)

producer threads (cores=2X, places=X/4) (a) RL_EA_UL 2x8 u=50% non-resilient O-p0 O-dist 0 500 1000 1500 2000 2500 3000 64 128 256 512 throughput (operations/ms)

producer threads (cores=2X, places=X/4) (d) RV_LA_WB 2x8 u=50% non-resilient O-p0 O-dist 0 100 200 300 400 500 600 700 64 128 256 512 throughput (operations/ms)

producer threads (cores=2X, places=X/4) (b) RL_EA_UL 4x4 u=50% non-resilient O-p0 O-dist 0 200 400 600 800 1000 1200 1400 1600 1800 64 128 256 512 throughput (operations/ms)

producer threads (cores=2X, places=X/4) (e) RV_LA_WB 4x4 u=50% non-resilient O-p0 O-dist 0 50 100 150 200 250 300 350 400 450 500 64 128 256 512 throughput (operations/ms)

producer threads (cores=2X, places=X/4) (c) RL_EA_UL 8x2 u=50% non-resilient O-p0 O-dist 0 200 400 600 800 1000 1200 64 128 256 512 throughput (operations/ms)

producer threads (cores=2X, places=X/4) (f) RV_LA_WB 8x2 u=50% non-resilient

O-p0 O-dist

to evaluate the performance of distributed transactions in a realistic application, we implemented kernel-4 of the SSCA benchmark number 2 [Bader and Madduri, 2005]. SSCA2-k4 is a graph clustering problem that aims to create clusters of a certain size within a graph. For graph generation, we used the SSCA2 graph gen- erator provided in X10’s benchmarking suite [X10 Benchmarks], located at x10- benchmarks/PERCS/SSCA2/Rmat.x10. The graph generation algorithm is based on the R-Mat model [Chakrabarti et al.,2004], which uses four parameters to config- ure the structure of the graph (a, b, c, and d) such thata+b+c+d = 1. Each of these parameters represents the probability that a vertex will be located in a specific area in the graph. We partitioned the graph vertices evenly among the places; however, the graph structure itself is replicated at all places to speed up neighbor-vertex queries, which are extensively used in this benchmark.

Using the parallel workers framework, we developed a lock-based implementation and a transactional implementation of SSCA2-k4. The source code of these implementations is available in x10/x10.dist/samples/ssca2. The two implementations process the graph usingN ∗t parallel threads, whereN is the number of places, and

t is the number of clustering threads in each place. Each thread is assigned a partition of the graph vertices, where a vertex represents a local potential root for a new cluster. A thread tries to create a cluster by first allocating the root vertex, then allocating all neighboring vertices of that root that are not already allocated. After allocating the neighbors, a heuristic is applied to pick one of them and use it for further expanding the cluster by allocating the neighbors of the picked vertex. This process continues until the cluster reaches a target size, or no more vertices are available for allocation. In that case, the thread switches to the next root vertex to create a new cluster.

The lock-based implementation uses non-blockingtryLockReadandtryLockWrite functions provided by the same lock class used by transactions (see Section5.2.3.2

for a description of the lock specification). The clustering thread first attempts to read-lock a target vertex to check if it is part of a cluster. If locking is successful and the vertex is free, it attempts to lock the vertex for writing. If locking is successful and the vertex is still free, it marks the vertex with its cluster id and moves on to another vertex. All acquired vertices remain exclusively locked until cluster creation completes. Failure to acquire a lock for read or write forces the thread to clear all allocated vertices, unlock the acquired locks, and restart processing the same root vertex. Similar to [Bocchino et al., 2008], our lock-based implementation does not protect against livelocks. The transactional implementation avoids explicit locking by handling each cluster creation as a transaction. The runtime system implicitly checks for conflicts and performs the consequent rollback of obtained vertices.

The lock-based implementation is not resilient; however, we use it as an indicative baseline for comparison with non-resilient transactions. Although adding resilience to the lock-based version is possible, we found that the implementation would be complex as it would require handling data replication and lock tracking at the application level. With transactions, however, we receive these features transparently by the runtime system. In the transaction implementation, we don’t checkpoint the progress of each thread. Therefore, the threads of a recovering worker restart process-

§5.4 Performance Evaluation 149

ing their assigned vertices from the beginning. However, because the root vertices of the clusters are locally located, checking the status of previously locked vertex is not expensive as it generates a local read-only transaction.

We performed our experiment on a balanced graph of size 218vertices, with equal probability for the four graph parameters (a = 0.25, b = 0.25, c = 0.25, d = 0.25), and a target cluster size of 10. We started four clustering threads at each worker (place) and allocated two cores per clustering thread, for the same reason described in theResilientTxBenchexperiment (Section5.4.2.1). Therefore, each X10 place was configured with 8 worker threads using X10 NTHREADS=8. Experiments running in resilient mode use the optimistic distributed finish implementation (O-dist).

We report the median of five measurements for each configuration scenario for this application. The 95% confidence interval of the reported processing time is less than 8% of the mean for experiments running in non-resilient mode, less than 4% of the mean for experiments running in resilient mode without failure, and less than 14% of the mean for experiments running in resilient mode with failure.

Table5.9compares the performance of the application using different concurrency control mechanisms. Because this application is a write-intensive application, the lock- based implementation and RL EA UL suffer significantly more conflicts than RV LA WB. Assuming the conflict cost is the ratio of processing time to the number of conflicts: in non-resilient mode with 64 threads, the conflict cost is 0.08, 0.57, and 6.93 for lock-based, RL EA UL, andRV LA WB, respectively. The conflict cost in RV LA WBis much higher because conflict detection is delayed to the commit preparation time, after the transaction has fully expanded. The other two mechanisms detect conflicts much faster, but the negative consequence of their low conflict cost is retrying the acquisition of the same lock within a short time interval, which results in more repeated conflicts thanRV LA WB. Overall, in non-resilient mode, the number of conflicts offsets the conflict cost for the two transactional implementations, and their performance is comparable. Locking outperforms the transactional mechanisms with small numbers of threads. However, this advantage is gradually lost as the number of threads increases.

In resilient mode, the cost of the transaction resilience overhead and the large gap in the total number of processed transactions increase the performance gap between

RL EA UL and RV LA WB. Overall, RV LA WB achieves better scalability and lower resilience overhead than RL EA ULin this application. With 64 threads, the resilience overhead is 117% and 141% for RV LA WBandRL EA UL, respectively. With 512 threads, the overhead ofRV LA WBreduces to only 75%, while it increases to 242% withRL EA ULdue to the massive increase in the number of conflicts.

Using the transactional implementation, we performed an experiment using RV LA WB to evaluate the overhead of failure recovery for this application. We configured the middle place as a victim and forced its first clustering thread to kill the process after completing half of its assigned vertices. Table5.10shows the impact of the failure on the total processing time and the number of conflicts. It also shows the time consumed for replication recovery by the resilient store. As expected, the failure results in generating more conflicts by the transactions that interact with the

Table 5.9: SSCA2-k4 performance with different concurrency control mechanisms.

Conflicts Processing time

threads 64 128 256 512 64 128 256 512 Non-res. lock-based 82831 147793 310540 986013 6.3s 4.2s 3.4s 3.6s RL EA UL 15848 42284 70933 207199 9.1s 5.2s 3.3s 3.1s RV LA WB 1414 2557 5881 13803 9.8s 5.0s 3.4s 3.6s Res. (O-dist) RL EA UL 22924 50858 186940 592674 21.9s 12.7s 11.6s 10.6s RV LA WB 1127 2382 5543 12745 21.3s 10.9s 8.7s 6.3s

dead place. The number of conflicts is proportional to the replication recovery time. With small numbers of places (16 and 32), the recovery time ranged between 1.2–3.4 seconds, and with larger numbers of places (64 and 128), the recovery time ranged between 0.3–0.4 seconds. In this strong scaling experiment, as the number of places increases, the data assigned to each place shrinks and the transfer time of replicas reduces. However, there are factors other than the replica size that can also impact the recovery performance and cause high performance variability (the 95% confidence interval of the processing time under failure is between 5%–14% of the mean). These factors include the possible delay in allocating the spare place due to high processing load at the master of the dead place and the wait time experienced by the master and the slave replicas to reach a migratable state (see Section5.3.1.2). We expect the load at the master place to be a determining factor for the recovery performance due to performing the recovery process asynchronously with normal application tasks.

To summarize, as shown in Figure5.13our transactional finish implementation results in good strong-scaling performance for the SSCA2-k4 application in failure-free and failure scenarios. The performance is comparable to the lock-based implementation in non-resilient mode. We counted the number of lines of the code that performs the actual cluster generation function in both implementations, excluding the main function, definition of classes, debug messages, and code related to killing the victim places. The lock-based implementation takes 171 lines of code without any fault tolerance support, while the resilient transactional implementation uses only 80 lines of code2.

In document Resilience in high level parallel programming languages (Page 165-170)