4.10 Finish Resilient Store Implementations
5.1.2 Double In-Memory Replication
The storage medium used by the resilient store is a critical factor for determining its survivability and performance. Based on the type of storage medium, resilient stores can be classified intodisk-basedor disklessstores.
A disk-based resilient store for an HPCenvironment typically writes its data in a Parallel File System (PFS), which is available in most large-scale infrastructures. It can survive the failure of the entire computation thanks to the durability of disk storage. However, the high I/O latency of thePFScan slow down data access operations and make the resilient store a performance bottleneck at scale. As the load of disk access increases, the PFS itself can be a source of failures, reducing the reliability of the application [Sato et al.,2014].
A diskless resilient store leverages the low latency of memory access by writing the data in memory while protecting against data loss by employing a data redun- dancy mechanism, such as replication or data encoding [Cappello, 2009]. The data redundancy mechanism places an upper bound on the survivability of the store. For example, replicating the data at two processes (i.e. double in-memory replication) enables the store to only survive failures that do not kill these two processes at the same time. Increasing the number of replicas improves the store’s survivability at a higher performance and memory cost. Because, in practice, most failures impact one or a few nodes simultanously [Lifflander et al.,2013; Sato et al.,2012;Meneses et al., 2012;Moody et al., 2010], a double in-memory resilient store can protect the application from the majority of failures.
We designed the X10 resilient store based on double in-memory replication. As shown in Figure5.1, each place owns a master replica of its data and a slave replica of its left neighbor’s data.
Master 0 Master 1
Slave 0
Master 2
Slave 1 Slave 3
Place 0 Place 1 Place 2 Place 3
Master 3
Slave 2
Figure 5.1:Resilient store replication.
5.1.3 Non-Shrinking Recovery
As previously described in our taxonomy in Section2.2.3.1, resilient runtime systems can support shrinking and/or non-shrinking recovery. For statically partitioned applications, which represent a wide class ofHPCapplications, shrinking recovery is challenging to use because it requires the programmer not only to adjust the communication topology to accommodate fewer processes, but also to handle possible load imbalance when the workload of the failed process shifts to other processes. Non- shrinking recovery avoids these complexities by resuming the computation on the same number of processes. See Figure 5.2 for a demonstration of shrinking and non-shrinking recovery strategies for a statically partitioned 2D domain.
Our earliest work with RX10, as published in [Hamouda et al.,2015], evaluates the performance of shrinking and non-shrinking recovery for three machine learning benchmarks: linear regression, logistic regression, and PageRank. Linear regression and logistic regression store the input classification examples in a dense matrix par- titioned into N horizontal blocks, where N is the number of places. Each block holds 50K classification examples, with 500 features each. PageRank stores the in- put document graph in a sparse matrix and follows the same partitioning strategy above. Each block holds 2M edges of the graph. The applications achieved data resilience by checkpointing each block both in local memory and in the memory of one neighboring place. Figure5.3shows the performance of the three applications after experiencing one place failure using an old version of X10 (v2.5.2).
Non-shrinking recovery results in the fastest performance for the three applica- tions. It maintains the balance between the places and limits the need for communica- tion during recovery to the new place only. Shrinking recovery without repartitioning, in our implementation, assigns the blocks of the failed process to only one live pro- cess (similar to the example in Figure5.2-b). Similar to non-shrinking recovery, only one process needs remote communication for recovery. The rest of the processes recover their blocks from their local memory. Unlike non-shrinking recovery, the load between the places is not balanced, which results in slower performance. Finally,
§5.1 A Resilient Data Store for the APGAS Model 109
Place 0
b0 b1
Initial matrix partitioning
a) Initial partition mapping
b) Shrinking recovery without repartitioning, consequently, load balancing is sacrificed
c) Shrinking recovery with repartitioning for load balancing
d) Non-shrinking recovery b2 b3 b4 b5 b0 b1 Place 1 b2 b3 Place 2 b4 b5 Place 0 b0 b1 New Place b2 b3 Place 2 b4 b5 Place 0 Place 2 b4 b5 b0 b1 b2 b3 Place 0 Place 2 c0 c1 c3 c3
Figure 5.2: Shrinking recovery versus non-shrinking recovery for a 2-dimensional data grid.
shrinking recovery with load balancing results in the slowest performance and the highest performance variability. Changing the block partitioning to achieve load balancing means that the content of a new block may have been distributed among multiple processes before the failure (similar to the example in Figure 5.2-c). Dur- ing recovery, each place communicates with a group of other places to restore the contents of its blocks, which explains the resulting performance overhead. Many optimizations have been done toRX10and the iterative framework since publishing this paper. However, the results are still useful as a demonstration of the benefit of non-shrinking recovery for statically balanced bulk-synchronous applications.
Our store supports non-shrinking recovery. Lost replicas due to a place failure are recovered on a spare place using the redundancy available in the store, as shown in Figure5.4. ThePlaceManagerclass provides a logical group of places, named the active places, that maintains a fixed size despite failures as long as spare resources are available. By replacing a failed place with a new place at the same location, it automatically supports place virtualization by allowing the program to use the place order as its identifier, rather than using the physical place id (see Section2.4.5.1). Our store spans over the group of active places and recovers the data with the awareness that a failed place will be replaced with a new place in the same order.
0 10 20 30 40 50 60 0 4 8 12 16 20 24 28 32 36 40 44 total time (s) places
Linear Regression, weak scaling 50K examples per place, 500 features, Native X10 v2.5.2
shrinking with load balancing shrinking non-shrinking non-resilient (no failure)
0 10 20 30 40 50 60 0 4 8 12 16 20 24 28 32 36 40 44 total time (s) places
Logistic Regression, weak scaling 50K examples per place, 500 features, Native X10 v2.5.2
shrinking with load balancing shrinking non-shrinking non-resilient (no failure)
0 2 4 6 8 10 12 14 16 18 0 4 8 12 16 20 24 28 32 36 40 44 total time (s) places PageRank, weak scaling 2M edges per place, Native X10 v2.5.2
shrinking with load balancing shrinking non-shrinking non-resilient (no failure)
Figure 5.3: Weak scaling performance for three GML benchmarks with non-shrinking and shrinking recovery (see [Hamouda et al.,2015] for more details).
Master 0 Master 1
Slave 0
Master 2
Slave 1 Slave 3
Place 0 Place 1 Place 2 Place 3
Master 3
Slave 2
Place 4
Active Places Spare Place
Create Slave 1
Create Master 2
§5.1 A Resilient Data Store for the APGAS Model 111