Throughput - Lock-free Data Structures - Throughput and energy efficiency of lock-free data str

1.3 Lock-free Data Structures

1.3.2 Throughput

A common metric for measuring the performance of lock-free data structure is throughput, defined as the number of successful operations per unit of time. For generic lock-free algorithms, the execution time of a single operation cannot be bounded. It is then more natural to consider sequences of operations instead, since all the operations in the sequence will not encounter bad executions. In this context, the performance is often measured with the average system throughput over a sequence of operations.

We are interested in the throughput of concurrent lock-free data structures, and the underlying impacting factors that drives this throughput. These impact- ing factors are viable for the performance of all lock-free data structures that we consider in this thesis, but the significance of these impacting factors differs based on the characteristics of the data structure and on the context they are used in.

Retry loop and hardware conflicts: Lock-free operations cannot be blocked but some parts of an operation can be repeated due to the existence of conflict- ing concurrent operations within the retry loops. Under high contention, retry loop conflictsoccur, and this retry loop conflicts might lead to a second type of conflict, that we refer to as hardware conflicts. Retry loops contain atomic prim- itives that can stall other memory accesses (atomic primitives, read/write that access the same memory word) while getting executed. When multiple atomic primitives are issued in the same time interval, they serialize (the latency of memory accesses expands due to stall time) and this leads to significant performance degradation.

Under high contention but in the absence of hardware conflicts, failing retry loop iterations introduce additional useless work to the failing (repeating) pro-

18 CHAPTER 1. INTRODUCTION cess but they often do not decrease the system performance. This is because two successful retry loop iterations cannot overlap in time and the successful one cannot be obstructed by failing retry loop iterations if there are no hardware conflicts. Therefore, increasing the number of processes in the retry loop would merely increase the number of failed retry loop iterations, but would not harm the system performance. However, hardware conflicts do not only introduce useless work (through waiting time) to the failing process but also harm the system performance. Think of a sequence of serialized Compare-And-Swap instructions: while a process will operate a successful Compare-And-Swap (due to the progress guarantee), the rest of the processes in the retry loop are doomed to failure. If they are scheduled to execute their Compare-And-Swap when the possibly successful one is pending, the system performance is reduced. Failing Compare-And-Swaps do not change the content of the memory word but only obstruct the successful one. This impact can escalate with the increase in the number of processes in the retry loop. It gets harder to get out of the retry loop for a successful process (i.e. the ratio failing/successful Compare-And-Swap in- creases), and the additional delay of the successful operation leaves more space for new processes to arrive at the retry loop, that increases the contention fur- ther. This interplay might create hot spots.

In such cases, back-off strategies can be used to convert this harmful work (failing Compare-And-Swap) to a harmless but useless one. Failing processes can back-off, instead of retrying, to let the others succeed with less blockage. The back-off would increase the system performance, but its amount should be tuned since a small amount might be ineffective and large amount might lead to an underutilization of the resources.

Lock-free data structures that have inherent sequential bottlenecks are more prone to retry loop conflicts, thus to hardware conflicts. For such data structures, accesses are concentrated on a small number of memory words. For example, a plain stack is accessed via its top pointer by all of its operations—in the same way, queue operations access either the head or the tail of the queue. Regardless of the size of the stack (the number of elements inside), all operation accesses the top pointer. This characteristic might lead to contention in the form

1.3. LOCK-FREE DATA STRUCTURES 19 of hot spots whose severity is determined by the number and access rate of the processes that are performing the operations.

Number of Loads/Stores and Cache Misses: Previously mentioned fac- tors (retry loop conflicts and hardware conflicts) are specific to use cases with high concurrency. There are also performance impacting factors that are not related to concurrency and appear both in sequential and concurrent executions. For example, consider a binary tree (or a skip list, a hash table) that might lead to accesses on a large number of different memory words over a sequence of operations. Even in the absence of concurrency-related conflicts, one needs to estimate the number of memory word accesses per operation and connected to this, in the practical domain, the cache capacity misses. This estimation might not be trivial for some data structures like a binary tree, in contrast to simpler data structures such as stacks or queues.

On the bright side, this characteristic (accesses are not concentrated on a small number memory words) might turn out to be an advantage in the concur- rent executions (i.e. leading to a good scalability) because the processes might spread to different shared memory words (for example to the different branches of a binary tree); this reduces the possibility of retry loop and hardware conflicts, and in turn, the possibility of hot spots. If we assume that the number of memory words is much bigger than the number of processes (excluding extremely imbalanced access patterns), the retry loops and hardware conflicts would have a negligible impact on the performance of such data structures.

This does not mean that these data structures are immune to contention since every modification still requires a consensus. This consensus leads, on the logi- cal side, to a consistent view of the lock-free data structure that is accessed and modified by multiple processes concurrently in a non-blocking manner. On the practical side, achieving this consensus and spreading the information during and after its achievement impacts performance of all processes in the system. This impact is merely small compared to the other mentioned impacts.

The struggle of processes executing the same retry loop is often viewed as the major source of contention when they try to propose different values for the same consensus object within the same time frame (which leads to retry

20 CHAPTER 1. INTRODUCTION loop conflicts and hardware conflicts). The impact of contention on the learners (the processes that read the modified memory word) is less apparent since the contenting events may not occur close in time. More clearly, consider two consecutive accesses to a memory word j by a process i that happen at time t0 and t1, respectively. For the access at t1, process i would experience a coher- ence cache missif memory word j is modified by another process in between t0 and t1. Search data structures, e.g. hash tables, skip lists, trees, contain multiple consensus objects (nodes), and this characteristic leverages the impact of the retry loop contention against the coherence contention on the learners dramatically.

Through this thesis, we address these performance impacting factors in var- ious configurations. We focus on the retry loop conflicts (and their subsequent performance impactor hardware conflicts) for data structures that have sequen- tial bottlenecks (e.g. stack, queue, priority queue, counter). We set parameters for our models to analyze the congestion points so as to cover a large set of possible lock-free data structure designs, contention levels, and use cases. For search data structures, we focus on the main impacting factors, the most sig- nificant of which are the number of memory accesses, capacity and coherency cache misses. We construct a model based on these impacting factors and show that it can be initiated with different abstract data types (e.g. skip list, hash table, binary tree).

In document Throughput and energy efficiency of lock-free data structures: Execution Models and Analyses (Page 35-38)