Facilitating the Implementation of Concurrent Data Structures

1.2 Applying Transactional Memory in Real-World Applications

1.2.2 Facilitating the Implementation of Concurrent Data Structures

Concurrent data structures are fundamental building blocks in modern programs. In many applications, the performance of concurrent data structures determines the scalability of the program. TM facilitates the implementation of concurrent data structures for the following reasons:

• Synchronization in concurrent data structures is relatively straightforward; • The size of critical sections fit within HTM capacity in most cases;

• Data structures rarely contain complex concurrency control like irrevocable instructions or condition variables;

• TM is particularly appealing for data structures and applications with irregular or hard-to-predict memory accesses (e.g., the rebalancing operations of a balanced binary search tree mutation), which are difficult to implement efficiently using locks.

Based on that, there are many research papers leveraging TM to craft concurrent data structures with better scalability [27, 32, 63, 75, 120].

Balanced Binary Search Trees (BST) are commonly used in a wide range of applications because they provide a logarithmic bound on search operations. In a concurrent environ- ment, it is notoriously difficult to achieve satisfactory performance by using locks. Two reasons cause this dilemma: rebalancing and indirect deletion. Rebalancing refers to the

problem where an insert or remove operation causes imbalance in the tree, and the operation must restore balance before returning in order to guarantee asymptotic complexity for future operations. Indirect deletion happens when a target node has two children. The target node has to be replaced by its successor node (based on value) and the real delete happens at a leaf node. Fine-grained locking is difficult for balanced BST because when rebalancing happens, the thread traverses upward, which is opposite to the direction the thread traverses to search the elements from root to leaf. This behavior makes lock acquires and releases difficult, because it is impossible to prevent deadlock by acquiring locks in a canonical order. Lock-free approaches are similarly difficult. Existing ways to solve the problem include relaxing the balancing conditions or using coarse-grained synchronization. Another way is to use Read-Copy-Update(RCU) [79]. However, it increases the overhead of copying existing subtrees when changes are needed, and does not allow concurrent writes. Wrapping entire BST functions in HTM transactions is one option. The scalability for this approach will largely depend on the size of the BST and the number of write operations. Siakavaras [32] proposed a new method, which combines HTM and RCU, to implement concurrent balanced BST. The algorithm achieves several appealing results: It allows concurrent update, and it introduces negligible synchronization overhead on reading operations. Although the algorithm suffers from many problems, such as live locking and memory leaks, it provides the best performance among existing algorithms for concurrent, strictly balanced BST.

In previous work [75], we also discovered that HTM can be applied in many cases to accelerate existing non-blocking data structures. For example, hardware instructions such as CAS are directly supported by x86 processors, but only for one machine word. Software emulations of CAS can support multiple words, such as k-compare-and-swap (KCAS) [76]. Wait-free KCAS is expensive but can significantly simplify the design of non blocking data structures. In this case, HTM can be applied as the simple and fast path for KCAS (atomi- cally updating multiple memory locations). Another example is that update operations for non-blocking data structures usually are implemented by copy-on-write. Updating a bucket in nonblocking hash map [74] would require copying the whole bucket. It is expensive if the bucket contains many elements or the memory capacity is limited. HTM can provide a fast

path for threads to update the bucket in-place. In both cases, synchronization between the fast path and fall-back path is straightforward.

Memory Reclamation Problems Most non-blocking concurrent data structures in re-

search papers do not provide memory management in their implementation [11, 21, 32, 54, 75, 84, 95]. One possible reason is that memory management incurs latency. Hazard pointers [83] allow a data structure node to be reserved by a thread before it is accessed. If another threads intends to delete the node, it has to make sure no one is accessing it by checking the hazard pointer. This method introduces significant overhead because each node access will incur a write fence. Hazard pointers do not allow immediate memory reclamation when there are other threads accessing the candidate node. Epochs [41] have less latency but will delay the reclamation if one thread is blocked in an earlier epoch. It is difficult to bound the time between logical removal and physical reclamation [34], and many scalable techniques accept unbounded worst-case delay for a bounded [83] or unbounded [26] number of items. To avoid these delays, a system might fall back to complex or expensive measures when the amount of unreclaimed memory becomes too great [5, 10, 12, 19]. However, there will al- ways remain programs whose correctness depends on memory being reclaimed immediately, hence the need for precise memory reclamation.

In section 1.4, we discuss the incompatibility between non-blocking techniques and precise memory management. By leveraging the immediate abort property of HTM when accessing de-allocated memory locations, we propose revocable reclamation to bridge the gap between them. A detailed solution and implementation are described in Chapter 2.

In document Tailoring Transactional Memory to Real-World Applications (Page 32-34)