Direct HTM Usage - Applications of HTM - Interaction of Hardware Transactional Memory and Micro

5.2 Applications of HTM

5.2.1 Direct HTM Usage

Microbenchmarks Several concurrent data structures lend themselves easily to usage of HTM to sim- plify and accelerate the algorithm. In my thesis work, I have used, extended and evaluated concurrent integer set implementations that are based on various data structures: linked lists, skip lists, RB-trees, and hash sets.

On a high-level, there exist three ways how HTM can be employed in concurrent data structures. Direct usage wraps the entire traversal and update of the underlying data structure into a single transaction. This approach ensures that all data structure accesses are properly synchronised by providing linearisability through the hardware transactions’ strict serialisability property.

Especially for linear data structures, this can cause performance problems, because every transaction accumulates a read-set with size O(N ) for an N entry structure. Therefore, for some data structures I have split the data structure operation into a non-transactional scan part with subsequent transactional

Figure 5.1: The VELOX stack showing the different components created and modified for evaluating Hardware Transactional Memory.

modification. Scanning traverses the linked data structure without acquiring locks or using transactions; traversing until the region that will be modified by the operation is found. The modification is achieved as follows: SPECULATE; check data structure consistency, e.g., elements still present and linked; perform modifications; COMMIT.

This algorithm borrows from conventional fine-grained locking approaches. One difference is that the scan phase will abort concurrent data structure modifications if they have already performed writes in their transactions, but have not yet committed. Therefore, the readers do not need to check for concurrent modifications (through scanning for locks, acquiring traversal locks or read-lock per-node read-write locks), which improves their performance.

While generally improving performance, this algorithm does not easily compose, despite using HTM, if the composition maintains the performance-preserving non-transactional scanning phase. Such composition is possible with ASF, where the scanning phase can be located inside a transaction (the one used for composing the two data structure operations), but still be performed non-transactionally (with unmarked loads in non-inverted ASF).

Finally, a third option to use HTM in concurrent data structures is to use a lower-level primitive synthesised from HTM, such as DCAS, and use that in a lock-free mechanism that will update the data structure. In the linked-list case, DCAS can be used to monitor the to be deleted element’s next pointer and at the same time swing around the previous element’s next pointer.

Large Applications - STAMP Microbenchmarks are a great way of debugging and testing performance of an HTM implementation. They also reflect a usage of transactions in small scale, mostly homogeneous settings. The STAMP benchmark suite [166] is designed to be characteristic of large-scale TM usage, and therefore presents larger transactions with a more diverse set of operations. To enable direct usage of transactional memory, the global memory accesses and beginning / end of transactions have been identified by hand and therefore allows direct usage of transactional primitives.

Language- and Library-Level Integration of HTM While hand annotation for transactional memory applications is possible for small benchmarks, the manual effort required to instrument all the right memory accesses and check for escaping function calls and other exits is significant. This work should ideally performed by a compiler that understands programming language-level transactional constructs

and adds the required transactional memory accesses for data and also inserts instructions to start / end a transaction.

The most typical way for describing transactions in C-like programming languages is a special basic block tm_atomic { . . . } wrapping an arbitrary sequence of code into a transaction.

During the course of this work, multiple transaction aware compilers were created by our partners in the VELOX research project: Deuce for Java [226], and DTMC (the Dresden TM Compiler) based on LLVM for C++ [150, 210], and finally gcc-tm – a fork of GCC which adds support for atomic blocks (now part of mainline GCC)1

Figure 5.1 shows the entire VELOX stack with different applications, languages, compilers, and li- braries for the evaluation of transactional memory.

Conceptually, these compilers work in a very similar fashion. They instrument entry and exit paths of the basic block and add calls to a transactional-memory library. In addition, the compiler invokes read / write barrier functions of the TM library for every access to (potentially) shared memory. Finally, the compiler instruments called functions and checks that they do not cause problems, such as I/O, system calls etc., which can escape the transactional memory mechanisms, and adds additional code required for the control flow of transaction abort.

The transactional memory functionality is provided by the TM library, which can implement various TM algorithms. For HTM, the TM library mainly acts as a thin proxy layer, translating the compiler- identified memory accesses and transaction start / end into the right use of the HTM primitives. In addition, the TM library may provide fallback paths in case a hardware transaction repeatedly fails. The simplest of such is grabbing a global lock in case of repeated transaction abort.

If transactions get mapped onto HTM, the overheads of the transactional read / write barriers have to be small. In the small language integration layer introduced in Chapter 3 (Section 3.3), this is achieved by using only single instruction inline assembly sequences for the barriers and short, hand-crafted assembly sequences for transaction start and end.

For compilers that support TM language primitives through a TM library, the work per read / write barrier might be more significant, including a function call / return and parameter passing. In case of STMs that have to perform more work per transactional access, these overheads may not be too significant. For ASF, where the barrier essentially is a single LOCK MOV, these overheads have to be removed. Fortunately, the DTMC compiler uses LLVM which can perform aggressive inlining of functions at link time2_{, which significantly reduces overheads.}

For most of the work here, I have used and extended TinySTM with various ASF backends. Our joint publication at EuroSys 2010 describes the stack and presents results for various HTM implementations and benchmarks [213].

In document Interaction of Hardware Transactional Memory and Microprocessor Microarchitecture (Page 126-128)