Industry Adoption - Interaction of Hardware Transactional Memory and Microprocessor Microarchit

2.3.1 Early Industry Approaches

Arguably some of the first work on transactional memory started in industry. Herlihy and Moss’ widely recognised 1993 paper shows Herlihy at DEC [34]; and in parallel, Stone, et al, from IBM worked on the “Oklahome Update”, and published their work in the same year [35].

Similar to Herlihy and Moss, the Oklahoma Update was an attempt at turning an N-element CAS operation into a more RISC-like structure. Stone, et al, extend the known LL/SC primitives, by allowing multiple locations to be loaded (and then operated on), and then conditionally stored to. One key requirement for N-CAS is, however, that either all stores successfully become the next value in the coherence order, or none. Therefore, having multiple independent store-conditional operations would not be correct. Instead, the authors propose essentially a two-phase commit protocol that performs (1) load- register, (2) store-contingent, and eventually (3) write-if-reserved. The third operation gets all lines in the exclusive cache state and performs an uninterruptible update. Similar to much later work also from IBM, Stone, et al, propose an automatic retry and exponential back-off in hardware. One further key observation of the Oklahoma Update is that when addresses are acquired in sorted order, the algorithm is deadlock-free, even though other write requests are held off.

2.3.2 First Industrial Silicon: Sun Rock and Azul

With the big wave of transactional memory research starting 2004, two companies published about silicon with transactional memory: Sun with their SPARC Rock CPU [188, 201], and Azul with their proprietary CPU used for running Java [199].

Sun Rock Sun’s Rock was a radically redesigned CPU, based on the concepts of scout threading, and deferred execution of stalled instructions. The later (also dubbed execute-ahead) keeps most of the core in-order, and on a long latency operation (such as a cache miss), puts the corresponding instruction and all its dependents into a deferred queue, while continuing to execute (and retire) independent instructions.

Stores will buffer their data until the long-latency operation has been resolved. Once that is done, the deferred instructions are executed until they have “caught” up with the independent instructions. In case of resource depletion of the deferred operation structures, the execute-ahead mechanism becomes non-architectural and warms caches and branch predictors. When the long-latency operation completes, architectural execution commences from a checkpoint created then.

Scout threading (or simultaneous speculative threading) uses a separate hardware thread to execute instructions simultaneously when the long latency operation resolves: one thread executes the independent front, while the other executes the instructions from the independent instruction stream.

Rock provides many mechanisms needed for HTM: register checkpoints exist so that execution can turn non-architectural and be rolled back; stores need buffering until the deferred queue has caught up because of the strong memory model of SPARC (TSO [40]). Similarly, loads need to mark their cachelines as speculative so the core gets notified if a concurrent store changes the loaded value. In summary, adding HTM support to Rock mostly exposes the underlying hardware features to applications directly.

Therein lies, however, the biggest weakness of the Rock TM implementation: while the micro-architectural mechanisms need to be fast and correct, they are free to fail and abort the speculation in many cases. Due to the close coupling between the microarchitectural speculation mechanism and transactional speculation, however, these microarchitectural events cause a high number of transaction failures. For example, branch mispredictions, TLB misses, register window overflows (for function calls), etc. can cause transaction aborts. In summary, the programmers have to perform many tweaks to their code to get reasonable TM performance [201].

Overall, Rock is an impressive design; unfortunately, it was never released commercially; according to the publications (the technical report has many more details) the biggest challenges were in validating the design (both the base line and the HTM) due to the many logically parallel instructions executing from very separate parts in the application instruction stream. For the HTM, the biggest challenge was actually to define the logical commit point [188].

Looking back at Rock, it becomes clear that (1) decoupling the microarchitectural ILP speculation method and the transactional speculation is advisable for usability and verification reasons, (2) usability of transactions is crucial (including meaningful abort codes), and (3) even support for small transactions (32 stores) can be very useful.

Azul Vega Azul’s Vega system is a specialised system designed to be massively parallel, with custom- made multi-core CPUs and specialised to run large-scale Java workloads [234]. For scalability, Vega systems support HTM for Java’s synchronized methods, effectively performing lock (or rather “mon- itor”) elision; wanting to scale applications that were written for small core counts to their massively parallel systems. In 2009, Click gave a presentation with Azul’s design of and experience with their HTM solution [199]. Their cores are simple in-order cores, but up to 54 of them are on a single die. The HTM is implemented through tx-read, tx-written bits in the L1 data cache, and the memory system including the L2 and further is unchanged. The ISA offers the usual instructions, all loads / stores inside the transactions are also transactional. One interesting point is that the register checkpoint of the beginning of the transaction is kept by software, rather than hardware. Click specifically mentions that their deisn is unaffected by TLB misses or branch mispredictions – a clear side-remark at the Sun Rock design. Several applications lend themselves well to HTM usage, yet, according to Azul, the heuristics for when to acquire the lock and when to try the transaction are hard; they resort to profiling at runtime and switching the mechanism.

An additional complication is typical “anti-patterns” that artificially limit transactional throughput, especially with binary code that would otherwise lend itself well to transactional execution: single “number of elements” counters, and centralised performance counters. Often, when the code is rewritten, adding

fine-grained locking has better performance characteristics and less erratic performance than HTM, according to Click’s experience.

2.3.3 IBM’s HTM: A Bouquet of Architectures and Microarchitectures

HTM in the L2 Cache: IBM BlueGene/Q An even more parallel and specialised market is served by IBM’s BlueGene/Q system and CPU (BGQ): high-performance computing [267]. The main cores are relatively simple in-order cores, that achieve throughput through being 4-way threaded. The interesting aspect about BGQ is not only that it is the first commercially available CPU with HTM support, but more importantly, that transactions are implemented entirely outside of the core: a multi-version L2 cache keeps track of the read / write sets and speculative writes of all the connected cores. Transactions are started and committed through a special interface that is hidden behind a system call API and implemented through memory-mapped I/O to the L2 controller [281].

The implementation of TM in the L2 only simplifies core design, but has several significant conse- quences: transaction start and end are costly operations, hiding transactional stores from other threads on the core requires either flushing (and subsequent bypassing) of the L2, or remapping of locations to different physical addresses per hardware thread. A significant software layer deals with detecting and handling aborts, register checkpointing with liveness analysis, and other bookkeeping. In a follow-on publication, Wang, et al, dissect TM performance even further and find that the high overheads (118 clock cycles) of just entering and exiting transactions together with the large L2 capacity (20MB for transactional data) and its versioning abilities make BGQ useful mainly for larger transactions, and they advocate the use of STMs for small transactions on this system [345]. Another interesting aspect is that small transactions cause the L2 to run out of version numbers for newly written data faster than it can recycle older, aborted / committed version numbers.

Suspend / Resume and Small Transactions: IBM Power 8 In their server-line Power series, IBM released support for HTM with Power 8 [287, 340, 353]. The ISA extensions are noteworthy for two as- pects; first, IBM chose to include support for suspending / resuming transactions, for example to support short switches to the OS kernel during a transaction; and second, for integrating TM as a strong synchro- nisation primitive in a weak baseline memory model. Because Power is not multi-copy atomic, and also permits a lot of reordering of memory accesses locally, Cain, et al, needed to specify many behaviours such as fencing behaviour and transitivity preservation explicitly. Furthermore, Power 8 also specifies rollback-only transactions which will not check for conflicts, but allow local rollback of all modifications inside the transaction; this can be useful for trace compilation and optimisation techniques.

In the microarchitecture, Power 8 uses a write-through L1 data cache, with the point-of-coherence being in the L2 (similar to BGQ). Similarly, most of the tracking and conflict detection happens in the L2; the L1 caches therefore have to forward transactional read hits, and the LSU is used for tracking conflicts in in the transit time. For simplicity, Power 8 adds an additional CAM (content addressable memory) next to the L2 to track the transactional properties of the accessed cache lines, rather than extending the entire L2 with transactional memory tracking hardware. The size of that CAM limits the overall transaction footprint to 64 cache lines – which can become a limitation for larger transaction. In this work, I’ve experimented with similar additional buffers and found that 256 entries can be too small, particularly for the read set of transactions.

While useful, the suspend / resume feature introduces a significant amount of complexity in the microarchitecture and nuances in the architecture, as well. One challenge is how to deal with transactions that abort while they are suspended and which values can be seen by code running while the transaction is suspended. Similarly, communicating values back into the resuming transaction can be tricky. For ASF,

similar challenges arise due to the mixing of transactional and non-transactional access; these can be particularly unexpected when they are to two different words of the same cacheline – a case of transactional / non-transactional false sharing.

Configurable and Guaranteed Progress: IBM z-Series Finally, the z-Series marks the third HTM de- sign and implementation in IBM’s processor families. According to IBM, the zEC12 is the first commercially available CPU with HTM [270, 276].

Again, the z-Series HTM architecture differs from those in BGQ and Power 8: they add constrained transactions with guaranteed progress, have a configurable policy for register checkpointing (programmer can decide to not checkpoint), and they offer an instruction that assists with waiting after an abort has happened using hardware knowledge about the size of the system. Additionally, applications can decide whether to forward in-transaction exceptions to the operating system, or not.

The constrained transactions provide a limited progress guarantee even under contention, as long as the transactions observe stringent limitations for size (instructions and memory operations), and structure (no backward branches). The hardware will try these transactions until they succeed, and can use heavy hammer mechanisms such as full bus locks which effectively stall all other cores in the system if all other methods are unsuccessful.

On the microachitectural level, zEC12 also features a write-through L1 and L2 data cache that does not store dirty lines; instead, there is a store-gathering cache that can sink overlapping stores and feeds data into both L2 and the L3. Transactions use this buffer as the versioning widget; therefore, transactions are limited to 64 128 byte cache lines of written data. The L1 also temporarily stores transactionally written data so that the transaction can observe its own writes from there. That cache is cleared on transaction aborts; and values have to be fetched from the L2.

Conflicts are avoided with sending a limited number of NACKs per transaction so that the currently holding transaction has a higher chance to complete. Another interesting detail that Jacobi, et al, mention is that they mark read set entries speculatively (based on branch prediction), because they do not want to add a second access to the L1 data cache when the load becomes non-speculative. I will show more detail for this transactional overmarking later in the thesis. Finally, zEC12 uses a clever trick (first found in VTM [86]) to extend the reach of the read set tracking in the L1: they mark an entire set as transactional when a transactional read set entry is evicted and abort the transaction if any remote store hits in this set. Due to the inclusive L2 cache, however, this only increases transactions to the size of the L2, as entries evicted from the L2 will back invalidate from the L1 and thus hit the set-matching overflow mechanism.

2.3.4 Intel TSX: (Semi-)Transparent Hardware Lock Elision

Intel also released both ISA extensions and CPUs with support for HTM with their fourth generation “Haswell” CPU design [303, 367]. Architecturally, Intel implements a typical best-effort transactional memory system with register checkpoints, transaction start / end primitives, user visible abort, and no option to poke through the transactions or any for of guarantees in their RTM design. An interesting addition to that is the hardware lock-elision (HLE) extensions which is undoubtedly informed by Rajwar’s earlier work on lock elision [58, 62]. The idea there is that the application consists of normal lock acquire / release operations and the hardware converts them transparently into transactions (with some additional logic to control the lock variable itself and retry). That way, the same application binary can run on newer systems using TM speculation, and will fallback to standard locks on older systems without any additional code paths.

Unfortunately, there is not a single standard locking instruction or code sequence, and it is furthermore hard to differentiate locks from other uses of similar instructions (such as atomic increment of a ticket

lock vs a stats value), and in some cases, determining the polarity of the lock is not straightforward (as the unlock path performs also complex operations). During my work at AMD on ASF, I have found these issues when attempting fully transparent lock elision on arbitrary binaries. Intel apparently faced similar issues, and decided to require (backwards compatible) programmer annotation: XACQUIRE and XRELEASE prefixes in front of the acquiring / releasing instructions are ignored by legacy CPUs and will cause the switch to transactional execution on newer CPUs.

Intel has not released much detail about their microarchitecture; they do employ the L1 data cache for transactional read and write set tracking, and also employ a secondary, fuzzy structure to track read set elements that evicted from the L1 cache. There is also an additional buffer that holds the lock value during HLE without making it globally visible so that multiple concurrent HLE critical sections on the same lock can execute, yet, each local thread sees the lock as taken.

Despite the simple implementation and the L1 data cache performing most of the work, transaction entry / exit latency is higher than that of normal lock acquisitions and single instruction atomics; there- fore, Intel suggests to batch multiple updates together (lock coarsening), and removing the acquisition of multiple locks and replacing them with a single transaction (lockset elision). Further performance im- provements can be achieved by implementing simpler, faster algorithms, or algorithms which are more clever but did not have a fine-grain locking implementation before.

Unfortunately, Intel had issues in the first three generations of their HTM implementation [365, 366], and had to switch the feature off. Unfortunately, little is known about the actual detail, but presumably under specific corner cases, the transactional isolation properties do not hold.

In summary, Intel provides supports for transactional memory in mainstream x86 architectures; and support for it is being integrated into standard locking libraries such as glibc [273, 317, 318], and runtime environments, such as Java [315, 324, 330]. Interestingly, all public software changes prefer the explicit transactional elision mode with RTM, rather than the HLE mode. Reasons cited are better visibility of aborts, and more careful control over the number of restarts, and the need to patch the code in any case. Furthermore, it seems that even a straightforward TM system and a commercially very successful CPU manufacturer struggle to provide a fully correct HTM implementation – this contrasts those academic publications which add significant complexity in both ISA and microarchitecture very keenly.

2.3.5 Comparison of Commercial HTMs

Given the size of the design space covered by the different HTM implementations both in terms of ISA, but also microarchitecture, an obvious question is: who has got it right? Qualitatively, the transaction core functions very similarly between all the designs; however, each implementation adds its own architectural feature – there, if a feature actually simplifies software development, it is useful. For now, there is no clear verdict on that, however, first indications suggest that Intel HLE extensions are not that useful in practice, as their added value is small and they hide useful transactional characteristics [317].

Due to the similar core functionality, yet different implementation choices, an interesting comparison is also the quantitative one: how do the different design decisions affect which transactions benefit, how do they limit the overall parallelism that can be extracted in different workloads? Nakaike, et al, compare all four major (BGQ, zEC12, RTM, POWER8) HTM implementations [343], and also look at some of the qualitatively different features. In short, they find that there is no clear scalability winner, but HTM as a feature is overall useful. A summary of the different designs analysed can be found in Figure 2.1.

Comparing the details of the different implementations shows that in particular BGQ has significant single-threaded overheads, but thanks to its large transactional working sets can support applications with larger footprint. Furthermore, zEC12 delivers the largest speadups, while the smaller capacity of POWER8 can sometimes be a limiting factor.

Processor type Blue Gene/Q zEC12 Intel Core

i7-4770 POWER8 Granularity 8 - 128 bytes 256 bytes 64 bytes 128 bytes TX load capacity 20 MB (1.25 MB per

core) 1 MB 4 MB 8 KB

TX store capacity 20 MB (1.25 MB per

core) 8 KB 22 KB 8 KB

L1 data cache 16 KB, 8-way 96 KB, 6-way

32 KB,

8-way 64 KB

L2 data cache 32 MB, 16-way,

(shared by 16 cores) 1 MB, 8-way 256 KB

512 KB, 8-way

SMT level 4 None 2 8

#abort reasons - 14 6 11

Table 2.1: Comparison of the characteristics of commercially available HTMs. From [343].

Interesting conclusions are (1) tuning the retry policy is not trivial, and sometimes retrying even though hardware suggests otherwise is beneficial, (2) microarchitectural interactions can limit performance, such as with Intel RTM causing aborts due to the prefetcher pushing out transactional data or causing additional aborts, (3) high abort rates (80% - 95%) can still produce speedups.

Finally, the additional features can be useful: zEC12 constrained transactions provide similar throughput without requiring careful tuning of retry policies for small data structures; Intel HLE suffers frm the non-tunable retry / abort polcies; suspend / resume in POWER8 can give small performance advantages by allowing transactions to spin on taken locks.

Nakaike, et al, give suggestions for future HTM systems: conflicts should be detected as precisely as possible, especially the interaction with prefetchers can be crucial; SMT significantly reduces the HTM resources, and thus needs to be carefully controlled; non-transactional loads / stores can be useful for debugging, and thread-level speculation; most transactions are small: 10kB fits most transactions with a few outliers using up to 32kB.

In document Interaction of Hardware Transactional Memory and Microprocessor Microarchitecture (Page 44-49)