HTM Product Experiences - Interaction of Hardware Transactional Memory and Microprocessor Micro

Most of my work in this PhD has been performed before HTM implementations were available commer- cially. Even then, a debate over the actual value of the feature arose, and it was clear that HTM is not a “silver bullet” for all synchronisation / parallelisation challenges [155, 246, 252].

IBM and Intel are currently leading in terms of having HTM implementations available, and have had these for considerable amounts of time (since 2013). Intel especially are pushing HTM as a differentiator

13_{USPTO 8,943,278 and 8,914,586} 14_{USPTO application 20140181480}

0 5 10 15 20 25 30 35 40 Throughput (million reqs per sec) (*) cuckoo+ with HTM

(*) cuckoo+ with fine-grained locking Intel TBB concurrent_hash_map optimistic concurrent cuckoo C++11 std:unordered_map

Google dense_hash_map 64 bit key/value pairs read-to-write ratio = 1:1 120 million keys

Figure 7.8: Throughput of different hash tables on a 4-core system. From [320].

to competing platforms that do not have HTM (most notably AMD and ARM), and claim significant performance gains in SAP HANA’s radix tree and DPDK’s hash table, of 2.2x and 11x respectively [335, 348]. However, on closer inspection, most (if not all) of these spectacular gains can be achieved by using different, already concurrent data structures, and / or converting the existing single-global-lock version to use fine-grained locking.

Li et al investigate the DPDK cuckoo-hash table in more detail in [320]. They break out the benefits of the different optimisation separately, and also convert the data structure to fine-grained locking. In Figure 7.8, it is obvious that that performance varies significantly between different hash tables. Com- paring the performance of the fine-grained version and the optimised TSX version, it becomes clear that the most significant (and arguably only) benefit of HTM is the simplification / avoidance of the work required to convert the data structure to fine-grained locking / lock-free operation. The authors note on that point:

Our results about TSX can be interpreted in two ways. On one hand, in almost all of our experiments, hardware transactional memory provided a modest but significant speedup over either global locking or our best-engineered fine-grained locking, and it was easy to use. . . .

On the other hand, the benefits of data structure engineering for efficient concurrent access contributed substantially more to improving performance, but also required deep algorithmic changes to the point of being a research contribution on their own.

With absolute performance removed from the list of HTM benefits (as opposed to the very valid performance vs application engineering trade-off), the other key remaining aspect of TM is composability. Going back to the detailed work in [320], however, we find that HTM performance of the application improved significantly only with significant code motion of application code out of the critical section / transaction. Such modification, however, loses the benefit when the entire macro data structure operation including the out-of-transaction prefix / suffix is part of an outer transaction. In my earlier work on ASF 1, I have speculated about possible ways to compose such operations by extracting and merging the prefix and transactional part of the operations separately [158, 159]. In the meantime, several authors have refined and formalised that concept [240, 314, 346], under such concepts as Consistency Oblivious Programming, Partitioned Transactions, and Optimistic Transactional Boosting.

With the remaining valid feature of TM being simpler development of fine-grained concurrent data structures, one should note that the overall complexity in the system remains, but is pushed into the hardware layer. This can for example be seen that Intel had to disable their HTM in the first three generations of products due to errors in the implementation [365, 366].

7.6 Summary

Despite significant work from both industry and academic researchers, and several commercially available implementations, transactional memory still has many open questions and available optimisation tweaks. In this chapter, I presented a small selection of work that I have undertaken towards understanding semantical challenges when using HTM and reasoning based on observing synchronised clocks, further added features and regularised ISAs to ease manual synthesis of flexible lock elision primitives, and further ISA extensions (nested abort handlers, roll-forward mode) and microachitectural improvements to provide better performance and larger capacities for HTM implementations.

In accordance with the industry research pipeline (research, patent, productise, publish), some of these ideas are only available publicly through patent applications and granted patents, while others are still in the patenting process and consequently cannot be discussed in this public thesis document. Still, despite a lot of work of me and other academic and industry researchers in the field mostly since 2006, there are still uncovered / unpublished extensions and optimisations twelve years later in 2018, and my employers (both AMD and ARM) continue the investigation; and I am sure that other companies are too.

Conclusions

The key thesis of this work has been that one needs a detailed hardware substrate and ISA description to understand corner cases, feasibility, cost, and value of HTM implementations. Furthermore, I stipulated that despite and because of the higher level of detail, interesting solutions to both instruction set design and microarchitecture for HTM would be possible and new challenges could be uncovered–leading to solutions different from those presented in academic state of the art.

8.1 Summary

In the previous chapters, I have presented a summary of the state of the art in transactional memory and simulation (in Chapter 2); presented a production level ISA extension for HTM–AMD’s Advanced Synchronization Facility–with background information and justifications for design decisions, challenges, and changes made due to experience from iterating through a full executable model, compiler, and real application stack (in Chapter 3). ASF provides full HTM functionality, but provides some additional features (limited capacity / progress guarantee, non-transactional accesses) and differences to “text book” style designs (no full register checkpoint). Then, I show various implementation options for that ISA extension in realistic and complex CPU cores that take into account interactions with the relevant CPU features, such as out-of-order execution, misspeculation, and complex memory hierarchies (Chapter 4); followed by a summary of ASF use cases and performance evaluation experiments, and a detailed tour through the simulator and challenges unique to the simulator implementation of ASF (in Chapter 5). Finally, I show new use cases and extensions to ASF, namely building communication channels on top of ASF and a mechanism to use the limited register checkpoint for transactions that can be resurrected after they have been aborted (Chapter 6); and present further extensions: decomposing lock elision primitives, roll-forward mode, nested abort handlers, two-dimensional conflict tracking, and using the TLB to track larger objects, and challenges: handling time sources as an implicit communication channel between elided critical sections (in Chapter 7).

My main contribution to the state of the art is the detailed level of microarchitectural, ISA, and system-level understanding for HTM that I have gained and made available to the research community through my work on the PTLsim and Marss86 simulation platforms [262, 304]. A further software artifact that I contributed to is the DTMC compiler toolchain; mainly through testing, and bug fixing thanks to detailed visibility into the whole system state in simulation. Additionally, my extensions to the Oracle Hotspot JVM uncovered several challenging real-world interactions that would break typical lock elision approaches [290]. These artifacts have been one leg of the foundation of the VELOX project that has taken a holistic application to transistor view of transactional memory.

From a commercial perspective, my main contributions are certainly the significant number of microarchitectural HTM implementation variants, their evaluation, and associated patent filings with AMD (27 patents pending, 19 patents granted as of writing), and the detailed ASF ISA extension that we published [186] (also available in Appendix A).

Academically, many of my contributions are part of papers at top-tier conferences and high-class relevant workshops. On top of those publications that I (co-)authored, several other publications use the ASF implementation in the simulator for further experimentation. The full list of publications that I have worked on is:

• Hardware Acceleration for Lock-Free Data Structures and Software-Transactional Memory (EPHAM 2008 [158], Appendix B.1)

• ASF: AMD64 Extension for Lock-free Data Structures and Transactional Memory (MICRO 2010 [214]) • Evaluation of AMD’s Advanced Synchronization Facility Within a Complete Transactional Memory

Stack (EuroSys 2010 [213])

• The Velox Transactional Memory Stack (IEEE Micro Journal [210])

• Implementing AMD’s Advanced Synchronization Facility in an Out-of-Order x86 Core (TRANSACT 2010 [220], Appendix B.2)

• Compilation of Thoughts about AMD Advanced Synchronization Facility and First-Generation Hard- ware Transactional Memory Support (TRANSACT 2010 [215], Appendix B.3)

• Sane Semantics of Best-effort Hardware Transactional Memory (WTTM 2010 [221], Appendix B.5) • From Lightweight Hardware Transactional Memory to Lightweight Lock Elision (TRANSACT 2011 [254],

Appendix B.4)

• Delegation and Nesting in Best Effort Hardware Transactional Memory (SPAA 2012 [274]) • Safely Accessing Time Stamps in Transactions (WTTM 2012 [263], Appendix B.6)

• Between All and Nothing–Versatile Aborts in Hardware Transactional Memory (SPAA 2013 and TRANSACT 2015 [289, 337], Appendix B.7)

As already reviewed in Section 1.4, these are largely split into three phases; (1) baseline ISA, microarchitecture, and evaluation with full TM stack in 2010; (2) use cases and extensions (several of which did not turn into academic papers) until 2012, when the AMD Research office dissolved; and then further extensions with improved simulator models and extended use cases (2013 - 2015).

Comparing that to the trend in the field, this was after several of the seminal STM and HTM papers, but that time was not squandered, but instead allowed me and my collaborators to gain deeper insight into TM and related CPU architecture and micro-architecture concepts and challenges. Academic interest in HTM shifted, and commercial TM implementations became available, yet they were usually of quite simple nature (and often offering a single differenciating feature on top) around 2013. Overall, most of my contributions conincide with the peak of academic interest in TM, see Figure 2.2.

In document Interaction of Hardware Transactional Memory and Microprocessor Microarchitecture (Page 193-198)