• No results found

4.7 High-level Interaction Between Microarchitecture and HTM

4.7.3 Influence of ASF-specific Features

Some of ASF’s features influence and complicate hardware design. Because ASF provides a limited capac- ity guarantee (effectively obstruction freedom for transactions with up to four cachelines worth of read / write set), the hardware needs to provide that. That means that hardware cannot arbitrarily abort the transactions, but must attempt to run small transactions in earnest. The guarantee does not guarantee (progress) under contention, and also does not protect against external abort causes such as interrupts. Instead, the core must carefully handle branch prediction and how it adds entries to the transactional working set, especially with an associativity limited tracking structure. In our chosen implementations, different structures effectively permit speculation and the limiting structures only track working set when the accesses are not speculative anymore. Furthermore, our designs support cache and TLB misses while staying in transactional mode; and generally do not contain any unnecessary aborts.

Another influence on the microarchitecture are ASF’s features of non-speculative accesses and only partial register checkpoints. Both of these allow state to leak from the transaction; the first challenge was to specify what behaviour programmers could expect to see of these accesses. Secondly, hardware of course must carefully adhere to the specification and not leak speculative transactional data through (falsely) aliasing non-transactional stores, and also ensure that the register file state at the abort time corresponds an appropriate position in the code.

Non-transactional accesses must of course be distinguished from transactional ones, complicating the flow and verification space of the load / store path of the core.

Together, however, ASF’s limited amount of register checkpointing reduces the need for complex register stashing or renaming schemes; allows fast transaction entry and abort; and does not reduce ILP inside the transaction. Furthermore, non-transactional accesses allow more careful selective annotation and so a smaller tracking structure can be useful to more code that performs many operations on private memory.

Similar to overlapping tx and non-tx accesses, the RELEASE instruction requires careful sequencing in the instruction stream in order to not release later accesses that were hoisted before it due to out-of-order execution.

4.8

Summary

In this chapter, I have presented various options for implementing a BeHTM ISA extension with the example of AMD ASF. While conceptually simple on the ISA layer, implementations are complex because of the cross-cutting responsibility of different hardware mechanisms to coordinate transaction execution safely. Key challenges for any HTM implementation are the integration with other present layers of speculation, most prominently out-of-order execution in high-performance cores, and the integration with the specifics of the cache hierarchy and memory system. There, ensuring that conflicts are tracked through the life-time of transactions in spatially separate structures without any windows of vulnerability due to transition between mechanisms is crucial for correct transaction execution. Furthermore, choosing the right location for data versioning and making all stores of a successful transaction visible to the system at once while still watching for conflicts can be challenging depending on the layout and strength of the underlying memory substrate.

Additionally, there is a strong relationship between the chosen microarchitecture and easily observable application characteristics; in some cases resulting in a step-wise change for small input perturbations. On the other hand, the features that distinguish ASF from other “run-of-the mill” BeHTMs also need careful consideration in the microarchitectural realisation.

Overall, however, it is possible to implement BeHTM with no, or very simple, changes to the overall memory system architecture and cache coherence protocol. The required changes to the baseline CPU core and caches are, however, still complex for a relatively “vanilla” BeHTM; therefore several of the pro- posals from the literature that require more invasive modifications seem prohibitive due to the required complexity and verification cost – especially for first generation systems.

Applications and Evaluation of HTM

5.1

Introduction

Transactional memory, like any other microprocessor feature, does not live in a vacuum, but instead both ISA design and microarchitectural implementation characteristics need to work well with the applications that use them. One of the first steps in design and implementation of such a processor extension is the analysis of workloads and understanding requirements and characteristics of potential use of the feature. Therefore, after introducing the ISA of a BeHTM (Chapter 3) and micro-architectural implementation options and details (Chapter 4) in the previous chapters, this chapter will highlight use cases for BeHTM mechanisms and performance results of our BeHTM implementations described. After understanding requirements and usage, we build architectural and implementation prototypes to study interactions on both architectural and microarchitectural levels.

The architectural level analysis leads to an understanding of usability of the feature and allows testing of prototype software with necessary changes to support the proposed feature. In the case of BeHTM, the architectural component is important, because of new control flow interactions (transaction aborts), new instructions (transaction start / end, marked / unmarked memory accesses), and as a vehicle to test compiler backends, transactional memory libraries and hand-crafted use of BeHTM in concurrent data-structures and higher-level primitives (such as DCAS).

During my work on ASF, I was very fortunate to collaborate with our our partners in the VELOX EU- funded project who contributed compiler, language, and library support on top of ASF, and used this thesis’ simulation infrastructure for testing and performance evaluation of new compiler techniques. This collaboration resulted in multiple joint publications [158, 210, 213, 214, 220, 254, 274, 289, 337]; some of which (workshop papers without formal proceedings) I have attached to this thesis (Appendix B) and will use / paraphrase here for illustration.

Building an actual microprocessor for this thesis work was infeasible, so I implemented all described microarchitectural mechanisms inside detailed simulators (PTLsim [135] and Marss86 [253]). In fact, only close to the end of my thesis have products with enabled BeHTM actually started shipping (Intel - Q3-2013 [278, 286], IBM - Q3-2012 [270, 281, 340]) and been disabled again due to implementation bugs (Intel - Haswell, Broadwell, Skylake [366]). Through my exposure to actual proprietary microar- chitectures (mainly at AMD 2006 - 2012), I have tried to keep the simulator implementations as realistic as possible, without sharing proprietary microarchitectural implementation details.

Despite a faithful simulator implementation, several fundamental differences remain between simula- tors and actual implementations in RTL / silicon. Since I have not published about these, I will use some space in this chapter (Section 5.4) to generally introduce high-level differences / characteristics of simu-

lation environments (Section 5.4.1), show where this simplified my implementation (Section 5.4.3), and finally present cases where the simulator actually complicated the design significantly (Section 5.4.4).

The remainder of this chapter is organised as follows: Section 5.2 will briefly introduce usage sce- narios for BeHTM and summarise characteristics of the applications studied. Section 5.3 will present highlights of the numerous performance studies undertaken with my ASF implementation and will dis- cuss additional background information, not present in the selected publications.

Section 5.4 will present backgrounds and detailed analysis of the actual simulator implementations, simplifications and complications; and I will summarise and conclude this chapter in Section 5.5.