Simulator Extensions and Complications - Simulator Implementation Details

5.4 Simulator Implementation Details

5.4.4 Simulator Extensions and Complications

Clearly, cycle-approximate simulators operate at a higher level of abstraction than microprocessor RTL. As such one expects them to be easier to extend and prototype features in while still giving accurate performance predictions and hint at interaction patterns between components.

During my work on both PTLsim and Marss86, I did, however face several challenges specific to the nature of the simulator implementations. First and foremost, the main concern is the general level of immaturity in these simulators. In comparison to a commercial microarchitecture, the simulators often contain more bugs, because they are not as rigorously validated / tested. I have fixed a significant number of bugs in both PTLsim and Marss86 simulators, in different areas of the design; often in tricky areas such as pipeline mispeculation recovery, coherence protocol, etc.

The second challenge in simulators is the lack of microarchitectural features. If a feature is not relevant to model, it is often not included in the simulator implementation due to lack of awareness / development time, or a concious decision, such reduced simulation complexity and thus faster execution time or simpler simulator code.

In PTLsim, the lack of a cache coherence protocol required me to add a first-order approximation model (simple MSI coherence) with constant cache-to-cache propagation delays and a simplified coherence protocol. The PTLsim model has zero-cycle cache-line invalidations and no bandwidth limitations of both normal memory requests / responses and snoop messages.

In both simulators, the great simplification of a single global flat memory space makes it hard to deliberately have multiple locally-visible versions of data in the system. Adding such a mechanism then needs laborious and error-prone detection of all places that silently rely on the simplification and their change to the new mechanism. The single flat memory view also permits zero cycle communication between cores in a window where the flat view already has the updated new value of the producing store, while the microarchitectural view has not yet propagated the invalidation to the consuming load that therefore still has a cache hit which should read the old data, but can already inspect the freshly produced data through the flat memory view.

The availability of such short-cuts and safety nets is great for rapid prototyping, but when a feature needs prototyping that requires the proper functionality, retro-fitting a detailed model can be hard. In particular for interconnect networks and coherence protocols, where both liveness and correctness are of paramount concern for commercial designs and are areas for notorious bugs and research [70, 122], the reliance of a safety-net leads to a potentially large number of issues lurking in the corner cases of the actual mechanism.

For the simulator implementation of ASF this meant that a significant amount of development, debugging and bug-fixing effort was spent on existing features / functionality. In particular in Marss86’s new cache coherence protocol I spent several man-months on debugging deadlocks and correctness issues of the coherence protocol. These had not be found because the coherence protocol was never responsible for the actual data delivered between loads and stores, and therefore the protocol contained functional bugs. Another missing feature that I added to both simulators was the proper support for the AMD64 memory model, in particular in-order loads. Both simulators aggressively execute loads from different addresses out of order, which can be visible to carefully crafted litmus tests. The AMD64 memory model (and the Intel equivalent) [209] does enforce that loads execute in order. I therefore had to add the required logic to detect and fix when the effects of out-of-order loads were visible to the application [27]. Another unfortunate simplification in the Marss86 simulator is the direct support of misaligned / cache-line straddling loads / stores. These induce very subtle corner-case in real microarchitectures for ISAs that support these, as they require split / merging of multiple cache lines and need delicate tracking logic. This logic is essential in real microarchitectures and had been present in PTLsim which split these

misaligned loads / stores in the cores and exposed them as multiple aligned loads / stores (with valid bit masks) to the other components in the system. For the Marss86 simulator, this logic had been removed to speed up simulation due to reduced complexity in the simulator6_{. Since caches, LSQs, etc., never}

transport any actual data, this is only a small issue, since the actual access to the global flat memory view can happen in a misaligned fashion. For the ASF implementation, however, this complicated the logic for handling marking and tracking of transactional cache lines.

Finally, the C++ structure with heavy use of templates and encapsulation made it sometimes a little challenging to connect the right components with one another. Especially reaching into the coherence logic from inside the CPU core on a transaction abort / commit needed poking holes through several layered interfaces. This observation puts the earlier observation into perspective that in C++ everything is just a method call away. Similar problems would be expected from a real implementation due to the distance between the decision making and the mechanism that is invoked; there they manifest in timing violations / routing problems if a plain wire is used to connect the remote components.

ASF-specific Simulator Challenges For the ASF implementation, a few things were crucial and tricky to get right in the simulator environment. I have already mentioned the general unreliability of the coherence protocol implementation, and the effects of the single flat global memory view.

For ASF, these complicated the implementation, because I effectively had to track conflicts in two layers: in the architectural memory view where an undo-log is kept per core for both versioning and conflict-detection with stores, and in the microarchitectural realm for both timing and correctness.

Initially, I had hoped that the architectural view could provide a safety net so that the transactions would be sound even in the face of a buggy coherency protocol. The issue is, however, that the outcome of most conflicts is strongly dependent on timing which is not faithfully modelled in the architectural view. The result was that both layers often disagreed on when a conflict would happen and also which way around concurrent conflicting accesses would be ordered.

I therefore restricted the architectural view to only perform data versioning and conflict detection on the transactional stores with other loads and stores. These still, however, had a large number of conflicts that disagreed in timing and abort decision between the microarchitectural and the architectural layer, so I eventually reduced the amount of conflict detection induced by the architectural layer to a minimum by not detecting architectural read-after-write conflicts where a load tries to read from a location that is being transactionally written. Instead, I used the data in the undo-log and forwarded that to the load, effectively performing selective lazy versioning of these stores. This required, however, careful tie in of the forwarding of the old data into the path that handles misaligned loads.

The architectural layer is then only responsible for detecting conflicts between transactional (and non-transactional) stores and will influence abort decisions in these cases.

In all other cases, the correctness of ASF depends on the proper function of the coherency protocol which previously was only relevant for effects on simulated performance rather than correctness. For that reason I had to rework and repair the coherency protocol in the Marss86 implementation.

In addition to the conceptual who-aborts-whom-when issue when having two mechanisms for conflict detection, the general coordination between the two layers was challenging; very similar to the issues mentioned in Chapter 4 about overlapping conflict detection intervals when moving data and the tracking responsibility. Some errors were double undo operations overwriting stored memory, abort decisions being made but being only able to abort transactions at the boundary of x86 instructions and cycles.

5.5 Summary

In the course of my work on ASF, I have added support for the instruction set extension to two state-of- the-art simulators for complex CPU cores executing the x86/AMD64 instruction set–PTLsim and Marss86. Both of these provide a significantly more detailed core model than typical one-IPC in-order pipeline models used in most research for HTM. As a result, I have obtained a much deeper understanding of microarchitectural interactions; not least thanks to the significant amount of debug work performed. The resulting HTM implementation (ASF) is thus more believable and realistic than other feature-loaded proposals.

A second result of my deepened understanding and improved functionality, I have been the maintainer of PTLsim. All changes to the simulator, and the extensions to the Hotspot JVM are available publicly as open-source [262, 290, 304]

ASF has been widely evaluated by myself, collaborators in the VELOX project and related joint research, and has been used as the HTM of choice in other TM publications.

Extensions and New Use-Cases

6.1 Introduction

The published ASF instruction set extension was mainly designed to provide resource efficient transactional execution of transactions with explicitly marked accesses. The inverted mode described already in Section 3.4.1 has been an acknowledgement to supporting unmodified binary code inside a transaction and executing it transactionally.

The non-transactional accesses were mainly envisioned to bypass / ease capacity restrictions for accesses that did not require transactional conflict detection and versioning. In this chapter, I will introduce various extensions to the ISA and the usage of non-transactional execution for new programming patterns.

The next section (Section 6.2), will introduce the general concept of using non-transactional accesses for communication between transactions and the challenges such a model introduces. In Section 6.3, this ad-hoc communication is put on structured foundations and embedded into a more generic notion of parallel nesting of transactions. Finally, in Section 6.4, I will present how non-transactional memory accesses and light-weight extensions to the ASF ISA can be used to resurrect a transaction if it is aborted. Section 6.5 will conclude this chapter.

The work in this chapter has been presented at SPAA 2012: communicating transactions (joint work with Yujie Liu, and Michael Spear) [274], SPAA 2013 and TRANSACT 2015: resurrecting transactions (joint work with Martin Nowack, Michael Spear, and Christof Fetzer) [289, 337].

In document Interaction of Hardware Transactional Memory and Microprocessor Microarchitecture (Page 161-165)