Progress - High-level Interaction Between Microarchitecture and HTM

4.7 High-level Interaction Between Microarchitecture and HTM

4.7.2 Progress

Overall, guaranteeing progress in hardware transactions is complex, for three main reasons: (1) spurious aborts, even without contention, (2) resource depletion of transactional resources, and (3) guaranteeing global progress with local policy.

In the first category fall all the transaction aborts that are caused by imperfections of the underlying microarchitecture. Speculation itself is a best-effort feature and microprocessors will fall-back to non-speculative execution in extraordinary (and hopefully non-performance critical) circumstances. De- pending on the coupling between OoO speculation and TX speculation, failure in OoO speculation may cause failure of TX speculation. The Sun Rock processor [188], for example, will abort transactions on branch mispredicts and TLB misses. Notably, these events are not visible at the architectural level, i.e., applications cannot directly observe them happening.

Even on microarchitectures that decouple OoO and HTM mechanisms, for example the ones described previously in this chapter, architecturally invisible events can still obstruct transaction progress, for example scheduling issues or corner cases in the coherence protocol. Several examples were presented as implementation challenges / bugs in Section 4.4.3 and 4.4.6; their workarounds could also have involved aborting the transaction at detection of such a corner case. It is conceivable that real designs will err on the side of aborting too many transactions rather than putting transactional safety at risk.

Events that have a large, visible effect on the architectural state may also abort transactions, for example calling into the operating system / hypervisor explicitly or implicitly due to an exception (page fault) or interrupt.

Broadly, the second category of resource depletion will encompass cases when a transaction requires more resources than are available in the transactional facilities. Depending on the implementation, transactions may be limited in instruction count (size of the reorder buffer), number of loads / stores (size of load / store queue), amount of transactional data used (size of the tracking structure, e.g., the L1 data cache), and related metrics, such as address patterns in indexed associative tracking structures. The latter is particularly important, because the worst-case capacity of a cache-based implementation may be limited by the associativity of the cache. While there are academic proposals even to bypass the capacity constraints of caches, they are firmly outside of what is deemed practically implementable as of today [80, 86].

Finally, one of the paramount principles outlined in this work, is to not adversely affect the original microarchitectural substrate and protocols, but instead have a minimally-invasive HTM implementation. Adverse conflict patterns between two or more concurrent transactions may cause progress issues. Sim- ple overlapping access patterns of two transactions with reverse access order in the second may cause livelock, when both transactions conflict on their second access with the first access of each other and simply retry (in lockstep).

Unfortunately, other, less-obvious effects can complicate the conflict scenario. False sharing of dif- ferent data in identical cache lines can hide address patterns from programmers. Furthermore, conflict- inducing coherence messages may be exchanged for reasons unknown / invisible to the application. Hardware prefetches will speculatively pull in data in order to reduce latency on future accesses. These prefetches may cause conflicts with concurrent transactions’ read and write sets. While most stream prefetchers will prefetch data for reading, processors may use exclusive prefetchers for OoO speculative stores, thus coupling local misspeculation (on a branch containing stores) with remote spurious transaction failure.

Additionally, specific coherence messages may convert resource limitations elsewhere into transaction aborts, for example enforced evicts due to limited snoop filter / directory capacity.

In summary, there exists a large number of reasons for progress-hindering aborts of transactions. In the common case, these are expected to be rare, due to sufficient warm-up and expected to go away after a small number of retries (predictors warmed up, working set paged in); but persistent corner cases may exist, among them resource limitations and byzantine actual conflict patterns.

For all these reasons, it is hard to give general progress guarantees, such as wait-free / lock-free execution. Best-effort HTM systems are thus favoured due to the fewer constraints they impose on the underlying microarchitecture, speculation mechanisms and misspeculation recovery. These fewer constraints, however, do not make the feature trivial to implement; BeHTM implementations are hard as the observed bugs in my designs have shown and also the issues Intel has had with its first commercially available HTM implementations [365, 366].

In ASF, the architecture attempts to give a very limited progress guarantee; essentially obstruction- freedom when the number of transactional cache lines is smaller or equal to four. In this case, exceptions and interrupt events are also treated as obstructions. Thus, the guarantee means that small transactions that do not fault and execute with disabled interrupts will succeed eventually. In real life, the reasonable expectation is that page faults will not persist indefinitely (due to the OS page tables holding all required data) and interrupt rates will allow short transactions to complete between two interrupt events.

For the microarchitectural implementation, even such a weak guarantee will severely restrict the freedom, as described earlier. Even though the microarchitectures inside the simulators are relatively regular and cannot expose all quirks and corner cases present in product microarchitectures, the challenges in- troduced and solved in previous sections are indicative of the class of problems expected in real CPUs, but will very well present only the tip of the iceberg.

In document Interaction of Hardware Transactional Memory and Microprocessor Microarchitecture (Page 120-122)