Further HTM Ideas - Interaction of Hardware Transactional Memory and Microprocessor Microarchit

Industrial research has to strike a careful balance between publishing and keeping findings proprietary. One step to publishing research work is to patent ideas and mechanisms, and then publish them in research literature.

During my time at AMD, me and my collaborators developed several ideas that have not been published in academic conferences, but some of which have been published as patents and patent applications. Others still remain in the process of being evaluated for patenting / being drafted as patents.

This section briefly discusses the scientific problem and core of the solution in plain terms, and links to the respective patent documents for reference.

7.4.1 Roll-forward Mode

During the design of the initial ASF specification [186] in 2010, we already identified the synchronous nature of HTM aborts as an unfortunate side-effect, especially with sequences of non-tx operations which should be completed before handling the abort. The solution I present earlier in this document (transactional resurrection, Section 6.4) proposes to take the synchronous abort, and then continue the aborted transaction / sequence of non-tx code until the abort can be handled more easily asynchronously.

Our original proposal to handle such cases was to propose a roll-forward mode for ASF9 _{which we}

planned to publish as an update to the existing ASF specification. The roll-forward mode is started with a special flavour of the SPECULATE instruction; inside the roll-forward transaction, aborts will only set a flag and continue execution of the transaction. Transactional stores check the flag and if an earlier abort was detected, they will not perform the store. There is also a new instruction to check for a recorded abort (VALIDATE) and allow applications to defer handling of the abort to a more suitable time.

Transactional stores checking the flag simplifies the application logic, as otherwise, there would be a race between the abort happening and the application checking for the validity of the transaction only before / after the store, but not at the same time.

One challenge in this mode is the mixed nesting of roll-forward / roll-back transactions and tracking their state when a nested transaction ends: is the outer transaction of the roll-forward or roll-back kind?

7.4.2 Nested Abort Handlers

Nesting transactions in ASF (and most other BeHTMs) is conceptually simple flat nesting which keeps a nesting count and flattens all nested transactions into the outermost enclosing transaction. As a side- effect, only the outermost transaction will see the abort.

This is a nuisance for keeping statistics of transaction aborts, because if the innermost transaction is responsible for the conflict and could provide a separate, non-conflicting code path, it needs to know of the abort in order to change the selected code path.

32 kB 16 kB 0 kB 32 kB 16 kB 48 kB 48 kB 64 kB 64 kB 80 kB precise tracking overflow tracking coarse tracking

Figure 7.7: Tracking transactional read sets in two dimensions using a combination of different granularity mechanisms for additional capacity and precision. Precise tracking uses the normal cacheline- granularity HTM mechanism; overflow tracking marks the entire set as transactional when a read set entry was evicted from it; in combination with coarse tracking, precision loss is reduced because both structure need to indicate a conflict.

In addition, if nested transactions perform non-tx operations, they might want to undo them with a registered undo-action in case of an abort. The outermost handler, however, does not know of the undo-actions of the nested transaction which may be behind a library interface.

In our work on nesting abort handlers10_{we show how to reverse the handling of aborts: instead of}

invoking only the outermost abort handler, hardware invokes the innermost handler and a combination of hardware and software tracking (similar in technique to the transactional resurrection proposal described in Section 6.4) ensures that the innermost handler knows about the nesting hierarchy of abort handlers. For that, nested SPECULATEs will push link information to the previous abort handler on the stack so that software can then perform a simple POP / RET sequence and link to the outer handler from the inner. Pushing the link information out to memory ensures that hardware does not have to provide for a large buffer tracking this information and that the information can persist a context switch so that the abort handler hierarchy can be walked when the application is then later switched to again.

7.4.3 Two-dimensional Tracking of Large Objects

One technique to extend the HTM’s capacity for tracking the read set is to use a simple overflow mechanism: once a tx-read cache line is displaced from the conflict detection hardware structure (for example the L1 data cache), the entire set of the cache is marked as TX.R. That way, all remote conflicting memory snoops that access this set will cause a transactional abort. The addresses that map into the set depend on the cache geometry, and are distributed regularly in memory: usually11 _every cache_capacity/cache_associativity. For example, for a two-way set-associative L1 data cache with 32 kB capacity, the overflown set would (falsely) detect conflicts with all memory locations with an offset of in- teger multiplies of 16 kB.

Overall, this mechanism trades tracking precision over tracking capacity. Another mechanism that performs the same trade-off differently is using a more coarse grained tracking granularity; for example, instead of tracking cache-line-sized memory regions (64 byte), one could track larger blocks of 4 kB to increase the total tracked capacity.

In our invention12_{, we propose to combine the two tracking mechanisms, i.e., once a cache-line is}

displaced from the cache and the entire index will conflict, a secondary tracking structure with larger granularity (such as the TLB with 4 kB page sizes) can be used to reduce the number of false positives caused by the aliasing of the equidistant addresses.

As is apparent from Figure 7.7, this is effectively tracking the transactional read set in two dimensions.

10_{USPTO patent application 20140181480}

11_{if the lowest possible bits are used for indexing into the cache} 12_{USPTO 8,612,694}

7.4.4 Tracking Large Objects in the TLB / Page Tables

Using a separate, purpose-built tracking structure with different granularity is straightforward, but re- quires an additional hardware widget. Using the TLB directly has challenges because it is usually queried with virtual addresses and is also not consulted on external snoop messages. In two separate inventions13, we have shown how to use the actual page-table data structure in memory and the fact that the AMD64 architecture maintains in hardware accessed and dirty bits for every page-table entry.

In a nutshell, the algorithm uses the normal ASF conflict detection mechanism (e.g. the L1 data cache), but instead of tracking conflicts on (all cache lines of) the accessed object, the conflict detection mechanism will monitor changes to the respective entry in the page-table (the last level PTE). Thanks to the hardware maintained accessed and dirty bits, a remote writer will need to write to the PTE to update the dirty bit, and thus cause a proxy conflict on the PTE with the original transactional reader.

In summary, this mechanism then can track accesses on page-granularity, supporting huge read sets with very little amount of actual conflict detection hardware; of course at the cost of losing tracking precision and potentially inducing false conflicts and aborts. The two inventions deal in more detail with challenges such exact marking / unmarking of the accessed / dirty bits.

7.4.5 Reducing Live-Lock by Ordering Transaction Memory Accesses

Two transactions can have mutually overlapping working sets and thus one of them may conflict abort, restart and then abort the other transaction, leading to the same, mirrored cycle. A system in such a condition is making progress on some level (the CPUs keep executing instructions), but failing to make progress on some higher level of abstraction (the transactions / operations they represent never complete)–a live lock.

There is a large body of related work on solving these issues, usually with specific contention man- agement policies, delaying the restart of conflicting transactions probabilistically and also mechanisms for stricter scheduling of transactions.

In HTM implementations that use the cache coherence protocol for conflict detection, one option is to stall answers to incoming conflicting snoop messages in order to throttle the conflict, abort, restart, conflict loop and increase chances of the local transaction to commit in time. Unfortunately, stalling snoop responses in such a way may impede progress guarantees of the underlying coherence protocol, sometimes through long, obscure, system-wide resource dependencies; in other more obvious ways where two transactions decide to not respond to each others’ snoop requests and thus deadlock the system.

In our invention14_{, we carefully establish a non-cyclic order between memory accesses (for example}

by physical address), and allow transactions / CPUs to block snoop responses if their accesses are ordered in accordance. That way, in the mutual conflict case, one of the two transactions will be able to “lock” its working set and commit the transaction, clearing the livelock situation.

In document Interaction of Hardware Transactional Memory and Microprocessor Microarchitecture (Page 191-193)