4.2 Key Microarchitectural HTM Mechanisms
4.2.1 Data Versioning
Transaction semantics require that transactional stores are made visible to other cores only when the enclosing transaction commits successfully. In the case that the transaction does not commit successfully, the transactional stores need to be discarded; that is, locations stored to need to return to the value they had before the transaction. For the duration of the transaction it is therefore necessary that both
the pre-transactional version and the speculative, transactional version of the memory location need to exist in the system. The speculative copy will become the authoritative copy upon commit, while the pre-transactional version is the fall-back in case of a transaction abort.
Data versioning describes this existence of two versions of the same memory location and usually buffering is employed to keep one of the two possible versions of the data available in addition to the other one which is usually stored at the normal place in the memory hierarchy.
The first decision is whether to reuse / augment an existing, similar structure (LSQ, DC) for buffer- ing, or whether to create a dedicated, new buffering structure. Reusing an existing structure for data versioning usually restricts flexibility in sizing and organisation, may complicate existing interfaces, and slow-down timing on critical paths. Furthermore, the extended unit may need a complete re-validation pass during the design of the CPU, even though transactions are not used. On the positive side, however, may be the smaller overall cost in silicon real estate due to reuse of similar logic and storage cells; and a higher performance due to closer proximity of the backup storage to the unit storing the definite copy of the data / performing the store.
Creating a new buffering structure, on the other hand, may permit better tuning to the needs of transactional code and leave exiting interfaces and control paths unencumbered. On the negative side, the new buffer does need to interact with the existing components and thus needs careful integration into existing flows, and verification effort to ensure that non-transactional operation remains correct. An additional hurdle may be challenges in placing the new structure in an existing, tight floor plan without impeding timing and routing of signals. A dedicated structure will therefore be usually smaller and may have higher latencies to access.
Regardless of the choice, the data versioning mechanism must be able to supply the updated transac- tional values to loads inside the ongoing transaction, while hiding them from global visibility and allow- ing to revert to pre-transactional values. Reusing a structure that is queried on local loads (LSQ, DC) has therefore the advantage of automatically forwarding transactional writes to loads inside the transaction. Using a new structure to store the transactionally written data, on the other hand, requires additional checking of upon loads inside the transaction. Together with the idea of optimising for transaction com- mit, dedicated structures for data versioning should therefore track the old version of the modified data for restore upon abort.
The concept of eager / lazy versioning [102] (and similarly, undo / redo logging) is a concept from STM systems, but does not adequately address the different trade-offs and complexities in hardware systems. In particular, it fails to distinguish between the mechanism of data versioning and the location of global visibility, and the time when store probes are sent to the system. For a better analysis, three components need to be considered: (1) is transactional data stored in the local source for loads (i.e. the local L1 DC, the LSQ); (2) is transactional data stored in (or beyond) the global point of visibility and finally, (3) is transactional data snooped at the time of the write, or at the end of the transaction.
Eager versioning for stores would likely refer to a system that stored transactional data in the coherent and globally visible L1 cache, and keep pre-transactional values either in a separate undo log / buffer, or more easily in lower levels (L2, DRAM) of the memory hierarchy (see below). Lazy versioning, on the other hand, could be implemented in exactly the same way in a system that had the L2 cache as the point of visibility and thus would require transactional stores to be pushed out to the L2 at the end of the transaction. Note that in this lazy versioning scheme the usual cost for transactional loads in STMs does not occur, there is no additional buffer that the transactional loads need to consider to be able to read from earlier transactional stores. Instead, the L1 will already provide such functionality.
One particularly light-weight choice for data versioning uses an outer-level cache (e.g. L2 cache) as a storage for pre-transactional data and holds transactionally modified data in the local L1 cache.
Upon transaction commit, the copy in the L2 cache is made unauthoritative, either by invalidating / updating it, or by making sure that the L1 copy (that often is consulted in parallel with the L2) responds to remote read requests. Transaction aborts will invalidate the transactionally written data in the local L1 cache and later instructions read the pre-transactional state from the L2. Care has to be taken that the outer cache / main memory contains the correct pre-transactional copy. For example, in most modern (MSI, MESI, MOESI) cache coherence protocols [136] the most recent store (non-transactional or from a committed transaction) may not be present in main memory, but instead live only in a single cache or cache hierarchy. A transactional store to such a line will require writing-back the pre-transactional data to the L2 cache. Additionally, exclusive cache hierarchies explicitly forbid duplicate cache lines in different levels of a single hierarchy. In these cases, a transactional store may still need to write back prior to the modification of the transactional copy, temporarily bypassing the exclusive regime.
Chosen Implementation For my thesis, I implemented multiple variants of data versioning. The first uses a special buffer (the locked line buffer) that buffers pre-transactional data next to the L1 data cache; the second one uses lower levels of the cache hierarchy for storing pre-transactional data as discussed here; and finally, we also implemented a version that provides additional buffering logic in the LSQ of the core to work around pathological capacity cases where indexed data structures can exhibit very low usable capacity due to index trashing.