Extended Transactional Instruction Implementation

4.5 Enhanced Pipeline Integration

4.5.2 Extended Transactional Instruction Implementation

Beginning a Speculative Region The SPECULATE instruction starts a speculative region and is mi- crocoded in the microcode ROM. On detecting a SPECULATE, the extended instruction decoder sets the InSP (in speculation)7_{bit to remember the beginning of a speculative region. Making the decoder aware}

of the boundaries of the speculative region allows earlier marking of speculative accesses at a point where the instruction stream is still processed in-order, see the complications described in Section 4.4.3. Similar to the baseline proposal, the decoder signals the instruction dispatcher to read the SPECULATE microcode. The microcode (1) computes the next rIP so that rIP is restored to point to the instruction following the SPECULATE at an abort, (2) saves the next rIP and the current rSP in the shadow register file, and (3) executes an mfence (memory fence) micro-op. The mfence generates a dependency between SPECULATE

Capacity Checker Instruction Dispatcher

Instruction Fetcher

Shadow Register File

Store Queue L1 Cache V LS Unit D S SW SR EX(execute) Unit cHT Fabric ASF_Exception_IP

Microarchitectural Register File

5 ASF Instructions Instruction Decoder

InSP (In speculation)

Scheduler Conflict Detector NACK Nacking Support V Data SW EP Conflict Detector Load Queue Tag V SR Conflict Detector NACK SPECULATE logic COMMIT logic ABORT logic RELEASE logic Microcode ROM SPECULATE Microcode COMMIT Microcode ABORT Handler Prohibited Op Handler ROB ROB Entry SP Tag Data Tag Capacity Checker

OV(overflow) CF(conflict) AI(ABORT) PB(prohibited)

Figure 4.19: Extending an out-of-order core for transactional memory with focus on changing the flow of instructions.

and later LOCK MOVs, which prevents the LOCK MOVs from being executed ahead of the SPECULATE in the out-of-order execution stage (Section 4.4.2). The shadow register file is carved out of the existing micro-architectural register file used only by micro-ops.

Speculative Accesses To track speculative accesses, two bits are added per cache line: the SW (spec- ulative write) bit for speculative stores and the SR (speculative read) bit for speculative loads as shown in Figure 4.19. The SW bit is also added per store queue entry, and the SR bit per load queue entry. A transactional access is issued to the LS (load/store) unit and sets the SW bit of the store queue entry for a store operation and the SR bit of the load queue entry for a load operation. The data movement operation executes in the same way as a normal access. The AMD64 TLB refill hardware allows a speculative region to survive a possible TLB miss during address translation by handling the miss in hardware. When the speculative access retires, the SR bit of the load queue entry is cleared, and the corresponding SR bit in the L1 cache is set, carefully observing the handover principles established in Section 4.4.5. Although sending another access to the L1 data cache at retire time can cause additional delay, it reduces transactional overmarking and allows for less complicated logic in the miss-buffer handling logic. The SW bit of the store queue entry is cleared when the speculative data are transferred from the store queue to the L1 cache along with setting the SW bit in the L1 cache. If speculative data are written to a cache line that contains non-speculative dirty data (i.e., the D (Dirty) bit is set, but the SW bit is not set), the cache line is written back first to make sure that the last committed data are preserved in the L2/L3 caches or the main memory.

Increased Worst-Case Capacity As a fall-back, the LS unit can assist buffering speculative data post- retire, if the L1 cache is out of capacity (for this particular index). More precisely, the transfer of the SW/SR bits from the load/store queues to the L1 cache needs to meet two conditions: (1) the access misses in the cache (i.e., no cache line to retain the bits) and (2) all cache lines of the indexed cache set have their SW and/or SR bits set (i.e., no cache line to evict without triggering a capacity overflow). In this case, the entry in the load/store queue is not deallocated, even though the associated instruction has retired. While the total capacity increase is small, this scheme helps to handle unfavourable access patterns that exceed the capacity of a few cache indices. Figure 4.20 depicts such an interaction.

If a non-speculative access meets the two conditions above, the L1 cache handles it as if the access is of uncacheable type (through a dedicated buffer outside the L1) to avoid a capacity overflow, and the L2 cache handles it directly. In order to hold as much speculative data as possible, the L1 cache eviction

1 2 3 4 5 LSQ Data Cache TX TX TX TX asf.ld / ld [foo] req full Uncacheable req data keep

Figure 4.20: The load/store queue can be used as an additional storage container when a cache index is filled with transactional tracking data. If a request accesses the cache (1) and the corresponding set is full with transactional data (2), the LSQ will use the uncacheable path for getting the data (3) and (4), and keep the access present for conflict detection if it was transactional (5).

policy evicts cache lines without the speculative bits set first. Similarly, a cache line prefetched by a hardware prefetcher is inserted into the L1 cache only when it does not cause eviction of transactional data.

Extending Data Versioning Capacity Using the store queue to buffer speculative stores instead of the L1 cache needs careful integration with the logic for store visibility. Stores in the store queue are only locally visible in the AMD64 memory model [209]. Therefore, a store is visible to the rest of the system only after it is transferred to the L1 cache. To broadcast the existence of the buffered speculative store that cannot be transferred to the L1 cache without triggering a capacity overflow, an exclusive permission request for the store is sent directly to the coherent interconnect when the store retires in the store queue. This enables the other cores to detect a conflict against the store.

Once the exclusive permission is acquired, the store queue entry remembers the acquisition through an exclusive permission (EP) bit. The COMMIT instruction later checks the EP bit to make sure that the store has been seen by the rest of the system for conflict detection before starting the commit procedure. This is an example of the complication introduced in Section 4.2.3, and we will investigate details in the following Section 4.5.3.

Capacity Overflow An overflow exception is triggered when the load/store queues do not have an available entry for an incoming transactional memory instruction (i.e., the SW/SR bits of all entries are set in the queue the instruction needs to go to). A tricky problem is that non-speculative accesses should always make forward progress regardless of the number of the speculative accesses executed in a speculative region. Our design reserves one entry per queue for the non-speculative accesses to be able to execute them even when the rest of the load/store queue entries are filled with speculative data. As outlined earlier, OoO misspeculation may trigger a false overflow exception. To address this problem, we add an additional tracking bit per ROB entry: the overflow (OV) bit is set for an speculate access when it would cause a capacity overflow. If the speculative access is on a mis-speculative path, the ROB entry and the hardware resources associated with the entry will be discarded by the existing branch misprediction recovery mechanism. A true capacity overflow at that time will be serviced when the instruction becomes the oldest in the reorder buffer (the ROB entry reaches the head of the ROB) and triggers an overflow exception (assuming no other abort conditions exist before the speculative access). One last chance is given to such an instruction and it tries to allocate again in the load/store queue. The reasoning is that previous, OoO-speculative entries could have used up a slot that was later freed due to the instructions being on a wrongly speculated branch, and thus impacted an instruction on the correct path. Such misspeculation will be rectified once the OV-marked instruction has reached the head the ROB.

The combination of marking transactional lines in the cache only at the instruction’s retire time (which is bound to have resolved all earlier misspeculations) and the freeing of load/store queue entries of instructions on any misspeculated branch allows for precise capacity tracking at the expense of an additional round-trip to the cache and additional logic in the OoO execution supporting structures (ROB and

LSQ, here). The solution outlined in the previous section is, in contrast, designed to minimise changes to the processor’s core mechanisms.

In document Interaction of Hardware Transactional Memory and Microprocessor Microarchitecture (Page 111-114)