3.5 Capacity and Architectural Progress
4.1.1 Processor Microarchitecture
Although the exact implementation details of contemporary out-of-order CPUs vary, the generic function- ality and structure are very similar (see also Figure 4.1): a front end fetches and decodes instructions, the execution units perform the operations, and finally the back end will remove completed instructions from the core.
Front End The core fetches instructions in the native AMD64 instruction-set architecture (ISA) from memory (the instruction cache) and decodes the instructions and operand information. A number of the instructions are not executed directly in the core, but are split up into multiple smaller instructions, so called microoperations (uops), instead. These flow through the pipeline independently and also retire in sequence.
Conditional branches make the code sequence dependent on data, which usually is produced only near the branch instruction and may be subject to long-latency operations, such as complex arithmetic and cache misses. To maintain a sufficiently large look-ahead instruction window, modern microprocessors employ branch prediction to forecast the instruction stream; decoupling code-flow from data and allowing further look-ahead. If a conditional branch is predicted the wrong way (predicted taken vs. resolved not taken, and vice versa), instructions on the wrong branch have been executed. These instructions have to be removed from the core and their effects have to be undone, or annulled, and architectural state needs to be restored to a previous, known-good configuration.
Other predictions, such as predicting intra-thread data dependencies (or their absence) for pairs of stores and loads with unresolved addresses (store-load aliases), or optimistic assumptions for scheduling conflicts and late resource shortages, may also cause re-execution of instructions.
A central data structure called the reorder buffer (ROB) keeps track of in-flight instructions, their states, and required input operands. Dependencies among instructions are formed through producer- consumer relationships between instructions: operands required by one instruction are produced as re- sults by an earlier one and are usually conveyed through registers.
Because architecturally visible registers may be used by multiple independent in-flight instruction pairs, register renaming is used to separate these aliases. Register renaming happens early in the pipeline and maps a small set of architectural registers to a larger number of entries in a physical register file. Mapping happens so that anti-dependencies (write-after-read hazard) and overwrites (write-after-write hazards) are dealt with by mapping writes to different physical registers than the one currently mapped to the architectural register they are overwriting.
Execution Once an instruction has all input dependencies fulfilled, it is considered for execution and is eventually issued on one of the functional units of the core. Executing these instructions is not dependent on program order at this point anymore, but can proceed out of order: later instructions with fulfilled dependencies may execute before earlier instructions with unmet dependencies. Once the instructions complete execution, they forward the results to dependent in-flight instructions. The final pipeline step retires instructions from the core. In contrast to previous pipeline stages, this stage processes completed in-flight instructions strictly in program order and thus maintains the sequential semantics of the code.
One source for long-latency operations are load instructions that access memory: memory latency has not kept up with CPU clock frequency scaling and so a memory access takes on the order of hundreds of clock cycles. A hierarchy of caches exploits locality of access patterns and stores parts of the work- ing set in faster SRAM-memory cells on the CPU die. Generally, caches closer to the CPU core will be smaller, but offer shorter access latencies (two to twenty clock cycles), while larger last-level caches on die aim to minimise off-chip traffic and offer several megabytes of capacity at higher access latency (forty cycles and more). At the closest level to the core (L1), the caches distinguish between instructions (L1i cache) and data (L1d cache, DC), due to different access patterns and spatial layout. Despite the cache hierarchy, memory accesses may miss in the data cache(s). OoO execution helps because the core can issue multiple independent cache-missing loads at once and execute independent other instructions (such as arithmetic), thereby effectively overlapping the latencies for the loads and the computation. Several data structures keep track of in-flight memory operations: The load and store queue(s) (LSQ) of the core handle single load and store instructions before they retire. An additional miss buffer keeps track of the pending cache-lines, which may be referenced by multiple in-flight memory operations.
Executing memory instructions out of order interferes with the global order of memory accesses in multiprocessor systems, impacting memory consistency guarantees. To free the application programmer from reasoning over the actual complex interactions, our baseline core maintains stronger (simpler to reason about) guarantees by locally checking for consistency violations and selectively replaying mem- ory instructions [27]. Other commercial implementations take varying positions on whether to employ these and provide strong [209, 367] memory model with stronger ISA-level rules, or a weak memory model [351, 353] for performance and / or energy reasons.
Back End In the back end of the core, instructions that executed out of order are serialised again; mainly by retiring (or committing) them from the ROB in order when they have completed. At this point, the architectural state of the register file is reconstructed (either by directly updating a separate archi- tectural register file, or a separate remapping table). Futhermore, completed instructions are checked for exceptions (such as permission errors or missing mappings for memory instructions, and arithmetic exceptions); these require a clean architectural state and must not be actioned while they might be hap- pening on a mis-speculated branch. Because the back end retires instructions in-order, branches must be fully resolved before subsequent instructions can be retired. That way, checking for exceptions in the back end is an easy way to ensure they are indeed part of the actual application instruction stream.
Similarly, other instructions that need to wait for preceeding instructions to finish (sometimes called pipeline serialising) will wait until the retire stage to take effect; fences, CPUID, privilege level changes, and modifactions to the segment registers. Another class of instructions that typically waits until the back end are stores. In most ISAs, speculative stores must not be visible to the memory system. Sending them out only at the retire stage ensures that; often, they are then held in a separate write buffer, or post-retire store queue; in some designs, a part of the unified LSQ is used for that purpose.
Types of Speculation In total, modern out-of-order microprocessors execute instructions speculatively and need to have mechanisms to deal with wrong speculation and reset the processor state to a known
Instruction
Cache & TLB InstructionFetch PredictorBranch
Instruction Decoder
Reorder Buffer
Load / Store
Unit Cache & TLBData ALU
ALU ALU ALU
Figure 4.1: Abstract pipeline diagram of an out-of-order microprocessor.
good configuration–a valid architectural state. During speculation, however, speculative state is advanced and only promoted to architectural state once all predictions have been validated successfully (at an in- order retirement stage), or remedied.
In this thesis, I will refer to this collection of speculation as employed by current OoO microprocessors as out-of-order speculation (OoO-spec). In contrast, I will refer to speculation used in the transactional context (for example when entering an ASF speculative region) as ASF speculation (ASF-spec).
These speculation levels do not have to be distinct, their scope is, however, different. Out-of-order speculation speculates on the instructions inside the reorder- / instruction-window which contains tens to hundreds of instructions1, and supports a large number of different speculation mechanisms and failure
scenarios. Transactional speculation is a more high-level concept, conceptually executing local transac- tions sequentially, but speculating on (lack of) interference from remote memory accesses. The horizon of transactional speculation is the size of the transactions.
Adding support for transactional memory into an existing microarchitecture needs changes in a large fraction of the units: clearly, the new instructions required to start and end a transaction necessitate changes in the decoder; while the actual transactional memory functionality of conflict detection and versioning require changes to the load-store unit and memory subsystem. The following sections outline the technical details, looking at the features one by one.