Dezso¨ Sima
2.2.5 Layout of the Register Mapping
2.2.5.1 Overview
Register mapping includes three main tasks, as depicted in Fig. 2.16:
1. The processor needs to allocate rename buffers to the destination registers of the dispatched instructions.
2. It also must keep track of the valid mappings for two reasons: a. To forward generated results to the right rename buffers.
b. To deliver the correct operand values when they are needed in the course of instruction processing.
3. It needs to deallocate no longer needed rename buffers. 2.2.5.2 Allocation Scheme of Rename Buffers
Processors usually allocate rename buffers to every dispatched instruction rather than only to those including a destination register in order to simplify logic. Although rename buffers are not needed until the instruction results are generated in the last execution cycle, rename buffers are typically allocated to the instructions as early as during instruction dispatch. This kind of register allocation leads to wasted rename register space. Delaying the allocation of rename buffers to the instructions until instructions finish [39] saves rename register space. Various schemes have been proposed for this, such as virtual renaming [39–42] and others [43]. In fact, a virtual allocation scheme has already been introduced in the POWER3 [39].
2.2.5.3 Method of Keeping Track of Actual Register Mapping
There are three possibilities for keeping track of the actual mapping of the architectural registers to the allocated rename buffers: (1) The processor can use a mapping table for this, (2) it can simplify the tracking task by means of a future file, or (3) it can track register renames within the rename buffers themselves. In the following section we outline these methods, which are illustrated in Fig. 2.17.
A mapping table has as many entries as there are architectural registers in the instruction set architecture (ISA), usually 32 for RISCs. Each entry holds a status bit (called the entry valid bit in the figure), which indicates whether the associated architectural register is renamed. Operands from
Layout of the register mapping
Deallocation scheme of rename buffers Method of keeping track
of actual mappings Allocation scheme of
the rename buffers
renamed registers will be accessed from the rename buffers, whereas operands from not renamed registers from the architectural register file. Each valid entry supplies the index of the rename buffer, which is allocated to the architectural register belonging to that entry (called the RB-index). For instance, the left-hand side of Fig. 2.17 shows that the mapping table holds a valid entry for architectural register r7, which contains the RB-index of 12, indicating that the architectural register r7 is actually
renamed to rename buffer number 12 or vice versa; rename buffer RB 12 will hold or already holds the generated value for r7, depending on the value of the value-valid bit of the associated rename buffer. To
prepare operand access, source registers of dispatched instructions are renamed simply by accessing the mapping table with the register numbers as indices and fetching the associated rename buffer identifiers (RB-indices), assuming that there is a valid renaming for that particular register, indicated by the ‘‘entry valid’’ field of the entry, as shown in Fig. 2.17.
As already mentioned in the previous section, usually each entry is set up during instruction dispatch when new rename buffers are allocated to the destination registers of the dispatched instructions. A new entry is created by setting the ‘‘entry valid’’ bit and writing the index of the allocated rename buffer into the field ‘‘RB index.’’ A valid mapping is updated when the architectural register belonging to that entry is renamed again, and it will be invalidated when the instruction associated with the actual renaming completes. In this way, the mapping table continuously holds the latest allocations.
We note that for split architectural register files obviously separate FX- and FP-mapping tables are needed. Mapping tables should provide one read port for each source operand that may be fetched in any one cycle, and one write port for each rename buffer that may be allocated in any one cycle (as discussed earlier in the Section 2.2.4.4).
The second option for tracking register allocations is based on the future file concept. Originally introduced for implementing precise interrupts in pipelined processors in the middle of the 1980s [46], the future file has the same number of entries as the architectural register file and holds the most recent values produced for the architectural registers so far. In connection with renaming, the future file is used for holding the latest values of the renamed (temporarily buffered) register values and delivering those values (if available else their tags) when the operands are accessed. Subsequently, with reference to Fig. 2.18, we describe the operation of the future file in more detail, assuming that the processor makes use of shelving, accesses operands dispatch-bound, and holds instruction results (that is renamed values) in the ROB.
As instructions are dispatched to the reservation stations (RS) in the future file, the processor clears the ‘‘value-valid’’ bits belonging to the destination registers (Rd) of the instructions dispatched in order
Method for keeping track of the actual register mapping
Using a mapping table
Mapping within the rename buffers
Assoc. lookup for r7 Entry valid Dest.
reg. no. Value
Value- valid Latest bit Rename buffers 9 10 11 12 1 1 1 1 8 7 9 7 80 1 0 1 1 7 - 70 1 1 0 1 0 “12”/“70” (RB index = 12)/(Value = 70) Entry valid RB index Mapping table Lookup for r7 Lookup for r7 6 7 8 0 1 1 12 14 “12” (RB index = 12) 0 Using a future file Value V Tag Rename buffers 70 “70” (Value = 70) n−1 6 7 8 0 n−1 1
to indicate a not yet available value and marks a missing value by the tag (Tag) of the instruction to prepare a later updating when one of the available execution units generates the result. Furthermore, the future file is accessed by the referenced source operand identifiers (Rs1, Rs2) and delivers the requested operand values if they are available (i.e., their value-valid bits are set), else is accessed by their tags to the RS. Dispatched instructions remain in the RS and wait for their missing operands. The scheduler checks the instructions held in the RS in each clock cycle and issues the oldest instruction that owes all required operand values to the associated execution unit. The generated result is then written into the ROB into the associated entry, that is, into the entry carrying the same tag as the result. In addition, both the future file and the RS will be updated by the generated result. The RS needs an associative access to update all operands waiting for the particular result that is holding the matching tag. The future file is updated only if the referenced register entry has the same tag as the forwarded result; else the forwarded result is not the latest one belonging to the referenced register since a subsequent instruction has already rewritten its value.
Update Value Reg. nr, result of completed instructions OC Rd⬘ Value V Update 0 RS Restore Op1 Op2 Tag Tag Value V
Execution Unit Tag, result Architectural reg. file Future file Value V Tag Update Reg. nr, tag, result
OC, Rd ROB n-1 by exceptions misspeculations 0 n-1 Scheduler Update Tag, result Tag Tag, Rd⬘, Rs1, Rs2 Dispatched instructions
FIGURE 2.18 Principle of using a future file for keeping track of actual register mappings, assuming shelving with issue-bound operand fetching and a ROB for holding instruction result temporarily.
When instructions complete in program order their results are written into the architectural register file to update the program state. In case of mispredicted branches, misspeculated loads, etc., or accepted exceptions, the future file is flushed and restored with the content of the architectural register file.
The third fundamentally different alternative for keeping track of the actual register mappings relies on an associative mechanism (see the right-hand side of Fig. 2.17) and is called mapping within the rename buffers. In this case, no mapping table exists but each rename buffer holds the identifier of the associated architectural register (usually the register number of the renamed destination register) and additional status bits as well. These entries are set up usually during instruction dispatch when a particular rename buffer is allocated to a specified destination register. As Fig. 2.17 shows, in this case each rename buffer holds five pieces of information: (1) a status bit, which indicates that this rename buffer is actually allocated (called the entry valid bit in the figure); (2) the identifier of the associated architectural register (Dest. reg. no.); (3) a further status bit, called the ‘‘latest bit,’’ whose role will be explained subsequently; (4) another status bit, called the ‘‘value-valid’’ bit, which shows whether the actual value of the associated architectural register has already been generated; and finally (5) the value itself (value), provided that the value-valid bit signifies an already produced result. The latest bit is needed to mark the last allocation of a given architectural register if it has more than one valid allocation due to repeated renaming. For instance, in our example, architectural register r7 has two subsequent
allocations. From these, entry number 12 is the latest one as its latest bit has been set. Thus, in our figure, renaming of the source register r7would yield the RB-index of 12. We point out that in this method
source registers are renamed by an associative lookup for the latest allocation of the given source register. If operands are fetched dispatch bound, source registers are both renamed and accessed during the dispatch process. Then processors usually integrate renaming and operand accessing, and therefore maintain register mapping within the rename buffers or use a future file. For issue-bound operand fetching, however, these tasks are separated. Source registers are usually renamed during instruction dispatch, whereas the source operands are accessed while the processor issues the instructions to the execution units. Therefore, in this case, processors typically use mapping tables.
2.2.5.4 Deallocation Scheme of Rename Buffers
If rename buffers are no longer needed, they should be reclaimed (deallocated). The actual scheme of reclaiming depends on key aspects of the overall rename process. In particular, it depends on the allocation scheme of the rename buffers, the type of rename buffers used, the method of keeping track of actual allocations, and even whether operands are fetched dispatch bound or issue bound. Here, we do not go into details, but refer to Section 2.2.6.1 for a few examples on how processors reclaim rename registers.
2.2.5.5 Rename Rate
As its name suggests, the rename rate stands for the maximum number of renames that a processor is able to perform in a cycle. Basically, the processor should be able to rename all instructions dispatched in the same cycle to avoid performance degradation. Thus, the rename rate should equal the dispatch rate. This is easier said than done because it is not at all an easy task to implement a high rename rate (four or higher). This is true for two reasons. First, for higher rename rates, the detection and handling of inter- instruction dependencies during renaming (as discussed later in Section 2.2.6.1) becomes a more complex task. Second, higher rename rates require a larger number of read and write ports on register files and on mapping tables. For instance, the four-way superscalar R10000 can dispatch any combin- ation of 4 FX- and FP-instructions. Accordingly, its FX-mapping table needs 12 read and 4 write ports, and its FP-table requires 16 read and 4 write ports. This number of ports is needed since FX-instructions can refer up to three and FP-instructions up to four source operands in this processor. Another example worth looking at is the PM1, also called SPARC64. This four-way superscalar processor can dispatch any combination of 4 FX- and 2 FP-instructions, up to a maximum of four instructions. In this case, both the FX-mapping table and the merged register file have 10 read and 4 write ports while its FP-counterpart has 6 read and 3 write ports. According to Asato et al. [44],
this 14-port 116 word 64-bit merged register file needs 371 K transistors, far more than the entire Intel 8086 processor (about 30 K transistors) or slightly more than the i386 (about 275 K transistors) [45].