Trace Cache - The Multi2Sim Simulation Framework. A CPU-GPU Model for Heterogeneous Computing (

Multi2Sim models a trace cache with a design similar to that originally proposed in [4]. The aim of the trace cache is to provide a sequence of pre-decoded x86 microinstructions with intermingled branches.

This increases the fetch width by enabling instructions from different basic blocks to be fetched in the same cycle. Section [TraceCache]in the x86 configuration file (option––x86-config) is used to set up the trace cache model, with the following allowed variables:

• Present (defaults toFalse). If set toTrue, the trace cache model is activated.

• Sets (defaults to 64). Number of sets in the trace cache. The default value is 64.

• Assoc (defaults to 4). Trace cache associativity, or number of ways in each trace cache set. The product Sets×Assocdetermines the total number of traces that can be stored in the trace cache.

Figure 2.5: Timing diagram for the execution of therep movsbx86 string operation.

• TraceSize(defaults to 16). Maximum size of a trace in number of micro-instructions. A trace always groups micro-instructions at macro-instruction boundaries, which causes them to often become shorter than TraceSize.

• BranchMax(defaults to 3). Maximum number of branches allowed in a trace.

• QueueSize(defaults to 32). Size of the trace queue size in micro-instructions. See Section 2.9 for more details on the trace queue management.

Besides micro-instructions, each trace cache line contains the following attached fields:

• valid bit: bit indicating whether the line contains a valid trace.

• tag : address of the first microinstruction in the line. This is always the first one from the set of microinstructions belonging to the same macroinstruction. If a macroinstruction is decoded into more than trace_size microinstructions, they cannot be stored in the trace cache.

• uop_count: number of microinstructions in the trace.

• branch_count: number of branches in the trace, not including the last microinstruction if it is a branch.

• branch_flags: bit mask with predictions for branches.

• branch_mask: bit mask indicating which bits in branch_flags are valid.

• fall_trough: address of the next trace to fetch.

• target: address of the next trace in case the last microinstruction is a branch and it is predicted taken.

Creation of traces

Traces are created non-speculatively at the so-called fill unit, after the commit stage. In this unit, a temporary trace is created and dumped into the trace cache when the number of stored

microinstructions exceeds trace_size or the number of branches reaches branch_max. The following algorithm is used every time an instruction I commits.

If the trace is empty, the address of I is stored as the trace tag. If I is a branch, the fall_through and target fields are stored as the address of the contiguous next instruction, and the branch target if it is taken, respectively. This information is needed only in case this branch is the last microinstruction stored in the trace line. Then, the predicted direction is added to branch_flags in the position corresponding to bit branch_count. The mask branch_mask indicates which bits in branch_flags are valid. Thus, the bit position branch_count in branch_mask is set in this case. Finally, the counters uop_count and branch_count are incremented.

If I is not a branch, the fall_through field is updated, and the target field is cleared. Likewise, the counter uop_count is incremented.

Before adding I the temporary trace, it is checked whether I actually fits. If the maximum number of branches of microinstructions is exceeded, the temporary trace is first dumped into a new allocated trace cache line. When this occurs, the target field is checked. If it contains a value other than 0, it means that the last instruction in the trace is a branch. The prediction of this branch must not be included in the branch flags, since the next basic block is not present in the trace. In contrast, it will be used to fetch either the contiguous or the target basic block. Thus, the last stored bit in both branch_flags and branch_mask is cleared, and branch_count is decremented.

Trace cache lookups

The trace cache is complementary to the instruction cache, i.e., it does not replace the traditional instruction fetch mechanism. To the contrary, the trace cache simply has priority over the instruction cache in case of a hit, since it is able to supply a higher number of instructions along multiple basic blocks.

The trace cache modeled in Multi2Sim is indexed by the eip register (i.e., the instruction pointer). If the trace cache associativity is greater than 1, different ways can hold traces for the same address and different branch prediction chains. The condition to extract a microinstruction sequence into the pipeline, i.e. to consider a trace cache hit, is evaluated as follows.

First, the two-level adaptive branch predictor is used to obtain a multiple prediction of the following branch_max branches, using the algorithm described in Section 2.6. This provides a bit mask of branch_max bits, called pred, where the least significant bit corresponds to the primary branch direction. To check if a trace is valid to be fetched, pred is combined with the trace branch_mask field. If the result is equal to the trace branch_flags field, a hit is detected.

On a trace cache hit, the next address to be fetched is updated as follows. If the trace target field is 0, the fall_through field is used as next address. If target is non-0, the last predecoded

microinstruction in the trace is a branch. In this case, the bit branch_count+1 in the pred bit mask is used to choose between fall_through and target for the next fetch address.

Trace cache statistics

When the trace cache is activated, a set of statistics are attached to the pipeline report (option

––x86-report). Since each hardware thread has its private trace cache, there is a different set of statistics prefixed with theTraceCacheidentifier in each thread section ([c0t0],[c0t1], etc). The following list gives the meaning of each reported statistic related with the trace cache:

• TraceCache.Accesses. Number of cycles when the trace cache is looked up for a valid trace in the fetch stage. This is not necessarily the same as the number of execution cycles, since trace cache accesses are avoided, for example, when the trace queue is full.

• TraceCache.Hits. Number of accesses to the trace cache that provided a hit for the given fetch address and branch sequence prediction. Notice that this does not imply that the fetched micro-instructions are in the correct path, since the branch prediction used to access the trace cache might have been incorrect.

• TraceCache.Fetched. Number of micro-instructions fetched from traces stored in the trace cache.

• TraceCache.Dispatched. Number of micro-instructions fetched from the trace cache and dispatched to the reorder buffer.

• TraceCache.Issued. Number of micro-instructions fetched from the trace cache and issued to the functional units for execution.

• TraceCache.Committed. Number of micro-instructions in the correct path that came from a trace cache access and have committed. These are a subset of the micro-instructions included in the

Commit.Totalstatistic for the same hardware thread section.

• TraceCache.Squashed. Number of micro-instructions fetched from the trace cache that where dispatched and later squashed upon a branch misprediction detection. This number is equal to

TraceCache.Dispatched –TraceCache.Commited.

• TraceCache.TraceLength. Average length of the traces stored in the trace cache. This is a value equal or (probably) lower than the trace cache line size. Since micro-instructions from the same

Figure 2.6: The fetch stage.

macro-instruction cannot be split among traces, a trace may be dumped into the trace cache before being full. The fact that there is a maximum number of allowed branches in a trace is an additional reason for having traces shorter than the maximum trace length.

In document The Multi2Sim Simulation Framework. A CPU-GPU Model for Heterogeneous Computing (For Multi2Sim v. 4.2) (Page 30-34)