Statistical profiling - Overview and Discussion

2.5 Overview and Discussion

3.1.1 Statistical profiling

In statistical profiling we make a distinction between microarchitecture-independent characteristics and microarchitecture-dependent characteristics. The microarchitecture-independent characteristics can be used across microarchitectures during design space exploration. The microarchitecture-dependent characteristics on the other hand are particular to a specific microarchitecture component configuration. Ideally, the profile should contain only microarchitecture-independent characteristics, such that it applies to many microarchitectures and thus needs to be determined only once. Figure 3.2 illustrates what a statistical profile looks like; we now discuss each component in more detail.

3.1 Single-core statistical simulation 21                     

Figure 3.2: Illustration of the statistical profile. The notation ‘A|ABB’ repre-

sents basic block A with its history of three preceding basic blocks ABB.

Microarchitecture-independent characteristics

The key structure in the statistical profile is the statistical flow graph (SFG) [19] which represents a program’s control flow behavior in a statistical manner. In an SFG, the nodes are the basic blocks along with their basic block history, i.e., the basic blocks being executed prior to the given basic block. The order of the SFG is defined as the length of the basic block history, i.e., the number of predecessors to a basic block in each node of the SFG. The order of an SFG will be denoted with the symbol k throughout this dissertation—throughout this chapter we consider third-order SFGs unless stated otherwise. For example, consider the following basic block sequence ‘ABBAABAABBA’. The third- order SFG then makes a distinction between basic block ‘A’ given its basic block history ‘ABB’, ‘BBA’, ‘AAB’, ‘ABA’; the SFG will thus contain the following nodes: ‘A|ABB’, ‘A|BBA’, ‘A|AAB’ and ‘A|ABA’. The edges in the SFG interconnecting the nodes represent transition proba-

bilities between the nodes. Figure 3.2 gives an example third-order SFG for the aforementioned basic block sequence.

The idea behind the SFG is to model all the other program characteristics along the nodes of the SFG. This allows for modeling program characteristics that are correlated with (or dependent on) execution path behavior. This means that for a given basic block, different statistics are computed for different basic block histories, i.e., we collect different statistics for basic block ‘A’ given its history ‘AAB’ and ‘ABB’. For example, the probability for a cache miss for a given load in basic block ‘A’ might be different depending on its basic block history. On the other hand, in case the correlation between program characteristics spans a number of basic blocks that is larger than the SFG’s order, it will be impossible to model such correlations within the SFG, unless the order of the SFG is increased. However, in the next chapter we will show that it is possible to decouple cache miss correlation from the order of the SFG, i.e., we capture correlation that spans a number of basic blocks larger than the SFG’s order.

The second microarchitecture-independent characteristic is the instruction mix. We classify the instruction types into 16 classes accord- ing to their semantics: nop, trap, load, store, software prefetch, write hint, integer conditional branch, floating-point conditional branch, in- direct branch, integer arithmetic and logical operation, integer multiply, integer divide, floating-point arithmetic and logical operation, floating-point multiply, floating-point divide and floating-point square root. This distinction is made based on the instruction’s semantics and its execution latencies. For each instruction we also record the number of input registers or source operands. Note that some instruction types, although classified within the same instruction class, may have a different number of source operands.

For each operand we also record the dependence distance which is the number of dynamically executed instructions between the produc- tion of a register value (register write) and its consumption (register read). We only consider read-after-write (RAW) dependences since our focus is on out-of-order architectures in which write-after-write (WAW) and write-after-read (WAR) dependences are dynamically re- moved through register renaming as long as enough physical registers are available. Note that recording the dependence distance requires storing a distribution since multiple dynamic versions of the same static instruction could result in multiple dependence distances. Although

3.1 Single-core statistical simulation 23

very large dependence distances can occur in real program traces, for our purposes we can limit the dependence distances in the distribution to the maximum reorder buffer size we want to consider during statistical simulation. In our study, we limit the dependence distance to 512 which allows for modeling a wide range of microprocessors.

Microarchitecture-dependent characteristics

In addition to the microarchitecture-independent characteristics men- tioned above, we also measure a number of microarchitecture-dependent characteristics that are related to locality events. The reason for choosing to model these events in a microarchitecture-dependent way is that locality events are hard to model using microarchitecture- independent metrics. We therefore take a pragmatic approach and collect cache miss and branch miss information for particular cache configurations and branch predictors.

For the branch statistics we consider (i) the probability of a taken branch, (ii) the probability of a fetch redirection (target misprediction in conjunction with a correct taken/not-taken prediction for conditional branches), and (iii) the probability of a branch misprediction. When measuring the branch statistics we consider a FIFO buffer as described in [19] in order to model delayed branch predictor update.

The cache statistics consist of the following six probabilities: (i) the L1 I-cache miss rate, (ii) the L2 cache miss rate due to instructions only1_,

(iii) the L1 D-cache miss rate, (iv) the L2 cache miss rate due to data accesses only, (v) the I-TLB miss rate and (vi) the D-TLB miss rate.

In document Fast simulation techniques for microprocessor design space exploration (Page 50-53)