5.2 A Study on the Potential of Custom Instructions
5.2.1 Crossing the Basic Block Boundaries
Most of the research in candidate pattern identification are based on analyzing the basic blocks in isolation. The only exception to this is the work by Arnold and Corporaal [6], which identifies patterns based on the dynamic execution trace of the program. Dynamic execution trace is the record of the program’s complete run time execution sequence. As we are interested in identifying the performance poten- tial of customization, our identification process is also based on dynamic execution trace. This way we can identify patterns and their frequencies across basic block boundaries. Using execution trace we can group operations across branches in the
CHAPTER 5. CUSTOM INSTRUCTION SELECTION 96 (a) (b) 50 50 50 50 BB0(100) BB6(100) BB3(100) BB1(50) BB2(50) BB4(50) BB5(50) BB0(100) BB6(100) BB3(100) BB1(50) BB2(50) BB4(50) BB5(50)
Figure 5.4: Possible correlations of branches. (a) Left (right) side of the 1st branch is always followed by the left (right) side of the 2nd one, (b) Left (right) side of the 1st branch is always followed by the right (left) side of the 2nd one.
execution sequence more accurately. This is because certain dynamic behavior can- not be deduced from profiling of basic block execution counts. Figure 5.4 shows an example control flow graph where correlation of biased branches cannot be correctly inferred from only the execution counts. However, Arnold [6] constructs a huge dataflow graph for the entire trace and builds patterns incrementally by traversing this graph multiple times. This approach is computationally expensive, thereby lim- ited to small patterns. Instead we base our study on a compact representation of the dynamic execution trace called Whole Program Path (WPP) [48], which allows identification of patterns within and across basic blocks in an efficient manner.
Whole Program Path (WPP)
Larus developed the notion of Whole Program Path (WPP) [48], which captures the entire execution trace of a program. The storage overhead for the trace is reduced drastically by employing on-line string compression techniques called SE- QUITUR [60]. SEQUITUR algorithm represents a finite string σ (the control flow
CHAPTER 5. CUSTOM INSTRUCTION SELECTION 97 S(1) SÆAACC7 AÆBB BÆ01346 Interior node (basic block ) S(1) BÆ01346 CÆ02356 sequence) A(2) B(4) C(2) L f d Leaf node (basic block) 0(6) 1(4) 2(2) 3(6) 4(4) 5(2) 6(6) 7(1)
Figure 5.5: WPP for basic block sequence 0134601346013460134602356023567 with execution count annotations.
trace in our case) as a context free grammar whose language is the singleton set {σ}. The grammar is synthesized on-the-fly with time complexity linear in the length of the input string. It works by appending symbols from the input string, in order, to the end of the grammar’s start production. Upon each addition, SEQUITUR adjusts the grammar to preserve the following two invariants. The first invariant is referred to as the Diagram Uniqueness property, where a pair of consecutive symbols, called a diagram, should occur at most once in the rules of the grammar. If adding a symbol from the input string introduces a recurring diagram, its occurrences will be replaced with the non-terminal symbol for a rule (possibly already constructed) with the diagram as its right side. This first invariant constructs the rules and builds the hierarchy to express the redundancy. The second invariant is referred to as the Rule Utility property, where all non-terminal symbols of the grammar (except for the start symbol) must be referred more than once by other rules; otherwise, a rule will be eliminated. The reference count of a non-terminal symbol may reduce when its occurrence is replaced by other non-terminal symbols on the higher hierarchy. The second invariant eliminates the useless rules.
CHAPTER 5. CUSTOM INSTRUCTION SELECTION 98
of basic blocks. The grammar produced by SEQUITUR can be represented as a directed acyclic graph (DAG), called WPP. Figure 5.5 shows an example of WPP. Each node of the WPP is annotated by the execution count of the sub-DAG rooted at that node. The leaf nodes of the WPP are the basic blocks; an interior node represents a sequence of basic blocks appearing in the execution trace. This example illustrates how the correlations of the two branches in Figure 5.4(a) can be captured in the WPP (by non-terminal symbols B and C).
During candidate pattern enumeration, we first start with the basic blocks and identify subgraphs within the basic blocks. To identify subgraphs across basic block boundaries, we look at frequently occurring interior nodes in the WPP and treat the sequence of basic block corresponding to that node as the unit for pattern identification process.
5.2.2
Experimental Setup
Table 5.2 shows the benchmark programs used in this study. All the benchmarks, except for md5, are from MiBench [31]: a free, representative embedded benchmark suite. We have selected benchmark programs from all the different categories such as security, network, telecomm etc. We consider integer-intensive benchmarks here, as including float-point operations in patterns seldom results in speedup. Table 5.2 also shows the total number of basic blocks and hot basic blocks for each program. We define hot basic blocks as the ones whose aggregate contribution exceed 95% of the total execution time of the program. ISE identification methodology only explores these hot basic blocks and basic block sequences involving them. Including patterns from the rest of the basic blocks has negligible effect on performance improvement. The average size of hot basic blocks varies from very small (2.6 instructions) to very big (495.7 instructions).
CHAPTER 5. CUSTOM INSTRUCTION SELECTION 99
Benchmark Class Total Hot Avg. Hot
BB BB BB Size rawcaudio Telecomm 68 22 2.6 rawdaudio Telecomm 66 18 2.6 fft Telecomm 129 24 6.8 sha Security 76 6 17.2 strsearch Office 148 4 6 qsort Automotive 30 26 4.9 bitcnts Automotive 79 13 12.4 basicmath Automotive 94 28 6 patricia Network 203 37 2.8 dijkstra Network 77 6 5 djpeg Consumer 317 96 6.8 rijndael Security 168 7 184.3 blowfish Security 81 13 30.3 sha(unroll) Security 68 3 495.7 cjpeg Consumer 3756 145 7.8 md5 Security 107 39 29.6
Table 5.2: Characteristics of benchmark programs
The execution traces of the programs are generated using Simplescalar tool set [12] which is a cycle-accurate simulation platform for RISC-like processor archi- tectures. The benchmarks are compiled by gcc version 2.7.2.3 with -O3 optimization. We build the Whole Program Path (WPP) from the execution traces using a modi- fied version of the Sequitur grammar [59]. DFGs for the hot basic blocks and paths (internal nodes of WPP) are constructed to identify custom instructions within and spanning across multiple basic blocks. Only connected candidate subgraphs are enumerated. The ILP formulations for custom instruction selection are solved us- ing ILOG CPLEX (v9.1). There are cases (a few ones for sha(unroll) and md5) for which CPLEX cannot return their optimal cycle reductions within 2 hours on a 3Ghz Pentium4 Linux workstation. For these cases, we use the best cycle reductions CPLEX have achieved with 2 hours running time, which are provable to be at most 5% less than the optimal ones.
Evaluation of latencies and area of custom instructions is the same as that described in Section 5.1.2. Similarly, under the assumption of a single-issue, in-order
CHAPTER 5. CUSTOM INSTRUCTION SELECTION 100
pipelined architecture with 100% cache hit rate, the percentage of cycle reduction is given as:
Reduction% = Reduced cycles by custom instructions
Original execution cycles of the benchmark ∗ 100