7.3.1 Initial Implementation of Stack Operations Folding
Instruction pattern based folding has been introduced by SUN Microsystems as part of its first implementation of a hardware Java machine, the picoJava-I proces- sor [124]. It relies on a classification of the Java bytecode, in which instructions are grouped by their purpose, e.g. local variable load/store operations or arith- metic operations.
The execution stage of the underlying virtual machine has to be aware of this classification, as well as of every instruction’s implicit group. This allows scanning of the instruction stream for groups of bytecodes, which may be folded into a
7.3. FOLDING BASED ON INSTRUCTION TYPE PATTERN 117
Table 7.2: Instruction Groups Implemented in picoJava-II [123]
LV LV OP MEM LV LV OP LV LV BG2 LV OP MEM LV BG2 LV BG1 LV OP LV MEM OP MEM
single RISC-like instruction, in order to eliminate stack transfers, and thus reduce execution time.
PicoJava-I has a four-stage pipeline that contains a data cache. This allows for quick access to recently used data, which most often times is stack data, as almost every bytecode accesses the stack. The folding mechanism implemented in picoJava-I makes use of this fact. In order ...
“ ... to boost performance, picoJava-I relies on a folding operation that takes advantage of random, single-cycle access to the stack cache. Frequently, an instruction that copies data from a local variable to the top of the stack im- mediately precedes an instruction that consumes that data. The instruction decoder detects this situation and folds these two instructions together. This compound instruction performs the operation as if the local variable were al- ready located at the top of the stack.
Since, on average, the variables area is within 15 entries of the top of the stack and the stack cache is designed to contain nearly 64 valid entries, the local variable requested is almost always in the stack cache. In the unlikely event that the local variable is not contained on the stack cache, folding cannot occur, and picoJava-I suppresses it.” [124]
118 CHAPTER 7. INSTRUCTION FOLDING
DEC DEC DEC DEC DEC DEC DEC
D C A B A B C A B 0 1 2 3 4 5 6 I−Buffer l5 l4 l3 l2 l1 l0 len0 len1 len2 A RS1 l1 l1 l6 l6 l6 D B C RS2 OP RD
Figure 7.3: PicoJava-II’s Folding Logic for 2-, 3- and 4-Foldable Bytecode Sequences [123]
This folding mechanism supports folding operations on local variable load and push constant operations only [124]. As a result, a total of 14.9% of all executed instructions can be folded with this approach. This means that a large amount of load and constant operations remains unfolded, as these two instruction groups
sum up to≈ 41.3% of the chosen benchmarks amount of executed instructions.
7.3.2 A More Fine Grained Folding Logic
An improved folding mechanism has been implemented in the picoJava-II proces- sor. It relies on a 16 bytes wide instruction buffer, as well as on an improved num- ber of six bytecode groups, which are shown in table 7.1. Furthermore, the pro- cessor’s Instruction Folding Logic (IFU) handles nine patterns, which are shown in table 7.2, and do not only differ in their number of contained instructions, but also in their actual length. Thus, a large amount of additional detection logic is necessary for the picoJava-II folding logic, which
“ ... examines the top 7 bytes in the instruction buffer (I-Buffer) to determine how many instructions can be folded (up to a maximum of four). [It further-
7.3. FOLDING BASED ON INSTRUCTION TYPE PATTERN 119 3−to−1 2−to−1 2−to−1 3−to−1 Fold Logic Group 2 Group 9 i0 i1 i2 i3 i4 i5 i6
fdec fdec fdec fdec fdec
acc_len0 acc_len1 acc_len2
Figure 7.4: Detection and Folding Logic for 2- and 3-Foldable Sequences (derived from) [125]
more] decodes the instructions and provides the result to the R stage and sends the shift signal, which indicates the number of bytes consumed, to the I-Buffer.” [123]
A schematic of picoJava-II’s folding logic is sketched in figure 7.3. SUN Microsys- tems has not published any performance numbers of picoJava-II’s folding mecha- nism. Nonetheless, picoJava-II has been used as a reference in several papers on different folding schemes. Regarding those numbers, picoJava-II is able to fold 42.32% of all stack operations [130] regardless of the related instruction’s type. The amount of folded instructions may differ slightly as some instructions exe- cute more than a single stack operation. However, it can be stated that picoJava-II folds almost three times as much instructions as its predecessor.
7.3.3 A Hardware Saving Approach With Reduced Detection Logic
In order to reduce the large amount of detection logic required for the picoJava- II scheme, further evaluation and design have led to a slightly simplified, but hardware saving folding scheme [125]. It is stated, that the primary . . .
120 CHAPTER 7. INSTRUCTION FOLDING
“ . . . source of complexity in the PicoJava-II’s folding mechanism is the variable length of the bytecode, especially, the length of the LV type bytecodes that varies from one to three bytes. To reduce this complexity, we modified the PicoJava-II’s scheme in the following two points. First, we limit the number of folding bytecodes to three (i. e. Group 1 is excluded). Next, we exclude
SIPUSH, which is the only three byte long LV type bytecode and handle it as an
NF bytecode.” [125]
Reducing the length of foldable bytecode sequences from four to three instruc- tions, significantly reduces the amount of hardware which is required for the implementation. Additional hardware reduction is gained from the elimination of the sipush instruction from the folding process. This step compresses the de- tection logic for local variable type instructions, which amount up to 35% of an applications overall instructions [124].
Additionally, the processor’s critical path is shortened, and thus the overall exe- cution time of an application should decrease. Nonetheless, an acceptable slow down of the application was measured, caused by a reduced number of folding operations. In fact, the execution has still been faster than without folding.
“ [. . . ] the delay of the longest path [. . . ] was 2.50ns, which is a reduction of 11%. We also compared the logic circuit areas and it was found that the proposed scheme occupied 35% less area than that of PicoJava-II [. . . and . . . ] the proposed scheme achieved 74.0% to 99.7% of the PicoJava-II’s scheme for seven benchmarks.” [125]