Java Bytecode Folding Based on Behavioural Pattern

Table 7.3: Instruction Types Defined by the POC Folding Model [113]

P An operation that pushes constant or loads local variable to operand stack. OE An operation that will be executed in execution units.

OB An operation that conditionally branches or jumps to target address.

OC An operation that will be executed in microcoded ROM or trapped as a sequence of

instructions.

OT An operation that will force the folding check to be terminated for the difficulty in

performing folding.

C An operation that pops the value from stack and stores it into local variable.

7.4 Java Bytecode Folding Based on Behavioural Pattern

7.4.1 The Producer-Operator-Consumer Folding Approach

As already mentioned, the second major approach on stack operations folding bases on the stack behaviour of bytecodes. Depending on its role in an examined bytecode sequence, an instruction is classified as a Producer, an Operator or a

Consumer. Hence, such folding mechanisms are called POC-folding.

“ The basic concept of the POC model is that it checks the instructions N and N + 1 to see whether they can be folded together (based on the instruction type, operand source, operand destination, data type and width). If they are foldable, the folded result instruction will become the new instruction N , and will be checked with the new following instruction N + 1, repetitively, until the end of folding.” [113]

The first implementation of POC-folding [131], classified one producer, one consumer and three distinguished operation types. Furthermore, those instruction types are grouped for a total of ten patterns (five 2-foldable, four 3-foldable and one 4-foldable). It allows an effective folding of 76% of all stack operations, while speeding up the application execution by a factor of 1.26.

An improved version of this first POC-folding [113] introduces a fourth type of operational instructions, while also improving the number of patterns to an overall

122 CHAPTER 7. INSTRUCTION FOLDING

B

istore 8 iadd iload 7 iload_2 istore_3 iadd iload 6 iload 4

A

(a) Sequential Trace Execution

B

A

istore 8 iadd iload 7 istore_3 iadd iload 6 iload 4 iload_2

(b) Interleaved Trace Execution

Figure 7.5: Stack Operation Folding on Different Bytecode Groups

amount of 22. The large number of folding rules can be expressed by a state machine, which is shown in figure 7.6. The larger set of folding rules goes with improved hardware requirements, but also improves the amount of folded stack operations to 84%, while creating a speedup of 1.34. The six different operation types and their meaning are shown in table 7.3.

7.4.2 Enhancements of the POC Folding Model

Plenty of research has been done on improvements of the basic POC-folding model. The basic pattern matching algorithm excludes large parts of the bytecode, and is furthermore limited to recognition of ideal folding situations. Thus, the amount of foldable stack operations does not exceed 85% with the basic POC-folding scheme [113]. Figure 7.5 shows two sample bytecode sequences,

which both have the same semantics. Nonetheless, only the sequential se-

quence of the two traces is recognized by the basic POC-folding model. How- ever, the interleaved bytecode sequence actually is foldable as well, in case the detection logic is able to match for it and issues the resulting native instructions in the correct order.

Minimizing the Number of Stack Operations

Improvement [129] and iteration [130, 132] of the basic POC folding model has led to an enhanced folding mechanism, called EPOC-folding.

7.4. JAVA BYTECODE FOLDING BASED ON BEHAVIOURAL PATTERN 123

End Folding Rule Check T T C E C B E C B

Start Folding Rule Check

State_P

State_C

State_O State_O State_O

C C C C P P O O O O O O B O OE P, O, C P, O, C P, O P, O

Figure 7.6: Folding Rules for the POC Folding Model [129]

“ The main improvement of the EPOC model over the POC model is the ca- pability of folding the discontinuous Ps. As shown [...], the P Counting state will record how much Ps are there before the O or C type instructions. If there is no O or C type instruction in the instruction buffer, the Ps will be issued sequentially to the execution unit like the POC does. The C Counting state will check whether the preceding state is OE state or P Counting state. If the pre-

ceding state is OE, C Counting state will fold the Cs into OE. If it is P Counting

state, then Ps are folded into Cs according to the number of Cs. Otherwise, if the C type instruction is the first instruction in instruction buffer, the EPOC issues the C sequentially.” [129]

The EPOC-folding’s improved pattern matching unit is shown in figure 7.7. The EPOC-folding approach eliminates up to 99% of all stack operations and provides an improved issuing rate of almost two bytecodes per cycle [129]. Nonetheless, this does not translate into a speedup of two, as still non-foldable and unfolded instructions remain.

124 CHAPTER 7. INSTRUCTION FOLDING P,O,C P,O,C P,O P,O C B O C C B O T O E O E O P C O C O T O

Check, Issue FBI End EPOC Folding Rules

Check

Start EPOC Folding Rules

P Counting O

O C Counting

Figure 7.7: Folding Rules for the EPOC Folding Model [129]

Folding of Interleaved Bytecode Sequences

A different direction of research focusses on interleaved bytecode patterns [119, 120, 121] (an example is shown in figure 7.5b), as . . .

“ . . . it would be an ideal situation if a foldable instruction sequence is followed by another foldable sequence. However, many times foldable instructions are separated by other bytecode instructions. Five distinct types of relationships between a foldable instruction group and its adjacent instructions have been found.” [121]

The addition of four interleaved instruction patterns to the already existing POC

model, significantly improves the folding performance, as≈ 95% of all stack op-

erations can be eliminated [119]. Its application to the EPOC folding model leads to an overall application speedup of 1.74. Regardless of this runtime acceleration, bytecode reordering for the purpose of instruction folding has the drawback of an increased hardware complexity. An implementation of the proposed reordering algorithm works on a large 73 byte instruction reorder buffer [132], while the speedup over normal POC folding is just 1.2.

In document Performance Improvement of Adaptive Processors: Hardware Synthesis, Instruction Folding and Microcode Assembly (Page 125-129)