Translation Optimization Layer - Experimental Framework

3. Experimental Framework

3.3 Translation Optimization Layer

Translation Optimization Layer (TOL) is the software layer that executes on-top of the host RISC processor. It is responsible for translating the target x86 code to the host ISA. It does that in three different execution modes: 1) interpretation mode (IM), 2) basic block translation mode (BBM), and 3) superblock translation and optimization mode (SBM).

TOL starts by interpreting guest x86 instruction stream in IM. When a basic block is executed more than a predetermined number of times, TOL switches to BBM. In this mode, the whole basic block is translated and stored in the code cache and the rest of the executions of this basic block are done from the code cache. Moreover, branch profiling information for direction and target of branches is also collected. Once the execution of a basic block exceeds another predetermined threshold, TOL creates a bigger optimization region, called superblock, using the branch profiling information collected during BBM. The superblock goes through several optimizations and is stored in the code cache. The high level view of the execution flow of TOL is shown in Figure 3.3.

3.3.1 Interpretation

TOL begins the execution of the application in IM. While in IM mode, x86 instructions are interpreted one by one and the x86 state is updated accordingly. The IM guarantees forward progress of the application and also is used as a safety-net in case instructions cannot be included in basic block translations and superblocks. Moreover, interpretation is necessary to make forward progress, in case of speculation failures in superblock due to aggressive optimizations.

There is one caveat concerning the interpretation method employed in DARCO. Due to the complex and time consuming nature of building an interpreter, we decided to

40 use the translator provided by QEMU but instead of translating one basic block at a time, it was modified to translate one instruction at a time. Since QEMU’s translator was designed for portability (it supports translation from various guest to host ISAs), using it to translate just one instruction introduces high overhead. In order to accommodate the high cost of such interpretation method, an interpretation cache is used to store the interpretations. Interpretation cache is a typical code cache used in HW/SW co-designed processors; the only difference being instead of storing whole basic block translation or superblocks, it stores translation for individual instructions. Once the translation of an x86 instructions has been stored in the interpretation cache, its subsequent executions are done from this cache. This modification significantly reduced the cost of interpreting an x86 instruction. Also note that no chaining is done between interpretations.

3.3.2 Basic Block Translation

During IM, profiling information is collected for execution frequency of the basic blocks using software repetition counters. When the repetition counter of a basic block

Figure 3.3: Translation Optimization Layer (TOL) execution flow. The left path is

followed in IM, the middle in BBM and the right in SBM. x86 eip In Code $? In Intr $? Interpret Store in Intr $

Execute from Intr $

BB translate Store in Code $

Chain

Execute from Code $

Create SB Optimize SB > BB_th? From Code $ Yes Yes Yes No No No

41 reaches the BB_translation_threshold, TOL switches to BBM in order to translate the corresponding basic block.

Note that since we use a modified version of the QEMU translator and code generator, we also inherit some of the nomenclature. The intermediate representation of the instructions in DARCO is called qOps.

Figure 3.4shows an abstract version of a typical translation of an x86 basic block. The original code is translated into an equivalent set of qOps. TOL translates all x86 memory operations in a special way. We introduced new qOps and host instructions for all load and store instructions in order to be able to distinguish during the execution whether a memory access corresponds to the application itself or TOL. There are two reasons for doing this. The first regards to functionality. The PPC component needs to know if there is an access to the x86 memory space and in the uncommon case that the data page was not communicated before, requests the page to the controller as explained before. The second reason regards to evaluation, since we would like to be aware of the performance characteristics of each translation.

At the end of the translation, a branch instruction and two exit stubs are inserted. The outcome of the branch instruction decides which of the two exit stubs to execute. Each exit stub consists of an empty position where the chaining will be patched later during the execution, an update of the program counter and a branch to TOL where the basic block starting at the new program counter will be interpreted or translated. When the chain position is patched, the execution will not return to TOL, but instead the next basic block will be executed directly from the code cache. Finally, a new PPC instruction, eob_x86, is introduced. The purpose of this instruction is strictly for synchronization. In terms of timing, this instruction has no effect.

x86 Basic Block

mov eax, [ebx] jne label1 qOps for x86 BB q_ld_x86 env->eax, [env->ebx] bne taken_exit_stub eob_x86 chain position update eip branch to TOL eob_x86 chain position update eip branch to TOL Taken exit stub Not taken exit stub

Host code for x86 BB

ld_x86 r19, [r20] Jne taken_exit_stub eob_x86 chain position update eip branch to TOL eob_x86 chain position update eip branch to TOL Taken exit stub Not taken exit stub

Figure 3.4: Abstract translation of an x86 basic block to host ISA. The eob_x86

instruction is used by DARCO for execution synchronization and special ld_x86 instructions to point out accesses to x86 memory space.

42 The qOps are forwarded to the code generator. There, they undergo some basic optimizations like dead code elimination and constant propagation, which contribute towards reducing the number of generated instructions. Finally, the qOps are translated to PPC instructions and stored in the code cache from where they are dispatched for execution.

3.3.3 Superblocks and Optimizations

During Basic-Block translation Mode (BBM), profiling information is gathered for all the basic blocks in BBM using software counters. This information consists of execution and edge counters. The execution counter provides the execution frequency of a basic block while the edge counters monitor the biased branch direction. Once the execution of a basic block exceeds another predetermined threshold, TOL creates a bigger optimization region, called superblock, using the branch profiling information collected during BBM.

In Superblock translation and optimization mode (SBM), TOL generates a new superblock starting from the triggering basic block. A superblock generally includes multiple basic blocks following the biased direction of branches. A superblock ends at one of the following conditions:

1) The last basic block included in the superblock ends with an indirect branch, call, or return instruction.

2) The last basic block included in the superblock ends with an unbiased branch or the probability of reaching the last basic block from the beginning of the superblock falls below a predetermined threshold.

3) The number of instructions in the superblock exceeds a predetermined threshold.

4) The number of basic blocks included in the superblock exceeds a predetermined threshold.

Moreover, the branches inside the superblocks are converted to “asserts” so that a superblock can be treated as a single-entry, single-exit sequence of instructions. This gives the freedom to reorder and optimize instructions across multiple basic blocks. “Asserts” are similar to branches in the sense that both checks a condition. Branches determine the next instruction to be executed based on the condition; however, asserts have no such effect. If the condition is true, assert does nothing. However, if the condition evaluates to false, the assert “fails” and the execution is restarted from a previously saved checkpoint in IM. Furthermore, if the number of assert failures in a superblock exceeds a predetermined limit, the superblock is recreated without converting branches to “asserts”.

43 As a result, this time the superblock has to be treated as a single-entry multiple-exit sequence of instructions. Having multiple exits in a superblock also reduces available optimization opportunities because the instructions across different exit paths cannot be reordered as freely as before.

Furthermore, while creating a superblock, if a loop is detected, it is unrolled. Currently, we unroll loops consisting only a single basic block, as they are the ones which provide maximum benefit [77]. To detect and unroll the loops without control flow the following steps are followed.

1) The target address of the first branch instruction in the superblock is compared against the address of the first instruction of the superblock. In case of a loop, the addresses will match.

2) The execution and edge counters are used to determine the loop trip count. 3) Loop unroll factor is determined based upon the data types in the loop, SIMD

accelerator width, and the loop trip count determined in the last step. For example, if a loop contains only single-precision floating-point data types, then for a 128-bit wide SIMD accelerator the loop is unrolled 4 times if the loop trip count is more than or equals to 4.

Moreover, the unrolled version of the loop is followed by the original loop (without unrolling). During execution, a runtime check is performed to determine whether to execute the unrolled version or the original loop. If the number of iterations left for execution are less than the loop unroll factor, then the original loop is executed instead of the unrolled loop.

The optimizer applies several transformations on the superblock. Figure 3.5 shows different optimizations performed by the optimizer. First, the qOps are transformed into a Static Single Assignment format. This transformation removes anti & output dependences and significantly reduces the complexity of subsequent optimizations. Second, a forward pass applies a set of conventional single pass optimizations: constant folding, constant propagation, copy propagation, and common subexpression elimination. Third, a backward pass applies dead code elimination.

After the basic optimizations, the Data Dependence Graph (DDG) is prepared. To create DDG, the input and output registers of the instructions are inspected and the corresponding dependences are added. During DDG creation, we perform memory disambiguation analysis. If the analysis cannot prove that a pair of memory operations will never/always alias, it is marked as “may alias”. In case of reordering, the original memory

44 instructions are converted to speculative memory operations. Apart from this, Redundant Load Elimination and Store Forwarding are also applied during DDG phase so that redundant memory operations are removed. The DDG is then fed to the instruction scheduler that uses a conventional list scheduling algorithm. Afterwards, the determined schedule is used by the register allocator that implements linear scan register allocation algorithm. Finally, the qOps are translated to PPC instructions and the code is stored in the code cache. The previous entry in the code cache that corresponds to the first basic block of current superblock is invalidated and freed for use by subsequent translations.

In document Optimizing SIMD execution in HW/SW co-designed processors (Page 55-60)