MULTIPLE INSTRUCTION FETCHING - Design and evaluation of a VLIW processor for real-time systems

Pipelines described in the previous section reach a theoretical k speed-up factor where k is the number of stages. Their maximum throughput could also reach about 1 instruction per cycle (IPC). In- stead of dividing the work ink simpler stages, we can reach this performance replicating the work withkcopies allowing a different type of instruction level parallelism (ILP). Standard pipelines provide temporal parallelism while allowing multiple instruction execution give us spatial parallelism. If temporal and spatial parallelism are combined, processors could reach more than one instruction per cycle. Pipelining allied with wider instruction fetch constitute superscalar and VLIW (Very Long Instruction Word) machines. Both types of machines provide higher ILP and the basic difference between them is related to instruction scheduling. In superscalar machines, the hardware is responsible for instruction scheduling while in VLIW machines, instructions are statically scheduled by the compiler.

Figure 5 shows a dual-issue pipeline where the execution stage E is composed of different types of execution units. This machine can, for instance, execute at the same time one ALU and one memory instruction. In this design, each functional unit can be customized to a particular instruction resulting in a more efficient hardware. Each type of instruction has its own latency and uses all stages of its functional unit. If the intra-instruction dependencies are resolved before being forwarded to the functional units, there will be no pipeline stalls. This scheme allows distributed and independent control of each sub-pipeline execution. Replicated units such as ALUs can be added to this design allowing dual arithmetic instruction execution.

2.2.1 Out-of-order execution

If out-of-order execution is allowed in the execution stage of a diversified pipeline, we constitute a superscalar machine with a dynamic pipeline. This is one of the most successful designs and this is the type of machine used in general purpose applications such as smartphones,

2.2. Multiple instruction fetching 53 F E D M1 M2 ALU FP1 FP2 FP3 Br1 Br2 WB

Figure 5 – Diversified pipeline example: Mxare memory sub-stages, FPx are floating-point stages and BRxare branch stages.

desktop computers and laptops. Figure 6 shows an example of this design.

Dynamic pipelines have two special buffers (dispatch and reorder) and a hardware instruction scheduler. Instructions are fetched and decoded in a multi-issue fashion and the hardware scheduler is responsible for instruction assignment in the correct execution sub- pipeline. Instruction dependencies are also checked and when there are stall generating instructions, independent ones are issued into the execution stage. After all dependencies are resolved, stalled instructions in the dispatch buffer are executed. In the pipeline end, there is a reorder buffer to guarantee correct semantic order.

Superscalar processors are complex machines intended to extract the maximum performance abstracting latency details during the compilation phase. These abstractions also provide better code compatibil- ity allowing different processor implementations and realizations to be binary compatible. Complex hardware is necessary to provide out-of- order execution and more details of this design is presented inShen et al.(2005). However, out-of-order execution leads to timing anomalies

E

D

M1 M2 ULA PF1 PF2 PF3 Br1 Br2

WB

in order out of order dispatch buﬀer reorder buﬀer out of order in order

Figure 6 – Dynamic pipeline example: Mx are memory sub-stages, FPx are floating-point stages and BRxare branch stages.

and the high hardware complexity leads to unbounded states during WCET analysis.

2.2.2 In-order execution

Multiple instruction fetching with in-order execution and with- out hardware instruction scheduling leads to a VLIW (Very Long In- struction Word) design. The VLIW philosophy exposes the hardware not only at the ISA level but also at Instruction Level Parallelism (ILP). Superscalar machines extracted ILP using hardware features rearranging operations not directly specified in the code. VLIW processors do not perform any instruction scheduling and ILP must be provided by the compiler. Instruction rearranging (scheduling) is performed offline during compilation and the processor executes instructions exactly as defined by the compiler. The main idea is not to let the hardware do things that cannot be seen in programming, to reduce processor com-

2.2. Multiple instruction fetching 55

Table 6 – Basic differences between VLIW and Superescalar designs

(FISHER; FARABOSHI; YOUNG,2005).

Superescalar VLIW

Instruction flow

Multiple scalar operations are fetched

Instructions are fetched sequentially but with multiple operations Scheduling Hardware dynamic

scheduler

Compiler static scheduler

Fetch width Hardware dynamically determines the number of fetched instructions

Compiler determines statically the number of fetched instructions Instruction

ordering

Dynamic fetching allowing out-of-order and in-order execution

Static fetching with only in-order execution Implications Superescalar design is

related to microarchi- tecture techniques

VLIW is an architec- tural technique. Hard- ware details are ex- posed to the compiler

plexity, to have only simple instructions and to increase predictability when needed. Table6shows basic differences between superscalar and VLIW designs.

VLIW designs are not commonly used in general purpose com- puting but they are very popular in embedded applications. Several DSPs (Digital Signal Processors) like Texas Instrument C6x, Agere/- Motorola StarCore, Suns’s MAJC, Fujitsu’s FR-V, STss HP/ST Lx (ST200), Philips’ Trimedia, Silicon Hive Avispa, Tensilica Xtensa Lx and Analog Devices TigerShark are VLIW processors. Intel’s Itanium architecture for servers also uses this design philosophy.

VLIW compilers are more sophisticated since they must perform operations usually executed by the superscalar hardware. A superscalar

control unit avoids sophisticated compilers. The fact that all ILP ex- traction is done by the hardware every time an instruction is fetched by the processor control unit shows that this work could be easily done by the compiler. The code reordering after out-of-order execution is relatively trivial and compilers can handle it easily and they can evolve faster than a new processor construction.

Considering real-time systems, VLIW is a very interesting solu- tion since hardware modeling is considerably less complex, in-order execution avoids timing anomalies and higher performance can be achieved using multi-issue instruction fetching.

In document Design and evaluation of a VLIW processor for real-time systems (Page 54-58)