Instruction level parallelism - Key algorithms for future 4G SDR systems

2.4 Key algorithms for future 4G SDR systems

3.1.3 Instruction level parallelism

The term instruction level parallelism (ILP) is used for processor architectures that are able to issue multiple instructions per clock cycle. On a SIMD architecture, ILP support may hide the overhead for memory access and vector permutations. Furthermore, ILP can improve the resource utilization of the SIMD processing units.

In principle, architectures that support ILP can be classied into superscalar and long instruction word (LIW) or very long instruction word (VLIW) architectures, with explicitly parallel instruction computing (EPIC) and dynamic VLIW as hybrid types [Smo02]. Superscalar architectures perform all control tasks for issuing multiple instructions in parallel in hardware. The control tasks are the grouping of independent instructions, which potentially can execute in parallel without interfering with each other, the assignment of instructions to functional units (FUs), and the actual initiation of instructions. On a LIW or VLIW architecture, all these tasks have to be performed by the programmer or if available compiler. The dierence between LIW and superscalar architectures is illustrated in gure 3.2. LIW and VLIW architectures only dier in the number of issued parallel instructions with no clear dened boundary between both terms; the term VLIW has been introduced by Fisher in 1983 [Fis83].

Instruction grouping Code generation Assignment to FUs Initiation timing Instruction grouping Assignment to FUs Initiation timing Compiler/Programmer Hardware superscalar LIW Execution

Figure 3.2: Visualization of ILP architectures based on [Smo02]. Superscalar architectures perform all control tasks for instruction parallelization in hardware, while LIW architectures require these tasks to be done by the programmer or compiler. Due to the hardware overhead of superscalar architectures, only LIW and VLIW architectures are of interest for modern signal processors. Compared to sequential processor architectures, LIW architectures require additional register le ports to support several functional units and wider instructions that contain multiple slots with operations on dif- ferent units. In a xed-length LIW architecture (gure 3.3 on the left-hand side), slots simply occupy consecutive segments of the instruction word. Although the instruction

3.1 Development of the SIMD processor architecture based on algorithm requirements decoding for such an architecture is of low complexity, the code size signicantly grows if not all available slots can be lled with useful operations. If no useful operation can be scheduled in a slot, the slot has to be lled with a no-operation (nop). Variable-length LIW architectures avoid this issue by explicitly encoding the number of used slots in the instruction. The number of slots can be dened by a header or by dierential encoding of slots (see gure 3.3 on the right-hand side). Dierential encoding requires one additional stop bit for each slot except for the last one. The value of the stop bit denes whether another slot follows the current slot or the current slot is the last slot. The code word length can be further reduced by applying code compression techniques, for example based on Human codes [WC92, BNW98], Markov models [XWL02, XWL06], or lookup tables [RS03]. Due to the complexity of these code compression techniques, they have not been considered for the proposed scalable SIMD processor architecture.

slot 4 slot 3 slot 2 slot 1

fixed-length LIW

nop nop nop nop

0 N

2N 3N

4N bit

slot 4 slot 3 slot 2 slot 1

variable-length LIW with differential encoding

0 N+1 2N+2 3N+3 4N+4 stop bit nop worst case: only nops

worst case: only nops

Figure 3.3: Fixed-length and variable-length LIW encoding examples for a LIW architecture with four slots

As mentioned above, LIW architectures also require an increased number of register le ports to support multiple functional units in parallel. Both area and power consumption of a register le with p ports have an asymptotic complexity of O (p2₎[RDK+00] (see section

3.1.4). Hence, the maximum number of LIW slots should be set to a moderate value to limit the number of ports.

In order to select an appropriate number of instruction slots, the algorithm implementations on the EVP have been analyzed. The EVP supports up to six vector and four scalar operations in one VLIW instruction. Yet, the average number of parallel operations per instruction is signicantly smaller than that for all considered algorithms. The results of the analysis are depicted in table 3.5. The results show the best known implementations of these algorithms. For the radix-2 FFT, similar results have been obtained by other researchers [SM06].

The average number of parallel operations per instruction Npar. ∅ and the peak value

N_{par. peak}have been measured for the inner loops of algorithms. Furthermore, the resource utilization values of both SIMD arithmetic units (RVALU and RVMAC) have been calculated.

Resource utilization describes the relative amount of time the unit has been active. 1 _The

Chapter 3 Scalable SIMD processor architecture

resource utilization indicates if a speedup is possible. If any processing unit is utilized all the time, no speedup by LIW is possible without adding further processing units of the same type.

Table 3.5: Measured ILP on the EVP for inner loops of baseband algorithms

Measured kernel N_{par. ∅} N_{par. peak} R_VALU R_VMAC

3 radix-2 FFT stages (with permutations) 3.654 5 92.33 % 61.54 % 3 radix-2 FFT stages (without permutations) 2.808 5 92.33 % 92.33 % Radix-3 FFT stage 2.467 5 66.67 % 93.33 % Radix-6 FFT stage 2.045 4 72.73 % 72.73 % Radix-5 FFT stage 2.619 6 61.90 % 85.71 %

W-CDMA channel estimation 2.600 6 15 % 100 %

HSDPA channel combiner 2.542 5 25 % 58.33 %

HSDPA sub frame generation 3.625 5 100 % (CGU) 100 %

On average, all measured inner loops achieve between two and four parallel operations per instruction. Except for the HSDPA channel combiner, all inner loops achieve maximum resource utilization values close to or equal to 100 percent. The HSDPA sub frame generation kernel does not use the VALU; however, the CGU is active all the time. Based on these results, the scalable SIMD processor architecture was designed as a LIW architecture with four parallel slots per instruction, which is more than all achieved values for Npar. ∅. The peak number of parallel operations has only been achieved for one or two

cycles in each loop, this suggests that similar performance can be achieved by moving one (or two) operations into subsequent instructions. Hence, a slot number smaller than the peak number of parallel operations has been selected. A variable-length LIW encoding based on dierential encoding has been chosen to guarantee a small code size. The imple- mentation is based on design examples in [CoW09b]. The number of ports for the register les will be discussed in the following section.

The instructions are encoded with a slot size of 24 bits and a variable instruction word length between 24 and 96 bits. The implemented slot encoding formats for the dierent operation types are displayed in gure 3.4.

3.1 Development of the SIMD processor architecture based on algorithm requirements

0 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

Binary 16-bit operations Binary 16-bit operations Bit positions

Bit positions

opcode cond wen

opcode dst src cond wen

opcode src2 dst src1 wen

Unary 16-bit operations Unary 16-bit operations

Comparison operations Comparison operations

opcode src2 dst src1 wen

Binary 1-bit operations Binary 1-bit operations

opcode dst immediate wen

Move immediate operations Move immediate operations

opcode ptr src/dst Load/store operations Load/store operations opcode ptr src/dst 1 0 immediate ofs_reg opcode cond Branch/call operations Branch/call operations branch_address 1 0

opcode cond dst_reg

opcode imm./reg

Hardware loop operations Hardware loop operations

loop_end_address

opcode dst src

Moves between register files Moves between register files

opcode src

Unary 1-bit operations Unary 1-bit operations

dst dst

src2 src1

don’t care bit don’t care or register address wen writeback enable bit

V a ri a b le -l e n g th L IW s lo t s to p b it

Permutation operations (single-vector network /double-vector network)

opcode pattern src dst cond wen

Figure 3.4: Encoding of 24-bit slots: The rst bit contains the dierential encoding of the instruction length, the remaining bits contain the instruction.

Chapter 3 Scalable SIMD processor architecture

In document Exploration of the scalability of SIMD processing for software defined radio (Page 62-66)