Research on the scalability of SIMD processing for SDR

2.3 Wide SIMD processor architectures and research on the scalability of SIMD

2.3.6 Research on the scalability of SIMD processing for SDR

While SIMD processing is a technique that has been thoroughly investigated, few research results have been published that discuss the scalability of wide SIMD processor architectures for SDR algorithms. [WLS+_{08a] describes the development of the Ardbeg}

architecture from SODA. A limited analysis of dierent SIMD widths and permutation networks has been done before selecting the nal SIMD width for Ardbeg. [WLS+08b]

investigates the scalability of SODA for four baseband algorithms. SIMD scalability analysis during the Ardbeg development

[WLS+_{08a] includes a SIMD width analysis for the Ardbeg processor in 90 nm technology.}

Ardbeg has been synthesized for SIMD widths ranging from 128 to 1024 bit (8 to 64 16- bit lanes). The synthesized permutation network is not mentioned. Energy consumption, delay, and area have been measured for a mix of 3G baseband algorithms, including FIR

Chapter 2 Overview of software dened radio principles and architectures

ltering, FFT, W-CDMA searcher (based on auto-correlation), and Viterbi decoding. Al- gorithm parameters (e. g. lter length, FFT size) are not mentioned. Results are reported as an average over all algorithms and are normalized to the 128-bit SIMD processor. The normalized delay gures show approximately linear speedup.2 _{The energy consumption}

decreases with an increasing SIMD width (60 percent for 64 lanes). The area more than doubles with a doubling of the SIMD vector length (approximately 10× for 64 lanes). Due to the signicant increase in area, the SIMD width has been set to 32 lanes and not 64 lanes.

After xing the SIMD width, four dierent permutation networks have been implemented and synthesized and normalized energy and energy-delay-product have been measured for 64-point and 2048-point radix-2 and radix-4 FFTs and a Viterbi decoder for constraint length 9. The implemented permutation networks are SODA's single stage perfect shue exchange / inverse perfect shue exchange network with a width of one vector (enabling permutations on one input vector), the same network topology with a width of two vectors (enabling permutations on pairs of vectors), a banyan network (multistage interconnect network - MIN) on two vectors, and a crossbar network on two vectors. More information on permutation network topologies can be found in chapter 3.1.5. The analysis shows that the algorithm implementations with permutation networks with a width of two vectors consume less energy and have a better delay than the implementations with a single-vector network. The banyan network and the crossbar network achieve similar results, except for the 64-point radix-2 FFT, which has a much higher energy consumption using the crossbar network. The double-vector perfect shue exchange / inverse perfect shue exchange network achieves the best results for the 64-point radix-2 FFT, as the FFT algorithm is optimized for this network architecture. The banyan network and the crossbar network attain the best results for all remaining algorithms. Based on the analysis, Ardbeg has been realized with a double-vector banyan network.

SIMD scalability analysis based on SODA

The SIMD scalability analysis in [WLS+08b] based on SODA considers four SDR algo-

rithms for MIMO-OFDM for SIMD widths ranging from 512 bit (32 16-bit lanes) to 4096 bit (256 16-bit lanes). The implemented algorithms are a 1024-point radix-2 FFT, space time block coding (STBC) based on Alamouti for a 2 × 2 MIMO system [Ala98, Bau01], the vertical Bell laboratories layered space-time (V-BLAST) detection algorithm for 4 × 4 MIMO [Fos96, WFGV98], and a decoder for a WiMAX LDPC code (z = 96, R = 5_/₆)

[IEE09b, SMZC07].

First, available data parallelism and workload have been analyzed; the results are sum- marized in table 2.2. According to Woh et al., the SIMD width should be increased to 2_{The term linear speedup or ideal speedup means that a doubling of the SIMD width leads to a doubling}

of the performance.

2.3 Wide SIMD processor architectures and research on the scalability of SIMD processing be as large as the FFT size NDFT for the maximum performance. STBC and V-BLAST

both operate on small vectors, with one vector per OFDM sub-carrier. As sub-carriers are orthogonal to each other and can be processed in parallel, data parallelism is only limited by the number of data carriers in an OFDM symbol, which here is assumed the FFT size. The LDPC decoder operates on z × z sub-matrices of the LDPC matrix; hence, at most z elements can be processed in parallel.

Table 2.2: Analysis of data parallelism [WLS+_08b]

Algorithm Overhead Scalar SIMD Maximum vector

Workload [%] Workload [%] Workload [%] parallelism

FFT/IFFT 61 5 34 N_DFT = 1024

2 × 2 STBC 14 5 81 4 · N_DFT

4 × 4 V-BLAST 24 6 70 4 · N_DFT

LDPC 3 18 49 z = 96

The workload results categorize the workload on the SIMD processor into scalar workload, workload for computational SIMD operations on the ALU, multiplier, or shifter (denoted as SIMD workload), and overhead workload for memory access and vector permutations. The workload results show a high utilization of the computational SIMD units for STBC and V-BLAST, the FFT is dominated by overhead workload.

The speedup and energy consumption have been measured and normalized to the results for 32 16-bit lanes. Normalized speedup results show linear speedup for STBC and slightly less than linear speedup for FFT. The V-BLAST implementation apparently requires more scalar operations for wider SIMD widths; hence, the speedup increases slowly (approximately 5.5× for 256 lanes). The speedup for LDPC decoding also increases slowly and does not increase at all if the SIMD width is increased from 128 to 256 lanes, as at most 96 elements can be processed in parallel. The maximum speedup is 3.0 for 128 or more parallel lanes. If linear or close to linear speedup can be attained, the energy consumption stays almost constant. The LDPC decoder requires more energy on wider SIMD architectures, because most of the SIMD lanes perform useless computations.

Dierences to the present thesis

The analysis of the scalability of SIMD processing in this thesis diers from the work in [WLS+_{08b, WLS}+_{08a] concerning the analyzed algorithms and the considered SIMD}

Chapter 2 Overview of software dened radio principles and architectures

SODA supports only one vector operation per clock cycle, while Ardbeg supports a re- stricted LIW instruction format with at most two parallel vector operations. Therefore, much of the processing time is spent either on scalar or memory access and vector align- ment operations, as can be seen in table 2.2. This thesis proposes a SIMD processor architecture with LIW support. Parallel processing of computational SIMD operations (e. g. addition, multiplication) and memory access and/or vector permutation operations increases the performance of the SIMD architecture, as overhead operations can be hidden by LIW execution. One prominent example is the FFT with an overhead workload of 61 percent on SODA. On the proposed SIMD architecture, the overhead operations can be completely or mostly performed in parallel to useful computational operations (see chapter 4.6.3).

The SIMD processor architecture has also been implemented with four dierent permutation network congurations (see chapter 3.1.5), enabling to perform a systematic analysis of the complexity of permutations for dierent SIMD widths.

From an algorithm perspective, the present thesis considers dierent algorithm parameters, a more recent MIMO detection algorithm, and in part achieves dierent results than the work in [WLS+_{08b, WLS}+_08a].

Woh et al. only implemented one FFT size (NDFT = 1024) and one LDPC code (z = 96,

R = 5_/₆). Hence, the inuence of algorithm parameters could not be investigated. This

thesis analyzes dierent FFT sizes, including mixed-radix FFT sizes and LDPC codes. The results show that the scalability indeed depends on algorithm parameters.

In [WLS+08b], 2 × 2 STBC and 4 × 4 V-BLAST are implemented as examples for MIMO

algorithms. STBC is an approach that increases the signal quality at the receiver by sending a signal on multiple transmit antennas in a space-time code [Ala98]. V-BLAST is a detection algorithm for spatial multiplexing, i. e. multiple data streams are transmitted in parallel [WFGV98]. This thesis analyzes 4×4 sphere decoding, which is a class of detection algorithms for spatial multiplexing. V-BLAST has a lower computational complexity than sphere decoding, but has a poor BER performance, as it does not exploit the full MIMO diversity [BBW+_{05]. Sphere decoding algorithms achieve a BER performance close to the}

optimum maximum likelihood (ML) solution. Therefore, sphere decoding is a better choice for 4G systems and more challenging due to the greater computational complexity. Woh et al. also arrive at dierent conclusions concerning the scalability of FFT and LDPC decoding than this thesis. [WLS+08b] claims that the SIMD width should be increased

to the FFT size, while the LDPC decoder can process at most z elements in parallel. In chapter 4, it is shown that the FFT size should be at least twice the SIMD width for radix-2 FFTs. Chapter 6, which describes the LDPC decoder implementation, shows that LDPC decoding may also be eciently done for SIMD widths greater than z, yet a dierent implementation is required than for SIMD widths less than or equal to z.

In document Exploration of the scalability of SIMD processing for software defined radio (Page 47-51)