5.4 SIMD implementation of the FSD for MIMO-OFDM
5.5.4 Improving the FSD performance
The performance results for the FSD algorithm on the proposed scalable SIMD processor architecture have been obtained without any application-specic modications or exten- sions to the processor architecture. Yet, during the implementation of the FSD algorithm, performance bottlenecks that can be xed with small extensions to the instruction set have been identied. Below, these bottlenecks and potential solutions are briey discussed. The calculation of the squared Euclidean norm (||·||2
2) is a major performance bottleneck,
as the calculation requires multiple operations and lies in the critical path of the FSD algorithm. The hard-decision FSD algorithm requires at least 16 squared Euclidean norm computations and the soft-decision algorithm requires at least 72 squared Euclidean norm computations for 16-QAM modulation. Each norm computation requires three consecu- tive operations on the scalable SIMD processor architecture (see gure 5.15): First the real and imaginary parts are squared using a multiplication operation. Next real and imaginary parts are swapped using a permutation operation. In the last step, both values are accumulated.
1: vmul_f16_rdnsat v0 v0 v0 2: vswap v1 v0
3: vaddsat v0 v0 v1
Figure 5.15: Assembly code fragment for the calculation of the squared Euclidean distance A specialized squared Euclidean norm operation could possibly perform the same operation on the VMAC unit in one clock cycle by rst squaring real and imaginary parts using the multipliers and then accumulating the results for real part and imaginary part.
5.5 Performance analysis The computation of the best symbol candidate by thresholding (see gure 5.13) during the single expansion stages of the FSD is a further performance bottleneck, especially for 16-QAM modulation.
Assembly code for 16-QAM thresholding is displayed in gure 5.16: First, the absolute value of the input vector v0 is calculated and the symbol value sym2 is broadcasted to
v4. Next, the absolute value is compared to the threshold for the amplitude of the symbol vector and sym1 is broadcasted to v5. Then, the amplitude is updated, while the sign is
determined by comparing v0 to zero. In the fourth clock cycle, the sign of the symbol vector is updated. The VALU is occupied all the time.
1: vabs v2 v0 || vbcst16 v4 r2
2: vcmpgte m1 v2 v1 || vbcst16 v5 r1 3: vcmpgte m2 v2 v3 || vmov_vmac v4 v5 m1
4: vneg v4 v4 m2
Figure 5.16: Assembly code fragment for the thresholding operation during SE stages for 16-QAM modulation
The thresholding is a performance bottleneck if no other useful operations can be done in parallel, e. g. PED computations on the VMAC. During the soft-decision FSD, PED com- putations cannot always be done in parallel to the thresholding; hence, the performance can be improved by speeding up the thresholding operation. The thresholding can be re- alized using small programmable lookup tables (LUTs) that are distributed to the 16-bit SIMD lanes. Each LUT contains the possible symbol values; the address is generated from the MSBs of the data values.
The thresholding requires two operations on the VALU for QPSK modulation; hence, there is less room for improvement. Yet, QPSK thresholding could also be implemented using small LUTs.
A further performance bottleneck is the minimum-search during the LLR calculation for the soft-decision FSD, which requires one vector minimum operation per pair of bits (in the in-phase and quadrature signal components) and soft-decision list element. The performance could be improved by computing the required minimum for multiple bits in parallel. As the channel decoding algorithm (e. g. a turbo decoder or an LDPC decoder) does not require the LLR values in 16-bit precision, the LLR calculation could be performed on 8-bit data types, with two 8-bit elements stored in one 16-bit vector element, potentially leading to a runtime reduction by 50 percent.
Chapter 5 Sphere decoding for MIMO detection
5.6 Conclusion
Sphere decoding can be eciently realized on arbitrary wide SIMD processor architectures if two prerequisites are satised. Firstly, a breadth-rst search strategy has to be applied to the tree search instead of the original sequential depth-rst sphere search algorithm. The FSD algorithm fullls this requirement and still achieves close to ML bit error rates. Secondly, parallel processing of multiple sphere searches has to be enabled, as parallelism in one sphere search is limited by the size of the modulation symbol alphabet (e. g. four dierent symbols for QPSK modulation). Future MIMO systems will probably use OFDM block modulation. Hence, this requirement is fullled, as orthogonal OFDM sub-carriers can be processed in parallel.
The hard-decision FSD implementation can meet the throughput requirements of 4 × 4 MIMO systems based on the LTE frame structure (up to 278.77 Mbps for 16-QAM mod- ulation on a 1024-bit SIMD processor). Due to the signicantly increased complexity, the implemented soft-decision FSD algorithm based on bit-ipping achieves approximately half the throughput of the hard-decision FSD algorithm.
The FSD implementation on the scalable SIMD processor architecture achieves approxi- mately 32 percent of the soft-decision throughput of the best known hardware implemen- tation [FLN+09], while also consuming more power and requiring more area. Compared
to other SDR implementations, the achieved performance for both hard-decision and soft- decision MIMO detection is very good.
Chapter 6
Decoding of quasi-cyclic low density
parity check codes
The decoding of quasi-cyclic low-density parity check (LDPC) codes on the proposed scalable SIMD processor architecture is evaluated in this chapter. Section 6.1 describes the basics of LDPC coding, such as the representation by Tanner graphs and the properties of quasi-cyclic LDPC codes. The following section (section 6.2) explains the decoding of LDPC codes by message-passing algorithms. SIMD implementations of WiMAX LDPC codes are discussed in section 6.3 and their performance is evaluated in section 6.4. Finally, conclusions are drawn in section 6.5.
6.1 Fundamentals
LDPC codes are parity check codes based on very sparse matrices. The BER performance of LDPC codes can be close to the Shannon limit [MN97, RSU01, CFRU01]. LDPC codes have been invented by Gallager and rst published in his dissertation in 1960 [Gal63]. Yet, interest in LDPC codes only developed after the advent of turbo codes, which were invented by Berrou and Glavieux in 1993 [BGT93]. LDPC codes were nally rediscovered independently by MacKay and Neal [MN95] and by Wiberg [Wib96].