Channel ordering - SIMD implementation of the FSD for MIMO-OFDM

5.4 SIMD implementation of the FSD for MIMO-OFDM

5.4.1 Channel ordering

The FSD channel ordering requires calculating the matrix A = HH _{· H}_{. The channel}

ordering is then computed iteratively; each iteration requires a matrix inversion and a reduction of A by removing one row and one column (see equation (5.20)). Afterwards, the channel matrix needs to be reordered based on the new channel ordering. As the channel matrix is a 4 × 4 matrix and the iterations of the ordering algorithm lead to a further reduction of the matrix size, there are only limited opportunities for parallel processing for the channel ordering of one matrix. However, in an OFDM system, there is one channel matrix, one received vector, and one transmit vector for each OFDM sub- carrier. As the sub-carriers are orthogonal to each other, they may be processed in parallel during the channel ordering stage, as well as the QR-decomposition and the tree search. The principle approach is depicted by gure 5.9. Parallelism in this case is only limited by the number of data carriers in an OFDM symbol, which is suciently large for wide SIMD processing (e. g. 1200 data carriers for 20 MHz bandwidth in LTE).

The main challenges of the channel ordering implementation are reducing the complexity of the matrix operations and avoiding roundo errors due to the limited 16-bit xed-point precision especially during the matrix inversions.

3_{The proposed soft-decision FSD algorithm with bit-ipping requires a minimum search in the single} expansion stages of the algorithm.

5.4 SIMD implementation of the FSD for MIMO-OFDM QRD Channel ordering FSD tree search sub-carrier i QRD Channel ordering FSD tree search sub-carrier i+1 QRD Channel ordering FSD tree search sub-carrier i+2 QRD Channel ordering FSD tree search sub-carrier i+3

SIMD data vector

Figure 5.9: Parallel processing of the FSD algorithm by parallel processing of OFDM sub- carriers for a SIMD width of four elements

Calculation of A = HH _{· H}

The rst step of the channel ordering, the calculation of matrix A can be simplied due to the properties of the matrix. A is a Hermitian matrix, which means that the matrix is equal to its own conjugate transpose. The diagonal elements of A are real-valued and positive, as is the determinant of A.

A = HH · H =     a1,1 a∗2,1 a ∗ 3,1 a ∗ 4,1 a2,1 a2,2 a∗3,2 a ∗ 42 a3,1 a3,2 a3,3 a∗4,3 a4,1 a4,2 a4,3 a4,4     ai,i ∈ R+, ai,j ∈ C ∀ i 6= j (5.21)

Hence, only the diagonal elements and the elements below the diagonal need to be computed, the remaining six elements are given by symmetry. This reduces the computational complexity of the matrix product and the required memory for storing A, as only ten matrix elements need to be saved. Furthermore, the property that diagonal elements are always real-valued allows to replace complex-valued multiplications by real-valued multiplications for operations involving one of these elements, which improves the performance, as a real-valued multiplication or MAC operation takes only one clock cycle, while a complex-valued operation requires two clock cycles.

Chapter 5 Sphere decoding for MIMO detection

Determining the channel ordering based on matrix-inversion

The next step of the channel ordering is calculating the index of the next channel matrix column from equation (5.20). The equation requires computing the position of the minimum or the maximum of the diagonal elements of A−1_{. Therefore, neither is there}

a need to compute the complete matrix inverse, nor a necessity to compute the exact values, as only the relative ordering of diagonal elements of the inverse matrix is required to determine the minimum or maximum position. Hence, equation (5.20) can be solved by computing the diagonal elements of the adjugate matrix of A instead of the inverse matrix:

A−1 = 1

det (A) · adj (A) ⇒ arg min

j A −1

j,j = arg min_j [adj (A)]j,j

(5.22) Computing the diagonal elements of the adjugate matrix instead of the inverse of A reduces the computational complexity of the channel ordering, because a division by the matrix determinant can be avoided. Furthermore, the dynamic range of values is reduced, which reduces the impact of errors due to rounding and especially the saturation of values. The adjugate matrix of an n×n matrix is dened by the following equation, where Aij denotes

the sub-matrix that is generated by removing the ith row and jth column from A:

adj (A) =      det (A11) − det (A12) . . . (−1) n+1 det (A1n) − det (A21) det (A22) . . . (−1) n+2 det (A2n) ... ... ... (−1)n+1det (An1) (−1) n+2 det (An2) . . . det (Ann)      (5.23) Consequently, the channel ordering requires calculating and comparing four 3 × 3 determinants (4 × 4 input matrix), three 2 × 2 determinants (A reduced to 3 × 3), and one scalar comparison (A reduced to 2 × 2). The reduction of matrix A into smaller matrices by removing rows and columns for index k is visualized by gure 5.10. The required determinants can be directly computed, e. g. using the rule of Sarrus for 3×3 sub-matrices. As the input data has a limited precision, due to the 16-bit word length, the proposed algorithm implementation has been tested for saturation or rounding errors. Rounding errors are insignicant, as the computed sub-matrix determinants are only used in comparison operations. Comparison errors will only occur if the values of two sub-matrix determinants are very close to each other. In this case, the relative ordering of the corresponding channel matrix columns is unimportant, as both columns suer from a similar amount of post-processing noise-amplication.

Errors due to saturation are more signicant, because a comparison of two saturated values is impossible. Furthermore, saturation of intermediate results (e. g. during the calculation

5.4 SIMD implementation of the FSD for MIMO-OFDM b11 b12 b13 b21 b22 b23 b31 b32 b33 k=3 a13 a14 a23 a24 a41 a43 a44 a11 a21 a11 a12 a21 a22 a31 a32 a33 a34 a42 a13 a14 a23 a24 a41 a43 a44 relabeling b11 b13 b31 b33 k=2 b12 b21 b22 b23 b32 b11 b31 b33 b13 c11 c12 c21 c22 relabeling k=1 c22 c11 c12 c21 c22 removed element

Figure 5.10: Example for the reduction of matrix A during the channel ordering. The relabeling of matrix elements is done to simplify the gure.

of matrix A) also leads to signicant errors. Saturation can be avoided, by scaling the input values before or during the calculation of matrix A and by avoiding the division by a matrix determinant as in equation (5.22). Simulation results obtained from Matlab show that right shifting the input values by two bits is sucient to prevent saturation under the assumption that the input channel matrix requires the full dynamic range provided by a 16-bit word length. If the channel estimation produces a channel matrix estimate with less than 15 bits precision, the scaling by right shifting may be avoided.

Reordering of channel matrix elements

The reordering of the channel matrix elements describes the removal of elements from A, as in gure 5.10, and the ordering of the columns of the channel matrix H after the new column indices have been computed.

Chapter 5 Sphere decoding for MIMO detection

On a scalar processor architecture, both ordering operations can be eciently implemented by memory access: The matrix elements are read from memory and stored in a dierent order, using the computed channel ordering for oset addressing. The overall complexity is one memory read and one memory write access per matrix element.

On a SIMD processor architecture, a dierent approach is necessary, as multiple OFDM sub-carriers with potentially diering orderings are processed in parallel in a vector. Hence, the ordering has to be done by conditionally swapping elements of two vectors, with masks dening the desired order of the channel matrix columns. The ordering of one matrix row with four elements requires six consecutive swapping operations, each swapping operation can be realized by a pair of parallel vector move operations (see gure 5.11).

vmov_vmac v0 v1 m1 || vmov_valu v1 v0 m1

Figure 5.11: Swapping of data vectors v0 and v1 based on vector mask m1 using two parallel masked move operations

Yet, although the channel reordering on vectors requires more operations than a channel reordering on scalars on the scalable SIMD processor architecture, the execution time is the same, as the overhead for swapping values can be eciently hidden by LIW execution (see table 5.3). The runtime is determined by memory access operations.

Table 5.3: Complexity comparison of scalar and vector channel matrix reordering for a 4 × 4 matrix

Description Load/Store Swap Runtime

operations operations [Cycles]

Scalar: reordering by memory access 16+16 32

Vector: reordering by swapping values 16+16 24 32

In document Exploration of the scalability of SIMD processing for software defined radio (Page 150-154)