Algorithmic Modifications for VLSI Implementation

4.3 Transmit Noise-Whitening for Tree-Search Based MIMO Detectors

5.1.3 Algorithmic Modifications for VLSI Implementation

After each iteration of SA the relations TT# _{= I and, if computed with infinite precision, also the}

relation GG#= I still holds true.

5.1.3 Algorithmic Modifications for VLSI Implementation

Two groups of algorithmic modifications to the original algorithm, proposed in [Sey93], targeting both dedicated VLSI implementations are proposed in this section. The first set of modifications reduces the computational complexity for SA-based LR compared to [Sey93]. The second one improves the BER performance of LRALD with minimal computational overhead. Some modifications proposed belong to both groups simultaneously.

MMSE Preprocessing

The first modification enhances the BER performance and furthermore, reduces the complexity of SA. Several publications [WS07, WSJM11] show that, similar to the LD, exploiting knowledge of the noise variance enhances the BER performance of LRALD significantly. Two principle possibilities to apply MMSE-regularization exists. Either the basis transformation is calculated first and then the MMSE filter matrix for the new basis is calculated, or the basis transformation is calculated for an already regularized channel matrix. In [WS07] it was further shown that the latter scheme, which is using the MMSE-regularization before the LR steps, outperforms detection schemes, where LR is computed without taking into account the presence of noise.

Furthermore, it was shown in [BSB10], that regularization reduces the computational complexity of LR itself. Similarly as for the LLL-algorithm [WBKK04], MMSE LR and equalization using SA-based LRALD can be performed by augmenting the channel matrix H with the weighted identity matrix resulting in H, given by

H=       H √ NtxσI      . (5.12)

In Fig. 5.2 the cumulative distribution of Seysen’s metric for regularized and non-regularized channel matrices for a 4 × 4 MIMO system at a SNR of 20 dB is shown before lattice reduction is applied. As SA iteratively reduces Seysen’s metric, it is obvious that the number of iterations is decreasing if the initial Seysen’s metric is already small. Therefore, applying MMSE before SA significantly reduces the average computational complexity. In addition, computing SA based on the Gram matrix as described in (5.6)-(5.11) reduces the computational overhead of regularization (i.e., applying MMSE) to Nrxreal-valued additions. Furthermore, contrary to the

Figure 5.2 – Comparison of the cumulative distribution of Seysen’s metric for regularized and non-regularized channel matrices at a SNR of 20 dB.

regularization of the QR decomposition presented in Section 3.2, no additional memory for the regularization is required if SA is computed based on the Gram matrix.

Unit Lambda Updates

In order to evaluate the dynamic range of the update values λi, j, SA was applied to 106random 4 × 4 channel matrices and the magnitude of the real and the imaginary value of all possible λi, j

in each iteration of SA was evaluated. This analysis allows to accurately estimate the distribution of the magnitudes of the update values. The result is shown in Fig. 5.3, where more than 85.6% of the values have a magnitude of zero and another 14.0% have a magnitude of one.

Based on the distribution of the candidate update values, we proposed in [BSB10] to reduce the computational complexity per iteration of SA by restricting the dynamic range of the update coefficient λi, j to a+ b

√

−1 with a, b ∈ {−1, 0, 1}. Only 0.35% of real or imaginary part of all possible λi, jvalues are affected by the unit lambda limitation. But the limitation of the dynamic

range allows to avoid the computation of all divisions in (5.6) by first evaluating the inequality

j,iGj, j− Gj,iG#_i,i ≤ G

Figure 5.3 – Distribution of the magnitude of the real and imaginary part of all possible λi, j.

and setting <λ

i, j = 0 if (5.13) holds true, otherwise setting the real part of λi, jto <λ

i, j = signn<G#j,iGj, j− Gj,iG#_i,i o

× signG#

i,iGj, j . (5.14)

An analogous evaluation has to be performed for the imaginary part of λi, j[BSB10]. Additionally, restricting the dynamic range of λi, jalso simplifies the computation of∆i, j, given in (5.7), and

of the update steps performed in (5.9), (5.10), and (5.11), as all multiplications with λi, jcan be replaced by conditional additions or subtractions.

The restriction of the dynamic range of the candidate update values reduce the computational complexity of the update value calculation itself. It also eliminates the need for complex-valued multiplications during the basis updates in each iteration of SA. However, the number of iterations required to reduce the lattice basis may increases if unit lambda updates are used. Therefore, the impact of the restriction of the dynamic range of λi, jon the total complexity of SA can only be evaluated for a specific index selection scheme. We evaluated the magnitude of the real and imaginary part of the selected update value for a greedy index selection scheme. The resulting distribution, shown in Fig. 5.4, slightly differs from the distribution of all update values, shown in Fig. 5.3. For the greedy index selection scheme only 1.43% of the magnitudes of the real or imaginary part of the chosen update values are larger than the restricted range. Nevertheless, we show in Section 5.1.3 and [BSB10] that for practical systems the increased number of LR iterations is clearly more than compensated in terms of computational complexity by the reduction

Figure 5.4 – Distribution of the magnitude of the real and imaginary part of chosen λi, j. of multiplications and the complete removal of the otherwise required complex-valued division operations.

Index Selection Scheme

The performance of SA in terms of BER performance, and also in terms of computational complexity is significantly impacted by the index selection scheme. Obtaining the smallest possible Seysen’s metric with SA is only guaranteed if all possible sequences of selected index pairs are evaluated. In [ZMS10] a tree-search based approach to reduce the enormous computational complexity of this exhaustive search has been presented. Note, that there is no known upper bound to the number of iterations of SA. The authors in [ZMS10] further propose to perform data detection with multiple reduced lattice bases, which are the result of the tree-search approach, to compute approximate reliability information (i.e., soft-outputs) for a subsequent channel decoder [ZMS10]. However, computing multiple LRs and performing multiple detections significantly increases computational costs of LRALD, and entails large memory requirements. In summary, the additional hardware complexity of the tree-search approach seems economically unfavorable for most communication standards.

Therefore, in this section we focus on simpler local index-selection schemes, which evaluate only one LR per OFDM tone. In [Sey93] two index selection schemes where proposed, that are either based on the possible update values λi, jor on the corresponding potential reduction

Figure 5.5 – Total number of arithmetic operations per LR.

of Seysen’s metric∆i, j. The first index selection scheme, called greedy scheme, shown in (5.8) chooses the index pair {s, t} that maximizes∆i, j among all calculated candidate updates. The other proposed index selection scheme, named “lazy” [Sey93], chooses a random index pair {s, t} such that λs,t , 0. Hence, by omitting the calculation of ∆i, j, the “lazy” index selection scheme reduces the computational cost per iteration of SA. However, [BSB10] showed that, the total number of iterations of SA required to reach a certain performance target is significantly larger for the “lazy” selection scheme compared to the greedy index selection scheme.

Not only the index selection scheme itself, but also the number of candidate update values can be restricted to reduce the complexity per iteration. This restriction was initially proposed in [BSB10]. In this thesis, we name the restriction of the number of candidate update values from 2Ntx

to K, K-SA. Similar to the “lazy” selection scheme K-SA significantly reduce the complexity per iteration but possibly increase the required number of iterations. This is due to the locally optimal update value is may not among the evaluated index pairs, and therefore, the locally best update step is not performed.

The overall computational complexity in terms of arithmetic operations per LR for the “lazy” index selection scheme and for several K-Seysen’s algorithm with greedy index selection scheme, is shown in Fig. 5.5 for unrestricted and for unit-λ updates. While the number of additions and multiplications using the unit lambda approach is slightly increased compared to the original

Figure 5.6 – Iteration limit induced BER performance loss.

algorithm with unrestricted update values, no multiplications with λ or divisions are necessary at all. Therefore, the unit lambda approach is clearly beneficial on dedicated hardware. When comparing the “lazy” index selection scheme with the greedy index selection scheme for unit lambda updates it is evident that the higher computational complexity per iteration of SA, using the greedy index selection scheme, is compensated by the lower iteration number. Further, a comparison of K-Seysen’s algorithm with unit lambda updates shows that the computational complexity of a K chosen smaller than 2Ntx

would be beneficial. However, reducing K also results in longer run-times and in a more complicated termination scheme. The later is due to the missing information whether the index pairs that have not been evaluated would result in a further reduction of the lattice basis.

Fixed Iteration Limit

Practical implementations of LR-based preprocessing circuits have to provide a certain guaranteed throughput to meet the latency constraints. Unfortunately, the number of iterations and thereby the throughput of SA varies. To our knowledge, there is no upper bound on the number of iterations of SA. Therefore, a definition of a maximum number of allowed iterations per LR is required to ensure a minimum throughput of a SA based LRALD VLSI implementation. For several antenna configurations, [ZMS10] presents the mean number of iterations of SA. The BER performance loss due to a run-time limitation for different selection schemes for a 4 × 4 MIMO communication system is shown in Fig. 5.6. It can be seen that increasing the run-time limit

Figure 5.7 – Coded bit error rate performance at 64-QAM modulation for non-punctured and punctured outliers with different code rates.

reduces the impact on the implementation loss. If K-SA is used, the number of K significantly determines the number of iterations required to achieve a certain performance goal. While the runtime limit, evaluated in Fig. 5.6, is fixed for all OFDM tones, scheduling algorithms providing a guaranteed mean throughput can also be considered. Nevertheless, such scheduling schemes increase the complexity of the hardware without a significant gain in BER performance. Therefore, a fixed iteration limit for SA based LRALD is proposed in this thesis.

Impact and Mitigation of Finite-Constellation Effects

As described in Section 5.1.1, for LRALD the symbol constellation for the transmitted vector x has to be relaxed to x ∈ CZNtx. While remapping the relaxed estimated transmit vector ˆx to ˆs ∈ XNtx,

some entries of ˆs = Tˆx may not be valid constellation points due to noise effects. The work in [SSB08] presents two different approaches to handle such elements. The most common one, also mentioned in [SSB08], which maps such elements by means of quantization to the nearest constellation point, is suboptimal [WSJM11] in terms of detection error probability.

reduce the impact of the finite-constellation issue on the BER performance. Usually, LRALDs do not provide any reliability information about the detected bit3(i.e., soft information). However, since the detection is performed in a transformed basis, it is known that all bits of a receive symbol vector have a significantly reduced reliability if one or more of its elements are outside the constellation. To account for this reliability loss, we propose to puncture all the bits demapped from a receive symbol with at least one element outside of the constellation. This puncturing essentially corresponds to associating them with a LLR of zero, indicating that the detector has no preference whether these bits are more likely to be zero or one. It should be noted that puncturing can be implemented with no additional overhead.

All non-punctured bits are forwarded, together with the punctured ones, to the Viterbi decoder. For hard-input Viterbi decoders the decoder has to be capable of supporting puncturing4. If a soft-input Viterbi decoder is employed, all the remaining bits are mapped to LLRs with equal absolute values. Fig. 5.7 shows the coded BER performance gain that is available with this technique.

In document Energy Efficient VLSI Circuits for MIMO-WLAN (Page 128-135)