Limitations - A MIMO Simulation Testbed - Efficiency and flexibility trade-offs for soft input

4.3 A MIMO Simulation Testbed

4.3.6 Limitations

The simulator shortly summarized in this chapter is dedicated for the investigation of iterative MIMO demapping and decoding. It realizes a coherent baseband model of the physical layer for a MIMO transmission and thus idealizes for instance analog components and neglects timing or frequency synchronization errors. Furthermore, no dedicated RF effects such as transmitter side impairments [169, 170, 193] nor the inclusion of real RF air interfaces are considered so far since these topics are wide research areas on their own. Similarly, no MAC functionality is included. The current implementation of the configurable Matlab/Simulink model has some further limitations when approaching the co-simulation with not just a single component such as a demapper or decoder but a hardware simulation consisting of both units. In such a case, the simulation model needs to be structurally modified since the data and control handling between the demapper and decoder is currently fixed in the simulator.

Chapter 5

A Flexible ASIC for SISO STS Sphere

Decoding

The overview of sphere-decoding algorithms and hardware architectures in Table 3.1 indicates that a wide variety of MIMO demapping architectures is already available. However, area and energy efficiency as well as flexibility are still serious issues. When this work started, the feasibility of a SISO sphere-decoding architecture was an open issue that could be proven by the SISO STS sphere-decoding architecture this chapter focuses on. For such a SISO architecture, the various architectural and algorithmic efficiency metrics as well as flexibility can be traded-off against each other in a much larger parameter space than for soft-output architectures due to the effects of demapping/decoding iterations. In order to prove the feasibility of a SISO sphere-decoding architecture and to provide a reasonable bound for the trade-off between efficiency and flexibility, a first mandatory step is the design of a flexible-as-necessary and efficient-as-possible ASIC hardware architecture.

Promising work in the domain of hard-output and non-iterative soft-output depth- first sphere decoding architectures is published in [24, 167]. Due to the superior error-rate performance of depth-first algorithms, the single tree-search (STS) approach proposed in [167] is adopted as tree-traversal strategy for the SISO sphere-decoding architecture. As a first step towards this SISO architecture, a non-iterative soft-output architecture competitive with the one published in [167] is designed in a way that it can be extended in a second step without major structural changes towards the support of soft-input information. The major challenge of adding SISO capabilities is the enumeration problem in the presence of a priori information. As discussed in Section 3.5.3.2, a valuable algorithm proposal for such an enumeration is the hybrid enumeration approach developed in [107]. Since the SISO sphere-decoding VLSI architecture proposed in this chapter has a major focus on soft-input processing and the efficient implementation of the hybrid enumeration, its recursively defined name is “Cae2_{sar, an efficient enumeration soft-input architecture”.}

5.1 Overview on Design Principles of Sphere Decoder

VLSI Architectures

The architectural design principles vary very much depending on the underlying sphere-decoding algorithm, particularly in terms of parallelism and pipelining. The most significant differences in terms of parallelism can be identified between depth- first and breadth-first sphere-decoding approaches. Further minor differences be-

84 Chapter 5. A Flexible ASIC for SISO STS Sphere Decoding

tween published architectures are related to whether the MIMO detection problem is formulated and implemented with complex numbers or with an equivalent real-val- ued representation. Other architectural implementation options such as modifying the norms used for metric computations for the sake of more efficient VLSI implementations have been explored for instance in [24]. Since the focus of this work is rather put on the feasibility of a SISO sphere-decoding architecture than on the ulti- mate optimization of a single VLSI architecture, only the most significant differences between depth-first and breadth-first architectures are discussed in the following.

In a breadth-first tree search, no dependencies are present between the computations of tree-node metrics on a single tree level. Therefore, this operation can be paral- lelized very well as demonstrated in various VLSI implementations [63, 161, 192, 207]. Furthermore, breadth-first approaches have a deterministic runtime which is pro- portional to the number of tree levels. Therefore, the computation of partial metrics MP(si) can be very well pipelined on the basis of a systolic array with one cell processing one antenna level in a fixed number of cycles. However, the parallelism is limited to the point where a sorter unit needs to identify the K best candidates in K-best approaches. This dependency issue is eliminated by the FSD tree-search approaches leading to more efficient VLSI architectures [14, 211]. The fine grained parallelism achievable with breadth-first approaches on the levels of tree nodes and antennas provides a reasonable way to improve the performance of a MIMO detector. However, the overall area efficiency and energy efficiency is mostly independent from the parallelism degree since the performance is paid by proportionally additional area. Furthermore, it can be expected that the high number of computed but later on discarded nodes imply area and energy-efficiency penalties in breadth-first approaches.

Depth-first sphere-decoder implementations follow a different approach. Fine- granular parallelism on a node or antenna level is hardly achievable due to the data and control-flow dependencies of depth-first tree searches changing tree levels in an unpredictable way. In this context, the most efficient approach is to sequentially exam- ine one (tree) node per cycle (ONPC) as proposed in [24] for a hard-output depth-first VLSI architecture and in [167] for a soft-output STS VLSI architecture. An accept- able guaranteed worst-case runtime can be achieved by a combination of suitable constraints set by a simple additional scheduler unit, such as a maximum number of examined nodes, the sphere constraint r2 and/or the clipping value Γ. Furthermore, such a scheduler can distribute the received symbol vectors to multiple parallel depth- first SD units in order to improve the throughput. This is a more coarse-grained level of parallelism compared to the node-level parallelism applied in K-best implementations but allows very similar throughput improvements. As for the fine-granular parallelism in K-best architectures, significant changes of the area- and energy-efficiency measures are not expected from this kind of coarse-grained parallelism.

General area and energy-efficiency comparisons between depth-first and breadth- first architectures are based on literature are very difficult. The reasons for this problem are inconsistent error rates, channel codes, channel models, etc. used throughout the publications available. Furthermore, the variable runtime of depth-first MIMO

5.2. Arithmetic and Fixed-Point Implementation Aspects 85

detectors is often used in literature comparisons in ways to turn the comparison result either in one or another direction rarely defining consistent points of operation. Therefore, such a comparison is skipped at this point. However, an approach for a fair MIMO detector analysis and comparison is developed and presented in Chap- ter 7 as a major contribution of this work. Based on this methodology, selected MIMO detectors will be analyzed and compared.

5.2 Arithmetic and Fixed-Point Implementation Aspects

In order to allow for an efficient hardware implementation, several numerical aspects (fixed-point representation, value ranges, etc.) and RTL design-style decisions play an important role. Furthermore, the soft-output base architecture has the purpose to provide a reasonably efficient basis, but not the utmost optimized base architecture. Therefore, established concepts are selected such that a well maintainable and reg- ular architecture can be implemented. Sophisticated implementation considerations are reserved for the soft-input extensions later introduced in Section 5.4 in order to prove the feasibility of an efficient depth-first SISO MIMO detector. Furthermore, this chapter only focuses on the implementation of a single detector instance and its char- acterization. Aspects of parallel MIMO detector units as described in Section 5.1 are considered in the analysis in Chapter 7.

Aside from these general aspects, several numerical considerations have to be taken into account. In the algorithmic domain, the constellation diagram is often normalized such that E[|si|] =1 or E[ksk] =1. This however leads to non-rational real and imaginary parts of the scalar complex constellation points requiring a significant amount of fractional bits in fixed-point notation. Multiplications with such values result in unnecessarily complex hardware. Thus, it is more efficient to define constellation points on an integer grid, such as

Re{si}, Im{si} ∈      {−7,−5,−3,−1,+1,+3,+5,+7} for 64 QAM {−3,−1,+1,+3} for 16 QAM {−1,+1_} for QPSK. (5.1)

This allows to replace multiplications with constellation points by very few simple add/sub and constant shift operations.

Furthermore, the division by N0 or alternatively the multiplication with the in-

verse of N₀ in the computation of MC(si) in (3.35) imposes both complexity (area, critical path) issues as well as numerical stability issues. However, this division can be eliminated inside the demapper by scaling all metric and LLR computations by N0

86 Chapter 5. A Flexible ASIC for SISO STS Sphere Decoding

data type use QPSK4×4 16 QAM4×4 64 QAM4×4 16 QAM2×2 16 QAM8×8

Re{ri,j}, Im{ri,j} 4.7 4.7 4.7 4.7 4.7

Re{˜yi}, Im{˜yi} 5.7 6.7 7.7 6.7 6.7

MP,MC,MA 7.6 9.6 11.6 8.6 10.6

LA_i,b, LE_i,b 7.5 9.5 11.5 8.5 10.5

Table 5.1: Exemplary fixed-point word widths used for the SISO STS SD ASIC. The

notation x.y corresponds to x integer and y fractional bits. In general, a QAM-order increase of factor four requires one more integer bit for ˜yi per real/imaginary part and two more integer bits for _MP(si), MA(si), MC(si), LAi,b and LEi,b. Doubling MT requires one more integer bit for

MA(si),MC(si), LAi,b and LEi,b.

only the input and output LLR values are scaled by N₀. Therefore, the computation of MC(si)in (3.35) is changed to MC(si) = ˜y_i− MT X j=i r_i,js_j 2 (5.2)

and input and output LLRs are redefined by:

LA_i,b = N₀˜LA_i,b (5.3)

LE_i,b = N₀˜LE_i,b (5.4)

LE_max = M_TEsΓ (5.5)

with ˜LA_i,b and ˜LE_i,b being now the unmodified LLRs as used in Chapter 3. All derived metrics (MA, MP, LD_i,b, etc.) change accordingly. Although this shifts the issue of

division or inverse multiplication outside the sphere decoder, this strategy can be ad- vantageous for instance in case the channel decoder is insensitive to a general scaling of LLR values.

Additionally, fixed-point number representations need to be carefully determined. In order to obtain a reasonably well maintainable RTL design, fixed-point operations including saturation and rounding or truncation are realized by the means of the VHDL 2008 standard fixed-float library [17, 76]. On the basis of this library, a set of fixed-point data types has been defined as given in Table 5.1. The integer and fractional word widths of these data types are determined empirically by extensive algorithmic fixed-point simulation such that the BER performance degradation is neg- ligible.

Fixed-point saturation and rounding has been employed very carefully due to the major effects on the critical path or on logic optimizations during gate-level synthe-

In document Efficiency and flexibility trade-offs for soft input soft output sphere decoding architectures (Page 91-97)