Sphere Decoder Algorithm - Advanced Transceiver Processing for Large MIMO Systems

The sphere decoder is a very efficient version of a maximum likelihood detector. It has been introduced in the early eighties [103, 32] and has be adopted first in the context of wireless communication in [130]. Schorr and Euchner significantly reduced the related computational complexity be introducing an improved enumeration [110]. Still, the complexity of the sphere decoder increases exponentially with the number of mutually interfering information streams and the size of the modulation alphabet [63]. In addition, the corresponding processing complexity and latency of SD can vary significantly which presents another major channel when implementing such approaches.

2.3.1 Schnorr-Euchner Sphere Decoder

This subsection revisits the principles of depth-first, Schorr-Euchner SD with radius update, which is one of the most efficient forms for performing ML detection, in the context of MIMO communication systems [110, 98]. The ML detection problem can be described as

sML= arg min

s∈QNt = ky − Hsk

2_, _(2.10)

the solution of which would require an exhaustive search over all possible s vectors. The SD simplifies that problem by transforming the ML problem into an equivalent tree search [110]. In particular, by decomposing the MIMO channel matrix as H = QR,

where Q is a orthonormal matrix and R is an upper triangular matrix, the ML problem can be transformed into

sML= arg min

s∈QNtk¯y − Rsk

2_. _(2.11)

with ¯y = Q∗y. The tree has a height of Ntand a branch factor of |Q|. Each level l of the

tree is related to the symbol transmitted from a specific antenna. In addition, each node of a specific level l is associated with a partial symbol vector sl = [s(Nt− l), .., s(Nt)]

containing all potential transmitted symbols down to this level, and it is characterized by its Partial Euclidean Distance (PD), defined as

c(sl) = Nt X k=l  ¯y(k) − Nt X p=k R(k, p) · sl(p)   2 (2.12) c(sl) =  y −¯ Nt X p=l R(l, p) · sl(p)   2 + Nt X k=l+1  ¯y(k) − Nt X p=k R(k, p) · sl(p)   2 (2.13) c(sl) =  y(l) −¯ Nt X p=l R(l, p) · sl(p)   2 + c(sl+1), (2.14)

with R(k, p) being the element of R in at the kth row and the pth column. Then, the ML problem is translated into finding the leaf node which has the minimum c(s1). For

depth-first sphere decoders with Schnorr-Euchner enumeration and radius reduction [110] the radius is initially set to infinity. Then, whenever a leaf s1 is reached with a

PD less than the squared radius r, the radius is updated to c(s1). Upon meeting a

node sl, if c(sl) > r this node, its children and this node’s siblings that have not been

visited are all pruned. To define the search order, according to the Schnorr-Euchner enumeration, the children of a parent node are visited in ascending order of their PD [110]. The SD can find the ML solution by substantially reducing the complexity of the exhaustive search. However, while it is very efficient for small numbers of interfering streams, it becomes impractical for large numbers of interfering streams due to two reasons. First, candidate solutions (leaf nodes) found prior to finding the ML solution can be of very large r2 values, which prevents from efficient tree pruning. Second, after finding the ML solution, the SD spends a significant amount of processing for verifying it before the tree search is terminated. In particular, the expected value of the PD of the ML solution is given by E(c(sML,1)) = Nt· σ2. As a result, nearly all nodes residing

SD-SE (idicative)

SD-SE with genie initial radius

SD-SE with genie initial radius and stopping creterion

25 25.5 26 26.5 27 27.5 28 28.5 29 29.5 30 108 106 104 102 Complexi ty (visi ted nodes) SNR

Figure 2.1: Simulation results for the sphere decoding complexity a 32 × 32 MIMO with 16-QAM. The triangle denotes the complexity of the traditional sphere decoder; the solid line denotes complexity of sphere decoder with “genie” radius; the dashed denotes the complexity of a sphere decoder with “genie” radius and ideally stopping criterion.

at the higher layers of large SD trees, have smaller PD values than the ML solution, and therefore, will be visited before the tree search is terminated. In Figure 2.1, the average number of visited nodes is depicted when detecting a 32 × 32 MIMO system with 16-QAM. At aSNR of 28 dB, three cases are considered. First, a traditional SD; second, a traditional SD helped by a “genie” that provides the radius of the ML solution; third, a traditional SD helped by a “genie” that provides the ML radius and that stops the search after finding the ML solution. It is shown that the vast majority of SD’s processing complexity results from inefficient pruning prior finding the ML solution. Still, sphere decoding spends a significant amount of processing complexity to validate the ML solution after finding it.

The problem becomes even more challenging, when likelihood values need to be esti- mated to enable APP channel decoder. Then, typically, two constraint SD problems need to be solved per transmitted bit [56], which renders the corresponding computational complexity even more impractical. Simplifications, such list-sphere decoder [56], single-tree structures [120] and LLR clipping [97] can reduce the related processing load.

2.3.2 Fixed Complexity Sphere Decoder

The FCSD [13, 10] is an approximate sphere decoder that, by design, enables highly parallel processing and guarantees a fixed processing throughput. To this end, it em- ploys a static pruning approach that excludes the large majority of tree-branch a priori, without knowing the signals received signal. FCSD separates the SD search tree into two parts. Specifically, the top L layers of the tree belong to the full enumeration phase, the lower Nt− L layer belong to the single enumeration phase. In the full enumeration

phase, the FCSD’s search tree is equivalent to that of Schorr-Euchner SD. Hence, here all nodes are fully expended which results in a branching factor of |Q|. In contrast, in the single enumeration phase, only the node with the smallest PD is expended and all other nodes are instantly pruned regardless of their actual PD. This aggressive pruning reduces the branching factor to one, but at the expends that the ML-solution might be excluded form the search. To minimize the probability of excluding the ML-solution the authors in FCSD utilize a modified version of the detection ordering proposed in [134]. The aim of the proposed ordering is to maximize the post-processing SNR1 of the layers belonging to the single enumeration phase by intentionally minimizing the post-postprocessing SNR of the layers belonging to the full enumeration phase. The motivation behind this ordering is that the likelihood to exclude the ML-solution in any given layer of the single enumeration phase decreases exponential with the corresponding post-processing SNR, whereas in the full enumeration phase all possible symbol combination are evaluated and therefore the ML-solution cannot be pruned regardless of the post-processing SNR.

The number of layers in the full enumeration phase, here denoted as L, determines the performance-complexity trade-off of FCSD. If L = 0 FCSD reduces to the traditional SIC [134] if L = Nt FCSD is equivalent to an exhaustive search. In general, the FCSD

tree has a total of |Q|L different branches. Therefore, if the branches are processed independently (e.g., in parallel) FCSDs total processing complexity is O(Nt2 · |Q|L_).

The effect of L on the uncoded BER of FCSD in Rayleigh fading MIMO channels has been investigated in [62]. In particular, the authors show that the error rate of FCSD

1_{The post processing SNR of layer l is defined as} R(l, l)

converges to the error rate of optimal SD when L ≥√N t − 1. FCSD has been extended in [12] to provide likelihood information. In particular, the in parallel calculated candidate solutions are leveraged to calculate LLR values. Field Programmable Gate Array (FPGA) implementations of the FCSD are presented [11, 135, 6, 74].

2.3.3 Overview of Parallel Sphere Decoder Implementation

The exploitation of parallelism to reduce processing latency is not limited to varia- tions of the FCSD: indeed implementations of all kinds of Sphere decoders (e.g., both breadth-first and depth-first) involve some level of parallelism. However, existing approaches either take a limited or inflexible level of parallelism or the parallel processing takes place in an suboptimal, heuristic manner, without accounting for the actual trans- mission channel conditions.

Parallelism at a distance calculation level. Both depth-first [17, 53, 133] and breadth-first Sphere decoder implementations, including the K-Best sphere decoders [18, 45, 81, 93, 111, 112, 132, 131], calculate multiple Euclidean distances in parallel any time they change tree level. In addition, after performing the parallel operations, the node or list of nodes with the minimum Euclidean distance needs to be found, which requires a significant synchronization overhead between the parallel processes. This level of exploited parallelism is fixed, predetermined, non-flexible, with high de- pendencies which are related to the specific architectural design. In addition, in K-Best Sphere decoders the value of K, which is predetermined, needs to increase for dense constellations and large numbers antennas, making K-best detection inappropriate for dense constellations and large MIMO systems.

Parallelism at a higher than a Euclidean distance level. Khairy et al. [73] use GPUs to run in parallel multiple, low-dimensional (4 × 4) Sphere decoders but without parallelizing the tasks or the data processing involved in each. J´osza et al. [69] describe a hybrid ad hoc depth-first, breadth-first GPU implementation for low dimensional sphere decoders. However, their approach lacks theoretical basis and cannot prevent visiting unnecessary tree paths that are not likely to include the correct

solution. In addition, since the authors do not propose a specific tree search method- ology, their approach is not extendable to large MIMO systems. Yang et al. [142, 141] propose a Very-Large-Scale Integration (VLSI)/Complementary Metal Oxide Semicon- ductor (CMOS) multicore combined distributed/shared memory approach for high- dimensional SDs, where SD partitioning is performed by splitting the SD tree into subtrees. But partitioning is heuristic, and their approach requires interaction between the parallel trees, thus making it inflexible. In addition, the required communication overhead among the parallel elements makes the approach inefficient for very dense constellations and inappropriate for a GPU implementation.

In document Advanced Transceiver Processing for Large MIMO Systems (Page 38-43)