Integration of the STS SD into the Complete PHY Layer ASIC

4.2 Regularized and Sorted QR Decomposition for Tree-Search Based MIMO

4.2.2 Integration of the STS SD into the Complete PHY Layer ASIC

In order to put the sorted QR decomposition in relation to the remaining circuit, required to perform a tree-search based MIMO detection, a RxSTProcessing module with a tree-search based MIMO detector is integrated into the PHY layer ASIC discussed in Section 2.2. A block diagram of such a space time processing circuit is shown in Fig. 4.3.

Similar to the RxSTProcessing module with a linear MMSE detector, shown in Section 3.2.5, the first module of the RxSTProcessing module with a tree-search based MIMO detector separates the symbols corresponding to the training sequences and the symbols related to actual data detection. The training sequence related symbols are forwarded to the channel estimation, where the MIMO channel matrix is estimated, based on the known training sequence at the receiver and the received symbols of the transmitted training sequence.

The subsequent module performs the matrix decomposition. Contrary to the matrix decomposition module used for linear MMSE detection, the one used here performs the above introduced sorted QR decomposition. The computational complexity of the matrix decomposition is slightly higher than the one of the unsorted QR decomposition due to the norm calculation and due to the sorting steps that are not compensated by removing the SINR module.

Also the cache is modified, as no SINR information has to be stored for tree-search based MIMO detection. A post sorting module has to be instantiated in addition to the actual QR decomposition, as the first few rows of the matrix R are stored in the cache before the final sorting of the spatial streams is known. Hence, the sorting steps have to be applied to these rows, as well. After the post sorting unit, the matrices R and Q are forwarded to the subsequent MIMO detector.

The tree-search based MIMO detection itself starts with multiplying the received symbol vector with the matrix Q in the nulling module in order to compute the vector ˆy. Subsequently, the received vector is scaled, such that the resulting constellation points are placed on an integer grid. This scaling of the receive vector reduces the computational complexity of the tree-search based MIMO detector.

presented in [Wen10,SWB12], was used. The SD supports BPSK, QPSK, 16-QAM, and 64-QAM modulation and up to four spatial streams. For the enumeration of the nodes, the ordered l∞-norm

scheme is used [Wen10, SWB12]. In order to increase the AT efficiency of the sphere core, a pipeline interleaved processing scheme with three pipeline stages is used. The integrated sphere core performs the above described tree pruning scheme to reduce the average computational complexity of the detection algorithm. Furthermore, a runtime constraint can be set to the sphere core that terminates the processing in order to achieve a certain target throughput. Of course, early termination based on a runtime constraint potentially reduces the error rate performance of the SD.

Unfortunately, for a larger number of spatial streams and higher order modulation schemes, a single soft-output STS SD core [Wen10, SWB12] has a too low throughput (as for four streams and 64-QAM modulation 16,777,216 candidate vectors have to be potentially evaluated) to sustain the required system throughput. Therefore, multiple sphere cores processing in parallel different OFDM tones have to be instantiated in order to achieve the system throughput with an appropriate error rate performance. Using N SD cores allows to visit up to κ nodes per OFDM tone. The maximum number of visited nodes κ is proportional to

κ ∝ N fclk

Nsd

, (4.19)

where fclkcorresponds to the clock frequency and Nsdto the total number of OFDM tones per

OFDM symbol (which is related to the system bandwidth). In (4.19) it is shown that increasing the number of sphere cores allows to linearly increase the number of guaranteed visited nodes per OFDM tone. The error rate performance penalty of early termination, by a runtime constraint, can be reduced by increasing the number of guaranteed visited nodes.

In order to distribute the workload on N sphere cores, a scheduler is instantiated prior the actual sphere cores. The scheduler has two functions – it first instantiates a FIFO in order to decouple the varying processing rate of the SD from the overall receiver. The FIFO also helps to instantly provide new OFDM tones to free sphere cores to be processed. The FIFO implemented in this thesis is based on flip-flops in order to have a high access bandwidth allowing to assign new OFDM tones to several sphere cores in the same clock cycle. Beside the FIFO, decoupling the instantaneous processing rate of the SD and the remaining PHY layer, the average processing speed of the SD cores compared to the overall receiver has to be adjusted as well. For doing so, a scheduler calculates a runtime constraint for each OFDM tone given as a maximum number of clock cycles κrconstthat a sphere core has available to visit the candidate nodes of that specific

OFDM tone in order to find the leaf with the smallest ED, according to

κrconst(n)= min ( NTOFDM Tc | {z } total processing cycles available per OFDM tone

− X i∈F κused(i) | {z } used processing cycles −X l∈G κrconst(l) | {z } potentially used processing cycles −κmin(M − (n+ 1)) | {z } reserved processing cycles ; κmax ) , (4.20)

number of clock cycles assigned to a sphere core to process an OFDM tone is upper bounded by κmax. In addition, this approach differs to the scheduling approach in [Stu05], by the fact that

the SD core used in this implementation can not be stopped when the decoder has started to process an OFDM tone. Hence, the runtime constraints have to determine when an OFDM tone is assigned to a sphere core. In the implementation of this thesis, the two values κminand κmax

can be configured over an AMBA bus to fine-tune the scheduler to the environment.

Subsequent to the scheduler, multiple parallel soft-output STS SD cores are instantiated in the design. Depending on the to be achieved error rate performance, the number of SD may differ. In this thesis, we implemented the RxSTProcessing module with two, five, and ten soft-output STS SD cores [Wen10, SWB12].

In parallel to the sphere cores, a dedicated single stream BPSK core is integrated to perform detection of the header symbol. This BPSK core computes the estimated transmitted symbol twice for each OFDM tone: once without rotating the symbol prior computing the estimate and once with the symbol rotated by 90 degrees. This processing scheme is then later used in the HT detection unit to distinguish between HT and non-HT frame formats.

The output of the sphere cores is then gathered in a collector module that assures that the sphere core always can forward its output. Being able to always output the computed estimates reduces the number of idle cycles in the sphere cores. Additionally, the collector circuit reports to the scheduler module the number of used cycles for each OFDM tone received from a SD core. Based on this feedback, the scheduler updates κrconst(n) for the subsequent OFDM tones. The

collector circuit is further responsible for reverting the stream ordering resulting from the sorted QR decomposition for all OFDM tones. However, contrary to the implementation in [Wen10], the collector module does not reorder the OFDM tones itself as the sorting of the OFDM tones is performed in the subsequent deinterleaver of the channel decoder.

In order to again adjust the varying processing speed of the sphere cores for the remaining circuits, a buffer provides the subsequent circuits with a constant arrival rate of the LLR values calculated by the sphere cores.

The final module of the space-time processing circuit is the HT detection circuit that is a reused sub-circuit of the LLR computation module from the linear detection implementation, presented

Table 4.1 – Implementation results of the RxSTProcessing module for a soft-output STS SD

Unit Area [kGE] Area [µm2] Memory [kBits]

Preprocessing

Training/ data separation 1 3’143.8 -

Channel estimation 75 233’963.4 100.4 QR decomposition feeder 5 13’054.4 - QR decomposition 178 557’471.0 - Cache Feeder 6 17’073.2 - Cache Cache 51 161’257.4

Q & diag R & order cache 50.2

R Cache 17.5 Detection Post sorting 12 37’741.0 - Nulling 20 63’856.0 - Constellation scaler 14 44’762.5 - Scheduler 17 52’615.8 - STS SD cores 203a 507b 1’026c 642’477.8a - BPSK core 3 8’971.3 - Collector 12a 26b 52c 38’413.6a - Buffer 42 183’945.7 - HT detection 15 42’991.9 0.5

Controlling and monitoring 19 59’456.2 -

Total 673a 991b 1’536c 2’109’231.8a 168.6

a_{Two soft-output STS SD cores}

b_{Five soft-output STS SD cores}

c_{Ten soft-output STS SD cores}

in Section 3.2. The task of this HT detection unit is to distinguish, based on the header OFDM symbol, between HT and non-HT transmissions. This is achieved based on the accumulated LLRs for an entire detected OFDM symbol that are calculated by a dedicated BPSK core once rotated and once non-rotated. After the decision, the chosen format is forwarded, based on the cached values.

Area Results

In Tbl. 4.1, the size of the different components of the RxSTProcessing module of the PHY layer ASIC is shown for the example of two, five, and ten soft-output STS SD cores. It is clear that, already for two sphere cores, the total area of the space-time processing is increased by 68% compared to the implementation with a linear MMSE detector. Using five or even ten sphere cores, the area of the space-time processing increases by 148% and 284%, respectively.

In addition to the total area enlargement, the area that processes data during the data detection phase increases significantly by 189%, 415%, and 801%, respectively. Therefore, for long frames, the dynamic power consumption determining active silicon area significantly rises.

Figure 4.4 – Floorplan of the PHY layer ASIC implemented in 90 nm CMOS technology with two parallel soft-output STS SD cores and a sorted QR decomposition as preprocessing circuit.

Floorplans of the entire PHY layer ASIC with the proposed sorted QR decomposition based soft-output STS SD with two, five, and ten cores in the RxSTProcessing module are shown in Fig. 4.4, Fig. 4.5, and Fig. 4.6 respectively. The area of the entire ASIC rose due to the larger detector and preprocessing circuit area to 8.98 mm2, 10.212, and 12.512, respectively. The area of the RxSTProcessing module corresponds to 38.3%, 47.8%, or 58.6%, respectively for two, five or ten sphere cores of the core area of the ASIC.

Compared to the linear detector, the area increase of the soft-output STS SD based PHY layer ASIC is substantial. In the next section, we will evaluate if this substantial area growth also results in a substantial error rate performance increase.

Error Rate Performance of an Entire PHY Layer ASIC with a Sorted QR Decomposition Based Soft-Output STS SD

The same frame formats, used in Section 3.2.5 to evaluate the performance of the linear detector based PHY layer, have been used to evaluate the error rate performance of the entire PHY layer ASIC with two and with five soft-output STS SD cores1.

1_{Ten soft-output STS SD cores no longer fit on the two Xilinx Vertex-5 XC5VLX330 FPGAs, used to evaluate the}

Figure 4.5 – Floorplan of the PHY layer ASIC implemented in 90 nm CMOS technology with five parallel soft-output STS SD cores and a sorted QR decomposition as preprocessing circuit.

Figure 4.6 – Floorplan of the PHY layer ASIC implemented in 90 nm CMOS technology with ten parallel soft-output STS SD cores and a sorted QR decomposition as preprocessing circuit.

Figure 4.7 – Packet error rate performance of the PHY layer ASIC with two STS SD cores in the MIMO detector using a TGnD channel model and a packet length of 1000 Bytes.

In Fig. 4.7, the error rate performance of the PHY layer ASIC with two soft-output STS SD cores is shown. Already for two sphere cores, a large improvement of the error rate compared to the linear detector can be seen for transmission schemes with high spatial stream order.

The same error rate analysis was been performed in Fig. 4.8, with five soft-output STS SD cores. As expected, the error rate using five sphere cores is, for a low number of streams or a low modulation order, only slightly improved, compared to the implementation with two STS SD cores. This minor benefit of additional sphere cores for transmission schemes using a low modulation order or a small number of spatial streams results from the fact that already two sphere cores find the leaf with the smallest ED in the search-tree, due to the reduced size of the tree itself for these transmission schemes.

Contrary to the transmission schemes with a small number of leafs in the search-tree, the error rate performance of the PHY layer ASIC is noticeably improved using five sphere cores for transmission schemes with high modulation order and a large number of spatial streams. This improvement is generated by the larger number of evaluated candidate nodes of the search-tree. Nevertheless, five soft-output STS SD cores are insufficient to achieve ML diversity for 4×4 MIMO transmission using 64-QAM with code rate 5/6.

Figure 4.8 – Packet error rate performance of the PHY layer ASIC with five STS SD cores in the MIMO detector using a TGnD channel model and a packet length of 1000 Bytes.

In document Energy Efficient VLSI Circuits for MIMO-WLAN (Page 105-112)