• No results found

4.2 VLSI Architecture

4.2.3 LDPC Decoder Architecture

The core architecture of the LDPC decoder used in this thesis and shown in Fig- ure 4.16 was first introduced by C. Roth in [128] and [129]. This design is specialised to decode the IEEE 802.11n LDPC codes [78], which are also utilised in the WiMAX standard [79]. The structure and properties of these quasi-cyclic codes are described in Section 2.3.1. In summary, three different sub-block sizes Z ∈ {27, 54, 81}and four code rates R ∈ {1/2, 2/3, 3/4, 5/6} are supported. Each of these twelve possible codes corresponds to a different matrix prototype Hp.

The basic computational entity of the decoder is the node computation unit (NCU), which implements the message update according to the layered OMS algorithm intro- duced in Section 2.3.1. Z parallel NCU instances are required to process one element of the matrix Hp per cycle. To this end, first the corresponding Z LLR values are read in parallel from the internal LLR memory and cyclically shifted by the value specified in the Hp entry. Then, the NCU applies the OMS algorithm and stores the new messages rc,v and the temporary updates for the variable nodes into dedicated

memories. Once all the columns in one row of Hp have been processed, the updated LLRs λpv =qv are written back to the internal LLR memory.

The schedule of the decoding process goes through the matrix Hp row by row, at a rate of one element per cycle. To increase the throughput, the message update is split into two phases, named MIN and SEL respectively; for details about the actual steps included in these two phases the reader is referred to [128]. The MIN and SEL operations can be computed in parallel within the NCUs. To minimise the data dependencies, the MIN phase operates on the subsequent Hp row with respect to the SEL step; the remaining dependencies are dealt with by stalling the MIN operation when necessary.

The detailed schedule of these operations, which depends on the entries of Hp, is optimised at design time to minimise the total cycle count by avoiding as many dependencies as possible, possibly by processing the columns of Hp out of order. The precomputed execution sequence is then loaded into a small memory located in the control unit of the decoder and is simply stepped through sequentially at runtime. By loading the proper schedule, the decoder can process any quasi-cyclic LDPC code that fits into the available hardware resources.

The main challenge in the implementation of LDPC decoding is the design of an efficient memory subsystem, since large amounts of data are frequently moved to/from the computational units, which on the other hand involve rather simple com- binational logic. This issue is first addressed by optimising the execution sequence to reduce the memory accesses. Secondly, to achieve the required bandwidth while keeping the power consumption low, all the memories are split into three banks, which are selectively activated depending on the parameter Z currently in use. If Z = 81 all the three banks are active whereas for Z = 54 and Z = 27 respectively one and two banks are turned off by means of clock gating; the same technique is applied to the NCUs. Furthermore, all the memories are standard-cell based [101] so that the clock gating can be extended to the granularity of a single bit. Finally,

Figure 4.16: LDPC decoder and writeback unit architecture.

sign-magnitude arithmetic is extensively used in the decoder to reduce the switching activity and hence the power consumption.

Around this core architecture, a suitable interface was designed to fit the LDPC decoder in the IDD receiver. The main issue that has to be considered at this level is that the decoder computes the a posteriori LLRs λp,dec instead of the extrinsic LLRs

λe,dec required by the SD-based detector to optimise the communication performance.

For this reason, the shared LLR memory between the detector and the decoder is not used to store intermediate decoding results, so that the a priori LLRs λa,dec are not overwritten and can be used to compute the extrinsic LLRs λe,dec once the decoding is complete. Therefore, in the initial decoding iteration the read operations requested by the NCUs are redirected to the shared LLR memory to fetch the λa,dec values, while the internal memories of the decoder are subsequently used to store temporary re- sults. In the last iteration, the a posteriori LLRs output by the NCUs are picked up by the LDPC writeback unit. This component, also shown in Figure 4.16, first undoes the cyclic shift introduced at the input of the NCUs and requests the corresponding a pri- ori LLRs λa,dec from the shared LLR memory. In the following two clock cycles, the extrinsic LLRs are first computed by subtracting λa,dec from λp,dec and then written back to the shared LLR memory.

Although not shown in Figure 4.16 for simplicity, the LDPC writeback unit also takes care of most operations related to selective IDD (see Section 4.1.2). Each a pos- teriori LLR output by the decoder is compared with the threshold Λp set for symbol- wise on-demand detection and the single-bit result is stored in a dedicated memory. The signs of the a priori LLRs λa,dec, which are the old extrinsic LLRs λe,detold from the

detector point of view, are also saved in a separate memory before being overwritten by the new extrinsic LLRs λe,dec computed by the decoder. Later on, the detector reads the result of the comparison for all the bits belonging to the symbol vector that is cur- rently being processed. If condition (4.5) is verified, SD is not applied to the symbol vector and the new LLRs are computed according to (4.6) and output after a single cy- cle. Therefore, the hardware costs of symbol-wise on-demand detection amount to Z 5 bit comparators in the LDPC-writeback unit and two memories of 1944 bits (i.e., the maximum Nc specified by the IEEE 802.11n LDPC codes [78]) each, which represent a small area overhead in the context of the MIMO IDD receiver.