2.3 Preprocessing Architecture for MIMO-OFDM Detectors
2.3.2 Application Specific Processor Versus Pipelined Architecture
In the literature two main architectures have been proposed to implement preprocessing circuits of MIMO detectors. Many implementation use processor like architectures [Lü10, Stu09, Bur06, Wen10, SSLF08] to perform MIMO detection and corresponding preprocessing algorithms. Other publications favor the use of pipelined architectures for MIMO detectors [SYG13, SSB10, SB13].
In this subsection, we will evaluate the general properties of the two architectures proposed in the literature in the context of the PHY layer ASIC described in Section 2.2.2. Note again that the OFDM tones arrive at the RXSTProcessing module in a sequential manner but with a bit-reversed ordering and that all channel matrices are assumed to be independent of each other.
Application specific processor architectures: A common class of architectures for MIMO detectors and the related preprocessing circuits proposed in the literature are application spe- cific instruction-set processor (ASIP) architectures. Several QR-decompositions [Lü10], SVDs [SSLF08] or even lattice reduction implementations [BSS+10] have been proposed with this type of architecture. A processor like circuit computes all tasks required to decompose the channel matrix of one OFDM tone. To this end, a program (that may is hard-coded) is executed, specifying all operations required to perform the preprocessing algorithm. In an initial step, the entire channel matrix is loaded into a dedicated data memory. In the subsequent processing steps, the processor fetches for each instruction of the program the required data from the data memory and forwards them through an interconnection network to dedicated processing elements (PEs). The output of the PEs is then again stored into the data memory. When all operations associated to the preprocessing algorithm have been executed the resulting matrices are output of the ASIP.
A common strategy for the optimization of ASIPs used to solve many independent problems, such as the matrix decomposition of all channel matrices in IEEE 802.11n, is illustrated in Fig. 2.14 [SBFB07, SSLF08]. An initial implementation is optimized, such that, the area times time (AT)-efficiency is enhanced. This optimization is performed with architectural transformations (e.g., time-sharing, iterative decomposition, replication) as well as the implementation of special PEs dedicated to process specific operations. The optimized ASIP core is then replicated in order to meet the target throughput required of the PHY layer implementation. The replication can be performed as the problems to solve are independent of each other and hence, can be performed in parallel.
In Fig. 2.14a, the optimization strategy is illustrated for a “good” case. We assume to have a large set of independent problems to solve, and the AT optimized ASIP core has a small area. Under such conditions, replication of the ASIP core allows to meet the target throughput with minimal area overhead.
Contrary to the case shown in Fig. 2.14a, it is may not always possible to achieve a small area by optimizing the AT-efficiency of an ASIP core. As illustrated in Fig. 2.14b, under these conditions only a coarse granularity is accomplished by replication. In the worst case, shown in Fig. 2.14b, a single ASIP is just insufficient to reach the system specification. In this case replication of the ASIP core results in a large area overhead.
Pipelined Architectures: Another class of architectures proposed for preprocessing circuits and MIMO detectors are pipelined architectures. Many different names have been used in the literature for this type of architectures. Some publications refer to the architecture as pipelined [SSG08,SG09], others call it macro-pipeline [BSG07,SSB10], or systolic-array [Kun82, GK82, HC92, HK95, KCD05, WBY11]. Common to all these architectures is that they divide the algorithm to be implemented into multiple-tasks and assign the individual tasks to dedicated modules that are separated by pipeline stages. The granularity used to divide the algorithm varies for the different types of pipelined architectures. Each of the modules in the architecture communicates only with a limited number of “neighbor” modules. None of the modules performs the entire matrix decomposition alone, such as performed by a single ASIP core.
In Fig. 2.15, two high-level examples of pipelined architectures are illustrated. On the left, Fig. 2.15a illustrates a linear pipelined architecture. The circuit is composed by several concate- nation modules, named M-Al, M-Bl, · · · , M-Nl. Each of the modules processes a dedicated task
on the data stored in its internal memory. After the completion of its dedicated task, the module exchanges the memory content with its neighbors. For some implementations, each module computes its task within one clock-cycle, for others, each module calculates its task within several clock-cycles. Dependent on the algorithm performed by the entire architecture, it is possible that some of the modules share the same type.
(a) Linear Pipelined Architecture (b) Mash Systolic Array Architecture
Figure 2.15 – Examples of pipelined architectures.
constructed based on this principle, as illustrated in Fig. 2.15b. Similar to the modules in the linear architecture, some of the modules M-Am, M-Bm, · · · , M-Nmin the mash like architecture
may share a common type.
While the ASIP solves the entire problem (i.e., a matrix decomposition) on the same hardware, the pipeline architecture is composed of dedicated modules optimized to perform their sub-task. Hence, the problem of decomposing a matrix is divided into “simpler” sub-problems, assigned to modules which each solve their sub-problem with dedicated hardware. The PEs within each module are optimized for the specific sub-problem assigned to their module.
Implementations for matrix decompositions used as preprocessing circuits of MIMO detectors based on pipelined architectures usually have a larger area than implementations of the same algorithm with a single instance of an ASIP. Nevertheless, if the number of matrices to decompose is large enough, then it is possible to implement algorithms based on pipelined architectures that achieve the system requirements with a smaller area than replicated ASIPs. Unfortunately, designing pipelined architectures meeting exactly the system requirements renders challenging.