Over the last 15 years we have seen Low-Density Parity-Check (LDPC) codes assum- ing greater and greater importance in the channel coding arena, namely because they have error correction capability to achieve efficient coding close to the Shannon limit. They were invented by Robert Gallager (MIT) in the early sixties[46]and never been fully exploited due to overwhelming computational requirements by that time. Naturally, the fact that nowadays their patent has expired, has shifted the attention of the scientific com- munity and industry away from Turbo codes[9,10]towards LDPC codes. Mainly for this reason and also because advances in microelectronics allowed the development of hard- ware solutions for real-time LDPC decoding, they have been adopted by modern commu- nication standards. Important examples of these standards are: the WiMAX IEEE 802.16e used in wireless communication systems in Metropolitan Area Networks (MAN)[64]; the Digital Video Broadcasting – Satellite 2 (DVB-S2) used in long distance wireless com- munications[29]; the WiFi 802.11n the ITU-T G.hn standard for wired home networking technologies; the 10 Giga-Bit Ethernet IEEE 802.3an; or the 3GPP2 Ultra Mobile Broad- band (UMB) under the evolution of the 3G mobile system, where the introduction of LDPC codes in 4G systems has recently been proposed, as opposed to the use of Turbo codes in 3G.
Until very recently, solutions for LDPC decoding were exclusively based on dedi- cated hardware such as Application Specific Integrated Circuits (ASIC), which represent non-flexible and non-reprogrammable approaches[18,70]. Also, the development of ASIC solutions for LDPC decoding consumes significant resources, with high Non-Recurring Engineering (NRE) penalty, and presents long development periods (penalizing time- to-market). An additional restriction in currently developed ASICs for LDPC decoding is the use of fixed-point arithmetic. This introduces limitations due to quantization ef- fects[101]that impose restrictions on coding gains, error floors and Bit Error Rate (BER). Also, the complexity of Very Large Scale Integration (VLSI) parallel LDPC decoders in- creases significantly for long length codes, as those used, for example, in the DVB-S2 standard. In this case, the routing complexity assumes proportions that create great dif- ficulties in the design of such architectures.
In recent years, we have seen processors scaling up to hundreds of millions of tran- sistors. Memory and power walls have shifted the paradigm of computer architectures into the multi-core era[59]. The integration of multiple cores into a single chip has be- come the new trend to increase processor performance. Multi-core architectures[12]have evolved from dual or quad-core to many-core systems, supporting multi-threading, a powerful technique to hide memory latency, while at the same time provide larger Single
Instruction Multiple Data (SIMD) units for vector processing. The number of cores per processor assumes significant proportions and is expected to increase even further in the future[16,58]. This new context motivates the investigation of more flexible approaches based on multi-cores to solve challenging computational problems that require intensive processing, and that typically were only achieved in the past with VLSI dedicated accel- erators.
Generally worldwide disseminated, many of the actual parallel computing[12] plat- forms provide low-cost high-performance massive computation, supported by conve- nient programming languages, interfaces, tools and libraries[71,89]. The advantages of us- ing software- versus hardware-based approaches are essentially related with programma- bility, reconfigurability, scalability and adjustable data-precision. The advantage is clear if NRE is used as the figure of merit to compare both approaches. However, developing programs for platforms with multiple cores is not trivial. Exploiting the full potential of multi-core based computers many times involves expertise on parallel computation and specific skills which, at the moment, still compel the software developer to deal with low-level architectural/software details.
In the last decade we have seen a vast set of multi-core architectures emerging[12], allowing processing performances unmatched before. The general-purpose multi-core processors replicate a single core in a homogeneous way, typically with a x86 instruction set, and provide shared memory hardware mechanisms. They support multi-threading and share data at a certain level of the memory hierarchy, and can be programmed at a high level by using different software technologies[71]. The popularity of these architec- tures has made multi-core processing power generally available everywhere.
Mainly due to demands for visualization technology in the games industry, Graphics Processing Units (GPU) have undergone increasing performances over the last decade. Even in a commodity personal computer, we now have at disposal high quality graphics created by real-time computing, mainly due to the high performance GPUs available in personal computers. With many cores driven by a considerable memory bandwidth and a shared memory architecture, recent GPUs are targeted for compute-intensive, multi- threaded, highly-parallel computation, and researchers in high performance computing fields are exploiting GPUs for General Purpose Computing on GPUs (GPGPU)[20,54,99].
Also pushed by audiovisual needs in the industry of games, emerged the Sony, Toshiba and IBM (STI) Cell Broadband Engine (Cell/B.E.) Architecture (CBEA)[24] [60]. It is char- acterized by an heterogeneous vectorized SIMD multi-core architecture composed by one main PowerPC Processor that communicates efficiently with several Synergistic Proces- sors. It is based on a distributed memory architecture where fast communications be- tween processors are supported by high bandwidth dedicated buses.
Motivated by the evolution of these multi-core systems, programmable parallel ap- proaches for the computationally intensive processing of LDPC decoding are desirable. Multi-cores can allow a low level of NRE and the flexibility they introduce can be used to exploit other levels of efficiency. However, many challenges have to be successfully addressed fostering these approaches, namely the efficiency of data communications and synchronization between cores, the reduction of latency and minimization of congestion problems that may occur in parallel memory accesses, cache coherency and memory hi- erarchy issues, and scalability of the algorithms, just to name a few. The investigation of efficient parallel algorithms for LDPC decoding has to consider the characteristics of the different architectures, such as homogeneity or heterogeneity of the parallel architec- ture, and memory models that can support data sharing or be distributed (models that exploit both types also exist). The diversity of architectures imposes different strategies in order to exploit parallelism conveniently. Another important issue consists of adopt- ing data representations suitable for parallel computing engines. All these aspects have significant impact on the achieved performance.
The development of programs for the considered parallel computing architectures allows assessing the performance of software-based solutions for LDPC decoding. The difficulty of developing accurate models for predicting the computational performance of these architectures is associated with the existence of a large variety of parameters which can be manipulated within each architecture. Advances in this area can help understand- ing how far parallel algorithms are from their theoretical peak performance. This infor- mation can be used to tune an algorithm’s execution time targeting a certain multi-core architecture. More importantly, it can be used to help perceiving how compilers and parallel programming tools should evolve towards the massive exploitation of the full potential of parallel computing architectures.