High Performance Turbo Decoder on CELL BE for WiMAX System

(1)

High Performance Turbo Decoder on CELL BE for

WiMAX System

Huili Guo*_{, Juntao Zhao}*_{, Jianwen Chen}§

, Xiang Chen*_{, Jing Wang}*

*_{Department of Electronic Engineering, State Key Laboratory on Microwave and Digital Communications and Tsinghua National} Laboratory for Information Science and Technology, Tsinghua University, Beijing, 100084, China,

{[email protected], [email protected], [email protected], [email protected]}

§

IBM China Research Laboratory, Beijing, China, [email protected] Abstract— Turbo codes are widely used in many radio systems

due to its superior performance and Software Radio (SR) is an emerging paradigm of the wireless communication system design due to its good flexibility and adaptability. However, since turbo decoding is computationally intensive, the SR implementation of turbo decoding is always challenging. In this paper, an efficient software implementation of the double-binary turbo decoder for the WiMAX SR baseband system on IBM CELL Broadband Engine (BE) is presented. After the parallelization and optimization of the decoder structure, with a single Synergistic Processor Element (SPE) running at 3.2GHz, the implemented turbo decoder can achieve a throughput up to 1.36 Mbps. With eight SPEs working in parallel, the decoder can obtain the throughput more than 10Mbps, which can meet the WiMAX system requirement at 5MHz bandwidth mode.

Keywords-Turbo decoder, WiMAX, Software Radio, Multi-core Processor

I. INTRODUCTION

Software Radio (SR) technology brings the flexibility, cost efficiency and lower power to drive communications forward. SR has wide-reaching benefits that are realized by service providers, product developers, and through to end users. However the SR application is always restricted by the performance of the hardware platform on which it develops on. In recent years, the multi-core technology has developed rapidly and is currently the trend of the microprocessor development. Multi-core processor, with high-frequency and low-power consumption, is able to provide a whole wireless system SR solution with high performance and good adaptability [1, 2].

Worldwide Interoperability for Microwave Access (WiMAX), as a broadband wireless access technology, can provide high quality “last-mile” wireless access service and offer the mobile client machines with the internet connections. Especially, WiMAX is adopted as one of 3G international standards recently [3].

In this paper, a basic WiMAX baseband SR system based on CELL Broadband Engine (BE) is considered [4], which is also based on the multi-core technology. The system structure is shown in Fig. 1, in which the Convolution Code (CC) with tail-biting is adopted in [4]. However, from the system performance point of view, the CC scheme can’t meet all the system requirements especially in the multi-path fading channels. Accordingly, in this paper, we try to apply the

double-binary Convolutional Turbo Code (CTC) [3] into the WiMAX baseband SR system on CELL BE.

As known in [2, 4], Cell BE is a single-chip multiprocessor with nine processor elements operating on a shared, coherent memory [5]. Although all processor elements share memory, their functions are specialized into two types: the Power Processor Element (PPE) and the Synergistic Processor Element (SPE).The architecture of CELL BE is shown in Fig. 2.

Originally, a turbo decoder without any optimization is concatenated into the WiMAX baseband SR system. The computation complexity of all the modules in the whole system is shown in TableⅠ. From this table, we can see that turbo decoder is the most computationally-intensive module and the throughput bottleneck of the system. So designing high efficient turbo decoding is the most challenging part of work in the whole WiMAX SR system.

MAC

L

a

ye

r

Fig. 1. WiMAX baseband system structure

Fig. 2. CELL BE architecture

TABLE I. WIMAX MODULES COMPUTATION COMPLEXITY

This work is partially supported by IBM OCR and SUR programs, National Basic Research Program of China (2007CB31060), PCSIRT, and International Science and Technology Cooperation Program (2008DFA12160)

(2)

Module CPU cycles Channel coding 1927 Interleaving 1225 IFFT 1551 Channel Estimation 783 SFBC 1484 De-Interleaving 1287 Turbo Decoding (Max-log-MAP) 6663

In this paper, the turbo decoder on CELL BE will be studied. Then two parallel decoding methods, referred to as Parallel Block Decoding (PBD) and Parallel State Decoding (PSD), are presented to achieve high throughput and high performance based on the CELL BE platform. In addition, the decoder is also optimized based on the programming characteristic of SPE.

The rest of the paper is organized as follows: The turbo decoder with the MAX-log-MAP algorithm is firstly described in Section Ⅱ. Section Ⅲ covers the implementation of two parallel decoding methods and relevant multi-core programming optimizations. In Section Ⅳ , the decoding throughput results are illustrated and analyzed. The decoding simulation results are described in Section Ⅴ in detail. Finally Section Ⅵ concludes the paper.

II. MAX-LOG-MAPDECODINGALGORITHM The iterative turbo decoder consists of two component Soft-Input Soft-Output (SISO) decoders serially concatenated via an interleaver, identical to the one in the turbo encoder, as shown in Fig. 3 [6].

Fig. 3. Structure of the turbo decoder

When the Maximum A Posteriori (MAP) algorithm is applied to each SISO decoder, the Log-Likelihood Ratio (LLR) for each double-binary pair can be expressed as follows:

)

(

*

)

,

(

*

)

(

)

(

*

)

,

(

*

)

(

ln

)

|

0 ,

0 (

)

|

,

(

ln

)

,

(

1 1 0 , 0 1 1 1 1 , 1 1 + + = = + + + + = = + +

∑

=

k k B A k k k k k k k b B a A k k k k k k k k k k k

s

r

s

r

s

y

B

A

P

y

b

B

a

A

P

B

A

L

k k k k

β

α

β

α

where (a,b) are (0,1), (1,0) or (1,1).

However, the MAP decoding algorithm requires large memory and a large number of operations involving exponentiations and multiplications, which is likely to be considered too difficult for implementation, especially in the SR system on CELL BE. Thereby here we choose the MAX-log-MAP algorithm to replace the MAP algorithm, which has acceptable performance with much lower computational complexity and memory consumption [6].

In the MAX-log-MAP algorithm, the output of each SISO decoder, representing the extrinsic LLR, is expressed as follows:

)}

(

)

,

(

)

(

{

max

)}

(

)

,

(

)

(

{

max

)

,

(

1 1 1 1 ) 00 , , ( 1 1 1 1 ) , , ( 1 1 + + + + + + + +

+

−

+

=

+ + k k k k k k k s s k k k k k k k z s s k k

s

B

A

k k k k

β

γ

α

β

γ

α

λ

, (1) where z∈

φ

={01,10,11}.

The extrinsic LLR

λ

(

A

k

,

B

k

)

of SISO decoder is

interleaved or de-interleaved and then fed to the next SISO decoder as the priori information ( ) .

,

z

IN e

L

The computation of the LLR can be broken into calculation of three metrics: the forward state metric

α

_k

(

s

_k

)

, the branch metric

γ

k₊₁

(

s

k

,

s

k₊₁

)

, and the backward state metric

)

(

₁

1 + + k

k

s

β

. Denote and as the start and the end states, respectively. Then k

s

k₊₁

)

(

k k

s

α

,

β

k

(

s

k

)

and

γ

k

(

s

k

,

s

k+1

)

can be defined as follows:

)}

,

(

)

(

{

max

)

(

₁ ₁ ₁ 1 1 s k k k k s k k

s

k − − − ∈

+

=

−

γ

α

, (2)

)}

,

(

)

(

{

max

)

(

1 1 1 2 1 + + + ∈

+

=

+ k k k k s s k

s

k k

β

γ

β

, (3) ) ( , 2 2 1 1 2 2 1 1 1

)

(

2 )]

,

(

*

)

|

(

ln[

)

,

(

z p k p k p k p k s k s k s k s k c k k k k k IN e k k

L

y

x

y

x

y

x

y

x

L

z

B

A

P

x

y

P

s

+

=

+

γ

, (4) where:

• is the set of states at time k-1 connected to the state . is the set of states at time k+1 connected to the state . 1

s

k

s

₂ k

s

• z

∈

φ

=

{

00 ,

01 ,

10 ,

11 }

.

• is the input symbol consisting of two bits. is a priori probability of .

)

,

(

A

k

B

k

)

,

(

A

k

B

k

P

(

A

k

,

B

k

)

• and are the transmitted and received codewords respectively associated with .

k

x

y

_k

)

,

(

A

_k

B

_k

(3)

• Superscripts and respectively denote the parity bits and systematic bits.

p

s

• is the priori information obtained from the other SISO decoder. ) ( , z IN e

L

• The code is assumed to be modulated by BPSK and transmitted through an AWGN channel with noise variance . In this case, the turbo decoding based on the Max-log-MAP algorithm is independent of SNR, therefore can usually be set to a constant value. 2

σ

2

/

2 σ

=

c

L

III. TURBODECODERIMPLEMENTATIONONCELL In this section, we will describe the implementation methods of MAX-log-MAP decoding algorithms on a single SPE. SPE supports 128-bit-wide Single Instruction Multiple Data (SIMD) operations. To make good use of this SIMD feature, we present two parallel decoding methods on the implementation of the SISO decoder, which are Parallel Block Decoding (PBD) and Parallel State Decoding (PSD), respectively.

A. PBD Implementation

Firstly, we will introduce the PBD implementation method. In PBD method, the frame size is assumed to be M bits, which can be divided into N blocks with equal length. Each sub-block of the frame can be decoded in parallel structure independently [7].

As for the implementation, without loss of generality, the soft-input data is quantized by 8 bits and occupies two bytes. So one 128-bits wide vector can contain eight soft-input data at maximum. The above data mapping method of PBD algorithm is shown in Fig. 4, where data[i] denotes the soft-input of the decoder, i=0,1,…,2*M/N. Thus, data[i] of each sub-block is read into one vector and dealt with in parallel. Simultaneously, values of α and β for each sub-block are calculated according to (2), (3) and (4) in Section Ⅱ. At last, the LLR is obtained by (1) in parallel and independently.

Since the calculations of α and β metrics may be started somewhere in the middle of the frame, they must be initialized. Fig. 5 shows the initialization value passing scheme for PBD algorithm. For simplicity, we only demonstrate two iterations in Fig.5. In practice, the number of iterations is chosen by the trade-off between the Bit Error Rate (BER) performance and decoding throughput.

Fig. 4. Data mapping for the PBD algorithm

Assuming k bits couples are contained in each sub-block. As shown in Fig.5, during the parallel decoding, αk of the sub-block n (n=1, 2,…, N-1) is saved as α0 of the sub-block n+1 for the next iteration. For the last sub-block N, αk is saved as α0 of the sub-block 1 for the next iteration. Similarly, β0 of the

sub-block n (n=2, 3,…, N) is saved as βk of the sub-block n-1 for the next iteration, and for the sub-block 1, β0 is saved as βk of the sub-block N for the next iteration.

Fig. 5. Metric initialization value passing scheme for PBD algorithm

B. PSD Implementation

Except the parallel decoding structure of PBD algorithm according to the sub-blocks in one frame, another parallel structure based on the decoding states, referred to as PSD algorithm, is presented in this section. Assuming the state number of the turbo encoder is D, in the PSD algorithm, all states of α and β are calculated in parallel. Since the CTC encoder used in WiMAX system has three registers [3], there are eight states in total, i.e., D=8. The state transition diagram of this CTC is given in Fig.6 [6].

In Fig.6, we can see that, for each specific time k, both of α and β have eight states. With each α or β corresponding to each state represented by 16 bits, one 128-bits-wide vector can contain all the values of α or β corresponding to all the eight states. The data mapping scheme for α in the PSD algorithm be shown in the following Fig. 7. The mapping scheme for β is also similar to that for α.

Then the values of α and β for eight states are calculated in parallel structure simultaneously, according to (2), (3) and (4) in Section Ⅱ. In order to calculate the α and β in parallel, the Synergistic Processor Unit (SPU) shuttle instruction should be used frequently, to adjust the positions of the eight elements in one vector in each iteration.

Since the local store, i.e., local memory of one single SPE is limited to 256K Bytes (KB), the extrinsic LLR can be calculated during the calculation of α to save the memory usage. Consequently, with the frame length 2*N, the memory usage for metric α is reduced from (N+1)*8*2 bytes to 8*2 bytes.

(4)

Fig. 7. Data mapping for the PSD algorithm

C. Implementation Optimization on CELL BE

To achieve the best throughput performance, the application programming codes are optimized based on the following CELL BE characteristics.

• Firstly, the scalar instruction execution is very low-efficient at SPE. Each scalar needs to be converted to a vector before execution, and to be converted back when finished. To take full advantages of the SPE performance, the number of scalar instructions is reduced to the minimum.

• Secondly, the branch decision is cycle-consuming and low efficient at SPE [5]. There are various methods, such as unrolling the loop, branch prediction, and inline function, are adopted to avoid branch decision as much as possible.

• Thirdly, the SPE has two separate pipelines, named even (pipeline 0) and odd (pipeline 1) [5]. Each pipeline is responsible for execution of different instruction types. A doubleword-aligned instruction pair is called a fetch group. The SPE issues instructions in program order at the unit of fetch group. The dual-issue occurs when a fetch group has two dual-issueable instructions, in which the first instruction can be executed on the even pipeline and the second instruction can be executed on the odd pipeline and there is no dependency in the two instructions. Improving the instruction’s dual-issue rate can bring better throughput performance. During our implementation on CELL BE, the instructions which can be executed in different pipeline are then put together and the dependency between the instructions is reduced.

• Last but not least, we use more memory to help to speed up the calculations. The local store of the SPE is only 256KB, which contains the program, stack, local data structures, and the DMA buffers. During our implementation in one single SPE, the memory occupation of the program and global data for the turbo decoder is about 135KB, then the remainder 121KB can be used for memory spending (stack allocation etc.) during the program running. It is confirmed that, when the frame size is equal or less than 4800 bits, the maximum SPE local sore consumption during the program execution will not exceed 256KB.

IV. DECODINGTHROUGHPUTRESULTS In order to justify the efficacy of the presented parallel decoding algorithms, some throughput results are given in this section. Fig. 8 shows the throughput of the two parallel turbo decoding methods for different frame length. For the tradeoff between BER performance and throughput, when the frame

length is equal or shorter than 192 bits, the iteration number is set to be 6; otherwise, the iteration number is set to be 5.

As shown in Fig. 8, we can find that when the frame length is equal or longer than 192 bits, the PBD algorithm has obviously higher throughput than the PSD algorithm. However, as the frame length decreases, the throughput gap between these two algorithms is getting smaller. That is because the PBD algorithm is constrained by the BER performance. When the frame length gets shorter, to keep the acceptable BER performance, the number of sub-blocks has to be smaller, which deduces the efficiency of the SIMD operation and thus deduces the throughput.

Therefore, when the frame length is less than 192 bits, the PSD algorithm will be preferred; otherwise, the PBD is preferred. 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

Fram e Size (Bit)

Th roug hp ut ( M bps ) PSD PBD

Fig. 8. Throughput comparison of the two parallel methods

V. BERPERFORMANCERESULTS

In this section, some BER performance results of the PBD algorithm will be shown. The results of the PSD algorithm are omitted due to the similar BER performance as the non-block algorithm. Firstly, Fig. 9 shows the BER performance of different frame and block length for the PBD algorithm and non-block algorithm.

From Fig. 9, we can find that for the same frame length, as the block length becomes shorter, the BER performance gets worse. The PBD algorithm offers the similar BER performance with the non-block algorithm when the frame length is equal or longer than 192 bits.

Furthermore, the PBD turbo decoder with BPSK modulation is concatenated into the WiMAX SR system on CELL BE as shown in Fig.1. Fig. 10 shows the system BER performance comparisons in AWGN channel between the (2, 1, 3) system CC with tail-biting and the CTC. The CC is decoded by the viterbi algorithm. The CTC is decoded by the PBD algorithm. The frame length of the CTC code is 480 bits, the coding rate is ½, and the iteration number of decoding is 5. From Fig.10, we can see that the coding gain of the CTC compared to the CC is about 4dB at the BER 10-5_.

(5)

Fig. 9. BER performance for the PBD algorithm

Fig. 10. WiMAX system BER performance on CELL BE

VI. CONCLUSION

In this paper, two parallel decoding algorithms of the turbo decoder for WiMAX SR baseband system on CELL BE are

presented. We focused on the implementation details of two parallel decoding methods and the vector programming optimization on CELL BE. The testing results show that, with a single SPE running at 3.2GHz, the decoding throughput of the turbo decoder can reach up to 1.36Mbps. With eight SPEs working in parallel, the decoder can obtain a throughput more than 10Mbps, which can meet the WiMAX system requirement at 5MHz bandwidth mode.

REFERENCES

[1] Kun Tan, Jiansong Zhang, Ji Fang, et al., "Sora: High Performance Software Radio Using General Purpose Multi-core Processors," in 6th USENIX Symposium on Networked Systems Design & Implementation 2009, USENIX, 2009.

[2] Jianwen Chen, Qing Wang, Zhenbo Zhu, Yonghua Lin, “An Efficient Software Radio Framework for WiMAX Physical Layer on Cell Multicore Platform,” in ICC2009, in Dresden, Germany.

[3] IEEE Std 802.16e-2005, “Part 16: Air Interface for Fixed, Mobile Broadband Wireless Access Systems Amendment2:Physical, Medium Access Control Layer for Combined Fixed, and Mobile Operation in Licensed Bands,” Feb., 2006.

[4] Junjie Lai, Jianwen Chen, “High Performance Viterbi Decoder on Cell BE,” Proc. the 1st International Workshop on Software Radio Technology (SRT2008), October 16-17, 2008.

[5] IBM, Cell Broadband Engine Programming Handbook, Version 1.1, April 24, 2007.

[6] Xinmei Wang, Guoqi Xiao, “Error-correcting codes,”(in Chinese) Xidian University publishing house, pp 506-513, April 2001.

[7] Jonathan Roth and Naraig Manjikian, Subramania Sudharsanan, “Performance Optimization and Parallelization of Turbo Decoding for Software-defined Radio,” Electrical and Computer Engineering, 2009 CCECE '09 Canadian Conference, pp. 804 – 809, May 3-6, 2009