Digital Object Identifier(DOI): 10.1007/s 11741-007-0412-2
A reordered first fit algorithm based novel storage scheme for parallel turbo decoder
ZHANG Le ( ), HE Xiang ( ), XU You-yun (), LUO Han-wen () Department of Electronic Engineering, Shanghai Jiaotong University, Shanghai 200240, P. R. China
Abstract In this paper we discuss a novel storage scheme for simultaneous memory access in parallel turbo decoder. The new scheme employs vertex coloring in graph theory. Compared to a similar method that also uses unnatural order in storage, our scheme requires 25 more memory blocks but allows a simpler configuration for variable sizes of code lengths that can be implemented on-chip. Experiment shows that for a moderate to high decoding throughput (40∼100 Mbps), the hardware cost is still affordable for 3GPP’s (3rd generation partnership project) interleaver.
Keywords turbo codes, parallel turbo decoding, interleaver, vertex coloring, reordered first fit algorithm (RFFA), field programmable gate array (FPGA).
1 Introduction
During field programmable gate array (FPGA) im- plementation of turbo decoder, a substantial amount of memory is assigned to store channel information and ex- trinsic information. A decoder using parallel maximum a priori (MAP) algorithm contains multiple soft-input- soft-output (SISO) modules
[1], so parallel access to these storages is required. When translated into hardware de- sign, it means that data required by different SISOs at the same time must not be stored in the same RAM block. Fig.1(a) and Fig.1(b) illustrate memory access during one iteration in turbo decoding, which is concep- tually divided into 2 phases: (1) decoding against the 1st component code, and (2) decoding against the 2nd com- ponent code. During each phase, the trellis of the com- ponent code is divided into three segments, each taken care of by one SISO module. Suppose we avoid mem- ory access contention in the 1st phase by storing SISOs’
output physically in 3 different RAM blocks. However, during the 2nd phase, previously separate writing of ad- dresses from SISOs are translated by the interleaver Π.
Potentially they can end up on the same RAM, and memory access contention still exists.
Designing the interleaving pattern wisely can pre- vent such collision
[1]and empirical results show that these contention-free interleavers can yield similar per- formance as conventional interleavers designed for serial implementation
[2,3]. However, we notice that they are all constructed in a semi-random way so that at least
SISO
2SISO
3SISO
1RAM
2RAM
3RAM
1SISO
2SISO
3Collision SISO
1RAM
2RAM
3RAM
1∏
(a) Decoding the first component code
(b) Decoding the second component code Fig.1 Memory access in parallel Turbo decoder
one part of the interleaver pattern must be stored ex- plicitly, which can be inconvenient when support for dif- ferent code length is required. A solution is provided in [4], but it still requires explicit storage of the inter- leaver of the longest size. Interleavers of other lengths are then obtained by pruning this longest interleaver pattern. Other works try to solve this problem without redesigning interleaver pattern, partly because in some applications, interleaving pattern is predefined as part of the standard
[5]. Also many excellent variable length real-time addressable interleavers exist, but they are not contention free
[6]. In [7], an architecture was proposed which buffers memory access requests when they point to the same address. But it requires a special hard- ware structure unavailable in today’s FPGA. The idea Received Oct.25, 2005; Revised Feb.22, 2006
Project supported by the National High-Technology Research and Development Program of China (Grant No.2003AA123310), and the National Natural Science Foundation of China (Grant Nos.60332030, 60572157)
Corresponding author ZHANG Le, PhD Candidate, E-mail: [email protected]
of resolving memory access contention by storing data in an unnatural order is first introduced in [8], which proves that, in this way, P RAM blocks are adequate to support P concurrent memory access. It provides the bottom line of the memory requirement and an ‘an- neal procedure’ which computes memory storage order offline.
In this paper we follow the idea of unnatural order in storage proposed in [8] but try to make the calcula- tion of storage order simple enough so as to implement on-chip. The new storage order calculation borrows its idea from graph theory and is in essence a serial vertex coloring algorithm using greedy heuristics
[9]. Simula- tion with 3GPP’s interleaver pattern shows the solution given by this method requires a few more memory blocks than [8], but the storage order calculation module can be implemented with simple logic, which reconfigures the decoder at the time of code length change within O(10L) clock cycles, where L is the length of informa- tion bits. We also find the major bottleneck of this un- natural order storage scheme is interconnection, in that the number of required tri-state buffers increases signifi- cantly for high throughput. However, we verify that for moderate decoding throughput (40 ∼50 Mbps), the re- quired amount of tri-state buffers is still affordable for 5 iterations and 80 ∼100 MHz system clock. For applica- tion in WCDMA system, the peak throughput can even reach 100 Mbps.
The rest of the paper is organized as follows. Section 2 models concurrent memory access as vertex coloring problem. Section 3 explains the resultant architecture of turbo decoder. Section 4 discusses the design of ver- tex coloring algorithm in the light of interconnection bottleneck. We also show how to implement this algo- rithm with hardware. Finally, to test the viability of our scheme, we implement it on a Xilinx Virtex II Pro xc2vp70 FPGA.
2 Memory access as a vertex-coloring problem
Graph coloring is first used to solve a register alloca- tion problem in compiler design
[10]. The same principle can be borrowed to solve memory access problem in Sec- tion 1. It may be described as follows.
(1) Every stage of the trellis of the component code is modeled as a vertex. Any two vertexes are connected with an (undirected) edge if and only if data related to these two stages are accessed simultaneously by differ- ent SISO processors
[2]in ‘parallel turbo decoding algo- rithm’.
(2) Let each color represent an RAM block. As long as any two adjacent vertexes are labeled with different colors, memory access will be collision free. This is ex- actly the ‘vertex coloring’ problem in graph theory.
(3) To use as few RAM blocks as possible, the num- ber of colors in use should be minimized.
Principles (2) and (3) should be obvious. The graph construction of principle (1) is demonstrated with an ex- ample in Fig.2, where there are 2 SISO processors, and the trellis length is 8. Assume the interleaving pattern π(x) is 5, 3, 7, 8, 1, 4, 6, 2 for x=1, 2,· · · , 8. During the 1st phase of one iteration, trellis stages requiring simul- taneous access are paired as {1, 5}, {2, 6}, {3, 7}, {4, 8 } according to Fig.2(a). During the 2nd phase, they are paired as {5,1}, {8,7}, {2,3}, {6,4}. In the following context, we call edges resulting from the 1st phase, thus drawn below the numbers, ‘low edges’, and edges from 2nd phase, thus drawn above the numbers, ‘high edges’.
1 2 3 4 5 6 7
7 6 5 4 3 2
1 8
8
SISO
1(a) Before interleaving
(c) The graph
(b) After interleaving
SISO
2SISO
1SISO
25 8 2 6 1 7 3 4
Fig.2 Construction of the graph from interleaving pattern when 2 SISO processors are used and trellis has 8 stages
3 Hardware design
For explanation purpose, channel and extrinsic infor- mation of one trellis stage are simply called an ‘element’.
Let the number of stages in the trellis be M, the num- ber of SISO processors be P , the graph is colored with χ colors. Then access collision can be avoided by stor- ing the M elements in χ memory blocks, each with the capacity of L/P elements, as follows:
Stage i’s data element (i = 0, · · · , M − 1) is stored in the cth RAM at position (i mod L/P ) where c is the stage’s color.
Fig.3 shows the turbo decoder’s overall architecture.
Only information bits’ extrinsic values are stored with the new scheme, while channel values of parity check bits are still stored orderly in ordinary way. Informa- tion bits’ channel values, which remain unchanged over the decoding period, are replicated twice and stored in natural and interleaved orders respectively (see Fig.8 in [3]). ‘x’ indicates the part of RAM which is occu- pied by a data element. The P tables store the c value of data elements used by the SISO. Since every SISO must support two kinds of input order, every table has a size 2 log
2χ2
log2L/P, which is 2 K when χ=10 and L/P =256. During the ‘learning period’
[8], each SISO consults their neighbor’s table, so the connections between SISO processors and tables may switch to the
‘dash-dot’ line in the figure.
SISO
1Table
1SISO
2RAM
2RAM
3RAM
4RAM
1Config
Table
2SISO
3Table
3SISO
4Table
4RAM blocks of channel values
Selection network RAM stack for information bits’
extrinsic values
x x x
x x
x x
x
x x x
x x x
x x x x
x
x
χ=6RAM blocks
[
L/P] elements/blockhere
L=20, P=4Fig.3 Turbo decoder architecture ( P =4, χ=6, L=20)
Data exchange between SISO processors and RAM stack is done via a ‘selection network’, whose internals are shown in Fig.4. The switches are implemented with tri-state buffers. We define χ
pas the number of output ports of the pth switch. Input is copied to the port in- dicated by control signal (drawn in dots), and all other ports are left in high-impedance state. The port width of switch
p(for read) is log
2L/P . The port width of switch
p(for write) is log
2L/P + w, where w is word length of data. Thus the overall number of tri- state buffers can be calculated with
Pp=1
(2 log
2L/P +
w)χ
p.
Table
pDelay
pSelector
pRAM
q, q=1,..., χ
p=1,..., PSwitch
pSwitch
'Delay
'Delay
p"p p
Data from
χpRAM blocks
To χ
pRAM blocks
Read data (
p)Read address (p)
...
... ... ...
...
Counter Inter/
deinter
Counter Inter/
deinter
From switches
Write address (
p)Write data (p)
Fig.4 Selection network with concurrent read and write support
In practice we find the number of tri-state buffers can be overwhelming, which equals 1600 when L=2048, P =8, w=9, χ
p≡ P . It is about 10% of all available tri-state buffers on a Xilinx xc2vp70 FPGA with 7 mil- lion gates. We also notice if the decoder’s latency is fixed, which means 2 log
2L/P +w remains constant, the number of tri-state buffers increases at the speed of O(P
2). This rapid increase in hardware consumption is mainly due to the ‘time varying’ nature of switches and is called ‘interconnection bottleneck’
[7].
Designing the coloring scheme properly can alleviate this problem. We do this by restricting the whole num- ber of colors seen by an SISO processor when it decodes
the two component codes. If χ
pcan be restricted, the total tri-state buffer consumption will decrease. The resultant ‘reordered first fit’ algorithm is described as follows.
Let n = L/P , and π(x) be the interleaved index of x. For simplicity, we assume P divides L. Let A
pbe the index of elements processed by an SISO processor, which is defined as follows:
A
p= {pn + m, m = 0, 1, · · · , n − 1} ∪ {π
−1( pn + m), m = 0, 1, · · · , n − 1}.
For p = 0, 1, · · · , P −1, color vertices in set A
pas follows.
Look at the vertex’s every adjacent vertex and record all colors (if any) already used by them. Let the smallest color not used by adjacent nodes be the vertex’s color.
The difference of our algorithm from a canonical first fit algorithm
[9], which may also be used here, is that the latter colors vertices orderly from index 0 to L−1. Here we color vertices in set A
1first, and then color vertices in set A
2, and so on. Every vertex is colored twice. This, in theory, will not increase coloring latency significantly if some additional complexity is introduced to jump over colored vertices.
Reordered first fit algorithm and its canonical version are compared in the sense of number of tri-state buffers and χ in Fig.5, in which L/P is kept constant at 256 and P is increased from 4 to 16. The number of tri-state buffers is calculated as
Pp=1
(2 log
2L/P + w)χ
pwith w=9, L/P =256. We also provide estimated number of tri-state buffers of [8] whose results imply χ
p≡ P . According to Fig.5(a), the reordered first fit algorithm saves 12% ∼20% tri-state buffers compared to canonical first fit algorithm. It uses even fewer tri-state buffers than [8] for P =16. According to Fig.5(b), the coloring scheme uses 2 ∼5 more RAM blocks than [8]. In practice, this additional cost can be alleviated by implementing the last RAM block with slices as it usually hosts less than 16 elements.
Reordered first fit algorithm makes the connection network ‘irregular’, in that χ
pis different from each other. When different code lengths are supported, we are concerned if connections saved under one code length are unlikely to reappear under another code length. This is ensured by the following 2 observations: (1) color in- dices are assigned to A
psequentially, that is, if color index i appears in A
p, all colors with index smaller than i must appear in A
pas well; (2) χ
pgenerally increases monotonically with p because of the greedy nature of our algorithm. This is verified in Fig.6. Here the χ
pfor each p is its maximum value over the code length listed be- low. Calculation with
Pp=1
(2 log
2L/P + w)χ
pshows
that 15% of tri-state buffers are saved.
Canonical first fit
Reordered first fit Ref.[4]
Reordered first fit
Canonical first fit
Ref.[4]
4 6 8 10 12 14 16
P
4 6 8 10 12 14 16
P
22 20 18 16 14 12 10 8 6 4
χ
8000 7000 6000 5000 4000 3000 2000 1000 0
Number of tri-state buf fers
(a) Tri-state buffers consumption
(b) RAM block consumption when different vertex coloring scheme is used (L/P = 256)
Fig.5 Tri-state buffers and RAM block simulation
1 2 3 4 5 6 7 8
p χ
p12
11
10
9
8
7
6
5
Canonical first fit
Reordered first fit
Fig.6 Comparison of χ
pof different coloring scheme when supporting code length ( L=512, 1024, 2048, 4096, P =8)
The ‘config’ module in Fig.3 computes data for the P tables when code length changes. At this time, ports of the tables are switched to the connection shown in Fig.7. Assume the configuration of one table entry takes N clock cycles. The first two cycles are used to read col-
ors of neighboring vertices from tables, one for neighbors seen from ‘high edges’, the other for neighbors connected via ‘low edges’. The ‘read’ ports are then idle during the next ( N −2) cycles. This is because calculation for a new vertex can only start after the previous result has been written back into the tables. The ‘color calculator’ picks up the color for the vertex, which is written to the table indicated by the ‘write control’ module if necessary. For simplicity we just color every vertex twice, thus com- putation for all table entries takes O(2NL) clock cycles.
Address buses addr
aand addr
bare calculated as follows.
for p = 1, · · · , P for k ∈ A
pfor t = 0, · · · , N − 1
addr
a=
( k mod L/P , t = 0,
π(k) mod L/P + 2
log2L/P, otherwise, addr
b=
( k/L/P , t = 0, π(k)/L/P , otherwise, end for
end for end for
Delay Delay
P