Worcester Polytechnic Institute
Digital WPI
Doctoral Dissertations (All Dissertations, All Years)
Electronic Theses and Dissertations
2012-01-09
High-Performance Decoder Architectures For
Low-Density Parity-Check Codes
Kai Zhang
Worcester Polytechnic Institute
Follow this and additional works at:
https://digitalcommons.wpi.edu/etd-dissertations
Repository Citation
Zhang, K. (2012).
High-Performance Decoder Architectures For Low-Density Parity-Check Codes
. Retrieved from
https://digitalcommons.wpi.edu/etd-dissertations/17
FOR LOW-DENSITY PARITY-CHECK CODES
by
Kai Zhang
A Dissertation
Submitted tothe faulty of the
WORCESTER POLYTECHNIC INSTITUTE
In partialfulllmentof the requirements for the
Degree of Dotor of Philosophy
in
Eletrialand Computer Engineering
by
August,2011
APPROVED:
Prof. Xinming Huang Prof. Berk Sunar
Thesis Advisor Thesis Committee
Worester Polytehni Institute Worester Polytehni Institute
Prof. James Dukworth Dr. ZhongfengWang
Thesis Committee Thesis Committee
TheLow-Density Parity-Chek (LDPC) odes, whihwere invented by Gallager
bakin1960s,haveattratedonsiderableattentionsreently. Comparedwithother
error orretion odes, LDPC odes are well suited for wireless, optial, and
mag-neti reording systems due to their near-Shannon-limit error-orreting apaity,
high intrinsi parallelism and high-throughput potentials. With these remarkable
harateristis, LDPC odes have been adopted in several reent ommuniation
standards suh as 802.11n(Wi-Fi), 802.16e(WiMax),802.15.3 (WPAN), DVB-S2
and CMMB.
This dissertation is devoted to exploring eient VLSI arhitetures for
high-performaneLDPC deoders andLDPC-likedetetors insparse inter-symbol
inter-ferene (ISI) hannels. The performane of an LDPC deoder is mainly evaluated
by area eieny, error-orreting apability, throughputand rate exibility. With
this workweinvestigatetradeosbetweenthefourperformaneaspetsanddevelop
several deoder arhitetures to improve one or several performane aspets while
maintainingaeptable valuesfor other aspets.
Firstly,we present ahigh-throughput deoder design for the Quasi-Cyli (QC)
LDPC odes. Twonew tehniques are proposed forthe rsttime,inludingparallel
layered deoding arhiteture (PLDA) and ritial path splitting. Parallel layered
deoding arhiteture enablesparallel proessingfor alllayers by establishing
dedi-ated messagepassingpaths amongthem. Thedeoder avoids rossbar-based large
interonnet network. Critial path splitting tehnique is based on artiulate
ieny. Asaasestudy,arate-1/22304-bitirregularLDPC deoder isimplemented
using ASIC design in 90 nm CMOS proess. The deoder an ahieve an input
throughput of 1.1 Gbps, that is, 3 or4 times improvement over state-of-art LDPC
deoders, while maintaininga omparablehip size of 2.9mm
2
.Seondly,wepresentahigh-throughputdeoderarhitetureforrate-ompatible
(RC) LDPC odes whih supports arbitrary ode rates between the rate of mother
odeand1. WhiletheoriginalPLDA islak ofrateexibility,the problemissolved
graefully by inorporating the punturing sheme. Simulation results show that
our seleted punturing sheme only introdues the BER performane degradation
of less than 0.2dB, ompared with the dediated odes for dierent rates speied
in the IEEE 802.16e(WiMax) standard. Subsequently, PLDA isemployed forhigh
throughput deoder design. As a ase study, a RC-LDPC deoder based on the
rate-1/2WiMax LDPCode isimplementedinCMOS 90nmproess. The deoder
an ahieve an input throughput of 975 Mbps and supports any rate between 1/2
and 1.
Thirdly, wedevelop alow-omplexityVLSI arhitetureand implementationfor
LDPC deoder used in China Multimedia Mobile Broadasting (CMMB) systems.
Anarea-eientlayereddeodingarhiteturebasedonmin-sumalgorithmis
inor-poratedin the design. A novelsplit-memoryarhiteture is developed to eiently
handle the weight-2 submatries that are rarely seen in onventional LDPC
de-oders. In addition,the hek-node proessingunit ishighlyoptimizedtominimize
omplexityand omputinglatenywhilefailitatingareongurable deoding ore.
Finally,we propose an LDPC-deoder-likehanneldetetor forsparse ISI
han-nels using belief propagation (BP). The BP-based detetion omputationally
interferers. Layereddeodingalgorithm,whihispopularinLDPCdeoding,isalso
adopted in this paper. Simulation results show that the layered deoding doubles
the onvergene speed of the iterative belief propagation proess. Exploring the
speial strutureof theonnetionsbetweenthe heknodes andthe variablenodes
on thefator graph, wepropose aneetivedetetor arhiteture forgeneri sparse
ISI hannels to failitate the pratial appliation of the proposed detetion
algo-rithm. The proposed arhiteture is also reongurable in order to swith exible
Firstof all,I wish to thank my advisor,Professor Xinming Huang, forhis
guid-ane and support through all stages of my studies and researh at the Worester
PolytehniInstitute. I amverygratefulforhisreognition,hisinspiration,andthe
exposure and opportunities that I havereeived duringthe ourse of my study.
IamalsoindebtedtotoProfessorBerk Sunar,ProfessorJames Dukworth, and
Dr. ZhongfengWangfortheirvaluablesupportsasmembersofmythesisommittee.
I amgratefulto my fellowgraduate students inthe Embedded ComputingLab,
Cao Liang, WenxuanGuo, YanjiePeng, Chen Shen and WeiWang for their
friend-ship and support.
I would also like to thank all the sta in ECE department, inluding Robert
Brown, CatherineEmmerton, ColleenSweeney,Brenda MDonald and Staie
Mur-ray fortheir kind assistane and oordinations.
This thesis is dediated to my wonderful family. I am forever grateful to my
parentsShanfengZhangandQingfangZhao,my wifeWanningJiang, fortheirlove,
support, and enouragement. I shall not try to put my appreiation and love for
1 Introdution 1
1.1 Bakground . . . 1
1.2 Related Works . . . 3
1.3 Summary of Motivationsand Contributions . . . 4
1.3.1 Ultra-High-Throughput LDPCDeoder Arhiteture Design . 5 1.3.2 Flexible-Rate High-Throughput LDPC Deoder Arhiteture Design . . . 6
1.3.3 Eient Deoder Arhiteture Design for LDPC Codes with Speial Struture . . . 7
1.3.4 Eient Arhiteture Design in LDPC-Like BP-Based Cir-umstanes . . . 8
1.4 Outline. . . 8
2 LDPC Codes and Deoding Algorithms 10 2.1 Introdutionof LDPC Codes . . . 10
2.2 Quasi-Cyli LDPC Codes . . . 11
2.3 Belief Propagation Algorithm . . . 12
2.4 Min-Sum Algorithm . . . 13
2.7 Layered Deoding Algorithm . . . 18
3 High-Throughput LDPC Deoder Arhiteture with Parallel Lay-ered Deoding 22 3.1 HighThroughput Strategies . . . 22
3.2 ParallelLayered Deoding Arhiteture . . . 25
3.3 Critial Path Splitting . . . 29
3.4 Proposed Deoder Arhiteture . . . 30
3.4.1 Overall Deoder Arhiteture . . . 31
3.4.2 PipelinedArhiteture for CNU . . . 33
3.4.3 Deision Units. . . 35
3.5 Implementation Results . . . 36
3.6 Summary . . . 39
4 High-Throughput Rate-Compatible LDPC Deoder Arhiteture 40 4.1 Introdution . . . 40
4.2 Punturing Shemes for Rate-CompatibleLDPC Codes . . . 42
4.2.1 Quasi-CyliLDPCCodes withDual-DiagonalParityStruture 42 4.2.2 Rate-Compatible LDPC Codes . . . 43
4.2.3 Seleted Punturing Sheme . . . 45
4.3 High-Throughput Rate-Compatible LDPC Deoder Arhiteture . . . 50
4.3.1 Summary of the Parallel Layered Deoding Arhiteture . . . 50
4.3.2 RCLDPC Deoder Design . . . 54
4.4 ExperimentalResults . . . 57
4.4.1 Simulation Resultsfor Puntured WiMax Codes . . . 57
5 Low-ComplexityLDPCDeoderArhiteturefor CMMB Systems 62
5.1 Introdution . . . 62
5.2 QC-LDPC Codes in CMMB Standard. . . 64
5.3 Dual-rate Deoder Design . . . 65
5.3.1 Overall Arhiteture . . . 66
5.3.2 Dual-Rate CNU Design. . . 67
5.3.3 Memory Aess for Partially Parallel Layered Deoding . . . . 71
5.3.4 Split-MemoryArhiteture for Weight-2Sub-matries . . . 72
5.3.5 Numberof Pipeline Stages . . . 74
5.4 Area-Eient Design Tehniques . . . 76
5.4.1 Memory Redution . . . 76
5.4.2 Read/Write Networks . . . 76
5.5 Implementation results . . . 79
5.6 Summary . . . 81
6 Designof Belief-PropagationBased Detetorsfor Sparse ISI Chan-nels 82 6.1 Introdution . . . 82
6.2 Channel Modeland Deoding Algorithms. . . 84
6.2.1 Channel Modeland Fator Graph Representation . . . 84
6.2.2 Belief-Propagation Algorithm . . . 85
6.2.3 Layered Deoding Algorithm. . . 87
6.3 SimulationResults . . . 87
6.4 Detetor Arhiteture Design . . . 90
6.4.3 Cahe-Like Arhiteture . . . 94
6.5 Implementation Results . . . 95
6.6 Summary . . . 96
2.1 Regular (8,4) LDPC odes: (a)Parity-hek matrix (b) Tanner graph 11
2.2 Implementation of earlytermination strategy. . . 18
2.3 Parity-hek matrix for the seleted rate-1/2 LDPC ode in 802.16e
standard. . . 19
2.4 Message passing owin horizontallayered deoding. . . 19
2.5 Arhiteture of horizontal layered deoding with loosely oupled
al-gorithm. . . 20
3.1 Proessing status atfour dierent lok yles. . . 26
3.2 Variablesummationspassingdiretions ofthe
H
base matrix: (a)forolumn
1
(b) for olumn6
. . . 283.3 Oset-modiedparity-hekmatrixforrate-1/2LDPCodein802.16e. 29
3.4 Timing diagramfor parallellayered deoding arhiteture. . . 29
3.5 BER performane omparison ofdierent deoding algorithms. . . 31
3.6 Overall parallellayered deoding arhiteture forQC-LDPC odes. . 32
3.7 CNU arhiteture with ritialpath splittinginto4 pipelined stages.. 34
3.8 Register alloationin eah setion of the hard deisions.. . . 35
3.9 Layout of the deoder ore area. . . 37
4.3 BERs of the three puntured odes and the dediated ode at rate
2/3 over AWGN hannels. . . 47
4.4 Overall arhiteture of arate-ompatible LDPC deoder. . . 53
4.5 Arhiteture ofLLR InitializationBlok. . . 55
4.6 Arhiteture ofthe Address Generator. . . 56
4.7 BERs of the puntured LDPC odes over AWGN hannels. . . 58
4.8 Layout of the proposed deoder hip. . . 59
5.1 Strutureofthe parityhek matrixforLDPCodes inCMMB stan-dard: (a) rate-1/2;(b) rate-3/4. . . 64
5.2 BER performane fordierent rates and quantizationshemes. . . 66
5.3 Overall arhiteture of the CMMB LDPC deoder. . . 67
5.4 Arhiteture ofCNU for dual-rate (1/2, 3/4) CMMB LDPC odes. . 68
5.5 The design of anPROF-basedomparator . . . 69
5.6 Element orrespondene relations between CTV memory and APP memory. . . 71
5.7 Split-memory design forhandling the weight-2sub-matries. . . 73
5.8 Forrate-1/2odes (a)Arhitetureofread network; (b)Arhiteture of write network. . . 77
6.1 Fator graph of anISI hannel. . . 85
6.2 BER omparison between originalBP algorithmand LDA-basedBP algorithm. . . 87
6.3 BER performane forvarious quantization shemes of reeived data. . 88
6.4 BER performane for various quantizationshemes of extrinsi mes-sages. . . 89
6.6 Arhiteture ofthe CNU.. . . 93
3.1 Overall omparison between proposed deoder and other existing
LDPC deoders. . . 38
4.1 Threepunturingshemesforahievingrate2/3fromrate1/2mother
ode . . . 47
4.2 Index of puntured bloks at dierent desiredrates . . . 48
4.3 Overall omparison between proposed deoder and other existing
WiMax LDPC deoders . . . 60
5.1 Connetions of input and output ports in the read network for the
rate-1/2odes . . . 77
5.2 Connetions of input and output ports in the read network for the
rate-3/4odes . . . 78
5.3 Overall omparison between proposed deoder and other existing
Introdution
This hapter rst introdues the history of error orretionodes and LDPC odes
and then desribes the motivation,related works, ontributions and outline of this
dissertation.
1.1 Bakground
A major onern in designing reliable data transmission and storage systems is
the ontrolof data errors indued by a noisy hannel or storage medium. In 1948,
Shannon demonstratedthatiftheinformationrateislessthanthehannelapaity,
introdued errors an be orreted by proper enoding and deoding of the
infor-mation [63℄. From then on, various types of error orretion odes were designed
by adding some redundanies to the original data (known as enoding), whih the
reeiver an use to hek onsisteny of the delivered message, and toreoverdata
determined to be erroneous (known as deoding). Classial and widely used error
orretion odes inlude Hamming Code (1950), BCH Code (1959) [34, 8℄,
limit-approahing odes in whih two innovativeideas were exploited: iterative
de-oding and onstrained random ode onstrution. However, LDPC odes were
mostly negleted by ode theorists for more than 30 years due to the tremendous
omputational eorts in implementing their enoder and deoder and the
intro-dution of Reed-Solomon odes [59℄. Shortly after the debut of another
Shannon-limit-approahingode, Turboode [6℄,inwhihboth the aboveideasare explored,
LDPCodes wereredisoveredbyMaKay andNealin1996[50℄. LDPCodeshave
several advantages over Turbo odes: no error oors at high SNRs, more inherent
deoding parallelism,higher throughputpotentials, lower hardware omplexity due
to the eliminationof long interleavers.
LDPC odes are speied by sparse parity hek matries
H
and an be fullyrepresented by bipartite graphs (known as Tanner graph) with two sets of nodes,
alled the hek nodes and the variable nodes. LDPC odes an be eetively
deodedusingthestandardBPalgorithm,alsoalledsum-produtalgorithm(SPA).
Twophasesof messages, thehek-to-variable (CTV)messages and the
variable-to-hek (VTC) messages, are transmittedalongthe edges ofthe Tanner graph[69℄ to
update eah other iteratively.
Good LDPC odes that have been found are largely long, random and
om-puter generated odes with BP-based iterative deoding and an ahieve an
er-ror performane as lose as 0.0045 dB away from the theoretial Shannon limit
[50, 49, 60, 61, 14℄. In despite of exellent error performane, these odes are
im-pratial for hardware implementation due to the following reasons: (1) they are
very long (up to 100,000 bits) and thus a large memory is required to store the
iterative messages, as well as the onguration of the
H
matrix; (2) theirrandom Tanner graph; (4) randomly onstruted LDPC odes also require a
om-plex routing network, whih not only onsumes a large amount of hip area, but
also signiantlyinreases the omputationdelay.
In order to make LDPC odes hardware-friendly, strutured LDPC odes that
haveelegantregularityinthe strutureof their
H
matriesand anprovideompa-rable oreven bettererrorperformane were introduedinbyalgebraionstrution
[40, 39, 75, 35, 73, 53, 19, 57℄. It has been demonstrated in [48℄ that strutured,
algebrai-onstruted odes an outperform randomones formediumlength LDPC
odes (up to a few thousand bits). Besides, their enoding an be simply
imple-mented by linear shift registers. That is why LDPC odes are widely adopted by
severalreentommuniationstandardssuhas802.11n(Wi-Fi),802.15.3(WPAN)
802.16e (WiMax), DVB-S2, CMMB, 802.3an (10G-BaseT) and et. Quasi-Cyli
(QC) LDPC odes, a speial lass of strutured LDPC odes, deompose the large
H
matrixintosmallsub-matrieswhihareeitheridentitymatrixoritspermutation[22, 43,11℄. Therefore, QC LDPC odes are best suited for hardware
implementa-tion sinethe regularstruturesof their
H
matriesmakesimplemessageswithingand memoryaessing possible.
1.2 Related Works
This dissertation is devoted to high-performane VLSI arhiteture design and
im-plementation for LDPC odes. Generally,state-of-art LDPC deoder arhitetures
an bedividedintothreeategories: fullyparallelmethod,serialmethodandpartly
parallel method.
imple-throughput, requires huge amount of interonnetions to omplete the messages
passing within a random Tanner graph, leading to the average net length of 3nm
and the total hip area of 52.5 mm
2
. On the other side, Yeo et al. [80℄ designed a
serial deoder by sharing the omputationunits and updating messages in
sequen-tial. Serial method results in very simple hardware arhiteture but low deoding
throughput, whih is usually unaeptable by real-world appliations, due to the
large lateny of ten-thousand yles needed fordeoding a odeword.
Sine strutured LDPC odes are arhiteture-aware, their deoders an be
im-plemented by partly parallel method to explore potential parallelismand improve
the throughput[85,12,51,86,37,87,77,36℄. ComparedwithrandomLDPCodes,
the
H
matrixof struturedLDPC an beeasilystored by storingonlypermutationvalues of eah small sub-matrixinstead of the entire
H
matrix. Also, regularity inthe
H
matrix of strutured LDPC odes enables easy ontrol over the read/writeoperationsofVTCandCTVmessageswhendeodingbyemployingasimplebitwise
ounter, thusavoidingthe huge interonnetionnetwork in[7℄. QCLDPC deoders
proposed in [77, 79, 78, 17, 66, 28, 45, 70, 9, 25, 27, 67℄ also belong to the partly
parallel ategory.
1.3 Summary of Motivations and Contributions
This researh is motivated by desires to explore high-performane LDPC deoder
arhitetures intermsof area eieny, error-orretingapability,throughputand
rate exibility. Although LDPC deoder has been an extensive researh topi for
almost a deade, there are still many unresolved problems. With this dissertation,
ar-LDPC deoder.
The spei motivations for the work and the orresponding ontributions will
besummarized asfollows.
1.3.1 Ultra-High-Throughput LDPC Deoder Arhiteture
Design
Motivation: Withthe inreasing demand for high-data-ratewireless appliations,
many reent ommuniation systems employ ultra-high throughput hannel odes
to math the data-rate requirements. For example, 802.15.3 standard is targeted
for the data rateof multi-gigabits perseond (Gbps). However, onventional
high-throughput LDPCdeoders proposed in[7,46,54℄were implemented by employing
morehardwareresoures tooperateatahigherparallelism,whihonsumed alarge
amountof hip area. Forexample, the deoder in [46℄ ahieves anequivalent input
throughput of 1 Gbps with a lok rate of only 200 MHz, an undoubtedly high
parallelismof 512 and a large hip area of 14.5 mm
2
under 90 nm proess. Thus it
is desirableto investigateother methods toimprovethroughput.
Contribution: We proposed a parallel layered deoding arhiteture to ahieve
ultra high throughput. Parallel layered deoding arhiteture not only keeps the
advantage of onventional layered deoding algorithm that an redue the
onver-gene required number of iterations by half and thus double the throughput, but
alsoavoidsthe serialoperationsexistinginonventionallayered deodingalgorithm
whih limits overall throughput. Two other tehniques were also applied: ritial
pathsplittingandxedmessagepassing. Theformeroptimizesthedeoder'sritial
anLDPCdeoderwithoutinreasingparallelismand hardwareresoures. Asaase
study, a rate-1/2 2304-bit irregular LDPC deoder was implemented using ASIC
design in 90 nm CMOS proess. The deoder an ahieve an input throughput of
1.1Gbps, thatis,3 or4timesimprovementoverstate-of-artLDPCdeoders, while
maintaininga omparable hip size of2.9 mm
2
.1.3.2 Flexible-Rate High-Throughput LDPC Deoder
Arhi-teture Design
Motivation: In wireless ommuniation systems, it is desirable to adjust ECC
ode rate aording to the hannelstate information(CSI) to meet various servie
requirements and hannel onditions. For example, WiMax standard provides 6
dierentLDPCode patterns in4dierentrates(
1
/
2
,2
/
3
A
,2
/
3
B
,3
/
4
A
,3
/
4
B
,5
/
6
)[4℄.Al-though there have been some researh works on the design of exible rate LDPC
deoders [68, 84, 52, 46, 66℄, none of them an provide arbitrary ode rate.
More-over, these deoders usually employ ompliated interonnetion network toswith
between dierent ode patterns [84, 46, 66℄, suh as Benes networks used in [52℄,
whih leads to signal ongestion problem,large delay and low throughput. How to
design a exible rate LDPC deoder with highthroughput remains hallenging.
Contribution: With the dissertation we proposed a rate-ompatible (RC) LDPC
deoder arhiteture that an provide arbitrary ode rate between the rate of the
motherode and1. Codeswith exibleratesweregenerated bypunturingthe
par-ity bits of the mother ode with negligible error performane degradation. Parallel
layered deoding arhiteture was also adopted to ahieve high throughput where
ompliated interonnet network was replaed by xed wires. The proposed RC
on the rate-1/2WiMax LDPC ode is implemented in CMOS 90 nm proess. The
deoderanahieveaninputthroughputof975Mbpsandsupportsanyratebetween
1/2 and 1.
1.3.3 Eient Deoder Arhiteture Design for LDPC Codes
with Speial Struture
Motivation: Although urrent LDPC deoders an handle QC LPDC odes
ef-fetively and eiently, it is still a problem when dealing with some speially
on-struted odes, suh as the LDPC odes in CMMB standard whose sub-matries
are weigh-2 superimposed matries instead of identity matries. The design of an
area-eient LDPC deoder that supports multiple ode patterns is essential for
wireless appliations suh as CMMB.
Contribution: Weproposed alow-omplexityLDPCdeoder forCMMBsystems.
An area-eient layered deoding arhiteture based on min-sum algorithm is
in-orporated in the design. A reongurable arhiteture, whih an support dual
rate LDPC odes speied in the CMMB standard, is onstruted with minimal
overhead. A novel split-memoryarhiteture is developed to eiently handle the
weight-2 submatries that are rarely seen in onventional LDPC deoders. In
ad-dition, the hek-node proessing unit is highly optimized to minimize omplexity
Cirumstanes
Motivation: Besides LDPC deoding, alarge varietyof algorithmsin
ommunia-tion and signal proessingan be viewed asinstanes ofBP algorithm,whih
oper-ates by iterativemessage passingon abipartitegraph [47℄. Forexample, reent
re-searhhasfousedontheappliationoftheBP algorithmtodetetingoveraknown
ISI hannel[15,62,5℄. The omputationalomplexityofthe BP-baseddetetion
in-reasesexponentiallyonlywiththenumberofthenonzerointerferers,withrespetto
the optimalmaximum a posterior (MAP)algorithmsor maximum-likelihood(ML)
algorithmswhoseomputationalomplexityareexponentialinhannellength.
BP-basedhanneldetetorisneeded espeiallyinthesparseISI hannelwithlongdelay
spreads andonly afewnonzero interferers, suh asthe underwater aousti (UWA)
hannels [62℄. Thus it is desirable to investigate LDPC-like VLSI arhitetures for
appliations under these irumstanes.
Contribution: Inthisdissertation,weprovidedafeasiblesolutionofdetetor
arhi-teture forrandomsparseISIhannels. Areongurablearhiteturewasdeveloped
in order to swith exible onnetions on the fator graph in the time-varying ISI
hannels. LayereddeodingandtaskreshedulingoneusedinLDPCdeodingwere
alsoinorporatedtoaeleratetheiterativeproess. Allofthetasksaremanagedby
the ontrolunit implementedasa nitestate mahine(FSM). Theproposed
dete-torisimplemented usingASICdesignin90nm CMOSproess and alsoprototyped
on anFPGA.
1.4 Outline
rithms that are used throughout this dissertation. Besides standard BP, other
de-odingalgorithmssuhaslooselyoupledalgorithm,min-sumalgorithmandlayered
deoding algorithmare also presented.
Chapter 3 develops a ultra-high-throughput LDPC deoder arhiteture
imple-mentation with parallel layered deoding. We rst explain how parallellayered
de-oding worksand thenpresentshowitan helpsplit the ritialpath andeliminate
the interonnet network.
InChapter 4,a punturing-basedRCLDPC deoder arhiteture that supports
any ode rate 1/2 to 1 is designed and implemented. First we investigate various
punturing shemes and pik up the optimal one in term of error performane.
Then we modify the parallel layered deoding arhiteture developed in Chapter 3
by addinga ontrolblok to arry out punturing.
Chapter 5 presents alow-omplexity LDPCdeoder for CMMB systems.
Split-memory arhiteture is proposed toeiently handle the weight-2sub-matries. A
reongurable arhiteture, whih an support dual rate LDPC odes speied in
the CMMB standard, is then onstrutedwith minimaloverhead;
Chapter 6 presents VLSI implementation of a BP-based detetor for sparse ISI
hannels, suh asunderwateraoustihannel. Cahe-likearhiteture isdeveloped
by storingonly the messages that urrentnode interferes with.
LDPC Codes and Deoding
Algorithms
InthisChapter,wereviewssomefundamentalsofLDPCodes,QCLDPCodesand
their deoding algorithms, inluding standard BP algorithm, min-sum algorithm,
looselyoupled algorithmand layered deoding algorithm.
2.1 Introdution of LDPC Codes
The LDPC odes an be desribed by a
M
×
N
sparse parity hek matrixH
, inwhih most of the elements are 0's and only a few are 1's.
M
denotes the numberof parity hek equations, that is,number of the hek nodes, while
N
is the bloklength, that is, the number of variable nodes. Fig. 2.1(a) shows a rate-1/2 4Ö8
parity hek matrix of aregular (8,4)LDPC ode. In the deodingaspet, a parity
hekmatrixanbemappedintoabipartitegraph,alledtheTannergraph,withall
of thevariablenodes ononesideand heknodes onanother. Loationsofthe
! " #$% #&% ' ()* +,, $-+ )( '* +,, $-+
Figure2.1: Regular (8,4) LDPC odes: (a)Parity-hek matrix (b) Tanner graph
nodes and heknodes,asillustrated inFig. 2.1(b). Theseonnetionsan bealso
onsidered asthe messages (CTV messages and VTCmessages) transmittingpaths
when deoding using the iterativeBP algorithm.
2.2 Quasi-Cyli LDPC Codes
QC LDPC odes are a speial lass of the LDPC odes with strutured
H
matrixwhih an be generated froman
m
b
×
n
b
base matrixH
b
.H
b
=
P
0
,
0
P
0
,
1
· · ·
P
0
,
n
b
−
1
P
1
,
0
P
1
,
1
· · ·
P
1
,
n
b
−
1
. . . . . . . . . . . .P
m
b
−
1
,
0
P
m
b
−
1
,
1
· · ·
P
m
b
−
1
,
n
b
−
1
(2.1)Eah nonzero element
P
i
,
j
in the base matrix is az
×
z
submatrix that anbe expanded by irularly right-shifting an identity matrix with the shift value
CNUs and VNUs now beome well-regulated and easy to handle. Therefore,
QC-LDPC odes are welomed by several advaned ommuniation standards, suh as
802.11n, 802.15.3 and 802.16e.
2.3 Belief Propagation Algorithm
The BP algorithm provides an eient and powerful method to deoding LDPC
odes. StandardBPalgorithmin[24℄isusuallytransformedintologarithmidomain
whereadditionsanbeusedinsteadoftheomplexmultipliation. Beforepresenting
the BP algorithm,werst makesome denitionsasfollows: Let
c
n
denote then
-thbitof aodewordand
y
n
denotethe orrespondingreeived valuefromthe hannel.Let
r
mn
[
k
]
(q
mn
[
k
]
)bethe CTV(VTC)messagefromhek nodemtovariablenoden
at thek
-th iteration. LetN
(
m
)
denote the set of variables that partiipate inhek
m
andM
(
n
)
denote the set of heks that partiipatein variablen
. The setN
(
m
)
withoutvariablen
is denoted asN
(
m
)
\
n
and the setM
(
n
)
withouthekm
isdenoted asM
(
n
)
\
m
.1. Initialization:
Under the assumption of equal priori probability, ompute the hannel
prob-ability
p
n
(intrinsi information)of the variable noden
,by:p
n
=
log
P
(
y
n
|
c
n
= 0)
P
(
y
n
|
c
n
= 1)
(2.2)
The CTV message
r
mn
isset tobe zero.2. Iterative Deoding:
q
mn
[
k
] =
p
n
+
X
m
′
∈{
M
(
n
)
\
m
}
r
m
′
n
[
k
−
1]
(2.3)Meanwhile, the deoder an make a hard deision by alulating the APP
(a-posterior probability)by
Λ
n
[
k
] =
p
n
+
X
m
′
∈
M
(
n
)
r
m
′
n
[
k
−
1]
(2.4)Deide the
n
-th bit of the deoded odewordx
n
= 0
ifΛ
n
>
0
andx
n
= 1
otherwise. The deoding proess terminates when the entire odeword
x
=
[
x
1
, x
2
,
· · · ·
x
N
]
satisfy alloftheM
parity hek equations:H
x
= 0
,orthepreset maximum number of iterationis onsumed.
If the deoding proess does not stop, then, alulate the CTV message
r
mn
for the hek node
m
, byr
mn
[
k
] =
Y
n
′
∈{
N
(
m
)
\
n
}
sign
(
q
mn
′
[
k
])
×
Ψ
−
1
X
n
′
∈{
N
(
m
)
\
n
}
Ψ (
|
q
mn
′
[
k
]
|
)
(2.5)Ψ (
x
) = Ψ
−
1
(
x
) =
log
1 +
e
−
x
1
−
e
−
x
(2.6) 2.4 Min-Sum AlgorithmThe nonlinear
log
−
tanh
funtion in the hek node updating step is usuallyim-plemented by Look-Up Table (LUT), whih seriously inreases the omplexity and
bal-gorithm to lower the CNU omplexity by approximating the CTV message with a
minimum operation,as shown in (2.7).
r
mn
[
k
] =
Y
n
′
∈{
N
(
m
)
\
n
}
sign
(
q
mn
′
[
k
])
×
min
n
′
∈{
N
(
m
)
\
n
}
{|
q
mn
′
[
k
]
|} ×
α
(2.7)Here,anormalizedfator
α
isintroduedtoompensatefortheperformanelossexiting inthe min-sumalgorithmwithoutompensationompared toBPalgorithm
[23, 26,10℄. In this dissertation,
α
is set tobe0.75.Using the min-sum algorithm,the look-up tables (LUTs) whih implement the
intriate non-linear funtion in standard BP algorithmare now replaed by rather
simple omparators, resultinginsimpler omputationomplexity of CNU.Besides,
storage resoures an also be redued, as only the minimum and seond minimum
value of allof the VTC messages within ahek node need tobestored.
2.5 Loosely Coupled Algorithm
In the BP algorithm, messages are transmitted between hek nodes and variable
nodes iterativelytoupdateeahother. VTCmessagesarerenewed byhannel
prob-abilities and CTV messages from the hek nodes. Then, these renewed messages
needtobepassedagaintoupdatetheVTCmessages. EveryVTCandCTVmessage
should be transmitted immediately after they are renewed along the edges on the
Tanner graph. This large amount of transmitted messages auses an
interonne-tion problem. Large hip area isoupied by interonnetions and an area-eient
tionproblem. Thedeoder doesnotexhangetheCTV andVTCmessages between
the hek nodes and variable nodes. Instead,it delivers onlythe hek and variable
summation
∆
m
andΛ
n
. Atavariable node,given the hek summationvalues∆
m
,aVNU wouldrst reoverindividualCTV message
r
mn
and thenalulatethe nextvariable summation
Λ
n
whih will be transmitted toCNUs. As a result, (2.3) and(2.4) an nowbe modied as:
r
mn
[
k
−
1] = (
sign
(∆
m
[
k
−
1])
×
sign
(
q
mn
[
k
−
1]))
×
Ψ
−
1
(Ψ (
|
∆
m
[
k
−
1]
|
)
−
Ψ (
|
q
mn
[
k
−
1]
|
))
(2.8)Λ
n
[
k
] =
p
n
+
X
m
′
∈
M
(
n
)
r
m
′
n
[
k
−
1]
(2.9)q
mn
[
k
] =
p
n
+
X
m
′
∈{
M
(
n
)
\
m
}
r
m
′
n
[
k
−
1]
(2.10)Similarly, at a hek node, given the variable summation values
Λ
n
, a CNU wouldrst reoverindividualVTCmessage
q
mn
and thenalulatethenext heksumma-tion
∆
m
whih willbetransmitted toVNUs. Asa result,(2.5) an nowbe modiedas:
q
mn
[
k
] = Λ
n
[
k
]
−
r
mn
[
k
−
1]
(2.11)∆
m
[
k
] =
Y
n
′
∈
N
(
m
)
sign
(
q
mn
′
[
k
])
×
Ψ
−
1
X
n
′
∈
N
(
m
)
Ψ (
|
q
mn
′
[
k
]
|
)
(2.12)r
mn
[
k
] =
Y
n
′
∈{
N
(
m
)
\
n
}
sign
(
q
mn
′
[
k
])
×
Ψ
−
1
X
n
′
∈{
N
(
m
)
\
n
}
Ψ (
|
q
mn
′
[
k
]
|
)
(2.13)Ifthe min-sumalgorithmisalsoapplied,thenweonlyneedtotransmitthe variable
summation
Λ
n
, leadingto further simpliationininteronnetion omplexity [70℄.2.6 Early Termination Strategy
As presented inthe BP algorithm,the deoding proess an be terminated by two
general onditions: one is the parity hek equations
H
x
= 0
,
another is the atualnumber of iterationsexeeds the predened maximum number. For regular LDPC
ode deoding in paper[85℄, parity hek equations beome easy to verify, beause
everylokylesthe deodedbits fromtheVNUsare withinasamehekand
par-ityhekanbeompletedimmediatelyafterthedeodedbitsaredeided. However,
forirregularodessuhparityhekmethodwouldleadtolargerhardwareresoure,
longer deoding lateny and lower deoding throughput, beause extra storage
re-soures and lok yles are required tosavethe harddeision bits and toverify to
see if the parity hek equations are satised.
Another onventional termination method is to only set the maximum number
of iterationsand stop atthe preset number withoutonsidering if the parity hek
equations are satised. The method will have little hurt to the error-orretion
performane of the LDPC odes if the maximum number if suiently large. In
iruitleveldesign,asimpleounterwillbenetofullyimplementthis termination
method. Thus, less hardware omplexity is employed, ompared with the parity
num-environmentswhere fewer errors our.
In order to address the drawbaks of the two existing termination riteria, we
propose a more onvenient and robust earlytermination strategy [66℄ to balane
the throughput requirements and hardware usage. The early termination strategy
an not only dynamially adjust the number of iterationswhen dealing with
om-muniation hannels of dierent SNRs, but also be suient for the low-ost and
low-powerhardware implementation.
The pivot of the early termination strategy lies in that hard deisions from
previous iterationare storedand ompared with newly generatedhard deisions. If
all of the deoded bits at urrent iterationare idential with those of the previous
iteration,then the deoder indiatesasuessful deoding of urrent odeword and
jumps out of the iterativeproess. Otherwise, the deoding proess would ontinue
until two suessive hard deisions beome the same or the maximum number of
iterations is satised. As indiated in [66℄, for the LDPC ode used in this paper,
we only need to hek 1152 information bits instead of 2304 bits, beause it is a
rate-1/2 systematilinear blok ode.
Fig. 2.2 shows the hardware arhiteture of the early termination strategy. It
mainly onsists of XOR gates, OR gates and registers whih are used to save the
hard deisions. At every iteration yle, hard deisions of previous iteration are
retrieved from registers and XORed with urrent deisions to generate an array
of intermediate signals
same
_dif f
. Then eah bits of thesame
_dif f
signal areORed to generate the
f lag
signal. A high-levelf lag
signal indiates dierene ofhard deisions between the twosuessive iterationsand then the deoding proess
willontinueifthe predenednumberofiterationisnot satised. Otherwise,a
!" #! $$%& $$%& $$%& '( )* +, -. . ./( 0
Figure2.2: Implementationof early terminationstrategy
will stop.
2.7 Layered Deoding Algorithm
In BP algorithm [24℄, the two-phase mutual messages, namely VTC messages
q
mn
andCTVmessages
r
mn
,areupdatedbyseparateproessingunitsandpassedtoeahotheriteratively.
q
mn
updateswillnotstartuntilallofther
mn
arepreparedandvieversa. In horizontallayered deoding,the CTV messagefromthe urrent layerwill
bepassedvertiallytoallother unproessedlayers thatbelongtothesame variable
node. In eah iteration, the horizontal layers are proessed sequentially from the
top tothe bottomlayer.
As an example, the
H
matrix of rate-1/2LDPC ode from802.16e standard isquasi-yli, whihonsistsof sub-matriesthatare generated byirularlyshifting
anidentitymatrix, asshowninFig. 2.3. Suhstruture iswellsuitedforhorizontal
layered deodingaseahsub-matrixhas the olumnweight ofone andwean treat
eahrowofthe base matrixasalayer. Messagepassingowoflayered deoding for
0
0
7
7
26
26
41
41
66
66
43
43
12
12
0
0
0
0
49
49
39
39
65
65
7
7
11
11
0
0
0
0
72
72
70
70
59
59
94
94
10
10
0
0
0
0
51
51
43
43
24
24
83
83
12
12
9
9
0
0
0
0
47
47
2
2
73
73
11
11
8
8
0
0
0
0
18
18
14
14
53
53
95
95
7
7
0
0
0
0
0
0
79
79
82
82
40
40
46
46
6
6
0
0
0
0
72
72
41
41
84
84
39
39
5
5
0
0
0
0
25
25
65
65
47
47
61
61
4
4
0
0
0
0
0
0
33
33
81
81
22
22
24
24
3
3
0
0
0
0
12
12
9
9
79
79
22
22
27
27
2
2
0
0
7
7
83
83
55
55
73
73
94
94
1
1
24
24
23
23
22
22
21
21
20
20
19
19
18
18
17
17
16
16
15
15
14
14
13
13
12
12
11
11
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
3
2
2
1
1
Figure 2.3: Parity-hek matrix for the seleted rate-1/2 LDPC ode in 802.16e
standard.
! ! "# $% & ' ( )* +, -. / ! 0 123 0 123 0 123 ! 4 4 4 5 1 6 7 38 9 3 6 6 7 38 9 3 6 6 7 38 ! 4 4 4 :;; < =>?@A 4 4 4 4 4 4 B 31C D31 6 C123E2 C123 FG H I 32 J K L M O P Q R S F G H I 32 T UV W X Y Z[ \ ] FG H I 32 ! ^_ `a b c d ef gh i j k 4 4 4 4 4 4 9 3 6 6 7 38 5 1 6 7 38 5 1 6 7 38 !
Figure 2.5: Arhiteture of horizontal layered deoding with loosely oupled
netion omplexity. As illustrated in Fig. 2.5, at the
j
-th layer, the VTCmes-sages are rst reovered from the variable summations
Λ
n
and the CTV messagefrom memories, as in (6). At the output of CNU
j
, the updated CTV messagesr
j,n
j
|
n
j
∈
N
(
j
)
in (7)are storedbak tothe CTV memories. Variablesumma-tion
Λ
n
j
|
n
j
∈
N
(
j
)
are renewed as in (8) and sent bak to the APP memory.The entire owan beexpressed as follows:
q
j,n
j
= Λ
n
j
−
r
j,n
j
(2.14)r
j,n
j
=
Y
n
′
j
∈{
N
(
j
)
\
n
j
}
sign
q
j,n
′
j
×
min
n
′
j
∈{
N
(
j
)
\
n
j
}
q
j,n
j
×
α
!
(2.15)Λ
n
j
=
q
j,n
j
+
r
j,n
j
(2.16)Compared with onventional layered deoding algorithm, it an be observed
that loosely oupled algorithmdoes not require variable node operations. In other
words, VNUs an be eliminated sine their main funtion of updating the variable
summations
Λ
n
an be done by CNUs using VTC messagesr
j,n
j
[
k
]
from previousHigh-Throughput LDPC Deoder
Arhiteture with Parallel Layered
Deoding
3.1 High Throughput Strategies
With the inreasing demand for high-data-rate wireless appliations, many reent
ommuniation systems employ ultra-highthroughput hannel odes to math the
data-rate requirements. For example, 802.15.3 standard is targeted for the data
rate ofmulti-gigabits perseond(Gbps),thusLDPCodes arepreferredompared
withonvolutionalodesandTurboodes. However, itisagreathallengetodesign
a high-through LDPC deoder due to the omplexity of deoding algorithm. Sine
QC-LDPCodes areinreasinglypopularinemergingommuniationstandards,we
T hroughput
=
F req
×
Block Length
Cycles per Iter
×
Num of Iter
(3.1)Therefore, three strategies an be attempted in order to improve the throughput:
reduing the number of iterations required for onvergene, reduing the
deod-ing lateny per iterationand improvingthe operating frequeny. Correspondingly,
three arhiteture-aware shemes are studied in this paper, inluding layered
de-oding algorithm,parallellayered deoding arhiteture, and ritial path splitting
tehnique.
First, layered deoding algorithm is adopted to redue the required number of
iterations by a fator of two for any given SNR, ompared with the standard BP
algorithm. Hene, the deoding throughput is supposed to be doubled without
any biterror performane loss. Generally,there are two layered deoding methods:
horizontal layered deoding [51, 33, 45, 9, 25, 28℄ and vertial layered deoding
[82℄. It has been proved these two methods are theoretially equivalent and both
an onverge twie as fast as the BP algorithm [65℄. In this paper, we employ
the horizontal layered deoding strategy beause it is favorable for the min-sum
algorithm.
Seondly, we propose a novel sheme namely parallel layered deoding
arhi-teture that enables onurrent proessing among all layers. In traditionallayered
deodingarhiteture,thelayers areproessedsequentiallywhihleadstolonger
de-oding latenyperiteration[33℄. Inparallellayered deodingarhiteture, preisely
sheduled messagepassing among dierent layers guarantees that all updated
mes-sages are passedtotheir designatedloationsinonneted layers. The parity-hek
iently large for message passing. Moreover, parallel layered deoding arhiteture
supports the deoding arhiteture in whih hek node updates unit (CNU) and
variablenode updatesunit(VNU)are ombinedintoasinglefuntionalunit. Thus,
no extra lok yles are needed toomplete the variable node updateas they have
been merged into the CNU. As a result, the number of lok yles per iteration
in parallellayered deodingarhiteture an beredued by75% ompared withthe
existing arhitetures.
Finally,the tehniqueof reduing ritialpath delay isproposedto improvethe
operating frequeny at iruit-level. The ombination of CNU and VNU results a
long ritial path in the deoder implementation [66℄, whih limits the maximum
lok speed. Fortunately,there are idletime intervals amongdierent layers whih
allows the iterative messages to be proessed and passed to the next layer within
several lok yles. Therefore, we an insert registers tosplit the CNU (inluding
the VNU) intoseveral pipelinedstages. Consequently, the ritial pathis split into
multiple stages and the lok speed an be improved dramatially, on ondition
that the numberof stages does not exeed the idletime intervals. In pratie, this
ritial path splitting method an inrease the maximum frequeny of the deoder
by afator of
3
or higher.To demonstrate the aforementioned three tehniques, a rate-1/2 2304-bit
QC-LDPC ode isseleted from802.16estandard asa ase study. In addition, existing
tehniques suh as min-sum algorithm and loosely oupled algorithm are also
em-ployed to simplify deoding omplexity and to redue the hip area of the deoder
As disussedabove,the essentialreasonthatlayereddeodingalgorithman redue
the number of iterations is that the latest extrinsi messages are passed to and
employed by the subsequent layers within the urrent iteration. Therefore, layered
deodingrequires layerstobeproessed sequentially,whihresults alarge deoding
lateny periteration. A methodof inreasing parallelisminside a layeris proposed
in[51,45℄,butalllayersare stillproessedinseries. Althoughdeodingthroughput
an be improved, this method introdues rossbar-based interonnetion networks
that inrease the hardware omplexity.
Motivated by the partly parallel mehanism mentioned in [85℄, we propose the
parallel layered deoding arhiteture that allows all layers tobeproessed
onur-rently. Eah layer generates and sends updated messages and at the same time it
also reeivesthe updated messages fromother layers. Unlike the method proposed
in [51, 45℄, parallellayered deoding arhiteture uses parallelproessing amongall
layers and serial proessing within eah layer. Detailed message proessing ow at
the
j
-thlayer (CNU)an besummarized as1. Feth the orresponding variable summations
Λ
n
j
|
n
j
∈
N
(
j
)
from theAPPmemoryandCTVmessages
r
j,n
j
|
n
j
∈
N
(
j
)
fromloalCTVmemory.2. Calulatethe VTCmessages
q
j,n
j
|
n
j
∈
N
(
j
)
inthesamerowusing(2.14).3. Calulate horizontallytoobtainnew CTV messages
r
j,n
j
|
n
j
∈
N
(
j
)
,asin(2.15).
4. Immediatelyupdatethe variablesummations
Λ
n
j
|
n
j
∈
N
(
j
)
using(2.16).5. Deliverthenewvariablesummation
Λ
n
!" " # $% & & ' ' !" " # $% ' ( ' )* !" " # $% + )* , )* +( ( !" " # $% * ( && ( . / 0 1 2 3 . / 0 1 2 45 . / 0 1 2 67 . / 0 1 2 8 4 9: : ; < < =< 9: : ; < < =< 9: : ; << = < 3.1: Pro essing status at four dieren t lo k yles. 26
Hereby, we explainhow topass
Λ
n
j
messages in parallellayered deodingarhi-teture. Instead of passing
Λ
n
j
toall unproessed layers as in onventional layereddeoding, parallel layered deoding arhiteture only sends
Λ
n
j
to the layer thatwill use it next. Let us take the rst olumn of the
H
base matrix as an example.None-zero entries are at the
4
th,9
th and12
th layer whose permutation numbersare
61
,12
and43
, respetively. Now we suppose that atyle0
eah of these threeorresponding CNUs starts to proess from the rst row of the sub-matrix,
orre-spondingtothe
62
th,13
thand44
th olumnindies. In general,orresponding rowand olumnindies of the entry being proessed atyle
l
an be alulated asr
index
=
l
(3.2)c
index
=
mod
(
l
+ 1 +
s
(
i, j
)
,
96)
(3.3)InFig. 3.1, itillustratesthe messagepassing routesinparallellayered deoding
arhiteture using the the rst olumn of the parity hek matrix in Fig. 1 as an
example. The operation sequene in time of these three layers is desribed as the
following:
1. At yle
0
, CNUs at layers4
,9
and12
start to proess simultaneously fromrow
1
of eahsub-matrix,orrespondingtoolumn62
,olumn13
andolumn44
aording to(3.3).2. Atyle
18
,allthreelayers areproessingatrow19
,orrespondingtoolumn80
(in layer4), olumn31
(inlayer 9) and olumn62
(inlayer 12). Layer12
alls for the latest summation message of olumn
62
whih has already beenupdated by layer
4
. Therefore, the updated variable summations at layer4
Figure 3.2: Variable summations passing diretions of the
H
base matrix: (a) forolumn
1
(b) for olumn6
.3. Atyle
31
,allthreelayers areproessingatrow32
,orrespondingtoolumn93
(in layer 4), olumn44
(in layer 9) and olumn75
(in layer 12). Layer9
alls for the latest summation message of olumn
44
whih has already beenupdated by layer
12
. Therefore, the updated variable summationsat layer12
should be sent tolayer
9
(Layer12→
Layer9).4. Similarly,at yle
47
,all threelayers are proessing row48
,orresponding toolumn
13
(in layer 4), olumn60
(in layer 9) and olumn91
(in layer 12).Layer
4
alls for latest summation message of olumn13
whih has alreadybeenupdatedbylayer
9
. Therefore, theupdatedvariablesummationsatlayer9
should besent to layer4
(Layer 9→
Layer4).Based on the desription above, message passing routes for olumn
1
of basematrix is shown in Fig. 3.2(a). An additionalexample is illustrated in Fig. 3.2(b)
for olumn 6of the base matrix. In general,we an determine the message passing
routes among layers based on the their permutation values. For eah olumn, we
an sort the permutation values of all layers in desending order. Eah layer then
12
12
11
11
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
3
2
2
1
1
Layer
Layer
80
80
87
87
10
10
25
25
50
50
27
27
+80
+80
0
0
0
0
49
49
39
39
65
65
7
7
+0
+0
16
16
16
16
88
88
86
86
75
75
14
14
+16
+16
0
0
0
0
51
51
43
43
24
24
83
83
12
12
+0
+0
8
8
8
8
55
55
10
10
81
81
27
27
+8
+8
88
88
88
88
10
10
6
6
45
45
87
87
+88
+88
0
0
0
0
0
0
79
79
82
82
40
40
46
46
+0
+0
84
84
84
84
60
60
29
29
72
72
27
27
+84
+84
12
12
12
12
37
37
77
77
59
59
73
73
+12
+12
0
0
0
0
0
0
33
33
81
81
22
22
24
24
+0
+0
8
8
8
8
20
20
17
17
87
87
30
30
35
35
+8
+8
0
0
7
7
83
83
55
55
73
73
94
94
+0
+0
24
24
23
23
22
22
21
21
20
20
19
19
18
18
17
17
16
16
15
15
14
14
13
13
12
12
11
11
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
3
2
2
1
1
Offset
Offset
Figure3.3: Oset-modied parity-hek matrix forrate-1/2LDPCode in802.16e.
! " #
Figure3.4: Timingdiagram forparallel layered deoding arhiteture.
layerwiththe smallestpermutationvalueloopsbakand onnetstothe layerwith
the largest value. This message passing sheme guarantees the updated messages
among alllayered are proessed progressively within eah iteration.
3.3 Critial Path Splitting
Aordingtothe messagepassingmehanismexplainedabove,wesuppose CNUsin
alllayersstartatrow
1
ofeahsub-matrix. Atually,the CNUsanstarttooperatefromdierentrows, whihisequivalenttoaddinganosettothe permutationvalue
of eah layer. For example, Fig. ?? shows a modied
H
base matrix with a setof oset values added for dierent layers. The oset values are arefully seleted,
atleast
5
. It meanseahlayerneedstoread, update,store andpass the messagetothe next onneted layer with 5 lok yles. The detail steps for the proessing in
a layerare shown in inFig. 3.4. Inother words, eahCNU has aperiodof
4
ylesto omplete the task of reading oldmessage and update new message, asindiated
in (2.14) to (2.16). The message passing rule remains the same, exept that the
updated permutationvaluesshouldbeused todeterminethe onnetionsinPLDA.
FromFig. 3.3,weanseethattherowweightofeahlayeriseither
6
or7
,whihrequires
6
or7
messages to be ompared in the CNU. Combined with otherne-essary funtions suh as adding, rounding and memory read/write, CNU beomes
the ritialpaththat limitsthe maximumoperatingfrequeny ofthe deoder. T
ak-ing advantage of the 4-yle time intervals, we an split the ritial path of CNU
into
4
pipelined stages by inserting3
levels of registers. Anoptimal splittingyieldsbalaned delays among the pipeline stages. The implementation of ritial path
splitting willbe given inSetion 3.4.2.
3.4 Proposed Deoder Arhiteture
In ordertoprovethe onept ofour proposedparallellayereddeodingarhiteture
and high-throughput strategies, a rate-1/2 2304-bit QC-LDPC ode seleted from
802.16e standard is designed based on the oset-modied
H
matrix. The size ofeah sub-matrix is hosen to be
96
×
96
. Before onstruting the funtional unitsof the deoder, we rst perform xed-point analysis to quantize the word-length of
messages. In fat,word-lengthquantizationisatradeobetween memoryresoures
and bit-error-rateperformane. Fig. 3.5showsthe BERperformaneoftheseleted
rate-1/2 LDPC ode using message word-length
(3
,
2)
. As demonstrated in [?℄,5
0.5
1
1.5
2
2.5
10
−7
10
−6
10
−5
10
−4
10
−3
10
−2
10
−1
10
0
SNR(dB)
Bit Error Rate
Floating Standard BP @ 20 iter
Floating Min−Sum @ 20 iter
Floating Min−Sum @ 10 iter
Floating Proposed @ 10 iter
Fixed(3,2) Proposed @ 10 iter
Figure 3.5: BER performane omparison of dierentdeoding algorithms.
for representing the absolute values of the extrinsi messages. Considering one
additional sign bit, we hoose
6
bits xed-point representation for messages in thisimplementation.
3.4.1 Overall Deoder Arhiteture
The overall arhiteture of the QC-LDPC deoder is shown inFig. 3.6. It onsists
of hek node proessing units (CNUs), APP memory banks, CTV memory banks
and hard-deision units. The entire
H
matrix is divided into12
layers and eahlayer employs a dediated CNU. APP memory banks and CTV memory banks
are used to store variable summations and CTV messages. Eah non-zero
sub-matrixorrespondstooneAPPmemoryunitandoneCTVmemoryunit. Inlayered
deoding,eahAPPmemoryexportsasummationmessagetotheCNUandimports
a new summation message from the CNU of another layer. Similarly, eah CTV
! " # $ % % & & & ' ( ) * + , -. / 0 1 2 3 4 5 6 7 8 9 : % %& & & & ;< = > ? @ A BC DE F G H IJ KL M N O PQ RS T U V % % & WX X Y Z [ \ ] ^ _ ` ` ` ` ` ` a b a b & a b & ` ` ` cd d e & e f e g e h e i e f j k cd d & & e & & e l & e m & e h & e n & e & & e o cd d & & e & e l & e n & e & h & e & & e f ` ` ` ` ` j k & j k & ` ` ` ` ` ` ` p 3.6: Ov erall parallel la y ered deo ding ar hiteture for QC-LDPC 32
onurrentreadand writeoperationsare usedinthe design. Sinethereare
76
sub-matriesin total,the APP memory bank and the CTV memorybank eah onsists
of
76
smallsingle-portmemory units.Eah layer in the parallel layered deoding arhiteture orresponds to
6
or7
APP memory units, whih means that the updated variable summations of the
j
-th layer
Λ
j,n
j
will be delivered to APP memory units in other layers based on themessage passing sheme desribed in Setion ?? and Fig. 3.2. Therefore, these
variablesummationsare passedtotheirdestinations throughxed onnetionwires
instead of rossbar-based networks.
3.4.2 Pipelined Arhiteture for CNU
For hek node update, eah layer employs a CNU to perform a series of funtions
inluding subtration, omparison and addition. Consequently, the CNU beomes
the longest delay path in timing. In min-sum algorithm,a CNU needs to ompare
6
or7
numbers to nd the minimum value and its loation as well as the seondminimum value. Thus, the omparator beomes a key omponent in a CNU. A
2
-input omparator onsists of an adder and a multiplexer. A3
-inputompara-tor an be implemented with three adders, ve multiplexers and some basi logi
gates. Basedonomparisonunitsof2-inputand 3-inputomparators,
6
-inputor7
-inputomparatoranbeonstrutedbyombiningthreelevelsof2-inputor3-input
omparators, asshown inFig. 3.7.
Fig. 3.7shows detailedarhitetureof CNUanditsfuntionalblokinside. The
subtrator andadder bloksfulllthefuntionsdened in(2.14)and(2.16),
respe-tively. Two quantizers are inserted to prevent the CTV and variable summations
! " # $$ %& ' %& ' & & & & & & & ! " # $$ $$ $$ $$ ! $$ ! " $$ " # $$ # # ! " # $ $ $ $ $ $ $ $ ! $ $ " $ $ # () *+ , -. / 0 12 34 56 7 12 89 :) ; <) += .> ? 12 @1 A6 +.B >C) 12 @8 D 7/ 129 8 EF ; GH 8 129 8 EF ; GH 8 129 8 EF ; GH 8 12 84 EF ; G +I) J 5) )*BK 12 @1 A6 +.B >C) 12 3L D== )I 12 1M /> ? . 5 B + ? ) @ 12M 8 5 B + ? ) 3 12MM 5 B + ? ) 8 12N M 5B + ? ) 9 12N OP QR QST UVT RW XYPY 3.7: CNU ar hiteture with ritial path splitting in to 4 pip elined 34
!" #$ %& !" #$ %& !" #$ %&
Figure3.8: Registeralloationineah setionof the hard deisions.
messages.
AsexplainedinSetion3.3,inordertoreduetheritialpathdelayandimprove
lokspeed,aCNUanbesplitinto
4
pipelinedstagesbyinserting3
setofregisters.Registers are arefully inserted suh that all 4 pipeline stages have approximately
the same delay, sine maximum frequeny is determined by the longest delay path.
Delays of eah funtional unit and the pipeline stages are alsoshown inFig. 3.7.
3.4.3 Deision Units
The outputbits of the LDPC deoder are deided by the signs of the variable
sum-mations. In traditional horizontal layered deoding, deisions an be made during
the variable node proessing at the bottom layer. Parallel layered deoding
arhi-teture operates in a slightly dierent way in whih every layer generates variable
onur-rently. Asillustrated inFig. 3.8, letustakethe rstolumnofthemodied
H
basematrix with oset as an example. CNU
4
, CNU9
and CNU12
start to operatefrom row
13
, row1
and row81
, respetively, as dened by the oset values showninFig. 3.3, whihorrespond toolumn
74
,olumn13
and olumn28
,respetively.AordingtothemessagepassingrulederivedinSetion3.4.2,variablesummations
passing routes of the rst olumn is Layer
4
→
Layer12
→
Layer9
→
Layer4
, alsoshown inFig. 3.8. Afterompletionofeahiteration,thenal variablesummations
for olumns13through 27are stored atlayer12. Variablesummations forolumns
28
through73
are stored at layer4
. Similarly, variable summations for olumn74
through
12
are stored at layer9
. Therefore, the hard deision bits for olumns 1through 96are stored in the memoriesdistributed in3 dierent layers.
In this design, we employ a onvenient and robust earlytermination strategy,
similar to[64,66℄. The pivot of the earlyterminationstrategy liesin thatthe hard
deisionbits frompreviousiterationarestoredand omparedwith thedeisionbits
from urrent iteration. If alldeoded bits are idential, then the deoder indiates
suessfuldeodingofaodewordandtheiterativedeodingterminates. Otherwise,
the deoding proess ontinues until the maximumnumberof iterationsis reahed.
3.5 Implementation Results
In order to evaluate the performane of our proposed parallel layered deoding
ar-hiteture, we implement a rate-1/2 2304-bit QC-LDPC deoder in TSMC
90
nm1
.
0
V CMOS tehnology with 8-layer metals. We omplete synthesis and ore areaplae and route using Synopsys tools.
Implementation results show that the deoder an operate at a maximum
using
10
iterations. A totalof152 single-portmemoryunitseahwith sizeof96
×
6
bits (inluding one sign bit for eah message) are employed, whih sums up to
87
,
752
bits of memory use and oupies more than 75% of the ore area. Parallellayereddeodingarhitetureonlyneeds
(
p
+ 3)
×
Iter
lokylesforthe deodingproess and
p
is the size of the sub-matrix. In [66℄ and [70℄, this number rises top
×
(4
×
Iter
+ 1)
andp
×
(5
×
Iter
) + 12
, respetively. Hene, under the samenumberofiterations, parallellayereddeodingarhiteture ouldreduethe
deod-inglatenyby approximately
75
%. Moreover, theimplementationresultsshowthatthe maximumoperatingfrequenieswithandwithoutritialpathsplittingmethod
are
950
MHz and305
MHz, respetively, whihdemonstrates animprovementof thedeoder speed by a fator of
3
. Combined with layered deoding algorithm thatdoubles the onvergene speed, our proposed arhiteture an signiantly improve
the throughputof the QC-LDPC deoder up tomulti-Gbps.
Fig. 3.9 shows the layout view of the deoder ore area of
1
.
8
mm
×
1
.
6
mm
and the logi density of
70
%. The design does not have a rossbar interonnetionnetwork, sineparallellayered deoding arhiteture employs xed messagepassing
C. Liu [45℄ X. Shih [66℄ T. Brak [9℄ G.Gentile[25℄ Y. Ueng [70℄ M.Karkooti[?℄ Proposed Deoder Code Length 576~2304 576~2304 576~2304 576~2304 2304 1944 2304 Frequeny 150MHz 83.3MHz 333MHz 400MHz 200MHz 412MHz 950MHz Iterations 20 2~8 10,15 15 4.6(average) 15 10 Throughput 105Mbps 60~220Mbps 133-928Mbps 128-746Mbps 106Mbps 736MHz 2.2Gbps Tehnology 90nm 130nm 130nm 65nm 180nm 130nm 90nm Area 6.25
mm
2
8.29mm
2
3.83mm
2
0.59mm
2
- 2.4mm
2
2.9mm
2
Power 264mW 52mW - - - 502mW 870mW 38deoder is
2
.
9
mm
2
and the estimated power onsumptionis
870
mW.Table 3.1shows the deoder implementationresults ompared withthe existing
QC-LDPC deoders. Note that the throughput values from [9,25℄ are realulated
basedonthedeodedbitsforfairomparisontootherimplementationslistedinT
a-ble3.1. Weshowthattheproposeddeoderanahievehigherdeodingthroughput
with omparable or smaller hip area. The deoder onsumes more power mainly
due toitshigh operatingfrequeny.
3.6 Summary
LDPC odesare widelyused inreentommuniation systemsdue totheir superior
error-orretion performane. In this hapter, we proposed a new arhiteture to
improvethe throughput of QC-LDPC deoders. Withparallel layered deoding
ar-hiteture and ritialpath splittingtehnique, the deoder implementation,whih
usesTSMC90nmtehnology,anahieve
2
.
2
Gbpsdeodingthroughputforseletedrate-1/2irregularQC-LDPCodes. In addition,min-sumand looselyoupled
algo-rithms are employed for area eieny and the ore size is
2
.
9
mm
2
. The proposed
parallellayereddeoding arhiteture issuitableforthenear-apaityhannelodes
High-Throughput Rate-Compatible
LDPC Deoder Arhiteture
4.1 Introdution
Reently, thereisagrowinginterestinrate-ompatible(RC) LDPCodes and their
appliations. On time-varyinghannels, itis desirableto adjust the FEC ode rate
aording to the hannel s