High-Performance Decoder Architectures For Low-Density Parity-Check Codes

(1)

Worcester Polytechnic Institute

Digital WPI

Doctoral Dissertations (All Dissertations, All Years)

Electronic Theses and Dissertations

2012-01-09

High-Performance Decoder Architectures For

Low-Density Parity-Check Codes

Kai Zhang

Worcester Polytechnic Institute

Follow this and additional works at:

https://digitalcommons.wpi.edu/etd-dissertations

Repository Citation

Zhang, K. (2012).

High-Performance Decoder Architectures For Low-Density Parity-Check Codes

. Retrieved from

https://digitalcommons.wpi.edu/etd-dissertations/17

(2)

FOR LOW-DENSITY PARITY-CHECK CODES

by

Kai Zhang

A Dissertation

Submitted tothe faulty of the

WORCESTER POLYTECHNIC INSTITUTE

In partialfulllmentof the requirements for the

Degree of Dotor of Philosophy

in

Eletrialand Computer Engineering

by

August,2011

APPROVED:

Prof. Xinming Huang Prof. Berk Sunar

Thesis Advisor Thesis Committee

Worester Polytehni Institute Worester Polytehni Institute

Prof. James Dukworth Dr. ZhongfengWang

Thesis Committee Thesis Committee

(3)

TheLow-Density Parity-Chek (LDPC) odes, whihwere invented by Gallager

bakin1960s,haveattratedonsiderableattentionsreently. Comparedwithother

error orretion odes, LDPC odes are well suited for wireless, optial, and

mag-neti reording systems due to their near-Shannon-limit error-orreting apaity,

high intrinsi parallelism and high-throughput potentials. With these remarkable

harateristis, LDPC odes have been adopted in several reent ommuniation

standards suh as 802.11n(Wi-Fi), 802.16e(WiMax),802.15.3 (WPAN), DVB-S2

and CMMB.

This dissertation is devoted to exploring eient VLSI arhitetures for

high-performaneLDPC deoders andLDPC-likedetetors insparse inter-symbol

inter-ferene (ISI) hannels. The performane of an LDPC deoder is mainly evaluated

by area eieny, error-orreting apability, throughputand rate exibility. With

this workweinvestigatetradeosbetweenthefourperformaneaspetsanddevelop

several deoder arhitetures to improve one or several performane aspets while

maintainingaeptable valuesfor other aspets.

Firstly,we present ahigh-throughput deoder design for the Quasi-Cyli (QC)

LDPC odes. Twonew tehniques are proposed forthe rsttime,inludingparallel

layered deoding arhiteture (PLDA) and ritial path splitting. Parallel layered

deoding arhiteture enablesparallel proessingfor alllayers by establishing

dedi-ated messagepassingpaths amongthem. Thedeoder avoids rossbar-based large

interonnet network. Critial path splitting tehnique is based on artiulate

(4)

ieny. Asaasestudy,arate-1/22304-bitirregularLDPC deoder isimplemented

using ASIC design in 90 nm CMOS proess. The deoder an ahieve an input

throughput of 1.1 Gbps, that is, 3 or4 times improvement over state-of-art LDPC

deoders, while maintaininga omparablehip size of 2.9mm

2

.

Seondly,wepresentahigh-throughputdeoderarhitetureforrate-ompatible

(RC) LDPC odes whih supports arbitrary ode rates between the rate of mother

odeand1. WhiletheoriginalPLDA islak ofrateexibility,the problemissolved

graefully by inorporating the punturing sheme. Simulation results show that

our seleted punturing sheme only introdues the BER performane degradation

of less than 0.2dB, ompared with the dediated odes for dierent rates speied

in the IEEE 802.16e(WiMax) standard. Subsequently, PLDA isemployed forhigh

throughput deoder design. As a ase study, a RC-LDPC deoder based on the

rate-1/2WiMax LDPCode isimplementedinCMOS 90nmproess. The deoder

an ahieve an input throughput of 975 Mbps and supports any rate between 1/2

and 1.

Thirdly, wedevelop alow-omplexityVLSI arhitetureand implementationfor

LDPC deoder used in China Multimedia Mobile Broadasting (CMMB) systems.

Anarea-eientlayereddeodingarhiteturebasedonmin-sumalgorithmis

inor-poratedin the design. A novelsplit-memoryarhiteture is developed to eiently

handle the weight-2 submatries that are rarely seen in onventional LDPC

de-oders. In addition,the hek-node proessingunit ishighlyoptimizedtominimize

omplexityand omputinglatenywhilefailitatingareongurable deoding ore.

Finally,we propose an LDPC-deoder-likehanneldetetor forsparse ISI

han-nels using belief propagation (BP). The BP-based detetion omputationally

(5)

interferers. Layereddeodingalgorithm,whihispopularinLDPCdeoding,isalso

adopted in this paper. Simulation results show that the layered deoding doubles

the onvergene speed of the iterative belief propagation proess. Exploring the

speial strutureof theonnetionsbetweenthe heknodes andthe variablenodes

on thefator graph, wepropose aneetivedetetor arhiteture forgeneri sparse

ISI hannels to failitate the pratial appliation of the proposed detetion

algo-rithm. The proposed arhiteture is also reongurable in order to swith exible

(6)

Firstof all,I wish to thank my advisor,Professor Xinming Huang, forhis

guid-ane and support through all stages of my studies and researh at the Worester

PolytehniInstitute. I amverygratefulforhisreognition,hisinspiration,andthe

exposure and opportunities that I havereeived duringthe ourse of my study.

IamalsoindebtedtotoProfessorBerk Sunar,ProfessorJames Dukworth, and

Dr. ZhongfengWangfortheirvaluablesupportsasmembersofmythesisommittee.

I amgratefulto my fellowgraduate students inthe Embedded ComputingLab,

Cao Liang, WenxuanGuo, YanjiePeng, Chen Shen and WeiWang for their

friend-ship and support.

I would also like to thank all the sta in ECE department, inluding Robert

Brown, CatherineEmmerton, ColleenSweeney,Brenda MDonald and Staie

Mur-ray fortheir kind assistane and oordinations.

This thesis is dediated to my wonderful family. I am forever grateful to my

parentsShanfengZhangandQingfangZhao,my wifeWanningJiang, fortheirlove,

support, and enouragement. I shall not try to put my appreiation and love for

(7)

1 Introdution 1

1.1 Bakground . . . 1

1.2 Related Works . . . 3

1.3 Summary of Motivationsand Contributions . . . 4

1.3.1 Ultra-High-Throughput LDPCDeoder Arhiteture Design . 5 1.3.2 Flexible-Rate High-Throughput LDPC Deoder Arhiteture Design . . . 6

1.3.3 Eient Deoder Arhiteture Design for LDPC Codes with Speial Struture . . . 7

1.3.4 Eient Arhiteture Design in LDPC-Like BP-Based Cir-umstanes . . . 8

1.4 Outline. . . 8

2 LDPC Codes and Deoding Algorithms 10 2.1 Introdutionof LDPC Codes . . . 10

2.2 Quasi-Cyli LDPC Codes . . . 11

2.3 Belief Propagation Algorithm . . . 12

2.4 Min-Sum Algorithm . . . 13

(8)

2.7 Layered Deoding Algorithm . . . 18

3 High-Throughput LDPC Deoder Arhiteture with Parallel Lay-ered Deoding 22 3.1 HighThroughput Strategies . . . 22

3.2 ParallelLayered Deoding Arhiteture . . . 25

3.3 Critial Path Splitting . . . 29

3.4 Proposed Deoder Arhiteture . . . 30

3.4.1 Overall Deoder Arhiteture . . . 31

3.4.2 PipelinedArhiteture for CNU . . . 33

3.4.3 Deision Units. . . 35

3.5 Implementation Results . . . 36

3.6 Summary . . . 39

4 High-Throughput Rate-Compatible LDPC Deoder Arhiteture 40 4.1 Introdution . . . 40

4.2 Punturing Shemes for Rate-CompatibleLDPC Codes . . . 42

4.2.1 Quasi-CyliLDPCCodes withDual-DiagonalParityStruture 42 4.2.2 Rate-Compatible LDPC Codes . . . 43

4.2.3 Seleted Punturing Sheme . . . 45

4.3 High-Throughput Rate-Compatible LDPC Deoder Arhiteture . . . 50

4.3.1 Summary of the Parallel Layered Deoding Arhiteture . . . 50

4.3.2 RCLDPC Deoder Design . . . 54

4.4 ExperimentalResults . . . 57

4.4.1 Simulation Resultsfor Puntured WiMax Codes . . . 57

(9)

5 Low-ComplexityLDPCDeoderArhiteturefor CMMB Systems 62

5.1 Introdution . . . 62

5.2 QC-LDPC Codes in CMMB Standard. . . 64

5.3 Dual-rate Deoder Design . . . 65

5.3.1 Overall Arhiteture . . . 66

5.3.2 Dual-Rate CNU Design. . . 67

5.3.3 Memory Aess for Partially Parallel Layered Deoding . . . . 71

5.3.4 Split-MemoryArhiteture for Weight-2Sub-matries . . . 72

5.3.5 Numberof Pipeline Stages . . . 74

5.4 Area-Eient Design Tehniques . . . 76

5.4.1 Memory Redution . . . 76

5.4.2 Read/Write Networks . . . 76

5.5 Implementation results . . . 79

5.6 Summary . . . 81

6 Designof Belief-PropagationBased Detetorsfor Sparse ISI Chan-nels 82 6.1 Introdution . . . 82

6.2 Channel Modeland Deoding Algorithms. . . 84

6.2.1 Channel Modeland Fator Graph Representation . . . 84

6.2.2 Belief-Propagation Algorithm . . . 85

6.2.3 Layered Deoding Algorithm. . . 87

6.3 SimulationResults . . . 87

6.4 Detetor Arhiteture Design . . . 90

(10)

6.4.3 Cahe-Like Arhiteture . . . 94

6.5 Implementation Results . . . 95

6.6 Summary . . . 96

(11)

2.1 Regular (8,4) LDPC odes: (a)Parity-hek matrix (b) Tanner graph 11

2.2 Implementation of earlytermination strategy. . . 18

2.3 Parity-hek matrix for the seleted rate-1/2 LDPC ode in 802.16e

standard. . . 19

2.4 Message passing owin horizontallayered deoding. . . 19

2.5 Arhiteture of horizontal layered deoding with loosely oupled

al-gorithm. . . 20

3.1 Proessing status atfour dierent lok yles. . . 26

3.2 Variablesummationspassingdiretions ofthe

H

base matrix: (a)for

olumn

1

(b) for olumn

6

. . . 28

3.3 Oset-modiedparity-hekmatrixforrate-1/2LDPCodein802.16e. 29

3.4 Timing diagramfor parallellayered deoding arhiteture. . . 29

3.5 BER performane omparison ofdierent deoding algorithms. . . 31

3.6 Overall parallellayered deoding arhiteture forQC-LDPC odes. . 32

3.7 CNU arhiteture with ritialpath splittinginto4 pipelined stages.. 34

3.8 Register alloationin eah setion of the hard deisions.. . . 35

3.9 Layout of the deoder ore area. . . 37

(12)

4.3 BERs of the three puntured odes and the dediated ode at rate

2/3 over AWGN hannels. . . 47

4.4 Overall arhiteture of arate-ompatible LDPC deoder. . . 53

4.5 Arhiteture ofLLR InitializationBlok. . . 55

4.6 Arhiteture ofthe Address Generator. . . 56

4.7 BERs of the puntured LDPC odes over AWGN hannels. . . 58

4.8 Layout of the proposed deoder hip. . . 59

5.1 Strutureofthe parityhek matrixforLDPCodes inCMMB stan-dard: (a) rate-1/2;(b) rate-3/4. . . 64

5.2 BER performane fordierent rates and quantizationshemes. . . 66

5.3 Overall arhiteture of the CMMB LDPC deoder. . . 67

5.4 Arhiteture ofCNU for dual-rate (1/2, 3/4) CMMB LDPC odes. . 68

5.5 The design of anPROF-basedomparator . . . 69

5.6 Element orrespondene relations between CTV memory and APP memory. . . 71

5.7 Split-memory design forhandling the weight-2sub-matries. . . 73

5.8 Forrate-1/2odes (a)Arhitetureofread network; (b)Arhiteture of write network. . . 77

6.1 Fator graph of anISI hannel. . . 85

6.2 BER omparison between originalBP algorithmand LDA-basedBP algorithm. . . 87

6.3 BER performane forvarious quantization shemes of reeived data. . 88

6.4 BER performane for various quantizationshemes of extrinsi mes-sages. . . 89

(13)

6.6 Arhiteture ofthe CNU.. . . 93

(14)

3.1 Overall omparison between proposed deoder and other existing

LDPC deoders. . . 38

4.1 Threepunturingshemesforahievingrate2/3fromrate1/2mother

ode . . . 47

4.2 Index of puntured bloks at dierent desiredrates . . . 48

WiMax LDPC deoders . . . 60

5.1 Connetions of input and output ports in the read network for the

rate-1/2odes . . . 77

5.2 Connetions of input and output ports in the read network for the

rate-3/4odes . . . 78

(15)

Introdution

This hapter rst introdues the history of error orretionodes and LDPC odes

and then desribes the motivation,related works, ontributions and outline of this

dissertation.

1.1 Bakground

A major onern in designing reliable data transmission and storage systems is

the ontrolof data errors indued by a noisy hannel or storage medium. In 1948,

Shannon demonstratedthatiftheinformationrateislessthanthehannelapaity,

introdued errors an be orreted by proper enoding and deoding of the

infor-mation [63℄. From then on, various types of error orretion odes were designed

by adding some redundanies to the original data (known as enoding), whih the

reeiver an use to hek onsisteny of the delivered message, and toreoverdata

determined to be erroneous (known as deoding). Classial and widely used error

orretion odes inlude Hamming Code (1950), BCH Code (1959) [34, 8℄,

(16)

limit-approahing odes in whih two innovativeideas were exploited: iterative

de-oding and onstrained random ode onstrution. However, LDPC odes were

mostly negleted by ode theorists for more than 30 years due to the tremendous

omputational eorts in implementing their enoder and deoder and the

intro-dution of Reed-Solomon odes [59℄. Shortly after the debut of another

Shannon-limit-approahingode, Turboode [6℄,inwhihboth the aboveideasare explored,

LDPCodes wereredisoveredbyMaKay andNealin1996[50℄. LDPCodeshave

several advantages over Turbo odes: no error oors at high SNRs, more inherent

deoding parallelism,higher throughputpotentials, lower hardware omplexity due

to the eliminationof long interleavers.

LDPC odes are speied by sparse parity hek matries

H

and an be fully

represented by bipartite graphs (known as Tanner graph) with two sets of nodes,

alled the hek nodes and the variable nodes. LDPC odes an be eetively

deodedusingthestandardBPalgorithm,alsoalledsum-produtalgorithm(SPA).

Twophasesof messages, thehek-to-variable (CTV)messages and the

variable-to-hek (VTC) messages, are transmittedalongthe edges ofthe Tanner graph[69℄ to

update eah other iteratively.

Good LDPC odes that have been found are largely long, random and

om-puter generated odes with BP-based iterative deoding and an ahieve an

er-ror performane as lose as 0.0045 dB away from the theoretial Shannon limit

[50, 49, 60, 61, 14℄. In despite of exellent error performane, these odes are

im-pratial for hardware implementation due to the following reasons: (1) they are

very long (up to 100,000 bits) and thus a large memory is required to store the

iterative messages, as well as the onguration of the

H

matrix; (2) their

(17)

random Tanner graph; (4) randomly onstruted LDPC odes also require a

om-plex routing network, whih not only onsumes a large amount of hip area, but

also signiantlyinreases the omputationdelay.

In order to make LDPC odes hardware-friendly, strutured LDPC odes that

haveelegantregularityinthe strutureof their

H

matriesand anprovide

ompa-rable oreven bettererrorperformane were introduedinbyalgebraionstrution

[40, 39, 75, 35, 73, 53, 19, 57℄. It has been demonstrated in [48℄ that strutured,

algebrai-onstruted odes an outperform randomones formediumlength LDPC

odes (up to a few thousand bits). Besides, their enoding an be simply

imple-mented by linear shift registers. That is why LDPC odes are widely adopted by

severalreentommuniationstandardssuhas802.11n(Wi-Fi),802.15.3(WPAN)

802.16e (WiMax), DVB-S2, CMMB, 802.3an (10G-BaseT) and et. Quasi-Cyli

(QC) LDPC odes, a speial lass of strutured LDPC odes, deompose the large

H

matrixintosmallsub-matrieswhihareeitheridentitymatrixoritspermutation

[22, 43,11℄. Therefore, QC LDPC odes are best suited for hardware

implementa-tion sinethe regularstruturesof their

H

matriesmakesimplemessageswithing

and memoryaessing possible.

1.2 Related Works

This dissertation is devoted to high-performane VLSI arhiteture design and

im-plementation for LDPC odes. Generally,state-of-art LDPC deoder arhitetures

an bedividedintothreeategories: fullyparallelmethod,serialmethodandpartly

parallel method.

(18)

imple-throughput, requires huge amount of interonnetions to omplete the messages

passing within a random Tanner graph, leading to the average net length of 3nm

and the total hip area of 52.5 mm

2

. On the other side, Yeo et al. [80℄ designed a

serial deoder by sharing the omputationunits and updating messages in

sequen-tial. Serial method results in very simple hardware arhiteture but low deoding

throughput, whih is usually unaeptable by real-world appliations, due to the

large lateny of ten-thousand yles needed fordeoding a odeword.

Sine strutured LDPC odes are arhiteture-aware, their deoders an be

im-plemented by partly parallel method to explore potential parallelismand improve

the throughput[85,12,51,86,37,87,77,36℄. ComparedwithrandomLDPCodes,

the

H

matrixof struturedLDPC an beeasilystored by storingonlypermutation

values of eah small sub-matrixinstead of the entire

H

matrix. Also, regularity in

the

H

matrix of strutured LDPC odes enables easy ontrol over the read/write

operationsofVTCandCTVmessageswhendeodingbyemployingasimplebitwise

ounter, thusavoidingthe huge interonnetionnetwork in[7℄. QCLDPC deoders

proposed in [77, 79, 78, 17, 66, 28, 45, 70, 9, 25, 27, 67℄ also belong to the partly

parallel ategory.

1.3 Summary of Motivations and Contributions

This researh is motivated by desires to explore high-performane LDPC deoder

arhitetures intermsof area eieny, error-orretingapability,throughputand

rate exibility. Although LDPC deoder has been an extensive researh topi for

almost a deade, there are still many unresolved problems. With this dissertation,

(19)

ar-LDPC deoder.

The spei motivations for the work and the orresponding ontributions will

besummarized asfollows.

1.3.1 Ultra-High-Throughput LDPC Deoder Arhiteture

Design

Motivation: Withthe inreasing demand for high-data-ratewireless appliations,

many reent ommuniation systems employ ultra-high throughput hannel odes

to math the data-rate requirements. For example, 802.15.3 standard is targeted

for the data rateof multi-gigabits perseond (Gbps). However, onventional

high-throughput LDPCdeoders proposed in[7,46,54℄were implemented by employing

morehardwareresoures tooperateatahigherparallelism,whihonsumed alarge

amountof hip area. Forexample, the deoder in [46℄ ahieves anequivalent input

throughput of 1 Gbps with a lok rate of only 200 MHz, an undoubtedly high

parallelismof 512 and a large hip area of 14.5 mm

2

under 90 nm proess. Thus it

is desirableto investigateother methods toimprovethroughput.

Contribution: We proposed a parallel layered deoding arhiteture to ahieve

ultra high throughput. Parallel layered deoding arhiteture not only keeps the

advantage of onventional layered deoding algorithm that an redue the

onver-gene required number of iterations by half and thus double the throughput, but

alsoavoidsthe serialoperationsexistinginonventionallayered deodingalgorithm

whih limits overall throughput. Two other tehniques were also applied: ritial

pathsplittingandxedmessagepassing. Theformeroptimizesthedeoder'sritial

(20)

anLDPCdeoderwithoutinreasingparallelismand hardwareresoures. Asaase

study, a rate-1/2 2304-bit irregular LDPC deoder was implemented using ASIC

design in 90 nm CMOS proess. The deoder an ahieve an input throughput of

1.1Gbps, thatis,3 or4timesimprovementoverstate-of-artLDPCdeoders, while

maintaininga omparable hip size of2.9 mm

2

.

1.3.2 Flexible-Rate High-Throughput LDPC Deoder

Arhi-teture Design

Motivation: In wireless ommuniation systems, it is desirable to adjust ECC

ode rate aording to the hannelstate information(CSI) to meet various servie

requirements and hannel onditions. For example, WiMax standard provides 6

dierentLDPCode patterns in4dierentrates(

1

/

2

,

2

/

3

A

,

2

/

3

B

,

3

/

4

A

,

3

/

4

B

,

5

/

6

)[4℄.

Al-though there have been some researh works on the design of exible rate LDPC

deoders [68, 84, 52, 46, 66℄, none of them an provide arbitrary ode rate.

More-over, these deoders usually employ ompliated interonnetion network toswith

between dierent ode patterns [84, 46, 66℄, suh as Benes networks used in [52℄,

whih leads to signal ongestion problem,large delay and low throughput. How to

design a exible rate LDPC deoder with highthroughput remains hallenging.

Contribution: With the dissertation we proposed a rate-ompatible (RC) LDPC

deoder arhiteture that an provide arbitrary ode rate between the rate of the

motherode and1. Codeswith exibleratesweregenerated bypunturingthe

par-ity bits of the mother ode with negligible error performane degradation. Parallel

layered deoding arhiteture was also adopted to ahieve high throughput where

ompliated interonnet network was replaed by xed wires. The proposed RC

(21)

on the rate-1/2WiMax LDPC ode is implemented in CMOS 90 nm proess. The

deoderanahieveaninputthroughputof975Mbpsandsupportsanyratebetween

1/2 and 1.

1.3.3 Eient Deoder Arhiteture Design for LDPC Codes

with Speial Struture

Motivation: Although urrent LDPC deoders an handle QC LPDC odes

ef-fetively and eiently, it is still a problem when dealing with some speially

on-struted odes, suh as the LDPC odes in CMMB standard whose sub-matries

are weigh-2 superimposed matries instead of identity matries. The design of an

area-eient LDPC deoder that supports multiple ode patterns is essential for

wireless appliations suh as CMMB.

Contribution: Weproposed alow-omplexityLDPCdeoder forCMMBsystems.

An area-eient layered deoding arhiteture based on min-sum algorithm is

in-orporated in the design. A reongurable arhiteture, whih an support dual

rate LDPC odes speied in the CMMB standard, is onstruted with minimal

overhead. A novel split-memoryarhiteture is developed to eiently handle the

weight-2 submatries that are rarely seen in onventional LDPC deoders. In

ad-dition, the hek-node proessing unit is highly optimized to minimize omplexity

(22)

Cirumstanes

Motivation: Besides LDPC deoding, alarge varietyof algorithmsin

ommunia-tion and signal proessingan be viewed asinstanes ofBP algorithm,whih

oper-ates by iterativemessage passingon abipartitegraph [47℄. Forexample, reent

re-searhhasfousedontheappliationoftheBP algorithmtodetetingoveraknown

ISI hannel[15,62,5℄. The omputationalomplexityofthe BP-baseddetetion

in-reasesexponentiallyonlywiththenumberofthenonzerointerferers,withrespetto

the optimalmaximum a posterior (MAP)algorithmsor maximum-likelihood(ML)

algorithmswhoseomputationalomplexityareexponentialinhannellength.

BP-basedhanneldetetorisneeded espeiallyinthesparseISI hannelwithlongdelay

spreads andonly afewnonzero interferers, suh asthe underwater aousti (UWA)

hannels [62℄. Thus it is desirable to investigate LDPC-like VLSI arhitetures for

appliations under these irumstanes.

Contribution: Inthisdissertation,weprovidedafeasiblesolutionofdetetor

arhi-teture forrandomsparseISIhannels. Areongurablearhiteturewasdeveloped

in order to swith exible onnetions on the fator graph in the time-varying ISI

hannels. LayereddeodingandtaskreshedulingoneusedinLDPCdeodingwere

alsoinorporatedtoaeleratetheiterativeproess. Allofthetasksaremanagedby

the ontrolunit implementedasa nitestate mahine(FSM). Theproposed

dete-torisimplemented usingASICdesignin90nm CMOSproess and alsoprototyped

on anFPGA.

1.4 Outline

(23)

rithms that are used throughout this dissertation. Besides standard BP, other

de-odingalgorithmssuhaslooselyoupledalgorithm,min-sumalgorithmandlayered

deoding algorithmare also presented.

Chapter 3 develops a ultra-high-throughput LDPC deoder arhiteture

imple-mentation with parallel layered deoding. We rst explain how parallellayered

de-oding worksand thenpresentshowitan helpsplit the ritialpath andeliminate

the interonnet network.

InChapter 4,a punturing-basedRCLDPC deoder arhiteture that supports

any ode rate 1/2 to 1 is designed and implemented. First we investigate various

punturing shemes and pik up the optimal one in term of error performane.

Then we modify the parallel layered deoding arhiteture developed in Chapter 3

by addinga ontrolblok to arry out punturing.

Chapter 5 presents alow-omplexity LDPCdeoder for CMMB systems.

Split-memory arhiteture is proposed toeiently handle the weight-2sub-matries. A

reongurable arhiteture, whih an support dual rate LDPC odes speied in

the CMMB standard, is then onstrutedwith minimaloverhead;

Chapter 6 presents VLSI implementation of a BP-based detetor for sparse ISI

hannels, suh asunderwateraoustihannel. Cahe-likearhiteture isdeveloped

by storingonly the messages that urrentnode interferes with.

(24)

LDPC Codes and Deoding

Algorithms

InthisChapter,wereviewssomefundamentalsofLDPCodes,QCLDPCodesand

their deoding algorithms, inluding standard BP algorithm, min-sum algorithm,

looselyoupled algorithmand layered deoding algorithm.

2.1 Introdution of LDPC Codes

The LDPC odes an be desribed by a

M

×

N

sparse parity hek matrix

H

, in

whih most of the elements are 0's and only a few are 1's.

M

denotes the number

of parity hek equations, that is,number of the hek nodes, while

N

is the blok

length, that is, the number of variable nodes. Fig. 2.1(a) shows a rate-1/2 4Ö8

parity hek matrix of aregular (8,4)LDPC ode. In the deodingaspet, a parity

hekmatrixanbemappedintoabipartitegraph,alledtheTannergraph,withall

of thevariablenodes ononesideand heknodes onanother. Loationsofthe

(25)

! " #$% #&% ' ()* +,, $-+ )( '* +,, $-+

Figure2.1: Regular (8,4) LDPC odes: (a)Parity-hek matrix (b) Tanner graph

nodes and heknodes,asillustrated inFig. 2.1(b). Theseonnetionsan bealso

onsidered asthe messages (CTV messages and VTCmessages) transmittingpaths

when deoding using the iterativeBP algorithm.

2.2 Quasi-Cyli LDPC Codes

QC LDPC odes are a speial lass of the LDPC odes with strutured

H

matrix

whih an be generated froman

m

b

×

n

b

base matrix

H

b

.

H

b

=







P

₀

,

₀

P

₀

,

1

· · ·

P

₀

,

n

b

−

1

P

1

,

₀

P

1

,

1

· · ·

P

1

,

n

b

−

1

. . . . . . . . . . . .

P

m

b

−

1

,

0

P

m

b

−

1

,

1

· · ·

P

m

b

−

1

,

n

b

−

1







(2.1)

Eah nonzero element

P

i

,

j

in the base matrix is a

z

×

z

submatrix that an

be expanded by irularly right-shifting an identity matrix with the shift value

(26)

CNUs and VNUs now beome well-regulated and easy to handle. Therefore,

QC-LDPC odes are welomed by several advaned ommuniation standards, suh as

802.11n, 802.15.3 and 802.16e.

2.3 Belief Propagation Algorithm

The BP algorithm provides an eient and powerful method to deoding LDPC

odes. StandardBPalgorithmin[24℄isusuallytransformedintologarithmidomain

whereadditionsanbeusedinsteadoftheomplexmultipliation. Beforepresenting

the BP algorithm,werst makesome denitionsasfollows: Let

c

n

denote the

n

-th

bitof aodewordand

y

n

denotethe orrespondingreeived valuefromthe hannel.

Let

r

mn

[

k

]

(

q

mn

[

k

]

)bethe CTV(VTC)messagefromhek nodemtovariablenode

n

at the

k

-th iteration. Let

N

(

m

)

denote the set of variables that partiipate in

hek

m

and

M

(

n

)

denote the set of heks that partiipatein variable

n

. The set

N

(

m

)

withoutvariable

n

is denoted as

N

(

m

)

\

n

and the set

M

(

n

)

withouthek

m

isdenoted as

M

(

n

)

\

m

.

1. Initialization:

Under the assumption of equal priori probability, ompute the hannel

prob-ability

p

n

(intrinsi information)of the variable node

n

,by:

p

n

=

log

P

(

y

n

|

c

n

= 0)

P

(

y

n

|

c

n

= 1)

(2.2)

The CTV message

r

mn

isset tobe zero.

2. Iterative Deoding:

(27)

q

mn

[

k

] =

p

n

+

X

m

′

∈{

M

(

n

)

\

m

}

r

m

′

n

[

k

−

1]

(2.3)

Meanwhile, the deoder an make a hard deision by alulating the APP

(a-posterior probability)by

Λ

n

[

k

] =

p

n

+

X

m

′

∈

M

(

n

)

r

m

′

n

[

k

−

1]

(2.4)

Deide the

n

-th bit of the deoded odeword

x

n

= 0

if

Λ

n

>

0

and

x

n

= 1

otherwise. The deoding proess terminates when the entire odeword

x

=

[

x

1

, x

2

,

· · · ·

x

N

]

satisfy allofthe

M

parity hek equations:

H

x

= 0

,orthe

preset maximum number of iterationis onsumed.

If the deoding proess does not stop, then, alulate the CTV message

r

mn

for the hek node

m

, by

r

mn

[

k

] =

Y

n

′

∈{

N

(

m

)

\

n

}

sign

(

q

mn

′

[

k

])

×

Ψ

−

1





X

n

′

∈{

N

(

m

)

\

n

}

Ψ (

|

q

mn

′

[

k

]

|

)





(2.5)

Ψ (

x

) = Ψ

−

1

(

x

) =

log

1 +

e

−

x

1

−

e

−

x

(2.6) 2.4 Min-Sum Algorithm

The nonlinear

log

−

tanh

funtion in the hek node updating step is usually

im-plemented by Look-Up Table (LUT), whih seriously inreases the omplexity and

(28)

bal-gorithm to lower the CNU omplexity by approximating the CTV message with a

minimum operation,as shown in (2.7).

r

mn

[

k

] =

Y

n

′

∈{

N

(

m

)

\

n

}

sign

(

q

mn

′

[

k

])

×

min

n

′

∈{

N

(

m

)

\

n

}

{|

q

mn

′

[

k

]

|} ×

α

(2.7)

Here,anormalizedfator

α

isintroduedtoompensatefortheperformaneloss

exiting inthe min-sumalgorithmwithoutompensationompared toBPalgorithm

[23, 26,10℄. In this dissertation,

α

is set tobe0.75.

Using the min-sum algorithm,the look-up tables (LUTs) whih implement the

intriate non-linear funtion in standard BP algorithmare now replaed by rather

simple omparators, resultinginsimpler omputationomplexity of CNU.Besides,

storage resoures an also be redued, as only the minimum and seond minimum

value of allof the VTC messages within ahek node need tobestored.

2.5 Loosely Coupled Algorithm

In the BP algorithm, messages are transmitted between hek nodes and variable

nodes iterativelytoupdateeahother. VTCmessagesarerenewed byhannel

prob-abilities and CTV messages from the hek nodes. Then, these renewed messages

needtobepassedagaintoupdatetheVTCmessages. EveryVTCandCTVmessage

should be transmitted immediately after they are renewed along the edges on the

Tanner graph. This large amount of transmitted messages auses an

interonne-tion problem. Large hip area isoupied by interonnetions and an area-eient

(29)

tionproblem. Thedeoder doesnotexhangetheCTV andVTCmessages between

the hek nodes and variable nodes. Instead,it delivers onlythe hek and variable

summation

∆

m

and

Λ

n

. Atavariable node,given the hek summationvalues

∆

m

,

aVNU wouldrst reoverindividualCTV message

r

mn

and thenalulatethe next

variable summation

Λ

n

whih will be transmitted toCNUs. As a result, (2.3) and

(2.4) an nowbe modied as:

r

mn

[

k

−

1] = (

sign

(∆

m

[

k

−

1])

×

sign

(

q

mn

[

k

−

1]))

×

Ψ

−

1

(Ψ (

|

∆

m

[

k

−

1]

|

)

−

Ψ (

|

q

mn

[

k

−

1]

|

))

(2.8)

Λ

n

[

k

] =

p

n

+

X

m

′

∈

M

(

n

)

r

m

′

n

[

k

−

1]

(2.9)

q

mn

[

k

] =

p

n

+

X

m

′

∈{

M

(

n

)

\

m

}

r

m

′

n

[

k

−

1]

(2.10)

Similarly, at a hek node, given the variable summation values

Λ

n

, a CNU would

rst reoverindividualVTCmessage

q

mn

and thenalulatethenext hek

summa-tion

∆

m

whih willbetransmitted toVNUs. Asa result,(2.5) an nowbe modied

as:

q

mn

[

k

] = Λ

n

[

k

]

−

r

mn

[

k

−

1]

(2.11)

∆

m

[

k

] =

Y

n

′

∈

N

(

m

)

sign

(

q

mn

′

[

k

])

×

Ψ

−

1





X

n

′

∈

N

(

m

)

Ψ (

|

q

mn

′

[

k

]

|

)





(2.12)

(30)

r

mn

[

k

] =

Y

n

′

∈{

N

(

m

)

\

n

}

sign

(

q

mn

′

[

k

])

×

Ψ

−

1





X

n

′

∈{

N

(

m

)

\

n

}

Ψ (

|

q

mn

′

[

k

]

|

)





(2.13)

Ifthe min-sumalgorithmisalsoapplied,thenweonlyneedtotransmitthe variable

summation

Λ

n

, leadingto further simpliationininteronnetion omplexity [70℄.

2.6 Early Termination Strategy

As presented inthe BP algorithm,the deoding proess an be terminated by two

general onditions: one is the parity hek equations

H

x

= 0

,

another is the atual

number of iterationsexeeds the predened maximum number. For regular LDPC

ode deoding in paper[85℄, parity hek equations beome easy to verify, beause

everylokylesthe deodedbits fromtheVNUsare withinasamehekand

par-ityhekanbeompletedimmediatelyafterthedeodedbitsaredeided. However,

forirregularodessuhparityhekmethodwouldleadtolargerhardwareresoure,

longer deoding lateny and lower deoding throughput, beause extra storage

re-soures and lok yles are required tosavethe harddeision bits and toverify to

see if the parity hek equations are satised.

Another onventional termination method is to only set the maximum number

of iterationsand stop atthe preset number withoutonsidering if the parity hek

equations are satised. The method will have little hurt to the error-orretion

performane of the LDPC odes if the maximum number if suiently large. In

iruitleveldesign,asimpleounterwillbenetofullyimplementthis termination

method. Thus, less hardware omplexity is employed, ompared with the parity

(31)

num-environmentswhere fewer errors our.

In order to address the drawbaks of the two existing termination riteria, we

propose a more onvenient and robust earlytermination strategy [66℄ to balane

the throughput requirements and hardware usage. The early termination strategy

an not only dynamially adjust the number of iterationswhen dealing with

om-muniation hannels of dierent SNRs, but also be suient for the low-ost and

low-powerhardware implementation.

The pivot of the early termination strategy lies in that hard deisions from

previous iterationare storedand ompared with newly generatedhard deisions. If

all of the deoded bits at urrent iterationare idential with those of the previous

iteration,then the deoder indiatesasuessful deoding of urrent odeword and

jumps out of the iterativeproess. Otherwise, the deoding proess would ontinue

until two suessive hard deisions beome the same or the maximum number of

iterations is satised. As indiated in [66℄, for the LDPC ode used in this paper,

we only need to hek 1152 information bits instead of 2304 bits, beause it is a

rate-1/2 systematilinear blok ode.

Fig. 2.2 shows the hardware arhiteture of the early termination strategy. It

mainly onsists of XOR gates, OR gates and registers whih are used to save the

hard deisions. At every iteration yle, hard deisions of previous iteration are

retrieved from registers and XORed with urrent deisions to generate an array

of intermediate signals

same

_

dif f

. Then eah bits of the

same

_

dif f

signal are

ORed to generate the

f lag

signal. A high-level

f lag

signal indiates dierene of

hard deisions between the twosuessive iterationsand then the deoding proess

willontinueifthe predenednumberofiterationisnot satised. Otherwise,a

(32)

!" #! $$%& $$%& $$%& '( )* +, -. . ./( 0

Figure2.2: Implementationof early terminationstrategy

will stop.

2.7 Layered Deoding Algorithm

In BP algorithm [24℄, the two-phase mutual messages, namely VTC messages

q

mn

andCTVmessages

r

mn

,areupdatedbyseparateproessingunitsandpassedtoeah

otheriteratively.

q

mn

updateswillnotstartuntilallofthe

r

mn

arepreparedandvie

versa. In horizontallayered deoding,the CTV messagefromthe urrent layerwill

bepassedvertiallytoallother unproessedlayers thatbelongtothesame variable

node. In eah iteration, the horizontal layers are proessed sequentially from the

top tothe bottomlayer.

As an example, the

H

matrix of rate-1/2LDPC ode from802.16e standard is

quasi-yli, whihonsistsof sub-matriesthatare generated byirularlyshifting

anidentitymatrix, asshowninFig. 2.3. Suhstruture iswellsuitedforhorizontal

layered deodingaseahsub-matrixhas the olumnweight ofone andwean treat

eahrowofthe base matrixasalayer. Messagepassingowoflayered deoding for

(33)

0

7

26

41

66

43

12

0

49

39

65

7

11

0

72

70

59

94

10

0

51

43

24

83

12

9

0

47

2

73

11

8

0

18

14

53

95

7

0

79

82

40

46

6

0

72

41

84

39

5

0

25

65

47

61

4

0

33

81

22

24

3

0

12

9

79

22

27

2

0

7

83

55

73

94

1

24

23

22

21

20

19

18

17

16

15

14

13

12

11

10

9

8

7

6

5

4

3

2

1

Figure 2.3: Parity-hek matrix for the seleted rate-1/2 LDPC ode in 802.16e

standard.

(34)

! ! "# $% & ' ( )* +, -. / ! 0 123 0 123 0 123 ! 4 4 4 5 1 6 7 38 9 3 6 6 7 38 9 3 6 6 7 38 ! 4 4 4 :;; < =>?@A 4 4 4 4 4 4 B 31C D31 6 C123E2 C123 FG H I 32 J K L M O P Q R S F G H I 32 T UV W X Y Z[ \ ] FG H I 32 ! ^_ `a b c d ef gh i j k 4 4 4 4 4 4 9 3 6 6 7 38 5 1 6 7 38 5 1 6 7 38 !

Figure 2.5: Arhiteture of horizontal layered deoding with loosely oupled

(35)

netion omplexity. As illustrated in Fig. 2.5, at the

j

-th layer, the VTC

mes-sages are rst reovered from the variable summations

Λ

n

and the CTV message

from memories, as in (6). At the output of CNU

j

, the updated CTV messages

r

j,n

j

|

n

j

∈

N

(

j

)

in (7)are storedbak tothe CTV memories. Variable

summa-tion

Λ

n

j

|

n

j

∈

N

(

j

)

are renewed as in (8) and sent bak to the APP memory.

The entire owan beexpressed as follows:

q

j,n

j

= Λ

n

j

−

r

j,n

j

(2.14)

r

j,n

j

=





Y

n

′

j

∈{

N

(

j

)

\

n

j

}

sign

q

j,n

′

j





×

min

n

′

j

∈{

N

(

j

)

\

n

j

}

q

j,n

j

×

α

!

(2.15)

Λ

n

j

=

q

j,n

j

+

r

j,n

j

(2.16)

Compared with onventional layered deoding algorithm, it an be observed

that loosely oupled algorithmdoes not require variable node operations. In other

words, VNUs an be eliminated sine their main funtion of updating the variable

summations

Λ

n

an be done by CNUs using VTC messages

r

j,n

j

[

k

]

from previous

(36)

High-Throughput LDPC Deoder

Arhiteture with Parallel Layered

Deoding

3.1 High Throughput Strategies

With the inreasing demand for high-data-rate wireless appliations, many reent

ommuniation systems employ ultra-highthroughput hannel odes to math the

data-rate requirements. For example, 802.15.3 standard is targeted for the data

rate ofmulti-gigabits perseond(Gbps),thusLDPCodes arepreferredompared

withonvolutionalodesandTurboodes. However, itisagreathallengetodesign

a high-through LDPC deoder due to the omplexity of deoding algorithm. Sine

QC-LDPCodes areinreasinglypopularinemergingommuniationstandards,we

(37)

T hroughput

=

F req

×

Block Length

Cycles per Iter

×

Num of Iter

(3.1)

Therefore, three strategies an be attempted in order to improve the throughput:

reduing the number of iterations required for onvergene, reduing the

deod-ing lateny per iterationand improvingthe operating frequeny. Correspondingly,

three arhiteture-aware shemes are studied in this paper, inluding layered

de-oding algorithm,parallellayered deoding arhiteture, and ritial path splitting

tehnique.

First, layered deoding algorithm is adopted to redue the required number of

iterations by a fator of two for any given SNR, ompared with the standard BP

algorithm. Hene, the deoding throughput is supposed to be doubled without

any biterror performane loss. Generally,there are two layered deoding methods:

horizontal layered deoding [51, 33, 45, 9, 25, 28℄ and vertial layered deoding

[82℄. It has been proved these two methods are theoretially equivalent and both

an onverge twie as fast as the BP algorithm [65℄. In this paper, we employ

the horizontal layered deoding strategy beause it is favorable for the min-sum

algorithm.

Seondly, we propose a novel sheme namely parallel layered deoding

arhi-teture that enables onurrent proessing among all layers. In traditionallayered

deodingarhiteture,thelayers areproessedsequentiallywhihleadstolonger

de-oding latenyperiteration[33℄. Inparallellayered deodingarhiteture, preisely

sheduled messagepassing among dierent layers guarantees that all updated

mes-sages are passedtotheir designatedloationsinonneted layers. The parity-hek

(38)

iently large for message passing. Moreover, parallel layered deoding arhiteture

supports the deoding arhiteture in whih hek node updates unit (CNU) and

variablenode updatesunit(VNU)are ombinedintoasinglefuntionalunit. Thus,

no extra lok yles are needed toomplete the variable node updateas they have

been merged into the CNU. As a result, the number of lok yles per iteration

in parallellayered deodingarhiteture an beredued by75% ompared withthe

existing arhitetures.

Finally,the tehniqueof reduing ritialpath delay isproposedto improvethe

operating frequeny at iruit-level. The ombination of CNU and VNU results a

long ritial path in the deoder implementation [66℄, whih limits the maximum

lok speed. Fortunately,there are idletime intervals amongdierent layers whih

allows the iterative messages to be proessed and passed to the next layer within

several lok yles. Therefore, we an insert registers tosplit the CNU (inluding

the VNU) intoseveral pipelinedstages. Consequently, the ritial pathis split into

multiple stages and the lok speed an be improved dramatially, on ondition

that the numberof stages does not exeed the idletime intervals. In pratie, this

ritial path splitting method an inrease the maximum frequeny of the deoder

by afator of

3

or higher.

To demonstrate the aforementioned three tehniques, a rate-1/2 2304-bit

QC-LDPC ode isseleted from802.16estandard asa ase study. In addition, existing

tehniques suh as min-sum algorithm and loosely oupled algorithm are also

em-ployed to simplify deoding omplexity and to redue the hip area of the deoder

(39)

As disussedabove,the essentialreasonthatlayereddeodingalgorithman redue

the number of iterations is that the latest extrinsi messages are passed to and

employed by the subsequent layers within the urrent iteration. Therefore, layered

deodingrequires layerstobeproessed sequentially,whihresults alarge deoding

lateny periteration. A methodof inreasing parallelisminside a layeris proposed

in[51,45℄,butalllayersare stillproessedinseries. Althoughdeodingthroughput

an be improved, this method introdues rossbar-based interonnetion networks

that inrease the hardware omplexity.

Motivated by the partly parallel mehanism mentioned in [85℄, we propose the

parallel layered deoding arhiteture that allows all layers tobeproessed

onur-rently. Eah layer generates and sends updated messages and at the same time it

also reeivesthe updated messages fromother layers. Unlike the method proposed

in [51, 45℄, parallellayered deoding arhiteture uses parallelproessing amongall

layers and serial proessing within eah layer. Detailed message proessing ow at

the

j

-thlayer (CNU)an besummarized as

1. Feth the orresponding variable summations

Λ

n

j

|

n

j

∈

N

(

j

)

from the

APPmemoryandCTVmessages

r

j,n

j

|

n

j

∈

N

(

j

)

fromloalCTVmemory.

2. Calulatethe VTCmessages

q

j,n

j

|

n

j

∈

N

(

j

)

inthesamerowusing(2.14).

3. Calulate horizontallytoobtainnew CTV messages

r

j,n

j

|

n

j

∈

N

(

j

)

,asin

(2.15).

4. Immediatelyupdatethe variablesummations

Λ

n

j

|

n

j

∈

N

(

j

)

using(2.16).

5. Deliverthenewvariablesummation

Λ

n

(40)

!" " # $% & & ' ' !" " # $% ' ( ' )* !" " # $% + )* , )* +( ( !" " # $% * ( && ( . / 0 1 2 3 . / 0 1 2 45 . / 0 1 2 67 . / 0 1 2 8 4 9: : ; < < =< 9: : ; < < =< 9: : ; << = < 3.1: Pro essing status at four dieren t lo k yles. 26

(41)

Hereby, we explainhow topass

Λ

n

j

messages in parallellayered deoding

arhi-teture. Instead of passing

Λ

n

j

toall unproessed layers as in onventional layered

deoding, parallel layered deoding arhiteture only sends

Λ

n

j

to the layer that

will use it next. Let us take the rst olumn of the

H

base matrix as an example.

None-zero entries are at the

4

th,

9

th and

12

th layer whose permutation numbers

are

61

,

12

and

43

, respetively. Now we suppose that atyle

0

eah of these three

orresponding CNUs starts to proess from the rst row of the sub-matrix,

orre-spondingtothe

62

th,

13

thand

44

th olumnindies. In general,orresponding row

and olumnindies of the entry being proessed atyle

l

an be alulated as

r

index

=

l

(3.2)

c

index

=

mod

(

l

+ 1 +

s

(

i, j

)

,

96)

(3.3)

InFig. 3.1, itillustratesthe messagepassing routesinparallellayered deoding

arhiteture using the the rst olumn of the parity hek matrix in Fig. 1 as an

example. The operation sequene in time of these three layers is desribed as the

following:

1. At yle

0

, CNUs at layers

4

,

9

and

12

start to proess simultaneously from

row

1

of eahsub-matrix,orrespondingtoolumn

62

,olumn

13

andolumn

44

aording to(3.3).

2. Atyle

18

,allthreelayers areproessingatrow

19

,orrespondingtoolumn

80

(in layer4), olumn

31

(inlayer 9) and olumn

62

(inlayer 12). Layer

12

alls for the latest summation message of olumn

62

whih has already been

updated by layer

4

. Therefore, the updated variable summations at layer

4

(42)

Figure 3.2: Variable summations passing diretions of the

H

base matrix: (a) for

olumn

1

(b) for olumn

6

.

3. Atyle

31

,allthreelayers areproessingatrow

32

,orrespondingtoolumn

93

(in layer 4), olumn

44

(in layer 9) and olumn

75

(in layer 12). Layer

9

alls for the latest summation message of olumn

44

whih has already been

updated by layer

12

. Therefore, the updated variable summationsat layer

12

should be sent tolayer

9

(Layer12

→

Layer9).

4. Similarly,at yle

47

,all threelayers are proessing row

48

,orresponding to

olumn

13

(in layer 4), olumn

60

(in layer 9) and olumn

91

(in layer 12).

Layer

4

alls for latest summation message of olumn

13

whih has already

beenupdatedbylayer

9

. Therefore, theupdatedvariablesummationsatlayer

9

should besent to layer

4

(Layer 9

→

Layer4).

Based on the desription above, message passing routes for olumn

1

of base

matrix is shown in Fig. 3.2(a). An additionalexample is illustrated in Fig. 3.2(b)

for olumn 6of the base matrix. In general,we an determine the message passing

routes among layers based on the their permutation values. For eah olumn, we

an sort the permutation values of all layers in desending order. Eah layer then

(43)

12

11

10

9

8

7

6

5

4

3

2

1

Layer

80

87

10

25

50

27

+80

0

49

39

65

7

+0

16

88

86

75

14

+16

0

51

43

24

83

12

+0

8

55

10

81

27

+8

88

10

6

45

87

+88

0

79

82

40

46

+0

84

60

29

72

27

+84

12

37

77

59

73

+12

0

33

81

22

24

+0

8

20

17

87

30

35

+8

0

7

83

55

73

94

+0

24

23

22

21

20

19

18

17

16

15

14

13

12

11

10

9

8

7

6

5

4

3

2

1

Offset

Figure3.3: Oset-modied parity-hek matrix forrate-1/2LDPCode in802.16e.

! " #

Figure3.4: Timingdiagram forparallel layered deoding arhiteture.

layerwiththe smallestpermutationvalueloopsbakand onnetstothe layerwith

the largest value. This message passing sheme guarantees the updated messages

among alllayered are proessed progressively within eah iteration.

3.3 Critial Path Splitting

Aordingtothe messagepassingmehanismexplainedabove,wesuppose CNUsin

alllayersstartatrow

1

ofeahsub-matrix. Atually,the CNUsanstarttooperate

fromdierentrows, whihisequivalenttoaddinganosettothe permutationvalue

of eah layer. For example, Fig. ?? shows a modied

H

base matrix with a set

of oset values added for dierent layers. The oset values are arefully seleted,

(44)

atleast

5

. It meanseahlayerneedstoread, update,store andpass the messageto

the next onneted layer with 5 lok yles. The detail steps for the proessing in

a layerare shown in inFig. 3.4. Inother words, eahCNU has aperiodof

4

yles

to omplete the task of reading oldmessage and update new message, asindiated

in (2.14) to (2.16). The message passing rule remains the same, exept that the

updated permutationvaluesshouldbeused todeterminethe onnetionsinPLDA.

FromFig. 3.3,weanseethattherowweightofeahlayeriseither

6

or

7

,whih

requires

6

or

7

messages to be ompared in the CNU. Combined with other

ne-essary funtions suh as adding, rounding and memory read/write, CNU beomes

the ritialpaththat limitsthe maximumoperatingfrequeny ofthe deoder. T

ak-ing advantage of the 4-yle time intervals, we an split the ritial path of CNU

into

4

pipelined stages by inserting

3

levels of registers. Anoptimal splittingyields

balaned delays among the pipeline stages. The implementation of ritial path

splitting willbe given inSetion 3.4.2.

3.4 Proposed Deoder Arhiteture

In ordertoprovethe onept ofour proposedparallellayereddeodingarhiteture

and high-throughput strategies, a rate-1/2 2304-bit QC-LDPC ode seleted from

802.16e standard is designed based on the oset-modied

H

matrix. The size of

eah sub-matrix is hosen to be

96

×

96

. Before onstruting the funtional units

of the deoder, we rst perform xed-point analysis to quantize the word-length of

messages. In fat,word-lengthquantizationisatradeobetween memoryresoures

and bit-error-rateperformane. Fig. 3.5showsthe BERperformaneoftheseleted

rate-1/2 LDPC ode using message word-length

(3

,

2)

. As demonstrated in [?℄,

5

(45)

0.5

1

1.5

2

2.5

10

−7

10

−6

10

−5

10

−4

10

−3

10

−2

10

−1

10

0

SNR(dB)

Bit Error Rate

Floating Standard BP @ 20 iter

Floating Min−Sum @ 20 iter

Floating Min−Sum @ 10 iter

Floating Proposed @ 10 iter

Fixed(3,2) Proposed @ 10 iter

Figure 3.5: BER performane omparison of dierentdeoding algorithms.

for representing the absolute values of the extrinsi messages. Considering one

additional sign bit, we hoose

6

bits xed-point representation for messages in this

implementation.

3.4.1 Overall Deoder Arhiteture

The overall arhiteture of the QC-LDPC deoder is shown inFig. 3.6. It onsists

of hek node proessing units (CNUs), APP memory banks, CTV memory banks

and hard-deision units. The entire

H

matrix is divided into

12

layers and eah

layer employs a dediated CNU. APP memory banks and CTV memory banks

are used to store variable summations and CTV messages. Eah non-zero

sub-matrixorrespondstooneAPPmemoryunitandoneCTVmemoryunit. Inlayered

deoding,eahAPPmemoryexportsasummationmessagetotheCNUandimports

a new summation message from the CNU of another layer. Similarly, eah CTV

(46)

! " # $ % % & & & ' ( ) * + , -. / 0 1 2 3 4 5 6 7 8 9 : % %& & & & ;< = > ? @ A BC DE F G H IJ KL M N O PQ RS T U V % % & WX X Y Z [ \ ] ^ _ ` ` ` ` ` ` a b a b & a b & ` ` ` cd d e & e f e g e h e i e f j k cd d & & e & & e l & e m & e h & e n & e & & e o cd d & & e & e l & e n & e & h & e & & e f ` ` ` ` ` j k & j k & ` ` ` ` ` ` ` p 3.6: Ov erall parallel la y ered deo ding ar hiteture for QC-LDPC 32

(47)

onurrentreadand writeoperationsare usedinthe design. Sinethereare

76

sub-matriesin total,the APP memory bank and the CTV memorybank eah onsists

of

76

smallsingle-portmemory units.

Eah layer in the parallel layered deoding arhiteture orresponds to

6

or

7

APP memory units, whih means that the updated variable summations of the

j

-th layer

Λ

j,n

j

will be delivered to APP memory units in other layers based on the

message passing sheme desribed in Setion ?? and Fig. 3.2. Therefore, these

variablesummationsare passedtotheirdestinations throughxed onnetionwires

instead of rossbar-based networks.

3.4.2 Pipelined Arhiteture for CNU

For hek node update, eah layer employs a CNU to perform a series of funtions

inluding subtration, omparison and addition. Consequently, the CNU beomes

the longest delay path in timing. In min-sum algorithm,a CNU needs to ompare

6

or

7

numbers to nd the minimum value and its loation as well as the seond

minimum value. Thus, the omparator beomes a key omponent in a CNU. A

2

-input omparator onsists of an adder and a multiplexer. A

3

-input

ompara-tor an be implemented with three adders, ve multiplexers and some basi logi

gates. Basedonomparisonunitsof2-inputand 3-inputomparators,

6

-inputor

7

-inputomparatoranbeonstrutedbyombiningthreelevelsof2-inputor3-input

omparators, asshown inFig. 3.7.

Fig. 3.7shows detailedarhitetureof CNUanditsfuntionalblokinside. The

subtrator andadder bloksfulllthefuntionsdened in(2.14)and(2.16),

respe-tively. Two quantizers are inserted to prevent the CTV and variable summations

(48)

! " # $$ %& ' %& ' & & & & & & & ! " # $$ $$ $$ $$ ! $$ ! " $$ " # $$ # # ! " # $ $ $ $ $ $ $ $ ! $ $ " $ $ # () *+ , -. / 0 12 34 56 7 12 89 :) ; <) += .> ? 12 @1 A6 +.B >C) 12 @8 D 7/ 129 8 EF ; GH 8 129 8 EF ; GH 8 129 8 EF ; GH 8 12 84 EF ; G +I) J 5) )*BK 12 @1 A6 +.B >C) 12 3L D== )I 12 1M /> ? . 5 B + ? ) @ 12M 8 5 B + ? ) 3 12MM 5 B + ? ) 8 12N M 5B + ? ) 9 12N OP QR QST UVT RW XYPY 3.7: CNU ar hiteture with ritial path splitting in to 4 pip elined 34

(49)

!" #$ %& !" #$ %& !" #$ %&

Figure3.8: Registeralloationineah setionof the hard deisions.

messages.

AsexplainedinSetion3.3,inordertoreduetheritialpathdelayandimprove

lokspeed,aCNUanbesplitinto

4

pipelinedstagesbyinserting

3

setofregisters.

Registers are arefully inserted suh that all 4 pipeline stages have approximately

the same delay, sine maximum frequeny is determined by the longest delay path.

Delays of eah funtional unit and the pipeline stages are alsoshown inFig. 3.7.

3.4.3 Deision Units

The outputbits of the LDPC deoder are deided by the signs of the variable

sum-mations. In traditional horizontal layered deoding, deisions an be made during

the variable node proessing at the bottom layer. Parallel layered deoding

arhi-teture operates in a slightly dierent way in whih every layer generates variable

(50)

onur-rently. Asillustrated inFig. 3.8, letustakethe rstolumnofthemodied

H

base

matrix with oset as an example. CNU

4

, CNU

9

and CNU

12

start to operate

from row

13

, row

1

and row

81

, respetively, as dened by the oset values shown

inFig. 3.3, whihorrespond toolumn

74

,olumn

13

and olumn

28

,respetively.

AordingtothemessagepassingrulederivedinSetion3.4.2,variablesummations

passing routes of the rst olumn is Layer

4

→

Layer

12

→

Layer

9

→

Layer

4

, also

shown inFig. 3.8. Afterompletionofeahiteration,thenal variablesummations

for olumns13through 27are stored atlayer12. Variablesummations forolumns

28

through

73

are stored at layer

4

. Similarly, variable summations for olumn

74

through

12

are stored at layer

9

. Therefore, the hard deision bits for olumns 1

through 96are stored in the memoriesdistributed in3 dierent layers.

In this design, we employ a onvenient and robust earlytermination strategy,

similar to[64,66℄. The pivot of the earlyterminationstrategy liesin thatthe hard

deisionbits frompreviousiterationarestoredand omparedwith thedeisionbits

from urrent iteration. If alldeoded bits are idential, then the deoder indiates

suessfuldeodingofaodewordandtheiterativedeodingterminates. Otherwise,

the deoding proess ontinues until the maximumnumberof iterationsis reahed.

3.5 Implementation Results

In order to evaluate the performane of our proposed parallel layered deoding

ar-hiteture, we implement a rate-1/2 2304-bit QC-LDPC deoder in TSMC

90

nm

1

.

0

V CMOS tehnology with 8-layer metals. We omplete synthesis and ore area

plae and route using Synopsys tools.

Implementation results show that the deoder an operate at a maximum

(51)

using

10

iterations. A totalof152 single-portmemoryunitseahwith sizeof

96

×

6

bits (inluding one sign bit for eah message) are employed, whih sums up to

87

,

752

bits of memory use and oupies more than 75% of the ore area. Parallel

layereddeodingarhitetureonlyneeds

(

p

+ 3)

×

Iter

lokylesforthe deoding

proess and

p

is the size of the sub-matrix. In [66℄ and [70℄, this number rises to

p

×

(4

×

Iter

+ 1)

and

p

×

(5

×

Iter

) + 12

, respetively. Hene, under the same

numberofiterations, parallellayereddeodingarhiteture ouldreduethe

deod-inglatenyby approximately

75

%. Moreover, theimplementationresultsshowthat

the maximumoperatingfrequenieswithandwithoutritialpathsplittingmethod

are

950

MHz and

305

MHz, respetively, whihdemonstrates animprovementof the

deoder speed by a fator of

3

. Combined with layered deoding algorithm that

doubles the onvergene speed, our proposed arhiteture an signiantly improve

the throughputof the QC-LDPC deoder up tomulti-Gbps.

Fig. 3.9 shows the layout view of the deoder ore area of

1

.

8

mm

×

1

.

6

mm

and the logi density of

70

%. The design does not have a rossbar interonnetion

network, sineparallellayered deoding arhiteture employs xed messagepassing

(52)

C. Liu [45℄ X. Shih [66℄ T. Brak [9℄ G.Gentile[25℄ Y. Ueng [70℄ M.Karkooti[?℄ Proposed Deoder Code Length 576~2304 576~2304 576~2304 576~2304 2304 1944 2304 Frequeny 150MHz 83.3MHz 333MHz 400MHz 200MHz 412MHz 950MHz Iterations 20 2~8 10,15 15 4.6(average) 15 10 Throughput 105Mbps 60~220Mbps 133-928Mbps 128-746Mbps 106Mbps 736MHz 2.2Gbps Tehnology 90nm 130nm 130nm 65nm 180nm 130nm 90nm Area 6.25

mm

2

8.29

mm

2

3.83

mm

2

0.59

mm

2

- 2.4

mm

2

2.9

mm

2

Power 264mW 52mW - - - 502mW 870mW 38

(53)

deoder is

2

.

9

mm

2

and the estimated power onsumptionis

870

mW.

Table 3.1shows the deoder implementationresults ompared withthe existing

QC-LDPC deoders. Note that the throughput values from [9,25℄ are realulated

basedonthedeodedbitsforfairomparisontootherimplementationslistedinT

a-ble3.1. Weshowthattheproposeddeoderanahievehigherdeodingthroughput

with omparable or smaller hip area. The deoder onsumes more power mainly

due toitshigh operatingfrequeny.

3.6 Summary

LDPC odesare widelyused inreentommuniation systemsdue totheir superior

error-orretion performane. In this hapter, we proposed a new arhiteture to

improvethe throughput of QC-LDPC deoders. Withparallel layered deoding

ar-hiteture and ritialpath splittingtehnique, the deoder implementation,whih

usesTSMC90nmtehnology,anahieve

2

.

2

Gbpsdeodingthroughputforseleted

rate-1/2irregularQC-LDPCodes. In addition,min-sumand looselyoupled

algo-rithms are employed for area eieny and the ore size is

2

.

9

mm

2

. The proposed

parallellayereddeoding arhiteture issuitableforthenear-apaityhannelodes

(54)

High-Throughput Rate-Compatible

LDPC Deoder Arhiteture

4.1 Introdution

Reently, thereisagrowinginterestinrate-ompatible(RC) LDPCodes and their

appliations. On time-varyinghannels, itis desirableto adjust the FEC ode rate

aording to the hannel s