Methodology and optimizing of multiple frame format buffering within FPGA H.264/AVC decoder with FRExt.

(1)

Rochester Institute of Technology

RIT Scholar Works

Theses

Thesis/Dissertation Collections

8-2007

Methodology and optimizing of multiple frame

format buffering within FPGA H.264/AVC

decoder with FRExt.

Timothy Aaron Stotts

Follow this and additional works at:

http://scholarworks.rit.edu/theses

This Thesis is brought to you for free and open access by the Thesis/Dissertation Collections at RIT Scholar Works. It has been accepted for inclusion in Theses by an authorized administrator of RIT Scholar Works. For more information, please [email protected].

Recommended Citation

(2)

Methodology and optimizing of multiple frame format

buffering within FPGA H.264/AVC decoder with FRExt.

by

A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of

Master of Science in Computer Engineering

Approved By:

Supervised by

Assistant Professor Dr. Marcin Lukowiak

Department of Computer Engineering

Kate Gleason College of Engineering

Rochester Institute of Technology

Rochester. New York

August 2007

Marcin tukowiak

Dr. Marcin Lukowiak

Assistant Professor, RIT, Department of Computer Engineering

Primary Adviser

Ken W. Hsu

Dr.

Ken W. Hsu

Professor, RIT, Department of Computer Engineering

Secondary Adviser

Mark Grabosky

(3)

Thesis Release Permission Form

Rochester Institute of Technology

Kate Gleason College of Engineering

Title: Methodology and optimizing of multiple frame format buffering

within FPGA H.264/AVC decoder with FRExt.

I, Timothy Aaron Stotts, hereby

grant permission to the Wallace Memorial Library to

reproduce my thesis in whole

or in part

.

(4)

Dedication

To Christ

Jesus,

myone truesource of peace.

"Peace I leavewithyou,_my peaceIgive unto you: not as theworld_giveth,

(5)

Acknowledgments

A special thank you to each of _my advisers for sharing their time and _experience; and

especially toDr. Lukowiakfor hispatientguidance, andMark

Grabosky

at

Xelic,

Inc. for encouragingand_equipingmetopullthrough. Thankyou alsoto_{Thomas Warsaw for many}

(6)

Abstract

Digitalrepresentationofvideodata isan

inherently

resource

demanding

problemthatcon tinues tonecessitate the developmentandrefinement of_codingmethods. The H.264/AVC

standard, along with its recent

Fidelity

Range Extensions amendment

(FRExt),

_{is quickly}

being

adopted asthe standard codec for broadcastanddistributionofhighdefinition video.

The FRExt amendment, while not _{necessarily affecting} the overall decoder architecture,

presents an added _complexity of _providing efficient _memory management for

buffering

intermediate frames of various pixel color samplings anddepths.

Thisthesisevaluatedtherole of

designing

theframe bufferofahardwarevideo

decoder,

with integrated supportfor the H.264/AVC codec plusFRExt. With focus on _organizing

external _{memory data} access, the frame bufferwas designedto provide intermediate data storageforthe

decoder,

while_usingan efficient store andloadschemethat takesintocon

siderationeachframepixel formatofthevideodata.

VHDLwasused tomodel theframebuffer. Exploitation of_{reconfigurability} and

post-synthesis FPGA simulations were used to evaluate

behavior,

_scalability and power con sumption, while_providingananalysis of approachesto_{adding FRExt}to the_memory man agement. Real-time buffer performance was achieved fortwo common frame formats at

1080 HDresolution; and aninnovativepipelinedesignprovides dynamic switchingoffor

mats between video sequences. As an additional consequence of_verifying the model, a

preexisting Baseline H.264/AVC decoder testbench was augmented to support

testing

of

(7)

Contents

Dedication ... iii

Acknowledgments ... iv

Abstract _v

Glossary

xiv

1 Introduction ... 1

1.1 Background ₁

1.2 Thesisobjective ₄

1.3 Thesischapter overview ₆

2 Video

Coding

... 7

2.1 Y'CbCrcolormodel ₇

2.1.1 _{Y'CbCr sub-sampling} ₉

2.2 H.264/AVCoverview ₁₂

2.2.1 _{H.264/AVC coding summary} 14

2.2.2 H.264/AVC

Fidelity

_{Range Extensions summary} 17

2.3 Thesisrelevance and specifics 19

2.3.1 H.264/AVC data

buffering

flow ₁₉

2.3.2 H.264/AVC data

buffering

organization ₂₀

2.3.3 Macroblockpixel types 23

3 H.264/AVC Research .25

3.1 _{Decoder memory}case studiesand research ₂₅

3.1.1 Identification of_memorycomponents 26

3.1.2 Optimizationtechniques ₂₇

3.2 Analysisof published results ₃₀

4 Requirementsand

Modeling

₃₂

4.1 Augmentation ofdecodersystem ₃₂

4.2 Algorithms used

by

theframe buffer. ₃₂

4.2.1 Intraprediction

buffering

requirements ₃₄

4.2.2

Deblocking

filter

buffering

requirements ₃₆

(8)

4.2.4

Combining

of

buffering

mechanisms 37

4.2.5 Referencepicturemanagement 37

5 SynthesizableImplementation 40

5.1 _{External memory} storage and control 40

5.1.1 _{DDR memory}control 42

5.1.2 Block datacontroller 44

5.1.3 Implementation

hiearchy

45

5.1.4 External DDR interface 46

5.2 Frameorganization and_addressing 46

5.2.1 Macroblock identification andframe slotting 49

5.2.2 Macroblockaddress_mappingwithFRExt 51

5.2.3 Framestorage_marking 53

5.2.4

Sliding

windowimplementation 54

5.3 Synthesisparameters 55

5.4 Frame buffer interfaceand_pipelining 57

5.4.1 Framebuffer RTL interface 57

5.4.2 Pipeline semantics 62

5.5 Dual RAM frame buffer 63

5.5.1 Dual DDR SDRAM design 65

6 Verification- HDL Model

Functionality

. . .... 69

6.1 Unit

testing

69

6.2 In-systemverification 71

6.2.1 Augmentationofthedecodersystem 71

6.2.2 Testbenchredesign ₇₄

6.2.3 Videosequences ₇₆

6.2.4 Functional simulation ₇₇

6.2.5 Post-synthesissimulation ₇₉

7 Results andAnalysis ... 81

7.1 Implementation analysis _gl

7.2 Synthesisresource analysis ₈₆

7.3 DDR

timing

analysis ₃₇

7.4 H.264/AVC

timing

analysis ₉₀

7.5 Powerconsumptionanalysis ₉₂

7.6 Costanalysis ₉₄

8 Conclusions ₉₇

8.1 Synthesizablemodels ₉₇

8.2 Proposed system

interfacing

99

(9)

SoftwareToolsandDeliverables 106

A.l Software tools 106

A. 1.1 _{Video processing}and

display

106

A. 1.2 FPGAdesignand simulation 107

(10)

List

of

Figures

1.1 Digitalrepresentationof a picture intermsofdatasize 1 1.2 Digitalrepresentationofuncompressedvideo intermsofdatasize 2 1.3 Theextradimension of pixel size upontotalpicturedatasize 3

1.4 _{Internal partitioning}offrame buffer design 4

1.5 Video decoder system_partitioningaugmentedfor_testingframe buffer. ... 5

2. 1 RGB vs. Y'CbCr decompositionofa

"foreman"

testframe 8

2.2 _{Y'CbCr sub-sampling 4:4:4} 10

2.6 Scopeof_{H.264/AVC Standard: only}

decoding

[24]

14 2.7 Thecorrelationbetweensource

(uncoded)

pictureframesand encoded slices. 15

2.8 _{Pixel sampling}and

depth,

increasingly

stacked

by

FRExtprofile.

[24]

... 18 2.9

Buffering

withinthe H.264/AVCHypothetical Reference Decoder.

[9]

. . . 20

2.10 DPB operation: macroblock

in,

macroblock out 21

2.11 Buffera row of macroblockstoretain neighborMBs 21

2.12 Organizationofthereference frame buffer 22

3.1 FPGAhybridon-chip, off-chip decoderarchitecture proposedin [21]. ... 26

4. 1 Videodecodersystem architecture anddataflow 33 4.2 IntraPrediction macroblock neighborpermutations 34 4.3 Maindatapath

"tap"

locationsfor

buffering

36

5.1 Vendor-suppliedXilinx Spartan 3E DDR SDRAMcontroller. ₄₃

5.2 CustomizedXilinx Spartan 3EDDRSDRAMcontroller ₄₃

5.3 Blockdatacontrollerstate machines ₄₄

5.4 Internal partitioningofframe buffer design 45

5.5 ExternalDDR interface ₄₇

5.6

Binary

16x8 sub-macroblock_memorymapsforeach 8-bitsub-sampling. . . 49 5.7 SingleRAM frame buffer innerand outerinterfaces 58 5.8 DualRAM frame buffer innerand outerinterfaces ₆₅

6.1 Testbench flowwith emphasis ondata_processing ₇₄ 6.2 Testbench flowwith emphasis on storageoperations ₇₅

(11)

(12)

List

of

Tables

2.1 Compressionratios forvariousY'CbCrsub-samplings 12

2.2 H.264/AVC standard

drafting by

year.

[19]

13

2.3 Someofthealternatenames givento the H. 264/AVC standard.

[19]

.... 13 2.4 Thethreebasic slicetypes specified

by

H. 264/AVC 16 2.5

Sub-sampling

factor forall_sub-samplingratios 23

2.6

Binary

sizes of amacroblockand frame 24

4.1 Macroblock stagesof operation while_passingthroughthedecodersystem. . 33

4.2 Example

buffering

size requirementsfor intraprediction 35 4.3 Example ofreferencepicturelistupdates.

[16]

39

5.1 External DDRpins 47

5.2 Uniqueaddressesrequiredtostore 16x8 sub-macroblockforallpixel types. 49 5.3 Example

(exact)

ranges of macroblocknumbers 50

5.4 Example

(arbitrary)

rangesofslotIDs 51

5.5

Addressing

combinationsloaded

by

blockcontroller 53

5.6 Interpretation offrame marking booleans 53

5.7 Examplecontentsofframe buffermetadata 54

5.8 Structural frame buffersynthesis parameters 56

5.9 Digitalpatternsfor markingslots 62

6.1 Performancemodificationsto the originaldecodermodel 72 6.2 Performanceenhancementsto theoriginaldecodermodel 72 6.3 Functional correctionsto theoriginaldecodermodel 73

6.4

Key

H.264/AVC testsequences 76

6.5 Typical simulationCPUtime and_{memory for}some sequences ₇₈

7.1 Single RAM frame buffer synthesis, fullpin-out ₈₆

7.2 Dual RAM frame buffersynthesis, fullpin-out ₈₇

7.3 Comparison offramebuffersynthesis ₈₇

7.4 Singlexl6DDR SDRAMbandwidth ₈₈

7.5 StripedDual xl6DDR SDRAMbandwidth ₈₉

7.6 xl6DDRSDRAM bandwidth variance ₈₉

7.7 Single xl6DDR SDRAMframespersecond

91

7.8 Stripedxl6DDR SDRAM framespersecond _0|

7.9 FPGA devicepower consumptionofframe buffer_{post-synthesis}

(13)

7.10 Estimatedunitdevicecostinpurchase_quantityof 100 95

(14)

Listings

5.1 RTLpseudo code forstoreoperation 60

5.2 RTLpseudo codeforworst-caseloadoperation 60

5.3 RTLpseudo codefor best-case load operation 61

5.4 Exampleincorrectuse of_{data striping} 67

5.5 RTLpseudo codeforstriped store operation 67

(15)

Glossary

C++

C++ or C Plus Plus. A widely used object-oriented software _programming

language.

CAVLC Context Adaptive VariableLengthCoding. An

improved,

context-adaptive ver

sion ofVLC usedin theH.264/AVC Baseline Profile.

D

DVD Digital Video Disk or Digital Versatile Disk. A popular optical disk storage

technology

usedforvideosand other applications thatrequirelargeamounts of

storage.

F

FRExt

Fidelity

Range Extensions An amendment to H.264/AVC approved in 2004

providing

"professional"

codingtoolsandfour new

"High"

profiles.

H

H.264/AVC ITU-T H.264 and ISO/IEC 14496-10. Video coding standard approved in 2003

jointly by

ITU-TandISO/IEC. Delivers significantly bettercompression

(16)

HDTV

High DefinitionTelevision. Anumber of

high-quality

resolutions standardized

for television use. Includes 1080x720 and 1920x1080 resolutions, and two different forms of pixelarrangement(progressive andinterlaced).

I

IDR Instantaneous Data Refresh. A codedframe composed of_{only I} or SI slices.

The

decoding

of an IDR frame signals the reference picture list to mark its

entirelistofframesas nolongerneededforreference.

ISO/IEC

International Standards Organization/International Electrotechnical Commis

sion. ISO isaninternational

body

responsible for

developing

and _maintaining a range of standards across _{many disciplines. IEC is}the commission specifi callyresponsible for electricaltechnologies,

including

MPEG video compres sion standards.

ITU-R International Telecommunications Union

(ITU)

Radiocommunication Sector.

Responsible_{for regulating}theradio

frequency

spectrumusedforwireless com

munications

by industry

andgovernment worldwide.

ITU-T InternationalTelecommunications Union

(ITU)

TelecommunicationsStandard

ization Sector. Responsible for

developing

and_{maintaining joint}

industry

and

government standards forworldwidetelecommunications technology.

M

(17)

Q

QCIF

Quarter-resolution Common Image Format. Defines an image size of 176pix

els wide

by

144pixelshigh.

R

RAM

RIT

Random Access Memory. Type ofreusabledata storageforwhichits contents

canbeaccessedin any_order, and without_anyphysical_moving parts.

Rochester InstituteofTechnology. The author's_{primary university} atthe time

ofpublishing.

VCEG Video

Coding

_{Experts Group. A group from}the ITU-Tresponsible foradopt

ing

and

defining

video compression standards.

VCL Video

Coding

Layer. The layerintheH.264/AVC standardthatcontains actual

videoinformation.

Verilog

Verilog. Apopular computerlanguageusedfor modelingand

describing

hard ware.

VHDL

Very

High Speed Integrated Circuit

(VHSIC)

_{Hardware Description Language}

(HDL). A popular computerlanguageused for modelingand

describing

_hard ware.

Y'CbCr Y'CbCrorYCC orYPbPr. A digital equivalent oftheYUV color _model,con

taining

one

luma,

one bluechrominance and one red chrominancevalue. Al

(18)

Y'CbCr is specified

by

adifferent set offormulas.

Analog

component signals

which _carrytheY'CbCr data are sometimestermedYPbPr.

YUV

YUV. A three component colormodel defined in terms of one luma andtwo

chrominance values. _{YUV is commonly} used within _analog video broadcast

formats to

lossy

compress RGB pixels

by

discarding

a significant portion of

the color

data,

while_retaining much ofthe human perceptible image quality.

(19)

Chapter

1

Introduction

1.1

Background.

The atomic unit of digital graphics

technology

is the pixel, or "picture element"

'.

A

digital image when renderedfor

display

whether a still _picture, a printed _graphic, or an

individual frameof a video sequence consists of a

finite,

two-dimensional_arrayof points.

Each_point,orpixel,isrepresented

by

a sequence of

binary

datathatdescribesthe

intensity

and color of that individual point. A single image

typically

consists of a uniform pixel

type viz., each point is described

by

_exactly the same manneras all of the other points

withinthatimage. The digitalrepresentationof a singleimage is depictedwithFigure 1.1.

'CD

I

-^i - "

. ,: . ~ "

: -

[image:19.491.132.352.446.590.2]

-A-. *tF

"T^-| :X^tjd^ld^ "

H H--JAH]

-^r

:

-

i-!-] UAl ::-lffi -^-f^rf IH:' f If- -

-, tljtil 1 [1

""-i Ar _&: _d " ::=-4=H-H-^A_[fl1_| :^1

.... l<< ''1 " i:

Width

t

Pixel Bits

Figure 1.1: Digitalrepresentationof a pictureintermsofdatasize.

(20)

As shown, thequantitative metricsthatdeterminethetotal

binary

datasizeof an

iffif<X

are: the number of_{bits necessary} to represent a single_pixel, andthe total number ofpix

els.

Considering

_only pictures, or rectangular

images,

the total

binary

size of apicture is computed as follows:

PictureBits =

(WidthlnPixels

_*

HeightlnPixels)

*

{Bits

Per

Pixel)

(1-1)

When considering digital video _technology, the appearance of continuous motion is facilitated

by

rapid

display

of a sequence of _pictures, with most pictures

being

a slight

changein appearancefromtheprevious. Thedatasizeof anuncompressed videosequence

then _{increases multiplicatively from}the above_equation, as shownin Figure 1.2.

VideoBits =

(PictureCount)

*

(PictureBits)

(1.2)

[image:20.491.187.314.367.440.2]

Picture Bits

Figure 1.2: Digitalrepresentation ofuncompressed videointermsofdatasize.

Video coding standards suchas H.264/AVC provide a setoftools

by

which the

binary

dataofa source video sequence_{may be}modified and compressed into amuch smallerbi

naryrepresentation,while_retainingameasure ofthehuman-perceptiblevisual quality. The

compression operates_primarilyontheassumptionthat_neighboringpicturescontainalarge quantity ofsimilar

(redundant)

pixels. Thecompressed, andthus _{smaller, representation}_is then stored or transmitted in various manners. However, when

decoding

this _compressed

(21)

some pictures are stored within a frame buffer 2 for later reference

by

the

decoding

al gorithms. This intermediate storage of select pictures produces a potential performance concern for hardwareimplementationsdueto the_memorystoreandloadof significantdata quantity.

Very

specific to H.264/AVC is the recent FRExt amendment applied to this standard.

While many video _coding standards allow _{for only} a single pixel type of a specific bit

size, theFRExtamendmentintroducestheoptiontoencodeavideo sequence withoneofa varietyof pixeltypes. Eachpixeltypeusesadifferentnumberof samplebitsandadifferent Y'CbCr sub-sampling to represent

itself,

thus _{significantly}

impacting

the

binary

sizes of

both the compressed and uncompressed

data,

irrespective of the total number of pixels.

The impact of _changing pixel representation upon the total picture data size is depicted

with Figure 1.3. This additional

"dimension"

by

which the picture data representation

may differ between video sequences introduces an additional _complexityto both the data quantity and accessbehaviorofthe_{decoder memory}management.

., Width

(pixels)

CD X

sz D)

CD

1

Width

(bits)

sz O)

CD

1000 0110

[image:21.491.118.412.369.516.2]

4?)

Figure 1.3: Theextradimensionofpixelsize upon totalpicture datasize.

2Framebuffer is a more generic termfor thistypeofhardware component. TheH.264/AVC standard

(22)

1.2

Thesis

objective.

This thesis provides an _{initial methodology for}

implementing

and optimizing the frame

bufferof ahardware H.264/AVCdecoderwithFRExt. A frame bufferwith external mem

orywasimplementedwiththe

functionality

oftheH.264/AVCcodec plus supportforeach oftheY'CbCrpixelformatsofthe FRExt"High"

profiles. The frame bufferwas designed tobeasingle_component,scalabletovarious_memorycapacities andframeresolutions, and

capable of_{efficiently switching frame format}mode (pixeltype)in-hardware.

Additionally,

the organizational and access schemes ofthe frame bufferweretailoredto handle each of

the decoded framepixel

formats,

with considerations towardoptimization. Figure 1.4 de

pictsthe_{internal partitioning}oftheframe buffercomponent_accordingto

functionality

and logical interface. /*-Bl RAM Command External Memory Controller Intra Prediction

1

ockController Address Counter Deblocking

Filter \^ -a

/

'

y

-^ /

Ar Inter Prediction + ^ o / O/ FRExt_Scaling H.264/AVC Mapper SlidingWindow Control Frame Buffer External Memory

Figure 1.4: Internal partitioningof_{frame buffer design.}

Theframe bufferwasmodeled_usingtheVHDL_{hardware description language}_in_three differentforms: azero-time_{simulation-only}behavioralmodel, andtwodifferentsynthesiz

able descriptions for

implementing

inhardware. Both hardware models _specificallytarget Xilinx FPGA

technology

withexternal DDR SDRAM _memory; one _using a single mem

ory chip, and the other _{striping data between} two _memory chips. All three models were

verified against each otherfor identical

functionality

within a_{full simulation-only}_decoder

(23)

netlistforms.

Finally,

each hardwaremodelwas verifiedas parameterizablepre-synthesis

to support _any combination of the three H.264/AVC frame

buffering

needs: intrapredic

tion, interprediction, and

deblocking

filter;

to _optionally support multiple frame

formats;

andto _optionallysupportthe _slidingwindow algorithm.

As an additional _{contribution,} a _{simulation-only VHDL} model of aH.264/AVC Base

line decoderwas augmentedtosupportsimulation ofHDresolutions, and emulate_memory

supportformultiple pixelformatsoftheFRExt"High"

profiles. Theseadditionsincludeda

new _queuingtestbenchdesignthatwould exercisethe_{frame buffer according}to thebehav

iorof a real video sequence. The preexisting, incomplete Baseline softwarebuffermodel

was one such component augmented within the

decoder,

and its operation within the full

decodersystem is depicted in Figure 1.5.

Inter Prediction

1

c

1

Frame

Buffer

VideoHeader | |

Intra Prediction L

_J

I,

Control

Stream Parser

Inverse ^ Quantizer

Inverse Transform

iDeblocking Filter

: ?"*

ompressed uncompressed

video video

Figure 1.5: Video decodersystempartitioningaugmentedfor

testing

frame buffer.

The decoderoperation with respectto these additional_coding toolswasverified _using

referenceH.264/AVC codec software

[18]

written in C++. After several video sequences

were usedtodemonstrate_sufficientlycorrectbehavioroftheVHDL decoder incomparison

with thereference software, thedecoderwasthen usedasabasis forin-system simulation

and validation ofthe synthesizableframe buffer

description,

bothpre-_and

(24)

1.3

Thesis

chapter overview.

This thesis begins with a discussion of video _coding concepts and the H.264/AVC stan

dard in Chapter 2. An overview ofbasic video compression and color modelsis presented

along with a synopsis ofthe H.264/AVC standard. Chapter 3 thenprovides an overview

ofpublishedresearch on the topicof_memory management within ahardware H.264/AVC

decoder. Potential methods of_optimizing FRExt within a hardware decoder are conjec

tured. Chapter 4 discusses the conceptual _modeling ofthe frame buffer implemented

by

thisthesis,

including

requirements, algorithms, anddata flow.

The actual frame buffer implementations performed are presented in Chapter

5,

with

adetailed look at synthesizable

descriptions,

andconsiderations toward

functionality

and

optimization. Chapter 6 discusses the verification of each model ofthe the frame buffer

component, and how

they

were verified against each other with representative video se

quences. The resultsoftheverifieddescriptions are presentedin Chapter

7,

with simulated

performance analysis for the hardware frame

buffer, taking

into consideration trade-offs

between _speed, power_consumption, and complexity.

Finally,

Chapter 8 concludesthe the

siswithasummation ofresultsand potentialimprovements. Italso proposesan

interfacing

scheme to integrate the frame bufferinto afull pipelined H.264/AVCplus FRExt decoder

(25)

Chapter 2

Video

Coding

Thischapter

briefly

discussesthe theY'CbCrpixel color model used

by

_manyvideocodecs,

andalsoprovides backgroundonthe

history,

_concepts, and application ofH.264/AVC.

2.1

Y'CbCr

color model.

Y'CbCristhepredominatecolor model used withindigitalvideo_codingstandards, includ

ing

H.264/AVC. Some popular color models, such as RGB and YMCK

(commonly

used

withindisplays and printersrespectively), produce a range ofcolors

by

mixture ofthree or

four linearchannels. These color channels are similar in effect to the _mixing of _primary

paints on an artist's palette. Unlike linearcolor_models, Y'CbCrrepresents pixel

intensity

asits owncomponent, butwith some residual interdependencewiththecolorcomponents.

"Y"

representsthelumacomponent, and "Cb"

and "Cr"

representtheblueandred chroma

components respectively. The

"luma"

component is gamma-corrected

luminosity,

andthe

"chroma"

components are gamma-corrected chrominance. A comparison of linear RGB

decompositionand gammaY'CbCrdecomposition isshownwithFigure 2.1.

Thegoal oftheY'CbCrmodelistorepresentR'G'B' _{(gamma-corrected}

RGB)

_{data in} _a

compressedform

by

discarding

some ofthelessessential sub-pixel color resolution. Since

thehumaneyeispredominately sensitiveto

brightness,

and alsothecolor_green,compress

(26)

-(a) Original.

il-X

(b) R.G,Bchannelsrespectively.

X

(c) Yd Cb. Crcomponents respectively.

Figure2.1: RGB vs. Y'CbCr decompositionofa

"foreman"

test frame.

bit size while _retaining much of the original appearance. Multiple _{sub-sampling formu}

las exist for

Y'CbCr,

and several are standardized

by

the ITU. One conventional Y'CbCr

formusedto transform source videodata justpriorto_encodingwithH.264/AVC andother

codecs is:

Y'

= KR*R ₊

{l-KR

1

(

B-Y Cb

9 ₁

-Kb

C

KB)

*_G ₊

Kb

*

B\

1

/

R-Y

9 1- A' R

(2.1)

The gamma values

KR

and

KB

are left unspecified until a specific

display

technology

is

determined;

but

KR

+

KB

< _{0.5 may be} assumed. Acommon choiceforcurrent displays

is:

KR

0.2120,

KB

= 0.0722 [19]. As _shown, _the luma _component does depend on

the original red andblue channels, but is primarily influenced

by

the green. The choice of

gamma valuesis dependentupontheintendedvideo

display

technology;forexample. CRT

and LCD displays have somewhat different ideal gamma values, as do conventional and

highdefinitiontelevisions.

With respect to digital video coding, sample values are

typically

stored and operated

upon asintegers. ITU-R BT.601 specifies aformof8-bitintegermatrix multiplication that

canbe usedtoperformtransformations betweenR'G'B' and _{Y'CbCr [6, 8j.} Equation

(2.2)

demonstrates transformation of8-bit perchannel

R'G'B'

(27)

equation

(2.3)

demonstratestransformationof8-bitper sampleY'CbCrto8-bitperchannel

R'G'B'. When the matrix coefficients are chosen _{properly for} the target technology, the transformationsalonewillincur very little data

loss;

_onlysomuchasresultsfromprecision

errorsinternal to themathematical expressions.

y Cb Cr 1 256 77 150 -44 -87 131 -110 29 -131 -21 R' 16 G' + 128 B' 128

(2.2)

R! G' B' 1 256

256 0 351

256 -86 -179

256 444 0

r --16 Cb- -128 Cr- -128 (2.3)

2.1.1

Y'CbCr

sub-sampling.

By

itself, transforming

a picture from RGB to Y'CbCr does not reduce the size of the

binary

representation. Due to the near separation of

luminosity

and color, sub-sampling canbeused

during

or

following

transform toreducethe

binary

sizeofthechroma samples. This form of reduction requires that the picture be partitioned into

"macroblocks." Each

macroblock is a base square unit ofpixels, possibly 8

by

8 pixels or 16

by

16 pixels in

dimension,

depending

on the intended use. It is also _necessary to _{specify sub-sampling}

with athree value ratio in the form of: f:m:n.

Using

this ratio, the RGB is transformed

with arelative_sampling rateforeachcomponent ofthe Y'CbCrcolor _model, _according to the

following

rules:

/

is definedas anintegergreaterthan0.

(28)

Whenn is greaterthan0:

/

isthe _{horizontal sampling}

frequency

ofthe luma.

ra isthe_{horizontal sampling}

frequency

ofthefirst

(blue)

chroma.

n isthe_{horizontal sampling}

frequency

ofthe second

(red)

chroma.

The vertical _{sampling frequencies}oflumaandeach chromaarethe same.

When n isequalto 0:

/

isthe_{horizontal sampling}

frequency

oftheluma.

ra isthe_{horizontal sampling}

frequency

of each chroma.

The vertical_sampling of each chromais halfthevertical _samplingofthe luma.

With respect to digital video _codingstandards, ITU-R BT.601 specifies the lumaratio

/

as constant at

4,

_representingan analog-to-digital _samplingrate of13.5 MHz as used

by

NTSC andPALin US andEuropeantelevision_{respectively [8].}

Thus,

adirect transforma

tionofRGB datatoY'CbCrwithout_sub-samplingwouldbe represented

by

theratio

4:4:4,

andis depictedwith Figure2.2.

pixel B

-sample

(a) Y'

(b) Cb _(c) Cr

Figure 2.2: Y'CbCr sub-sampling 4:4:4.

Asshown,eachmacroblockisthesamedigitalsize withaone-to-one_{mapping between}

samples and pixels.

Performing

professional_quality _sub-samplingof4:2:2yields areduc

tion inthe number ofchroma _samples, and thus compresses the _macroblock, as depicted

with Figure 2.3. The numberof pixels within the macroblock does not _change; _only the

numberofsamples,reducingthe internalresolution or quality.

Consumer-gradecompression

including

Digital Video Broadcast

typically

uses a

sub-sampling of

4:2:0,

with a _{horizontal sampling} _{alignment similar} _to _that

(29)

-pixel sample

'';> J I'

^ i

X A H H r

X

-- -j ! -B-l-1 .a_./a:Xa;

: ___; -4_-C: > __i_LL.

r . Aid - H P 1

f-H E-+--i

H t

1

._

;:*+:

' _d - ₁

i.r-i

(a) Y' (b) Cb _(c) Cr

Figure_{2.3: Y'CbCr sub-sampling 4:2:2.}

with Figure 2.4.

Very

old standards such as MPEG-1 used a different alignment for the

samplingprocess; butnewerstandards attempttoreduce_{re-sampling losses between 4:2:2}

and4:2:0.

-pixel B

-sample

-tfr

----[- 1

l J

Xu... 1

T'" -~'u -X d iS ~k--tz I 1

c rr i>m

in ---s _I C L B C n r_ BH^B

--c

El B B m ir m a

^*?5"

L B !'

(a) Y' (b) Cb _(c)Cr

To representthe video stream in gray scale for "blackand

white"

content, thechroma

samples are _{simply discarded. The gray}scale _samplingratio of4:0:0 is depictedwith

Fig

ure2.5.

Depending

onthe intendeduse, thegammaratios used

during

transformation _may

be slightly different than with other _sampling ratios to preserve subjective monochrome

quality.

As shown with each ofthese _examples, the macroblock remains a 16

by

16 block of

pixels aftertransformation andaftersub-sampling.

However,

themacroblock canberepre

sented

by

a smaller numberof samplesthan thereare_pixels,_compressingthe internal data

size. The effects and intended applications ofthe aforementioned _sub-sampling ratios are

(30)

-pixel sample

1 : :_ _LLL. 1 1 A'

X ; }_jU

-1 I i '("+

.H

f

-1 . J c -Tte 1"

\i' " - -H~t-i

(a) Y' (b) Cb (c) Cr

Table2.1: Compressionratiosforvarious Y'CbCrsub-samplings.

Sampling

% bitsof original size Intent 4:4 4:2 4:2 4:0 4 2 0 0

1.0+1.0+1.0/3 = 100%

1.0+0.5+0.5/3 = _67%

1.0+.25+.25/3 = _50%

I.O+O.+O./3 = 33%

Nearlossless R'G'B'.

Professionalvideoediting.

Commercialvideodistribution.

Gray

scalevideo.

2.2

H.264/AVC

overview.

Inadditiontopixel-level_compression,video_codingstandardsprovidecomputationaltools

by

whichto _{significantly} reduce video

binary

size. H.264/AVC is one such video _coding

standard, recognized

internationally,

and put forth

jointly by

the ITU and ISO standards

organizations. More specifically, the standard was drafted in cooperationbetween ITU-T

and

IEC,

whichare_respectivelythe ITUsectorandISOcommission responsible forvideo

codingstandards. First drafted

by

the ITU-T in 2002 as

H.26L,

andapproved in 2003 as

H.264,

the standardhas undergone a series of revisions and approvals

by

multiple organi

zations. Table 2.2 details the progress of the H.264/AVC video _coding standard

by

year

fromtheperspectiveoftheITU-T drafting. Table 2.3 liststhe various names

by

whichthe

H.264/AVC standard is sometimes referredby. Hence

forth,

thestandard isreferred to

by

thename H.264/AVC as an abbreviated merge oftheITUandISO _names,_commonlyused

in literature [19].

The primary goal ofthe H.264/AVC standard is to provide video compression similar

(31)

Table 2.2: H.264/AVC standard

drafting

by

year.

[19]

Date ITUevent

2002 H.26L drafted.

May

2003 ITU-T H.264 Version 1 approved with

Baseline, Main,

Extendedprofiles.

May

2004 ITU-T_{Corrigendum containing} minor corrections.

March 2005 ITU-T H.264 Version

2,

with added

High,

High

10,

High

4:2:2,

High 4:4:4profiles (FRExt).

Sept. 2005 _{ITU-T Corrigendum containing} minorcorrections

andthree aspectratio indicators.

June 2006 ITU-T

Amendment,

removal ofHigh 4:4:4profile,

and additionof extended-gamut colorspace.

TBD ITU-T Replacement ofHigh4:4:4withHigh 4:4:4Predictive.

Table 2.3: Someofthe alternate names givento theH.264/AVC standard.

[19]

H.26L H.264

ISO/IEC 14496-10 JVT

MPEG-4Part 10 MPEG-4AVC

usage scenarios and _coding efficiency.

Currently,

H.264/AVC _{is starting} to become the

common standardforuse within cabletelevisionDigital Video Broadcast

(DVB)

andHigh

Definition

(HD)

media [19]. H.264/AVC aims to succeed the MPEG-2 format with the

following

changes:

improvednetwork_quality of serviceformobile andLAN/Internet

increased visual _{quality-to-binary} size _ratio, _especially at _{very low} and _{very high}

resolutions

improvedvisual precision with respecttomotion_prediction,_reducingvisual artifacts

more ideal coding format for HD-DVDmovies andInternet TV

With the initial

drafting

of the H.264/AVC specification, a goal was established to

achieve an improved visual _quality to compressedbit size ratio over MPEG-2 and H.263

(32)

implementationssuggestthat the standarddoesgenerallyprovidesuchcompression

by

use

ofits advanced_codingtools [1

1, 24, 23,

19].

Asisconventionwith_manyvideo_codingstandards,especiallythosereleased

by

ITU-T,

the specification is constrained in scope to the data format of the full video processing

system. The scope is limitedto the algorithms and

functionality

ofthe

decoder,

omitting

implementation and architectural

details;

thus allowing for maximum

flexibility

ofboth

decoderand encoderimplementations. Figure2.6 depictsthefullvideoprocessingsystem,

from source contentto

display

rendering. Asshown, the scope ofthe H.264/AVC standard

is limitedto the

decoding

stage ofthefullvideo_processingsystem. Details ofthe_encoding

andouter_processingstagesareconsideredout-of-scope, and omitted.

Video processingsystem.

source . ! ,.

?

Pre-Processing

?

Encoding

Post-Processing

I

destination &Error

Recovery

I

Decoding

I

T

[image:32.491.123.370.297.409.2]

scope of standard i

Figure 2.6: ScopeofH.264/AVC Standard: only

decoding

[24].

2.2.1

H.264/AVC coding

summary.

TheH.264/AVCstandard specifieshigh-levelorganization of_operatingonraw videoframes

to reduce their

binary

representation to acompressed format with a configurable ratio of

visual_qualitytosize. Thisreduction works

by

acombinationof

lossy

andlosslesscompres

sion. The

lossy

compression discardsredundantdata while_retainingmuch oftheoriginal

visual quality.

Where as H.264/AVC does not _specifythe

implementation

_details ofthe block trans

(33)

storage of picture data.

Conceptually,

each _picture, or

frame,

can be coded as one ofthe

following:

Acomplete, self-containedframe.

Differencesfromone pastorfuturereferenceframe.

Differences fromtwopastorfuturereferenceframes.

To organizethiscomplex frame

differencing

intoamoreflexible arrangement, asequence

offrames is cut into a sequence of _slices, as depicted with Figure 2.7.

Additionally,

the

coding features are grouped into different subsets, orprofiles.

Depending

on the profile

and_{configuration, the} slices do not have to be exact in size to the actual picture

frames,

although aone-to-one _{mapping is}common.

Additionally,

theslices are not _necessarily the

same raster-scan order as _{the pictures,} even iforganized into a one-to-one mapping. This

impliesthatframe

decoding

order_may notalwaysbe identical withthefinalvisual

display

order.

A

&

Ov

/

X

r

mmi

-Figure 2.7: Thecorrelationbetween source

(uncoded)

pictureframes and encoded slices.

WhileH.264/AVCprovidesfor five differentslice_{types, only}threeslicetypesare used

(34)

Table 2.4: Thethreebasic slicetypesspecified

by

H.264/AVC.

Type Name PredictionModes Profilesnot_supporting

I-slice

P-slice

B-slice

independentslice

predictive slice

bidirectionalpred. slice

intra-intra-,

inter-

(xl)

intra-,

inter-(x2)

Baseline

slices, each slice is divided into a set of macroblocks

(MBs),

a 16-by-16 pixel base data

unit.

Nearly

allcomputational efforts areperformed

directly

on asingle

MB,

withpotential reference to other MBs. _{Each MB may be} of a _{different type, referring} to groupings of

pixels

belonging

to neighborMBs.

An I-slicecontains_only macroblocks thatuse spatial

(intra)

prediction, possiblyrefer

encing near-by macroblocks within the same slice. A P-slice contains a mixture ofboth

spatial

(intra)

prediction and temporal

(inter)

prediction _macroblocks, where each mac

roblock uses _only onetype ofpredictionfor itself. For

P-slices,

each temporal prediction

vector can _{have only} one _{previously decoded} reference. B-slices are similar to

P-slices,

exceptthat thosemacroblocks _usingtemporalprediction_{may have}tworeferencesperpre

dictionvector.

The slice _mapping and slice types has a direct impact upon the effective

lossy

com pression ofthe videodata. Quantization is usedto discard the least important pixel

data;

withlargercoefficients_producingmore intensecompression attheexpenseofvisual qual

ity. Temporal predictionprovides more effective _{quantization;} and thus those slices _using

temporal predictioncompress more_easilythan those _using _{spatial prediction.}

Thenet effect ofquantizationisthat the data is storedwith an additionallosslesscom

pressiontailored_specifically to the formatofthevideo data. Oneoftwoforms of_entropy

coding_CAVLC_and_CABAC

can beusedtoincreasetheaverage

compressibility

ofthe

already

highly

compressibledatawhen_{performing lossless} _block

compression

[9,

24].

To address the _many usage _scenarios,_H.264/AVC _contains_a _extensive

list of features

available to the _coding _processes, as required to be supported

by

the video decoder. To organizethesefeatures intoasmall _quantityof_{permutations,} _the_{specification}

groupsthem

(35)

Baseline: contains allbutthemost complexandleastcommon_codingfeaturesofthe specification. The intended applications _{include streaming}_video, teleconferencing, and other more general purpose uses. CAVLC isthe_onlyoption for_entropycoding, andB-slices are unsupported.

Main: contains a _primary subset of

features,

_{mostly overlapping} with the Baseline profile, and _adding those featuresmost useful for on demandcommercial services.

The intended applicationsinclude

DVB,

distribution of video_media, and othersthat usehighresolutions anddatarates. CABAC isthedefaultoptionfor entropycoding, andB-slices are supported.

Extended: contains most features of the Baseline _profile, plus multiple additional

features that are complex, uncommon, and useful _{only in} subset of situations. The

intendedapplicationsinclude mobile and wirelessdevices.

Each ofthese H.264/AVC profiles use aconsumer-grade video depthof 4:2:0 chroma

formating

and8 bitsper sample

[9,

24]. Version 1 ofH.264/AVCwithouttheFRExtamend ment_onlysupportsthesame sampledepthand_sub-samplingratio asitspredecessor_codecs,

including

H.263 andMPEG-2.

2.2.2

H.264/

AVC

Fidelity

Range

Extensions

summary.

The FRExtamendmentto theH.264/AVC specification_essentiallyaugmentsthefeatureset oftheoriginalH.264/AVC to support multiple_sampling

depths,

_sub-samplingratios,color spaces, larger frameresolutions, and other additional _coding tools. Whilethe _primaryob

jectiveofthisamendmentwastointroduce_{features necessary for editing}andproductionof "professional"-grade video, thenew featuresprovide additional

flexibility

to_managingthe

qualityandformatofdistributedmedia. Asoneexample, theHighprofilehas already been adoptedto succeedtheMainprofileforuse within someHigh Definition _media,

including

HD-DVD,

BD-ROM,and someformsofDVB. Forsake of

brevity,

the term

Fidelity

Range

(36)

Each ofthe additional decoderprofiles specified

by

FRExtare anincremental increase

offeatures from the Main profile due to their_primarily commercial application. In sum

mary, theadditional decoderprofiles are:

High: contains all of the _coding tools of the Main profile. Adds several coding

efficiency tools and monochrome 4:0:0 video. This profile _easily replaces theMain

profilefor many applications.

High 10: contains allofthe_codingtoolsoftheHighprofile. Adds sampling depthof

upto 10 bitsperlumaand 10 bitsperchroma.

High 4:2:2: contains all of the _coding tools of the High 10 profile, while _adding

professional-grade_{tools, very high}resolutions, and the_sub-sampling ratioof4:2:2.

High4:4:4: contains all ofthe _coding tools ofthe High4:2:2 _profile, while _adding

sub-sampling of near-lossless 4:4:4. Italso adds_{extremely high data}rates andreso

lution,

with some_{limited lossless encoding}capabilities.

J.W.L.HP.H.

4:0:0-.

J ,.J.bit,.

Jbit.

WML

High 10 4:2:0

j

Main

! High

4:2

'-';

High 4:2:2

4:4:4\

High 4:4:4

ch\roma

\

depth

Figure 2.8: Pixel sampling and

depth,

increasingly

stacked

by

FRExt profile.

[24]

With respect to frame

formatting

alone, the

profiles'

supported _Y'CbCrpixel _formats

stack

increasingly

as shown in Figure 2.8. The chroma _samplingratios arerepresentative

(37)

to 4:4:4 near-losslessR'G'B'. The higherchromaratios increasethe color_accuracy ofthe

video. An individual video sequence _may also be configured to represent its luma and

chroma samples _uniformly witha sample sizebetween 8 and 12 bits inclusive. The larger

sample sizes increasetheoverall precision ofthevideo.

One anticipated use ofFRExtwithin futureconsumer_products,both software and em

bedded

hardware,

isuse oftheHigh andHigh 10profiles,which_{may be}usedtoselect sev

eral color depths and monochrome for HDvideo, without _changingthe frame resolution;

thus_providinganextra degreeofsubjective visual _quality tothe video stream. Otherpos

sible embedded applicationsinclude professional-gradeencoder/decodersforuse invideo

production; especiallythoseconcernedwithreal-timeoperation.

[19]

2.3

Thesis

relevance and specifics.

While H.264/AVC does not _{specify any memory} _{architecture,} it does detail the data flow

of

buffering

intermediate datathrough thevideodecoder. It also specifieshowthe various

Y'CbCrmacroblockformats are processed.

2.3.1

H.264/AVC data

buffering

flow.

In Annex C oftheH.264/AVC standard

[9],

a Hypothetical Reference Decoder

(HRD)

is

detailed for sake of_providing an example conceptual implementation of the standard in

software. This decodercontainstwodistinctdata

buffers,

the Coded Picture Buffer

(CPB)

and the Decoded Picture Buffer (DPB). The CPB is not considered

by

this thesis as it is

only a_receiving cache ofthebitstream andnot associated with actual

decoding

processes

orthe frame buffer itself. (It is shown forcompleteness.) The DPB and itsposition inthe

data flow is shown withFigure 2.9.

Conceptually,

the DPB is a random access block of _memory where buffered mac

roblocks_{may be}storedtoandloadedfrom. Eachofthesemacroblocks are_partiallyor

fully

(38)

Hypothetical

streamscheduler (HSS)

bitstrearr

V

Codedpicture access buffer_(CPB) units

referenceframes

?

Decodingprocess

(instantaneous)

frames

Decodedpicture buffer_(DPB)

T

V

Output cropping

[image:38.491.81.403.57.257.2]

output

Figure 2.9:

Buffering

withintheH.264/AVCHypothetical Reference Decoder.

[9]

With a software

implementation,

the DPB is a section of system RAM allocated on the

operatingsystem

heap,

and afewpointersprovide

indexing

of variouslocations withinthe

DPB. When

implementing

in

hardware,

_using a single external _{SDRAM memory for} the

DPB may be sufficient; but a_memory

hierarchy

_may also be necessary to obtain ahigher

levelof performance. Afewpossiblehardwarearchitectures are discussed in Chapter 3.

2.3.2

H.264/AVC data

buffering

organization.

Data enteringand_exitingtheDPBisalways on abasisof a macroblock ofY'CbCrsamples.

While the internal storage _may or _may not _map the three luma and chroma components

togetherwithinthe same address space, theinterfaceto theDPB always groups themon a

macroblock

basis,

asdepictedwithFigure 2.10.

H.264/AVCspecifiesthreemacroblock_sizes, each

having

alumadimensionof:

16x16,

8x8,

4x4.

Conceptually,

all data store within the DPB is on a 16x16 or4x4 macroblock

(39)

store

load

r

Cb Cr

DPB

Figure 2.10: DPB operation: macroblock

in,

macroblock out.

Adjacent macroblockbuffering.

Two processes within the decoder system the

Deblocking

Filter

(DF)

process and Intra Prediction

(IntraP)

process require load and/or store access to a _currently selected MB and also to each ofits four alreadyprocessed neighbor _{MBs. The currently} selectedMB moves _accordingtoraster-scan orderas aframe is decoded. The MB positions aredepicted withFigure 2. 1 1.

'

~>

D

B

in

X;XvXvX;X;X;X;X;X;X;X;X;X;;'.X '.

A

CurrentMB ?

MBs

per

Raster-scan

Row

Figure 2.11: Buffer arow of macroblockstoretainneighborMBs.

Forexample, whileamacroblockisprocessed

by

theDFand outputintothe Current

MB

location,

anMB frompositionsAorD _{may be loaded}toassistthe

filtering

calculations.

Similarly,

theIntraPprocess_mayneedtoloadanMBfrom anyofpositions_{A, B, C,}orD.

(40)

their

buffering

operations couldbe considered as either combined with the otherdecoder

processes,or perhaps_usinganindependent section withintheDPB [21].

Frame buffering.

As each frame is

decoded,

it is potentially stored in its entirety within the DPB for later

reference,

depending

on the values ofmetadataand_{memory instructions} parsed fromthe

bitstream.

Typically,

an H.264/AVC profile will _specify a defaults frame

history

depth

offour

(4)

to six

(6)

frames to store within the DPB for reference purposes. The maxi

mum frame buffer length permitted

by

the standard is fifteen

(15)

reference frames. The

maximum reference frame size of the DPB for a specific video stream depends on both

the H.264/AVC profile and the IDC Level of the current video stream. Forexample, the

standarddoes not permitmorethan five

(5)

reference frames for a resolution of 1080 HD

(1920x1080).

Inter Prediction

Frame Buffer

F

'

Deblocking

Filter [image:40.491.107.370.356.571.2]

Lists

Figure 2.12: Organization ofthereference_{frame buffer.}

The organization of reference frames within the frame buffer section of the DPB is

shownwithFigure 2.12. Aseachframe is addedto the

buffer,

the sliceheader

information

(41)

long-term use. Each frame has an index number for

identifying

itself within the frame

buffer. Two lists are maintainedbetween

frames,

lists LO and LI. This first list is used

withbothP-slices and

B-slices,

whereasthe _{latter is only for B-slices.}

When

beginning

the addition of a frame to the

buffer,

the current _{capacity is first}

checked. If the buffer is full then a

"sliding

window"

algorithm is used to discard the

oldest short-term frame. _{A frame may be} marked for long-term storage

by

an explicit

memory storage instruction in the bitstream. Once a frame is marked for long-term stor

age, anexplicit_{memory instruction from}thebitstream isrequiredto flush theframeoutof

the buffer.

Oneprocess withinthedecodersystem theInter Prediction

(InterP)

process requires

load access to _{arbitrary MBs from previously decoded}reference frames.

Noting

the slice

typesfrom Table

2.4,

eachindividualmacroblock makes use oftheIntraPorInterPprocess,

but not both. This suggests that the

buffering

operations of IntraP and InterP could be

combined to _{overlap in}timing.

[21]

2.3.3

Macroblock

pixel

types.

With additionalpixel types permitted

by

theFRExtamendment, the

binary

size ofa mac

roblockisrelativeto twoadditionalfactors beyondthepixeldimensionsofthemacroblock

itself: chroma sub-sampling, and _sampling depth. The single pixel data size for a set of

Y'CbCr samples is computed

by

the _{sub-sampling factor} times the _{luma sampling depth.}

The _sub-sampling factor foreach ratio is shown with Table 2.5.

Using

the _sub-sampling

Table 2.5:

Sub-sampling

factorforall _sub-sampling ratios.

sub-sampling f:m:n factor

Fss

4:0

4:2

4:4 0

0

2

4

1.0 +0.0+0.0= 1.0

1.0 +.25 + .25 = 1.5

1.0 +0.5+0.5 = 2.0

1.0+1.0+ 1.0 = 3.0

factor

Fss,

an equal bit size ofthe lumaand chroma samples

lunrnbits

=

(42)

thepixel dimensions ofa macroblock

Mw

*

Mh,

the

binary

size of a macroblock canbe

computedas such:

MBbits

=

(Mw

_*

Mh)

*

(lumablts

*

Fss)

(2.4)

Asan example,fora4x4macroblockwith 4:2:0 _sub-sampling and 8 bitsofsample

depth,

the

binary

size ofthemacroblockis:

MBbits

= (4*

4)

*

(8

*

1.5)

= 192 bits

(2.5)

The

binary

sizes of all possible 16x16 macroblocks considered

by

this thesis are detailed

withTable 2.6(a). Asan example ofhowthemacroblock sizeaffectsframestorage_capacity

Table 2.6:

Binary

sizes ofamacroblockandframe.

(a) 16x16macroblock,bits chroma sub-sampl

ing

4:0:0 4:2:0 4:2:2 4:4:4 2048 3072 4096 6144 2304 3456 4608 6912 2560 3840 5120 7680 2816 4224 5632 8448 3072 4608 6144 9216

(b) 1080 HDframe,AAbits chromasut>-sampl_ing 4:0:0 4:2:0 4:2:2 4:4:4

1.99 2.99 3.98 5.98 2.24 3.36 4.48 6.72 2.49 3.74 4.98 7.47 2.74 4.11 5.48 8.22 2.99 4.48 5.98 8.96

withinthe

DPB,

Table

2.6(b)

details the frame sizes inbits1forthe largest HD-TV resolu

tion: 1080 HD. Note thatwhilethevisible resolution is

1920x1080,

theH.264/AVCcoded

luma resolution is actually 1920x1088 due to _{internal cropping} constraints of the codec.

As _{shown, the} largest pixel type produces a 1080 HD frame that is 4.5 times the size of

thesmallest pixel type. Thissignificant_{variability in data}size isuniqueto theH.264/AVC

plusFRExtcodec.

(43)

Chapter

3

H.264/AVC

Research

This chapter discusses published investigations and conclusions related to

implementing

the frame bufferof an H.264/AVC

decoder,

and also presents several inferences made

by

this thesis.

3.1

Decoder memory

case studies and research.

Since its approval in

2003,

the _{H.264/AVC coding} standard

[7, 9]

has seen an extensive

degreeofpublishedresearch,

including

theperformance optimization ofbothsoftware and

hardware implementations. The extent of open _{publishing is possibly due} to two major

factors.

First,

thebaseline specificationofH.264/AVC _{is royalty free}and opentoacademia

and

industry

alike without charge beyond

purchasing1

the specification

document,

from

either ISO or ITU. This is different fromthe preceding MPEG-2 standard, forwhich ISO

charges significant royalties against each implementation [20].

Second,

even

during

the

drafting

phase, the scalability ofthe standard wasfoundto besuitable forapplicationsbe

yondthe original focusof videoconferencing,

including

theup-and-coming

technology

of

HDTV. H.264/AVC was found to perform well atboth low and high bit rates and resolu

tions [20]. Thesesignificant

factors,

inadditionto others,

including industry

and academia

trends, havecontributedto significant_publishing ofH.264/AVCresearch anddevelopment

'Asof_early 2007,ITU-Tisnow_providingfreedownloadoftheentire_{H.264.X group}ofdocumentsand software. Implementations ofprofiles other than Baseline, includingsome ofthe workperformed_by this

(44)

Encoded VideoDa

H.264/AVC Video Decoder

MacrucnockRow Buffer

External_Memory

ExternalMemory Video Driver Dnver

Frame Buffer

//.'//, StreamParser

(Entropy& RunLengthDecoding)

Header Information

^ Prediction

Intra

^_ Prediction

Macroblock

Buffer

Deblocking *"

Filter

Transform Unit

AA

[image:44.491.46.458.67.285.2]

Macroblock Row Buffer

Figure 3.1: FPGA hybridon-chip,_off-chipdecoderarchitecture proposed in [21].

[24].

In this section, a few _supportingworks arediscussed forthe purpose of

depicting

doc

umented approaches to

implementing

theexternal _memory ofahardware H.264/AVC de

coder withoutFRExt.

3.1.1

Identification

of

memory

components.

Before

discussing

techniques to _{optimizing memory} within the H.264/AVC

decoder,

it is

necessary to first

identify

the _memory requirements of the _processing

blocks,

and also a

realistic generic architecture for hardware implementation.

Considering

an FPGA imple

mentationofthe

DPB,

afew distinct independentsections of

buffering

becomeapparent,as

already detailed in Section 2.3.2. Proposed in

[21]

is a mixed_on-chip and_{off-chip FPGA}

memory architectureforanFPGA

decoder,

asdepictedwithFigure 3.1.

From the

figure,

theindependent

buffering

needsofthe decoderare as follows:

1. Two ping-pong buffersto FIFO macroblock data between processing stages. These

(45)

enough to _potentially _{fit on-chip in FPGA block RAM. One ping-pong buffer is}

placed _between _the _{stream parser and} _{the transform} _unit. _The other is placed be

tweentheprediction units and the

deblocking

filter.

2. A row buffer for data feedback to the intra-prediction unit. For small resolutions,

thisshould_{fit on-chip for}an FPGA.

However,

for large resolutionsit should_{only fit}

on-chip for very large FPGAswitha significant_quantityofblock RAM.

3. A row buffer for data feedback to the

deblocking

filter. For small resolutions, this

should _{fit on-chip for} an FPGA.

However,

for large resolutions it should _{only fit}

on-chip for very large FPGAs with a significant _quantity of block RAM.

Also,

it

is conceivablethat resources are constrainedto _only allow instantiation ofone row

bufferon-chip; in such a case, this row buffer can becombined with the reference

frame buffer.

4. Areferenceframe buffer for storingan iV depthof

fully

decoded frames. Thismem

ory component iscertainto requirea_memorycontroller with an externalRAM chip

with a significant capacity.

Each ofthese conceptual

buffering

stages could be combined into a single

buffering

unit

with _off-chip memory, or could be distributed as described. The greater the distribution

ofthe_{memory buffer}on-chip, thelowerthebandwidthrequirements oftheexternal RAM

interface.

3.1.2

Optimization

techniques.

Custom SDRAMmemorycontrollerfor H.264/AVC.

Oneapproachto optimizingtheframebufferperformanceisto implementacustom mem

ory controller withbus access specifically tailored to theH.264/AVC decoderarchitecture

andframe data. Ratherthanjustoptimizethedata flow logicaroundthememory, themem

(46)

comparable dataaccess atlowerclockrates.

Described in

[26],

an HDTV H.264/AVC decoder was implemented with a custom

SDRAM control for off-chip

buffering

of frame data. First a standard SDRAM _memory

controllerwas usedto control the readand write accesses. Forthe maximumH.264/AVC

Baseline resolution of 1080 HD

(1920x1080),

the clock speed requirement ofthe mem

ory controllerwas determinedto be 193 MHz. The controller was thenreimplemented to

improveperformance.

Reducing

theSDRAMpage-active cyclein 2-dimensionalread and

write accesses providedaone-third performanceenhancementovertraditionalpageperline

architectures.

Memory

bandwidth was _{significantly} conserved with the additional benefit

ofincreased

flexibility

withtheimplementation ofother

decoding

blocks.

Specifically,

the clock speed requirement for the the new _memory controller was de

termined to be 121

MHz,

hence an approximate one-third performance improvement of

the controller. The reduction in clock speed requirement also _potentially reduces power

consumption anddevicecost.

Multiplechannel_memoryarchitecture.

Anotherpossible _approach,

depending

onthe target_{device technology, is}to usetwomem

ory controllers with one RAM each for

buffering

different portions offrame data within

thedecoder. In

[10],

an ASIC H.264/AVC decoder architectureutilizes _{dual memory}con

trollerscombinedwith anARM

CPU,

system

bus,

andlocal bus. Thearchitecturedemon

strates use oftwobusesandtwo_memorycontrollers tofacilitate ahigh levelofcontrolled

parallelism _{between processing blocks. The} performance was sufficient to process real

time 1080HDframes.

Extending

this technique, a

deblocking

filter architecture is described in [1

1]

thatuses

a 2-dimensional _array of_memory modules.

Specifically,

the architecture uses eight

dual-port _SRAM _modules,

facilitating

parallel access of eight pixels. The pixels within a4x4

macroblock are mappedina linearshift-rotatemanner. Thisallows an

8-way

parallelload

(47)

selectconflicts.

Implementable

_greedy search algorithmfor frame history.

To further increasethehardwareperformance ofthe frame buffer as measured

by

reduc

ing

the_necessaryclock speedand_{RAM capacity} theframe buffer datamanagement_may

discard an additional _quantity offrame data. This form of

lossy

optimization _potentially

increasesperformance

by

a reduction ofdatatransferacrossthe_{memory bus}attheexpense

of_reducingthe subjective video output quality.

To render the decoded video within

"acceptable"

levels,

an efficient search algorithm

may beusedto determinewhatvideodata ifany-todiscardforeach frame bufferopera

tion. Described in

[5],

the techniqueemploysa _greedysearchheuristic todiscard theleast

important reference frames. In

doing

this, the prediction performance is increased over

theconventional _sliding window _memory _management, which _{simply discards} the oldest

framewithoutasmuch regardtocontext.

Whilethe_{implementation complexity} ofthis approachis small, thealgorithm requires

experimental

fine-tuning

specific to the decoder architecture,

including

the _memory ac

cesses.

Additionally,

the subjective video output_{quality may be}compromisedifthe algo

rithm is not _carefully tuned [5]. Thisapproach is notably ofless general application than

theH.264/AVC-specified_sliding window algorithm.

Compressed_memorybusaccesses.

Another popular technique for_optimizing decoder power and performance is use ofem

beddedcompression

(EC)

within the _memory architecture. All data is compressed with a

lossless blockcompression algorithmjustpriortomemorystore, anddecompressed imme

diately

following

load frommemory.

Considering

that themajorityofdataflowthatutilizes

the_memoryarchitecturearedecodedmacroblocks, thepotentialsavingsin memorycapac

ity

and data bus operations is significant. The reduction in physical _memory operations

(48)

Described in

[4]

is an EC technique used to optimize a H.264/AVC decoder for real

timeoperationatlowspeedandlowpower consumption. _{While any form}ofECwasfound

tobeadequatefor_reducingthe_{memory capacity} requirementsof a generalvideo

decoder,

threemain constraints were determinedaspivotalfor reducingpower consumption:

1. Useablock-basedcompression, and

independently

encode eachblock.

2. Seta fixed compression ratio for allblocks and use this fixed ratio for the_memory

mapping.

3. Storetheluminance and chrominance planes

jointly

inmemory.

3.2

Analysis

of published results.

Conjecturing

from the above _research, several techniques _{may be} employed to improve

external_memoryperformance withinavideodecoderwith a singledecoded frame format.

Experimentally

tune the _memory controller parameters such that access bursts and

the

topography

of the address _mapping are improved specifically for the store and

loadofmacroblocks.

Replace the

"sliding

window"

algorithm with a more

lossy

_algorithm, _reducing the

total number of store and load operations; _possibly also _reducing the

frequency

of

memory accesses.

Compress dataasthefirststageofa storeoperation, anddecompress dataasthe final

stage of a load operation, using a lossless block compression algorithm tailed for

Y'CbCr data. This would reduce _{memory bandwidth} and _capacity requirements at

theexpenseofadditional_on-chip_logic.

Usea_{dedicated memory}controllerfortheframe

buffer,

with allintraprediction data

(49)

Extrapolating

fromtheabove_research,several plausible approachestooptimizing FRExt

memoryperformance include:

Re-sample data as the first stage ofa store operation, and re-sample data as the fi

nal stage of a load operation. This would reduce _memory bandwidth and _capacity

requirementsattheexpense of additional_{on-chip logic}andreductionofinterpredic

tionquality.

Configurethe _memory _mapping anddatapipeline to _{efficiently handle}variabledata

(50)

Chapter

4

Requirements

and

Modeling

This chapter provides an overview ofthe requirements, algorithms anddata flow usedfor

designing

the frame buffer model. The addition of Y'CbCr pixel type _variability to the

frame buffermodel is alsodiscussed.

4.1

Augmentation

of

decoder

system.

A preexisting Baseline H.264/AVC decoder testbench system was obtained from [21].

Whilemuch ofthe

functionality

oftheH.264/AVC Baselineprofile was

implemented,

the

systemwas_onlycapable o