Video Processing & Communications - Wang

(1)

(2)

Errata for

VIDEO PROCESSING AND COMMUNICATIONS

Yao Wang, Joern Ostermann, and Ya-Qin Zhang

(©2002 by Prentice-Hall, ISBN 0-13-017547-1)

Updated 6/12/2002

Symbols Used

Ti = i-th line from top; Bi = i-th line from bottom; Fi = Figure i, TAi = Table i,

Pi=Problem i,E(i)=Equation(i), X -> Y = replace X with Y

Page Line/Fig/Tab Corrections

16 F1.5

Add an output from the demultiplexing box to a microphone at the

bottom of the figure.

48 B6,

E(2.4.4)-E(2.4.6)

Replace “v_x”, “v_y” by “\tilde v_x”, “\tilde v_y”

119 E(5.2.7)

C(X)->C(X,t),r(X)->r(X,t),E(N)->E(N,t)

125 F5.11

Caption: “cameras”-> “a camera”, “diffuse”-> “ambient”

126 T7

“diffuse illumination”-> “ambient illumination”

133 B10

T_x,T_y,T_z -> T_x,T_y,T_z, and Z

B4

Delete “when there is no translational motion in the Z direction, or”

B2

“aX+bY+cZ=1” -> “Z=aX+bY+c”

Before

E(5.5.13)

Add “(see Problem 5.3)” after “before and after the motion”

138 P5.3

“a planar patch” -> “any 3-D object”, “projective mapping”->Equation

(5.5.13)”

P5.4

“Equation 5.5.14”-> “Equation (5.5.14)”,

“aX+bY+cZ=1”-> “Z= aX+bY+c”

143 T4

After “true 2-D motion.” Add “Optical flow depends on not only 2-D

motion, but also illumination and object surface texture.”

159 T6

After “block size is 16x16” add “, and the search range is 16x16”

189 P6.1

“global”->”global-based”

190 P6.12

Add at the end “Choose two frames that have sufficient motion in

between, so that it is easier to observe the effect of motion estimation

inaccuracy. If necessary, choose frames that are not immediate

neighbors.”

199 T9

“Equation (7.1.11) defines a linear dependency … straight line.” ->

“Equation (7.1.11) says that the possible positions x’ of a point x after

motion lie on a straight line. The actual position depends on the

Z-coordinate of the original 3-D point.”

200 B8

“[A]” -> “[A]^T [A]”

214 P7.5

“Derive”-> “Equation (7.1.5) describes”

Add at the end “(assuming F=1)”

P7.6

Replace “\delta” with “\bf \delta”

218 F8.1

“Parameter statistics” -> “Model parameter statistics”

247 F8.9

Add a box with words “Update previous distortion \\ D_0=D_1” in the

line with the word “No”.

(3)

255 F8.14

Same as for F8.9

261 P8.13(a)

“B_l={f_k, k=1,2,… ,K_l}” -> “B_l, which consists of K_l vectors in

{\cal F}”

416 TA13.2

Item “4CIF/H.263” should be “Opt.”

421 TA13.3

Item “Video/Non-QoS LAN” should be “H.261/3”

436 T13

“MPEG-2, defined” -> “MPEG-2 defined”

443 T10

“I-VOP”->”I-VOPs”, “B-VOP”-> “B-VOPs”

575 P1.3

“red+green=blue”-> “red+green=black”

P1.4

“(1.4.4)” -> “(1.4.3)”, “(1.4.2)” -> “(1.4.1)”

(4)

wang-50214

wang˙fm

August 23, 2001

14:22

Preface

In the past decade or so, there have been fascinating developments in multimedia

rep-resentation and communications. First of all, it has become very clear that all aspects

of media are “going digital”; from representation to transmission, from processing to

retrieval, from studio to home. Second, there have been significant advances in digital

multimedia compression and communication algorithms, which make it possible to

deliver high-quality video at relatively low bit rates in today’s networks. Third, the

advancement in VLSI technologies has enabled sophisticated software to be

imple-mented in a cost-effective manner. Last but not least, the establishment of half a dozen

international standards by ISO/MPEG and ITU-T laid the common groundwork for

different vendors and content providers.

At the same time, the explosive growth in wireless and networking technology

has profoundly changed the global communications infrastructure. It is the confluence

of wireless, multimedia, and networking that will fundamentally change the way people

conduct business and communicate with each other. The future computing and

com-munications infrastructure will be empowered by virtually unlimited bandwidth, full

connectivity, high mobility, and rich multimedia capability.

As multimedia becomes more pervasive, the boundaries between video, graphics,

computer vision, multimedia database, and computer networking start to blur, making

video processing an exciting field with input from many disciplines. Today, video

processing lies at the core of multimedia. Among the many technologies involved, video

coding and its standardization are definitely the key enablers of these developments.

This book covers the fundamental theory and techniques for digital video processing,

with a focus on video coding and communications. It is intended as a textbook for a

graduate-level course on video processing, as well as a reference or self-study text for

(17)

wang-50214

wang˙fm

August 23, 2001

14:22

xxii

Preface

researchers and engineers. In selecting the topics to cover, we have tried to achieve

a balance between providing a solid theoretical foundation and presenting complex

system issues in real video systems.

SYNOPSIS

Chapter 1 gives a broad overview of video technology, from analog color TV

sys-tem to digital video. Chapter 2 delineates the analytical framework for video analysis

in the frequency domain, and describes characteristics of the human visual system.

Chapters 3–12 focus on several very important sub-topics in digital video technology.

Chapters 3 and 4 consider how a continuous-space video signal can be sampled to

retain the maximum perceivable information within the affordable data rate, and how

video can be converted from one format to another. Chapter 5 presents models for

the various components involved in forming a video signal, including the camera, the

illumination source, the imaged objects and the scene composition. Models for the

three-dimensional (3-D) motions of the camera and objects, as well as their projections

onto the two-dimensional (2-D) image plane, are discussed at length, because these

models are the foundation for developing motion estimation algorithms, which are

the subjects of Chapters 6 and 7. Chapter 6 focuses on 2-D motion estimation, which

is a critical component in modern video coders. It is also a necessary preprocessing

step for 3-D motion estimation. We provide both the fundamental principles governing

2-D motion estimation, and practical algorithms based on different 2-D motion

repre-sentations. Chapter 7 considers 3-D motion estimation, which is required for various

computer vision applications, and can also help improve the efficiency of video coding.

Chapters 8–11 are devoted to the subject of video coding. Chapter 8 introduces

the fundamental theory and techniques for source coding, including information theory

bounds for both lossless and lossy coding, binary encoding methods, and scalar and

vector quantization. Chapter 9 focuses on waveform-based methods (including

trans-form and predictive coding), and introduces the block-based hybrid coding framework,

which is the core of all international video coding standards. Chapter 10 discusses

content-dependent coding, which has the potential of achieving extremely high

com-pression ratios by making use of knowledge of scene content. Chapter 11 presents

scalable coding methods, which are well-suited for video streaming and

broadcast-ing applications, where the intended recipients have varybroadcast-ing network connections and

computing powers. Chapter 12 introduces stereoscopic and multiview video processing

techniques, including disparity estimation and coding of such sequences.

Chapters 13–15 cover system-level issues in video communications. Chapter 13

introduces the H.261, H.263, MPEG-1, MPEG-2, and MPEG-4 standards for video

coding, comparing their intended applications and relative performance. These

stan-dards integrate many of the coding techniques discussed in Chapters 8–11. The MPEG-7

standard for multimedia content description is also briefly described. Chapter 14 reviews

techniques for combating transmission errors in video communication systems, and

also describes the requirements of different video applications, and the characteristics

(18)

wang-50214

wang˙fm

August 23, 2001

14:22

Preface

xxiii

of various networks. As an example of a practical video communication system, we

end the text with a chapter devoted to video streaming over the Internet and wireless

network. Chapter 15 discusses the requirements and representative solutions for the

major subcomponents of a streaming system.

SUGGESTED USE FOR INSTRUCTION AND SELF-STUDY

As prerequisites, students are assumed to have finished undergraduate courses in signals

and systems, communications, probability, and preferably a course in image

process-ing. For a one-semester course focusing on video coding and communications, we

recommend covering the two beginning chapters, followed by video modeling

(Chap-ter 5), 2-D motion estimation (Chap(Chap-ter 6), video coding (Chap(Chap-ters 8–11), standards

(Chapter 13), error control (Chapter 14) and video streaming systems (Chapter 15).

On the other hand, for a course on general video processing, the first nine chapters,

in-cluding the introduction (Chapter 1), frequency domain analysis (Chapter 2), sampling

and sampling rate conversion (Chapters 3 and 4), video modeling (Chapter 5), motion

estimation (Chapters 6 and 7), and basic video coding techniques (Chapters 8 and 9),

plus selected topics from Chapters 10–13 (content-dependent coding, scalable coding,

stereo, and video coding standards) may be appropriate. In either case, Chapter 8 may

be skipped or only briefly reviewed if the students have finished a prior course on

source coding. Chapters 7 (3-D motion estimation), 10 (content-dependent coding),

11 (scalable coding), 12 (stereo), 14 (error-control), and 15 (video streaming) may also

be left for an advanced course in video, after covering the other chapters in a first course

in video. In all cases, sections denoted by asterisks (*) may be skipped or left for further

exploration by advanced students.

Problems are provided at the end of Chapters 1–14 for self-study or as

home-work assignments for classroom use. Appendix D gives answers to selected problems.

The website for this book (www.prenhall.com/wang) provides MATLAB scripts used to

generate some of the plots in the figures. Instructors may modify these scripts to generate

similar examples. The scripts may also help students to understand the underlying

operations. Sample video sequences can be downloaded from the website, so that

students can evaluate the performance of different algorithms on real sequences. Some

compressed sequences using standard algorithms are also included, to enable instructors

to demonstrate coding artifacts at different rates by different techniques.

ACKNOWLEDGMENTS

We are grateful to the many people who have helped to make this book a reality. Dr.

Barry G. Haskell of AT&T Labs, with his tremendous experience in video coding

stan-dardization, reviewed Chapter 13 and gave valuable input to this chapter as well as other

topics. Prof. David J. Goodman of Polytechnic University, a leading expert in wireless

communications, provided valuable input to Section 14.2.2, part of which summarize

characteristics of wireless networks. Prof. Antonio Ortega of the University of Southern

(19)

wang-50214

wang˙fm

August 23, 2001

14:22

xxiv

Preface

California and Dr. Anthony Vetro of Mitsubishi Electric Research Laboratories, then

a Ph.D. student at Polytechnic University, suggested what topics to cover in the

sec-tion on rate control, and reviewed Secsec-tions 9.3.3–4. Mr. Dapeng Wu, a Ph.D. student

at Carnegie Mellon University, and Dr. Yiwei Hou from Fijitsu Labs helped to draft

Chapter 15. Dr. Ru-Shang Wang of Nokia Research Center, Mr. Fatih Porikli of

Mit-subishi Electric Research Laboratories, also a Ph.D. student at Polytechnic University,

and Mr. Khalid Goudeaux, a student at Carnegie Mellon University, generated several

images related to stereo. Mr. Haidi Gu, a student at Polytechnic University, provided

the example image for scalable video coding. Mrs. Dorota Ostermann provided the

brilliant design for the cover.

We would like to thank the anonymous reviewers who provided valuable

com-ments and suggestions to enhance this work. We would also like to thank the students

at Polytechnic University, who used draft versions of the text and pointed out many

typographic errors and inconsistencies. Solutions included in Appendix D are based on

their homeworks. Finally, we would like to acknowledge the encouragement and

guid-ance of Tom Robbins at Prentice Hall. Yao Wang would like to acknowledge research

grants from the National Science Foundation and New York State Center for Advanced

Technology in Telecommunications over the past ten years, which have led to some of

the research results included in this book.

Most of all, we are deeply indebted to our families, for allowing and even

encour-aging us to complete this project, which started more than four years ago and took away

a significant amount of time we could otherwise have spent with them. The arrival of

our new children Yana and Brandon caused a delay in the creation of the book but also

provided an impetus to finish it. This book is a tribute to our families, for their love,

affection, and support.

Y

AO

W

ANG

Polytechnic University, Brooklyn, NY, USA

J ¨

ORN

O

STERMANN

AT&T Labs—Research, Middletown, NJ, USA

Y

A

-Q

IN

Z

HANG

Microsoft Research, Beijing, China

(20)

VIDEO FORMATION,

PERCEPTION, AND

REPRESENTATION

In this rst chapter, we describe what is a video signal, how is it captured and

perceived, how is it stored/transmitted, and what are the important parameters

thatdeterminethequalityandbandwidth(whichinturndeterminesthedatarate)

of a video signal. We rst present the underlying physics for color perception

and specication (Sec. 1.1). We then describe the principles and typical devices

for video capture and display (Sec. 1.2). As will be seen, analog videos are

cap-tured/stored/transmitted in a raster scan format, using either progressive or

in-terlacedscans. Asan example,wereviewtheanalogcolortelevision(TV) system

(Sec.1.4),andgiveinsightsastohowarecertaincriticalparameters,suchasframe rateandlinerate,chosen,whatisthespectralcontentofacolorTVsignal,andhow

candierentcomponentsofthesignalbemultiplexed into acompositesignal.

Fi-nally,Section1.5introducestheITU-RBT.601videoformat(formerlyCCIR601),

thedigitizedversionoftheanalogcolorTVsignal. Wepresentsomeofthe consider-ationsthathavegoneintotheselectionofvariousdigitizationparameters. Wealso

describeseveralotherdigitalvideoformats,includinghigh-denitionTV(HDTV).

Thecompressionstandardsdevelopedfordierentapplicationsandtheirassociated

videoformatsaresummarized.

Thepurposeofthischapter istogivethereadersbackgroundknowledgeabout

analogand digitalvideo, and to provideinsights to commonvideo systemdesign

problems. As such, the presentation is intentionally made more qualitative than

quantitative. Inlater chapters, wewill come back to certain problemsmentioned

inthis chapterandprovidemorerigorousdescriptions/solutions.

1.1 Color Perception and Specication

A video signal is a sequence of two dimensional (2D) images projected from a

(21)

colorvalueatanypointinavideoframerecordstheemittedorre ectedlightata particular3Dpointintheobservedscene. Tounderstandwhatdoesthecolorvalue meanphysically, wereview in this sectionbasicsof lightphysicsand describethe attributesthat characterizelightandits color. Wewill alsodescribetheprinciple ofhumancolorperceptionanddierentwaystospecifyacolorsignal.

1.1.1 Light and Color

Light is an electromagnetic wave with wavelengths in the range of 380 to 780

nanometer(nm), to which thehumaneyeissensitive. Theenergyoflightis

mea-suredby ux,withaunitofwatt,whichistherateatwhichenergyisemitted. The radiantintensity of alight, which is directlyrelatedto the brightnessof thelight we perceive, is dened asthe ux radiated into a unit solid angle in aparticular direction,measuredinwatt/solid-angle. Alightsourceusually canemit energyin arangeofwavelengths,anditsintensitycanbevaryinginbothspaceandtime. In thisbook,weuseC(X;t;)torepresenttheradiantintensitydistributionofalight, whichspecies thelightintensityat wavelength ,spatial location X=(X;Y;Z)

andtimet.

Theperceivedcolorofalightdependsonitsspectralcontent(i.e. thewavelength

composition). Forexample, alightthat has itsenergy concentratednear 700nm

appearsred. Alightthathasequalenergyintheentirevisiblebandappearswhite.

In general, alight that has a verynarrow bandwidth is referred to as a spectral

color. Ontheotherhand,awhitelightissaidto beachromatic.

There are twotypes of light sources: the illuminating source, which emits an

electromagnetic wave, and there ecting source, which re ects an incident wave.

1

The illuminating light sources include the sun, light bulbs, the television (TV)

monitors,etc. Theperceivedcolorof anilluminating lightsourcedepends onthe

wavelengthrangeinwhichitemitsenergy. Theilluminatinglightfollowsanadditive rule,i.e. theperceivedcolorofseveralmixedilluminatinglightsourcesdependson thesumofthespectraofalllightsources. Forexample,combiningred,green,and bluelightsinrightproportionscreatesthewhitecolor.

There ectinglightsourcesarethosethatre ectanincidentlight(whichcould itselfbeare ectedlight). Whenalightbeamhitsanobject,theenergyinacertain wavelengthrangeisabsorbed,whiletherestisre ected. Thecolorofare ectedlight dependsonthespectralcontentoftheincidentlightandthewavelengthrangethat isabsorbed. A re ectinglightsourcefollowsasubtractiverule,i.e. theperceived colorofseveralmixedre ectinglightsourcesdependsontheremaining,unabsorbed wavelengths. Themostnotablere ectinglightsourcesarethecolordyesandpaints. Forexample,iftheincidentlightiswhite, adyethatabsorbsthewavelengthnear 700nm(red)appearsascyan. Inthissense,wesaythatcyanisthecomplementof

1

Theilluminatingandre ectinglightsourcesarealsoreferredtoasprimaryandsecondarylight sources,respectively. Wedonotusethosetermstoavoidtheconfusionwiththeprimarycolors associatedwithlight. Inotherplaces, illuminatingand re ectinglightsarealsocalledadditive

(22)

Figure 1.1. Solidline: Frequencyresponsesof the threetypesof cones onthe human retina. Theblueresponsecurveismagniedbyafactorof20inthegure. DashedLine: TheluminouseÆciencyfunction. From[10 ,Fig.1].

red(orwhiteminus red). Similarly,magentaandyellowarecomplementsofgreen

and blue, respectively. Mixing cyan, magenta, and yellow dyes produces black,

whichabsorbstheentirevisiblespectrum.

1.1.2 Human Perception of Color

Theperceptionofalightinthehumanbeingstartswiththephotoreceptorslocated

in the retina (the surface of the rear of the eye ball). There are two types of

receptors: cones that function under bright light andcan perceivethecolor tone,

and rods that work under lowambient light and canonly extract the luminance

information. Thevisualinformationfromtheretinaispassedviaopticnervebers tothebrainareacalledthevisualcortex,wherevisualprocessingandunderstanding

isaccomplished. Therearethreetypesofconeswhichhaveoverlappingpass-bands

inthevisiblespectrumwithpeaksatred(near570nm),green(near535nm),and

blue(near445nm)wavelengths,respectively,asshowninFigure1.1. Theresponses ofthesereceptorsto anincominglightdistributionC()can bedescribedby:

C i = Z C()a i ()d; i=r;g;b; (1.1.1) where a r ();a g ();a b

() arereferredto asthefrequencyresponses orrelative

ab-sorption functions of the red, green, and blue cones. The combination of these

threetypesofreceptorsenablesahumanbeingto perceiveanycolor. Thisimplies

that the perceived coloronly depends on three numbers, C

r ;C g ;C b , rather than thecompletelightspectrumC(). Thisisknownasthetri-receptortheoryofcolor

(23)

There are two attributes that describe the color sensation of a human being:

luminanceandchrominance. Thetermluminance referstotheperceivedbrightness

ofthelight,whichisproportionaltothetotalenergyinthevisibleband. Theterm

chrominance describes the perceived color tone of a light, which depends on the

wavelength compositionof thelight. Chrominanceisin turncharacterizedbytwo

attributes: hue and saturation. Hue species the color tone, which depends on

thepeakwavelengthofthelight,whilesaturation describeshowpurethecoloris,

whichdependsonthespreadorbandwidthofthelightspectrum. Inthisbook,we

usethewordcolortorefertoboththeluminance andchrominanceattributesofa

light, although it is customary to use the word colorto referto the chrominance

aspectofalightonly.

Experimentshaveshown that there exists asecondaryprocessing stage in the

humanvisualsystem(HVS),whichconvertsthethreecolorvaluesobtainedbythe

conesintoonevaluethatisproportionaltotheluminanceandtwoothervaluesthat

areresponsibleforthe perception ofchrominance. This is knownastheopponent

color model oftheHVS[3,9]. It hasbeenfoundthat thesameamountofenergy

produces dierent sensations of the brightness at dierent wavelengths, and this

wavelength-dependent variation of the brightness sensation is characterized by a

relative luminous eÆciency function, a

y

(), which is also shown (in dashed line) in Fig. 1.1. It is essentially thesum of thefrequency responses of allthree types

ofcones. Wecan see thatthegreen wavelengthcontributesmostto theperceived

brightness,theredwavelengththesecond,and theblue theleast. The luminance

(oftendenotedbyY)isrelatedtotheincominglightspectrumby:

Y =

Z C()a

y

()d: (1.1.2)

In theaboveequations, wehave neglectedthe time andspace variables, since we

are only concerned with the perceived color or luminance at a xed spatial and

temporal location. Wealsoneglectedthescaling factorcommonlyassociatedwith

eachequation,whichdependsonthedesiredunitfordescribingthecolorintensities

andluminance.

1.1.3 The Trichromatic Theory of Color Mixture

A veryimportant ndingin color physicsis that mostcolorscanbeproduced by

mixing three properly chosen primary colors. This is known as the trichromatic

theoryof colormixture,rstdemonstratedbyMaxwellin1855[9,13]. LetC

k ;k= 1;2;3representthecolorsofthreeprimarycolorsources,andCagivencolor. Then thetheoryessentiallysays

C= X k =1;2;3 T k C k ; (1.1.3) where T k

's are the amounts of the three primary colors required to match color

(24)

negative. Assuming onlyT 1

is negative,this means that one cannot match color

C by mixing C 1 ;C 2 ;C 3

, but one can match colorC+jT

1 jC 1 with T 2 C 2 +T 3 C 3 :

In practice, the primary colors should be chosen so that mostnatural colors can

be reproduced using positive combinations of primary colors. The most popular

primary set for theilluminating light sourcecontains red, green, and blue colors,

knownastheRGBprimary. Themostcommonprimarysetforthere ectinglight

source containscyan, magenta, and yellow, known astheCMY primary. Infact,

RGB and CMY primary sets are complement of each other, in that mixing two

colorsin oneset willproduceonecolorin theother set. Forexample,mixing red withgreenwillyieldyellow. Thiscomplementaryinformationisbestillustratedby acolorwheel,which canbefoundin manyimageprocessingbooks,e.g.,[9, 4].

For achosenprimary set,one waytodeterminetristimulusvaluesofanycolor isbyrstdeterminingthecolormatchingfunctions,m

i

(), forprimarycolors,C i

,

i=1,2,3. These functions describe the tristimulus values of a spectral color with

wavelength , for various in the entire visible band, and can bedetermined by

visualexperimentswithcontrolledviewing conditions. Thenthetristimulusvalues foranycolorwithaspectrumC() canbeobtainedby[9]:

T i = Z C()m i ()d; i=1;2;3: (1.1.4)

Toproduceallvisiblecolorswithpositivemixing,thematchingfunctionsassociated withtheprimarycolorsmustbepositive.

Theabovetheory forms thebasisfor colorcaptureand display. Torecordthe colorofanincominglight,acameraneedstohavethreesensorsthathavefrequency responsessimilartothecolormatchingfunctionsofachosenprimaryset. Thiscan beaccomplishedbyopticalorelectroniclterswiththedesiredfrequencyresponses. Similarly, todisplayacolorpicture,thedisplaydevice needstoemit threeoptical

beams of the chosen primary colors with appropriate intensities, as specied by

the tristimulus values. In practice, electronic beams that strike phosphors with

the red, green and blue colors are used. All present display systems use a RGB

primary, although the standard spectra specied for the primary colors may be

slightlydierent. Likewise, acolorprinter canproducedierentcolorsby mixing

three dyes with the chosen primary colors in appropriate proportions. Most of

the color printers use the CMY primary. For amore vivid and wide-rangecolor

rendition,somecolorprintersusefourprimaries,byaddingblack(K)to theCMY

set. Thisis known asthe CMYKprimary, which canrendertheblack colormore

truthfully.

1.1.4 Color Specication by Tristimulus Values

TristimulusValues Wehaveintroducedthetristimulusrepresentation ofacolor

in Sec. 1.1.3, which species the proportions, i.e. the T k

's in Eq. (1.1.3), of the threeprimarycolorsneededtocreatethedesiredcolor. Inordertomakethecolor

(25)

should benormalizedso that T k

=1;k=1;2;3for areferencewhite color(equal

energy in allwavelengths) with aunit energy. Whenweuse aRGB primary, the

tristimulusvaluesareusuallydenotedbyR ;G;andB.

ChromaticityValues: Theabovetristimulusrepresentationmixesthe luminance

andchrominanceattributesof acolor. Tomeasure onlythechrominance

informa-tion(i.e. thehueandsaturation)ofalight,thechromaticitycoordinateis dened as: t k = T k T 1 +T 2 +T 3 ; k=1;2;3: (1.1.5) Sincet 1 +t 2 +t 3

=1,twochromaticityvaluesaresuÆcienttospecifythe chromi-nanceofacolor.

Obviously, the color value of an imaged point depends on the primary colors

used. Tostandardizecolordescriptionandspecication,severalstandardprimary

colorsystemshavebeenspecied. Forexample,the CIE,

2

aninternationalbody

ofcolorscientists,dened aCIE RGBprimary system,whichconsists ofcolorsat

700(R 0 ),546.1(G 0 ),and 435.8(B 0 )nm.

Color CoordinateConversion Onecanconvert thecolorvaluesbasedononeset

ofprimariestothecolorvaluesforanothersetofprimaries. Conversionof(R,G,B)

coordinate to the (C,M,Y) coordinate is, for example, oftenrequired for printing

colorimagesstoredinthe(R,G,B)coordinate. Giventhetristimulusrepresentation

ofoneprimary set in termsofanother primary,one candeterminetheconversion

matrix between the two color coordinates. The principle of color conversionand

thederivationof theconversionmatrixbetweentwosetsofcolorprimariescanbe foundin[9].

1.1.5 Color Specication by Luminance and Chrominance

At-tributes

TheRGBprimarycommonlyusedforcolordisplaymixestheluminanceand

chromi-nanceattributesofalight. Inmanyapplications, itisdesirabletodescribeacolor

in terms of itsluminance and chrominancecontentseparately, to enable more

ef-cient processing and transmission of color signals. Towards this goal, various

three-componentcolor coordinates havebeendeveloped, in which one component

re ectsthe luminance and theother twocollectivelycharacterizehueand

satura-tion. Onesuch coordinate istheCIE XYZprimary,in which Ydirectly measures

theluminance intensity. The(X;Y;Z)valuesin thiscoordinateare relatedtothe (R ;G;B)valuesintheCIERGBcoordinateby[9]:

2 4 X Y Z 3 5 = 2 4 2:365 0:515 0:005 0:897 1:426 0:014 0:468 0:089 1:009 3 5 2 4 R G B 3 5 : (1.1.6) 2

(26)

Com-Inadditionto separatingtheluminance andchrominanceinformation,another

advantageoftheCIEXYZsystemisthat almostallvisiblecolorscanbespecied

withnon-negativetristimulusvalues,whichisaverydesirablefeature. Theproblem

is that theX,Y,Z colors sodened are notrealizable by actual colorstimuli. As

such,theXYZprimaryisnotdirectlyusedforcolorproduction,ratheritismainly introducedfordening otherprimariesandfornumericalspecicationofcolor. As will be seenlater, thecolorcoordinatesused fortransmissionof colorTVsignals,

suchasYIQandYUV,areallderivedfrom theXYZcoordinate.

Thereareothercolorrepresentationsinwhichthehueandsaturationofacolor areexplicitlyspecied,inadditiontotheluminance. OneexampleistheHSI coor-dinate,where Hstandsforhue,S forsaturation,andI forintensity (equivalentto luminance)

3

. Althoughthiscolorcoordinateclearlyseparatesdierentattributesof alight,itisnonlinearlyrelatedtothetristimulusvaluesandisdiÆculttocompute.

The book by Gonzalez hasa comprehensivecoverageof various color coordinates

andtheirconversions[4].

1.2 Video Capture and Display

1.2.1 Principle of Color Video Imaging

Having explained what is light and how it is perceived and characterized, we are

now in a position to understand themeaning of avideosignal. In short,a video

recordstheemittedand/orre ectedlightintensity,i.e. C(X;t;)from theobjects

in thescene that is observedbyaviewing system(a humaneyeor acamera). In

general,thisintensitychangesbothintimeandspace. Here,weassumethat there aresomeilluminatinglightsourcesinthescene. Otherwise,therewillbenoinjected

norre ectedlightandtheimagewillbetotallydark. Whenobservedbyacamera,

onlythosewavelengthstowhichthecameraissensitivearevisible. Letthespectral

absorption function of the camera be denoted by a

c

(), then the light intensity distributioninthe3Dworldthatis\visible"tothecamerais:

(X;t)= Z 1 0 C(X;t;)a c ()d: (1.2.1)

Theimage function captured by thecameraat anytime t is theprojectionof

the light distributionin the3D scene onto a2D image plane. Let P()represent

thecameraprojectionoperator so that theprojected2Dposition ofthe3D point

X is given byx =P(X). Furthermore, letP

1

() denote the inverse projection

operator,sothatX=P

1

(x)species the3Dpositionassociatedwitha2Dpoint

x:Thentheprojectedimageisrelatedtothe3Dimageby

(P(X);t)= (X;t) or (x;t)= P 1 (x);t : (1.2.2)

Thefunction (x;t)iswhatisknownasavideosignal. Wecanseethatitdescribes

the radiant intensity at the 3D position X that is projected onto x in the image

(27)

planeattimet. Ingeneralthevideosignalhasanitespatialandtemporalrange.

The spatialrange depends onthe cameraviewing area, whilethe temporal range

dependsonthedurationinwhichthevideoiscaptured. Apointintheimageplane iscalledapixel(meaningpictureelement)orsimplypel.

4

Formostcamerasystems, theprojectionoperatorP()canbeapproximatedbyaperspectiveprojection. This isdiscussedinmoredetailin Sec.5.1.

IfthecameraabsorptionfunctionisthesameastherelativeluminouseÆciency functionofthehumanbeing,i.e. a

c

()=a

y

(),thenaluminanceimageisformed.

If the absorption function is non-zero over a narrow band, then a monochrome

(or monotone) image is formed. To perceive all visible colors, according to the

trichromaticcolorvisiontheory(seeSec.1.1.2),threesensorsareneeded,eachwith afrequencyresponsesimilar tothecolormatchingfunction foraselectedprimary color. Asdescribedbefore,mostcolorcamerasusethered,green,andbluesensors forcoloracquisition.

If the camera hasonly one luminance sensor, (x;t) is ascalar function that

represents the luminance of the projected light. In this book, we use the word

gray-scale to refertosuch avideo. Thetermblack-and-white will beused strictly todescribeanimagethathasonlytwocolors: blackandwhite. Ontheotherhand, ifthecamerahasthreeseparatesensors,eachtunedtoachosenprimarycolor,the signalisavectorfunction that containsthree colorvaluesateverypoint. Instead of specifyingthese colorvalues directly, onecanuse othercolor coordinates (each consistsofthreevalues) tocharacterizelight,asexplainedin theprevioussection.

Note that for special purposes, onemay use sensorsthat work in afrequency

range that is invisible to the human being. For example, in X-ray imaging, the

sensorissensitiveto thespectralrangeoftheX-ray. Ontheotherhand,an infra-redcameraissensitivetotheinfra-redrange,whichcanfunctionatverylowambient light. Thesecamerascan\see"thingsthatcannotbeperceivedbythehumaneye. Yetanotherexampleistherangecamera,inwhichthesensoremitsalaserbeamand

measures thetime it takesfor thebeamto reach anobjectand then bere ected

back to the sensor. Because the round trip time is proportional to the distance

between the sensor and the objectsurface, the image intensity at any point in a

rangeimagedescribesthedistanceorrangeofitscorresponding3Dpointfromthe camera.

1.2.2 Video Cameras

All theanalogcamerasoftodaycaptureavideoin aframebyframemannerwith

acertain time spacing betweenthe frames. Somecameras (e.g. TV camerasand

consumervideocamcorders) acquireaframe byscanning consecutivelines witha

certainlinespacing. Similarly,allthedisplaydevicespresentavideoasa consecu-tivesetofframes,andwithTVmonitors,thescanlinesareplayedbacksequentially asseparatelines. Suchcaptureanddisplaymechanismsaredesignedtotake

advan-4

(28)

tageofthefactthat theHVScannotperceiveveryhighfrequencychangesintime andspace. ThispropertyoftheHVSwillbediscussedmoreextensivelyinSec.2.4.

There are basically two types of video imagers: (1) tube-based imagers such

as vidicons, plumbicons, or orthicons, and (2) solid-state sensors such as

charge-coupleddevices (CCD).The lensof acamerafocuses theimage ofa sceneontoa

photosensitivesurfaceof theimager of thecamera, which converts optical signals into electrical signals. The photosensitive surfaceof the tube imager is typically scannedlinebyline(knownasrasterscan)withanelectronbeamorotherelectronic methods, andthescannedlinesin each framearethenconvertedintoanelectrical signal representingvariations of lightintensity as variations in voltage. Dierent linesarethereforecapturedatslightlydierenttimesinacontinuousmanner. With

progressive scan, the electronic beam scans every line continuously; while with

interlacedscan, the beamscans everyother line in onehalf of the frame time (a

eld)andthenscanstheotherhalfofthelines. Wewilldiscussrasterscaninmore detailinSec.1.3. WithaCCDcamera,thephotosensitivesurfaceiscomprisedofa 2Darrayofsensors,eachcorrespondingtoonepixel,andtheopticalsignalreaching eachsensorisconvertedtoanelectronicsignal. Thesensorvaluescapturedineach frametimearerststoredinabuer,whicharethenread-outsequentiallyoneline at atimeto formarastersignal. Unlikethetubebasedcameras,alltheread-out

values in the same frame are captured at the same time. With interlaced scan

camera,alternatelinesareread-outineacheld.

Tocapturecolor,thereareusuallythreetypesofphotosensitivesurfacesorCCD

sensors, eachwith afrequencyresponse that is determined bythe colormatching

functionofthechosenprimarycolor,asdescribedpreviouslyinSec.1.1.3. Toreduce

thecost,mostconsumercamerasuseasingleCCDchipforcolorimaging. Thisis

accomplishedbydividingthesensorareaforeachpixelintothreeorfoursub-areas, eachsensitivetoadierentprimarycolor. Thethreecaptured colorsignalscanbe

eitherconverted tooneluminance signalandtwochrominancesignalandsentout

asacomponentcolorvideo,ormultiplexedintoacompositesignal. Thissubjectis explainedfurtherin Sec.1.2.4.

ManycamerasoftodayareCCD-basedbecausetheycanbemademuchsmaller

and lighter than the tube-based cameras, to acquire the same spatial resolution.

Advancementin CCD technologyhas madeit possibleto capture in averysmall

chipsizeaveryhighresolutionimagearray. Forexample,1/3-inCCD'swith380K

pixelsarecommonlyfoundinconsumer-usecamcorders,whereasa2/3-inCCDwith

2millionpixels hasbeendeveloped forHDTV.The tube-based camerasare more

bulkyand costly,andareonlyusedin specialapplications,suchasthoserequiring veryhighresolutionorhighsensitivityunderlowambientlight. Inadditiontothe circuitryforcolorimaging,mostcamerasalsoimplementcolorcoordinateconversion

(from RGB to luminance and chrominance) and compositing of luminance and

chrominancesignals. Fordigitaloutput,analog-to-digital(A/D)conversionisalso

incorporated. Figure 1.2 shows the typical processings involvedin a professional

(29)

Figure 1.2. SchematicBlockDiagramof aProfessionalColorVideoCamera. From[6 , Fig.7(a)].

imagequality, digitalprocessingis introducedwithin thecamera. Foranexcellent expositionof thevideocameraanddisplaytechnologies,see[6].

1.2.3 Video Display

Todisplayavideo,themostcommondevice isthecathoderaytube(CRT).With

aCRT monitor,anelectron gunemits anelectronbeamacrossthescreenline by

line, exciting phosphorswith intensities proportionalto the intensityof the video signalatcorrespondinglocations. Todisplayacolorimage,threebeamsareemitted

by three separate guns, exciting red, green, and blue phosphors with the desired

intensitycombinationateachlocation. Tobemoreprecise,eachcolorpixelconsists ofthreeelementsarrangedinasmalltriangle,knownasatriad.

TheCRTcanproduceanimage havingaverylargedynamicrangesothatthe

displayedimagecanbeverybright,suÆcientforviewingduringdaylightorfroma distance. However,thethicknessofaCRTneedstobeaboutthesameasthewidth ofthescreen,fortheelectronstoreachthesideofthescreen. Alargescreenmonitor is thus too bulky, unsuitable for applications requiringthin andportable devices.

Tocircumventthis problem,various atpaneldisplayshavebeendeveloped. One

populardeviceisLiquidCrystalDisplay(LCD).TheprincipleideabehindtheLCD

istochangetheopticalpropertiesandconsequentlythebrightness/colorofthe liq-uidcrystalbyanappliedelectriceld. Theelectriceldcanbegenerated/adapted

by either an arrayof transistors, such asin LCD's using active matrix

thin-lm-transistors(TFT),orbyusingplasma. Theplasmatechnologyeliminatestheneed

for TFT and makeslarge-screen LCD's possible. There are also new designs for

atCRT's. A morecomprehensivedescriptionofvideodisplaytechnologiescanbe

foundin[6].

(30)

frameinstantiscompletelyrecordedonthelm. Fordisplay,consecutiverecorded framesareplayedbackusingananalogopticalprojectionsystem.

1.2.4 Composite vs. Component Video

Ideally, a color video should be specied by three functions or signals, each

de-scribing one color component, in either a tristimulus color representation, or a

luminance-chrominancerepresentation. A video in this format is known as

com-ponent video. Mainly for historical reasons, various composite video formatsalso

exist, wherein the three color signalsare multiplexed into a singlesignal. These

compositeformatswereinventedwhenthecolorTVsystemwasrstdevelopedand

there was a need to transmit the color TV signal in a way so that a

black-and-white TVset canextract from it the luminance component. Theconstruction of

acomposite signalrelieson theproperty thatthe chrominancesignalshavea

sig-nicantlysmallerbandwidththantheluminancecomponent. Bymodulatingeach

chrominance component to a frequency that is at the high end of the luminance

component,and addingtheresultingmodulatedchrominancesignalsandthe

orig-inal luminance signal together, onecreates acompositesignal that contains both

luminanceandchrominanceinformation. Todisplayacompositevideosignalona

colormonitor,alterisusedtoseparatethemodulatedchrominancesignalsandthe

luminance signal. Theresultingluminance and chrominancecomponentsarethen

convertedtored,green,andbluecolorcomponents. Withagray-scalemonitor,the luminancesignalaloneisextractedanddisplayeddirectly.

AllpresentanalogTVsystemstransmitcolorTVsignalsinacompositeformat.

The composite format is also used for video storage on some analog tapes(such

as the VHS tape). In addition to being compatible with a gray-scale signal, the

compositeformateliminatestheneedforsynchronizingdierentcolorcomponents

when processing acolor video. A composite signal also hasa bandwidth that is

signicantlylowerthanthesumofthebandwidthofthreecomponentsignals,and

thereforecanbetransmittedorstoredmoreeÆciently. These benetsarehowever

achievedattheexpenseofvideoquality: thereoftenexistnoticeableartifactscaused

bycross-talksbetweencolorandluminancecomponents.

Asacompromisebetweenthedatarateandvideoquality,S-videowasinvented,

whichconsists oftwocomponents,the luminancecomponentand asingle

chromi-nancecomponentwhichisthemultiplexoftwooriginalchrominancesignals. Many

advanced consumer level video cameras and displays enable recording/display of

video in S-video format. Component format is used only in professional video

equipment.

1.2.5 Gamma Correction