Errata for
VIDEO PROCESSING AND COMMUNICATIONS
Yao Wang, Joern Ostermann, and Ya-Qin Zhang
(©2002 by Prentice-Hall, ISBN 0-13-017547-1)
Updated 6/12/2002
Symbols Used
Ti = i-th line from top; Bi = i-th line from bottom; Fi = Figure i, TAi = Table i,
Pi=Problem i,E(i)=Equation(i), X -> Y = replace X with Y
Page Line/Fig/Tab Corrections
16
F1.5
Add an output from the demultiplexing box to a microphone at the
bottom of the figure.
48
B6,
E(2.4.4)-E(2.4.6)
Replace “v_x”, “v_y” by “\tilde v_x”, “\tilde v_y”
119
E(5.2.7)
C(X)->C(X,t),r(X)->r(X,t),E(N)->E(N,t)
125
F5.11
Caption: “cameras”-> “a camera”, “diffuse”-> “ambient”
126
T7
“diffuse illumination”-> “ambient illumination”
133
B10
T_x,T_y,T_z -> T_x,T_y,T_z, and Z
B4
Delete “when there is no translational motion in the Z direction, or”
B2
“aX+bY+cZ=1” -> “Z=aX+bY+c”
Before
E(5.5.13)
Add “(see Problem 5.3)” after “before and after the motion”
138
P5.3
“a planar patch” -> “any 3-D object”, “projective mapping”->Equation
(5.5.13)”
P5.4
“Equation 5.5.14”-> “Equation (5.5.14)”,
“aX+bY+cZ=1”-> “Z= aX+bY+c”
143
T4
After “true 2-D motion.” Add “Optical flow depends on not only 2-D
motion, but also illumination and object surface texture.”
159
T6
After “block size is 16x16” add “, and the search range is 16x16”
189
P6.1
“global”->”global-based”
190
P6.12
Add at the end “Choose two frames that have sufficient motion in
between, so that it is easier to observe the effect of motion estimation
inaccuracy. If necessary, choose frames that are not immediate
neighbors.”
199
T9
“Equation (7.1.11) defines a linear dependency … straight line.” ->
“Equation (7.1.11) says that the possible positions x’ of a point x after
motion lie on a straight line. The actual position depends on the
Z-coordinate of the original 3-D point.”
200
B8
“[A]” -> “[A]^T [A]”
214
P7.5
“Derive”-> “Equation (7.1.5) describes”
Add at the end “(assuming F=1)”
P7.6
Replace “\delta” with “\bf \delta”
218
F8.1
“Parameter statistics” -> “Model parameter statistics”
247
F8.9
Add a box with words “Update previous distortion \\ D_0=D_1” in the
line with the word “No”.
255
F8.14
Same as for F8.9
261
P8.13(a)
“B_l={f_k, k=1,2,… ,K_l}” -> “B_l, which consists of K_l vectors in
{\cal F}”
416
TA13.2
Item “4CIF/H.263” should be “Opt.”
421
TA13.3
Item “Video/Non-QoS LAN” should be “H.261/3”
436
T13
“MPEG-2, defined” -> “MPEG-2 defined”
443
T10
“I-VOP”->”I-VOPs”, “B-VOP”-> “B-VOPs”
575
P1.3
“red+green=blue”-> “red+green=black”
P1.4
“(1.4.4)” -> “(1.4.3)”, “(1.4.2)” -> “(1.4.1)”
wang-50214
wang˙fm
August 23, 2001
14:22
Contents
PREFACE
xxi
GLOSSARY OF NOTATIONS
xxv
1
VIDEO FORMATION, PERCEPTION,
AND REPRESENTATION
1
1.1
Color Perception and Specification
2
1.1.1
Light and Color, 2
1.1.2
Human Perception of Color, 3
1.1.3
The Trichromatic Theory of Color Mixture, 4
1.1.4
Color Specification by Tristimulus Values, 5
1.1.5
Color Specification by Luminance and Chrominance
Attributes, 6
1.2
Video Capture and Display
7
1.2.1
Principles of Color Video Imaging, 7
1.2.2
Video Cameras, 8
1.2.3
Video Display, 10
1.2.4
Composite versus Component Video, 11
1.2.5
Gamma Correction, 11
1.3
Analog Video Raster
12
1.3.1
Progressive and Interlaced Scan, 12
1.3.2
Characterization of a Video Raster, 14
wang-50214
wang˙fm
August 23, 2001
14:22
x
Contents
1.4
Analog Color Television Systems
16
1.4.1
Spatial and Temporal Resolution, 16
1.4.2
Color Coordinate, 17
1.4.3
Signal Bandwidth, 19
1.4.4
Multiplexing of Luminance, Chrominance, and Audio, 19
1.4.5
Analog Video Recording, 21
1.5
Digital Video
22
1.5.1
Notation, 22
1.5.2
ITU-R BT.601 Digital Video, 23
1.5.3
Other Digital Video Formats and Applications, 26
1.5.4
Digital Video Recording, 28
1.5.5
Video Quality Measure, 28
1.6
Summary
30
1.7
Problems
31
1.8
Bibliography
32
2
FOURIER ANALYSIS OF VIDEO SIGNALS AND
FREQUENCY RESPONSE OF THE HUMAN
VISUAL SYSTEM
33
2.1
Multidimensional Continuous-Space Signals and Systems
33
2.2
Multidimensional Discrete-Space Signals and Systems
36
2.3
Frequency Domain Characterization of Video Signals
38
2.3.1
Spatial and Temporal Frequencies, 38
2.3.2
Temporal Frequencies Caused by Linear Motion, 40
2.4
Frequency Response of the Human Visual System
42
2.4.1
Temporal Frequency Response and Flicker Perception, 43
2.4.2
Spatial Frequency Response, 45
2.4.3
Spatiotemporal Frequency Response, 46
2.4.4
Smooth Pursuit Eye Movement, 48
2.5
Summary
50
2.6
Problems
51
2.7
Bibliography
52
3
VIDEO SAMPLING
53
3.1
Basics of the Lattice Theory
54
3.2
Sampling over Lattices
59
3.2.1
Sampling Process and Sampled-Space Fourier Transform, 60
3.2.2
The Generalized Nyquist Sampling Theorem , 61
wang-50214
wang˙fm
August 23, 2001
14:22
Contents
xi
3.2.4
Implementation of the Prefilter and Reconstruction Filter, 65
3.2.5
Relation between Fourier Transforms over Continuous, Discrete,
and Sampled Spaces, 66
3.3
Sampling of Video Signals
67
3.3.1
Required Sampling Rates, 67
3.3.2
Sampling Video in Two Dimensions: Progressive versus
Interlaced Scans, 69
3.3.3
Sampling a Raster Scan: BT.601 Format Revisited, 71
3.3.4
Sampling Video in Three Dimensions, 72
3.3.5
Spatial and Temporal Aliasing, 73
3.4
Filtering Operations in Cameras and Display Devices
76
3.4.1
Camera Apertures, 76
3.4.2
Display Apertures, 79
3.5
Summary
80
3.6
Problems
80
3.7
Bibliography
83
4
VIDEO SAMPLING RATE CONVERSION
84
4.1
Conversion of Signals Sampled on Different Lattices
84
4.1.1
Up-Conversion, 85
4.1.2
Down-Conversion, 87
4.1.3
Conversion between Arbitrary Lattices, 89
4.1.4
Filter Implementation and Design, and Other Interpolation
Approaches, 91
4.2
Sampling Rate Conversion of Video Signals
92
4.2.1
Deinterlacing, 93
4.2.2
Conversion between PAL and NTSC Signals, 98
4.2.3
Motion-Adaptive Interpolation, 104
4.3
Summary
105
4.4
Problems
106
4.5
Bibliography
109
5
VIDEO MODELING
111
5.1
Camera Model
112
5.1.1
Pinhole Model, 112
5.1.2
CAHV Model, 114
5.1.3
Camera Motions, 116
5.2
Illumination Model
116
wang-50214
wang˙fm
August 23, 2001
14:22
xii
Contents
5.2.2
Radiance Distribution under Differing Illumination and Reflection
Conditions, 117
5.2.3
Changes in the Image Function Due to Object Motion, 119
5.3
Object Model
120
5.3.1
Shape Model, 121
5.3.2
Motion Model, 122
5.4
Scene Model
125
5.5
Two-Dimensional Motion Models
128
5.5.1
Definition and Notation, 128
5.5.2
Two-Dimensional Motion Models Corresponding to Typical Camera
Motions, 130
5.5.3
Two-Dimensional Motion Corresponding to Three-Dimensional Rigid
Motion, 133
5.5.4
Approximations of Projective Mapping, 136
5.6
Summary
137
5.7
Problems
138
5.8
Bibliography
139
6
TWO-DIMENSIONAL MOTION ESTIMATION
141
6.1
Optical Flow
142
6.1.1
Two-Dimensional Motion versus Optical Flow, 142
6.1.2
Optical Flow Equation and Ambiguity in Motion Estimation, 143
6.2
General Methodologies
145
6.2.1
Motion Representation, 146
6.2.2
Motion Estimation Criteria, 147
6.2.3
Optimization Methods, 151
6.3
Pixel-Based Motion Estimation
152
6.3.1
Regularization Using the Motion Smoothness Constraint, 153
6.3.2
Using a Multipoint Neighborhood, 153
6.3.3
Pel-Recursive Methods, 154
6.4
Block-Matching Algorithm
154
6.4.1
The Exhaustive Block-Matching Algorithm, 155
6.4.2
Fractional Accuracy Search, 157
6.4.3
Fast Algorithms, 159
6.4.4
Imposing Motion Smoothness Constraints, 161
6.4.5
Phase Correlation Method, 162
6.4.6
Binary Feature Matching, 163
6.5
Deformable Block-Matching Algorithms
165
6.5.1
Node-Based Motion Representation, 166
wang-50214
wang˙fm
August 23, 2001
14:22
Contents
xiii
6.6
Mesh-Based Motion Estimation
169
6.6.1
Mesh-Based Motion Representation, 171
6.6.2
Motion Estimation Using the Mesh-Based Model, 173
6.7
Global Motion Estimation
177
6.7.1
Robust Estimators, 177
6.7.2
Direct Estimation, 178
6.7.3
Indirect Estimation, 178
6.8
Region-Based Motion Estimation
179
6.8.1
Motion-Based Region Segmentation, 180
6.8.2
Joint Region Segmentation and Motion Estimation, 181
6.9
Multiresolution Motion Estimation
182
6.9.1
General Formulation, 182
6.9.2
Hierarchical Block Matching Algorithm, 184
6.10
Application of Motion Estimation in Video Coding
187
6.11
Summary
188
6.12
Problems
189
6.13
Bibliography
191
7
THREE-DIMENSIONAL MOTION ESTIMATION
194
7.1
Feature-Based Motion Estimation
195
7.1.1
Objects of Known Shape under Orthographic Projection, 195
7.1.2
Objects of Known Shape under Perspective Projection, 196
7.1.3
Planar Objects, 197
7.1.4
Objects of Unknown Shape Using the Epipolar Line, 198
7.2
Direct Motion Estimation
203
7.2.1
Image Signal Models and Motion, 204
7.2.2
Objects of Known Shape, 206
7.2.3
Planar Objects, 207
7.2.4
Robust Estimation, 209
7.3
Iterative Motion Estimation
212
7.4
Summary
213
7.5
Problems
214
7.6
Bibliography
215
8
FOUNDATIONS OF VIDEO CODING
217
8.1
Overview of Coding Systems
218
8.1.1
General Framework, 218
wang-50214
wang˙fm
August 23, 2001
14:22
xiv
Contents
8.2
Basic Notions in Probability and Information Theory
221
8.2.1
Characterization of Stationary Sources, 221
8.2.2
Entropy and Mutual Information for Discrete Sources, 222
8.2.3
Entropy and Mutual Information for Continuous
Sources, 226
8.3
Information Theory for Source Coding
227
8.3.1
Bound for Lossless Coding, 227
8.3.2
Bound for Lossy Coding, 229
8.3.3
Rate-Distortion Bounds for Gaussian Sources, 232
8.4
Binary Encoding
234
8.4.1
Huffman Coding, 235
8.4.2
Arithmetic Coding, 238
8.5
Scalar Quantization
241
8.5.1
Fundamentals, 241
8.5.2
Uniform Quantization, 243
8.5.3
Optimal Scalar Quantizer, 244
8.6
Vector Quantization
248
8.6.1
Fundamentals, 248
8.6.2
Lattice Vector Quantizer, 251
8.6.3
Optimal Vector Quantizer, 253
8.6.4
Entropy-Constrained Optimal Quantizer Design, 255
8.7
Summary
257
8.8
Problems
259
8.9
Bibliography
261
9
WAVEFORM-BASED VIDEO CODING
263
9.1
Block-Based Transform Coding
263
9.1.1
Overview, 264
9.1.2
One-Dimensional Unitary Transform, 266
9.1.3
Two-Dimensional Unitary Transform, 269
9.1.4
The Discrete Cosine Transform, 271
9.1.5
Bit Allocation and Transform Coding Gain, 273
9.1.6
Optimal Transform Design and the KLT, 279
9.1.7
DCT-Based Image Coders and the JPEG Standard, 281
9.1.8
Vector Transform Coding, 284
9.2
Predictive Coding
285
9.2.1
Overview, 285
9.2.2
Optimal Predictor Design and Predictive Coding Gain, 286
9.2.3
Spatial-Domain Linear Prediction, 290
wang-50214
wang˙fm
August 23, 2001
14:22
Contents
xv
9.3
Video Coding Using Temporal Prediction and Transform Coding
293
9.3.1
Block-Based Hybrid Video Coding, 293
9.3.2
Overlapped Block Motion Compensation, 296
9.3.3
Coding Parameter Selection, 299
9.3.4
Rate Control, 302
9.3.5
Loop Filtering, 305
9.4
Summary
308
9.5
Problems
309
9.6
Bibliography
311
10
CONTENT-DEPENDENT VIDEO CODING
314
10.1
Two-Dimensional Shape Coding
314
10.1.1
Bitmap Coding, 315
10.1.2
Contour Coding, 318
10.1.3
Evaluation Criteria for Shape Coding Efficiency, 323
10.2
Texture Coding for Arbitrarily Shaped Regions
324
10.2.1
Texture Extrapolation, 324
10.2.2
Direct Texture Coding, 325
10.3
Joint Shape and Texture Coding
326
10.4
Region-Based Video Coding
327
10.5
Object-Based Video Coding
328
10.5.1
Source Model F2D, 330
10.5.2
Source Models R3D and F3D, 332
10.6
Knowledge-Based Video Coding
336
10.7
Semantic Video Coding
338
10.8
Layered Coding System
339
10.9
Summary
342
10.10
Problems
343
10.11
Bibliography
344
11
SCALABLE VIDEO CODING
349
11.1
Basic Modes of Scalability
350
11.1.1
Quality Scalability, 350
11.1.2
Spatial Scalability, 353
11.1.3
Temporal Scalability, 356
11.1.4
Frequency Scalability, 356
wang-50214
wang˙fm
August 23, 2001
14:22
xvi
Contents
11.1.5
Combination of Basic Schemes, 357
11.1.6
Fine-Granularity Scalability, 357
11.2
Object-Based Scalability
359
11.3
Wavelet-Transform-Based Coding
361
11.3.1
Wavelet Coding of Still Images, 363
11.3.2
Wavelet Coding of Video, 367
11.4
Summary
370
11.5
Problems
370
11.6
Bibliography
371
12
STEREO AND MULTIVIEW SEQUENCE PROCESSING
374
12.1
Depth Perception
375
12.1.1
Binocular Cues—Stereopsis, 375
12.1.2
Visual Sensitivity Thresholds for Depth Perception, 375
12.2
Stereo Imaging Principle
377
12.2.1
Arbitrary Camera Configuration, 377
12.2.2
Parallel Camera Configuration, 379
12.2.3
Converging Camera Configuration, 381
12.2.4
Epipolar Geometry, 383
12.3
Disparity Estimation
385
12.3.1
Constraints on Disparity Distribution, 386
12.3.2
Models for the Disparity Function, 387
12.3.3
Block-Based Approach, 388
12.3.4
Two-Dimensional Mesh-Based Approach, 388
12.3.5
Intra-Line Edge Matching Using Dynamic Programming, 391
12.3.6
Joint Structure and Motion Estimation, 392
12.4
Intermediate View Synthesis
393
12.5
Stereo Sequence Coding
396
12.5.1
Block-Based Coding and MPEG-2 Multiview Profile, 396
12.5.2
Incomplete Three-Dimensional Representation
of Multiview Sequences, 398
12.5.3
Mixed-Resolution Coding, 398
12.5.4
Three-Dimensional Object-Based Coding, 399
12.5.5
Three-Dimensional Model-Based Coding, 400
12.6
Summary
400
12.7
Problems
402
wang-50214
wang˙fm
August 23, 2001
14:22
Contents
xvii
13
VIDEO COMPRESSION STANDARDS
405
13.1
Standardization
406
13.1.1
Standards Organizations, 406
13.1.2
Requirements for a Successful Standard, 409
13.1.3
Standard Development Process, 411
13.1.4
Applications for Modern Video Coding Standards, 412
13.2
Video Telephony with H.261 and H.263
413
13.2.1
H.261 Overview, 413
13.2.2
H.263 Highlights, 416
13.2.3
Comparison, 420
13.3
Standards for Visual Communication Systems
421
13.3.1
H.323 Multimedia Terminals, 421
13.3.2
H.324 Multimedia Terminals, 422
13.4
Consumer Video Communications with MPEG-1
423
13.4.1
Overview, 423
13.4.2
MPEG-1 Video, 424
13.5
Digital TV with MPEG-2
426
13.5.1
Systems, 426
13.5.2
Audio, 426
13.5.3
Video, 427
13.5.4
Profiles, 435
13.6
Coding of Audiovisual Objects with MPEG-4
437
13.6.1
Systems, 437
13.6.2
Audio, 441
13.6.3
Basic Video Coding, 442
13.6.4
Object-Based Video Coding, 445
13.6.5
Still Texture Coding, 447
13.6.6
Mesh Animation, 447
13.6.7
Face and Body Animation, 448
13.6.8
Profiles, 451
13.6.9
Evaluation of Subjective Video Quality, 454
13.7
Video Bit Stream Syntax
454
13.8
Multimedia Content Description Using MPEG-7
458
13.8.1
Overview, 458
13.8.2
Multimedia Description Schemes, 459
13.8.3
Visual Descriptors and Description Schemes, 461
13.9
Summary
465
13.10
Problems
466
13.11
Bibliography
467
wang-50214
wang˙fm
August 23, 2001
14:22
xviii
Contents
14
ERROR CONTROL IN VIDEO COMMUNICATIONS
472
14.1
Motivation and Overview of Approaches
473
14.2
Typical Video Applications and Communication Networks
476
14.2.1
Categorization of Video Applications, 476
14.2.2
Communication Networks, 479
14.3
Transport-Level Error Control
485
14.3.1
Forward Error Correction, 485
14.3.2
Error-Resilient Packetization and Multiplexing, 486
14.3.3
Delay-Constrained Retransmission, 487
14.3.4
Unequal Error Protection, 488
14.4
Error-Resilient Encoding
489
14.4.1
Error Isolation, 489
14.4.2
Robust Binary Encoding, 490
14.4.3
Error-Resilient Prediction, 492
14.4.4
Layered Coding with Unequal Error Protection, 493
14.4.5
Multiple-Description Coding, 494
14.4.6
Joint Source and Channel Coding, 498
14.5
Decoder Error Concealment
498
14.5.1
Recovery of Texture Information, 500
14.5.2
Recovery of Coding Modes and Motion Vectors, 501
14.5.3
Syntax-Based Repair, 502
14.6
Encoder–Decoder Interactive Error Control
502
14.6.1
Coding-Parameter Adaptation Based on Channel Conditions, 503
14.6.2
Reference Picture Selection Based on Feedback Information, 503
14.6.3
Error Tracking Based on Feedback Information, 504
14.6.4
Retransmission without Waiting, 504
14.7
Error-Resilience Tools in H.263 and MPEG-4
505
14.7.1
Error-Resilience Tools in H.263, 505
14.7.2
Error-Resilience Tools in MPEG-4, 508
14.8
Summary
509
14.9
Problems
511
14.10
Bibliography
513
15
STREAMING VIDEO OVER THE INTERNET AND
WIRELESS IP NETWORKS
519
15.1
Architecture for Video Streaming Systems
520
15.2
Video Compression
522
wang-50214
wang˙fm
August 23, 2001
14:22
Contents
xix
15.3
Application-Layer QoS Control for Streaming Video
522
15.3.1
Congestion Control, 522
15.3.2
Error Control, 525
15.4
Continuous Media Distribution Services
529
15.4.1
Network Filtering, 529
15.4.2
Application-Level Multicast, 531
15.4.3
Content Replication, 532
15.5
Streaming Servers
533
15.5.1
Real-Time Operating System, 534
15.5.2
Storage System, 537
15.6
Media Synchronization
539
15.7
Protocols for Streaming Video
542
15.7.1
Transport Protocols, 543
15.7.2
Session Control Protocol: RTSP, 545
15.8
Streaming Video over Wireless IP Networks
546
15.8.1
Network-Aware Applications, 548
15.8.2
Adaptive Service, 549
15.9
Summary
554
15.10
Bibliography
555
APPENDIX A: DETERMINATION OF SPATIAL–TEMPORAL
GRADIENTS
562
A.1
First- and Second-Order Gradient
562
A.2
Sobel Operator
563
A.3
Difference of Gaussian Filters
563
APPENDIX B: GRADIENT DESCENT METHODS
565
B.1
First-Order Gradient Descent Method
565
B.2
Steepest Descent Method
566
B.3
Newton’s Method
566
B.4
Newton-Ralphson Method
567
B.5
Bibliography
567
APPENDIX C: GLOSSARY OF ACRONYMS
568
wang-50214
wang˙fm
August 23, 2001
14:22
wang-50214
wang˙fm
August 23, 2001
14:22
Preface
In the past decade or so, there have been fascinating developments in multimedia
rep-resentation and communications. First of all, it has become very clear that all aspects
of media are “going digital”; from representation to transmission, from processing to
retrieval, from studio to home. Second, there have been significant advances in digital
multimedia compression and communication algorithms, which make it possible to
deliver high-quality video at relatively low bit rates in today’s networks. Third, the
advancement in VLSI technologies has enabled sophisticated software to be
imple-mented in a cost-effective manner. Last but not least, the establishment of half a dozen
international standards by ISO/MPEG and ITU-T laid the common groundwork for
different vendors and content providers.
At the same time, the explosive growth in wireless and networking technology
has profoundly changed the global communications infrastructure. It is the confluence
of wireless, multimedia, and networking that will fundamentally change the way people
conduct business and communicate with each other. The future computing and
com-munications infrastructure will be empowered by virtually unlimited bandwidth, full
connectivity, high mobility, and rich multimedia capability.
As multimedia becomes more pervasive, the boundaries between video, graphics,
computer vision, multimedia database, and computer networking start to blur, making
video processing an exciting field with input from many disciplines. Today, video
processing lies at the core of multimedia. Among the many technologies involved, video
coding and its standardization are definitely the key enablers of these developments.
This book covers the fundamental theory and techniques for digital video processing,
with a focus on video coding and communications. It is intended as a textbook for a
graduate-level course on video processing, as well as a reference or self-study text for
wang-50214
wang˙fm
August 23, 2001
14:22
xxii
Preface
researchers and engineers. In selecting the topics to cover, we have tried to achieve
a balance between providing a solid theoretical foundation and presenting complex
system issues in real video systems.
SYNOPSIS
Chapter 1 gives a broad overview of video technology, from analog color TV
sys-tem to digital video. Chapter 2 delineates the analytical framework for video analysis
in the frequency domain, and describes characteristics of the human visual system.
Chapters 3–12 focus on several very important sub-topics in digital video technology.
Chapters 3 and 4 consider how a continuous-space video signal can be sampled to
retain the maximum perceivable information within the affordable data rate, and how
video can be converted from one format to another. Chapter 5 presents models for
the various components involved in forming a video signal, including the camera, the
illumination source, the imaged objects and the scene composition. Models for the
three-dimensional (3-D) motions of the camera and objects, as well as their projections
onto the two-dimensional (2-D) image plane, are discussed at length, because these
models are the foundation for developing motion estimation algorithms, which are
the subjects of Chapters 6 and 7. Chapter 6 focuses on 2-D motion estimation, which
is a critical component in modern video coders. It is also a necessary preprocessing
step for 3-D motion estimation. We provide both the fundamental principles governing
2-D motion estimation, and practical algorithms based on different 2-D motion
repre-sentations. Chapter 7 considers 3-D motion estimation, which is required for various
computer vision applications, and can also help improve the efficiency of video coding.
Chapters 8–11 are devoted to the subject of video coding. Chapter 8 introduces
the fundamental theory and techniques for source coding, including information theory
bounds for both lossless and lossy coding, binary encoding methods, and scalar and
vector quantization. Chapter 9 focuses on waveform-based methods (including
trans-form and predictive coding), and introduces the block-based hybrid coding framework,
which is the core of all international video coding standards. Chapter 10 discusses
content-dependent coding, which has the potential of achieving extremely high
com-pression ratios by making use of knowledge of scene content. Chapter 11 presents
scalable coding methods, which are well-suited for video streaming and
broadcast-ing applications, where the intended recipients have varybroadcast-ing network connections and
computing powers. Chapter 12 introduces stereoscopic and multiview video processing
techniques, including disparity estimation and coding of such sequences.
Chapters 13–15 cover system-level issues in video communications. Chapter 13
introduces the H.261, H.263, MPEG-1, MPEG-2, and MPEG-4 standards for video
coding, comparing their intended applications and relative performance. These
stan-dards integrate many of the coding techniques discussed in Chapters 8–11. The MPEG-7
standard for multimedia content description is also briefly described. Chapter 14 reviews
techniques for combating transmission errors in video communication systems, and
also describes the requirements of different video applications, and the characteristics
wang-50214
wang˙fm
August 23, 2001
14:22
Preface
xxiii
of various networks. As an example of a practical video communication system, we
end the text with a chapter devoted to video streaming over the Internet and wireless
network. Chapter 15 discusses the requirements and representative solutions for the
major subcomponents of a streaming system.
SUGGESTED USE FOR INSTRUCTION AND SELF-STUDY
As prerequisites, students are assumed to have finished undergraduate courses in signals
and systems, communications, probability, and preferably a course in image
process-ing. For a one-semester course focusing on video coding and communications, we
recommend covering the two beginning chapters, followed by video modeling
(Chap-ter 5), 2-D motion estimation (Chap(Chap-ter 6), video coding (Chap(Chap-ters 8–11), standards
(Chapter 13), error control (Chapter 14) and video streaming systems (Chapter 15).
On the other hand, for a course on general video processing, the first nine chapters,
in-cluding the introduction (Chapter 1), frequency domain analysis (Chapter 2), sampling
and sampling rate conversion (Chapters 3 and 4), video modeling (Chapter 5), motion
estimation (Chapters 6 and 7), and basic video coding techniques (Chapters 8 and 9),
plus selected topics from Chapters 10–13 (content-dependent coding, scalable coding,
stereo, and video coding standards) may be appropriate. In either case, Chapter 8 may
be skipped or only briefly reviewed if the students have finished a prior course on
source coding. Chapters 7 (3-D motion estimation), 10 (content-dependent coding),
11 (scalable coding), 12 (stereo), 14 (error-control), and 15 (video streaming) may also
be left for an advanced course in video, after covering the other chapters in a first course
in video. In all cases, sections denoted by asterisks (*) may be skipped or left for further
exploration by advanced students.
Problems are provided at the end of Chapters 1–14 for self-study or as
home-work assignments for classroom use. Appendix D gives answers to selected problems.
The website for this book (www.prenhall.com/wang) provides MATLAB scripts used to
generate some of the plots in the figures. Instructors may modify these scripts to generate
similar examples. The scripts may also help students to understand the underlying
operations. Sample video sequences can be downloaded from the website, so that
students can evaluate the performance of different algorithms on real sequences. Some
compressed sequences using standard algorithms are also included, to enable instructors
to demonstrate coding artifacts at different rates by different techniques.
ACKNOWLEDGMENTS
We are grateful to the many people who have helped to make this book a reality. Dr.
Barry G. Haskell of AT&T Labs, with his tremendous experience in video coding
stan-dardization, reviewed Chapter 13 and gave valuable input to this chapter as well as other
topics. Prof. David J. Goodman of Polytechnic University, a leading expert in wireless
communications, provided valuable input to Section 14.2.2, part of which summarize
characteristics of wireless networks. Prof. Antonio Ortega of the University of Southern
wang-50214
wang˙fm
August 23, 2001
14:22
xxiv
Preface
California and Dr. Anthony Vetro of Mitsubishi Electric Research Laboratories, then
a Ph.D. student at Polytechnic University, suggested what topics to cover in the
sec-tion on rate control, and reviewed Secsec-tions 9.3.3–4. Mr. Dapeng Wu, a Ph.D. student
at Carnegie Mellon University, and Dr. Yiwei Hou from Fijitsu Labs helped to draft
Chapter 15. Dr. Ru-Shang Wang of Nokia Research Center, Mr. Fatih Porikli of
Mit-subishi Electric Research Laboratories, also a Ph.D. student at Polytechnic University,
and Mr. Khalid Goudeaux, a student at Carnegie Mellon University, generated several
images related to stereo. Mr. Haidi Gu, a student at Polytechnic University, provided
the example image for scalable video coding. Mrs. Dorota Ostermann provided the
brilliant design for the cover.
We would like to thank the anonymous reviewers who provided valuable
com-ments and suggestions to enhance this work. We would also like to thank the students
at Polytechnic University, who used draft versions of the text and pointed out many
typographic errors and inconsistencies. Solutions included in Appendix D are based on
their homeworks. Finally, we would like to acknowledge the encouragement and
guid-ance of Tom Robbins at Prentice Hall. Yao Wang would like to acknowledge research
grants from the National Science Foundation and New York State Center for Advanced
Technology in Telecommunications over the past ten years, which have led to some of
the research results included in this book.
Most of all, we are deeply indebted to our families, for allowing and even
encour-aging us to complete this project, which started more than four years ago and took away
a significant amount of time we could otherwise have spent with them. The arrival of
our new children Yana and Brandon caused a delay in the creation of the book but also
provided an impetus to finish it. This book is a tribute to our families, for their love,
affection, and support.
Y
AO
W
ANG
Polytechnic University, Brooklyn, NY, USA
[email protected]
J ¨
ORN
O
STERMANN
AT&T Labs—Research, Middletown, NJ, USA
[email protected]
Y
A
-Q
IN
Z
HANG
Microsoft Research, Beijing, China
VIDEO FORMATION,
PERCEPTION, AND
REPRESENTATION
In this rst chapter, we describe what is a video signal, how is it captured and
perceived, how is it stored/transmitted, and what are the important parameters
thatdeterminethequalityandbandwidth(whichinturndeterminesthedatarate)
of a video signal. We rst present the underlying physics for color perception
and specication (Sec. 1.1). We then describe the principles and typical devices
for video capture and display (Sec. 1.2). As will be seen, analog videos are
cap-tured/stored/transmitted in a raster scan format, using either progressive or
in-terlacedscans. Asan example,wereviewtheanalogcolortelevision(TV) system
(Sec.1.4),andgiveinsightsastohowarecertaincriticalparameters,suchasframe rateandlinerate,chosen,whatisthespectralcontentofacolorTVsignal,andhow
candierentcomponentsofthesignalbemultiplexed into acompositesignal.
Fi-nally,Section1.5introducestheITU-RBT.601videoformat(formerlyCCIR601),
thedigitizedversionoftheanalogcolorTVsignal. Wepresentsomeofthe consider-ationsthathavegoneintotheselectionofvariousdigitizationparameters. Wealso
describeseveralotherdigitalvideoformats,includinghigh-denitionTV(HDTV).
Thecompressionstandardsdevelopedfordierentapplicationsandtheirassociated
videoformatsaresummarized.
Thepurposeofthischapter istogivethereadersbackgroundknowledgeabout
analogand digitalvideo, and to provideinsights to commonvideo systemdesign
problems. As such, the presentation is intentionally made more qualitative than
quantitative. Inlater chapters, wewill come back to certain problemsmentioned
inthis chapterandprovidemorerigorousdescriptions/solutions.
1.1 Color Perception and Specication
A video signal is a sequence of two dimensional (2D) images projected from a
colorvalueatanypointinavideoframerecordstheemittedorre ectedlightata particular3Dpointintheobservedscene. Tounderstandwhatdoesthecolorvalue meanphysically, wereview in this sectionbasicsof lightphysicsand describethe attributesthat characterizelightandits color. Wewill alsodescribetheprinciple ofhumancolorperceptionanddierentwaystospecifyacolorsignal.
1.1.1 Light and Color
Light is an electromagnetic wave with wavelengths in the range of 380 to 780
nanometer(nm), to which thehumaneyeissensitive. Theenergyoflightis
mea-suredby ux,withaunitofwatt,whichistherateatwhichenergyisemitted. The radiantintensity of alight, which is directlyrelatedto the brightnessof thelight we perceive, is dened asthe ux radiated into a unit solid angle in aparticular direction,measuredinwatt/solid-angle. Alightsourceusually canemit energyin arangeofwavelengths,anditsintensitycanbevaryinginbothspaceandtime. In thisbook,weuseC(X;t;)torepresenttheradiantintensitydistributionofalight, whichspecies thelightintensityat wavelength ,spatial location X=(X;Y;Z)
andtimet.
Theperceivedcolorofalightdependsonitsspectralcontent(i.e. thewavelength
composition). Forexample, alightthat has itsenergy concentratednear 700nm
appearsred. Alightthathasequalenergyintheentirevisiblebandappearswhite.
In general, alight that has a verynarrow bandwidth is referred to as a spectral
color. Ontheotherhand,awhitelightissaidto beachromatic.
There are twotypes of light sources: the illuminating source, which emits an
electromagnetic wave, and there ecting source, which re ects an incident wave.
1
The illuminating light sources include the sun, light bulbs, the television (TV)
monitors,etc. Theperceivedcolorof anilluminating lightsourcedepends onthe
wavelengthrangeinwhichitemitsenergy. Theilluminatinglightfollowsanadditive rule,i.e. theperceivedcolorofseveralmixedilluminatinglightsourcesdependson thesumofthespectraofalllightsources. Forexample,combiningred,green,and bluelightsinrightproportionscreatesthewhitecolor.
There ectinglightsourcesarethosethatre ectanincidentlight(whichcould itselfbeare ectedlight). Whenalightbeamhitsanobject,theenergyinacertain wavelengthrangeisabsorbed,whiletherestisre ected. Thecolorofare ectedlight dependsonthespectralcontentoftheincidentlightandthewavelengthrangethat isabsorbed. A re ectinglightsourcefollowsasubtractiverule,i.e. theperceived colorofseveralmixedre ectinglightsourcesdependsontheremaining,unabsorbed wavelengths. Themostnotablere ectinglightsourcesarethecolordyesandpaints. Forexample,iftheincidentlightiswhite, adyethatabsorbsthewavelengthnear 700nm(red)appearsascyan. Inthissense,wesaythatcyanisthecomplementof
1
Theilluminatingandre ectinglightsourcesarealsoreferredtoasprimaryandsecondarylight sources,respectively. Wedonotusethosetermstoavoidtheconfusionwiththeprimarycolors associatedwithlight. Inotherplaces, illuminatingand re ectinglightsarealsocalledadditive
Figure 1.1. Solidline: Frequencyresponsesof the threetypesof cones onthe human retina. Theblueresponsecurveismagniedbyafactorof20inthegure. DashedLine: TheluminouseÆciencyfunction. From[10 ,Fig.1].
red(orwhiteminus red). Similarly,magentaandyellowarecomplementsofgreen
and blue, respectively. Mixing cyan, magenta, and yellow dyes produces black,
whichabsorbstheentirevisiblespectrum.
1.1.2 Human Perception of Color
Theperceptionofalightinthehumanbeingstartswiththephotoreceptorslocated
in the retina (the surface of the rear of the eye ball). There are two types of
receptors: cones that function under bright light andcan perceivethecolor tone,
and rods that work under lowambient light and canonly extract the luminance
information. Thevisualinformationfromtheretinaispassedviaopticnervebers tothebrainareacalledthevisualcortex,wherevisualprocessingandunderstanding
isaccomplished. Therearethreetypesofconeswhichhaveoverlappingpass-bands
inthevisiblespectrumwithpeaksatred(near570nm),green(near535nm),and
blue(near445nm)wavelengths,respectively,asshowninFigure1.1. Theresponses ofthesereceptorsto anincominglightdistributionC()can bedescribedby:
C i = Z C()a i ()d; i=r;g;b; (1.1.1) where a r ();a g ();a b
() arereferredto asthefrequencyresponses orrelative
ab-sorption functions of the red, green, and blue cones. The combination of these
threetypesofreceptorsenablesahumanbeingto perceiveanycolor. Thisimplies
that the perceived coloronly depends on three numbers, C
r ;C g ;C b , rather than thecompletelightspectrumC(). Thisisknownasthetri-receptortheoryofcolor
There are two attributes that describe the color sensation of a human being:
luminanceandchrominance. Thetermluminance referstotheperceivedbrightness
ofthelight,whichisproportionaltothetotalenergyinthevisibleband. Theterm
chrominance describes the perceived color tone of a light, which depends on the
wavelength compositionof thelight. Chrominanceisin turncharacterizedbytwo
attributes: hue and saturation. Hue species the color tone, which depends on
thepeakwavelengthofthelight,whilesaturation describeshowpurethecoloris,
whichdependsonthespreadorbandwidthofthelightspectrum. Inthisbook,we
usethewordcolortorefertoboththeluminance andchrominanceattributesofa
light, although it is customary to use the word colorto referto the chrominance
aspectofalightonly.
Experimentshaveshown that there exists asecondaryprocessing stage in the
humanvisualsystem(HVS),whichconvertsthethreecolorvaluesobtainedbythe
conesintoonevaluethatisproportionaltotheluminanceandtwoothervaluesthat
areresponsibleforthe perception ofchrominance. This is knownastheopponent
color model oftheHVS[3,9]. It hasbeenfoundthat thesameamountofenergy
produces dierent sensations of the brightness at dierent wavelengths, and this
wavelength-dependent variation of the brightness sensation is characterized by a
relative luminous eÆciency function, a
y
(), which is also shown (in dashed line) in Fig. 1.1. It is essentially thesum of thefrequency responses of allthree types
ofcones. Wecan see thatthegreen wavelengthcontributesmostto theperceived
brightness,theredwavelengththesecond,and theblue theleast. The luminance
(oftendenotedbyY)isrelatedtotheincominglightspectrumby:
Y =
Z C()a
y
()d: (1.1.2)
In theaboveequations, wehave neglectedthe time andspace variables, since we
are only concerned with the perceived color or luminance at a xed spatial and
temporal location. Wealsoneglectedthescaling factorcommonlyassociatedwith
eachequation,whichdependsonthedesiredunitfordescribingthecolorintensities
andluminance.
1.1.3 The Trichromatic Theory of Color Mixture
A veryimportant ndingin color physicsis that mostcolorscanbeproduced by
mixing three properly chosen primary colors. This is known as the trichromatic
theoryof colormixture,rstdemonstratedbyMaxwellin1855[9,13]. LetC
k ;k= 1;2;3representthecolorsofthreeprimarycolorsources,andCagivencolor. Then thetheoryessentiallysays
C= X k =1;2;3 T k C k ; (1.1.3) where T k
's are the amounts of the three primary colors required to match color
negative. Assuming onlyT 1
is negative,this means that one cannot match color
C by mixing C 1 ;C 2 ;C 3
, but one can match colorC+jT
1 jC 1 with T 2 C 2 +T 3 C 3 :
In practice, the primary colors should be chosen so that mostnatural colors can
be reproduced using positive combinations of primary colors. The most popular
primary set for theilluminating light sourcecontains red, green, and blue colors,
knownastheRGBprimary. Themostcommonprimarysetforthere ectinglight
source containscyan, magenta, and yellow, known astheCMY primary. Infact,
RGB and CMY primary sets are complement of each other, in that mixing two
colorsin oneset willproduceonecolorin theother set. Forexample,mixing red withgreenwillyieldyellow. Thiscomplementaryinformationisbestillustratedby acolorwheel,which canbefoundin manyimageprocessingbooks,e.g.,[9, 4].
For achosenprimary set,one waytodeterminetristimulusvaluesofanycolor isbyrstdeterminingthecolormatchingfunctions,m
i
(), forprimarycolors,C i
,
i=1,2,3. These functions describe the tristimulus values of a spectral color with
wavelength , for various in the entire visible band, and can bedetermined by
visualexperimentswithcontrolledviewing conditions. Thenthetristimulusvalues foranycolorwithaspectrumC() canbeobtainedby[9]:
T i = Z C()m i ()d; i=1;2;3: (1.1.4)
Toproduceallvisiblecolorswithpositivemixing,thematchingfunctionsassociated withtheprimarycolorsmustbepositive.
Theabovetheory forms thebasisfor colorcaptureand display. Torecordthe colorofanincominglight,acameraneedstohavethreesensorsthathavefrequency responsessimilartothecolormatchingfunctionsofachosenprimaryset. Thiscan beaccomplishedbyopticalorelectroniclterswiththedesiredfrequencyresponses. Similarly, todisplayacolorpicture,thedisplaydevice needstoemit threeoptical
beams of the chosen primary colors with appropriate intensities, as specied by
the tristimulus values. In practice, electronic beams that strike phosphors with
the red, green and blue colors are used. All present display systems use a RGB
primary, although the standard spectra specied for the primary colors may be
slightlydierent. Likewise, acolorprinter canproducedierentcolorsby mixing
three dyes with the chosen primary colors in appropriate proportions. Most of
the color printers use the CMY primary. For amore vivid and wide-rangecolor
rendition,somecolorprintersusefourprimaries,byaddingblack(K)to theCMY
set. Thisis known asthe CMYKprimary, which canrendertheblack colormore
truthfully.
1.1.4 Color Specication by Tristimulus Values
TristimulusValues Wehaveintroducedthetristimulusrepresentation ofacolor
in Sec. 1.1.3, which species the proportions, i.e. the T k
's in Eq. (1.1.3), of the threeprimarycolorsneededtocreatethedesiredcolor. Inordertomakethecolor
should benormalizedso that T k
=1;k=1;2;3for areferencewhite color(equal
energy in allwavelengths) with aunit energy. Whenweuse aRGB primary, the
tristimulusvaluesareusuallydenotedbyR ;G;andB.
ChromaticityValues: Theabovetristimulusrepresentationmixesthe luminance
andchrominanceattributesof acolor. Tomeasure onlythechrominance
informa-tion(i.e. thehueandsaturation)ofalight,thechromaticitycoordinateis dened as: t k = T k T 1 +T 2 +T 3 ; k=1;2;3: (1.1.5) Sincet 1 +t 2 +t 3
=1,twochromaticityvaluesaresuÆcienttospecifythe chromi-nanceofacolor.
Obviously, the color value of an imaged point depends on the primary colors
used. Tostandardizecolordescriptionandspecication,severalstandardprimary
colorsystemshavebeenspecied. Forexample,the CIE,
2
aninternationalbody
ofcolorscientists,dened aCIE RGBprimary system,whichconsists ofcolorsat
700(R 0 ),546.1(G 0 ),and 435.8(B 0 )nm.
Color CoordinateConversion Onecanconvert thecolorvaluesbasedononeset
ofprimariestothecolorvaluesforanothersetofprimaries. Conversionof(R,G,B)
coordinate to the (C,M,Y) coordinate is, for example, oftenrequired for printing
colorimagesstoredinthe(R,G,B)coordinate. Giventhetristimulusrepresentation
ofoneprimary set in termsofanother primary,one candeterminetheconversion
matrix between the two color coordinates. The principle of color conversionand
thederivationof theconversionmatrixbetweentwosetsofcolorprimariescanbe foundin[9].
1.1.5 Color Specication by Luminance and Chrominance
At-tributes
TheRGBprimarycommonlyusedforcolordisplaymixestheluminanceand
chromi-nanceattributesofalight. Inmanyapplications, itisdesirabletodescribeacolor
in terms of itsluminance and chrominancecontentseparately, to enable more
ef-cient processing and transmission of color signals. Towards this goal, various
three-componentcolor coordinates havebeendeveloped, in which one component
re ectsthe luminance and theother twocollectivelycharacterizehueand
satura-tion. Onesuch coordinate istheCIE XYZprimary,in which Ydirectly measures
theluminance intensity. The(X;Y;Z)valuesin thiscoordinateare relatedtothe (R ;G;B)valuesintheCIERGBcoordinateby[9]:
2 4 X Y Z 3 5 = 2 4 2:365 0:515 0:005 0:897 1:426 0:014 0:468 0:089 1:009 3 5 2 4 R G B 3 5 : (1.1.6) 2
Com-Inadditionto separatingtheluminance andchrominanceinformation,another
advantageoftheCIEXYZsystemisthat almostallvisiblecolorscanbespecied
withnon-negativetristimulusvalues,whichisaverydesirablefeature. Theproblem
is that theX,Y,Z colors sodened are notrealizable by actual colorstimuli. As
such,theXYZprimaryisnotdirectlyusedforcolorproduction,ratheritismainly introducedfordening otherprimariesandfornumericalspecicationofcolor. As will be seenlater, thecolorcoordinatesused fortransmissionof colorTVsignals,
suchasYIQandYUV,areallderivedfrom theXYZcoordinate.
Thereareothercolorrepresentationsinwhichthehueandsaturationofacolor areexplicitlyspecied,inadditiontotheluminance. OneexampleistheHSI coor-dinate,where Hstandsforhue,S forsaturation,andI forintensity (equivalentto luminance)
3
. Althoughthiscolorcoordinateclearlyseparatesdierentattributesof alight,itisnonlinearlyrelatedtothetristimulusvaluesandisdiÆculttocompute.
The book by Gonzalez hasa comprehensivecoverageof various color coordinates
andtheirconversions[4].
1.2 Video Capture and Display
1.2.1 Principle of Color Video Imaging
Having explained what is light and how it is perceived and characterized, we are
now in a position to understand themeaning of avideosignal. In short,a video
recordstheemittedand/orre ectedlightintensity,i.e. C(X;t;)from theobjects
in thescene that is observedbyaviewing system(a humaneyeor acamera). In
general,thisintensitychangesbothintimeandspace. Here,weassumethat there aresomeilluminatinglightsourcesinthescene. Otherwise,therewillbenoinjected
norre ectedlightandtheimagewillbetotallydark. Whenobservedbyacamera,
onlythosewavelengthstowhichthecameraissensitivearevisible. Letthespectral
absorption function of the camera be denoted by a
c
(), then the light intensity distributioninthe3Dworldthatis\visible"tothecamerais:
(X;t)= Z 1 0 C(X;t;)a c ()d: (1.2.1)
Theimage function captured by thecameraat anytime t is theprojectionof
the light distributionin the3D scene onto a2D image plane. Let P()represent
thecameraprojectionoperator so that theprojected2Dposition ofthe3D point
X is given byx =P(X). Furthermore, letP
1
() denote the inverse projection
operator,sothatX=P
1
(x)species the3Dpositionassociatedwitha2Dpoint
x:Thentheprojectedimageisrelatedtothe3Dimageby
(P(X);t)= (X;t) or (x;t)= P 1 (x);t : (1.2.2)
Thefunction (x;t)iswhatisknownasavideosignal. Wecanseethatitdescribes
the radiant intensity at the 3D position X that is projected onto x in the image
planeattimet. Ingeneralthevideosignalhasanitespatialandtemporalrange.
The spatialrange depends onthe cameraviewing area, whilethe temporal range
dependsonthedurationinwhichthevideoiscaptured. Apointintheimageplane iscalledapixel(meaningpictureelement)orsimplypel.
4
Formostcamerasystems, theprojectionoperatorP()canbeapproximatedbyaperspectiveprojection. This isdiscussedinmoredetailin Sec.5.1.
IfthecameraabsorptionfunctionisthesameastherelativeluminouseÆciency functionofthehumanbeing,i.e. a
c
()=a
y
(),thenaluminanceimageisformed.
If the absorption function is non-zero over a narrow band, then a monochrome
(or monotone) image is formed. To perceive all visible colors, according to the
trichromaticcolorvisiontheory(seeSec.1.1.2),threesensorsareneeded,eachwith afrequencyresponsesimilar tothecolormatchingfunction foraselectedprimary color. Asdescribedbefore,mostcolorcamerasusethered,green,andbluesensors forcoloracquisition.
If the camera hasonly one luminance sensor, (x;t) is ascalar function that
represents the luminance of the projected light. In this book, we use the word
gray-scale to refertosuch avideo. Thetermblack-and-white will beused strictly todescribeanimagethathasonlytwocolors: blackandwhite. Ontheotherhand, ifthecamerahasthreeseparatesensors,eachtunedtoachosenprimarycolor,the signalisavectorfunction that containsthree colorvaluesateverypoint. Instead of specifyingthese colorvalues directly, onecanuse othercolor coordinates (each consistsofthreevalues) tocharacterizelight,asexplainedin theprevioussection.
Note that for special purposes, onemay use sensorsthat work in afrequency
range that is invisible to the human being. For example, in X-ray imaging, the
sensorissensitiveto thespectralrangeoftheX-ray. Ontheotherhand,an infra-redcameraissensitivetotheinfra-redrange,whichcanfunctionatverylowambient light. Thesecamerascan\see"thingsthatcannotbeperceivedbythehumaneye. Yetanotherexampleistherangecamera,inwhichthesensoremitsalaserbeamand
measures thetime it takesfor thebeamto reach anobjectand then bere ected
back to the sensor. Because the round trip time is proportional to the distance
between the sensor and the objectsurface, the image intensity at any point in a
rangeimagedescribesthedistanceorrangeofitscorresponding3Dpointfromthe camera.
1.2.2 Video Cameras
All theanalogcamerasoftodaycaptureavideoin aframebyframemannerwith
acertain time spacing betweenthe frames. Somecameras (e.g. TV camerasand
consumervideocamcorders) acquireaframe byscanning consecutivelines witha
certainlinespacing. Similarly,allthedisplaydevicespresentavideoasa consecu-tivesetofframes,andwithTVmonitors,thescanlinesareplayedbacksequentially asseparatelines. Suchcaptureanddisplaymechanismsaredesignedtotake
advan-4
tageofthefactthat theHVScannotperceiveveryhighfrequencychangesintime andspace. ThispropertyoftheHVSwillbediscussedmoreextensivelyinSec.2.4.
There are basically two types of video imagers: (1) tube-based imagers such
as vidicons, plumbicons, or orthicons, and (2) solid-state sensors such as
charge-coupleddevices (CCD).The lensof acamerafocuses theimage ofa sceneontoa
photosensitivesurfaceof theimager of thecamera, which converts optical signals into electrical signals. The photosensitive surfaceof the tube imager is typically scannedlinebyline(knownasrasterscan)withanelectronbeamorotherelectronic methods, andthescannedlinesin each framearethenconvertedintoanelectrical signal representingvariations of lightintensity as variations in voltage. Dierent linesarethereforecapturedatslightlydierenttimesinacontinuousmanner. With
progressive scan, the electronic beam scans every line continuously; while with
interlacedscan, the beamscans everyother line in onehalf of the frame time (a
eld)andthenscanstheotherhalfofthelines. Wewilldiscussrasterscaninmore detailinSec.1.3. WithaCCDcamera,thephotosensitivesurfaceiscomprisedofa 2Darrayofsensors,eachcorrespondingtoonepixel,andtheopticalsignalreaching eachsensorisconvertedtoanelectronicsignal. Thesensorvaluescapturedineach frametimearerststoredinabuer,whicharethenread-outsequentiallyoneline at atimeto formarastersignal. Unlikethetubebasedcameras,alltheread-out
values in the same frame are captured at the same time. With interlaced scan
camera,alternatelinesareread-outineacheld.
Tocapturecolor,thereareusuallythreetypesofphotosensitivesurfacesorCCD
sensors, eachwith afrequencyresponse that is determined bythe colormatching
functionofthechosenprimarycolor,asdescribedpreviouslyinSec.1.1.3. Toreduce
thecost,mostconsumercamerasuseasingleCCDchipforcolorimaging. Thisis
accomplishedbydividingthesensorareaforeachpixelintothreeorfoursub-areas, eachsensitivetoadierentprimarycolor. Thethreecaptured colorsignalscanbe
eitherconverted tooneluminance signalandtwochrominancesignalandsentout
asacomponentcolorvideo,ormultiplexedintoacompositesignal. Thissubjectis explainedfurtherin Sec.1.2.4.
ManycamerasoftodayareCCD-basedbecausetheycanbemademuchsmaller
and lighter than the tube-based cameras, to acquire the same spatial resolution.
Advancementin CCD technologyhas madeit possibleto capture in averysmall
chipsizeaveryhighresolutionimagearray. Forexample,1/3-inCCD'swith380K
pixelsarecommonlyfoundinconsumer-usecamcorders,whereasa2/3-inCCDwith
2millionpixels hasbeendeveloped forHDTV.The tube-based camerasare more
bulkyand costly,andareonlyusedin specialapplications,suchasthoserequiring veryhighresolutionorhighsensitivityunderlowambientlight. Inadditiontothe circuitryforcolorimaging,mostcamerasalsoimplementcolorcoordinateconversion
(from RGB to luminance and chrominance) and compositing of luminance and
chrominancesignals. Fordigitaloutput,analog-to-digital(A/D)conversionisalso
incorporated. Figure 1.2 shows the typical processings involvedin a professional
Figure 1.2. SchematicBlockDiagramof aProfessionalColorVideoCamera. From[6 , Fig.7(a)].
imagequality, digitalprocessingis introducedwithin thecamera. Foranexcellent expositionof thevideocameraanddisplaytechnologies,see[6].
1.2.3 Video Display
Todisplayavideo,themostcommondevice isthecathoderaytube(CRT).With
aCRT monitor,anelectron gunemits anelectronbeamacrossthescreenline by
line, exciting phosphorswith intensities proportionalto the intensityof the video signalatcorrespondinglocations. Todisplayacolorimage,threebeamsareemitted
by three separate guns, exciting red, green, and blue phosphors with the desired
intensitycombinationateachlocation. Tobemoreprecise,eachcolorpixelconsists ofthreeelementsarrangedinasmalltriangle,knownasatriad.
TheCRTcanproduceanimage havingaverylargedynamicrangesothatthe
displayedimagecanbeverybright,suÆcientforviewingduringdaylightorfroma distance. However,thethicknessofaCRTneedstobeaboutthesameasthewidth ofthescreen,fortheelectronstoreachthesideofthescreen. Alargescreenmonitor is thus too bulky, unsuitable for applications requiringthin andportable devices.
Tocircumventthis problem,various atpaneldisplayshavebeendeveloped. One
populardeviceisLiquidCrystalDisplay(LCD).TheprincipleideabehindtheLCD
istochangetheopticalpropertiesandconsequentlythebrightness/colorofthe liq-uidcrystalbyanappliedelectriceld. Theelectriceldcanbegenerated/adapted
by either an arrayof transistors, such asin LCD's using active matrix
thin-lm-transistors(TFT),orbyusingplasma. Theplasmatechnologyeliminatestheneed
for TFT and makeslarge-screen LCD's possible. There are also new designs for
atCRT's. A morecomprehensivedescriptionofvideodisplaytechnologiescanbe
foundin[6].
frameinstantiscompletelyrecordedonthelm. Fordisplay,consecutiverecorded framesareplayedbackusingananalogopticalprojectionsystem.
1.2.4 Composite vs. Component Video
Ideally, a color video should be specied by three functions or signals, each
de-scribing one color component, in either a tristimulus color representation, or a
luminance-chrominancerepresentation. A video in this format is known as
com-ponent video. Mainly for historical reasons, various composite video formatsalso
exist, wherein the three color signalsare multiplexed into a singlesignal. These
compositeformatswereinventedwhenthecolorTVsystemwasrstdevelopedand
there was a need to transmit the color TV signal in a way so that a
black-and-white TVset canextract from it the luminance component. Theconstruction of
acomposite signalrelieson theproperty thatthe chrominancesignalshavea
sig-nicantlysmallerbandwidththantheluminancecomponent. Bymodulatingeach
chrominance component to a frequency that is at the high end of the luminance
component,and addingtheresultingmodulatedchrominancesignalsandthe
orig-inal luminance signal together, onecreates acompositesignal that contains both
luminanceandchrominanceinformation. Todisplayacompositevideosignalona
colormonitor,alterisusedtoseparatethemodulatedchrominancesignalsandthe
luminance signal. Theresultingluminance and chrominancecomponentsarethen
convertedtored,green,andbluecolorcomponents. Withagray-scalemonitor,the luminancesignalaloneisextractedanddisplayeddirectly.
AllpresentanalogTVsystemstransmitcolorTVsignalsinacompositeformat.
The composite format is also used for video storage on some analog tapes(such
as the VHS tape). In addition to being compatible with a gray-scale signal, the
compositeformateliminatestheneedforsynchronizingdierentcolorcomponents
when processing acolor video. A composite signal also hasa bandwidth that is
signicantlylowerthanthesumofthebandwidthofthreecomponentsignals,and
thereforecanbetransmittedorstoredmoreeÆciently. These benetsarehowever
achievedattheexpenseofvideoquality: thereoftenexistnoticeableartifactscaused
bycross-talksbetweencolorandluminancecomponents.
Asacompromisebetweenthedatarateandvideoquality,S-videowasinvented,
whichconsists oftwocomponents,the luminancecomponentand asingle
chromi-nancecomponentwhichisthemultiplexoftwooriginalchrominancesignals. Many
advanced consumer level video cameras and displays enable recording/display of
video in S-video format. Component format is used only in professional video
equipment.
1.2.5 Gamma Correction