D C U
D u b l i n C i t y U n i v e r s i t y
S
c h o o l o f e l e c t r o n i c e n g i n e e r i n gSemi-automatic Video Object
Segmentation for Multimedia Applications
A thesis subm itted for the aw ard of
M. Eng. Degree in Electronic Engineering at
Dublin City University
b y
S a m a n H e m a n t h a C o o r a y , B . S c . ( E n g )
S u p e r v i s e d b y D r . N o e l O ’C o n n o r
D E C L A R A T I O N
I hereby certify that this material, which I now submit for assessment on the
programme o f study leading to the award o f M. Eng. in Electronic Engineering is
entirely m y own work and has not been taken from the work o f others save and to the
extent that such work has been cited and acknowledged within the text o f my work.
Signed: . ( ^ f e ...
I D N o : .a S 3 .7 .! ? 3 .a ..
Sam an Hemantha Cooray
A C K N O W LE D G E M E N TS
First o f all, I w ish to express m y sincere gratitude to Dr. N oel O ’C onnor for his excellent guidance, affectionate counselling and h is encouragem ent thoughout the course o f this research and w riting o f this thesis. It w ould n o t have been possible to com plete this w ork w ithout his guidance and stim ulating suggestions.
I w ould like to thank Dr. T om m y C urran for providing m e his kind assistance and other necessary facilities to succeed this work.
I also w ish to thank Dr. Sean M arlow and Dr. N oel M urphy for their encouragem ent and suggestions offered to m e throughout this w ork. M y thanks also go to all other m em bers in the V M PG group.
Special thanks then go to m y colleague Sean M urphy w ho used to w ork in the sam e lab J119, for his help and m any fruitful discussions. I am also grateful to other colleagues w ho have helpm ed m e in num erous w ays to succeed this w ork.
M y heartiest thanks are offered to m y parents and fam ily, w ho are thousands o f miles aw ay at this m om ent, for their role o f encouragem ent and sacrifice, and for w hat they are yet to experience in the com ing years.
I m ust give m y special thanks to all m y Sri L ankan friends for their lovely and entertaining e-m ails to keep m e in a healthy position. I ca n r e m e m b e r th a t th e r e w a s a tim e a fte r A p r il th e 2 7 th 1 9 9 9 I q u ite ly s a n g s o m e S ri L a n k a n s o n g s d u r in g y o u r a b sen ce .
ABSTRACT
A u th o r: S a m a n H e m a n th a C o o r a y
A sem i-autom atic video object segm entation tool is presented for segm enting both still pictures and im age sequences. T he approach com prises both autom atic segm entation algorithm s and m anual user interaction. The still im age segm entation com ponent is com prised o f a conventional spatial segm entation algorithm (R ecursive Shortest Spanning Tree (R SST)), a hierarchical segm entation representation m ethod (Binary P artition Tree (B PT)), and user interaction. A n initial segm entation partition o f hom ogeneous regions is created using RSST. The B P T technique is then used to m erge these regions and hierarchically represent the segm entation in a binary tree. The sem antic objects are th en m anually bu ilt by selectively clicking on im age regions. A video object-tracking com ponent enables im age sequence segm entation, and this subsystem is based on m otion estim ation, spatial segm entation, object projection, region classification, and user interaction. The m o tio n betw een the previous fram e and the current fram e is estim ated, and the previous object is then projected onto the current partition. A region classification technique is used to determ ine w hich regions in the current partition belong to the projected object. U ser interaction is allow ed for object re-initialisation w hen th e segm entation results becom e inaccurate. T he com bination o f all these com ponents enables offline video sequence segm entation. The results presented on standard test sequences illustrate the potential use o f this system for object-based coding and representation o f m ultim edia.
T A B L E O F C O N T E N T S
1. INTRODUCTION--- --- 1
1.1 O verview... 1
1.2 Introduction...! 1.3 Objectives of the Research...3
1.4 Structure of the Thesis... 5
2. STILL PICTURE AND MOVING PICTURE CODING STANDARDS--- 7
2.1 O verview... 7
2.2 Overview of Data Compression... 7
2.3 Different Video Form ats...9
2.4 Different Colour Spaces...10
2.5 Block-based Compression and Segmentation-based Compression... 11
2.6 Still Image Compression Standards...13
2.6.1 ISO/IEC JPEG Coding--- 13
2.6.2 ISO/IEC JPEG2000 Coding--- 16
2.7 Moving Picture Compression and Description Standards... 20
2.7.1 ITU-T H .261--- ---22 2.7.2 ITU-T H .263--- 25 2.7.3 ISO/IEC M PE G -1--- --- 26 2.7.4 ISO/IEC M PE G -2--- --- --- 28 2.7.5 ISO/IEC M PEG -4--- 30 2.7.6 ISO/IEC M PEG -7--- --- 34 2.8 Discussion...37 3. IMAGE SEGMENTATION---39 3.1 O verview ...39
3.2 General Description of Segmentation... 39
3.2.1 Regions vs Objects--- 40
3.2.2 Automatic vs Semi-automatic---41
3.2.3 How Image Segmentation is achieved--- 43
3.2.4 Segmentation by Generic Region M erging--- 45
3.2.5 Hierarchical Segmentation--- 45
3.3 Automatic Segmentation Algorithms...— ...—...---...— ...47
3.3.1 Watershed A lgorith m--- --- 47
3.3.2 Recursive Shortest Spanning Tree (RSST)--- 50
3.4 Segmentation Representation... 52
3.4.1 Binary Partition Tree (BPT)--- 52
3.5 Discussion...55
4.1 O verview...57
4.2 Introduction... 57
4.3 Tracking Algorithms... 62
4.3.1 E dge-based Tracking Algorithm s---62
4.3.1.1 Active Contours (Snakes) Tracking Algorithms--- 64
4.3.2 Region-based Tracking A lgorith m s---66
4.3.2.1 Partition Projection M e th o d--- 67
4.3.2.2 Region Classification--- 68
4.4 Discussion... 69
5. A SEMI-AUTOMATIC VIDEO OBJECT SEGMENTATION TO O L ---71
5.1 O verview... 71
5.2 System Architecture... 71
5.3 The Graphical User Interface...73
5.4 Initial Object Segmentation using RSST, BPT and User Interaction... 76
5.5 Object Tracking...79
5.6 Various features supported to allow interactivity... 82
5.6.1 Form ing the o b jects--- 83
5.6.2 B row sing the tr e e--- 83
5.6.3 Sim plification o f bro wsing process---85
5.6.4 Object Extraction using EXTRA C T b u tto n---86
5.6.5 Save to disk functionality---86
5.6.6 R eset fu n ction ality--- 87
5.6.7 Refinem ent by region splitting- fu rth er interaction---87
5.6.8 T ra ck--- 89
5.6.9 R e-in itia lise--- 90
5.6.10 Other text-fields--- 90
5.7 Software Implementation... ... 91
5.8 Discussion... 93
6. EXPERIMENTAL RESULTS AND DISCUSSIONS--- 96
6.1 O verview... 96
6.2 Spatial Object Segmentation Results...97
6.3 Object Tracking R esults... 108
6.4 Discussion... 118
7. CONCLUSION--- 120
7.1 Overview... 120
7.2 Review of the work carried ou t... 120
7.3 Recommendations for future work...123
REFERENCES---127
APPENDIX A - SOME SPATIAL SEGMENTATION RESULTS ON NON-STANDARD IM AG ES--- 1
T A B L E O F F I G U R E S
Figure 2-1 JPEG Coding M o d el--- --- --- 14
Figure 2-2 ROI cod in g---20
Figure 2-3 H.261 coding system---23
Figure 2-4 MPEG-4 E ncoder---32
Figure 3-1 Example of image segmentation (a) original image (b) segmented image represented in mean grey-level regions (c) segmented image represented as contours being superimposed on the original im age--- 40
Figure 3-2 Examples of semantic objects--- 41
Figure 3-3 Examples of automatically extracted regions--- 41
Figure 3-4 Hierarchical segmentation representation--- 46
Figure 3-5 Illustration of watershed im m ersion--- 49
Figure 3-6 Mapped image with 4-way connectivity--- 50
Figure 3-7 Graph of nodes being generated during merging process---51
Figure 3-8 A BPT with 9 levels--- 54
Figure 4-1 Examples of moving objects for tracking--- 58
Figure 4-2 Occlusion problem--- 60
Figure 4-3 Example of video object tracking---62
Figure 4-4 Contour matching---63
Figure 4-5 Snakes model--- 65
Figure 5-1 System Block Diagram--- --- 72
Figure 5-2 A screen-shot of the G U I--- 74
Figure 5-3 Initial Object Segmentation--- —---77
Figure 5-4 Simplified browsing process--- --- --- 79
Figure 5-5 Processing of tracked objects--- 80
Figure 5-6 Region classification--- 81
Figure 5-7 Object tracking m ethod--- 82
Figure 5-8 Relationship between slider-bar and tree hierarchy---84
Figure 5-9 Image partitions corresponding to different levels in the hierarchy--- 85
Figure 5-10 Browsing simplification step--- 86
Figure 5-11 Pixel identification for splitting process--- --- 88
Figure 5-12 results from automatic segmentation and splitting technique---89
Figure 6-1 Automatic region-based segmentations for the Claire im a g e --- 98
Figure 6-2 Automatic region-based segmentations for the Foreman image---99
Figure 6-3 Automatic region-based segmentations for the Mother and Daughter image---100
Figure 6-4 Automatic region-based segmentations for the Table Tennis im a g e--- 101
Figure 6-5 Automatic region-based segmentations for the Children im age--- 102
Figure 6-6 Interactive object segmentation for Claire im age--- 103
Figure 6-7 Interactive object segmentation for Foreman im age--- 104
Figure 6-9 Interactive object segmentation for “Table Tennis” image--- 105
Figure 6-10 Interactive object segmentation for “Children” image--- 105
Figure 6-11 Interactive object segmentation for Mother and Daughter sequence with an initial partition of 300 regions---107
Figure 6-12 Object tracking results for “Claire” sequence--- 109
Figure 6-13 Object tracking results for Foreman sequence--- 111
Figure 6-14 Object tracking results for Mother and Daughter sequence--- 112
Figure 6-15 Object tracking results for Children sequence--- 114
Figure 6-16 Object tracking results for Foreman sequence with 0.6 threshold value--- 115
Figure 6-17 Object tracking results for Foreman sequence with 0.8 threshold value--- 116
Figure 6-18 Object tracking results for Foreman sequence with 0.8 threshold value and 900 region spatial segmentation--- 117
Figure 6-19 Object tracking results for Mother and daughter sequence with
15x15
structuring set for hole-filling--- 1181 . I N T R O D U C T I O N
1 .1 O v e r v i e w
In this chapter, a general context o f the research undertaken is first described, highlighting the evolution o f m ultim edia technologies in recent years. Current and future trends in m ultim edia research are briefly described for the sake o f justifying the choice o f the follow ing research as a M asters project. The rest o f the chapter is then devoted to describing the objectives o f the research and the structure o f the thesis. The m ain objectives o f the research are explained according to th eir degree o f im portance in section 1.3. The structure o f the thesis is then described in section 1.4.
1 .2 I n t r o d u c t i o n
The conversion o f thoughts into w ords, w ords into actions, and actions into pictures has brought today's m ultim edia services to an astonishing level. It is rem arkable how m uch the technology has im proved in the field o f m ultim edia, particularly in recent years. People can now adays consum e com bined m edia such as graphics, text, audio, and video, in an interactive m anner w ith m any m ore new functionalities in the future.
T he m o st perform ance-dem anding com ponent o f m u ltim ed ia is undoubtedly video w hich is addressed throughout this thesis. V ideo has entered various spheres o f m odem life, in particular, the current convergence o f televisions and com puters into new system s, enabling them to b e used for entertainm ent and w ork at the sam e time. H ow ever, one w ouldn’t have expected such m ilestones to com e to pass w ithin such a short tim e. The credit should be given to the researchers and system developers, w orking in different fields such as com puter, telecom m unications, and consum er electronics, w ho have done an excellent jo b delivering the required services to the end users efficiently and quickly.
C learly, the m ost w ell know n and im portant application in video is television broadcasting. T elevision broadcasting w as introduced after the Second W orld W ar II. It has been developed in various phases, and the m ost im portant revolution w as the concept o f digital transm ission o f television, i.e. m igration from analogue to digital, leading to various applications hitherto im possible. M ost im portantly, the m erging o f
video and audio w ith com puters and netw orks w as seen in the m id nineties, leading to the revolutionary concept o f m ultim edia. These issues are currently being further developed b y the research com m unity in an attem pt to bring th e w hole w orld to a state called “virtual reality” .
A w orldw ide penetration o f desktop com puting in hom es and businesses, in recent years, has stim ulated m any m ore interactive m ultim edia applications. These trends have been largely fuelled b y the grow ing success o f the Internet, enabling large varieties o f different applications. The largest p art o f the m ultim edia d ata stream is, however, occupied b y the video inform ation that needs to be transm itted along w ith other content. It w as looking alm ost im possible to deliver video u ntil recently w hen new data com pression technology and netw ork technology issues w ere resolved to a certain extent. H ow ever, certain com plications still rem ain i f stream ed video is to be delivered over a data netw ork. This is due to the fact that stream ing digital video data to the end users takes about one to three M egabits per second, w hich is currently beyond the bandw idth available for personalised user applications.
W hilst the bandw idth constraint rem ains to be resolved in W orld W ide W eb (W W W ) based m ultim edia applications, television service providers are dealing w ith the issues o f realising tw o-w ay com m unication, i f interactivity is to be provided at the end-user term inals. In other w ords, w hatever the transm ission m edia (i.e. cable TV , Satellite TV, terrestrial TV ) being used to deliver the service, a back channel is necessary to carry the control signals, i f new interactive services are to be brought to hom e television viewers. Som e o f the new interactive services could be m ovies on dem and, new s on demand, and interactive sports such as personalized sports view ing. In interactive TV services, a phone line is connected to set-top boxes so as to provide a back channel. H owever, the introduction o f these interactive services causes the system to b e highly client-server oriented, and it further increases the com plexity o f the bandw idth and processing pow er requirem ents. The A synchronous T ransfer M ode (A TM ) protocol and the Integrated D igital Service N etw ork (ISD N ) connection netw ork support the transporting o f all kinds o f service data. Even w ith A TM , the am ount o f uncom pressed video data is too high to handle, and therefore a m echanism for reducing the am ount o f data is necessary. To this end, efficient techniques for com pressing audio and video, m ostly developed w ithin the M otion Picture Experts G roup (M PEG ) arena, are considered as a viable solution. In this context, M PEG -1, M PEG -2 and M P E G -4 have largely contributed to
resolving the com pression issues to facilitate the high bandw idth dem anding and interactive services in m ultim edia applications.
R ecently, the provision o f m ultim edia services has b een carried over to m obile telephone applications. D ue to high bandw idth requirem ents for these services, currently available m obile com m unication services hav en ’t pro v ed very successful in providing full services. H ow ever, third generation m obile, also know n as the Universal M obile T elecom m unication System (U M TS), w ill support novel services such as m ultim edia m essaging services, gam es, and virtual tourism to m obile phone users. N evertheless, due to the current problem s such as display, storage and processing lim itations o f the m obile phones, the success o f this innovation is still too early to predict.
U nlike M PEG-1 and M PEG -2, M PE G -4 and M PE G -7 address content based- fim ctionalities to facilitate num erous applications (see chapter 2). These content-based functionalities require som e m eans o f effectively describing the audio-visual content. In this context, it is necessary to facilitate easy access to the im age data or the content itse lf i f further processing (interaction, m anipulation) needs to b e perform ed to efficiently describe and encode them . This task is efficiently achieved b y representing the content in the form o f objects.
K eeping in m ind the im portance o f rapidly grow ing m u ltim edia applications, and particularly content-based functionalities, a project on “Sem i-A utom atic Video Object Segm entation for M ultim edia A pplications” is undertaken to fulfil a requirem ent for a research M asters degree.
1 .3 O b j e c t i v e s o f t h e R e s e a r c h
The prim ary objective o f this research is to develop a tool for interactive video object segm entation. The m ain subject o f the research w ork is how to extract sem antically m eaningful video objects from scenes, in order to facilitate future content-based m ultim edia applications. In general, the difficulty o f extracting a particular object from a scene is application-dependent. Thus, this subject area can be discussed under two categories called “on-line” and “off-line” segm entation. O n-line m ethods are also called unsupervised segm entation (i.e. no user interaction is involved), and off-line
m ethods are also called supervised (i.e. user interaction is provided) segm entation (see also section 3.2.2). To this end, an attem pt is m ade in our research to determ ine how difficult o r easy it is to extract sem antic objects from a given im age in term s o f the am ount o f user interaction involved. Furtherm ore, the task o f segm enting video sequences is investigated b y incorporating an object-tracking algorithm into the spatial object segm entation tool.
In order to achieve the prim ary goals, an ancillary objective o f the research is to carry out a survey o f various current and future trends in m ultim edia applications. The auth o r’s research is targeted at content-based functionalities com ing u nder the auspices o f standards such as M PEG -4 and M PEG -7, w here video objects have been identified as the core elem ent for content m anipulations.
O ther research objectives include studying and identifying various segm entation m ethods both in the spatial and tem poral dom ains. Efforts are also m ade to identify the suitability o f autom atic segm entation and sem i-autom atic segm entation approaches. In this context, autom atic spatial segm entation m ethods are first identified. Secondly, the m ethods to hierarchically represent the autom atically generated results are identified. V ideo object-tracking algorithm s are also studied to identify a suitable m ethod o f extending this system to a com plete video object segm entation tool.
T he final objective o f the research can b e considered to be identifying future w ork- areas. In this context, som e areas o f possible im provem ents in various com ponents, such as the autom atic segm entation process, the m anipulation o f user interaction, region classification for object tracking, etc., are considered to b e im portant for further study.
In order to achieve the above goals, it is decided to develop the front-end system in a G raphical U ser Interface (GUI) form. The G U I allow s the user to open up image files and to ru n segm entation algorithm s. F urther operations such as segm entation tree brow sing, region selection, region deletion, region split, object-extraction, etc. are provided in the form o f sim ple buttons, m ouse-clicks, m ouse-drags, m enus, and a slider bar. T he entire im plem entation is carried out in the Jav a program m ing language under M icrosoft W indow s operating system .
1 .4 S t r u c t u r e o f t h e T h e s i s
The thesis is organised as follows: C hapter 2 focuses on still picture and m oving picture coding standards. This chapter starts w ith an introduction to the fundam entals o f video coding in section 2.2, outlining the need for data com pression and the m eans o f achieving this. V arious video form ats and colour spaces are described in section 2.3 and 2.4. B lock-based video com pression and segm entation-based video com pression are then addressed in section 2.5; describing the m anner in w hich the segm entation process is utilised in later M P E G standards, such as M P E G -4 and M PEG-7. Subsequently, still im age com pression standards, such as JP E G and JPEG2000, are described in section 2.6, outlining the evolution o f the im age coding standards. M oving picture com pression and description standards, such as H .261, H .263, M PEG-1, M PEG -2, M PEG -4, and M PEG -7, are described in detail in section 2.7.
In chapter 3, a detailed discussion o f still im age segm entation is presented, covering m ost o f the theoretical and conceptual aspects o f segm entation. Starting w ith a b rief introduction highlighting the need for segm entation in various applications, som e basic definitions, such as R egions vs O bjects and autom atic segm entation vs sem i-autom atic segm entation, are given in sections 3.2.1 and 3.2.2. Som e o f the m ore w ell-know n segm entation m ethods are then presented in section 3.2.3. In section 3.2.4, segm entation based on a region-m erging technique is described, follow ed by a description o f hierarchical segm entation in section 3.2.5. A utom atic segm entation algorithm s such as m orphological w atershed and R SST are described in sections 3.3.1 and 3.3.2 respectively. A segm entation representation m ethod, B PT, is described in section 3.4.1.
The fourth chapter is devoted to video object tracking. This chapter is organised in such a w ay th at the m ain em phasis is on describing various object-tracking algorithm s, illustrating their suitability un der various scene-constraints. H ence, basic concepts along w ith som e o f the tracking difficulties encountered in practice are first described in section 4.1. This is then follow ed b y a detailed discussion o f w ell-know n tracking algorithm s, such as edge-based and region-based approaches, in section 4.2.
Im plem entation details are presented in chapter 5. Starting w ith an overview o f the system , design details, illustrated by a system block diagram , are given in section 5.2. The im plem ented system , broken dow n into tw o stages, is further described in sections
5.2 and 5.3 under the headings Initial O bject Segm entation and O bject Tracking respectively. The G U I through w hich the u ser interaction is provided is described in section 5.4. V arious graphical com ponents o f the system are individually addressed in section 5.5 to describe their im portance w ithin this system. The rest o f the chapter, presented in sections 5.6 and 5.7, is a description o f the softw are architecture o f the system and som e concluding rem arks.
E xperim ental results and discussions are presented in C hapter 6. R esults obtained for several M PEG -4 test sequences are presented b o th for spatial object segm entation and im age sequence segm entation in sections 6.2 and 6.3 respectively. F o r spatial object segm entation, results are discussed in the form o f b o th autom atically and semi- autom atically generated segm entations. O bject-tracking results obtained from several experim ents, presented in section 6.3, are subject to a detailed discussion in order to evaluate the perform ance o f the tracking algorithm .
C onclusions are draw n in chapter 7. A review o f the w ork carried out w ithin this thesis is given in section 7.2. This is follow ed b y som e recom m endations for future work, w hich w arrant/require further research and developm ent w ork.
Finally, publications arising from the research or associated w ith sim ilar w ork are listed along w ith som e appendices, w hich the reader m ight find useful.
2 . S T I L L P I C T U R E A N D M O V I N G P I C T U R E C O D I N G
S T A N D A R D S
2 .1 O v e r v i e w
This chapter presents an overview o f the m ost com m only used still im age and video com pression standards, such as JPEG , JPEG 2000, H .261, H .263, M PEG -1, M PEG-2, M PEG -4, and M PEG -7. The chapter starts w ith an overview o f data com pression follow ed b y a description o f the different video form ats and colour spaces, w hich are com m only used in these standards. B lock-based com pression and segm entation-based com pression aspects are then discussed in section 2.5. Still im age com pression standards, such as JP E G and JPEG 2000, are described in section 2.6. A detailed description o f m oving picture com pression and description standards is then given in section 2.7. Finally, in section 2.8, a discussion review s these standards, and briefly outlines the use o f different standards for various applications.
M PEG standards are generally called audio/video com pression standards. However, due to th e recent em phasis on content-based functionalities (e.g. M P E G -4 and M PEG- 7), a different term inology m ay b e required. In this respect, it w ould be m ore appropriate to call them either as m u ltim e d ia c o d in g a n d d e s c r ip tio n s ta n d a r d s or sim ply as m u ltim e d ia s ta n d a rd s.
2 .2 O v e r v i e w o f D a ta C o m p r e s s i o n
The efficient representation o f digital im age and video data has been the objective o f a great research effort for m ore than the last tw o decades. D ue to the success o f these attem pts, the results achieved, in particular the w ide set o f international standards that have em erged, have provided a valuable basis to enable a large num ber o f applications in several fields, including advanced personal com m unications (teleconferencing, video-telephony), rem ote operation (telew orking, distance learning), interactive digital services (m ultim edia applications, TV program production). In this fram ework, the grow ing availability o f transm ission links, the drastically im proved data carrying capacity o f com puter netw orks, the enorm ous progress in digital signal processing, and the developm ent o f V L S I technology w ith application to im age and video com pression m ake visual com m unications m ore feasible than ever.
In this process, a fundam ental role is played b y efficient data representations. The idea o f data com pression is to rem ove redundant data from th e data stream . O nly the use o f efficient data com pression techniques allow s reducing this to a rate convenient for cost- effective applications. The precise characteristics o f such techniques depend on the available bandw idth, the application requirem ents, and the allow able com plexity o r cost o f the processing equipm ent.
D ata com pression techniques can be divided into tw o m ajor fam ilies called, lossy and lossless. L ossy data com pression concedes a certain loss o f accuracy in exchange for greatly increased com pression. Lossless com pression consists o f those techniques guaranteed to generate an exact replication o f the input data stream after a encode/decode cycle. This type o f com pression is m andatory w hen transm itting or storing database records, spreadsheets, or w ord processing files w here a single bad bit could lead to disaster. Lossless com pression uses tw o different types o f m odelling called, Statistical-based and D ictionary-based. Shannon-Fano, H uffm an, and A rithm etic coding algorithm s are som e o f the exam ples for Statistical based m ethods, w hereas LZ77, LZ78, LZSS and LZ W are som e o f dictionary-based coding methods. A detailed discussion o f these data com pression techniques is outside the scope o f this thesis, but their details can be found in [N elson and G ailly 1995].
In the television broadcasting industry, availability o f video data in digital form and ability to com press them to a data-rate convenient for cost effective transm ission have facilitated bringing n ew services to the television view ers. F o r exam ple, a digitized raw P A L color picture (7 2 0 x 5 7 6 ) at 8 bits p er sam ple produces 1215 kB ytes o f data. Transm ission o f such fram es at 25 fram es/s requires about 158 M bit/s bandw idth. This situation is w orse for H igh D efinition T elevision (H D TV ) w hich needs even higher resolution pictures and higher fram e rates. Therefore, efficient video data com pression is a m uch-needed requirem ent in this industry. M PEG -2 targets coding algorithm s for H D TV applications at 4-9 M bit/s [Sadka 2002].
The need for effective data com pression is evident in alm ost all applications w here storage and transm ission o f digital im ages are involved. F or exam ple, 640 pixels x 480 lines V G A colour im age generates 921600 bytes o f data i f each colour channel is coded in one byte (8 bits). T his requires about 1.9 m inutes for transm ission over a 64kbit/s
m edia w ithout com pression [N etravali and H askell 1995]. In storing such an am ount o f data, a 700 M byte CD can hold only 760 such im ages.
Luckily, com pression o f im age data w ithout significant degradation o f the visual quality is usually possible because im ages contain high degrees of:
• Spatial and Tem poral redundancy
- This is due to correlation betw een neighbouring pixels w ithin the fram e, and betw een fram es in the case o f video sequences;• Psychovisual redundancy
- The eye-brain m echanism creates w hat is know n as perceptual redundancy.By
exploiting these properties, visibility o f im pairm ents caused b y the bit reduction can b e m inim ised.The higher the redundancy, the higher the achievable com pression. The perform ance o f any com pression system can b e evaluated b y considering its apparent im age quality, data rate, and system cost.
2 .3 D i f f e r e n t V id e o F o r m a t s
In im age processing, digitised video is represented in various different formats. The advantage o f the hum an vision system being m ore sensitive to changes in lum inance than to changes in chrom inance w as taken into consideration w hen the CCIR-601 recom m endation w as created to define digitisation param eters for video in YCbCr com ponent 4:2:2 form at. In 4:2:2 form at, only tw o sam ples o f Q , and tw o sam ples o f Cr are used for 4 sam ples o f Y, thereby reducing the horizontal resolution o f chrom inance com ponents b y half. The follow ing video form ats w ere later derived from CCIR-601: S tandard Input F orm at (SIF) fo r video fram es in digital television, Com m on Interm ediate Form at (CIF) for fram es in video conferencing. F o r further low bit rate applications, Q uarter-C IF (Q CIF) has been defined.
Table 1
show s different video form ats used in digital video processing. A n in-depth discussion o f this topic is not provided since it is beyond the scope o f this thesis.Im age Form at R esolution (Lum inance) Sub-sam pling (Y C bCr) CCIR-601 (N TSC) 7 2 0 x 4 8 0 4:2:2 CCIR-601 (PAL) 7 2 0 x 5 7 6 4:2:2 SIF (N TSC) 3 6 0 x 2 4 0 4:2:0 SIF (PAL) 3 6 0 x 2 8 8 4:2:0 CIF (N TSC & PA L) 3 5 2 x 2 8 8 4:2:0 Q C IF (N TSC & PA L) 1 7 6 x 1 4 4 4:2:0
Table 1 Different video form ats
2 .4 D i f f e r e n t C o l o u r S p a c e s
C olour im ages and videos are usually displayed in a basic colour space m odel called, RG B. H ow ever, in im age processing, colour im ages are presented in a num ber o f different colour-space m odels for com pression and representation purposes. The selection o f a p articular colour space depends on its effectiveness to the application (video coding, video processing, content representation, etc.) for w hich the im age data is to be used. F o r this reason, it can often be found that different colour transform ations are used in research in order to optim ise the perform ance o f the application [M anjunath et al 2001]. O f the several m odels available, the m ost im portant and w idely used ones, such as {R, G, B }, {Y, Cb, Cr}, {H, S, V }, {C, M , Y , K }, and {H, M , M , D }, are briefly described below .
RGB:
T his represents the three p rim ary colours R ed (R), G reen (G ) and Blue (B). This form o f colour is m ainly used for displaying and capturing purposes.Y C bC r: T his is the m o st com m only used ITU standard colour space for com pression and transm ission purposes in video coding system s. “Y ” represents the brightness (lum inance) and “Cb”, “Cr” represent colour (chrom inance). In this m odel, the brightness com ponent is decoupled from the colour com ponents. Y U V is yet another very sim ilar (and interchangeably used) colour space, and is used in the PA L TV
system [V asudev and K onstantinos 1997]. B oth these m odels are linear transform ations from R G B com ponents.
H S V : The H SV colour space is another very com m only used colour m odel in image retrieval applications, but the transform ation associated w ith it is m ore com plicated than the above ones. This consists o f three com ponents called, H ue (H ), Saturation (S), and V alue (V). This space is also know n as HIS w here “I” stands for Intensity. “H ” and “S” correspond to colour inform ation, w hile “V ” or “I” corresponds to brightness. Hue specifies the pure colour, w hile S specifies the am ount o f w hite light m ixed w ith the pure colour. O nce again, colour and intensity are independent as in the YCbCr system.
C M Y K : This is a som ew hat strange acronym to represent C yan (C), M agenta (M), Y ellow (Y), and B lack (K). H ow ever, it is a rather sim pler linear transform ation to represent secondary colours, w hich are also called subtractive colours. This was originally CM Y , and a fourth com ponent K w as introduced later due to the failure o f the system C M Y to produce black [Efford 2000]. This is a useful m odel for printing colour im ages [Efford 2000].
H M M D : This is a sim ilar and com petitive m odel to H SV . There are four com ponents in this non-linear transform ation m odel, representing H ue (H ), M in (M ), M ax (M ), and D istance (D). M in indicates the tint property, giving an idea o f how m uch w hite colour the im age contains, w hereas M ax indicates the shade property, giving an idea o f how m uch black colour the im age contains. D istance indicates how m uch grey colour the im age is com posed of. This space m ay be useful in im age retrieval applications, for exam ple, in M PEG -7.
2 .5 B l o c k - b a s e d C o m p r e s s i o n a n d S e g m e n t a t i o n - b a s e d
C o m p r e s s i o n
In video coding, com pression can be achieved either in a block-based m anner or in a segm entation-based (nonblock-based) m anner. In this section, the tw o approaches are discussed w ith regard to their use in com pression, also highlighting their role in past and current M P E G standards.
The level o f abstraction introduced w ithin these standards enables us to categorise these standards into tw o areas, revealing that the M P E G standards are now m igrating from issues o f conventional com pression to new issues o f content-based functionality. It is w ell know n that M PEG-1 and M PEG -2 standards m ostly follow the sam e coding techniques, despite the fact that they are targeted for different applications. They work on sm all square blocks o f pixels, efficiently perform ing com pression b y using m otion estim ation/com pensation along w ith transform coding and entropy coding techniques. C onsequently, these block-based com pression approaches cause a type o f im age distortion called "blocking artefacts" due to im perfect reconstruction. Perfect reconstruction cannot alw ays be guaranteed, particularly at high com pression ratios. B locking artefacts are visible as spatial discontinuities, w hich are highly undesirable in som e applications.
On the other hand, the n ex t generation o f m ultim edia standards such as M PEG -4 and M PEG -7 m ostly w ork o n the basis o f arbitrary-shaped regions and objects instead o f small square blocks. It should be noted that, in M PEG -7, such regions and objects are not used in the context o f com pression, as the M PEG -7 standard has a different scope from that o f earlier standards. M PEG -4 based applications exploit segm entation as a non-norm ative p art o f the standard in order to extract and encode arbitrary shaped objects. The shape encoding w ithin M PEG -4 is carried out using a type o f bitmap- based coder, w hich classifies each pixel o f the block to be coded as belonging to the object o r not. H ence, in M PEG -4, com pression betw een fram es is achieved by encom passing features o f both segm entation and conventional block-based techniques. A detailed discussion on im age and video segm entation is given in chapters 3 and 4.
In a segm entation-based com pression approach, since the scene can be separated into several entities, each entity can be encoded according to the requirem ent o f the user. This m ethod o f encoding entities separately can also im prove the subjective coding efficiency since the coding can be perform ed according to the u ser’s best interest instead o f using one single technique for the entire process. H ow ever, an extra cost has to be paid for encoding the shape o f the entity com pared to a fixed shape square block. O ne m ethod o f shape encoding is know n as contour-based coding w here the outline or the bo undary o f the object is coded b y using techniques, such as chain coders or polygon approxim ations. The higher cost involved in contour coding is considered to
be a bottleneck in segm entation-based coding approaches since the contour coding occupies a larger bit-stream com pared to that o f the texture coding. This highlights the fact that segm entation-based com pression techniques need a pow erful shape-coding tool.
2 .6 S t i l l I m a g e C o m p r e s s i o n S t a n d a r d s
Still im age com pression standards w ere developed for coding single fram e colour im ages. C om m ittees such as ISO /ITU Joint Photographic Experts G roup (JPEG ) and IS O /IT U Jo int B i-level Im age G roup (JBIG) h av e contributed to this task over the last few decades. The term “Joint” arises because o f the collaboration o f the tw o groups, ISO and ITU. In this thesis, only a short description o f th e JP E G standard is given, in order to outline the historical developm ent o f im age coding standards. JPE G was published as an international standard in 1992 [W allace 1992], In JPEG , com pression is achieved either in a lossy or lossless m anner. Sim ilarly, there are several m odes defined w ithin JP E G called, baseline, lossless, progressive, and hierarchical. O ur discussion is only concerned about the m ost popular m ode, i.e. baseline m ode, w hich is based on lossy coding only. Furtherm ore, the latest JP E G standard, JPE G 2000 is also described. O ther recent im age com pression standards, such as JPEG -LS, M P E G -4 V isual Texture C oding (V TC), and Portable N etw ork G raphics (PN G ), are considered to b e outside the context o f our discussion.
2.6.1 I SO/I EC JPEG Coding
JP E G can be considered to be one o f the m o st successful and w ell-recognised generic im age com pression standards to com e into being. D ue to the success o f its perform ance, it has b een used in w ide range o f transm ission and storage applications to date. T his im age form at is w idely used for digital photography, desktop publishing, the Internet, and so on. In JPEG , transform coding is b ased on the D iscrete Cosine T ransform (D C T), w h ich is used to convert im age data from the tim e dom ain to the frequency dom ain. T he D C T is also an orthogonal transform ation technique, in the sense that its inverse transform ation produces the exact replica o f its input data. C om pression can b e achieved due to its ability to pack m ost o f energy into a few coefficients, enabling other D C T coefficients to be discarded w ithout significant quality loss. The D C T is w idely used b y other w ell-established standards, such as H.261, H .263, M PE G -1, M PEG -2, and M PEG -4. The actual com pression ratio obtainable from
this standard can vary from 100:1 to 2:1 depending on the specific application and encoder/decoder com plexity [Rao and H w ang 1996].
A lthough colour conversion is part o f the redundancy rem oval process, it is not part o f the JP E G standard. JP E G handles colours as separate com ponents in com m only used colour spaces, such as RGB, YCbCr, and CM YK . F or each separate colour com ponent, the full im age is decom posed into 8 x 8 blocks o f pixels, w hich form the input to the DCT. Typically, in the 8 x 8 blocks, the pixel values v ary slow ly, and hence the energy is o f low spatial frequency. Thus, a 2D D C T is quite capable o f concentrating the energy into a few coefficients, thereby facilitating a h igh degree o f com pression. The block diagram o f the JP E G coding process is show n in
Figure 2-1.
IN
Image D e- Quantiser Entropy
" ^ correlator ► Encoder
Storage/Tx M edium
Image Dequantiser Entropy
Correlator D ecoder
Figure 2-1 JP E G Coding Model
The operation o f the JP E G algorithm can be divided into three basic stages: • The rem oval o f spatial redundancy b y m eans o f the D CT, i.e. D ecorrelation;
• The quantisation o f the D C T coefficients using w eighting functions optim ised for the hum an visual system;
• The encoding o f the data to m inim ise the entropy o f the quantised D C T coefficients using the H uffm an variable length coding.
The im age decorrelator is the first step in the JP E G coding m ethod. This process generates the sam e n um b er o f coefficients as pixels, b u t the upper left com er coefficient is now called the DC coefficient w hile the rest are called the AC
coefficients. The A C coefficients represent increasing horizontal and vertical spatial frequencies. A n ideal transform should com pletely decorrelate the data in a block, i.e. it should p ack m ost am ount o f energy in the few est num ber o f coefficients. The D CT has been found to b e extrem ely efficient for h ig h ly correlated data. D epending on the uniform ity o f the block, the num ber o f non-zero A C coefficients tends to vary, and hence the overall com pression. H ow ever, i f the block is perfectly uniform , only the DC coefficient w ill b e non-zero.
The second step is the quantisation o f the frequency coefficients. The input to the DCT consists o f eight b it pixel values, but the coefficient values can range from a low o f -1,024 to a high o f 1,023, occupying eleven bits. The action used to reduce the num ber o f bits required for storage o f the D C T m atrix is referred to as “Q uantisation” . The JP E G algorithm im plem ents quantisation b y using a quantisation m atrix. For every elem ent position in the D C T m atrix, a corresponding value in the quantisation m atrix gives a quantum value ranging from 1 to 255.
N ext, the coding m odel rearranges the quantised frequency coefficients into a zigzag pattern, w ith the low est frequency first and the highest frequency last. The reason for using the zigzag pattern is to increase the run-length o f zero coefficients found in the block. Since the D C coefficients o f subsequent blocks often vary slightly, they are coded as the difference betw een the quantised D C coefficient o f the current block and the quantised D C coefficient o f the previous block, w hich conform s to the sim plest form o f predictive coding called, D ifferential P ulse C ode M odulation (D PCM ). The quantised A C coefficients usually have the prop erty o f containing runs o f consecutive zeros, w hich can b e exploited b y run-length coding to achieve m o re com pression.
The last step in JP E G coding is the entropy coding. The block codes from the D PCM and run-length m odels can be further com pressed using entropy encoding, given a conversion table. H uffm an coding is used for this purpose.
D espite its use in large num bers o f applications, the JP E G im age form at has not proved to be useful in certain applications due to som e o f its draw backs, such as blocking effects at high com pression (or low bit-rates), p o o r support for error resilience, lack o f scalability, no support for region-based o r object-based representation o f the content,
etc. Taking these issues into consideration has led to the developm ent o f the next generation o f the im age-coding standard called, JPEG 2000.
Scalability (w hich is also explained in section 2.7.4 for M PEG -2) is an im portant feature that the JPE G 2000 standard aim s to support. The scalability feature is useful in extracting im ages o f a quality conform ing to the application b y decoding only a part o f the bit stream. This is supported in tw o scales: quality (SN R) scalability and resolution (spatial) scalability. R esolution scalability allow s end-users to extract different sizes o f im ages b y enabling a reduction o f the tim e taken for receiving the im age. SNR scalability allow s extracting im ages o f the sam e spatial resolution but in different qualities.
2.6.2 ISO/IEC JPEG2000 Coding
A very interesting and prom ising still im age-coding standard, w hich has recently been published, is described in this section. The superiority o f this standard over JPE G is not lim ited to b etter perform ance in term s o f subjective im age quality and com pression efficiency, b u t it also conveys m any other functionalities and features that w ere only partly covered (or totally im possible) in the existing system . The JPEG 2000 standard consists o f seven p arts (i.e. part I to V II) [G rosbois e t al. 2001J1. The im age coding system or the baseline algorithm is described in p art I o f the standard, w hereas extensions to the core coding system are described in part II. It can b e found that the overall perform ance o f this standard is com pared w ith other recent standards, such as PNG , JPEG -LS and M P E G -4 V TC , in the literature [Christopoulos 2000], [Santa-Cruz
e t al. 2000]. H ow ever, such an exhaustive discussion is n o t given in this thesis.
This coding system is based on the use o f w avelet technology rather than on DCT technology. This system is being developed to satisfy current and future m ultim edia requirem ents, such as client-server based applications, real tim e applications, and other applications, w hich dem and high im age quality. It is expected that the use o f the original JP E G standard w ill be overtaken by JP E G 2000 in the future, in application areas ranging from portable digital video cam eras to advanced m edical imaging.
1 It should be noted that some new parts, i.e. part VIII to X, have just been started, and the part VII has been omitted, leaving the standard to contain nine distinct parts. This news is only recently published on the JPEG2000 web site http://www.jpeg.org/JPEG2000.htm.
H ow ever, the intention is no t to replace the original JP E G standard bu t to com plem ent it [C hristopoulos e t al. 1999].
The system block diagram o f the JPE G 2000 rem ains the sam e as that illustrated in
Figure 2-1
. There are tw o m ain differences betw een th e tw o system s. First, the im age decorrelator is based on forw ard D iscrete W avelet Transform (D W T) instead forw ard D C T as used in the original JPEG standard. Second, the entropy coding m ethod in the JPEG 2000 system is based on binary arithm etic coding as opposed to H uffm an coding in the original JP E G standard. The basic encoding process can be explained as outlined below w hile the decoding process corresponds to the exact reverse process. As in JPEG , im ages are first transform ed into colour com ponents, and each colour com ponent is decom posed into rectangular blocks o f pixels.T he forw ard D W T algorithm for im age decorrelation is applied on basic units o f the im age com ponents. These basic units are called “tiles” in the JPEG 2000 standard. These tiles are different from th e 8 x 8 blocks used in the JPE G standard, and it is allow ed to have arbitrary size o f tiles, up to and including the entire image [C hristopoulos e t al. 1999]. The w avelet transform ation decom poses the tiles into different resolution levels, providing a num ber o f subbands o f coefficients. These coefficients are the representation o f the horizontal and vertical spatial frequency characteristics o f th e tile com ponent.
A s in the JP E G standard, the second step is the quantisation o f the frequency coefficients. In this step, scalar quantisation is used, w hich finally arranges the quantised coefficients into a rectangular array o f “code-blocks”.
In the last step, the code-blocks are fed to the arithm etic coding stage to achieve further com pression. The coded data is arranged into layers and output as the code-stream in packets.
T he JPEG 2000 standard provides additional features to support specific applications. Som e o f the m ost p rom isin g and im portant features are:
• A bility to specify R egions o f Interest (ROI): This allow s a user to encode a certain p art o f the im age at b etter quality or even losslessly, w h ich is very im portant in high-end applications, such as m edial im aging [C hristopoulos e t al. 1999]. In other w ords, it enables non-uniform distribution o f the im age quality w ithin the im age being coded [G rosbois e t al. 2001].
• R andom code-stream access: The user-defined R O I can be random ly accessed or processed according to user’s interest. T his is possible due to individual coding o f the blocks and packetized structure o f the code-stream ;
• E rror resilience: This is achieved by using “resync” m arkers to m ake the system reliable in the presence o f high transm ission error rates;
• The ability to include im age security and content m etadata: This corresponds to the protection o f the im age b y encryption and w aterm arking. Inclusion o f content m etadata is very useful for tod ay ’s e-com m erce applications.
In JPEG 2000, R O I coding can b e achieved in tw o w ays: R O I M axshift and RO I scaling. The R O I M axshift m ethod, defined in the p art I o f th e standard, enables RO I shape coding b y dow nshifting the background coefficients below the R O I coefficients according to a scaling value (S I). This is show n in F igure 2-2a. The scaling value is chosen in such a w ay th at the sm allest non-zero R O I coefficient becom es larger than the largest non-zero background coefficient, thereby avoiding an overlap betw een the R O I coefficients and background coefficients [B oliek e t a l. 2000a]2. Therefore, no RO I m ask is required to separate the R O I coefficients and background coefficients at the decoder end. This is because R O I coefficients can be separated from the background coefficients b y com paring them w ith a threshold, w h ich is derived from the scaling value encoded in th e codestream . A very im portant advantage o f this m ethod is that it enables coding o f an y arbitrary shape region in the im age. O ne disadvantage o f this m ethod over the R O I scaling m ethod is that dow nshifting th e background coefficients w ith the M axshift scaling value increases the num ber o f bit-planes, and hence m ore data to encode (i.e. high overhead) [G rosbois e t a l. 2001]. The R O I scaling m ethod, on the other hand, is defined in the part II o f the standard, and is know n to be a generic m ethod for R O I coding [G rosbois e t al. 2001]. It is generic because any scaling value can b e used to shift the background coefficients, allow ing an overlap betw een the ROI
coefficients and background coefficients. A s depicted in F ig 2-2b, these coefficients are positioned b y dow nshifting the background coefficients tow ards the least significant bit-planes according to a scaling value (S2) [G rosbois e t al. 2001]. The disadvantage o f this m ethod is that a b it m ask needs to be derived to define, in each subband, w hat the R O I coefficients are, and these bit m asks need to be transm itted to explicitly define the shape o f RO I. This is because it is not otherw ise possible to distinguish RO I coefficients from background coefficients at the decoder. The high cost involved in the bit m ask encoding process im poses the constraint th at the R O Is be a com bination o f rectangular and elliptical regions o n ly [G rosbois e t a l. 2001]. A lso, this type o f RO I coding requires a JP E G 2000 part II decoder at the receiver side. A detailed description o f these R O I coding m ethods can be found in [B oliek e t al. 2000a] and [Boliek e t al.
2000b]. M SB LSB ROI coefficients Background coefficients Background shift b y SI M SB LSB 0 0 0 0 0 0 0 0 0 0 ROI coefficients Background coefficients scaling value SI
MSB LSB
1 1
M SB
scaling value S2 overlap
ROI coefficients Background coefficients Background shift b y S2 LSB O ■ 1 ROI coefficients Background coefficients
b. ROI scaling operation
Figure 2-2 R O I coding
2 .7 M o v i n g P i c t u r e C o m p r e s s i o n a n d D e s c r i p t i o n S t a n d a r d s
The M otion P icture E xperts G roup (M PEG ) w as established in 1988, w ith the m andate to develop standards for coded representation o f m oving pictures and audio. W ithin the initial phases o f this fram ew ork, various algorithm s w ere developed to com press m oving pictures for efficient storage and transm ission o n various digital media.
The M PEG-1 and M P E G -2 audio and video coding standards have attracted m uch attention over the last several years w hile an increasing num ber o f hardw are and softw are im plem entations o f these standards are becom ing com m ercially available. M PEG-1 has m ade a rem arkable im pact on audio/visual C D -R O M applications w ith various im plem entations in both softw are and hardw are. The M PEG -2 standard forms the basic elem ent o f existing and future digital TV chip sets, and has been adopted for em erging H D TV applications. A nother recent M P E G standard, M PEG -4, is targeted for content-based m ultim edia applications. The recently b o m M P E G standard, M PEG -7, is to provide a M u ltim e d ia C o n te n t D e sc r ip tio n I n te r fa c e that w ill further help to develop com puter-based m ultim edia applications and e-com m erce applications.
O ur discussion is not lim ited to M PEG . Standards such as H.261 and H .263, w hich are recom m ended b y ITU -T for low b it-rate videoconferencing applications, are also considered for discussion. Som e o f the earlier M P E G standards borrow ed technologies from JPEG , and m ore directly from H .261. This m eans that the M P E G coding systems exploit m ost o f the m ethodologies o f video conferencing standards as w ell as some other new techniques, m ost notably m otion com pensated interpolation.
A video sequence can be considered as a sequence o f still im ages to be coded individually. In practice, tem poral redundancy cannot b e exploited i f im age frames in a video sequence are coded individually. In video coding, there are tw o m ain modes called, intra-fram e and inter-fram e. Intra-fram e m ethod follow s th e sam e techniques as JPEG. M ore com pression is then achieved in the inter-fram e m ode b y exploiting tem poral redundancy available betw een frames.
T he use o f tem poral and spatial redundancies in video com pression involves both lossy and lossless transform ations. V ideo coding is generally b ased on G roup o f Pictures (G O P) w hich consist o f three types o f fram es called, I-fram e, P-fram e and B-fram e (see section 2.7.3). I-fram es are self-contained and com pressed in a sim ilar m anner to JPEG . P-fram es and B -fram es are com pressed using m otion com pensated prediction or interpolation betw een tw o reference frames. Therefore, coding a video sequence generally incorporates spatial com pression (transform coding, quantisation and entropy coding) perform ed on the first fram e o f the GOP, calculation o f m otion vectors for every 1 6 x 1 6 block (m acroblock) o f P and B-fram es, and com pression on the block- difference calculated betw een th e actual and predicted blocks for every P and B-fram es (sim ilar to spatial com pression m entioned above). O n the other hand, the decoding process or reconstruction o f a video sequence basically corresponds to decoding the I- fram e, m otion vectors and block-difference inform ation o f each G OP, w hich w ere coded at the encoder side. R econstruction o f the P and B -fram es at the receiver-end involves m otion com pensating each m acroblock using the decoded m otion vectors.
T he follow ing discussion describes the use o f various redundancies inherent in the natural im age sequences, and how they can be utilized to suit different application requirem ents.
2.7.1 ITU-T H.261
T he ITU -T expert group on visual telephony produced the H.261 standard in 1990. It deals w ith a video codec (encoder and decoder) for audio-visual services at p x 64 kbit/s. It should be noted that H.261 is not the first video conferencing standard for coding digital video. The first video conferencing standard, H .120, has been overshadow ed b y the H.261 standard due to its better quality and com pression efficiency. The H .120 standard w as developed b y a sm aller research group, COST 211 [W hybray e t al. 1997], at data rates close to 2 M bit/s. The system w as based on C onditional R eplenishm ent C oding (C R C ) [N etravali and H askell 1995]. CRC was found to be appropriate at that tim e for low b it rate video conferencing and videophone applications, due to the fact that the activity betw een tw o successive fram es was relatively small. This is due to the stationary cam eras u sed and slow ly changing scenes encountered in those applications. H ow ever, our discussion is m ainly devoted to the m ore advanced H.261 and H .263 that w ere the later standards targeted for the above applications.
T he H.261 standard is a superior technology to the H .120 standard, and it was a basis for H .263 and som e M PEG standards. The data rate w ithin H.261 can be set to vary in integer m ultiples o f 64 kbit/s, according to the w ell-know n expression p x 6 4 . The p aram eter "p" can range from 1 to 30, providing a m inim um o f 64 kbit/s and a m axim um o f 1920 kbit/s data rates. H ow ever, the selection o f the num ber (1, 2, 3 and so on) depends on the application rate/distortion requirem ent. The video form ats used in H.261 are CIF and QCIF. A block diagram o f the H .261 coding system , extracted from [W hybray e t al. 1997], is show n in F ig u re 2-3.
Figure 2-3 H.261 coding system
The three m ain com ponents in the system are D C T for transform coding, H uffm an coding for V ariable Length C oding (V LC), and m otion estim ation/com pensation for predictive coding. W hen com pared w ith the JP E G system , the m ain difference is only due to the presence o f the third com ponent, i.e. predictive coding, w hich is used to exploit the inter-fram e redundancy, w hile intra-fram e redundancy is exploited using the first tw o com ponents. The input picture data is fed to the subtracter in an 8 x 8 block o f pixels, and then to the D C T, as in JPEG. A coding unit alw ays consists o f four 8 x 8 blocks o f lum inance, one 8 x 8 b lock o f U, and one 8 x 8 block o f V, w hich is referred to as a “m acroblock” in video coding.
In H .261, only tw o types o f pictures, nam ely Intra (I) and P redicted (P), are used. P- pictures are generated using the m otion estim ation and com pensation forward predictive coding m ethod. P rediction o f the future fram e is facilitated by the decoding m odule, w hich is incorporated in the feedback loop o f the encoder. This feedback m odule consists o f the inverse D C T (ID CT), and the inverse quantiser (IQ) but not the V LC. This is because only the D C T and quantiser processes are lossy, w hereas VLC is
not. The prediction m ethod based on m otion estim ation is very costly in term s o f com putations, and is close to 60% o f the overall com putational load [Vasudev and K onstantinos 1997]. The m otion estim ation process is, how ever, a non-norm ative part o f the standard. A ddressing m ore issues on m otion estim ation research is generally considered to be valuable in order to im prove its overall efficiency.
The system functionality can be briefly described as follow s. The first picture in the sequence is alw ays coded as intra, and the actual p ixels are subject to D CT, quantisation, run length, and VLC coding. In inter-fram e m ode, the difference betw een the actual fram e and the m otion com pensated fram e is coded using the same techniques. In each picture, the D C T coded and quantised blocks are subject to its reverse transform ation, and the previous picture is regenerated b y adding the inverse transform ed data to the com pensated block. These blocks are organised in the picture store (m em ory) for the subsequent m otion estim ation and com pensation processes to follow. F o r this, the current picture is fed to the m otion estim ation block. M otion V ectors (M V s), calculated in the m otion estim ation process, are fed to the m otion com pensation block and the V LC block. The m otion-com pensated m acroblocks are then fed to the loop filter for rem oving unw anted high frequency noise, w hich, in turn, reduces the visibility o f the blocking effects. The M V s fed to the V LC are entropy coded and sent in the coded datastream for the decoder to determ ine the correct location o f the prediction o f th e current m acroblock. A constant bit-rate o f the coded bitsrteam can be m aintained b y m onitoring the state o f the output buffer and using the rate control feedback loop, w hich, in turn, controls the quantisation process. This process sets relevant quantisation levels for different m acroblocks depending on the am ount o f data being generated in the coding process. In the encoder, m acroblocks can be encoded as IN TR A o r ENTER. The ability to change these m odes in betw een enables to m inim ise the propagation o f coding errors in the encoding chain.
H.261 has several interesting characteristics:
• It defines only the decoder w hile ensuring the encoder to be com patible w ith the decoder. F or exam ple, past im plem entations have show n that m otion estim ation/com pensation, the rate control feed back m echanism , the loop filter, pre processing, and post-processing are open to variations [V asudev and K onstantinos