Using contour information and segmentation for object registration, modeling and retrieval

(1)

USING CONTOUR INFORMATION AND

SEGMENTATION FOR OBJECT

REGISTRATION, MODELING AND

RETRIEVAL

by

T om asz A d a m e k , M .Sc.

A thesis subm itted in partial fulfilm ent of the requirem ents for the degree of Ph.D in Electronic Engineering

Supervisor: Dr. N oelE . O 'C onnor

S ch o o l o f E lectro n ic E n g in e e rin g D u b lin C ity U n iv e rs ity

J u n e , 2006

(2)

D e c l a r a t i o n

I hereby certify that this m aterial, w hich I now subm it for assessm ent on the program m e of stu d y leading to the aw ard of Ph.D in Electronic Engineering is entirely m y ow n w ork and has not been taken from the w ork of others save and to the extent that such work has been cited and acknow ledged w ithin the text of my work.

S ig n ed :______

(3)

A C K N O W L E D G E M E N T S

I w ish to extend m y heartfelt gratitude to the m any people w ho have enabled a n d su p p o rted the w ork behind this thesis. Special thanks goes to m y supervisor, Dr. Noel E. O 'C onnor, for his valuable guidance and expertise as w ell as for his kind advice, encouragem ent, and constant support. Similar thanks are d u e to Dr. N oel M urph y an d Professor Alan Sm eaton for their individual advice, su p p o rt an d suggestions at different stages of the work.

I w ould like to thank all the colleagues at the Centre for Digital Video Processing for the pleasan t w ork atmosphere. I'm indebted to m any m em bers of the group for the interesting an d fruitful discussions, especially fellow PhD students Orla Duffner, Sorin Sav, and Ciarán Ó Conaire, and also Dr. H ervé Le Borgne.

I w o uld like to thank Christan Ferran Bennström and Dr. Josep R. Casas from the U niversität Politécnica de C atalunya for the interesting discussions w hich m o tivated m any experim ents presented in this thesis. C hristan Ferran Bennström devoted significant am ount of his tim e to the creation of the im age collection used in this thesis.

Different p arts of the research w ork published in this thesis w ere financially su p p o rted by the Inform atics Research Initiative of Enterprise Ireland, b y Science F oundation Ireland under grant 03/IN .3/I361., and by the European Com m is sion u n d er contract FP6-001765 aceMedia.

Dr. M iroslaw Bober of the Visual Inform ation Laboratory of M itsubishi Electric Inform ation Technology C enter Europe is acknow ledged for generating and p ro v id ing the CSS results. The Centre for Vision, Speech, an d Signal Processing of the U niversity of Surrey is acknow ledged for m aking the database of shapes of m arine creatures available, an d the Center for Intelligent Inform ation Retrieval from U niversity of M assachusetts is acknow ledged for providing the annotated collection of G eorge W ashington's m anuscripts.

O n a personal note, special thanks go to all m y friends, especially to the Polish com m unity at D ublin City University.

I w ish to extend m y gratitude to m y parents H enryka and Julian (Za zycie i

wychowanie.) and m y aunt Alicja and uncle Boleslaw (Za zawsze serdeczne goszcze- nie mnie pod swoim dachem.).

Finally, I w ould like to thank Ms. Cristina Blanco (Por su constante apoyo y ánimo,

y por recordarme que hay cosas más importantes en esta vida que el ser llamado 'Doctor'. También por proporcionar fotos de sí misma, las cuales mejoran significativamente el valor estético de esta tesis.).

(4)

A B S T R A C T

USING CONTOUR INFORMATION AND

SEGMENTATION FOR OBJECT REGISTRATION,

MODELING AND RETRIEVAL

b y T o m asz A d a m e k M .Sc.

T his thesis considers different aspects of the utilization of contour infor m ation and syntactic and sem antic im age segm entation for object regis tration, m odeling and retrieval in the context of content-based indexing and retrieval in large collections of images. Target applications include retrieval in collections of closed silhouettes, holistic w o rd recognition in handw ritten histor ical m anuscripts and shape registration. Also, the thesis explores the feasibility of contour-based syntactic features for im proving the correspondence of the out p u t of bottom -up segm entation to sem antic objects present in the scene and dis cusses the feasibility of different strategies for im age analysis utilizing contour inform ation, e.g. segm entation driven by visual features versus segm entation driven by shape m odels or sem i-autom atic in selected application scenarios. There are three contributions in this thesis. The first contribution considers struc ture analysis based on the shape and spatial configuration of image regions (so- called syntactic visual features) an d their utilization for automatic image seg m entation. The second contribution is the stu d y of novel shape features, m atch ing algorithm s an d sim ilarity m easures. Various applications of the proposed so lutions are presented throughout the thesis pro v id in g the basis for the third con tribution w hich is a discussion of the feasibility of different recognition strategies utilizing contour inform ation. In each case, the perform ance and generality of the proposed approach has been analyzed b ased on extensive rigorous experi m entation using as large as possible test collections.

(5)

L I S T O F P U B L I C A T I O N S

T. A dam ek, N. E. O 'Connor, and A.E Smeaton, "Word m atching using single closed contours for indexing han d w ritten historical docum ents" in accepted for

publication in Special Issue on Analysis of Historical Documents, Int'l Journal on Doc ument Analysis and Recognition, 2006

T. A dam ek and N. E. O'Connor, "Interactive object contour extraction for shape m odeling" in Proc. 1st Int'l Workshop on Shapes and Semantics, Tokyo, June 2006. N. E. O 'Connor, E. Cooke, H. Le Borgne, M. Blighe, an d T. A dam ek, "The Ace- Toolbox: low-level audiovisual feature extraction for retrieval and classification" in Proc. 2nd IEE European Workshop on the Integration of Knowledge, Semantic and

Digital Media Technologies, London, U.K., Dec. 2005.

A. F. Smeaton, H. Le Borgne, N. E. O'Connor, T. A dam ek, O. Smyth, and S. De Burca, "C oherent segm entation of video into syntactic regions," in Proc. 9th Irish Ma

chine Vision and Image Processing Conf (IMVIP’05), Belfast, Northern Ireland, Aug.

2005.

T. A dam ek, N. E. O'Connor, G. Jones, and N. M urphy, "A n integrated approach for object shape registration an d m odeling," in Proc. 28th Annual Int'l ACM SI-

GIR Conference (SIGIR'05), Workshop on Multimedia Information Retrieval, Salvador, Brazil, Aug. 2005.

T. A dam ek, N. E. O'Connor, an d N. M urphy, "M ulti-scale representation and op tim al m atching of non-rigid shapes," in Proc. 4th Int'l Workshop on Content-Based

Multimedia Indexing, CBMI'05, Riga, Latvia, June 2005.

T. A dam ek, N. E. O 'Connor, and N. M urphy, "Region-based segm entation of im ages using syntactic visual features," in Proc. 6th Int’l Workshop on Image Analysis

for Multimedia Interactive Services (W IAM IS’05), Montreux, Switzerland, Apr. 2005.

T. A dam ek and N. E. O 'Connor, "A multi-scale representation m ethod for non rigid shapes w ith a single closed contour," IEEE Trans. Circuits Syst. Video Tech-

no l, special issue on Audio and Video Analysis for Midtimedia Interactive Services,

no. 5, May 2004.

T. A dam ek and N. E. O 'Connor, "Efficient contour-based shape representation an d m atching," in Proc. 5th AC M SIGM M Int'l Workshop on Multimedia Informa

(6)

N. E. O 'C onnor, S. Sav, T. A dam ek, V. M ezaris, I. K om patsiaris, T. Y. Lui, E. Iz quierdo, C. F. Bennstrom , an d J. R. C asas, "Region an d object segm entation algo rithm s in the QIMERA segm entation platform " in Proc. Int'l Workshop on Content-

Based Multimedia Indexing (CBMI'03), Rennes, France, Sep. 2003.

N. E. O 'Connor, T. A dam ek, S. Sav, N. M urphy, and S. M arlow, "QIMERA: A softw are platform for video object segm entation and tracking," in Proc. 4th Work

shop on Image Analysis for Multimedia Interactive Services, London, U.K. (WIAMIS'03),

(7)

TABLE OF CO N TEN TS

TABLE OF CONTENTS

LIST OF TABLES vi

LIST OF FIGURES vii

1 INTRODUCTION 1

1.1 The N ew "Digital A g e " ... 1

1.2 C hallenges Brought by the Explosion of Cheap Digital M edia . . 1

1.3 Retrieval by C o n t e n t ... 2

1.4 The Im portance of Image Segm entation and Shape Analysis in CBIR 3 1.4.1 Im age S e g m e n ta tio n ... 4

1.4.2 Shape M atch in g ... ... 5

1.5 Objectives of this T h e s i s ... 6

1.6 The M ain C o n tr ib u tio n s ... 7

1.7 Thesis Overview ... 9

2 IMAGE SEGMENTATION, APPLICATIONS A ND EVALUATION: A REVIEW 11 2.1 I n tr o d u c tio n ... 11

2.1.1 Image S e g m e n ta tio n ... 11

2.1.2 T a x o n o m y ... 12

2.1.3 C hapter S tru c tu re ... 13

2.2 Relevance to Content-Based Im age R e t r ie v a l... 13

2.2.1 CBIR Systems Utilizing S e g m e n ta tio n ... 14

2.2.2 Automatic A nnotation of Image R e g i o n s ... 16

2.3 G rouping Cues ... 17 2.3.1 C o l o u r ... 17 2.3.2 T e x tu re ... 17 2.3.3 General Geometric C u e s ... 18 2.3.4 Application Specific M o d e ls ... 19 2.3.5 User I n te ra c tio n s ... 19

(8)

TABLE OF CONTENTS

2.4 Selected M ethods for Bottom -up A utom atic Segmentation: A Re

view ... 20

2.4.1 C lu s te r in g ... 20

2.4.2 M athem atical M orphology ... 22

2.4.3 G raph Theoretic A lgorithm s ... 22

2.5 Selected M ethods for Top-dow n Segm entation: A R ev iew ... 23

2.5.1 Semi-Automatic S e g m e n ta tio n ... 23

2.5.2 Shape Driven S e g m e n ta tio n ... 26

2.6 E v a l u a ti o n ... 28 2.6.1 Evaluation S tra te g ie s ... 28 2.6.2 Relative Evaluation M e t h o d s ... 30 2.6.3 A d op ted Evaluation M e t h o d ... 31 2.6.4 G round-Truth C reation ... 34 2.7 D is c u s s io n ... 36 2.8 C o n c lu s io n ... 36

3 SEGMENTATION USING SYNTACTIC VISUAL FEATURES 38 3.1 I n tr o d u c tio n ... 39

3.1.1 M o tiv a tio n ... 39

3.2 RSST Segm entation F ra m e w o rk ... 41

3.3 O verview of the Proposed A p p r o a c h ... 43

3.4 M ethodology Used for Training a n d T e stin g ... 44

3.4.1 Training and Test C o r p u s ... 44

3.4.2 D evelopm ent M e th o d o lo g y ... 45

3.5 C olour H o m o g e n e ity ... 45

3.5.1 C olour S p a c e s ... 45

3.5.2 O ptim ized Colour H om ogeneity C r i t e r i o n ... 45

3.5.3 E xtended C olour R e p re s e n ta tio n ... ... 47

3.6 Syntactic Visual F e a tu re s ... 50

3.6.1 A d ja c e n c y ... 52

3.6.2 R egularity (low co m p lex ity )... 53

3.7 Integration of Evidence from Different S o u r c e s ... 55

3.7.1 Belief T h e o r y ... 55

3.7.2 D em pster-Shafer's Theory: B a c k g ro u n d ... 56

3.7.3 A pplication of D em pster-Shafer's Theory to the Region M erg ing P r o b le m ... 58

3.7.4 D esigning Belief Structures for each Source of Information 61 3.7.5 Evaluation of the N ew Fram ew ork for Combining M ulti ple F e a t u r e s ... 66

(9)

TABLE OF CONTENTS

3.8 Stopping C riterion ... 71

3.8.1 S topping C riterion Based on P S N R ... 72

3.8.2 A N ew Stopping C riterio n ... 73 3.9 Future W o r k ... 78 3.10 D is c u s s io n ... ... 81 3.11 Semi-Automatic S e g m e n ta tio n ... 82 3.11.1 M o tiv a tio n ... 82 3.11.2 U ser In te ra c tio n ... 83 3.11.3 Im plem entation D e t a i l s ... 84 3.11.4 R e s u l t s ... 92 3.11.5 D is c u s s io n ... 94 3.12 C o n c lu s io n ... 95

SHAPE REPRESENTATION A ND M ATCHING: A REVIEW 97 4.1 I n tr o d u c t io n ... 97

4.1.1 Shape A n a l y s i s ... 97

4.1.2 2D Shapes Versus a 3D W o r ld ... ... 98

4.1.3 C h ap ter S tru c tu re ... 99

4.2 Im p o rtan t A spects of Retrieval by S h a p e ... 99

4.2.1 C o m putational Shape A n a ly s is ... 99

4.2.2 Shape Representation and D esc rip tio n ... 100

4.2.3 Shape M atching an d Similarity E s ti m a t io n ... 101

4.3 Taxonom y of Shape R e p re s e n ta tio n s ... 103

4.3.1 T a x o n o m y ... 103

4.3.2 C ontour-Based Versus Region-Based T ech n iq u es... 104

4.3.3 Transform D om ain Versus Spatial D om ain Representations 105 4.3.4 S hape M o d e lin g ... 106

4.4 The M ost C haracteristic M ethods U tilizing Shape Inform ation Em b ed d ed Into a Low Dimensional Vector S p a c e ... 106

4.4.1 Sim ple Global Geometric Shape F e a t u r e s ... 107

4.4.2 F ourier Transform ... 107

4.4.3 M o m e n ts ... 108

4.5 The M ost C haracteristic C urve M atching Techniques ... 110

4.5.1 M edial Axis Transform ... ... I l l 4.5.2 D ense M a tc h in g ... 112

4.5.3 P roxim ity M a t c h i n g ... 113

4.5.4 S pread Prim itive M a t c h i n g ... 114

4.5.5 Syntactical M a tc h in g ... 114

(10)

TABLE OF CONTENTS

5 M ULTI-SCALE R EPR ESEN TA TIO N A N D M A T C H IN G OF CLO SED

C O N T O U R S 121

5.1 I n tr o d u c tio n ... 122

5.1.1 M o tiv a tio n ... 122

5.2 Multi-scale Convexity Concavity R e p re s e n ta tio n ... 124

5.2.1 D e f in it io n ... 124

5.2.2 Properties of MCC r e p r e s e n t a tio n ... 127

5.3 M atching and D issim ilarity M e a s u r e ... 129

5.3.1 O ptim al C orrespondence Between C o n to u r P o in ts ... 129

5.3.2 Distances Between C ontour P o i n t s ... 132

5.3.3 Dynamic Program m ing F o rm u la tio n ... 132

5.3.4 The Com plete M atching A l g o r i t h m ... 134

5.3.5 Shape D issim ilarity ... 136

5.3.6 M atching Examples ... 136

5.4 R e s u l t s ... 137

5.4.1 Retrieval of M arine C re a tu re s ... 137

5.4.2 Simulation of MPEG-7 Core E x p erim en t... 138

5.5 Com putational C o m p le x ity ... 141

5.6 Alternative R epresentation and M atching O p tim iz a tio n ... 143

5.6.1 A n O ptim ization Fram ew ork to Facilitate Evaluation . . . 143

5.6.2 Reduction of R edundancy Between C on to u r Point Features 144 5.6.3 O ptim ization M e th o d o lo g y ... 146

5.6.4 O ptim ization C rite rio n ... 147

5.6.5 O ptim ization R e s u lts ... 148

5.7 Relation to Existing M e t h o d s ... 152

5.8 Discussion an d Future W o rk ... 153

5.8.1 Alternative Approaches to C ontour Evolution and Con vexity/C oncavity M e a s u r e m e n t... 153

5.8.2 Different Q uantization Schemes for a ... .. . 153

5.8.3 R edundancy Between C ontour P o i n t s ... 153

5.8.4 Lim itations of the O ptim ization F r a m e w o r k ... 154

5.9 C o n c lu sio n ... 154 6 C ASE S T U D Y 1: W O R D M A T C H IN G I N H A N D W R IT T E N D O C U  M E N T S 155 6.1 I n tr o d u c tio n ... 155 6.1.1 M o tiv a tio n ... 155 6.1.2 Previous Work ... 157 6.1.3 C hapter S tru c tu re ... 159

(11)

TABLE OF CONTENTS

6.2 C ontour E x tr a c tio n ... 159

6.2.1 B in a riz a tio n ... 160

6.2.2 Position E s tim a tio n ... 160

6.2.3 C onnected C om ponent L abelling... 162

6.2.4 Connecting Disconnected L e tte rs ... 162

6.2.5 Contour T rac in g ... 163

6.3 C ontour M a tc h in g ... 163

6.4 E x p e rim e n ts ... 165

6.4.1 Overall R e s u l t s ... 165

6.4.2 Towards Fast R e c o g n itio n ... 166

6.4.3 C om parison w ith C S S ... 171

6.5 Future w o r k ... 172

6.6 C o n c lu sio n ... 173

7 CASE STUDY 2: SHAPE REGISTRATION AND M ODELING 174 7.1 I n tr o d u c tio n ... 174

7.1.1 M o tiv a tio n ... 174

7.1.2 Previous W ork ... 176

7.2 Point D istribution M o d e l... 177

7.2.1 Pairw ise A lignm ent ... 177

7.2.2 G eneralized Procrustes A n a ly s is ... 178

7.2.3 M odeling Shape V a ria tio n ... 179

7.3 Semi-Automatic S e g m e n ta tio n ... 180

7.4 Landm ark Id e n tific a tio n ... 180

7.4.1 Pair-wise M a tc h in g ... 181

7.4.2 C orrespondence for a Set of C u r v e s ... 181

7.4.3 M odeling Symm etrical S h a p e s ... 184

7.5 R e s u l t s ... 184 7.5.1 Cross-sections of Pork C a r c a s s e s ... 184 7.5.2 H ead & S h o u l d e r s ... 185 7.5.3 Synthetic E x a m p l e ... ... 185 7.6 D is c u s s io n ... 186 7.7 C o n c lu sio n ... 188

8 GENERAL C O N C L U SIO N S A N D FUTURE W ORK 189 8.1 G eneral D is c u s s io n ... 189

8.2 Thesis Sum m ary & Research C o n trib u tio n s ... 190

(12)

TABLE OF CONTENTS

8.3.1 Image S e g m e n ta tio n ... 193 8.3.2 Shape M atch in g ... 194 8.4 Final W o r d ... 195

A THE HUM AN VISUAL SYSTEM A N D OBJECT RECOGNITION 196

A .l G estalt Principles of Visual O rganization O f Perception . . . 196 A.2 Structural-D escription vs. Image-Based M o d e ls ... . 197 A.2.1 Reconstruction of th e O utside W orld from the Retinal Im age 197 A.2.2 Exemplar-Based M ultiple-Views M e c h a n is m ... 198 A.2.3 W hich M odel Explains All Facts of H u m a n Visual Percep

tion? ... ... 199

B IMAGE COLLECTION USED IN CHAPTER 3 200

(13)

LIST OF TABLES

3.1 C om bination of tw o BBA using the D em pster's com bination rule 61 3.2 Param eters T l, T c, T r, a n d a for colour hom ogeneity m easures BBA. 64 3.3 D iscounting factor a ad j ... 64 3.4 D iscounting factor a.cpX ... 65 3.5 Results of cross-validation for different com binations of features. 66 3.6 Results of cross-validation of selected m erging criteria used w ith

the PSNR-based stopping criterion... 75 3.7 Results of cross-validation of selected m erging criteria w ith the

proposed stopping criterion... 78 5.1 Average retrieval rates for different configurations of the contour

m atching algorithm ... 150 6.1 Effect of pruning of w ord pairs based on contour complexity. . . . 170 6.2 Effect of pruning of w ord pairs based on num ber of descenders. . 171 6.3 Effect of pruning of w ord pairs based on num ber of ascenders. . . 171 6.4 WER (excluding OOV w ords) for the discussed m ethods...172

(14)

LIST OF FIGURES

2.1 M ain segm entation s tr a t e g ie s ... 12

2.2 W eighting functions used for evaluation of s e g m e n ta tio n ... 32

2.3 Tool for m an ual creation of ground-truth s e g m e n ta tio n ... 35

3.1 Influence of param eter qavg on the spatial segm entation error. . . 46

3.2 Exam ple of ADCS representation... 48

3.3 Influence of param eter qext on the spatial segm entation error. . . . 49

3.4 Segm entations obtained by different colour hom ogeneity criteria (the required num ber of regions is know n)... 51

3.5 O ver-segm entation of occluded background caused by favouring of m erging of adjacent regions... 53

3.6 G eneral form of BBA f u n c t i o n s ... 59

3.7 Selected segm entation results obtained by m erging criteria inte grating colour hom ogeneity w ith different com binations of syn tactic features... 69

3.8 Segm entations obtained by m erging costs integrating geometric m easures w ith tw o different colour hom ogeneity criteria: CaVg a n d Cext... 70

3.9 Influence of Tpsnr on the average spatial segm entation error . . . 74

3.10 Stopping criterion based on accum ulated m erging cost (Caum) . . 76

3.11 Influence of Tcum on the average spatial segm entation error. . . . 77

3.12 Results of fully autom atic segm entation... 79

3.13 U pw ards label p r o p a g a t io n ... 86

3.14 D ow nw ards label p r o p a g a t i o n ... 89

3.15 Exam ple illustrating limitations of the solution for labelling of the entire im age proposed by Salembier and G a r r id o ... 90

3.16 Results of sem i-autom atic segm entation... 93

4.1 Problem s addressed by shape a n a ly s is ... 99

4.2 Shape representation ta x o n o m y ... 104

(15)

LIST OF FIGURES

4.4 Taxonomy of curve m atching in spatial d o m a in ... I l l

4.5 CSS e x tra c tio n ... 116

4.6 Radical change in a CSS im age caused by a m in o r change in shape 118 4.7 M atching approach pro p o sed by Petrakis et. a l... 118

5.1 Contour displacem ent... 125

5.2 Extraction of MCC representation... 126

5.3 Influence of non-rigid deform ations on M CC representation. . . . 128

5.4 Examples of MCC representation... 130

5.5 M atching... 132

5.6 M atching exam ples... 137

5.7 Retrieval of m arine creatures... 138

5.8 Examples from MPEG-7 collection (part B of "CE-Shape-l"). . . . 139

5.9 Retrieval rates for all classes from the M PEG-7 collection... 140

5.10 Precision versus Recall for the proposed an d th e CSS approach. . 141

5.11 Distances betw een shapes com puted using the proposed approach. 142 5.12 MCC-DCT representation... 145

5.13 O ptim ization algorithm ... 147

5.14 Optim al com binations of param eters p i... 149

5.15 Class retrieval rates for different configurations of the approach. . 151

5.16 Average retrieval ratios vs. num ber of DCT coefficients... 151

6.1 M anuscript sam ple... 159

6.2 Binarization using dynam ic threshold... 161

6.3 Position estim ation... 161

6.4 Extraction of w ord contours... ... 164

6.5 M atching w o rd s... 165

6.6 Selection strategy for starting p o in t... 168

6.7 Word Error Rate as a function of allow ed p a th deviation Tj, . , , . 168 6.8 Identification of descenders and ascenders... 170

7.1 M ultiple assignm ents of p o in ts... 183

7.2 M odeling of the p o rk carcasses... 185

7.3 M odeling of the head & shoulders... 186

7.4 Modeling of the "b u m p "... 187

(16)

C h a p t e r 1

I N T R O D U C T I O N

T h is chapter briefly discusses the m ain m otivations for the research carried

out by the autho r a n d rep o rted in this thesis. Also, it briefly outlines the m ain objectives of the w o rk together w ith the au th o r's contributions and an overview of the structure of the thesis.

1 .1 T h e N e w " D i g i t a l A g e "

In recent years, com puter technology has transform ed alm ost every aspect of o u r life and culture. The ad v en t an d popularization of the personal com puter in conjunction w ith the continuous evolution of digital netw orks and the em ergence of m ultim edia technology, particulary the proliferation of cheap digital m edia acquisition tools, have enabled a w orldw ide trend tow ards a new "digital age". The n ew m odels of content production, distribution and consum ption have resulted in fast grow th of the am ou n t of digital m aterial available.

However, the w ealth of inform ation available now adays has also induced a m ajor problem of inform ation overload. In other w ords, the new "digital age" challenges us to develop the ability to find helpful an d useful inform ation in a vast sea of otherw ise useless inform ation. The need for efficient tools to organize, m anipulate, search, filter and browse through the huge am ounts of digital inform ation is ev id en t from the spectacular success of text-based Web

Search Engines such as Yahoo o r Google [1].

1 .2 C h a l l e n g e s B r o u g h t b y t h e E x p l o s i o n o f C h e a p D i g i t a l M e d i a

The emergence of m u ltim edia technology, particulary the explosion of cheap digital m edia acquisition tools such as scanners and digital cam eras etc. as well

(17)

1. INTRODUCTION

as ra p id reduction of storage cost, h as resulted in accelerated g row th of digital au d iovisual m edia collections, b o th proprietary and freely available on the World

Wide Web. A pplications like m anufacturing, m edicine, entertainm ent, education,

etc. all m ake use of immense am ounts of audiovisual data. As the am ount of inform ation available in visual form continuously increases, the necessity for the developm ent of hum an-centered tools for the effective processing, storing, m anaging, brow sing and retrieval of audiovisual data becom es evident.

A large am ount of visual m aterial is available in the form of still im ages, either as a p h o to grap h representing a real scene or as a graphic containing a non-real im age (sketched by a h um an or synthesized by com puter). Im ages can be stored either in local databases or distrib u ted databases (e.g. the World Wide Web) and either em bedded in docum ents or available as stand-alone objects.

The application areas m ost often listed in the literature w here the use, and retrieval in particular, of im ages now adays plays a crucial role are: law and crim e prevention, medicine, fashion and graphic design, publishing, electronic com m erce, architectural and engineering design, an d historical research. A detailed study of image users and the possible uses of im ages w as provided by Eakins and G raham in [2]. D ue to the volum e of visual m aterial in the form of im ages there is a clear need for image search engines either for private or professional use.

1 .3 R e t r i e v a l b y C o n t e n t

Currently, techniques for image retrieval (or visual m edia generally) are one of the m ost eagerly sought m ultim edia technologies, w ith potentially great com m ercial significance. H ow ever visual inform ation system s are radically different from conventional inform ation systems an d retrieval of data in such system s requires addressing m any novel issues before they can tru ly become a new fertile area for innovative products.

Specifically, retrieving based on the text usually associated w ith visual content is no longer sufficient. In fact, in m any cases this is not even possible. O ften there is no textual inform ation associated w ith images and m an u al k eyw ord annotation is labor intensive. Therefore, recently retrieval by content has been suggested as an alternative retrieval m odel for audiovisual media. In Content-Based Image

Retrieval (CBIR) systems im ages are indexed (and subsequently brow sed and

retrieved) based on features extracted directly from their representation rather th a n by any associated text.

B uilding m odern systems for content-based retrieval of visual data requires consideration of several key issues such as system design, feature extraction, sim ilarity m easures, indexing structures, semantic analysis, inferring content

(18)

1. INTRODUCTION

abstraction from low-level features, designing user interfaces, querying models and relevance feedback [3].

Notably, CBIR system s require indices to order the data. However, given a visual signal there is little inform ation readily available for indexing. Therefore in recent years a significant research effort has been dedicated to the problem of extraction of visual descriptors w hich could be then exploited later for content- based brow sing and retrieval. The descriptors may be visual features such as colour, texture, edge direction, shape, scene layout, spatial relationships, or semantic prim itives. Also they can be global (one feature describes the whole image) or local (each feature describes an object or a p art of the image). A nother crucial com ponent of CBIR system s is establishing similarity betw een visual entities (e.g. betw een the query an d the target).

In order to allow inter-operability betw een devices and applications attem pting to solve various aspects of the q uery by content problem as well as to provide a com m on term inology and test m aterial, the Motion Picture Experts Group (MPEG) introduced a standard know n as MPEG-7 [4, 5]. MPEG-7, form ally called the

Multimedia Content Description Interface, is a standard for describing m ultim edia

content, facilitating sophisticated m anagem ent, browsing, searching, indexing, filtering, and accessing of that content. Unlike previous ISO standards, it is intended to provide representations of audiovisual (AV) inform ation that aim beyond com pression (such as MPEG-1 a n d MPEG-2 standards) or even object- based functionalities (such as su p p o rted by the MPEG-4 standard), an d support some degree of interpretation of the AV content. MPEG-7 offers a com prehensive set of stan d ard ized audiovisual description tools by specifying four types of norm ative elements: Descriptors, D escription Schemes, a Description Definition Language (DDL), and coding schem es. For a more detailed description of the standard the reader can refer to [4, 5].

This thesis contributes to tw o key research areas enabling some of the above challenges to be addressed: im age segm entation and shape m atching. The re m ainder of this chapter briefly discusses the m ain m otivation for the research carried o ut by the author in this thesis by discussing the role of im age segm en tation and shape m atching in CBIR. This is followed by an outline of the main objectives of the w ork together w ith the au th o r's contributions an d an overview of the structure of the thesis.

1 .4 T h e I m p o r t a n c e o f I m a g e S e g m e n t a t i o n a n d S h a p e A n a l y s i s i n C B I R

This section discuses briefly the role of im age segm entation and shape m atching technologies in addressing som e of the m ajor challenges w hich m u st be faced

(19)

1. INTRODUCTION

1.4.1 Image Segmentation

Im age segm entation refers to a very broad set of techniques for im age p arti tioning. Typically, the goal is to partition an im age using a chosen criterion so that the created partition reflects the structure of the scene at a level suitable for a given application, e.g. hom ogenous regions or semantic objects. In other w ords, the term segm entation can be used to refer to any process allow ing d is covering certain know ledge about the structure of the scene, e.g. shape of the objects present. Therefore, the problem of partitioning an im age into a set of h o m ogenous regions or semantic entities is a fundam ental enabling technology for un d erstan d in g scene structure and identifying relevant objects. Identification of areas of an im age th a t correspond to im p o rtan t regions, e.g. sem antic objects, is often considered to be the first step of m any object-based applications.

The p roblem of segm entation is fundam ental to m any challenges encountered in the area of com puter vision. This thesis focuses prim arily on region an d object b ased segm entation of images in the context of visual content indexing an d retrieval applications.

M any of the low-level descriptors, including the ones defined b y the MPEG- 7 stand ard , can be used to describe entire im ages as well as p arts of images (e.g. im age regions, w hich ideally w ould correspond to real objects present in the scene). U tilization of local descriptors, representing im ages at region or object granularity, allows indexing and retrieval in a m anner closer to h u m an perception a n d scene understanding. Im age segm entation is a vital tool for extraction of such localized descriptors of the image content. In other w ords, the ability of partitioning the im age into regions corresponding to objects present in the scene (or at least to parts of the object hom ogenous according to certain criteria) is central for extraction of local low-level features (e.g. colour, texture, shape) w here each feature partially or com pletely describes an object. H ow ever, MPEG-7 does not specify norm ative tools for im age segm entation as this is n o t necessary for inter-operability. N evertheless, in practice, im age segm entation tools are integral parts of m any feature extraction toolboxes - see for exam ple [6]. The author believes that in m any cases, the success of MPEG-7, both, com m ercially and as a tool facilitating research, w ill d epend on the ability to easily p ro d u ce partitioning of the visual content into regions m eaningful in a given application scenario.

It is com m only k now n that even lim ited capabilities of filtering of the AV m a terial b ased on sm all sets of particulary im po rtan t sem antic objects/concepts can greatly im prove perform ance of CBIR systems, e.g. the user m ay w ish to brow se only im ages containing certain type of objects (e.g. faces) or scenes (e.g. when building CBIR systems.

(20)

1. INTRODUCTION

in d o o r/o u td o r) - see for exam ple the Fischlar system [7]. Often, such function alities require autom atic detection, extraction an d recognition of semantically m eaningful entities from visual data.

Recently it has been show n that textual labels can be assigned (inferred) to hom ogenous im age regions autom atically d u rin g an off-line training/indexing process [8, 9]. Such labels can be then used for indexing and subsequently for textual querying. Clearly, the ability of autom atically segm ent images based on colour and texture hom ogeneity criteria is a crucial underpinning technology in such a scenario.

Finally, sem i-autom atic (often also referred to as supervised) segmentation tech nologies are becom ing sufficiently m atu re for integration w ith CBIR opening an interesting possibility of object-based rather th a n im age-based queries. Surpris ingly, to the best of authors know ledge so far only a few studies have considered such a possibility [10]. Also, the sem i-autom atic tools enable the possibility of rapid sem i-autom atic annotation of im ages w here labels are m anually assigned to image parts [11]. O ther interesting approaches to CBIR utilizing image seg m entation are discussed in the next chapter (section 2.2).

1.4.2 S h a p e M a tc h in g

Shape plays an im portant role to allow h um ans to recognize and classify objects. In fact, often shapes represent the abstractions (archetypes) of objects belonging to the same sem antic class. N ot surprisingly, som e user surveys regarding cognition aspects of image retrieval indicate th a t users are more interested in retrieval by shape than by colour and texture [12]. Therefore shape analysis can play an im p o rtan t role in addressing som e of the key problems which m ust be faced w hen building CBIR system s such as detection of semantic objects in unseen images, establishing sim ilarity betw een objects, an d extracting indices to index the data.

Shape m atching (or in this case often referred to as tem plate matching) plays an im portant role in the detection a n d segm entation of semantic objects in unseen images. Once prototypical objects are extracted, shape analysis can m ake explicit some im p o rtan t inform ation about the object's shape which can be then exploited to recognize that object or distinguish it from other objects. Estimation of sim ilarity betw een shapes of objects (e.g. betw een the query and the target objects) is a key technology enabling querying b y exam ple. However, estimation of shape sim ilarity can be indirectly im p ortant also for textual querying of images, e.g. text recognition in scanned docum ents by using Optical Character

Recognition (OCR). Clearly the shape of characters an d w ords is crucial for

m apping them into text. One such application is discussed in this thesis in chapter 6 w hich considers contour m atching for w ord recognition in historically

(21)

1. INTRODUCTION

N ot surprisingly, am ong several low-level visual features to characterize the AV content MPEG-7 defines tw o standardized descriptors specifically for 2D shape w hich are: contour-based descriptor relying on the concept of Curvature

Scale Space (CSS) [13, 14, 15, 16, 17] and region-based descriptor based on a 2D

complex transform defined w ith polar coordinates on the unit disk, called A n 

gular Radial Transform (ART) [17] - both are discussed in detail in chapter 4. To

facilitate the com parison betw een different m ethods for shape-based retrieval, a com plete evaluation m ethodology and com m on test collections w ere speci fied. This m ethodology, am ong others, is adopted for extensive evaluation of the m ethod prop o sed by the autho r in chapter 5. Also, w eaknesses of the stan dardized contour-based descriptor (CSS) are identified and discussed based on several application exam ples. In particular, it will be show n than the m ethod for estim ating sim ilarity betw een shapes proposed in this thesis produces better results that the CSS-based m ethod in tw o applications: retrieval in collections of closed silhouettes a n d holistic w ord recognition in han d w ritten historical m an uscripts.

Sum m arizing, im age segm entation as well as shape m atching are key technolo gies essential to p rovide the content-based functionalities required by future m ultim edia applications. Therefore not surprisingly they have been h o t research topics for a large n u m b e r of groups around the w orld. C ontributions to both of these key research areas are m ade and reported in this thesis together w ith a discussion of several applications of the proposed solutions.

1 .5 O b j e c t i v e s o f t h i s T h e s i s

This thesis considers different aspects of the utilization of contour inform ation for im age segm entation, sim ilarity estim ation betw een objects, and also object registration and m odeling in the context of content-based retrieval in large collections of images.

The first objective of this thesis is to place in context the research and provide justification for the investigation carried out b y the author. This is achieved by presenting various applications of the proposed solutions throughout the thesis and dem onstrating th a t image segm entation an d shape m atching are key technologies enabling the content-based functionalities required by future m ultim edia applications.

The second objective is to outline the current state of the art in im age segm enta tion an d shape m atching by review ing the m ost characteristic categories of tech niques in b o th areas and briefly discussing the m ost interesting and prom ising significant handw ritten documents.

(22)

1. INTRODUCTION

techniques from each category. This thesis does not attem p t to cover all aspects of image segm entation and shape analysis nor does it represent a detailed lit eratu re review of the vast am ount of w ork published in these fields. Rather, it restricts itself to a discussion of these technologies in the context of CBIR.

The third, and m ost im portant objective is to investigate n ew approaches to solving the problem s related to image segm entation, contour m atching and sim ilarity estim ation and to discuss their practicability in various applications. O ne of the m ain goals is to explore the feasibility of utilizing the spatial configuration of im age regions and structural analysis of their contours (so- called "syntactic visual features") for im proving the usefulness of the o utput of bottom -up image segm entation, e.g. for feature extraction, by creating a m ore m eaningful segm entation of the im age corresponding m ore closely to sem antic objects present in the scene and im proving the perceptual quality of the segm entation. This can be achieved by investigating new w ays of utilizing evidence from m ultiple sources of inform ation, each w ith its ow n accuracy a n d reliability. A nother goal is to investigate selected issues of shape analysis, in particular contour representation, m atching, and estim ation of similarity betw een silhouettes. The aim is also to dem onstrate the usefulness of the proposed approach in a variety of applications.

In each case, the perform ance an d generality of the pro p o sed techniques m ust be analyzed based on rigorous experim entation using large test collections. More over, presentation of selected applications of the prop o sed solutions should pro vid e a suitable basis for a discussion of the practicability of different recognition strategies utilizing contour inform ation, e.g. bottom -up (autom atic segm enta tion followed by recognition) vs. top-dow n (segm entation d riven by shape m od els or semi-automatic).

The final objective of this thesis is to indicate directions for further research. Namely, to consider possibilities for further im provem ent of the proposed solutions and discuss the prospects of using them as a basis for addressing various other challenges in the field.

1 .6 T h e M a i n C o n t r i b u t i o n s

There are three m ain contributions in this thesis. The first contribution is a fea sibility study of utilizing spatial configuration of regions and structural analysis of their contours (so-called syntactic visual features [18]) for im proving the cor respondence of fully autom atic im age segm entation to sem antic objects present in the scene and for im proving the perceptual quality of the segm entation. Sev eral extensions to the w ell-know n Recursive Shortest Spanning Tree (RSST) algo rith m [19] are proposed am ong w hich the m ost im portant is a novel fram ework for the integration of evidence from m ultiple sources of inform ation, each with

(23)

1. INTRODUCTION

its ow n accuracy and reliability. The new integration fram ew ork is based on the

Theory of Belief (BeT) [20, 21, 22, 23]. The "strength" of the evidence provided by

the geom etric properties of regions an d their spatial configurations is assessed and com pared w ith the evidence p ro v id ed solely by the colour hom ogeneity cri terion. O ther contributions in this area include a new colour m odel and colour hom ogeneity criteria, practical solutions to structure analysis based on shape and spatial configuration of im age regions, and a new simple stopping criterion aim ed at producing partitions containing the m ost salient objects present in the scene.

Additionally, the application of the autom atic segm entation m ethod to the problem of sem i-autom atic segm entation is also discussed. A n easy to use and intuitive sem i-autom atic segm entation tool utilizing the autom atic approach is described in detail. This exam ple dem onstrates that syntactic features can be useful also in the case of su pervised scenarios w here they can facilitate more intuitive user interactions e.g. by allow ing the segm entation process m ake more "intelligent" decisions w henever the inform ation provided by the user is not sufficient (or am biguous).

The second contribution is the p roposal of a novel rich multi-scale representation for non-rigid shapes w ith a single closed contour and a study of its properties in the context of efficient silhouette m atching and similarity estimation. An initial approach for optim ization of the m atching in order to achieve higher perform ance in discrim inating b etw een shape classes is also proposed. The efficiency of the proposed silhouette m atching approach is dem onstrated in a variety of applications.

In particular, the retrieval results are com pared to those of the C urvature Scale Space (CSS) [13, 14, 15] approach (ad opted by the MPEG-7 stan d ard [16, 17]) and it is show n th at the p ro po sed schem e perform s better th at CSS in two applications: retrieval in collections of closed silhouettes and holistic w ord recognition in han d w ritten historical m anuscripts. As a m atter of fact it is show n th a t the proposed contour b ased approach outperform s all techniques for w o rd m atching proposed in the literature w hen tested on a set of 20 pages from the George W ashington collection at the Library of Congress [24, 25]. A dditionally, it is show n that the m atching approach can be extended w ith some success to the problem of m atching m ultiple contours and therefore facilitating establishm ent of dense correspondence betw een m ultiple silhouettes w hich can then be em ployed for contour registration.

Various applications of the proposed techniques are presented throu gh ou t the thesis p roviding the basis for the th ird contribution w hich is a discussion of the feasibility of different recognition strategies utilizing contour inform ation.

(24)

1. INTRODUCTION

1 .7 T h e s i s O v e r v i e w

The first chapters of the thesis consider segm entation of im ages of real w orld scenes w hile the later chapters focus on estim ation of sim ilarities betw een objects' silhouettes. The final chapters present case studies dem onstrating the usefulness of the proposed solutions in selected applications.

C hapter 2 presents the state of the art in im age segm entation in the CBIR con text. The chapter discusses the relevance of im age segm entation in the context of content-based image retrieval, outlines the m ain categories of approaches, briefly describes the grouping cues m ost com m only utilized in segm entation and review s the m o st characteristic approaches to autom atic and sem i-autom atic im age segm entation. The chapter also briefly discusses evaluation m ethodologies for assessm ent of segm entation quality proposed thus far in the literature and describes in som e detail the evaluation m ethod ad o pted throughout this thesis. C hapter 3 explores the feasibility of utilizing the spatial configuration of regions and structural analysis of their contours (so-called syntactic visual features) for im proving the correspondence of the o u tp u t of bottom -up segm entation by region m erging to semantic objects present in the scene and im proving its perceptual quality. A new fram ew ork for utilizing evidence from m ultiple sources of inform ation, each w ith its ow n accuracy and reliability, for m eaningful region m erging is proposed. Two practical m easures for analyzing the spatial configuration of image regions and structural analysis of their contours are proposed together w ith several im provem ents to colour representations used d u ring th e region m erging process and a stopping criteria aim ed at producing partitions containing the m ost salient objects present in the scene. Finally, it dem onstrates the utilization of the approach in a sem i-autom atic scenario. C hapter 4 outlines the m ost im portant issues related to shape analysis in the context of content-based image retrieval. It focuses on the m ajor challenges related to 2D sh ape representation, m atching an d sim ilarity estim ation a n d gives an overview of selected techniques available in the literature in order to provide a context for the research carried o u t in the rem ainder of the thesis, particulary in chapter 5.

C hapter 5 looks a t the developm ent of a new representation an d m atching technique dev ised by the author for non-rigid shapes w ith a single closed contour. A n initial approach for optim ization of the m atching in ord er to achieve h ig h er perform ance in discrim inating betw een shape classes is also presented. The efficiency of the proposed approaches is dem onstrated on two collections a n d the retrieval results are com pared to those of the m ethod based on Curvature Scale Space (CSS)[13,14,15], w hich has been adopted by the MPEG-7 stan d ard [16,1MPEG-7].

(25)

1. INTRODUCTION

proposed in the earlier chapters. C hapter 6 exam ines the application of the silhouette m atching m ethod proposed in chap ter 5 to holistic w ord recognition in historically significant h an d w ritten m anuscripts. Utilization of both the silhouette m atching m ethod p resented in chap ter 5 and the semi-automatic segm entation approach discussed in chapter 3, in an integrated tool allowing intuitive extraction an d registration of contour exam ples from a set of images for construction of statistical shape m odels is discussed in chapter 7.

The final chapter sum m arizes the thesis b y review ing the achieved research objectives an d recalling the m ain conclusions d ra w n throughout this thesis. Also, it indicates directions for fu rther research b y discussing possibilities for further im provem ent of the proposed techniques a n d looks at future prospects in using them to address related challenges in th e field.

(26)

C h a p t e r 2

I M A G E S E G M E N T A T I O N ,

A P P L I C A T I O N S A N D

E V A L U A T I O N : A R E V I E W

T HIS chapter discusses the im portance of im age segm entation in the con text of content-based im age retrieval, outlines the m ain categories of ap proaches, and review s the m ost characteristic segm entation methods. The sub ject of objective evaluation of segm entation quality is also discussed in some d e tail as it has received relatively little attention in the literature, especially w hen com pared to the large n u m b er of publications on the topic of image segm enta tion itself.

2 .1 I n t r o d u c t i o n

2.1.1 Image Segm entation

The term im age segm entation1, refers to a broad class of processes of partitioning an image into disjoint connected regions w hich are hom ogeneous w ith respect to a certain criteria such as low-level features (e.g. colour or texture), general p rio r know ledge about the w o rld (e.g. boundary sm oothness), high-level know ledge (e.g. semantic m odels) o r even user interactions.

The problem of im age segm entation is a fundam ental enabling technology for un derstanding scene structure and identifying relevant objects. Identification of areas of an im age th a t correspond to hom ogenous o r /a n d im portant regions, e.g. semantic objects, is often considered to be the first step of m any object-based applications such as for exam ple content-based im age indexing and retrieval (e.g. using the recently introduced MPEG-7 [26, 4, 5] standard) or region-of- interest coding using the JPEG2000 standard [27]. This thesis focuses prim arily

(27)

2. IMAGE SEGMENTATION, APPLICATIONS AND EVALUATION: A REVIEW

MODEL DRIVEN (TOP-DOWN): models (e.g. shape templates)

semi-automatic (supervised)

Figure 2.1: M ain segm entation strategies.

o n region and object based segm entation of im ages in the context of visual content indexing an d retrieval applications.

2.1.2 T ax o n o m y

A plethora of approaches to im age segm entation h av e been proposed in the literature [28, 29, 30, 31, 32, 33, 34]. Exhaustive surveys can be found in [35, 36, 37, 38, 39]. The techniques available in the literature can be classified according to m any different criteria. M any authors divide the m ethods into tw o distinct groups: (i) region-based approaches relying on the homogeneity of spatially localized features (e.g. colour or texture) an d (ii) boundary-based m ethods using m ostly gradient inform ation to locate object boundaries. Others classify m ethods according to the m athem atical m ethodology they employ, e.g. (i) variational, (ii) statistical, (iii) graph-based, and (iv) morphological.

A nother im portant criterion, adopted for discussion in the rem inder of this thesis, is the inform ation level used for grouping. A ccording to this criterion the m ethods can be broadly classified as either visual features-driven (bottom-up) or

model-driven(top-down) [18] - see Figure 2.1. A pproaches from the first category

attem p t to infer m eaningful entities (ideally objects) solely from analysis of visual features, e.g. colour or texture homogeneity. A pproaches from the second category attem pt to segm ent im age into objects w hose visual features m atch best the visual features of the m odels used. It should be noted th at the term bottom-

up is also often used to refer to m ethods m erging iteratively pixels into more

com plex entities (e.g. [19]) an d the term top-down often refers to region splitting techniques (e.g. [40]) or to approaches w here segm entation of the dow n-sam pled im age is used to initialize the segm entation of the im age at finer resolution (e.g. [30]). H ow ever in this thesis the term s bottom-up a n d top-down are mainly used to refer to the level of inform ation used and n ot necessarily to the order in w hich images at different scales are analyzed.

Alternatively, approaches can also be classified as either object-based (resulting in sem antic objects) or region-based (producing regions w ith hom ogeneous colour a n d /o r texture). Clearly, there is a close relationship betw een the last two

DATA DRIVEN (BOTTOM-UP): colour homogeneity texture homogeneity geometric homogeneity

(28)

criteria, i.e. typically sem antically m eaningful objects can be produced only by m aking use of objects' m odels, p rio r know ledge about a particular application or user guidance. The approaches driven by visual features typically rely on a certain hom ogeneity criterion, e.g. colour or texture. They are attractive in m any applications as they do not require m odels of individual semantic objects or user interactions. However, they also rarely can be used to extract com plete semantic objects as the problem of visual features-driven object-based segm entation is u n d er constrained in m any applications. In contrast, use of the model-driven approaches, although requiring utilization of prior know ledge about the objects to be extracted, leads to well defined detection problem s. Typically, in the case of general content, the visual features-driven approaches can only partition the in p u t im age into hom ogenous regions although in specific application scenarios can also produce objects, w hile, the model-driven approaches are used m ainly to extract the im p o rtan t sem antic objects from the irrelevant structures p resent in the scene. In the case of systems carrying o u t shape analysis, the segm entation algorithm s typically aim at object-based segm entation in order to extract/recover the shape of the relevant objects present in the scene. However, the tw o approaches (visual features-driven and model-driven) are not m utually exclusive and they c o u ld /s h o u ld be used both for segm entation and object extraction. For exam ple, in [41] bottom-up segm entation w as used to obtain a single partitioning of the im age a n d also to build its hierarchical representation in the form of a Binary Partition Tree (BPT). BPT was then used to guide the search for the optim um m atch betw een a reference contour and the contours of the p artition leading to object segm entation and detection.

2.1.3 C h a p te r S tru c tu r e

The rem ainder of this chapter is organized as follows: the next section discuses the relevance of im age segm entation in the context of Content-Based Image

Retrieval (CBIR) focusing prim arily on reviewing selected CBIR systems making

use of segm entation and describing some recently em erged approaches to im age retrieval utilizing autom atic segmentation. Section 2.3 briefly discusses cues w hich can be used for im age segmentation. Selected bottom-up and top-

down approaches are review ed in sections 2.4 and 2.5 respectively. Evaluation

m ethods are discussed briefly in section 2.6. A short discussion of the prospects of im age segm entation in the context of CBIR is given in section 2.7 and conclusions are form ulated in section 2.8.

2 .2 R e l e v a n c e t o C o n t e n t - B a s e d I m a g e R e t r i e v a l

U tilizing segm entation in CBIR system s allows representation of images at the level of hom ogenous regions, w hich in an ideal case correspond to semantic

(29)

objects or at least to significant p arts of the objects. Such region or object gran u larity allows utilization of local features for indexing and retrieval in a m an ner closer to hum an perception and scene understanding.

State of the art CBIR systems m ake use of image segm entation to extract a separate set of indexing features for each region (or object) [42, 43, 29, 44]. A ccording to [45], the systems utilizing segm entation can be d iv ided into two categories depending on the strategy used for m atching regions from the query and the target image: (i) individual region matching - w here a single region selected by the user from the query image is m atched to all regions in the collection [46, 43] or alternatively retrieval results obtained by using several queries w ith individual regions are m erged [29] and (ii) frame region matching - w here inform ation of all regions com posing the images is u sed [47,45]. Recently, the possibility of another category of systems utilizing im age segm entation em erged w hen it w as show n that keyw ords can be associated w ith im age regions autom atically du rin g an off-line training /in d ex in g process [8]. Such keyw ords can be then used for indexing and subsequently for textual querying.

The rem ainder of this section aims at highlighting the im portance of image segm entation in CBIR by review ing selected approaches utilizing som e form of im age partitioning.

2.2.1 CBIR Systems U tilizing Segmentation

Recent years have w itnessed an explosion of im age retrieval system s, developed as platform s facilitating research or even as commercial products. Extensive review s of the major CBIR systems can be found in [2,48, 3], This section briefly overview s those systems utilizing some form of image segm entation.

O ne of the earliest CBIR systems, and also one of the first to pro v id e lim ited object-based functionalities is the Query By Image Content (QBIC) system devel oped b y IBM A lm aden Research Center [49,50,11]. In this approach sim ple low- level features (colour, texture and sim ple shape descriptors) are extracted from entire im ages or objects sem i-autom atically segm ented in the database p o p u la tion step. The segm entation is based on the active contours technique w hich aligns the approxim ate object's outline provided by the user w ith the closest im age edges. Each extracted object can be annotated by textual description. All im ages are also represented by a reduced binary m ap of edge points to al low retrieval based on rough sketches. The system allows com bination of text- based keyw ord queries w ith visual queries, i.e. queries-by-exam ple, by a rough sketches an d selected colour and texture patterns.

O ne of the first im age retrieval systems utilizing autom atic im age segm entation is Picasso developed by the Visual Inform ation Processing Lab, U niversity of Florence [51, 52]. A t the core of the system is a pyram idal colour segm entation

(30)

of images (i.e. at the low est level, each region consists of only one pixel, w hile the top of the pyram id corresponds to the the entire image) obtained by iterative m erging of adjacent regions. Each level of the pyram id corresponds to a resolution level of the segm entation. The above representation is stored in the form of a m ultilayered g rap h in w hich nodes representing regions are characterized b y colour (a b inary 187-dim ensional colour vector), position of its centroid, size, and shape (elongation an d the major axis orientation). Additionally, d u rin g database p op u latio n the objects of interest in each image are bounded w ith their m inim um enclosing rectangle and a Canny edge detector is used w ithin each of these rectangular areas to extract edges. The system allows querying by colour regions (by d raw in g a region w ith a specific colour or by tracing the contour of some relevant object in an exam ple image), by texture, or by shape (draw n sketches).

Netra, developed b y the D epartm ent of Electrical and C om puter Engineering,

U niversity of California, Santa Barbara, is another CBIR system utilizing image segm entation [53, 46]. In this system im ages are autom atically segm ented into regions of hom ogeneous colour using an edge flow segm entation technique [54]. Each region is characterized by colour (represented by a colour codebook of 256 colours), texture (represented b y a feature vector containing the norm alized m ean and standard deviation of a series of G abor w avelet transforms), shape (represented by three different feature vectors: curvature at each point on the contour, centroid distance function, an d Fourier descriptors) and spatial location. The query is form ulated by selecting one of the regions from the query image or alternatively, if the exam ple im age is not available, directly by specifying the color an d spatial location.

iPure is a CBIR system developed b y IBM India Research Lab, N ew Delhi [42].

In this system im ages are segm ented into regions w ith hom ogenous colour using the Mean-Shift algorithm [55, 56, 33]. Each region is represented by colour (average colour in CIE LUV space), texture (coefficients of the Wold decom position of the im age view ed as a ran do m field), and shape (size, orientation axes, a n d Fourier descriptors). Additionally, the spatial lay-out is characterized by th e centroid, the m in im u m b o u n d in g box, and contiguity. The query is form ulated by selecting one or m ore regions from the exam ple image.

Istorama is an im age retrieval system developed by Informatics and Telematics

Institute, Centre for Research a n d Technology Hellas, Greece [43]. In this system all im ages are autom atically segm ented by a variant of K-Means algorithm called KMCC [30]. Each region is characterized by its colour, size and location. Similarly to the Blobword system the user form ulates the query by selecting a single region from the query image. Additionally, the user m ay stress or dim inish the im portance of a specific feature b y adjusting w eights associated w ith each feature.

(31)

Blobworld is a CBIR system developed at C om puter Science Division, U niversity of California, Berkeley [29], In this approach im ages are segm ented into regions w ith uniform colour and texture (the so-called blobs) by a variant of the Expectation Maximization (EM) algorithm. Each region is characterized by its colour (represented by a histogram of the colour coordinates in the CIE LAB colour space), texture (characterized by m ean contrast and anisotropy over the region), and sim ple global shape descriptors (size, eccentricity, and orientation). The query-by-exam ple is form ulated by selecting one or more regions from one of the query im ages followed by specification of the im portance of the blob itself an d also its colour, texture, location, and shape.

A n interesting m ethodology to image retrieval w as presented in [44], In this ap proach im ages are autom atically segm ented by a variant of K-Means algorithm called KMCC [30] a n d the resulting regions are represented by low-level features such as colour, position, size and shape. The m ain innovation of this m ethodol ogy is autom atic association of these descriptors w ith qualitative interm ediate- level descriptors, w hich form a simple vocabulary term ed object ontology. The ob ject ontology allows the qualitative definition of the high-level concepts and their relations in a hum an-centered fashion and therefore facilitates querying using sem antically m eaningful concepts. A pplicability to generic collections w ithout requiring m anual definition of the correspondences betw een regions and rele v an t identifiers is ensured by simplicity of the em ployed ontology. U tilization of the object ontology allow s narrow ing dow n the search to a set of potentially rel evant images. Finally, a relevance feedback m echanism , based on Support Vector

Machines and using the low-level descriptors is em ployed for further refinem ent

of retrieval results.

2.2.2 Automatic Annotation of Image Regions

Some user studies suggest that in practice queries based on global low-level features (e.g. colour histogram s, texture) are surprisingly rare w hile at the sam e tim e text associated w ith images is particulary useful [57]. Interestingly, a n u m ber of w orks [8, 9] have show n that keyw ords can be associated w ith image regions autom atically d u rin g an off-line train in g /in d ex in g process. Such labels can be then u sed for indexing and subsequently for textual querying.

O ne of the m ost prom ising approaches so far to predicting w ords from seg m ented im ages w as described in [9]. In this approach a statistical m odel links w ords and im age data, w ith ou t explicit encoding of correspondence betw een w ords an d regions. This m odel combines the aspect m odel w ith a soft cluster ing m odel. It is assum ed th at images and co-occurring w ords are generated by nodes arran ged in a tree structure. The joint probability of w ords and image regions is m odelled as being generated by a collection of nodes, each of w hich has a probability distribution over w ords and regions. The region probability

(32)

distributions are G aussians over feature vectors an d the w o rd probabilities are provided by simple frequency table. A region's features im ply a probability of being generated from each node. These probabilities are then used to w eight the nodes for w ord emission. Param eters for the conditional probabilities linking w ords and regions are estim ated from the w ord-region co-occurrence data using an Expectation-Maximization algorithm . W hile the app ro ach w as show n to label only some regions reasonably well (e.g. sky, w ater, snow, people, fish, planes) it certainly represents an extrem ely interesting approach to object recognition. In [58] three classes of im age segm entation algorithm s (Blobworld [29], Normal

ized Cuts [28] and Mean-Shift [33]) w ere evaluated in the context of the above

annotation approach, e.g. in term s of the w ord prediction performance. It was found that the Normalized Cuts approach provides the best su p p o rt of all three m ethods for w ord predictions closely followed by the Mean-Shift segmenter.

2 .3 G r o u p i n g C u e s

This section looks at possible features (grouping cues) w hich are relevant to the im age segm entation problem .

2.3.1 C o lo u r

C olour is the prim ary low-level feature in practically all im age segmentation techniques w hich can be found in the literature [29, 30, 31, 59, 18]. One of the m ain issues related to colour is the choice of an appropriate colour space. Typically it is advantageous th at the colour space used guarantees a low correlation am ong com ponents and is perceptually uniform, i.e. the numerical distance is proportional to the perceived colour difference. Due to the above requirem ents m any resent publications advocate utilization of CIE LUV and its im proved version CIE LAB colour spaces w hich for similar colours are approxim ately perceptually uniform [28,29, 30].

2.3.2 T ex tu re

Texture is arguably the second m ost im portant low -level feature after colour in the image partitioning problem . Several approaches to texture characterization have been show n to be useful for segm entation [30] including multi-orientation

filter banks [60] and the second-moment matrix [61,29].

However, utilization of texture in image segm entation sho uld not be considered sim ply as a thoughtless extension of the colour feature. There are several issues w hich m ust be addressed w hen m aking use of texture inform ation in order to achieve satisfactory results for natural images often depicting both textured