Temporally consistent region based video segmentation

(1)

LEABHARLANN CHOLAISTE NA TRIONOIDE, BAILE ATHA CLIATH TRINITY COLLEGE LIBRARY DUBLIN

OUscoil Atha Cliath The University of Dublin

Terms and Conditions of Use of Digitised Theses from Trinity College Library Dublin

Copyright statement

All material supplied by Trinity College Library is protected by copyright (under the Copyright and Related Rights Act, 2000 as amended) and other relevant Intellectual Property Rights. By accessing and using a Digitised Thesis from Trinity College Library you acknowledge that all Intellectual Property Rights in any Works supplied are the sole and exclusive property of the copyright and/or other I PR holder. Specific copyright holders may not be explicitly identified. Use of materials from other sources within a thesis should not be construed as a claim over them.

A non-exclusive, non-transferable licence is hereby granted to those using or reproducing, in whole or in part, the material for valid purposes, providing the copyright owners are acknowledged using the normal conventions. Where specific permission to use material is required, this is identified and such permission must be sought from the copyright holder or agency cited.

Liability statement

By using a Digitised Thesis, I accept that Trinity College Dublin bears no legal responsibility for the accuracy, legality or comprehensiveness of materials contained within the thesis, and that Trinity College Dublin accepts no liability for indirect, consequential, or incidental, damages or losses arising from use of the thesis for whatever reason. Information located in a thesis may be subject to specific use constraints, details of which may not be explicitly described. It is the responsibility of potential and actual users to be aware of such constraints and to abide by them. By making use of material from a digitised thesis, you accept these copyright and disclaimer provisions. Where it is brought to the attention of Trinity College Library that there may be a breach of copyright or other restraint, it is the policy to withdraw or take down access to a thesis while the issue is being resolved.

Access Agreement

By using a Digitised Thesis from Trinity College Library you are bound by the following Terms & Conditions. Please read them carefully.

(2)

Tem porally Consistent Region Based

Video Segm entation

A thesis subm itted to University of DubHn, Trinity College

in partial fulfilment of th e requirements for the degree of

Doctor of Philosophy

School of C o m pu ter Science and Statistics,

University of Dublin, Trinity College

A ugust 2014

(3)

rt^ T R W IT Y C O L L E G E ^

5 JAN 2015

(4)

D eclaration

I, the undersigned, declare th a t this work has not previously been subm itted as an cxercise

for a degree at this, or any other University, and th a t unless otherwise stated, is my own work.

Atul Nautiyal

(5)

P erm ission to Lend a n d /o r Copy

I, the undersigned, agree th a t Trinity College Library may lend or copy this thesis upon request.

(6)

Dedicated to

(7)

Acknowledgments

T h is thesis would n o t have been possible w ith o u t th e guidance an d su p p o rt of my supervisor K en n eth Dawson-Howe. I would sincercly like to th a n k him for his guidance, continuous encouragem ent, co n stru ctiv e criticism , generous help, endless patien ce an d enorm ous freedom he gave m e d u ring the PhD .

My colleagues in G V 2 over p ast 4 years played a v ital role in th is work. I am grateful to th em all for th eir su p p o rt, advice, proof reading and general good com pany. A special th a n k s goes to C lau d ia A rellano, Zbigniew E dm Zdziarski, D onghoon Kim and Sayandeep P u rk ay asth for consistent m oral su p p o rt th ro u g h o iit and excellent advice.

I would like to express m y deepest appreciation to Bac:han Singh who gave me innnensc su p p o rt since th e day I step p ed in D ublin to pursue my education. I would also th a n k my friends for providing su p p o rt and friendship th a t I needed. I th a n k A n iab . M ithileash, Amit an d Soum ya for m aking my sta y in D ublin w onderful and th e ir belief in me. I especially th a n k A ru n im a G u lati for her su p p o rt, patience and co n sta n t encouragem ent d u ring my P h D . Last b u t n o t th e least; I th a n k my p aren ts B huw an N autiyal and P re m la ta N autiyal, m y sister A rchana an d m y b ro th e r Lava for th eir su p p o rt and u nconditional love.

At u l Na u t i y a l

(8)

A bstract

T h is thesis addresses th e problem of segm enting a video sequence in a tem p o rally consistent fashion, so th a t th e labels assigned to p artic u la r region rem ain th e sam e th ro u g h o u t the

sequence.

Energy m inim isation, and in p artic u la r o-expansion based label p ro p ag atio n , is a p o p u lar approach for addressing video segm entation. In th is approach, labels of m ost sim ilar objcct classcs arc assigned to pixels such th a t th e energy of final libelling is m inim ised. Two m ajo r lim itatio n s of th is approach arc n o t being able to handle non class based segm entation and ap p earan ce of new o b jects in th e cu rren t fram e. In th is thesis, we developed a novel m eth o d for label p ro p ag atio n in th e forw ard direction using th e cv-cxpansion m ethod. O ur spatio-tem poral displacem ent constraint enables th e a-ex p an sio n to handle b o th non class an d physical class based segm entation w ith lesser pixel to label com parisons. A ppearance of new o b jects is handled by com paring colour properties of pixels and th e existing o b jects d u rin g th e label propagation. extend our approach to learn th e physical explanations (like occlusion or deocclusion) behind any changes in th e segm entation results over tim e. L earned physical explanations also help us to d eterm ine occlusion boundaries.

(9)

(10)

List of Tables

3.1 User defined variables... 49

6.1 Spatial and size consistency scores... 113

6.2 Temporal consistency scores... 118

7.1 Spatial and tem poral consistency scores of test video sequences... 154

(15)

List of Figures

1.1 A n ill posed pro b lem ... 1

2.1 H y p c r g r a p l i ... 13

2.2 Belief p ro p ag atio n ... 20

3.1 s t g rap h for com puting the optim al m oves... 32

3.2 st g raph an d pairw ise p o ten tials after Q -expansion... 34

3.3 st grap h an d higher order clique p o ten tials after a-ex p a n sio n ... 3C 3.4 P hysical class based seg m en tatio n ... 37

3.5 S p atio -tem p o ral n eig h b o u rh o o d ... 38

3.6 st g rap h co n stru ctio n w ith spatio -tem p o ral displacem ent c o n s tra in t... 38

3.7 T ex tu re m a p ... 44

3.8 P airw ise p o te n tia l... 46

3.9 M anually specified segm entation for th e first processed fram e of th e flower- and garden video sequence... 50

3.10 L abel p ro p ag atio n resu lts for five fram es of th e flo w er and garden sequence. Fram es are selected a t regular interval of 5 fram es... 50

3.11 M anually specified segm entation for th e first processed fram e of th e boy video sequence [1|... 52

3.12 L abel p ro p ag atio n resu lts for five fram es of th e boy video sequence. Fram es arc selected a t regular interval of 40 fram es... 52

3.13 A p p earan ce of new o b je c ts... 54

3.14 st g rap h w ith d a ta c o st... 57

3.13 H andling ap p earan ce of new o b je c ts... 59

4.-1 E x am p le of in accu ratc segm ents m erging... 64

(16)

L I S T O F F I G U R E S xiv

4.0 M ulti-label m a s k ... C7

4.1 E xam ple c l u s t e r ... 69

4.2 Segment m erging... 75

4.2 Im proved label p ropagation re su lts... 81

5.1 Occlusion and deocclusion... 84

5.2 L earning physical ex p lan atio n s... 87

5.3 D etecting occlusion boundaries using segm entation result of th e current fram e 90 5.4 D etecting occlusion boundaries using segm entation results of th e previous fram e. 93 5.5 U p d atin g occlusion boundaries using tem poral consistency m e tric ... 94

5.6 L earned physical ex p lan atio n an d occlusion b oundaries for ten fram es of th e flow er and garden video sequence... 97

5.7 L earned physical explanation and occlusion boundaries for ten fram es of th e boy video sequence... 98

6.1 O v e r - s ('g m e n ta tio n ... 100

6.2 Exam ples of spatial and size consistency scores using different synth etic d a ta . 102 6.3 Sjjatio-tem poral neighbourhood of a pixel in block of fram es... 107

G.4 Paths for pixel in forw ard d irectio n ... 108

6.5 Tcm jioral consistency score 50 fram es of th e boy video sequence... 110

6.6 G round tr u th and segm entation resu lts of a-ex p an sio n and our m eth o d for five fram es of th e boy sequence... 112

6.7 T em poral consistency scorcs... 114

6.8 Inconsistent label p ro p a g a tio n ... 114

6.9 T em poral consistency score in case of inconsistent label p ro p a g a tio n ... 115

6.10 T em poral consistency m etric exam ples w ith m axim um consistency scores. . . 116

6.11 T em poral inconsistency m etric exam ples w ith inconsistent changes... 117

7.1 Key fram es for th e flower and garden video s e q u e n c e ... 121

7.2 Label p ro p ag atio n resu lts for ten fram es of th e flower and garden sequence. Fram es are selected a t a regular interval of five fram es... 123

7.3 Si)atial and size consistency m etrics scores for flow er and garden sequence. . . 124

(17)

L IS T OF F IG U RES xv

7.6 Label propagation results for ten frames of the boy video sequence |1|. Frames

are selected at a regular interval of twenty fram es... 127

7.7 Spatial and size consistency metrics scores for boy video sequence... 128

7.9 Key frames for a P E T S 2001 video sequ en ce... 130

7.10 Label propagation results for ten frames of the P E T S 2001 video sequence. Frames are selected at a regular interval of two frames... 131

7.11 Spatial and size consistency metrics scores for a P E T S 2001 video sequence. . 132

7.13 Key frames for the carl video sequence ... 134

7.14 Label propagation results for ten frames of the carl s e q u e n c e ... 135

7.15 Spatial and size consistency metrics scores for the carl video sequence... 136

7.17 Key frames for the yunakim video sequence |1]... 138

7.18 Label propagation results for ten frames of the yunakim sequence... 139

7.19 Spatial and size consistency metrics scores for the y^makim video sequence. . 140

7.21 Key frames for the dance video sequence [I]... 142

7.22 Label propagation results for ten frames of the dance video sequence [1|. . . . 143

7.23 Spatial and size consistency metrics scores for the dance video seciuence. . . . 144

7.25 Key frames for the hands2 video sequence |2 |... 146

7.26 Label propagation results for ten frames of the hands2 sequence... 147

7.27 Spatial and size consistency metrics scores for the hands2 video sc'quence. . . 148

7.29 Key frames for the hands video sequence... 150

7.30 Label propagation results for ten frames of the hands sequence... 151

(18)

Chapter 1 Introduction

In co m p u ter vision, seg m en tatio n is th e process of dividing a scene (im age or video fram e) into different m eaningful regions which can fu rth e r be used for scene in te rp re ta tio n . Segm en ta tio n is an ill posed problem b(!cause seg m en tatio n decisions are influenced by th e d e p th of know ledge th a t any seg m en tatio n m eth o d has a b o u t th e scene |3|. As a resu lt, it is possi ble th a t different seg m en tatio n m e th o d s or hu m an observers pro d u ce different b u t a c c u ra te

seg m en tatio n s of th e sam e scene (F igure 1.1).

(a) Image (b) A nnotator id: 1123 c) A n n o tato r id: 1130 (d) A nnotator id: 1132

F ig u re 1.1: S eg m en tatio n is an ill posed problem . T h re e h u m an a n n o ta to rs c re a te differ en t seg m en tatio n s for th e sanu; im age. Im age an d seg m en tatio n re su lts are ta k e n from th e Berkeley D a ta b a se |4|.

T em porally co n sisten t seg m en tatio n of video sequences is th e basis for a n u m b er of a p p li catio n s in different dom ains, such as video track in g , video com pression, video c o n te n t based indexing, surveillance, a rtistic sty lisatio n an d video forensics. T hese a p p licatio n s rely on b o th

[image:18.538.44.495.32.556.2]

(19)

dif-ferent labels to the same object or some parts of it. This makes it difficult for higher level applications to correctly in terpret the scenes in a given video.

In video segm entation, accuracy and tem poral consistency are two im portant require ments. To date, accurate segm entation of individual frames has recc'ived more attention th an tem porally consistent segm entation of video. A common belief is th a t acc urate sjiatial seg m entation of image regions will autom atically enforce tem poral consistency of regions in video frames. However, tem poral consistency inform ation can provide cues for video object detec tion and can also be used to improve video segm entation accuracy. T he main objective of this research is to improve the state of the a rt in tem porally consistent region based video segm entation.

There are two ty|)es of tem poral consistency: short term and long term. Video segm enta

tion results are short term tem porally consistent if they are tem porally consistent only for a small num ber of frames |5, 6, 7, 8, 9|. For short term tenijjoral consistency, labels assigned to pixels belonging to the objects change quite often over the whole vid('o .sequence. Most of the block based video segm entation approaches deliver short term tem poral consistency, i.e. over frames of the sam e block. For long term tem poral consistency, segm entation results of the objects shotild rem ain tem i)orally consistent while any part of the object is visible in the video sequence. T here are two popular approaches th a t try to achieve long term tem poral consis tency; (i) tracking based |10, 11| and (ii) label propagation based m ethods |12, 13, 14, 15, 7|. Tracking based m ethods com pute tem poral inform ation in advance. Segm entation is then perform ed by clustering spatially and tem porally coherent pixels together. Label propagation based m ethods form ulate image segm entation as an energy optim isation jjroblem where seg m entation is achieved by m inim ising the defined energy functions. In general, these m ethods

produce b e tte r tem porally consistent segm entation results in comparison with the segmen tatio n m ethods based on two other approaches as detailed in chapter 2. Therefore, in our research we have developed an approach using label propagation methodologies to achieve long term tem porally consistent segm entation results.

(20)

1.1 C ontributions

T h e novel co n trib u tio n s from th is research work are as follows.

1.1.1 S p atio-tem p oral d isp lacem ent constraint

L abel p ro p ag atio n is a two ste p process, hi th e first step, a set of learned labels is p roduced by- learning th e p ro p erties, e.g., colour, te x tu re , etc., of o b jects present in video using tra in in g on e ith e r te st videos or ground tr u th d a ta of a few (1 or 2) previous fram es. N orm ally, each label of th e learned label set represents a class of physical objects, for exam ple, sky, car, grass, road, b o a t and sheep. We call th is ty p e of label p ro p ag atio n physical class based label p ro p ag atio n . In th e physical class based label p ro p ag atio n m eth o d s, cach label class has unique p ro p erties an d represents a class of physical objects. T herefore, segm ents of a video sequence represent physical objects and in any fram e, m ultiple in stan ces (e.g., presence of m any cows) of th e sam e o b je c t class gets th e sam e label. If different labels are used to represent m ultiple in stan ces of a physical class th e n it can be referred to as non class based label p ro p ag atio n . In th e second step, th e b est labelling is achieved by m inim ising a defined energy function. E nergy m inim isation m eth o d s often select local m ininm m due to inefficiency of co m p u tin g th e global m inim um . In g('neral. local m inim isation techniques arc sensitive to th e in itial estim a te s and p ro d u ce a local m inim um far from th e global m inim um , a-ex p an sio n is a powerful g ra p h cut based algorithm for energy m inim isation [16|. Q-expansion does not d epend on in itialisatio n and g u aran tees local m inim um w ithin two factors of th e global m inim um . A fter s ta rtin g from an initial labelling th e algorithm changes th e labelling iteratively. A change which leads tow ards the m inim um local energy is selected a t each iteratio n . F inal labelling is said to be achieved w hen fu rth e r m inim isation of energy is not possible. In th is research, we use a -ex p an sio n for m inim ising our energy function (C h a p te r 3).

a-ex p an sio n based label p ro p ag atio n m eth o d s p ro p ag ate labels by com paring each pixel of th e cu rren t fram e w ith each label of th e learned label set. Hence, any label of th e learned label set can be assigned to any pixel of th e c u rren t fram e. In general, in a video sequence, any o b ject present in th e c u rrcn t fram e can only move by a sm all d istan ce in th e n ex t fram e. T herefore, label p ro p ag atio n m eth o d s should n o t com pare each pixel of th e cu rren t fram e w ith each label of th e learned label set. T hey should only com pare cach pixel w ith th e labels of o b jects which were present in th e pixel’s sp atial neighbourhood in th e previous fram e. We

(21)

call this constraint spatio-temporal displacement constraint and propose a novel m ethod th a t enables label propagation m ethods to produce robust results with lesser num ber of com par isons (pixel to label). The spatio-tem poral displacement constraint is a novel contribution

which allows Q-expansion |16| to:

• handle non class based segmentation, and

• produce results with lesser num ber of comparisons (pixel to label).

Spatio-tem poral displacement constraint is one of the m ajor contributions of this thesis and is discussed in detail in section 3.3.

1.1.2 H andling new o b jects

Label propagation uses the learned label set for assigning labels to pixels of the current frame If. Hence, in the case of appearance of new (not present in the learned label set) objects in It, label propagation will assign incorrect labels to the pixels belonging to the new objects |17, 18].

We have developed a novel pre-processing step to deal with new objects. Our approach is a four step approach. In the first step, a new label void is added in the karncd label set to represent new objects. In the second step, an energy function (unary potential) is com puted such th a t during the optim isation it favours assigning void label to a pixel if the colour of pixel is not statistically similar to the colour of atleast one label present in the learned label set. In this research, we have limited our determ ination of new objects to those which have different colour distributions from the existing objects. In the th ird step, pixels marked as new objects are clustered in Colour X Y space. In order to avoid segm entation errors, only clusters bigger than a m inim um area are considered as new objects. In the last step, our algorithm learns properties of these clusters and re-runs the label propagation m ethod for the current frame with the new learned set of labels. Section 3.6 describes this contribution comprehensively.

1.1.3 Im proving label prop agation resu lts for rigid o b jects

Our label propagation m ethod uses properties, i.e., colour and texture of labels present in the learned set for assigning labels to each pixel of the current frame. If the properties of pixels

(22)

belonging to the two neighbouring segments are similar, our label propagation may misclassify some or all of the pixels from these segments.

T he segmentation results achieved using the label propagation m ethod may be incorrect bu t can be used to learn new cues such as m otion and shape of the segments. These new cues can further be used to improve the label propagation results. Wc have developed a novel m ethod which improves the label propagation results for rigid objects using shape and motion cucs learned with the help of the segm entation results of the current and previous frames. This contribution is covered in chapter 4.

1.1.4 Learning p hysical exp lan ation s b ehind changes in segm en ts shapes

and occlu sion boundaries

Rigid objccts present in a video may exhibit motion, such as translation, rotation and scaling. These moving objects can occlude or deocclude themselves and other objccts. Occlusion is an event where, in the view of an observer, objccts disappear (completely or partially) behind other objects. Deocculsion is an event where new parts of the objects appear from behind the other objects. Both occlusion and deocclusion changes the visible shape of objects in the view of an observer. We have developed a m ethod for categorising changcs in the shape of segments as occlusion or deocclusion. We refer to it £is learning the physical explanation. Along with the physical ex])lanation, we have developed a novel algorithm for detecting occlusion boundaries present in any frame. Occlusion boundaries are boundaries present between the foreground and the background segments. This contribution is detailed in chapter 5.

1.1.5 E valuation m etrics

(23)

over-segm ented regions to m easure th e over-segm entation resu lts accurately. C onsidering these req u ire m en ts, we have developed two m etrics: sp atial consistency and size consistc'ncy for m easuring th e accuracy of th e seg m en tatio n results.

A n o th er m etric, th e tem p o ral consistency m etric is developed for m easuring th e te m p o ral consistency of seg m en tatio n resu lts by com paring label consistency of the pixels in b o th forw ard and backw ard directions. E xisting evaluation m etrics are d etailed in Section 2.4.

S p a tia l c o n s is te n c y m etric

T h e sp a tia l consistency m etric m easures th e accuracy of seg m en tatio n resu lts ag ain st th e g ro u n d tru th . Norm ally, hu m an a n n o ta te d ground tr u th is a high level (objc'ct based) segm en ta tio n and consequently, over-segm entation is exp ccted w ith seg m en tatio n m eth o d s. T hus, th e re is a need for over-segm ented regions to be clu stered to g eth er before com parison w ith groim d tr u th objects. 0>ir m etric developed for m easuring th e sp atial consistency is d etailed in section 6.1.

S ize c o n s is te n c y m etric

T h e sp a tia l consistency m etric docs not consider th e size an d n u m b er of over-segm ented segm ents. Let us assum e th a t a segm entation algorithm divides a ground tr u th o b ject into segm ents of size 1 pixel each. O u r sp atial consistency m etric will clu ster th e segm ents to g eth er w ith o u t considering th e ir sizes. Hence, th is seg m en tatio n will get m axiim m i score for th e s p a tia l consistency. T herefore, th e sp a tia l consistency m etric alone is n o t enough to m easure th e seg m en tatio n quality. We therefore in tro d u ce an o th e r m etric called size consistency to p enalise th e seg m en tatio n q u ality based on th e num ber and th e sizes of regions. T h is m etric is d etailed in section 6.2.

T em p o ral c o n s is te n c y m etric

From a review of existing lite ra tu re , it is a p p a re n t th a t th ere is a lack of ro b u st m e th o d s and m etrics for m easuring th e tem p o ral consistency of video segm entation. T em poral consistency of video seg m en tatio n resu lts can be m easured using gro u n d tr u th w ith pixel to pixel corre spondence inform ation. Such correspondence in fo rm atio n enables evalu atio n m etrics to check th e consistency of labels assigned to pixels over tim e. However, creatin g such d etailed gro u n d tr u th is a trem endously difficult task. We have developed a novel m e th o d to m easu re th e

(24)

tem poral consistency of the segm entation results. Our m ethod compares the segm entation results of the current frame only with the past and future segm entation results w ithout need of the ground tru th data. The developed algorithm is discussed in section 6.3. Temporal consistency m etric is another of the m ajor contributions of this thesis.

1.2 R eport outline

In chapter 2, we present a review of the state of the art for three types of region based video segm entation m ethods (block based, tracking based and energy minimisation based label propagation methods) and for evaluation metrics. We present the basics of the a-expansion algorithm along with the algorithm s developed for label propagation in the forward direction and those for handling new objccts in chapter 3. Our algorithm s for improving the label propagation for rigid objccts is presented in chapter 4. In chaptcr 5, we present our algorithm s for learning physical explanations and occlusion boundaries. In chapter 6, we present three new m etrics for measuring the accuracy and the tem poral consistency of segm entation results. We present our results in chapter 7. A sum m ary of this research w'ork and directions for future work arc presented in chapter 8.

(25)

Chapter 2 State of the art

T h is c h ap tcr presents s ta te of th e a rt of th e region based te m p o ra l consistent video seg m en ta tio n and evaluation m etrics (Section 2.4). In our review of region based te m p o ra l consistent video segm entation, we focused on th ree ty p es of video seg m en tatio n m ethods; block based (Section 2.1), tracking based (Section 2.2) an d energy m in im iza tio n based label propagation m ethods (Section 2.3). \'id e o is a collection of im age fram es. T herefore, th e sim plest way of segm enting any video is to segm ent each fram e individually. However, th is tcchnicjuc lacks te m p o ra l coherence bccausc each fram e gets segm ented independently. A possible solution is to segm ent th e im age fram es into groups. Block based video seg m en tatio n m e th o d s try to achieve b o th sp atial an d tem p o ral coherence by segm enting m ore th a n one fram e sim u lta n e ously. T hese m e th o d s first divide th e video in sm all blocks of tem p o ral an d sp atially co n sisten t fram es and th e n segm ent fram es of each block. Block based video seg m en tatio n m e th o d s are u n ab le to jjroduce tem p o rally consistent resu lts over long spans of th e tim e (Section 2.1.4).

Tracking based video seg m en tatio n m eth o d s a tte m p t to im prove tem p o ral consistency of seg m en tatio n by co m p u tin g te m p o ra l inform ation in advance. In th e first step , these m e th o d s co m p u te th e m otion tra je c to ries of featu re p o in ts or pixels over th e whole video sequence. In th e next step , seg m en tatio n is achieved by clustering th e pixels aro u n d th e tra je c to rie s of sim ilar featu re points. T em poral consistency of these m eth o d s is discussed in section 2.2.1.

(26)

m ethods produce more accurate and more tem porally consistent results than the block based video segm entation methods. These m ethods also have more general purpose applicability in comparison to the tracking based approaches.

2.1 B lock based vid eo segm en tation m eth od s

Video segm entation m ethods rely on spatial and tem poral cohcrcnce of objects present in video to produce more accurate and robust segm entation results. Spatial segm entation of a frame is same as image segmentation. Therefore, video segm entation m ethods inherit algorithm s from the image segm entation domain for spatial segm entation of frames. The objects in a video scquencc experience m otion, occlusion and changes in illumination. Therefore, it is very hard to enforce spatial and tem poral coherencc of objects over a long period of time. To overcome these limits, volumetric techniques try to achicvc a short term spatio tem poral cohcrcnce by decomposing the video sequence into smaller blocks of frames. In the simplest case, each block can hold a single frame |19|. However, lack of tem poral inform ation in the one frame approach may causc unstable segm entation results. A two frame per block approach tries to address this issue by segmenting two frames sinmltaneously [20, 21, 5, 8|. On the negative side, this model only uses one directional (cither backward or forward) tem poral inform ation for segm entation of frames. A common approach is allocating three frames per block where central frame works as a reference frame |G, 7|. The three frame approach uses motion and localization cues from both the directions (backw'ard and forward) to produce segm entation results.

A few initial segm entation approaches do not follow the above described fixed number of frames per block allocation mechanism. For example, Huang ct al. [22| used tw'O frame approach to com pute optical flows for patches and then used three frames to create initial spatial-tem poral segm entation, hi another example, after generating the particles (feature points) trajectories using w'hole video sequence, Silva and Scharcanski |10] used a 2-frame approach for spatial clustering of the trajectories.

2.1.1 S egm en tation o f blocks o f fram es

\'id eo segm entation is a process of labelling pixels of frames based on cues like colour, texture, motion, shape and higher order co-occurrences. Generally cues used in the image and video

(27)

seg m en tatio n process can be divided into th re e levels: low, m edium and high level cues. Low level cues, such as colotir and b rightness consider pixels directly; m edium level cues, such as region continuity and shape consider th e p ro p erties of w hole regions and; high level cues like co-occurrence relationships, for exam ple, b o at-w ater and airplane-sky consider scene p ro p e rtie s for labelling th e pixels |23].

A fter dividing video in blocks of fram es (Section 2.1), video seg m en tatio n m eth o d s perform seg m en tatio n a t th e block of fram es. N orm ally, each block of fram es is segm ented in tw'o steps; in itial segm entation and final segm entation. T he in itial seg m en tatio n divides th e block of fram es into sm all spatio tem p o ral hom ogeneous regions using low level cues like, colour and b rig h tn ess, and th en th e final seg m en tatio n m eth o d s m e rg e /sp lit these in itial segm ents using m otion coherencc. How'cver, th ere are som e video seg m en tatio n m eth o d s which do not use th e in itial segm entation ste p and directly go for th e final segm entation of blocks.

T h e m o tivation behind th e initial segm entation is to create sm all local, coherent and b o u n d a ry preserving groups of pixels to cnhance th e overall seg m en tatio n q uality and to reduce th e required co m p u tatio n pow'cr for th e subsequent video processing. In itial seg m en tatio n divides th e fram es of a video block into sm all hom ogeneous sp a tio -tem p o ral over-segm ented regions based on a set of fe a tu res while preserving o b jects b o u n d a ry inform ation. N orm ally, th is in itial seg m en tatio n is o b ta in e d by generalising th e im age seg m en tatio n m eth o d s into th e

(s})atial — tim e) dom ain.

In itial seg m en tatio n of a video block produces result as a 3Z? (sp atial ^ tim e) g rap h . In itial seg m en tatio n should preserve th e region b oundaries b u t it is q u ite possible to lose region b o u n d aries in this process. T herefore, an o th e r o ption is to skip th e initial seg m en tatio n step an d th e n use pixels d irectly as nodes of th e g rap h for final segm entation. For exam ple, Shi an d M alik |20] initialised a weighted sp atio tem p o ral g rap h using pixels of two consecutive fram es as nodes. In an o th er work, Z hang and J i |7| used pixels to co n stru c t a 3D conditional ran d o m field (C R F ) m odel as th ey suggest th e initial seg m en tatio n techniques m ay fail to find a c c u ra te o b jects bou n d aries. How'cver, considering pixels as initial segm ents for fu rth e r seg m en tatio n , d em ands enorm ous processing power. T he goal in th e final seg m en tatio n is to p ro d u ce m ore m eaningful sp atio -tem p o ral coherent regions by p artitio n in g th is in p u t 3Z) graph.

Block based m eth o d s extensively rely on th e n eighbourhood of pixel in fo rm atio n for c a te gorising it into b o th sp atial an d tem p o ral directions. G rap h s provide a wonderful framew^ork

(28)

to represent this neighbourhood information. Therefore, these m ethods use graphs to repre sent the input frames. We can divide the block based m ethod in two categories; pair wise graph (Section 2.1.2) and hypcrgraph (Section 2.1.3) based methods.

2.1.2 P airw ise graph based m eth o d s

Pairwise graphs arc very popular in the image and video segm entation bccause of their sim plicity. These graphs arc called pairwise because they capture only the pairwise relationship between its nodes. Let Q = ( V, £) be a undirected graph where vertices Vj G V represents the set of elements (pixels or superpixels) to be segmented and (vj, vj) G £ represents the set of edges between neighbouring vertices. A non negative weight which captures the dissim ilarity between neighbouring vertex pair \ \ and Vj, is assigned to each edge (vj, Vj) G £. A segm entation process divides this graph into different clusters of vertices such th a t elements w ithin a cluster should have similar feature values and elements in different clusters should have dissimilar feature values.

Generalised Felzenszwalb and H uttenlochcr’s [24| image segm entation m ethod provides a simple and effective graph based 3D (XY — time) segm entation framework |5, 9|. Felzenszwalb and H uttenlocher used a simple, intensity based predicate to merge graph nodes |24| and proposed feature points to deal with occlusion. Generally, graph based clustering m ethods l)rovide no mechanism to control the cluster size. To address this behaviour, Galm er and Huet |5l used a size dependent threshold with the graph based m ethods for segmenting the first frame. This threshold regulates region size by allowing fewer merges as region size increases.

W ith the graph representation, segm entation problem can be treated as graph partition problem. Wu and Leahy [25] used graph cuts for the image segm entation. A cut partitions the nodes V of graph Q into two disjoint subsets. Their m ethod try to minimise the sum of edge weights across cut boundaries. However, this graph cut m ethod was biased towards small regions. Shi and Malik (26| addressed this biased behaviour of graph cut by taking a norm alised weight across the cut. Their m ethod Normalised Cut {NCut) is the most popular graph cut m ethod. Many other segm entation m ethods use N C ut as a basis. Popular graph cut based m ethods use eigen vectors of an affinity (edge weights) m atrix to partition the graph. In NCut, Shi and Malik |26| used eigen vectors with first k smallest eigenvalues to p artition the 2D graph into k partitions, by finding the splitting points such th a t the N C ut is minimised. Each partition is further repartitioned if N Cut is less th an a pre-specified stability

(29)

value. L ater Shi and M alik generalised N C u t for video dom ain |20|.

T h e Superj)ixel algorithm of R en an d M alik |27| is a j)opular N C u t based seg m en tatio n al g o rith m to pro d u ce a large luim ber of sm all, com pact qtiasi-uniforni segm ented im age regions. In th eir m eth o d , edge afhnity is calcu lated by connecting p ix el’s contour and te x tu re cues. L evinshtein ct al. |28| poin ted out m ost of th e initial seg m en tatio n algorithm s, like local vari a tio n s of g rap h -b ased alg o rith m s |24|, have no control over th e num b ers of segm ents and also p ro d u cc segm ents of irreg u lar sh ap e an d sizes. To overcome these lim itatio n s th e y proposed TtirboPixels alg o rith m th a t produces lattice-like s tru c tu re of co m pact regions by d ilatin g seed pixels to a d a p t th e local s tru c tu re w ith roughly linear tim e com plexity. Using geom etric flow b ased alg o rith m s, TurboPixels assures uniform sized, co m pact, connectivc, edge preserving (tu rb o p ix cls d o n ’t have any o b jcct b o u n d aries inside them ) an d non overlapping segm ents. However, Turbopixel’s control over n u m b er of segm ents comes w ith a new lim itatio n , as the n u m b er of required initial segm ents m ust b e know n in advance.

M ultiscale seg m en tatio n is a n o th e r app ro ach to increase th e accuracy and th e speed of th e seg m en tatio n m e th o d s |29, 30, 311. In th e m ultiscale approach, seg m en tatio n a t lower resolu tio n tries to c a p tu re th e global inform ation and seg m en tatio n a t higher resolution c a p tu re s th e local in fo rm atio n very accurately. F u rth erm o re, seg m en tatio n resu lts of tlu ' lower resolutions a re p ro p a g a te d hierarchically to th e higher levels to producc th e final results. For exam ple, C o u r ct al. |30| proposed a linear tim e, m ultiscale g ra p h decom position w ith norm alised cu ts for im age seg m en tatio n . T h is m ultiscalc g ra p h decom position m e th o d is a d o p ted in video seg m e n ta tio n for in itial seg m en tatio n because of its accuracy an d linear tim e com plexity [22, 311. In ste a d of g rap h cu ts som e m eth o d s use sim ple m erging p red icates to p ro d u cc th e final seg m en tatio n |5, 9|. In these m eth o d s, th e m erging p red icate an d th e o rd er of m erging are tw o im p o rta n t p a ram eters. T h e m erging pred icate is a s ta tistic a l te st to decide th e m erg in g /s p littin g o f th e g rap h nodes. O n th e o th e r h a n d , th e m erging o rd er decidcs th e order o f m erging for th e nodes [32]. M cD iarm id ’s in eq u ality has been co m m onh' used for m erging p re d ic a te calcu latio n [32, 8, 33|. G alm ar an d H uet [5[ used a global th resh o ld w ith a local m erging p re d ic a te to control th e size of sp a tio tem p o ral segm ents. For sim plicity, each edge w eight is com pared ag ain st th e in tern al m ean w eights of th e p a rtic ip a tin g co m p o n en ts ra th e r th a n th eir m axinm m weights |24|. T h is m e th o d also categorises segm ents as s ta tic or m oving. However, all p a ra m e te rs of th is seg m en tatio n m e th o d s including th e global th resh o ld , are m an u ally selected. In th eir w ork, H assani ct al. [8[ used a in te n sity /c o lo u r based sta tistic a l

(30)

te s t to p a rtitio n a fram e into in itial p atches.

S tein a n d H cl)ert |2| proposed a w atersh ed based o v er-seg m en tatio n , ciriven by tlie o u tp u t of a learn ed , sta tistic a l an d m ulti-cue edge d e te c to r a fte r non-local m ax in u im sui)pression. An edge m aj) of th e referc'nce fram e is c o n stru c te d using th e Pb edge d e te c to r [34|. T h e Pb edge d e te c to r is selected b ecause it considers colour, b rig h tn ess an d te x tu re cues sim ultaneously. T h e w atersh ed is preferred over m e th o d s relying on n o rm alised c u ts for its sim plicity, speed

an d m ore regularly shap(^d segments.

2 .1 .3

H yp ergrap h b ased m eth o d s

G ra p h b ased m e th o d s n orm ally co m p u te edge affinity by com b in in g m any w eighted featu res. However, edge affinity calcu lated as a w eighted sum of fe a tu res m ay lead to loss of som e vital in fo rm atio n . M oreover, w hen affinity relatio n s are n o t p a ir wise b u t ra th e r of higher ord ers, sim ple graj)hs are u n ab le to c reate good clusters. In clu sterin g , so m etim es th re e or m ore d a ta p o in ts should be an aly sed to g e th e r to d e te rm in e if th e y belong to th e sam e clu ste r or n o t |35|. A special kind of g ra p h , h yp('rgraph, re p resen ts th is higher o rd e r re la tio n sh ip of d a ta .

F ig u re 2.1: E x am p le of a h y p erg rajjh w ith eight vertices V = {vi,V2,V3,V4,V5,x!e,x>T,vg} an d four h y p er edges £ = {61,62,63,64} }, {^ 2 ,u.3,'*^8}, {«4, U 5 , '*^5, E ach hy p er edge is h ig hlighted in different colour.

[image:30.538.50.522.296.620.2]

(31)

hyperedges. T h e fu n ctio n h associates non -n eg ativ e w eights w ith each hyperedge. F ig u re 2.1, show s a n exam ple of h y p e rg ra p h w ith eight vertices V = {vi , v2,v-,i,v4, vr.„vti, vr,vt^} an d four

h y p e r edges £ = {6 1,6 2,6 3,^4} = {{ui}, {^^2,-(^3, {^4, ^8, ?'5, {i^a,'’5, w?}}- A garw al et al. |35] were h rs t to in tro d u ce h y p e rg ra p h s in th e c o m p u te r vision for th e im age s('gm entation. How ever, to p a rtitio n th e hyi)ergraph, th eir m eth o d api>roxim ates a w f'ighted grapii from the

h y p e rg ra p h . L a te r, Zhou e t al. [36] ex ten d ed sp e c tra l c lu sterin g m ethodologies (like N C u t) of u n d ire c te d g ra p h s for th e h y p erg rap h s. T h e ir norm alised h y p erg ap h c u t is a p o p u la r alg o rith m for p a rtitio n in g h y p e rg ra p h s [22, 31[.

H y p e rg ra p h s are gain in g p o p u la rity in video seg m en tatio n in recent years because of th e ir ab ility to c a p tu re higher o rd er relatio n sh ip s. For ex am ple, H u an g et al. [22[ m odel tlu ' s|)a tio te m p o ra l over-segm ented im age p atch es using h y p erg rap h s. T h is m eth o d |)erform s s p e c tra l analysis [36[ on o p tical flows a n d m o tio n profiles of p atch es to set th e h yperedge w eights. F u rth erm o re, H uang e t al. [22[ em phasise hyperedges th a t c o n ta in m oving o b je c ts by assig n in g th e m larger w eights. G enerally, h y p erg rap h c o n stru c tio n an d clu sterin g alg o rith m s suffer from higher tim e com plexity in com parison to pairw ise g rap h based alg o rith m s.

To deal w ith higher tim e com plexity nniltilevel fram ew ork for h y p e rg ra p h s h as been p ro posed I37j. In th e nniltilevel ap p ro ach , in itial ])artitio n in g h as been co m p u ted over th e co arsest h y p e rg ra p h using eigen vectors. A fterw ards, j)a rtitio n s of th e coarser h y p e rg rap h are p ro je c te d to th e n e x t finest level of th e h y p erg rap h .

2 .1 .4

T em p oral co n siste n c y issu es w ith b lock b ased v id eo seg m en ta tio n

m eth o d s

We have identified following m a jo r issues w hen te m p o ra l consistency of block b ased m e th o d s b re a k down:

(32)

C o m p u ta tio n a l re strictio n s: Accuracy and tem poral consistency of these m ethods depend on the num ber of frames used per block. Normally better segm entation results are achieved by increasing the num ber of frames per block. However, there are memory and processing power restrictions on the num ber of frames th a t a segm entation m ethod can process simultaneously.

R e g io n m e r g e /s p lit: Region merge is an event when two or more spatially disconnected regions/objects of sim ilar properties get merged at some point of the tim e t and then stay together for considerable am ount of time. Region split is an opposite event of region merge. In the region split, pixels of any region/object split into two or more m otion coherent re gions/objects. Lets assume at tim e t region s (label) splits into two regions s (label) and s i

(label). In the j)aradigm of tem poral consistency, segm entation m ethod must assign different labels to the pixels belonging to both the objects from start. This can be achieved only if tem poral inform ation over the whole video sequence is available in advance. Block based m ethods use limited tem poral information for segm entation of the block of frames. Therefore, change happening due to the region m erge/split outside the current block do not make any effect on the segm entation results.

2.2 Tracking based vid eo segm en tation m eth od s

Lack of tem poral information beyond the boundaries of a block is one of the main problems for th e block based methods. A possible work around is to collect the tem poral inform ation jjresent in all frame of the video in advancc and then use it to for segm entation of each frame. Tracking based m ethods work on this principle.

Tracking is following any object of interest over time. In the video domain, tracking of an object can be made possible in two ways; tracking of all pixels belonging to an object or traekhig only some specific features belonging to the object. To track the objects present in a sccnc, both approaches com pute the motions of objects and establish a correspondence of every pixel or feature point respectively. Tracking m ethods which consider all pixels for tracking are called dense or direct m ethods |38|. Direct m ethods perform both motion com putation of objects and establishing correspondence of every pixel simultaneously. In contrast, feature based m ethods used only a lesser num ber of points (in comparison to the num ber of pixels) per object for tracking. Therefore, these m ethods are often called sparse m ethods. Sparse or

(33)

feature based m ethods first extract a sparse set of features from each frame iiidepeiideutly and then com pute their motion by estabhshing correspondence for every feature point [SOj.

Segm entation m ethods based on direct tracking approaches m ethods use all pixels for tracking process. Com puting m otion of pixels especially in the presence of occlusion is most critical task for any tracking based approach. Optical flow is a popular choice in block based video based segm entation |6, 20, 9, 22, 2|. However, segm entation approaches based on direct m ethods generally avoid optical flow for m otion com putation because results of optical flow based m ethods are very ambiguous near object boundaries. Normally, tracking m ethods prefer motion models with higher degrees of freedom like affine m otion models to model pixels motions |21, 10, 11|. In their work K um ar et al. |11|, divided a fram e into patches of 3 X 3 pixels and used consecutive frames to find param eters of affine models of every patch. In consecutive frames of a video sequence large motion is not expected. Therefore authors restrict the motion model to a small num ber of transform ations. A joint probability distribution is solved to find out the motion param eters of each patch. A final segm entation is achieved by clustering neighbouring patches which are moving together. In general, direct tracking approaches produce broken (unlinked) trajectories for moving objects which remains static for a long time before moving again. An unlinked trajectory is a trajectory of pixels or features which is not linked with any other trajectory present outside its lifespan. To handle this situation Zhong and Chang |40| compare regions present in unlinked trajectory with the regions of other trajectories. An unlinked trajectory gets linked with a trajectory which shares the maxinmm num ber of common regions.

In contrast to the direct m ethods, feature based m ethods track a lesser num ber of points over frames. F eature based m ethods represent any object by its shape and a])pearance features like points, geometric shape, contours, colour, texture and probability density |41|. These m ethods choose features in each frame and then feature trajectories are created by solving feature correspondence. Temporal consistency of feature based segm entation m ethods depend highly on the quality of both feature points and feature trackers. In their work Marc et al. |21] used an affine m otion model and a mesh m otion model for tracking of pixels and feature points. The affine model learns motion of pixels and then a probability function is used to cluster i)ixels of sim ilar m otion in different classes. In any class, pixels with high probability arc called seed pixels. A mesh model then tracks the seed pixels within a set of frames and produces seed trajectories. In the last step, a clustering function m erge/split the previous

(34)

d u ste rs into groups of coherent affine motion models and trajectories. This model only uses m otion inform ation for video segm entation so its performance m ay suffer in textured regions. Silva and Scharcanski |10| deal with this problem by considering the pixel neighbourhood’s spatial and m otion coherence as well. Their m ethod uses particle tracking to create initial d u sters. A hypergraph uses these initial clusters as its nodes and then produce final spatial and coherent clusters. Although this approach performs well in textured regions bu t the complexity of hvpergraph will increase rapidly with the increase in num ber of frames used for

particle tracking.

2.2.1 T em poral co n sisten cy issu es w ith tracking based v id eo seg m en ta tio n

m eth od s

Tracking based segm entation m ethods have tem poral inform ation over the whole video se quence in advance. Therefore, these m ethods can deal with region m erge/split issues. Low level cues like colour and brightness arc used with tem poral information of pixel feature points for clustering similar pixels. As a result, tracking based m ethods are also affected by changes happening at higher levels (Section 2.1.4). T he biggest problem with these m ethods is th a t they arc only suitable for highly textured videos where each segment can have significant num ber of feature points in each frame. Performance of tracking based m ethods decline rapidly in non textured regions where the num ber of feature points is near to zero.

In section 2.3, we discuss label propagation based m ethods which use all; low, medium and higher level of cues gathered by training either on previous segm entation results or on training video set for labelling the pixels of video frames.

2.3 Label propagation based vid eo seg m en ta tio n m eth od s

Video segm entation can be seen as a label propagation problem where labels existing in previous frames are propagated to label each pixel p of current frame I(. A good labelling is achieved when final segments are spatially sm ooth and differences among segments and classes })roperties, e.g., colour, texture and shape, are minimum. The energy m inim isation framework can provide a natural solution for this kind of optim isation problem and the most favourable labelling can be achieved by minimising the defined energy function.

Energy m inim isation m ethods often select local mininmm due to inefficiency of com puting

(35)

the global minimum. In general, local m inimisation techniques are sensitive to the initial estim ates and choices of energy functions. As a result, a local minimum can he arbitrarily far from the optim um not convoying any of the global image properties |16|. Many optim isation techniques like Iterated Conditional Modes (ICM) |42| and sim ulated annealing [431 move towards the local/global minimum by changing label of a single pixel at a time. ICM is a greedy m ethod and it convergences to a local minimum by assigning the label for which value of the selected energy function is minimum to each pixel. Simulated annealing approach searches the global optim um in the entire search space. Sim ulated annealing inspiration come from the process of slow'ly cooling molecules to form a perfect crystal. It requires exponential tim e for minimising an arb itrary function thus making it im practical for energy functions with several variables (labels).

T he a:-cxpansion and the a/3-sw'ap are the tw'o most popular move making algorithm s for energy m inim isation |16|. After starting from an initial labelling both algorithm s change (move) the labelling iteratively. These algorithms can change label of more than one pixel in any move. A change w'hich leads tow'ard lesser energy is selected at cach iteration. Final labelling is said to be achieved w'hen further m inim isation of energy is not possible, a-expansion does not depend on initialisation and guarantees local minimum within tw'o factors of the global minimum in polynomial time. Therefore, we have used the a-expansion to minimise our energy function.

T he general form of energy function E{x) is composed of tw'o energies, d a ta energy {E^ata) and sm oothness energy [Esmooth] |16, 44, 45, 46, 47, 48, 49, 14, 12|:

E{x) — £"rfa£a(^) Egjjioothi^) (^-l)

where, Edata{x) measures the disagreem ent betw'een labelling x and the observed data; and Esmooth measures the extent to w'hich x is not piecewise sm ooth (spatially). B oth Edatai'x) and Esmooth{x) can be defined by the unary and pairw'ise term s of the Gibbs energy function. Hence, Edata can be w ritten as the sum of d a ta costs over all pixels;

E d a t a { x ) = ^ T > p ( X p ) ( 2 . 2 ) p € V

where Vp{xp) is a function defined on observed d a ta th a t measures the cost of assigning label Xp to any pixel p and V represents all pixels of the frame.

(36)

In term s of the Gibbs energy, Egrnoothix) can be w ritten as the sum of pairwise potential functions:

where N defines the neighbourhood of pixels and T>p q{xp,xq) measures the cost of assign ing labels Xp and Xq to pixels p and q, i.e., {p, q} G Af. Vp q term encourages spatial sm oothness by assigning the same label to both the pixels p and q. However, at borders where neighbour ing pixels often represent different physical objects, Vp_q should avoid this behaviour. Such functionality can be achieved if Vp q has the discontinuity preserving property. In the next sections, wc present review of the two main m ethods used for energy m inim isation which are loopy belief propagation and graph cut based methods.

2.3.1 Loopy b e lie f prop agation

Loopy belief propagation (LBP) minimises the energy equation 2.1 by passing messages around the graph defined by the image pixels. Image pixels represent the graph nodes and the edge comieeting any two nodes represents the com patibility between them . At each iteration, each node sends and receives messages from its neighbours. The length of each message can be given by the num ber of possible labels. Belief propagation m ethods com pute belief at each node using neighbours messages and the local d a ta (e.g, intensity, colour etc.). Finally, the labelling th a t minimises the energy function E{x) at each node is selected. There are two m ain variants of the loopy belief propagation m ethods m ax-product based |50, 51, 52] and sum -product based [53|. The m ax-product based loopy belief propagation m ethod tries to find out the labelling x th a t maximises the global energy function. The other variant, sum -product based loopy belief propagation m ethod com putes the negative logarithm of the marginal probability of each node to find out the optim al labelling x.

Loopy belief propagation m ethods com pute local marginal probabilities using factor graphs. A factor graph Q = { X , F , E } consists of variable vertices X , factor vertices F and edges E. Factor graph helps in factorisation of a function [54|. Factor graph shown in figure 2.2a factoriscs a function p{x) as

where, Z is partition function used for norm alisation. For each variable vertex, local E s m o o t h { x ) — 'y ' V p q [ X p , X q )

{ p . q € M}

(2.3)

(2.4)

(37)

X , )--- ( X 2 )---1 ---( X 3 ''l /

(a) Factor graph

(

)

1 (,

1 {

(b) M essage passing in the forward direction. x,i

(c) M essage passing in the reverse direction.

F igu re 2.2; B e lie f p rop agation u sin g an exam p le factor graph o f three variable X — { .x i, X2, .rs} and tw o factor F = { / i , /2} vertices.

m argin al p rob ab ility is co m p u ted u sing all m essages p assin g through th e vertex. T h ere arc tw o ty p e s o f m essages. T h e first ty p e o f m essages p ass from th e variable vertices ( x) to the factor vertices ( / ) , i.e., T h e second ty p e o f m essages p ass from th e factor v ertices ( / ) to th e variable vertices (x ), i.e., For exam p le, in th e case o f sum p rod uct m essages p assed in th e forward d irection arc

/* /l—>X2

Z '^ /2 —>3:3 X !] * / * ( ^ 2 i ^ 3 ) ^ X 2 —> /2

and m essages p assed in th e reverse d irection are

1 ^ x 3 ^ f 2 — 1

M /2—>X2 — f i t ' l l ^ 3 )

> / l / * /2—

/ * / i —> x i — / ( ^ l 1 ^ 2 ) ^ x 2 —> / l

in b o th d irection s th e first m essages are in itia lised w ith 1. f ( x p , x g ) is a w eigh t assigned to th e edge co n n ectin g tw o neigh bou ring variable vertices. In th e case o f im ages, th is fu n ction m ay b e co m p u ted u sin g p rop erties such as colour, tex tu r e and b righ tn ess o f th e n eigh bou ring p ixels. U sin g th ese m essages m arginal p rob ab ility o f each v ertex is co m p u ted w hich further

(38)

leads to assigning the best labelling to each variable vertex of the factor graph.

Belief propagation algorithm s were initially developed for the simple graphs by Pearl [50|. Pearl showed th a t their algorithm for finding M axim um-a-Posteriori (MAP) assignments is guaranteed to converge and it will assign the optim al assignment values to each node. Simple graphs (w ithout loops) used by Pearl were not able to capture complex pixel neighbourhood relations properly. Weiss and Freeman |51| extended the belief propagation algorithm for graphs of arbitrary topology.

In their work, Felzenszwalab and H uttenlochcr [53| used a new message passing algorithm s on m ultiscale grid graphs to reduce the com putation tim e of basic loopy belief propagation algorithm . Their m ethod captures long range pixel interaction by shorter paths at coarse levels. This inform ation is further used at the higher levels to save com putation time. Szeliski et al. [52] further improved com putation time by introducing a new m ethod for message passing.

2 .3 .2

G raph cut based m eth o d s

Greig et al. [oo] were the first to show th a t graph cut (m in-cut/m ax-flow ) can be used to minimise energy function (Equation 2.1) for pixel labelling of a binary image. T heir proposed framework has been extensively used in image and video segm entation |26, 20, 56, 28, 22, 57, 16, 44, 45]. G raph cuts can only work with certain kind of energy fimctions otherwise m inim isation is NP hard. Greig et al. showed th a t the globally optimal solution of a two term inal, st graph (Section 3.2.1) can be achieved if the pairwise potentials are semi-metric. T>p^q is called a semi-metric if:

Vp^gia, p) = 0 ^ a = 0 (2.5)

V p ^ , { a , 0 ) = V p , , { 0 , a ) > O (2.6)

where, a and /3 are labels. Boykov et al. [16] extended the framework of Greig et al. to minimise the energy function (Equation 2.1) for pixel labelling of colour or grey scale images. Their move making algorithms, the a/3-swap and the a-expansion are the most popular techniques for minimising the energy functions of form (Equation 2.1). The global minim um of a two term inal st graph can be com puted using «/3-swap or a-expansion if pairwise potentials arc semi-metric or metric respectively. Vp q is called a metric if it satisfies

Temporally consistent region based video segmentation

Tem porally Consistent Region Based

Video Segm entation

A thesis subm itted to University of DubHn, Trinity College

in partial fulfilment of th e requirements for the degree of

Doctor of Philosophy

School of C o m pu ter Science and Statistics,

University of Dublin, Trinity College

A ugust 2014

D eclaration

P erm ission to Lend a n d /o r Copy

Dedicated to

Acknowledgments

A bstract

Contents

List of Tables

List of Figures

Chapter 1

Introduction

1.1

C ontributions

1.1.1

S p atio-tem p oral d isp lacem ent constraint

1.1.2

H andling new o b jects

1.1.3

Im proving label prop agation resu lts for rigid o b jects

1.1.4

Learning p hysical exp lan ation s b ehind changes in segm en ts shapes

and occlu sion boundaries

1.1.5

E valuation m etrics

1.2

R eport outline

Chapter 2

State of the art

2.1

B lock based vid eo segm en tation m eth od s

2.1.1

S egm en tation o f blocks o f fram es

2.1.2

P airw ise graph based m eth o d s

2 .1 .3

H yp ergrap h b ased m eth o d s

2 .1 .4

T em p oral co n siste n c y issu es w ith b lock b ased v id eo seg m en ta tio n

m eth o d s

2.2

Tracking based vid eo segm en tation m eth od s

2.2.1

T em poral co n sisten cy issu es w ith tracking based v id eo seg m en ta tio n

m eth od s

2.3

Label propagation based vid eo seg m en ta tio n m eth od s

2.3.1

Loopy b e lie f prop agation

(

)

1

(,

1

{

2 .3 .2

G raph cut based m eth o d s