Expert Object Recognition in video

(1)

Rochester Institute of Technology

RIT Scholar Works

Theses

Thesis/Dissertation Collections

10-2005

Expert Object Recognition in video

Matthew S. McEuen

Follow this and additional works at:

http://scholarworks.rit.edu/theses

This Thesis is brought to you for free and open access by the Thesis/Dissertation Collections at RIT Scholar Works. It has been accepted for inclusion in Theses by an authorized administrator of RIT Scholar Works. For more information, please [email protected].

Recommended Citation

(2)

Expert Object Recognition in Video

Matt McEuen

In

partial fulfillment of the requirements

for

the Degree

of Master of Science

Department of Computer Science

Rochester Institute of Technology

Rochester, New York

October,

2005

Roger Gaborski

~~~~~~~~~-Professor Roger S.

Gaborski, Chairperson

Carl Reynolds

Professor

Carl Reynolds, Reader

Edith Hemaspaandra

Professor Edith Hemaspaandra, Observer

H. Bischof

(3)

Thesis/Dissertation Author Permission Statement

Title of thesis or dissertation:

L.x

p(' 1+

u~/

e

c

'f

R

f ( OQ.. Y1

1

·

i '

0

0"i IV)

v

,

·

J.

e

o

J (/

Nameofauthor:

----:~'--+-'...,--"+-'-'-:<--~S~

-~

-

~

;l-

-l..

1

~c=--=~:....::::

'-"

::...J>..<~

~

1

---

Degree:

Program: College:

I understand that I must submit a print copy of my thesis or dissertation to the RIT Archives, per current RIT guidelines for the completion of my degree. I hereby grant to the Rochester Institute of Technology and its agents the non-exclusive license to archive and make accessible my thesis or dissertation in whole or in part in all forms of media in perpetuity. I retain all other ownership rights to the copyright of the thesis or dissertation. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertation.

Print Reproduction Permission Granted:

I, .

.ft

c-..

it

4

e

L-'

s

i-1

c

F

IA

el-'\

'hereby grant permission to the Rochester Institute

Technology to reproduce my print thesis or dissertation in whole or in part. Any reproduction will not be for commercial use or profit.

Signature of Author:

Matthew S. McEuen

_Date:

t

o

l2/o7

I I

Print Reproduction Permission Denied:

I, , hereby deny permission to the RIT Library of the Rochester Institute of Technology to reproduce my print thesis or dissertation in whole or in part.

Signature of Author: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ Date: _ _ _ _ _

Inclusion in the RIT Digital Media Library Electronic Thesis

&

Dissertation (ETD) Archive

I,

fo

b

l/}1e

v

$.

A1

c

Cl

....

Pvz , additionally grant to the Rochester Institute of Technology Digital Media Library (RIT DML) the non-exclusive license to archive and provide electronic access to my thesis or dissertation in whole or in part in all forms of media in perpetuity.

I understand that my work, in addition to its bibliographic record and abstract, will be available to the world-wide community of scholars and researchers through the RIT DML. I retain all other ownership rights to the copyright of the thesis or dissertation. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertation. I am aware that the Rochester Institute of Technology does not require registration of copyright for ETDs.

I hereby certify that, if appropriate, I have obtained and attached written permission statements from the owners of each third party copyrighted matter to be included in my thesis or dissertation. I certify that the version I submitted is the same as that approved by my committee.

(4)

oc;-Abstract

Arecent computer vision techniqueforobject classificationinstillimages

is the biologically-inspired Expert Object Recognition (EOR). This thesis

adapts and extends the EOR approach for use with segmented video data.

Properties ofthis _data, such as segmentation masks and the _visibility ofan

object over multiple_frames, are exploitedto decrease humansupervision and

increase accuracy. _Several types of runtime

learning

are facilitated: class-level

learning

in which object types that are not included in the_training set

are given artificial _classes; viewpoint-level

learning

in which novel views of

trainingobjects are associated with _{existing classes;} andinstance-level learn ing of images that are somewhat similar to _training images. The architec

ture of_EOR, _consisting offeature _{extraction, clustering,} and cluster-specific

principal component _{analysis, is} retained. _However, the K-means _clustering

algorithm used inEORis replacedinthissystem_by an augmentedversion of

Fuzzy K-means. This algorithm is _{incrementally} run overthe lifetime ofthe system, and _{automatically} determines an appropriate number of partitions

based on the data _{in memory} and on a system parameter. In addition, the

edge and line-based feature extraction of EORis replaced with a global _ap

plication ofthe principalcomponent_analysis, which increases accuracywhen

usedwithsegmented videodata. Classificationoutputforthesystem consists

of a multi-class hypothesis foreach _{tracked object,} fromwhich a single-class

"hard"

hypothesis _may be determined. _{The system,} named VEOR (video

expert object _{recognition),} is designed for and tested with _noisy, automati

callysegmentedreal-_world

(5)

Acknowledgements

This thesis would not have been possible without the vast amounts of

guidance, insight, and support that were _freely offered _by _my _chairperson,

Dr. Roger Gaborski. I would also like to thank the other members of _my

committee, Dr. Carl Reynolds and Dr. Edith_{Hemaspaandra,} for theirfeed back and support. I am grateful to Dr. Bruce Draper and Dr. Kyungim

Baekfor their helpfulcorrespondence and assistance.

(6)

Figures

2.1 EOR system architecture ₁₃

2.2 Gabor energies of cat faces. The _top row is the unfiltered in

put. The _second, _third, and fourth rows are the faces filtered

with even and odd Gabor filters of sizes _{7, 15,} and _31, respec

tively, in false color. Theinput images werefirst scaledto the

range

[-1,1]

and smoothed 15

2.3 Clustered cat and _dog faces. Each row represents a different

cluster ₂₀

2.4 Two-dimensional illustrationofthemajorsteps of_PCA, cour

tesy of

[16]

23

4.1 Architecture of the VEOR system. Arrows indicate the flow

(9)

4.2 Example of VEOR _training sample. The first two pieces of

data associated with this sample are a video frame and its

corresponding mask, shown above. The third and fourth are

theclass name

"car"

andthe

(row,column)

coordinate

(87,79),

which specifies the location of the object of interest. The

upper-left corner of the images are considered to be the co

ordinate _origins,

(1,1)

32

4.3 Stages ofinitial input processingon asample,

(a)

and

(b)

are

the sample's input video and mask frames, respectively,

(c)

and

(d)

aretheextracted and resized video and maskpatches.

(e)

is the masked object if Gabor

filtering

is not _used, while

(f),

shown in false _{color, is} the masked object ifa 7 x 7 filter

is applied ₄₂

4.4 Some cars and vans clustered _using global PCA for feature

extraction. Rows represent clusters. Note that vehicles tend

to beclustered both_by theirclass and _by theirgrayscale color. 45

(10)

5.2 Cat and

dog

_database _{classification performance}

using global

PCAfeature extraction. Thenumberoflocaland global PCA

subspace dimensionsare varied

(top

left and_top_right, respec

tively), as well as _Fuzzy K-means' _fuzziness coefficient (bot tom). The default values for the number of global and local

subspaces is _5, and the default fuzziness is 1.3 77 5.3 Examples of videoframesfromthevehicle_database, and their

corresponding masks ₈₀

5.4 Vehicle database classification performance _using global PCA

feature extraction. The number oflocal and global PCA

sub-spacedimensions arevaried (the_topleftand _topright _graphs,

respectively), as well as _Fuzzy K-means' fuzziness coefficient

(bottom)

82

5.5 Frames from the classified output video. Each row shows a

different car _approaching the camera (column _one), _turning (column two), and_{continuing in} profile (column three). Ared

bounding box indicates a"car" _{classification,} green indicates

(11)

Chapter

1

Introduction

Object recognition is a difficult and important problem in computer _vision,

and is integral _{to the understanding} ofvideo events by a computer. An en

couraging fact, however, is that humans can perform this task with great

speed, accuracy, and ease. The _biologically inspired system proposed by

Bruce _Draper, Kyungim _Baek, and Jeff_Boody, called expert object _recog

nition

(EOR)

[1,

_{2, 3],} yields results that surpass those ofthe system's con

stituent algorithms. _However, their approach is engineered for use with still

images, and while _potentially ofgreat use to video _analysis, has limitations

which prevent it from _being applied to that domain. An _{EOR-based ap}

proach called VEOR (video expert object _{recognition),} which attempts to

address these _limitations, is proposed in this thesis. VEOR is specializedfor

the needs of video _{understanding} and tuned for use with segmented video

(12)

ap-plies to real world _data, and is tested with a database of segmented _cars,

trucks, and vans.

The outline of this thesis is as follows. Chapter 2 gives background in

formation on the EOR system and its incorporated algorithms. Chapter 3

presents problemsthat prevent the_existingEORimplementationfrom_being

applied to video _data, and describes some features that would be desirable in an object recognition system for video. The architecture of VEOR and

the detailsofits subsystems and algorithms arefound inchapter 4. Chapter 5describes experiments withdifferent data sets and theirresults. Chapter 6

presents conclusions.

Becausecontinuedresearchinthisareais_planned,implementation-specific

notesthat describe_functions, their arguments and return_values, systempa

rameters, etc., are included. These notes are _by no means an exhaustive guide to the VEOR _{implementation;} they are _merely meant as a _starting

point, so that code and comments _{may be} put into context. The implemen tation consists of MATLAB _code, and all system parameters are settable

(13)

Chapter

2

The Expert Object

Recognition

approach

Expert object recognition is aspecifickind ofhuman object recognition that

is concerned with fast recognition and discernment offamiliar types of ob

jects. Objects are recognized on both instance and _{category levels.} For

example, while a familiar face _may be associated with a specific individual,

an unfamiliar facewould still be recognized as human.

In a series oftwo papers [3, 1] and one doctoral dissertation

[2],

Draper,

Baek, and_Boody proposed a _biologicallyinspired approach to object _recog

nition based on the human expert object recognition _pathway, dubbed the

expert object recognition

(EOR)

system. EOR has been _actively _developed,

and the algorithmic choices and parameterization differ somewhat from pa

(14)

of the system to be quite _good, _{outperforming} all of the constituent algo

rithms used [1]. Furtherresearch is

being

donein allofthe modulesof_EOR,

particularly in the areas of_feature _{extraction and exemplar} matching.

The expert object recognition _{pathway in} the human _brain, modeled _by

the EOR _system, consists of three main stages which _sequentially extract

features _from, _categorize, and match viewed imagesto visual memories.

The EOR lifecycle isdivided into_trainingand runtime_{phases, the}latter

of which can beused as a _testingphase. The same general stages are found

in both the _trainingand _testingphases: feature _extraction, which computes

feature vectors for image _data; _{categorization,} which partitions images into

groupsbased on the_similarity oftheirfeature _vectors; and exemplar match

ing, which is responsible for_determining thebest matchforaruntimeimage

from within its assigned partition. The flow ofdata among these stages can

beseen at ahigh level infigure2.1. The stages are described in detail inthe

followingsections.

An implementationofEORwas developed as a_startingpointfor_VEOR,

and so that the two approaches _may be _{quantitatively} compared. In this implementation, feature extraction is applied to each_trainingimage serially,

generatingaset offeaturevectors. The categorizationstageis applied to the

featurevectors in asingle _step, and calculations _necessary to allow exemplar

matching are performed _sequentially on each generated partition of_training data. _During _testing, images are fully processed sequentially. Where the

(15)

Hand-_cropped

Images

EOR

Feature Extraction

Categorization System

Exemplar matching System

Non- symbolic

[image:15.528.86.444.112.233.2]

S Classification

Figure 2.1: EORsystem architecture.

specialnote willbe made.

2.1

Edge Detection

The VI area of the brain consists _primarily of edge _detecting simple _cells,

andofcomplex cellswhich sumthesimple cell outputs [2]. Gabor_functions,

the products of Gaussian and sinusoidal _functions, are the accepted model

of simple cell receptive fields. Two types of Gabor _filters, even and _odd,

are _particularly sensitive to bars and _{to edges,} respectively. The sinusoid

used in the creation of even filters is a cosine _function, while that used for

odd filters is a sine function. Basic even and odd filters are generated in

this implementation with a script from theVENUS

[5]

system. Those filters

are then rotated to _{0, 45, 90,} and 135 degrees to detect edges and bars

(16)

sensitive toedges and bars of various sizes. _However, the performanceofthe

implementation as a whole was _{slightly better} when _only thesmallest filters

(seven _by seven _pixels) were _used, than when the three sizes

(7,

_15, and 31

square pixels

[2])

_found in the literature were used. _Using_only the smallest

filtersize alsodecreased_processingtimesignificantly. Recentcorrespondence

with Dr. Draper confirms that the _working version ofthe EORsystem uses

only the smallest filter size aswell.

ComplexVI cells are modeled _by_summingthesquared output fromsim

plecells. Thisservestocreate a rectified_energy _map _denoting edgestrength

acrossthe filteredimage [2]. Some differenceexists _amongthedifferent EOR

proposals as to whether the rectified energies of the individual orientations

are tobe summed into an average

[1],

or kept separate asinput for the next

stage of _processing [2]. The approach used in this implementation was to

sum the outputs and normalize the result to the range [0,1].

Twosmallalgorithmic additionsthatincreasedperformanceasmall amount

were to smooth images usingan _averagingfilterprior to Gabor_filtering, and

to normalize theirvalues to [-1,1]. The latter step serves to_remedy a short

comingof_{applying Gabor filters}toimagedataratherthantoa contrastmap

(as in [5]). This problem is evidenced by the fact that a white object on a

black background will generate a _very different edge response than a black

object on a white background. Normalizing the input to

[-1,1],

combined

with the rectification _step, produces identical edge responses for both. In

(17)

apply-'" _*

J xSt

Figure 2.2: Gabor energies of cat faces. The _top row is the unfiltered input.

The _second, _third, and fourth rows are the faces filtered with even and odd

Gabor filters of sizes _{7, 15,} and _31, _{respectively, in} false color. The input

[image:17.528.73.457.235.459.2]

(18)

ing Gabor filters to a contrast map. A contrast _map is generated _using a

filter with a shape similar to that of a Gabor _filter, such as a difference of

Gaussians filter [5]. _{This similarity} of _results, _however, calls into question

the _necessity ofthe extrafiltering. The effects ofGabor _filtering at different

resolutions, after _smoothingand _{normalizing, may} be seen in figure 2.2.

Inthis _{implementation,} the sizes of_filters, and the number offilter_sizes,

are _determined _by _the _{system parameter gabor} _(a _vector _of _in

tegers); the number oforientations _by _{gabor_orientCt;} and the number of

phases (1 for odd_filters, 2 for both even and odd

filters)

_by gabor

2.2

Line

Detection

In the human _brain, further feature extraction is performed _by the lateral

occipital cortex

(LOC),

the cells of which respond _{to structural,} or

edge-based, properties. _{Specifically,} non-accidental_properties, such as_{collinearity,}

parallelism, symmetry, and _antisymmetry are detected. The EOR system

performs a subset ofthefunctionsofthe LOC_by _usingtheHough transform

to discover collinearity [1]. This algorithm finds, for _every non-zero pixel in

an _image, _along what lines that pixel _{may lie.} Response valuesare collected

in bins in "Hough space", a discrete two-dimensional space whose axes are

the orientation and perpendicular distance from the origin ofpossible lines.

Thus, binsrepresentindividual linesacrossthe input_image, and thevalue of

(19)

standard image processing, Hough space responses arethresholded to detect

lines. _However, the goal of this stage of the EOR system is to produce a

signaturethat identifiesan image_byits non-accidentalfeatures, so allofthe

response dataare kept.

Unfortunately, all attempts to gain performance in this implementation

using the Hough transform were futile. Under any parameterization with

Hough, performancewas _{very significantly}lowerthanwithout. This _may be

because ofsomeincorrect assumption that is not _specifically detailed in the

literature, because, in the _literature, Hough added _greatly to performance

[1]. _Thus, under the default parameterization, this implementation follows

the earliest specification of EOR

[3],

in which Hough was not used. The

system _flag useHough determines whether or not the transform is included infeatureextraction. Theimplementation oftheHough transformused here

is from [19].

2.3

Categorization

Whilethenexttwostagesinthebiologicalexpert object recognition_pathway

are not as well understood as those _previously _discussed, there is evidence

that thefusiformgyrus and rightinferior frontalgyrus perform categorization

(20)

centers, centroids, or _prototypes, _{are calculated as part of} _{categorization.}

Runtime images are then assigned to clusters whose centroids are closest to

their feature vectors. Exemplar matching for an image is performed _using

the subset of_training_data associated with its cluster.

Categorization in EORis modeled _by the K-means _clustering algorithm.

K-means is a simple iterative algorithm that can _quickly categorize multi

dimensional _{input into clusters,} while _{simultaneously calculating} the mean

value for each cluster [7].

Clustering

does not improve the performance of

thesystem muchin terms of_accuracy

[1];

_however, itallows visual memories

to be stored at ahigh level of_compression, as described in the next section.

The number ofclusters is determined beforeruntime _by the parameter k.

K-means belongs toa_category of_clusteringalgorithms that seekto min

imize a distance (or _error) metric between data and their assigned cluster

centroids. The objective function that the algorithm seeksto minimize is

jKM{x,

c)

=

__.

N

(2-i)

j=li=l

Xisavectorofdatatobepartitioned; eachdatumx G X is itselfafeature

vector, xfis read "the ft1 datum in the

f1

cluster", and

n3-is the number of

datain clusterj. Cis amatrix of cluster centroids, and each centroid ceC

has the same dimensionality

(length)

as the vectors in X

||

*

||2

denotes

a particular distance metric [7]. Squared Euclidean distance is the most

(21)

Algorithm 1 The K-means algorithm

[8]

1. Initialize k (a _parameter) cluster centroids in C to the values of k

randomdata in X.

2. Assign _membership of each datum to the cluster centroid to which it

is closest.

3. Recompute the values ofthe centroids to be the mean values oftheir

memberdata.

4. Ifa convergence criterion is not _met, go to _{step 2.}

"city block" and _{Mahalanobis distances. K-means} _is _implemented

using the

technique of _alternating _{optimization, in} which optimal values for C and

cluster _membership are calculated in turn until some convergence criterion

(or _criteria) is met [8]. Common convergence strategies are to _stop when

no change in cluster _{membership occurs,} or when total change of centroid

locations is less than a threshold. The steps ofK-means are as follows.

Sometypicalclusters generated whenK-meansisapplied

(following

Gabor-based feature extraction) to the cat and _dog _database _(see _section

5.1)

are

shown in figure 2.3.

Draperandhiscolleagues foundthat performance was increased_by over

estimatingthe value for _k,

possibly because the actual classes in the datado

not reflect a _{Gaussian distribution [1].} _The _{results of}

this implementation

support this.

Although a new _{implementation} of_K-means was created for this imple

(22)

Figure 2.3: Clustered cat and _dog faces. Each row represents a different

cluster.

implementation found in the MATLAB statistical toolbox. The reason for

this is unknown. The _differing results _may be due to the fact that the per

formance ofK-means varies _greatly with thechoice ofinitial cluster centers.

However, use of the statistical toolbox version did not improve exemplar

matchingperformance ofthesystemsignificantlywhencompared tothenew K-means implementation. The statistical toolbox implementation may be

enabled _by _settingthe system_flag clustering_matlab_hkm,and thenew im

plementation by _clustering

2.4

Exemplar

_matching

Once an imagehasbeen assignedto acluster atruntime, exemplar_matching

[image:22.528.72.439.92.314.2]

(23)

have hypothesized that visual memories are retained in the brain as "com

pressed _images", and some experimental data support this idea [1]. Com

pressioninthe EORsystem is facilitatedby theuse of asubspace projection

algorithm, theprincipal componentanalysis (PCA).PCA isastatisticaltech

nique that exploits similarities in data. It has proved useful in the field of

computer vision, and has been used for years as a face _matching algorithm

[!]

PCA finds orthogonal vectors _along which the maximum variance in a

dataset _lies, and captures that variance in new variables _by_making the vec

tors axes in a subspace. In _{this way,} most of the variance ofa dataset can

be represented by a _relatively small number ofdimensions. Details of PCA

are found in the _following section. _By

forcing

similar images into clusters, and_generating adifferent subspace foreach _cluster, the EORsystem allows

PCA to _{very effectively} compress a dataset _consisting of dissimilar objects

[2]. Another advantage to _clustering and subspace _{matching is} _that, instead

of

finding

the nearest match to an image fromthe entire _trainingdataset in high-dimensional_{image space,} EOR finds the nearest match to the image in

a_{low-dimensional}subspace from a clustered subset ofthe data.

In

[3],

PCA isappliedtoimage_data, whilein

[1],

it isappliedtothe same feature vectors upon which K-means operates. Both strategies were tested in this

implementation,

with the former _achieving better performance. The

(24)

The Principal Component Analysis

As mentioned _previously, PCA projects datainto an artificial _subspace, and

can function _{as a variable reduction procedure.} _{PCA differs from}

subspace

projection algorithms such asthe Fouriertransform in that the basisvectors

forthe subspace arederived fromthedatatobetransformed. Theprojection

matrix consists ofbasis vectors (subspace _axes) that are orthonormal toone

another [16]. _Thus, the vectors are _mutually uncorrected [17].

When PCA is performed on a_dataset,the mean value issubtractedfrom

the _data, and a covariance matrix is calculated from the result [18]. Infor

mation in data _{is usually} encoded as _variance, and the _ability to capture

most of that variance in a small number of variables is how PCA accom

plishes variable reduction. PCA works best when there exists a high level of _redundancy

(covariance)

_{among input} variables [17]. When performed on

raw image _data, the _redundancy exploited _by PCA is that of pixels whose

values _{vary together.} A simplified example: if a particular region ofpixels

sharethe same graylevelvalue withineach image ina_dataset,but that value varies _among the different _images, then a single subspace variable _{may rep}

resent the _intensityof all ofthose pixels. _Likewise, a subspace variable _may

encode thepresence and degree ofafeature common _{to some,} but not _all, of the dataset. _However, because PCA applied in this _{fashion is pixel-based,} a

spatialtranslationofaparticular featurecould not beso_cleanly represented. When _calculating the projection _matrix, a basis vector is found that ac

(25)

Transformation Reduce dim.

^: n o

T>o

o_o_________-_.v. 00 00o O000

Yj

^xl X]

Figure 2.4: Two-dimensional illustrationofthemajor stepsof_PCA, _courtesy of[16].

principal component. Another vector is_found, orthogonalto the _first, which

capturesthe maximum _remainingvariance in the data, and so _on; each sub

sequent component accounts for the maximal amount of"residual" variance

not captured _bythe previous components. The method _by which these ba

sis vectors are found is to calculate the eigenvectors and eigenvalues ofthe

covariance matrix ofthe data [16]. The order of principal components is the

same as the _descending order oftheir _{corresponding}_{eigenvalues; that} _is, the

eigenvector withthe largest eigenvalue is the first principal component [18]. As _many principal components_may be found fora dataset as exist vari

ables in its input vectors. Each principal component represents an axis in

the PCA_subspace; _therefore, whendataaretransformed _bytheir"full" PCA

projection _{matrix, the} _{dimensionality} _(and

size) ofthe output is identical to that of the input. _However, because most of the information in the data is

captured in the initial principal _components, the lattercomponents _may be

ignored, and _only a small amount ofinformation is lost. There are _many

[image:25.528.94.458.97.208.2]

(26)

com-ponents to _retain, such as the eigenvalue-one _criterion, the scree _test, and

setting athreshold on the proportion of variancefor which retained compo

nents must account [17]. These methods make their determination based on

the dataset at hand. _However, EOR takes a simpler _approach, and retains

a number of eigenvectors specified _by a system parameter. This value was a

variable in the experiments ofDraper et al. [1].

Once the transformation matrix for adataset is _calculated, data _{may be}

projected through it [18]. The size of the projected data is proportional to the number of retained basis vectors. In _addition, a reconstruction of the

original dataset _may be produced _by _projecting subspace data through the transpose of the transformation matrix. If as _many principal components

exist as variables in the _input, then no information is _lost, and a perfect

reconstruction _may be created. _However, _only an approximation can be recovered ifsome ofthe eigenvectors are ignored [18].

2.5

Summary

Now that the _building blocks of EOR have been described, the end to end

procedure will be shown for the sake of comprehension. The steps followed

(27)

Algorithm 2 _EOR, _trainingphase

1. Extract features from _trainingimages using Gabor filters.

2. Cluster the feature vectors into k partitions _using K-means.

3. For each cluster, create a transformation matrix _by _performing PCA

on the images associated with that cluster. Project each datum into

its cluster's subspace.

Algorithm 3 _{EOR, testing} phase

1. Extract features from the _testing _{image using} Gaborfilters.

2. Findthecluster centroid (ascomputedin_training) closestto thefeature

vector.

3. Project the image into the PCA subspacefor that _cluster, and find its

closest matchfromthe projected clustermembers. Thisistheexemplar

(28)

Chapter

3

Considerations

for

video

Gaborskiet al. fromtheRIT departmentofcomputer sciencehave developed

anovel videoeventrecognition _system, calledVENUS

[5],

whichiscapableof

detectingnovelty in avideoscene basedon habituation tocolor and motion.

However, the long-term goal for the system is to be able to describe what

happens in a scene, and object recognition plays a _key role in that regard.

The prospect of_applying the EORapproach in this way is appealing. EOR

may be trained on _any number of _{arbitrary classes;} that _is, it is a generic

approach that can be specialized for various applications _by _{choosing differ}

ent _trainingsets. In the areaof video _{understanding,} _trainingdata could be

selected basedonwhat objectsareexpectedtoappearina_scene, and also on

those objects whose presence should be cause for alarm. In _addition, EOR

lends itself to use with video data. The attention _window, a part of visual

(29)

back-ground detection-based object segmentation (see section 4.2). Also, while EOR associates images with _previously seen _images, an associative _memory

that would label those images

(along

with other reasoning-related _tasks) is deemed to be outside the scope of EOR [2]. Plans to introduce a robust

associative_memory into VENUScouldfillthisroleand achievemorereliable

classification results.

However, significant issues stand in the _way of _providing an EOR im

plementation suitable for use in a video system, especially with regards to

runtime learning. Learning would be a _highly advantageous feature for a

video-based system, especially one engineered for the detection of novelty.

In a realistic setting, an object _may look differently from one frame to the

next, because of _rotation, _shadow, partial _occlusion, or noise. Knowledge,

gained through _learning, that a class _may appear in views different from

those trainedupon would aidfurther identificationofinstancesofthat class.

Objects whose classes were not included in the _training set _may also be en

countered, and should be classified as unknown objects. _However, if two

such objects appearin avideo _sequence, an intelligent response wouldbe to

classify the objects as the same _type, even though the system does know a

symbolic label for that type.

The current EOR system makes a _strong distinction between _training

and _testing_phases, which precludes onlinelearning. This distinction is com

plicated _by the fact that two of the _algorithms, K-means and PCA, would

(30)

memory. This highamount of_{processing may}beredundant andunnecessary.

Likewise, the number and selection of retainedimages is animportant _issue,

as this determines how often

learning

must occur and in what _way visual

memory is increased. The K-means clustering algorithm is too inflexible of

an approach to_partitioning forthis environment, because it requires a spec

ification of _k, the number of clusters. In a realistic _setting, an appropriate

number of clusters _may not be determinable before runtime and _may need

(31)

Chapter 4

The

VEOR

system

The VEOR (Video Expert Object _Recognition) system seeks to _apply the

EOR approach to video _input, and addresses the concerns found in chap

ter 3 through algorithmic _{additions, substitutions,} and modifications. Its

responsibilities are a superset of those of _EOR, because it performs object

tracking and symbolic _labeling in addition to visual matching. Runtime vi

sual_learningisfacilitatedin multiple ways. Instances ofknown _classes,from

both known

(trained)

and unknown

(novel)

_{perspectives,} can be introduced

intomemory. Artificialclasses are created forunclassifiable input. Thetime

requiredto add an image to _{memory is}kept to a minimum _by _reusingexist

ing clusterinformationwhen repartitioning. This keeps cluster_membership

fairly stable, in addition to_{reducing clustering} time. _Thus, local PCA needs

tobe applied_only toclusters whose _membershiphas _changed, and then_only

(32)

Video Input

1

Foreground Segmentation

VEOR

Object_Tracking ? EOR Subsystem ? _MembershipSubsystem

Subsystem _I

ObjectClassifications

Figure 4.1: Architecture of the VEOR system. Arrows indicate the flow of

information.

4.1

VEOR System Architecture

VEOR is divided into logicalsubsystems. _{Functionality} modeled after EOR

is found in the EOR subsystem. A separate _Membership subsystem cal

culates class _membership hypotheses for _objects, while the Object _Tracking

subsystemisresponsibleforfollowingobjectsthroughsuccessiveframes.

Seg

mentation ofvideo data is a _necessary _{pre-processing step} for VEOR input.

The relationships between these modules are shown in figure 4.1.

The EORsubsystemin VEORconsistsofthesame stages asEOR: feature

extraction, clustering, and exemplar matching, but substitutions and modifi

[image:32.528.71.463.99.312.2]

(33)

has been replaced with a global application of _PCA, and the K-means clus

tering algorithm has been replacedwith alearning-enabled algorithm based

on _Fuzzy K-means [8]. Exemplar matching in VEOR happens in much the

same _wayas in EOR.

The distinction between trainingand runtime

(testing)

phases has been

carried over from EOR; however, this division may be somewhat mislead

ing, because_learningcontinues to occur _duringthe runtime phase. Training

data are extracted from segmented video footage, and consist offour parts:

a video frame _containing the _trainingobject, a _{corresponding} segmentation

frame that masks the object, the class ofthe object, and a coordinate pair

specifying a point which lies within the object's segmentation mask. Be

cause the _training data are taken from real-world video footage, which is

segmented based _solely on _motion, a segmentation mask used in _training

mayfind multipleobjects within its correspondingvideo frame. This is_why

thecoordinatepairisnecessary: itspecifieswhichmasked region corresponds

to the _trainingobject. Runtime input consists of avideo and a correspond

ingsegmentation_video, each frameofwhichis a segmentationmask. Classes

for runtime objects are _unknown, and coordinate pairs are _unnecessary be

cause _every segmented object (that meets certain _constraints) is classified.

Multiple objects within a single frame are classified simultaneously.

The major steps of the VEOR system

during

_training and _testing are

described _by algorithms 4 and 5.

(34)

simi-Figure 4.2: Example of VEOR _training sample. The first two pieces of

data associated with this sample are a video frame and its corresponding

mask, shown above. The third and fourth are the class name "car" and the

(row,column)

coordinate

(87,79),

which specifies the location of the object of interest. The upper-left corner of the images are considered to be the

coordinate _origins, (1,1).

Algorithm 4 _{VEOR, training} phase

1. Extract segmented image patches from _trainingdata.

2. Assign _membership of all _training images to their respective classes.

3. Extract coarse features from _training images _by _{performing PCA} and

projecting the datainto the global PCA subspace. This generates fea ture vectors.

4. Initializeasmall number of clustercentroids to thelocationsofrandom

feature vectors.

5. Perform _Fuzzy K-means. _{If any} cluster does not meet certain "good

ness"

criteria, add a new centroid that is themean ofa random subset

(35)

Algorithm 5 _VEOR, testing phase

For each tracked object in each frame:

1. Extract a segmented image patch from theframe.

2. Extract coarse features from the image patch _by _{projecting it into} the

global PCA space.

3. Find the cluster centroid closest to the generated feature vector.

4. Ifthe local PCA transformation matrix for that cluster is not _up to

date, compute it _by _{performing PCA} on the cluster's _{corresponding} image data.

5. Project the image patch into the local PCA subspace for _{its cluster,}

and find theclosest match fromthe projected cluster members.

6. If the distance (in global PCA _{space) from} the _testing image to its

exemplar match is below a _threshold, contribute the _membership of the exemplar match to the _membership of the test _image, and _skip

steps 7-10.

7. _Otherwise, introduce the _testing image into the _training set and per

form steps 2-4 of the _training _sequence, with the exception that the

initial cluster centroids _{in step 3} are set to the current cluster cen

troids.

8. Findtheexemplar matchforthe_testingimagefromwithinitsnew clus

ter _by _performingsteps 4-5 of the _testing _sequence, while _disallowing

the exemplar match from _being the _testingimage itself.

9. If the image is partitioned into a cluster _{by itself,} then a new class is

created for_it, and _{membership is} assigned to that class.

10.

Otherwise,

_contribute _{the membership} _of _the _{exemplar match} _{to the}

(36)

lar to themirror image (horizontal

reflection) ofanother. For example, acar

facing

totheright _probably_{looks like}themirrorimageofthesame car_facing

to theleft.

Adding

mirrorimagesoftheentire_trainingdatabaseinto_memory

would be _redundant, so a different method is used. _{If mirroring is} _enabled,

both an object's image and its reflection are exemplar matched (see section

4.6), and whichever match is _"better", or _closer, is used forclassification (or is learned see section 4.8.1).

Implementation notes

All persistent data are stored in a "VEOR data _structure", which is a sim

ple MATLAB structure. Examples ofsuch data are cluster _centroids, PCA

transformation matrices, and _membership information. This is done so that

multiple

"instances"

of_VEOR, trained _separately and with different _states,

maybe run side-by-sidein an application. Several VEOR functions take one

ofthe data structures as an argument.

The parameter flag EOR_mirrorlmages determines whether image mir

roring is used at runtime.

4.2

Foreground

segmentation

Beforean object canbeclassified, it mustfirstbe _located; that _is, pixels cor

responding toobjects must be distinguishedfrom thosecorresponding toun

(37)

mixtureofGaussians background subtraction technique

[6],

implemented by

Jason Roberts. The input for this stage of_processing is simple video data, and theoutput is a maskvideo. Eachframe in theoutput video corresponds

to aframe inthe input, and the presence ofan object in the input frame is noted _by a connected _group of white pixels in its mask. All other pixels in

the mask video are black. Objects in a scene _may then be separated from background _by_performing alogical"AND" operation between avideo frame and its mask.

Segmentation isachievedthroughdetection and_modelingofbackground. When an object moves through a_scene, it does not fit into the background model and isthus detected as foreground. The particular method described in

[6]

_{is noteworthy in}thatit adapts to a_scene, is robustto _lightingchanges and multimodal_backgrounds, and can assimilate new objects intothe back ground if _they become _stationary for a period of time. Pixel color values are clustered into any number ofGaussian distributions (a _parameter), and

these distributions are ranked _according to their likelihood of_belonging to

background. The heuristics used for this _ranking are based on a Gaussian's variance (a_movingobject tends tocause more variationina given pixel than

does

background)

and persistence(thecolors associatedwith a_movingobject

(38)

within those distributions

(by

_default, within 2.5 standard _deviations) are

exclusivelylabeled as background.

The default number of

Gaussians,

_three, allows for the segmentation of

foreground from abimodal background (the third Gaussian is used for fore

ground) under _noisy conditions. When segmentation was performed for the

testing data described in sections 5.2 and _5.3, _however, _only two Gaussians

were _used, because the short lengths of the involved video clips rendered

multimodal background detection unneccesary. Multimodal background de

tection should be used when _providingVEOR with longer input videos.

After_{detecting background,} morphological_{processing is}performed tore

movefrom maskssmall patches ofwhitepixelsthat represent noise orincon

sequentiallysmall _movingobjects. Inaddition, the morphological_processing

serves to smooth the outlines of _objects, fill holes in _masks, and connect

foreground components that are close to one another. Components in such

close _proximity often belong to the same object. The results offoreground

detection are shown in figure 4.2.

4.3

Object

tracking

After foreground pixels have been identified, they may be interpreted as

belonging to objects. Regions of connected foreground pixels are _found, and

region properties (centroid, area, and _bounding

box)

are calculated. Regions

(39)

In thecurrent VEORimplementation, regions must meet two criteria to

betracked. _First, _anyregionthattouches an edge of aframe is ignored. This

is because EOR is an appearance-based _technique, and the _similaritythat it

would find between an object and aportion ofthat object decreases as less

ofthe object is visible. For _example, it is _{unlikely that} an image of a car

wouldbe the exemplar match ifone halfofthe same car image wasused as

input. _Thus, anobject is nottracked until it isfully within the fieldof_view,

and is no longer tracked when it begins to leave. Unfortunately, this same

situation _may occur if an object becomes occluded while fully within the

frame, bybackground oranother_object, for_instance, andthisis alimitation

of the current implementation. The second region acceptance criterion is

thatatrackedregion must_occupyat least a minimumnumberofpixels. One

reasonforthisis_that, althoughthe_majorityof segmentation noiseisremoved

through morphological _processing, some is not. _Any foreground region of

substantial size cannot be considered _noise, so _only objects are tracked. A

second reason is to eliminate the problem that _far-away (low _resolution)

objects aredifficultto_{distinguish from}one_another,and arethereforedifficult

to classify.

All regions _meetingthe acceptance criteria are tracked _among frames as

objects. The actual _tracking mechanism _currently used in VEOR is very

limited, and is a stand-in for some more state-of-the-art object _tracking al

gorithm. _{Implementation} _{of a robust}

trackingscheme is beyond thescope of

this _thesis, _{and work on such a system}

(40)

inthe RITDepartmentofComputer_{Science. The}object _trackingsubsystem

in VEOR considers foreground regions in two consecutive frames to be the

same object _{if there is any overlap in}their

bounding

boxes. _Trackingbehav

ioris undefined if there is a greaterthanone-to-one correspondence between

such regions.

Resulting

limitationsincludethe_inabilityto track objectsthat

disappear and _reappear, objects that _overlap one _another, and objects that

movefast enough relativeto the frame rate of video input such that thereis

no _{overlap in} regions between consecutive frames. In _addition, if an object

occupies more than one _region, each constituent region will be tracked _sep

arately. For _example, ifa car drives behind a street _{sign, it} will be tracked

as two separate objects while _visually divided. However, this form of object

tracking serves as a proof of_concept, and works well with real-_world _input

that avoids _{these admittedly}realistic situations.

The object _tracking subsystem is split into two separate functions:

findOb-jects, which identifies foreground regions that meet the acceptance criteria,

and _{trackObjects,}whichtracksobjects_amongframes. FindObjectstakes_only

a mask frame as a _parameter, and returns allfound objects as a MATLAB

struct array. ThefindObjects_minSize system parameter sets the minimum

number of pixels of which a region must consist for it to be considered an

object; ifunspecified, the parameter defaults to 40. TrackObjects takes two

(41)

current frame and one for the previous frame. The function uses these to

determine which ofthe foundregionsare new objects and which are objects

thathavepersistedfromtheprevious _frame,and relaysthisinformationwith

threereturn arrays: oneforobjectsthatfirstappearinthenew_frame,onefor

objects that have been trackedfromthe previous _frame, and one forobjects

that werein the previous frame but have disappeared. The structures used

asinputand output forthesefunctions areidenticaltothosereturned _by the

regionprops MATLAB _function, with the exception that _they are assigned

an additional id field which specifies a unique numerical identifier for each

tracked object.

4.4

Feature

extraction

Feature extraction in the EOR system consists of edge and line _detection,

implemented as Gabor

filtering

and the Hough _transform, respectively. Al

though thesealgorithms are incorporated into_VEOR, adifferent featureex

traction_{pathway is}used_bydefault. Gabor

filtering

andtheHoughtransform

may be enabled _by _setting system _{flags. In} _{addition, processing is} _required

to transform input data into masked image patches. This _{processing is} not

(42)

Initial _processing

Each tracked object within a frame ofvideo must undergo several transfor

mations before features_may_be_extracted_from_it. _Data_provided_{to this}_stage

of_processing consist ofthe video frame and mask frame in which an object

appears, as well as thecoordinates ofa point that lieswithin the object. At

runtime, these coordinates are retrievedfrom the region propertiesfound _by

the object _tracking subsystem. _During _training, _they are provided as input

data, in partbecause it is impossibleto trackanobject across asingle frame.

An object in a video occupies occupies a region whose size in pixels de

pends on the resolution ofthe video and on the distance ofthe object from

the camera. To compare pixel-based features _among objects from different

frames, however, these regions must be rescaled to a common image patch

size, whichin VEORis square. _Usingthe coordinatepairas alocatorforthe

appropriate _{region, the}centerand _boundingboxoftheobject are calculated

from the mask frame. All other foreground regions of the mask are zeroed

out to ensure that masks _{corresponding to} other objects do not affect the

masking ofthe object at hand. Next, the major axis ofthe object's _region,

verticalor horizontal, is determined. Thereasonforthisisthat some objects

(such as _pedestrians) are taller than _they are _wide, and other objects (such

as _vehicles) are wider than tall. To most _effectively scale the _object, tall

objects should occupy the full height of the final image patch, while wide

objects should occupy the full width. The minor axis ofthe objects should

(43)

be to scale the axes _separately, _allowing all objects to _{occupy both} the full

widthand heightofthefinal image patch; however, in doingthis, important

implicit information regarding the aspect ratio ofthe object would be lost.

Ifanobject liesnear the borderofthe framein such a_waythat thepatch to

be extracted extends beyond the frame's borders, then the frame is padded

symmetrically. Resizing, _{using bilinear} interpolation, is then performed on

therelevant part ofthe video frame. Nearest-neighbor interpolation is used

when_{resizing the} maskframe, because bilinear interpolation could cause in

appropriate_graypixels toappearbetweentheblack and white regionsin the

mask.

Ifedgedetection is_{enabled, then}convolutionwithGabor filtersisapplied.

When_{resizing is} _performed, extra pixels in all directions are also resized to

provide _padding for _{the convolution,} so that edge effects are avoided. The

object itself is extracted from the _padding after convolution _{occurs, if it}

occurs.

Finally, masking occurs through a logical "OR" operation between the

image and mask patches. If enabled, the Hough transform is applied to the

resulting image patch. The _resulting _data, be _they grayscale _image, edge

map, or hough _space, are vectorized so that each pixel (or hough

bin)

is

(44)

(a)

(b)

(c)

(d)

(e)

(f)

Figure 4.3: Stages of _{initial input processing} on a sample,

(a)

and

(b)

are

the sample's input video and mask _frames, respectively,

(c)

and

(d)

are the

extracted and resized video and mask patches,

(e)

is the masked object if

Gabor _filtering is not _used, while

(f),

shown in false _{color, is} the masked [image:44.528.86.440.142.510.2]

(45)

Principal Component Analysis

PCA has _long been used as a feature extraction tool in face recognition, in

which the variations _among _many different faces _may be well-represented

within a PCA subspace [2]. In the _{EOR system, PCA is} used in the exem

plar _matching phase to _{discriminate among} what the _clustering algorithm

determines tobe similar images. Thecreators ofEOR note that PCAworks

bestwhenappliedtoimages drawnfroma normal distribution

[3],

and while

images ofa certain type ofobject (such asthe human

face)

_maydrawfroma

normal_{distribution,} adataset _consistingofdifferent objects does not. _Thus,

clusteringallows localsubspaces tobe generatedforthemultiple normaldis

tributions that are mixed in such a dataset. In

[2],

PCA is tested as single

(global)

classification_step, and is found to be inferior to EORwhen applied

to the cat and _dog database (see section 5.1). _However, Draper et al. show

that when _{EOR feature} extraction is applied to aerial images of Fort Hood

before PCA _{is applied,} performance is as good as the entire EOR _system,

given a high enough number of retained dimensions [1]. Experiments with

VEOR_usingsegmentedvehicle _{data, however,}show that performance_{is sig}

nificantly worse when _clustering is left _out, no matter how _many subspaces

are retained.

InthedefaultVEOR_{configuration,}acombinationofthese twoapproaches

to PCA_{is used,} andthe algorithmisapplied at botha global and local level.

The_{global application} ofPCAfunctionsas coarsefeature _extraction, results

from _{which are used} _for_clustering. _The

(46)

thesamefunctionasit doesin_EOR, which canbeconsideredfine featureex

traction. Thus PCA extractsthe largest features that it can from a dataset:

inter-class features from a multi-class _mixture, and inter-instance features

from a cluster. PCA is _commonly usedtoreduce the _{dimensionality} ofdata

before clustering is performed. _Ding and He

[15]

show that PCA itself per

forms data _{clustering according} to K-means' objective _{function, improving}

the performance of_clustering when the two techniquesare used insequence.

ExperimentswithVEORalso showthat theincrease inexecutiontimecaused

by the global PCA application is more than made _{up for} _by the decrease in

clusteringtime, and overall executiontime _{is significantly decreased.}

Replacingedge-basedfeatureextraction withPCA isnot without_caveats,

however. Asnotedin

[3],

anadvantageto_{using both Gabor}_filteringandPCA

is that different properties ofthe data are exploited _by the two techniques:

boundary and _{intensity information,} respectively. The negative effects of

applying PCA to grayscale images at all levels is evidenced by the clusters

shown in figure 4.4. Dark cars tend to be clustered _together, as do grayish

cars and whitish _cars, whereas under edge-based feature extraction all such

cars would tend to be clustered together. In fact, when _using the cat and

dog database (see section _5.1) as _input, VEOR performs worse with PCA

feature extraction than with Gabor filtering. Despite these issues,

PCA-basedfeature extraction yields greater performanceintermsofaccuracyand

(47)

[image:47.528.133.391.95.289.2]

Figure 4.4: Some cars and vans clustered _using global PCA for feature ex

traction. Rows represent clusters. Note that vehicles tend to be clustered both _bytheir class and _by their grayscale color.

Image scalingand_masking, aswell asoptionalGabor

filtering

andtheHough

transform, ishandled_bythe MATLAB function EORPreProcessing. Thear guments to thisfunctionare avideo_frame, a mask _frame, and atwo-element

location vector. An argument _providing Gabor filters _may be used iffilter ingis enabled, and is ignoredotherwise. Two values are returned: an image patch ofthemasked_object,and afeaturevector. IfneitherHoughnorGabor

filtering

are _enabled, then that vector is simply the vectorized form of the

returned masked object. Theparameters imgSizeCand imgSizeR specifythe

(48)

filtering

is implemented in the functions genGabor and applyGabor, and the

Hough transform in hough.

The global application ofPCA is in the _{EORClassify function,} which is

also responsible for _calling categorization and exemplar _matching routines.

Theparameter_{EOR_globalC} specifiesthenumber of subspacedimensionsto

retain (aseparate parameter specifiesthe number oflocalretained subspace

dimensions). The principal component analysis itself is implemented in the

pea function. AglobalPCAtransformationmatrix isrecalculated_everytime

that an image is introduced into _{memory through} learning.

4.5

Clustering

4.5.1 Requirements

The major requirements for VEORclustering, other than the need for

high-quality partitioning, are _firstly _{the ability to} _{automatically} determine, from

input _data, an appropriate number ofclusters for those data; and _secondly,

to efficientlyfacilitate learning. Theneedfortheformerina_{general-purpose,}

dynamic environment is immediately apparent. However, the latter require

mentbears explanation. Everytime thesystemlearns, thatis, whenthesys

tem introduces new image data into memory, several _{time-consuming} steps

occur. If PCA is used for feature extraction, then a new transformation

matrix must be calculated from the expanded data set.

Clustering

is then

(49)

clusterofthelearned_image, acluster-specific transformationmatrix is com puted. These three steps are unavoidable. What we would like to avoid,

however, are _unnecessary changes in the membership of other _clusters, be causeachangein_anycluster requiresthatanewPCAtransformationmatrix

becalculatedfor it when animage is matchedto the cluster. So, the needto efficiently facilitate learning may be rephrased as an issue ofthe _stability of

clusters'

membershipbetween partitionings. Another way in which learning

is facilitated is_that, whenalearnedobjectis partitionedinto itsown _cluster, an artificial classis createdforit. Thisis described in detail in section 4.8.1.

4.5.2

Promoting

_stability of cluster _membership

Asstated_previously, significant_processingtime_maybesavedifcluster mem

berships change as little as possible when _adding new image data to VEOR memory. Unfortunately, the locally optimum _partitioning reached by K-means is _highly dependent upon the initial centroid values

(seeds)

_chosen, and the most common methods for _choosing seeds are _by _assigning them to the locations ofrandom data _members, or to random locations [7]. This meansthat _{in practice,} clusters are _{rarely identical} from one _partitioningto the next when the natural groupings ofthedata are not well separated.

The K-Harmonic Means

(KHM)

algorithm

[13]

has been found to be

largely insensitive to seed locations when compared to K-means

[11],

and was_implemented_in_the_hope_of

preventing unnecessaryapplications ofPCA.

(50)

conver-gence of multiple centroids to the same _locations, _in effect _preventing the

data from

being

partitionedinto morethan a small number ofclusters. This

occurred when _{clustering both} _{high-dimensional} _raw _image _data _and

lower-dimensional PCA subspace _data, and under _any parameterization ofKHM.

A different approach to cluster _{membership stability is} taken from Un

supervised _Fuzzy

Clustering

Optimal Number of Partitions

(UFC-ONP)

algorithm

[10],

which is described in further detail in section 4.5.4. _Every

clusteringof UFC-ONP is seeded with the centroids from theprevious clus

tering,whichencouragesthealgorithmtoconvergetoalocaloptimumsimilar

to that ofthe previous clustering. In addition, because this approach uses

the results ofwork that _{has already been} _done, _{it significantly} reduces the

amount of time _necessary to converge to a local optimum. This method is

used in VEOR, andthe introductionofa new _{image into memory generally}

changes _{the membership} of a small number of_clusters, and often _only of a

single cluster.

4.5.3

Determining

an appropriate number of clusters

For a_clustering schemeto be useful in areal-world _system, where the input

data could _{vary significantly} from one invocation to the next, and from one

particularapplicationofthesystemto another, itmust not_relyonan apriori

specification ofthe number of clustersto be found. Users of_clusteringsoft

ware often pickthevalue ofaparameter, k, using knowledgeoftheinput data

(51)

newimagesextractedfromavideostream, thenitshouldbeabletocopewith

the fact that the number of natural groups within the data might _actually

increase as time progresses. In addition, Draper et al. find that the number

of actual classes in an image dataset does not necessarily correspond to the

best value ofk for the data, and get better results when _{overestimating the}

parameter [1]. They attribute this to the possibility that within each _class,

more cohesive logical groupings_may exist. _Thus, domain-specific knowledge

regarding thenumberofclasses_amongwhichdiscriminationmustoccur does

not translate into knowledge ofhow _many clusters should be found.

One method _{that may} be used to estimate the relative goodness ofdif

ferent partitions is to _apply a cluster _validity index [8]. When applied to

partitionings of the same data into different numbers of _clusters, these in

dices may, in theory, show which numb

Expert Object Recognition in video

Rochester Institute of Technology

Recommended Citation

Carl Reynolds

Print Reproduction Permission Granted:

Inclusion in the RIT Digital Media Library Electronic Thesis

learning

Contents

Figures

filtering

Chapter

Chapter

being

Edge Detection

filters)

Categorization

Clustering

(following

Exemplar

forcing

Summary

Chapter

(along

learning

Chapter 4

VEOR System Architecture

during

Otherwise,

Foreground

background)

Object

bounding

Feature

filtering

filtering

Clustering

Clustering

Promoting

being

Determining