Multiresolution Gaussian mixture models : theory and applications

(1)

Original citation:

Wilson, Roland, 1949- (2000) Multiresolution Gaussian mixture models : theory and

application

s. University of Warwick. Department of Computer Science. (Department of

Computer Science Research Report). (Unpublished) CS-RR-371

Permanent WRAP url:

http://wrap.warwick.ac.uk/61131

Copyright and reuse:

The Warwick Research Archive Portal (WRAP) makes this work by researchers of the

University of Warwick available open access under the following conditions. Copyright ©

and all moral rights to the version of the paper presented here belong to the individual

author(s) and/or other copyright owners. To the extent reasonable and practicable the

material made available in WRAP has been checked for eligibility before being made

available.

Copies of full items can be used for personal research or study, educational, or

not-for-profit purposes without prior permission or charge. Provided that the authors, title and

full bibliographic details are credited, a hyperlink and/or URL is given for the original

metadata page and the content is not changed in any way.

A note on versions:

(2)

and Appliations

Roland Wilson

February 28, 2000

1 Introdution

Multiresolution image representations have been suessfully applied to many

prob-lemsinimageanalysis [26,3,11,5℄beauseoftheirabilitytotradeospatial

resolu-tionagainstfrequeny domainresolutionandtheirsymmetry withrespet to

transla-tionsand dilations,twoof the mostimportantsouresof variationinimages[4℄. Not

every multiresolutionrepresentationis equallyeetiveinthis regard: shemes using

dyadi deimation, suh aspyramids and orthonormal wavelets, sarie translation

invariane to redue redundany [6, 25, 12℄. While this is desirable in image

om-pression, it is not ideal for problems in omputer vision, where ompatness is less

importantthanutilityinarangeofproblems-fromsegmentationtomotionanalysis.

In those ases, itis the ombination of statistialinferene, usually from inomplete

(eg. projeted) data and symmetry undermotion whih are most signiant.

Another interesting development in reent years has been the use of Gaussian

Mixture models to ope with statistial problems for whih no simple parametri

model exists [20, 27℄. While it is well known that algorithms suh as

Expetation-Maximisation an lead to eetive approximations in terms of a nite number of

omponents, the general problemof mixture modelling is diÆult when the number

of omponents isunknown [13, 17℄.

This paperdesribes anew multiresolutionrepresentation whih taklesthe

mix-ture modelling problem head-on. The general approah shares important features

withthe Classiationand RegressionTrees (CART)system of[9℄and itsderivatives

[17℄. Itmayalsobeseenasageneralisationofsale-spae[26℄,inthatitusesGaussian

funtions and inludes spatial o-ordinates,but unlikea onventional basis set, they

are adapted to the data and used statistially. Moreover, they are dened ina spae

whose dimension reets the inferene problem,not simplythe image data. Thusin

dealing with olour images, a 5 D spae is required (two spatial and three olour

dimensions);for inferring3 D struturefrom motion,typiallyninedimensions are

required (three spatial, three olour and three motion axes). Yet MGMM has no

diÆulty inprinipleinmovingseamlesslybetweenthese spaes. The next setionof

the paper outlines the theory underlying MGMM as a method of approximating an

(3)

Thisisfollowedby apresentationof somesimpleexperiments,whihillustratehowit

maybeused inimportant`early vision'problems: segmentationandmotionanalysis.

The paper is onluded with some remarks on the impliations and potential of the

(4)

Three main elements ditate the form of representation alled MGMM: ability to

approximate any probabilitydensityina spaeof arbitrarydimension; losureunder

aÆne motionsand amultiresolutionstruture, whih an be used tomake

omputa-tion eÆient.

Before outlining the theory of MGMM, it may be useful to explain the need for

suh anapproah. It is widely reognised that statistialmethodsare akey element

in vision and image analysis eg. [7, 22℄. While Gaussian models work adequately

in many ases, more general distributions pose onsiderable diÆulties. The rst is

simplyestimatingthedensity,aproblemwhihhasbesetstatistiiansformaydeades

[23,24℄. Therearetwowidelyusedalternatives: non-parametri,orkernelestimation

and parametri estimation. Kernel methods are a generalisation of histogramming,

in whih, ratherthan use anon-overlapping set of intervals,the density is estimated

usingasmoothkernelto`ironout'theutuationswhihsamplinginevitablyauses.

Ineet,thisrepresentsaonvolutionwhihisthestatistialanalogueofanti-aliasing:

itiswellknownthatrawhistogramsarewoefullyinadequateasdensityestimates[23℄.

The alternative, whih is learly preferable if there is suÆient prior informationor

knowledge about the proesses produing the data, is to use a parametri model. In

reent years, this has been extended through the use of Bayesian methods for model

seletion and the estimation of so-alled `hyper-parameters', whih haraterise the

model, eg. [19, 15℄, but while useful inpartiular problems, these hardly represent a

generalapproahtothe infereneproblem. It seemsthat whatisneeded isamethod

whih ombines the generality of kernel estimation with the power of parametri

methods. This isone of the key motivations behind the MGMM approah.

To dene the MGMM approah ina general way, westart by observing that any

probability density funtion an be approximated to an arbitrary preision by a set

of Gaussian funtions: it is well known that the Gaussian funtions are a omplete

set on L 2

(R n

). However, we want to make a stronger laim - namely that an L 1

approximation of anarbitrarydensity funtionneed only involvepositive oeÆients

in the expansion. To this end, we state the theorem:

Theorem 1: Let f(:) : R n

! R be any nonnegative real integrable funtion on R n

with Z

R n

dxf(x)=1 (1)

Then for any Æ > 0 there exists an approximation of f(:) by a stritly positive sum

of Gaussianfuntions of the form

^ f(x)=

X

i f

i g

i

(x

i

) (2)

of means i

and ovarianes i

, suhthat f i

>0;8iand

Æ > Z

R n

dxjf(x) ^

f(x)j (3)

Proof: Westartinonedimensionandthenextend theresulttoR n

,primarilyfor

(5)

X X (x) = 1 2

if x2X

0 else

(4)

wherethe set X isthe interval<x. Consider theonvolutionof X

(:)with the

Gaussianfuntion g

(x) of variane 2 g ? X (x)= Z X dyg

(y x) (5)

whihis easilywritten as

g

?(x)= 1 4 [erf( +x p 2 ) erf( x p 2 )℄ (6)

where erf(:) is the error funtion [1℄. If ==M, then for any Æ > 0, there exists a

value of M suh that Z dx j X (x) g =M ? X (x)j < Z X M dx X

(x)+ (7)

Z

X X M

dxexp[ M℄+ Z R X dx g =M ? X (x) < 2 M

+2exp [ M℄ < Æ

2

where X M

is the subset of R for whih jxj < 1 M

. Now the onvolution, whih is

smooth, an be approximated by asum

g ? X (x) X i f i g (x 2 k i) (8)

where k is hosen suh that there is an integral absolute approximation error less

than Æ 2 Z dxjg ? X (x) X i f i g (x 2 k )j< Æ 2 (9) where f i

> 0;8i. But this is the Riemann integral theorem [8℄: one simply uses for

f i

the saled values of X (2 k i) f i =2 k X (2 k i) (10)

Then for some k, the sum, whih is absolutely onvergent, willgive anerror smaller

than Æ=2.

We have thus proved that the funtion X

(:) an be approximated to arbitrary

preisionby anite set of Gaussian funtions. Toomplete the proof,we note that

X

i f

i

!1 (11)

whih follows from the uniform sampling of X

(:), the saling of X

(:), (4) and the

onvergene of the sum. Thus if X

(6)

density an be approximated by a mixture of funtions of the form X

( x y

i

): these

simplefuntionsare the means by whih the Lebesgue integralis dened [8℄and any

probability distributionis measurable.

The extension to R n

is trivial: one simply replaes the 1 d elements by the

orresponding Cartesian separable, produt form and the proof goes through in the

same way. This ompletes the proof.

In order to use this result, we need an eÆient way of deiding how many

om-ponents to use in modelling an arbitrary density, or more realistially, data drawn

from anarbitrary density. Tothis end, onsider the following simpledeision: given

a set of data, is it better modelled with one Gaussian omponent or two? This is

a deision whih annot be made on the basis of likelihood alone: it would always

give preferene to the more omplex desription. The inrease in likelihoodmust be

balaned against the ost interms of omplexityof using two omponentsinstead of

one.

This type of problem has reeived muh attention in the literature on

lassia-tion [9,17, 19℄, and in system identiation, in whih modelorder has tobe hosen

[10℄. Inboth areas,widespreaduse hasbeen madeof Akaike's InformationCriterion

(AIC), whih is derived from the Kullbak-Liebler Information Divergene between

two densities:

d(f;g)= Z

R n

dxf(x)ln[ f(x)

g(x)

℄ (12)

A widelyused alternativetoAIC isthe MinimumDesriptionLength(MDL)

ri-terion,inwhihanestimateismadeofthe `odingost'oftheparametridesription

of anobjet [18℄. Now, in many ases, quitewhat the MDL of anobjet mightbeis

unlear, sineitwould appear todependonthehoieofdesriptionlanguageandon

the density indued onstrings from that languageby the objets of interest. In the

present ontext, muh greater preision is available, sine not only the parameters,

but inprinipletheirpriorandposteriordensitiesanbeestimated. Lastly,Bayesian

Evidene (BE), ie. the posterior P(MjY) supporting a model M amongst a set of

alternatives, has alsobeen propounded [17, 15℄as the best way of seletinga model.

In the ontext of seletingeither one or two Gaussianomponents to modela set of

data, this involves penalising the two-omponent model though the prior. It is not

so lear, however, just how steep the penalty should be; at least in the AIC and in

the presentase inMDL, this questionissettledbydenition. In general,evaluation

of these riteria is itself a soure of onsiderable diÆulty, with resort usually being

madetosimple approximations,eg. based onsaddle-pointapproximationsorsample

averagingover anappropriateset ofsimulationepohs. In the present ase, one the

data are lassied,anexat solutionispossible. Classiationofthe dataisahieved

by Gibbs sampling.

Beause the Bayesian paradigm is an appropriate one in many appliations, let

us start by onsidering anevidential framework for the problemof `one-or-two'. We

an basethe deisionbetween H 1

: one omponent andH 2

(7)

R B = P(H 1 )P(XjH 1 ) P(H 2 )P(XjH 2 ) (13)

whereP(H 1

)=1 P(H 2

)isthepriorprobabilityofasingleomponentandP(XjH i

)

is the evidenefor H i

: the integral overthe parameter density of the likelihood

P(XjH i )= Z R m

d f(Xj;H i

)f(jH i

) (14)

Wean use the results of the appendix to ompute the BayesFatorfor twoagainst

one omponents. In the following, unsubsripted quantities referto the unsplit data,

whilesubsripts 1;2refertothe two omponentsofthe putativemixture

approxima-tion. Thusthe Bayes Fator an be writtenfrom (40) as

BF = jA 1 j d 2 jA 2 j d 2 jA j d 2 jA 1 j d 1 2 jA 2 j d 2 2 jAj d 2 n Y i=1 ( d 1 +1 i 2 ) ( d 2 +1 i 2 ) ( d +1 i 2 ) ( d+1 i 2 ) (15)

In(15), quantities witha

denoteposteriorand withoutitdenotethe orresponding

prior parameters. Thus, Aisthe prior ovariane parameterfor asingle omponent,

while A 1

is the posterior ovariane parameter for the rst of the two omponents.

The sample size, N, is hidden in the parameters d

= d+N;d i

= d+N i

;i = 1;2.

A onsiderable simpliation of the above ours if we make the following hoies:

A i

=A= d d

A

. Inthis ase, the log-BayesFator redues to

LBF = d +d 2 lnjA j d 1 2 lnjA 1 j d 2 2 lnjA 2 j+ nd 2 ln d d + n X i=1 ln ( d 1

+1 i

2 )+ X i ln ( d 2

+1 i

2 ) X i ln ( d

+1 i

2 )

X

i ln (

d+1 i

2

) (16)

AIC is based onlikelihoodsand an be written using(38) as

AIC = N 2 lnjSj N 1 2 lnjS 1 j N 2 2 lnjS 2

j n(n+3) (17)

Comparing (16) and (17), there is an obvious similarity in the dependene on the

data, exept for the use of priors in the former. Moreover, in the LBF, the `penalty'

term inreases with both the dimension of the problemand the samplesize, whereas

the AIC penalty depends onlyonthe dimension. Notethatalthoughitis possibleto

use Bayesian estimates inAIC orMLestimates inthe BF, itmakeslittlesensetodo

so.

Figures 1(a)-(d) show an interesting feature of the two. All four gures show

ontourplots ofthe AICorLBFguresforatest oftwo2 DGaussianomponents

against one, as a funtion of separation either in sale or position (horizontal axis)

(8)

been notedby others, itdoes appear that AIC is more supportiveof splittingas the

population inreases; on the other hand, for very small populations, it is the LBF

whihismoreinlinedtosplit,ieispositive. ThistendenyofLBFtofavoursplitting

of smallpopulations is not a good one and shows that neither riterion dsiplays all

of the features one would like, even in suh a simple ase as a two-vs-one deision.

Nevertheless, the LBF has lear advantages, as long as the population size is large

enoughto give meaningfulresults.

A dierentperspetive anbe given totheBFif werewriteitusingBayes's Rule,

viz.

LBF =ln

P(Xj 1 ; 2 ) P(Xj) ln P( 1 ; 2 jX) P( 1 ; 2 ) +ln

P(jX)

P()

(18)

While the rst term represents the gain in likelihood from using two omponents

instead of one, the remainder represents the ost of this, in terms of the mutual

informationbetween parameters and data (f. Appendix I, (41)).

2.1 Eets of Motion

The other majorproperty,a ruial one for motion analysis,is the losureof the set

of n D Gaussian funtionsG n

underaÆne maps A:R n

7!R n

Ax=Lx+a (19)

where L is an invertible matrix and a a translation. Again, it is obvious that the

ation of AonG n

islosed, sine

g

(A 1

(x ))=g A (x A ) (20) where A

= a (21)

and A =L T L (22)

But nowweare ina position toprovea rather interesting result, summarised as

Theorem 2: Let f(:) 0 be an integrable funtion f : R n

! R , as above and let

t(:) : R n

! R n

be a smooth, invertible map from R n

to itself, having bounded rst

and seond derivatives. Then for any Æ >0, f has an approximation as a Gaussian

mixtureof theformof (2),with integralabsoluteerrornogreaterthanÆ=2and there

exists an approximation of the transformed funtion Tf

Tf(x)=f(t 1

(x)) (23)

of the same form,where eah Gaussianomponent g

i

istransformed aording toa

(9)

ture of the MGMM approximation. First, we observe that any smooth ow an be

approximated using

t (x)=t x 0

+r x

(t)(x x 0

)+O(kxk 2

) (24)

viaTaylor'stheorem, wherer x

(:)isthe gradient. Now,letg

(x )beanarbitrary

GaussianfuntiononR n

. Thenitfollowsfrom(24)thatthereexistsasaleparameter

(Æ)>0 suh that

Z

dxjg (Æ)

(t(x ) g

(Æ)

(t()+r x

(t )(x ))j< Æ

2

(25)

We an ahieve this by hoosing (Æ) > 0 so that the error (x) - the integrand in

(25) - satises

1. (x) < Æ 4

;if kx t()k < r. This follows beause the error in the Taylor

expansionfor t isof order 2

(Æ) if r=M(Æ).

2. R kxk>r dxj(x)j < R kxk>r g (Æ)

(t(x )+g (Æ)

(t ()+r x

(t )(x )) Æ 4 .

This follows from the observation that the tails of the two Gaussians an be

made of arbitrarilylow weightby asimilar hoie of r.

Putting the two errors together, we get a total error

Z

dxjg (Æ)

(t(x ) g

(Æ)

(t ()+r x

(t)(x ))j< Æ 4 + Æ 4 = Æ 2 (26)

Sine this appliestoanarbitrary Gaussian,apply itto eah omponentin amixture

representation of f(t(x))

f(t(x))= X i f i g i (x i ) (27)

for whih (i)the absolute integral erroris less than Æ=2and (ii)eah omponent has

avarianesmallenoughthat itstransformationby t introduesanerror nobigger

than Æ=2. Suh a representation exists, by Theorem 1. But then the error in the

representation of f(x) by the aÆne approximation of t is no greater than Æ. This

(10)

Thekeytoapplyingtheseideasistouse asequentialapproah,whihleadstoa

mul-tiresolution tree struture: Multiresolution Gaussian Mixture Modelling (MGMM).

We start from the natural assumption that we do not know the density of interest,

but wish to estimate it from some data, suh as an image or set of images. If these

data are denoted X i

;1 i n, we an ompute the sample mean and ovariane

[21℄ and thene inferasingle multivariateGaussianmodelg 0

(x

0

),where 0

; 0

are the sample (Maximum Likelihood) estimates for the data. Now, in rare ases

this will model the data adequately. If not, then we split the data into two parts

and model eah with a Gaussian. This an be done using the Markov Chain Monte

Carlo (MCMC) sampling tehnique desribed in [19℄, whih treats the inferene as

oneontaininghiddenvariables,namelythelassZ i

towhiheahdatumbelongsand

samplesfromthe posteriorsforthepopulationsize,meansandovarianes, assuming

onjugate priors, whose parameters are simply those of the population as a whole.

Thus the prior for the means of the two lasses are Gaussian, while the ovarianes

are drawn from an inverse Wishart density: a Normal-Inverse Wishart (NIW)

den-sity. Beause the NIW is the naturalonjugate prior for this problem, the posterior

isalsoNIWand soaGibbssampleran beused [19,14℄. Samplingfor(i)thehidden

variablesand(ii)theorrespondingsub-populationdensitiesgivesthe twoGaussians,

based on the posterior estimates from the sampler for g

1j

(x

1j

);i2 [0;1℄. This

Gibbs sampler is desribed fully in Appendix II. Clearly, we have the basis of a

re-ursionhere: we an builda tree representation, in whih eah leafis asubset of the

populationfound bysuessive binarysplitsfromtherootnode,whihrepresents the

wholepopulation. Beause theNIW isaonjugateprior,the sameomputationsan

be used atevery node in the tree:

1. Seletlass j and test itsnormaldensity approximation for`goodness-of-t'. If

the t isadequate, terminate, else

2. Splitlassj intotwoomponents: lassj1andlassj2,bysamplingthehidden

variables Z ji

;1 i N j

and thene obtaining a Bayesian estimate of the

lass means and ovarianes by sampling from the posteriors, given the prior

g

j

(x

j ).

This needsa riterionforsplitting. In the previoussetion, wereviewed twopopular

hoies: LBF and AIC, eitherof whih an be expressed in the present ase as

C =ln

f(Xj j1

; j2

)

f(Xj j

)

k(N j

;n) (28)

wherethe log-likelihoodratiois subjettoa `omplexity'penalty whih isafuntion

of the size of the populationat j and the dimensions of the spae. In the ontext of

the MGMM tree, however, the LBF given in(18) doesnot have the right form: it is

based ontheomparison of eitherasingle omponentortwo omponentsasa model

of the data. Any node in the tree, other than the root, is the result of a series of

(11)

estimates at node j are the obvious hyperparameters for the two resulting lasses.

Indeed, if we wish to store or transmit the parameters j1

; j2

, we would need the

values of j

also, sine these appear in the prior. Finally, in keeping with its roots

in rate-distortion theory [2℄, we add a rate ontrol parameter , whih allows us to

trade-othe informationinagiven MGMMtree againstitsauray. Ifwe take this

into aount,we replae (18) by the Minimum Information Criterion (MIC):

MIC()=ln

f(Xj j1

; j2

)

f(Xj j

)

ln f(

j1 ;

j2 jX;

j )

f( j1

; j2

j j

)

(29)

Anotherwayofviewingthe onstantisthatitaountsforthepriorP(H 1

)in(13).

If we do not wish to take this ommuniation-oriented approah, then setting =1

amounts to setting P(H 1

)=0:5, ie using the LBF.Although the aimof MIC might

sound similar to Rissanen's MDL, two observations might be made about this: (i)

the aimhereisto providea trade-obetween the informationost ofthe parameters

andtheinformationgainedbyprovidingabetterttothedata;(ii)intheNIW ase,

these riteria have suh losely related forms that they all amount to some form of

Bayes Evidene, leavingspae for one's prejudie inthe deniton of the prior.

Weonludebyobservingthatanalternativeviewofthe MGMMdesriptionisas

apathwork ofaÆne models,eahleaf nodebeingthe result ofalinear regressionon

the data [9℄. For example, inthe ase of a grey level image,the MGMM desription

gives for eah lass a Gaussian model, whih is diretly related to a least-squares

approximation of the form

z i

(x)=A i

(x x

i )+z

i0 +

i

(x) (30)

where z i

(:)isthe grey levelas afuntion of the spatialoordinatex forthe ith lass

and i

(:) is the residual. The matrix A i

is easily found from the ovariane matrix

i

forthat lass and x i

;z i0

(12)

Therstsetofexperimentsisdesignedtotestthesamplerondatatakenfromaknown

density: a pair of 2 D normals, with varying distane between means, ovariane

and populations:

1. 40pointsdrawnfromanormal T

=(0;0);=I densityand40fromanormal

T

=(4;0);=I density.

2. First omponentas in rst ase and seond shifted to T

=(2;0).

3. First omponentas above and seond shifted to T

=(0;0),with =16I.

4. First omponentas above and seond shifted to T

=(0;0),with =4I.

In addition, subsets onsisting of (40;20), (40;10) and (20;20) data from the two

samples respetively were used, giving12ombinationsinall. The data ineah ase

were obtained from the same set, by simply translating and saling the seond set

appropriately. Thiswasdonetoensurethatthevariationsseeninresultswereaused

by thesamplingalgorithm. The samplerusedthesamplemean andovarianeof the

whole population as prior parameters for the NIW distribution, with the

hyperpa-rameters settotheirminimumvalues, ie. =1;d=n+2in(31). The errorratesfor

the hidden variablelassiation taken from 10independent runs withdierent

sim-ulationseedsare shown inTable4. Thelassiationusedthe samplemeanfromthe

last100 iterationsof thesampler. Investigations showed that thesamplerhadsettled

intoastationarydistributionbythen. Althoughthisisasmallnumberofsteps,these

problems are omparativelywellposed, espeiallyinthe rsttwoases.

Correspond-ingly, the error rates for these two ases are smalland the estimated parameters are

lose totheirtrue values ineverysimulation. Althoughhigherror rates are observed

for the onentri ases, itshould be obviousthat insuh ases, lassiation of the

entralpointsineitherase isessentially arbitrary. Itwasnoted thatinvirtually

ev-eryase, thesamplersetttledinastateorrespondingtotwoomponentsofdiering

variane, loated approximately onentrially. Figures2(a)-(d) show typialresults

fromthe simulation.

Notethat inthese examplesand inthose thatfollow, topreventsingularities and

toreetthenitepreisionoftheimage measurements,eahdatumonsistsofboth

a loation and a variane parameter. Correspondingly, the sample ovariane has to

be modied totake aount of this.

The method has also been tested on some more realisti appliations in image

analysis. In all the ases desribed below, the MGMM algorithmwas limited to 50

iterations and typially onverged in 10 20. The Bayes Fator was used as the

riterionfor splittingaomponent. Therst results showthe segmentation of awell

known image - Lena - using the MGMM approah. In this ase, the data are 3 d:

two spatial dimensions and one for intensity. The MGMM tree for this 256 256

pixel image, produed using the Gibbs Sampler,are shown shematiallyin gure 5

and are superposed on the orresponding least squares approximations in gure 4,

(13)

40,40 0.0213 0.212 0.341 0.411

40,20 0.050 0.138 0.273 0.395

40,10 0.018 0.120 0.222 0.378

20,20 0.073 0.20 0.338 0.448

Table 1: Misslassiation rates for samples of sizes shown in leftmost olumn from

two normaldensities aslisted intext. The lassiationused the sample meansover

the last 100 of 200 iterationsof the sampler.

g. 5,thesamplemeanvetor,intheorder(x;y;z)andthepopulationprobabilityare

shown next to eah leaf vertex in the tree. Figures5(a)-(d) show the representation

of the image by 8 regions, eah orresponding to a leaf of the MGMM tree. The

lassiation 5(a) is just the MAP lassiation of pixels based on all3 oordinates

(spatial and intensity), while g. 5(b) shows the MGMM tree nodes and the ellipses

orrespondingtothespatialovarianeof theGaussianomponentateahleafnode.

The lines indiate the tree struture, with thikness indiating height in the tree:

the thikest lines are those from the root; the thinnest those to the lowest level in

this tree - level 4, numbering from 0 at the root. In other words, the tree is not

espeially balaned, nor should it be, if it is properly adapted to the data. The two

reonstrutions use the least-squares approximation based on the estimated mean

and ovariane assoiated with that lass at that pixel, as in (30). The left image,

(), shows the reonstrution using only spatial information, using a `soft' deision:

eah pixel is treated as a mixture, with weights given by the relative magnitudes of

the 8 Gaussian omponents at that position. In eet, this is making an inferene

from 2 D to 3 D using the MGMM representation, sine grey level is ignored

in lassifyingthe pixels. The reonstrution signal-noise ratio for this ase is15:1dB

(peak-rms). Althoughthis isa poorreonstrution visually,itrepresents anextreme

pauity of information: only8sets of 3 DGaussianparametersare required,ie. 72

values. Ontheotherhand,whenafullMAPlassiationisperformedusingall3

o-ordinates,thereonstrutionSNRlimbsto24:3dBandindeedthevisualappearane

issurprisinglygood. Figure6(a)shows the approximation ofthe probablilitydensity

based on the MGMM and extrating the marginal density for the intensity on level

3 of a gray level pyramid, whih has just 1024 pixels. This is a trivialomputation

from the MGMM representation, whih gives a better approximation of the density

than an be obtained by simple histogramming. For omparison, the histogram of

the 256256 pixelimage is shown ingure 6(b).

The seondillustrationuses twoframesof the`MissAmeria'sequene inthe

seg-mentation,whihisbasedonthe5 ddataonsistingofspatialo-ordinates,intensity

and 2 d motion. The raw motion estimates were obtained using a multiresolution

orrelation algorithm, again from 256 256 pixel images. In this example, a tree

desription with 66leaveswas obtained fromframe15of the sequene and the 2 d

motions. The aÆne motionsextrated by least squares (f. (30) from the ovariane

(14)

15-16,gures7(a)-(d)showthatthe MGMMapproahandeal easilywith non-rigid

motions,thepeak-rmssignal-noiseratio(PSNR)being20:0dBforthisreonstrution,

as itis for the originalframe.

As anal example,the twoimagesof abreakfast tableinFig. 8(a)-(b)were used

in the same multiresolution ross-orrelation algorithm to ompute a stereosopi

disparitymap. ThiswasinputtotheMGMMtreealgorithmas4 Ddata: spatial

o-ordinates,greylevelandhorizontaldisparity. Theoriginalimageswerealso256256

pixels, but 6464 images were input to the MGMM algorithm. Reonstrution of

the disparity also used all 4 dimensions in lassifying eah pixel, the depths of the

lassiedpixelsbeingestimatedusing theleastsquares approximationbasedontheir

spatialposition,asin(30). Theresultingapproximationisshowningure8()andis

superimposed on the left imagein gure 8(d), showing areasonable orrelation with

the expeted heights of the larger objets, suh as milk and ereal pakets. There

were 150 leaves in the MGMM tree for this example, due to the omplexity of the

(15)

Inthispaper,anovelandversatilestatistialimagerepresentationhasbeenpresented.

The theory of MGMM as a generalform of density approximation was outlined and

its use in desribing images and sequenes illustrated. The key properties whih

mark MGMM out as a representation are as follows: although it is a statistial

model, it inorporates spatial relationships; it is `auto-saling', ie. it lassies data

based on their likelihood, rather than on simple Eulidean distane; beause it is

multiresolution,itallowseÆientomputation;beauseofthelosureoftheGaussian

funtions under aÆne motions, it deals diretly with the problem of image motion.

ThisombinationofpropertiesmakesMGMMauniqueapproahtoimagemodelling.

It should be lear that the tree struture of MGMM implies that as an

approxi-mation of an arbitrary density with a xed number of omponents, it is most likely

not optimal: tree searhes have their limitations. But this is not the real point of

MGMM: what MGMM oers is a way of approximatingan arbitrary distribution to

an arbitrary preision in a omputationally eÆient way. In this respet too it is

unique.

The experiments onsyntheti data show that a samplingapproahto estimation

of the Gaussianparametersateah stagean be omputationallyeetive,while the

results on real image data illustrate that aurate reonstrutions are possible with

veryompatMGMMrepresentations-even inthemostomplexofthe above

exam-ples,only125lasseswere needed for256256 imagedata. Alloftheexamplesshow

that MAP estimation ombined with MGMM is an eetive tool in image analysis

and vision.

Of ourse, this report represents preliminary work, whih requires further

devel-opment, both theoretial and pratial. For example, the work on motion needs to

be extended totake properaount oftemporalstruture and thisin turnrequires a

proper theoretial basis. It is lear that none of the seletion riteria is ideal in all

ases. Finally,the essentialproblem invisionof movingfroma 2 D representation

to a 3 D one has not been takledyet. These are allunder ative investigationat

the time of this writing.

Aknowledgement

This work was supported by the UK EPSRC.

Referenes

[1℄ M. Abramowitz and I. A. Stegun. Handbook of Mathematial Funtions. Dover,

New York, 1965.

[2℄ T.Berger. Rate-DistortionTheory: aMathematialBasisforDataCompression.

(16)

Code. IEEE Trans. Commun., COM(31):532{540, 1983.

[4℄ I. Daubehies. Orthogonal Basesof CompatlySupported Wavelets. Comm.on

Pure and Appl. Math., XLI:909{996, 1988.

[5℄ I.Daubehies. The Wavelet Transform,Time-FrequenyLoalisationandSignal

Analysis. IEEE Trans. Information Theory, 36:961{1005,1990.

[6℄ E. H. Adelson E. P. Simonelli, W. T. Freeman and D. J. Heeger. Shiftable

Multisale Transforms. IEEE Trans.Information Theory, 38:587{607,1992.

[7℄ A. K. Jain. Fundamentals of Digital Image Proessing. Englewood Clis, New

Jersey, 1989.

[8℄ A. N. Kolmogorovand S.V. Fomin. Introdutory Real Analysis. Dover, 1975.

[9℄ R. A. Ohlsen L. Breiman, J. H. Friedman and C. J. Stone. Classiation and

Regression Trees. Wadsworth, 1984.

[10℄ L. J. Ljung. System Identiation: Theory for the User. Englewood Clis,

Prentie-Hall, 1987.

[11℄ S.G. Mallat. A Theory forMultiresolution SignalDeomposition: The Wavelet

Representation. IEEE Trans. Patt. Anal. Mahine Intell., 11(7):674{693, July

1989.

[12℄ S. G. Mallat. Multifrequeny Channel Deompositions of Images and Wavelet

Models. IEEE Trans. Aous. Speeh Sig. Pro., 37(12):2091{2110, Deember

1989.

[13℄ G. J. MLahlan. Mixture Models: Inferene and Appliations to Inferene and

Clustering. New York, M. Dekker, 1988.

[14℄ A. O'Hagan. Kendall's Advaned Theory of Statistis,vol. 2B. London,Edward

Arnold, 1994.

[15℄ J.J.K.O'RuanaidhandW.J.Fitzgerald. NumerialBayesian Methods Applied

to Signal Proessing. New York, Springer-Verlag, 1996.

[16℄ S. J.Press. Applied Multivariate Analysis. Florida,Robert E.Krieger, 1982.

[17℄ B. D. Ripley. Pattern Reognition and Neural Networks. Cambridge U.P., 1996.

[18℄ J. Rissanen. Stohasti Complexity. Jnl. Royal Stat. So., B, 49:223{239, 1987.

[19℄ C.P.Robert.MixturesofDistributions: InfereneandEstimation.InS.

Rihard-son W. R. Gilksand D. J. Spiegelhalter,editors,Markov Chain Monte Carlo in

(17)

sian Mixture Modelling. IEEE Trans. Patt.Anal. Mahine Intell.,20(11):1133{

1141, 1998.

[21℄ G. A.F. Seber. Multivariate Observations. New York,Wiley, 1984.

[22℄ R. Szeliski. Bayesian Modeling of Unertainty in Low-level Vision. Boston,

Kluwer, 1989.

[23℄ R. A. Tapia and J. R. Thompson. Non-parametri Probability Density

Estima-tion. Baltimore,Johns Hopkins Pr., 1978.

[24℄ G. S. Watson and M. R. Leadbetter. On the Estimation of the Probability

Density. In Ann. Math. Stat., volume 34, pages 480{491,1962.

[25℄ R.Wilson,A.Calway,andE.R.S.Pearson. AGeneralizedWaveletTransformfor

FourierAnalysis: The MultiresolutionFourierTransform and ItsAppliationto

Image and Audio SignalAnalysis. IEEE Trans.Information Theory,38(2):674{

690, Marh 1992.

[26℄ A. Witkin. Sale-Spae Filtering. In Pro. of IEEE ICASSP-84, 1984.

[27℄ K. Palaniappan X. Zhuang, Y. Huang and Y. Zhao. Gaussian Mixture Density

(18)

Distribution

Inusingthe MVNdistributionasauniversalapproximator,itispossibletomakeuse

of the properties of this widely studied distribution. The development here follows

that in [14℄ and [16℄. A Bayesian analysis starts from the hoie of priors and in

the present appliation, it is obvious to use the so alled natural onjugate prior

distribution: theNormal-Inverse-Wishart(NIW)distribution,formultivariatenormal

data with unknown mean and ovariane. This is dened for n D data as

f(;)=k 1 jAj d 2 jjj d+n+2 2 exp [ 2 ( a) T 1 (( a) 1 2 tr( 1 A)℄ (31)

or, briey f(;) = NIW(A;d;a;), a funtion of the parameters d > n+1, the

degrees of freedom, , the prior weight on the mean, a, the prior mean and A, the

prior ovariane. Inreasing and d attahes relativelymore weight to the prior, for

a given sample size N. The normalising onstant k is given by

k=2 n(d+1) 2 n(n+1) 4 n Y i=1 (

d+1 i

2

) (32)

The likelihood for the data X i

;1 i N, is then simply a produt of normals,

assumingthat data are independently sampled,ie.

f(Xj;) =(2) N 2 jj N 2 exp [ 1 2 N X i=1 (X i ) T 1 (X i )℄ (33)

and beause the NIW isthe natural onjugateprior, the posterior isalso NIW, with

parameters

a

=

a+N X +N (34) where X =N

1 P

i X

i

isthe samplemean and

A

=A+NS+ N +N (a X)(a X) T (35)

where S =N 1 P i (X i X)((X i X) T

is the sample ovariane. It is not hard to

see that in the posterior, the salar parameters are

=+N;d

=d+N.

The lassial analysisof MVN data is basedonthe likelihood(33), fromwhihit

is easyto nd the ML estimates

^ = X (36) and ^

=S (37)

Evaluating the likelihoodatthese estimates gives

max ;

(19)

a ; ~ = 1 d A

f(Xj;~ ~

) =j2A

j N 2 d nN 2 exp[ d 2 (n trA 1 A)℄ (39)

The above results an be used to evaluate the Bayes evidene for one omponent in

the mixture, by integrating over the parameters

f(Y)=

jAj d 2 nN 2 jA j d 2 n Y i=1 ( d +1 i 2 ) ( d+1 i 2 ) (40)

The lastquantity ofinterest isthe mutual informationbetween the data and

param-eters, whihis given by the ratio

i(Y;;) =ln

f(;jY)

f(;)

(41)

The expetation of this quantity represents the amount of information whih, on

average, the parameters yield about the data, or vie versa. Sine both prior and

posterior are NIW, it isstraightforward toevaluate (41)

i(Y;;) = d 2 lnjA j d 2 lnjAj N 2 lnjj nN 2

ln2+

X

i ln (

d

+1 i

2 )

X

i

ln (

d+1 i

2 ) 1 2 tr[ 1 ( ( a )( a ) T

( a)( a)

T +A

A)℄

(20)

with hidden variables

This samplerfollows exatlythe modelproposed by Robert [19℄. It uses NIW as the

priorforthe sampledata andaDirihletpriorforthe populationsize (experienehas

shown that this is not always helpful, however). Thus, given the estimate j

; j

for

the mean and ovarianeat anode j, the samplerperforms the following steps:

1. Sample ji

; ji

;i=1;2; NIW( j

; j

; i

;d i

).

2. Samplefor the populationsize using the Dirihlet distribution.

3. Samplethe hidden variables Z k

2[1;2℄;1k N j

using aGibbs sampler and

the urrent estimates of P; ji

; ji

;i=1;2.

4. Calulate theposteriorNIW parameters, basedonthe lassieddataand

(34)-(35).

This is repeated, starting with i

= i

=1;d i

=d i

=n+2, the minimum values, to

keep the prior vague.

It may be noted here that a deterministi algorithm, based on 2 means, but

using the log-likelihood rather than simple Eulidean distane, works adequately on

some simpler problems, for example where lasses are onvex. In general, however,

it islittleheaper toimplement2 means and muh morelikely tosettle in aloal

(21)

-4

-2

0

2

4

20

40

60

80

100 -4

-2

0

2

4

20

40

60

80

100

(a) (b)

-4

-2

0

2

4

20

40

60

80

100 -4

-2

0

2

4

20

40

60

80

100

() (d)

Figure 1: Contour plots of Akaike Information Criterion and Log-Bayes Fators for

two vs one normal omponents: (a) AIC as a funtion of dierene in means

(hor-izontal) and population size (vertial), (b) LBF as in (a), () AIC as a funtion of

dierene in sale, (d) LBF as in (). In all four ases, the proportions of the two

omponentsareequalandblakrepresentsnegativevalues,ieapreferene forasingle

(22)

-1

0

1

2

3

4

5

6 -3

-2

-1

0

1

2

3

4 -1

0

1

2

3

4 -2

-1

0

1

2

3

(a) (b)

-4

-2

0

2

4

6 -4

-2

0

2

4

6 -3

-2

-1

0

1

2

3

4 -3

-2

-1

0

1

2

3

4

() (d)

Figure2: ContourplotofestimatedGaussianomponentsfromapairof2-dnormals,

(a)havingmeans T 1

=(0;0); T 2

=(4;0)and ovarianes i

=I;i=1;2;(b)means

at T 1

= (0;0); T 2

= (2;0); () means at T i

= (0;0);i = 1;2 and 1

= 16 2

; (d)

1

=4

2

. Estimates obtained from 200 iterationsof the Gibbs sampler. The data

(23)

P=0.202

P=0.080

P=0.128

P=0.138

P=0.069

P=0.095

P=0.180

P=0.108

(12,6,187) (13,21,183)

(16,25,198)

(18,21,221) (11,20,218)

(14,14,220)

(18,16,128) (28,17,123)

8-Leaf MGMM Tree for Lena

Figure 3: Shemati of the 8 leaf MGMM representation of the 256 256 `Lena'

image. Population probabilities and mean vetors, in the order (x;y;z), shown for

(24)

() (d)

Figure4: (a)-(d)FirstfourlevelsoftheMGMMtree for256256`Lena',superposed

ontheorrespondingleastsquaresapproximations. Line thikness indiatestreelevel

and the ellipses show the ovariane of the Gaussianomponent ateahnode. Note,

(25)

() (d)

Figure5: (a)ClassiationofLena imagefromMGMMrepresentationwith 8leaves.

(b) MGMMtree orrespondingto(a), asinpreviousgure;() Reonstrution using

(26)

50

100

150

200

250

0.005

0.01

0.015

0.02

50

100

150

200

250

200

400

600

800 1000

(a) (b)

Figure 6: (a) Comparison of histogram with MGMM approximation to probability

(27)

() (d)

Figure7: (a)Frame15ofMissAmeriasequeneand(b)reonstrutionfromMGMM

tree using 66 lasses.() Original frame 16 and (d) reonstrution based on moved

(28)

() (d)

Figure 8: (a) Left image of stereo pair, (b) right image. () MGMM approximation

of disparity image from (a) and (b). (d) MGMM approximation superposed on left