Original citation:
Wilson, Roland, 1949- (2000) Multiresolution Gaussian mixture models : theory and
application
s. University of Warwick. Department of Computer Science. (Department of
Computer Science Research Report). (Unpublished) CS-RR-371
Permanent WRAP url:
http://wrap.warwick.ac.uk/61131
Copyright and reuse:
The Warwick Research Archive Portal (WRAP) makes this work by researchers of the
University of Warwick available open access under the following conditions. Copyright ©
and all moral rights to the version of the paper presented here belong to the individual
author(s) and/or other copyright owners. To the extent reasonable and practicable the
material made available in WRAP has been checked for eligibility before being made
available.
Copies of full items can be used for personal research or study, educational, or
not-for-profit purposes without prior permission or charge. Provided that the authors, title and
full bibliographic details are credited, a hyperlink and/or URL is given for the original
metadata page and the content is not changed in any way.
A note on versions:
and Appliations
Roland Wilson
February 28, 2000
1 Introdution
Multiresolution image representations have been suessfully applied to many
prob-lemsinimageanalysis [26,3,11,5℄beauseoftheirabilitytotradeospatial
resolu-tionagainstfrequeny domainresolutionandtheirsymmetry withrespet to
transla-tionsand dilations,twoof the mostimportantsouresof variationinimages[4℄. Not
every multiresolutionrepresentationis equallyeetiveinthis regard: shemes using
dyadi deimation, suh aspyramids and orthonormal wavelets, sarie translation
invariane to redue redundany [6, 25, 12℄. While this is desirable in image
om-pression, it is not ideal for problems in omputer vision, where ompatness is less
importantthanutilityinarangeofproblems-fromsegmentationtomotionanalysis.
In those ases, itis the ombination of statistialinferene, usually from inomplete
(eg. projeted) data and symmetry undermotion whih are most signiant.
Another interesting development in reent years has been the use of Gaussian
Mixture models to ope with statistial problems for whih no simple parametri
model exists [20, 27℄. While it is well known that algorithms suh as
Expetation-Maximisation an lead to eetive approximations in terms of a nite number of
omponents, the general problemof mixture modelling is diÆult when the number
of omponents isunknown [13, 17℄.
This paperdesribes anew multiresolutionrepresentation whih taklesthe
mix-ture modelling problem head-on. The general approah shares important features
withthe Classiationand RegressionTrees (CART)system of[9℄and itsderivatives
[17℄. Itmayalsobeseenasageneralisationofsale-spae[26℄,inthatitusesGaussian
funtions and inludes spatial o-ordinates,but unlikea onventional basis set, they
are adapted to the data and used statistially. Moreover, they are dened ina spae
whose dimension reets the inferene problem,not simplythe image data. Thusin
dealing with olour images, a 5 D spae is required (two spatial and three olour
dimensions);for inferring3 D struturefrom motion,typiallyninedimensions are
required (three spatial, three olour and three motion axes). Yet MGMM has no
diÆulty inprinipleinmovingseamlesslybetweenthese spaes. The next setionof
the paper outlines the theory underlying MGMM as a method of approximating an
Thisisfollowedby apresentationof somesimpleexperiments,whihillustratehowit
maybeused inimportant`early vision'problems: segmentationandmotionanalysis.
The paper is onluded with some remarks on the impliations and potential of the
Three main elements ditate the form of representation alled MGMM: ability to
approximate any probabilitydensityina spaeof arbitrarydimension; losureunder
aÆne motionsand amultiresolutionstruture, whih an be used tomake
omputa-tion eÆient.
Before outlining the theory of MGMM, it may be useful to explain the need for
suh anapproah. It is widely reognised that statistialmethodsare akey element
in vision and image analysis eg. [7, 22℄. While Gaussian models work adequately
in many ases, more general distributions pose onsiderable diÆulties. The rst is
simplyestimatingthedensity,aproblemwhihhasbesetstatistiiansformaydeades
[23,24℄. Therearetwowidelyusedalternatives: non-parametri,orkernelestimation
and parametri estimation. Kernel methods are a generalisation of histogramming,
in whih, ratherthan use anon-overlapping set of intervals,the density is estimated
usingasmoothkernelto`ironout'theutuationswhihsamplinginevitablyauses.
Ineet,thisrepresentsaonvolutionwhihisthestatistialanalogueofanti-aliasing:
itiswellknownthatrawhistogramsarewoefullyinadequateasdensityestimates[23℄.
The alternative, whih is learly preferable if there is suÆient prior informationor
knowledge about the proesses produing the data, is to use a parametri model. In
reent years, this has been extended through the use of Bayesian methods for model
seletion and the estimation of so-alled `hyper-parameters', whih haraterise the
model, eg. [19, 15℄, but while useful inpartiular problems, these hardly represent a
generalapproahtothe infereneproblem. It seemsthat whatisneeded isamethod
whih ombines the generality of kernel estimation with the power of parametri
methods. This isone of the key motivations behind the MGMM approah.
To dene the MGMM approah ina general way, westart by observing that any
probability density funtion an be approximated to an arbitrary preision by a set
of Gaussian funtions: it is well known that the Gaussian funtions are a omplete
set on L 2
(R n
). However, we want to make a stronger laim - namely that an L 1
approximation of anarbitrarydensity funtionneed only involvepositive oeÆients
in the expansion. To this end, we state the theorem:
Theorem 1: Let f(:) : R n
! R be any nonnegative real integrable funtion on R n
with Z
R n
dxf(x)=1 (1)
Then for any Æ > 0 there exists an approximation of f(:) by a stritly positive sum
of Gaussianfuntions of the form
^ f(x)=
X
i f
i g
i
(x
i
) (2)
of means i
and ovarianes i
, suhthat f i
>0;8iand
Æ > Z
R n
dxjf(x) ^
f(x)j (3)
Proof: Westartinonedimensionandthenextend theresulttoR n
,primarilyfor
X X (x) = 1 2
if x2X
0 else
(4)
wherethe set X isthe interval<x. Consider theonvolutionof X
(:)with the
Gaussianfuntion g
(x) of variane 2 g ? X (x)= Z X dyg
(y x) (5)
whihis easilywritten as
g
?(x)= 1 4 [erf( +x p 2 ) erf( x p 2 )℄ (6)
where erf(:) is the error funtion [1℄. If ==M, then for any Æ > 0, there exists a
value of M suh that Z dx j X (x) g =M ? X (x)j < Z X M dx X
(x)+ (7)
Z
X X M
dxexp[ M℄+ Z R X dx g =M ? X (x) < 2 M
+2exp [ M℄ < Æ
2
where X M
is the subset of R for whih jxj < 1 M
. Now the onvolution, whih is
smooth, an be approximated by asum
g ? X (x) X i f i g (x 2 k i) (8)
where k is hosen suh that there is an integral absolute approximation error less
than Æ 2 Z dxjg ? X (x) X i f i g (x 2 k )j< Æ 2 (9) where f i
> 0;8i. But this is the Riemann integral theorem [8℄: one simply uses for
f i
the saled values of X (2 k i) f i =2 k X (2 k i) (10)
Then for some k, the sum, whih is absolutely onvergent, willgive anerror smaller
than Æ=2.
We have thus proved that the funtion X
(:) an be approximated to arbitrary
preisionby anite set of Gaussian funtions. Toomplete the proof,we note that
X
i f
i
!1 (11)
whih follows from the uniform sampling of X
(:), the saling of X
(:), (4) and the
onvergene of the sum. Thus if X
density an be approximated by a mixture of funtions of the form X
( x y
i
i
): these
simplefuntionsare the means by whih the Lebesgue integralis dened [8℄and any
probability distributionis measurable.
The extension to R n
is trivial: one simply replaes the 1 d elements by the
orresponding Cartesian separable, produt form and the proof goes through in the
same way. This ompletes the proof.
In order to use this result, we need an eÆient way of deiding how many
om-ponents to use in modelling an arbitrary density, or more realistially, data drawn
from anarbitrary density. Tothis end, onsider the following simpledeision: given
a set of data, is it better modelled with one Gaussian omponent or two? This is
a deision whih annot be made on the basis of likelihood alone: it would always
give preferene to the more omplex desription. The inrease in likelihoodmust be
balaned against the ost interms of omplexityof using two omponentsinstead of
one.
This type of problem has reeived muh attention in the literature on
lassia-tion [9,17, 19℄, and in system identiation, in whih modelorder has tobe hosen
[10℄. Inboth areas,widespreaduse hasbeen madeof Akaike's InformationCriterion
(AIC), whih is derived from the Kullbak-Liebler Information Divergene between
two densities:
d(f;g)= Z
R n
dxf(x)ln[ f(x)
g(x)
℄ (12)
A widelyused alternativetoAIC isthe MinimumDesriptionLength(MDL)
ri-terion,inwhihanestimateismadeofthe `odingost'oftheparametridesription
of anobjet [18℄. Now, in many ases, quitewhat the MDL of anobjet mightbeis
unlear, sineitwould appear todependonthehoieofdesriptionlanguageandon
the density indued onstrings from that languageby the objets of interest. In the
present ontext, muh greater preision is available, sine not only the parameters,
but inprinipletheirpriorandposteriordensitiesanbeestimated. Lastly,Bayesian
Evidene (BE), ie. the posterior P(MjY) supporting a model M amongst a set of
alternatives, has alsobeen propounded [17, 15℄as the best way of seletinga model.
In the ontext of seletingeither one or two Gaussianomponents to modela set of
data, this involves penalising the two-omponent model though the prior. It is not
so lear, however, just how steep the penalty should be; at least in the AIC and in
the presentase inMDL, this questionissettledbydenition. In general,evaluation
of these riteria is itself a soure of onsiderable diÆulty, with resort usually being
madetosimple approximations,eg. based onsaddle-pointapproximationsorsample
averagingover anappropriateset ofsimulationepohs. In the present ase, one the
data are lassied,anexat solutionispossible. Classiationofthe dataisahieved
by Gibbs sampling.
Beause the Bayesian paradigm is an appropriate one in many appliations, let
us start by onsidering anevidential framework for the problemof `one-or-two'. We
an basethe deisionbetween H 1
: one omponent andH 2
R B = P(H 1 )P(XjH 1 ) P(H 2 )P(XjH 2 ) (13)
whereP(H 1
)=1 P(H 2
)isthepriorprobabilityofasingleomponentandP(XjH i
)
is the evidenefor H i
: the integral overthe parameter density of the likelihood
P(XjH i )= Z R m
d f(Xj;H i
)f(jH i
) (14)
Wean use the results of the appendix to ompute the BayesFatorfor twoagainst
one omponents. In the following, unsubsripted quantities referto the unsplit data,
whilesubsripts 1;2refertothe two omponentsofthe putativemixture
approxima-tion. Thusthe Bayes Fator an be writtenfrom (40) as
BF = jA 1 j d 2 jA 2 j d 2 jA j d 2 jA 1 j d 1 2 jA 2 j d 2 2 jAj d 2 n Y i=1 ( d 1 +1 i 2 ) ( d 2 +1 i 2 ) ( d +1 i 2 ) ( d+1 i 2 ) (15)
In(15), quantities witha
denoteposteriorand withoutitdenotethe orresponding
prior parameters. Thus, Aisthe prior ovariane parameterfor asingle omponent,
while A 1
is the posterior ovariane parameter for the rst of the two omponents.
The sample size, N, is hidden in the parameters d
= d+N;d i
= d+N i
;i = 1;2.
A onsiderable simpliation of the above ours if we make the following hoies:
A i
=A= d d
A
. Inthis ase, the log-BayesFator redues to
LBF = d +d 2 lnjA j d 1 2 lnjA 1 j d 2 2 lnjA 2 j+ nd 2 ln d d + n X i=1 ln ( d 1
+1 i
2 )+ X i ln ( d 2
+1 i
2 ) X i ln ( d
+1 i
2 )
X
i ln (
d+1 i
2
) (16)
AIC is based onlikelihoodsand an be written using(38) as
AIC = N 2 lnjSj N 1 2 lnjS 1 j N 2 2 lnjS 2
j n(n+3) (17)
Comparing (16) and (17), there is an obvious similarity in the dependene on the
data, exept for the use of priors in the former. Moreover, in the LBF, the `penalty'
term inreases with both the dimension of the problemand the samplesize, whereas
the AIC penalty depends onlyonthe dimension. Notethatalthoughitis possibleto
use Bayesian estimates inAIC orMLestimates inthe BF, itmakeslittlesensetodo
so.
Figures 1(a)-(d) show an interesting feature of the two. All four gures show
ontourplots ofthe AICorLBFguresforatest oftwo2 DGaussianomponents
against one, as a funtion of separation either in sale or position (horizontal axis)
been notedby others, itdoes appear that AIC is more supportiveof splittingas the
population inreases; on the other hand, for very small populations, it is the LBF
whihismoreinlinedtosplit,ieispositive. ThistendenyofLBFtofavoursplitting
of smallpopulations is not a good one and shows that neither riterion dsiplays all
of the features one would like, even in suh a simple ase as a two-vs-one deision.
Nevertheless, the LBF has lear advantages, as long as the population size is large
enoughto give meaningfulresults.
A dierentperspetive anbe given totheBFif werewriteitusingBayes's Rule,
viz.
LBF =ln
P(Xj 1 ; 2 ) P(Xj) ln P( 1 ; 2 jX) P( 1 ; 2 ) +ln
P(jX)
P()
(18)
While the rst term represents the gain in likelihood from using two omponents
instead of one, the remainder represents the ost of this, in terms of the mutual
informationbetween parameters and data (f. Appendix I, (41)).
2.1 Eets of Motion
The other majorproperty,a ruial one for motion analysis,is the losureof the set
of n D Gaussian funtionsG n
underaÆne maps A:R n
7!R n
Ax=Lx+a (19)
where L is an invertible matrix and a a translation. Again, it is obvious that the
ation of AonG n
islosed, sine
g
(A 1
(x ))=g A (x A ) (20) where A
= a (21)
and A =L T L (22)
But nowweare ina position toprovea rather interesting result, summarised as
Theorem 2: Let f(:) 0 be an integrable funtion f : R n
! R , as above and let
t(:) : R n
! R n
be a smooth, invertible map from R n
to itself, having bounded rst
and seond derivatives. Then for any Æ >0, f has an approximation as a Gaussian
mixtureof theformof (2),with integralabsoluteerrornogreaterthanÆ=2and there
exists an approximation of the transformed funtion Tf
Tf(x)=f(t 1
(x)) (23)
of the same form,where eah Gaussianomponent g
i
istransformed aording toa
ture of the MGMM approximation. First, we observe that any smooth ow an be
approximated using
t (x)=t x 0
+r x
(t)(x x 0
)+O(kxk 2
) (24)
viaTaylor'stheorem, wherer x
(:)isthe gradient. Now,letg
(x )beanarbitrary
GaussianfuntiononR n
. Thenitfollowsfrom(24)thatthereexistsasaleparameter
(Æ)>0 suh that
Z
dxjg (Æ)
(t(x ) g
(Æ)
(t()+r x
(t )(x ))j< Æ
2
(25)
We an ahieve this by hoosing (Æ) > 0 so that the error (x) - the integrand in
(25) - satises
1. (x) < Æ 4
;if kx t()k < r. This follows beause the error in the Taylor
expansionfor t isof order 2
(Æ) if r=M(Æ).
2. R kxk>r dxj(x)j < R kxk>r g (Æ)
(t(x )+g (Æ)
(t ()+r x
(t )(x )) Æ 4 .
This follows from the observation that the tails of the two Gaussians an be
made of arbitrarilylow weightby asimilar hoie of r.
Putting the two errors together, we get a total error
Z
dxjg (Æ)
(t(x ) g
(Æ)
(t ()+r x
(t)(x ))j< Æ 4 + Æ 4 = Æ 2 (26)
Sine this appliestoanarbitrary Gaussian,apply itto eah omponentin amixture
representation of f(t(x))
f(t(x))= X i f i g i (x i ) (27)
for whih (i)the absolute integral erroris less than Æ=2and (ii)eah omponent has
avarianesmallenoughthat itstransformationby t introduesanerror nobigger
than Æ=2. Suh a representation exists, by Theorem 1. But then the error in the
representation of f(x) by the aÆne approximation of t is no greater than Æ. This
Thekeytoapplyingtheseideasistouse asequentialapproah,whihleadstoa
mul-tiresolution tree struture: Multiresolution Gaussian Mixture Modelling (MGMM).
We start from the natural assumption that we do not know the density of interest,
but wish to estimate it from some data, suh as an image or set of images. If these
data are denoted X i
;1 i n, we an ompute the sample mean and ovariane
[21℄ and thene inferasingle multivariateGaussianmodelg 0
(x
0
),where 0
; 0
are the sample (Maximum Likelihood) estimates for the data. Now, in rare ases
this will model the data adequately. If not, then we split the data into two parts
and model eah with a Gaussian. This an be done using the Markov Chain Monte
Carlo (MCMC) sampling tehnique desribed in [19℄, whih treats the inferene as
oneontaininghiddenvariables,namelythelassZ i
towhiheahdatumbelongsand
samplesfromthe posteriorsforthepopulationsize,meansandovarianes, assuming
onjugate priors, whose parameters are simply those of the population as a whole.
Thus the prior for the means of the two lasses are Gaussian, while the ovarianes
are drawn from an inverse Wishart density: a Normal-Inverse Wishart (NIW)
den-sity. Beause the NIW is the naturalonjugate prior for this problem, the posterior
isalsoNIWand soaGibbssampleran beused [19,14℄. Samplingfor(i)thehidden
variablesand(ii)theorrespondingsub-populationdensitiesgivesthe twoGaussians,
based on the posterior estimates from the sampler for g
1j
(x
1j
);i2 [0;1℄. This
Gibbs sampler is desribed fully in Appendix II. Clearly, we have the basis of a
re-ursionhere: we an builda tree representation, in whih eah leafis asubset of the
populationfound bysuessive binarysplitsfromtherootnode,whihrepresents the
wholepopulation. Beause theNIW isaonjugateprior,the sameomputationsan
be used atevery node in the tree:
1. Seletlass j and test itsnormaldensity approximation for`goodness-of-t'. If
the t isadequate, terminate, else
2. Splitlassj intotwoomponents: lassj1andlassj2,bysamplingthehidden
variables Z ji
;1 i N j
and thene obtaining a Bayesian estimate of the
lass means and ovarianes by sampling from the posteriors, given the prior
g
j
(x
j ).
This needsa riterionforsplitting. In the previoussetion, wereviewed twopopular
hoies: LBF and AIC, eitherof whih an be expressed in the present ase as
C =ln
f(Xj j1
; j2
)
f(Xj j
)
k(N j
;n) (28)
wherethe log-likelihoodratiois subjettoa `omplexity'penalty whih isafuntion
of the size of the populationat j and the dimensions of the spae. In the ontext of
the MGMM tree, however, the LBF given in(18) doesnot have the right form: it is
based ontheomparison of eitherasingle omponentortwo omponentsasa model
of the data. Any node in the tree, other than the root, is the result of a series of
estimates at node j are the obvious hyperparameters for the two resulting lasses.
Indeed, if we wish to store or transmit the parameters j1
; j2
, we would need the
values of j
also, sine these appear in the prior. Finally, in keeping with its roots
in rate-distortion theory [2℄, we add a rate ontrol parameter , whih allows us to
trade-othe informationinagiven MGMMtree againstitsauray. Ifwe take this
into aount,we replae (18) by the Minimum Information Criterion (MIC):
MIC()=ln
f(Xj j1
; j2
)
f(Xj j
)
ln f(
j1 ;
j2 jX;
j )
f( j1
; j2
j j
)
(29)
Anotherwayofviewingthe onstantisthatitaountsforthepriorP(H 1
)in(13).
If we do not wish to take this ommuniation-oriented approah, then setting =1
amounts to setting P(H 1
)=0:5, ie using the LBF.Although the aimof MIC might
sound similar to Rissanen's MDL, two observations might be made about this: (i)
the aimhereisto providea trade-obetween the informationost ofthe parameters
andtheinformationgainedbyprovidingabetterttothedata;(ii)intheNIW ase,
these riteria have suh losely related forms that they all amount to some form of
Bayes Evidene, leavingspae for one's prejudie inthe deniton of the prior.
Weonludebyobservingthatanalternativeviewofthe MGMMdesriptionisas
apathwork ofaÆne models,eahleaf nodebeingthe result ofalinear regressionon
the data [9℄. For example, inthe ase of a grey level image,the MGMM desription
gives for eah lass a Gaussian model, whih is diretly related to a least-squares
approximation of the form
z i
(x)=A i
(x x
i )+z
i0 +
i
(x) (30)
where z i
(:)isthe grey levelas afuntion of the spatialoordinatex forthe ith lass
and i
(:) is the residual. The matrix A i
is easily found from the ovariane matrix
i
forthat lass and x i
;z i0
Therstsetofexperimentsisdesignedtotestthesamplerondatatakenfromaknown
density: a pair of 2 D normals, with varying distane between means, ovariane
and populations:
1. 40pointsdrawnfromanormal T
=(0;0);=I densityand40fromanormal
T
=(4;0);=I density.
2. First omponentas in rst ase and seond shifted to T
=(2;0).
3. First omponentas above and seond shifted to T
=(0;0),with =16I.
4. First omponentas above and seond shifted to T
=(0;0),with =4I.
In addition, subsets onsisting of (40;20), (40;10) and (20;20) data from the two
samples respetively were used, giving12ombinationsinall. The data ineah ase
were obtained from the same set, by simply translating and saling the seond set
appropriately. Thiswasdonetoensurethatthevariationsseeninresultswereaused
by thesamplingalgorithm. The samplerusedthesamplemean andovarianeof the
whole population as prior parameters for the NIW distribution, with the
hyperpa-rameters settotheirminimumvalues, ie. =1;d=n+2in(31). The errorratesfor
the hidden variablelassiation taken from 10independent runs withdierent
sim-ulationseedsare shown inTable4. Thelassiationusedthe samplemeanfromthe
last100 iterationsof thesampler. Investigations showed that thesamplerhadsettled
intoastationarydistributionbythen. Althoughthisisasmallnumberofsteps,these
problems are omparativelywellposed, espeiallyinthe rsttwoases.
Correspond-ingly, the error rates for these two ases are smalland the estimated parameters are
lose totheirtrue values ineverysimulation. Althoughhigherror rates are observed
for the onentri ases, itshould be obviousthat insuh ases, lassiation of the
entralpointsineitherase isessentially arbitrary. Itwasnoted thatinvirtually
ev-eryase, thesamplersetttledinastateorrespondingtotwoomponentsofdiering
variane, loated approximately onentrially. Figures2(a)-(d) show typialresults
fromthe simulation.
Notethat inthese examplesand inthose thatfollow, topreventsingularities and
toreetthenitepreisionoftheimage measurements,eahdatumonsistsofboth
a loation and a variane parameter. Correspondingly, the sample ovariane has to
be modied totake aount of this.
The method has also been tested on some more realisti appliations in image
analysis. In all the ases desribed below, the MGMM algorithmwas limited to 50
iterations and typially onverged in 10 20. The Bayes Fator was used as the
riterionfor splittingaomponent. Therst results showthe segmentation of awell
known image - Lena - using the MGMM approah. In this ase, the data are 3 d:
two spatial dimensions and one for intensity. The MGMM tree for this 256 256
pixel image, produed using the Gibbs Sampler,are shown shematiallyin gure 5
and are superposed on the orresponding least squares approximations in gure 4,
40,40 0.0213 0.212 0.341 0.411
40,20 0.050 0.138 0.273 0.395
40,10 0.018 0.120 0.222 0.378
20,20 0.073 0.20 0.338 0.448
Table 1: Misslassiation rates for samples of sizes shown in leftmost olumn from
two normaldensities aslisted intext. The lassiationused the sample meansover
the last 100 of 200 iterationsof the sampler.
g. 5,thesamplemeanvetor,intheorder(x;y;z)andthepopulationprobabilityare
shown next to eah leaf vertex in the tree. Figures5(a)-(d) show the representation
of the image by 8 regions, eah orresponding to a leaf of the MGMM tree. The
lassiation 5(a) is just the MAP lassiation of pixels based on all3 oordinates
(spatial and intensity), while g. 5(b) shows the MGMM tree nodes and the ellipses
orrespondingtothespatialovarianeof theGaussianomponentateahleafnode.
The lines indiate the tree struture, with thikness indiating height in the tree:
the thikest lines are those from the root; the thinnest those to the lowest level in
this tree - level 4, numbering from 0 at the root. In other words, the tree is not
espeially balaned, nor should it be, if it is properly adapted to the data. The two
reonstrutions use the least-squares approximation based on the estimated mean
and ovariane assoiated with that lass at that pixel, as in (30). The left image,
(), shows the reonstrution using only spatial information, using a `soft' deision:
eah pixel is treated as a mixture, with weights given by the relative magnitudes of
the 8 Gaussian omponents at that position. In eet, this is making an inferene
from 2 D to 3 D using the MGMM representation, sine grey level is ignored
in lassifyingthe pixels. The reonstrution signal-noise ratio for this ase is15:1dB
(peak-rms). Althoughthis isa poorreonstrution visually,itrepresents anextreme
pauity of information: only8sets of 3 DGaussianparametersare required,ie. 72
values. Ontheotherhand,whenafullMAPlassiationisperformedusingall3
o-ordinates,thereonstrutionSNRlimbsto24:3dBandindeedthevisualappearane
issurprisinglygood. Figure6(a)shows the approximation ofthe probablilitydensity
based on the MGMM and extrating the marginal density for the intensity on level
3 of a gray level pyramid, whih has just 1024 pixels. This is a trivialomputation
from the MGMM representation, whih gives a better approximation of the density
than an be obtained by simple histogramming. For omparison, the histogram of
the 256256 pixelimage is shown ingure 6(b).
The seondillustrationuses twoframesof the`MissAmeria'sequene inthe
seg-mentation,whihisbasedonthe5 ddataonsistingofspatialo-ordinates,intensity
and 2 d motion. The raw motion estimates were obtained using a multiresolution
orrelation algorithm, again from 256 256 pixel images. In this example, a tree
desription with 66leaveswas obtained fromframe15of the sequene and the 2 d
motions. The aÆne motionsextrated by least squares (f. (30) from the ovariane
15-16,gures7(a)-(d)showthatthe MGMMapproahandeal easilywith non-rigid
motions,thepeak-rmssignal-noiseratio(PSNR)being20:0dBforthisreonstrution,
as itis for the originalframe.
As anal example,the twoimagesof abreakfast tableinFig. 8(a)-(b)were used
in the same multiresolution ross-orrelation algorithm to ompute a stereosopi
disparitymap. ThiswasinputtotheMGMMtreealgorithmas4 Ddata: spatial
o-ordinates,greylevelandhorizontaldisparity. Theoriginalimageswerealso256256
pixels, but 6464 images were input to the MGMM algorithm. Reonstrution of
the disparity also used all 4 dimensions in lassifying eah pixel, the depths of the
lassiedpixelsbeingestimatedusing theleastsquares approximationbasedontheir
spatialposition,asin(30). Theresultingapproximationisshowningure8()andis
superimposed on the left imagein gure 8(d), showing areasonable orrelation with
the expeted heights of the larger objets, suh as milk and ereal pakets. There
were 150 leaves in the MGMM tree for this example, due to the omplexity of the
Inthispaper,anovelandversatilestatistialimagerepresentationhasbeenpresented.
The theory of MGMM as a generalform of density approximation was outlined and
its use in desribing images and sequenes illustrated. The key properties whih
mark MGMM out as a representation are as follows: although it is a statistial
model, it inorporates spatial relationships; it is `auto-saling', ie. it lassies data
based on their likelihood, rather than on simple Eulidean distane; beause it is
multiresolution,itallowseÆientomputation;beauseofthelosureoftheGaussian
funtions under aÆne motions, it deals diretly with the problem of image motion.
ThisombinationofpropertiesmakesMGMMauniqueapproahtoimagemodelling.
It should be lear that the tree struture of MGMM implies that as an
approxi-mation of an arbitrary density with a xed number of omponents, it is most likely
not optimal: tree searhes have their limitations. But this is not the real point of
MGMM: what MGMM oers is a way of approximatingan arbitrary distribution to
an arbitrary preision in a omputationally eÆient way. In this respet too it is
unique.
The experiments onsyntheti data show that a samplingapproahto estimation
of the Gaussianparametersateah stagean be omputationallyeetive,while the
results on real image data illustrate that aurate reonstrutions are possible with
veryompatMGMMrepresentations-even inthemostomplexofthe above
exam-ples,only125lasseswere needed for256256 imagedata. Alloftheexamplesshow
that MAP estimation ombined with MGMM is an eetive tool in image analysis
and vision.
Of ourse, this report represents preliminary work, whih requires further
devel-opment, both theoretial and pratial. For example, the work on motion needs to
be extended totake properaount oftemporalstruture and thisin turnrequires a
proper theoretial basis. It is lear that none of the seletion riteria is ideal in all
ases. Finally,the essentialproblem invisionof movingfroma 2 D representation
to a 3 D one has not been takledyet. These are allunder ative investigationat
the time of this writing.
Aknowledgement
This work was supported by the UK EPSRC.
Referenes
[1℄ M. Abramowitz and I. A. Stegun. Handbook of Mathematial Funtions. Dover,
New York, 1965.
[2℄ T.Berger. Rate-DistortionTheory: aMathematialBasisforDataCompression.
Code. IEEE Trans. Commun., COM(31):532{540, 1983.
[4℄ I. Daubehies. Orthogonal Basesof CompatlySupported Wavelets. Comm.on
Pure and Appl. Math., XLI:909{996, 1988.
[5℄ I.Daubehies. The Wavelet Transform,Time-FrequenyLoalisationandSignal
Analysis. IEEE Trans. Information Theory, 36:961{1005,1990.
[6℄ E. H. Adelson E. P. Simonelli, W. T. Freeman and D. J. Heeger. Shiftable
Multisale Transforms. IEEE Trans.Information Theory, 38:587{607,1992.
[7℄ A. K. Jain. Fundamentals of Digital Image Proessing. Englewood Clis, New
Jersey, 1989.
[8℄ A. N. Kolmogorovand S.V. Fomin. Introdutory Real Analysis. Dover, 1975.
[9℄ R. A. Ohlsen L. Breiman, J. H. Friedman and C. J. Stone. Classiation and
Regression Trees. Wadsworth, 1984.
[10℄ L. J. Ljung. System Identiation: Theory for the User. Englewood Clis,
Prentie-Hall, 1987.
[11℄ S.G. Mallat. A Theory forMultiresolution SignalDeomposition: The Wavelet
Representation. IEEE Trans. Patt. Anal. Mahine Intell., 11(7):674{693, July
1989.
[12℄ S. G. Mallat. Multifrequeny Channel Deompositions of Images and Wavelet
Models. IEEE Trans. Aous. Speeh Sig. Pro., 37(12):2091{2110, Deember
1989.
[13℄ G. J. MLahlan. Mixture Models: Inferene and Appliations to Inferene and
Clustering. New York, M. Dekker, 1988.
[14℄ A. O'Hagan. Kendall's Advaned Theory of Statistis,vol. 2B. London,Edward
Arnold, 1994.
[15℄ J.J.K.O'RuanaidhandW.J.Fitzgerald. NumerialBayesian Methods Applied
to Signal Proessing. New York, Springer-Verlag, 1996.
[16℄ S. J.Press. Applied Multivariate Analysis. Florida,Robert E.Krieger, 1982.
[17℄ B. D. Ripley. Pattern Reognition and Neural Networks. Cambridge U.P., 1996.
[18℄ J. Rissanen. Stohasti Complexity. Jnl. Royal Stat. So., B, 49:223{239, 1987.
[19℄ C.P.Robert.MixturesofDistributions: InfereneandEstimation.InS.
Rihard-son W. R. Gilksand D. J. Spiegelhalter,editors,Markov Chain Monte Carlo in
sian Mixture Modelling. IEEE Trans. Patt.Anal. Mahine Intell.,20(11):1133{
1141, 1998.
[21℄ G. A.F. Seber. Multivariate Observations. New York,Wiley, 1984.
[22℄ R. Szeliski. Bayesian Modeling of Unertainty in Low-level Vision. Boston,
Kluwer, 1989.
[23℄ R. A. Tapia and J. R. Thompson. Non-parametri Probability Density
Estima-tion. Baltimore,Johns Hopkins Pr., 1978.
[24℄ G. S. Watson and M. R. Leadbetter. On the Estimation of the Probability
Density. In Ann. Math. Stat., volume 34, pages 480{491,1962.
[25℄ R.Wilson,A.Calway,andE.R.S.Pearson. AGeneralizedWaveletTransformfor
FourierAnalysis: The MultiresolutionFourierTransform and ItsAppliationto
Image and Audio SignalAnalysis. IEEE Trans.Information Theory,38(2):674{
690, Marh 1992.
[26℄ A. Witkin. Sale-Spae Filtering. In Pro. of IEEE ICASSP-84, 1984.
[27℄ K. Palaniappan X. Zhuang, Y. Huang and Y. Zhao. Gaussian Mixture Density
Distribution
Inusingthe MVNdistributionasauniversalapproximator,itispossibletomakeuse
of the properties of this widely studied distribution. The development here follows
that in [14℄ and [16℄. A Bayesian analysis starts from the hoie of priors and in
the present appliation, it is obvious to use the so alled natural onjugate prior
distribution: theNormal-Inverse-Wishart(NIW)distribution,formultivariatenormal
data with unknown mean and ovariane. This is dened for n D data as
f(;)=k 1 jAj d 2 jjj d+n+2 2 exp [ 2 ( a) T 1 (( a) 1 2 tr( 1 A)℄ (31)
or, briey f(;) = NIW(A;d;a;), a funtion of the parameters d > n+1, the
degrees of freedom, , the prior weight on the mean, a, the prior mean and A, the
prior ovariane. Inreasing and d attahes relativelymore weight to the prior, for
a given sample size N. The normalising onstant k is given by
k=2 n(d+1) 2 n(n+1) 4 n Y i=1 (
d+1 i
2
) (32)
The likelihood for the data X i
;1 i N, is then simply a produt of normals,
assumingthat data are independently sampled,ie.
f(Xj;) =(2) N 2 jj N 2 exp [ 1 2 N X i=1 (X i ) T 1 (X i )℄ (33)
and beause the NIW isthe natural onjugateprior, the posterior isalso NIW, with
parameters
a
=
a+N X +N (34) where X =N
1 P
i X
i
isthe samplemean and
A
=A+NS+ N +N (a X)(a X) T (35)
where S =N 1 P i (X i X)((X i X) T
is the sample ovariane. It is not hard to
see that in the posterior, the salar parameters are
=+N;d
=d+N.
The lassial analysisof MVN data is basedonthe likelihood(33), fromwhihit
is easyto nd the ML estimates
^ = X (36) and ^
=S (37)
Evaluating the likelihoodatthese estimates gives
max ;
a ; ~ = 1 d A
f(Xj;~ ~
) =j2A
j N 2 d nN 2 exp[ d 2 (n trA 1 A)℄ (39)
The above results an be used to evaluate the Bayes evidene for one omponent in
the mixture, by integrating over the parameters
f(Y)=
jAj d 2 nN 2 jA j d 2 n Y i=1 ( d +1 i 2 ) ( d+1 i 2 ) (40)
The lastquantity ofinterest isthe mutual informationbetween the data and
param-eters, whihis given by the ratio
i(Y;;) =ln
f(;jY)
f(;)
(41)
The expetation of this quantity represents the amount of information whih, on
average, the parameters yield about the data, or vie versa. Sine both prior and
posterior are NIW, it isstraightforward toevaluate (41)
i(Y;;) = d 2 lnjA j d 2 lnjAj N 2 lnjj nN 2
ln2+
X
i ln (
d
+1 i
2 )
X
i
ln (
d+1 i
2 ) 1 2 tr[ 1 ( ( a )( a ) T
( a)( a)
T +A
A)℄
with hidden variables
This samplerfollows exatlythe modelproposed by Robert [19℄. It uses NIW as the
priorforthe sampledata andaDirihletpriorforthe populationsize (experienehas
shown that this is not always helpful, however). Thus, given the estimate j
; j
for
the mean and ovarianeat anode j, the samplerperforms the following steps:
1. Sample ji
; ji
;i=1;2; NIW( j
; j
; i
;d i
).
2. Samplefor the populationsize using the Dirihlet distribution.
3. Samplethe hidden variables Z k
2[1;2℄;1k N j
using aGibbs sampler and
the urrent estimates of P; ji
; ji
;i=1;2.
4. Calulate theposteriorNIW parameters, basedonthe lassieddataand
(34)-(35).
This is repeated, starting with i
= i
=1;d i
=d i
=n+2, the minimum values, to
keep the prior vague.
It may be noted here that a deterministi algorithm, based on 2 means, but
using the log-likelihood rather than simple Eulidean distane, works adequately on
some simpler problems, for example where lasses are onvex. In general, however,
it islittleheaper toimplement2 means and muh morelikely tosettle in aloal
-4
-2
0
2
4
20
40
60
80
100
-4
-2
0
2
4
20
40
60
80
100
(a) (b)
-4
-2
0
2
4
20
40
60
80
100
-4
-2
0
2
4
20
40
60
80
100
() (d)
Figure 1: Contour plots of Akaike Information Criterion and Log-Bayes Fators for
two vs one normal omponents: (a) AIC as a funtion of dierene in means
(hor-izontal) and population size (vertial), (b) LBF as in (a), () AIC as a funtion of
dierene in sale, (d) LBF as in (). In all four ases, the proportions of the two
omponentsareequalandblakrepresentsnegativevalues,ieapreferene forasingle
-1
0
1
2
3
4
5
6
-3
-2
-1
0
1
2
3
4
-1
0
1
2
3
4
-2
-1
0
1
2
3
(a) (b)
-4
-2
0
2
4
6
-4
-2
0
2
4
6
-3
-2
-1
0
1
2
3
4
-3
-2
-1
0
1
2
3
4
() (d)
Figure2: ContourplotofestimatedGaussianomponentsfromapairof2-dnormals,
(a)havingmeans T 1
=(0;0); T 2
=(4;0)and ovarianes i
=I;i=1;2;(b)means
at T 1
= (0;0); T 2
= (2;0); () means at T i
= (0;0);i = 1;2 and 1
= 16 2
; (d)
1
=4
2
. Estimates obtained from 200 iterationsof the Gibbs sampler. The data
P=0.202
P=0.080
P=0.128
P=0.138
P=0.069
P=0.095
P=0.180
P=0.108
(12,6,187) (13,21,183)
(16,25,198)
(18,21,221) (11,20,218)
(14,14,220)
(18,16,128) (28,17,123)
8-Leaf MGMM Tree for Lena
Figure 3: Shemati of the 8 leaf MGMM representation of the 256 256 `Lena'
image. Population probabilities and mean vetors, in the order (x;y;z), shown for
() (d)
Figure4: (a)-(d)FirstfourlevelsoftheMGMMtree for256256`Lena',superposed
ontheorrespondingleastsquaresapproximations. Line thikness indiatestreelevel
and the ellipses show the ovariane of the Gaussianomponent ateahnode. Note,
() (d)
Figure5: (a)ClassiationofLena imagefromMGMMrepresentationwith 8leaves.
(b) MGMMtree orrespondingto(a), asinpreviousgure;() Reonstrution using
50
100
150
200
250
0.005
0.01
0.015
0.02
50
100
150
200
250
200
400
600
800
1000
(a) (b)
Figure 6: (a) Comparison of histogram with MGMM approximation to probability
() (d)
Figure7: (a)Frame15ofMissAmeriasequeneand(b)reonstrutionfromMGMM
tree using 66 lasses.() Original frame 16 and (d) reonstrution based on moved
() (d)
Figure 8: (a) Left image of stereo pair, (b) right image. () MGMM approximation
of disparity image from (a) and (b). (d) MGMM approximation superposed on left