GTM-based data visualisation with incomplete data

(1)

YiSun,PeterTino,andIanNabney

NeuralComputingResear hGroup,AstonUniversity,

AstonTriangle,BirminghamB47ET

UnitedKingdom

fsuny,tinop,nabneyitgaston.a .uk

http://www.n rg.aston.a .uk/

Abstra t. WeanalysehowtheGenerativeTopographi Mapping(GTM)

anbe modiedto ope withmissingvaluesinthe trainingdata.Our

approa hisbasedonanExpe tation-Maximisation(EM)methodwhi h

estimates the parameters of the mixture omponentsand at the same

timedealswiththemissingvalues.Wein orporatethisalgorithmintoa

hierar hi alGTM.Weverifythemethodonatoydataset(usingasingle

GTM)andarealisti dataset(usingahierar hi alGTM). Theresults

showouralgorithm anhelpto onstru tinformativevisualisationplots,

evenwhensomeofthetrainingpointsare orruptedwithmissingvalues.

1 Introdu tion

Data visualisation, whi h plays akeyrole in developinggood models forlarge

quantitiesofdata,isanimportantaidindimensionredu tion,givesinformation

aboutlo aldeviations inperforman eandprovidesauseful he kforobje tive

quantitativemeasures.However,in manyappli ationstheinputdata is

in om-plete.Thereforeitisimportantto knowhowtousetheavailabledata andhow

tore onstru tthemissingvalues.Forexample,in thepharma euti aleld,

s i-entistsuse omputermodellingtoexamineandanalysethemole ularstru ture

of ompounds and high throughput s reening to assess their intera tion with

biologi altargets.Many ompounds arenots reenedagainsta ompleteset of

targets, yet wedo notwantto ex ludeallsu h ompounds from data analysis

sin ethatrisksmissingpotentialdrugs.

Thehierar hi algenerativetopographi mapping(GTM)modelisan

intera -tivedatavisualisationte hnique,whi henablestheuserto onstru tarbitrarily

detailedproje tionplots.Thebasi buildingblo kistheGTM[1℄.Theproblem

onsideredhereistotraintheGTMmodelwithin ompletedataandre onstru t

the missing values. This way the data, in ludingthe missing omponents, an

beshown in avisualisationplotthat isas\faithful" aspossible. For

hierar hi- al GTM,thein ompletedata anbedisplayedatalllevelsofthehierar hyof

visualisationplots.

(2)

set by using an EM algorithm.For visualisation purposes, the missing data is

lled in by omputing the posterior mean. In [2℄, the GTM was trained only

with omplete data,and an additional onditionwasaddedto re onstru t the

missingdata. In ontrast,ouralgorithmismoregeneri .

Sin eour algorithm is basedon Gaussian mixturemodels (GMM)and the

EM algorithm, in se tion2 webrie y introdu ethe EMalgorithm for GMMs.

The GTM with in omplete data algorithm is detailed in se tion 3. Se tion 4

givesa basi introdu tion to hierar hi alGTM. Weillustrate the algorithmin

se tion5withatoydataandahighdimensionaldatasetfrom owdiagnosti s

ofanoilpipeline.Se tion6dis ussestheresult.

2 The EM Algorithm for Gaussian Mixture Models

TheEMalgorithmisespe iallyrelevantsin eitisageneralmethod for

param-eterestimationin mixturemodelsthat isbasedontheideaofllinginmissing

data. This se tion introdu es brie y the algorithm for nding the maximum

likelihoodparametersofaGaussianmixturemodel[3℄.

We onsider amixturedensity

P(t n )= K X j=1 P(t n jj; j )P(j); (1)

whi hisgeneratingthe(i.i.d.)dataT=ft

n g

N

n=1

.Inthis aseea h omponentof

themixtureisdenotedbyj andparametrisedby

j

,andP(j)istheprior

prob-abilityforthemixture omponentj.Then theloglikelihoodofthe parameters

giventhedatasetis

L()= N X n=1 log K X j=1 P(t n jj; j )P(j): (2)

The binary indi ator variablesz

nj

are introdu ed to spe ify whi h omponent

of themixturegenerated thedata point. z

nj =1ifandonly ift n is generated by omponentj, otherwisez nj

=0.Thenequation(2) anbere-writtenasthe

ompletedataloglikelihoodfun tion:

L ()= N X n=1 K X j=1 z nj log [P(t n jz nj ;)P(z nj ;)℄: (3) Sin ez nj

isnotknown,theexpe tationE[z

nj jt n ; j ℄ofz nj

giventhe urrent

parameter values

j

is omputed. This is the probability that the Gaussian j

generatedthedatapointt

n

andisdenotedbyr

nj

.ThisistheE-stepoftheEM

algorithm: r nj = j j j 1=2 expf 1 2 (t n j ) T 1 j (t n j )gP(j) P K k =1 j k j 1=2 expf 1 (t n k ) T 1 k (t n k )gP(k) : (4)

(3)

The means

j

and ovarian ematri es

j

of thejth omponentGaussian are

updatedintheM-stepusingthedatasetweightedbyther

nj : t+1 j = P N n=1 r nj t n P N n=1 r nj (5) t+1 j = P N n=1 r nj (t n t+1 j )(t n t+1 j ) T P N n=1 r nj (6)

Theequationsaboveareforfull ovarian ematri es,buttherearesimilar

equa-tionsforother ovarian estru tures.

3 Generative Topographi Mapping and In omplete Data

3.1 The GenerativeTopographi Mapping

The generative topographi mapping(GTM) [1℄ is a nonlinear latent variable

modelthat useslatent(orhidden)variablesto modelaprobabilitydistribution

in the data spa e. It is a onstrained mixture of Gaussians whose parameters

areoptimisedusingtheexpe tation-maximisation(EM) algorithm.

FortheGTM,tdenotesthedatain aD-dimensionalEu lideanspa eandx

denotesthelatentvariablesinanL-dimensionallatentspa e.Consideringa

non-lineartransformationfromthelatentspa etothedataspa eusingaradialbasis

fun tionnetwork(seee.g.[4℄),thelatentdataismappedtodataspa ebyaradial

basisfun tiony=W(x)withweightsWandabasisfun tionmatrix.The

goalofthelatentvariablemodelisto ndarepresentationforthedistribution

p(t)intermsofanumberKoflatentpointsx

j

(j=1;2;:::K)and orresponding

Gaussiandistributions entredony (x

j

;W )[1℄.Thedatadensityisdenedby

P(tjW ;)= 1 K K X j=1 P(tjx j ;W ;) (7) and P(tjx j ;W ;)= 2 D=2 exp 2 ky (x j ;W ) tk 2 (8)

whereWand theinversevarian e anbettedbymaximumlikelihoodwith

theEMalgorithm.

Thelatent spa erepresentation ofthe point t

n

, i.e. the proje tion of t

n , is

taken to be the mean P K j=1 r nj x j

of the posteriordistribution on the latent

spa e.

3.2 In orporating missingvalues into the EM algorithm for the

GTM model

Tohandle missing valuesin the data set, wewrite data pointst

n as (t o n ;t m n ),

(4)

odenotesubve torsandsubmatri esoftheparametersmat hingthemissingand

observed omponentsofthedata. TheEMalgorithm treatsboth theindi ator

variables z

nj

and themissing inputst m

n

ashidden variables.FortheGTM, as

the ovarian ematrixis onstrainedtobeisotropi ,

j =

1

I,the ovarian e

of missing and observed values mo

j

is equal to 0. The expe ted value in the

E-step is taken with respe t to both sets of hidden variables. If we knew the

valuesoftheindi atorvariablesz

nj

,wewouldwrite thenegativeloglikelihood

fun tion as L(W ;)= N X n=1 K X j=1 z nj n D 2 ln(2) D 2 ln+ 2 h kt o n y o j k 2 + kt m n y m j k 2 io (9)

Aftertaking theexpe tation,thesuÆ ientstatisti s fortheparameters

in- ludethree unknownterms,z

nj ,z nj t m n andz nj t m n t m n

.Sowemust al ulatethe

expe tations forthese threeterms.Following[5℄,weintrodu e:

^ t m nj =E(t m n jz nj =1;t o n ; j )=(y m j ) old (10)

whi histhe least-squaresregressionbetweent m n and t o n predi tedbyGaussian

j,and`old' denotesthevalue omputedinthelastM-step.

Theexpe tation of z nj is E[z nj jt o n ; j ℄= r nj

(equation (4)) measuredonly

ontheobserveddimensionsoft

n .FortheGTM, we al ulate: E[z nj t m n jt o n ; j ℄=E[z nj jt o n ; j ℄E[t m n jz nj =1;t o n ; j ℄=r nj ^ t m nj =r nj (y m j ) old (11)

IntheM-step,themissingvaluesareexpressedusingtheposteriormeans:

E[t m n jt o n ; j ℄= K X j=1 r nj E[t m n jz nj =1;t o i ; j ℄ (12)

and the weights are then updated to W

new

as used way for GTM [1℄. The

varian eisupdatedby:

1 = 1 ND N X n=1 K X j=1 r nj kt o n y o j k 2 +E[z nj kt m n y m j k 2 ℄ (13) where E[z nj kt m n y m j k 2 ℄=E[kt m n y m j k 2 jz nj =1℄ =( 1 ) old +( ^ t m nj ) T ( ^ t m nj ) 2( ^ t m nj ) T y m j +(y m j ) T y m j (14) andy m =(W new (x j )) m .

(5)

4 Hierar hi al GTM

4.1 An introdu tion to hierar hi al GTM

Fora omplexdataset,asingletwo-dimensionalvisualisation plotmaynotbe

suÆ ientsin eitisdiÆ ultto apturealloftheinterestingaspe ts inthedata

set.Thereforeahierar hi alvisualisationsystemisdesirable.

Given a training data set T = ft

1 ;t

2 ;:::;t

N

g, the probability, assigned to

this set byahierar hyof GTMsorganised in hierar hi altreeT, is al ulated

by onsidering the hierar hi al GTM T asa mixture of GTMs [6℄, with

mix-ture omponentsbeingtheleavesM.Theparametersofthehierar hy(weights

W ,inversevarian e andparent- onditionalmixture oeÆ ients) arettedby

maximumlikelihood usingtheEM algorithm.Mixture oeÆ ientsforthe

mix-ture omponentsMare al ulatedre ursivelybymultiplyingparent- onditional

mixture oeÆ ientsdownthepathfromtheroottoM.

Givenadatapointt

n

andasubmodelMinthehierar hyT,wehavethree

typesofhiddenvariables:1)ResponsibilityofParent(M),theparentofM,for

generatingt

n

.2)Parent- onditionalresponsibilityfort

n

,giventhatParent(M)

generatedt

n

,and3)Responsibilityoflatentspa e entresx

j

ofMforgenerating

t

n .

Toavoidnumeri alproblemsarisingfrommultipli ationofsmallprobabilities

andtospeedupthetrainingpro ess,theGTMsondeeperlevelsaretrainedonly

ondatapointsforwhi htheparentmodelhasresponsibilitygreaterthansome

pre-setthreshold.Inourexperiments=10 3

.

4.2 Parameterinitialisation

Havingtrained GTMs down to level ` of thehierar hi al tree T, we hoose a

parentmodelN atlevel`and,basedonitsvisualisationplot,wesele t\regions

of interest" for hild GTMs Mat level`+1.More pre isely,the visualisation

plotoftheparentGTMN showslow-dimensionalrepresentationsinthelatent

spa eofdataspa epointsfromthetrainingset.

Theregionsofinterestaresele tedasfollows:Theuserrstsele tspoints

i ,

i=1;2;:::;A,inthelatentspa ethat orrespondto\ entres"ofthesubregions

the user is interested in.The points

i

are then transformed viathe map y

N

dened bytheparentGTMN tothedataspa e

y N ( i )=W N N ( i ) (15)

The regionsof interestare given bythe Voronoi ompartments [7℄in the data

spa e orrespondingtothepointsy

N ( i ),i=1;2;:::;A: V i = t2< D jd(t;y N ( i ))=min j d( t;y N ( j )) ; (16)

whered(;)istheEu lideandistan einthedataspa e< D

.Allpointsin V

i are

(6)

We initialise the parameters W

M

of hild GTMs M, so that ea h GTM

initiallyapproximatesprin ipal omponentanalysis(PCA)ofthe orresponding

Voronoi ompartment. For GTM M orresponding to a ompartment V

i , we

rstevaluatethe ovarian ematrixof trainingpointsin V

i

and obtaintherst

L prin ipaleigenve tors.Next,wedetermineW

M

byminimising theerror

E= 1 2 KM X j=1 kW M M (x M j ) U x M j k 2 ; (17)

wherethe olumnsofUaretherstLprin ipaleigenve torsofthedata

ovari-an ematrix(see[1℄).

Following[1℄,parameter

M

isinitialisedtobethelargeroftheL+1

eigen-valuefrom PCA, that representsthevarian eof thedata awayfrom the PCA

plane,orthesquareofhalfofthegridspa ingofthePCA-proje tedlatentdata

pointsindata spa e.

5 Experiment

Inourexperiments,GTMmodelsweretrainedintwoways:(A1)thealgorithm

dened in se tion 3:2 and (A2) standard EM applied to a dataset with the

missingvaluesrepla edbytheun onditionalmean.

5.1 The toy data

200trainingdatapointsweregeneratedrandomlyintheinterval[0;2℄ast

1 .The

variablet

2

wasthen omputedbythefun tiont

2 =t 1 +1:25sin(2t 1 ).Aspheri al

Gaussian noisewithstandarddeviation0.1 wasaddedto t

2

oordinates. Then

wedeleted 30%of thevalues in t

2

randomly. Figure 1 showsthe result using

A1andA2.Aftertraining,thenegativeloglikelihoodis1.62and2.66perdata

respe tively.

5.2 Oildata

Thisexamplearisesfromtheproblemofdeterminingthefra tionofoilina

multi-phase pipeline arrying amixture of oil, waterand gas. Thedata set onsists

of 1000 12-dimensional points. Points in the data set are lassied into three

dierentmulti-phase ow ongurations:homogeneous,annularandlaminar[8℄.

Figure2showsthevisualisationresults.A hierar hyofGTMs uptolevel3

wastrainedon thedata set.Foreverylevel,1515=225latent data points

weresele tedinthe2-dimensionallatentspa eandthenumberofGaussianbasis

fun tionsis4 4=16.Thenalvisualisationplotforthe omplete(un orrupted)

(7)

0

1

2

3

4

5

6

7

0

1

2

3

4

5

6

7 After 15 iterations of training.

0

1

2

3

4

5

6

7

0

1

2

3

4

5

6

7 After 15 iterations of training.

(a)UsingtheEMbasedalgorithm (b)Usingun onditionalmeanmethod

Fig.1. The toy problem: the omplete data points are plotted as ir les while the

entresoftheGaussianmixtureareplottedasplussigns.The entresare joinedbya

line a ording to their orderinginthe(one-dimensional)latent spa e(K =60).The

starsrepresentthemissingvalues.Thedis ssurroundingea hplussignrepresenttwo

standarddeviations'widthofthenoisemodel.

Werandomlydeleted 30%of valuesin thedata set.Themaximumnumber

of orrupted oordinates per data point is 6. Again we ompare the negative

log likelihood of A1and A2.Here wejust measured thevaluesof negativelog

likelihood for the top level GTM, sin e the likelihood for lower level models

dependsonwherethe\regionsofinterest"aresele ted.Forthein ompletedata

set,after10training y les,usingtheEMalgorithm,thenegativeloglikelihoodis

3:39perdatapoint,whileusingun onditionalmeanllinginthemissingdata,

thenegativelog likelihood is 1:31.UsingourEMbasedalgorithmfordealing

with missing values an indeed be bene ial as it an be seen by omparing

thetoplevel(root)visualisationplotsandthese ondvisualisationplotsonthe

se ondlevelofthehierar hy.Thesese ond-levelplotsshowbetterseparationof

lassesandmat hbetterto themodelstrainedonthe ompletedataset.

6 Con lusions

Inthispaper,wehaveshownhowin ompletedata anbein ludedin the

hier-ar hi alGTMtraining.Thealgorithmfordealingwithmissingvaluesbasedon

the EMalgorithm and Gaussian mixture models is aviableapproa h for data

(8)

Referen es

1. C. M. Bishop, M. Svensen, and C.K.I. Williams. GTM: TheGenerative T

opo-graphi Mapping. NeuralComputation,10(1):215{235,1998.

2. M.

A. Carreira-Perpi~nan. Re onstru tion of Sequential Data with Probabilisti

Models and Continuity Constraints. In SaraA. Solla, Todd K.Leen, and

Klaus-RobertMuller, editors, Advan es in Neural Information Pro essing Systems,

vol-ume12. TheMITPress, 2000.

3. A.P.Dempster,N.M.Laird,andD.B.Rubin.MaximumLikelihoodfromIn omplete

DataviatheEMAlgorithm. J.Roy.Stat. So .B,39:1{38,1977.

4. C.M. Bishop. Neural Networks forPattern Re ognition. Oxford UniversityPress,

NewYork,N.Y.,1995.

5. Z.GhahramaniandM.I.Jordan.Learningfromin ompletedata.Te hni alreport,

AILaboratory,MIT,1994.

6. P.Tino and I.Nabney. Constru ting lo alized non-linearproje tion manifoldsin

aprin ipledway:hierar hi alGenerativeTopographi Mapping. Te hni alreport,

2000.

7. F. Aurenhammer. Voronoi diagrams - survey of a fundamental geometri data

stru ture. ACMComputingSurveys,3:345{405, 1991.

8. C.M.Bishop andG.D. James. AnalysisofMulti-phaseFlows UsingDual-energy

GammaDensitometryand NeuralNetworks. Nu learInstrumentsandMethodsin

(9)

1

2

3

Homogeneous

Annular

Laminar

1

2

3

(a)

1

2

3

Homogeneous

Annular

Laminar

1

2

3

1

2

3

Homogeneous

Annular

Laminar

1

2

3

(b) ( )

Fig.2.Datavisualisationfor oildatabyusinghierar hi alGTM. Plot(a)shows the

result of training on the omplete data set. Plot (b) shows the result of using the

EMalgorithm learningfromin ompletedata,while plot( )shows thesamedataset