Using chain event graphs to refine model selection

(1)

University of Warwick institutional repository: http://go.warwick.ac.uk/wrap

This paper is made available online in accordance with

publisher policies. Please scroll down to view the document

itself. Please refer to the repository record for this item and our

policy information available from the repository home page for

further information.

To see the final version of this paper please visit the publisher’s website

.

Access to the published version may require a subscription.

Author(s): PA Thwaites

Article Title: Using Chain Event Graphs to refine model selection

Year of publication: 2011

Link to published article:

http://www2.warwick.ac.uk/fac/sci/statistics/crism/research/2011/paper

11-15

(2)

Peter Thwaites Universityof Warwik, UK Peter.Thwaiteswarwik.a.uk

Abstrat

ChainEventGraphs(CEGs)arespeiallydesignedtoembodytheonditional indepen-denestrutureofproblemswhosestatespaesareasymmetrianddonotadmitanatural produtstruture. Thelearningof CEGs islosely relatedto thelearningofBNs, and if weuse(forexample)MAPmodelseletionthenwhereamodelanberepresentedasboth aBN and a CEG, thetwo methodsassign thismodel thesame sore. If we suspetthat aprobleminorporatessigniantontext-spei onditionalindependenestruture we an use standard BN-based learning methods to selet a good approximate model, and thenusetheCEG-based learningmethodsdesribedhere to furtherrene thismodel.

1 Introdution

TheChainEventGraph(CEG)(Smithand An-derson, 2008; Thwaites et al., 2008; Thwaites et al., 2010) is a graphial model whih ap-turestheonditionalindependenestrutureof problemswhihdonotadmitanaturalprodut struture on their state spaes. Suh problems often have no satisfatory representation as a BNorontext-spei BN.

Speially,a CEGis a funtionof an event tree. These trees (Shafer, 1996) are partiu-larlysuitedto problemsdisplayingasymmetry, but are not ideal for the representation of the onditional independene struture of a prob-lem. TheCEGhasbeendevelopedtosolvethis fault.

A formal desription and motivation for us-ing CEGs, and an outline of some of their im-pliitonditionalindependenestrutureanbe found in (Smith and Anderson, 2008). Three points from this paper are key to the ideas presented here. Firstly, problem asymmetries arerepresentedexpliitlyinthetopologyof the CEG. Seondly, CEGs an be used to express a riher set of onditional independene state-ments not simultaneously expressible through a single BN. Lastly, the lass of BNs is on-tainedwithinthatof CEGs. Thisisa property

whih we exploit here, sine with appropriate prior settings, it follows that BN model sele-tion proedures an be nested within those for CEGs.

Fast propagation algorithms for CEGs were developed in (Thwaites et al., 2008). These exploitthe graph'sembedded onditional inde-pendenies to fatorize its mass funtion over loal masses. In this paper we demonstrate howthisfatorizationofthejointmassfuntion over a given event spae an also be used as a frameworkforsearhingoveraspaeof promis-ing andidate CEGs to disover models whih providegoodqualitativeexplanationsofthe un-derlyingdatageneratingproessofagivendata set. Beausethesesearhmethodsaresimilarto well known algorithms used for searhing BNs we are able to use similar arguments for set-tinguphyperparametersoverpriorssothatthe priors over the model spae deompose as ol-letions ofloal beliefs.

(3)

spei adaptations of these models whih are better reetionsof theproblem.

Setion 2 briey desribes the proess by whih we reate a CEG from an event tree. Setion 3 introdues the tehniques for learn-ing CEGs. In setion 4 we provide an example ofhowBN-basedmodelseletionanberened by the use of CEG-based tehniques. Further disussionappearsinsetion 5.

2 Produing a CEG

StartingwithaneventtreeT (vertexsetV(T), edgeset E(T)), aprobabilitytree an be spe-ied by assigningprobabilitiesto eah member of E(T).

Letting T(v) be the subtree of T rooted in thevertex v(2V(T)),wesaythat theverties v

1 andv

2

areinthesame positionif:

thesubtreesT(v 1

)andT(v 2

)haveidential topologies,

there exists a map between T(v 1

) and T(v

2

)suhthatorrespondingedgesinthe two subtrees are labelled with the same outomes(givendierentproblem develop-ments uptov

1 and v

2

) and thesame prob-abilities.

The setK(T) ofpositionsw partitionsV(T). TheCEGCisaoloureddiretedgraphwith vertex set V(C) =K(T)[fw

1

g, and edge set E(C). There exists an edge e 2 E(C) from w

1 to w

2 6=w

1

foreah vertex v 2

2w 2

whih isa hildofa xed representative v

1 2w 1 forsome v 1

2V(T),andanedgefromw 1

to w 1

foreah leaf-node v 2 V(T) whih is a hild of a xed representativev

1 2w

1

forsome v 1

2V(T). The oret F(w) of a positionw 2V(C) isw togetherwiththesetofoutgoingedges fromw. We saythat thepositionsw

1 andw

2

areinthe same stage u if:

theoretsF(w 1

) andF(w 2

) haveidential topologies,

there exists a map between F(w 1

) and F(w

2

) suh that orresponding edges in

outomes(givendierentproblem develop-mentsuptov

1 and v

2

) and thesame prob-abilities.

Forw 1

;w 2

inthesame stage the orrespond-ing edges of F(w

1

) and F(w 2

) have the same olour (see positionsw

1

and w 2

and their out-goingedgesinFigure2). Foranyw2uwean, without ambiguity let the stage oret F(u) be utogether withaset ofedges labelledwiththe same events and probabilities as the outgoing edgesof w.

The proess ofproduinga CEGfrom a tree isillustratedinExample1.

Example1

v

0 v

1 v

2 v

3 v

4 v

5 v

6 v

7 etc.

θ

₁

θ

₂

θ

3 θ

₄

θ

5 θ

₄

θ

₅

θ

6 θ

₇

θ

6 θ

₇

θ

₈

θ

₉

θ

₆

θ

₇

Figure1: TreeforExample1

Figure 1 shows an event tree T embellished with edge-probabilities. Edge event labels are not shown, butedges sharing a ommon prob-ability label (eg.

4

) orrespond to the same event given a dierent history. The CEG C in Figure2 is produedbyombiningtheverties fv 3 ;v 4 ;v 6

g into one positionw 3

, ombining all leaf-nodes into a single sink-node w

1

, and re-labellingverties v

0 ;v 1 ;v 2 ;v 5 asw 0 ;w 1 ;w 2 ;w 4 . The stages of the CEG are u

0 = fw 0 g; u 1 = fw 1 ;w 2 g; u 2 = fw 3 g; u 3 = fw 4

g. The edges leaving w

1

and w 2

(4)

w

0 w

1 w

2 w

3 w

4 θ

1 θ

₂

θ

3 θ

₄

θ

₅

θ

₄

θ

₅

θ

₆

θ

7 θ

8 θ

₉

w

inf

Figure2: CEGforExample1

NotethattheCEGisspeiedthrougha par-tiular event tree and statements about spe-i developments sharing the same distribu-tion. Bothof these propertiesan beexpressed verballyintermsofageneralexplanationofthe unfoldingofevents, andthereforehavea mean-ingthattransends thepartiularinstane.

3 Learning CEGs

In this paper we onsider maximum a posteri-ori(MAP)modelseletiononthelassofCEGs. OthermethodsexistforBNsandmanyofthese extendto CEGsasstraightforwardlyasthe ex-tensiondesribed here.

AswithBN-basedmodelling,ifwehave om-pleterandomsamplingthelikelihoodforaCEG model separates into produts of terms whih are only a funtion of parameters assoiated with one omponent of the model. In the BN eah term is assoiated with a variable and its parents; intheaseoftheCEGthemodel om-ponentsarethe stage orets. Furthermore, the terminthelikelihoodorrespondingtoa parti-ularoret F(u)isproportionalto one obtained from multinomial sampling on the set of units arrivingat u.

For eah stage u we an label the edges in F(u) by their probabilities under this model, so

ui

labels the ith edge leaving any position whih is a member of thestage u. We then let n

ui

bethetotalnumberofsampleunitspassing throughanedgelabelled

ui

,and thelikelihood

L()= Y u Y i ui n ui

Assumptions of global and loal independene together withthe useof Dirihletpriors ensure onjugay when learning BNs. To ensure the same with CEGs, we give the vetors of prob-abilities assoiated with the setof stage orets independent Dirihlet distributions. This gives prior and posterior distributions for the CEG modelwhihareprodutsofDirihletdensities, and a marginallikelihoodforC of

Y u ( P i ui ) ( P i ( ui +n ui )) Y i ( ui +n ui ) ( ui ) (1) where ui

aretheexponentsofourDirihlet pri-ors.

As P(model j data) / P(data j model) P(model) we have to set priorprobabilitiesfor possible models as well as parameter priors. There are many hoies for both these, butfor aessibility in this paper we onsider simple ases whihhave diretanaloguesinBN model seletion. So, if there is no reasonto do other-wise we let P(model) be onstant forall mod-els in the andidate set of CEGs. Similarly we hoose the ase where hyperparameter pri-ors are set to orrespond to ounts of dummy unitsthroughtheCEG.Wedothisbyputtinga uniformpriorovertheroot-to-sinkpaths ofthe CEG and assigning Dirihlet priors to eah of the stage orets. It is straightforward to hek (see for example (Freeman and Smith, 2009)) that for models expressible as both CEGs and BNs,thevaluesgivenbyexpression(1)arethen identialtothosegivenbyBNexpression(2) us-ingthepriorsettingssuggested in(Cooperand Herskovits, 1992;Hekermanetal.,1995) et.

Y i2V h Y j ( P k ijk ) ( P k ( ijk +n ijk )) Y k ( ijk +n ijk ) ( ijk ) i (2) Note that here i indexesthe set of variables of theBN; k indexesthelevelsof thevariable X

(5)

variablesof X i

.

It is this result whih allows us to use BN-based methods to narrow down the set of pos-sible models before moving over to CEGs to use the tehniqueshere presentedto rene our searh.

For the CEG in Figure 2, we put a uniform prior over the nine root-to-sink paths, and as-sign a Di(2;4;3) prior to u

0

, Di(4;3) prior to u

1 fw

1 ;w

2

g,Di(3;3)priortou 2

,andDi(1;1) priorto u

3

. We thenhave L() equalto (9)

(9+N)

(2+n 01

) (4+n 02

) (3+n 03

) (2) (4) (3)

(7) (7+n

11 +n

12 )

(4+n 11

) (3+n 12

) (4) (3)

(6) (6+n

21 +n

22 )

(3+n 21

) (3+n 22

) (3) (3)

(2) (2+n

31 +n

32 )

(1+n 31

) (1+n 32

) (1) (1) whereN isthesample size, andn

11

(for exam-ple)is thetotalnumberof sampleunitsleaving u

1 (ie. w

1 orw

2

) via (inthisase) a blueedge. Note that, as in this example, CEGs an be used to depitmodelswhih admit known log-ial onstraints. If we attempt to express the onstraints of this example through a BN, we ndthatsomevariableshavenooutomesgiven partiular vetors of values of anestral vari-ables. We annot simply set probabilities to zero in thisinstane as a Dirihlet distribution is then no longer appropriate and so the usual modelseletionproedurefails.

4 An Example

In this setion we onsider a simple exam-ple whih demonstrates the versatility of our method. Ourlientis analyzingamedialdata set relating to an inherited ondition. A ran-dom sampleof 100 (51 female, 49 male) people hasbeentakenfromapopulationwhohavehad reent anestors with the ondition. For eah individualinthesampleareordhasbeenkept of whether or not they displayed a partiular symptomintheirteens,andwhetherornotthey then developed the ondition in middle age.

orrespondstofemale,male;B =1orresponds to the individual displayingthe symptom; and C=1 orrespondsto theindividualdeveloping theondition.

Table 1: Dataformedialexample A

0 1

B B

0 1 0 1

C 0 33 6 10 12

1 6 6 9 18

Eight possible BNs ould be drawn for this problem,with direted edges present orabsent betweenA&B,A&C,andB &C. TheseBNs representeightpossiblemodels,whihgiventhe temporal ordering of the variables an be de-sribed by (a) full independene, (b) A ! C, Bq(A;C),()B !C,Aq(B;C),(d)A!B, C q (A;B), (e) A ! C, B ! C, B q A, (f)A!B!C,CqAjB,(g)A!B,A!C, CqB j A, and (h) A ! B ! C,A ! C, full assoiation. CEGs an also be drawn for these models, although as these are not asymmetri models, there is no advantage in doing so. For illustrativepurposesthemodels(b),(d)and(f) are depited as CEGs in Figure 3 (i), (ii) and (iii).

A=0

B=0

C=0|A=0

w

0 w

1 w

2 w

3 w

inf

w

4 A=1

B=0

Figure 3(i): A!C,Bq(A;C)

(6)

1 2

oloured(ie. they arrydierent probabil-ities), these positions are not in the same stage,soAq/B,

edges labelled B = 0 onverge at w 3

, so CqA j(B =0). Similarly,edges labelled B=1onverge at w

4

,soCqA j(B=1), andombiningthese we getCqAjB.

A=0

B=0|A=0

C=0

w

0 w

1 w

2 w

3 w

inf

Figure3 (ii): A!B,Cq(A;B)

A=0

B=0|A=0

C=0|B=0

w

0 w

1 w

2 w

3 w

inf

w

4 A=1

B=0|A=1

(iii): A!B !C,CqA jB

w

0 w

1 w

2 w

3 w

inf

w

5 w

4 B=0|A=0

B=0|A=1

C=1|A=1

C=0|(A=0,B=0)

(iv): A!B,CqB j(A=1)

w

0 w

1 w

2 w

3 w

inf

w

5 w

4 B=1|A=0

B=1|A=1

C=1|B=1

C=0|(A=0,B=0)

(v): A!B,CqAj(B=1)

B=1|A=0

C=0|Max(A,B)=0

w

0 w

1 w

2 w

3 w

inf

w

4 C=1|Max(A,B)=1

B=0|A=0

(vi): A!B,Cq(A;B) jMax(A;B)

Our startingpoint is to searh over the an-didate set of eight BNs, and as our lient has not expressed any preferene for a parti-ular model, we let P(model) be onstant for eah model in the andidate set, whih allows us to use P(data j model) as a measure for P(model j data). We then (as we are using MAPmodelseletion)letthesoreofthemodel be the logarithm of its marginal likelihood (as expressed by (2)). Note that using CEGs and expression (1) would give us exatly the same soresasusingBNsand (2). Thesoresforour eight models are given in Table 2. The model with the highest sore is the MAP model for thisandidate set.

Table2

(7)

(e)learlyindiatethatB isdiretlydependent onA. Althoughmodel(h)hasthehighestsore, the loseness of the sores for models (f) and (g), their proximity to the sore for (h), and their distane from the sore for (d) suggests that there is some ontext-spei onditional independene at work. Context-spei prop-erties suh as C qB j (A = 1) (there is one distribution for developing the onditiongiven that gender ismale) orCqA j(B =1) (there is one distributionfor developing theondition giventhat symptomwasdisplayed)anbe rep-resentedasontext-spei BNsof thetype de-sribed in, forexample, (Boutilier et al., 1996; Poole and Zhang,2003). Theyan also be rep-resented elegantlyas CEGs | these partiular models are depited in Figure 3 (iv) and (v) (whihalsoreettheestablisheddiret depen-deneofBonA). AsweearlierreadtheCEGin Figure3(iii),weanread,forexampleCEG (v) asfollows:

w 1

and w 2

are not in the same stage, so Aq/B,

edges labelled B = 1 onverge at a single position, so Cq A j (B = 1), but edges labelled B =0 do not, so we do not have CqA j(B=0).

Note that the CEG portrays the ontext-spei onditional independene properties of themodelinitstopology|theontext-spei BNdoesnot. Also,althoughBN-basedlearning methodshavebeenadaptedforontext-spei BNs (see for example (Feelders and van der Gaag,2005)),ourCEG-basedmethodsworkfor allCEGmodelswithouttheneedforany adap-tation.

Using CEGs to sore models with ontext-speipropertiesofthesortdesribed,wend thatCqB j(A=1)andCqAj(B=1)are in-deedimprovementsnotjustuponCqB jAand CqA jB,butalso uponfull assoiation, sor-ing-197.58 and -197.53 respetively. The lose-nessofthesesoressuggeststhattheremaybea modelwithontext-speiindependenewhih

els, and whih isbetter than both of them. In fatthere are 30 possibleCEG modelsfor this problem, and thisis without relaxing the edge orderingA;B;C. In fatthebestmodelhereis Cq(A;B) jMax(A;B) (thereis one distribu-tion fordevelopingthe onditiongiven that an individualis maleOR displayed the symptom, and one distribution for developing the ondi-tiongiventhatanindividualisfemaleANDdid not display the symptom). This is shown as a CEGinFigure3 (vi). Itisnotrepresentableas aBN withouttransformationof thevariables.

5 Disussion

Inthispaperwe have onentratedonthe prin-iple of assigning a sore to a member of a andidate lass, rather than on algorithms for searhing over this lass. But as the example in the previous setion demonstrates, not only is it very easy to establish the full andidate set of CEGs, it will also be straightforward to move between the members of this set when learning. The sore for a CEG model deom-poses into omponents assoiated with orets. When two CEGs ontain the same oret, we assignthisoret thesame priordistributionin eahmodel,andtheseparationofthelikelihood meansthatthispropertyisretainedinthe pos-teriordistribution. Assimilarmodelswillshare ahighproportionoforets, thesores for simi-lar modelswilldieronlyinasmallnumberof omponents. EÆient algorithmsan therefore bereatedto searhovertheCEGmodelspae (Freeman andSmith,2009).

Various methods have been developed to re-strit thesearh inBN model seletion to sub-setsofthelassofmodels(seeforexample(van Gerven and Luas, 2004)). As what we are proposing is to use CEG model seletion as a reningproess,we an stillutilisethese meth-odsbeforemovingontothelassofCEGs. Also there areways in whih we an furtherrestrit thesearh to exploresublassesofCEGs whih are expeted to provide good explanations of thedata.

(8)

text, the task of restriting the set of andi-date CEGs is muh easier than it might rst appear. Thus for example, in the eduational examples onsidered in (Freeman and Smith, 2009), the ontext demands that the underly-ing event tree is onsistent with the order stu-dents study ourses, and that ertain verties ould never reasonably be ombined into the same stage. Thesesortsofontextuallydened onstraintsanreadilybeinorporatedinto us-tomizedsearhalgorithms,and theeÆienyof the searh proedure improved. It is also not unusualformorequantitativeinformationto be available,suhasonetypeofstageombination being proportionately more probable than an-other. This an allow one to usefully further rene and improve the searh, although then the framework the CEG provides is no longer totallyqualitative.

We noted earlier that there is a wide hoie ofpossibleparameterpriorsavailable, andthat we had hosen a partiularly straightforward set with a diret analogue in BN model sele-tion. Caredoeshoweverneedtobetaken when hoosing parameter priors if the model sele-tion algorithm is to funtion eÆiently. This issue has already been addressed by a number of authorsforthease of BNs (see forexample (Hekerman, 1998)) usingonepts of distribu-tionandindependeneequivalene,and param-eter modularity to ensure plausibly onsistent priors over this lass. For a full Bayesian esti-mation with onjugate loally and globally in-dependent priors,thelassof BNs nestswithin the larger lass of CEGs. If we require that all BNs within the sublass of CEGs we are studyingontinuetorespettheseindependene rules, whilst also retaining our oret indepen-dene, then the hoies of prior hyperparame-ters are limited analogously with the lass of BNs. Using a result from (Geiger and Heker-man,1997), itisshownin(FreemanandSmith, 2009) that for a signiant lass of CEGs, if we assign Markov equivalent models the same prior,thenthejointdistributionontheleavesof theunderlyingtreeisneessarilyapriori

Dirih-distributionsbeingDirihletandmutually inde-pendent.

In(Silanderetal.,2007)itwasdemonstrated that MAP model seletion on the lass of BNs anbesensitivetohowpriorsareset,evenwhen these priors are onjugate produt Dirihlets. Extendingthisidea to CEGmodel seletion,it maybe insuÆient simply to state thatwe are settingauniformDirihletprioronthe root-to-sinkpaths;wemayalsoneedtoexerisearein the hoie of a sale parameter for this distri-bution. This requiresan expliit evaluation of the overall strength of prior beliefs, whih an then be speiedvia theequivalent size (ount of dummy units) assigned in the prior to eah root-to-leaf path of the underlyingtree. As al-ready noted, there are Bayesian model sele-tion methodsother thanMAP whihextendto CEGs. If the analyst does not feel suÆiently ondentinmakingthisevaluation,thenfor ex-ampleusingtheBayesianInformationCriterion (BIC) ouldeasilybemodied forusewiththe set ofCEG models.

Of ourse, just as with BNs, the onjugay doesnotneessarilyontinuetoholdwhen sam-pling is not omplete. In this ase approxi-mate ornumerialsearh algorithmsneedtobe employed with onsequent loss of auray or speed in soring and omparing models. How-everinthisasethemethodsforestimatingBNs withmissingvalues(seeforexample(Riggelsen, 2004))anusuallybeextendedsothattheyalso applyto CEGs.

CEGsallowfortherepresentationand analy-sisofproblemswhosestatespaesare asymmet-ri and do not admit a natural produt stru-ture. Inthispaperwehaveshownthatthereare natural methods for learning CEGs whih are loselyrelatedtothemethodsforlearningBNs, and thatwean usethese CEG-basedmethods forfurtherrening BN-basedmodelseletion.

Aknowledgments

(9)

EP/F036752/1).

Referenes

C. Boutilier, N. Friedman, M. Goldszmidt, and D. Koller. 1996. Context-spei independene inBayesianNetworks. InProeedings of the 12th Conferene on Unertainty in Artiial Intelli-gene,pages115{123,Portland,Oregon.

G.F.Cooperand E.Herskovits. 1992. ABayesian method for the indution of Probabilisti Net-works from data. Mahine Learning, 9(4):309{ 347.

A. Feelders and L. van der Gaag. 2005. Learning BayesianNetwork parameters with prior knowl-edgeaboutontext-speiqualitativeinuenes. InProeedings of the 21st Conferene on Uner-tainty in Artiial Intelligene, Arlington, Vir-ginia.

G. Freemanand J.Q. Smith. 2009. Bayesianmap modelseletionofChainEventGraphs. Researh Report09{06,CRiSM.

D.GeigerandD.Hekerman. 1997. A harateriza-tionof theDirihlet distributionthroughGlobal and Loal independene. Annals of Statistis, 25:1344{1369.

D. Hekerman,D.Geiger,andD.Chikering. 1995. LearningBayesianNetworks:Theombinationof knowledge and statistial data. Mahine Learn-ing,20:197{243.

D. Hekerman. 1998. A tutorial on Learning with Bayesian Networks. In M. I. Jordan, edi-tor,LearninginGraphialModels,pages301{354. MITPress.

D. Poole and N. L. Zhang. 2003. Exploiting ontextual independene in probabilisti infer-ene. Journal of Artiial Intelligene Researh, 18:263{313.

C. Riggelsen. 2004. Learning Bayesian Network parameters from inomplete data using impor-tanesampling. InProeedings of the 2nd Euro-peanWorkshoponProbabilistiGraphialModels, pages169{176,Leiden.

G. Shafer. 1996. The Art of Causal Conjeture. MITPress.

T.Silander,P.Kontkanen,andP.Myllymaki. 2007. OnthesensitivityoftheMAPBayesianNetwork struturetotheequivalentsamplesizeparameter. InProeedings of the 23rd Conferene on Uner-taintyinArtiialIntelligene,Vanouver.

independeneandChainEventGraphs. Artiial Intelligene,172:42{68.

P.A.Thwaites,J.Q.Smith,andR.G.Cowell. 2008. Propagationusing Chain EventGraphs. In Pro-eedingsofthe24thConfereneonUnertaintyin ArtiialIntelligene,pages546{553,Helsinki. P.A.Thwaites,J.Q.Smith,andE.M.Riomagno.

2010. Causalanalysiswith ChainEventGraphs. AeptedbyArtiialIntelligene.