Bayesian graphical model determination using decision theory

(1)

Bayesian graphical model determination using

decision theory

Jukka Corander

1

Rolf Nevanlinna Institute, P.O. Box 4, University of Helsinki, FIN-00014, Finland Received29 December 1999

Abstract

Bayesian model determination in the complete class of graphical models is considered using a decision theoretic framework within the regular exponential family. The complete class contains both decomposable and non-decomposable graphical models. A utility measure based on a logarithmic score function is introduced under reference priors for the model parameters. The logarithmic utility of a model is decomposed into predictive performance and relative complexity. Axioms of decision theory lead to the judgement of the plausibility of a model in terms of the posterior expected utility. This quantity has an analytic expression for decomposable models when certain reference priors are used and the exponential family is closed under marginalization. For non-decomposable models, a simulation consistent estimate of the expectation can be obtained. Both real and simulated data sets are used to illustrate the introduced methodology.

Keywords:Bayesian model determination; Entropy; Exponential family; Graphical models; Multinomial distribution; Multinormal distribution; Reference analysis; Utility

1. Introduction

During the past two decades, numerous researchers have shown the value and versatility of graphical models in multivariate analysis; for a comprehensive list of

E-mail address:[email protected].ﬁ.

1_{The author thanks Mattias Villani andthe two anonymous referees whose comments andsuggestions} ledto a substantial improvement of the original manuscript. This research has been supportedin part by Academy of Finland, Grant 50203.

(2)

references, see [17] or [29]. However, only recently, soundmethods for graphical model determination, dealing simultaneously with the uncertainty involved in the dependence structure and the parameters, have been discussed in the statistical literature. After a landmark paper by Dawid and Lauritzen[8], where the general principles for Bayesian analysis of decomposable graphical models were developed, Bayesian strategies for model determination have been considered by Madigan and Raftery[19], Madigan and York[20], Dellaportas andForster[9], Giudici and Green [11], andGiudici et al.[12].

Model determination strategies have mostly been considered within the class of decomposable graphical models, since the simple structure of such models reduces considerably the complexity of the model determination task. The exception is the paper by Dellaportas andForster [9] where also non-decomposable models were considered for multinomial data. Here, we develop a method for model determination in the complete class of graphical models, using a decision theoretic framework within the regular exponential family. In practical application of the proposedmethodwe focus on multinomial andmultinormal distributions.

Numerous authors have advocated the view of statistical inference as a special case of decision theory, one of the most impressive works on this area being the book by Bernardo and Smith[4]. Advantages of such an approach to solving the problem of learning about graphical model structure given the observed data are: (i) it provides a formal framework which necessitates only a manageable amount of effort in specifying a priori information, (ii) it facilitates a convenient communication of the results of learning, and(iii) it guarantees a coherent solution regardless of the amount of information available in the data.

In a decision theoretic formulation the plausibility of a graphical model in the light of data is measured in terms of its posterior expected utility. We propose the use of a utility function which is a proper score function of a logarithmic form, for a detailed derivation of these concepts, see[4]. The expectedutility ‘‘scores’’ a graphical model for its ability in predicting data, while weighting the ability against the complexity of the model, under the current uncertainty about the parameters.

Inspiredin particular by the work of Bernardo[3], we enter into a criterion for graphical model determination, which can be used both in the absence and presence of a priori information concerning model parameters. We concentrate on the former case and derive the criterion underreferencepriors. For a general discussion on such priors, see[15].

There has been a considerable interest in the statistical community to deﬁne Bayesian methods for model determination which necessitate as little effort as possible by the potential user. See, for instance,[2,16,23,26]. Apart from the methods basedon asymptotic approximations, such that the one by Schwarz [26], the common feature of the ‘‘automatic Bayesian’’ model determination methods is that their application to complex problems, such that the one considered in the current paper, is seldom described or even discussed.

Methods relying on asymptotic approximations are usually of the ‘‘penalized maximum likelihood’’ type, with various forms of penalty functions, and it is widely recognizedthat they have poor performance for small data sets. Here it will be

(3)

shown that the penalizedmaximum likelihoodmethods can be viewedas asymptotic approximations to the currently considered posterior expected utility, similarly as Smith andSpiegelhalter [27] showedthat they can be viewedin the framework of Bayes factors (for the deﬁnition of Bayes factor, see[14]).

The present paper is organizedas follows. The general concepts of graphical models are presented in Section 2. Model determination within the decision theoretic framework will be discussed in Section 3. Application of the introduced methodology to multinomial and multinormal data is considered in Sections 4 and 5, respectively. Some concluding remarks are given in Section 6.

2. Graphical models

We consider graphical modeling of distributions within the regular exponential family, for which a detailed theory is developed in[1,5].

Consider a ﬁnite setDofkstochastic variables with a joint sample spaceX;which is either a discrete set ID or Rk: The joint distribution of Dis characterizedby a

parameterhAHanda statistictðxÞ;xAX;such that the inner product/h;tðxÞSAR:

The density ofxwith respect to a s-ﬁnite measuremonX;is assumedto be

fðxjhÞ ¼expf/h;tðxÞSKðhÞg ð2:1Þ

so that the probabilistic behavior of the variablesDis described by an exponential model. It is assumed that (2.1) is strictly positive for allxAX:For a subsetaCD;the

marginal density ofxaAX_a is denoted byfaðxajhaÞ;whereh_a is a subvector ofh:

A generalization of the considered situation would allow the set D to be partitionedinto variables of discrete andcontinuous type. Within the regular exponential representation (2.1) there is then the possibility of using CG distributions introduced by Lauritzen and Wermuth[18]. However, the CG family does not obey closure under marginalization, which leads to technical difﬁculties and necessitates substantial further work in constructing a practically applicable method for model determination.

LetGbe the class of undirected graphical models, where eachGAGrestrictshto

take values in an afﬁne subspaceH_GofH:The terms graphical model and graph are usedinterchangeably in the sequel. For further details andexplanations concerning graphical models, see[30]or[17].

For any particular valuehAH;the projection hðGÞ ¼projðhjGÞ of h ontoH_G is

deﬁned as the unique parameter value, which minimizes the Kullback–Leibler divergence

Dðh;hðGÞÞ ¼

Z

fðxjhÞlog fðxjhÞ

fðxjhðGÞÞmðdxÞ ð2:2Þ

fromfðxjhÞtofðxjhðGÞÞ:TheentropyoffðjhÞis

hðfðjhÞÞ ¼

Z

(4)

Let CðGÞ denote the set of cliques of a graph G: When G is decomposable the

separatorsS of the cliques can be obtainedby a sequence of decompositions ofG

intoCðGÞ;see[17]. For such graphs the densityfðxjhðGÞÞcan be written as

fðxjhðGÞÞ ¼ Q cACðGÞ fcðxcjhcÞ Q sASðGÞ fsðxsjhsÞ : ð2:4Þ

For a decomposable model, divergence (2.2) can be written in terms of entropies of the marginal distributions, as

X

cACðGÞ

hðfcðjhcÞÞ X sASðGÞ

hðfsðjhsÞÞ hðfðjhÞÞ: ð2:5Þ

The standard condition for the decomposability of a graph G is that the graph shouldnot contain any chordless cycles of length four or larger. WhenG is non-decomposable, the density fðxjhðGÞÞ cannot be directly presented in terms of marginal densities as in (2.4), and the projection ofh cannot be expressedin a closedform, but it has to be foundby some iterative method. However, there is a possibility to utilize the concept of decomposability to represent the afﬁne restrictions imposed by a non-decomposable model, which is done in the following deﬁnition (originally introduced in[6]). Essentially the same idea was discovered by Rudas[25], andusedfor the maximum likelihoodestimation of graphical log-linear models.

Deﬁnition 2.1. For any two graphical models G;G0 in G; we let GCG0; if the

edges present in G are also present in G0: We say that G0 is a supermodel of G:

For any non-decomposable model G; let AG denote the class of minimal

decomposable supermodels, consisting of all G0 for which the following three conditions hold:

(1) G0 is decomposable, (2) GCG0;

(3) there does not exist any decomposableG00;such thatGCG00CG0:

The afﬁne restrictions toh imposedby a non-decomposableG are such that the densityfðxjhðGÞÞsatisﬁes factorization (2.4)simultaneouslyfor allG0_A_A

G:Assume

AGcontains themgraphsG1;y;Gm:It follows from the general results of Csiszar

[7], that the projectionhðGÞcan be obtainedas a limit of a cyclical projectionhðG1Þ;

hðG1ÞðG2Þ;y;hðGm1ÞðGmÞ; hðGmÞðG1Þ;y to the subspaces HG1;y;HGm: As

discussed in [25], this representation in terms of decomposable models has advantages over alternative, rather standard projection methods, such that those described in[28,30]. The above cyclical projection methodfacilitates both a simpler coding of the problem and a faster convergence to the parameter value minimizing (2.2).

(5)

3. Graphical model determination using decision theory

Let xdenote a set of n exchangeable observations x1;y;xn whose probabilistic

behavior is governed by (2.1). It is desired to determine the degrees of plausibility of the elements ofG under the current uncertainty about h: The statedproblem can formally be described as a decision problem, where the action to be taken is the choice of a model fromG: As discussed in[3,4], under quantitative coherence, we then needto specify: (i) the utility function uðGÞ_% ; measuring the desirability of choosingG;as a function of the plausibility of the parameter valueshAH_G;(ii) the

prior distributionpðhÞofh;(iii) the modelGwhich maximizes the expectedposterior utilityuðGj_% xÞ:

To specify the utility structure for the stateddecision problem, we use the general utility theory considered in [4]. A logarithmic utility function measures the appropriateness of an approximation to the densityfðxjhÞin terms of

alogfðxÞ þbðxÞ; xAX; ð3:1Þ

where a40 and bðÞ is an arbitrary real-valuedfunction. Let cðGÞ denote a cost function representing the relative complexity ofG:Since the data is assumed to be generatedfrom fðxjhÞ; the expectedlogarithmic utility of G before the data are actually observed, is deﬁned as

% uðGÞ ¼an Z fðxjhÞlogfðxjhðGÞÞmðdxÞ þn Z bðxÞfðxjhÞmðdxÞ cðGÞ; ð3:2Þ

where the ﬁrst part is proportional to the negative entropy of fðjhÞ minus the negative Kullback–Leibler divergence fromfðxjhÞto fðxjhðGÞÞ:

The maximization of the logarithmic utility does not depend on the actual form of the functionbðÞ;since the posterior expectation of the termnRbðxÞfðxjhÞmðdxÞdoes not depend onG:Hence, the optimal decision is to choose the Gwhich maximizes the posterior expectedlogarithmic utility

% uðGjxÞ ¼an Z Z fðxjhÞlogfðxjhðGÞÞmðdxÞ pðhjxÞdhcðGÞ: ð3:3Þ

The decision criterion thus involves the negative posterior expected entropy of the distribution with densityfðjhðGÞÞ:

WhenG is large, it may be impractical or not feasible to calculate the expected utilities for all models. In such situations, the model determination may be carried out by an efﬁcient heuristic search algorithm, such as the one considered in[19]. In the search among models, and in communicating the model determination results, it is useful to considerrelativeexpectedlogarithmic utilities on an exponential scale

expfuðGj% xÞg

P

GAG expfuðGj% xÞg

; ð3:4Þ

whereGcan be replaced by a subclass of models under consideration. The relative utilities also facilitate a weighting of the models in the light of data, which is useful for prediction purposes. The positive effect of taking into account several models by their posterior weights in prediction has been illustrated, for instance, in[19].

(6)

In contrast to the Markov chain Monte Carlo (MCMC) model determination methods considered in [9,11,12,20], models are visited only once in the search algorithm of Madigan and Raftery[19]. In an MCMC approach the plausibility of a model is measured by the number times that particular model is visited in a Markov chain over the model space. This is a potential problem, since whenGis large, the Markov chain may remain for long periods in non-representative parts ofG;making convergence to the target distribution slow. On the contrary, the algorithm of Madigan and Raftery[19]shouldbe capable of ﬁnding efﬁciently the relevant part of the model space, even for largeG:

LetGn

denote the model for whichHG¼H:The difference between the expected

utilitiesuðG% n jxÞanduðGj% xÞis an Z Z fðxjhÞlog fðxjhÞ fðxjhðGÞÞmðdxÞ pðhjxÞdh ðcðGn Þ cðGÞÞ; ð3:5Þ

where the ﬁrst part is the expectedlog-likelihoodratio under fðjhÞ: As notedby Bernardo [3], this quantity measures the amount of information about future observations which wouldbe neededto recover Gn

from G: Further, the utility constant a may be interpretedas the value of one unit of information about data generatedfromfðxjhÞ:

Recalling the deﬁnition of the density ofxaccording to a decomposable graph, we see that for such a model, the expected logarithmic utility uðGj_% xÞ involves only expectations of entropies of marginal distributions.

% uðGjxÞ ¼an X cACðGÞ hðfcðjhcÞÞ X sASðGÞ hðfsðjhsÞÞ 2 4 3 5pðhjxÞdhcðGÞ: ð3:6Þ

It will be shown in the subsequent sections that these expectedentropies have an analytic expression under a reference prior. For non-decomposable models, the expectedvalue of RfðxjhÞlogfðxjhðGÞÞmðdxÞ has to be calculatedby simulating values from pðhjxÞandprojecting each of these into HG to obtain a Monte Carlo

approximation. The computational burden due to this task might at ﬁrst sight appear prohibitive. Note, however, that when the expectation needs to be calculated for several models, the same set of simulated values frompðhjxÞcan be usedfor all models. Furthermore, it is necessary to calculate the Monte Carlo approximation only for the part ofG involving chordless cycles of length four or larger. In large graphs this is a real advantage, since a major part of the graph may be decomposable.

The computational burden may also be reduced by performing the search inGat two stages; at ﬁrst among the decomposable models, and then, given the graphs with the highest utilities, among the non-decomposable ones. Such a procedure will reduce considerably the number of non-decomposable models that need to be investigated.

Comparison of the expected utilities of different models is particularly simple for certain decomposable models. Consider the difference betweenuðGj% xÞand uðG% 0_j_x_Þ whereG0CG:If both these models are decomposable, andG0is obtainedfromGby

(7)

removing a single edgefd;d0g;the utility difference will simplify considerably. Note that the edgefd;d0gis containedin a single cliquecofG; andletcd¼c\_f_d0_g_;_c

d0 ¼ c\_f_d_g _and_c₀_¼_c\_f_d;d0g:Then, the utility difference equals

an

Z

½hðfcdðjhcdÞÞ þhðfcd0ðjhcd0ÞÞ hðfc0ðjhc0ÞÞ hðfcðjhcÞÞpðhjxÞdh

ðcðGÞ cðG0ÞÞ: ð3:7Þ

We now consider the choice ofaandcðGÞ;which, together with the choice of prior pðhÞforh;are crucial for the practical implementation of the model determination procedure. An obvious requirement to be met by a model determination procedure in the current context is that of asymptotic consistency, which means that as the sample size approaches inﬁnity, ‘‘the true model’’ among those inGwill be chosen with probability one. Although the truth of a model is merely a mathematical artifact, such a formulation provides valuable guidance in specifying a procedure which is applicable in practice. Notice that all models inGwith at least one missing edge are nested in the model corresponding to the complete graph, which by assumption gives the data generating mechanism. Therefore, the true model may be taken to correspondto the simplest possible graphGfor whichhAH_G:

LetqðGÞdenote the number of non-restricted elements inhAH_G;andletn-N;so that the posterior distribution approaches a Dirac spike at the valueh# maximizing the log-likelihood lðxjhÞ over H: At this limit, the expectedlogarithmic utility reduces to

alðxjh#ðGÞÞ cðGÞ: ð3:8Þ

It is clear from (3.3) that the choice ofadoes not affect the orderof the models with respect touðGj_% xÞ;sinceais constant overG:On the other hand, for the simplicity of interpretation of the relative utilities (3.4), one may seta¼1:Now, any choice of

cðGÞwhich involves qðGÞ;andnot n;will lead(asymptotically) to a non-consistent Akaike-type penalizedmaximum likelihoodcriterion. Whereas, ifcðGÞis set equal to

qðGÞðlognÞ=2;the logarithmic utility will tendto theSchwarz criterion[26], which is consistent in the present context. However, since the expectedutility will be lower than lðxjh#ðGÞÞ for ﬁnite samples, the cost qðGÞðlognÞ=2 will produce even more conservative results than the Schwarz criterion.

One possible solution to the problem of specifying the costcðGÞis to seek for the

least conservative asymptotically consistent criterion. Such a criterion, leading to

cðGÞ ¼qðGÞlog logn;was introduced in[13], for the determination of the order of autoregression. Heuristics suggest that their result holds more generally in the exponential family, as does that of Schwarz [26], but this is sincerely difﬁcult to demonstrate rigorously.

Although the asymptotic behavior of uðGj_% xÞis usedto provide insight into the problem of choosing cðGÞ; it shouldbe notedthat uðGj_% xÞ is not derived as an approximation to any other quantity. The choice ofcðGÞcorresponds to the choice of the degree of uncertainty in a prior distribution to be used for the calculation of the marginal likelihood of a model. The marginal likelihood is deﬁned as the

(8)

expectation of the likelihoodwith respect to prior RLðxjhÞpðhÞdh; andit is the standard Bayesian model determination criterion used in all papers concerning graphical model determination listed in the ﬁrst section.

A disadvantage of approximations of type (3.8) is that they are linear with respect to the model complexity for a ﬁxedn:Whennis small, these approximations tendto breakdown since they do not take into account the curvature in the likelihood around h#ðGÞ: In this respect uðGj_% xÞ behaves as a marginal likelihood, since the expectation of the logarithmic utility with respect to posterior penalizes a model for increasedcomplexity in a non-linear manner.

As notedin[3], in scientiﬁc applications of statistical inference there is usually a pragmatic needfor a model-basedprior, which has a minimal effect to the posterior inference relative to the data. Priors with such a desired property are often called

reference or non-informative, reﬂecting the fact they are intended to be utilized without further subjective assessment of prior hyper parameters. This is a particularly important point, since it can be extremely difﬁcult to specify reasonable subjective priors when the number of variables is large, see the discussion in[11].

When the expectedutilityuðGj% xÞis basedon a reference prior, andon the choices

cðGÞ ¼qðGÞlog logn; a¼1; we deﬁne it as the reference criterion for graphical model determination. To summarize the advantages of this criterion: (i) it provides a uniﬁed approach to graphical model determination in the complete class of graphical models, (ii) it is consistent while having desirable small sample properties, (iii) it does not require further setting of prior hyperparameters, and(iv) it facilitates a simple communication of the model determination results.

4. Multinomial case

We now consider graphical model determination for multinomial data. Let the ﬁnite set Id index the possible outcomes of dAD: On the set I_D¼ d_A_DI_d we

assume a multinomial distribution with the probabilities pD: For any aCD; the

corresponding marginal distribution with probabilities pa is deﬁned on Ia¼ dAaId:We writenD¼ ðnðxÞ ¼

Pn

i¼1 Iðxi¼xÞ;xAIDÞfor the observedcounts of

the different outcomes.

LetkDbe a vector of constantsðlðxÞ;xAIDÞ:Assuming the prior distribution for pD is the Dirichlet ðkDÞdistribution, the corresponding posterior is DirichletðkDþ nDÞ: If the Jeffreys’ reference prior is used, then lðxÞ ¼1=2;xAI_D: However, as

discussed in[9], for a largeIDthis choice leads to marginal priors that are not vague.

A more reasonable, symmetric prior is obtainedby settinglðxÞequal to 1=jIDjfor allxAI_D:This prior was originally suggestedin [24]. Given the Dirichlet-posterior

forpD;marginal posteriors for any subsetaCDare straightforwardto obtain.

For decomposable graphical models, the expected logarithmic utility involves posterior expectations of the entropies fhðfcðjhcÞÞ;hðfsðjhsÞÞ; cACðGÞ; sASðGÞg:

(9)

following expression, see,[31],

X

xAID

lðxÞ þnðxÞ

nþlðxÞjIDj½cðlðxÞ þnðxÞ þ1Þ cðnþlðxÞjIDj þ1Þ; ð4:1Þ

wherecðÞis the digamma function. As illustrated in[31], whennis relatively small andthe cardinality ofIDis large, the above Bayesian estimator of the entropy tends

to be more stable andyieldlarger values than the maximum likelihoodestimator, which is obtainedby replacing the probabilities by relative frequencies in the deﬁnition of entropy.

We now illustrate the model determination methodology using the risk factor data from Whittaker [30, p. 268]. In the present example andin the next section, the variables are representedby the integers 1;y;4 in the given order. The risk factor

data consists of a cross-classiﬁcation of 1841 car factory employees with respect to four binary variables: smoking, strenuous mental work, strenuous physical work, andratio of lipoproteins.

For k¼4 there are 64 models in G; of which 3 are non-decomposable, so it is feasible to compute the expectedlogarithmic utility for all models. To estimate the expected entropy under a non-decomposable model, we experimented with various numbers of parameter values, andstable estimates were already obtainedby using 1000 simulatedvalues from the posterior. All estimates usedin the present paper are basedon 10,000 values from the posterior.

The results of model determination are presented in Table 1, using the relative utilities (3.4). There is no ambiguity concerning the optimal model, which could be expectedbearing the large sample size in mind. The optimal model appears to be non-decomposable, which illustrates the disadvantage of model determination strategies which can be applied only to decomposable models. We note that the current model determination approach for multinomial distributions is expected to produce very similar results as that of Dellaportas and Forster[9].

Table 1

(10)

5. Multinormal case

We now consider model determination in the class of graphical Gaussian, or covariance selection models[10,30]. In this class, the joint distribution ofDequals a multinormal distribution Nkð0;RÞ: Let S denote the observed covariance matrix basedon a sample of sizen:

The entropy ofNkð0;RÞis given by

hðRÞ ¼k

2ð1þlogð2pÞÞ þ 1

2logjRj: ð5:1Þ

Since the maximum likelihoodestimate ofhðRÞis provided by replacingRwith the sample covariance matrixS;it tends to be lower than the true value, asEjSjojRjfor

k41; see, for instance, [22]. To derive an expression for the Bayesian estimate, we needthe following lemma, which follows by an obvious modiﬁcation of the proof of Theorem 3.4.8 in[21].

Lemma 5.1. LetMfollow the inverse Wishart distribution W1

k ðU;mÞ;and letw

2_ð_n_Þ

denote an inverted chi-squared random variable withndegrees of freedom. Then,jMjis distributed asjUj Qki¼01 w2ðmiÞ;where w2ðÞare all independent.

Given Lemma 5.1, andthe expectations of the logarithms ofw2_ðÞ_and_w2_ðÞ_;_we

may state the following result.

Proposition 5.2. LetSbe a kk sample covariance matrix under Nkð0;RÞ:Under the reference prior pðRÞpjRjðqþ1Þ=2 and quadratic loss, the Bayesian estimate of the entropy hðRÞ is unbiased and equals hð%RÞ ¼k

2ð1þlogð2pÞ þlognÞ þ12ðlogjSj

log 2

S

k_i¼01ðcððn1iÞ=2ÞÞÞ:

The reference posterior basedon the priorpðRÞpjRjðqþ1Þ=2is the inverse Wishart distributionW1_ðn_S_;_n₁_Þ_; _{with the parameterization of Mardia et al.}_{[21]. As in}

the multinomial case, the above formula cannot be usedfor a non-decomposable part of the investigatedgraph, but the expectedentropy has to be foundby simulating covariance matrices from the posterior, andprojecting them to the corresponding afﬁne subspace.

To illustrate graphical model determination for multinormal data we consider the Fret’s heads data set, analyzed previously in [11,21,30]. The data consists of 25 measurements of the headlength andbreadth of the ﬁrst andsecondson, respectively. The models with largest relative expected utilities (3.4) are given in Table 2. For comparison, the posterior model probabilities obtained in[11]are also given.

Situation is quite opposite to the previous example, as there is a considerable uncertainty about the dependence structure. The relative expected utilities differ from the posterior probabilities obtainedin[11], since their analysis was restrictedto decomposable models only, and favors thus unnecessary complex models according

(11)

to our results. However, the ordering of the decomposable models is roughly the same according to the posterior probabilities and the relative utilities based on the cost functionqðGÞlog logn:

To investigate performance of the model determination procedure based on different cost functions more systematically, we generated data sets from multi-normal distributions at various parameter values. The entropy criteria with the cost functionsqðGÞðlognÞ=2 andqðGÞlog logn are in the sequel referredto as EC1 and EC2, respectively. The SBC criterion of Schwarz [26] was also included in the comparison.

Four different trivariate normal distributions, which were identical except for the covariance between two of the variables, were investigated. The means and variances were chosen to be 0 and10, respectively. The two ﬁxedcovariances were set to 5 and the thirdðyÞwas given the values 2:5;3;4;5;where the ﬁrst value corresponds to a model with conditional independence between two of the variables. These models are referredto as model 1;y;4; respectively. The permutation of variable labels corresponding to the covariance matrix

10 5 y 10 5 10 0 B @ 1 C A

was chosen. With three variables, there are 8 models in the classG:The sample sizes 20;30;40;50;75;100;200 and300, with 50,000 replications of each, were used. The proportions of replicates for which the correct model was indicated as the optimal

Table 2

The models with largest relative expected utilities (3.4) for the Fret’s heads data

The usedcost functionsqðGÞðlognÞ=2 (1),qðGÞlog logn(2) are indicated at the beginning of each row. For comparison, the posterior model probabilities (3) obtained in[11]are given.

(12)

one are given inTable 3. For the investigatedmodels EC2 is nearly uniformly better in indicating the correct model than the other two criteria, the exception being model 1 for which SBC performs best. On the other hand, EC1 performs nearly uniformly worst, as might be expectedon the basis of the discussion in Section 3. The results thus strongly discourage the use of the cost functionqðGÞðlognÞ=2 in the current model determination procedure.

6. Discussion

We have demonstrated the importance of considering both decomposable and non-decomposable models in graphical model determination. Non-decomposable models arise frequently in applications, and may represent a subject-matter theory about a dependence structure equally well as decomposable models. Therefore, learning about dependence structures should not be restricted to decomposable models on the basis of mathematical and computational convenience.

In addition to the identiﬁcation of the ‘‘optimal’’ model in the light of data, it is important to consider the degrees of optimality of various competing models, as illustratedin our examples. When the relative utilities are not strongly supporting a single model, they are useful in communicating the model determination results and facilitate a weighted posterior inference about any model-dependent quantity of interest.

Although we have presented a uniﬁed approach to graphical model determination within the exponential family, some substantial matters remain for further research. First, a practically applicable methodfor the calculation of expectedutilities has still to be developed for the CG distributions of Lauritzen and Wermuth[18]. Second, it would be useful to generalize the currently considered model class to allow for graphs with directed edges. Finally, the current approach should be extended to handle situations with missing data. It is worth to notice that this important issue has Table 3

Proportions of replicates for which the correct models were indicated as optimal using different criteria

Model n 20 30 40 50 75 100 200 300 1 EC1 0.4050 0.6046 0.7595 0.8537 0.9532 0.9760 0.9871 0.9898 1 EC2 0.4724 0.6669 0.7993 0.8698 0.9350 0.9497 0.9614 0.9648 1 SBC 0.4830 0.6692 0.7976 0.8705 0.9444 0.9626 0.9780 0.9827 2 EC1 0.0139 0.0094 0.0091 0.0140 0.0281 0.0401 0.0610 0.0746 2 EC2 0.0251 0.0220 0.0308 0.0453 0.0729 0.0919 0.1317 0.1667 2 SBC 0.0266 0.0214 0.0266 0.0360 0.0549 0.0671 0.0901 0.1077 3 EC1 0.0041 0.0032 0.0137 0.0395 0.1523 0.2887 0.6332 0.8216 3 EC2 0.0095 0.0185 0.0592 0.1207 0.3070 0.4619 0.7813 0.9118 3 SBC 0.0011 0.0132 0.0520 0.0984 0.2350 0.3932 0.7089 0.8680 4 EC1 0.0021 0.0051 0.0290 0.0842 0.3325 0.5972 0.9757 0.9990 4 EC2 0.0070 0.0336 0.1130 0.2268 0.5472 0.7744 0.9928 0.9999 4 SBC 0.0065 0.0311 0.0940 0.1916 0.4783 0.7138 0.9860 0.9995

(13)

not so far been treatedin any of the papers concerning Bayesian graphical model determination.

References

[1] O. Barndorff-Nielsen, Information and Exponential Families in Statistical Theory, Wiley, New York, 1978.

[2] J. Berger, L. Pericchi, The intrinsic Bayes factor for model selection and prediction, J. Amer. Statist. Assoc. 91 (1996) 109–122.

[3] J. Bernardo, Nested hypothesis testing: the Bayesian reference criterion, in: J. Bernardo, J. Berger, A. Dawid, A. Smith (Eds.), Bayesian Statistics, Vol. 6, Oxford University Press, Oxford, 1999, pp. 101–130.

[4] J. Bernardo, A. Smith, Bayesian Theory, Wiley, Chichester, 1994.

[5] L. Brown, Fundamentals of Statistical Exponential Families, Institute of Mathematical Statistics, Hayward, 1986.

[6] J. Corander, Graphical model selection for multinomial data using information divergence, Research Report, Department of Statistics, Stockholm University, 1998.

[7] I. Csiszar, I-divergence geometry of probability distributions and minimization problems, Ann. Probab. 3 (1975) 146–158.

[8] A. Dawid, S. Lauritzen, Hyper-Markov laws in the statistical analysis of decomposable graphical models, Ann. Statist. 21 (1993) 1272–1317.

[9] P. Dellaportas, J. Forster, Markov chain Monte Carlo model determination for hierarchical and graphical log-linear models, Biometrika 86 (1999) 615–633.

[10] A. Dempster, Covariance selection, Biometrics 28 (1972) 157–175.

[11] P. Giudici, P. Green, Decomposable graphical Gaussian model determination, Biometrika 86 (1999) 785–801.

[12] P. Giudici, P. Green, C. Tarantola, Efﬁcient model determination for discrete graphical models, Technical Report, 1999, available atwww.statslab.cam.ac.uk/MCMC/.

[13] E. Hannan, B. Quinn, The determination of the order of an autoregression, J. Roy. Statist. Soc. B 41 (1979) 190–195.

[14] R. Kass, A. Raftery, Bayes factors, J. Amer. Statist. Assoc. 90 (1995) 773–795.

[15] R. Kass, L. Wasserman, The selection of prior distributions by formal rules, J. Amer. Statist. Assoc. 91 (1996) 1343–1370.

[16] J. Key, L. Pericchi, A. Smith, Bayesian model choice: what and why? in: J. Bernardo, J. Berger, A. Dawid, A. Smith (Eds.), Bayesian Statistics, Vol. 6, Oxford University Press, Oxford, 1999, pp. 343–370.

[17] S. Lauritzen, Graphical Models, Oxford University Press, Oxford, 1996.

[18] S. Lauritzen, N. Wermuth, Graphical models for associations between variables, some of which are qualitative andsome quantitative, Ann. Statist. 17 (1989) 31–54.

[19] D. Madigan, A. Raftery, Model selection and accounting for model uncertainty in graphical models using Occam’s window, J. Amer. Statist. Assoc. 89 (1994) 1535–1546.

[20] D. Madigan, J. York, Bayesian graphical models for discrete data, Internat. Statist. Rev. 63 (1995) 215–232.

[21] K. Mardia, J. Kent, J. Bibby, Multivariate Analysis, Academic Press, London, 1979. [22] R. Muirhead, Aspects of Multivariate Statistical Theory, Wiley, New York, 1982.

[23] A. O’Hagan, Fractional Bayes factors for model comparisons, J. Roy. Statist. Soc. B 57 (1995) 99–138.

[24] W. Perks, Some observations on inverse probability including a new indifference rule, J. Inst. Actuaries 73 (1947) 285–334.

[25] T. Rudas, A new algorithm for the maximum likelihood estimation of graphical log-linear models, Comput. Statist. 13 (1998) 529–537.

(14)

[26] G. Schwarz, Estimating the dimension of a model, Ann. Statist. 6 (1978) 461–464.

[27] A. Smith, D. Spiegelhalter, Bayes factors andchoice criteria for linear models, J. Roy. Statist. Soc. B 42 (1980) 213–220.

[28] T. Speed, H. Kiiveri, Gaussian Markov distributions over ﬁnite graphs, Ann. Statist. 14 (1986) 138–150.

[29] N. Wermuth, Graphical Markov models, In: S. Kotz, C. Read, D. Banks (Eds.), Encyclopedia of Statistical Science, Update, Vol. 2, Wiley, New York, 1998, pp. 284–300.

[30] J. Whittaker, Graphical Models in Applied Multivariate Statistics, Wiley, Chichester, 1990. [31] L. Yuan, H. Kesavan, Bayesian estimation of Shannon entropy, Comm. Statist. Theory Methods 26