Statistical data mining
5.6 Graphical models
5.6.1 Symmetric graphical models
In symmetric graphical models, the probability distribution is Markovian with respect to the specified undirected graph. This is equivalent to imposing on the distribution a number of probabilistic constraints known as Markov properties. The constraints can be expressed in terms of conditional independence relation- ships. Here are two Markov properties and how to interpret them:
• For the pairwise Markov property, if two nodes are not adjacent in the fixed graph, the two corresponding random variables will be conditionally inde- pendent, given the others. On the other, hand, if the specified probability distribution is such that X⊥Y| others, the edge between the nodes corre- sponding to XandY has to be omitted from the graph.
• For the global Markov property, if two sets of variables,U andV, are graph- ically separated by a third set of variables,W, then it holds thatU⊥V|W. For example, consider four discrete random variables, W,X, Y, and Z, whose conditional independence relations are described by the graph in Figure 5.10, from which we have thatW andZare separated fromXandY, andY andZ
are separated from X. A Markovian distribution with respect to the graph in Figure 5.10 has to satisfy the global Markov property and therefore it holds thatW⊥Z|(X, Y )and Y⊥Z|(W, X).
W
Y
X
Z
Figure 5.10 Illustration of the global Markov property.
It is useful to distinguish three types of symmetric graphical models:
• Discrete graphical models coincide with log-linear graphical models and are used when all the available variables are categorical.
• Graphical Gaussian models are used when the joint distribution of all variables is multivariate Gaussian.
• Mixed graphical models are used for a mixture of categorical variables and multivariate Gaussian variables.
We have seen discrete graphical models in the Section 5.5.3. A similar type of symmetric model, useful for descriptive data mining, can be introduced for con- tinuous variables. An exhaustive description of these models can be found in Whittaker (1990), who has called them Gaussian graphical models even though they were previously known in the statistical literature as covariance selection models (Dempster, 1972). For these models, it is assumed thatY =(Y1, . . . , Yq)
is a vector of continuous variables with a normal multivariate distribution. Marko- vian properties allow us to show that two variables are conditionally independent on all the others, if and only if the element corresponding to the two variables in the inverse of the variance–covariance matrix is null. This is equivalent to saying that the partial correlation coefficient between the two variables, given the others, is null. In terms of conditional independence graphs, given four variables
X, Y, W, Z, if the elements of the inverse of the variance–covariance matrix
kx,z and ky,w were null, the edges between the nodes X and Z and the nodes
Y and W would have to be absent. From a statistical viewpoint, a graphical Gaussian model and, equivalently, a graphical representation are selected by suc- cessively testing hypotheses of edge removal or addition. This is equivalent to testing whether the corresponding partial correlation coefficients are zero.
Notice how the treatment of the continuous case is similar to the discrete case. This has allowed us to introduce a very general class of mixed symmetric graphical models. We now introduce them in a rather general way, including continuous and discrete graphical models as special cases. Let V =∪ be the vertex set of a graph, partitioned in a set of|| continuous variables, and a set of||discrete variables. If to each vertexv is associated a random variable
Xv, the whole graph is associated with a random vector XV =(Xv, v∈V). A
vectorXV. PartitionXVinto a vectorXcontaining the categorical variables, and
a vectorX containing the continuous variables. ThenXV follows a conditional
Gaussian distribution if it satisfies these two conditions:
• p(i)=P (X=i) >0
• p(X|X=i)=N||
ξ(i),(i)
where the symbolNindicates a Gaussian distribution of dimension||with mean vectorξ(i)=K(i)−1h(i)and variance–covariance matrix(i)=K(i)−1, pos-
itive definite. In words, a random vector is distributed as a conditional Gaussian if the distribution of the categorical variables is described by a set of positive cell probabilities (this could happen through the specification of a log-linear model) and the continuous variables are distributed, conditional on each joint level of the categorical variables, as a Gaussian distribution with a null mean vector and a variance–covariance matrix that can, in general, depend on the levels of the categorical variables.
From a probabilistic viewpoint, a symmetric graphical model is specified by a graph and a family of probability distributions, which has Markov properties with respect to it. However, to use graphical models in real applications, it is necessary to completely specify the probability distribution, usually by estimating the unknown parameters on the basis of the data. This inferential task, usually accomplished by maximum likelihood estimation, is called quantitative learning. Furthermore, in data mining problems it is difficult to avoid uncertainty when specifying a graphical structure, so alternative graphical representations have to be compared and selected, again on the basis of the available data; this con- stitutes the so-called structural learning task, usually tackled by deviance-based statistical tests.
To demonstrate this approach, we can return to the European software industry application in Section 4.6 and try to describe the associative structure among all seven considered random variables. The graph in Figure 5.11 is based on hypothe- ses formulated through subject matter research by industrial economics experts; it shows conditional independence relationships between the available variables. One objective of the analysis is to verify whether the graph in Figure 5.11 can be simplified, maintaining a good fit to the data (structural learning). Another objective is to verify some research hypothesis on the sign of the association between some variables (quantitative learning).
We begin by assuming a probability distribution of conditional Gaussian type and given the reduced sample size (51 observations), a homogeneous model (Lauritzen, 1996). A homogeneous model means we assume the variance of the continuous variable does not depend on the level of the qualitative variables. So we can measure explicitly the effect of the continuous variable Y on the qualitative variables, we have decided to maintain, in all considered models, a link betweenY and the qualitative variables, even when it is not significant on the basis of the data. The symmetric model for the complete graph will therefore
Y
A S I
M H N
Figure 5.11 Initial conditional independence graph.
contain a total of 129 parameters. It is opportune to start the selection from the initial research graph (Figure 5.11). Since the conditional Gaussian distribution has to be Markovian with respect to this graph, all the parameters containing the pairs {M, A}, {N, I}, {M, N}, {M, I}, {A, N}, {A, I} have to be 0, hence the total number of parameters in the model corresponding to Figure 5.11 is 29. Considering the low number of available observations, this model is clearly overparameterised.
A very important characteristic of graphical models is to permit local calcula- tions on each clique of the graph (Frydenberg and Lauritzen, 1989). For instance, as the above model can be decomposed into 4 cliques, it is possible to estimate the parameters separately, on each clique, using the 51 available observations to estimate the 17 parameters of each marginal model. In fact, on the basis of a backward selection procedure using a significance level of 5%, Giudici and Carota (1992) obtained the final structural model shown in Figure 5.12. From the figure we deduce that the only direct significant associations between qualitative variables are between the pairs {H, I}, {N, S}and {N, H}. These associations depend on the revenue Y but not on the remaining residual variables. Concerning quantitative learning, the same authors have used their final model to calculate the
Y
A S I
M H N
odds ratios between the qualitative variables, conditional on the level ofY. They obtained the estimated conditional odds, relative to the pairsHI,NS, andNH:
ˆ
I H|R=exp(0.278+0.139R)thereforeˆI H|R>1 forR
>0.135 (all enterprises)
ˆ
N H|R=exp(−2.829+0.356R)thereforeˆN H|R>1 forR
>2856 (one enterprise)
ˆ
N S|R=exp(−0.827−0.263R)thereforeˆN S|R>1 forR
<23.21 (23 enterprises)
The signs of the association can be summarised as follows: the association between I and H is positive; the association between N and H is substan- tially negative; the association between N andS is positive only for enterprises having revenues less than the median.
From an economic viewpoint, these associations have a simple interpretation. The relationship betweenIandHconfirms that enterprises which adopt a strategy of incremental innovations tend to increase their contacts with enterprises in the hardware sector. The strategy of creating radically new products is based on an opposite view. Looking at contacts exclusively within the software sector, small enterprises (having revenues less than the median) tend to fear their innovations could be stolen or imitated and they tend to not make contacts with other small companies. Large companies (having revenues greater than the median) do not fear initiations and tend to increase their contacts with other companies.