Statistical data mining
5.6 Graphical models
5.6.3 Graphical models versus neural networks
We have seen that the construction of a statistical model is a long and conceptually complex process and it requires the formulation of a series of formal hypotheses. On the other hand, a statistical model allows us to make predictions and simulate scenarios on the basis of explicit rules that are easily scalable – rules that can be generalised to different data. In Chapter 4 we saw how computationally intensive techniques require a lighter analytical structure, allowing us to find precious information rapidly from large volumes of data. Their disadvantages are low transparency and low scalability. Here is a brief comparison to help underline the different concepts. We shall compare neural networks and graphical models; they can be seen as rather general examples of computational methods and statistical methods, respectively.
The nodes of a graphical model represent random variables, whereas in neural networks they are computational units, not necessarily random. In a graphical model an edge represents a probabilistic conditional dependency between the
corresponding pair of random variables, whereas in a neural network an edge describes a functional relation between the corresponding nodes. Graphical mod- els are usually constructed in three phases: (a) the qualitative phase establishes the conditional independence relationships among the random variables; (b) the probabilistic phase associates the graph with a vector of random variables having a Markovian distribution with respect to the graph; (c) the quantitative phase assigns the parameters (if known) that characterise the distribution in (b). Neural networks are constructed in three similar phases: (a) the qualitative phase estab- lishes the organisation of the layers and the relationships among them; (b) the functional phase specifies the functional relationships between the layers; (c) the quantitative phase fixes the weights (if known) associated with the connections among the different nodes.
I believe that these two methodologies can be used in a complementary way. Taking a graphical model and introducing latent variables – variables that are not observed – confers two extra advantages. First, it allows us to represent a multi- layer perceptron as a graphical model, so we can take formal statistical methods valid for graphical models and use them on neural networks (e.g. confidence intervals, rejection regions, deviance comparisons). Second the use of a neural network in a preliminary phase could help to reduce the structural complexity of graphical models, reducing the number of variables and edges present, and doing it in a more computationally efficient way. Adding latent variables to graphi- cal models, corresponding to purely computational units, allows us to enrich the model with non-linear components, as occurs with neural networks. For more on the role of latent variables in graphical models, see Cox and Wermuth (1996).
5.7
Further reading
In this chapter we have reviewed the main statistical models for data mining applications. Their common feature is the presence of probabilistic modelling. This makes the results much easier to interpret but it may slow down the imple- mentation and elaboration phases. I have tried to give an overview of the relevant literature.
We began with methods for modelling uncertainty and inference; there are many textbooks on this. One to consult is Mood, Graybill and Boes (1991); another is Azzalini (1992), which takes more of a modelling viewpoint. Non- parametric models are distribution-free, as they do not require heavy preliminary assumptions. They may be very useful, especially in an exploratory context. For a review of non-parametric methods see Gibbons and Chakraborti (1992). Semi- parametric models, based on mixture models, can provide a powerful probabilistic approach to cluster analysis. For an introductory treatment from a data mining viewpoint, see Hastie, Tibshirani and Friedman (2001).
Introduction of the Gaussian distribution allows us to bring regression methods into the field of normal linear models, and therefore to correlate the least squares method with measures of sample variability, as well as to provide thresholds for evaluating goodness of fit. For an introduction to the normal linear model,
consult Mood, Graybill and Boes (1991) or a classic econometrics text such as Greene (1999). It is possible to develop the normal linear model into generalised linear models. For an introduction consult the original article of Nelder and Wedderburn (1972) and the books of Dobson (1990), McCullagh and Nelder (1989) and Agresti (1990).
Log-linear models are an important class generalised linear models. They are symmetric models and are mainly used to obtain the associative structure among categorical variables, whose observations are classified in multiple contingency tables. Graphical log-linear models are particularly useful for data interpretation. For an introduction to log-linear models, look at the earlier texts or at Christensen (1997). For graphical log-linear models it is better to consult texts on graphical models, for example Whittaker (1990).
We introduced the concept of conditional independence (and dependence); graphical representation of conditional independence relationships allowed us to take what we saw for graphical log-linear models and generalise it to a wider class of statistical models, known as graphical models. Graphical models are very general statistical models for data mining. In particular, they can adapt to different analytical objectives, from predicting multivariate response variables (recursive models) to finding associative structure (symmetric models), in the presence of both qualitative and quantitative variables. For an introduction to graphical models, consult Edwards (1995), Whittaker (1990) or Lauritzen (1996). For directed graphical models, also known as probabilistic expert systems, see Cowell et al. (1999) or Jensen (1996).