Model specification
Result 5. To compare two nested regression planes that differ in a single explana tory variable, say the ( p + 1)th, present in one model but not in the other, the
4.15 Survival analysis models
Survival analysis (e.g. Singer and Willet, 2003) focuses on the time between entry to a study and some subsequent event. The standard approaches to survival analysis are stochastic; that is, the times at which events occur are assumed to be realisations of random processes. It follows thatT, the event time for some particular individual, is a random variable with a probability distribution. A use- ful, model-free or non-parametric approach for all random variables uses the cumulative distribution function (e.g. Hougaard 1995). The cumulative distribu- tion function of a variableT, denoted by F (t), tells us the probability that the variable will be less than or equal to some valuet; that is,F (t)=P{T ≤t}. If we know the value ofF for every value oft, then we know all there is to know about the distribution ofT. In survival analysis it is more common to work with a closely related function called the survivor function, defined as
S(t)=P{T > t} =1−F (t).
If the event of interest is a death (or, equivalently, a churn), the survivor function gives the probability of surviving beyondt. BecauseS is a probability we know that it is bounded by 0 and 1; and because T cannot be negative, we know thatS(0)=1. Often the objective is to compare survivor functions for different subgroups in a sample (clusters, regions, etc.). If the survivor function for one group is always higher than the survivor function for another group, then the first group clearly lives longer than the second group.
For continuous variables, another common way of describing their probability distribution is the probability density function. This function is defined as
f (t)= dF (t) dt = −
dS(t) dt ;
that is, the probability density function is just the derivative or slope of the cumulative distribution function. For continuous survival data, the hazard function is more popular than the probability density function as a way of describing distributions. The hazard function (e.g. Allison 1995) is defined as
h (t)= lim
εt→0
P{t ≤T < t+εt|T ≥t} εt .
This definition quantifies the instantaneous risk that an event will occur at timet. Since time is continuous, the probability that an event will occur exactly at time
t is necessarily 0. But we can talk about the probability that an event occurs in the small interval betweentandt+εtand we also want to make this probability conditional on the individual surviving to timet. For this formulation the hazard function is sometimes described as a conditional density and, when events are repeatable, the hazard function is often referred to as the intensity function. The survival function, the probability density function and the hazard function are equivalent ways of describing a continuous probability distribution. Another formula expresses the hazard in terms of the probability density function:
h(t)= f (t) S(t),
which leads to
h(t)= −d
dtlogS(t).
Integrating both sides of this equation gives an expression for the survival func- tion in terms of the hazard function:
S(t)=exp ⎛ ⎝− t ) 0 h(u) du ⎞ ⎠.
The hazard is a dimensional quantity that expresses the number of events per interval of time.
The first step in the analysis of survival data (for descriptive purposes) is to plot the survival function and the risk. The survival function is estimated by the Kaplan–Meier method (Kaplan and Meier, 1958). Suppose that there areK
distinct event times,t1< t2< . . . < tk. At each time tj there arenj individuals
who are said to be at risk of an event; that is, they have not experienced an event and they have not been censored prior to time tj. If any cases are censored at
exactly tj, there are also considered to be at risk attj. Letdj be the number of
individuals who die at timetj. The Kaplan–Meier estimator is defined as
ˆ S(t)= , j:tj≤t 1−dj nj , t1≤t ≤tk,
This formula says that, for a given time t, we take all the event times that are less than or equal tot. For each of these event times, we compute the quantity in parentheses, which can be interpreted as the conditional probability of surviving to timetj+1, given that one has survived to timetj. Then we multiply all of these
survival probability together.
For predictive purposes, the most popular model is Cox regression (Cox, 1972). Cox proposed a proportional hazards model and a new method of estimation that came to be called partial likelihood or, more accurately, maximum partial
likelihood. We start with the basic model that does not include time-dependent covariate or non-proportional hazards. The model is usually written as
h(tij)=h0(tj)exp
β1X1ij+β2X2ij+ · · · +βpXpij
.
It says that the hazard for individuali at timet is the product of two factors: a baseline hazard function that is left unspecified, and a linear combination of a set ofp fixed covariates, which is then exponentiated. The baseline function can be regarded as the hazard function for an individual whose covariates all have values 0. The model is called the proportional hazards model because the hazard for any individual is a fixed proportion of the hazard for any other individual. To see this, take the ratio of the hazards for two individualsi andj:
hi(t) hj(t) = exp&β1 xi1−xj1 + · · · +βp xip−xjp ' .
What is important about this equation is that the baseline cancels out of the numer- ator and denominator. As a result, the ratio of the hazards is constant over time.
4.16
Further reading
In the chapter we have reviewed the most important data mining methodologies, beginning in Sections 4.1–4.8 with those that do not strictly require a probabilis- tic model. The first section explained how to calculate a distance matrix from a data matrix. Sometimes we want to build a data matrix from a distance matrix, and one solution is the method of multidimensional scaling (e.g. Mardia et al., 1979). Having applied multidimensional scaling, it is possible to represent the row vectors (statistical observations) and the column vectors (statistical variables) in a unique plot called a biplot; this helps us to make interesting interpretations of the scores obtained. In general, biplots are used with tools for reducing dimen- sionality, such as principal component analysis and correspondence analysis. For an introduction to this important theme, see Gower and Hand (1996); in a data mining context, see Handet al. (2001).
The next section was concerned with cluster analysis, probably one of the best-known techniques used in the statistical analysis of multidimensional data. An interesting extension of cluster analysis is fuzzy classification; this allows a ‘weighted’ allocation of the observations to the clusters (Zadeh, 1977).
Multivariate linear regression is best dealt with using matrix notation. For an introduction to matrix algebra in statistics, see Searle (1982). The logistic regression model is for predicting categorical variables. The estimated cate- gory probabilities can then be used to classify statistical observations in groups, according to a supervised method. Probit models, well known in economics, are essentially the same as logistic regression models, once the logistic link is replaced by an inverse Gaussian link (e.g. Agresti, 1990).
Local model rules are still at an embryonic stage of development, at least from a statistical viewpoint. Association rules seem ripe for a full statistical treatment, and we have covered them in some depth. However, we have only
briefly referred to retrieval-by-content methods, which are expected to gain in importance in the foreseeable future, especially with reference to text mining. See Handet al. (2001) on retrieval by content and Zanasi (2003) on text mining. The statistical understanding of local models will be an important area for future research.
Tree models are probably the most popular data mining technique. A more detailed account can be found in advanced data mining textbooks, such as Hastie et al. (2001) and Handet al. (2001). These texts offer a statistical treatment; a computational treatment can be found in Han and Kamber (2001). The original works on CART and CHAID are Breimanet al. (1984) and Kass (1980).
Neural networks and support vector machines are important classes of super- vised models that originated in the machine learning communities. We have not considered support vector machines in this book because they do not provide explicit and transparent solutions and therefore are rarely used in business data mining problems. The literature on neural networks is vast; for a classical statis- tical approach, see Bishop (1995); for a Bayesian approach, consult Neal (1996). Support vector machines are discussed by 1998), Vapnik (1995.
In recent years statistical models of increasing complexity have been devel- oped that closely resemble neural networks but have a more statistical structure. Examples are the projection pursuit models, the generalised additive models, and the multivariate adaptive regression spline models. For a review, see Cheng and Titterington (1994) and Hastieet al. (2001).
In Section 4.9–4.15 we have reviewed the main statistical models for data mining applications. Their common feature is the presence of probability mod- elling. We began with methods for modelling uncertainty and inference; there are many textbooks on this. One to consult is Mood et al. (1991); another is Azzalini (1992), which takes more of a modelling viewpoint. Non-parametric models are distribution free, as they do not require heavy preliminary assump- tions. They may be very useful, especially in an exploratory context. For a review of non-parametric methods, see Gibbons and Chakraborti (1992). Semiparametric models, based on mixture models, can provide a powerful probabilistic approach to cluster analysis. For an introductory treatment from a data mining viewpoint, see Hastieet al. (2001).
Introduction of the Gaussian distribution allows us to bring regression methods into the field of normal linear models. For an introduction to the normal linear model, see Mood et al. (1991) or a classic econometrics text such as Greene (2000). The need for predictive tools for response variables that are neither con- tinuous nor normal led to the development of generalised linear models from the normal linear model. For an introduction, see Dobson (2002), McCullagh and Nelder (1989), Nelder and Wedderburn (1972) and Agresti (1990).
Log-linear models are an important class generalised linear models. They are symmetric models and are mainly used to obtain the associative structure among categorical variables, whose observations are classified in multiple contingency tables. Graphical log-linear models are particularly useful for data interpreta- tion. For an introduction to log-linear models, see Agresti (1990) or Christensen
(1997). For graphical log-linear models it is better to consult texts on graphical models, such as Whittaker (1990).
We introduced the concept of conditional independence (and dependence); graphical representation of conditional independence relationships allowed us to take what we saw for graphical log-linear models and generalise it to a wider class of statistical models, known as graphical models. Graphical models are very general statistical models for data mining. In particular, they can adapt to different analytical objectives, from predicting multivariate response variables (recursive models) to finding associative structure (symmetric models), in the presence of both qualitative and quantitative variables. For an introduction to graphical models, see Edwards (2000), Whittaker (1990) or Lauritzen (1996). For directed graphical models, also known as probabilistic expert systems, see Cowellet al. (1999) or Jensen (1996).