Computational data mining
4.4 Logistic regression
Section 4.3 considered a predictive model for a quantitative response variable; this section considers a predictive model for a qualitative response variable. A qualitative response problem can often be decomposed into binary response problems (e.g. Agresti, 1990). The building block of most qualitative response models is the logistic regression model, one of the most important predictive data mining methods. Letyi (i=1,2, . . . , n) be the observed values of a binary
response variable, which can take only the values 0 or 1. The level 1 usu- ally represents the occurrence of an event of interest, often called a ‘success’. A logistic regression model is defined in terms of fitted values to be inter- preted as probabilities (Section 5.1) that the event occurs in different subpop- ulations:
πi =P (Yi =1), fori=1,2, . . . , n
More precisely, a logistic regression model specifies that an appropriate function of the fitted probability of the event is a linear function of the observed values of the available explanatory variables. Here is an example:
log πi 1−πi =a+b1xi1+b2xi2+ · · · +bkxik
The left-hand side defines the logit function of the fitted probability, logit(πi),
as the logarithm of the odds for the event, namely the natural logarithm of the ratio between the probability of occurrence (success) and the probability of non-occurrence (failure): logit(πi)=log πi 1−πi
Once πi is calculated, on the basis of the data, a fitted value for each binary
observation yˆi can be obtained, introducing a threshold value ofπi above which
ˆ
yi =1 and below whichyˆi =0. The resulting fit will seldom be perfect, so there
will be a fitting error that will have to be kept as low as possible. Unlike linear regression, the observed response values cannot be decomposed additively as the sum of a fitted value and an error term.
The choice of the logit function to describe the function that linksπi to the
linear combination of the explanatory variables, is motivated by the fact that with this choice the probability tends towards 0 and 1 gradually. And these limits are never exceeded, guaranteeing that πi is a valid probability. A linear
regression model would be inappropriate to predict a binary response variable, simply because a linear function is unlimited, so the model could predict values for the response variable outside the interval [0,1], which would be meaningless. But other types of link are possible, as will be seen in Section 5.4.
4.4.1 Interpretation of logistic regression
The logit function implies that the dependence ofπi on the explanatory variables
is described by a sigmoid or S-shaped curve. By inverting the definition of the logit function, we obtain
πi =
exp(a+b1xi1+b2xi2+ · · · +bkxik)
1+exp(a+b1xi1+b2xi2+ · · · +bkxik)
This relationship corresponds to the function known as a ‘logistic curve’, often employed for diffusion problems, including the launch of a new product or the diffusion of a reserved piece of information. These applications often concern the simple case of only one explanatory variable, corresponding to a bivariate logistic regression model:
πi =
ea+b1xi1
1+ea+b1xi1
Here the value of the success probability varies according to the observed values of the unique explanatory variable. This simplified case is useful to visualise the behaviour of the logistic curve, and to make two more remarks about interpreta- tion. Figure 4.5 shows the graph of the logistic function that links the probability of successπi to the possible values of the explanatory variablexi, corresponding
to two different signs of the coefficient β. We have assumed the more general setting, in which the explanatory variable is continuous and therefore the success probability can be indicated asπ(x). For discrete or qualitative explanatory vari- ables the results will be a particular case of what I am about to describe. Notice
x 1.0 0.8 0.6 0.4 0.2 0 (a) (b) p ( x ) 1.0 0.8 0.6 0.4 0.2 0 p ( x ) x
that the parameterβ determines the rate of growth or increase of the curve; the sign ofβ indicates whether the curve increases or decreases and the magnitude ofβ determines the rate of that increase or decrease:
• Whenβ >0 then(x)increases as xincreases.
• Whenβ <0 thenπ(x) decreases asx increases.
Furthermore, forβ→0 the curve tends to become a horizontal straight line. In particular, whenβ=0, Y is independent ofX.
Although the probability of success is a logistic function and therefore not linear in the explanatory variables, the logarithm of the odds is a linear function of the explanatory variables:
log π(x) 1−π(x) =α+βx
Positive log-odds favour Y =1 whereas negative log-odds favour Y =0. The log-odds expression establishes that the logit increases by β units for a unit increase inx. It could be used during the exploratory phase to evaluate the linear- ity of the observed logit. A good linear fit of the explanatory variable with respect to the observed logit will encourage us to apply the logistic regression model. The concept of odds was introduced in Section 3.4. For the logistic regression model, the odds of success can be expressed by
π(x)
1−π(x) =e
α+βx =eα(eβ)x
This exponential relationship offers a useful interpretation of the parameter β: a unit increase inxmultiplies the odds by a factor eβ. In other words, the odds at level x+1 equal the odds at level x multiplied by eβ. When β=0 we obtain
eβ =1, therefore the odds do not depend onX.
What about the fitting algorithm, the properties of the residuals, and goodness of fit indexes? These concepts can be introduced by interpreting logistic regres- sion as a linear regression model for appropriate transformation of the variables. They are examined as part of the broader field of generalised linear models (Section 5.4), which should make them easier to understand. I have waited until Section 5.4 to give a real application of the model.
4.4.2 Discriminant analysis
Linear regression and logistic regression models are essentially scoring mod- els – they assign a numerical score to each value to be predicted. These scores can be used to estimate the probability that the response variable assumes a predetermined set of values or levels (e.g. all positive values if the response is continuous or a level if it is binary). Scores can then be used to classify the observations into disjoint classes. This is particularly useful for classifying
new observations not already present in the database. This objective is more natural for logistic regression models, where predicted scores can be converted in binary values, thus classifying observations in two classes: those predicted to be 0 and those predicted to be 1. To do this, we need a threshold or cut-off rule. This type of predictive classification rule is studied by the classical the- ory of discriminant analysis. We will consider the simple and common case in which each observation is to be classified using a binary response: it is either in class 0 or in class 1. The more general case is similar, but more complex to illustrate.
The choice between the two classes is usually based on a probabilistic criterion: choose the class with the highest probability of occurrence, on the basis of the observed data. This rationale, which is optimal when equal misclassification costs are assumed (Section 5.1), leads to an odds-based rule that allows us to assign an observation to class 1 (rather than class 0) when the odds in favour of class 1 are greater than 1, and vice versa. Logistic regression can be expressed as a linear function of log-odds, therefore a discriminant rule can be expressed in linear terms, by assigning theith observations to class 1 if
a+b1xi1+b2xi2+ · · · +bkxik>0
With a single predictor variable, the rule simplifies to
a+bxi >0
This rule is known as the logistic discriminant rule; it can be extended to quali- tative response variables with more than two classes.
An alternative to logistic regression is linear discriminant analysis, also known as Fisher’s rule. It is based on the assumption that, for each given class of the response variable, the explanatory variables are distributed as a multivariate normal distribution (Section 5.1) with a common variance–covariance matrix. Then it is also possible to obtain a rule in linear terms. For a single predictor, the rule assigns observation ito class 1 if
logn1 n0 − (x1−x0)2 2s2 + xi(x1−x0) s2 >0
where n1 and n0 are the number of observations in classes 1 and 0; x1 and x0
are the observed means of the predictor X in the two classes, 1 and 0; s2 is
the variance of X for all the observations. Both Fisher’s rule and the logistic discriminant rule can be expressed in linear terms, but the logistic rule is simpler to apply and interpret and it does not require any probabilistic assumptions. Fisher’s rule is more explicit than the logistic discriminant rule. By assuming a normal distribution, we can add more information to the rule, such as an assessment of its sampling variability. We shall return to discriminant analysis in Section 5.1.