Logistic regression - Computational data mining

Computational data mining

4.4 Logistic regression

Section 4.3 considered a predictive model for a quantitative response variable; this section considers a predictive model for a qualitative response variable. A qualitative response problem can often be decomposed into binary response problems (e.g. Agresti, 1990). The building block of most qualitative response models is the logistic regression model, one of the most important predictive data mining methods. Letyi (i=1,2, . . . , n) be the observed values of a binary

response variable, which can take only the values 0 or 1. The level 1 usually represents the occurrence of an event of interest, often called a ‘success’. A logistic regression model is deﬁned in terms of ﬁtted values to be inter- preted as probabilities (Section 5.1) that the event occurs in different subpop- ulations:

πi =P (Yi =1), fori=1,2, . . . , n

More precisely, a logistic regression model speciﬁes that an appropriate function of the ﬁtted probability of the event is a linear function of the observed values of the available explanatory variables. Here is an example:

log πi 1−πi =a+b1xi1+b2xi2+ · · · +bkxik

The left-hand side deﬁnes the logit function of the ﬁtted probability, logit(πi),

as the logarithm of the odds for the event, namely the natural logarithm of the ratio between the probability of occurrence (success) and the probability of non-occurrence (failure): logit(πi)=log πi 1−πi

Once πi is calculated, on the basis of the data, a ﬁtted value for each binary

observation yˆi can be obtained, introducing a threshold value ofπi above which

yi =1 and below whichyˆi =0. The resulting ﬁt will seldom be perfect, so there

will be a ﬁtting error that will have to be kept as low as possible. Unlike linear regression, the observed response values cannot be decomposed additively as the sum of a ﬁtted value and an error term.

The choice of the logit function to describe the function that linksπi to the

linear combination of the explanatory variables, is motivated by the fact that with this choice the probability tends towards 0 and 1 gradually. And these limits are never exceeded, guaranteeing that πi is a valid probability. A linear

regression model would be inappropriate to predict a binary response variable, simply because a linear function is unlimited, so the model could predict values for the response variable outside the interval [0,1], which would be meaningless. But other types of link are possible, as will be seen in Section 5.4.

4.4.1 Interpretation of logistic regression

The logit function implies that the dependence ofπi on the explanatory variables

is described by a sigmoid or S-shaped curve. By inverting the deﬁnition of the logit function, we obtain

πi =

exp(a+b1xi1+b2xi2+ · · · +bkxik)

1+exp(a+b1xi1+b2xi2+ · · · +bkxik)

This relationship corresponds to the function known as a ‘logistic curve’, often employed for diffusion problems, including the launch of a new product or the diffusion of a reserved piece of information. These applications often concern the simple case of only one explanatory variable, corresponding to a bivariate logistic regression model:

πi =

ea+b1xi1

1+ea+b1xi1

Here the value of the success probability varies according to the observed values of the unique explanatory variable. This simpliﬁed case is useful to visualise the behaviour of the logistic curve, and to make two more remarks about interpretation. Figure 4.5 shows the graph of the logistic function that links the probability of successπi to the possible values of the explanatory variablexi, corresponding

to two different signs of the coefﬁcient β. We have assumed the more general setting, in which the explanatory variable is continuous and therefore the success probability can be indicated asπ(x). For discrete or qualitative explanatory variables the results will be a particular case of what I am about to describe. Notice

x 1.0 0.8 0.6 0.4 0.2 0 (a) (b) p ( x ) 1.0 0.8 0.6 0.4 0.2 0 p ( x ) x

that the parameterβ determines the rate of growth or increase of the curve; the sign ofβ indicates whether the curve increases or decreases and the magnitude ofβ determines the rate of that increase or decrease:

• Whenβ >0 then(x)increases as xincreases.

• Whenβ <0 thenπ(x) decreases asx increases.

Furthermore, forβ→0 the curve tends to become a horizontal straight line. In particular, whenβ=0, Y is independent ofX.

Although the probability of success is a logistic function and therefore not linear in the explanatory variables, the logarithm of the odds is a linear function of the explanatory variables:

log π(x) 1−π(x) =α+βx

Positive log-odds favour Y =1 whereas negative log-odds favour Y =0. The log-odds expression establishes that the logit increases by β units for a unit increase inx. It could be used during the exploratory phase to evaluate the linear- ity of the observed logit. A good linear ﬁt of the explanatory variable with respect to the observed logit will encourage us to apply the logistic regression model. The concept of odds was introduced in Section 3.4. For the logistic regression model, the odds of success can be expressed by

π(x)

1−π(x) =e

α+βx ₌_eα₍_eβ₎x

This exponential relationship offers a useful interpretation of the parameter β: a unit increase inxmultiplies the odds by a factor eβ. In other words, the odds at level x+1 equal the odds at level x multiplied by eβ_{. When} _β₌_{0 we obtain}

eβ ₌_{1, therefore the odds do not depend on}_X_.

What about the fitting algorithm, the properties of the residuals, and goodness of fit indexes? These concepts can be introduced by interpreting logistic regression as a linear regression model for appropriate transformation of the variables. They are examined as part of the broader field of generalised linear models (Section 5.4), which should make them easier to understand. I have waited until Section 5.4 to give a real application of the model.

4.4.2 Discriminant analysis

Linear regression and logistic regression models are essentially scoring models – they assign a numerical score to each value to be predicted. These scores can be used to estimate the probability that the response variable assumes a predetermined set of values or levels (e.g. all positive values if the response is continuous or a level if it is binary). Scores can then be used to classify the observations into disjoint classes. This is particularly useful for classifying

new observations not already present in the database. This objective is more natural for logistic regression models, where predicted scores can be converted in binary values, thus classifying observations in two classes: those predicted to be 0 and those predicted to be 1. To do this, we need a threshold or cut-off rule. This type of predictive classiﬁcation rule is studied by the classical the- ory of discriminant analysis. We will consider the simple and common case in which each observation is to be classiﬁed using a binary response: it is either in class 0 or in class 1. The more general case is similar, but more complex to illustrate.

The choice between the two classes is usually based on a probabilistic criterion: choose the class with the highest probability of occurrence, on the basis of the observed data. This rationale, which is optimal when equal misclassiﬁcation costs are assumed (Section 5.1), leads to an odds-based rule that allows us to assign an observation to class 1 (rather than class 0) when the odds in favour of class 1 are greater than 1, and vice versa. Logistic regression can be expressed as a linear function of log-odds, therefore a discriminant rule can be expressed in linear terms, by assigning theith observations to class 1 if

a+b1xi1+b2xi2+ · · · +bkxik>0

With a single predictor variable, the rule simpliﬁes to

a+bxi >0

This rule is known as the logistic discriminant rule; it can be extended to qualitative response variables with more than two classes.

An alternative to logistic regression is linear discriminant analysis, also known as Fisher’s rule. It is based on the assumption that, for each given class of the response variable, the explanatory variables are distributed as a multivariate normal distribution (Section 5.1) with a common variance–covariance matrix. Then it is also possible to obtain a rule in linear terms. For a single predictor, the rule assigns observation ito class 1 if

logn1 n0 − (x1−x0)2 2s2 + xi(x1−x0) s2 >0

where n1 and n0 are the number of observations in classes 1 and 0; x1 and x0

are the observed means of the predictor X in the two classes, 1 and 0; s2 _is

the variance of X for all the observations. Both Fisher’s rule and the logistic discriminant rule can be expressed in linear terms, but the logistic rule is simpler to apply and interpret and it does not require any probabilistic assumptions. Fisher’s rule is more explicit than the logistic discriminant rule. By assuming a normal distribution, we can add more information to the rule, such as an assessment of its sampling variability. We shall return to discriminant analysis in Section 5.1.

In document Applied Data Mining Statistical Methods for Business and Industry Giudici P (2003) pdf (Page 110-114)