• No results found

Statistical data mining

5.4 Generalised linear models

5.4.3 The logistic regression model

The logistic regression model is an important model. We can use our general results to derive inferential results for the logistic regression model. The deviance of a modelM assumes the following form:

G2(M)=2 n i=1 yilog yi niπˆi +(niyi)log niyi niniπˆi

where theπˆi are the fitted probabilities of success, calculated on the basis of the

estimatedβ parameters for modelM. The deviance G2 assumes the form

G2 =2

i

oilog

oi

ei

where oi indicates the observed frequencies yi and niyi, and ei indicates the

corresponding fitted frequenciesniπˆi andniniπˆi. Note thatG2 can be inter-

preted as a distance function, expressed in terms of entropy differences between the fitted model and the saturated model.

The Pearson statistic for the logistic regression model, based on theX2 dis-

tance, takes the form

X2 = n i=1 (yiniπˆi)2 niπˆi(1− ˆπi)

BothG2 and X2 can be used to compare models in terms of distances between

observed and fitted values. The advantage of G2 lies in its modularity. For

instance, in the case of two nested logistic regression modelsMA, withqparam-

eters, andMB, withp parameters (q < p), the difference between the deviances

is given by D=G2(MA)G2(MB)=2 n i=1 yilog niπˆiB niπˆiA +(niyi)log niπˆiB niπˆiA =2 n i=1 oilog eB i eiAχp2q where πˆA

i andπˆiB indicate the success probability fitted on the basis of models

MA andMB, respectively. Note that the expression for the deviance boils down

to an entropy measure between probability models, exactly as before. This is a general fact. The deviance residuals are defined by

Dri = ±(yi− ˆπi)21/2 yilog yi niπˆi +(niyi)log niyi niniπˆi 1/2 5.4.4 Application

We now consider a case study on correspondence sales for an editorial company. It is described at length in Chapter 10. The observations are the customers of the company. The response variable to be predicted distinguishes the clients into two categories: those that buy only one product and those that buy more, following the first purchase. All the explanatory variables are binary. The response variable is calledNacquist; it indicates whether or not the number of purchases is greater than one. The explanatory variables areVdpflrat,islands,south,centre,

north,age15 35,age36 50,age51 89,dim g,dim m,dim p,sex. The interpretation of the variables and the explanatory analysis are given in Chapter 10. Here we construct the logistic regression model, initially fitting a model with all the variables. The value of G2 for this model is 3011.658.

Table 5.6 gives the deviance and the maximum likelihood estimates for this model. It begins with information relative to the chosen model. The third row shows the log-likelihood score for the considered model and for the null model with only the intercept. The difference between the two deviances, D, is equal to 307.094. Using a chi-squared test with 9 degrees of freedom (9=10−1), we obtain a significant difference, so we accept the considered model (the p-value is 0.00001). Even though there are 12 explanatory variables, the presence of the intercept means we have to eliminate three of them that we cannot estimate. For example, since there are three age classes, the three columns that indicate the presence or absence of each of them sum to a vector of ones, identical to the intercept vector. Therefore, in this model we eliminate the variable age15 35.

Table 5.6 Results of fitting the full logistic regression model.

Model Fitting Information and Testing Global Null Hypothesis BETA=0

Intercept Intercept and

Criterion Only Covariates Chi-Square for Covariates AIC 3320.732 3031.658 .

SC 3326.556 3089.700 .

-2 LOG L 3318.752 3011.658 307.094 with g DF (p=0.0001) Score . . 300.114 with g DF (p=0.0001)

NOTE: The following parameters have been set to 0, since the variables are a linear combination of other variables as shown.

AGE1535 = 1 * INTERCEPT - 1 * AGE5189 - 1 * AGE3650 NORTH = 1 * INTERCEPT - 1 * ISLANDS - 1 * SOUTH - 1 * CENTRE DIM−P = 1 * INTERCEPT - 1 * DIM−G - 1 * DIM−M

Analysis of Maximum Likelihood Estimates

Parameter Standard Wald Pr > Standardised Odds Variable DF Estimates Error Chi-Square Chi-Square Estimate Ratio

INTERCEPT 1 -1.2163 0.1510 64.9210 0.0001 . . VDPFLRAT 1 1.5038 0.0999 226.6170 0.0001 0.365878 4.498 ISLANDS 1 -0.2474 0.1311 3.5639 0.0590 -0.050827 0.781 SOUTH 1 -0.4347 0.1222 12.6525 0.0004 -0.098742 0.647 CENTRE 1 -0.1371 0.1128 1.4777 0.2241 -0.033399 0.872 AGE1−89 1 0.4272 0.1312 10.6086 0.0011 0.095633 1.533 AGE36−50 1 0.7457 0.1070 48.5289 0.0001 0.205547 2.108 DIMG 1 -0.0689 0.1335 0.2667 0.6055 -0.016728 0.933 DIMM 1 0.1294 0.1172 1.2192 0.2695 0.035521 1.138 AGE1535 0 0 . . . . . SEX 1 0.1180 0.0936 1.5878 0.2076 0.030974 1.125 NORTH 0 0 . . . . . DIM−P 0 0 . . . . .

AIC and SC in the first two rows of the table are model choice criteria related to the numerator of the deviance. Chapter 6 covers them in more detail. The second part of the table shows, for each of the parameters, the estimates obtained plus the relative standard errors, as well as the Wald statistic for hypothesis testing on each coefficient. Using the p-value corresponding to each statistic, we can deduce that at least four variables are not significant (with a significance level of 0.05, five variables are not significant, as they have a greaterp-value). Finally, the table shows the estimated odds ratios of the response variable with each explanatory variable. These estimates are derived using the estimated parameters, so they may differ from the odds ratios calculated during the exploratory phase (Section 4.4), which are based on a saturated model. The exploratory indexes are usually calculated marginally, whereas the present indexes take account of interactions among all variables.

We now look at a model selection procedure to see whether the model can be further simplified. We can choose forward, backward or stepwise selection.

In the normal linear model, these procedures are based on recursive application of theF test, but now they are based on the deviance differences. If thep-value for this difference is ‘large’, the simpler model will be chosen; if it is small, the more complex model will be chosen. The procedure stops when no further change will produce a significant change in the deviance. Table 5.7 shows the results obtained with a forward procedure (generically known as ‘stepwise’ by the software). It highlights for every variable the values of Rao’s score statis- tic, in order to show the incremental importance of each inserted variable. The procedure stops after the insertion of five variables. Here they are in order of insertion: Vdpflrat,age15 35,north,age51 89,south. No other vari- able is retained at a significance level ofα=0.15, the software default. Table 5.7 also reports the parameter estimates for the five variables selected in the final model, with the corresponding Wald statistics. Now, no variable appears to be not significant, using a significance level of 0.05. The variableVdpflartindicates whether or not the price of the first purchase is paid in instalments; it is decisively estimated to be the variable most associated with the response variable.

For large samples, stepwise selection procedures, like the one we have just applied, might lead to high instability of the results. The forward and backward approaches may even lead to different final models. Therefore it is a good idea to consider other model selection procedures too; this is discussed in Chapter 6. Figure 5.5 presents a final diagnostic of the model, through analysis of the deviance residuals. It turns out that the standardised residuals behave quite well, lying in the interval [−2,+2]. But notice a slight decreasing tendency of the residuals (as opposed to being distributed around a constant line). This indicates a possible underestimation of observations with high success probability. For more details on residual analysis see Weisberg (1985).

Table 5.7 Results of forward procedure.

Summary of Stepwise Procedure

Variable Number Score Wald Pr > Step Entered Removed In Chi-Square Chi-Square Chi-Square

1 VDPFLRAT 1 234.7 . 0.0001 2 AGE1535 2 45.0708 . 0.0001 3 NORTH 3 9.4252 . 0.0021 4 AGE5189 4 7.4656 . 0.0063 5 SOUTH 5 4.4325 . 0.0353

Analysis of Maximum Likelihood Estimates

Parameter Standard Wald Pr > Standardized Odds Variable DF Estimate Error Chi-Square Chi-Square Estimate Ratio INTERCEPT 1 -0.5281 0.0811 42.3568 0.0001 . . VDPFLRAT 1 1.5022 0.0997 226.9094 0.0001 0.365499 4.492 SOUTH 1 -0.2464 0.1172 4.4246 0.0354 -0.055982 0.782 AGE5189 1 -0.3132 0.1130 7.6883 0.0056 -0.070108 0.731 AGE1535 1 -0.7551 0.1063 50.5103 0.0001 -0.186949 0.470 NORTH 1 0.2044 0.0989 4.2728 0.0387 0.053802 1.227

Figure 5.5 Residual analysis for the fitted logistic model.