Interpreting a Model for a Polychotomous Predictor

For the churndata set [4], suppose that we categorize thecustomer service calls variable into a new variable,CSC, as follows:

r _{Zero or one customer service calls: CSC}₌_Low r _{Two or three customer service calls: CSC}₌_Medium r _{Four or more customer service calls: CSC}₌_High

ThenCSCis a trichotomous predictor. How will logistic regression handle this? First, the analyst will need to code the data set using indicator (dummy) variables and reference cell coding. Suppose that we chooseCSC=Lowto be our reference cell. Then we assign the indicator variable values to two new indicator variables,CSC-Med andCSC-Hi, given in Table 4.5. Each record will have assigned to it a value of zero or 1 for each ofCSC-MedandCSC-Hi. For example, a customer with 1 customer service call will have values CSC-Med=0 and CSC-Hi =0, a customer with 3 customer service calls will haveCSC-Med=1 andCSC-Hi=0, and a customer with 7 customer service calls will haveCSC-Med=0 andCSC-Hi=1.

Table 4.6 shows a cross-tabulation of churn by CSC. Using CSC = Low as the reference class, we can calculate the odds ratios using the cross-products

INTERPRETING A LOGISTIC REGRESSION MODEL 167

TABLE 4.6 Cross-Tabulation ofChurnbyCSC

CSC=Low CSC=Medium CSC=High Total

Churn=false, 1664 1057 129 2850 y=0 Churn=true, 214 131 138 483 y=1 Total 1878 1188 267 3333 as follows: r _For_CSC₌_Medium_: OR=131(1664) 214(1057) =0.963687≈0.96 r _For_CSC₌_High_: OR=138(1664) 214(129) =8.31819≈8.32

The logistic regression is then performed in Minitab with the results shown in Table 4.7.

Note that the odds ratio reported by Minitab are the same that we found using the cell counts directly. We verify the odds ratios given in Table 4.7, using equation (4.3):

r _{For CSC-Med}_: _OR∧ ₌_eb1₌_e−0.0369891₌₀_.₉₆

r _{For CSC-Hi}_: _OR∧ ₌_eb2=_e2.11844=₈.₃₂

Here we haveb0= −2.051, b1 = −0.0369891,andb2=2.11844.So the probability of churning is estimated as

ˆ π(x)= e ˆ g(x) 1+eg(x)ˆ = e−2.051−0.0369891(CSC-Med)+2.11844(CSC-Hi) 1+e−2.051−0.0369891(CSC-Med)+2.11844(CSC-Hi) with the estimated logit:

g(x)= −2.051− 0.0369891(CSC-Med)+2.11844(CSC-Hi)

TABLE 4.7 Results of Logistic Regression ofChurnonCSC

Logistic Regression Table

Odds 95% CI

Predictor Coef SE Coef Z P Ratio Lower Upper

Constant -2.05100 0.0726213 -28.24 0.000

CSC-Med -0.0369891 0.117701 -0.31 0.753 0.96 0.77 1.21

CSC-Hi 2.11844 0.142380 14.88 0.000 8.32 6.29 11.00

Log-Likelihood = -1263.368

SPH SPH

JWDD006-04 JWDD006-Larose November 23, 2005 14:51 Char Count= 0

168 CHAPTER 4 LOGISTIC REGRESSION

For a customer with low customer service calls, we estimate his or her probability of churning: ˆ g(x)= −2.051−0.0369891(0)+2.11844(0)= −2.051 and ˆ π(x)= e ˆ g(x) 1+eg(x)ˆ = e−2.051 1+e−2.051 =0.114

So the estimated probability that a customer with low numbers of customer service calls will churn is 11.4%, which is less than the overall proportion of churners in the data set, 14.5%, indicating that such customers churn somewhat less frequently than the overall group. Also, this probability could have been found directly from Table 4.6:

P(churn|CSC=Low)= 214

1878 =0.114

For a customer with medium customer service calls, the probability of churn is estimated as ˆ g(x)= −2.051−0.0369891(1)+2.11844(0)= −2.088 and ˆ π(x)= e ˆ g(x) 1+eg(x)ˆ = e−2.088 1+e−2.088 =0.110

The estimated probability that a customer with medium numbers of customer service calls will churn is 11.0%, which is about the same as that for customers with low numbers of customer service calls. The analyst may consider collapsing the distinction betweenCSC-MedandCSC-Low. This probability could have been found directly from Table 4.6:

P(churn|CSC=Medium)= 131

1188 =0.110

For a customer with high customer service calls, the probability of churn is estimated as ˆ g(x)= −2.051−0.0369891(0)+2.11844(1)=0.06744 and ˆ π(x)= e ˆ g(x) 1+eg(x)ˆ = e0.06744 1+e0.06744 =0.5169

Thus, customers with high levels of customer service calls have a much higher estimated probability of churn, over 51%, which is more than triple the overall churn rate. Clearly, the company needs to ﬂag customers who make 4 or more customer service calls and intervene with them before they attrit. This probability could also have been found directly from Table 4.6:

P(churn|CSC=High)= 138

267 =0.5169

Applying the Wald test for the signiﬁcance of theCSC-Medparameter, we have b1 = −0.0369891 and SE(b1)=0.117701,giving us

ZWald=

−0.0369891

INTERPRETING A LOGISTIC REGRESSION MODEL 169

as reported under z for the coefﬁcient CSC-Med in Table 4.7. The p-value is P(|z|>0.31426)=0.753,which is not signiﬁcant. There is no evidence that the CSC-MedversusCSC-Lowdistinction is useful for predicting churn. For theCSC-Hi parameter, we haveb1=2.11844 and SE(b1)=0.142380,giving us

ZWald=

2.11844

0.142380 =14.88

as shown for the coefﬁcientCSC-Hi in Table 4.7. The p-value, P(|z|>14.88)∼= 0.000,indicates that there is strong evidence that the distinctionCSC-HiversusCSC- Lowis useful for predicting churn.

Examining Table 4.7, note that the odds ratios for bothCSC =Mediumand CSC=Highare equal to those we calculated using the cell counts directly. Also note that the logistic regression coefﬁcients for the indicator variables are equal to the natural log of their respective odds ratios:

bCSC-Med=ln(0.96)≈ln(0.963687)= −0.0369891 bCSC-High=ln(8.32)≈ln(8.31819)=2.11844

For example, the natural log of the odds ratio ofCSC-HitoCSC-Lowcan be derived using equation (4.4) as follows:

ln [OR(High,Low)]=gˆ(High)−gˆ(Low)

=[b0+b1(CSC-Med=0)+b2(CSC-Hi=1)] −[b0+b1(CSC-Med=0)+b2(CSC-Hi=0)] =b2=2.11844

Similarly, the natural log of the odds ratio ofCSC-MediumtoCSC-Lowis given by ln [OR(Medium,Low)]=gˆ(Medium)−gˆ(Low)

=[b0+b1(CSC-Med=1)+b2(CSC-Hi=0)] −[b0+b1(CSC-Med=0)+b2(CSC-Hi=0)] =b1= −0.0369891

Just as for the dichotomous case, we may use the cell entries to estimate the standard error of the coefﬁcients directly. For example, the standard error for the logistic regression coefﬁcientb1forCSC-Medis estimated as follows:

∧ SE(b1)= 1 131+ 1 1664+ 1 214+ 1 1057 =0.117701

Also similar to the dichotomous case, we may calculate 100(1−α)% conﬁdence intervals for the odds ratios, for theith predictor, as follows:

exp bi±z·

∧

SE(bi)

For example, a 95% conﬁdence interval for the odds ratio between CSC-Hi and CSC-Lowis given by:

exp b2±z· ∧ SE(b2) =exp2.11844±(1.96)(0.142380) =(e1.8394, e2.3975) =(6.29, 11.0)

SPH SPH

JWDD006-04 JWDD006-Larose November 23, 2005 14:51 Char Count= 0

170 CHAPTER 4 LOGISTIC REGRESSION

as reported in Table 4.7. We are 95% confident that the odds ratio for churning for customers with high customer service calls compared to customers with low customer service calls lies between 6.29 and 11.0. Since the interval does not includee0₌_1, the relationship is significant with 95% confidence.

However, consider the 95% conﬁdence interval for the odds ratio between CSC-MedandCSC-Low:

exp b1±z· ∧ SE(b1) =exp [−0.0369891±(1.96)(0.117701)] =(e−0.2677, e0.1937) =(0.77, 1.21)

as reported in Table 4.7. We are 95% confident that the odds ratio for churning for customers with medium customer service calls compared to customers with low customer service calls lies between 0.77 and 1.21. Since this interval does include e0=1, the relationship is not significant with 95% confidence. Depending on other modeling factors, the analyst may consider collapsingCSC-MedandCSC-Lowinto a single category.

In document Data Mining Methods And Models Larose DT (2006) pdf (Page 184-188)