Smoothing a Categorical Variable - Ratner - Statistical and Machine-Learning Data Mining

The classic approach to include a categorical variable into the modeling pro- cess involves dummy variable coding. A categorical variable with k classes of qualitative (nonnumeric) information is replaced by a set of k - 1 quantitative dummy variables. The dummy variable is defined by the presence or absence of the class values. The class left out is called the reference class, to which the other classes are compared when interpreting the effects of dummy variables on response. The classic approach instructs that the complete set of k - 1 dummy variables is included in the model regardless of the number of dummy variables that are declared nonsignificant. This approach is prob- lematic when the number of classes is large, which is typically the case in big data applications. By chance alone, as the number of class values increases, the probability of one or more dummy variables being declared nonsignificant increases. To put all the dummy variables in the model effectively adds “noise” or unreliability to the model, as nonsignificant variables are known to be noisy. Intuitively, a large set of inseparable dummy variables poses dif- ficulty in model building in that they quickly “fill up” the model, not allow- ing room for other variables.

The EDA approach of treating a categorical variable for model inclusion is a viable alternative to the classic approach as it explicitly addresses the prob- lems associated with a large set of dummy variables. It reduces the number of classes by merging (smoothing or averaging) the classes with comparable values of the dependent variable under study, which for the application of response modeling is the response rate. The smoothed categorical variable, now with fewer classes, is less likely to add noise in the model and allows more room for other variables to get into the model.

There is an additional benefit offered by smoothing of a categorical variable. The information captured by the smoothed categorical variable tends to be more reliable than that of the complete set of dummy variables. The reliability of information of the categorical variable is only as good as the aggregate reliability of information of the individual classes. Classes of small size tend to provide unreliable information. Consider the extreme situation of a class of size one. The estimated response rate for this class is either 100% or 0% because the sole individual either responds or does not respond, respectively. It is unlikely that the estimated response rate is the true response rate for this class. This class is considered to provide unreliable information regarding its true response rate. Thus, the reliability of information for the categorical variable itself decreases as the number of small class values increases. The smoothed categorical variable tends to have greater reliability than the set of dummy variables because it intrinsi- cally has fewer classes and consequently has larger class sizes due to the merging process. The rule of thumb of EDA for small class size is that less than 200 is considered small.

CHAID is often the preferred EDA technique for smoothing a categorical variable. In essence, CHAID is an excellent EDA technique as it involves the three main elements of statistical detective work: “numerical, counting, and graphical.” CHAID forms new larger classes based on a numerical merging, or averaging, of response rates and counts the reduction in the number of classes as it determines the best set of merged classes. Last, the output of CHAID is conveniently presented in an easy-to-read and -understand graphical display, a treelike box diagram with leaf boxes representing the merged classes.

The technical details of the merging process of CHAID are beyond the scope of this chapter. CHAID is covered in detail in subsequent chapters, so here I briefly discuss and illustrate it with the smoothing of the last variable to be considered for predicting TXN_ADD response, namely, FD_TYPE. 8.16.1 Smoothing FD_TYPe with CHaID

Remember that FD_TYPE is a categorical variable that represents the prod- uct type of the customer’s most recent investment purchase. It assumes 14 products (classes) coded A, B, C, …, N. The TXN_ADD response rate by FD_ TYPE values are in Table 8.18.

There are seven small classes (F, G, J, K, L, M, and N) with sizes 42, 45, 57, 94, 126, 19, and 131, respectively. Their response rates—0.26, 0.24, 0.19, 0.20, 0.22, 0.42, and 0.16, respectively—can be considered potentially unreliable. Class B has the largest size, 2,828, with a surely reliable 0.06 response rate.

Table 8.18 FD_TYPE TXN_ADD FD_TYPE N MEAN A 267 0.251 B 2,828 0.066 C 250 0.156 D 219 0.128 E 368 0.261 F 42 0.262 G 45 0.244 H 225 0.138 I 255 0.122 J 57 0.193 K 94 0.202 L 126 0.222 M 19 0.421 N 131 0.160 Total 4,926 0.119

The remaining six presumably reliable classes (A, C, D, E, and H) have sizes between 219 and 368.

The CHAID tree for FD_TYPE in Figure 8.17 is read and interpreted as follows:

1. The top box, the root of the tree, represents the sample of 4,926 with response rate 11.9%.

2. The CHAID technique smoothes FD_TYPE by way of merging the original 14 classes into 3 merged (smoothed) classes, as displayed in the CHAID tree with three leaf boxes.

3. The leftmost leaf, which consists of the seven small unreliable classes and the two reliable classes A and E, represents a newly merged class with a reliable response rate of 24.7% based on a class size of 1,018. In this situation, the smoothing process increases the reliability of the small classes with two-step averaging. The first step combines all the small classes into a temporary class, which by itself produces a reliable average response rate of 22.7% based on a class size of 383. In the second step, which does not always occur in smoothing, the temporary class is further united with the already-reliable classes A and E because the latter classes have comparable response rates to the temporary class response rate. The double-smoothed newly merged class represents the average response rate of the seven small classes and classes A and E. When double smoothing does not occur, the temporary class is the final class. Total 11.9% 4,926 6.6% 2,828 24.7% 1,018 1 1 0 CH_TYPE CH_FTY_1 CH_FTY_2 2 0 1 3 0 0 13.9% 1,080 A, E, F, G, J, K, L, M B FD_TYPE C, D, H, I, N FIguRe 8.17

4. The increased reliability that smoothing of a categorical variable offers can now be clearly illustrated. Consider class M with its unreliable estimated response rate of 42% based on class size 19. The smoothing process puts class M in the larger, more reliable leftmost leaf with a response rate of 24.7%. The implication is that class M now has a more reliable estimate of response rate, namely, the response rate of its newly assigned class, 24.7%. Thus, the smoothing has effectively adjusted the original estimated response rate of class M downward, from a positively biased 42% to a reliable 24.7%. In contrast, within the same smoothing process, the adjustment of class J is upward, from a negatively biased 19% to 24.7%. It is not surprising that the two reliable classes, A and E, remain noticeably unchanged, from 25% and 26% to 24.7%, respectively.

5. The middle leaf consists of only class B, defined by a large class size of 2,828 with a reliable response rate of 6.6%. Apparently, the low response rate of class B is not comparable to any class (original, temporary, or newly merged) response rate to warrant a merging. Thus, the original estimated response rate of class B is unchanged after the smoothing process. This presents no concern over the reliability of class B because its class size is largest from the outset.

6. The rightmost leaf consists of large classes C, D, H, and I and the small class N for an average reliable response rate of 13.9% with class size 1,080. The smoothing process adjusts the response rate of class N downward, from 16% to a smooth 13.9%. The same adjustment occurs for class C. The remaining classes D, H, and I experience an upward adjustment.

I call the smoothed categorical variable CH_TYPE. Its three classes are labeled 1, 2, and 3, corresponding to the leaves from left to right, respectively (see bottom of Figure 8.17). I also create two dummy variables for CH_TYPE: 1. CH_FTY_1 = 1 if FD_TYPE = A, E, F, G, J, K, L, or M; otherwise,

CH_FTY_1 = 0;

2. CH_FTY_2 = 1 if FD_TYPE = B; otherwise, CH_FTY_2 = 0.

3. This dummy variable construction uses class CH_TYPE = 3 as the reference class. If an individual has values CH_FTY_1 = 0 and CH_ FTY_2 = 0, then the individual has implicitly CH_TYPE = 3 and one of the original classes (C, D, H, I, or N).

8.16.2 Importance of CH_FTY_1 and CH_FTY_2

I assess the importance of the CHAID-based smoothed variable CH_ TYPE by performing a logistic regression analysis on TXN_ADD with both CH_FTY_1 and CH_FTY_2, as the set dummy variable must be

together in the model; the output is in Table 8.19. The G/df value is 108.234 (= 216.468/2), which is greater than the standard G/df value of 4. Thus, CHFTY_1 and CH_FTY_2 together are declared important predictor variables of TXN_ADD.

In document Ratner - Statistical and Machine-Learning Data Mining (Page 167-171)