Validating the Model - Data Mining Cookbook Robert Elliot (2001) pdf

The masterpiece is out of the oven! Now we want to ensure that it was cooked to perfection. It's time for the taste test!

Validating the model is a critical step in the process. It allows us to determine if we've successfully performed all the prior steps. If a model does not validate well, it can be due to data problems, poorly fitting variables, or problematic techniques. There are several methods for validating models. In this chapter, I begin with the basic tools for validating the model, gains tables and gains charts. Marketers and managers love them because they take the modeling results right to the bottom line. Next, I test the results of the model algorithm on an alternate data set. A major section of the chapter focuses on the steps for creating confidence intervals around the model estimates using resampling. This is gaining popularity as an excellent method for determining the robustness of a model. In the final section I discuss ways to validate the model by measuring its effect on key market drivers.

Gains Tables and Charts

A gains table is an excellent tool for evaluating the performance of a model. It contains actionable information that can be easily understood and used by non-

technical marketers and managers. Using our case study, I will demonstrate the power of a gains table for Method 1.

Method 1: One Model

Recall that in Method 1 I developed a score for targeting actives using the one-model approach. To further validate the results of the model, I put the information from the decile analysis in Figure 5.6 into a spreadsheet; now I can create the gains table in Figure 6.1.

Column A mirrors the decile analysis in Figure 5.6. It shows each decile containing 10% of the total mail file.

Column B is cumulative percent of file for validation portion of the prospect data set.

Column C, as in Figure 5.6, is the average Probability of Activation for each decile as defined by the model.

Column D, as in Figure 5.6, is the average Percent Actives in the validation data set for each decile. This represents the number of true actives in the validation data set divided by the total prospects for each decile.

Column E is a cumulative calculation of column D or Cumulative PercentActives. At each decile you can estimate the number of true actives for a given "Depth of File" or from decile 0 to a designated decile. If we consider the value for decile 4, we can say that the average Percent Active for deciles 0 through 4 is 0.214%.

Column F is the number of true actives in the decile (Col. A * Col. D).

Figure 6.1

Column G is column F divided by the sum of column F. This represents the percent Actives of the Total Actives that are contained in each decile. If we consider the value in Decile 1, we can say that 27.71% of all Actives in the validation data set were put into decile 1 by the model.

Column H is a cumulative of column F. This represents the total Number of Actives for a given ''Depth of File" or from decile 0 to a designated decile.

Column I is column H divided by the number of total actives. At each decile you can determine the Cumulative Percent of Total Actives that are contained in decile 0 through a given decile. If we consider the value in decile 1, we can say that 42.79% of all Actives are in deciles 0 and 1.

Column J is the "lift" of each decile. The lift is calculated by dividing the decile percent active by the overall percent active in column D. The value of 277 in decile 0 means that the prospects in decile 0 are 277% more likely to activate than the overall average.

Column K is the cumulative Lift through a specific decile. It is calculated by dividing a decile value in column E by the overall percent active.

The lift measurement is a favorite among marketers for evaluating and comparing models. At each decile it demonstrates the model's power to beat the random approach or average performance. In Figure 6.1, we see that decile 0 is 2.77 times the average. Up to and including decile 3, the model performs better than average. Another way to use lift is in

cumulative lift. This means to a given "Depth of File": the model performs better than random. If we go through decile 3, we can say that for the first four deciles, the model performs 64% better than average. Figure 6.2, a validation gains chart, provides a visual interpretation of the concept of Lift.

The gains chart is an excellent visual tool for understanding the power of a model. It is designed to compare the percent of the target group to the percent of the prospect file through the rank ordered data set. At 50% of the file we can capture 74% of the active accounts— a 47% increase in Actives. (This will be discussed in more detail in chapter 7.)

Method 2: Two Models

Recall that in Method 2, I built a model to target Response and a second model to target Actives given response.

In Figure 6.3 we see the ability of the Method 2 model to rank order the data. When comparing the active rate for the deciles, the power of the model is slightly better than the Method 1 model in distinguishing actives from nonresponders and nonactive responders. The lift measures imitate the similarity.

Figure 6.2 Validation gains chart.

Recall that the response model in the two-model approach did a good job of predicting actives overall. The gains chart in Figure 6.4 compares the activation model from the one -model approach to the response and combination models

developed using the two-model approach. The response model seems to do almost as well as the active models. Again, this supports our theory that the probability of being active is strongly driven by response.

In conclusion, Method 1 and Method 2 produce very similar results when modeling the probability of an account being active. This may not always be the case. It is worth exploring both methods to determine the best method for each situation. In the next sections, I will compare the stability of the models through further validation. In chapter 7, I will discuss alternate ways to use the Method 2 models in the implementation process.

Figure 6.3

Enhanced gains table on validation data.

Figure 6.4

Scoring Alternate Data Sets

In business, the typical purpose of developing a targeting model is to predict behavior on a data set other than that on which the model was developed. In our case study, I have our "hold out" sample that I used for validation. But because it was randomly sampled from the same data set as the model development data, I expect it to deliver a performance very similar to that of the model development data. The best way to test the robustness of the model is to score a campaign that more closely reflects the intended use. For example, a similar campaign from a different time period is optimal because most models are designed for future use in the same general market. Other options include campaigns from different geographic regions or for a slightly different product.

Our case study deals with the insurance industry, which is highly regulated. Therefore, it is quite common to begin marketing in just one or two states. Expansion typically happens on a state-by-state basis. For our case study, the goal is to use the model for name selection in a new state with the hope of improving on random selection. Our model was developed on a campaign in the state of New York. Now I will validate the scoring algorithm on the results from the state of Colorado.

I will concentrate on the Method 1 approach for this validation because the mechanics and results should be similar. The same processing will be performed on the Method 2 combined model for comparison. The first step is to read in the data from the Colorado campaign. The input statement is modified to read in only the necessary variables.

data acqmod.colorado;

infile 'F:\insur\acquisit\camp3.txt' missover recl=99; input

prosp_id 1-9 /*unique prospect identifier*/ pop_den $ 13 /*population density code*/ | | | | | | | | | |

no90eve 97-99 /*number of payments over 90 days */ ;

run;

The next step is to create the variable transformations needed for the model scoring. The code is a reduced version of the code used in chapter 4 to create the original variables so it won't be repeated here. I rerun the logistic model on the original (New York) data to create a data set with one observation using outest=acqmod.nyscore. This data set contains the coefficients of the model. The output from acqmod.nyscore is seen in Figure 6.5.

Figure 6.5 Coefficient data set.

proc logistic data=acqmod.colorado descending

outest=acqmod.nyscore;

weight splitwgt; model active =

HOM_CUI AGE_COS AGE_SQI INC_SQRT MORTAL1 MORTAL3 HOM_MED TOA_TAN TOA_CU TOB_LOG INQ_SQRT TOP_LOGI TOP_CU TOP_CUI CRL_LOW RAT_LOG BRT_MED POPDNSBC APT_INDD SGLE_IND GENDER_D CHILDIND OCC_G NO90DE_D ACTOPL6D; run;

proc print data = acqmod.nyscore; run;

The final scoring is carried out using PROC SCORE. The procedure multiplies the coefficients in the model from (acqmod.nyscore

values in the data set to be scored (acqmod.colorado) and creates a new data set (acqmod.validco), which contains the estimates or betas for each observation. The line beginning with 'id' brings additional variables into the scored data set.

proc score data=acqmod.colorado

out=acqmod.validco predict score=acqmod.nyscore type=parms;

id respond activate pros_id activ_r activate splitwgt records smp_wgt; VAR HOM_CUI AGE_COS AGE_SQI INC_SQRT MORTAL1 MORTAL3 HOM_MED TOA_TAN TOA_CU TOB_LOG INQ_SQRT TOP_LOGI TOP_CU TOP_CUI CRL_LOW RAT_LOG BRT_MED POPDNSBC APT_INDD SGLE_IND GENDER_D CHILDIND OCC_G NO90DE_D ACTOPL6D; run;

The output for PROC SCORE gives us the sum of the betas. Recall from chapter 1 that we use it in the following equation to calculate the probability:

The rank ordering for the estimate and the predicted probability (pred) are the same. I sort the file by descending estimate and find the sum of the weights (sumwgt) to create a decile analysis.

TEAM

FLY

proc sort data=acqmod.validco; by descending estimate;

run;

proc univariate data=acqmod.validco noprint; weight smp_wgt;

var estimate;

output out=preddata sumwgt=sumwgt; run;

data acqmod.validco; set acqmod.validco; active = (activate='1');

pred = exp(estimate)/(1+exp(estimate));

if (_n_ eq 1) then set preddata; retain sumwgt;

number+smp_wgt;

if number < .1*sumwgt then val_dec = 0; else if number < .2*sumwgt then val_dec = 1; else if number < .3*sumwgt then val_dec = 2; else if number < .4*sumwgt then val_dec = 3; else if number < .5*sumwgt then val_dec = 4; else if number < .6*sumwgt then val_dec = 5; else if number < .7*sumwgt then val_dec = 6; else if number < .8*sumwgt then val_dec = 7; else if number < .9*sumwgt then val_dec = 8; else val_dec = 9;

records = 1; run;

title1 "Gains Table - Colorado"; title2 "Validation Data";

PROC tabulate data=acqmod.validco; weight smp_wgt;

class val_dec;

var respond active pred records activ_r; table val_dec='Decile' all='Total',

records='Prospects'*sum=' '*f=comma10.

pred='Predicted Probability'*(mean=' '*f=11.5) active='Percent Active'*(mean=' '*f=11.5) /rts = 9 row=float;

run;

In Figure 6.6, we see that the Method 1 model does show a loss of power when scored on the campaign from another state. This is probably due to some population differences between states. Figure 6.7 shows similar results from the Method 2 model. In any case, the performance is still much better than random with the best decile showing three times the active rate of the worst decile.

Figure 6.6

Alternate state gains table— Method 1.

Figure 6.7

Therefore, in the typical application this type of model would be very efficient in reducing the costs related to entering a new state. The model could be used to score the new state data. The best approach is to mail the top two or three deciles and sample the remaining deciles. Then develop a new model with the results of the campaign.

At this point the Method 1 model and the Method 2 model are running neck and neck. The next section will describe some powerful tools to test the robustness of the models and select a winner.

Resampling

Resampling is a common-sense, nonstatistical technique for estimating and validating models. It provides an empirical estimation (based on experience and observation) instead of a parametric estimation (based on a system or distribution). Consider the basic premise that over-fitting is fitting a model so well that it is picking up irregularities in the data that may be unique to that particular data set. In model development, resampling is commonly used in two ways: (1) it avoids over-fitting by calculating model coefficients or estimates based on repeated sampling; or (2) it detects over-fitting by using repeated samples to validate the results of a model. Because our modeling technique is not prone to over-fitting the data, I will focus on the use of resampling as a validation technique. This allows us to calculate confidence intervals around our estimates.

Two main types of resampling techniques are used in database marketing: jackknifing and bootstrapping. The following discussion and examples highlight and compare the power of these two techniques.

Jackknifing

In its purest form, jackknifing is a resampling technique based on the "leave -one-out" principle. So, if N is the total number of observations in the data set, jackknifing calculates the estimates on N – 1 different samples each having N – 1 observations. This works well for small data sets. In model development, though, we are dealing with large data sets that can be cumbersome to process. A variation of the jackknifing procedure works well on large data sets. It works on the same principle as leave-one-out. Instead of just one record, it leaves out a group of records. Overall, it gives equal opportunity to every observation in the data set.

In our case study, the model was developed on a 50% random sample, presumed to be representative of the entire campaign data set. A 50% random sample was held out for validation. In this section, I use jackknifing to estimate the pre-

dicted probability of active, the actual active rate, and the lift for each decile using 100–99% samples. I will show the code for this process for the Method 1 model. The process will be repeated for the Method 2 model, and the results will be compared.

The program begins with the logistic regression to create an output file (acqmod.resamp) that contains only the validation data and a few key variables. Each record is scored with a predicted value (pred).

proc logistic data=acqmod.model2 descending; weight splitwgt;

model active =HOM_CUI AGE_COS AGE_SQI INC_SQRT MORTAL1 MORTAL3 HOM_MED TOA_TAN TOA_CU TOB_LOG INQ_SQRT TOP_LOGI TOP_CU TOP_CUI CRL_LOW RAT_LOG BRT_MED POPDNSBC APT_INDD SGLE_IND GENDER_D CHILDIND OCC_G NO90DE_D ACTOPL6D;

output out=acqmod.resamp(where=(splitwgt=.) keep=pred active records

smp_wgt splitwgt) pred=pred;

run;

The following code begins a macro that creates 100 jackknife samples. Using a do loop, each iteration eliminates 1% of the data. This is repeated 100 times. The ranuni (5555) function with the positive seed (5555) ensures the same random number assignment for each iteration of the do loop. The resulting samples each have a different 1% eliminated.

%macro jackknif;

%do prcnt = 1 %to 100;

data acqmod.outk&prcnt; set acqmod.resamp;

if .01*(&prcnt-1) < ranuni(5555) < .01*(&prcnt) then delete; run;

The following code is similar to that in chapter 5 for creating deciles. In this case, this process is repeated 100 times to create 100 decile values. The value &prcnt increments by 1 during each iteration.

proc sort data=acqmod.outk&prcnt; by descending pred;

run;

proc univariate data=acqmod.outk&prcnt noprint; weight smp_wgt;

var pred;

output out=preddata sumwgt=sumwgt; run;

data acqmod.outk&prcnt; set acqmod.outk&prcnt;

retain sumwgt; number+smp_wgt;

In proc summary, the average values for active rate (actmn&samp) and predicted probability (prdmn&samp) are calculated for each decile. This is repeated and incremented 100 times; 100 output data sets (jkmns&prcnt) are created with the average values.

proc summary data=acqmod.outk&prcnt; var active pred;

class val_dec; weight smp_wgt;

output out=acqmod.jkmns&prcnt mean=actmn&prcnt prdmn&prcnt;

run;

To calculate the lift for each decile, the value for the overall predicted probability of active and the actual active rate are needed. These are extracted from the output file from proc summary. There is one observation in the output data set where the decile value (val_dec) is missing (.). That represents the overall means for each requested variable.

data actomean(rename=(actmn&prcnt=actom&prcnt) drop=val_dec); set acqmod.jkmns&prcnt(where=(val_dec=.) keep=actmn&prcnt val_dec); run;

The overall values are appended to the data sets and the lifts are calculated.

data acqmod.jkmns&prcnt; set acqmod.jkmns&prcnt;

if (_n_ eq 1) then set actomean; retain actom&prcnt;

liftd&prcnt = 100*actmn&prcnt/actom&prcnt;

After this process is repeated 100 times, the macro is terminated.

%end; %mend; %jackknif;

The 100 output files are merged together by decile. The values for the mean and standard deviation are calculated for the predicted probability of active, the actual active rate, and the lift. The corresponding confidence intervals are also calculated.

data acqmod.jk_sum(keep = prdmjk lci_p uci_p actmjk lci_a uci_a lftmjk lci_l uci_l val_dec); merge

acqmod.jkmns1 acqmod.jkmns2 acqmod.jkmns3 . . . acqmod.jkmns99 acqmod.jkmns100;

by val_dec;

prdmjk = mean(of prdmn1 -prdmn100); /* predicted probability */ prdsdjk = std(of prdmn1 -prdmn100);

actmjk = mean(of actmn1 -actmn100); /* active rate */ actsdjk = std(of actmn1 -actmn100);

lftmjk = mean(of liftd1 -liftd100); /* lift on active rate */ lftsdjk = std(of liftd1 -liftd100);

lci_p = prdmjk - 1.96*actsdjk; /* confidence itnerval on predicted */ uci_p = prdmjk + 1.96*actsdjk;

lci_a = actmjk - 1.96*actsdjk; /* confidence itnerval on actual */ uci_a = actmjk + 1.96*actsdjk;

lci_l = lftmjk - 1.96*lftsdjk; /* confidence itnerval on lift */ uci_l = lftmjk + 1.96*lftsdjk;

run;

Proc format and proc tabulate create the gains table for validating the model. Figure 6.8 shows the jackknife estimates, along with the confidence intervals for the predicted probability of active, the actual active rate, and the lift.

proc format; picture perc

low-high = '009.999%' (mult=1000000); proc tabulate data=acqmod.jk_sum;

var prdmjk lci_p uci_p

actmjk lci_a uci_a lftmjk lci_l uci_l; class val_dec;

table (val_dec='Decile' all='Total'), (prdmjk='JK Est Prob'*mean=' '*f=perc. lci_p ='JK Lower CI Prob'*mean=' '*f=perc. uci_p ='JK Upper CI Prob'*mean=' '*f=perc. actmjk='JK Est % Active'*mean=' '*f=perc. lci_a ='JK Lower CI % Active'*mean=' '*f=perc.

In document Data Mining Cookbook Robert Elliot (2001) pdf (Page 156-183)