Choosing a Decision Threshold
10. Select OK to save the code.
11. Run the SAS code and examine the results of the MEANS procedures.
Total Observations in the Validation Data Table The MEANS Procedure
Analysis Variable : P_Default1 Predicted: Default=1
N --- 1968 ---
Number of Observations where P_Default1 greater than or equal to 0.33 The MEANS Procedure
Analysis Variable : P_Default1 Predicted: Default=1
N --- 391 ---
Therefore, based on the theoretical approach, 391 out of 1968 applications, or approximately 19%, should be rejected.
As you will see in the next chapter, you can obtain the same result using the Model Comparison node if you input the profits associated with the decisions.
2.5 Exercises
1. Initial Data ExplorationA supermarket is beginning to offer a line of organic products. The supermarket’s management would like to determine which customers are likely to purchase these products.
The supermarket has a customer loyalty program. As an initial buyer incentive plan, the supermarket provided coupons for the organic products to all of their loyalty program participants and have now collected data that includes whether or not these customers have purchased any of the organic products.
The ORGANICS data set contains over 22,000 observations and 18 variables. The variables in the data set are shown below with the appropriate roles and levels.
Name Model
Role
Measurement Level
Description
CUSTID ID Nominal Customer loyalty
identification number
GENDER Input Nominal M = male, F = female, U =
unknown
DOB Rejected Interval Date of birth
EDATE Rejected Unary Date extracted from the daily sales data base
AGE Input Interval Age, in years
AGEGRP1 Input Nominal Age group 1
AGEGRP2 Input Nominal Age group 2
TV_REG Input Nominal Television region
NGROUP Input Nominal Neighborhood group
NEIGHBORHOOD Input Nominal Type of residential neighborhood
LCDATE Rejected Interval Loyalty card application date
LTIME Input Interval Time as loyalty card member
ORGANICS Target Interval Number of organic products purchased
BILL Input Interval Total amount spent
REGION Input Nominal Geographic region
CLASS Input Nominal Customer loyalty status: tin, silver, gold, or platinum
ORGYN Target Binary Organics purchased? 1 = Yes,
0 = No
AFFL Input Interval Affluence grade on a scale
from 1 to 30
Although two target variables are listed, these exercises concentrate on the binary variable ORGYN.
a. Set up a new project for this exercise with Exercise as the project name. b. Create a new diagram called Organics.
c. Define the data set ADMT.ORGANICS as a data source for the project. d. Set the model role for the target variable and examine the distribution of the
variable. What is the proportion of individuals who purchased organic products?
e. Do you have any reservations about any of the other variables in the data set?
Are there any variables that should not be included as input variables in your analysis?
f. The variables AGE, AGEGRP1, and AGEGRP2 are all different measurements
for the same information. Presume that, based on previous experience, you know that AGE should be used for this type of modeling. Set the model role for AGEGRP1 and AGEGRP2 to rejected.
g. The variable NGROUP contains collapsed levels of the variable
NEIGHBORHOOD. Therefore, only one of these variables should be used in a
model. Presume that, based on previous experience, you believe that
NGROUP is sufficient for this type of modeling effort. Set the model role for NEIGHBORHOOD to rejected.
h. The variables LCDATE and LTIME essentially measure the same thing. Set
the model role for LCDATE to rejected, retaining the variable LTIME as an input variable.
i. The variable ORGANICS contains information that would not be known at the
time you are developing a model to predict the purchase of organic products. Set the model role for ORGANICS to rejected.
j. Add the ADMT.ORGANICS data source to the diagram workspace.
k. Add a Data Partition node to the diagram and connect it to the Data Source
2. Predictive Modeling Using Decision Trees
a. Return to the Organics diagram in the Exercise project. Add a Decision Tree
node to the workspace and connect it to the Data Partition node.
b. Run the diagram from the Decision Tree node with the default values for the
decision tree.
c. Examine the tree results. How many leaves are in the tree that is selected
based on the validation data set?
d. View the tree. Which variable was used for the first split? What were the
competing splits for this first split?
e. Add a second Decision Tree node to the diagram and connect it to the Data
Partition node.
f. In the Properties Panel of the new Decision Tree node, change the maximum
number of branches from a node to 3 to allow for 3-way splits.
g. Run the diagram from the new Decision Tree node and examine the tree
results. How many leaves are in the tree that is selected based on the validation data set?
h. Which variables were important in growing this tree? i. View the tree. Which variable was used for the first split?
j. Close the tree results and add a Model Comparison node to the diagram.
Connect both Decision Tree nodes to the Model Comparison node.
k. Using the Model Comparison node, which of the decision tree models
2.6 Solutions to Exercises
1. Initial Data Explorationa. To set up a new project for this exercise: