Select OK to save the code. - Choosing a Decision Threshold

Choosing a Decision Threshold

10. Select OK to save the code.

11. Run the SAS code and examine the results of the MEANS procedures.

Total Observations in the Validation Data Table The MEANS Procedure

Analysis Variable : P_Default1 Predicted: Default=1

N --- 1968 ---

Number of Observations where P_Default1 greater than or equal to 0.33 The MEANS Procedure

Analysis Variable : P_Default1 Predicted: Default=1

N --- 391 ---

Therefore, based on the theoretical approach, 391 out of 1968 applications, or approximately 19%, should be rejected.

As you will see in the next chapter, you can obtain the same result using the Model Comparison node if you input the profits associated with the decisions.

2.5 Exercises

1. Initial Data Exploration

A supermarket is beginning to offer a line of organic products. The supermarket’s management would like to determine which customers are likely to purchase these products.

The supermarket has a customer loyalty program. As an initial buyer incentive plan, the supermarket provided coupons for the organic products to all of their loyalty program participants and have now collected data that includes whether or not these customers have purchased any of the organic products.

The ORGANICS data set contains over 22,000 observations and 18 variables. The variables in the data set are shown below with the appropriate roles and levels.

Name Model

Role

Measurement Level

Description

CUSTID ID Nominal Customer loyalty

identification number

GENDER Input Nominal M = male, F = female, U =

unknown

DOB Rejected Interval Date of birth

EDATE Rejected Unary Date extracted from the daily sales data base

AGE Input Interval Age, in years

AGEGRP1 Input Nominal Age group 1

AGEGRP2 Input Nominal Age group 2

TV_REG Input Nominal Television region

NGROUP Input Nominal Neighborhood group

NEIGHBORHOOD Input Nominal Type of residential neighborhood

LCDATE Rejected Interval Loyalty card application date

LTIME Input Interval Time as loyalty card member

ORGANICS Target Interval Number of organic products purchased

BILL Input Interval Total amount spent

REGION Input Nominal Geographic region

CLASS Input Nominal Customer loyalty status: tin, silver, gold, or platinum

ORGYN Target Binary Organics purchased? 1 = Yes,

0 = No

AFFL Input Interval Affluence grade on a scale

from 1 to 30

Although two target variables are listed, these exercises concentrate on the binary variable ORGYN.

a. Set up a new project for this exercise with Exercise as the project name. b. Create a new diagram called Organics.

c. Define the data set ADMT.ORGANICS as a data source for the project. d. Set the model role for the target variable and examine the distribution of the

variable. What is the proportion of individuals who purchased organic products?

e. Do you have any reservations about any of the other variables in the data set?

Are there any variables that should not be included as input variables in your analysis?

f. The variables AGE, AGEGRP1, and AGEGRP2 are all different measurements

for the same information. Presume that, based on previous experience, you know that AGE should be used for this type of modeling. Set the model role for AGEGRP1 and AGEGRP2 to rejected.

g. The variable NGROUP contains collapsed levels of the variable

NEIGHBORHOOD. Therefore, only one of these variables should be used in a

model. Presume that, based on previous experience, you believe that

NGROUP is sufficient for this type of modeling effort. Set the model role for NEIGHBORHOOD to rejected.

h. The variables LCDATE and LTIME essentially measure the same thing. Set

the model role for LCDATE to rejected, retaining the variable LTIME as an input variable.

i. The variable ORGANICS contains information that would not be known at the

time you are developing a model to predict the purchase of organic products. Set the model role for ORGANICS to rejected.

j. Add the ADMT.ORGANICS data source to the diagram workspace.

k. Add a Data Partition node to the diagram and connect it to the Data Source

2. Predictive Modeling Using Decision Trees

a. Return to the Organics diagram in the Exercise project. Add a Decision Tree

node to the workspace and connect it to the Data Partition node.

b. Run the diagram from the Decision Tree node with the default values for the

decision tree.

c. Examine the tree results. How many leaves are in the tree that is selected

based on the validation data set?

d. View the tree. Which variable was used for the first split? What were the

competing splits for this first split?

e. Add a second Decision Tree node to the diagram and connect it to the Data

Partition node.

f. In the Properties Panel of the new Decision Tree node, change the maximum

number of branches from a node to 3 to allow for 3-way splits.

g. Run the diagram from the new Decision Tree node and examine the tree

results. How many leaves are in the tree that is selected based on the validation data set?

h. Which variables were important in growing this tree? i. View the tree. Which variable was used for the first split?

j. Close the tree results and add a Model Comparison node to the diagram.

Connect both Decision Tree nodes to the Model Comparison node.

k. Using the Model Comparison node, which of the decision tree models

2.6 Solutions to Exercises

1. Initial Data Exploration

a. To set up a new project for this exercise:

In document Applying Data Mining Techniques Using SAS Enterprise Miner. Course Notes (Page 106-111)