Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

(1)

Applied Data Mining Analysis:

A Step-by-Step Introduction Using Real-World Data Sets

August 2015 Salford Systems

http://info.salford-systems.com/jsm-2015-ctw

(2)

Course Outline

• Demonstration of two classification examples in SPM

o Bank Marketing o KDD cup 2009

• Predictive Modeling package used for the examples

o Core Statistics

o Logistic Regression

o CART Decision Tree (original, by Jerome Friedman) o MARS Spline Regression (original, by Jerome Friedman)

o TreeNet gradient boosting machine ((original, by Jerome Friedman) o RandomForests (original, Breiman and Cutler)

o Automation and model acceleration

(3)

Bank Marketing Data

• Portuguese bank marketing data

o 41,188 records

o 20 attributes, such as age, job, education, housing status

o The goal is to predict whether the client will subscribe a term deposit o Output variable (desired target):

has the client subscribed a term deposit? (binary: 'yes','no')

• Dataset is publicly available at UCI machine learning repository

o http://mlr.cs.umass.edu/ml/datasets/Bank+Marketing

• Challenges

o Missing Value

o Mixed categorical and numerical variables o Variable selection

(4)

Sample Data

AGE JOB MARITAL DEF HOUSING LOAN CONTACT

EMP_VAR_RAT

E CPI CCI EURIBOR NUM_EMP Y

56

housemai

d married no no no telephone 1.1 93.994 -36.4 4.857 5191 no

57 services married no no telephone 1.1 93.994 -36.4 4.857 5191 no

37 services married no yes no telephone 1.1 93.994 -36.4 4.857 5191 no

40 admin. married no no no telephone 1.1 93.994 -36.4 4.857 5191 no

56 services married no no yes telephone 1.1 93.994 -36.4 4.857 5191 no

45 services married no no telephone 1.1 93.994 -36.4 4.857 5191 no

59 admin. married no no no telephone 1.1 93.994 -36.4 4.857 5191 no

41

blue-

collar married no no telephone 1.1 93.994 -36.4 4.857 5191 no

24 technician single no yes no telephone 1.1 93.994 -36.4 4.857 5191 no

25 services single no yes no telephone 1.1 93.994 -36.4 4.857 5191 no

Other variables include: level of education, date of last contact, outcome of last campaign, days since last contact, etc.

Note: missing values, categorical and numeric variables

(5)

Open Raw Data:

bank.CSV

(6)

Character Variables and

Missing Values

(7)

Request Descriptive Statistics

All variables are included in default

(8)

Brief Descriptive Stats

We always check for prevalence of missing data

Always review number of distinct values (too few?, too many?) Anything looks wrong in the dataset

(9)

Full Descriptive Stats

Output contains detailed descriptive statistics for every variable

(10)

Frequency of Target variable

Target Variable

• 0 means “non subscriber”

• 1 means “subscriber”

• It’s not surprised that there are only a small percentage of people subscribed term deposit

(11)

Data Preparation

• The records in this dataset are ordered by date (from May 2008 to November 2010)

• Note that 2008 economy crisis made this dataset

complicated because time has to be considered as a factor in the analysis.

• We partitioned 80% as learning data and remaining 20% as testing data in time order.

• Note: pdays 999 means the clients have never been

contacted before this phone call.

(12)

Build LOGIT Model

(13)

LOGIT Model Summary

• ROC learn value is 0.94 which should get your attention to exam if it is too good to be true

• ROC learning and test difference tells us that time does have an impact

(14)

LOGIT Model Coefficients

Partial coefficients are shown in the table above

(15)

CART

• Classification and Regression Trees

o Separates relevant from irrelevant predictors o Yields simply, easy to understand results

o Doesn’t require variable transformations o Impervious to outliers and missing values

• Fastest, most versatile predictive modeling algorithm available to analysts

• Provides the foundation to modern data mining techniques such as

bagging and boosting

(16)

Build CART Model

(17)

Testing Method

(18)

CART Model

Learn and Test sample perform quite different with this model which means time does contribute as a factor to influence the outcome

Also learning sample performance looks too good to be true

(19)

Variable Importance

Duration: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed.

(20)

Rerun CART model

excluding Duration

(21)

Variable Importance Ranking

CART gives an initial look of what variable are important, it is useful when there are quite a few predictors in your dataset.

(22)

Root Node Split Very Effective

• We can view nodes detail by clicking

“Tree Details” in CART output window

• The first splitter is “month” which is also shown in variable importance ranking table as the most influential predictor

• The whole tree with details can be viewed as well

(23)

• Multivariate Adaptive Regression Splines MARS

• Uses “knots” to impose local linearities

• These knots create “basis functions” to

decompose the information in each variable individually

0 10 20 30 40 50 60

0 10 20 30 40

LSTAT

MV

-10 0 10 20 30 40 50 60

0 10 20 30 40

LSTAT

MV

(24)

Build MARS Model

(25)

MARs Model Setup

• Max basis Function default setting is 15 where often time model hits this limit and stop before

reaching the optimal model

• So we set it as 60 after a couple of runs

(26)

MARS Output Window

This output window shows you the number of basis functions in the model

against the performance of the model. Because MARS is a regression engine, the MSE and R-squared values will still be reported, but can be ignored here.

(27)

Summary

This model improved in targeting customers, with an ROC of 0.72.

(28)

MARS Basis function

Here is where the logistic regression equation is laid out in terms of the basis functions (transformations of the predictors). Each basis function is

described and the final model is listed at the bottom. This form of output is especially desired by those who are comfortable with standard regression.

(29)

MARs Plots

• Note: The presence of nonlinearity in this dataset

(30)

TreeNet

• Stochastic Gradient Boosting

• Small decision trees built in an error- correcting sequence

1. Begin with small tree as initial model

2. Compute residuals from this model for all records 3. Grow a second small tree to predict these residuals 4. And so on…

(31)

Build TreeNet Model

(32)

TreeNet Output Window

The Output window shows a graph of the number of trees in the ensemble with its corresponding ROC value. The vertical green bar denotes the model with the optimal ROC: 9 trees at 0.69.

(33)

Partial Dependency Plots

Using TreeNet for targeted marketing has improved random calling and given you an idea of how the predictors affect subscription

(34)

Random Forests

• Ensemble of trees built on bootstrap samples

• Algorithm:

o Each tree is grown on a bootstrap sample from the learning data

o During tree growing, only P predictors are selected and tried at each node

o By default, P is the square root of total predictors

• The overall prediction is determined by averaging

• Law of Large Numbers ensures convergence

• The key to accuracy is low correlation and bias

• To keep bias low, trees are grown to maximum

depth

(35)

Build RandomForests

Model

(36)

RandomForests Output1

RandomForests optimal model is always the one with most trees,

(37)

RandomForests Summary

(38)

Prediction Success Table1

We want to minimize the false “non-subscribers” rate to spend least effort

(39)

Adjust Class Weights

• Class Weights default is

“BALANCED”

which means Upweight small classes to equal size of largest target class.

• Now we manually

upweight class

“1” which is the small class even more than

“Balanced”

setting

(40)

Prediction Success Table2

(41)

Conclusion

• CART, MARs, TreeNet and RandomForests

o handles missing value automatically

o Detect interaction and nonlinearity automatically

o Model can be translate into other programing languages

o Model performance usually exceeds traditional classification algorithms o Advanced setting boosts model performance

• CART provides initial insights of the dataset

• MARs gives equations in a linear regression format with transformation of original predictors

• TreeNet generates more accurate models

• RandomForests outperforms with wide datasets

(42)

KDD Cup 2009

• Knowledge Discovery and Data mining competition held once a year to challenge modelers to a task

o http://www.kdd.org/kddcup/index.php - competitions from 1997-2010 o Includes tasks, data, rules, results, and FAQs

• KDD Cup 2009 was about customer relationship prediction

• French telecom company Orange provided large marketing databases

• Overall goal was to beat the in-house system

implemented by Orange

(43)

Datasets

• 50,000 customers

• 15,000 predictors

o ex) demographic, geographic, behavioral

• Three binary classification tasks:

o Appetency: customer buys new product or service o Churn: customer switches providers

o Upselling: customer buys upgrade offered to them

• Training and testing dataset

• Smaller subsets of data available for practice

(44)

Challenges

• Large database

o 50,000 x 15,000

• Numerical and categorical variables

• Missing data

• Unbalanced class distributions

o Many more customers NOT doing these things

• Sanitized data - no intuition

(45)

Data Preparation

• Combine multiple datasets

o Large dataset broken into 5 “chunks”, 53 MB each o True target values needed to be appended

• Delete or impute missing values

o Not necessary in SPM

• Handle categorical variables

o Create dummy indicators

o Combine levels in variables with many o Again, not necessary in SPM

(46)

Open Prepared Data

(47)

View Data

…

(48)

Run Descriptive Statistics

(49)

Target Frequencies

(50)

Appetency

• In this context, appetency is the propensity of the

customer to buy a new

product or service

(51)

CART Model Setup

• Choose CART as the Analysis Engine

• Our Target is coded -1/1, so we will choose

Classification/Logistic Binary as the Target Type

• Appetency is our

response variable and VAR1-VAR15000 are our predictors

(52)

Setting a Testing Method

• A separate test

dataset is provided in the competition, but true target

values were not included

• For model-building, we will use a 20%

random partition of the training dataset to monitor

performance

(53)

Restricting Tree Size

• We are interested in looking at CART

ranking of important predictors

• By forcing the tree to only one split, we can quickly create a tree to access this information

(54)

Penalties

• We are aware there are

variables with many missing values and

variables with a high number of categorical levels

• Setting penalties on these cases makes it harder to include these in the model

(55)

Results - Single Split CART Tree

(56)

Variable Improvement Measures

(57)

TreeNet Model Setup

(58)

Results - TreeNet Ensemble

(59)

Variable Selection

• Improvement measures are averaged across all trees in the ensemble

• Only 185 of the original 15,000 predictors are flagged as important

(60)

Recursive Feature Elimination (RFE)

• Remove one variable at a time from the TOP of the variable importance list to eliminate “too good” predictors

(61)

RFE, Step 2

• Remove one variable at a time from the BOTTOM of the variable importance list to eliminate weak predictors

• Final ROC: 0.9048

(62)

Parameter Variation - Automates

• Each TreeNet control parameter can be automatically varied over its values

• A model is built at each step and summarized

(63)

Stability of the Model

• Automate PARTITION varies the learn/test partition so the user can observe the stability of model performance

(64)

Repeat on Churn

• Churn is the propensity of the customer to switch providers

• We repeat the same steps of model-building to achieve a final model

(65)

Repeat on Upsell

• Upsell is the propensity of the customer to buy an upgrade offered to them

• We repeat the same steps of model-building to achieve a final model

(66)

Summary of Results

Rank Team Appetency Churn Upselling Score

1 IBM Research 0.8830 0.7611 0.9038 0.8493

– You! 0.9048 0.7320 0.9059 0.8476

2 ID Analytics, Inc. 0.8724 0.7565 0.9056 0.8448

3 Old dogs with new tricks 0.8740 0.7541 0.9050 0.8443

4 Crusaders 0.8688 0.7569 0.9034 0.8430

5 Financial Engineering Group, Inc. Japan 0.8732 0.7498 0.9057 0.8429

• Unable to compare to true target values because these were only seen by competition judges

• However, we are confident in our results (2 of the above groups used SPM)

• Results can vary based on optimal selection criterion, random number seed, etc.

(67)

Overall Conclusions

• We were able to narrow down the predictor list

significantly using TreeNet and Automate SHAVING

o Of the original 15,000 predictors:

• Appetency: 167

• Churn: 249

• Upselling: 165

• Handling of categorical variables and missing values was automatic and didn’t cause any issues

• Small rates in the class of interest didn’t pose a problem

o Priors/Costs and Class Weights can control for this in CART and TreeNet