Applied Data Mining Analysis:
A Step-by-Step Introduction Using Real-World Data Sets
August 2015 Salford Systems
http://info.salford-systems.com/jsm-2015-ctw
Course Outline
• Demonstration of two classification examples in SPM
o Bank Marketing o KDD cup 2009
• Predictive Modeling package used for the examples
o Core Statistics
o Logistic Regression
o CART Decision Tree (original, by Jerome Friedman) o MARS Spline Regression (original, by Jerome Friedman)
o TreeNet gradient boosting machine ((original, by Jerome Friedman) o RandomForests (original, Breiman and Cutler)
o Automation and model acceleration
Bank Marketing Data
• Portuguese bank marketing data
o 41,188 records
o 20 attributes, such as age, job, education, housing status
o The goal is to predict whether the client will subscribe a term deposit o Output variable (desired target):
has the client subscribed a term deposit? (binary: 'yes','no')
• Dataset is publicly available at UCI machine learning repository
o http://mlr.cs.umass.edu/ml/datasets/Bank+Marketing
• Challenges
o Missing Value
o Mixed categorical and numerical variables o Variable selection
Sample Data
AGE JOB MARITAL DEF HOUSING LOAN CONTACT
EMP_VAR_RAT
E CPI CCI EURIBOR NUM_EMP Y
56
housemai
d married no no no telephone 1.1 93.994 -36.4 4.857 5191 no
57 services married no no telephone 1.1 93.994 -36.4 4.857 5191 no
37 services married no yes no telephone 1.1 93.994 -36.4 4.857 5191 no
40 admin. married no no no telephone 1.1 93.994 -36.4 4.857 5191 no
56 services married no no yes telephone 1.1 93.994 -36.4 4.857 5191 no
45 services married no no telephone 1.1 93.994 -36.4 4.857 5191 no
59 admin. married no no no telephone 1.1 93.994 -36.4 4.857 5191 no
41
blue-
collar married no no telephone 1.1 93.994 -36.4 4.857 5191 no
24 technician single no yes no telephone 1.1 93.994 -36.4 4.857 5191 no
25 services single no yes no telephone 1.1 93.994 -36.4 4.857 5191 no
Other variables include: level of education, date of last contact, outcome of last campaign, days since last contact, etc.
Note: missing values, categorical and numeric variables
Open Raw Data:
bank.CSV
Character Variables and
Missing Values
Request Descriptive Statistics
All variables are included in default
Brief Descriptive Stats
We always check for prevalence of missing data
Always review number of distinct values (too few?, too many?) Anything looks wrong in the dataset
Full Descriptive Stats
Output contains detailed descriptive statistics for every variable
Frequency of Target variable
Target Variable
• 0 means “non subscriber”
• 1 means “subscriber”
• It’s not surprised that there are only a small percentage of people subscribed term deposit
Data Preparation
• The records in this dataset are ordered by date (from May 2008 to November 2010)
• Note that 2008 economy crisis made this dataset
complicated because time has to be considered as a factor in the analysis.
• We partitioned 80% as learning data and remaining 20% as testing data in time order.
• Note: pdays 999 means the clients have never been
contacted before this phone call.
Build LOGIT Model
LOGIT Model Summary
• ROC learn value is 0.94 which should get your attention to exam if it is too good to be true
• ROC learning and test difference tells us that time does have an impact
LOGIT Model Coefficients
Partial coefficients are shown in the table above
CART
• Classification and Regression Trees
o Separates relevant from irrelevant predictors o Yields simply, easy to understand results
o Doesn’t require variable transformations o Impervious to outliers and missing values
• Fastest, most versatile predictive modeling algorithm available to analysts
• Provides the foundation to modern data mining techniques such as
bagging and boosting
Build CART Model
Testing Method
CART Model
Learn and Test sample perform quite different with this model which means time does contribute as a factor to influence the outcome
Also learning sample performance looks too good to be true
Variable Importance
Duration: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed.
Rerun CART model
excluding Duration
Variable Importance Ranking
CART gives an initial look of what variable are important, it is useful when there are quite a few predictors in your dataset.
Root Node Split Very Effective
• We can view nodes detail by clicking
“Tree Details” in CART output window
• The first splitter is “month” which is also shown in variable importance ranking table as the most influential predictor
• The whole tree with details can be viewed as well
• Multivariate Adaptive Regression Splines MARS
• Uses “knots” to impose local linearities
• These knots create “basis functions” to
decompose the information in each variable individually
0 10 20 30 40 50 60
0 10 20 30 40
LSTAT
MV
-10 0 10 20 30 40 50 60
0 10 20 30 40
LSTAT
MV
Build MARS Model
MARs Model Setup
• Max basis Function default setting is 15 where often time model hits this limit and stop before
reaching the optimal model
• So we set it as 60 after a couple of runs
MARS Output Window
This output window shows you the number of basis functions in the model
against the performance of the model. Because MARS is a regression engine, the MSE and R-squared values will still be reported, but can be ignored here.
Summary
This model improved in targeting customers, with an ROC of 0.72.
MARS Basis function
Here is where the logistic regression equation is laid out in terms of the basis functions (transformations of the predictors). Each basis function is
described and the final model is listed at the bottom. This form of output is especially desired by those who are comfortable with standard regression.
MARs Plots
• Note: The presence of nonlinearity in this dataset
TreeNet
• Stochastic Gradient Boosting
• Small decision trees built in an error- correcting sequence
1. Begin with small tree as initial model
2. Compute residuals from this model for all records 3. Grow a second small tree to predict these residuals 4. And so on…
Build TreeNet Model
TreeNet Output Window
The Output window shows a graph of the number of trees in the ensemble with its corresponding ROC value. The vertical green bar denotes the model with the optimal ROC: 9 trees at 0.69.
Partial Dependency Plots
Using TreeNet for targeted marketing has improved random calling and given you an idea of how the predictors affect subscription
Random Forests
• Ensemble of trees built on bootstrap samples
• Algorithm:
o Each tree is grown on a bootstrap sample from the learning data
o During tree growing, only P predictors are selected and tried at each node
o By default, P is the square root of total predictors
• The overall prediction is determined by averaging
• Law of Large Numbers ensures convergence
• The key to accuracy is low correlation and bias
• To keep bias low, trees are grown to maximum
depth
Build RandomForests
Model
RandomForests Output1
RandomForests optimal model is always the one with most trees,
RandomForests Summary
Prediction Success Table1
We want to minimize the false “non-subscribers” rate to spend least effort
Adjust Class Weights
• Class Weights default is
“BALANCED”
which means Upweight small classes to equal size of largest target class.
• Now we manually
upweight class
“1” which is the small class even more than
“Balanced”
setting
Prediction Success Table2
Conclusion
• CART, MARs, TreeNet and RandomForests
o handles missing value automatically
o Detect interaction and nonlinearity automatically
o Model can be translate into other programing languages
o Model performance usually exceeds traditional classification algorithms o Advanced setting boosts model performance
• CART provides initial insights of the dataset
• MARs gives equations in a linear regression format with transformation of original predictors
• TreeNet generates more accurate models
• RandomForests outperforms with wide datasets
KDD Cup 2009
• Knowledge Discovery and Data mining competition held once a year to challenge modelers to a task
o http://www.kdd.org/kddcup/index.php - competitions from 1997-2010 o Includes tasks, data, rules, results, and FAQs
• KDD Cup 2009 was about customer relationship prediction
• French telecom company Orange provided large marketing databases
• Overall goal was to beat the in-house system
implemented by Orange
Datasets
• 50,000 customers
• 15,000 predictors
o ex) demographic, geographic, behavioral
• Three binary classification tasks:
o Appetency: customer buys new product or service o Churn: customer switches providers
o Upselling: customer buys upgrade offered to them
• Training and testing dataset
• Smaller subsets of data available for practice
Challenges
• Large database
o 50,000 x 15,000
• Numerical and categorical variables
• Missing data
• Unbalanced class distributions
o Many more customers NOT doing these things
• Sanitized data - no intuition
Data Preparation
• Combine multiple datasets
o Large dataset broken into 5 “chunks”, 53 MB each o True target values needed to be appended
• Delete or impute missing values
o Not necessary in SPM
• Handle categorical variables
o Create dummy indicators
o Combine levels in variables with many o Again, not necessary in SPM
Open Prepared Data
View Data
…
Run Descriptive Statistics
Target Frequencies
Appetency
• In this context, appetency is the propensity of the
customer to buy a new
product or service
CART Model Setup
• Choose CART as the Analysis Engine
• Our Target is coded -1/1, so we will choose
Classification/Logistic Binary as the Target Type
• Appetency is our
response variable and VAR1-VAR15000 are our predictors
Setting a Testing Method
• A separate test
dataset is provided in the competition, but true target
values were not included
• For model-building, we will use a 20%
random partition of the training dataset to monitor
performance
Restricting Tree Size
• We are interested in looking at CART
ranking of important predictors
• By forcing the tree to only one split, we can quickly create a tree to access this information
Penalties
• We are aware there are
variables with many missing values and
variables with a high number of categorical levels
• Setting penalties on these cases makes it harder to include these in the model
Results - Single Split CART Tree
Variable Improvement Measures
TreeNet Model Setup
Results - TreeNet Ensemble
Variable Selection
• Improvement measures are averaged across all trees in the ensemble
• Only 185 of the original 15,000 predictors are flagged as important
Recursive Feature Elimination (RFE)
• Remove one variable at a time from the TOP of the variable importance list to eliminate “too good” predictors
RFE, Step 2
• Remove one variable at a time from the BOTTOM of the variable importance list to eliminate weak predictors
• Final ROC: 0.9048
Parameter Variation - Automates
• Each TreeNet control parameter can be automatically varied over its values
• A model is built at each step and summarized
Stability of the Model
• Automate PARTITION varies the learn/test partition so the user can observe the stability of model performance
Repeat on Churn
• Churn is the propensity of the customer to switch providers
• We repeat the same steps of model-building to achieve a final model
• Final ROC: 0.7320
Repeat on Upsell
• Upsell is the propensity of the customer to buy an upgrade offered to them
• We repeat the same steps of model-building to achieve a final model
• Final ROC: 0.9059
Summary of Results
Rank Team Appetency Churn Upselling Score
1 IBM Research 0.8830 0.7611 0.9038 0.8493
– You! 0.9048 0.7320 0.9059 0.8476
2 ID Analytics, Inc. 0.8724 0.7565 0.9056 0.8448
3 Old dogs with new tricks 0.8740 0.7541 0.9050 0.8443
4 Crusaders 0.8688 0.7569 0.9034 0.8430
5 Financial Engineering Group, Inc. Japan 0.8732 0.7498 0.9057 0.8429
• Unable to compare to true target values because these were only seen by competition judges
• However, we are confident in our results (2 of the above groups used SPM)
• Results can vary based on optimal selection criterion, random number seed, etc.
Overall Conclusions
• We were able to narrow down the predictor list
significantly using TreeNet and Automate SHAVING
o Of the original 15,000 predictors:
• Appetency: 167
• Churn: 249
• Upselling: 165
• Handling of categorical variables and missing values was automatic and didn’t cause any issues
• Small rates in the class of interest didn’t pose a problem
o Priors/Costs and Class Weights can control for this in CART and TreeNet