An Overview of Predictive Analytics for
Practitioners
Thank You Sponsors
Empower users with new insights through
familiar tools while balancing the need for
IT to monitor and manage user created
content. Deliver access to all data types
across structured and unstructured
sources.
www.microsoft.com/bi
Hortonworks develops, distributes
and supports the only 100% Open
Source distribution of Apache
Hadoop architected, built and
tested for enterprise deployments.
Dean Abbott
•
Co-founder and Chief Data Scientist at
SmarterHQ, based in Indianapolis, Indiana
•
President of Abbott Analytics in San Diego,
California
•
Internationally recognized data mining and
predictive analytics expert with over two
decades’ experience
•
Author of
Applied Predictive Analytics (Wiley,
2014), co-author of IBM SPSS Modeler
Cookbook (Packt Publishing, 2013).
•
Advisory board and instructor for UC Irvine
Speaker Social Media
@deanabb http://www.linkedin.com/in /deanabbott/ abbottanalytics.blogspot.com/ www.abbottanalytics.com/The Analyst’s Journey
Gain critical business and data
analytics skills
Uncover insights and provide valueto
your organization
Put your
knowledge to use immediately
An Overview of Predictive Analytics for
Practitioners
What do Predictive Modelers do?
The CRISP-DM Process Model
•
CR
oss-
I
ndustry
S
tandard
P
rocess
M
odel for
D
ata
M
ining
•
Describes Components of
Complete Data Mining Cycle
from the Project Manager’s
Perspective
•
Shows Iterative Nature of Data
Mining
Business
Understanding Data Understanding
Data Preparation Modeling Evaluation Deployment Data Data Data
CRISP-DM:
Business Understanding Steps
•Ask Relevant Business Questions
•Determine Data
Requirements to Answer Business Question
•Translate Business Question into Appropriate Data Mining Approach
•Determine Project Plan for Data Mining Approach
Define Business Objectives Background Business Objectives Business Success Criteria Assess Situation Inventory of Resources Requirements, Assumptions, Constraints Risks and Contingencies Terminology
Costs and Benefits Determine
Data Mining Objectives
Data Mining Goals Data Mining Success Criteria
Produce
Project Plan Project Plan
Initial Assess-ment of Tools & Techniques
Objective’s
•
Business objective:
•
Random test mailing to NRA’s house file achieved a 11% response rate
•
Need a model that finds population with a minimum response rate of
13.5% to be profitable
•
Modeling Objectives:
•
Develop a binary outcome model that will rank-order current database
based on propensity to respond to traditional mailing, optimizing at a
cumulative average response rate of >= 13.5%.
CRISP-DM Step 2:
Data Understanding Steps
•
Collect initial data
• Internal data: historical customer behavior, results from previous
experiments
• External data: demographics & census, other studies and
government research
• Extract superset of data (rows and columns) to be used in
modeling
• Identify form of data repository: multiple vs. single table, flat file
vs. database, local copy vs. data mart
•
Perform Preliminary Analysis
• Characterize Data (describe, explore, verify) • Condition Data Collect Initial Data Initial Data Collection Report Describe Data Data Description Report
Explore Data Data Exploration Report
Verify Data Quality
Data Quality Report
Source Data
•
Business partner provided data that summarizes transactional
data for every active NRA member - 49
independent
variables.
•
TN Marketing enhanced the database with demographic
data-18
appended
variables.
•
I-Miner was used to derive new variable features and
transformations of pre existing data points - 79
derived
CRISP-DM Step 3:
Data Preparation (Conditioning) Steps
Select Data Rationale for Inclusion/Exclusion
Clean Data Data Cleaning Report
Construct
Data Derived Attributes
Generated Records
Integrate
Data Merged Data
Format Data Reformatted Data
Fix Data Problems
Data Preparation
•
Key transformations
•
Date Features
•
Filling missing data
• Use “Distribution” when possible for numeric fields • Use Constant for categoricals
• For numeric data with both “in-house” and third-party versions, use in-house when available,
Data Size
•
Original Data
•
Data after data cleanup
and feature creation
•
Data after further cleanup,
and adding interaction
terms
CRISP-DM Step 4:
Modeling Steps
Select Modeling
Techniques TechniquesModeling AssumptionsModeling
Generate Test
Design Test Design
Build Model Parameter Settings Models
Revised Parameter Model Description
Algorithm Selection
Sampling
Algorithms
Model Ranking
Sampling
•
Randomly split the 21,557 records into two data sets, training and validation
•
Build response model on training data set: 10,778 records
•
•
Validate model by scoring test data set: 10,779 records
•
Ideally, have a third held out data set to provide final assessment of
Classifiers Find Different Decision Boundaries
11-Nearest Neighbor Neural Network
Naïve Bayes Logistic Regression Decision Tree Actual Data
How to deploy model?
Software, source code, in database
How often, when to update
model
Report results
Lessons learned
Plan Deployment Deployment Plan
Plan Moni-toring and Maintenance
Monitoring & Maintenance Plan
Produce Final
Report Final Report Final Presentation
Review Project Experience
CRISP-DM Step 6:
Deployment Steps
Model Results after Deployment
•
Scored over 2,100,000 prospects
•
Actual results from the rollout
•
Average response rate = 13.67%
What is Predictive Analytics?
Simple Definitions
•
Data driven
analysis for [large] data sets
•
Data-driven to discover input combinations
•
Data-driven to validate models
•
OR
•
Discovering interesting patterns in data
automatically
from the
data
•
Input variables are selected automatically
Customer Analytics: BI vs. PA
Customer Analytics: Business Intelligence
■ What were the e-mail open, click-through, and response rates?
■ Which regions/states/ZIPs had the highest response rates?
■ Which products had the highest/lowest click-through rates?
■ How many repeat purchasers were there last month?
■ How many new subscriptions to the loyalty program were there?
■ What is the average spend of those who belong to the loyalty program? Those who aren’t a part of the
Customer Analytics for Predictive Analytics
■ What is the likelihood an e-mail will be opened?
■ What is the likelihood a customer will click-through a link in an e-mail?
■ Which product is a customer most likely to purchase if given the choice?
■ How many e-mails should the customer receive to maximize the likelihood of a purchase?
■ What is the best product to up-sell to the customer after they purchase a product?
■ What is the visit volume expected on the website next week?
Predictive Analytics vs. Data Science
•
Predictive Analytics and
Data Mining have
always
covered
the same ground except for…
•
Big data-centricity
•
Advanced database technology (to
handle big data)
• Hadoop
• Other NoSQL (MongoDB, Cassandra…)
•
Programming language-centricity (not
listed)
What Degree Does it Take
to Be a Predictive Modeler?
• Highest Degree • 7 PhDs • 1 Masters • 2 Bachelors• You don’t need an advanced degree to be a great practitioner!
Max. Degree Count
Math 2
Computer Science 2 Social Science 2 Statistics 1
PASS Virtual
Chapters for
Business
Analytics
www.sqlpass.org/vc
Like What You Heard?
Dean will be presenting at BAC 2015!
Pre-Conference (full day):
•
An Overview of Predictive Analytics for
Practitioners
Breakout Sessions (60 mins):
•
Starting Your First Predictive Analytics
Project
passbaconference.com
REGISTER TODAY
Productivity Revolution in Excel
Avi Singh, PowerPivotPro and Chandoo, chandoo.org