Data Mining Approaches
to Collections and Case
Closure
Bill Haffey
Technical Director, SPSS Public Sector
Background
• Florida DOR has 500,000 sales accounts, of which
~35,000 are likely to be in the collections process in a given month
• Payment frequencies range from monthly to
annually, based on expected tax amount
• Current collections process generally entails: – Notice mailed after 30 days
– Phone call after another 15 days
– Visit after 54 days, or collection agency for low $ – Garnishment/lien after 120 days
• All accounts treated identically, and no costs have
Background (cont)
• Idea:– Identify ‘paths’/maps composed of
minimal/optimal sequences of actions that tend to result in delinquent case closure (for monthly payment accounts), perhaps unique to particular account types
– Deploy these paths into an automated
recommendation engine designed to improve timeliness and efficiency of collections process
Account Type B close close phone phone close notice notice notice notice . Acct3 Acct2 Acct1
Sequence
Detection
Notice
Notice Phone call Visit/Collections Close Close Close Account Type A Account Type B Account Type C
‘Recommendation’ Engine
If
Then
But, in reality . . .
‘Best Contact’ Resolution not yet feasible:
• Actions made to the account are not separable:• 1st notice sent on establishment of liability • Phone call after another 15 days
• Sent to service center 54 days after 1st notice • Garnishment/lien may be made
– What if notice rec’d and payment sent day 40, but not rec’d
by Fla until day 45 – after phone call placed
– A phone call placed to account == a phone call rec’d from
account w/promise to pay
• Not all actions made on the account are recorded: – ‘Virtual agent campaigns’ (eg, Mosaix recorded msg) not
Instead . . .
•
Model time to account closure (X
days), broken into the following
groups:
– X < 30 – 30 < X < 60 – X > 60•
Assumptions:
– X < 30• Case will have entailed minimal contact
– 30 < X < 60
• Notice and/or phone call or automated message
– X > 60
• (and bill exceeds $250 threshold) handled by field service ctr
Why?
•
These time-to-closure groupings
provide a reasonable proxy for the
type of contact that resulted in
closure
•
The modeling and prediction of an
account’s time-to-closure could
provide such business rules as:
– If account is predicted as X < 30, consider not adding
case to call queue for an additional period
– If account is predicted as X > 60, refer case directly to
Why Data Mining?
• Needed to ‘model’/predict the time-to-closurecategory
– As opposed to query/OLAP/report ‘snapshots’ • Lots of legacy data to ‘train’ the model (account
characteristics and outcomes)
– Ability to scale procedures against large
volumes of data
• Needed flexibility in types of data that could be
modeled
– As opposed to traditional statistical procedures
Why Data Mining
(cont)?
• In training the model, needed to minimize the
probability of especially ‘bad’ predictions
– Predicting 30 < X < 60 for a case that would
actually close in X < 30 isn’t as ‘bad’ as predicting X > 60 for that same case
• Needed to understand the model – why certain
types of cases were predicted to close at X > 60
– As opposed to an opaque ‘black-box’ modeling
methodology
• Chose the Rule-Induction data mining
The Data Mining Project followed the CRISP-DM Methodology.
Project Approach Methodology
CRISP-DM Approach
ü Standard, proven process to guide data mining efforts
ü Maximizes return on investment in data mining tools and processes
ü Iterative process that incorporates business expertise and understanding as a key guide to analyses
ü Standard, proven process to guide data mining efforts
ü Maximizes return on investment in data mining tools and processes
ü Iterative process that incorporates business expertise and understanding as
a key guide to analyses
Benefits Provided Predictive Evolution OLAP Query and Report Data Mining and Forecasting Real-Time Information Distribution Time Business Value Predictive/Proactive: “What should we offer this customer today?”
Predictive: “Which ones are at risk of
leaving?”
Historical: “Which cities did they live
in?”
Historical: “How many customers do
we lose each month?”
Cross Industry Standard Process for Data Mining: CRISP – Data
Mining Methodology
ØDeveloped by SPSS, NCR, Daimler-Chrysler, and OHRA in 1996
ØTime tested and used worldwide
ØFlexible and adaptable methodology
ØSix Cyclic Stages:
• Business Understanding • Data Understanding • Data Preparation • Modeling • Evaluation • Deployment
Step 6: Deployment Step 6: Deployment üImplement models and processes.
•Plan and structure processes for deployment of model.
•Demonstration of models.
CRISP – DM: Project Approach
Project Goal: Develop a data model that will predict the time required for an account to close for both bills and delinquencies.
Project Goal: Develop a data model that will predict the time required for an account to close for both bills and delinquencies.
Step 5: Evaluation
Step 5: Evaluation
Objectives: üüGoals definitionProject objectives
üGain buy-in Steps: Step 4: Modeling Step 4: Modeling Step 3: Data Preparation Step 3: Data Preparation Step 2: Data Understanding Step 2: Data Understanding Step 1: Business Understanding Step 1: Business Understanding üDetermine status of data üConduct data collection process
üPrepare data for detailed analysis
üDetermine missing data
üModel data to yield cross-sell insights
üValidate process and results with business goals
Activities: ••Define project goalsConduct interviews with key staff to define analytic and reporting processes
•Assess current processes
•Define success criteria
•Determine deployment method
•Collect data
•Data quality check
•Upload data into Clementine •Select fields to be used in analyses •Clean data •Transform and derive calculated fields as required •Conduct various modeling procedures on data •Identify and implement highest-value modeling method •Model data •Revisit original business objectives •Validate process and results with business goals
•Review results with Client and make any necessary modifications prior to delivery Deliverables: •Interviews •Definition of project goals •Success criteria
•Data audit report •Finished dataset to be used for analysis
•Documented analytical process as performed
•Analytical results tied to business objectives
•Additional input needed for Go/No Go decision
•Data quality improvement recommendations
First Round of Models
• Data Preparation Steps– Take time to group SIC (first 2 digits) into meaningful
categories
– Create time history for AGE and CASE_AGE
– Do not yet include time histories for other fields, such as
contacts, bankrupt, etc.
• Modeling Steps
– Create decision trees and neural networks using available
fields
– Used balanced samples for training the neural networks – Select models that do the best job
• Predicting outcomes
Data Sources
COUNTY ACCOUNT APP_PERIOD CREA_DATE1 STAT_DATE1 AGE CREA_DATE2 STAT_DATE2 CONTACTS RECNO CASE_AGE 11 002501 200007 10/10/00 10/27/00 17 10/11/00 10/27/00 1 604516 16 11 002501 200009 11/29/00 12/14/00 15 11/30/00 12/14/00 0 670115 14 11 002501 200010 12/21/00 1/17/01 27 12/22/00 1/17/01 2 747129 26 11 002501 200011 1/24/01 2/14/01 21 1/25/01 2/14/01 2 809634 20 11 002501 200101 4/25/01 5/18/01 23 4/26/01 5/18/01 2 1042649 22 11 002501 200102 4/25/01 5/18/01 23 4/26/01 5/18/01 2 1042650 22 11 002501 200104 6/29/01 7/23/01 24 7/2/01 7/23/01 1 1197620 21 11 003003 199910 2/4/00 3/21/00 46 2/7/00 4/5/00 2 51626 58 11 003003 199912 2/25/00 4/5/00 40 2/7/00 4/5/00 4 88246 58 11 003003 199912 3/7/00 3/31/00 24 2/7/00 4/5/00 2 97479 58 11 003003 200001 4/21/00 5/25/00 34 4/19/00 6/8/00 4 249504 50 11 003003 200002 5/1/00 5/30/00 29 4/19/00 6/8/00 4 257720 50 11 003003 199912 5/11/00 6/8/00 28 4/19/00 6/8/00 1 291136 50 11 003003 200003 7/18/00 8/21/00 34 7/19/00 8/21/00 0 435945 33
Types of Features
• Create Time-Based Features
– AGE features • Last AGE value • Maximum AGE
• Average AGE for all modules, last 3 modules, last 5 modules, etc.
– CASE_AGE features
• Same kinds of features as AGE: last, max, average AGE – Contacts
• Reduce large numbers of categories down to a smaller (more
manageable number)
– Ex: County, ORG_CODE, SIC, KIND_CODE, STAT_CODE – Reason: reduce redundant information, speed up modeling
Data Preprocessing Stream
SIC 2-Digit Features
• Group SIC 2-digit Values
– Functionally
(SIC 1-digit)
Age Category Distributions
• Split sample data into training and testing subsets
– Training for creating model
– Testing for assessing model performance
Template Modeling Stream
• Standard modeling stream
– Load data – Create models
– Assess results for training subset and testing subset
AGE Neural Network Model Parameters and Results
• Sometimes the direct path to a
model doesn’t work well.
• Create a model that predicts
AGE, and use this model as input to the AGE_cat model (actually, created a model that predicted LOG10(AGE)
• Make sure no fields are allowed
in the AGE model that cannot be included in AGE_cat model
Neural Network Accuracy Predicting Age
• AGE model predicts AGE values with 69% correlation. A scatter plot shows predictions vs. actual AGE values.• This doesn’t have
to be perfect to provide good information for the AGE_cat models
Rule Induction – Key
Features
•
Model output is intuitive – in the
form of either decision trees or
rulesets
•
Flexibility in types of data
•
Can ‘ransack’ a dataset to identify
key data features
– The resultant model will utilize
– Decision trees:
•
income < $40K–job > 5 yrs then good risk –job < 5 yrs then bad risk
•income > $40K
–high debt then bad risk –low debt then good risk
–
or Rule Sets
:•Rule #1 for good risk:
–if income > $40K –if low debt
•Rule #2 for good risk:
–if income < $40K –if job > 5 years
low 7 41k Good 3 . . . . . high 3 60k Bad 2 low 6 50k Good 1 Debt Job Income Risk Cust Training Data
Build the MODEL
low 7 41k Bad 80 . . . . high 3 60k Good 79 low 6 50k Good 78 Debt Job Income Risk Cust Testing Data
•Rule #1 for good risk:
–if income > $40K –if low debt
•Rule #2 for good risk:
–if income < $40K –if job > 5 years
Model Test the Model
Bad Amb Good Bad Amb Good
Actual
Outcomes
Modeled
Outcomes
Some Model ‘misses’
more critical than
others . . .
Changing Where the Errors Occur
• Change
misclassification costs to change where errors occur.
• If want to ensure that
one gets category 3 records correct, change how the decision tree views errors on records with category 3.
• In this example,
classifier has 84.8% accuracy on testing data for category 3.
• However, we also get
many category 1 and 2 records incorrectly called category 3 (false alarms)
No misclassification costs
Decision Tree Accuracy on Testing Data
• Results for output field Age_cat • Comparing $C-Age_cat with Age_cat
– Correct : 20759 ( 60.15%) – Wrong : 13753 ( 39.85%) – Total : 34512 • Coincidence Matrix • $C-Age_cat – 1 2 3 – 1 10296 769 3040 – 2 3665 1892 2463 – 3 3175 641 8571
Actual
Predicted
Key Variables in AGE_cat Decision Tree Model
• Decision treerules for best tree. • This is actually the third “boost” from a series of decision trees • AGE_pred is first split