Data Mining Algorithms
Data Mining End to End
Graham Williams
Principal Data Miner, ATO Adjunct Associate Professor, ANU
http://togaware.com Copyrightc2006, Graham J. Williams 1/48/1
Overview
1 Process CRISP-DM 2 Hot Spots NRMA Medicare3 Evaluation and Communication
Communicating Performance
4 Privacy
Protecting Privacy
http://togaware.com Copyrightc2006, Graham J. Williams 3/48/2
Process Hot Spots Evaluation and Communication Privacy
The CRISP-DM Model
CrossIndustryStandardProcess forDataMining http://www.crisp-dm.org/
Developed by NCR, Daimler-Benz, ISL, OHRA Define and validate aData Mining Process Model
applicable in diverse industry sectors industry and tool neutral
large data mining projects executed faster, cheaper, more reliably and more manageably
Life cycle of six (iterative) phases.
http://togaware.com Copyrightc2006, Graham J. Williams 4/48/3
Process Hot Spots Evaluation and Communication Privacy
Going Through Loops
Cyclicnature of data mining:
Every step of a data mining process can lead to revisiting any one of the previous steps. A DM process continues after a solution has been deployed. The lessons learnt can trigger new, often more focused business questions. Subsequent data mining processes benefit from experiences of previous ones.
http://togaware.com Copyrightc2006, Graham J. Williams 5/48/4
Process Hot Spots Evaluation and Communication Privacy
Six Steps
1 Business Understanding (25%) 2 Data Understanding (20%) 3 Data Preparation (25%) 4 Modelling (10%) 5 Evaluation (20%) 6 Deployment 1. Find Objectives 20% 2. Data Preparation 60% 3. Data Mining 10% 4. Analysis 10%http://togaware.com Copyrightc2006, Graham J. Williams 6/48/5
Process Hot Spots Evaluation and Communication Privacy
Business Understanding
We had better make sure we are addressing a real business problem.
Initial phase focuses onunderstanding project objectives and
requirementsfrom a business perspective
This knowledge is converted into a data mining problem definition
Develop a preliminary plan designed to achieve the objectives
Data Understanding
Understand what data is available and its semantics. Initial data collection
Familiarisation with the data
identify data quality problems discover first insights into the data
detect interesting subsets to form hypotheses for hidden information
http://togaware.com Copyrightc2006, Graham J. Williams 8/48/7
Data Preparation
Bring together the data — get it into shape for mining. Construct the mining dataset
Derived from the initial raw dataset(s) Data preparation tasks:
table, record, and attribute selection
generation of derived features
data transformation data cleaning
http://togaware.com Copyrightc2006, Graham J. Williams 9/48/8
Process Hot Spots Evaluation and Communication Privacy
Preparing to Mine
Issues to be dealt with include: Data Quality
missing data noisy data
lead to inconsistent or too general/specific discoveries
Data Cleaning
duplicates inconsistencies
identify and merge the same entities
http://togaware.com Copyrightc2006, Graham J. Williams 10/48/9
Process Hot Spots Evaluation and Communication Privacy
Modelling
Now the “data mining” begins!!! Select various modelling techniques Apply and calibrate modelling techniques
Typically there are several techniques for the same data mining problem
Some techniques have specific requirements on the form of data and require stepping back to the data preparation phase
http://togaware.com Copyrightc2006, Graham J. Williams 11/48/10
Process Hot Spots Evaluation and Communication Privacy
Evaluation
How do we know we have a useful outcome?
Evaluate the model and review the steps executed to construct the model
Does the model properly achieve the business objectives? Is there some important business issue that has not been sufficiently considered?
Decide on the use of the data mining results
http://togaware.com Copyrightc2006, Graham J. Williams 12/48/11
Process Hot Spots Evaluation and Communication Privacy
Deployment
No point to data mining unless we action the outcomes. Deployment may be:
Generate a report of the discoveries made
Implement changes in the processes of the organisation Implement a repeatable data mining process
For successful deployment the customer must understand the actions to be carried out in order to actually make use of the created models
Summary
The KDD Process
Interative process requiring multiple loops Time consuming
Mining is one “small” step Data issues are crucial to success
http://togaware.com Copyrightc2006, Graham J. Williams 14/48/13
Overview
1 Process CRISP-DM 2 Hot Spots NRMA Medicare3 Evaluation and Communication
Communicating Performance
4 Privacy
Protecting Privacy
http://togaware.com Copyrightc2006, Graham J. Williams 15/48/14
Process Hot Spots Evaluation and Communication Privacy
Motor Vehicle Insurance
Insurance premium setting and risk rating Actuaries study data and domain for general understanding of risk Several million transactions annually Consider more than the traditional small number of factors
Data mining can explore very large collections of data — both entities and features.
http://togaware.com Copyrightc2006, Graham J. Williams 16/48/15
Process Hot Spots Evaluation and Communication Privacy
Cluster then Describe then Measure
The Hot Spots methodology combines Cluster Analysis and Decision Trees to symbolically identify candidate regions of a dataset.
.
.
.
...
..
.
.
.
.
.
.
.
.
.
.
.
.
.
..
..
.
.
.
.
.
.
.
.
.
.
.
..
.
.
.
...
.
.
.
..
.
.
..
.
.
..
.
.
.
.
ClaimCost≤$695 ClmType≤6[holdenModel,ford∈]
SumRqst≤$15,000 C1 C2 Postcode≤2949 C3 C4 C2 Cubic≤2.4 C1 C3 Cost= $95,000 Cost= $158,000
http://togaware.com Copyrightc2006, Graham J. Williams 17/48/16
Process Hot Spots Evaluation and Communication Privacy
Find the Interesting Groups
Rule 1 NCB<60andAge≤24andAddress is Urban
Rule 23 Age>57andVehicle∈ {Utility, Station Wagon}
Nugget Claims Total Proportion Average Cost Total Cost
1 150 1400 11 3700 545,000 2 140 2300 6 3800 535,000 3 5 25 20 4400 13,000 4 10 120 8 7900 79,100 5 20 340 6 5300 116,000 6 65 520 13 4400 280,700 7 5 5 100 6800 20,300 . . . 60 800 1400 5.9 3500 2,800,000 All 3800 72000 5.0 3000 12,000,000
http://togaware.com Copyrightc2006, Graham J. Williams 18/48/17
Process Hot Spots Evaluation and Communication Privacy
Finding the Interesting Groups
Evaluate thelargecollection of groups (or Hot Spots) to find those that are important to the core business.
Nugget By Claims By Proportion By Average Cost
2 Y 3 Y 19 Y 24 Y 34 Y Y Y 35 Y Y 36 Y 40 Y Y
Find the Interesting Groups
Rule 1 NCB<60andAge≤24andAddress is Urban
Rule 23 Age>57andVehicle∈ {Utility, Station Wagon}
Nugget Claims Total Proportion Average Cost Total Cost
1 150 1400 11 3700 545,000 2 140 2300 6 3800 535,000 3 5 25 20 4400 13,000 4 10 120 8 7900 79,100 5 20 340 6 5300 116,000 6 65 520 13 4400 280,700 7 5 5 100 6800 20,300 . . . 60 800 1400 5.9 3500 2,800,000 All 3800 72000 5.0 3000 12,000,000
http://togaware.com Copyrightc2006, Graham J. Williams 20/48/19
Operationalise
Identify groups that are: High Risk
Very high dollars per claim
Large percentage of claims in the group
Low Risk
Very few claims from the group Claims are low in dollars
http://togaware.com Copyrightc2006, Graham J. Williams 21/48/20
Process Hot Spots Evaluation and Communication Privacy
Health Insurance Commission
Universal Health Coverage
Terabytes of patient claims since the inception of Medicare
Inappropriate Provider practices an ongoing focus
Exploration ofpublic fraud (including doctor shoppers)
Exploration of the practise of pathology
http://togaware.com Copyrightc2006, Graham J. Williams 22/48/21
Process Hot Spots Evaluation and Communication Privacy
Cluster/Describe/Measure
.
.
.
...
..
.
.
.
.
.
.
.
.
.
.
.
.
.
..
..
.
.
.
.
.
.
.
.
.
.
.
..
.
.
.
...
.
.
.
..
.
.
..
.
.
..
.
.
.
.
ClaimCost≤$695 ClmType≤6[holdenModel,ford∈]
SumRqst≤$15,000 C1 C2 Postcode≤2949 C3 C4 C2 Cubic≤2.4 C1 C3 Cost= $95,000 Cost= $158,000
http://togaware.com Copyrightc2006, Graham J. Williams 23/48/22
Process Hot Spots Evaluation and Communication Privacy
Cluster/Describe/Deliver
Rule 1 Age is between 28 and 35andWeeks≤5
Rule 2 Weeks<10andBenefits>$350
Nugget Size Age Gender Services Benefits Weeks Hoard Regular
1 9000 30 F 10 30 2 1 1 2 150 30 F 24 841 4 2 4 3 1200 65 M 7 220 20 1 1 4 80 45 F 30 750 10 1 1 5 90 10 M 12 1125 10 5 2 6 800 55 M 8 550 7 1 9 . . . 280 30 25 F 15 450 15 2 6 All 40,000 45 8 30 3 1 1
http://togaware.com Copyrightc2006, Graham J. Williams 24/48/23
Process Hot Spots Evaluation and Communication Privacy
Claim Hoarders
A distinctgroupof behaviour identified as Claim Hoarders
0 100 200 300 400 500 600 700 12600 12700 12800 12900 13000 13100 13200 13300 13400 pinA Service-Claim 0 5 10 15 20 25 30 35 12600 12700 12800 12900 13000 13100 13200 13300 13400 pinB Service-Claim 0 5 10 15 20 25 30 12600 12700 12800 12900 13000 13100 13200 13300 13400 pinC Service-Claim
But there may be manymillionsof these individuals.
Medicare Regulars
Group of patients with very regular activity:
0 50 100 150 200 250 300 12600 12700 12800 12900 13000 13100 13200 13300 13400 pinaD Service-Claim 0 5 10 15 20 25 30 35 40 45 12600 12700 12800 12900 13000 13100 13200 13300 13400 pinD Service-Claim
Remove non-cash payments!!!
http://togaware.com Copyrightc2006, Graham J. Williams 26/48/25
Operationalise
The fraud identified was investigated and appropriate action taken. Perpetrators prosecuted
Funds recovered
Processes improved to cross validate data
http://togaware.com Copyrightc2006, Graham J. Williams 27/48/26
Process Hot Spots Evaluation and Communication Privacy
Overview
1 Process CRISP-DM 2 Hot Spots NRMA Medicare3 Evaluation and Communication Communicating Performance 4 Privacy
Protecting Privacy
http://togaware.com Copyrightc2006, Graham J. Williams 28/48/27
Process Hot Spots Evaluation and Communication Privacy
The Importance of Communication
Selling the story to management.
Do we present the model or the outcomes?
Senior management in something like the ATO is necessarily cautious, protecting the integrity of the countries revenue system.
Need to demonstrate and “prove” the performance and robustness of models before deployment.
http://togaware.com Copyrightc2006, Graham J. Williams 29/48/28
Process Hot Spots Evaluation and Communication Privacy
Options in Rattle: Confusion Matrix
A simple instrument to convey predictive performance But quite a blunt instrument.
Confusion matrix rpart model on audit.csv [test] (counts): Actual
Predicted 0 1
0 428 56
1 44 72
Confusion matrix rpart model on audit.csv [test] (%): Actual
Predicted 0 1 0 71 9 1 7 12
http://togaware.com Copyrightc2006, Graham J. Williams 30/48/29
Process Hot Spots Evaluation and Communication Privacy
Options in Rattle: Risk Charts
Developed specifically for the ATO.
Capture both the “score” exhibited through probability, and the size of the Risk associated with each case! Often, it is the Risk that is of most interest.
Risk Chart: RPart
http://togaware.com Copyrightc2006, Graham J. Williams 32/48/31
Risk Chart: RF
http://togaware.com Copyrightc2006, Graham J. Williams 33/48/32
Process Hot Spots Evaluation and Communication Privacy
Risk Chart: SVM
http://togaware.com Copyrightc2006, Graham J. Williams 34/48/33
Process Hot Spots Evaluation and Communication Privacy
Risk Chart: Textual Comparison
The area under the Risk and Recall curves for rpart model Area under the Risk (red) curve: 79% (0.790) Area under the Recall (green) curve: 76% (0.762) The area under the Risk and Recall curves for rf model
Area under the Risk (red) curve: 78% (0.780) Area under the Recall (green) curve: 78% (0.779) The area under the Risk and Recall curves for ksvm model
Area under the Risk (red) curve: 78% (0.777) Area under the Recall (green) curve: 77% (0.774)
Which is best?
http://togaware.com Copyrightc2006, Graham J. Williams 35/48/34
Process Hot Spots Evaluation and Communication Privacy
Overview
1 Process CRISP-DM 2 Hot Spots NRMA Medicare3 Evaluation and Communication
Communicating Performance
4 Privacy
Protecting Privacy
http://togaware.com Copyrightc2006, Graham J. Williams 36/48/35
Process Hot Spots Evaluation and Communication Privacy
Privacy and Data Mining
Laws in many countries directly affect Data Mining and it is worth being aware of them—penalties are often severe. The OECD Principles of Data Collection were drafted in 1980. They embody guiding principles for governments.
Revised for APEC 2003 as part of the Asia-Pacific Privacy Charter Initiative.
Data mining by the Australian Taxation Office is governed by data matching and privacy protocols, and independently overseen by Privacy Commissioner, ANAO, and others.
What are we trying to protect?
Protect, amongst others: Religious freedom
Freedom from racial discrimination Personal medical records
Employment history Political freedom
http://togaware.com Copyrightc2006, Graham J. Williams 38/48/37
Centrelink and Privacy Breaches
In August 2006 CentreLink (Australian Social Security Agency) announced it had identified over 500 privacy breaches committed by staff.
Identified by monitoring and mining database access logs. Activities
looking at own personal records looking at family, friends, neighbours obtaining information to be sold changing information for financial gain
Consequences
counselling
reduced pay/position lose job
http://togaware.com Copyrightc2006, Graham J. Williams 39/48/38
Process Hot Spots Evaluation and Communication Privacy
AOL Privacy Breech
AOL researchers thought to release anonimised web query logs for researchers (August 2006). Covered 250,000 users and 20 million queries. (Compare with US DoJ demand that Google, AOL, etc, supply such data for them to monitor their citizens.)
Usernames converted to numeric IDs. But, aggregate queries for a single numeric ID
enough to identify individuals multiple queries paint a picture
private financial situation: property and bank loan enquiries indication of criminal activity or research for a book? health: pregnancy, home loan, dog vomit + uncooked pasta
This was a screw-up and we’re angry and upset about it.– AOL
http://togaware.com Copyrightc2006, Graham J. Williams 40/48/39
Process Hot Spots Evaluation and Communication Privacy
Criminal Intent or Research
Would we use the following to investigate someone? how to change brake pads on scion xb 2005 us open cup florida state champions how to get revenge on a ex
how to get revenge on a ex girlfriend
how to get revenge on a friend who —— you over replacement bumper for scion xb
florida department of law enforcement crime stoppers florida
Perhaps someone researching a novel!
http://togaware.com Copyrightc2006, Graham J. Williams 41/48/40
Process Hot Spots Evaluation and Communication Privacy
A Distressed Victim?
A quite distressing example from the AOL disclosure: casey middle school
surgical help for depression
can you adopt after a suicide attempt gynecology oncologists in new york city
Fishman David Dr 160 E 34th St, New York, 10016 how to tell your family you’re a victim of incest how long will the swelling last after my tummy tuck teaching positions in denver colorado
divorce laws in ohio
http://togaware.com Copyrightc2006, Graham J. Williams 42/48/41
Process Hot Spots Evaluation and Communication Privacy
Privacy is Important
Privacy is important to ensure freedom from oppression. Privacy can be breached either accidentally or purposefully. How much should we allow our governments breach our privacy — what is the right trade off?
Principles of Data Collection
1 Collection Limitation:
There should be limits to the collection of personal data and any such data should be obtained by lawful and fair means and, where appropriate, with the knowledge or consent of the data subject.
2 Data Quality:
Personal data should be relevant to the purposes for which they are to be used, and, to the extent necessary for those purposes, should be accurate, compete and kept up-to-date.
http://togaware.com Copyrightc2006, Graham J. Williams 44/48/43
Principles of Data Collection
3 Purpose Specification:
The purposes for which personal data are collected should be specified not later than at the time of collection and the subsequent use limited to the fulfilment of those purposes or such others as are not incompatible with those purposes and as are specified on each occasion of change of purpose.
4 Use limitation:
Personal data should not be disclosed, made available or otherwise used for purposes other than those specified in accordance with [Principle 3] except:
with the consent of the data subject; or by the authority of law.
http://togaware.com Copyrightc2006, Graham J. Williams 45/48/44
Process Hot Spots Evaluation and Communication Privacy
Principles of Data Collection
5 Security Safeguards:
Personal data should be protected by reasonable security safeguards against such risks as loss or unauthorised access, destruction, use, modification or disclosure of data.
6 Openness:
There should be a general policy of openness about developments, practices and policies with respect to personal data. Means should be readily available of establishing the existence and nature of personal data, and the main purposes of their use, as well as the identity and usual residence of the data controller.
http://togaware.com Copyrightc2006, Graham J. Williams 46/48/45
Process Hot Spots Evaluation and Communication Privacy
Principles of Data Collection
7 Individual Participation:
An individual should have the right to obtain confirmation of whether or not a data controller has data relating to them, and to have access to that data within a reasonable time and cost, and to be able to challenge any denial, and to be able to challenge data relating to themselves and, if the challenge is successful, to have the data erased, rectified, completed or amended.
8 Accountability:
A data controller should be accountable for complying with measures giving effect to these principles
http://togaware.com Copyrightc2006, Graham J. Williams 47/48/46
Process Hot Spots Evaluation and Communication Privacy