• No results found

Overview. Data Mining Algorithms. Going Through Loops. The CRISP-DM Model. Six Steps. Business Understanding. Data Mining End to End.

N/A
N/A
Protected

Academic year: 2021

Share "Overview. Data Mining Algorithms. Going Through Loops. The CRISP-DM Model. Six Steps. Business Understanding. Data Mining End to End."

Copied!
8
0
0

Loading.... (view fulltext now)

Full text

(1)

Data Mining Algorithms

Data Mining End to End

Graham Williams

Principal Data Miner, ATO Adjunct Associate Professor, ANU

http://togaware.com Copyrightc2006, Graham J. Williams 1/48/1

Overview

1 Process CRISP-DM 2 Hot Spots NRMA Medicare

3 Evaluation and Communication

Communicating Performance

4 Privacy

Protecting Privacy

http://togaware.com Copyrightc2006, Graham J. Williams 3/48/2

Process Hot Spots Evaluation and Communication Privacy

The CRISP-DM Model

CrossIndustryStandardProcess forDataMining http://www.crisp-dm.org/

Developed by NCR, Daimler-Benz, ISL, OHRA Define and validate aData Mining Process Model

applicable in diverse industry sectors industry and tool neutral

large data mining projects executed faster, cheaper, more reliably and more manageably

Life cycle of six (iterative) phases.

http://togaware.com Copyrightc2006, Graham J. Williams 4/48/3

Process Hot Spots Evaluation and Communication Privacy

Going Through Loops

Cyclicnature of data mining:

Every step of a data mining process can lead to revisiting any one of the previous steps. A DM process continues after a solution has been deployed. The lessons learnt can trigger new, often more focused business questions. Subsequent data mining processes benefit from experiences of previous ones.

http://togaware.com Copyrightc2006, Graham J. Williams 5/48/4

Process Hot Spots Evaluation and Communication Privacy

Six Steps

1 Business Understanding (25%) 2 Data Understanding (20%) 3 Data Preparation (25%) 4 Modelling (10%) 5 Evaluation (20%) 6 Deployment 1. Find Objectives 20% 2. Data Preparation 60% 3. Data Mining 10% 4. Analysis 10%

http://togaware.com Copyrightc2006, Graham J. Williams 6/48/5

Process Hot Spots Evaluation and Communication Privacy

Business Understanding

We had better make sure we are addressing a real business problem.

Initial phase focuses onunderstanding project objectives and

requirementsfrom a business perspective

This knowledge is converted into a data mining problem definition

Develop a preliminary plan designed to achieve the objectives

(2)

Data Understanding

Understand what data is available and its semantics. Initial data collection

Familiarisation with the data

identify data quality problems discover first insights into the data

detect interesting subsets to form hypotheses for hidden information

http://togaware.com Copyrightc2006, Graham J. Williams 8/48/7

Data Preparation

Bring together the data — get it into shape for mining. Construct the mining dataset

Derived from the initial raw dataset(s) Data preparation tasks:

table, record, and attribute selection

generation of derived features

data transformation data cleaning

http://togaware.com Copyrightc2006, Graham J. Williams 9/48/8

Process Hot Spots Evaluation and Communication Privacy

Preparing to Mine

Issues to be dealt with include: Data Quality

missing data noisy data

lead to inconsistent or too general/specific discoveries

Data Cleaning

duplicates inconsistencies

identify and merge the same entities

http://togaware.com Copyrightc2006, Graham J. Williams 10/48/9

Process Hot Spots Evaluation and Communication Privacy

Modelling

Now the “data mining” begins!!! Select various modelling techniques Apply and calibrate modelling techniques

Typically there are several techniques for the same data mining problem

Some techniques have specific requirements on the form of data and require stepping back to the data preparation phase

http://togaware.com Copyrightc2006, Graham J. Williams 11/48/10

Process Hot Spots Evaluation and Communication Privacy

Evaluation

How do we know we have a useful outcome?

Evaluate the model and review the steps executed to construct the model

Does the model properly achieve the business objectives? Is there some important business issue that has not been sufficiently considered?

Decide on the use of the data mining results

http://togaware.com Copyrightc2006, Graham J. Williams 12/48/11

Process Hot Spots Evaluation and Communication Privacy

Deployment

No point to data mining unless we action the outcomes. Deployment may be:

Generate a report of the discoveries made

Implement changes in the processes of the organisation Implement a repeatable data mining process

For successful deployment the customer must understand the actions to be carried out in order to actually make use of the created models

(3)

Summary

The KDD Process

Interative process requiring multiple loops Time consuming

Mining is one “small” step Data issues are crucial to success

http://togaware.com Copyrightc2006, Graham J. Williams 14/48/13

Overview

1 Process CRISP-DM 2 Hot Spots NRMA Medicare

3 Evaluation and Communication

Communicating Performance

4 Privacy

Protecting Privacy

http://togaware.com Copyrightc2006, Graham J. Williams 15/48/14

Process Hot Spots Evaluation and Communication Privacy

Motor Vehicle Insurance

Insurance premium setting and risk rating Actuaries study data and domain for general understanding of risk Several million transactions annually Consider more than the traditional small number of factors

Data mining can explore very large collections of data — both entities and features.

http://togaware.com Copyrightc2006, Graham J. Williams 16/48/15

Process Hot Spots Evaluation and Communication Privacy

Cluster then Describe then Measure

The Hot Spots methodology combines Cluster Analysis and Decision Trees to symbolically identify candidate regions of a dataset.

.

.

.

...

..

.

.

.

.

.

.

.

.

.

.

.

.

.

..

..

.

.

.

.

.

.

.

.

.

.

.

..

.

.

.

...

.

.

.

..

.

.

..

.

.

..

.

.

.

.

ClaimCost≤$695 ClmType≤6

[holdenModel,ford∈]

SumRqst≤$15,000 C1 C2 Postcode≤2949 C3 C4 C2 Cubic≤2.4 C1 C3 Cost= $95,000 Cost= $158,000

http://togaware.com Copyrightc2006, Graham J. Williams 17/48/16

Process Hot Spots Evaluation and Communication Privacy

Find the Interesting Groups

Rule 1 NCB<60andAge≤24andAddress is Urban

Rule 23 Age>57andVehicle∈ {Utility, Station Wagon}

Nugget Claims Total Proportion Average Cost Total Cost

1 150 1400 11 3700 545,000 2 140 2300 6 3800 535,000 3 5 25 20 4400 13,000 4 10 120 8 7900 79,100 5 20 340 6 5300 116,000 6 65 520 13 4400 280,700 7 5 5 100 6800 20,300 . . . 60 800 1400 5.9 3500 2,800,000 All 3800 72000 5.0 3000 12,000,000

http://togaware.com Copyrightc2006, Graham J. Williams 18/48/17

Process Hot Spots Evaluation and Communication Privacy

Finding the Interesting Groups

Evaluate thelargecollection of groups (or Hot Spots) to find those that are important to the core business.

Nugget By Claims By Proportion By Average Cost

2 Y 3 Y 19 Y 24 Y 34 Y Y Y 35 Y Y 36 Y 40 Y Y

(4)

Find the Interesting Groups

Rule 1 NCB<60andAge≤24andAddress is Urban

Rule 23 Age>57andVehicle∈ {Utility, Station Wagon}

Nugget Claims Total Proportion Average Cost Total Cost

1 150 1400 11 3700 545,000 2 140 2300 6 3800 535,000 3 5 25 20 4400 13,000 4 10 120 8 7900 79,100 5 20 340 6 5300 116,000 6 65 520 13 4400 280,700 7 5 5 100 6800 20,300 . . . 60 800 1400 5.9 3500 2,800,000 All 3800 72000 5.0 3000 12,000,000

http://togaware.com Copyrightc2006, Graham J. Williams 20/48/19

Operationalise

Identify groups that are: High Risk

Very high dollars per claim

Large percentage of claims in the group

Low Risk

Very few claims from the group Claims are low in dollars

http://togaware.com Copyrightc2006, Graham J. Williams 21/48/20

Process Hot Spots Evaluation and Communication Privacy

Health Insurance Commission

Universal Health Coverage

Terabytes of patient claims since the inception of Medicare

Inappropriate Provider practices an ongoing focus

Exploration ofpublic fraud (including doctor shoppers)

Exploration of the practise of pathology

http://togaware.com Copyrightc2006, Graham J. Williams 22/48/21

Process Hot Spots Evaluation and Communication Privacy

Cluster/Describe/Measure

.

.

.

...

..

.

.

.

.

.

.

.

.

.

.

.

.

.

..

..

.

.

.

.

.

.

.

.

.

.

.

..

.

.

.

...

.

.

.

..

.

.

..

.

.

..

.

.

.

.

ClaimCost≤$695 ClmType≤6

[holdenModel,ford∈]

SumRqst≤$15,000 C1 C2 Postcode≤2949 C3 C4 C2 Cubic≤2.4 C1 C3 Cost= $95,000 Cost= $158,000

http://togaware.com Copyrightc2006, Graham J. Williams 23/48/22

Process Hot Spots Evaluation and Communication Privacy

Cluster/Describe/Deliver

Rule 1 Age is between 28 and 35andWeeks≤5

Rule 2 Weeks<10andBenefits>$350

Nugget Size Age Gender Services Benefits Weeks Hoard Regular

1 9000 30 F 10 30 2 1 1 2 150 30 F 24 841 4 2 4 3 1200 65 M 7 220 20 1 1 4 80 45 F 30 750 10 1 1 5 90 10 M 12 1125 10 5 2 6 800 55 M 8 550 7 1 9 . . . 280 30 25 F 15 450 15 2 6 All 40,000 45 8 30 3 1 1

http://togaware.com Copyrightc2006, Graham J. Williams 24/48/23

Process Hot Spots Evaluation and Communication Privacy

Claim Hoarders

A distinctgroupof behaviour identified as Claim Hoarders

0 100 200 300 400 500 600 700 12600 12700 12800 12900 13000 13100 13200 13300 13400 pinA Service-Claim 0 5 10 15 20 25 30 35 12600 12700 12800 12900 13000 13100 13200 13300 13400 pinB Service-Claim 0 5 10 15 20 25 30 12600 12700 12800 12900 13000 13100 13200 13300 13400 pinC Service-Claim

But there may be manymillionsof these individuals.

(5)

Medicare Regulars

Group of patients with very regular activity:

0 50 100 150 200 250 300 12600 12700 12800 12900 13000 13100 13200 13300 13400 pinaD Service-Claim 0 5 10 15 20 25 30 35 40 45 12600 12700 12800 12900 13000 13100 13200 13300 13400 pinD Service-Claim

Remove non-cash payments!!!

http://togaware.com Copyrightc2006, Graham J. Williams 26/48/25

Operationalise

The fraud identified was investigated and appropriate action taken. Perpetrators prosecuted

Funds recovered

Processes improved to cross validate data

http://togaware.com Copyrightc2006, Graham J. Williams 27/48/26

Process Hot Spots Evaluation and Communication Privacy

Overview

1 Process CRISP-DM 2 Hot Spots NRMA Medicare

3 Evaluation and Communication Communicating Performance 4 Privacy

Protecting Privacy

http://togaware.com Copyrightc2006, Graham J. Williams 28/48/27

Process Hot Spots Evaluation and Communication Privacy

The Importance of Communication

Selling the story to management.

Do we present the model or the outcomes?

Senior management in something like the ATO is necessarily cautious, protecting the integrity of the countries revenue system.

Need to demonstrate and “prove” the performance and robustness of models before deployment.

http://togaware.com Copyrightc2006, Graham J. Williams 29/48/28

Process Hot Spots Evaluation and Communication Privacy

Options in Rattle: Confusion Matrix

A simple instrument to convey predictive performance But quite a blunt instrument.

Confusion matrix rpart model on audit.csv [test] (counts): Actual

Predicted 0 1

0 428 56

1 44 72

Confusion matrix rpart model on audit.csv [test] (%): Actual

Predicted 0 1 0 71 9 1 7 12

http://togaware.com Copyrightc2006, Graham J. Williams 30/48/29

Process Hot Spots Evaluation and Communication Privacy

Options in Rattle: Risk Charts

Developed specifically for the ATO.

Capture both the “score” exhibited through probability, and the size of the Risk associated with each case! Often, it is the Risk that is of most interest.

(6)

Risk Chart: RPart

http://togaware.com Copyrightc2006, Graham J. Williams 32/48/31

Risk Chart: RF

http://togaware.com Copyrightc2006, Graham J. Williams 33/48/32

Process Hot Spots Evaluation and Communication Privacy

Risk Chart: SVM

http://togaware.com Copyrightc2006, Graham J. Williams 34/48/33

Process Hot Spots Evaluation and Communication Privacy

Risk Chart: Textual Comparison

The area under the Risk and Recall curves for rpart model Area under the Risk (red) curve: 79% (0.790) Area under the Recall (green) curve: 76% (0.762) The area under the Risk and Recall curves for rf model

Area under the Risk (red) curve: 78% (0.780) Area under the Recall (green) curve: 78% (0.779) The area under the Risk and Recall curves for ksvm model

Area under the Risk (red) curve: 78% (0.777) Area under the Recall (green) curve: 77% (0.774)

Which is best?

http://togaware.com Copyrightc2006, Graham J. Williams 35/48/34

Process Hot Spots Evaluation and Communication Privacy

Overview

1 Process CRISP-DM 2 Hot Spots NRMA Medicare

3 Evaluation and Communication

Communicating Performance

4 Privacy

Protecting Privacy

http://togaware.com Copyrightc2006, Graham J. Williams 36/48/35

Process Hot Spots Evaluation and Communication Privacy

Privacy and Data Mining

Laws in many countries directly affect Data Mining and it is worth being aware of them—penalties are often severe. The OECD Principles of Data Collection were drafted in 1980. They embody guiding principles for governments.

Revised for APEC 2003 as part of the Asia-Pacific Privacy Charter Initiative.

Data mining by the Australian Taxation Office is governed by data matching and privacy protocols, and independently overseen by Privacy Commissioner, ANAO, and others.

(7)

What are we trying to protect?

Protect, amongst others: Religious freedom

Freedom from racial discrimination Personal medical records

Employment history Political freedom

http://togaware.com Copyrightc2006, Graham J. Williams 38/48/37

Centrelink and Privacy Breaches

In August 2006 CentreLink (Australian Social Security Agency) announced it had identified over 500 privacy breaches committed by staff.

Identified by monitoring and mining database access logs. Activities

looking at own personal records looking at family, friends, neighbours obtaining information to be sold changing information for financial gain

Consequences

counselling

reduced pay/position lose job

http://togaware.com Copyrightc2006, Graham J. Williams 39/48/38

Process Hot Spots Evaluation and Communication Privacy

AOL Privacy Breech

AOL researchers thought to release anonimised web query logs for researchers (August 2006). Covered 250,000 users and 20 million queries. (Compare with US DoJ demand that Google, AOL, etc, supply such data for them to monitor their citizens.)

Usernames converted to numeric IDs. But, aggregate queries for a single numeric ID

enough to identify individuals multiple queries paint a picture

private financial situation: property and bank loan enquiries indication of criminal activity or research for a book? health: pregnancy, home loan, dog vomit + uncooked pasta

This was a screw-up and we’re angry and upset about it.– AOL

http://togaware.com Copyrightc2006, Graham J. Williams 40/48/39

Process Hot Spots Evaluation and Communication Privacy

Criminal Intent or Research

Would we use the following to investigate someone? how to change brake pads on scion xb 2005 us open cup florida state champions how to get revenge on a ex

how to get revenge on a ex girlfriend

how to get revenge on a friend who —— you over replacement bumper for scion xb

florida department of law enforcement crime stoppers florida

Perhaps someone researching a novel!

http://togaware.com Copyrightc2006, Graham J. Williams 41/48/40

Process Hot Spots Evaluation and Communication Privacy

A Distressed Victim?

A quite distressing example from the AOL disclosure: casey middle school

surgical help for depression

can you adopt after a suicide attempt gynecology oncologists in new york city

Fishman David Dr 160 E 34th St, New York, 10016 how to tell your family you’re a victim of incest how long will the swelling last after my tummy tuck teaching positions in denver colorado

divorce laws in ohio

http://togaware.com Copyrightc2006, Graham J. Williams 42/48/41

Process Hot Spots Evaluation and Communication Privacy

Privacy is Important

Privacy is important to ensure freedom from oppression. Privacy can be breached either accidentally or purposefully. How much should we allow our governments breach our privacy — what is the right trade off?

(8)

Principles of Data Collection

1 Collection Limitation:

There should be limits to the collection of personal data and any such data should be obtained by lawful and fair means and, where appropriate, with the knowledge or consent of the data subject.

2 Data Quality:

Personal data should be relevant to the purposes for which they are to be used, and, to the extent necessary for those purposes, should be accurate, compete and kept up-to-date.

http://togaware.com Copyrightc2006, Graham J. Williams 44/48/43

Principles of Data Collection

3 Purpose Specification:

The purposes for which personal data are collected should be specified not later than at the time of collection and the subsequent use limited to the fulfilment of those purposes or such others as are not incompatible with those purposes and as are specified on each occasion of change of purpose.

4 Use limitation:

Personal data should not be disclosed, made available or otherwise used for purposes other than those specified in accordance with [Principle 3] except:

with the consent of the data subject; or by the authority of law.

http://togaware.com Copyrightc2006, Graham J. Williams 45/48/44

Process Hot Spots Evaluation and Communication Privacy

Principles of Data Collection

5 Security Safeguards:

Personal data should be protected by reasonable security safeguards against such risks as loss or unauthorised access, destruction, use, modification or disclosure of data.

6 Openness:

There should be a general policy of openness about developments, practices and policies with respect to personal data. Means should be readily available of establishing the existence and nature of personal data, and the main purposes of their use, as well as the identity and usual residence of the data controller.

http://togaware.com Copyrightc2006, Graham J. Williams 46/48/45

Process Hot Spots Evaluation and Communication Privacy

Principles of Data Collection

7 Individual Participation:

An individual should have the right to obtain confirmation of whether or not a data controller has data relating to them, and to have access to that data within a reasonable time and cost, and to be able to challenge any denial, and to be able to challenge data relating to themselves and, if the challenge is successful, to have the data erased, rectified, completed or amended.

8 Accountability:

A data controller should be accountable for complying with measures giving effect to these principles

http://togaware.com Copyrightc2006, Graham J. Williams 47/48/46

Process Hot Spots Evaluation and Communication Privacy

Thank You

References

Related documents