• No results found

Data Mining and risk Management

N/A
N/A
Protected

Academic year: 2021

Share "Data Mining and risk Management"

Copied!
23
0
0

Loading.... (view fulltext now)

Full text

(1)
(2)

Data Mining

Dr. Bradley A. Malin

Assistant Professor

Department of Biomedical Informatics

Vanderbilt University

(3)

Data Collection

• Localized: Personal records in databases at source

• Distributed: Integration of records from many sources

Overt

Collection

Hospital visit for treatment

Covert

Collection

webcam while walking

Electronic

Medical

Records

demographics

clinical

presentation

(4)

Data Mining

• Unsupervised

– “Labels”

unknown

in

advance, so search for

intrinsic patterns of the data

– Clustering “similar” people

• Purchased same products

• Supervised

– “Labels”

known

in advance

– Train models on sample

data to classify new cases

Country

Age

USA

Canada

Age

<50

<50

>50

>50

Jaws

Harry

Potter

Scream

1984

(5)

Website Personalization

• Can a website predict what I want to see?

Intra-personalization

: What pages / topics did

I visit in my previous visits?

Inter-personalization

: Is my browsing /

purchasing history similar to other people’s?

• Does my behavior reveal my identity or

sensitive things about my life?

(6)

Intelligence

• Lists of entities are

becoming increasingly

prevalent

– Intelligence reports,

rosters, networks

• How many Alice’s are

there? Which is which?

• How does Alice relate

to Bob?

Alice Doe

Bob Doe

Junior Doe

doc A

Alice Doe

John Smith

doc C

Bob Doe

Junior Doe

Alice Doe

doc B

Alice Doe

John Smith

Brad Malin

doc D

Alice

“A”

Bob

Junior

Alice

“B”

Alice

“C”

Brad

John

Alice

“D”

(7)

• Location Surveillance: Did someone on Interpol’s

watchlist visit hotel

X

? Airline Y?

Challenge:

Data holders want to collaborate, but

fear strategic knowledge and legal constraints

Surveillance

(8)

Privacy Protections

• Protect Anonymity

– Remove / encrypt identifying information

– Suppress inferences that can reveal identity

• Protect Confidentiality

– Hide Sensitive Rules

– Perturbation and Generalization

• Secure Multiparty Computation

– E(a) + E(b) = E(a + b) [homomorphism]

(9)

Clinical Genomics

• Vanderbilt DNA Databank

• DNA from “leftover” blood

25-75K per year, 250K in 5 years

• Combined with de-identified

electronic medical records

600 GBytes on 1.4 mil. patients

• “Hypothesis Generation” to

mine correlations between

clinical features and DNA

Blood

Samples

Clinical

Record

Clinical

Record

DNA

512 Bit Hash of #

De-identification

(10)

Example De-identified Medical Record

Substituted

names

Replaced SSN

and phone #

Shifted

Dates

MR# is

removed

(11)

Naïve Protection

ACTG

1

ACTG

2

ACTG

3

H

1

H

2

H

3

ACTG

1

DNA in Genomic DBs

H

1

H

2

H

3

Identities in Discharge DBs

ACTG

2

ACTG

3

ACTG

1

• Patterns in data can lead to privacy compromise

• Suppress patterns “intelligently” to support goals

(12)

0

20

40

60

80

100

0

10

20

30

40

50

k

%

o

f DN

A R

eco

rd

s Disclo

sed

Naive

Partial Trail Suppression

In Detail: Cystic Fibrosis

(1149 patients, 174 hospitals)

0

20

40

60

80

100

0

10

20

30

40

50

k

% of S

a

mpl

e

s

R

e

-i

de

nti

fi

e

d

(13)

The Impact of Data Mining on Privacy

in the Public and Private Sectors

Richard S. Rosenberg

Professor Emeritus, Department of Computer

Science, University of British Columbia and

President of the BC Freedom of Information and

Privacy Association

Vancouver, BC

(14)
(15)
(16)

Top Six Purposes of Data Mining

(17)

Table 1: Key Steps Agencies Are Required

to Take to Protect Privacy, with Examples of

Related Detailed Procedures and Sources

Key steps to protect privacy of Examples of procedures

Primary statutory

personal information

Source

___________

Publish notice in the Federal

• Specify the routine uses for the system

• Privacy Act

Register when creating or modifying • Identify the individual responsible for the system

system of records • Outline procedures individuals can use to gain access to their

________________________________records_________________________________________________________________

Provide individuals with access to • Permit individuals to review records about themselves

• Privacy Act

their records_____________________• Permit individuals to request corrections to their records__________________________

Notify individuals of the purpose and • Notify individuals of the authority that authorized the agency to • Privacy Act

authority for the requested

collect the information

Information when it is collected • Notify individuals of the principal purposes for which the information

________________________________is to be used_____________________________________________________________

Implement guidance on system • Perform a risk assessment to determine the information system

• FISMA

vulnerabilities, identify threats, and develop countermeasures to • Privacy Act

those threats

• Have the system certified and accredited by management

• Ensure the accuracy, relevance, timeliness, and completeness of

_______________________________ information_______________________________________________________________

Conduct a privacy impact • Describe and analyze how information is secured

• E-Government Act

Assessment

• Describe and analyze intended use of information

• Have assessment reviewed by chief information officer or equivalent

(18)

ADVISE Data Mining Tool

(19)

Cato Institute: Data Mining and

Terrorism

• Attempting to use predictive data mining to

ferret out terrorists before they strike would

be a subtle but important misdirection of

national security resources.

• With a relatively small number of attempts

every year and only one or two major terrorist

incidents every few years – each one distinct

in terms of planning and execution – there

are no meaningful patterns that show what

behavior indicates planning or preparation for

terrorism.

(20)

Data Mining in the Private Sector

We generate an enormous amount of data as a by-product of our

everyday transactions (purchasing goods, enrolling for courses,

etc.), visits to Web sites and interactions with government (taxes,

census, car registration, voter registration, etc.). Not only is the

number of records we generate increasing, but the amount of data

gathered for each type of record is increasing.

As data miners, our tasks are colliding with these concerns. In

analytic customer relationship management (CRM), we often

analyze customer data with the specific intent of understanding

individual behavior and instituting sales campaigns based on this

understanding. Researchers in economics, demographics, medicine

and social sciences are trying to understand the relationships

between behaviors and outcomes.

How can we reconcile the legitimate needs of business and

research with the equally legitimate desire of people to maintain

their privacy?

(21)

The Use of Anonymizing

Still, anonymizing technologies have been endorsed repeatedly

by panels appointed to examine the implications of data mining.

And intriguing progress appears to have been made at

designing information-retrieval systems with record

anonymization, user audit logs — which can confirm that no one

looked at records beyond the approved scope of an

investigation — and other privacy mechanisms "baked in."

The trick is to do more than simply strip names from records.

Latanya Sweeney of Carnegie Mellon University — a leading

privacy technologist who once had a project funded under TIA

— has shown that 87% of Americans could be identified by

records listing solely their birthdate, gender and ZIP code.

Sweeney had this challenge in mind as she developed a way for

the U.S. Department of Housing and Urban Development to

(22)

A Private Sector Example

Tesco is quietly building a profile of you, along with every individual in the

country - a map of personality, travel habits, shopping preferences and even

how charitable and eco-friendly you are. A subsidiary of the supermarket

chain has set up a database, called Crucible, that is collating detailed

information on every household in the UK, whether they choose to shop at

the retailer or not.

The company refuses to reveal the information it holds, yet Tesco is selling

access to this database to other big consumer groups, such as Sky, Orange

and Gillette. "It contains details of every consumer in the UK at their home

address across a range of demographic, socio-economic and lifestyle

characteristics," says the marketing blurb of dunnhumby, the Tesco

subsidiary in question. It has "added intelligent profiling and targeting" to its

data through a software system called Zodiac. This profiling can rank your

enthusiasm for promotions, your brand loyalty, whether you are a "creature

of habit" and when you prefer to shop. As the blurb puts it: "The list is

(23)

The View From 30,000 feet

Choicepoint’s press release states

they have forgone selling certain

consumer information “in selected

markets” at a cost of $15 million

dollars per year

Melbourne, Australia

Jane Doe vs. ABC 3 April 2007

Costs, including tort of invasion of

privacy: $234,190

2007 University of Pisa, Italy

KDD Laboratory &

“K-Anonymity” advancements

Fall 2006 -Purdue

University

electrophotograhic

halftone printer

code advances

New Zealand Court of

Appeal 4 May 2007

Brooker V. Police

Feb 2007 - Portugal

adopts “biometric”

national ID card

provider

Brussels, Belgium

EU “googles” Google

privacy practices

Roelof Temmingh,

South Africa,

releases version 1 of

“Evolution”

Canadian company

releases 3-D Face

Scanner $350

RCMP buys info from

data broker

References

Related documents

contributor was the C β -C γ -N-C dihedral term, which contributed 1.8 kcal/mol per dihedral or ~14 kcal/mol in total to the potential energy difference (Figure S8e). This

recommendations for how to lessen these disparities. Despite this, many individuals who go to the doctor are still given insufficient care because of their gender, race,

From Pistoia / Poggio a Caiano normal route: From the indicator continue on Via Pistoiese towards Florence, after San Piero a Ponti (at the entrance of San Donnino), turn right

distributions and their relationships in the Southern Ocean, southeast Indian Ocean, and northwest Pacific Ocean, Global Biogeochem2. American

•  Alice sends “I am Alice”, to Bob •  Bob sends a nonce, R, to Alice •  Alice encrypts the nonce using. Alice and Bob’s symmetric secret key, K A-B , and sends it

continues Capgemini partnership extending existing Oracle &amp; Capgemini relationship with WebCenter Sites (FatWire) expertise. &#34;We are pleased to have Capgemini as a global

I, (a) JANE SMITH DOE , pro se, state that on or about (b) February 2, 2006 , I did notify the defendant (c) JOHN DOE of this action by mailing a true and correct copy of the

Table 5 gives the results from the analysis of Table 4 gives results of the change in three different pressure drops across the PFBC ambient temperature with the high flow system.