• No results found

Mobility, Data Mining and Privacy

N/A
N/A
Protected

Academic year: 2020

Share "Mobility, Data Mining and Privacy"

Copied!
49
0
0

Loading.... (view fulltext now)

Full text

(1)
(2)

Mobility, Data Mining and Privacy

Lessons from the GeoPKDD EU project

Pisa KDD Lab

www-kdd.isti.cnr.it

University of Pisa and ISTI-CNR, Italy

(3)

www.geopkdd.eu

Project of the program FET From 2005 to 2009

Coordinator:

(4)

Location data from mobile phones, i.e. cell positions in

the GSM/UMTS network.

Location data from GPS-equipped devices –

❑Next/current generation of Nokia mobile phones have on-board GPS receiver, and can transmit GPS tracks by SMS/MMS

Location data from

❑peer-to-peer mobile networks

❑intelligent transportation environments – VANET

❑ad hoc sensor networks, RFIDs (radio-frequency ids)

(5)

The GeoPKDD scenario

From the analysis of the traces of our mobile phones it is possible to

reconstruct our mobile behaviour, the way we collectively move.

This knowledge may help us improving decision-making in many

mobility-related issues:

❑Planning traffic and public mobility systems in metropolitan areas;

❑Planning physical communication networks

❑Localizing new services in our towns

❑Forecasting traffic-related phenomena

❑Organizing logistics systems

❑Avoid repeating mistakes

(6)

Mobility Data Raw data Mobility Patterns GSM network, WSN, GPS End user Mobility manager

Privacy and anonymity protection

(7)

Key questions

How to reconstruct a trajectory from raw logs, how to store and

query trajectory data?

Which spatio-temporal pattern and

how to compute them?

(8)
(9)

…. A spatio-temporal sequential pattern

A sequence of visited regions, frequently visited in the

specified order with similar transition times

Giannotti, Nanni, Pedreschi, Pinelli. Trajectory pattern mining. In Proc.

ACM SIGKDD 2007

[30 – 40]

(10)

T-Pattern: Extraction Process

Trajectories Dataset

Regions of Interest

(11)

ΔT ∈ [10min, 20min] ΔT ∈ [20min, 35min] ΔT ∈ [5min, 10min] ΔT ∈ [25min, 45min]

(12)

Interactive density-based trajectory clustering

Nanni, Pedreschi. Time-focused clustering of trajectories of moving objects.

J. of Intelligent Information Systems, 2006

Rinzivillo, Pedreschi, Nanni, Giannotti, Andrienko, Andrienko.

Visually-driven analysis of movement data by progressive clustering. J. of

Information Visualization, 2008

(13)
(14)

Cluster 1: from work to home

(15)

Cluster 2: from home to work

(16)

Mobility data analysis in Milano

■WIND Telecomunicazioni spa (major telecom provider, GeoPKDD partner)

❑GSM data (Handover data: aggregated flows between adjacent cells) ■Other collaborations:

❑Comune di Milano, Mobility Agency

❑Infoblu and OctoTelematics (GPS receivers on board of cars with special insurance contract)

Experience on a a dataset of

❑2 M positions,

❑17 K vehicles,

(17)
(18)

T-Patterns

(19)

Left: peripheral routes; middle: inward routes; right: outward routes.

(20)

Three sample clusters are highlighted

One group (red) goes straight to NW, the others follow

(21)
(22)
(23)

From opportunities to threats

Personal mobility data, as gathered by the wireless

networks, are extremely sensitive

Their disclosure may represent a brutal violation of the

privacy protection rights, i.e., to keep confidential

the places we visit

the places we live or work at

the people we meet

(24)

Making data (reasonably) anonymous is not easy.

Sometimes, it is possible to reconstruct the exact

identities from the de-identified data.

■Two main sources of danger:

Many observations on the same “anonymous” subject

Linking data, after joining separate datasets

Many famous example of re-identification

❑Governor of Massachusetts’ clinical records (Sweeney’s experiment, 2001)

❑America On Line August 2006 crisis: user re-identified from search logs

(25)

The Mass'Governor and AOL cases

■Linking risks: Governor of Mass’ clinical records

❑The question arises as to what is the harm in the selling of "scrubbed" data

without patient identifying information? It is instructive to remember the effort of Latanya Sweeney director of the Data Privacy Laboratory at Carnegie Mellon University who was able to pick out the medical records of the Governor of Massachusetts from so called scrubbed medical information published by the Massachusetts insurance commission by correlating the data with birthdays, ZIP codes and general information published in the Massachusetts voter-registration rolls. Ms Sweeney indicates that birth date, gender and zip code can be used to uniquely identify 87 percent of the U.S. population.

■Repeated anonymous observations: AOL logs

❑In August of 2006, America On Line (AOL) posted mildly anonymized data that included 20 million web queries from 650,000 AOL users. They did this mainly to furnish academic researchers with a dataset, but soon realized the error of their ways. The dataset was only posted online for a short period of time, but was copied numerous times. As a result, AOL customer identities could be readily constructed. In an article the identity of Georgian retiree Thelma Arnold was discovered by piecing together her clickstream.

(26)

Spatio-temporal linkage in Mobility Data

By intersecting the phone directories of locations A and B we find that only

one individual lives in A and works in B

.

Id:34567 = Prof. Smith

Then you discover that on Saturday night Id:34567 usually drives to the city

red lights district…

A

A

B

B

Id: 34567

[almost every day mon-fri

between 7:45 – 8:15]

[almost every day mon-fri

between 17:45 – 18:15]

(27)

Trajectory anonymization

Cconstruct an anonymized version of a trajectory dataset, preserving some target analytical properties. Example: Never Walk Alone

Basic ideas

:

❑Trade uncertainty for anonymity: trajectories that are close up the uncertainty threshold are indistinguishable

Two steps:

❑Cluster trajectories into groups of k similar ones (removing outliers)

❑Perturb trajectories in a cluster so that each one is close to each other up to the uncertainty threshold

Bonchi, Abul, Nanni. Never Walk Alone: Uncertainty for Anonymity in Moving

(28)
(29)

(K,δ) –anonymity set

■K = minimum number of trajectories in the set

(30)

While mobility data flood us …

… mobility data mining is a emerging as an

exciting new field

GeoPKDD.eu is in the mix, shaping up the area

We have only begun to scratch the surface of this

(31)

R

ECENT

R

ESEARCH

A

CTIVITY

(32)

O

UTLINE

Activity in GeoPKDD

The Daedalus System

Joint work with R. Trasarti, F. Giannotti (KDDLab)

The Athena system –

Joint work with Jose Macedo, M. Wachowicz (WUR NL), R. Trasarti, M. Baglioni

Current Ongoing Work

Privacy Semantic

Trajectories –

Joint work with V. Bogorny UFSC, R. Trasarti and Anna Monreale KDDLab

Inferring activities from stops –

Joint Work with Laura Spinsanti (EPFL, JRC)

Semantic Enriched Trajectory patterns

Joint Work with M. Wachowicz (WUR NL), Rebecca Ong and Mirco Anni (KDDLab)

(33)

G

EO

PKDD A

CTIVITY

Daedalus System

(34)

IIDEA: uniform language and system to specify the KDD

process

DAEDALUS provides a Data Mining Query Language

based on SQL, that includes basic mechanisms for supporting

the GeoPKDD process

(35)

R

ECENT

A

CTIVITY

Athena System

Joint work with Jose de Macedo, Roberto Trasarti, Miriam Baglioni and Monica Wachowicz

(36)

The Athena System: the need for semantics and

reasoning

Which are the tourist activities? ΔT ∈ [10min, 20min] ΔT ∈ [20min, 35min] ΔT ∈ [5min, 10min] ΔT ∈ [25min, 45min]

Miriam Baglioni, José Antônio Fernandes de Macêdo, Chiara Renso, Roberto Trasarti, Monica Wachowicz: Towards Semantic Interpretation of Movement Behavior. AGILE Conf. 2009: 271-288

(37)

Tourist trajectories stop in Accomodation Places and in Tourist Places!!

Hotel

University

(38)

T

HE

SEMANTIC

ENRICHMENT

PROCESS

Data Mining

Ontology states that tourist trajectories/patterns stop in accomodation places and tourist places.

Automatically infers the semantics of trajectories and patterns according to ontology definitions

(39)

C

URRENT

A

CTIVITY

Privacy Semantic Trajectories

Joint work with Anna Monreale, Roberto Trasarti and Vania Bogorny (UFSC)

(40)

P

RIVACY

IN

SEMANTIC

TRAJECTORIES

Anonymization of semantic trajectories datasets, by A. Monreale, R. Trasarti, C. Renso and D. Pedreschi, Vania Bogorny, UFSC, Brazil.

User#1: <Office, Bank, Hospital>

User#2:<Shop, Hospital, Restaurant>

How to protect semantic trajectories datasets against privacy

attacks?

(41)

C

URRENT

A

CTIVITY

Inferring activity during the stops

Joint work with (1) Laura Spinsanti EPFL JRC and (2) Vania Bogorny UFSC

(42)

I

NFERRING ACTIVITY OF PEOPLE FROM THEIR MOVEMENT

Where you stop is who you are: understanding people’s activity by places visited., by C. Renso, Fabrizio Celli and Laura Spinsanti, EPFL and JRC.

Looking inside the stop, by C. Renso, Vania Bogorny, UFSC, and Valeria Times, UFPE, Brazil.

?

?

Goal: Inferring the probable activity of people while stopping:

◆ park the car and proceed walking without GPS (no tracks)

◆ infer the activity based on the micro movements during the stops (shopping in a shopping mall etc)

(43)

P

RACTICE

!

Choose one among the papers related to GeoPKDD project, read and

write a brief review reporting a summary of the contribution and

highlighting three strong points and three weak points of the paper.

(44)

S

OME

RESEARCH

PAPER

FROM

G

EO

PKDD

Clustering

1. Rinzivillo, Pedreschi, Nanni, Giannotti, Andrienko, Andrienko Visually-driven

analysis of movement data by progressive clustering. J. of Information Visualization, 2008

2. Mirco Nanni, Dino Pedreschi: Time-focused clustering of trajectories of moving

objects. J. Intell. Inf. Syst. 27(3): 267-289 (2006)

Daedalus

1. Riccardo Ortale, Ettore Ritacco, Nikos Pelekis, Roberto Trasarti, Gianni Costa,

Fosca Giannotti, Giuseppe Manco, Chiara Renso, Yannis Theodoridis: The DAEDALUS framework: progressive querying and mining of movement data. GIS 2008: 52

Athena

1. Miriam Baglioni, José Antônio Fernandes de Macêdo, Chiara Renso, Roberto

Trasarti, Monica Wachowicz: Towards Semantic Interpretation of Movement Behavior. AGILE Conf. 2009: 271-288

T-Pattern

1. Fosca Giannotti, Mirco Nanni, Fabio Pinelli, Dino Pedreschi: Trajectory pattern

(45)

THANKS! O

BRIGADA

!

(46)

Jose Tonho de Macedo for

inviting me here

,

(47)
(48)

…. my colleagues from KDDLab in Pisa for their help

with the course and their great work in the GeoPKDD

project

,

(49)

Last but not least… To all you for

attending the lessons with much interest!

For any question you can write me an email: [email protected]

References

Related documents

We present a model for tagging gene and protein mentions from text using the probabilistic sequence tagging framework of conditional random fields (CRFs).. Conditional random

Today, dozens of bioinformatics predictors are available for assigning to amino acids in a sequence the status of order or disorder ( Atkins et al. As predictors rely on

The Shari’ah compliance review/audit report that is prepared by the SSB (with the assistance of the IFI’s internal Shari’ah department report 5 ) is to be reviewed by the

Noor and Maad (2008) in their study conducted on marketing personnel, suggested that work overload creates work life conflict which leads to employees’

Telomerase activity is associated with over 90% of human breast cancers and is necessary for continued tumor cell growth, making it an ideal target for inhibition therapy..

[6, 7, 8, 9] to analyze the miss reference stream and use the prediction capabilities of the Kalman Filter to determine prefetch addresses.. The Kalman Filter

For the research aspect of my training, I am very interested in investigating how therapists (qualified psychotherapists, psychologists and counsellors), who have