Mobility, Data Mining and Privacy
Lessons from the GeoPKDD EU project
Pisa KDD Lab
www-kdd.isti.cnr.it
University of Pisa and ISTI-CNR, Italy
www.geopkdd.eu
Project of the program FET From 2005 to 2009
Coordinator:
■
Location data from mobile phones, i.e. cell positions in
the GSM/UMTS network.
■
Location data from GPS-equipped devices –
❑Next/current generation of Nokia mobile phones have on-board GPS receiver, and can transmit GPS tracks by SMS/MMS
■
Location data from
❑peer-to-peer mobile networks
❑intelligent transportation environments – VANET
❑ad hoc sensor networks, RFIDs (radio-frequency ids)
The GeoPKDD scenario
From the analysis of the traces of our mobile phones it is possible to
reconstruct our mobile behaviour, the way we collectively move.
This knowledge may help us improving decision-making in many
mobility-related issues:
❑Planning traffic and public mobility systems in metropolitan areas;
❑Planning physical communication networks
❑Localizing new services in our towns
❑Forecasting traffic-related phenomena
❑Organizing logistics systems
❑Avoid repeating mistakes
Mobility Data Raw data Mobility Patterns GSM network, WSN, GPS End user Mobility manager
Privacy and anonymity protection
Key questions
■
How to reconstruct a trajectory from raw logs, how to store and
query trajectory data?
■
Which spatio-temporal pattern and
how to compute them?
…. A spatio-temporal sequential pattern
A sequence of visited regions, frequently visited in the
specified order with similar transition times
Giannotti, Nanni, Pedreschi, Pinelli. Trajectory pattern mining. In Proc.
ACM SIGKDD 2007
[30 – 40]
T-Pattern: Extraction Process
Trajectories Dataset
Regions of Interest
ΔT ∈ [10min, 20min] ΔT ∈ [20min, 35min] ΔT ∈ [5min, 10min] ΔT ∈ [25min, 45min]
Interactive density-based trajectory clustering
■
Nanni, Pedreschi. Time-focused clustering of trajectories of moving objects.
J. of Intelligent Information Systems, 2006
■
Rinzivillo, Pedreschi, Nanni, Giannotti, Andrienko, Andrienko.
Visually-driven analysis of movement data by progressive clustering. J. of
Information Visualization, 2008
Cluster 1: from work to home
Cluster 2: from home to work
Mobility data analysis in Milano
■WIND Telecomunicazioni spa (major telecom provider, GeoPKDD partner)
❑GSM data (Handover data: aggregated flows between adjacent cells) ■Other collaborations:
❑Comune di Milano, Mobility Agency
❑Infoblu and OctoTelematics (GPS receivers on board of cars with special insurance contract)
❑
Experience on a a dataset of
❑2 M positions,
❑17 K vehicles,
T-Patterns
Left: peripheral routes; middle: inward routes; right: outward routes.
●
Three sample clusters are highlighted
●
One group (red) goes straight to NW, the others follow
From opportunities to threats
■
Personal mobility data, as gathered by the wireless
networks, are extremely sensitive
■
Their disclosure may represent a brutal violation of the
privacy protection rights, i.e., to keep confidential
❑
the places we visit
❑
the places we live or work at
❑
the people we meet
❑…
■
Making data (reasonably) anonymous is not easy.
■Sometimes, it is possible to reconstruct the exact
identities from the de-identified data.
■Two main sources of danger:
❑Many observations on the same “anonymous” subject
❑Linking data, after joining separate datasets
■
Many famous example of re-identification
❑Governor of Massachusetts’ clinical records (Sweeney’s experiment, 2001)
❑America On Line August 2006 crisis: user re-identified from search logs
The Mass'Governor and AOL cases
■Linking risks: Governor of Mass’ clinical records
❑The question arises as to what is the harm in the selling of "scrubbed" data
without patient identifying information? It is instructive to remember the effort of Latanya Sweeney director of the Data Privacy Laboratory at Carnegie Mellon University who was able to pick out the medical records of the Governor of Massachusetts from so called scrubbed medical information published by the Massachusetts insurance commission by correlating the data with birthdays, ZIP codes and general information published in the Massachusetts voter-registration rolls. Ms Sweeney indicates that birth date, gender and zip code can be used to uniquely identify 87 percent of the U.S. population.
■Repeated anonymous observations: AOL logs
❑In August of 2006, America On Line (AOL) posted mildly anonymized data that included 20 million web queries from 650,000 AOL users. They did this mainly to furnish academic researchers with a dataset, but soon realized the error of their ways. The dataset was only posted online for a short period of time, but was copied numerous times. As a result, AOL customer identities could be readily constructed. In an article the identity of Georgian retiree Thelma Arnold was discovered by piecing together her clickstream.
Spatio-temporal linkage in Mobility Data
■
By intersecting the phone directories of locations A and B we find that only
one individual lives in A and works in B
.
■
Id:34567 = Prof. Smith
■
Then you discover that on Saturday night Id:34567 usually drives to the city
red lights district…
A
A
B
B
Id: 34567
[almost every day mon-fri
between 7:45 – 8:15]
[almost every day mon-fri
between 17:45 – 18:15]
Trajectory anonymization
Cconstruct an anonymized version of a trajectory dataset, preserving some target analytical properties. Example: Never Walk Alone
■
Basic ideas
:
❑Trade uncertainty for anonymity: trajectories that are close up the uncertainty threshold are indistinguishable
■
Two steps:
❑Cluster trajectories into groups of k similar ones (removing outliers)
❑Perturb trajectories in a cluster so that each one is close to each other up to the uncertainty threshold
Bonchi, Abul, Nanni. Never Walk Alone: Uncertainty for Anonymity in Moving
(K,δ) –anonymity set
■K = minimum number of trajectories in the set
While mobility data flood us …
■
… mobility data mining is a emerging as an
exciting new field
■
GeoPKDD.eu is in the mix, shaping up the area
❑
We have only begun to scratch the surface of this
R
ECENT
R
ESEARCH
A
CTIVITY
O
UTLINE
Activity in GeoPKDD
⚫
The Daedalus System
–
Joint work with R. Trasarti, F. Giannotti (KDDLab)⚫
The Athena system –
Joint work with Jose Macedo, M. Wachowicz (WUR NL), R. Trasarti, M. BaglioniCurrent Ongoing Work
⚫
Privacy Semantic
Trajectories –
Joint work with V. Bogorny UFSC, R. Trasarti and Anna Monreale KDDLab⚫
Inferring activities from stops –
Joint Work with Laura Spinsanti (EPFL, JRC)⚫
Semantic Enriched Trajectory patterns
–
Joint Work with M. Wachowicz (WUR NL), Rebecca Ong and Mirco Anni (KDDLab)G
EO
PKDD A
CTIVITY
Daedalus System
IIDEA: uniform language and system to specify the KDD
process
DAEDALUS provides a Data Mining Query Language
based on SQL, that includes basic mechanisms for supporting
the GeoPKDD process
R
ECENT
A
CTIVITY
Athena System
Joint work with Jose de Macedo, Roberto Trasarti, Miriam Baglioni and Monica Wachowicz
The Athena System: the need for semantics and
reasoning
Which are the tourist activities? ΔT ∈ [10min, 20min] ΔT ∈ [20min, 35min] ΔT ∈ [5min, 10min] ΔT ∈ [25min, 45min]
Miriam Baglioni, José Antônio Fernandes de Macêdo, Chiara Renso, Roberto Trasarti, Monica Wachowicz: Towards Semantic Interpretation of Movement Behavior. AGILE Conf. 2009: 271-288
Tourist trajectories stop in Accomodation Places and in Tourist Places!!
Hotel
University
T
HE
SEMANTIC
ENRICHMENT
PROCESS
Data Mining
Ontology states that tourist trajectories/patterns stop in accomodation places and tourist places.
Automatically infers the semantics of trajectories and patterns according to ontology definitions
C
URRENT
A
CTIVITY
Privacy Semantic Trajectories
Joint work with Anna Monreale, Roberto Trasarti and Vania Bogorny (UFSC)
P
RIVACY
IN
SEMANTIC
TRAJECTORIES
Anonymization of semantic trajectories datasets, by A. Monreale, R. Trasarti, C. Renso and D. Pedreschi, Vania Bogorny, UFSC, Brazil.
User#1: <Office, Bank, Hospital>
User#2:<Shop, Hospital, Restaurant>
How to protect semantic trajectories datasets against privacy
attacks?
C
URRENT
A
CTIVITY
Inferring activity during the stops
Joint work with (1) Laura Spinsanti EPFL JRC and (2) Vania Bogorny UFSC
I
NFERRING ACTIVITY OF PEOPLE FROM THEIR MOVEMENTWhere you stop is who you are: understanding people’s activity by places visited., by C. Renso, Fabrizio Celli and Laura Spinsanti, EPFL and JRC.
Looking inside the stop, by C. Renso, Vania Bogorny, UFSC, and Valeria Times, UFPE, Brazil.
?
?
Goal: Inferring the probable activity of people while stopping:
◆ park the car and proceed walking without GPS (no tracks)
◆ infer the activity based on the micro movements during the stops (shopping in a shopping mall etc)
P
RACTICE
!
Choose one among the papers related to GeoPKDD project, read and
write a brief review reporting a summary of the contribution and
highlighting three strong points and three weak points of the paper.
S
OME
RESEARCH
PAPER
FROM
G
EO
PKDD
Clustering
1. Rinzivillo, Pedreschi, Nanni, Giannotti, Andrienko, Andrienko Visually-driven
analysis of movement data by progressive clustering. J. of Information Visualization, 2008
2. Mirco Nanni, Dino Pedreschi: Time-focused clustering of trajectories of moving
objects. J. Intell. Inf. Syst. 27(3): 267-289 (2006)
Daedalus
1. Riccardo Ortale, Ettore Ritacco, Nikos Pelekis, Roberto Trasarti, Gianni Costa,
Fosca Giannotti, Giuseppe Manco, Chiara Renso, Yannis Theodoridis: The DAEDALUS framework: progressive querying and mining of movement data. GIS 2008: 52
Athena
1. Miriam Baglioni, José Antônio Fernandes de Macêdo, Chiara Renso, Roberto
Trasarti, Monica Wachowicz: Towards Semantic Interpretation of Movement Behavior. AGILE Conf. 2009: 271-288
T-Pattern
1. Fosca Giannotti, Mirco Nanni, Fabio Pinelli, Dino Pedreschi: Trajectory pattern
THANKS! O
BRIGADA!
Jose Tonho de Macedo for
inviting me here
,…. my colleagues from KDDLab in Pisa for their help
with the course and their great work in the GeoPKDD
project
,Last but not least… To all you for
attending the lessons with much interest!
For any question you can write me an email: [email protected]