Tutorial on Geographic and Spatial Data Mining
Michael May
15th Italian Symposium on Advanced Database Systems - SEBD’07 Torre Canne, Italy
June 17th
2 Michael May
Tutorial Geographic and Spatial Data Mining
Fraunhofer Society
Joseph von Fraunhofer, German physicist and
entrepreneur
Fraunhofer mission:
- do state-of-the-art research and use it in challenging customer projects
- Funding is 33% research grants, 33% customer projects, 33% institutional funding
57 institutes, 40 locations, 12.000 employees, 1 bill. € annual volume
3 Michael May
Tutorial Geographic and Spatial Data Mining
Fraunhofer IAIS: Intelligent Analysis- and Information
Systems
„From sensor data to business intelligence, from media analysis to visual information systems: Our research allows companies to do more with data“
New name, long-standing experience
- Founded in 2006 as a merger of the Fraunhofer institutes AIS and IMK
230 people: scientists, project engineers, technical and administrative staff
Located on Fraunhofer Campus Schloss Birlinghoven/Bonn
Joint research groups and cooperation with
Univ. Bonn
4 Michael May
Tutorial Geographic and Spatial Data Mining
Fraunhofer IAIS: research and projects
Core research areas: Machine learning and adaptive systems
Data Mining and Business Intelligence
Automated media analysis
Interactive access and exploration
5 Michael May
Objectives
Although it is about statistical concepts, algorithms and data structures, the tutorial has a practical, application oriented focus
Integration of various technologies and algorithms. How do they combine? Covers a broad range
I do not assume familiarity with spatial concepts, but some basic familiarity with data mining approaches
Three Objectives:
- to stimulate research on spatial data mining related issues
- to stimulate development of more efficient spatial databases tailored for data mining applications
- to stimulate real-world applications
6 Michael May
Tutorial Geographic and Spatial Data Mining
A main message
Spatial Data Mining is not an esoteric
research topic; it is practically and
commercially very important and
sometimes business critical field!
Later I give an example where the value
of several dozens of companies directly
depends on the predictions given by our
spatial data mining algorithms.
7 Michael May
Tutorial Geographic and Spatial Data Mining
Spatial vs. Geographic Data Mining
Geographic Data is data related to the earth
Spatial Data Mining deals with physical space in general, from molecular to astronomical level
Geographic Data Mining is a subset of Spatial Data Mining
Allmost all geographic data mining algorithms can work in a general spatial setting (with the same dimensionality)
This tutorial focuses on geographic data in 2D, but most algorithms work on spatial data in general
I do not talk about specificties of molecular data, face detection, etc.
8 Michael May
Tutorial Geographic and Spatial Data Mining
Agenda
Introduction– Spatial and Geographic Data Mining Part I: Basic Concepts – Spatial Databases and GIS
•Spatial Data Types •Spatial Queries
•Construction of Complex Features
Part II: Exploratory Analysis of Spatial Data Part III: Spatial and Geographic Data Mining Methods
•Autocorrelation
•Mining Point Data – Clustering, Kriging
•Mining Points, Lines Areas – Clustering, Subgroup Discovery, Association Rules •Mining Networks – A practical case study
•Mining Tracks in Space and Time – Mining from GPS-Data
Challenges Summary
9 Michael May
Introduction – Spatial Data Mining
(
0)
0 0(
1
)
p
p
p
p
n
−
−
⋅
10 Michael MayTutorial Geographic and Spatial Data Mining
A classical example of spatial analysis
Dr. John Snow
Investigating causes of a cholera epidemia London,
September 1854
A good representation is often the key to solving a problem
Disease cluster
11 Michael May
Tutorial Geographic and Spatial Data Mining
Good representation because
...
Represents spatial relation of objects
of the same type
Represents spatial relation of objects
to other objects
It is not only important
where a cluster is but
also, what else is there
(e.g. a water-pump)!
Shows only relevant aspects and hides
irrelevant
12 Michael May
Tutorial Geographic and Spatial Data Mining
Goals of Spatial Data Mining
Identifying spatial patterns Identifying spatial objects that are potential generators of patterns Identifying information relevant for explaining the spatial pattern (and hiding irrelevant information) Presenting the information in a way
13 Michael May
Spatial Data Mining
Data Mining
+
Geographic Information Systems
= Spatial Mining
( 0) 0 0(1 ) p p p p n − − ⋅ 14 Michael MayTutorial Geographic and Spatial Data Mining
Basic Concepts
Spatial Databases and GIS
(
0)
0 0(
1
)
p
p
p
p
n
−
−
⋅
15 Michael May
Tutorial Geographic and Spatial Data Mining
Commercial
Where to build a new supermarket?
Where are the customers that want to buy new product X? How many cars pass the
main road per hour? Does it pay to install new
antennas?
What percentage of young females sees a billboard located in Ripley avenue? Public Sector
Are there clusters of a certain disease?
Is there a relationship between poverty and death rate?
Are there crime hot spots or patterns?
16 Michael May
Tutorial Geographic and Spatial Data Mining
Buildings
Rivers
Streets
Schools
Hospitals
Factory
Attribute Data Person p. Household No. of Cars Long-term illness Age Profession Ethnic group Unemployment Education Migrants Medical establishment Shopping areas17 Michael May
Elements of a spatial database
Spatial Operators
Spatial Data
Types
Spatial Indexes
Spatial Query
Language
Metadata
SELECT c.holding_company, c.location FROM competitor c,
bank b WHERE b.site_id = 1604
AND SDO_WITHIN_DISTANCE(c.location, b.location, 'distance=2 unit=mile') = 'TRUE'
INSIDE
Examples from Oracle Spatial
18 Michael May
Tutorial Geographic and Spatial Data Mining
Spatial Datatypes
(
0)
0 0(
1
)
p
p
p
p
n
−
−
⋅
19 Michael May
Tutorial Geographic and Spatial Data Mining
Two basic types of representation: Fields and Discrete
Objects
Fields:
Raster Data
Discrete Objects:
Vector Data Model
Area
Line
20 Michael May
Tutorial Geographic and Spatial Data Mining
Vector Data: Data Structure
Ordered sets of xy-coordinates defining points, lines, or polygons
3D or 4D also possible
Point
Line
(Polyline) Area (Polygon)
Easy to scale (linear transformation)
Storage efficient
Relationships between
objects (e.g. overlap) are not explicitly represented
Aka „Spaghetti Model“
Straight lines between points
(5,10) ((5,10),(9,16),(12,17)) ((5,10),(9,16),(12,17), …)
Data Structure
Draw line from last to first coordinate
21 Michael May
Two Main Types of Vector Data
-
non regular tesselations
closed polylines that partition the space
-
discrete isolated objects
:point, line, area
Point
Line
Area (Polygon)
Tesselations very useful for aggregation of discrete objects and for feature extraction
22 Michael May
Tutorial Geographic and Spatial Data Mining
UK, Greater Manchester, Stockport
Buildings
Geometry Address Type …Hospitals
Geometry Address Phone #Beds …Description of objects are organized in relations (database tables) Each row in a table describes one object
Different categories of objects are organized in separate relations each having its own set of attributes.
1 Ripley Avenue 23 (5,5),(6,6),… 3 2 Islington Road 2 (3,3),(4,4),… 2 1 Gladstone Street 5 (1,1),(2,2),… 1 Type Address Geometry ID 567897 Great Moore (3,3),(4,4),… 2 234567 Stepping Hill (1,1),(2,2),… 1 Phone Address Geometry ID 1 Ripley Avenue 23 (5,5), 3 2 Islington Road 2 (3,3),… 2 1 Gladstone Street 5 (1,1), 1 Ty pe Name Geometry ID 1 Ripley Avenue 23 (5,5), 3 2 Islington Road 2 (3,3),… 2 1 Gladstone Street 5 (1,1), 1 Ty pe Name Geometry ID 1 Ripley Avenue 23 (5,5), 3 2 Islington Road 2 (3,3),… 2 1 Gladstone Street 5 (1,1), 1 Ty pe Name Geometry ID 1 Ripley Avenue 23 (5,5), 3 2 Islington Road 2 (3,3),… 2 1 Gladstone Street 5 (1,1), 1 Ty pe Name Geometry ID
Rivers
Streets
Schools
Factory
23 Michael May
Tutorial Geographic and Spatial Data Mining
Hierarchy
Often data are organized in spatial hierarchies, e.g.
Country
State
Zip Area
Voting District Parcel
Hierarchies may overlap
County District2 District1 Districtn Ward1 … Ward1 Ward1 Wardn Ward1 Ward2
UK census data
24 Michael MayTutorial Geographic and Spatial Data Mining
Representation of data in a spatial database
A set of relations R1,...,Rn such that each relation Rihas a geometry attribute Gi or an identifier Aisuch that Ri can be linked (joined) to a relation Rkhaving a geometry attribute Gk
- Geometry attributes Giconsist of ordered sets of x,y-pairs defining points, lines,
or polygons
- Different types of spatial objects are organized in different relations
Ri(geographic layers), e.g. streets, rivers, enumeration districts, buildings, and
- each layer can have its own set of attributes A1,..., Anand at most one geometry attribute G
25 Michael May
Representation of data in a spatial database
A set of relations R1,...,Rn such that each relation Rihas a geometry attribute Gi or an identifier Aisuch that Ri can be linked (joined) to a relation Rkhaving a geometry attribute Gk
- Geometry attributes Giconsist of ordered sets of x,y-pairs defining points, lines,
or polygons
- Different types of spatial objects are organized in different relations
Ri(geographic layers), e.g. streets, rivers, enumeration districts, buildings, and
- each layer can have its own set of attributes A1,..., Anand at most one geometry attribute G
Does not fit well to
standard data mining
approaches!
This is where the specific
research challenge for
geographical data mining
comes from!
26 Michael May
Tutorial Geographic and Spatial Data Mining
Legend
Mixed conifer
Douglas fir
Oak savannah
Grassland
Raster representation. Each color represents a different value of a nominal-scale field
Longley et al (2001)
How to represent phenomena conceived as fields?
Divide the world into square cells
No variation within cells
Cell value may be average, max, min, sum,central point, …
Represent discrete objects as collections of one or more cells
Represent fields by assigning
attribute values to cells
Raster Data
27 Michael May
Tutorial Geographic and Spatial Data Mining
Raster and Vector: Comparison
Raster Modell
Advantages:
• Simple data structure
• Simple logical and algebraic structures Disadvantages:
• Large data volumes • imprecise geometry
• expensive transformations of coordinates • implicit coordinates
Vector Model
Advantages:
• Specify geometry by coordinates • Topological relationships • High geometric accuracy • Storage efficient Disadvantages:
•Complex data structure
• Compute intensive logical and algebraic operations
Remember: „Raster is vaster and vector is correcter“
Legend Mixed conifer Douglas fir Oak savannah Grassland 28 Michael May
Tutorial Geographic and Spatial Data Mining
Spatial Queries
(
0)
0 0(
1
)
p
p
p
p
n
−
−
⋅
29 Michael May
Spatial Queries
Problem: Vector data model does not explicitly capture relationships among objects. They have to be inferred using spatial predicates
Spatial predicates evaluate to true or falsefor given objects A query returns
the set of objects of which the statement is true; or
using aggregates the [minimum,maximum,sum,average,…], object(s) of which the statement is true …
Queries are evaluated using a spatial joinamong different relations (layers)
Here‘s where database technology and spatial indexing comes in to do the job efficiently!
Still, they can be extremely time consuming!
30 Michael May
Tutorial Geographic and Spatial Data Mining
Spatial Predicates: Egenhofer‘s 9-intersection model
Each object has interior (i), exterior (e) and boundary (b)
This results in a 9-intersection matrix for the relation between two spatial objects A and B
A cell contains a 1 iff the intersection of point sets is non-empty
1 1 1 e 1 0 0 i 1 0 1 b e i b 1 1 1 e 1 1 1 i 1 1 1 b e i b 1 0 0 e 1 1 1 i 1 0 0 b e i b A B A B
31 Michael May
Tutorial Geographic and Spatial Data Mining
Spatial Predicates
A inside B, B contains A A contains B, B inside A A covered-by B, B covers A A covers B, B covered by A A equals B, B equals A A overlaps B, B overlaps A A meets B, B meets A A disjoint B, B disjoint A9-intersection model for 2 regions (Egenhofer 1991)
INSIDE
32 Michael May
Tutorial Geographic and Spatial Data Mining
Spatial Queries: Distance
Metric spaces:
→Symmetry:d(i,j) = d(j,i)
→triangle inequality:d(i,k) ≤d(i,j)+ d(j,k)
- Euclidian Distance:de(i,j) =
i j
k
Distance relation between polygons: Minimum distance between any 2 points of the polygons
2
2 ( )
)
33 Michael May
Spatial Queries: Distance and Proximity
Selects nearest neighbor in space Select all object within a certain distance
Ma in S tr e e t X Distance Hospital #2 Hospital #1
SELECT c.holding_company, c.location FROM competitor c,
bank b
WHERE b.site_id = 1604
AND SDO_WITHIN_DISTANCE(c.location, b.location, 'distance=2 unit=mile') = 'TRUE'
Select all competitors and locations within 2 miles distance from bank with id 1604
Example: Oracle Spatial
34 Michael May
Tutorial Geographic and Spatial Data Mining
Distance – non-metric
non metric spaces
→Asymmetry: d(i,j) ≠d(j,i)
→triangle inequality does not hold
drive time
driving distance
35 Michael May
Tutorial Geographic and Spatial Data Mining
Stockport Database Schema
ED TAB01 TAB95 TAB61
...
Water...
River Building Street Shopping Region Vegetation =zone_id =zone_id =zone_id spatially interact inside spatially interacts spatially interacts spatially interacts Attribute data 95 tables with census data, ~8000 attributes Geographical Layers 85 tables Spatial Hierarchy • County • District • Wards • Enumeration district spatially interact Standard Join Spatial JoinRelations between objects implicit; very flexible and storage efficient, but compute intensive
36 Michael May
Tutorial Geographic and Spatial Data Mining
Implementation of Spatial Databases
Many popular databases have spatial extensions by now:
Oracle Spatial
PostgreSQL
MySQL (since 4.1)
37 Michael May
Construction of Complex Features
(
0)
0 0(
1
)
p
p
p
p
n
−
−
⋅
38 Michael MayTutorial Geographic and Spatial Data Mining
Spatial Functions
Example: Oracle Spatial 10g
Return a geometry - Union - Difference - Intersect - XOR - Buffer - CenterPoint - ConvexHull Return a number - Length - Area - Distance
Union
XOR
Intersect
Original
Difference
http://colab.cim3.net/file/work/SICoP/2006-06-20/2006-06-21/xlopez06212006.ppt Constructs new geometry objects from existing ones using point set theory Efficient implementation using computational geometry39 Michael May
Tutorial Geographic and Spatial Data Mining
Constructing Cells: Buffer
How many competitors are in the catchment area of my shop?
= How many shops are within the buffer?
Simplistic approximation
Does not take account of barriers (rivers, highways)
Does not take into
account road system
40 Michael May
Tutorial Geographic and Spatial Data Mining
Voronoi diagramm
Which are my nearest competitors?
What is the cover of my
radio antenna?
= Find voronoi neighbors
Approximation
Does not take account of barriers (rivers, highways)
Does not take into
account road system
Decompose space into regions around each point in a set of points S such that all the points in the region around piare closer to pithan to any other point in S Complexity:
Related data structure: Delaunay triangulation (graph of Voronoi neighbors)
)
lg
(
n
n
O
41 Michael May
Drive-Time Zone (Dijkstra)
How many competitors are inthe catchment area of my shop?
Realistic approximation
Take account of barriers (rivers, highways)
take into account road
system, maximum speed on road
All streets segments within a drive time distance <= d
from a given starting point Use Dijkstra‘s algortihm Complexity:
depending on data structures used for implementation
) lg ( ) ( 2 E V V O V O − + 42 Michael May
Tutorial Geographic and Spatial Data Mining
Pre-procesing
Several of the feature extractions are computationally quite expensive (at least for large data sets) and there is often a combinatorial explosion of features that might be constructed.
Several strategies are used in Spatial Warehouse Design:
Selective Pre-processing: materializing important joins in advance (storage requirements!)
Approximate precomputing:e.g. using Minimum Bound Rectangle to approximate
polygon
Schema Design (e.g. Star-Schema with selective materialization): Han J.,
Stefanovic N., Koperski K. Selective Materialization: An Efficient Method for Spatial Data Cube Construction. PAKDD, 1998.
43 Michael May
Tutorial Geographic and Spatial Data Mining
Spatial Database of Vector Objects: Discussion
Relations between objects implicit
Very flexible: depending on analysis task different relationsships can be constructed
storage efficient; no overhead for storing relationship information
compute intensive (thus spatial Indexing very important)
Consider what and when to materialize
Very rich possibilities to create new, non-trivial objects from existing ones
Makes feature extraction an important topic for Data Mining
Inherently multi-relational setting (but not first-order)
Could also be formulated in a deductive database setting
44 Michael May
Tutorial Geographic and Spatial Data Mining
Interactive Visualization of Spatial
Data –
Exploratory Data Analysis
(
0)
0 0(
1
)
p
p
p
p
n
−
−
⋅
45 Michael May
Interactive Visualization of Spatial Data –
Exploratory Data Analysis
(work by G. Andrienko & N. Andrienko, H. Voss and others at Fraunhofer IAIS) For the theory behind
CommonGIS, see the book Andrienko, N. and
Andrienko G.:
Exploratory Analysis of Spatial and Temporal Data - A Systematic Approach, Springer, 2005
46 Michael May
Tutorial Geographic and Spatial Data Mining
Geographic Information Systems and CommonGIS
Many commercial tools available- ESRI ARC GIS
- Mapinfo
- Intergraph - Manifold
But CommonGIS is different and unique … - Map-based exploratory data analysis
- stresses interactive visualization manipulation of statistical data in space - elaborated facilities for time-series visualization
CommonGIS can be aquired for non-commercial use by educational instutions for no fee
47 Michael May
Tutorial Geographic and Spatial Data Mining
-
Time-series visualization and analysis- Combines Vector-Raster transformation
- Weighted Sums - Ideal Point Analysis
- Similarity analysis - Dominant Attribut - Integration with Weka
(Clustering, Decision Trees) Multivariate Decision support
Multi-dimensional
= Fraunhofer IAIS Tool for Map-based
Exploratory Data Analysis
- combines interactive cartography and statistics
CommonGIS
48 Michael May
Tutorial Geographic and Spatial Data Mining
CommonGIS: Visual analysis of spatial data
Interactive spatial search for
geographic objects and recognition of spatial patterns:
dynamic choropleth maps, pie charts, bar charts, etc.
with dynamic removal of outliers
and dynamic queries
Comparison of attribute values of geographic objects (relations and correlations) and
comparison of spatial patterns (spatial correlations):
(Linked) dynamic maps and interactive diagrams
49 Michael May
CommonGIS: Visual analysis of spatio-temporal data
CommonGIS as an interactive browser to study how a spatial pattern evolves over time:
time aware maps (animations)
time series charts
CommonGIS as an interactive browser for
temporal behaviours of objects:
set of controls for analysing time intervals (object animations)
CommonGIS as an interactive browser of discrete space-time events to find spatio-temporal clusters:
space-time cube
50 Michael May
Tutorial Geographic and Spatial Data Mining
51 Michael May
Tutorial Geographic and Spatial Data Mining
Time-Series: Sales per Shop and Product Category
Bäckerei Stehcafé Sitzcafé Terrasse
Different Time Hierarchies (Year, Quarter, Month, Day…)
52 Michael May
Tutorial Geographic and Spatial Data Mining
CommonGIS: Data transformation
Transformation of data for furtheranalysis:
Attribute transformations:
calculate statistical indices
transform and combine attribute data arithmetically
dynamic classifiers (linked with dynamic choropleth map)
cross classifiers (linked with dynamic choropleth map)
Geographic transformations:
query, transform, combine, derive raster data
illumination model
raster -> vector transformations (i.e. raster -> area aggregation)
53 Michael May
Tutorial Geographic and Spatial Data Mining
54 Michael May
Tutorial Geographic and Spatial Data Mining
Geographic and
Spatial Data Mining Methods
(
0)
0 0(
1
)
p
p
p
p
n
−
−
⋅
55 Michael MayTutorial Geographic and Spatial Data Mining
Autocorrelation
(
0)
0 0(
1
)
p
p
p
p
n
−
−
⋅
56 Michael May
Spatial Variation
How are variables distributed in space?
Tobler‘s First Law of Geography:
„Everything is related to everything else, but near things are more related than distant things.“
Ö distribution of variables depends on space
Ö variables are autocorrelated
Field Soil Moisture
Franke, diploma thesis, Leipzig Univ., 2006
57 Michael May
Tutorial Geographic and Spatial Data Mining
Spatial Autocorrelation: Binary Example
binary attribute (blue, white)
autocorrelation to four immediate neighbors
Moran Index (here):
I = 0.86 I = 0.00 I = -1.00
Goodchild, CATMOG, GeoBooks, Norwich, 1986
I = 0.39 change equal change equal
n
n
n
n
I
+
−
=
- change - equal58 Michael May
Tutorial Geographic and Spatial Data Mining
Moran‘s I
Morans‘s I is a measure for spatial autocorrelation. It is a weighted correlation coefficient used to detect departures from spatial randomness. Departures from randomness indicate spatial patterns such as clusters and geographic trend.
Values of I larger than 0 indicate positive spatial autocorrelation; values smaller than 0 indicate negative spatial autocorrelation.
Moran's I is a weighted product-moment correlation coefficient, where the weights
reflect geographic proximity.
z – attribute of interest; w – weight; n – number of areal objects
∑
∑∑
∑ ∑
= = = ≠ = ≠ = ≠−
−
−
=
n i i n i n j j i ij n i i j n j j i j i ijz
z
w
z
z
z
z
w
n
I
1 2 1 1 , 1 , 1 ,)
(
)
)(
(
A B C D D 0 1 1 0 1 0 1 1 C 1 1 0 1 B 0 1 1 0 A D C B A wij weight matrix Example: n = 4 59 Michael MayTutorial Geographic and Spatial Data Mining
Spatial Autocorrelation
similarity in location indicates similarity in attribute
value
differs from temporal autocorrelation
- 1 – dimensional autocorrelation in time series, spatial autocorrelation spreads in 2 or 3 dimensions
- only forward causality in time series, direction of causality not restricted in space
depends on scale
Temperature of Sunspots Sunspot Time Series year
# s
uns
p
o
60 Michael May
Effects of Autocorrelation
makes spatial abstraction possible
makes standard approaches of analysis impossible - most statistics assume iid
makes local inference attractive - Kriging, kNN, …
makes choice of sampling interval hard - autocorrelation depends on scale
makes interpolation easier than extrapolation
zero autocorrelation = independence of location
distance correla tion 0 -1 +1 spatial autocorrelation 61 Michael May
Tutorial Geographic and Spatial Data Mining
Problem types for Spatial Data Mining
Spatial Data Mining:= partially
automated search for patterns and models in large spatial databases
Classification of methods along the following hierarchy
Points
Points, Lines and Area
Networks
62 Michael May
Tutorial Geographic and Spatial Data Mining
Handling spatial data in Data Mining – Basic Options
Treat as ordinary variables no special algorithms needed
spatial properties ignored, e. g. discontiguous areas
Make spatial relationships explicit e. g. infer topological relationship
expensive, but allows normal algorithms to be used
Can by done as pre-processing or dynamically (latter requires specialized algortihms)
Specialized algorithms
- Neighborhood methods, kriging, Gaussian processes, density-based clustering …
Use proper combination of data, preprocessing, algorithms, and interaction software!
63 Michael May
Tutorial Geographic and Spatial Data Mining
Mining Point Data
(
0)
0 0(
1
)
p
p
p
p
n
−
−
⋅
64 Michael May
Mining Point Data
Points Space Complexity Time Complexity 65 Michael May
Tutorial Geographic and Spatial Data Mining
Clustering spatial point data
Point data conceived as discrete objectsMany approaches exists for clustering spatial point data
In statistics, measures of spatial randomness or non-randomness have been developed (e.g. Ripley 1991, Cressie 1993)
- Ripley‘s K function as measuring deviation from complete spatial randomness (as exemplified by a Poisson process)
- Moran‘s I, which measures autocorrelation
Bayesian approaches often coming from image analysis (cf. Lawson et al 2002) In Geography, spatial clustering algorithms have been developed (Openshaw, GAM,
66 Michael May
Tutorial Geographic and Spatial Data Mining
Density Based Clustering – a KDD approach
[Ester et al. 1996]
Suitable for large databases
Discovers areas of high density and turns them into clusters
Discovers clusters of arbitrary shape Can handle noise
Algorithm DBSCAN
Note: Relatively straightforward extension to vector data possible (GDBSCAN); requires more complex definition of some key concepts (neighborhood and MinPts)
67 Michael May
Tutorial Geographic and Spatial Data Mining
Clustering spatial data
distance-based clustering is inherently spatial
but assumption of convex clusters (e.g. k-means) inappropriate for many “geographical” tasks X X X X X X X X X X X X
68 Michael May
Definitions 1
Eps-neighborhood of a point p Nε(p) := {q∈D | dist (p, q)≤ ε}
A point p is directly density-reachable from q iff 1. p ∈N ε(q)
2. |N ε(q)|>MinPts (“q is core object”)
- Not necessarily symmetric
p
p
q
q
p: border object q:core object P directly density reachable from q Q not directly density reachable from pDefinition of Eps is a crucial parameter!
69 Michael May
Tutorial Geographic and Spatial Data Mining
Definitions 2
density-reachable = p is density-reachable from point q wrt to Eps and MinPts iff there is a chain of points p1,…,pn, p1=q,pn=p such that pi+1is directly density-reachable from pi
Transitive, not symmetric
p is density-connected to q iff there is point o such that p and q are density-reachable from o wrt to Eps and MinPts.
p
p
q
p
o
p and q density-connected to each other by o p density reachable from q q not density reachable from p Symmetric70 Michael May
Tutorial Geographic and Spatial Data Mining
Density-connected clustering
A clusterC wrt. To Eps and MinPts is a non-empty subset of database D, where (1) ∀p,q: if p ∈C and q is density-reachable from p wrt Eps and MinPts, then q ∈C (2) ∀p,q ∈C: p is density connected to q wrt to Eps and MinPts.
Non-covered points are noise
Each cluster contains at least MinPts Exactly one clustering
71 Michael May
Tutorial Geographic and Spatial Data Mining
Algorithm DBScan – Basic Idea
Check Eps-Neigborhood of every unclassified pointin database
If neighborhood of p contains more than MinPts, a new cluster with p as core object is build
Collect directly density reachable objects from this set, merging clusters as necessary
Terminate when no new point can be added to any cluster
72 Michael May
Kriging-Spatial Interpolation
(
0)
0 0(
1
)
p
p
p
p
n
−
−
⋅
73 Michael MayTutorial Geographic and Spatial Data Mining
Kriging
developed by G. Matheron in the 1960s based on work of D. Krige geostatistical method of interpolation
Point data conceived as samples from a continuous surface
results are smoothly varying surfaces
provides optimality given assumptions (best linear unbiased estimate)
variety of methods, e.g. Ordinary Kriging, Universal Kriging, Co-Kriging, Block
Kriging, Stratified Kriging, Indicator Kriging, …
? ? ? ? ? ? • – measurements ? – unknown values
74 Michael May
Tutorial Geographic and Spatial Data Mining
Spatial Variation
Problem: spatial variation of a continuous attribute is often too irregular to be modelled by a simple, smooth mathematical function
Solution:
variation can be described by stochastic surface
x – location in n-dimensional space Z(x) – random variable of interest, e.g. soil
moisture
A stochastic processis a family of random variables Z(x) over the index set D ⊂ ℜn:
{
Z
(
x
)
:
x
∈
D
}
A Gaussian processis a stochastic process for which any finite set of Z-variables has a joint multivariate Gaussian distribution.
75 Michael May
Tutorial Geographic and Spatial Data Mining
Components of Spatial Variation
structural component, having a constant mean or trend
random, but spatially correlated component (regionalized variable)
spatially uncorrelated random noise term
''
)
(
'
)
(
)
(
x
=
m
x
+
ε
x
+
ε
Z
trend autocorrelation random noise
X Z(x)
76 Michael May
Stationarity
Problem: spatial data set is single realization of random process
inference is impossible without further restrictions on spatial variation
Intrinsic Stationarity (stationarity under translation):
constant mean (E[...] = 0) or trend (E[...] > 0):
variance of differences h is independent of location:
Isotropy (stationarity under rotation) :
spatial process evolves the same in all directions
[
Z
(
x
)
Z
(
x
h
)
]
const
.
E
−
+
=
2E {Z(x)
⎡
⎣
−
Z(x
+
h)}
⎤
⎦
= γ
2 (h)
x x+h h 77 Michael MayTutorial Geographic and Spatial Data Mining
Ordinary Kriging
Assumptions:
intrinsic stationarity with a constant mean - constant mean value in sampling area
- variance of differences depends only on the distance h between sites
Once structural effects have been accounted for, remaining variation is homogeneous in variance so that difference at sites are merely a function of differences between them.
[
]
]
)}
(
'
)
(
'
[{
)
(
2
]
)}
(
)
(
[{
)
(
)
(
2 2h
x
x
E
h
h
x
Z
x
Z
E
h
x
Z
x
Z
Var
+
−
=
=
+
−
=
+
−
ε
ε
γ
x x+h h[
Z
(
x
)
−
Z
(
x
+
h
)
]
=
0
E
semivariance78 Michael May
Tutorial Geographic and Spatial Data Mining
Ordinary Kriging
Proceedure:1. Estimate semivarianceγ(h) from data sample 2. Plot the experimental variogram
3. Fit a theoretical model to the experimental variogram
4. Estimate unknown values as weighted sum of neighboring measurements, determine optimal weights from variogram
79 Michael May
Tutorial Geographic and Spatial Data Mining
Semivariance and Experimental Variogram
semivariance depends only on distance (lag) h
estimate semivariance between all pairs of measurements with distance h (repeat for all possible h)
{
}
∑
=+
−
=
n i i iz
x
h
x
z
n
h
1 2)
(
)
(
2
1
)
(
ˆ
γ
lag h γ(h) Experimental Variogram80 Michael May
Variogram
nugget:- γ(h) = 0 (by definition)
- nugget effect represents small scale variation and measurement errors - estimate of ε‘‘
range:
- spatial dependency
- here, variance of differences increases with distance
- two points are more similar the closer they are
sill:
- semivariance levels off - variance of differences h is independent of distance lag h γ(h) range nugget sill
{
}
∑
=+
−
=
n i i iz
x
h
x
z
n
h
1 2)
(
)
(
2
1
)
(
ˆ
γ
81 Michael MayTutorial Geographic and Spatial Data Mining
Variogram Models
experimental variogram must be fitted to an appropriate variogram model
most commonly used are the spherical, exponential, linear or Gaussian model lag h γ(h) Spherical Model lag h γ(h) Exponential Model lag h γ(h) Linear Model lag h γ(h) Gaussian Model
82 Michael May
Tutorial Geographic and Spatial Data Mining
Interpolation of unknown Values
unknown value at location x0is estimated as weighted sum of neighboring
measurements
weights wiare determined according to two restrictions
- Z*(x0) is an unbiased estimate of Z(x0) - Z*(x0) is an optimal estimate
Have to solve system of n+1linear equations of semivariances and weights
∑
==
n i i iZ
x
w
x
Z
1 0 *)
(
)
(
83 Michael MayTutorial Geographic and Spatial Data Mining
Equation System
restriction on weights introduces Lagrange parameterφ(Restriction 1)
system of (n+1) equations must be solved to obtain optimal weights for each x0
1 1 1 n 1 1 0 n 1 n n n n 0
(x
x )
(x
x )
1
w
(x
x )
(x
x )
(x
x )
1
w
(x
x )
1
1
0
1
γ
−
γ
−
γ
−
⎛
⎞ ⎛
⎞ ⎛
⎞
⎜
⎟ ⎜
⎟ ⎜
⎟
⎜
⎟ ⎜
⎟ ⎜
=
⎟
⎜
γ
−
γ
−
⎟ ⎜
⎟ ⎜
γ
−
⎟
⎜
⎟ ⎜
φ
⎟ ⎜
⎟
⎝
⎠ ⎝
⎠ ⎝
⎠
K
M
O
M
M
M
M
L
L
Ordinary Kriging is an exact interpolator, i.e. interpolated value of a sample location will be identical with the measurement taken
84 Michael May
Variants of Kriging
Universal Kriging structural component may contain a external trend
Co-Kriging
interpolation for one attribute incorporates information of another, correlated attribute
sparse measurements of an expensive variable are supported by plenty
measurements of a cheap variable
Stratified Kriging
interpolation within sub-areas
equations are adjusted to avoid discontinuities on boundaries
More Details: Burrough, P., McDonnell, R 1998
85 Michael May
Tutorial Geographic and Spatial Data Mining
Mining Points, Lines, and Areas
(
0)
0 0(
1
)
p
p
p
p
n
−
−
⋅
86 Michael May
Tutorial Geographic and Spatial Data Mining
Points, Lines and Areas
Points
Space Complexity Time
Complexity
Points, Lines, and Areas
87 Michael May
Tutorial Geographic and Spatial Data Mining
Points, Lines and Areas
Requirements: • Point data • Polygons • aggregations Applications • Customer Segmentation, • Catchment Areas, • Location Planning, • Radio Network Analysis Examples:
• GDBScan Clustering • Spatial Subgroup Minig • Spatial Association Rules • Spatial Model Trees
88 Michael May
Clustering of Vector Data: GDBScan [Sander et al 1998]
Extension of DBSCan - Sample Instantiations
dist < ε intersects/meets neighbor
| S | ≥MinCard ∑areas ≥MinArea f (S) ≥MinF
89 Michael May
Tutorial Geographic and Spatial Data Mining
Spatial Subgroup Mining
(
0)
0 0(
1
)
p
p
p
p
n
−
−
⋅
90 Michael May
Tutorial Geographic and Spatial Data Mining
Typical Data Mining representation
Data Mining for spatial data: very different from this representation
‘spreadsheet data’
exactly 1 table
atomic values
91 Michael May
Tutorial Geographic and Spatial Data Mining
Subgroup Discovery Search (Klösgen 1996, Wrobel 1997)
Subgroup discovery searches deviation patterns for subgroups
overproportionally high share of target value (or mean of target variable)
Top-down search from most general to most specific subgroups, exploiting partial ordering of subgroups
S1≥S2 S1more general than S2
Beam search expands only thenbest ones at each level
Evaluating hypothesis according to quality function: N= Total population
n= subgroup size
p(T)= target share in total population p(T|C)= target share in subgroup
Extension to multi-relational representation in Wrobel (1997)
n
N
N
n
T
p
T
p
T
p
C
T
p
−
−
−
))
(
1
)(
(
)
(
)
|
(
92 Michael May
Translating Multirelational Subgroups
to Object-relational SQL
Domain: relational database schema D= {R1, ..., Rn} having geometry attributes Gi
Hypothesis Language
Multirelational subgroups are represented by a concept set C= {Ci}, where each Ci consists of a set of
attribute value-pairs {A1=v1,...,An=vn} from a relation in D,
a set of links L={Li} linking concepts Ci, Ckvia their attributes Am, Akof the form
(Ci/Am {=|inside| overlaps|...|spatially_interact} Ck/An)
target attribute can be non-numeric (A1=v1) or numeric aggregate (avg(A)=n)
Example:
C= {{district.long_term_illness=high, district.unemplyoment=high},{street.name=’Manchester Road’}}
L= {{district.geometry spatially_interact street.geometry}}
“Enumeration districts with high rate of long term illness and unemplyoment crossed by Manchester Road”
Testing satisfaction of subgroup descriptions
The number of tuples in Dthat satisfies a subgroup description is evaluated using SQL select statements including joins over multiple relations.
93 Michael May
Tutorial Geographic and Spatial Data Mining
Approach: Translation of Spatial Subgroup Mining to SQL
(Klösgen, May 2002)
• Representing subgroups in object-relational SQL, i.e. multi-relational representation • Using representation for spatial geometry based on Spatial Database
• Division of work between RDBMS and Search Manager • Combining visualization in abstract and physical space
94 Michael May
Tutorial Geographic and Spatial Data Mining
Division of labour between RDBMS
and Search Manager (May, Savinov 2003)
Database Server
Search Algorithm
Mining Server
statistics
•search in hypothesis space
• generation and evaluation of hypotheses (subgroup patterns)
mining query
• Database integration: efficiently organize mining queries
• Mining query delivers statistics (aggregations) sufficient for evaluating many hypotheses
95 Michael May
Tutorial Geographic and Spatial Data Mining
SPIN! – Spatial Data Mining System
Workspace Property Editor Subgroup Viewer Flowchart-Tool Subgroup Result List
96 Michael May
Interactive Exploratory Analysis
Combination of spatial
and non-spatial
visualization
User selects and
manipulates variables
Powerful for analysis
in low dimensions
(3-4)
Scatter Plot
Parallel Coordinate Plot
Choropleth Maps
Display dynamically linked
97 Michael May
Tutorial Geographic and Spatial Data Mining
Visualization of spatial sugroups
Linked Display
Spatial Venn Diagram
Subgroup Overview
p(T|C) vs. p(C)
Subgroup
High long-term illness in
districts crossed by M60
98 Michael May
Tutorial Geographic and Spatial Data Mining
Radio Network Planning in Telecommunication
High cut of call ration in mountanous regions crossed by highways having a certain technical
configuration
Legende:
Blau: Autobahn
Braun: große Höhe
Schwarz: Subgruppe SPIN! Mapviewer (Common GIS) 99 Michael May
Tutorial Geographic and Spatial Data Mining
Other commercial applications of Subgroup Discovery
How are my customers characterized. Are there interesting profiles?
Where to open the next supermarket? Does it create competition for my other supermarkets?
100 Michael May
Spatial Association Rules
work and slides by
Donato Malerba et al., Univ. Bari
(
0)
0 0(
1
)
p
p
p
p
n
−
−
⋅
101 Michael MayTutorial Geographic and Spatial Data Mining
Spatial association rules
An association pattern PP(s%)(s%)is a spatial association patternif it contains at least one spatial relation
A large town intersects a road and is adjacent to water (62%)
An association rule QQ→→RR(s%, c%)(s%, c%) is a spatial association ruleif QQ∧∧RR is a spatial association pattern
IF a large town intersects a road
THEN it is also adjacent to water (62%, 89%)
Malerba et al Seminal work by Koperski & Han 1995
102 Michael May
Tutorial Geographic and Spatial Data Mining
The problem
Given
a spatial database (SDB) with a set of
reference objects
S
S
,
some set
R
R
kk, 1
≤k≤m
, of
task-relevant objects
some
spatial hierarchies
H
H
kkinvolving objects in
R
k
M
M granularity levels
in the descriptions
a set of
granularity assignments
ψ
ψ
kkwhich associate each object in
H
kwith a granularity level
a couple of thresholds
minsup
minsup
[l]
[l]
and
minconf
minconf
[l]
[l]
for each
granularity level
a
domain knowledge
Find
strong multiple-level
spatial association rules.
Malerba et al103 Michael May
Tutorial Geographic and Spatial Data Mining
The solution
Solution (Appice et al., IDA Journal, 2003)
based on an Inductive Logic Programming (ILP) approach Æspatial relations easily handled
spatial pattern Æconjuction of first-order logic atoms θ-subsumption orders the space of spatial patterns
monotonicity of support w.r.t. θ-subsumption Æpruning of patterns at the same granularity levelin the candidate generation phase
monotonicity of pattern frequency w.r.t. granularity levelÆpruning of patterns
at different granularity levelsin the candidate generation phase
Implementedin SPADA(Spatial Pattern Discovery Algorithm) European project SPIN (Spatial Mining for Data of Public Interest)
104 Michael May
Extensions of initial solutions
Efficiency improvementof pattern evaluation by caching support objects for each stored pattern
Definition of a declarative bias to filter out rules on the basis of users’ preferences Î efficiency improvement is a byproduct
- In real-world applications a large number of spatial patterns can be generated even for a few hundred spatial objects.
- Most of discovered patterns are useless for the application at hand
- Urban accessibility application: only spatial patterns involving some sociological factor
(household with no car) are interesting.
Integration of SPADA in the ARES system that interfaces a Spatial DB (Oracle Spatial)
105 Michael May
Tutorial Geographic and Spatial Data Mining
Mining Network Data
(
0)
0 0(
1
)
p
p
p
p
n
−
−
⋅
106 Michael May
Tutorial Geographic and Spatial Data Mining
Networks
Points Space Complexity Time complexityPoints, Lines, and Areas Networks
107 Michael May
Tutorial Geographic and Spatial Data Mining
Points and Networks
• Requirements:
• Point Data • Polygons • Aggregations
• Spatial dependencies and relations,
networks
• Examples: Traffic frequency prediction
• Method:
108 Michael May
Case Study: Outdoor Advertising - Frequency Atlas
Customer: Fachverband für Außenwerbung
(FAW; German Outdoor Advertising Association)
Task:
Performance value assessment of advertising media
Traffic volume forecast
separate for private cars, public transport,
pedestrians
109 Michael May
Tutorial Geographic and Spatial Data Mining
Frequency + Media factories = poster reach
Gesellschaft für
Konsumforschung
110 Michael May
Tutorial Geographic and Spatial Data Mining
The project in numbers
Complete model for all German cities
with more than 50.000 inhabitants
(192 cities) = ca 1.000.000 street segments!
Complete model includes, for each segment, item
- car frequency - pedestrian frequency - public transport frequency
The model is presently beeing extended to to all cities with between 10.000 and 50.000 inhabitants
111 Michael May
Tutorial Geographic and Spatial Data Mining
Basic Data: traffic measurements
Manual traffic measurement at selectedposter locations
- 4 times 6 minutes at four days of the week at four times of day
Additional empirical model of day totals
Properties
- Well defined measurements - Extended measurement period, so
concept drift can not be excluded
112 Michael May Street network Sociodemographics + Socioeconomics Public transport network Frequency measurements 0 200 400 600 800 1000 1250 1500 1750 2000 ... DATA MINING Points of Interest (POI) Frequency classes
Secondary data
113 Michael MayTutorial Geographic and Spatial Data Mining
Local Measurements
Inhomogeneous measurements on the same street
How Spatial Autocorrelation helps
843
820 1200 843
114 Michael May
Tutorial Geographic and Spatial Data Mining
Attributes of street segments:
- Name, type, …. class - Points of Interest - Spatial coordinates
Locations with measurement values
Spatial kNN
Distance beetween two segments xa, xb
Selection of the k closest x1, …, xk
Prediction for new segment xq
(Project has actually used specially adapted distance measure) ( )
∑
= − =M m bm am b a x x x x d 1 ,∑
∑
= = = k i i i k i i q wy w y 1 1 ˆ ) , ( 1 i q i x x d w = withSegment
115 Michael MayTutorial Geographic and Spatial Data Mining
Spatial KNN - Properties
kNN captures well autocorrelation inherentin the data
Allows to bring in background knowledge by fine-tuning distance function
Database Integrated (Oracle Spatial)
Performs dynamic spatial query (minimum distances among polygons)
Performance improvements
Spatial Queries use Index Structures (R-Tree), still relatively costly (i.e. dominates overall run-time)
Partial evaluation of distance function based on lower bounds for distance to minimize number of spatial queries
116 Michael May
Smoothing based on flow constraints
Measurement errors lead to inconsistencies Need plausible assignment of frequencies
Solution:
Use Kirchhoff’s law as constraint
- Sum of inputs = sum of outputs
Smoothing algorithm finds locally optimal
solution using constraint relaxation
117 Michael May
Tutorial Geographic and Spatial Data Mining
Explaining frequencies
Problem: Customer wants transparent values, not a black box
=> Problem for Spatial kNN
Solution: Fit an explanatory model to the predicted values
Allows to understand why predictions are as they are
Allows to identify potential outliers and areas of high uncertainty
⇒ Use Model Trees
118 Michael May
Tutorial Geographic and Spatial Data Mining
Numerical prediction with model trees
LM1 FREQUENZ = 2277.3186 * X + 75.4087 * ANZAHL_EINKAUF + -142.4217 * MESSE + -21221.8497 Fussgängerzone: Nein | Ja Bahnhof Nein | Ja Distanz_zu_Bahnhof: <= 150 | > 150 Anzahl_Restaurants : <= 5 | > 5 ORTSTEIL = INNENSTADT (LR) | ... Straßenkategorie: Nebenstr. | Hauptstr. Y-Koordinate <= 9.6 | > 9.6 X-Koordinate <= 52.385 | > 52.385 Anzahl_Restaurants : <= 15 | > 15 LM1 LM2 LM4 LM5 LM6 LM3 119 Michael May
Tutorial Geographic and Spatial Data Mining
Improving model by spotting outliers based on model tree prediction
Points with great prediction error are checked - Visual inspection
- Getting additional empirical input by taking new measurements
Corrected values are basis for next round in model building, leading to improved results
120 Michael May
Tutorial Geographic and Spatial Data Mining
Final Result: Frequency Map
Cars
Public TransportPedestrians PedestriansPublic Cars
Transport
121 Michael May
Tutorial Geographic and Spatial Data Mining
~1 Million street segments predicted based on 96.000 measurements
~1 Million street segments predicted based on 96.000 measurements
Final result: frequency atlas
(cars, public transport, pedestrians)
Used for determining poster prices in Germany since 2006
122 Michael May
Tutorial Geographic and Spatial Data Mining
Spatial Model Trees [Malerba, Appice, Cecci 2005]
Standard Model Trees (e.g. M5‘) can do Spatial Mining by splitting along x and y coordinates Mrs-Smoti (Malerba et al. 2004) is a variant of Model Trees that- Allows regression nodes as interior nodes - Handles directly autocorrelation:
x Spatial regression model with dependencies in response variables: spatially lagged response
It inputs spatial objects eventually belonging to
separatethematic layers stored in a spatial database S
- target objects (main subject of analysis) - non target objects (relevant for the task in hand) and outputs a spatial model tree T by
- partitioning training spatial data according to intra-layer and inter-layer relationships
- associating different regression models to disjoint spatial areas
Integrates spatial database queries (see Subgroup Discovery)
6 5 4 3 2 Y’=c+dX’ 3 Y’=e+fX’2 X’4 ≤ γ Y’=g+hX’3 0 Y=a+bX1 1 X’3 ≤ α 7 Y’=i+lX’4 X’2 ≤ β T 123 Michael May
Tutorial Geographic and Spatial Data Mining
Mining Tracks in Space and Time
(
0)
0 0(
1
)
p
p
p
p
n
−
−
⋅
124 Michael May
Tracks in Space and Time
Points
Space Complexity Time
complexity
Points, Lines, and Areas
Tracks in Space and Time
Networks
125 Michael May
Tutorial Geographic and Spatial Data Mining
Tracks in space and time
• Requirements: • Point daa • Polygons • Aggregations • Networks • Tracks, GPS/RFID/Sensor-Measurement • Applications:
Traffic prediction, Mobility analysis
• Examples
• Sampling, Event analysis, non-linear optimization
126 Michael May
Tutorial Geographic and Spatial Data Mining
Mobility analysis based on GPS-tracks
introduction of new pricing model for
poster sites based on GPS tracks
registration of contact frequencies with poster sites
contact extrapolation for target groups:
- socio-demographic characteristics - residential areas
Media Trend Journal, Nov, 2006
127 Michael May
Tutorial Geographic and Spatial Data Mining
Time patterns
Patterns / Questions
- How long (days) does it take till x% of objects visit all locations? - How long does it take till x% of
objects visit at least one location twice?
Applications
- determine mobility of a group of people
- reach of poster networks
- find popularity of locations (theatres, supermarkets, hospitals)
128 Michael May
Modelling tasks
Modelling mobility for cities with GPS-measurements
for the overall population
Predicting mobility for cities without measurements (hard task!)
Extrapolating predictions in time
129 Michael May
Tutorial Geographic and Spatial Data Mining
GeoPKDD - FET Project IST-014915
Geographic Privacy-aware Knowledge Discovery and Delivery December 2005 – November 2008
Project Leader: Fosca Giannotti
http://www.geopkdd.eu
General Project Idea
extracting user-consumable forms of knowledge from large amounts of raw geographic data referenced in space and in time.
knowledge discovery and analysis methods for trajectories of
moving objects, which change their position in time, and possibly also their shape or other significant features
devising privacy-preserving methods for data mining from sources that typically contain personal sensitive data
130 Michael May
Tutorial Geographic and Spatial Data Mining
The Consortium
ID Acronym Partner Country
1 KDDLAB Knowledge Discovery and Delivery Laboratory, ISTI-CNR, Istituto di Scienza e Tecnologie dell’Informazione, Pisa. http://www.isti.cnr.it/ - jointly with Univ. Pisa, Dept. of Computer Science http://www.di.unipi.it
I
2 LUC Univ. Limburg, Theoretical Computer Science Group. http://www.luc.ac.be/theocomp B
3 EPFL EPFL, Lab. DB, Lausanne. http://lbdwww.epfl.ch/e/ CH
4 FAIS Fraunhofer Institute for Autonomous Intelligent Systems, Sankt Augustin.
http://www.ais.fraunhofer.de/
D
5 WUR Wageningen UR, Centre for GeoInformation. http://cgi.girs.wageningen-ur.nl/ NL
6 CTI Research Academic Computer Technology Institute, Research and Development Division.
http://www.cti.gr/ - jointly with Univ. Piraeus, Dept. of Informatics http://www.unipi.gr
GR
7 UNISAB Sabanci University, Faculty of Engineering and Natural Sciences. http://www.sabanciuniv.edu/ TK
8 WIND WIND Telecomunicazioni SpA, Direzione Reti Wind Progetti Finanziati & Technology Scouting. I
131 Michael May
Tutorial Geographic and Spatial Data Mining
Geographic Privacy-aware Knowledge Discovery Process
Traffic Management Accessibility of services Mobility evolution Urban planning …. interpretation visualization trajectory reconstruction p(x)=0.02 warehouse interpretation visualization trajectory reconstruction p(x)=0.02 ST patterns Trajectories warehouse Privacy-aware Data mining Bandwidth/Power optimization Mobile cells planning … Public administration or business companies Telecommunication company (W IND) GeoKnowledge Aggregative Location-based services Privacy enforcement Traffic Management Accessibility of services Mobility evolution Urban planning …. interpretation visualization trajectory reconstruction p(x)=0.02 warehouse interpretation visualization trajectory reconstruction p(x)=0.02 ST patterns Trajectories warehouse Privacy-aware Data mining Bandwidth/Power optimization Mobile cells planning … Public administration or business companies Telecommunication company (W IND) GeoKnowledge Aggregative Location-based services Privacy enforcement
132 Michael May
GeoPKDD – Specific Goals
models for moving objects, and data warehouse methods to store their trajectories knowledge discovery and analysis methods for moving objects and trajectories,
techniques to make such methods privacy-preserving
techniques for reasoning on spatio-temporal knowledge and on background knowledge
techniques for delivering the extracted knowledge within the geographic framework
133 Michael May
Tutorial Geographic and Spatial Data Mining
From Traces to Trajectories: the Source Data
GSM network
Entering the cell
- e.g. (UserID, time, IDcell, in)
Exiting the cell
- e.g. (UserID, time, IDcell, out)
Movements inside the cell? - Eg (UserID, time, X,Y, Idcell
streams of log data of mobile phones, e.g. cells in the GSM/UMTS network
Real trajectories are continuous functions
Logs are discrete sampling of real trajectories, dependent on the wireless network
technology
- unregular granularity in time and space - possible imperfection/imprecision
An approximated reconstruction of the real trajectory from its log traces is needed
134 Michael May
Tutorial Geographic and Spatial Data Mining
Movement patterns
Clustering Group together similar trajectories
For each group produce a summary
Frequent patterns
Discover frequently followed (sub)paths
Classification
Extract behaviour rules from history
Use them to predict behaviour of future
users
60
%
7%
8%
5%
20
%
?
Source: Pedreschi & Giannotti, 2005
135 Michael May
Tutorial Geographic and Spatial Data Mining
Why emphasis on privacy?
More, better data are gathered, more vulnerability from correlation On the other hand, more and new data bring new opportunities
Need to maintain privacy without giving up opportunities
Need to obtain social acceptance through demonstrably trustworthy solutions
... is a technical issue, besides ethical, social and legal, in the specific context of ST data
How to formalize privacy constraints over ST data and ST patterns?
- E.g., anonymity threshold on clusters of individual trajectories
How to design DM algorithms that, by construction, only yield patterns that meet the
privacy constraints?
Privacy in GeoPKDD
136 Michael May
Challenges
(
0)
0 0(
1
)
p
p
p
p
n
−
−
⋅
137 Michael MayTutorial Geographic and Spatial Data Mining
Causal Inference from Statistical Spatio-Temporal Data
Current project at IAIS for newspaper publisher:
Sales prediction of individual shops.
What happens if a shop closes or is sold out? Predict to which alternative shop customers go.
Spatio-Temporal Clustering of shops
Time Series Prediction
Modeling customer behavior
⇒ Causal inference about customer behavior
„If shop A closes, n% of A‘s customers go to B, m% to C“
138 Michael