• No results found

Tutorial on Geographic and Spatial Data Mining

N/A
N/A
Protected

Academic year: 2021

Share "Tutorial on Geographic and Spatial Data Mining"

Copied!
74
0
0

Loading.... (view fulltext now)

Full text

(1)

Tutorial on Geographic and Spatial Data Mining

Michael May

15th Italian Symposium on Advanced Database Systems - SEBD’07 Torre Canne, Italy

June 17th

2 Michael May

Tutorial Geographic and Spatial Data Mining

Fraunhofer Society

ƒ Joseph von Fraunhofer, German physicist and

entrepreneur

ƒ Fraunhofer mission:

- do state-of-the-art research and use it in challenging customer projects

- Funding is 33% research grants, 33% customer projects, 33% institutional funding

ƒ 57 institutes, 40 locations, 12.000 employees, 1 bill. € annual volume

(2)

3 Michael May

Tutorial Geographic and Spatial Data Mining

Fraunhofer IAIS: Intelligent Analysis- and Information

Systems

„From sensor data to business intelligence, from media analysis to visual information systems: Our research allows companies to do more with data“

ƒ New name, long-standing experience

- Founded in 2006 as a merger of the Fraunhofer institutes AIS and IMK

ƒ 230 people: scientists, project engineers, technical and administrative staff

ƒ Located on Fraunhofer Campus Schloss Birlinghoven/Bonn

ƒ Joint research groups and cooperation with

Univ. Bonn

4 Michael May

Tutorial Geographic and Spatial Data Mining

Fraunhofer IAIS: research and projects

Core research areas:

ƒ Machine learning and adaptive systems

ƒ Data Mining and Business Intelligence

ƒ Automated media analysis

ƒ Interactive access and exploration

(3)

5 Michael May

Objectives

ƒ Although it is about statistical concepts, algorithms and data structures, the tutorial has a practical, application oriented focus

ƒ Integration of various technologies and algorithms. How do they combine? ƒ Covers a broad range

ƒ I do not assume familiarity with spatial concepts, but some basic familiarity with data mining approaches

ƒ Three Objectives:

- to stimulate research on spatial data mining related issues

- to stimulate development of more efficient spatial databases tailored for data mining applications

- to stimulate real-world applications

6 Michael May

Tutorial Geographic and Spatial Data Mining

A main message

ƒ

Spatial Data Mining is not an esoteric

research topic; it is practically and

commercially very important and

sometimes business critical field!

ƒ

Later I give an example where the value

of several dozens of companies directly

depends on the predictions given by our

spatial data mining algorithms.

(4)

7 Michael May

Tutorial Geographic and Spatial Data Mining

Spatial vs. Geographic Data Mining

ƒ Geographic Data is data related to the earth

ƒ Spatial Data Mining deals with physical space in general, from molecular to astronomical level

ƒ Geographic Data Mining is a subset of Spatial Data Mining

ƒ Allmost all geographic data mining algorithms can work in a general spatial setting (with the same dimensionality)

ƒ This tutorial focuses on geographic data in 2D, but most algorithms work on spatial data in general

ƒ I do not talk about specificties of molecular data, face detection, etc.

8 Michael May

Tutorial Geographic and Spatial Data Mining

Agenda

Introduction– Spatial and Geographic Data Mining Part I: Basic Concepts – Spatial Databases and GIS

•Spatial Data Types •Spatial Queries

•Construction of Complex Features

Part II: Exploratory Analysis of Spatial Data Part III: Spatial and Geographic Data Mining Methods

•Autocorrelation

•Mining Point Data – Clustering, Kriging

•Mining Points, Lines Areas – Clustering, Subgroup Discovery, Association Rules •Mining Networks – A practical case study

•Mining Tracks in Space and Time – Mining from GPS-Data

Challenges Summary

(5)

9 Michael May

Introduction – Spatial Data Mining

(

0

)

0 0

(

1

)

p

p

p

p

n

10 Michael May

Tutorial Geographic and Spatial Data Mining

A classical example of spatial analysis

Dr. John Snow

Investigating causes of a cholera epidemia London,

September 1854

A good representation is often the key to solving a problem

Disease cluster

(6)

11 Michael May

Tutorial Geographic and Spatial Data Mining

Good representation because

...

Represents spatial relation of objects

of the same type

Represents spatial relation of objects

to other objects

It is not only important

where a cluster is but

also, what else is there

(e.g. a water-pump)!

Shows only relevant aspects and hides

irrelevant

12 Michael May

Tutorial Geographic and Spatial Data Mining

Goals of Spatial Data Mining

Identifying spatial patterns

ƒ Identifying spatial objects that are potential generators of patterns ƒ Identifying information relevant for explaining the spatial pattern (and hiding irrelevant information) ƒ Presenting the information in a way

(7)

13 Michael May

Spatial Data Mining

Data Mining

+

Geographic Information Systems

= Spatial Mining

( 0) 0 0(1 ) p p p p n − ⋅ 14 Michael May

Tutorial Geographic and Spatial Data Mining

Basic Concepts

Spatial Databases and GIS

(

0

)

0 0

(

1

)

p

p

p

p

n

(8)

15 Michael May

Tutorial Geographic and Spatial Data Mining

Commercial

ƒ Where to build a new supermarket?

ƒ Where are the customers that want to buy new product X? ƒ How many cars pass the

main road per hour? ƒ Does it pay to install new

antennas?

ƒ What percentage of young females sees a billboard located in Ripley avenue? Public Sector

ƒ Are there clusters of a certain disease?

ƒ Is there a relationship between poverty and death rate?

ƒ Are there crime hot spots or patterns?

16 Michael May

Tutorial Geographic and Spatial Data Mining

Buildings

Rivers

Streets

Schools

Hospitals

Factory

Attribute Data Person p. Household No. of Cars Long-term illness Age Profession Ethnic group Unemployment Education Migrants Medical establishment Shopping areas
(9)

17 Michael May

Elements of a spatial database

Spatial Operators

Spatial Data

Types

Spatial Indexes

Spatial Query

Language

Metadata

SELECT c.holding_company, c.location FROM competitor c,

bank b WHERE b.site_id = 1604

AND SDO_WITHIN_DISTANCE(c.location, b.location, 'distance=2 unit=mile') = 'TRUE'

INSIDE

Examples from Oracle Spatial

18 Michael May

Tutorial Geographic and Spatial Data Mining

Spatial Datatypes

(

0

)

0 0

(

1

)

p

p

p

p

n

(10)

19 Michael May

Tutorial Geographic and Spatial Data Mining

Two basic types of representation: Fields and Discrete

Objects

Fields:

Raster Data

Discrete Objects:

Vector Data Model

Area

Line

20 Michael May

Tutorial Geographic and Spatial Data Mining

Vector Data: Data Structure

Ordered sets of xy-coordinates defining points, lines, or polygons

3D or 4D also possible

Point

Line

(Polyline) Area (Polygon)

ƒ Easy to scale (linear transformation)

ƒ Storage efficient

ƒ Relationships between

objects (e.g. overlap) are not explicitly represented

ƒ Aka „Spaghetti Model“

Straight lines between points

(5,10) ((5,10),(9,16),(12,17)) ((5,10),(9,16),(12,17), …)

Data Structure

Draw line from last to first coordinate

(11)

21 Michael May

Two Main Types of Vector Data

-

non regular tesselations

closed polylines that partition the space

-

discrete isolated objects

:

point, line, area

Point

Line

Area (Polygon)

Tesselations very useful for aggregation of discrete objects and for feature extraction

22 Michael May

Tutorial Geographic and Spatial Data Mining

UK, Greater Manchester, Stockport

Buildings

Geometry Address Type …

Hospitals

Geometry Address Phone #Beds …

Description of objects are organized in relations (database tables) Each row in a table describes one object

Different categories of objects are organized in separate relations each having its own set of attributes.

1 Ripley Avenue 23 (5,5),(6,6),… 3 2 Islington Road 2 (3,3),(4,4),… 2 1 Gladstone Street 5 (1,1),(2,2),… 1 Type Address Geometry ID 567897 Great Moore (3,3),(4,4),… 2 234567 Stepping Hill (1,1),(2,2),… 1 Phone Address Geometry ID 1 Ripley Avenue 23 (5,5), 3 2 Islington Road 2 (3,3),… 2 1 Gladstone Street 5 (1,1), 1 Ty pe Name Geometry ID 1 Ripley Avenue 23 (5,5), 3 2 Islington Road 2 (3,3),… 2 1 Gladstone Street 5 (1,1), 1 Ty pe Name Geometry ID 1 Ripley Avenue 23 (5,5), 3 2 Islington Road 2 (3,3),… 2 1 Gladstone Street 5 (1,1), 1 Ty pe Name Geometry ID 1 Ripley Avenue 23 (5,5), 3 2 Islington Road 2 (3,3),… 2 1 Gladstone Street 5 (1,1), 1 Ty pe Name Geometry ID

Rivers

Streets

Schools

Factory

(12)

23 Michael May

Tutorial Geographic and Spatial Data Mining

Hierarchy

Often data are organized in spatial hierarchies, e.g.

ƒ Country

ƒ State

ƒ Zip Area

ƒ Voting District ƒ Parcel

Hierarchies may overlap

County District2 District1 Districtn Ward1 … Ward1 Ward1 Wardn Ward1 Ward2

UK census data

24 Michael May

Tutorial Geographic and Spatial Data Mining

Representation of data in a spatial database

A set of relations R1,...,Rn such that each relation Rihas a geometry attribute Gi or an identifier Aisuch that Ri can be linked (joined) to a relation Rkhaving a geometry attribute Gk

- Geometry attributes Giconsist of ordered sets of x,y-pairs defining points, lines,

or polygons

- Different types of spatial objects are organized in different relations

Ri(geographic layers), e.g. streets, rivers, enumeration districts, buildings, and

- each layer can have its own set of attributes A1,..., Anand at most one geometry attribute G

(13)

25 Michael May

Representation of data in a spatial database

A set of relations R1,...,Rn such that each relation Rihas a geometry attribute Gi or an identifier Aisuch that Ri can be linked (joined) to a relation Rkhaving a geometry attribute Gk

- Geometry attributes Giconsist of ordered sets of x,y-pairs defining points, lines,

or polygons

- Different types of spatial objects are organized in different relations

Ri(geographic layers), e.g. streets, rivers, enumeration districts, buildings, and

- each layer can have its own set of attributes A1,..., Anand at most one geometry attribute G

Does not fit well to

standard data mining

approaches!

This is where the specific

research challenge for

geographical data mining

comes from!

26 Michael May

Tutorial Geographic and Spatial Data Mining

Legend

Mixed conifer

Douglas fir

Oak savannah

Grassland

Raster representation. Each color represents a different value of a nominal-scale field

Longley et al (2001)

How to represent phenomena conceived as fields?

ƒDivide the world into square cells

ƒNo variation within cells

ƒCell value may be average, max, min, sum,central point, …

ƒRepresent discrete objects as collections of one or more cells

ƒRepresent fields by assigning

attribute values to cells

Raster Data

(14)

27 Michael May

Tutorial Geographic and Spatial Data Mining

Raster and Vector: Comparison

Raster Modell

Advantages:

• Simple data structure

• Simple logical and algebraic structures Disadvantages:

• Large data volumes • imprecise geometry

• expensive transformations of coordinates • implicit coordinates

Vector Model

Advantages:

• Specify geometry by coordinates • Topological relationships • High geometric accuracy • Storage efficient Disadvantages:

•Complex data structure

• Compute intensive logical and algebraic operations

Remember: Raster is vaster and vector is correcter

Legend Mixed conifer Douglas fir Oak savannah Grassland 28 Michael May

Tutorial Geographic and Spatial Data Mining

Spatial Queries

(

0

)

0 0

(

1

)

p

p

p

p

n

(15)

29 Michael May

Spatial Queries

Problem: Vector data model does not explicitly capture relationships among objects. They have to be inferred using spatial predicates

Spatial predicates evaluate to true or falsefor given objects A query returns

ƒ the set of objects of which the statement is true; or

ƒ using aggregates the [minimum,maximum,sum,average,…], object(s) of which the statement is true …

Queries are evaluated using a spatial joinamong different relations (layers)

Here‘s where database technology and spatial indexing comes in to do the job efficiently!

Still, they can be extremely time consuming!

30 Michael May

Tutorial Geographic and Spatial Data Mining

Spatial Predicates: Egenhofer‘s 9-intersection model

ƒ Each object has interior (i), exterior (e) and boundary (b)

ƒ This results in a 9-intersection matrix for the relation between two spatial objects A and B

ƒ A cell contains a 1 iff the intersection of point sets is non-empty

1 1 1 e 1 0 0 i 1 0 1 b e i b 1 1 1 e 1 1 1 i 1 1 1 b e i b 1 0 0 e 1 1 1 i 1 0 0 b e i b A B A B

(16)

31 Michael May

Tutorial Geographic and Spatial Data Mining

Spatial Predicates

A inside B, B contains A A contains B, B inside A A covered-by B, B covers A A covers B, B covered by A A equals B, B equals A A overlaps B, B overlaps A A meets B, B meets A A disjoint B, B disjoint A

9-intersection model for 2 regions (Egenhofer 1991)

INSIDE

32 Michael May

Tutorial Geographic and Spatial Data Mining

Spatial Queries: Distance

Metric spaces:

→Symmetry:d(i,j) = d(j,i)

→triangle inequality:d(i,k) d(i,j)+ d(j,k)

- Euclidian Distance:de(i,j) =

i j

k

Distance relation between polygons: Minimum distance between any 2 points of the polygons

2

2 ( )

)

(17)

33 Michael May

Spatial Queries: Distance and Proximity

ƒ Selects nearest neighbor in space ƒ Select all object within a certain distance

Ma in S tr e e t X Distance Hospital #2 Hospital #1

SELECT c.holding_company, c.location FROM competitor c,

bank b

WHERE b.site_id = 1604

AND SDO_WITHIN_DISTANCE(c.location, b.location, 'distance=2 unit=mile') = 'TRUE'

Select all competitors and locations within 2 miles distance from bank with id 1604

Example: Oracle Spatial

34 Michael May

Tutorial Geographic and Spatial Data Mining

Distance – non-metric

ƒ non metric spaces

→Asymmetry: d(i,j) ≠d(j,i)

→triangle inequality does not hold

ƒ drive time

ƒ driving distance

(18)

35 Michael May

Tutorial Geographic and Spatial Data Mining

Stockport Database Schema

ED TAB01 TAB95 TAB61

...

Water

...

River Building Street Shopping Region Vegetation =zone_id =zone_id =zone_id spatially interact inside spatially interacts spatially interacts spatially interacts Attribute data 95 tables with census data, ~8000 attributes Geographical Layers 85 tables Spatial Hierarchy • County • District • Wards • Enumeration district spatially interact Standard Join Spatial Join

Relations between objects implicit; very flexible and storage efficient, but compute intensive

36 Michael May

Tutorial Geographic and Spatial Data Mining

Implementation of Spatial Databases

Many popular databases have spatial extensions by now:

ƒ

Oracle Spatial

ƒ

PostgreSQL

ƒ

MySQL (since 4.1)

(19)

37 Michael May

Construction of Complex Features

(

0

)

0 0

(

1

)

p

p

p

p

n

38 Michael May

Tutorial Geographic and Spatial Data Mining

Spatial Functions

ƒ Example: Oracle Spatial 10g

ƒ Return a geometry - Union - Difference - Intersect - XOR - Buffer - CenterPoint - ConvexHull ƒ Return a number - Length - Area - Distance

Union

XOR

Intersect

Original

Difference

http://colab.cim3.net/file/work/SICoP/2006-06-20/2006-06-21/xlopez06212006.ppt Constructs new geometry objects from existing ones using point set theory Efficient implementation using computational geometry
(20)

39 Michael May

Tutorial Geographic and Spatial Data Mining

Constructing Cells: Buffer

ƒ How many competitors are in the catchment area of my shop?

= How many shops are within the buffer?

Simplistic approximation

ƒ Does not take account of barriers (rivers, highways)

ƒ Does not take into

account road system

40 Michael May

Tutorial Geographic and Spatial Data Mining

Voronoi diagramm

ƒ Which are my nearest competitors?

ƒ What is the cover of my

radio antenna?

= Find voronoi neighbors

Approximation

ƒ Does not take account of barriers (rivers, highways)

ƒ Does not take into

account road system

Decompose space into regions around each point in a set of points S such that all the points in the region around piare closer to pithan to any other point in S Complexity:

Related data structure: Delaunay triangulation (graph of Voronoi neighbors)

)

lg

(

n

n

O

(21)

41 Michael May

Drive-Time Zone (Dijkstra)

How many competitors are in

the catchment area of my shop?

Realistic approximation

ƒ Take account of barriers (rivers, highways)

ƒ take into account road

system, maximum speed on road

All streets segments within a drive time distance <= d

from a given starting point Use Dijkstra‘s algortihm Complexity:

depending on data structures used for implementation

) lg ( ) ( 2 E V V O V O − + 42 Michael May

Tutorial Geographic and Spatial Data Mining

Pre-procesing

Several of the feature extractions are computationally quite expensive (at least for large data sets) and there is often a combinatorial explosion of features that might be constructed.

Several strategies are used in Spatial Warehouse Design:

ƒ Selective Pre-processing: materializing important joins in advance (storage requirements!)

ƒ Approximate precomputing:e.g. using Minimum Bound Rectangle to approximate

polygon

ƒ Schema Design (e.g. Star-Schema with selective materialization): Han J.,

Stefanovic N., Koperski K. Selective Materialization: An Efficient Method for Spatial Data Cube Construction. PAKDD, 1998.

(22)

43 Michael May

Tutorial Geographic and Spatial Data Mining

Spatial Database of Vector Objects: Discussion

ƒ Relations between objects implicit

ƒ Very flexible: depending on analysis task different relationsships can be constructed

ƒ storage efficient; no overhead for storing relationship information

ƒ compute intensive (thus spatial Indexing very important)

ƒ Consider what and when to materialize

ƒ Very rich possibilities to create new, non-trivial objects from existing ones

ƒ Makes feature extraction an important topic for Data Mining

ƒ Inherently multi-relational setting (but not first-order)

ƒ Could also be formulated in a deductive database setting

44 Michael May

Tutorial Geographic and Spatial Data Mining

Interactive Visualization of Spatial

Data –

Exploratory Data Analysis

(

0

)

0 0

(

1

)

p

p

p

p

n

(23)

45 Michael May

Interactive Visualization of Spatial Data –

Exploratory Data Analysis

(work by G. Andrienko & N. Andrienko, H. Voss and others at Fraunhofer IAIS) For the theory behind

CommonGIS, see the book Andrienko, N. and

Andrienko G.:

Exploratory Analysis of Spatial and Temporal Data - A Systematic Approach, Springer, 2005

46 Michael May

Tutorial Geographic and Spatial Data Mining

Geographic Information Systems and CommonGIS

ƒ Many commercial tools available

- ESRI ARC GIS

- Mapinfo

- Intergraph - Manifold

ƒ But CommonGIS is different and unique … - Map-based exploratory data analysis

- stresses interactive visualization manipulation of statistical data in space - elaborated facilities for time-series visualization

ƒ CommonGIS can be aquired for non-commercial use by educational instutions for no fee

(24)

47 Michael May

Tutorial Geographic and Spatial Data Mining

-

Time-series visualization and analysis

- Combines Vector-Raster transformation

- Weighted Sums - Ideal Point Analysis

- Similarity analysis - Dominant Attribut - Integration with Weka

(Clustering, Decision Trees) Multivariate Decision support

Multi-dimensional

= Fraunhofer IAIS Tool for Map-based

Exploratory Data Analysis

- combines interactive cartography and statistics

CommonGIS

48 Michael May

Tutorial Geographic and Spatial Data Mining

CommonGIS: Visual analysis of spatial data

ƒ Interactive spatial search for

geographic objects and recognition of spatial patterns:

ƒ dynamic choropleth maps, pie charts, bar charts, etc.

ƒ with dynamic removal of outliers

ƒ and dynamic queries

ƒ Comparison of attribute values of geographic objects (relations and correlations) and

comparison of spatial patterns (spatial correlations):

ƒ (Linked) dynamic maps and interactive diagrams

(25)

49 Michael May

CommonGIS: Visual analysis of spatio-temporal data

ƒ CommonGIS as an interactive browser to study how a spatial pattern evolves over time:

ƒ time aware maps (animations)

ƒ time series charts

ƒ CommonGIS as an interactive browser for

temporal behaviours of objects:

ƒ set of controls for analysing time intervals (object animations)

ƒ CommonGIS as an interactive browser of discrete space-time events to find spatio-temporal clusters:

ƒ space-time cube

50 Michael May

Tutorial Geographic and Spatial Data Mining

(26)

51 Michael May

Tutorial Geographic and Spatial Data Mining

Time-Series: Sales per Shop and Product Category

Bäckerei Stehcafé Sitzcafé Terrasse

Different Time Hierarchies (Year, Quarter, Month, Day…)

52 Michael May

Tutorial Geographic and Spatial Data Mining

CommonGIS: Data transformation

ƒ Transformation of data for further

analysis:

Attribute transformations:

ƒ calculate statistical indices

ƒ transform and combine attribute data arithmetically

ƒ dynamic classifiers (linked with dynamic choropleth map)

ƒ cross classifiers (linked with dynamic choropleth map)

ƒ Geographic transformations:

ƒ query, transform, combine, derive raster data

ƒ illumination model

ƒ raster -> vector transformations (i.e. raster -> area aggregation)

(27)

53 Michael May

Tutorial Geographic and Spatial Data Mining

(28)

54 Michael May

Tutorial Geographic and Spatial Data Mining

Geographic and

Spatial Data Mining Methods

(

0

)

0 0

(

1

)

p

p

p

p

n

55 Michael May

Tutorial Geographic and Spatial Data Mining

Autocorrelation

(

0

)

0 0

(

1

)

p

p

p

p

n

(29)

56 Michael May

Spatial Variation

How are variables distributed in space?

Tobler‘s First Law of Geography:

„Everything is related to everything else, but near things are more related than distant things.“

Ö distribution of variables depends on space

Ö variables are autocorrelated

Field Soil Moisture

Franke, diploma thesis, Leipzig Univ., 2006

57 Michael May

Tutorial Geographic and Spatial Data Mining

Spatial Autocorrelation: Binary Example

ƒ binary attribute (blue, white)

ƒ autocorrelation to four immediate neighbors

ƒ Moran Index (here):

I = 0.86 I = 0.00 I = -1.00

Goodchild, CATMOG, GeoBooks, Norwich, 1986

I = 0.39 change equal change equal

n

n

n

n

I

+

=

- change - equal
(30)

58 Michael May

Tutorial Geographic and Spatial Data Mining

Moran‘s I

ƒ Morans‘s I is a measure for spatial autocorrelation. It is a weighted correlation coefficient used to detect departures from spatial randomness. Departures from randomness indicate spatial patterns such as clusters and geographic trend.

ƒ Values of I larger than 0 indicate positive spatial autocorrelation; values smaller than 0 indicate negative spatial autocorrelation.

ƒ Moran's I is a weighted product-moment correlation coefficient, where the weights

reflect geographic proximity.

ƒ z – attribute of interest; w – weight; n – number of areal objects

∑∑

∑ ∑

= = = ≠ = ≠ = ≠

=

n i i n i n j j i ij n i i j n j j i j i ij

z

z

w

z

z

z

z

w

n

I

1 2 1 1 , 1 , 1 ,

)

(

)

)(

(

A B C D D 0 1 1 0 1 0 1 1 C 1 1 0 1 B 0 1 1 0 A D C B A wij weight matrix Example: n = 4 59 Michael May

Tutorial Geographic and Spatial Data Mining

Spatial Autocorrelation

ƒ similarity in location indicates similarity in attribute

value

ƒ differs from temporal autocorrelation

- 1 – dimensional autocorrelation in time series, spatial autocorrelation spreads in 2 or 3 dimensions

- only forward causality in time series, direction of causality not restricted in space

ƒ depends on scale

Temperature of Sunspots Sunspot Time Series year

# s

uns

p

o

(31)

60 Michael May

Effects of Autocorrelation

ƒ makes spatial abstraction possible

ƒ makes standard approaches of analysis impossible - most statistics assume iid

ƒ makes local inference attractive - Kriging, kNN, …

ƒ makes choice of sampling interval hard - autocorrelation depends on scale

ƒ makes interpolation easier than extrapolation

ƒ zero autocorrelation = independence of location

distance correla tion 0 -1 +1 spatial autocorrelation 61 Michael May

Tutorial Geographic and Spatial Data Mining

Problem types for Spatial Data Mining

ƒ Spatial Data Mining:= partially

automated search for patterns and models in large spatial databases

Classification of methods along the following hierarchy

ƒ Points

ƒ Points, Lines and Area

ƒ Networks

(32)

62 Michael May

Tutorial Geographic and Spatial Data Mining

Handling spatial data in Data Mining – Basic Options

ƒ Treat as ordinary variables

ƒ no special algorithms needed

ƒ spatial properties ignored, e. g. discontiguous areas

ƒ Make spatial relationships explicit e. g. infer topological relationship

ƒ expensive, but allows normal algorithms to be used

ƒ Can by done as pre-processing or dynamically (latter requires specialized algortihms)

ƒ Specialized algorithms

- Neighborhood methods, kriging, Gaussian processes, density-based clustering …

ƒ Use proper combination of data, preprocessing, algorithms, and interaction software!

63 Michael May

Tutorial Geographic and Spatial Data Mining

Mining Point Data

(

0

)

0 0

(

1

)

p

p

p

p

n

(33)

64 Michael May

Mining Point Data

Points Space Complexity Time Complexity 65 Michael May

Tutorial Geographic and Spatial Data Mining

Clustering spatial point data

Point data conceived as discrete objects

Many approaches exists for clustering spatial point data

In statistics, measures of spatial randomness or non-randomness have been developed (e.g. Ripley 1991, Cressie 1993)

- Ripley‘s K function as measuring deviation from complete spatial randomness (as exemplified by a Poisson process)

- Moran‘s I, which measures autocorrelation

ƒ Bayesian approaches often coming from image analysis (cf. Lawson et al 2002) ƒ In Geography, spatial clustering algorithms have been developed (Openshaw, GAM,

(34)

66 Michael May

Tutorial Geographic and Spatial Data Mining

Density Based Clustering – a KDD approach

[Ester et al. 1996]

ƒ Suitable for large databases

ƒ Discovers areas of high density and turns them into clusters

ƒ Discovers clusters of arbitrary shape ƒ Can handle noise

Algorithm DBSCAN

Note: Relatively straightforward extension to vector data possible (GDBSCAN); requires more complex definition of some key concepts (neighborhood and MinPts)

67 Michael May

Tutorial Geographic and Spatial Data Mining

Clustering spatial data

ƒ distance-based clustering is inherently spatial

ƒ but assumption of convex clusters (e.g. k-means) inappropriate for many “geographical” tasks X X X X X X X X X X X X

(35)

68 Michael May

Definitions 1

Eps-neighborhood of a point p Nε(p) := {q∈D | dist (p, q)≤ ε}

A point p is directly density-reachable from q iff 1. p ∈N ε(q)

2. |N ε(q)|>MinPts (“q is core object”)

- Not necessarily symmetric

p

p

q

q

p: border object q:core object P directly density reachable from q Q not directly density reachable from p

Definition of Eps is a crucial parameter!

69 Michael May

Tutorial Geographic and Spatial Data Mining

Definitions 2

density-reachable = p is density-reachable from point q wrt to Eps and MinPts iff there is a chain of points p1,…,pn, p1=q,pn=p such that pi+1is directly density-reachable from pi

Transitive, not symmetric

p is density-connected to q iff there is point o such that p and q are density-reachable from o wrt to Eps and MinPts.

p

p

q

p

o

p and q density-connected to each other by o p density reachable from q q not density reachable from p Symmetric
(36)

70 Michael May

Tutorial Geographic and Spatial Data Mining

Density-connected clustering

A clusterC wrt. To Eps and MinPts is a non-empty subset of database D, where (1) ∀p,q: if p ∈C and q is density-reachable from p wrt Eps and MinPts, then q ∈C (2) ∀p,q ∈C: p is density connected to q wrt to Eps and MinPts.

ƒ Non-covered points are noise

ƒ Each cluster contains at least MinPts ƒ Exactly one clustering

71 Michael May

Tutorial Geographic and Spatial Data Mining

Algorithm DBScan – Basic Idea

ƒ Check Eps-Neigborhood of every unclassified point

in database

ƒ If neighborhood of p contains more than MinPts, a new cluster with p as core object is build

ƒ Collect directly density reachable objects from this set, merging clusters as necessary

ƒ Terminate when no new point can be added to any cluster

(37)

72 Michael May

Kriging-Spatial Interpolation

(

0

)

0 0

(

1

)

p

p

p

p

n

73 Michael May

Tutorial Geographic and Spatial Data Mining

Kriging

ƒ developed by G. Matheron in the 1960s based on work of D. Krige ƒ geostatistical method of interpolation

ƒ Point data conceived as samples from a continuous surface

ƒ results are smoothly varying surfaces

ƒ provides optimality given assumptions (best linear unbiased estimate)

ƒ variety of methods, e.g. Ordinary Kriging, Universal Kriging, Co-Kriging, Block

Kriging, Stratified Kriging, Indicator Kriging, …

? ? ? ? ? ? • – measurements ? – unknown values

(38)

74 Michael May

Tutorial Geographic and Spatial Data Mining

Spatial Variation

Problem:

ƒ spatial variation of a continuous attribute is often too irregular to be modelled by a simple, smooth mathematical function

Solution:

ƒ variation can be described by stochastic surface

x – location in n-dimensional space Z(x) – random variable of interest, e.g. soil

moisture

A stochastic processis a family of random variables Z(x) over the index set D ⊂ ℜn:

{

Z

(

x

)

:

x

D

}

A Gaussian processis a stochastic process for which any finite set of Z-variables has a joint multivariate Gaussian distribution.

75 Michael May

Tutorial Geographic and Spatial Data Mining

Components of Spatial Variation

ƒ structural component, having a constant mean or trend

ƒ random, but spatially correlated component (regionalized variable)

ƒ spatially uncorrelated random noise term

''

)

(

'

)

(

)

(

x

=

m

x

+

ε

x

+

ε

Z

trend autocorrelation random noise

X Z(x)

(39)

76 Michael May

Stationarity

Problem:

ƒ spatial data set is single realization of random process

ƒ inference is impossible without further restrictions on spatial variation

Intrinsic Stationarity (stationarity under translation):

ƒ constant mean (E[...] = 0) or trend (E[...] > 0):

ƒ variance of differences h is independent of location:

Isotropy (stationarity under rotation) :

ƒ spatial process evolves the same in all directions

[

Z

(

x

)

Z

(

x

h

)

]

const

.

E

+

=

2

E {Z(x)

Z(x

+

h)}

= γ

2 (h)

x x+h h 77 Michael May

Tutorial Geographic and Spatial Data Mining

Ordinary Kriging

ƒ Assumptions:

ƒ intrinsic stationarity with a constant mean - constant mean value in sampling area

- variance of differences depends only on the distance h between sites

ƒ Once structural effects have been accounted for, remaining variation is homogeneous in variance so that difference at sites are merely a function of differences between them.

[

]

]

)}

(

'

)

(

'

[{

)

(

2

]

)}

(

)

(

[{

)

(

)

(

2 2

h

x

x

E

h

h

x

Z

x

Z

E

h

x

Z

x

Z

Var

+

=

=

+

=

+

ε

ε

γ

x x+h h

[

Z

(

x

)

Z

(

x

+

h

)

]

=

0

E

semivariance
(40)

78 Michael May

Tutorial Geographic and Spatial Data Mining

Ordinary Kriging

Proceedure:

1. Estimate semivarianceγ(h) from data sample 2. Plot the experimental variogram

3. Fit a theoretical model to the experimental variogram

4. Estimate unknown values as weighted sum of neighboring measurements, determine optimal weights from variogram

79 Michael May

Tutorial Geographic and Spatial Data Mining

Semivariance and Experimental Variogram

ƒ semivariance depends only on distance (lag) h

ƒ estimate semivariance between all pairs of measurements with distance h (repeat for all possible h)

{

}

=

+

=

n i i i

z

x

h

x

z

n

h

1 2

)

(

)

(

2

1

)

(

ˆ

γ

lag h γ(h) Experimental Variogram
(41)

80 Michael May

Variogram

ƒ nugget:

- γ(h) = 0 (by definition)

- nugget effect represents small scale variation and measurement errors - estimate of ε‘‘

ƒ range:

- spatial dependency

- here, variance of differences increases with distance

- two points are more similar the closer they are

ƒ sill:

- semivariance levels off - variance of differences h is independent of distance lag h γ(h) range nugget sill

{

}

=

+

=

n i i i

z

x

h

x

z

n

h

1 2

)

(

)

(

2

1

)

(

ˆ

γ

81 Michael May

Tutorial Geographic and Spatial Data Mining

Variogram Models

ƒ experimental variogram must be fitted to an appropriate variogram model

ƒ most commonly used are the spherical, exponential, linear or Gaussian model lag h γ(h) Spherical Model lag h γ(h) Exponential Model lag h γ(h) Linear Model lag h γ(h) Gaussian Model

(42)

82 Michael May

Tutorial Geographic and Spatial Data Mining

Interpolation of unknown Values

ƒ unknown value at location x0is estimated as weighted sum of neighboring

measurements

ƒ weights wiare determined according to two restrictions

- Z*(x0) is an unbiased estimate of Z(x0) - Z*(x0) is an optimal estimate

ƒ Have to solve system of n+1linear equations of semivariances and weights

=

=

n i i i

Z

x

w

x

Z

1 0 *

)

(

)

(

83 Michael May

Tutorial Geographic and Spatial Data Mining

Equation System

ƒ restriction on weights introduces Lagrange parameterφ(Restriction 1)

ƒ system of (n+1) equations must be solved to obtain optimal weights for each x0

1 1 1 n 1 1 0 n 1 n n n n 0

(x

x )

(x

x )

1

w

(x

x )

(x

x )

(x

x )

1

w

(x

x )

1

1

0

1

γ

γ

γ

⎞ ⎛

⎞ ⎛

⎟ ⎜

⎟ ⎜

⎟ ⎜

⎟ ⎜

=

γ

γ

⎟ ⎜

⎟ ⎜

γ

⎟ ⎜

φ

⎟ ⎜

⎠ ⎝

⎠ ⎝

K

M

O

M

M

M

M

L

L

ƒ Ordinary Kriging is an exact interpolator, i.e. interpolated value of a sample location will be identical with the measurement taken

(43)

84 Michael May

Variants of Kriging

Universal Kriging

ƒ structural component may contain a external trend

Co-Kriging

ƒ interpolation for one attribute incorporates information of another, correlated attribute

ƒ sparse measurements of an expensive variable are supported by plenty

measurements of a cheap variable

Stratified Kriging

ƒ interpolation within sub-areas

ƒ equations are adjusted to avoid discontinuities on boundaries

More Details: Burrough, P., McDonnell, R 1998

85 Michael May

Tutorial Geographic and Spatial Data Mining

Mining Points, Lines, and Areas

(

0

)

0 0

(

1

)

p

p

p

p

n

(44)

86 Michael May

Tutorial Geographic and Spatial Data Mining

Points, Lines and Areas

Points

Space Complexity Time

Complexity

Points, Lines, and Areas

87 Michael May

Tutorial Geographic and Spatial Data Mining

Points, Lines and Areas

Requirements: • Point data • Polygons • aggregations Applications • Customer Segmentation, • Catchment Areas, • Location Planning, • Radio Network Analysis Examples:

• GDBScan Clustering • Spatial Subgroup Minig • Spatial Association Rules • Spatial Model Trees

(45)

88 Michael May

Clustering of Vector Data: GDBScan [Sander et al 1998]

Extension of DBSCan - Sample Instantiations

dist < ε intersects/meets neighbor

| S | ≥MinCard ∑areas ≥MinArea f (S) ≥MinF

89 Michael May

Tutorial Geographic and Spatial Data Mining

Spatial Subgroup Mining

(

0

)

0 0

(

1

)

p

p

p

p

n

(46)

90 Michael May

Tutorial Geographic and Spatial Data Mining

Typical Data Mining representation

Data Mining for spatial data: very different from this representation

‘spreadsheet data’

exactly 1 table

atomic values

91 Michael May

Tutorial Geographic and Spatial Data Mining

Subgroup Discovery Search (Klösgen 1996, Wrobel 1997)

ƒ Subgroup discovery searches deviation patterns for subgroups

overproportionally high share of target value (or mean of target variable)

ƒ Top-down search from most general to most specific subgroups, exploiting partial ordering of subgroups

S1≥S2 S1more general than S2

ƒ Beam search expands only thenbest ones at each level

ƒ Evaluating hypothesis according to quality function: N= Total population

n= subgroup size

p(T)= target share in total population p(T|C)= target share in subgroup

ƒ Extension to multi-relational representation in Wrobel (1997)

n

N

N

n

T

p

T

p

T

p

C

T

p

))

(

1

)(

(

)

(

)

|

(

(47)

92 Michael May

Translating Multirelational Subgroups

to Object-relational SQL

Domain: relational database schema D= {R1, ..., Rn} having geometry attributes Gi

Hypothesis Language

Multirelational subgroups are represented by a concept set C= {Ci}, where each Ci consists of a set of

attribute value-pairs {A1=v1,...,An=vn} from a relation in D,

a set of links L={Li} linking concepts Ci, Ckvia their attributes Am, Akof the form

(Ci/Am {=|inside| overlaps|...|spatially_interact} Ck/An)

target attribute can be non-numeric (A1=v1) or numeric aggregate (avg(A)=n)

Example:

C= {{district.long_term_illness=high, district.unemplyoment=high},{street.name=’Manchester Road’}}

L= {{district.geometry spatially_interact street.geometry}}

“Enumeration districts with high rate of long term illness and unemplyoment crossed by Manchester Road”

Testing satisfaction of subgroup descriptions

The number of tuples in Dthat satisfies a subgroup description is evaluated using SQL select statements including joins over multiple relations.

93 Michael May

Tutorial Geographic and Spatial Data Mining

Approach: Translation of Spatial Subgroup Mining to SQL

(Klösgen, May 2002)

• Representing subgroups in object-relational SQL, i.e. multi-relational representation • Using representation for spatial geometry based on Spatial Database

• Division of work between RDBMS and Search Manager • Combining visualization in abstract and physical space

(48)

94 Michael May

Tutorial Geographic and Spatial Data Mining

Division of labour between RDBMS

and Search Manager (May, Savinov 2003)

Database Server

Search Algorithm

Mining Server

statistics

•search in hypothesis space

• generation and evaluation of hypotheses (subgroup patterns)

mining query

• Database integration: efficiently organize mining queries

• Mining query delivers statistics (aggregations) sufficient for evaluating many hypotheses

95 Michael May

Tutorial Geographic and Spatial Data Mining

SPIN! – Spatial Data Mining System

Workspace Property Editor Subgroup Viewer Flowchart-Tool Subgroup Result List

(49)

96 Michael May

Interactive Exploratory Analysis

Combination of spatial

and non-spatial

visualization

User selects and

manipulates variables

Powerful for analysis

in low dimensions

(3-4)

Scatter Plot

Parallel Coordinate Plot

Choropleth Maps

Display dynamically linked

97 Michael May

Tutorial Geographic and Spatial Data Mining

Visualization of spatial sugroups

Linked Display

Spatial Venn Diagram

Subgroup Overview

p(T|C) vs. p(C)

Subgroup

High long-term illness in

districts crossed by M60

(50)

98 Michael May

Tutorial Geographic and Spatial Data Mining

Radio Network Planning in Telecommunication

High cut of call ration in mountanous regions crossed by highways having a certain technical

configuration

Legende:

Blau: Autobahn

Braun: große Höhe

Schwarz: Subgruppe SPIN! Mapviewer (Common GIS) 99 Michael May

Tutorial Geographic and Spatial Data Mining

Other commercial applications of Subgroup Discovery

ƒHow are my customers characterized. Are there interesting profiles?

ƒWhere to open the next supermarket? Does it create competition for my other supermarkets?

(51)

100 Michael May

Spatial Association Rules

work and slides by

Donato Malerba et al., Univ. Bari

(

0

)

0 0

(

1

)

p

p

p

p

n

101 Michael May

Tutorial Geographic and Spatial Data Mining

Spatial association rules

ƒ An association pattern PP(s%)(s%)is a spatial association patternif it contains at least one spatial relation

A large town intersects a road and is adjacent to water (62%)

ƒ An association rule QQ→→RR(s%, c%)(s%, c%) is a spatial association ruleif QQ∧∧RR is a spatial association pattern

IF a large town intersects a road

THEN it is also adjacent to water (62%, 89%)

Malerba et al Seminal work by Koperski & Han 1995

(52)

102 Michael May

Tutorial Geographic and Spatial Data Mining

The problem

Given

„

a spatial database (SDB) with a set of

reference objects

S

S

,

„

some set

R

R

kk

, 1

≤k≤m

, of

task-relevant objects

„

some

spatial hierarchies

H

H

kk

involving objects in

R

k „

„

M

M granularity levels

in the descriptions

„

a set of

granularity assignments

ψ

ψ

kk

which associate each object in

H

k

with a granularity level

„

a couple of thresholds

minsup

minsup

[l]

[l]

and

minconf

minconf

[l]

[l]

for each

granularity level

„

a

domain knowledge

Find

strong multiple-level

spatial association rules.

Malerba et al

103 Michael May

Tutorial Geographic and Spatial Data Mining

The solution

Solution (Appice et al., IDA Journal, 2003)

ƒ based on an Inductive Logic Programming (ILP) approach Æspatial relations easily handled

ƒ spatial pattern Æconjuction of first-order logic atoms ƒ θ-subsumption orders the space of spatial patterns

ƒ monotonicity of support w.r.t. θ-subsumption Æpruning of patterns at the same granularity levelin the candidate generation phase

ƒ monotonicity of pattern frequency w.r.t. granularity levelÆpruning of patterns

at different granularity levelsin the candidate generation phase

Implementedin SPADA(Spatial Pattern Discovery Algorithm) European project SPIN (Spatial Mining for Data of Public Interest)

(53)

104 Michael May

Extensions of initial solutions

ƒ Efficiency improvementof pattern evaluation by caching support objects for each stored pattern

ƒ Definition of a declarative bias to filter out rules on the basis of users’ preferences Î efficiency improvement is a byproduct

- In real-world applications a large number of spatial patterns can be generated even for a few hundred spatial objects.

- Most of discovered patterns are useless for the application at hand

- Urban accessibility application: only spatial patterns involving some sociological factor

(household with no car) are interesting.

ƒ Integration of SPADA in the ARES system that interfaces a Spatial DB (Oracle Spatial)

105 Michael May

Tutorial Geographic and Spatial Data Mining

Mining Network Data

(

0

)

0 0

(

1

)

p

p

p

p

n

(54)

106 Michael May

Tutorial Geographic and Spatial Data Mining

Networks

Points Space Complexity Time complexity

Points, Lines, and Areas Networks

107 Michael May

Tutorial Geographic and Spatial Data Mining

Points and Networks

Requirements:

• Point Data • Polygons • Aggregations

Spatial dependencies and relations,

networks

Examples: Traffic frequency prediction

Method:

(55)

108 Michael May

Case Study: Outdoor Advertising - Frequency Atlas

Customer:

ƒ Fachverband für Außenwerbung

(FAW; German Outdoor Advertising Association)

Task:

ƒ Performance value assessment of advertising media

ƒ Traffic volume forecast

ƒ separate for private cars, public transport,

pedestrians

109 Michael May

Tutorial Geographic and Spatial Data Mining

Frequency + Media factories = poster reach

Gesellschaft für

Konsumforschung

(56)

110 Michael May

Tutorial Geographic and Spatial Data Mining

The project in numbers

ƒ Complete model for all German cities

with more than 50.000 inhabitants

(192 cities) = ca 1.000.000 street segments!

ƒ Complete model includes, for each segment, item

- car frequency - pedestrian frequency - public transport frequency

ƒ The model is presently beeing extended to to all cities with between 10.000 and 50.000 inhabitants

111 Michael May

Tutorial Geographic and Spatial Data Mining

Basic Data: traffic measurements

Manual traffic measurement at selected

poster locations

- 4 times 6 minutes at four days of the week at four times of day

ƒ Additional empirical model of day totals

ƒ Properties

- Well defined measurements - Extended measurement period, so

concept drift can not be excluded

(57)

112 Michael May Street network Sociodemographics + Socioeconomics Public transport network Frequency measurements 0 200 400 600 800 1000 1250 1500 1750 2000 ... DATA MINING Points of Interest (POI) Frequency classes

Secondary data

113 Michael May

Tutorial Geographic and Spatial Data Mining

Local Measurements

Inhomogeneous measurements on the same street

How Spatial Autocorrelation helps

843

820 1200 843

(58)

114 Michael May

Tutorial Geographic and Spatial Data Mining

ƒ Attributes of street segments:

- Name, type, …. class - Points of Interest - Spatial coordinates

ƒ Locations with measurement values

Spatial kNN

ƒ Distance beetween two segments xa, xb

ƒ Selection of the k closest x1, …, xk

ƒ Prediction for new segment xq

ƒ (Project has actually used specially adapted distance measure) ( )

= − =M m bm am b a x x x x d 1 ,

= = = k i i i k i i q wy w y 1 1 ˆ ) , ( 1 i q i x x d w = with

Segment

115 Michael May

Tutorial Geographic and Spatial Data Mining

Spatial KNN - Properties

ƒ kNN captures well autocorrelation inherent

in the data

ƒ Allows to bring in background knowledge by fine-tuning distance function

ƒ Database Integrated (Oracle Spatial)

ƒ Performs dynamic spatial query (minimum distances among polygons)

Performance improvements

ƒ Spatial Queries use Index Structures (R-Tree), still relatively costly (i.e. dominates overall run-time)

ƒ Partial evaluation of distance function based on lower bounds for distance to minimize number of spatial queries

(59)

116 Michael May

Smoothing based on flow constraints

ƒ Measurement errors lead to inconsistencies ƒ Need plausible assignment of frequencies

Solution:

ƒ Use Kirchhoff’s law as constraint

- Sum of inputs = sum of outputs

ƒ Smoothing algorithm finds locally optimal

solution using constraint relaxation

117 Michael May

Tutorial Geographic and Spatial Data Mining

Explaining frequencies

Problem: Customer wants transparent values, not a black box

ƒ => Problem for Spatial kNN

Solution: Fit an explanatory model to the predicted values

Allows to understand why predictions are as they are

Allows to identify potential outliers and areas of high uncertainty

Use Model Trees

(60)

118 Michael May

Tutorial Geographic and Spatial Data Mining

Numerical prediction with model trees

LM1 FREQUENZ = 2277.3186 * X + 75.4087 * ANZAHL_EINKAUF + -142.4217 * MESSE + -21221.8497 Fussgängerzone: Nein | Ja Bahnhof Nein | Ja Distanz_zu_Bahnhof: <= 150 | > 150 Anzahl_Restaurants : <= 5 | > 5 ORTSTEIL = INNENSTADT (LR) | ... Straßenkategorie: Nebenstr. | Hauptstr. Y-Koordinate <= 9.6 | > 9.6 X-Koordinate <= 52.385 | > 52.385 Anzahl_Restaurants : <= 15 | > 15 LM1 LM2 LM4 LM5 LM6 LM3 119 Michael May

Tutorial Geographic and Spatial Data Mining

Improving model by spotting outliers based on model tree prediction

ƒ Points with great prediction error are checked - Visual inspection

- Getting additional empirical input by taking new measurements

ƒ Corrected values are basis for next round in model building, leading to improved results

(61)

120 Michael May

Tutorial Geographic and Spatial Data Mining

Final Result: Frequency Map

Cars

Public Transport

Pedestrians PedestriansPublic Cars

Transport

121 Michael May

Tutorial Geographic and Spatial Data Mining

ƒ ~1 Million street segments predicted based on 96.000 measurements

ƒ ~1 Million street segments predicted based on 96.000 measurements

Final result: frequency atlas

(cars, public transport, pedestrians)

Used for determining poster prices in Germany since 2006

(62)

122 Michael May

Tutorial Geographic and Spatial Data Mining

Spatial Model Trees [Malerba, Appice, Cecci 2005]

ƒ Standard Model Trees (e.g. M5‘) can do Spatial Mining by splitting along x and y coordinates ƒ Mrs-Smoti (Malerba et al. 2004) is a variant of Model Trees that

- Allows regression nodes as interior nodes - Handles directly autocorrelation:

x Spatial regression model with dependencies in response variables: spatially lagged response

ƒ It inputs spatial objects eventually belonging to

separatethematic layers stored in a spatial database S

- target objects (main subject of analysis) - non target objects (relevant for the task in hand) and outputs a spatial model tree T by

- partitioning training spatial data according to intra-layer and inter-layer relationships

- associating different regression models to disjoint spatial areas

ƒ Integrates spatial database queries (see Subgroup Discovery)

6 5 4 3 2 Y’=c+dX’ 3 Y’=e+fX’2 X’4 ≤ γ Y’=g+hX’3 0 Y=a+bX1 1 X’3 ≤ α 7 Y’=i+lX’4 X’2 ≤ β T 123 Michael May

Tutorial Geographic and Spatial Data Mining

Mining Tracks in Space and Time

(

0

)

0 0

(

1

)

p

p

p

p

n

(63)

124 Michael May

Tracks in Space and Time

Points

Space Complexity Time

complexity

Points, Lines, and Areas

Tracks in Space and Time

Networks

125 Michael May

Tutorial Geographic and Spatial Data Mining

Tracks in space and time

Requirements: • Point daa • Polygons • Aggregations • Networks • Tracks, GPS/RFID/Sensor-MeasurementApplications:

Traffic prediction, Mobility analysis

Examples

• Sampling, Event analysis, non-linear optimization

(64)

126 Michael May

Tutorial Geographic and Spatial Data Mining

Mobility analysis based on GPS-tracks

ƒ introduction of new pricing model for

poster sites based on GPS tracks

ƒ registration of contact frequencies with poster sites

ƒ contact extrapolation for target groups:

- socio-demographic characteristics - residential areas

Media Trend Journal, Nov, 2006

127 Michael May

Tutorial Geographic and Spatial Data Mining

Time patterns

ƒ Patterns / Questions

- How long (days) does it take till x% of objects visit all locations? - How long does it take till x% of

objects visit at least one location twice?

ƒ Applications

- determine mobility of a group of people

- reach of poster networks

- find popularity of locations (theatres, supermarkets, hospitals)

(65)

128 Michael May

Modelling tasks

ƒ Modelling mobility for cities with GPS-measurements

for the overall population

ƒ Predicting mobility for cities without measurements (hard task!)

ƒ Extrapolating predictions in time

129 Michael May

Tutorial Geographic and Spatial Data Mining

GeoPKDD - FET Project IST-014915

ƒ Geographic Privacy-aware Knowledge Discovery and Delivery ƒ December 2005 – November 2008

ƒ Project Leader: Fosca Giannotti

ƒ http://www.geopkdd.eu

General Project Idea

ƒ extracting user-consumable forms of knowledge from large amounts of raw geographic data referenced in space and in time.

ƒ knowledge discovery and analysis methods for trajectories of

moving objects, which change their position in time, and possibly also their shape or other significant features

ƒ devising privacy-preserving methods for data mining from sources that typically contain personal sensitive data

(66)

130 Michael May

Tutorial Geographic and Spatial Data Mining

The Consortium

ID Acronym Partner Country

1 KDDLAB Knowledge Discovery and Delivery Laboratory, ISTI-CNR, Istituto di Scienza e Tecnologie dell’Informazione, Pisa. http://www.isti.cnr.it/ - jointly with Univ. Pisa, Dept. of Computer Science http://www.di.unipi.it

I

2 LUC Univ. Limburg, Theoretical Computer Science Group. http://www.luc.ac.be/theocomp B

3 EPFL EPFL, Lab. DB, Lausanne. http://lbdwww.epfl.ch/e/ CH

4 FAIS Fraunhofer Institute for Autonomous Intelligent Systems, Sankt Augustin.

http://www.ais.fraunhofer.de/

D

5 WUR Wageningen UR, Centre for GeoInformation. http://cgi.girs.wageningen-ur.nl/ NL

6 CTI Research Academic Computer Technology Institute, Research and Development Division.

http://www.cti.gr/ - jointly with Univ. Piraeus, Dept. of Informatics http://www.unipi.gr

GR

7 UNISAB Sabanci University, Faculty of Engineering and Natural Sciences. http://www.sabanciuniv.edu/ TK

8 WIND WIND Telecomunicazioni SpA, Direzione Reti Wind Progetti Finanziati & Technology Scouting. I

131 Michael May

Tutorial Geographic and Spatial Data Mining

Geographic Privacy-aware Knowledge Discovery Process

Traffic Management Accessibility of services Mobility evolution Urban planning …. interpretation visualization trajectory reconstruction p(x)=0.02 warehouse interpretation visualization trajectory reconstruction p(x)=0.02 ST patterns Trajectories warehouse Privacy-aware Data mining Bandwidth/Power optimization Mobile cells planning … Public administration or business companies Telecommunication company (W IND) GeoKnowledge Aggregative Location-based services Privacy enforcement Traffic Management Accessibility of services Mobility evolution Urban planning …. interpretation visualization trajectory reconstruction p(x)=0.02 warehouse interpretation visualization trajectory reconstruction p(x)=0.02 ST patterns Trajectories warehouse Privacy-aware Data mining Bandwidth/Power optimization Mobile cells planning … Public administration or business companies Telecommunication company (W IND) GeoKnowledge Aggregative Location-based services Privacy enforcement

(67)

132 Michael May

GeoPKDD – Specific Goals

ƒ models for moving objects, and data warehouse methods to store their trajectories ƒ knowledge discovery and analysis methods for moving objects and trajectories,

ƒ techniques to make such methods privacy-preserving

ƒ techniques for reasoning on spatio-temporal knowledge and on background knowledge

ƒ techniques for delivering the extracted knowledge within the geographic framework

133 Michael May

Tutorial Geographic and Spatial Data Mining

From Traces to Trajectories: the Source Data

GSM network

ƒ Entering the cell

- e.g. (UserID, time, IDcell, in)

ƒ Exiting the cell

- e.g. (UserID, time, IDcell, out)

ƒ Movements inside the cell? - Eg (UserID, time, X,Y, Idcell

streams of log data of mobile phones, e.g. cells in the GSM/UMTS network

ƒ Real trajectories are continuous functions

ƒ Logs are discrete sampling of real trajectories, dependent on the wireless network

technology

- unregular granularity in time and space - possible imperfection/imprecision

ƒ An approximated reconstruction of the real trajectory from its log traces is needed

(68)

134 Michael May

Tutorial Geographic and Spatial Data Mining

Movement patterns

Clustering

ƒ Group together similar trajectories

ƒ For each group produce a summary

Frequent patterns

ƒ Discover frequently followed (sub)paths

Classification

ƒ Extract behaviour rules from history

ƒ Use them to predict behaviour of future

users

60

%

7%

8%

5%

20

%

?

Source: Pedreschi & Giannotti, 2005

135 Michael May

Tutorial Geographic and Spatial Data Mining

Why emphasis on privacy?

ƒ More, better data are gathered, more vulnerability from correlation ƒ On the other hand, more and new data bring new opportunities

ƒ Need to maintain privacy without giving up opportunities

ƒ Need to obtain social acceptance through demonstrably trustworthy solutions

ƒ ... is a technical issue, besides ethical, social and legal, in the specific context of ST data

ƒ How to formalize privacy constraints over ST data and ST patterns?

- E.g., anonymity threshold on clusters of individual trajectories

ƒ How to design DM algorithms that, by construction, only yield patterns that meet the

privacy constraints?

Privacy in GeoPKDD

(69)

136 Michael May

Challenges

(

0

)

0 0

(

1

)

p

p

p

p

n

137 Michael May

Tutorial Geographic and Spatial Data Mining

Causal Inference from Statistical Spatio-Temporal Data

ƒ Current project at IAIS for newspaper publisher:

ƒ Sales prediction of individual shops.

ƒ What happens if a shop closes or is sold out? Predict to which alternative shop customers go.

ƒ Spatio-Temporal Clustering of shops

ƒ Time Series Prediction

ƒ Modeling customer behavior

⇒ Causal inference about customer behavior

„If shop A closes, n% of A‘s customers go to B, m% to C“

(70)

138 Michael

References

Related documents

As  illustrated  in  the  previous  section,  as  a  trainer  you  have  to  be  aware  that  when 

ISS Transport AB Köpingevägen 26 724 60 Västerås Mobil: +46 734 36 77 75 E-post: [email protected] Trafikverket Kundnära tjänster 781 89 Borlänge

Players can create characters and participate in any adventure allowed as a part of the D&amp;D Adventurers League.. As they adventure, players track their characters’

Wound Reconstruction Swindon 21 October Introduction to Small Animal Ultrasound Swindon 23 and 24 October Next Steps in Orthopaedic Surgery Swindon 28 and 29 October Next

Whether grown as freestanding trees or wall- trained fans, established figs should be lightly pruned twice a year: once in spring to thin out old or damaged wood and to maintain

amended police report were lulled into a belief that the Applicant’s claim, and in particular the circumstances surrounding the accident, were indisputable.. They

compositional differences between age groups in asthma severity and other unobserved factors. However, the results of our cross-sectional analyses were consistent with the results

Evidenční sestra Evidenční sestra Zdravotnický team Zdravotnický team Pacient Pacient Vyšle ID Komunikace s pacientem Výběr pacienta Registrace pacienta Evidence pacienta