Tutorial on Geographic and Spatial Data Mining

(1)

Tutorial on Geographic and Spatial Data Mining

Michael May

15th Italian Symposium on Advanced Database Systems - SEBD’07 Torre Canne, Italy

June 17th

2 Michael May

Tutorial Geographic and Spatial Data Mining

Fraunhofer Society

Joseph von Fraunhofer, German physicist and

entrepreneur

Fraunhofer mission:

- do state-of-the-art research and use it in challenging customer projects

- Funding is 33% research grants, 33% customer projects, 33% institutional funding

57 institutes, 40 locations, 12.000 employees, 1 bill. € annual volume

(2)

3 Michael May

Fraunhofer IAIS: Intelligent Analysis- and Information

Systems

„From sensor data to business intelligence, from media analysis to visual information systems: Our research allows companies to do more with data“

New name, long-standing experience

- Founded in 2006 as a merger of the Fraunhofer institutes AIS and IMK

230 people: scientists, project engineers, technical and administrative staff

Located on Fraunhofer Campus Schloss Birlinghoven/Bonn

Joint research groups and cooperation with

Univ. Bonn

4 Michael May

Fraunhofer IAIS: research and projects

Core research areas:

Machine learning and adaptive systems

Data Mining and Business Intelligence

Automated media analysis

Interactive access and exploration

(3)

5 Michael May

Objectives

Although it is about statistical concepts, algorithms and data structures, the tutorial has a practical, application oriented focus

Integration of various technologies and algorithms. How do they combine? Covers a broad range

I do not assume familiarity with spatial concepts, but some basic familiarity with data mining approaches

Three Objectives:

- to stimulate research on spatial data mining related issues

- to stimulate development of more efficient spatial databases tailored for data mining applications

- to stimulate real-world applications

6 Michael May

A main message

Spatial Data Mining is not an esoteric

research topic; it is practically and

commercially very important and

sometimes business critical field!

Later I give an example where the value

of several dozens of companies directly

depends on the predictions given by our

spatial data mining algorithms.

(4)

7 Michael May

Spatial vs. Geographic Data Mining

Geographic Data is data related to the earth

Spatial Data Mining deals with physical space in general, from molecular to astronomical level

Geographic Data Mining is a subset of Spatial Data Mining

Allmost all geographic data mining algorithms can work in a general spatial setting (with the same dimensionality)

This tutorial focuses on geographic data in 2D, but most algorithms work on spatial data in general

I do not talk about specificties of molecular data, face detection, etc.

8 Michael May

Agenda

Introduction– Spatial and Geographic Data Mining Part I: Basic Concepts – Spatial Databases and GIS

•Spatial Data Types •Spatial Queries

•Construction of Complex Features

Part II: Exploratory Analysis of Spatial Data Part III: Spatial and Geographic Data Mining Methods

•Autocorrelation

•Mining Point Data – Clustering, Kriging

•Mining Points, Lines Areas – Clustering, Subgroup Discovery, Association Rules •Mining Networks – A practical case study

•Mining Tracks in Space and Time – Mining from GPS-Data

Challenges Summary

(5)

9 Michael May

Introduction – Spatial Data Mining

(

0

)

0 0

(

1

)

p

n

−

⋅

10 Michael May

A classical example of spatial analysis

Dr. John Snow

Investigating causes of a cholera epidemia London,

September 1854

A good representation is often the key to solving a problem

Disease cluster

(6)

11 Michael May

Good representation because

...

Represents spatial relation of objects

of the same type

Represents spatial relation of objects

to other objects

It is not only important

where a cluster is but

also, what else is there

(e.g. a water-pump)!

Shows only relevant aspects and hides

irrelevant

12 Michael May

Goals of Spatial Data Mining

Identifying spatial patterns

Identifying spatial objects that are potential generators of patterns Identifying information relevant for explaining the spatial pattern (and hiding irrelevant information) Presenting the information in a way

(7)

13 Michael May

Spatial Data Mining

Data Mining

+

Geographic Information Systems

= Spatial Mining

( 0) 0 0(1 ) p p p p n ₋ − ⋅ 14 Michael May

Basic Concepts

Spatial Databases and GIS

(

0

)

0 0

(

1

)

p

n

−

⋅

(8)

15 Michael May

Commercial

Where to build a new supermarket?

Where are the customers that want to buy new product X? How many cars pass the

main road per hour? Does it pay to install new

antennas?

What percentage of young females sees a billboard located in Ripley avenue? Public Sector

Are there clusters of a certain disease?

Is there a relationship between poverty and death rate?

Are there crime hot spots or patterns?

16 Michael May

Buildings

Rivers

Streets

Schools

Hospitals

Factory

Attribute Data Person p. Household No. of Cars Long-term illness Age Profession Ethnic group Unemployment Education Migrants Medical establishment Shopping areas

(9)

17 Michael May

Elements of a spatial database

Spatial Operators

Spatial Data

Types

Spatial Indexes

Spatial Query

Language

Metadata

SELECT c.holding_company, c.location FROM competitor c,

bank b WHERE b.site_id = 1604

AND SDO_WITHIN_DISTANCE(c.location, b.location, 'distance=2 unit=mile') = 'TRUE'

INSIDE

Examples from Oracle Spatial

18 Michael May

Spatial Datatypes

(

0

)

0 0

(

1

)

p

n

−

⋅

(10)

19 Michael May

Two basic types of representation: Fields and Discrete

Objects

Fields:

Raster Data

Discrete Objects:

Vector Data Model

Area

Line

20 Michael May

Vector Data: Data Structure

Ordered sets of xy-coordinates defining points, lines, or polygons

3D or 4D also possible

Point

Line

(Polyline) Area _(Polygon)

Easy to scale (linear transformation)

Storage efficient

Relationships between

objects (e.g. overlap) are not explicitly represented

Aka „Spaghetti Model“

Straight lines between points

(5,10) ((5,10),(9,16),(12,17)) ((5,10),(9,16),(12,17), …)

Data Structure

Draw line from last to first coordinate

(11)

21 Michael May

Two Main Types of Vector Data

-

non regular tesselations

closed polylines that partition the space

-

discrete isolated objects

:

point, line, area

Point

Line

Area (Polygon)

Tesselations very useful for aggregation of discrete objects and for feature extraction

22 Michael May

UK, Greater Manchester, Stockport

Buildings

Geometry Address Type …

Hospitals

Geometry Address Phone #Beds …

Description of objects are organized in relations (database tables) Each row in a table describes one object

Different categories of objects are organized in separate relations each having its own set of attributes.

1 Ripley Avenue 23 (5,5),(6,6),… 3 2 Islington Road 2 (3,3),(4,4),… 2 1 Gladstone Street 5 (1,1),(2,2),… 1 Type Address Geometry ID 567897 Great Moore (3,3),(4,4),… 2 234567 Stepping Hill (1,1),(2,2),… 1 Phone Address Geometry ID 1 Ripley Avenue 23 (5,5), 3 2 Islington Road 2 (3,3),… 2 1 Gladstone Street 5 (1,1), 1 Ty pe Name Geometry ID 1 Ripley Avenue 23 (5,5), 3 2 Islington Road 2 (3,3),… 2 1 Gladstone Street 5 (1,1), 1 Ty pe Name Geometry ID 1 Ripley Avenue 23 (5,5), 3 2 Islington Road 2 (3,3),… 2 1 Gladstone Street 5 (1,1), 1 Ty pe Name Geometry ID 1 Ripley Avenue 23 (5,5), 3 2 Islington Road 2 (3,3),… 2 1 Gladstone Street 5 (1,1), 1 Ty pe Name Geometry ID

Rivers

Streets

Schools

Factory

(12)

23 Michael May

Hierarchy

Often data are organized in spatial hierarchies, e.g.

Country

State

Zip Area

Voting District Parcel

Hierarchies may overlap

County District2 District1 Districtn Ward1 … Ward1 Ward1 Wardn Ward1 Ward2

UK census data

24 Michael May

Representation of data in a spatial database

A set of relations R1,...,Rn such that each relation Rihas a geometry attribute Gi or an identifier Aisuch that Ri can be linked (joined) to a relation Rkhaving a geometry attribute Gk

- Geometry attributes G_iconsist of ordered sets of x,y-pairs defining points, lines,

or polygons

- Different types of spatial objects are organized in different relations

R_i(geographic layers), e.g. streets, rivers, enumeration districts, buildings, and

- each layer can have its own set of attributes A₁,..., A_nand at most one geometry attribute G

(13)

25 Michael May

Representation of data in a spatial database

A set of relations R1,...,Rn such that each relation Rihas a geometry attribute Gi or an identifier Aisuch that Ri can be linked (joined) to a relation Rkhaving a geometry attribute Gk

- Geometry attributes G_iconsist of ordered sets of x,y-pairs defining points, lines,

or polygons

- Different types of spatial objects are organized in different relations

R_i(geographic layers), e.g. streets, rivers, enumeration districts, buildings, and

- each layer can have its own set of attributes A₁,..., A_nand at most one geometry attribute G

Does not fit well to

standard data mining

approaches!

This is where the specific

research challenge for

geographical data mining

comes from!

26 Michael May

Legend

Mixed conifer

Douglas fir

Oak savannah

Grassland

Raster representation. Each color represents a different value of a nominal-scale field

Longley et al (2001)

How to represent phenomena conceived as fields?

Divide the world into square cells

No variation within cells

Cell value may be average, max, min, sum,central point, …

Represent discrete objects as collections of one or more cells

Represent fields by assigning

attribute values to cells

Raster Data

(14)

27 Michael May

Raster and Vector: Comparison

Raster Modell

Advantages:

• Simple data structure

• Simple logical and algebraic structures Disadvantages:

• Large data volumes • imprecise geometry

• expensive transformations of coordinates • implicit coordinates

Vector Model

Advantages:

• Specify geometry by coordinates • Topological relationships • High geometric accuracy • Storage efficient Disadvantages:

•Complex data structure

• Compute intensive logical and algebraic operations

Remember: „Raster is vaster and vector is correcter“

Legend Mixed conifer Douglas fir Oak savannah Grassland 28 Michael May

Spatial Queries

(

0

)

0 0

(

1

)

p

n

−

⋅

(15)

29 Michael May

Spatial Queries

Problem: Vector data model does not explicitly capture relationships among objects. They have to be inferred using spatial predicates

Spatial predicates evaluate to true or falsefor given objects A query returns

the set of objects of which the statement is true; or

using aggregates the [minimum,maximum,sum,average,…], object(s) of which the statement is true …

Queries are evaluated using a spatial joinamong different relations (layers)

Here‘s where database technology and spatial indexing comes in to do the job efficiently!

Still, they can be extremely time consuming!

30 Michael May

Spatial Predicates: Egenhofer‘s 9-intersection model

Each object has interior (i), exterior (e) and boundary (b)

This results in a 9-intersection matrix for the relation between two spatial objects A and B

A cell contains a 1 iff the intersection of point sets is non-empty

1 1 1 e 1 0 0 i 1 0 1 b e i b 1 1 1 e 1 1 1 i 1 1 1 b e i b 1 0 0 e 1 1 1 i 1 0 0 b e i b A B A B

(16)

31 Michael May

Spatial Predicates

A inside B, B contains A A contains B, B inside A A covered-by B, B covers A A covers B, B covered by A A equals B, B equals A A overlaps B, B overlaps A A meets B, B meets A A disjoint B, B disjoint A

9-intersection model for 2 regions (Egenhofer 1991)

INSIDE

32 Michael May

Spatial Queries: Distance

Metric spaces:

→Symmetry:d(i,j) = d(j,i)

→triangle inequality:d(i,k) ≤d(i,j)+ d(j,k)

- Euclidian Distance:d_e(i,j) =

i j

k

Distance relation between polygons: Minimum distance between any 2 points of the polygons

2

2 ₍ ₎

)

(17)

33 Michael May

Spatial Queries: Distance and Proximity

Selects nearest neighbor in space Select all object within a certain distance

Ma in S tr e e t X Distance Hospital #2 Hospital #1

SELECT c.holding_company, c.location FROM competitor c,

bank b

WHERE b.site_id = 1604

AND SDO_WITHIN_DISTANCE(c.location, b.location, 'distance=2 unit=mile') = 'TRUE'

Select all competitors and locations within 2 miles distance from bank with id 1604

Example: Oracle Spatial

34 Michael May

Distance – non-metric

non metric spaces

→Asymmetry: d(i,j) ≠d(j,i)

→triangle inequality does not hold

drive time

driving distance

(18)

35 Michael May

Stockport Database Schema

ED TAB01 TAB95 TAB61

...

Water

...

River Building Street Shopping Region Vegetation =zone_id =zone_id =zone_id spatially interact inside spatially interacts spatially interacts spatially interacts Attribute data 95 tables with census data, ~8000 attributes Geographical Layers 85 tables Spatial Hierarchy • County • District • Wards • Enumeration district spatially interact Standard Join Spatial Join

Relations between objects implicit; very flexible and storage efficient, but compute intensive

36 Michael May

Implementation of Spatial Databases

Many popular databases have spatial extensions by now:

Oracle Spatial

PostgreSQL

MySQL (since 4.1)

(19)

37 Michael May

Construction of Complex Features

(

0

)

0 0

(

1

)

p

n

−

⋅

38 Michael May

Spatial Functions

Example: Oracle Spatial 10g

Return a geometry - Union - Difference - Intersect - XOR - Buffer - CenterPoint - ConvexHull Return a number - Length - Area - Distance

Union

XOR

Intersect

Original

Difference

http://colab.cim3.net/file/work/SICoP/2006-06-20/2006-06-21/xlopez06212006.ppt Constructs new geometry objects from existing ones using point set theory Efficient implementation using computational geometry

(20)

39 Michael May

Constructing Cells: Buffer

How many competitors are in the catchment area of my shop?

= How many shops are within the buffer?

Simplistic approximation

Does not take account of barriers (rivers, highways)

Does not take into

account road system

40 Michael May

Voronoi diagramm

Which are my nearest competitors?

What is the cover of my

radio antenna?

= Find voronoi neighbors

Approximation

Does not take account of barriers (rivers, highways)

Does not take into

account road system

Decompose space into regions around each point in a set of points S such that all the points in the region around p_iare closer to p_ithan to any other point in S Complexity:

Related data structure: Delaunay triangulation (graph of Voronoi neighbors)

)

lg

(

n

O

(21)

41 Michael May

Drive-Time Zone (Dijkstra)

How many competitors are in

the catchment area of my shop?

Realistic approximation

Take account of barriers (rivers, highways)

take into account road

system, maximum speed on road

All streets segments within a drive time distance <= d

from a given starting point Use Dijkstra‘s algortihm Complexity:

depending on data structures used for implementation

) lg ( ) ( 2 E V V O V O − + 42 Michael May

Pre-procesing

Several of the feature extractions are computationally quite expensive (at least for large data sets) and there is often a combinatorial explosion of features that might be constructed.

Several strategies are used in Spatial Warehouse Design:

Selective Pre-processing: materializing important joins in advance (storage requirements!)

Approximate precomputing:e.g. using Minimum Bound Rectangle to approximate

polygon

Schema Design (e.g. Star-Schema with selective materialization): Han J.,

Stefanovic N., Koperski K. Selective Materialization: An Efficient Method for Spatial Data Cube Construction. PAKDD, 1998.

(22)

43 Michael May

Spatial Database of Vector Objects: Discussion

Relations between objects implicit

Very flexible: depending on analysis task different relationsships can be constructed

storage efficient; no overhead for storing relationship information

compute intensive (thus spatial Indexing very important)

Consider what and when to materialize

Very rich possibilities to create new, non-trivial objects from existing ones

Makes feature extraction an important topic for Data Mining

Inherently multi-relational setting (but not first-order)

Could also be formulated in a deductive database setting

44 Michael May

Interactive Visualization of Spatial

Data –

Exploratory Data Analysis

(

0

)

0 0

(

1

)

p

n

−

⋅

(23)

45 Michael May

Interactive Visualization of Spatial Data –

(work by G. Andrienko & N. Andrienko, H. Voss and others at Fraunhofer IAIS) For the theory behind

CommonGIS, see the book Andrienko, N. and

Andrienko G.:

Exploratory Analysis of Spatial and Temporal Data - A Systematic Approach, Springer, 2005

46 Michael May

Geographic Information Systems and CommonGIS

Many commercial tools available

- ESRI ARC GIS

- Mapinfo

- Intergraph - Manifold

But CommonGIS is different and unique … - Map-based exploratory data analysis

- stresses interactive visualization manipulation of statistical data in space - elaborated facilities for time-series visualization

CommonGIS can be aquired for non-commercial use by educational instutions for no fee

(24)

47 Michael May

-

Time-series visualization and analysis

- Combines Vector-Raster transformation

- Weighted Sums - Ideal Point Analysis

- Similarity analysis - Dominant Attribut - Integration with Weka

(Clustering, Decision Trees) Multivariate Decision support

Multi-dimensional

= Fraunhofer IAIS Tool for Map-based

- combines interactive cartography and statistics

CommonGIS

48 Michael May

CommonGIS: Visual analysis of spatial data

Interactive spatial search for

geographic objects and recognition of spatial patterns:

dynamic choropleth maps, pie charts, bar charts, etc.

with dynamic removal of outliers

and dynamic queries

Comparison of attribute values of geographic objects (relations and correlations) and

comparison of spatial patterns (spatial correlations):

(Linked) dynamic maps and interactive diagrams

(25)

49 Michael May

CommonGIS: Visual analysis of spatio-temporal data

CommonGIS as an interactive browser to study how a spatial pattern evolves over time:

time aware maps (animations)

time series charts

CommonGIS as an interactive browser for

temporal behaviours of objects:

set of controls for analysing time intervals (object animations)

CommonGIS as an interactive browser of discrete space-time events to find spatio-temporal clusters:

space-time cube

50 Michael May

(26)

51 Michael May

Time-Series: Sales per Shop and Product Category

Bäckerei Stehcafé Sitzcafé Terrasse

Different Time Hierarchies (Year, Quarter, Month, Day…)

52 Michael May

CommonGIS: Data transformation

Transformation of data for further

analysis:

Attribute transformations:

calculate statistical indices

transform and combine attribute data arithmetically

dynamic classifiers (linked with dynamic choropleth map)

cross classifiers (linked with dynamic choropleth map)

Geographic transformations:

query, transform, combine, derive raster data

illumination model

raster -> vector transformations (i.e. raster -> area aggregation)

(27)

53 Michael May

(28)

54 Michael May

Geographic and

Spatial Data Mining Methods

(

0

)

0 0

(

1

)

p

n

−

⋅

55 Michael May

Autocorrelation

(

0

)

0 0

(

1

)

p

n

−

⋅

(29)

56 Michael May

Spatial Variation

How are variables distributed in space?

Tobler‘s First Law of Geography:

„Everything is related to everything else, but near things are more related than distant things.“

Ö distribution of variables depends on space

Ö variables are autocorrelated

Field Soil Moisture

Franke, diploma thesis, Leipzig Univ., 2006

57 Michael May

Spatial Autocorrelation: Binary Example

binary attribute (blue, white)

autocorrelation to four immediate neighbors

Moran Index (here):

I = 0.86 I = 0.00 I = -1.00

Goodchild, CATMOG, GeoBooks, Norwich, 1986

I = 0.39 change equal change equal

n

I

+

−

=

- change - equal

(30)

58 Michael May

Moran‘s I

Morans‘s I is a measure for spatial autocorrelation. It is a weighted correlation coefficient used to detect departures from spatial randomness. Departures from randomness indicate spatial patterns such as clusters and geographic trend.

Values of I larger than 0 indicate positive spatial autocorrelation; values smaller than 0 indicate negative spatial autocorrelation.

Moran's I is a weighted product-moment correlation coefficient, where the weights

reflect geographic proximity.

z – attribute of interest; w – weight; n – number of areal objects

∑

∑∑

∑ ∑

= = = ≠ = ≠ = ≠

−

=

_n i i n i n j j i ij n i i j n j j i j i ij

z

w

z

w

n

I

1 2 1 1 , 1 , 1 ,

)

(

)

)(

(

A _B C D D 0 1 1 0 1 0 1 1 C 1 1 0 1 B 0 1 1 0 A D C B A w_ij weight matrix Example: n = 4 59 Michael May

Spatial Autocorrelation

similarity in location indicates similarity in attribute

value

differs from temporal autocorrelation

- 1 – dimensional autocorrelation in time series, spatial autocorrelation spreads in 2 or 3 dimensions

- only forward causality in time series, direction of causality not restricted in space

depends on scale

Temperature of Sunspots Sunspot Time Series year

# s

uns

p

o

(31)

60 Michael May

Effects of Autocorrelation

makes spatial abstraction possible

makes standard approaches of analysis impossible - most statistics assume iid

makes local inference attractive - Kriging, kNN, …

makes choice of sampling interval hard - autocorrelation depends on scale

makes interpolation easier than extrapolation

zero autocorrelation = independence of location

distance correla tion 0 -1 +1 spatial autocorrelation 61 Michael May

Problem types for Spatial Data Mining

Spatial Data Mining:= partially

automated search for patterns and models in large spatial databases

Classification of methods along the following hierarchy

Points

Points, Lines and Area

Networks

(32)

62 Michael May

Handling spatial data in Data Mining – Basic Options

Treat as ordinary variables

no special algorithms needed

spatial properties ignored, e. g. discontiguous areas

Make spatial relationships explicit e. g. infer topological relationship

expensive, but allows normal algorithms to be used

Can by done as pre-processing or dynamically (latter requires specialized algortihms)

Specialized algorithms

- Neighborhood methods, kriging, Gaussian processes, density-based clustering …

Use proper combination of data, preprocessing, algorithms, and interaction software!

63 Michael May

Mining Point Data

(

0

)

0 0

(

1

)

p

n

−

⋅

(33)

64 Michael May

Mining Point Data

Points Space Complexity Time Complexity 65 Michael May

Clustering spatial point data

Point data conceived as discrete objects

Many approaches exists for clustering spatial point data

In statistics, measures of spatial randomness or non-randomness have been developed (e.g. Ripley 1991, Cressie 1993)

- Ripley‘s K function as measuring deviation from complete spatial randomness (as exemplified by a Poisson process)

- Moran‘s I, which measures autocorrelation

Bayesian approaches often coming from image analysis (cf. Lawson et al 2002) In Geography, spatial clustering algorithms have been developed (Openshaw, GAM,

(34)

66 Michael May

Density Based Clustering – a KDD approach

[Ester et al. 1996]

Suitable for large databases

Discovers areas of high density and turns them into clusters

Discovers clusters of arbitrary shape Can handle noise

Algorithm DBSCAN

Note: Relatively straightforward extension to vector data possible (GDBSCAN); requires more complex definition of some key concepts (neighborhood and MinPts)

67 Michael May

Clustering spatial data

distance-based clustering is inherently spatial

but assumption of convex clusters (e.g. k-means) inappropriate for many “geographical” tasks X X X X X X X X X X X X

(35)

68 Michael May

Definitions 1

Eps-neighborhood of a point p N_ε(p) := {q∈D | dist (p, q)≤ ε}

A point p is directly density-reachable from q iff 1. p ∈N _ε(q)

2. |N _ε(q)|>MinPts (“q is core object”)

- Not necessarily symmetric

p

q

p: border object q:core object P directly density reachable from q Q not directly density reachable from p

Definition of Eps is a crucial parameter!

69 Michael May

Definitions 2

density-reachable = p is density-reachable from point q wrt to Eps and MinPts iff there is a chain of points p₁,…,p_n,p₁=q,p_n=p such that p_i+1is directly density-reachable from p_i

Transitive, not symmetric

p is density-connected to q iff there is point o such that p and q are density-reachable from o wrt to Eps and MinPts.

p

q

p

o

p and q density-connected to each other by o p density reachable from q q not density reachable from p Symmetric

(36)

70 Michael May

Density-connected clustering

A clusterC wrt. To Eps and MinPts is a non-empty subset of database D, where (1) ∀p,q: if p ∈C and q is density-reachable from p wrt Eps and MinPts, then q ∈C (2) ∀p,q ∈C: p is density connected to q wrt to Eps and MinPts.

Non-covered points are noise

Each cluster contains at least MinPts Exactly one clustering

71 Michael May

Algorithm DBScan – Basic Idea

Check Eps-Neigborhood of every unclassified point

in database

If neighborhood of p contains more than MinPts, a new cluster with p as core object is build

Collect directly density reachable objects from this set, merging clusters as necessary

Terminate when no new point can be added to any cluster

(37)

72 Michael May

Kriging-Spatial Interpolation

(

0

)

0 0

(

1

)

p

n

−

⋅

73 Michael May

Kriging

developed by G. Matheron in the 1960s based on work of D. Krige geostatistical method of interpolation

Point data conceived as samples from a continuous surface

results are smoothly varying surfaces

provides optimality given assumptions (best linear unbiased estimate)

variety of methods, e.g. Ordinary Kriging, Universal Kriging, Co-Kriging, Block

Kriging, Stratified Kriging, Indicator Kriging, …

? ? ? ? ? ? • – measurements ? – unknown values

(38)

74 Michael May

Spatial Variation

Problem:

spatial variation of a continuous attribute is often too irregular to be modelled by a simple, smooth mathematical function

Solution:

variation can be described by stochastic surface

x – location in n-dimensional space Z(x) – random variable of interest, e.g. soil

moisture

A stochastic processis a family of random variables Z(x) over the index set D ⊂ ℜn_:

{

Z

(

x

)

:

x

∈

D

}

A Gaussian processis a stochastic process for which any finite set of Z-variables has a joint multivariate Gaussian distribution.

75 Michael May

Components of Spatial Variation

structural component, having a constant mean or trend

random, but spatially correlated component (regionalized variable)

spatially uncorrelated random noise term

''

)

(

'

)

(

)

(

x

=

m

x

+

ε

x

+

ε

Z

trend autocorrelation random noise

X Z(x)

(39)

76 Michael May

Stationarity

Problem:

spatial data set is single realization of random process

inference is impossible without further restrictions on spatial variation

Intrinsic Stationarity (stationarity under translation):

constant mean (E[...] = 0) or trend (E[...] > 0):

variance of differences h is independent of location:

Isotropy (stationarity under rotation) :

spatial process evolves the same in all directions

[

Z

(

x

)

Z

(

x

h

)

]

const

.

E

−

+

=

2

E {Z(x)

⎡

⎣

−

Z(x

+

h)}

⎤

⎦

= γ

2 (h)

x x+h h 77 Michael May

Ordinary Kriging

Assumptions:

intrinsic stationarity with a constant mean - constant mean value in sampling area

- variance of differences depends only on the distance h between sites

Once structural effects have been accounted for, remaining variation is homogeneous in variance so that difference at sites are merely a function of differences between them.

[

]

)}

(

'

)

(

'

[{

)

(

2

]

)}

(

)

(

[{

)

(

)

(

2 2

h

x

E

h

x

Z

x

Z

E

h

x

Z

x

Z

Var

+

−

=

+

−

=

+

−

ε

γ

x x+h h

[

Z

(

x

)

−

Z

(

x

+

h

)

]

=

0

E

semivariance

(40)

78 Michael May

Ordinary Kriging

Proceedure:

1. Estimate semivarianceγ(h) from data sample 2. Plot the experimental variogram

3. Fit a theoretical model to the experimental variogram

4. Estimate unknown values as weighted sum of neighboring measurements, determine optimal weights from variogram

79 Michael May

Semivariance and Experimental Variogram

semivariance depends only on distance (lag) h

estimate semivariance between all pairs of measurements with distance h (repeat for all possible h)

{

}

∑

=

+

−

=

n i i i

z

x

h

x

z

n

h

1 2

)

(

)

(

2

1

)

(

ˆ

γ

lag h γ(h) Experimental Variogram

(41)

80 Michael May

Variogram

nugget:

- γ(h) = 0 (by definition)

- nugget effect represents small scale variation and measurement errors - estimate of ε‘‘

range:

- spatial dependency

- here, variance of differences increases with distance

- two points are more similar the closer they are

sill:

- semivariance levels off - variance of differences h is independent of distance lag h γ(h) range nugget sill

{

}

∑

=

+

−

=

n i i i

z

x

h

x

z

n

h

1 2

)

(

)

(

2

1

)

(

ˆ

γ

81 Michael May

Variogram Models

experimental variogram must be fitted to an appropriate variogram model

most commonly used are the spherical, exponential, linear or Gaussian model lag h γ(h) Spherical Model lag h γ(h) Exponential Model lag h γ(h) Linear Model lag h γ(h) Gaussian Model

(42)

82 Michael May

Interpolation of unknown Values

unknown value at location x₀is estimated as weighted sum of neighboring

measurements

weights w_iare determined according to two restrictions

- Z*(x₀) is an unbiased estimate of Z(x₀) - Z*(x₀) is an optimal estimate

Have to solve system of n+1linear equations of semivariances and weights

∑

=

n i i i

Z

x

w

x

Z

1 0 *

)

(

)

(

83 Michael May

Equation System

restriction on weights introduces Lagrange parameterφ(Restriction 1)

system of (n+1) equations must be solved to obtain optimal weights for each x₀

1 1 1 n 1 1 0 n 1 n n n n 0

(x

x )

(x

x )

1

w

(x

x )

(x

x )

(x

x )

1

w

(x

x )

1

0

1

γ

−

γ

−

γ

−

⎛

⎞ ⎛

⎞

⎜

⎟ ⎜

⎟

⎜

⎟ ⎜

=

⎟

⎜

γ

−

γ

−

⎟ ⎜

γ

−

⎟

⎜

⎟ ⎜

φ

⎟ ⎜

⎟

⎝

⎠ ⎝

⎠

K

M

O

M

L

Ordinary Kriging is an exact interpolator, i.e. interpolated value of a sample location will be identical with the measurement taken

(43)

84 Michael May

Variants of Kriging

Universal Kriging

structural component may contain a external trend

Co-Kriging

interpolation for one attribute incorporates information of another, correlated attribute

sparse measurements of an expensive variable are supported by plenty

measurements of a cheap variable

Stratified Kriging

interpolation within sub-areas

equations are adjusted to avoid discontinuities on boundaries

More Details: Burrough, P., McDonnell, R 1998

85 Michael May

Mining Points, Lines, and Areas

(

0

)

0 0

(

1

)

p

n

−

⋅

(44)

86 Michael May

Points, Lines and Areas

Points

Space Complexity Time

Complexity

Points, Lines, and Areas

87 Michael May

Points, Lines and Areas

Requirements: • Point data • Polygons • aggregations Applications • Customer Segmentation, • Catchment Areas, • Location Planning, • Radio Network Analysis Examples:

• GDBScan Clustering • Spatial Subgroup Minig • Spatial Association Rules • Spatial Model Trees

(45)

88 Michael May

Clustering of Vector Data: GDBScan [Sander et al 1998]

Extension of DBSCan - Sample Instantiations

dist < ε intersects/meets neighbor

| S | ≥MinCard ∑areas ≥MinArea f (S) ≥MinF

89 Michael May

Spatial Subgroup Mining

(

0

)

0 0

(

1

)

p

n

−

⋅

(46)

90 Michael May

Typical Data Mining representation

Data Mining for spatial data: very different from this representation

‘spreadsheet data’

exactly 1 table

atomic values

91 Michael May

Subgroup Discovery Search (Klösgen 1996, Wrobel 1997)

Subgroup discovery searches deviation patterns for subgroups

overproportionally high share of target value (or mean of target variable)

Top-down search from most general to most specific subgroups, exploiting partial ordering of subgroups

S₁≥S₂ S₁more general than S₂

Beam search expands only thenbest ones at each level

Evaluating hypothesis according to quality function: N= Total population

n= subgroup size

p(T)= target share in total population p(T|C)= target share in subgroup

Extension to multi-relational representation in Wrobel (1997)

n

N

n

T

p

T

p

T

p

C

T

p

−

))

(

1

)(

(

)

(

)

|

(

(47)

92 Michael May

Translating Multirelational Subgroups

to Object-relational SQL

Domain: relational database schema D= {R1, ..., Rn} having geometry attributes Gi

Hypothesis Language

Multirelational subgroups are represented by a concept set C= {Ci}, where each Ci consists of a set of

attribute value-pairs {A1=v1,...,An=vn} from a relation in D,

a set of links L={Li} linking concepts Ci, Ckvia their attributes Am, Akof the form

(Ci/Am {=|inside| overlaps|...|spatially_interact} Ck/An)

target attribute can be non-numeric (A1=v1) or numeric aggregate (avg(A)=n)

Example:

C= {{district.long_term_illness=high, district.unemplyoment=high},{street.name=’Manchester Road’}}

L= {{district.geometry spatially_interact street.geometry}}

“Enumeration districts with high rate of long term illness and unemplyoment crossed by Manchester Road”

Testing satisfaction of subgroup descriptions

The number of tuples in Dthat satisfies a subgroup description is evaluated using SQL select statements including joins over multiple relations.

93 Michael May

Approach: Translation of Spatial Subgroup Mining to SQL

(Klösgen, May 2002)

• Representing subgroups in object-relational SQL, i.e. multi-relational representation • Using representation for spatial geometry based on Spatial Database

• Division of work between RDBMS and Search Manager • Combining visualization in abstract and physical space

(48)

94 Michael May

Division of labour between RDBMS

and Search Manager (May, Savinov 2003)

Database Server

Search Algorithm

Mining Server

statistics

•search in hypothesis space

• generation and evaluation of hypotheses (subgroup patterns)

mining query

• Database integration: efficiently organize mining queries

• Mining query delivers statistics (aggregations) sufficient for evaluating many hypotheses

95 Michael May

SPIN! – Spatial Data Mining System

Workspace Property Editor Subgroup Viewer Flowchart-Tool Subgroup Result List

(49)

96 Michael May

Interactive Exploratory Analysis

Combination of spatial

and non-spatial

visualization

User selects and

manipulates variables

Powerful for analysis

in low dimensions

(3-4)

Scatter Plot

Parallel Coordinate Plot

Choropleth Maps

Display dynamically linked

97 Michael May

Visualization of spatial sugroups

Linked Display

Spatial Venn Diagram

Subgroup Overview

p(T|C) vs. p(C)

Subgroup

High long-term illness in

districts crossed by M60

(50)

98 Michael May

Radio Network Planning in Telecommunication

High cut of call ration in mountanous regions crossed by highways having a certain technical

configuration

Legende:

Blau: Autobahn

Braun: große Höhe

Schwarz: Subgruppe SPIN! Mapviewer (Common GIS) 99 Michael May

Other commercial applications of Subgroup Discovery

How are my customers characterized. Are there interesting profiles?

Where to open the next supermarket? Does it create competition for my other supermarkets?

(51)

100 Michael May

Spatial Association Rules

work and slides by

Donato Malerba et al., Univ. Bari

(

0

)

0 0

(

1

)

p

n

−

⋅

101 Michael May

Spatial association rules

An association pattern PP(s%)(s%)is a spatial association patternif it contains at least one spatial relation

A large town intersects a road and is adjacent to water (62%)

An association rule QQ→→RR(s%, c%)(s%, c%) is a spatial association ruleif QQ∧∧RR is a spatial association pattern

IF a large town intersects a road

THEN it is also adjacent to water (62%, 89%)

Malerba et al Seminal work by Koperski & Han 1995

(52)

102 Michael May

The problem

Given

a spatial database (SDB) with a set of

reference objects

S

,

some set

R

_k_k

, 1

≤k≤m

, of

task-relevant objects

some

spatial hierarchies

H

_k_k

involving objects in

R

_k

M

M granularity levels

in the descriptions

a set of

granularity assignments

ψ

_k_k

which associate each object in

H

_k

with a granularity level

a couple of thresholds

minsup

[l]

and

minconf

[l]

for each

granularity level

a

domain knowledge

Find

strong multiple-level

spatial association rules.

Malerba et al

103 Michael May

The solution

Solution (Appice et al., IDA Journal, 2003)

based on an Inductive Logic Programming (ILP) approach Æspatial relations easily handled

spatial pattern Æconjuction of first-order logic atoms θ-subsumption orders the space of spatial patterns

monotonicity of support w.r.t. θ-subsumption Æpruning of patterns at the same granularity levelin the candidate generation phase

monotonicity of pattern frequency w.r.t. granularity levelÆpruning of patterns

at different granularity levelsin the candidate generation phase

Implementedin SPADA(Spatial Pattern Discovery Algorithm) European project SPIN (Spatial Mining for Data of Public Interest)

(53)

104 Michael May

Extensions of initial solutions

Efficiency improvementof pattern evaluation by caching support objects for each stored pattern

Definition of a declarative bias to filter out rules on the basis of users’ preferences Î efficiency improvement is a byproduct

- In real-world applications a large number of spatial patterns can be generated even for a few hundred spatial objects.

- Most of discovered patterns are useless for the application at hand

- Urban accessibility application: only spatial patterns involving some sociological factor

(household with no car) are interesting.

Integration of SPADA in the ARES system that interfaces a Spatial DB (Oracle Spatial)

105 Michael May

Mining Network Data

(

0

)

0 0

(

1

)

p

n

−

⋅

(54)

106 Michael May

Networks

Points Space Complexity Time complexity

Points, Lines, and Areas Networks

107 Michael May

Points and Networks

• Requirements:

• Point Data • Polygons • Aggregations

• Spatial dependencies and relations,

networks

• Examples: Traffic frequency prediction

• Method:

(55)

108 Michael May

Case Study: Outdoor Advertising - Frequency Atlas

Customer:

Fachverband für Außenwerbung

(FAW; German Outdoor Advertising Association)

Task:

Performance value assessment of advertising media

Traffic volume forecast

separate for private cars, public transport,

pedestrians

109 Michael May

Frequency + Media factories = poster reach

Gesellschaft für

Konsumforschung

(56)

110 Michael May

The project in numbers

Complete model for all German cities

with more than 50.000 inhabitants

(192 cities) = ca 1.000.000 street segments!

Complete model includes, for each segment, item

- car frequency - pedestrian frequency - public transport frequency

The model is presently beeing extended to to all cities with between 10.000 and 50.000 inhabitants

111 Michael May

Basic Data: traffic measurements

Manual traffic measurement at selected

poster locations

- 4 times 6 minutes at four days of the week at four times of day

Additional empirical model of day totals

Properties

- Well defined measurements - Extended measurement period, so

concept drift can not be excluded

(57)

112 Michael May Street network Sociodemographics + Socioeconomics Public transport network Frequency measurements 0 200 400 600 800 1000 1250 1500 1750 2000 ... DATA MINING Points of Interest (POI) Frequency classes

Secondary data

113 Michael May

Local Measurements

Inhomogeneous measurements on the same street

How Spatial Autocorrelation helps

843

820 1200 843

(58)

114 Michael May

Attributes of street segments:

- Name, type, …. class - Points of Interest - Spatial coordinates

Locations with measurement values

Spatial kNN

Distance beetween two segments x_a, x_b

Selection of the k closest x₁, …, x_k

Prediction for new segment x_q

(Project has actually used specially adapted distance measure) ( )

∑

= − =M m bm am b a x x x x d 1 ,

∑

= = = k i i i k i i q wy w y 1 1 ˆ ) , ( 1 i q i x x d w = with

Segment

115 Michael May

Spatial KNN - Properties

kNN captures well autocorrelation inherent

in the data

Allows to bring in background knowledge by fine-tuning distance function

Database Integrated (Oracle Spatial)

Performs dynamic spatial query (minimum distances among polygons)

Performance improvements

Spatial Queries use Index Structures (R-Tree), still relatively costly (i.e. dominates overall run-time)

Partial evaluation of distance function based on lower bounds for distance to minimize number of spatial queries

(59)

116 Michael May

Smoothing based on flow constraints

Measurement errors lead to inconsistencies Need plausible assignment of frequencies

Solution:

Use Kirchhoff’s law as constraint

- Sum of inputs = sum of outputs

Smoothing algorithm finds locally optimal

solution using constraint relaxation

117 Michael May

Explaining frequencies

Problem: Customer wants transparent values, not a black box

=> Problem for Spatial kNN

Solution: Fit an explanatory model to the predicted values

Allows to understand why predictions are as they are

Allows to identify potential outliers and areas of high uncertainty

⇒ Use Model Trees

(60)

118 Michael May

Numerical prediction with model trees

Improving model by spotting outliers based on model tree prediction

Points with great prediction error are checked - Visual inspection

- Getting additional empirical input by taking new measurements

Corrected values are basis for next round in model building, leading to improved results

(61)

120 Michael May

Final Result: Frequency Map

Cars

Public _Transport

Pedestrians PedestriansPublic Cars

Transport

121 Michael May

~1 Million street segments predicted based on 96.000 measurements

Final result: frequency atlas

(cars, public transport, pedestrians)

Used for determining poster prices in Germany since 2006

(62)

122 Michael May

Spatial Model Trees [Malerba, Appice, Cecci 2005]

Standard Model Trees (e.g. M5‘) can do Spatial Mining by splitting along x and y coordinates Mrs-Smoti (Malerba et al. 2004) is a variant of Model Trees that

- Allows regression nodes as interior nodes - Handles directly autocorrelation:

x Spatial regression model with dependencies in response variables: spatially lagged response

It inputs spatial objects eventually belonging to

separatethematic layers stored in a spatial database S

- target objects (main subject of analysis) - non target objects (relevant for the task in hand) and outputs a spatial model tree T by

- partitioning training spatial data according to intra-layer and inter-layer relationships

- associating different regression models to disjoint spatial areas

Integrates spatial database queries (see Subgroup Discovery)

6 5 4 3 2 Y’=c+dX’ 3 Y’=e+fX’2 X’4 ≤ γ Y’=g+hX’3 0 Y=a+bX1 1 X’3 ≤ α 7 Y’=i+lX’4 X’2 ≤ β T 123 Michael May

Mining Tracks in Space and Time

(

0

)

0 0

(

1

)

p

n

−

⋅

(63)

124 Michael May

Tracks in Space and Time

Points

Space Complexity Time

complexity

Points, Lines, and Areas

Tracks in Space and Time

Networks

125 Michael May

Tracks in space and time

• Requirements: • Point daa • Polygons • Aggregations • Networks • Tracks, GPS/RFID/Sensor-Measurement • Applications:

Traffic prediction, Mobility analysis

• Examples

• Sampling, Event analysis, non-linear optimization

(64)

126 Michael May

Mobility analysis based on GPS-tracks

introduction of new pricing model for

poster sites based on GPS tracks

registration of contact frequencies with poster sites

contact extrapolation for target groups:

- socio-demographic characteristics - residential areas

Media Trend Journal, Nov, 2006

127 Michael May

Time patterns

Patterns / Questions

- How long (days) does it take till x% of objects visit all locations? - How long does it take till x% of

objects visit at least one location twice?

Applications

- determine mobility of a group of people

- reach of poster networks

- find popularity of locations (theatres, supermarkets, hospitals)

(65)

128 Michael May

Modelling tasks

Modelling mobility for cities with GPS-measurements

for the overall population

Predicting mobility for cities without measurements (hard task!)

Extrapolating predictions in time

129 Michael May

GeoPKDD - FET Project IST-014915

Geographic Privacy-aware Knowledge Discovery and Delivery December 2005 – November 2008

Project Leader: Fosca Giannotti

http://www.geopkdd.eu

General Project Idea

extracting user-consumable forms of knowledge from large amounts of raw geographic data referenced in space and in time.

knowledge discovery and analysis methods for trajectories of

moving objects, which change their position in time, and possibly also their shape or other significant features

devising privacy-preserving methods for data mining from sources that typically contain personal sensitive data

(66)

130 Michael May

The Consortium

ID Acronym Partner Country

1 KDDLAB Knowledge Discovery and Delivery Laboratory, ISTI-CNR, Istituto di Scienza e Tecnologie dell’Informazione, Pisa. http://www.isti.cnr.it/ - jointly with Univ. Pisa, Dept. of Computer Science http://www.di.unipi.it

I

2 LUC Univ. Limburg, Theoretical Computer Science Group. http://www.luc.ac.be/theocomp B

3 EPFL EPFL, Lab. DB, Lausanne. http://lbdwww.epfl.ch/e/ CH

4 FAIS Fraunhofer Institute for Autonomous Intelligent Systems, Sankt Augustin.

http://www.ais.fraunhofer.de/

D

5 WUR Wageningen UR, Centre for GeoInformation. http://cgi.girs.wageningen-ur.nl/ NL

6 CTI Research Academic Computer Technology Institute, Research and Development Division.

http://www.cti.gr/ - jointly with Univ. Piraeus, Dept. of Informatics http://www.unipi.gr

GR

7 UNISAB Sabanci University, Faculty of Engineering and Natural Sciences. http://www.sabanciuniv.edu/ TK

8 WIND WIND Telecomunicazioni SpA, Direzione Reti Wind Progetti Finanziati & Technology Scouting. I

131 Michael May

Geographic Privacy-aware Knowledge Discovery Process

Traffic Management Accessibility of services Mobility evolution Urban planning …. interpretation visualization trajectory reconstruction p(x)=0.02 warehouse interpretation visualization trajectory reconstruction p(x)=0.02 ST patterns Trajectories warehouse Privacy-aware Data mining Bandwidth/Power optimization Mobile cells planning … Public administration or business companies Telecommunication company (W IND) GeoKnowledge Aggregative Location-based services Privacy enforcement Traffic Management Accessibility of services Mobility evolution Urban planning …. interpretation visualization trajectory reconstruction p(x)=0.02 warehouse interpretation visualization trajectory reconstruction p(x)=0.02 ST patterns Trajectories warehouse Privacy-aware Data mining Bandwidth/Power optimization Mobile cells planning … Public administration or business companies Telecommunication company (W IND) GeoKnowledge Aggregative Location-based services Privacy enforcement

(67)

132 Michael May

GeoPKDD – Specific Goals

models for moving objects, and data warehouse methods to store their trajectories knowledge discovery and analysis methods for moving objects and trajectories,

techniques to make such methods privacy-preserving

techniques for reasoning on spatio-temporal knowledge and on background knowledge

techniques for delivering the extracted knowledge within the geographic framework

133 Michael May

From Traces to Trajectories: the Source Data

GSM network

Entering the cell

- e.g. (UserID, time, IDcell, in)

Exiting the cell

- e.g. (UserID, time, IDcell, out)

Movements inside the cell? - Eg (UserID, time, X,Y, Idcell

streams of log data of mobile phones, e.g. cells in the GSM/UMTS network

Real trajectories are continuous functions

Logs are discrete sampling of real trajectories, dependent on the wireless network

technology

- unregular granularity in time and space - possible imperfection/imprecision

An approximated reconstruction of the real trajectory from its log traces is needed

(68)

134 Michael May

Movement patterns

Clustering

Group together similar trajectories

For each group produce a summary

Frequent patterns

Discover frequently followed (sub)paths

Classification

Extract behaviour rules from history

Use them to predict behaviour of future

users

60

%

7%

8%

5%

20

%

?

Source: Pedreschi & Giannotti, 2005

135 Michael May

Why emphasis on privacy?

More, better data are gathered, more vulnerability from correlation On the other hand, more and new data bring new opportunities

Need to maintain privacy without giving up opportunities

Need to obtain social acceptance through demonstrably trustworthy solutions

... is a technical issue, besides ethical, social and legal, in the specific context of ST data

How to formalize privacy constraints over ST data and ST patterns?

- E.g., anonymity threshold on clusters of individual trajectories

How to design DM algorithms that, by construction, only yield patterns that meet the

privacy constraints?

Privacy in GeoPKDD

(69)

136 Michael May

Challenges

(

0

)

0 0

(

1

)

p

n

−

⋅

137 Michael May

Causal Inference from Statistical Spatio-Temporal Data

Current project at IAIS for newspaper publisher:

Sales prediction of individual shops.

What happens if a shop closes or is sold out? Predict to which alternative shop customers go.

Spatio-Temporal Clustering of shops

Time Series Prediction

Modeling customer behavior

⇒ Causal inference about customer behavior

„If shop A closes, n% of A‘s customers go to B, m% to C“

(70)

138 Michael