• No results found

Data Mining and Visualization

N/A
N/A
Protected

Academic year: 2021

Share "Data Mining and Visualization"

Copied!
47
0
0

Loading.... (view fulltext now)

Full text

(1)

Results Matter, Trust

14 December 2005 SDMIV2, NeSC, Edinburgh

Data Mining and Visualization

Jeremy Walton

NAG Ltd, Oxford

NAG Ltd, Oxford

(2)

14 December 2005 2

Results Matter, Trust

SDMIV2, NeSC, Edinburgh

Overview

Data mining components

Functionality

Example application

Quality control

Visualization

Use of 3D

Example application

Market research

Statistics and visualization in Excel

What’s the problem?

(3)

14 December 2005 3

Results Matter, Trust

SDMIV2, NeSC, Edinburgh

Overview

Data mining components

Functionality

Example application

Quality control

Visualization

Use of 3D

Example application

Market research

Statistics and visualization in Excel

What’s the problem?

(4)

14 December 2005 4

Results Matter, Trust

SDMIV2, NeSC, Edinburgh

NAG Data Mining Components (DMC)

Data Cleaning

Data imputation adding missing values

Outlier detection finding suspect data records

Data Transformation

Scaling Data before distance computation

Principal Component Analysis reducing # of variables

Cluster Analysis

k-means analyst decides # of clusters in data Hierarchical stepwise agglomeration of data

(5)

14 December 2005 5

Results Matter, Trust

SDMIV2, NeSC, Edinburgh

DMC: Classification techniques

Classification Trees Two types of decision tree:

binary (Gini index) n-ary (entropy-based)

Generalized Linear Models Fitting of

Binomial distribution (for binary classification tasks) Poisson distribution (for count data)

k-Nearest Neighbours

Predict values using k most similar records in a training dataset

Set prior probabilities for data classes Also used for regression (see below)

(6)

14 December 2005 6

Results Matter, Trust

SDMIV2, NeSC, Edinburgh

DMC: Regression techniques

Regression Trees

Minimise sum of squares about mean

robust estimate of the mean, or sample average

Linear Regression

Automatic selection of model variables

Multi-Layer Perceptron Neural Networks

Flexible non-linear models

Free parameters in MLP optimised using conjugate gradients

Nearest Neighbours (see above) Radial Basis Function Models

function of distance from centre location to data records

(7)

14 December 2005 7

Results Matter, Trust

SDMIV2, NeSC, Edinburgh

DMC: other techniques

Association rules

Determine relationships between nominal data values

Utility functions

Random number generators Rank ordering

Sorting

Mean and sum of squares updates Two-way classification comparison Save and load models

(8)

14 December 2005 8

Results Matter, Trust

SDMIV2, NeSC, Edinburgh

Example application: Quality control

Detection of changes in sample

due to e.g. heating

Use circular dichroism spectroscopy

measures difference in absorbance by left & right polarized light

Generates spectrum for each sample

Intensity vs wavelength

Some spectra look similar, others don’t

How to classify them?

(9)

14 December 2005 9

Results Matter, Trust

SDMIV2, NeSC, Edinburgh

Spectra display

(10)

14 December 2005 10

Results Matter, Trust

SDMIV2, NeSC, Edinburgh

Classification

36 spectra, 152 intensity values each Read into 36 x 152 matrix

Passed to hierarchical cluster analysis routines

Euclidean distances between data points Average link distances between clusters

Output displayed as dendogram

tree plot showing merging of clusters with distance

Introduce a cut-off to define “natural” clusters

(11)
(12)

14 December 2005 12

Results Matter, Trust

SDMIV2, NeSC, Edinburgh

Analysis

Cut off gives seven natural clusters

not v. sensitive to distance functions

Some of the results can be understood w.r.t experimental conditions

e.g. 030220a1 - concentrated sample (evaporation) e.g. 030319g5 to 030330f5 - repeatable experiment

But there are some outliers

e.g. 030326e1, 030326e2 in normals needs consultation with domain experts

(13)

14 December 2005 13

Results Matter, Trust

SDMIV2, NeSC, Edinburgh

Example application: Classification

Fisher’s iris dataset

4 measurements made on 50 iris specimens from each of three species

petal length, petal width, sepal length, sepal width How to classify the species?

150 data points Each point

has 4 independent variables belongs in one of three classes

red, green, blue

How to display dataset?

(14)

2D scatterplots

(15)

14 December 2005 15

Results Matter, Trust

SDMIV2, NeSC, Edinburgh

Scatterplots?

In 2D, need 6 plots

to show each variable vs every other one

Need to consider them all at the same time Can reduce the number of variables

using principal components analysis first three explain ~95% of the variance

Scatterplot in 3D

(16)

3D scatterplot

(17)

3D scatterplot

(18)

3D scatterplot

(19)

3D scatterplot

(20)

14 December 2005 20

Results Matter, Trust

SDMIV2, NeSC, Edinburgh

Overview

Data mining components

Functionality

Example application

Quality control

Visualization

Use of 3D

Example application

Market research

Statistics and visualization in Excel

What’s the problem?

(21)

14 December 2005 21

Results Matter, Trust

SDMIV2, NeSC, Edinburgh

Example: Visual datamining

Dutch financial asset management company

Keen to target products at entrepreneurs

How does entrepreneurship relate to other customer characteristics?

Marketing dataset

25,000 customers (sampled from full customer base) Each characterised by values for 100 variables

Investment history = “entrepreneurship”

Income Age

(22)

14 December 2005 22

Results Matter, Trust

SDMIV2, NeSC, Edinburgh

Select strongly correlated variables

Correlation matrix

(23)

14 December 2005 23

Results Matter, Trust

SDMIV2, NeSC, Edinburgh

Select strongly correlated variables

Histogram from cluster analysis

Select variables of interest

Correlation matrix

(24)

14 December 2005 24

Results Matter, Trust

SDMIV2, NeSC, Edinburgh

Select strongly correlated variables

Histogram from cluster analysis

Select variables of interest

Correlation matrix

Scatter plot

(25)

14 December 2005 25

Results Matter, Trust

SDMIV2, NeSC, Edinburgh

Lessons learnt

3D correlation landscape useful

Identifying significant variables Focus on data distributions

Select appropriate ranges for cluster analysis

Cluster visualization helpful

Non-linear relationships in data revealed

3D visualization combined with direct interaction

Selection of correlated variables Binning and sorting

Done with IRIS Explorer

(26)

14 December 2005 26

Results Matter, Trust

SDMIV2, NeSC, Edinburgh

Overview

Data mining components

Functionality

Example application

Quality control

Visualization

Use of 3D

Example application

Market research

Statistics and visualization in Excel

What’s the problem?

(27)

14 December 2005 27

Results Matter, Trust

SDMIV2, NeSC, Edinburgh

Simple visualization: shoe size survey

Visualize using Excel

Chart wizard creates visualization easily Interactive control over appearance

Colours, line width, text, fonts, placement

Data and visualization linked together

1 5

17 19

13 14

5 3

Frequency

12 11

10 9

8 7

6 5

Shoe size

(28)
(29)

14 December 2005 29

Results Matter, Trust

SDMIV2, NeSC, Edinburgh

Second survey (now with half-sizes)

8 8

2 1

5 17

19 13

14 5

3 Frequency

9.5 7.5

5.5 12

11 10

9 8

7 6

5 Shoe size

(30)
(31)
(32)

14 December 2005 32

Results Matter, Trust

SDMIV2, NeSC, Edinburgh

What’s wrong with this picture?

Ordering of values on X axis reflects order in spreadsheet

not numerical order

Spacing on X axis doesn’t reflect difference between values

spacing is everywhere the same

Could pay close attention to labels

But might be harder for more complex data Obscures missing data

(33)

14 December 2005 33

Results Matter, Trust

SDMIV2, NeSC, Edinburgh

Another example – adsorption isotherm

Measurement of fluid density inside porous solid as a function of fluid pressure

Confined fluid can condense before saturation

Capillary condensation Vertical jump in isotherm

Plot data using

Scatter plot Line graph

(34)

Sc att er plo t

(35)

Lin e g ra ph

(36)

14 December 2005 36

Results Matter, Trust

SDMIV2, NeSC, Edinburgh

Another example – derivative calculation

Black-Scholes modelling of options on swap agreements (“swaptions”)

Option volatility as a function of

Swap maturity Option expiry time

Display in Excel as a surface plot

(37)

Irregular

spacing

(38)

14 December 2005 38

Results Matter, Trust

SDMIV2, NeSC, Edinburgh

What’s wrong with this picture?

See above

Irregular spacing, discontinuities

This is only one part of the dataset

Volatility also depends on strike value Want plots at other strike values

Want to see other relationships

e.g. volatility ( strike ) = “volatility smile”

Use IRIS Explorer

(39)

Irregular

spacing

(40)
(41)
(42)

14 December 2005 42

Results Matter, Trust

SDMIV2, NeSC, Edinburgh

Two types of axis in Excel charts

Value

Data treated as continuously varying numerical values Marker placed at location reflecting its value

Used in Excel Scatter plots

Category

Data treated as sequence of non-numerical text labels Marker location reflects position in sequence

Points distributed evenly along axis

Used in Excel Bar chart, Line chart, Surface plot…

(43)

14 December 2005 43

Results Matter, Trust

SDMIV2, NeSC, Edinburgh

NAG Schools Excel Add-in (N-SEA)

Supports instruction in statistics Functionality for

Data sampling Frequency plots

Box and whisker plots Histograms

Continuous bar charts

Allows (X/Y) ordering of data points in plotting

and inclusion of points with zero Y values

(44)

Orig ina l

(45)

N-S EA

(46)
(47)

14 December 2005 47

Results Matter, Trust

SDMIV2, NeSC, Edinburgh

Conclusions

Data mining components offer basic routines

Developers can incorporate them into applications

No wheel-reinvention, stone canoes, chocolate teapots

cf NAG numerical library

Visualization is crucial for analysis

Integration of data mining & visualization is application- dependent

Interactivity important

Problems with (even) well-known tools

Be aware Work around

References

Related documents

We conduct a comparison between DG3 (three-point discontinuous Galerkin scheme; Huynh, 2007), MCV5 (fifth- order multi-moment constrained finite volume scheme; Ii and Xiao, 2009)

Immunoprecipi- tation and Western blot for FGFR3 proteins confirmed the presence of both FGFR3 proteins in the cell lysate, suggesting that this decrease in phosphorylation did

In examining the ways in which nurses access information as a response to these uncertainties (Thompson et al. 2001a) and their perceptions of the information’s usefulness in

As a formal method it allows the user to test their applications reliably based on the SXM method of testing, whilst using a notation which is closer to a programming language.

For the cells sharing a given channel, the antenna pointing angles are first calculated and the azimuth and elevation angles subtended by each cell may be used to derive

Results indicates that high involvement job practice are lacking among the banking sector and majority of its employees need to be involved in those practices similarly the

This research contributes to the field of construction engineering and management in two major ways. The first way of contributing to the field is by

• to develop geothermal energy (electricity) to complement hydro and other sources of power to meet the energy demand of rural areas in sound environment,.. • to mitigate rural