• No results found

Data Science. IRDS: Data Mining Process. The term data mining. The term data mining

N/A
N/A
Protected

Academic year: 2021

Share "Data Science. IRDS: Data Mining Process. The term data mining. The term data mining"

Copied!
7
0
0

Loading.... (view fulltext now)

Full text

(1)

IRDS: Data Mining Process

Charles Sutton

University of Edinburgh

(many figures used from Murphy. Machine Learning: A Probabilistic Perspective.)

1

“Data Science”

• machine learning!

• databases!

• statistics!

• optimization

• Difficult to find another term for this intersection

• natural language processing!

• computer vision !

• speech processing!

• applications to science, business, health….

• Our working definition!

• Data science is the study of the computational

principles, methods, and systems for extracting

knowledge from data.!

• A relatively new term. A lot of current hype…!

• “If you have to put ‘science’ in the name…”!

• Component areas have a long history

2

The term “data mining”

Data mining is the analysis of (often large)

observational data sets to find unsuspected

relationships and to summarise the data in novel

ways that are both understandable and useful to

the data owner.

— Hand, Mannila, Smyth, 2001

The term “data mining”

Data mining is the analysis of (often large)

observational data sets to find unsuspected

relationships and to summarise the data in novel

ways that are both understandable and useful to

the data owner.

— Hand, Mannila, Smyth, 2001

not collected for the

purpose of your

(2)

The term “data mining”

Data mining is the analysis of (often large)

observational data sets to find unsuspected

relationships and to summarise the data in novel

ways that are both understandable and useful to

the data owner.

— Hand, Mannila, Smyth, 2001

e.g., pregnant example from

association rule mining

Many “easy” patterns already known

5

The term “data mining”

Data mining is the analysis of (often large)

observational data sets to find unsuspected

relationships and to summarise the data in novel

ways that are both understandable and useful to

the data owner.

— Hand, Mannila, Smyth, 2001

Tradeoff between

predictive performance

human interpretability

Ex: neural networks vs decision trees

6

Before I get too far ahead of myself…

(3)

Problem Types

• Visualization

Prediction: Learn a map x —> y

• Classification: Predict categorical value

• Regression: Predict a real value

• Others • Collaborative filtering • Learning to rank • Structured prediction • Description • Clustering • Dimensionality reduction • Density estimation • Finding patterns

• Association rule mining

• Detecting anomalies / outliers

supervised learning

unsupervised learning

9

Prediction Examples

• Classification • Advertising

• Ex: Given the text of an online advertisement and a search engine query, predict whether a user will click on the ad

• Document classification

• Ex: Spam filtering

• Object detection

• Ex: Given an image patch, dose it contain a face?

• Regression

• Predict the final vote in an election (or referendum) from polls

• Predict the temperature tomorrow given the previous few days

• Sometimes augmented with other structure / information

• Structured prediction

• Spatial data, Time series data

• Ex: Predicting coding regions in DNA

• Collaborative filtering (Amazon, Netflix)

• Semi-supervised learning

10

Description Examples

• Clustering

• Assign data into groups with high intra-group similarity

• (like classification, except without examples of “correct” group assignments)

• Ex: Cluster users into groups, based on behaviour

• Social network analysis

• Autoclass system (Cheeseman et al. 1988) discovered a new type of star,

• Dimensionality reduction

• Eigenfaces

• Topic modelling

• Discovering graph structure

• Ex: Transcription networks

• Ex: JamBayes for Seattle traffic jams

• Association rule mining

• Market basket data

• Computer security

General problem

Data

−8 −6 −4 −2 0 2 4 6 8 −4 −2 0 2 4 −2 0 2

Dimensionality reduction

Application to Images

Basis

(4)

Data

Analysis

Process

Inspired by Wagstaff, 2012. “Machine Learning that Matters”

For another more industrial process, see CRISP-DM. Phrase as a machine learning problem Collect data Select features Train model Evaluate model

Decide how to improve model

Deploy model Performance good enough?

Monitor deployed model Performance still good? More data Better features Different learning algorithm No Yes Yes No Prepare, clean data

13

Roadmap

In the next few weeks, we’ll talk about

Visualization

Feature extraction

Evaluation and debugging

But to talk about these, we still need to understand

representation behind the algorithms

14

Two Representation Problems

Image Classification

Document Clustering

Association Rule Mining

Transaction Database

Feature Extraction

Learning

Data Feature Vectors Models/Patterns

Feature Extraction Feature Extraction Learning Learning o o o o o o o o o o o o o o o o o o o o o o o o o o o o x x x x x x x x oo o o o o ooo o o x1 x2 w y X X X if Diapers = 1 then Beer = 1 if Ham = 1 then Pineapple = 1 …

Two Representation Problems

Image Classification

Document Clustering

Association Rule Mining

Transaction Database

Feature Extraction

Learning

Data Feature Vectors Models/Patterns

Feature Extraction Feature Extraction Learning Learning o o o o o o o o o o o o o o o o o o o o o o o o o o o o x x x x x x x x oo o o o o ooo o o x1 x2 w y X X X if Diapers = 1 then Beer = 1 if Ham = 1 then Pineapple = 1 …

(5)

Two Representation Problems

Image Classification

Document Clustering

Association Rule Mining

Transaction Database

Feature Extraction

Learning

Data Feature Vectors Models/Patterns

Feature Extraction Feature Extraction Learning Learning o o o o o o o o o o o o o o o o o o o o o o o o o o o o x x x x x x x x oo o o o o ooo o o x1 x2 w y X X X if Diapers = 1 then Beer = 1 if Ham = 1 then Pineapple = 1 …

2. What is the set of possible models?

15-3

Two Representation Problems

1. What features to use

2. What is the space of possible models

In this course, we discuss features.

Model —> IAML, PMR, MLPR

But: To pick features, must understand model.

So: Whirlwind tour of models, leaving out learning

algorithms

16

Linear regression

which can be solved easily (but I won’t say how)

Let denote the feature vector. Trying to predict y2 R Simplest choice a linear function. Define parameters

x2 Rd w2 Rd ˆ y = f (x, w) = w>x = d X j=1 wjxj

(to keep notation simple assume that always )xd= 1

x(1). . . x(N ), y(1), . . . , y(N )

Given a data set

find the best parameters min w N X i=1 ⇣ y(i) w>x(i)⌘2 −20 −1 0 1 2 3 0.5 1 1.5 2 2.5

Nonlinear regression

exactly the same form as before (because x is fixed)

so still just as easy

What if we want to learn a nonlinear function?

To find parameters,

the minimisation problem is now

0 5 10 15 20 −10 −5 0 5 10 15 degree 2

Trick: Define new features, e.g., for scalar x, define(x) = (1, x, x2)> ˆ

y = f (x, w) = w> (x)

this is still linear in w

min w N X i=1 ⇣ y(i) w> (x(i))⌘2

(6)

Logistic regression

(a classification method, despite the name)

Linear regression was easy. Can we do linear classification too?

x x x x x x x x oo o o o o o o o o o x1 x2 w f (x, w) = w>x Define a discriminant function

y = (

1 if f (x, w) 0 0 otherwise Then predict using

yields linear decision boundary Can get class probabilities from this idea, using logistic regression:

p(y = 1|x) =1 + exp1 { w>x}

(to show decision boundaries same, compute log odds logp(y = 1|x) p(y = 0|x)

19

K-Nearest Neighbour

simple method for classification or regression

1. Look through your training set. Find the K closest points. Call them (this is memory-based learning.)

2. Return the majority vote.

3. If you want a probability, take the proportion !

Define a distance function between feature vectors D(x, x0)

(the running time of this algorithm is terrible. See IAML for better indexing.) NK(x)

To classify a new feature vector x

p(y = c|x) =K1 X

(y0,x0)2NK(x)

I{y0= c}

20

K-Nearest Neighbour

Decision boundaries can be highly nonlinear The bigger the K, the smoother the boundary This is nonparametric: the complexity of the boundary varies depending on the amount of training data

−1 0 1 2 3 4 5 predicted label, K=1 c1 c2 −1 0 1 2 3 4 5 predicted label, K=5 c1 c2 −3 −2 −1 0 1 2 3 −2 −1 0 1 2 3 4 5 train data

k=1

k=5

To split the data into k clusters, iterate: !

Initialize cluster centroids randomly: Repeat

For each data point i Assign to closest cluster !

!

Move cluster centroids (average all points currently in cluster) For all k in 1,2,…K ! ! !

K-means clustering

µ1. . . µK ki arg max k2{1...K} D(µk, x(i)) µk PN

i=1x(i)I{ki= k}

PN

(7)

Summary

• Different types of model structures

1. Linear boundaries (for classification and regression) 2. Nonlinear boundaries (but linear in a set of features) 3. “Wavy” boundaries (nonparametric, piecewise linear) 4. Convex boundaries (with respect to Euclidean distance)

• This will affect feature construction, soon.

References

Related documents

[87] demonstrated the use of time-resolved fluorescence measurements to study the enhanced FRET efficiency and increased fluorescent lifetime of immobi- lized quantum dots on a

We conduct a comparison between DG3 (three-point discontinuous Galerkin scheme; Huynh, 2007), MCV5 (fifth- order multi-moment constrained finite volume scheme; Ii and Xiao, 2009)

A hybrid statistical model representing both the pose and shape variation of the carpal bones is built, based on a number of 3D CT data sets obtained from different subjects

Lead scientists, lead detective, and consulting experts develop recovery plan Forensic archaeologists in consultation with others scientists conduct recovery Remains and soil

Immunoprecipi- tation and Western blot for FGFR3 proteins confirmed the presence of both FGFR3 proteins in the cell lysate, suggesting that this decrease in phosphorylation did

In examining the ways in which nurses access information as a response to these uncertainties (Thompson et al. 2001a) and their perceptions of the information’s usefulness in

As a formal method it allows the user to test their applications reliably based on the SXM method of testing, whilst using a notation which is closer to a programming language.

For the cells sharing a given channel, the antenna pointing angles are first calculated and the azimuth and elevation angles subtended by each cell may be used to derive