• No results found

Machine Learning: Statistical Methods - I

N/A
N/A
Protected

Academic year: 2022

Share "Machine Learning: Statistical Methods - I"

Copied!
47
0
0

Loading.... (view fulltext now)

Full text

(1)

Machine Learning:

Statistical Methods - I

Introduction -

15.04.2009

(2)

© Stefan Roth, 15.04.2009 | Department of Computer Science | GRIS |

■ Lecturer:

Prof. Stefan Roth, Ph.D. <sroth AT cs.tu-...>

About me...

■ Teaching Assistant:

Qi Gao <qgao AT gris.tu-...>

■ Announcements:

Course web page:

http://www.gris.informatik.tu-darmstadt.de/teaching/courses/ss09/ml/index.en.htm

Mailing list: see subscription information on web page.

Forum: http://d120.de/forum/viewforum.php?f=292

2

Machine Learning I

(3)

Course language

■ will be English.

This applies to lectures, exercises, announcements, etc.

■ Why?

Essentially all machine learning publications and books are written in English.

Knowing the original terms is crucial.

■ If strongly preferred, you may contact the course staff in German.

English is encouraged though, because we may use your (anonymized) question to clarify points to the entire class.

(4)

© Stefan Roth, 15.04.2009 | Department of Computer Science | GRIS |

Organization

■ Lecture:

2 hours a week

Wednesdays, 9:50 - 11:30, S3/05 room 073

We will cover the foundational aspects of each topic.

■ Exercise:

2 hours a week

Wednesdays, 11:40 - 13:20, S3/05 room 073

We will cover some practical aspects, and discuss the homework assignments.

4

(5)

Exam & Bonus System

■ Most likely there will be an oral exam.

Likely during the first week after classes end.

Can be taken in English or German.

Details depend on how many students end up taking the class.

There will be a bonus of up to a full grade for those who do well in the homework assignments.

Details TBA.

■ Exercises:

In order to get credit for 2+2 SWS, you need to actively participate in the exercises / turn in the homework assignments.

(6)

© Stefan Roth, 15.04.2009 | Department of Computer Science | GRIS |

Style

■ Lectures:

I would like the lectures to be at least partly interactive.

Maybe more interactive than you are used to.

This is supposed to be helpful for you and me.

You are encouraged to ask questions!

■ Exercises:

Mostly interactive.

You are encouraged to ask detailed questions!

Your participation counts: Bonus for final exam.

6

(7)

Homework assignments

■ Mix of written and programming assignments.

We will have around 4-5 assignments.

Programming assignments in MATLAB, standard environment for scientific computing.

- Goal: Work with some real data to get a first hand knowledge of how the techniques work that we will learn.

- Introduction during first exercise (next week).

Also pen and paper exercises.

■ The last assignment may be a larger project-like one:

Stay tuned...

(8)

© Stefan Roth, 15.04.2009 | Department of Computer Science | GRIS |

Readings

■ Course book:

Christopher Bishop: Pattern Recognition and Machine Learning. Springer, 2006. ISBN 0-287-31073-8 (very good book, but not an easy read).

Should be on reserve in the library.

■ Other useful books:

Duda, Hart & Stork: Pattern Classification, Wiley-Interscience, 2000, 2nd edition. ISBN 0-471-05669-3 (new version of a classic).

David J. C. MacKay: Information Theory, Inference, and Learning Algorithms. Cambridge University Press, 2003. ISBN 0-521-64298-1 (free download at http://

www.inference.phy.cam.ac.uk/mackay/itila/book.html).

Gelman et al.: Bayesian Data Analysis. CRC Press, 2nd ed., 2004, ISBN 1-584-88388-X

Hastie, Tibshirani, Friedman: The Elements of Statistical Learning, Springer, 2001. ISBN 0-387-95284-5 (the statistical perspective).

Tom Mitchell: Machine Learning, McGraw-Hill, 1997, ISBN 0-07-042807-7 (classic, but getting outdated).

8

(9)

Readings

■ Additional readings:

At times I will post papers and tutorials.

Will be available or linked from the course web page.

■ I will often assign weekly readings:

Please read them and come to class prepared!

The Bishop book is a good investment, because it is also a very useful reference.

(10)

© Stefan Roth, 15.04.2009 | Department of Computer Science | GRIS |

How does it fit into your course plan?

■ Diplom:

Anwendungsbezogene Informatik

Possibly Praktische or Theoretische Informatik if you can find someone who will count ML I toward this.

- Note that I will not be able to offer an exam in Theoretische Informatik.

■ B.Sc. / M.Sc.:

Human Computer Systems (see Modulhandbuch)

Not Data Knowledge Engineering

If you are strongly interested in machine learning you should:

- Take ML: Statistical Methods for HCS credit and

- Take ML: Symbolische Methoden for DKE credit

10

(11)

How does it fit into your course plan?

■ Related classes:

Human Computer Systems (WS, Schiele/Fellner): prerequisite

Machine Learning: Statistical Methods II (WS, Schiele)

Computer Vision I (SS, Schiele/Schindler) - II (WS, Roth)

Maschinelles Lernen - Symbolische Verfahren (WS, Fürnkranz)

Bildverarbeitung (SS, Sakas)

■ Theses and projects:

Topics in machine learning with applications in computer vision.

Please talk to me if you are interested.

(12)

© Stefan Roth, 15.04.2009 | Department of Computer Science | GRIS |12

■ What is ML? What is its goal?

Develop a machine / an algorithm that learns to perform a task from past experience.

■ Why? What for?

Fundamental component of every intelligent and/or autonomous system

Discovering “rules” and patterns in data

Automatic adaptation of systems

Attempting to understand human / biological learning

Machine Learning

(13)

Machine Learning in Action

(14)

© Stefan Roth, 15.04.2009 | Department of Computer Science | GRIS |

Machine Learning: Examples

■ Example 1: Recognition of handwritten digits

These digits are given to us as small digital images.

We have to build a “machine” to decide which digit it is.

Obvious challenge: There are many different ways in which people handwrite.

14

(15)

Machine Learning: Examples

■ Example 2: Classification of fish

FIGURE 1.1. The objects to be classified are first sensed by a transducer (camera),

salmon sea bass

length count

l*

0 2 4 6 8 10 12 16 18 20 22

5 10 15 20 25

FIGURE 1.2. Histograms for the length feature for the two categories. No single thresh- old value of the length will serve to unambiguously discriminate between the two cat- egories; using length alone, we will have some errors. The value markedl will lead to the smallest number of errors, on average. From: Richard O. Duda, Peter E. Hart, and David G. Stork,Pattern Classification. Copyright c

(16)

© Stefan Roth, 15.04.2009 | Department of Computer Science | GRIS |

Machine Learning: Examples

■ More examples:

Email filtering:

Speech recognition:

Vehicle control:

16

(17)

■ Recognition of speech, letters, faces, ...

■ Autonomous vehicle navigation

■ Games

Backgammon world-champion

Chess: Deep-Blue vs. Kasparov

■ Google

■ Finding new astronomical structures

■ Fraud detection (credit card applications)

■ ...

Impact & Successes

(18)

© Stefan Roth, 15.04.2009 | Department of Computer Science | GRIS |18

■ Develop a machine / an algorithm that learns to perform a task from past experience.

■ Put more abstractly:

Our task is to learn a mapping from input to output.

Put differently, we want to predict the output from the input.

Input: images, text, other measurements,...

Output:

Parameter(s): (that is/are being “learned”)

Machine Learning

f : I → O

y = f (x; θ) x ∈ I

y ∈ O

θ ∈ Θ

(19)

Classification vs. Regression

■ Classification:

Learn a mapping into a discrete space, e.g.:

Examples: Spam / not spam, sea bass vs. salmon, parsing a sentence, recognizing digits, etc.

O = {verb, noun, nounphrase, . . .}

O = {0, 1, 2, 3, . . .}

O = {0, 1}

(20)

© Stefan Roth, 15.04.2009 | Department of Computer Science | GRIS |

Classification vs. Regression

■ Regression:

Learn a mapping into a continuous space, e.g.:

Examples: “Curve fitting”, financial analysis, ...

20

O = R

3

O = R

0 1

−1 0 1

1 2 3 4 5 6

40 50 60 70 80 90 100

(21)

General Paradigm

■ Training:

■ Testing:

Training data

learn

model

θ

Learned parameters

predict

0, 1, 2, 8, 4,

6, 6, 7, 8, 9

(22)

© Stefan Roth, 15.04.2009 | Department of Computer Science | GRIS |

What data do we have for training?

■ Data with labels (input / output pairs):

supervised learning

E.g. image with digit label

Sensory data for car with intended steering control.

■ Data without labels: unsupervised learning

E.g. automatic clustering (grouping) of sounds

Clustering of text according to topics

■ Data with and without labels: semi-supervised learning

■ No examples: learning-by-doing

Reinforcement learning

22

- 0

- 5

(23)

Some Key Challenges

■ We need generalization!

We cannot simply memorize the training set.

■ What if we see an input that we haven’t seen before?

Different shape of the digit image (unknown writer)

“Dirt” on the picture, etc.

We need to learn what is important for carrying out our task.

(24)

© Stefan Roth, 15.04.2009 | Department of Computer Science | GRIS |

Generalization

■ How do we achieve generalization?

24

?

2 4 6 8 10

14 15 16 17 18 19 20 21 22 width

lightness

salmon sea bass

FIGURE 1.5. Overly complex models for the fish will lead to decision boundaries that are complicated. While such a decision may lead to perfect classification of our training samples, it would lead to poor performance on future patterns. The novel test point marked ? is evidently most likely a salmon, whereas the complex decision boundary shown leads it to be classified as a sea bass. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c! 2001 by John Wiley & Sons, Inc.

(25)

■ How do we achieve generalization?

Generalization

2 4 6 8 10

14 15 16 17 18 19 20 21 22 width

lightness

salmon sea bass

FIGURE 1.6. The decision boundary shown might represent the optimal tradeoff be- tween performance on the training set and simplicity of classifier, thereby giving the highest accuracy on new patterns. From: Richard O. Duda, Peter E. Hart, and David G.

Occam’s Razor

(26)

© Stefan Roth, 15.04.2009 | Department of Computer Science | GRIS |26

■ Input x:

■ Features:

Choosing the “right” features is very important.

Coding and use of domain knowledge.

May allow for invariance (e.g. volume and pitch of voice).

■ Curse of Dimensionality:

If the features are too high-dimensional, we will run into trouble - more later.

Dimensionality reduction

Some Key Challenges

(27)

Some Key Challenges

■ How do we measure performance?

99% correct classification in speech recognition: What does that really mean?

We understand the meaning of the sentence? We understand every word? For all speakers?

■ Need more concrete numbers:

% of correctly classified letters

average distance driven (until accident...)

% of games won

% correctly recognized words, sentences, etc.

(28)

© Stefan Roth, 15.04.2009 | Department of Computer Science | GRIS |

Some Key Challenges

■ We also need to define the right error metric:

Whis is better?

Euclidean distance (L2 norm) might be useless.

28

(29)

■ Which is the right model?

The learned parameters can mean a lot of different things.

- w: may characterize the family of functions or the model space

- w: may index the hypothesis space

- w: vector, adjacency matrix, graph, ...

Some Key Challenges

(30)

© Stefan Roth, 15.04.2009 | Department of Computer Science | GRIS |

Some Key Challenges

■ Even if we have solved the other problems, computation is usually quite hard:

Learning often involves some kind of optimization

Find (search) best model parameters

Often we have to deal with thousands of training examples

Given a model, compute the prediction efficiently

30

(31)

Why is machine learning interesting (for you)?

■ Machine learning is a challenging problem that is far from being solved.

Our learning systems are primitive compared to us humans.

Think about what and how quickly a child can learn!

■ It combines insights and tools from many fields and disciplines:

Traditional artificial intelligence (logic, semantic networks, ...)

Statistics

Complexity theory

Artificial neural networks

(32)

© Stefan Roth, 15.04.2009 | Department of Computer Science | GRIS |

Why is machine learning interesting (for you)?

■ Allows you to apply theoretical skills that you may otherwise only use rarely.

■ Has lots of applications:

Computer vision

Computer linguistics

Search (think Google)

Digital “assistants”

Computer systems

...

■ It is a growing field:

Many major companies are hiring people with machine learning knowledge.

Anecdote: At a recent workshop on computer graphics, about 2/3 of the groups said they would benefit from more machine learning knowledge.

32

(33)

Preliminary Syllabus

■ Subject to change!

■ Fundamentals (~ 3 weeks)

Bayes decision theory, maximum likelihood, Bayesian inference

Performance evaluation

Probability density estimation

Mixture models, expectation maximization

■ Linear Methods (~ 3-4 weeks)

Linear regression

PCA, robust PCA

(34)

© Stefan Roth, 15.04.2009 | Department of Computer Science | GRIS |

Preliminary Syllabus

■ Large-Margin Methods (~ 3-4 weeks)

Statistical learning theory

Support vector machines

Kernel methods

■ Miscellaneous (~ 3 weeks)

Model averaging (bagging & boosting)

Graphical models (basic introduction)

34

(35)

Credits

■ Large parts of the lecture material have been developed by Prof. Bernt Schiele for the previous iterations of this course.

■ Many figures that I will use are directly taken out of the

books by Chris Bishop and Duda, Hart & Stork.

(36)

© Stefan Roth, 15.04.2009 | Department of Computer Science | GRIS |

Brief Review of Basic Probability

■ What you should already know:

36

B = r B = b

F = o F = a

Random picking:

Red box: 60% of the time

Blue box: 40% of the time

Pick fruit from a box with equal probability

p(B = r) = 0.6 p(B = b) = 0.4

p(F = a |B = r) = p(F = o|B = b) = 0.25

p(F = o |B = r) = p(F = a|B = b) = 0.75

(37)

Brief Review of Basic Probability

■ We usually do not mention the random variable (RV) explicitly (for brevity).

■ Instead of we write:

if we want to denote the probability distribution for a particular random variable .

if we want to denote the value of the probability of the random variable being .

It should be obvious from the context when we mean the random variable itself and a value that the random variable can take.

■ Some people use upper case for (discrete) p(X = x)

p(X) p(x)

X x

P (X = x)

(38)

© Stefan Roth, 15.04.2009 | Department of Computer Science | GRIS |

Brief Review of Basic Probability

■ Joint probability:

The probability distribution of random variables and taking on a configuration jointly.

For example:

■ Conditional probability:

The probability distribution of random variable given the fact that random variable takes on a specific value

For example:

38

p(X, Y )

X Y

p(B = b, F = o)

p(X |Y )

X Y

p(B = b |F = o)

(39)

■ Probabilities are always non-negative:

■ Probabilities sum to 1:

■ Sum rule or marginalization:

Basic Rules I

p(x) =

!

y

p(x, y) p(y) =

!

x

p(x, y) p(x) ≥ 0

!

x

p(x) = 1

0 ≤ p(x) ≤ 1

(40)

© Stefan Roth, 15.04.2009 | Department of Computer Science | GRIS |

Basic Rules II

■ Product rule:

From this we directly follow...

■ Bayes’ rule or Bayes’ theorem:

We will widely use these rules.

40

p(x, y) = p(x |y)p(y) = p(y|x)p(x)

p(y |x) = p(x |y)p(y) p(x)

Rev. Thomas Bayes 1701-1761

(41)

Continuous RVs

■ What if we have continuous random variables, say ?

Any single value has zero probability.

We can only assign a probability for a random variable being in a range of values:

■ Instead we use the probability density

■ Cumulative distribution function:

X = x ∈ R

P r(x

0

< X < x

1) = P r(x0

≤ X ≤ x

1)

p(x)

P r(x

0

≤ X ≤ x

1) = ! x1 x0

p(x) dx

(42)

© Stefan Roth, 15.04.2009 | Department of Computer Science | GRIS |

Continuous RVs

■ Probability density function = pdf

■ Cumulative distribution function = cdf

■ We can work with a density (pdf) as if it was a probability distribution:

For simplicity we usually use the same notation for both.

42 δx x

p(x) P (x)

(43)

Basic rules for pdfs

■ What are the rules?

Non-negativity:

“Summing” to 1:

But:

Marginalization:

p(x) ≥ 0

!

p(x) dx = 1

p(x) !≤ 1

in general

p(x) =

!

p(x, y) dy p(y) =

!

p(x, y) dx

(44)

© Stefan Roth, 15.04.2009 | Department of Computer Science | GRIS |

Expectations

■ The average value of a function under a probability distribution is the expectation:

■ For joint distributions we sometimes write:

■ Conditional expectation:

44

f (x) p(x)

E

x

[f(x, y)]

E

x|y

[f] = E

x

[f|y] = !

x

f (x)p(x |y) E[f ] = E[f (x)] = !

x

f (x)p(x) or E[f] = "

f (x)p(x) dx

(45)

Variance and Covariance

■ Variance of a single RV:

■ Covariance of two RVs:

■ Random vectors:

All we have said so far not only applies to scalar random variables, but also to random vectors.

In particular, we have the covariance matrix:

var[x] = E !

(x − E[x])

2

"

= E[x

2

] − E[x]

2

cov(x, y) = E

x,y

[(x − E[x])(y − E[y])] = E

x,y

[xy] − E[x]E[y]

! "

(46)

© Stefan Roth, 15.04.2009 | Department of Computer Science | GRIS |

Preview

46

■ Review of some basics about probability

■ Bayesian decision theory

■ Loss functions

■ Disclaimer: It will get quite a bit more mathematical than this :)

Don’t get scared away, but be aware that this will not be a walk in the park.

(47)

Readings for next week

■ Introduction to ML:

Bishop 1.0, 1.1

■ Review of the basics of probability:

Bishop 1.2 (you can skip 1.2.5 and 1.2.6 for now)

■ Decision theory:

Bishop 1.5

■ For the curious:

Probability: You could also look at MacKay 2

Brush up on information theory: Bishop 1.6

References

Related documents

While I expect seed to be up a couple million bushels as I expect more planted acres next year, corn used for food and industrial uses other than ethanol are expected to

In the group of  vineyard sprayers non-smokers, during the investigation  vineyard sprayers died of lungs carcinoma (mortality rate .).. In the group of

In that case, the output is ascending (article id, ascending positions of occurrences as word indices) pairs, together with a file containing list of ar- ticles representing

The combination of the above factors will require the industry’s technical and project delivery organisations to respond effectively to the challenges of

Regarding the difference between the 12 month measurement and the baseline measurement, correlations were found between changes in the utility score, more specifically the

The findings indicated: (a) women clients’ attempt to define their relationship with their counsellor by comparing it to other interpersonal relationships, mostly

Disorders-3 (ICSD-3) (American Academy of Sleep Medicine 2014 ) include: (a) There is a report of in- somnia and/or excessive sleepiness, accompanied by a reduction of total sleep

Now a day’s people are using the community public transportation system which plays an increasingly important role. It is a cost effective mode of transport and due to cause of