Machine Learning: Statistical Methods - I

(1)

Machine Learning:

Statistical Methods - I

Introduction -

15.04.2009

(2)

■ Lecturer:

•

Prof. Stefan Roth, Ph.D. <sroth AT cs.tu-...>

•

About me...

■ Teaching Assistant:

•

Qi Gao <qgao AT gris.tu-...>

■ Announcements:

•

Course web page:

http://www.gris.informatik.tu-darmstadt.de/teaching/courses/ss09/ml/index.en.htm

•

Mailing list: see subscription information on web page.

•

Forum: http://d120.de/forum/viewforum.php?f=292

2

Machine Learning I

(3)

Course language

■ will be English.

•

This applies to lectures, exercises, announcements, etc.

■ Why?

•

Essentially all machine learning publications and books are written in English.

•

Knowing the original terms is crucial.

■ If strongly preferred, you may contact the course staff in German.

•

English is encouraged though, because we may use your (anonymized) question to clarify points to the entire class.

(4)

Organization

■ Lecture:

•

2 hours a week

•

Wednesdays, 9:50 - 11:30, S3/05 room 073

•

We will cover the foundational aspects of each topic.

■ Exercise:

•

2 hours a week

•

Wednesdays, 11:40 - 13:20, S3/05 room 073

•

We will cover some practical aspects, and discuss the homework assignments.

4

(5)

Exam & Bonus System

■ Most likely there will be an oral exam.

• Likely during the first week after classes end.

• Can be taken in English or German.

• Details depend on how many students end up taking the class.

• There will be a bonus of up to a full grade for those who do well in the homework assignments.

• Details TBA.

■ Exercises:

• In order to get credit for 2+2 SWS, you need to actively participate in the exercises / turn in the homework assignments.

(6)

Style

■ Lectures:

•

I would like the lectures to be at least partly interactive.

•

Maybe more interactive than you are used to.

•

This is supposed to be helpful for you and me.

•

You are encouraged to ask questions!

■ Exercises:

•

Mostly interactive.

•

You are encouraged to ask detailed questions!

•

Your participation counts: Bonus for final exam.

6

(7)

Homework assignments

■ Mix of written and programming assignments.

•

We will have around 4-5 assignments.

•

Programming assignments in MATLAB, standard environment for scientific computing.

- Goal: Work with some real data to get a first hand knowledge of how the techniques work that we will learn.

- Introduction during first exercise (next week).

•

Also pen and paper exercises.

■ The last assignment may be a larger project-like one:

•

Stay tuned...

(8)

Readings

■ Course book:

• Christopher Bishop: Pattern Recognition and Machine Learning. Springer, 2006. ISBN 0-287-31073-8 (very good book, but not an easy read).

• Should be on reserve in the library.

■ Other useful books:

• Duda, Hart & Stork: Pattern Classification, Wiley-Interscience, 2000, 2nd edition. ISBN 0-471-05669-3 (new version of a classic).

• David J. C. MacKay: Information Theory, Inference, and Learning Algorithms. Cambridge University Press, 2003. ISBN 0-521-64298-1 (free download at http://

www.inference.phy.cam.ac.uk/mackay/itila/book.html).

• Gelman et al.: Bayesian Data Analysis. CRC Press, 2nd ed., 2004, ISBN 1-584-88388-X

• Hastie, Tibshirani, Friedman: The Elements of Statistical Learning, Springer, 2001. ISBN 0-387-95284-5 (the statistical perspective).

• Tom Mitchell: Machine Learning, McGraw-Hill, 1997, ISBN 0-07-042807-7 (classic, but getting outdated).

8

(9)

Readings

■ Additional readings:

•

At times I will post papers and tutorials.

•

Will be available or linked from the course web page.

■ I will often assign weekly readings:

•

Please read them and come to class prepared!

•

The Bishop book is a good investment, because it is also a very useful reference.

(10)

How does it fit into your course plan?

■ Diplom:

•

Anwendungsbezogene Informatik

•

Possibly Praktische or Theoretische Informatik if you can find someone who will count ML I toward this.

- Note that I will not be able to offer an exam in Theoretische Informatik.

■ B.Sc. / M.Sc.:

•

Human Computer Systems (see Modulhandbuch)

•

Not Data Knowledge Engineering

•

If you are strongly interested in machine learning you should:

- Take ML: Statistical Methods for HCS credit and

- Take ML: Symbolische Methoden for DKE credit

10

(11)

How does it fit into your course plan?

■ Related classes:

•

Human Computer Systems (WS, Schiele/Fellner): prerequisite

•

Machine Learning: Statistical Methods II (WS, Schiele)

•

Computer Vision I (SS, Schiele/Schindler) - II (WS, Roth)

•

Maschinelles Lernen - Symbolische Verfahren (WS, Fürnkranz)

•

Bildverarbeitung (SS, Sakas)

■ Theses and projects:

•

Topics in machine learning with applications in computer vision.

•

Please talk to me if you are interested.

(12)

■ What is ML? What is its goal?

Develop a machine / an algorithm that learns to perform a task from past experience.

■ Why? What for?

•

Fundamental component of every intelligent and/or autonomous system

•

Discovering “rules” and patterns in data

•

Automatic adaptation of systems

•

Attempting to understand human / biological learning

Machine Learning

(13)

Machine Learning in Action

(14)

Machine Learning: Examples

■ Example 1: Recognition of handwritten digits

•

These digits are given to us as small digital images.

•

We have to build a “machine” to decide which digit it is.

•

Obvious challenge: There are many different ways in which people handwrite.

14

(15)

Machine Learning: Examples

■ Example 2: Classification of fish

FIGURE 1.1. The objects to be classified are first sensed by a transducer (camera),

salmon sea bass

length count

l*

0 2 4 6 8 10 12 16 18 20 22

5 10 15 20 25

FIGURE 1.2. Histograms for the length feature for the two categories. No single thresh- old value of the length will serve to unambiguously discriminate between the two categories; using length alone, we will have some errors. The value markedl^∗ will lead to the smallest number of errors, on average. From: Richard O. Duda, Peter E. Hart, and David G. Stork,Pattern Classification. Copyright c

(16)

Machine Learning: Examples

■ More examples:

•

Email filtering:

•

Speech recognition:

•

Vehicle control:

16

(17)

■ Recognition of speech, letters, faces, ...

■ Autonomous vehicle navigation

■ Games

•

Backgammon world-champion

•

Chess: Deep-Blue vs. Kasparov

■ Google

■ Finding new astronomical structures

■ Fraud detection (credit card applications)

■ ...

Impact & Successes

(18)

■ Develop a machine / an algorithm that learns to perform a task from past experience.

■ Put more abstractly:

•

Our task is to learn a mapping from input to output.

•

Put differently, we want to predict the output from the input.

•

Input: images, text, other measurements,...

•

^Output:

•

Parameter(s): (that is/are being “learned”)

Machine Learning

f : I → O

y = f (x; θ) x ∈ I

y ∈ O

θ ∈ Θ

(19)

Classification vs. Regression

■ Classification:

•

Learn a mapping into a discrete space, e.g.:

•

Examples: Spam / not spam, sea bass vs. salmon, parsing a sentence, recognizing digits, etc.

O = {verb, noun, nounphrase, . . .}

O = {0, 1, 2, 3, . . .}

O = {0, 1}

(20)

Classification vs. Regression

■ Regression:

•

Learn a mapping into a continuous space, e.g.:

•

Examples: “Curve fitting”, financial analysis, ...

20

O = R

³

O = R

0 1

−1 0 1

1 2 3 4 5 6

40 50 60 70 80 90 100

(21)

General Paradigm

■ Training:

■ Testing:

Training data

learn

model

θ

Learned parameters

predict

0, 1, 2, 8, 4,

6, 6, 7, 8, 9

(22)

What data do we have for training?

■ Data with labels (input / output pairs):

supervised learning

•

E.g. image with digit label

•

Sensory data for car with intended steering control.

■ Data without labels: unsupervised learning

•

E.g. automatic clustering (grouping) of sounds

•

Clustering of text according to topics

■ Data with and without labels: semi-supervised learning

■ No examples: learning-by-doing

•

Reinforcement learning

22

- 0

- 5

(23)

Some Key Challenges

■ We need generalization!

• We cannot simply memorize the training set.

■ What if we see an input that we haven’t seen before?

• Different shape of the digit image (unknown writer)

• “Dirt” on the picture, etc.

• We need to learn what is important for carrying out our task.

(24)

Generalization

■ How do we achieve generalization?

24

?

2 4 6 8 10

14 15 16 17 18 19 20 21 22 width

lightness

salmon sea bass

FIGURE 1.5. Overly complex models for the fish will lead to decision boundaries that are complicated. While such a decision may lead to perfect classification of our training samples, it would lead to poor performance on future patterns. The novel test point marked ? is evidently most likely a salmon, whereas the complex decision boundary shown leads it to be classified as a sea bass. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c! 2001 by John Wiley & Sons, Inc.

(25)

■ How do we achieve generalization?

Generalization

2 4 6 8 10

14 15 16 17 18 19 20 21 22 width

lightness

salmon sea bass

FIGURE 1.6. The decision boundary shown might represent the optimal tradeoff between performance on the training set and simplicity of classifier, thereby giving the highest accuracy on new patterns. From: Richard O. Duda, Peter E. Hart, and David G.

Occam’s Razor

(26)

■ Input x:

■ Features:

• Choosing the “right” features is very important.

• Coding and use of domain knowledge.

• May allow for invariance (e.g. volume and pitch of voice).

■ Curse of Dimensionality:

• If the features are too high-dimensional, we will run into trouble - more later.

• Dimensionality reduction

Some Key Challenges

(27)

Some Key Challenges

■ How do we measure performance?

•

99% correct classification in speech recognition: What does that really mean?

•

We understand the meaning of the sentence? We understand every word? For all speakers?

■ Need more concrete numbers:

•

% of correctly classified letters

•

average distance driven (until accident...)

•

% of games won

•

% correctly recognized words, sentences, etc.

(28)

Some Key Challenges

■ We also need to define the right error metric:

•

Whis is better?

•

Euclidean distance (L2 norm) might be useless.

28

(29)

■ Which is the right model?

•

The learned parameters can mean a lot of different things.

- w: may characterize the family of functions or the model space

- w: may index the hypothesis space

- w: vector, adjacency matrix, graph, ...

Some Key Challenges

(30)

Some Key Challenges

■ Even if we have solved the other problems, computation is usually quite hard:

•

Learning often involves some kind of optimization

•

Find (search) best model parameters

•

Often we have to deal with thousands of training examples

•

Given a model, compute the prediction efficiently

30

(31)

Why is machine learning interesting (for you)?

■ Machine learning is a challenging problem that is far from being solved.

• Our learning systems are primitive compared to us humans.

• Think about what and how quickly a child can learn!

■ It combines insights and tools from many fields and disciplines:

• Traditional artificial intelligence (logic, semantic networks, ...)

• ^Statistics

• Complexity theory

• Artificial neural networks

•

(32)

Why is machine learning interesting (for you)?

■ Allows you to apply theoretical skills that you may otherwise only use rarely.

■ Has lots of applications:

• Computer vision

• Computer linguistics

• Search (think Google)

• Digital “assistants”

• Computer systems

• ...

■ It is a growing field:

• Many major companies are hiring people with machine learning knowledge.

• Anecdote: At a recent workshop on computer graphics, about 2/3 of the groups said they would benefit from more machine learning knowledge.

32

(33)

Preliminary Syllabus

■ Subject to change!

■ Fundamentals (~ 3 weeks)

•

Bayes decision theory, maximum likelihood, Bayesian inference

•

Performance evaluation

•

Probability density estimation

•

Mixture models, expectation maximization

■ Linear Methods (~ 3-4 weeks)

•

Linear regression

•

PCA, robust PCA

(34)

Preliminary Syllabus

■ Large-Margin Methods (~ 3-4 weeks)

•

Statistical learning theory

•

Support vector machines

•

Kernel methods

■ Miscellaneous (~ 3 weeks)

•

Model averaging (bagging & boosting)

•

Graphical models (basic introduction)

34

(35)

Credits

■ Large parts of the lecture material have been developed by Prof. Bernt Schiele for the previous iterations of this course.

■ Many figures that I will use are directly taken out of the

books by Chris Bishop and Duda, Hart & Stork.

(36)

Brief Review of Basic Probability

■ What you should already know:

36

B = r B = b

F = o F = a

‣

Random picking:

‣

Red box: 60% of the time

‣

Blue box: 40% of the time

‣

Pick fruit from a box with equal probability

p(B = r) = 0.6 p(B = b) = 0.4

p(F = a |B = r) = p(F = o|B = b) = 0.25

p(F = o |B = r) = p(F = a|B = b) = 0.75

(37)

Brief Review of Basic Probability

■ We usually do not mention the random variable (RV) explicitly (for brevity).

■ Instead of we write:

•

if we want to denote the probability distribution for a particular random variable .

•

if we want to denote the value of the probability of the random variable being .

•

It should be obvious from the context when we mean the random variable itself and a value that the random variable can take.

■ Some people use upper case for (discrete) p(X = x)

p(X) p(x)

X x

P (X = x)

(38)

Brief Review of Basic Probability

■ Joint probability:

•

The probability distribution of random variables and taking on a configuration jointly.

•

For example:

■ Conditional probability:

•

The probability distribution of random variable given the fact that random variable takes on a specific value

•

For example:

38

p(X, Y )

X Y

p(B = b, F = o)

p(X |Y )

X Y

p(B = b |F = o)

(39)

■ Probabilities are always non-negative:

■ Probabilities sum to 1:

■ Sum rule or marginalization:

Basic Rules I

p(x) =

!

y

p(x, y) p(y) =

!

x

p(x, y) p(x) ≥ 0

!

x

p(x) = 1 ⇒

0 ≤ p(x) ≤ 1

(40)

Basic Rules II

■ Product rule:

•

From this we directly follow...

■ Bayes’ rule or Bayes’ theorem:

•

We will widely use these rules.

40

p(x, y) = p(x |y)p(y) = p(y|x)p(x)

p(y |x) = p(x |y)p(y) p(x)

Rev. Thomas Bayes 1701-1761

(41)

Continuous RVs

■ What if we have continuous random variables, say ?

•

Any single value has zero probability.

•

We can only assign a probability for a random variable being in a range of values:

■ Instead we use the probability density

■ Cumulative distribution function:

X = x ∈ R

P r(x

₀

< X < x

₁) = P r(x₀

≤ X ≤ x

¹)

p(x)

P r(x

₀

≤ X ≤ x

¹) = ! x₁ x₀

p(x) dx

(42)

Continuous RVs

■ Probability density function = pdf

■ Cumulative distribution function = cdf

■ We can work with a density (pdf) as if it was a probability distribution:

•

For simplicity we usually use the same notation for both.

42 δx x

p(x) P (x)

(43)

Basic rules for pdfs

■ What are the rules?

•

Non-negativity:

•

“Summing” to 1:

•

^But:

•

Marginalization:

p(x) ≥ 0

!

p(x) dx = 1

p(x) !≤ 1

^{in general}

p(x) =

!

p(x, y) dy p(y) =

!

p(x, y) dx

(44)

Expectations

■ The average value of a function under a probability distribution is the expectation:

■ For joint distributions we sometimes write:

■ Conditional expectation:

44

f (x) p(x)

E

_x

[f(x, y)]

E

_x_|y

[f] = E

_x

[f|y] = !

x

f (x)p(x |y) E[f ] = E[f (x)] = !

x

f (x)p(x) or E[f] = "

f (x)p(x) dx

(45)

Variance and Covariance

■ Variance of a single RV:

■ Covariance of two RVs:

■ Random vectors:

•

All we have said so far not only applies to scalar random variables, but also to random vectors.