• No results found

CS 2750 Machine Learning. Lecture 3. Density estimation. CS 2750 Machine Learning. Announcements

N/A
N/A
Protected

Academic year: 2021

Share "CS 2750 Machine Learning. Lecture 3. Density estimation. CS 2750 Machine Learning. Announcements"

Copied!
20
0
0

Loading.... (view fulltext now)

Full text

(1)

CS 2750 Machine Learning

CS 2750 Machine Learning

Lecture 3

Milos Hauskrecht [email protected] 5329 Sennott Square

Density estimation

Announcements

Next lecture:Matlab tutorial

Rules for attending the class:Registered for credit

Registered for audit (only if there are available seats) Rules for audit:

(2)

CS 2750 Machine Learning

Review

Design cycle

Data Feature selection Model selection Learning Evaluation Require prior knowledge
(3)

CS 2750 Machine Learning

Data

Data may need a lot of:

Cleaning

Preprocessing (conversions) Cleaning:

Get rid of errors, noise,Removal of redundancies Preprocessing:Renaming Rescaling (normalization)DiscretizationsAbstractionAggregationNew attributes

Data biases

Watch out for data biases:

– Try to understand the data source

– It is very easy to derive “unexpected” results when data used for analysis and learning are biased (pre-selected) • Results (conclusions) derived for pre-selected data do not

(4)

CS 2750 Machine Learning

Data biases

Example 1:Risks in pregnancy study • Sponsored by DARPA at military hospitals

• Study of a large sample of pregnant woman who visited military hospitals

Conclusion:the factor with the largest impact on reducing risks during pregnancy (statistically significant) is a pregnant woman being single

• Single woman Æthe smallest risk • What is wrong?

Data

Example 2:Stock market trading (example by Andrew Lo) – Data on stock performances of companies traded on stock

market over past 25 year

Investment goal:pick a stock to hold long term

Proposed strategy: invest in a company stock with an IPO corresponding to a Carmichael number

- Evaluation result:excellent return over 25 years - Where the magic comes from?

(5)

CS 2750 Machine Learning

Design cycle

Data Feature selection Model selection Learning Evaluation Require prior knowledge

Feature selection

The size (dimensionality) of a samplecan be enormous • Example:document classification

– 10,000 different words

– Inputs: counts of occurrences of different words – Too many parameters to learn (not enough samples to

justify the estimates the parameters of the model) • Dimensionality reduction: replace inputs with features

Extract relevant inputs(e.g. mutual information measure) – PCA– principal component analysis

Group (cluster) similar words(uses a similarity measure) • Replace with the group label

) ,.., , ( 1 2 d i i i i x x x x = d - very large

(6)

CS 2750 Machine Learning

Design cycle

Data Feature selection Model selection Learning Evaluation Require prior knowledge

Model selection

What is the right model to learn?

E.g what polynomial to use

– A prior knowledge helps a lot, but still a lot of guessing – Initial data analysis and visualization

• We can make a good guess about the form of the distribution, shape of the function

Overfitting problem

– Take into account the bias and varianceof error estimates – Simpler (more biased) model – parameters can be estimated

more reliably (smaller variance of estimates) – Complex model with many parameters – parameter

(7)

CS 2750 Machine Learning

Solutions for overfitting

How to make the learner avoid the overfit?

Assure sufficient number of samplesin the training set – May not be possible (small number of examples) • Hold some data out of the training set = validation set

– Train (fit) on the training set (w/o data held out);

– Check for the generalization error on the validation set, choose the model based on the validation set error (random resampling validation techniques)

Regularization (Occam’s Razor)

– Penalize for the model complexity (number of parameters) – Explicit preference towards simple models

Design cycle

Data Feature selection Model selection Learning Evaluation Require prior knowledge
(8)

CS 2750 Machine Learning

Learning

Learning = optimization problem. Various criteria: – Mean square error

Maximum likelihood (ML) criterion

Maximum posterior probability (MAP) ) | ( max arg * = Θ Θ Θ P D ) | ( max arg * = P Θ D Θ Θ ( ) ) ( ) | ( ) | ( D P P D P D P Θ = Θ Θ ) | ( log ) (Θ = − P D Θ Error 2 ,.. 1 )) , ( ( 1 ) (w i i w N i x f y N Error =

− = ) ( min arg * w w w Error =

Learning

Learning = optimization problem

• Optimization problems can be hard to solve. Right choice of a model and an error function makes a difference.

Parameter optimizations

• Gradient descent, Conjugate gradient (1storder method)

• Newton-Rhapson (2ndorder method)

• Levenberg-Marquard

Some can be carried on-lineon a sample by sample basis Combinatorial optimizations (over discrete spaces):

• Hill-climbing

• Simulated-annealing • Genetic algorithms

(9)

CS 2750 Machine Learning

Design cycle

Data Feature selection Model selection Learning Evaluation Require prior knowledge

Simple holdout method.

– Divide the data to the training and test data. • Other more complex methods

– Based on random re-sampling validation schemes: • cross-validation, random sub-sampling.

• What if we want to compare the predictive performance on a classification or a regression problem for two different learning methods?

Solution:compare the error results on the test data set • The method with better (smaller) testing error gives a better

generalization error.

• But we need statistics to show significance

Evaluation.

(10)

CS 2750 Machine Learning

Density estimation

Outline

Outline:Density estimation: – Maximum likelihood (ML) – Bayesian parameter estimates – MAP

Bernoulli distribution.Binomial distributionMultinomial distributionNormal distribution

(11)

CS 2750 Machine Learning

Density estimation

Data:

Attributes:

• modeled by random variables with: – Continuous values

Discrete values

E.g. blood pressurewith numerical values or chest painwith discrete values

[no-pain, mild, moderate, strong] Underlying true probability distribution:

} ,.., , {D1 D2 Dn D= i i

D =x a vector of attribute values } , , , {X1 X2 K Xd = X ) (X p

Density estimation

Data:

Objective: try to estimate the underlying ‘true’ probability distribution over variables , , using examples in D

Standard (iid) assumptions: Samplesare independentof each other

come from the same (identical) distribution (fixed ) } ,.., , {D1 D2 Dn D= i i

D =x a vector of attribute values

X

) (X

p D={D1,D2,..,Dn}

n samples

true distribution estimate

) ( ˆ X p ) (X p ) (X p

(12)

CS 2750 Machine Learning

Density estimation

Types of density estimation:

Parametric

• the distribution is modeled using a set of parameters • Example:mean and covariances of a multivariate normal • Estimation: find parameters describing data D

Non-parametric

• The model of the distribution utilizes all examples in D

• As if all examples were parameters of the distribution • Examples:Nearest-neighbor Semi-parametric Θ ) | (X Θ p Θ

Learning via parameter estimation

In this lecture we consider parametric density estimation Basic settings:

• A set of random variables

A model of the distributionover variables in X

with parameters : • Data

Objective:find parameters such that describes data D the best } , , , {X1 X2 K Xd = X Θ } ,.., , {D1 D2 Dn D= ) | ( ˆ X Θ p Θ p(X|Θ)

(13)

CS 2750 Machine Learning

Parameter estimation.

Maximum likelihood (ML)

– yields: one set of parameters

– the target distribution is approximated as: • Bayesian parameter estimation

– uses the posterior distribution over possible parameters

– Yields: all possible settings of (and their “weights”) – The target distribution is approximated as:

)

,

|

(

D

Θ

ξ

p

maximize ) | ( ) | ( ) , | ( ) , | ( ξ ξ ξ ξ D p p D p D p Θ = Θ Θ ML

Θ

Θ ) ( ) ( ˆ p ML p X = X|Θ

= = Θ Θ Θ Θ | X X p D p X p D d pˆ( ) ( ) ( | ) ( | ,ξ)

Parameter estimation.

Other possible criteria:

Maximum a posteriori probability (MAP) – Yields: one set of parameters

– Approximation:

Expected value of the parameter

– Expectation taken with regard to posterior – Yields: one set of parameters

– Approximation:

maximize

p

(

Θ

|

D

,

ξ

)

(mode of the posterior) MAP

Θ

) ( ˆ Θ Θ=E ) ( ) ( ˆ p MAP p X = X|Θ

)

,

|

(

D

ξ

p

Θ

) ˆ ( ) ( ˆ X p X|Θ p =
(14)

CS 2750 Machine Learning

Parameter estimation. Coin example.

Coin example:we have a coin that can be biased

Outcomes:two possible values -- head or tail Data: D a sequence of outcomes such that

headtail

Model: probability of a head probability of a tail Objective:

We would like to estimate the probability of a head from data

θ

) 1 ( −θ 0 = i x 1 = i x i x

θ

ˆ

Parameter estimation. Example.

Assumethe unknown and possibly biased coin • Probability of the head is

Data:

H H T T H H T H T H T T T H T H H H H T H H H H T – Heads:15

Tails:10

What would be your estimate of the probability of a head ?

θ

? ~

= θ

(15)

CS 2750 Machine Learning

Parameter estimation. Example

Assumethe unknown and possibly biased coin • Probability of the head is

Data:

H H T T H H T H T H T T T H T H H H H T H H H H T – Heads:15

Tails:10

What would be your choice of the probability of a head ? Solution: use frequencies of occurrences to do the estimate

This is the maximum likelihood estimateof the parameter

θ

6 . 0 25 15 ~= = θ

θ

Probability of an outcome

Data: D a sequence of outcomes such that

headtail

Model: probability of a head probability of a tail

Assume: we know the probability Probability of an outcome of a coin flip

– Combines the probability of a head and a tail – So that is going to pick its correct probability – Gives for – Gives for ) 1 ( ) 1 ( ) | ( xi xi i x P

θ

=

θ

θ

θ

) 1 ( −θ 0 = i x 1 = i x i x Bernoulli distribution i x i x

θ

) 1 ( −θ xi =0 1 = i x

θ

(16)

CS 2750 Machine Learning

Probability of a sequence of outcomes.

Data: D a sequence of outcomes such that

headtail

Model: probability of a head probability of a tail

Assume: a sequence of independent coin flips

D = H H T H T H (encoded as D= 110101) What is the probability of observing the data sequenceD:

?

)

|

(

D

θ

=

P

θ

) 1 ( −θ 0 = i x 1 = i x i x

Probability of a sequence of outcomes.

Data: D a sequence of outcomes such that

headtail

Model: probability of a head probability of a tail

Assume: a sequence of coin flips D = H H T H T H encoded as D= 110101

What is the probability of observing a data sequenceD:

θ

θ

θ

θ

θθ

θ

) (1 ) (1 ) | (D = − − P

θ

) 1 ( −θ 0 = i x 1 = i x i x
(17)

CS 2750 Machine Learning

Probability of a sequence of outcomes.

Data: D a sequence of outcomes such that

headtail

Model: probability of a head probability of a tail

Assume: a sequence of coin flips D = H H T H T H encoded as D= 110101

What is the probability of observing a data sequenceD:

θ

θ

θ

θ

θθ

θ

) (1 ) (1 ) | (D = − − P

θ

) 1 ( −θ 0 = i x 1 = i x i x

likelihood of the data

Probability of a sequence of outcomes.

Data: D a sequence of outcomes such that

headtail

Model: probability of a head probability of a tail

Assume: a sequence of coin flips D = H H T H T H encoded as D= 110101

What is the probability of observing a data sequenceD:

Can be rewritten using the Bernoulli distribution:

θ

θ

θ

θ

θθ

θ

) (1 ) (1 ) | (D = − − P

θ

) 1 ( −θ 0 = i x 1 = i x i x ) 1 ( 6 1 ) 1 ( ) | ( i xi i x D P − = − =

θ

θ

θ

(18)

CS 2750 Machine Learning

The goodness of fit to the data.

Learning: we do not know the value of the parameter Our learning goal:

• Find the parameter that fits the data D the best? One solution to the “best”:Maximize the likelihood

Intuition:

• more likely are the data given the model, the better is the fit Note: Instead of an error function that measures how bad the data

fit the model we have a measure that tells us how well the data fit :

θ

) 1 ( 1 ) 1 ( ) | ( i xi n i x D P − = − =

θ θ θ

θ

) | ( ) , (D

θ

P D

θ

Error =−

Example: Bernoulli distribution.

Coin example:we have a coin that can be biased

Outcomes:two possible values -- head or tail Data: D a sequence of outcomes such that

headtail

Model: probability of a head probability of a tail Objective:

We would like to estimate the probability of a head Probability of an outcome ) 1 ( ) 1 ( ) | ( xi xi i x P

θ

=

θ

θ

θ

) 1 ( −θ 0 = i x 1 = i x i x

θ

ˆ i x Bernoulli distribution
(19)

CS 2750 Machine Learning

Maximum likelihood (ML) estimate.

Maximum likelihoodestimate

1

N - number of heads seen N2 - number of tails seen

) , | ( max arg

θ

ξ

θ

θ P D ML = Likelihood of data: ) 1 ( 1 ) 1 ( ) , | ( i xi n i x D P − = − =

θ

θ

ξ

θ

Optimize log-likelihood (the same as maximizing likelihood)

= − = = − =

(1 ) 1 ) 1 ( log ) , | ( log ) , ( i xi n i x D P D l

θ

θ

ξ

θ

θ

) 1 ( ) 1 log( log ) 1 log( ) 1 ( log 1 1 1

= = = − − + = − − + n i i n i i n i i i x x x x

θ

θ

θ

θ

Maximum likelihood (ML) estimate.

2 1 1 1 N N N N N ML = = + θ ML Solution: Optimize log-likelihood ) 1 log( log ) , (D

θ

=N1

θ

+N2

θ

l

Set derivative to zero

0 ) 1 ( ) , ( 1 2 = − − = ∂ ∂

θ

θ

θ

θ

N N D l 2 1 1 N N N + =

θ

Solving
(20)

CS 2750 Machine Learning

Maximum likelihood estimate. Example

Assumethe unknown and possibly biased coin

• Probability of the head is • Data:

H H T T H H T H T H T T T H T H H H H T H H H H T – Heads:15

Tails:10

What is the ML estimate of the probability of a head and a tail?

θ

Maximum likelihood estimate. Example

• Assume the unknown and possibly biased coin

• Probability of the head is • Data:

H H T T H H T H T H T T T H T H H H H T H H H H T – Heads:15

Tails:10

What is the ML estimate of the probability of head and tail ?

θ

6 . 0 25 15 2 1 1 1 = = + = = N N N N N ML θ 4 . 0 25 10 ) 1 ( 2 1 2 2 = = + = = − N N N N N ML θ Head: Tail:

References

Related documents