CS 2750 Machine Learning
CS 2750 Machine Learning
Lecture 3
Milos Hauskrecht [email protected] 5329 Sennott SquareDensity estimation
Announcements
Next lecture: • Matlab tutorialRules for attending the class: • Registered for credit
• Registered for audit (only if there are available seats) Rules for audit:
CS 2750 Machine Learning
Review
Design cycle
Data Feature selection Model selection Learning Evaluation Require prior knowledgeCS 2750 Machine Learning
Data
Data may need a lot of:• Cleaning
• Preprocessing (conversions) Cleaning:
– Get rid of errors, noise, – Removal of redundancies Preprocessing: – Renaming – Rescaling (normalization) – Discretizations – Abstraction – Aggregation – New attributes
Data biases
• Watch out for data biases:– Try to understand the data source
– It is very easy to derive “unexpected” results when data used for analysis and learning are biased (pre-selected) • Results (conclusions) derived for pre-selected data do not
CS 2750 Machine Learning
Data biases
Example 1:Risks in pregnancy study • Sponsored by DARPA at military hospitals• Study of a large sample of pregnant woman who visited military hospitals
• Conclusion:the factor with the largest impact on reducing risks during pregnancy (statistically significant) is a pregnant woman being single
• Single woman Æthe smallest risk • What is wrong?
Data
Example 2:Stock market trading (example by Andrew Lo) – Data on stock performances of companies traded on stock
market over past 25 year
– Investment goal:pick a stock to hold long term
– Proposed strategy: invest in a company stock with an IPO corresponding to a Carmichael number
- Evaluation result:excellent return over 25 years - Where the magic comes from?
CS 2750 Machine Learning
Design cycle
Data Feature selection Model selection Learning Evaluation Require prior knowledgeFeature selection
• The size (dimensionality) of a samplecan be enormous • Example:document classification
– 10,000 different words
– Inputs: counts of occurrences of different words – Too many parameters to learn (not enough samples to
justify the estimates the parameters of the model) • Dimensionality reduction: replace inputs with features
– Extract relevant inputs(e.g. mutual information measure) – PCA– principal component analysis
– Group (cluster) similar words(uses a similarity measure) • Replace with the group label
) ,.., , ( 1 2 d i i i i x x x x = d - very large
CS 2750 Machine Learning
Design cycle
Data Feature selection Model selection Learning Evaluation Require prior knowledgeModel selection
• What is the right model to learn?– E.g what polynomial to use
– A prior knowledge helps a lot, but still a lot of guessing – Initial data analysis and visualization
• We can make a good guess about the form of the distribution, shape of the function
• Overfitting problem
– Take into account the bias and varianceof error estimates – Simpler (more biased) model – parameters can be estimated
more reliably (smaller variance of estimates) – Complex model with many parameters – parameter
CS 2750 Machine Learning
Solutions for overfitting
How to make the learner avoid the overfit?• Assure sufficient number of samplesin the training set – May not be possible (small number of examples) • Hold some data out of the training set = validation set
– Train (fit) on the training set (w/o data held out);
– Check for the generalization error on the validation set, choose the model based on the validation set error (random resampling validation techniques)
• Regularization (Occam’s Razor)
– Penalize for the model complexity (number of parameters) – Explicit preference towards simple models
Design cycle
Data Feature selection Model selection Learning Evaluation Require prior knowledgeCS 2750 Machine Learning
Learning
• Learning = optimization problem. Various criteria: – Mean square error
– Maximum likelihood (ML) criterion
– Maximum posterior probability (MAP) ) | ( max arg * = Θ Θ Θ P D ) | ( max arg * = P Θ D Θ Θ ( ) ) ( ) | ( ) | ( D P P D P D P Θ = Θ Θ ) | ( log ) (Θ = − P D Θ Error 2 ,.. 1 )) , ( ( 1 ) (w i i w N i x f y N Error =
∑
− = ) ( min arg * w w w Error =Learning
Learning = optimization problem• Optimization problems can be hard to solve. Right choice of a model and an error function makes a difference.
• Parameter optimizations
• Gradient descent, Conjugate gradient (1storder method)
• Newton-Rhapson (2ndorder method)
• Levenberg-Marquard
Some can be carried on-lineon a sample by sample basis Combinatorial optimizations (over discrete spaces):
• Hill-climbing
• Simulated-annealing • Genetic algorithms
CS 2750 Machine Learning
Design cycle
Data Feature selection Model selection Learning Evaluation Require prior knowledge• Simple holdout method.
– Divide the data to the training and test data. • Other more complex methods
– Based on random re-sampling validation schemes: • cross-validation, random sub-sampling.
• What if we want to compare the predictive performance on a classification or a regression problem for two different learning methods?
• Solution:compare the error results on the test data set • The method with better (smaller) testing error gives a better
generalization error.
• But we need statistics to show significance
Evaluation.
CS 2750 Machine Learning
Density estimation
Outline
Outline: • Density estimation: – Maximum likelihood (ML) – Bayesian parameter estimates – MAP• Bernoulli distribution. • Binomial distribution • Multinomial distribution • Normal distribution
CS 2750 Machine Learning
Density estimation
Data:Attributes:
• modeled by random variables with: – Continuous values
– Discrete values
E.g. blood pressurewith numerical values or chest painwith discrete values
[no-pain, mild, moderate, strong] Underlying true probability distribution:
} ,.., , {D1 D2 Dn D= i i
D =x a vector of attribute values } , , , {X1 X2 K Xd = X ) (X p
Density estimation
Data:Objective: try to estimate the underlying ‘true’ probability distribution over variables , , using examples in D
Standard (iid) assumptions: Samples • are independentof each other
• come from the same (identical) distribution (fixed ) } ,.., , {D1 D2 Dn D= i i
D =x a vector of attribute values
X
) (X
p D={D1,D2,..,Dn}
n samples
true distribution estimate
) ( ˆ X p ) (X p ) (X p
CS 2750 Machine Learning
Density estimation
Types of density estimation:Parametric
• the distribution is modeled using a set of parameters • Example:mean and covariances of a multivariate normal • Estimation: find parameters describing data D
Non-parametric
• The model of the distribution utilizes all examples in D
• As if all examples were parameters of the distribution • Examples:Nearest-neighbor Semi-parametric Θ ) | (X Θ p Θ
Learning via parameter estimation
In this lecture we consider parametric density estimation Basic settings:• A set of random variables
• A model of the distributionover variables in X
with parameters : • Data
Objective:find parameters such that describes data D the best } , , , {X1 X2 K Xd = X Θ } ,.., , {D1 D2 Dn D= ) | ( ˆ X Θ p Θ p(X|Θ)
CS 2750 Machine Learning
Parameter estimation.
• Maximum likelihood (ML)– yields: one set of parameters
– the target distribution is approximated as: • Bayesian parameter estimation
– uses the posterior distribution over possible parameters
– Yields: all possible settings of (and their “weights”) – The target distribution is approximated as:
)
,
|
(
D
Θ
ξ
p
maximize ) | ( ) | ( ) , | ( ) , | ( ξ ξ ξ ξ D p p D p D p Θ = Θ Θ MLΘ
Θ ) ( ) ( ˆ p ML p X = X|Θ∫
= = Θ Θ Θ Θ | X X p D p X p D d pˆ( ) ( ) ( | ) ( | ,ξ)Parameter estimation.
Other possible criteria:• Maximum a posteriori probability (MAP) – Yields: one set of parameters
– Approximation:
• Expected value of the parameter
– Expectation taken with regard to posterior – Yields: one set of parameters
– Approximation:
maximize
p
(
Θ
|
D
,
ξ
)
(mode of the posterior) MAPΘ
) ( ˆ Θ Θ=E ) ( ) ( ˆ p MAP p X = X|Θ)
,
|
(
D
ξ
p
Θ
) ˆ ( ) ( ˆ X p X|Θ p =CS 2750 Machine Learning
Parameter estimation. Coin example.
Coin example:we have a coin that can be biasedOutcomes:two possible values -- head or tail Data: D a sequence of outcomes such that
• head • tail
Model: probability of a head probability of a tail Objective:
We would like to estimate the probability of a head from data
θ
) 1 ( −θ 0 = i x 1 = i x i xθ
ˆParameter estimation. Example.
• Assumethe unknown and possibly biased coin • Probability of the head is• Data:
H H T T H H T H T H T T T H T H H H H T H H H H T – Heads:15
– Tails:10
What would be your estimate of the probability of a head ?
θ
? ~
= θ
CS 2750 Machine Learning
Parameter estimation. Example
• Assumethe unknown and possibly biased coin • Probability of the head is• Data:
H H T T H H T H T H T T T H T H H H H T H H H H T – Heads:15
– Tails:10
What would be your choice of the probability of a head ? Solution: use frequencies of occurrences to do the estimate
This is the maximum likelihood estimateof the parameter
θ
6 . 0 25 15 ~= = θθ
Probability of an outcome
Data: D a sequence of outcomes such that• head • tail
Model: probability of a head probability of a tail
Assume: we know the probability Probability of an outcome of a coin flip
– Combines the probability of a head and a tail – So that is going to pick its correct probability – Gives for – Gives for ) 1 ( ) 1 ( ) | ( xi xi i x P
θ
=θ
−θ
−θ
) 1 ( −θ 0 = i x 1 = i x i x Bernoulli distribution i x i xθ
) 1 ( −θ xi =0 1 = i xθ
CS 2750 Machine Learning
Probability of a sequence of outcomes.
Data: D a sequence of outcomes such that• head • tail
Model: probability of a head probability of a tail
Assume: a sequence of independent coin flips
D = H H T H T H (encoded as D= 110101) What is the probability of observing the data sequenceD:
?
)
|
(
D
θ
=
P
θ
) 1 ( −θ 0 = i x 1 = i x i xProbability of a sequence of outcomes.
Data: D a sequence of outcomes such that• head • tail
Model: probability of a head probability of a tail
Assume: a sequence of coin flips D = H H T H T H encoded as D= 110101
What is the probability of observing a data sequenceD:
θ
θ
θ
θ
θθ
θ
) (1 ) (1 ) | (D = − − Pθ
) 1 ( −θ 0 = i x 1 = i x i xCS 2750 Machine Learning
Probability of a sequence of outcomes.
Data: D a sequence of outcomes such that• head • tail
Model: probability of a head probability of a tail
Assume: a sequence of coin flips D = H H T H T H encoded as D= 110101
What is the probability of observing a data sequenceD:
θ
θ
θ
θ
θθ
θ
) (1 ) (1 ) | (D = − − Pθ
) 1 ( −θ 0 = i x 1 = i x i xlikelihood of the data
Probability of a sequence of outcomes.
Data: D a sequence of outcomes such that• head • tail
Model: probability of a head probability of a tail
Assume: a sequence of coin flips D = H H T H T H encoded as D= 110101
What is the probability of observing a data sequenceD:
Can be rewritten using the Bernoulli distribution:
θ
θ
θ
θ
θθ
θ
) (1 ) (1 ) | (D = − − Pθ
) 1 ( −θ 0 = i x 1 = i x i x ) 1 ( 6 1 ) 1 ( ) | ( i xi i x D P − = − =∏
θ
θ
θ
CS 2750 Machine Learning
The goodness of fit to the data.
Learning: we do not know the value of the parameter Our learning goal:• Find the parameter that fits the data D the best? One solution to the “best”:Maximize the likelihood
Intuition:
• more likely are the data given the model, the better is the fit Note: Instead of an error function that measures how bad the data
fit the model we have a measure that tells us how well the data fit :
θ
) 1 ( 1 ) 1 ( ) | ( i xi n i x D P − = − =∏
θ θ θθ
) | ( ) , (Dθ
P Dθ
Error =−Example: Bernoulli distribution.
Coin example:we have a coin that can be biasedOutcomes:two possible values -- head or tail Data: D a sequence of outcomes such that
• head • tail
Model: probability of a head probability of a tail Objective:
We would like to estimate the probability of a head Probability of an outcome ) 1 ( ) 1 ( ) | ( xi xi i x P
θ
=θ
−θ
−θ
) 1 ( −θ 0 = i x 1 = i x i xθ
ˆ i x Bernoulli distributionCS 2750 Machine Learning
Maximum likelihood (ML) estimate.
Maximum likelihoodestimate
1
N - number of heads seen N2 - number of tails seen
) , | ( max arg
θ
ξ
θ
θ P D ML = Likelihood of data: ) 1 ( 1 ) 1 ( ) , | ( i xi n i x D P − = − =∏
θ
θ
ξ
θ
Optimize log-likelihood (the same as maximizing likelihood)
= − = = − =
∏
(1 ) 1 ) 1 ( log ) , | ( log ) , ( i xi n i x D P D lθ
θ
ξ
θ
θ
) 1 ( ) 1 log( log ) 1 log( ) 1 ( log 1 1 1∑
∑
∑
= = = − − + = − − + n i i n i i n i i i x x x xθ
θ
θ
θ
Maximum likelihood (ML) estimate.
2 1 1 1 N N N N N ML = = + θ ML Solution: Optimize log-likelihood ) 1 log( log ) , (D
θ
=N1θ
+N2 −θ
lSet derivative to zero
0 ) 1 ( ) , ( 1 2 = − − = ∂ ∂
θ
θ
θ
θ
N N D l 2 1 1 N N N + =θ
SolvingCS 2750 Machine Learning
Maximum likelihood estimate. Example
• Assumethe unknown and possibly biased coin• Probability of the head is • Data:
H H T T H H T H T H T T T H T H H H H T H H H H T – Heads:15
– Tails:10
What is the ML estimate of the probability of a head and a tail?
θ
Maximum likelihood estimate. Example
• Assume the unknown and possibly biased coin• Probability of the head is • Data:
H H T T H H T H T H T T T H T H H H H T H H H H T – Heads:15
– Tails:10
What is the ML estimate of the probability of head and tail ?
θ
6 . 0 25 15 2 1 1 1 = = + = = N N N N N ML θ 4 . 0 25 10 ) 1 ( 2 1 2 2 = = + = = − N N N N N ML θ Head: Tail: