CS 2710 Foundations of AI
CS 2710 Foundations of AI
Lecture 24
Milos Hauskrecht [email protected] 5329 Sennott SquareLearning probability distributions
Announcements
Homework 10:
• due today
Homework 11:
• Due next week on Wednesday
Final exam:
• December 14, 2004 at 11:00am-1:00pm • Location: 5201 WWPH Posvar Hall • Closed book
CS 2710 Foundations of AI
Unsupervised learning
• Data:a vector of attribute values – e.g. the description of a patient
– no specific target attribute we want to predict (no output y) • Objective:
– learn (describe) relations between attributes, examples
Types of problems:
• Clustering
Group together “similar” examples • Density estimation
– Model probabilistically the population of examples
} ,.., , {D1 D2 Dn D= i i D =x
Density estimation
Data: Attributes:• modeled by random variables with: – Continuous values
– Discrete values
E.g. blood pressure with numerical values or chest pain with discrete values
[no-pain, mild, moderate, strong]
Underlying true probability distribution:
} ,.., , {D1 D2 Dn D= i i
D =x a vector of attribute values
CS 2710 Foundations of AI
Density estimation
Data:
Objective: try to estimate the underlying true probability distribution over variables , , using examples in D
Standard (iid) assumptions: Samples
• are independentof each other
• come from the same (identical) distribution (fixed )
} ,.., , {D1 D2 Dn D= i i
D =x a vector of attribute values
X
) (X
p D={D1,D2,..,Dn}
n samples
true distribution estimate
) ( ˆ X p ) (X p ) (X p
Learning via parameter estimation
In this lecture we consider parametric density estimation Basic settings:• A set of random variables
• A model of the distributionover variables in X with parameters
• Data
Objective:find parameters that fit the data the best, or in other words reduce the misfit between the data and the model • What is the best set of parameters?
– There are various criteria one can apply here.
CS 2710 Foundations of AI
Parameter estimation. Basic criteria.
• Maximum likelihood (ML) criterion• Maximum a posteriori probability (MAP) criterion
)
,
|
(
max
arg
Θ
ξ
Θp
D
ξ
- represents prior (background) knowledge)
,
|
(
max
arg
p
Θ
D
ξ
Θ ) | ( ) | ( ) , | ( ) , | (ξ
ξ
ξ
ξ
D p p D p D p Θ = Θ ΘMAP selects the mode of the posterior
Likelihood of data
Posterior probability
Parameter estimation. Coin example.
Coin example:we have a coin that can be biased
Outcomes:two possible values -- head or tail
Data: D a sequence of outcomes such that
• head • tail
Model: probability of a head probability of a tail
Objective:
CS 2710 Foundations of AI
Parameter estimation. Example.
• Assumethe unknown and possibly biased coin • Probability of the head is• Data:
H H T T H H T H T H T T T H T H H H H T H H H H T – Heads:15
– Tails:10
What would be your estimate of the probability of a head ?
θ
? ~ =
θ
Parameter estimation. Example
• Assumethe unknown and possibly biased coin • Probability of the head is• Data:
H H T T H H T H T H T T T H T H H H H T H H H H T – Heads:15
– Tails:10
What would be your choice of the probability of a head ?
Solution: use frequencies of occurrences to do the estimate
This is the maximum likelihood estimateof the parameter
CS 2710 Foundations of AI
Probability of an outcome
Data: D a sequence of outcomes such that • head
• tail
Model: probability of a head probability of a tail
Assume: we know the probability
Probability of an outcome of a coin flip
– Combines the probability of a head and a tail – So that is going to pick its correct probability – Gives for – Gives for ) 1 ( ) 1 ( ) | ( xi xi i x P
θ
=θ
−θ
−θ
) 1 ( −θ 0 = i x 1 = i x i x Bernoulli distribution i x i xθ
) 1 ( −θ xi =0 1 = i xθ
Probability of a sequence of outcomes.
Data: D a sequence of outcomes such that • head
• tail
Model: probability of a head probability of a tail
Assume: a sequence of independent coin flips
D = H H T H T H (encoded as D= 110101)
CS 2710 Foundations of AI
Probability of a sequence of outcomes.
Data: D a sequence of outcomes such that • head
• tail
Model: probability of a head probability of a tail
Assume: a sequence of coin flips D = H H T H T H encoded as D= 110101
What is the probability of observing a data sequence D:
θ
θ
θ
θ
θθ
θ
) (1 ) (1 ) | (D = − − Pθ
) 1 ( −θ 0 = i x 1 = i x i xProbability of a sequence of outcomes.
Data: D a sequence of outcomes such that • head
• tail
Model: probability of a head probability of a tail
Assume: a sequence of coin flips D = H H T H T H encoded as D= 110101
What is the probability of observing a data sequence D:
θ
θ
θ
θ
θθ
θ
) (1 ) (1 ) | (D = − − Pθ
) 1 ( −θ 0 = i x 1 = i x i xCS 2710 Foundations of AI
Probability of a sequence of outcomes.
Data: D a sequence of outcomes such that • head
• tail
Model: probability of a head probability of a tail
Assume: a sequence of coin flips D = H H T H T H encoded as D= 110101
What is the probability of observing a data sequence D:
Can be rewritten using the Bernoulli distribution:
θ
θ
θ
θ
θθ
θ
) (1 ) (1 ) | (D = − − Pθ
) 1 ( −θ 0 = i x 1 = i x i x ) 1 ( 6 1 ) 1 ( ) | ( i xi i x D P − = − =∏
θ
θ
θ
The goodness of fit to the data.
Learning: we do not know the value of the parameter
Our learning goal:
• Find the parameter that fits the data D the best?
One solution to the “best”:Maximize the likelihood
Intuition:
• more likely are the data given the model, the better is the fit
CS 2710 Foundations of AI
Maximum likelihood (ML) estimate.
Maximum likelihoodestimate
1
N - number of heads seen N2 - number of tails seen
) , | ( max arg
θ
ξ
θ
θ P D ML = Likelihood of data: ) 1 ( 1 ) 1 ( ) , | ( i xi n i x D P − = − =∏
θ
θ
ξ
θ
Optimize log-likelihood (the same as maximizing likelihood)
= − = = − =
∏
(1 ) 1 ) 1 ( log ) , | ( log ) , ( i xi n i x D P D lθ
θ
ξ
θ
θ
) 1 ( ) 1 log( log ) 1 log( ) 1 ( log 1 1 1∑
∑
∑
= = = − − + = − − + n i i n i i n i i i x x x xθ
θ
θ
θ
Maximum likelihood (ML) estimate.
2 1 1 1 N N N N N ML = = + θ ML Solution: Optimize log-likelihood ) 1 log( log ) , (D
θ
=N1θ
+N2 −θ
lSet derivative to zero
CS 2710 Foundations of AI
Maximum likelihood estimate. Example
• Assumethe unknown and possibly biased coin• Probability of the head is • Data:
H H T T H H T H T H T T T H T H H H H T H H H H T – Heads:15
– Tails:10
What is the ML estimate of the probability of a head and a tail?
θ
Maximum likelihood estimate. Example
• Assume the unknown and possibly biased coin• Probability of the head is • Data:
H H T T H H T H T H T T T H T H H H H T H H H H T – Heads:15
– Tails:10
What is the ML estimate of the probability of head and tail ?
CS 2710 Foundations of AI
Maximum a posteriori estimate
Maximum a posteriori estimate
– Selects the mode of the posterior distribution
How to choose the prior probability?
) , | ( max arg θ ξ θ θ p D MAP = ) | ( ) | ( ) , | ( ) , | ( ξ ξ θ ξ θ ξ θ D P p D P D
p = (via Bayes rule)
) |
(
θ
ξ
p - is the prior probability on
θ
2 1(1 ) ) 1 ( ) , | ( (1 ) 1 N N x n i xi i D P
θ
ξ
=θ
−θ
− =θ
−θ
=∏
prior Likelihood of data Normalizing factorPrior distribution
) , | ( ) | ( ) , | ( ) , | ( ) , | ( 1 2 Beta 1 N1 2 N2 D P Beta D P D p = = θ α + α + ξ α α θ ξ θ ξ θChoice of prior: Beta distribution
Beta distribution “fits” Bernoulli trials - conjugate choices 1 1 2 1 2 1 2 1 1 (1 ) 2 ) ( ) ( ) ( ) , | ( ) | ( − − − Γ Γ + Γ = = θα θ α α α α α α α θ ξ θ Beta p
Why to use Beta distribution?
2 1(1 ) ) , | (D N N P
θ
ξ
=θ
−θ
Posterior distribution is again a Beta distribution
) (x Γ - A Gamma function ! ) (x =x Γ
CS 2710 Foundations of AI
Beta distribution
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.5 1 1.5 2 2.5 3 3.5 α=0.5, β=0.5 α=2.5, β=2.5 α=2.5, β=5Maximum a posterior probability
Maximum a posteriori estimate
– Selects the mode of the posterior distribution
Noticethat parameters of the prior act like counts of heads and tails
(sometimes they are also referred to as prior counts)
CS 2710 Foundations of AI
MAP estimate example
• Assume the unknown and possibly biased coin • Probability of the head is• Data:
H H T T H H T H T H T T T H T H H H H T H H H H T – Heads:15
– Tails:10 • Assume
What is the MAP estimate?
θ
) 5 , 5 | ( ) | (θ ξ Betaθ p =MAP estimate example
• Assume the unknown and possibly biased coin • Probability of the head is• Data:
H H T T H H T H T H T T T H T H H H H T H H H H T – Heads:15
– Tails:10 • Assume
What is the MAP estimate ?
CS 2710 Foundations of AI
MAP estimate example
• Note that the prior and data fit (data likelihood) are combined • The MAP can be biased with large prior counts
• It is hard to overturn it with a smaller sample size • Data: H H T T H H T H T H T T T H T H H H H T H H H H T – Heads:15 – Tails:10 • Assume ) 20 , 5 | ( ) | (θ ξ Betaθ p = ) 5 , 5 | ( ) | (θ ξ Betaθ p = 33 19 = MAP θ 48 19 = MAP θ
Multinomial distribution
Example: Multi-way coin toss, roll of dice
• Data: a set of N outcomes (multi-set)
Model parameters:
Probability of data (likelihood)
ML estimate: N Ni ML i, =
θ
)
,
,
(
θ
1θ
2K
θ
k=
θ
1 1 =∑
= k i iθ
s.t. iN - a number of times an outcome i has been seen
CS 2710 Foundations of AI
MAP estimate
Choice of prior: Dirichlet distribution
) ,.., | ( ) | ( ) ,.. , | ( ) , | ( ) , | ( 1 2 1 1 k k k Dir N N D P Dir D P D p = = α + α + ξ α α α ξ ξ θ θ θ θ
(
)
∑
= − + − + = k i i i i i MAP i k N N ,.. 1 , 1 α α θ MAP estimate: Posterior distribution 1 1 2 1 1 1 1 1 1 2 ) ( ) ( ) ,.., | ( − − − = =∏
∑
Γ Γ = k k k i i k i i k Dir θα θα θα α α α α K θDirichlet is the conjugate choicefor multinomial
k N k N N k k N N N N N N N P D P ξ ξ θ θ Kθ K K 1 2 2 1 2 1 2 1 ! ! ! ! ) , | , , ( ) , | ( θ = θ =
Learning complex distributions
• The problem of learning complex distributions–can be sometimes reduced to the problem of learning a number of simpler distributions
• Such a decomposition occurs for example in Bayesian networks
– Builds upon independences encoded in the network • Why learning of BBNs?
– Large databases are available
CS 2710 Foundations of AI
Learning of BBN parameters
Learning.Two steps:
– Learning of the network structure
– Learning of parameters of conditional probabilities • Variables:
– Observable – values present in every data sample – Hidden – values are never in the sample
– Missing values – values sometimes present, sometimes not
• Here:
– learning parameters for the fixed graph structure – All variables are observed in the dataset
Learning of BBN parameters. Example.
CS 2710 Foundations of AI
Learning of BBN parameters. Example.
Data D (different patient cases): Pal Fev Cou HWB Pneu
T T T T F T F F F F F F T T T F F T F T F T T T T T F T F F F F F F F T T F F F T T T T T F T F T T T F F T F F T F F F Pneumonia Cough Fever Paleness High WBC
Estimates of parameters of BBN
• Much like multiple coin tosses or rolls of a dice• A “smaller” learning problem corresponds to the learning of exactly one conditional distribution
• Example:
• Problem: How to pick the data to learn?
) |
(Fever Pneumonia =T
CS 2710 Foundations of AI
Learning of BBN parameters. Example.
Data D (different patient cases): Pal Fev Cou HWB Pneu
T T T T F T F F F F F F T T T F F T F T F T T T T T F T F F F F F F F T T F F F T T T T T F T F T T T F F T F F T F F F Pneumonia Cough Fever Paleness High WBC ? ) | (Fever Pneumonia = T = P How to estimate:
Learning of BBN parameters. Example.
Learn:
Step 1: Select data points with Pneumonia=T
CS 2710 Foundations of AI
Learning of BBN parameters. Example.
Learn:
Step 1: Ignore the rest
Pal Fev Cou HWB Pneu
F F T T T F F T F T F T T T T T T T T T F T F T T ) | (Fever Pneumonia =T P Pneumonia Cough Fever Paleness High WBC
Learning of BBN parameters. Example.
Learn:
Step 2: Select values of the random variable defining the distribution of Fever
Pal Fev Cou HWB Pneu
CS 2710 Foundations of AI
Learning of BBN parameters. Example.
Learn:
Step 2: Ignore the rest
Fev F F T T T ) | (Fever Pneumonia =T P Pneumonia Cough Fever Paleness High WBC
Learning of BBN parameters. Example.
Learn:
Step 3: Learning the ML estimate
CS 2710 Foundations of AI
Learning of BBN parameters. Example.
Learn:
Step 3: Learning the MAP estimate
Assume the prior Fev F F T T T ) | (Fever Pneumonia =T P ) | (Fever Pneumonia =T P 0.5 0.5 T F Pneumonia Cough Fever Paleness High WBC ) 4 , 3 ( ~ |Pneumonia T Beta Fever =
θ
Estimates of parameters of BBN
Much like multiple coin toss or roll of a diceproblems. • A “smaller” learning problem corresponds to the learning ofexactly one conditional distribution
Example:
Problem: How to pick the data to learn?
Answer:
1. Select data points with Pneumonia=T (ignore the rest)
2. Focus on (select) only values of the random variable defining the distribution (Fever)
3. Learn the parameters of the conditional the same way as we learned the parameters of the biased coin or dice
) |
(Fever Pneumonia =T