Learning probability distributions

(1)

CS 2710 Foundations of AI

Lecture 24

Milos Hauskrecht [email protected] 5329 Sennott Square

Learning probability distributions

Announcements

Homework 10:

• due today

Homework 11:

• Due next week on Wednesday

Final exam:

• December 14, 2004 at 11:00am-1:00pm • Location: 5201 WWPH Posvar Hall • Closed book

(2)

Unsupervised learning

• Data:

a vector of attribute values – e.g. the description of a patient

– no specific target attribute we want to predict (no output y) • Objective:

– learn (describe) relations between attributes, examples

Types of problems:

• Clustering

Group together “similar” examples • Density estimation

– Model probabilistically the population of examples

} ,.., , {D₁ D₂ D_n D= i i D =x

Density estimation

Data: Attributes:

• modeled by random variables with: – Continuous values

– Discrete values

E.g. blood pressure with numerical values or chest pain with discrete values

[no-pain, mild, moderate, strong]

Underlying true probability distribution:

} ,.., , {D₁ D₂ D_n D= i i

D =x a vector of attribute values

(3)

Density estimation

Data:

Objective: try to estimate the underlying true probability distribution over variables , , using examples in D

Standard (iid) assumptions: Samples

• are independentof each other

• come from the same (identical) distribution (fixed )

} ,.., , {D₁ D₂ D_n D= i i

D =x a vector of attribute values

X

) (X

p D={D₁,D₂,..,D_n}

n samples

true distribution estimate

) ( ˆ X p ) (X p ) (X p

Learning via parameter estimation

In this lecture we consider parametric density estimation Basic settings:

• A set of random variables

• A model of the distributionover variables in X with parameters

• Data

Objective:find parameters that fit the data the best, or in other words reduce the misfit between the data and the model • What is the best set of parameters?

– There are various criteria one can apply here.

(4)

Parameter estimation. Basic criteria.

• Maximum likelihood (ML) criterion

• Maximum a posteriori probability (MAP) criterion

)

,

|

(

max

arg

Θ

ξ

Θ

p

D

ξ

- represents prior (background) knowledge

)

,

|

(

max

arg

p

Θ

D

ξ

Θ ) | ( ) | ( ) , | ( ) , | (

ξ

D p p D p D p Θ = Θ Θ

MAP selects the mode of the posterior

Likelihood of data

Posterior probability

Parameter estimation. Coin example.

Coin example:we have a coin that can be biased

Outcomes:two possible values -- head or tail

Data: D a sequence of outcomes such that

• head • tail

Model: probability of a head probability of a tail

Objective:

(5)

Parameter estimation. Example.

• Assumethe unknown and possibly biased coin • Probability of the head is

• Data:

H H T T H H T H T H T T T H T H H H H T H H H H T – Heads:15

– Tails:10

What would be your estimate of the probability of a head ?

θ

? ~ =

θ

Parameter estimation. Example

• Assumethe unknown and possibly biased coin • Probability of the head is

• Data:

– Tails:10

What would be your choice of the probability of a head ?

Solution: use frequencies of occurrences to do the estimate

This is the maximum likelihood estimateof the parameter

(6)

Probability of an outcome

Data: D a sequence of outcomes such that • head

• tail

Assume: we know the probability

Probability of an outcome of a coin flip

– Combines the probability of a head and a tail – So that is going to pick its correct probability – Gives for – Gives for ) 1 ( ) 1 ( ) | ( xi xi i x P

θ

=

θ

−

θ

−

θ

) 1 ( −θ 0 = i x 1 = i x i x Bernoulli distribution i x i x

θ

) 1 ( −θ x_i =0 1 = i x

θ

Probability of a sequence of outcomes.

• tail

Assume: a sequence of independent coin flips

D = H H T H T H (encoded as D= 110101)

(7)

Probability of a sequence of outcomes.

• tail

Assume: a sequence of coin flips D = H H T H T H encoded as D= 110101

What is the probability of observing a data sequence D:

θ

θθ

θ

) (1 ) (1 ) | (D = − − P

θ

) 1 ( −θ 0 = i x 1 = i x i x

Probability of a sequence of outcomes.

• tail

θ

θθ

θ

) (1 ) (1 ) | (D = − − P

θ

) 1 ( −θ 0 = i x 1 = i x i x

(8)

Probability of a sequence of outcomes.

• tail

Can be rewritten using the Bernoulli distribution:

θ

θθ

θ

) (1 ) (1 ) | (D = − − P

θ

) 1 ( −θ 0 = i x 1 = i x i x ) 1 ( 6 1 ) 1 ( ) | ( i xi i x D P − = − =

∏

θ

The goodness of fit to the data.

Learning: we do not know the value of the parameter

Our learning goal:

• Find the parameter that fits the data D the best?

One solution to the “best”:Maximize the likelihood

Intuition:

• more likely are the data given the model, the better is the fit

(9)

Maximum likelihood (ML) estimate.

Maximum likelihoodestimate

1

N - number of heads seen N2 - number of tails seen

) , | ( max arg

θ

ξ

θ

θ P D ML = Likelihood of data: ) 1 ( 1 ) 1 ( ) , | ( i xi n i x D P − = − =

∏

θ

ξ

θ

Optimize log-likelihood (the same as maximizing likelihood)

= − = = − =

∏

(1 ) 1 ) 1 ( log ) , | ( log ) , ( i xi n i x D P D l

θ

ξ

θ

) 1 ( ) 1 log( log ) 1 log( ) 1 ( log 1 1 1

∑

= = = − − + = − − + n i i n i i n i i i x x x x

θ

Maximum likelihood (ML) estimate.

2 1 1 1 N N N N N ML = = ₊ θ ML Solution: Optimize log-likelihood ) 1 log( log ) , (D

θ

=N₁

θ

+N₂ −

θ

l

Set derivative to zero

(10)

Maximum likelihood estimate. Example

• Assumethe unknown and possibly biased coin

• Probability of the head is • Data:

– Tails:10

What is the ML estimate of the probability of a head and a tail?

θ

Maximum likelihood estimate. Example

• Assume the unknown and possibly biased coin

• Probability of the head is • Data:

– Tails:10

What is the ML estimate of the probability of head and tail ?

(11)

Maximum a posteriori estimate

Maximum a posteriori estimate

– Selects the mode of the posterior distribution

How to choose the prior probability?

) , | ( max arg θ ξ θ θ p D MAP = ) | ( ) | ( ) , | ( ) , | ( ξ ξ θ ξ θ ξ θ D P p D P D

p = _{(via Bayes rule)}

) |

(

θ

ξ

p - is the prior probability on

θ

2 1(1 ) ) 1 ( ) , | ( (1 ) 1 N N x n i x_i _i D P

θ

ξ

=

θ

−

θ

− =

θ

−

θ

=

∏

prior Likelihood of data Normalizing factor

Prior distribution

) , | ( ) | ( ) , | ( ) , | ( ) , | ( 1 2 _Beta ₁ _N₁ ₂ _N₂ D P Beta D P D p = = θ α + α + ξ α α θ ξ θ ξ θ

Choice of prior: Beta distribution

Beta distribution “fits” Bernoulli trials - conjugate choices 1 1 2 1 2 1 2 1 1 (1 ) 2 ) ( ) ( ) ( ) , | ( ) | ( − − − Γ Γ + Γ = = _θα _θ α α α α α α α θ ξ θ Beta p

Why to use Beta distribution?

2 1(1 ) ) , | (_D N N P

θ

ξ

=

θ

−

θ

Posterior distribution is again a Beta distribution

) (x Γ - A Gamma function ! ) (x =x Γ

(12)

Beta distribution

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.5 1 1.5 2 2.5 3 3.5 α=0.5, β=0.5 α=2.5, β=2.5 α=2.5, β=5

Maximum a posterior probability

Maximum a posteriori estimate

– Selects the mode of the posterior distribution

Noticethat parameters of the prior act like counts of heads and tails

(sometimes they are also referred to as prior counts)

(13)

MAP estimate example

• Assume the unknown and possibly biased coin • Probability of the head is

• Data:

– Tails:10 • Assume

What is the MAP estimate?

θ

) 5 , 5 | ( ) | (θ ξ Betaθ p =

MAP estimate example

• Assume the unknown and possibly biased coin • Probability of the head is

• Data:

– Tails:10 • Assume

What is the MAP estimate ?

(14)

MAP estimate example

• Note that the prior and data fit (data likelihood) are combined • The MAP can be biased with large prior counts

• It is hard to overturn it with a smaller sample size • Data: H H T T H H T H T H T T T H T H H H H T H H H H T – Heads:15 – Tails:10 • Assume ) 20 , 5 | ( ) | (θ ξ Betaθ p = ) 5 , 5 | ( ) | (θ ξ Betaθ p = 33 19 = MAP θ 48 19 = MAP θ

Multinomial distribution

Example: Multi-way coin toss, roll of dice

• Data: a set of N outcomes (multi-set)

Model parameters:

Probability of data (likelihood)

ML estimate: N N_i ML i, =

θ

)

,

(

θ

₁

θ

₂

_K

θ

_k

=

θ

1 1 =

∑

= k i i

θ

s.t. i

N - a number of times an outcome i has been seen

(15)

MAP estimate

Choice of prior: Dirichlet distribution

) ,.., | ( ) | ( ) ,.. , | ( ) , | ( ) , | ( 1 2 ₁ ₁ k k k _Dir _N _N D P Dir D P D p = = α + α + ξ α α α ξ ξ θ θ θ θ

(

)

∑

= − + − + = k i i i i i MAP i k N N ,.. 1 , 1 α α θ MAP estimate: Posterior distribution 1 1 2 1 1 1 1 1 1 2 ) ( ) ( ) ,.., | ( − − − = =

∏

∑

Γ Γ = k k k i i k i i k Dir θα θα θα α α α α _K θ

Dirichlet is the conjugate choicefor multinomial

k N k N N k k N N N N N N N P D P ξ ξ θ θ Kθ K K 1 2 2 1 2 1 2 1 ! ! ! ! ) , | , , ( ) , | ( θ = θ =

Learning complex distributions

• The problem of learning complex distributions

–can be sometimes reduced to the problem of learning a number of simpler distributions

• Such a decomposition occurs for example in Bayesian networks

– Builds upon independences encoded in the network • Why learning of BBNs?

– Large databases are available

(16)

Learning of BBN parameters

Learning.Two steps:

– Learning of the network structure

– Learning of parameters of conditional probabilities • Variables:

– Observable – values present in every data sample – Hidden – values are never in the sample

– Missing values – values sometimes present, sometimes not

• Here:

– learning parameters for the fixed graph structure – All variables are observed in the dataset

Learning of BBN parameters. Example.

(17)

Learning of BBN parameters. Example.

Data D (different patient cases): Pal Fev Cou HWB Pneu

T T T T F T F F F F F F T T T F F T F T F T T T T T F T F F F F F F F T T F F F T T T T T F T F T T T F F T F F T F F F Pneumonia Cough Fever Paleness High WBC

Estimates of parameters of BBN

• Much like multiple coin tosses or rolls of a dice

• A “smaller” learning problem corresponds to the learning of exactly one conditional distribution

• Example:

• Problem: How to pick the data to learn?

) |

(Fever Pneumonia =T

(18)

Learning of BBN parameters. Example.

Data D (different patient cases): Pal Fev Cou HWB Pneu

T T T T F T F F F F F F T T T F F T F T F T T T T T F T F F F F F F F T T F F F T T T T T F T F T T T F F T F F T F F F Pneumonia Cough Fever Paleness High WBC ? ) | (Fever Pneumonia = T = P How to estimate:

Learning of BBN parameters. Example.

Learn:

Step 1: Select data points with Pneumonia=T

(19)

Learning of BBN parameters. Example.

Learn:

Step 1: Ignore the rest

Pal Fev Cou HWB Pneu

F F T T T F F T F T F T T T T T T T T T F T F T T ) | (Fever Pneumonia =T P Pneumonia Cough Fever Paleness High WBC

Learning of BBN parameters. Example.

Learn:

Step 2: Select values of the random variable defining the distribution of Fever

Pal Fev Cou HWB Pneu

(20)

Learning of BBN parameters. Example.

Learn:

Step 2: Ignore the rest

Fev F F T T T ) | (Fever Pneumonia =T P Pneumonia Cough Fever Paleness High WBC

Learning of BBN parameters. Example.

Learn:

Step 3: Learning the ML estimate

(21)

Learning of BBN parameters. Example.

Learn:

Step 3: Learning the MAP estimate

Assume the prior Fev F F T T T ) | (Fever Pneumonia =T P ) | (Fever Pneumonia =T P 0.5 0.5 T F Pneumonia Cough Fever Paleness High WBC ) 4 , 3 ( ~ |Pneumonia T Beta Fever =

θ

Estimates of parameters of BBN

Much like multiple coin toss or roll of a diceproblems. • A “smaller” learning problem corresponds to the learning of

exactly one conditional distribution

Example:

Problem: How to pick the data to learn?

Answer:

1. Select data points with Pneumonia=T (ignore the rest)

2. Focus on (select) only values of the random variable defining the distribution (Fever)

3. Learn the parameters of the conditional the same way as we learned the parameters of the biased coin or dice

) |

(Fever Pneumonia =T