• No results found

Learning probability distributions

N/A
N/A
Protected

Academic year: 2021

Share "Learning probability distributions"

Copied!
21
0
0

Loading.... (view fulltext now)

Full text

(1)

CS 2710 Foundations of AI

CS 2710 Foundations of AI

Lecture 24

Milos Hauskrecht [email protected] 5329 Sennott Square

Learning probability distributions

Announcements

Homework 10:

• due today

Homework 11:

• Due next week on Wednesday

Final exam:

• December 14, 2004 at 11:00am-1:00pm • Location: 5201 WWPH Posvar Hall • Closed book

(2)

CS 2710 Foundations of AI

Unsupervised learning

• Data:

a vector of attribute values – e.g. the description of a patient

– no specific target attribute we want to predict (no output y) • Objective:

– learn (describe) relations between attributes, examples

Types of problems:

• Clustering

Group together “similar” examples • Density estimation

– Model probabilistically the population of examples

} ,.., , {D1 D2 Dn D= i i D =x

Density estimation

Data: Attributes:

• modeled by random variables with: – Continuous values

– Discrete values

E.g. blood pressure with numerical values or chest pain with discrete values

[no-pain, mild, moderate, strong]

Underlying true probability distribution:

} ,.., , {D1 D2 Dn D= i i

D =x a vector of attribute values

(3)

CS 2710 Foundations of AI

Density estimation

Data:

Objective: try to estimate the underlying true probability distribution over variables , , using examples in D

Standard (iid) assumptions: Samples

• are independentof each other

• come from the same (identical) distribution (fixed )

} ,.., , {D1 D2 Dn D= i i

D =x a vector of attribute values

X

) (X

p D={D1,D2,..,Dn}

n samples

true distribution estimate

) ( ˆ X p ) (X p ) (X p

Learning via parameter estimation

In this lecture we consider parametric density estimation Basic settings:

• A set of random variables

• A model of the distributionover variables in X with parameters

• Data

Objective:find parameters that fit the data the best, or in other words reduce the misfit between the data and the model • What is the best set of parameters?

– There are various criteria one can apply here.

(4)

CS 2710 Foundations of AI

Parameter estimation. Basic criteria.

• Maximum likelihood (ML) criterion

• Maximum a posteriori probability (MAP) criterion

)

,

|

(

max

arg

Θ

ξ

Θ

p

D

ξ

- represents prior (background) knowledge

)

,

|

(

max

arg

p

Θ

D

ξ

Θ ) | ( ) | ( ) , | ( ) , | (

ξ

ξ

ξ

ξ

D p p D p D p Θ = Θ Θ

MAP selects the mode of the posterior

Likelihood of data

Posterior probability

Parameter estimation. Coin example.

Coin example:we have a coin that can be biased

Outcomes:two possible values -- head or tail

Data: D a sequence of outcomes such that

• head • tail

Model: probability of a head probability of a tail

Objective:

(5)

CS 2710 Foundations of AI

Parameter estimation. Example.

• Assumethe unknown and possibly biased coin • Probability of the head is

• Data:

H H T T H H T H T H T T T H T H H H H T H H H H T – Heads:15

– Tails:10

What would be your estimate of the probability of a head ?

θ

? ~ =

θ

Parameter estimation. Example

• Assumethe unknown and possibly biased coin • Probability of the head is

• Data:

H H T T H H T H T H T T T H T H H H H T H H H H T – Heads:15

– Tails:10

What would be your choice of the probability of a head ?

Solution: use frequencies of occurrences to do the estimate

This is the maximum likelihood estimateof the parameter

(6)

CS 2710 Foundations of AI

Probability of an outcome

Data: D a sequence of outcomes such that • head

• tail

Model: probability of a head probability of a tail

Assume: we know the probability

Probability of an outcome of a coin flip

– Combines the probability of a head and a tail – So that is going to pick its correct probability – Gives for – Gives for ) 1 ( ) 1 ( ) | ( xi xi i x P

θ

=

θ

θ

θ

) 1 ( −θ 0 = i x 1 = i x i x Bernoulli distribution i x i x

θ

) 1 ( −θ xi =0 1 = i x

θ

Probability of a sequence of outcomes.

Data: D a sequence of outcomes such that • head

• tail

Model: probability of a head probability of a tail

Assume: a sequence of independent coin flips

D = H H T H T H (encoded as D= 110101)

(7)

CS 2710 Foundations of AI

Probability of a sequence of outcomes.

Data: D a sequence of outcomes such that • head

• tail

Model: probability of a head probability of a tail

Assume: a sequence of coin flips D = H H T H T H encoded as D= 110101

What is the probability of observing a data sequence D:

θ

θ

θ

θ

θθ

θ

) (1 ) (1 ) | (D = − − P

θ

) 1 ( −θ 0 = i x 1 = i x i x

Probability of a sequence of outcomes.

Data: D a sequence of outcomes such that • head

• tail

Model: probability of a head probability of a tail

Assume: a sequence of coin flips D = H H T H T H encoded as D= 110101

What is the probability of observing a data sequence D:

θ

θ

θ

θ

θθ

θ

) (1 ) (1 ) | (D = − − P

θ

) 1 ( −θ 0 = i x 1 = i x i x

(8)

CS 2710 Foundations of AI

Probability of a sequence of outcomes.

Data: D a sequence of outcomes such that • head

• tail

Model: probability of a head probability of a tail

Assume: a sequence of coin flips D = H H T H T H encoded as D= 110101

What is the probability of observing a data sequence D:

Can be rewritten using the Bernoulli distribution:

θ

θ

θ

θ

θθ

θ

) (1 ) (1 ) | (D = − − P

θ

) 1 ( −θ 0 = i x 1 = i x i x ) 1 ( 6 1 ) 1 ( ) | ( i xi i x D P − = − =

θ

θ

θ

The goodness of fit to the data.

Learning: we do not know the value of the parameter

Our learning goal:

• Find the parameter that fits the data D the best?

One solution to the “best”:Maximize the likelihood

Intuition:

• more likely are the data given the model, the better is the fit

(9)

CS 2710 Foundations of AI

Maximum likelihood (ML) estimate.

Maximum likelihoodestimate

1

N - number of heads seen N2 - number of tails seen

) , | ( max arg

θ

ξ

θ

θ P D ML = Likelihood of data: ) 1 ( 1 ) 1 ( ) , | ( i xi n i x D P − = − =

θ

θ

ξ

θ

Optimize log-likelihood (the same as maximizing likelihood)

= − = = − =

(1 ) 1 ) 1 ( log ) , | ( log ) , ( i xi n i x D P D l

θ

θ

ξ

θ

θ

) 1 ( ) 1 log( log ) 1 log( ) 1 ( log 1 1 1

= = = − − + = − − + n i i n i i n i i i x x x x

θ

θ

θ

θ

Maximum likelihood (ML) estimate.

2 1 1 1 N N N N N ML = = + θ ML Solution: Optimize log-likelihood ) 1 log( log ) , (D

θ

=N1

θ

+N2

θ

l

Set derivative to zero

(10)

CS 2710 Foundations of AI

Maximum likelihood estimate. Example

• Assumethe unknown and possibly biased coin

• Probability of the head is • Data:

H H T T H H T H T H T T T H T H H H H T H H H H T – Heads:15

– Tails:10

What is the ML estimate of the probability of a head and a tail?

θ

Maximum likelihood estimate. Example

• Assume the unknown and possibly biased coin

• Probability of the head is • Data:

H H T T H H T H T H T T T H T H H H H T H H H H T – Heads:15

– Tails:10

What is the ML estimate of the probability of head and tail ?

(11)

CS 2710 Foundations of AI

Maximum a posteriori estimate

Maximum a posteriori estimate

– Selects the mode of the posterior distribution

How to choose the prior probability?

) , | ( max arg θ ξ θ θ p D MAP = ) | ( ) | ( ) , | ( ) , | ( ξ ξ θ ξ θ ξ θ D P p D P D

p = (via Bayes rule)

) |

(

θ

ξ

p - is the prior probability on

θ

2 1(1 ) ) 1 ( ) , | ( (1 ) 1 N N x n i xi i D P

θ

ξ

=

θ

θ

− =

θ

θ

=

prior Likelihood of data Normalizing factor

Prior distribution

) , | ( ) | ( ) , | ( ) , | ( ) , | ( 1 2 Beta 1 N1 2 N2 D P Beta D P D p = = θ α + α + ξ α α θ ξ θ ξ θ

Choice of prior: Beta distribution

Beta distribution “fits” Bernoulli trials - conjugate choices 1 1 2 1 2 1 2 1 1 (1 ) 2 ) ( ) ( ) ( ) , | ( ) | ( − − − Γ Γ + Γ = = θα θ α α α α α α α θ ξ θ Beta p

Why to use Beta distribution?

2 1(1 ) ) , | (D N N P

θ

ξ

=

θ

θ

Posterior distribution is again a Beta distribution

) (x Γ - A Gamma function ! ) (x =x Γ

(12)

CS 2710 Foundations of AI

Beta distribution

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.5 1 1.5 2 2.5 3 3.5 α=0.5, β=0.5 α=2.5, β=2.5 α=2.5, β=5

Maximum a posterior probability

Maximum a posteriori estimate

– Selects the mode of the posterior distribution

Noticethat parameters of the prior act like counts of heads and tails

(sometimes they are also referred to as prior counts)

(13)

CS 2710 Foundations of AI

MAP estimate example

• Assume the unknown and possibly biased coin • Probability of the head is

• Data:

H H T T H H T H T H T T T H T H H H H T H H H H T – Heads:15

– Tails:10 • Assume

What is the MAP estimate?

θ

) 5 , 5 | ( ) | (θ ξ Betaθ p =

MAP estimate example

• Assume the unknown and possibly biased coin • Probability of the head is

• Data:

H H T T H H T H T H T T T H T H H H H T H H H H T – Heads:15

– Tails:10 • Assume

What is the MAP estimate ?

(14)

CS 2710 Foundations of AI

MAP estimate example

• Note that the prior and data fit (data likelihood) are combined • The MAP can be biased with large prior counts

• It is hard to overturn it with a smaller sample size • Data: H H T T H H T H T H T T T H T H H H H T H H H H T – Heads:15 – Tails:10 • Assume ) 20 , 5 | ( ) | (θ ξ Betaθ p = ) 5 , 5 | ( ) | (θ ξ Betaθ p = 33 19 = MAP θ 48 19 = MAP θ

Multinomial distribution

Example: Multi-way coin toss, roll of dice

• Data: a set of N outcomes (multi-set)

Model parameters:

Probability of data (likelihood)

ML estimate: N Ni ML i, =

θ

)

,

,

(

θ

1

θ

2

K

θ

k

=

θ

1 1 =

= k i i

θ

s.t. i

N - a number of times an outcome i has been seen

(15)

CS 2710 Foundations of AI

MAP estimate

Choice of prior: Dirichlet distribution

) ,.., | ( ) | ( ) ,.. , | ( ) , | ( ) , | ( 1 2 1 1 k k k Dir N N D P Dir D P D p = = α + α + ξ α α α ξ ξ θ θ θ θ

(

)

= − + − + = k i i i i i MAP i k N N ,.. 1 , 1 α α θ MAP estimate: Posterior distribution 1 1 2 1 1 1 1 1 1 2 ) ( ) ( ) ,.., | ( − − − = =

Γ Γ = k k k i i k i i k Dir θα θα θα α α α α K θ

Dirichlet is the conjugate choicefor multinomial

k N k N N k k N N N N N N N P D P ξ ξ θ θ Kθ K K 1 2 2 1 2 1 2 1 ! ! ! ! ) , | , , ( ) , | ( θ = θ =

Learning complex distributions

• The problem of learning complex distributions

can be sometimes reduced to the problem of learning a number of simpler distributions

• Such a decomposition occurs for example in Bayesian networks

– Builds upon independences encoded in the network • Why learning of BBNs?

– Large databases are available

(16)

CS 2710 Foundations of AI

Learning of BBN parameters

Learning.Two steps:

– Learning of the network structure

– Learning of parameters of conditional probabilities • Variables:

– Observable – values present in every data sample – Hidden – values are never in the sample

– Missing values – values sometimes present, sometimes not

• Here:

– learning parameters for the fixed graph structure – All variables are observed in the dataset

Learning of BBN parameters. Example.

(17)

CS 2710 Foundations of AI

Learning of BBN parameters. Example.

Data D (different patient cases): Pal Fev Cou HWB Pneu

T T T T F T F F F F F F T T T F F T F T F T T T T T F T F F F F F F F T T F F F T T T T T F T F T T T F F T F F T F F F Pneumonia Cough Fever Paleness High WBC

Estimates of parameters of BBN

• Much like multiple coin tosses or rolls of a dice

• A “smaller” learning problem corresponds to the learning of exactly one conditional distribution

• Example:

• Problem: How to pick the data to learn?

) |

(Fever Pneumonia =T

(18)

CS 2710 Foundations of AI

Learning of BBN parameters. Example.

Data D (different patient cases): Pal Fev Cou HWB Pneu

T T T T F T F F F F F F T T T F F T F T F T T T T T F T F F F F F F F T T F F F T T T T T F T F T T T F F T F F T F F F Pneumonia Cough Fever Paleness High WBC ? ) | (Fever Pneumonia = T = P How to estimate:

Learning of BBN parameters. Example.

Learn:

Step 1: Select data points with Pneumonia=T

(19)

CS 2710 Foundations of AI

Learning of BBN parameters. Example.

Learn:

Step 1: Ignore the rest

Pal Fev Cou HWB Pneu

F F T T T F F T F T F T T T T T T T T T F T F T T ) | (Fever Pneumonia =T P Pneumonia Cough Fever Paleness High WBC

Learning of BBN parameters. Example.

Learn:

Step 2: Select values of the random variable defining the distribution of Fever

Pal Fev Cou HWB Pneu

(20)

CS 2710 Foundations of AI

Learning of BBN parameters. Example.

Learn:

Step 2: Ignore the rest

Fev F F T T T ) | (Fever Pneumonia =T P Pneumonia Cough Fever Paleness High WBC

Learning of BBN parameters. Example.

Learn:

Step 3: Learning the ML estimate

(21)

CS 2710 Foundations of AI

Learning of BBN parameters. Example.

Learn:

Step 3: Learning the MAP estimate

Assume the prior Fev F F T T T ) | (Fever Pneumonia =T P ) | (Fever Pneumonia =T P 0.5 0.5 T F Pneumonia Cough Fever Paleness High WBC ) 4 , 3 ( ~ |Pneumonia T Beta Fever =

θ

Estimates of parameters of BBN

Much like multiple coin toss or roll of a diceproblems. • A “smaller” learning problem corresponds to the learning of

exactly one conditional distribution

Example:

Problem: How to pick the data to learn?

Answer:

1. Select data points with Pneumonia=T (ignore the rest)

2. Focus on (select) only values of the random variable defining the distribution (Fever)

3. Learn the parameters of the conditional the same way as we learned the parameters of the biased coin or dice

) |

(Fever Pneumonia =T

References

Related documents

4.1 The Select Committee is asked to consider the proposed development of the Customer Service Function, the recommended service delivery option and the investment required8. It

more than four additional runs were required, they were needed for the 2 7-3 design, which is intuitive as this design has one more factor than the 2 6-2 design

National Conference on Technical Vocational Education, Training and Skills Development: A Roadmap for Empowerment (Dec. 2008): Ministry of Human Resource Development, Department

Using a likert scale reading interest survey, a released SOL reading comprehension test, and teacher observations data were collected to determine if independent choice in a

Using text mining of first-opinion electronic medical records from seven veterinary practices around the UK, Kaplan-Meier and Cox proportional hazard modelling, we were able to

• Follow up with your employer each reporting period to ensure your hours are reported on a regular basis?. • Discuss your progress with

Since the 1995 Irish Stock Exchange Act, the Listing Rules of the Irish Stock Exchange have required every company listed on the Main Securities Market to state in its annual

 Inclusion of J Elgin Family Trust - Accepted husband's evidence he played little role in the trust, and as such this was excluded as property of the husband and an order made