• No results found

A Tutorial on Inference and Learning in Bayesian Networks

N/A
N/A
Protected

Academic year: 2021

Share "A Tutorial on Inference and Learning in Bayesian Networks"

Copied!
43
0
0

Loading.... (view fulltext now)

Full text

(1)

A Tutorial on

Inference and Learning

in Bayesian Networks

Irina Rish

IBM T.J.Watson Research Center [email protected]

(2)

Outline

Motivation: learning probabilistic models from data

Representation: Bayesian network models

Probabilistic inference in Bayesian Networks

Exact inference

Approximate inference

Learning Bayesian Networks

Learning parameters

Learning graph structure (model selection)

(3)

Bayesian Networks

Structured, graphical representation of probabilistic

relationships between several random variables

Explicit representation of conditional independencies

Missing arcs encode conditional independence

Efficient representation of joint PDF P(X)

Generative model (not just discriminative): allows

arbitrary queries to be answered, e.g.

(4)

Bayesian Network:

=

P(S) P(C|S) P(B|S) P(X|C,S) P(D|C,B)

lung Cancer

Smoking

X-ray

Bronchitis

Dyspnoea

P(D|C,B) P(B|S) P(S) P(X|C,S) P(C|S)

P(S, C, B, X, D)

CPD: C B D=0 D=1 0 0 0.1 0.9 0 1 0.7 0.3 1 0 0.8 0.2 1 1 0.9 0.1

Θ

)

(G,

BN

=

G

- directed acyclic graph (DAG) nodes – random variables edges – direct dependencies - set of parameters in all

conditional probability distributions (CPDs)

Θ

CPD of node X: P(X|parents(X))

Compact representation of joint distribution in a product form (chain rule):

ZY

ZY

ZY

ZY

\\\\

=

=

+

+

+

+

2

2

4

4

1

3

parameters

instead

of

2

1

(5)

Example: Printer Troubleshooting

Print Output OK Correct Driver Uncorrupted Driver Correct Printer Path Net Cable Connected Net/Local Printing Printer On and Online Correct Local Port Correct Printer Selected Local Cable Connected Application Output OK Print Spooling On Correct Driver Settings Printer Memory Adequate Network Up Spooled

Data OK GDI DataInput OK GDI Data Output OK Print Data OK PC to Printer Transport OK Printer Data OK Spool Process OK Net

Path OK Path OKLocal

Paper Loaded Local Disk Space Adequate [Heckerman, 95] 2 x 3 x2 3 2x2 1x2 17x1 99 get we parameters 2 of Instead variables 26 4 3 2 1 26 + + + + =

(6)

“Moral” graph of a BN

lung Cancer

Smoking

X-ray

Bronchitis

Dyspnoea

P(D|C,B) P(B|S) P(S) P(X|C,S) P(C|S)

Moralization algorithm:

1. Connect (“marry”) parents

of each node.

2. Drop the directionality of

the edges.

Resulting undirected graph is called the “moral” graph of BN

Interpretation:

every pair of nodes that occur together in a CPD is connected by an edge in the moral graph.

CPD for X and its k parents (called “family”) is represented by a clique of size

(k+1) in the moral graph, and contains probability parameters where

d is the number of values each variable can have (domain size).

d

(

d

1

)

(7)

Conditional Independence in BNs:

Three types of connections

Tuberculosis

Visit to Asia

Chest X-ray

Knowing T makes A and X independent (intermediate cause)

Lung Cancer

Smoking

Bronchitis

Knowing S makes L and B independent (common cause)

Dyspnoea

Lung Cancer

Bronchitis

A

T

X

S

L

B

Diverging

Serial

Converging

NOT knowing D or M makes L and B independent (common effect)

L

B

M

Running

Marathon

D

(8)

d-separation

Nodes X and Y are

d-separated

if on

any (undirected) path

between X and

Y there is some variable Z such that is either

Z is in a serial or diverging connection and Z is known, or

Z is in a converging connection and neither Z nor any of Z’s descendants are known

Nodes X and Y are called

d-connected

if they are not d-separated

(there exists an undirected path between X and Y not

d-separated by any node or a set of nodes)

If nodes X and Y are d-separated

by Z, then X and Y are

conditionally independent

given Z (see Pearl, 1988)

Z

X

Y

Y

X

Z

M

Z

Y

X

(9)

Independence Relations in BN

A variable (node) is conditionally independent of its

non-descendants given its parents

Lung Cancer

Smoking

Bronchitis

Dyspnoea

Chest X-ray

Given

Bronchitis

and

Lung Cancer,

Dyspnoea

is independent

of

X-ray

(but may depend

on

Running Marathon

)

Running

Marathon

(10)

Markov Blanket

A node is conditionally independent of ALL other nodes

given its

Markov blanket,

i.e. its

parents

,

children

, and

spouses

’’ (parents of common children)

(Proof left as a homework problem

)

Cancer

Smoking

Lung Tumor

Diet

Serum Calcium

Age

Gender

Exposure to Toxins

(11)

What are BNs useful for?

Diagnosis:

P(cause|

symptom

)=?

Medicine Bio-informatics Computer troubleshooting Stock market Text Classification Speech recognition

Prediction:

P(symptom|

cause

)=?

class

max

Classification:

P(class|

data

)

Decision-making (given a cost function)

1

C

C

2

symptom symptom cause

(12)

Application Examples

APRI system developed at AT&T Bell Labs

learns & uses Bayesian networks from data to identify customers

liable to default on bill payments

NASA Vista system

predict failures in propulsion systems

considers time criticality & suggests highest utility action

dynamically decide what information to show

(13)

Application Examples

Office Assistant in MS Office 97/ MS Office 95

Extension of Answer wizard

uses naïve Bayesian networks

help based on past experience (keyboard/mouse use) and task user is doing currently This is the “smiley face” you get in your MS Office applications

Microsoft Pregnancy and Child-Care

Available on MSN in Health section

Frequently occurring children’s symptoms are linked to expert modules that repeatedly ask parents relevant questions

Asks next best question based on provided information

(14)

Fault diagnosis using probes

Software or hardware components

Goal: finding most-likely diagnosis

1 X 4 T 1 T 2 T T3 2 X X 3 Efficiency (scalability) Missing data/noise: sensitivity analysis “Adaptive” probing: selecting “most-informative” probes on-line learning/model updates on-line diagnosis PPPP ›››› SSSS ›››› ££££ ŸŸŸŸ SSSS wOŸ wOŸwOŸ wOŸ ”ˆŸ ”ˆŸ ”ˆŸ ”ˆŸ GGGG ˆ™Ž ˆ™Žˆ™Ž ˆ™Ž ŸŸŸŸ SSSS OŸ OŸ OŸ OŸ XXXX •••• XXXX •••• ŸŸŸŸ SSSS ŸŸŸŸ QQQQ •••• QQQQ XXXX ____ :::: … … = … … ) Probe outcomes

Issues:

Pattern discovery, classification,

diagnosis and prediction

Pattern discovery, classification,

diagnosis and prediction

IBM’s systems management applications

Machine Learning for Systems @ Watson

(Hellerstein, Jayram, Rish (2000)) (Rish, Brodie, Ma (2001))

End-user transaction

recognition

5 R R3 R2 R1 R2 R5 2 R

Remote Procedure Calls (RPCs)

BUY?

SELL? OPEN_DB?SEARCH?

(15)

Probabilistic Inference Tasks

=

X/A a * k * 1

,...,

a

)

arg

max

P(

x

,

e)

(a

evidence)

|

x

P(X

)

BEL(X

i

=

i

=

i

Belief updating:

Finding most probable explanation (MPE)

Finding maximum a-posteriory hypothesis

Finding maximum-expected-utility (MEU) decision

e)

,

x

P(

max

arg

*

x

x

=

)

x

U(

e)

,

x

P(

max

arg

)

d

,...,

(d

X/D d * k * 1

=

variables hypothesis : X A function utility x variables decision : ) ( : U X D

(16)

Belief Updating Task: Example

lung Cancer

Smoking

X-ray

Bronchitis

Dyspnoea

(17)

Belief updating: find

P(X|evidence)

exp(w*))

O(n

Complexity: “Moral” graph S X D B C

P(s|d=1)

= b x c d 1, , ,

P(s)P(

c

|s)P(b|s)P(x|

c

,s)P(d|

c

,b)=

Variable Elimination

=1 d

x

P(s)

b

P(b|s)

)

,

,

,

(

s

d

b

x

f

c

P(c|s)P(x|c,s)P(d|c,b)

C B D X

Efficient inference:

variable orderings, conditioning, approximations

W*=4

”induced width”

(max induced clique size)

=

=

=

=

=

P(s,

d

1

)

1

)

P(d

1

)

d

P(s,

(18)

Variable elimination algorithms

(also called “bucket-elimination”)

Belief updating:

VE-algorithm

elim-bel

(Dechter 1996)

b

Elimination operator

P(a|e=0)

W*=4

”induced width”

(max clique size)

bucket B:

P(a)

P(c|a)

P(b|a) P(d|b,a) P(e|b,c)

bucket C:

bucket D:

bucket E:

bucket A:

e=0

B

C

D

E

A

e) (a, h D

(a)

h

E e) c, d, (a, h B

e)

d,

(a,

h

C

(19)

b

max

Elimination operator

MPE

probability

W*=4

”induced width”

(max clique size)

bucket B:

P(a)

P(c|a)

P(b|a) P(d|b,a) P(e|b,c)

bucket C:

bucket D:

bucket E:

bucket A:

e=0

B

C

D

E

A

e) (a, h D

(a)

h

E e) c, d, (a, h B

e)

d,

(a,

h

C

Finding

VE-algorithm

elim-mpe

(Dechter 1996)

)

x

P(

max

MPE

x

=

)

,

|

(

)

,

|

(

)

|

(

)

|

(

)

(

max

by

replaced

is

, , , ,

P

a

P

c

a

P

b

a

P

d

a

b

P

e

b

c

MPE

:

b c d e a

=

max

(20)

Generating the MPE-solution

C:

E:

P(b|a) P(d|b,a) P(e|b,c)

B:

D:

A:

P(a)

P(c|a)

e=0

hD(a, e) (a) hE

e)

c,

d,

(a,

h

B e) d, (a, hC (a) h P(a) max arg a' 1. E a ⋅ = 0 e' 2. = ) e' d, , (a' h max arg d' 3. C d = ) e' c, , d' , (a' h ) a' | P(c max arg c' 4. B c × × = ) c' b, | P(e' ) a' b, | P(d' ) a' | P(b max arg b' 5. b × × × =

)

e'

,

d'

,

c'

,

b'

,

(a'

Return

(21)

Complexity of VE-inference:

O

(

n

exp

(

w

o

*

))

). (denoted G graph the of the called is graph induced the of width The ordering. in the first the last to from node, each of neighbors earlier connecting y recursivel by obtained is ordering the along ' The nodes. all among (X) width maximum the is graph a of width The ). ( X to connected and ordering in the X preceding nodes of number the is ordering the along graph in X variable a of (X) width The * o o o o w width induced G' o G graph induced w w neighbors earlier o G w •Œ™Œ•ŠŒ •Œ™Œ•ŠŒ •Œ™Œ•ŠŒ •Œ™Œ•ŠŒ GGGG ‹œ™•Ž ‹œ™•Ž ‹œ™•Ž ‹œ™•Ž GGGG Š™Œˆ›Œ‹ Š™Œˆ›Œ‹ Š™Œˆ›Œ‹ Š™Œˆ›Œ‹ GGGG Š“˜œŒ Š“˜œŒ Š“˜œŒ Š“˜œŒ GGGG “ˆ™ŽŒš› “ˆ™ŽŒš› “ˆ™ŽŒš› “ˆ™ŽŒš› –G –G –G –G GGGG š¡Œ š¡Œ š¡Œ š¡Œ GGGG XXXX ž ž ž ž GG GG GG GG aaaa tŒˆ••Ž tŒˆ••Ž tŒˆ••Ž tŒˆ••Ž –––– GG GGGG GG –™‹Œ™•Ž –™‹Œ™•Ž–™‹Œ™•Ž –™‹Œ™•Ž GG GGGG GG ˆ“–•Ž ˆ“–•Ž ˆ“–•Ž ˆ“–•Ž GG GG GG GG Ž™ˆ— Ž™ˆ—Ž™ˆ— Ž™ˆ— GG GGGG GG ”–™ˆ“ ”–™ˆ“”–™ˆ“ ”–™ˆ“ –GG –GG –GG –GG GG GG GG GG ž‹› ž‹›ž‹› ž‹› GG GGGG GG •‹œŠŒ‹ •‹œŠŒ‹•‹œŠŒ‹ •‹œŠŒ‹ GG GG GG GG ›Œ ›Œ ›Œ ›Œ ž ž ž ž 3333 ```` 3333 ```` = + −

Ordering is important! But finding min-w* ordering is NP- hard… Inference is also NP-hard in general case [Cooper].

4

* 1

=

o

w

*

2

2

=

o

w

“Moral” graph A D E C B B C D E A E D C B A

(22)

Learning Bayesian Networks

Incremental

learning: P(H) or

S

C

Learning

causal

relationships:

Efficient

representation

and

inference

Handling

missing data:

<1.3 2.8

??

0 1 >

<9.7 0.6 8 14 18> <0.2 1.3 5 ?? ??> <1.3 2.8 ?? 0 1 > <?? 5.6 0 10 ??>

……….

Combining

domain expert

(23)

Learning tasks: four main cases

Known graph

C

S

B

D

X

Complete data:

parameter estimation (ML, MAP)

Incomplete data:

non-linear parametric

optimization (gradient descent, EM)

P(S) P(B|S) P(X|C,S) P(C|S) P(D|C,B)

– learn parameters

C

S

B

D

X

)

ˆ

arg

max

Score(G

G

G

=

C

S

B

D

X

Unknown graph

Complete data: optimization (search in space of graphs) Incomplete data: structural EM, mixture models

(24)

Learning Parameters: complete data

(overview)

ML-estimate:

max

log

(

|

Θ

)

Θ

P

D

- decomposable!

MAP-estimate

(

Bayesian statistics)

max

Θ

log

P

(

D

|

Θ

)

P

(

Θ

)

Conjugate

priors -

Dirichlet

(

|

1,

,...,

,

)

X X X m

Dir

θ

pa

α

pa

α

pa

X

C

B

X

Pa

)

|

(

, X x

x

P

X

pa

pa

=

θ

Multinomial

) ML( , , ,

= x x x x X X X N N pa pa pa θ counts

)

MAP(

, , , , ,

+

+

=

x x x x x x x X X X X X

N

N

pa pa pa pa pa

α

α

θ

Equivalent sample size

(25)

Learning Parameters

(26)
(27)
(28)
(29)
(30)
(31)

Learning Parameters: incomplete data

E

M

-algorithm:

iterate until convergence

Initial parameters

Current model

)

(G,

Θ

Non-decomposable

marginal likelihood (hidden nodes)

S X D C B <? 0 1 0 1> <1 1 ? 0 1> <0 0 0 ? ?> <? ?………0 ? 1> Data Expected counts

Expectation

Compute EXPECTED

Counts via inference in BN

Update parameters (ML, MAP)

Maximization

)

,

,

|

,

(

]

[

1 , ) (

G

y

x

p

N

E

k N k x x X P X

Θ

=

=

aX

aX

aX

aX

pa

(32)

Complete data – local computations Incomplete data (score

non-decomposable):stochastic methods

Local greedy search; K2 algorithm

Learning graph structure

NP-hard

optimization

Heuristic search:

G

max

arg

Find

G

ˆ

=

Score(G

)

C

S

B

C

S

B

Add S->B

C

S

B

Delete S->B

C

S

B

Reverse S->B

Constrained-based

methods (PC/IC algorithms)

Data impose independence

relations (constraints) on graph

structure

(33)

Scoring function:

Minimum Description Length (MDL)

Learning

data compression

Other: MDL = -BIC (Bayesian Information Criterion)

Bayesian score (BDe) - asymptotically equivalent to MDL

|

|

2

log

)

,

|

(

log

)

|

(

BN

D

=

P

D

Θ

G

+

N

Θ

MDL

DL(Model)

DL(Data|model)

<9.7 0.6 8 14 18> <0.2 1.3 5 ?? ??> <1.3 2.8 ?? 0 1 > <?? 5.6 0 10 ??> ……….

(34)

Model selection trade-offs

class) | P(f1 P(f2|class) P(fn |class) 1 f

feature featuref 2 featuref n Class

Naïve Bayes – too simple

(less parameters, but bad model)

class) |

P(f1 P(f2|class) P(fn|class) 1

f

feature featuref 2 featuref n Class

Unrestricted BN – too complex

(possible overfitting + complexity)

Various approximations between the two extremes

class) |

P(f1 P(f2|class) P(fn|class) 1

f

feature featuref 2 featuref n Class

TAN:

tree-augmented Naïve Bayes [Friedman et al. 1997]

Based on Chow-Liu Tree Method (CL) for learning trees

(35)

Tree-structured distributions

C

A

B

E

D

A joint probability distribution is tree-structured if it can be written as

=

=

n i i j i

x

x

P

P

1 ) (

)

|

(

)

(

ŸŸŸŸ

Not a tree – has an (undirected) cycle

C

A

B

E

D

A tree (with root A)

P(A,B,C,D,E)= P(A)P(B|A)P(C|A) P(D|C)P(E|B) tree) directed (a P(x) for network Bayesian in of parent the is where xj(i) xi

A tree requires only [(d-1) + d(d-1)(n-1)] parameters, where d is domain size Moreover, inference in trees is O(n) (linear) since their w*=1

(36)

Approximations by trees

C

A

B

E

D

C

A

B

E

D

True distribution P(X) Tree-approximation P’(X)

How good is approximation? Use cross-entropy (KL-divergence):

= ŸŸŸŸ ŸŸŸŸ ŸŸŸŸ ŸŸŸŸ ) ( ' ) ( log ) ( ) ' , ( P P P P P P D

D(P,P’) is non-negative, and D(P,P’)=0 if and only if P coincides with P’ (on a set of measure 1)

(37)

Optimal trees: Chow-Liu result

Lemma

Given a joint PDF P(x) and a fixed tree structure T, the best approximation P’(x) (i.e., P’(x) that minimizes D(P,P’) ) satisfies

Such P’(x) is called the projection of P(x) on T. Theorem [Chow and Liu, 1968]

Given a joint PDF P(x), the KL-divergence D(P,P’) is minimized by projecting P(x) on a maximum-weight spanning tree (MSWT) over nodes in X, where the weight on the edge is defined by the mutual information measure

,...,n

i

x

x

P

x

x

P

'

(

i

|

j(i)

)

=

(

i

|

j(i)

)

for

all

=

1

)

,

(

X

i

X

j

=

j i x x i j j i j i j i

x

P

x

P

x

x

P

x

x

P

X

X

I

,

(

)

(

)

)

,

(

log

)

,

(

)

;

(

)) ( ) ( ), , ( ( ) ; ( that

(38)

Proofs

). , ( weights edge of sum the maximizing by minimized

is last two termsareindependent of thechoice of the tree,and thus ( , ') The ). ( ) ( log ) ( ) , ( ) ( ) ( log ) ( ) ( ) ( log , ) ( )] ( / ) ( log[ , ) ' , ( yields (1) expression in the ) | ( ) | ( ' Replacing Lemma. the proves which , ) minimized is ) ' , ( total the thus (and ) | ( ) | ( ' choosing by maximized is ) | ( ' log | term the , and of any value for Therefore, P(x). (x) P' choice by the achieved is ) ( ' log of maximum the P(x), given : fact known A (2) ) ( ) | ( ' log | ) ( (1) ) ( ) | ( ' log , ) ( ) | ( ' log ) ( ) | ( ' log ) ( log ) | ( ' log ) ' , ( ) ( 1 1 ) ( 1 ( ) ) ( , ) ( 1 ) ( ) ( , ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( 1 ) ( ) ( ) ( 1 ) ( , ) ( 1 ) ( 1 ) ( 1 ) ( ) ( ) ( ) ( ) ( i j i n i n i x i i i j i n i i i i j i j i x x i j i n i i j i j i x x i j i i j i i j i i j i i j i i j i x i j i i j x n i i j i x x i j i i j n i i j i x x i j i n i i j i n i i j i n i i j i X X I D P P X H x P x P X X I X H x P x P x P x x P ) x P(x X H x P x x P ) x P(x P P D x x P x x P P P D x x P x x P x x P ) x P(x x i x P P(x) X H x x P ) x P(x x P X H x x P ) x P(x X H x x P ) P( X H x x P ) P( P ) P( x x P ) P( P P D i i j i i j i i i i j i i j i

∑∑

∑ ∑

∑ ∑

∑ ∑

∑ ∑

∑∑

= = = = = = = = = − − − = = −         + − = = − − = = = = − − = = − − = − − = = − − = + − = : Theorem of Proof : Lemma of Proof Ÿ ŸŸ Ÿ Ÿ ŸŸ Ÿ Ÿ ŸŸ Ÿ Ÿ ŸŸ Ÿ ŸŸŸŸ ŸŸŸŸ ŸŸŸŸ ŸŸŸŸ ŸŸŸŸ

(39)

Chow-Liu algorithm

[As presented in Pearl, 1988]

1. From the given distribution P(x) (or from data generated by P(x)), compute the joint distribution

2. Using the pairwise distributions from step 1, compute the mutual information for each pair of nodes and assign it as the weight to the corresponding edge .

3. Compute the maximum-weight spanning tree (MSWT):

a. Start from the empty tree over n variables

b. Insert the two largest-weight edges

c. Find the next largest-weight edge and add it to the tree if no cycle is

formed; otherwise, discard the edge and repeat this step.

d. Repeat step (c) until n-1 edges have been selected (a tree is

constructed).

4. Select an arbitrary root node, and direct the edges outwards from the root.

5. Tree approximation P’(x) can be computed as a projection of P(x) on the resulting directed tree (using the product-form of P’(x)).

j

i

x

x

P

(

i

|

j

)

for

all

)

;

(

X

i

X

j

I

)

,

(

X

i

X

j

(40)

Summary:

Learning and inference in BNs

Bayesian Networks

– graphical probabilistic models

Efficient

representation

and

inference

Expert

knowledge

+ learning from

data

Learning:

parameters

(parameter estimation, EM)

structure

(optimization w/

score

functions – e.g.,

MDL

)

Complexity trade-off:

NB, BNs and trees

There is much more: causality, modeling time (DBNs, HMMs),

approximate inference, on-line learning, active learning, etc.

(41)

Online/print resources on BNs

Conferences & Journals

UAI, ICML, AAAI, AISTAT, KDD

MLJ, DM&KD, JAIR, IEEE KDD, IJAR, IEEE PAMI

Books and Papers

Bayesian Networks without Tears by Eugene Charniak. AI

Magazine: Winter 1991.

Probabilistic Reasoning in Intelligent Systems by Judea Pearl.

Morgan Kaufmann: 1998.

Probabilistic Reasoning in Expert Systems by Richard

Neapolitan. Wiley: 1990.

CACM special issue on Real-world applications of BNs, March

1995

(42)

Online/Print Resources on BNs

AUAI online:

www.auai.org

. Links to:

Electronic proceedings for UAI conferences

Other sites with information on BNs and reasoning under

uncertainty

Several tutorials and important articles

Research groups & companies working in this area

Other societies, mailing lists and conferences

(43)

Publicly available s/w for BNs

List of BN software maintained by Russell Almond at

bayes.stat.washington.edu/almond/belief.html

several free packages: generally research only

commercial packages: most powerful (& expensive) is

HUGIN; others include Netica and Dxpress

we are working on developing a Java based BN toolkit here at

Watson

References

Related documents