Inference in FOL: Propositionalization
§ Convert (KB Ù ¬a) to PL, use a PL SAT solver to check (un)satisfiability
§ Trick: replace variables with ground terms, convert atomic sentences to symbols
§ "x Knows(x,Obama) and Democrat(Feinstein)
§ Knows(Obama,Obama)and Knows(Feinstein,Obama) and Democrat(Feinstein)
§ Knows_Obama_Obama Ù Knows_Feinstein_Obama Ù Democrat_Feinstein
§ and "x Knows(Mother(x),x)
§ Knows(Mother(Obama),Obama) and Knows(Mother(Mother(Obama)),Mother(Obama)) …….
§ Real trick: for k = 1 to infinity, use all possible terms of function nesting depth k
§ If entailed, will find a contradiction for some finite k (Herbrand); if not, may continue for ever;
semidecidable
Inference in FOL: Lifted inference
§ Apply inference rules directly to first-order sentences, e.g.,
§ KB = Person(Socrates), "x Person(x) Þ Mortal(x)
§ conclude Mortal(Socrates)
§ The general rule is a version of Modus Ponens:
§ Given a Þ b and a’, where a’s = as for some substitution s, conclude bs
§ s is {x/Socrates}
§ Given Knows(x,Obama) and Knows(y,z) Þ Likes(y,z)
§ s is {y/x, z/Obama}, conclude Likes(x,Obama)
§ Examples: Prolog (backward chaining), Datalog (forward chaining),
production rule systems (forward chaining), resolution theorem provers
Summary, pointers
§ FOL is a very expressive formal language
§ Many domains of common-sense and technical knowledge can be written in FOL (see AIMA Ch. 10)
§ circuits, software, planning, laws, taxes, network and security protocols, product descriptions, ecommerce transactions, geographical information systems, Google Knowledge Graph, Semantic Web, etc.
§ Inference is semidecidable in general; many problems are efficiently solvable in practice
§ Inference technology for logic programming is especially efficient
(see AIMA Ch. 9)
CS 188: Artificial Intelligence Probability
Instructors: Stuart Russell and Dawn Song University of California, Berkeley
Uncertainty
§ The real world is rife with uncertainty!
§ E.g., if I leave for SFO 60 minutes before my flight, will I be there in time?
§ Problems:
§ partial observability (road state, other drivers’ plans, etc.)
§ noisy sensors (radio traffic reports, Google maps)
§ immense complexity of modelling and predicting traffic, security line, etc.
§ lack of knowledge of world dynamics (will tire burst? need COVID test?)
§ Probabilistic assertions summarize effects of ignorance and laziness
§ Combine probability theory + utility theory -> decision theory
§ Maximize expected utility : a* = argmaxa ås P(s | a) U(s)
Basic laws of probability
§ Begin with a set W of possible worlds
§ E.g., 6 possible rolls of a die, {1, 2, 3, 4, 5, 6}
§ A probability model assigns a number P( w ) to each world w
§ E.g., P(1) = P(2) = P(3) = P(5) = P(5) = P(6) = 1/6.
§ These numbers must satisfy
§ 0 £ P(w) £ 1
§ åw ÎW
P( w ) = 1
1/6
1/6 1/6 1/6 1/6 1/6
Basic laws contd.
§ An event is any subset of W
§ E.g., “roll < 4” is the set {1,2,3}
§ E.g., “roll is odd” is the set {1,3,5}
§ The probability of an event is the sum of probabilities over its worlds
§ P(A) = åw Î A P(w)
§ E.g., P(roll < 4) = P(1) + P(2) + P(3) = 1/2
§ De Finetti (1931): anyone who bets according to probabilities that violate these laws can be forced to lose money on every set of bets
1/6 1/6
1/6 1/6
1/6 1/6 1/6
1/6 1/6
1/6
1/6 1/6
1/6 1/6 1/6
1/6 1/6 1/6
Random Variables
§ A random variable (usually denoted by a capital letter) is some aspect of the world about which we (may) be uncertain
§ Formally a deterministic function of w
§ The range of a random variable is the set of possible values
§ Odd = Is the dice roll an odd number? ® {true, false}
§ e.g. Odd(1)=true, Odd(6) = false
§ often write the event Odd=true as odd, Odd=false as ¬odd
§ T = Is it hot or cold? ® {hot, cold}
§ D = How long will it take to get to the airport? ® [0, ¥)
§ LGhost = Where is the ghost? ® {(0,0), (0,1), …}
§ The probability distribution of a random variable X gives the
probability for each value x in its range (probability of the event X=x)
§ P(X=x) = å {w: X(w)=x} P(w)
§ P(x) for short (when unambiguous)
§ P(X) refers to the entire distribution (think of it as a vector or table)
Probability Distributions
§ Associate a probability with each value; sums to 1
§ Temperature:
T P
hot 0.5 cold 0.5
W P
sun 0.6
rain 0.1
fog 0.3
meteor 0.0
§ Weather:
P(T) P(W) P(T,W)
§ Joint distribution
Temperature hot cold
Weather sun 0.45 0.15 rain 0.02 0.08 fog 0.03 0.27 meteor 0.00 0.00
Making possible worlds
§ In many cases we
§ begin with random variables and their domains
§ construct possible worlds as assignments of values to all variables
§ E.g., two dice rolls Roll
1and Roll
2§ How many possible worlds?
§ What are their probabilities?
§ Size of distribution for n variables with range size d?
§ For all but the smallest distributions, cannot write out by hand! d n
Probabilities of events
§ Recall that the probability of an event is the sum of probabilities of its worlds:
§ P(A) = åw Î A P(w)
§ So, given a joint distribution over all variables, can compute any event probability!
§ Probability that it’s hot AND sunny?
§ Probability that it’s hot?
§ Probability that it’s hot OR not foggy?
P(T,W)
§ Joint distribution
Temperature hot cold
Weather sun 0.45 0.15 rain 0.02 0.08 fog 0.03 0.27 meteor 0.00 0.00
Marginal Distributions
§ Marginal distributions are sub-tables which eliminate variables
§ Marginalization (summing out): Collapse a dimension by adding
P(X=x) = å
yP(X=x, Y=y)
Temperature hot cold
Weather sun 0.45 0.15 rain 0.02 0.08 fog 0.03 0.27 meteor 0.00 0.00
0.60 0.10 0.30 0.00 0.50 0.50
P(T)
P(W)
Temperature hot cold
Weather sun 0.45 0.15 rain 0.02 0.08 fog 0.03 0.27 meteor 0.00 0.00
Conditional Probabilities
§ A simple relation between joint and conditional probabilities
§ In fact, this is taken as the definition of a conditional probability
P(a) P(b)
P(a,b)
= P(W=s,T=c) + P(W=r,T=c) + P(W=f,T=c) + P(W=m,T=c)
= 0.15 + 0.08 + 0.27 + 0.00= 0.50
P(T,W)
P(a | b) = P(a, b) P(b)
P(W=s | T=c) = P(W=s,T=c)
P(T=c) = 0.15/0.50 = 0.3
Conditional Distributions
§ Distributions for one set of variables given another set
Temperature hot cold
Weather sun 0.45 0.15 rain 0.02 0.08 fog 0.03 0.27 meteor 0.00 0.00
P(W | T=c)
0.30 0.16 0.54 0.00 P(W | T=h)
0.90 0.04 0.06 0.00
P(W | T)
0.30 0.16 0.54 0.00 0.90
0.04 0.06 0.00
hot cold hot cold
§ (Dictionary) To bring or restore to a normal condition
§ Procedure:
§ Multiply each entry by a = 1/(sum over all entries)
Normalizing a distribution
All entries sum to ONE
a = 1/0.50 = 2 Normalize Temperature
hot cold
Weather sun 0.45 0.15 rain 0.02 0.08 fog 0.03 0.27 meteor 0.00 0.00
P(W,T=c)
0.15 0.08 0.27 0.00
0.30 0.16 0.54 0.00 P(W,T)
P(W | T=c) = P(W,T=c)/P(T=c)
= a P(W,T=c)
The Product Rule
§ Sometimes have conditional distributions but want the joint
P(a | b) = P(a, b)
P(a | b) P(b) = P(a, b)
P(b)The Product Rule: Example
P(W | T) P(T) = P(W, T)
T P
hot 0.5 cold 0.5
P(T)
P(W | T)
0.30 0.16 0.54 0.00 0.90
0.04 0.06 0.00
hot cold Temperature
hot cold
Weather sun 0.45 0.15 rain 0.02 0.08 fog 0.03 0.27 meteor 0.00 0.00
P(W, T)
The Chain Rule
§ A joint distribution can be written as a product of conditional distributions by repeated application of the product rule:
§ P(x
1, x
2, x
3) = P(x
3| x
1, x
2) P(x
1, x
2) = P(x
3| x
1, x
2) P(x
2| x
1) P(x
1)
§ P(x
1, x
2,…, x
n) = Õ
iP(x
i| x
1,…, x
i-1)
Probabilistic Inference
§ Probabilistic inference: compute a desired probability from a probability model
§ Typically for a query variable given evidence
§ E.g., P(airport on time | no accidents) = 0.90
§ These represent the agent’s beliefs given the evidence
§ Probabilities change with new evidence:
§ P(airport on time | no accidents, 5 a.m.) = 0.95
§ P(airport on time | no accidents, 5 a.m., raining) = 0.80
§ Observing new evidence causes beliefs to be updated
Inference by Enumeration
§ Probability model P(X1, …, Xn) is given
§ Partition the variables X1, …, Xn into sets as follows:
§ Evidence variables: E = e
§ Query variables: Q
§ Hidden variables: H
§ We want:
P(Q | e)
§ Step 1: Select the entries consistent with the evidence
§ Step 2: Sum out H from model to
get joint of query and evidence § Step 3: Normalize
åh P(Q , h, e) P(Q ,e) =
X1, …, Xn
P(Q | e) = a P(Q ,e)
Inference by Enumeration
§ P(W)?
Season Temp Weather P
summer hot sun 0.35
summer hot rain 0.01
summer hot fog 0.01
summer hot meteor 0.00
summer cold sun 0.10
summer cold rain 0.05
summer cold fog 0.09
summer cold meteor 0.00
winter hot sun 0.10
winter hot rain 0.01
winter hot fog 0.02
winter hot meteor 0.00
winter cold sun 0.15
winter cold rain 0.20
winter cold fog 0.18
winter cold meteor 0.00
Inference by Enumeration
§ P(W)?
§ P(W | winter)?
Season Temp Weather P
summer hot sun 0.35
summer hot rain 0.01
summer hot fog 0.01
summer hot meteor 0.00
summer cold sun 0.10
summer cold rain 0.05
summer cold fog 0.09
summer cold meteor 0.00
winter hot sun 0.10
winter hot rain 0.01
winter hot fog 0.02
winter hot meteor 0.00
winter cold sun 0.15
winter cold rain 0.20
winter cold fog 0.18
winter cold meteor 0.00
§ Obvious problems:
§ Worst-case time complexity O(dn) (exponential in #hidden variables)
§ Space complexity O(dn) to store the joint distribution
§ O(dn) data points to estimate the entries in the joint distribution
Inference by Enumeration
Bayes Rule
Bayes’ Rule
§ Write the product rule both ways:
P(a | b) P(b) = P(a, b) = P(b | a) P(a)
§ Dividing left and right expressions, we get:
§ Why is this at all helpful?
§ Lets us build one conditional from its reverse
§ Often one conditional is tricky but the other one is simple
§ Describes an “update” step from prior P(a) to posterior P(a | b)
§ Foundation of many systems we’ll see later (e.g. ASR, MT)
§ In the running for most important AI equation!
That’s my rule!
P(a | b) = P(b | a) P(a) P(b)
§ Two variables X and Y are (absolutely) independent if
"x,y P(x, y) = P(x) P(y)
§ I.e., the joint distribution factors into a product of two simpler distributions
§ Equivalently, via the product rule P(x,y) = P(x|y)P(y), P(x | y) = P(x) or P(y | x) = P(y)
§ Example: two dice rolls Roll1 and Roll2
§ P(Roll1=5, Roll2=3) = P(Roll1=5) P(Roll2=3) = 1/6 x 1/6 = 1/36
§ P(Roll2=3 | Roll1=5) = P(Roll2=3)
Independence
Example: Independence
§ n fair, independent coin flips:
H 0.5
T 0.5
H 0.5
T 0.5
H 0.5
T 0.5
P(X
1,X
2,...,X
n)
P(X
n) P(X
1) P(X
2)
2
nConditional Independence
Ghostbusters
§ A ghost is in the grid somewhere
§ Sensor readings tell how close a square is to the ghost
§ On the ghost: usually red
§ 1 or 2 away: mostly orange
§ 3 or 4 away: typically yellow
§ 5+ away: often green
§ Click on squares until confident of location, then “bust”
Video of Demo Ghostbusters with Probability
Ghostbusters model
§ Variables and ranges:
§ G (ghost location) in {(1,1),…,(3,3)}
§ Cx,y (color measured at square x,y) in {red,orange,yellow,green}
§ Ghostbuster physics:
§ Uniform prior distribution over ghost location: P(G)
§ Sensor model: P(Cx,y | G) (depends only on distance to G)
§ E.g. P(C1,1 = yellow | G = (1,1) ) = 0.1
0.11 0.11 0.11 0.11 0.11 0.11 0.11 0.11 0.11
Ghostbusters model, contd.
§ P(G, C
1,1, … C
3,3) has 9 x 4
9= 2,359,296 entries!!!
§ Ghostbuster independence:
§ Are C1,1 and C1,2 independent?
§ E.g., does P(C1,1= yellow) = P(C1,1 = yellow | C1,2 = orange) ?
§ Ghostbuster physics again:
§ P(Cx,y | G) depends only on distance to G
§ So P(C1,1 = yellow | G = (2,3) ) = P(C1,1 = yellow | G = (2,3), C1,2= orange)
§ I.e., C1,1 is conditionally independent of C1,2 given G
0.11 0.11 0.11 0.11 0.11 0.11 0.11 0.11 0.11
Ghostbusters model, contd.
§ Apply the chain rule to decompose the joint probability model:
§ P(G, C1,1 , … C3,3) = P(G) P(C1,1 | G) P(C1,2 | G, C1,1) P(C1,3 | G, C1,1, C1,2) … P(C3,3 | G, C1,1, …, C3,2)
§ Now simplify using conditional independence:
§ P(G, C1,1 , … C3,3) = P(G) P(C1,1 | G) P(C1,2 | G) P(C1,3 | G) … P(C3,3 | G)
§ I.e., conditional independence properties of ghostbuster physics simplify the probability model from exponential to quadratic in the number of squares
§ This is called a Naïve Bayes model:
§ One discrete query variable (often called the class or category variable)
§ All other variables are (potentially) evidence variables
§ Evidence variables are all conditionally independent given the query variable
G
C1,1 C1,2 C3,3
Conditional Independence
§ Conditional independence is our most basic and robust form of knowledge about uncertain environments.
§ X is conditionally independent of Y given Z if and only if:
"x,y,z P(x | y, z) = P(x | z) or, equivalently, if and only if
"x,y,z P(x, y | z) = P(x | z) P(y | z)
Conditional Independence
§ What about this domain:
§ Traffic
§ Umbrella
§ Raining
Conditional Independence
§ What about this domain:
§ Fire
§ Smoke
§ Alarm