Probabilistic Model
Probability: Definitions & Notations
Probability is a numerical measure of the likelihood of a given event.
number of desirable outcome / total number of possible outcomes
P(A) denotes the probability that event A will occur
1 - P(A) : probability that A will not occur
0.00 P(A) 1.00
The probability that any event A will occur varies from 0 to 1.
+---+---+
0 .5 1 never occur always occur half of the time occur
) ( 1 ) ( ) ( )
( A P A P A P A
P
c
Probability: Definitions & Notations
P(A or B) P(A B) The probability that either A or B will occur
Mutually Exclusive events
Events A and B cannot occur simultaneously
e.g. A = coin toss turns up a head B = coin toss turns up a tail
Not Mutually Exclusive events
Events A and B can occur simultaneously
e.g. A = a card drawn is an ace B = a card drawn is a heart
P(A and B) P(A B) The probability that both A and B will occur
Independent Events
The occurrence of A is independent of the occurrence of B
e.g. A = the first coin toss turns up a head B = the second coin toss turns up a tail
Dependent Events
The occurrence of A is not independent of the occurrence of B
e.g. A = the first card drawn is an ace
B = the second card drawn (without replacement) is a heart
P(B|A) The probability that B will occur given A has occurred.
Probability: Examples
P(A or B) P(A B) The probability that either A or B will occur
Mutually Exclusive events
Events A and B cannot occur simultaneously
P(A B) = P(A) + P(B)
What is the probability of rolling a die and getting 1 or 6?
A = The die rolls a 1
B = The die rolls a 6
P(A) = 1/6
P(B) = 1/6
P(A B) = 1/6 + 1/6 = 1/3
Not Mutually Exclusive events
Events A and b can occur simultaneously
P(A B) = P(A) + P(B) - P(A B)
What is the probability that a card drawn from a deck is an ace or spade?
A = The card is an ace
B = The card is a spade
P(A) = 4/52
P(B) = 13/52
P(A B) = 1/52
P(A B) = 4/52 + 13/52 – 1/52 = 16/52 = 4/13
Probability: Examples
P(A and B) P(A B) The probability that both A and B will occur
Independent Events
The occurrence of A is independent of the occurrence of B
P(A B) = P(A) P(B)
What is the probability that fair coin tosses will come up heads twice in a row?
A = a head on the first toss
B = a head on the second toss
P(A) = P(B) = 1/2
P(A B) = 1/2*1/2 = 1/4
Dependent Events
The occurrence of A is not independent of the occurrence of B
P(A B) = P(A) P(B|A) = P(B) P(A|B)
What is the probability that two cards drawn from a deck will both be aces?
A = the first card is an ace
B = the second card is an ace
P(A) = 4/52 = 1/13
P(B|A) = 3/51 = 1/17
P(A B) = 1/13*1/17 = 1/221
Addition
• Mutually Exclusive
− P(A B C) = P(A) + P(B) + P(C)
• Not Mutually Exclusive
− P(A B C) = P(A) + P(B) + P(C) – P(A B) - P( (A B) C)
AB = A B
P(A B C) = P(AB C) = P(AB) + P(C) – P(AB C)
» P(AB) = P(A) + P(B) – P(A B)
» P(AB C) = P( (A B) C)
Multiplication
• Independent
− P(A B C) = P(A) P(B) P(C)
• Dependent
− P(A B C) = P(A) P(B|A) P(C|(A B))
AB = A B
P(A B C) = P(AB C) = P(AB) P(C|AB)
» P(AB) = P(A) P(B|A)
Combination of Probability
OR addition
P(A B) AND multiplication
P(A B)
Mutually Exclusive P(A) + P(B)
NOT Mutually Exclusive P(A) + P(B) – P(A B)
Independent P(A) P(B)
Dependent P(A) P(B|A)
P(A) P(B)
Not Mutually Exclusive Events P(A B)
P(A) P(B)
Mutually Exclusive Events
Probability: Examples
P(A|B) The probability that A will occur given B has occurred.
The conditional probability of A given B
P(A|B) = P(A B) / P(B)
Derived from P(A B) = P(B) P(A|B)
At a restaurant, 90% of the customers order a burger. If 72% of the customers order a burger and fries, what is the probability that a customer who orders a burger will also order fries?
A = a customer orders fries
B = a customer orders a burger
P(B) = 90/100, P(A B) = 72/100
P(A|B) = (72/100) / (90/100) = 72/90
Probability: Bayes Theorem
Formula for calculating conditional probabilities
P(A|B) as a function of P(B|A)
Bayesian Inference
Learning about a hypothesis from data
Hypothesis
Prior probability: initial subjective degree of belief in the hypothesis
Learning
Posterior probablity: the degree of belief in the hypothesis is updated based on evidence
H = hypothesis, E = evidence
P(H) = prior probability of H
P(E) = marginal probability of E
Probability of witnessing the evidence E
P(E|H) = likelihood of E given H
Conditional probability of seeing the evidence E given that hypothesis H is true
P(H|E) = posterior probability of H given E
)
| ( ) ( )
| ( ) (
)
| ( ) ( )
( )
| ( ) ( )
( ) ) (
|
( P A P B A P A P B A
A B P A P B
P A B P A P B
P B A B P
A
P
)
| ( ) ( )
| ( ) ( ) (
) ( )
(B P B A P B A P A P B A P A P B A
P
)
| ( ) ( )
| ( ) (
)
| ( ) ( )
(
)
| ( ) ) (
|
( P H P E H P H P E H
H E P H P E
P
H E P H E P
H
P
Probability: Bayesian Inference
Box 1 has 10 oranges and 20 apples. Box 2 has 20 oranges and 10 apples. If you randomly picked out a fruit from a random box and got an apple, what is the probability that it came from box 1?
H= The fruit (e.g., apple) comes from box 1
Hc= The fruit (e.g., apple) comes from box 2
E = picks an apple
P(H|E) = P(H)P(E|H) / P(E) = P(H)P(E|H) / (P(H)P(E|H) + P(HC)P(E|HC))
P(H) = P(HC) = ½
P(E|H) = 20/30
P(E|HC) = 10/30
P(H|E) = (½ *2/3) / (½ *2/3 + ½ *1/3)
)
| ( ) ( )
| ( ) (
)
| ( ) ( )
(
)
| ( ) ) (
|
( P H P E H P H P E H
H E P H P E
P
H E P H E P
H
P
Probability: Odds
Probability and odds are two ways of measuring the chances of an event occurring.
Probability
• number of desirable outcome / total number of possible outcomes
− occurrence / whole
• chances for / total chances
• predictive: long run ratio
Odds
• number of desirable outcome / number of undesirable outcomes
− occurrence / non-occurrence
• chances for / chances against
− O(A) = 2 / 1 odds are 2 to 1 in favor of A
− O(A) = 1 / 2 odds are 2 to 1 against A
• payoff to break even
Probability and odds have similar values for infrequent events (P < 0.1)
Probability Odds Transformation
Range 0 to 1 0 to
Comparison
1
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
9.00 4.00 2.33 1.50 1.00 0.67 0.43 0.25 0.11 0.00
) ( 1
) ) (
( O A
A A O
P
) ( 1
) ) (
( P A
A A P
O
Logarithm
• Logarithm of x to base b
• x = by
Common Logarithm
• Logarithm to base 10
•
Natural Logaritm
• Logarithm to base e, where e = 2.718281828…
•
Properties
•
•
•
Log Transformation
− x = by,
−
y
b
x log
x x log log
10
x
e
x ln
log
2 1
2
1 log log
logx x x x
2 1
2
1 log log
log x x
x
x
x k xk log
log
b x x
c c
b log
log log
y
b x
log
b x b
y b
x c y c b c
c log log log log
log
logx
IR: Probabilistic Model
Probability Ranking Principle
• Assumptions
− Each document is either relevant or not relevant
− Relevance of one document does not affect that of another document
• Ranking documents according to the probability of relevance to the query
Maximizes the overall retrieval effectiveness
Binary Independence Retrieval Model
• Represent documents d as binary term vectors
−
• Rank documents according to the relevance odds (i.e., probability of relevance)
1. compute the odds that document d will be relevant:
» : probability that d is relevant
» : probability that d is not relevant
2. sort the documents by the descending order of
• Term Independence Assumption
− Presence/absence of one term is unrelated to the presence/absence of any other term.
− assumes that occurrence of terms in documents are independent from each other
otherwise d to belongs t
t if where t
t
d n i i
, 0
, ), 1
,..., (1
)
| (
)
| ) (
|
( P R d
d R d P
R
O
)
| (R d P
)
| (R d P
)
| (R d
O
IR: BIR Model Derivation
)
| (
)
| ( ) (
) (
R d P
R d P R P
R
P
O(R)
)
| (
)
| ) (
|
( P R d
d R d P
R
O
) (
)
| ( ) (
) (
)
| ( ) (
d P
R d P R P
d P
R d P R P
( | )
)
| (
1P t R
R t P
i n i i
apply Bayes Theorem assume Term Independence
P(R) = probability that a (randomly selected) document is relevant
split by presence/absence of terms
)
| 0 (
)
| 0 ( )
| 1 (
)
| 1 ) (
(
0
1 P t R
R t
P R t P
R t R P
O
i i i t
i
ti i
Ignore , which is constant across documents, and take the log:
)
| 1
(t R
P pi i
)
| 1
(t R
P qi i
i i i t
i
t q
p q
R p O d R O
i
i
1
) 1 ( )
| (
0 1
i i
i i
i i
t
i i t
i i n
i i
i i n
i
i i i n
i t
i t i
t i t i n
i q
p q
R p O t
q if p
t q if p q
q p R p
O
1
1 1
1 1
1
1 1
) 1 ( 0 ),
1 (
) 1 (
1 ,
) 1 (
) 1 ) (
(
Let : probability that ti occurs in a relevant document : probability that ti occurs in a non-relevant document
) (
) ) (
( P R
R R P
O
n
i
t
i i t
i i t
i i t
i i n
i
i i
i i
q p q
p q
p q
d p RSV
1
1 1
1 1
log 1 1
log 1 )
(
n
i i
i
i i i
i i i n
i i
i i
i i
i q
p q
t p q t p q
t p q
t p
1
1 1
log1 1
log1 1 log
log1 ) 1 ( log
n
i
n
i i
i
i i
i i i n
i i
i
i i i i
i q
p p
q q t p
q p
q p q p t
1 1
1 1
log1 )
1 (
) 1 log ( 1
log1
1
log1 Ignore ,
which is constant across documents
n
i i
i
q p
1 1
log1
n
i i n
n i
n i
x x
x x x
x x x
1 2
1 2
1 1
log log
..
log log ) ..
log(
log
IR: Term Relevance Weight
n
i i i
i i
i q p
q t p
d RSV
1 (1 )
) 1 log ( )
(
tr
k= Log odds ratio
Probability Estimation for Term Relevance Weight
• Without relevance information
− Known
Nd = total number of documents in collection (collection size)
dk = number of documents in which term k appears (postings)
− Estimated
qk = probability that term k occurs in a non-relevant document
» qk dk / Nd (conventional) qk = (dk+0.5) / (Nd+1)
» assume that number of non-relevant documents collection size
pk = probability that term k occurs in a relevant document
» pk 0.5
» assume random chance
− Same as idf for large Nd
) (
) log ( ) 1 log( ) 1 log( ) 1 log( ) 1 log( ) 1 (
) 1 log (
k k
k k
k k
k k
k k
k k
k k
k O q
p O q
q p
p q
q p
p p
q q
tr p
relevant document non
a in appears k
term that odds
document relevant
a in appears k
term that log odds
k k d
d k d
k d
d k d k
k k
k k
k k
k k
k d
d N N
d N
d N
N d N d q
q p
p p
q q
tr p
0 log log
1 5 log . 0 1
5 . log 0 ) 1 log( ) 1 log( ) 1 (
) 1 log (
5 . 0
5 . log 0
5 . 0
) 5 . 0 ( log 1
1 5 . 0
1 5 . 1 0
log ) 0
1 log( ) 1 log( ) 1 (
) 1 log (
:
k k d k
k d
k d k
k k k
k k
k k k
k d
d N d
d N
N d
N d q
q p
p p
q q tr p
al Convention
IR: Term Relevance Weight
Probability Estimation for Term Relevance Weight
• With relevance information
− Known
Nd = total number of documents in collection (collection size)
dk = number of documents in which term k appears (postings)
Nr = total number of relevant documents in collection
» Nd - Nr = total number of non-relevant documents in collection
rk = number of relevant documents in which term k appears
» dk - rk = number of non-relevant documents in which term k appears
− Estimated
qk = probability that term k occurs in a non-relevant document
» (conventional)
pk = probability that term k occurs in a relevant document
» (conventional)
r d
k k
k N N
r q d
rk
p r 0.5
1 5 . 0
r d
k k
k N N
r q d
Relevant Documents
Non-relevant Documents
Total
Documents
with Term k rk dk – rk Dk
Documents
without Term k Nr – rk Nd – Nr –(dk – rk) Nd – dk
Total Nr Nd – Nr Nd
IR: Term Relevance Weight
Probability Estimation for Term Relevance Weight
• With relevance information
qk = probability that term k occurs in a non-relevant document
» (conventional)
pk = probability that term k occurs in a relevant document
» (conventional)
(conventional)
k r k d
k k
k r
k
r d
k k r d
r d
k k
r k r
r k
k k
k k
k k
k k k
r N d N
r d
r N
r
N N
r d N N
N N
r d
N r N
N r
q q
p p p
q q
tr p log log
1 log1 ) 1 (
) 1 log (
r d
k k
k N N
r q d
r k
k N
p r
1 5 . 0
r k
k N
p r
1 5 . 0
r d
k k
k N N
r q d
5 . 0 5 . 0
5 . 0 5 . 0 log
1 5 . 0 1
1 5 . 0 1
5 . 0 1
1 5 . 0
) log 1
( ) 1 log (
k r k d
k k
k r
k
r d
k k r
d
r d
k k
r k r
r k
k k
k k k
r N d N
r d
r N
r
N N
r d N
N
N N
r d
N r N
N r
p q
q tr p