Basic probability theory and n-gram models
INF4820 – H2010
Jan Tore Lønning
Institutt for Informatikk Universitetet i Oslo
Outline
1 Basic probability
2 Bayes theorem
Outline
1 Basic probability
2 Bayes theorem
Gambling — the beginning of probabilities
Example
Throwing a fair dice
What is the chance of getting a 5?
What is the chance of getting an odd number? What is the chance of getting a number ≥ 3?
Example
Two throws
What is the chance of getting two 5s? What is the chance of getting the sum 10?
What is the chance of getting 6 or more if the first dice is 1 or 2?
Some terminology
Thesample spaceis a set Ω of elementaryoutcomes(No: “utfallsrom”).
For example, let Ω represent the outcome of throwing a die: Ω = {one, two, three, four , five, six }
Events(A, B, C, . . .) are subsets of this set, (No: “hendelse, begivenhet”)
such as {one}, {two, four , six }, {three, six },etc.
A probability measure P is a function from events to the interval [0, 1].
P(A) is the probability of event A.
Some Basic Rules
The three axioms of probability theory...
P(A) ≥ 0 for all events A (non-negativity) P(Ω) = 1 (unit measure)
A ∩ B = ∅ ⇒ P(A ∪ B) = P(A) + P(B) (additivity for
disjoint events)
And some of their consequences
P(¯A) = 1 − P(A)
P(∅) = 0
If A ⊆ B then P(A) ≤ P(B)
Joint Probability and Independence
Joint probability
Thejoint probabilityof two events A and B is the probability of both events occurring, P(A ∩ B).
Example
a throwing two dices, A is the event of the first being 3, B is
the event of the second being 4 or 6.
b two dices, the first is odd the sum of the two is even
c throwing one dice, A it is odd, B it is ≥ 4.
Independence
Two events A and B areindependentif
Conditional Probability and Independence
The notion of conditional probability lets us capture the
influence ofpartial knowledgeabout the outcome.
The conditional probability of A given B is defined as
P(A|B) = P(A ∩ B)
P(B)
The probability that we will observe A given that we have already observed B. The fraction of B’s probability mass that also covers A.
Conditional Probability and Independence (cont’d)
The Multiplication Rule
The numerator in our equation for conditional probability can itself be “conditionalized” using themultiplication rule: P(A ∩ B) = P(A)P(B|A) = P(B)P(A|B)
The Chain Rule
Generalizes the multiplication rule to multiple events: P(A ∩ B ∩ C ∩ D ∩ . . .) =
P(A)P(B|A)P(C|A ∩ B)P(D|A ∩ B ∩ C) . . . Two events A and B are independent
Random Variables
Instead of dealing with with the sample space directly (which will be different for each application), random
variables provides anextra layer of abstractionthat lets us
deal with events in a more uniform way.
Adiscreterandom variable X is a function from the sample
space Ω into afiniteorcountable infiniteset of values.
Theprobability mass function(pmf) p for a random variable X gives us the probability of a given numerical value:
p(x ) = p(X = x ) = P ({ω ∈ Ω : X (ω) = x })
For a discrete random variable X , let xi be the value
corresponding to the event Axi in Ω. We then have that:
X xi∈X p(X = xi) = X Axi∈Ω P(Axi) =P(Ω) = 1
Outline
1 Basic probability
2 Bayes theorem
Bayes teorem
P(A | B) = P(A ∩ B)
P(B) betinget sannsynlighet
P(A ∩ B) = P(A | B)P(B) produktregelen P(A ∩ B) = P(B | A)P(A)
P(A | B)P(B) = P(B | A)P(A)
P(A | B) = P(B | A)P(A)
Bayes teorem
Example (fra Wikipedia)
40% jenter, 60% gutter. Alle guttene bruker bukser. 50% av jentene bruker bukser.
Hva er sannsynligheten for at en som går i bukser er jente?
Example (Løsning med Bayes)
Sannsynlighet for at noen er jenter, P(A) = 0, 4
Sannsyn. for at en jente bruker bukser, P(B | A) = 0, 5. Sannsyn. for å gå i bukser, P(B) = P(B | A)P(A)+ P(B | A)P(A) = 0, 5 × 0, 4 + 1 × 0, 6 = 0, 8
Litt sjargong
P(A | B) = P(B|A)P(B) P(A) P(A) prior sannsynlighet
I eksempelet: sannsynligheten for at det er en jente før vi vet hun går i bukser.
P(A | B) posterior sannsynlighet
I eksempelet: sannsynligheten for at det er en jente etter at vi har fått vite at vedkommende går i bukser.
Argmax-notasjon
Definition (Argmax)
x0=arg max
x
f (x ) vil si at f (y ) ≤ f (x0) for alle y
Argmax og Bayes x0 = arg max x P(x | B) = arg max x P(B | x ) P(B) P(x ) = arg max x P(B | x )P(x ) = arg max x [log P(B | x ) + log P(x )]
Some mathematical notation
Sum and product
n X i=1 ai =a1+a2+ · · · +an n Y i=1 ai =a1· a2+ · · ·an Example P7 i=1=1 + 2 + 3 + 4 + 5 + 6 + 7 = 28 Q7 i=1i = 1 · 2 · 3 · · · 7 = 5040 log(Qn
Outline
1 Basic probability
2 Bayes theorem
Language Modeling
A stochastic (or probabilistic) language model M is a model that assign probabilities PM(x ) to all strings x ∈ L.
Remember the properties of PM:
L is the sample space 0 ≤ PM(x ) ≤ 1 P x ∈LPM(x ) = 1 Used in NLP: 1 Speech recognition 2 Machine translation
3 Spell checking (see Nordvig on the net) 4 OCR
Eksempel MT
Vi er interessert i å finne den beste engelske oversettelsen ˆ
E av en norsk (fremmed) setning F .
ˆ E = arg max E P(E | F ) = arg max E P(F | E ) P(F ) P(E ) = arg max E P(F | E )P(E )
Vi snur på flisa: betrakter F som en oversettelse (forvanskning) av E og spør hvilken E :
Utgangspunkt for forenkling/approksimasjoner. Tilgang til P(E ) = språkmodell.
Bigram model
By the chain rule of joint probabilities, the probability of a sequence of words can be factorized as:
p(w1, . . . ,wk) =
p(w1)p(w2|w1)p(w3|w1,w2) . . .p(wk|w1, . . . ,wk −1)
However, modeling this full distribution would require too many parameters...
We simplify the estimation problem by the so-called
Markov assumption: The probability of a given word is taken to only depend on the preceding word:
Bigram model
The probability of a string w1, . . . ,wk is just the product of
its individual word probabilities, computed as:
pn(w1. . .wk) = k
Y
i=1
p(wi|wi−1)
N-gram model
In atrigram modelthe probability of a word may depend on two proceeding words
p(wj|w1, . . . ,wj−1) =p(wj|wj−2wj−1) pn(w1. . .wk) = k Y i=1 p(wi|wj−2wi−1)
In anN-gram modelthe probability of a word may depend
on N − 1 proceeding words p(wj|w1, . . . ,wj−1) =p(wj|wj−N+1. . .wj−1) =p(wj|wj−N+1j−1 ) pn(w1. . .wk) = k Y i=1 p(wi|wi−N+1i−1 )
Estimation
How to get our hands on the probabilities?
P(bananas|i, like) = C(i, like, bananas)
C(i, like, ∗)
The probabilities are given by therelative frequenciesof
the observed outcomes, i.e. as they occur in our training corpus.
Some Considerations When Counting Words
What is to count as a word depends on the application. Tokenization
Normalization of case, spelling variants, clitics, abbreviations, numerical expressions, punctuation... Lemmatization. Base forms vs full word forms.
Evaluation
Extrinsic Evaluation
AKA application-based or end-to-end evaluation
Measure the effect on performance within the embedding application (e.g. the MT-system or the speech-recognizer that the LM is a part of).
Intrinsic
Quick and cheap evaluation based on the probability that the model assigns to unseen test data.
Training Data vs Test Data
We use the training set toestimateour models and use the
test set toevaluatethem.
Testing on the training data would give unrealistically high expectations about performance in a real application. Even with a perfect score when testing on the training data, we would probably see a drastic fall in performance on unseen data.
Problems
1 Data sparseness
Regardless of corpus size, perfectly acceptable phrases will be missing.
The creativity of language.
2 Unknown words
3 Zero counts (mixes badly with the factorization into word
probabilities)
Some remedies
1 Make sure all n-grams receive a non-zero count
(smoothing).
Smoothing
General idea: take some of the probability mass of frequent events, and redistribute it to less frequent or unseen events.
Simplest approach: Add-One smoothing: ˆ
P(wi | wi−n+1. . .wi−1) =
C(wi−n+1...wi−1wi)+1
C(wi−n+1...wi−1)+N
where N are the number of words
Other LM techniques aimed at overcoming problems with unseen events:
Back-off and interpolated models Class-based models