Basic probability theory and n-gram models

(1)

Basic probability theory and n-gram models

INF4820 – H2010

Jan Tore Lønning

Institutt for Informatikk Universitetet i Oslo

(2)

Outline

1 _{Basic probability}

2 Bayes theorem

(3)

Outline

2 Bayes theorem

(4)

Gambling — the beginning of probabilities

Example

Throwing a fair dice

What is the chance of getting a 5?

What is the chance of getting an odd number? What is the chance of getting a number ≥ 3?

Example

Two throws

What is the chance of getting two 5s? What is the chance of getting the sum 10?

What is the chance of getting 6 or more if the first dice is 1 or 2?

(5)

Some terminology

Thesample spaceis a set Ω of elementaryoutcomes(No: “utfallsrom”).

For example, let Ω represent the outcome of throwing a die: Ω = {one, two, three, four , five, six }

Events(A, B, C, . . .) are subsets of this set, (No: “hendelse, begivenhet”)

such as {one}, {two, four , six }, {three, six },etc.

A probability measure P is a function from events to the interval [0, 1].

P(A) is the probability of event A.

(6)

Some Basic Rules

The three axioms of probability theory...

P(A) ≥ 0 for all events A (non-negativity) P(Ω) = 1 (unit measure)

A ∩ B = ∅ ⇒ P(A ∪ B) = P(A) + P(B) (additivity for

disjoint events)

And some of their consequences

P(¯A) = 1 − P(A)

P(∅) = 0

If A ⊆ B then P(A) ≤ P(B)

(7)

Joint Probability and Independence

Joint probability

Thejoint probabilityof two events A and B is the probability of both events occurring, P(A ∩ B).

Example

a throwing two dices, A is the event of the first being 3, B is

the event of the second being 4 or 6.

b two dices, the first is odd the sum of the two is even

c throwing one dice, A it is odd, B it is ≥ 4.

Independence

Two events A and B areindependentif

(8)

Conditional Probability and Independence

The notion of conditional probability lets us capture the

influence ofpartial knowledgeabout the outcome.

The conditional probability of A given B is defined as

P(A|B) = P(A ∩ B)

P(B)

The probability that we will observe A given that we have already observed B. The fraction of B’s probability mass that also covers A.

(9)

Conditional Probability and Independence (cont’d)

The Multiplication Rule

The numerator in our equation for conditional probability can itself be “conditionalized” using themultiplication rule: P(A ∩ B) = P(A)P(B|A) = P(B)P(A|B)

The Chain Rule

Generalizes the multiplication rule to multiple events: P(A ∩ B ∩ C ∩ D ∩ . . .) =

P(A)P(B|A)P(C|A ∩ B)P(D|A ∩ B ∩ C) . . . Two events A and B are independent

(10)

Random Variables

Instead of dealing with with the sample space directly (which will be different for each application), random

variables provides anextra layer of abstractionthat lets us

deal with events in a more uniform way.

Adiscreterandom variable X is a function from the sample

space Ω into afiniteorcountable infiniteset of values.

Theprobability mass function(pmf) p for a random variable X gives us the probability of a given numerical value:

p(x ) = p(X = x ) = P ({ω ∈ Ω : X (ω) = x })

For a discrete random variable X , let xi be the value

corresponding to the event Axi in Ω. We then have that:

X xi∈X p(X = xi) = X A_xi∈Ω P(Axi) =P(Ω) = 1

(11)

Outline

2 Bayes theorem

(12)

Bayes teorem

P(A | B) = P(A ∩ B)

P(B) betinget sannsynlighet

P(A ∩ B) = P(A | B)P(B) produktregelen P(A ∩ B) = P(B | A)P(A)

P(A | B)P(B) = P(B | A)P(A)

P(A | B) = P(B | A)P(A)

(13)

Bayes teorem

Example (fra Wikipedia)

40% jenter, 60% gutter. Alle guttene bruker bukser. 50% av jentene bruker bukser.

Hva er sannsynligheten for at en som går i bukser er jente?

Example (Løsning med Bayes)

Sannsynlighet for at noen er jenter, P(A) = 0, 4

Sannsyn. for at en jente bruker bukser, P(B | A) = 0, 5. Sannsyn. for å gå i bukser, P(B) = P(B | A)P(A)+ P(B | A)P(A) = 0, 5 × 0, 4 + 1 × 0, 6 = 0, 8

(14)

Litt sjargong

P(A | B) = P(B|A)_P(B) P(A) P(A) prior sannsynlighet

I eksempelet: sannsynligheten for at det er en jente før vi vet hun går i bukser.

P(A | B) posterior sannsynlighet

I eksempelet: sannsynligheten for at det er en jente etter at vi har fått vite at vedkommende går i bukser.

(15)

Argmax-notasjon

Definition (Argmax)

x0=arg max

x

f (x ) vil si at f (y ) ≤ f (x0) for alle y

Argmax og Bayes x0 = arg max x P(x | B) = arg max x P(B | x ) P(B) P(x ) = arg max x P(B | x )P(x ) = arg max x [log P(B | x ) + log P(x )]

(16)

Some mathematical notation

Sum and product

n X i=1 ai =a1+a2+ · · · +an n Y i=1 ai =a1· a2+ · · ·an Example P7 i=1=1 + 2 + 3 + 4 + 5 + 6 + 7 = 28 Q7 i=1i = 1 · 2 · 3 · · · 7 = 5040 log(Qn

(17)

Outline

2 Bayes theorem

(18)

Language Modeling

A stochastic (or probabilistic) language model M is a model that assign probabilities PM(x ) to all strings x ∈ L.

Remember the properties of PM:

L is the sample space 0 ≤ PM(x ) ≤ 1 P x ∈LPM(x ) = 1 Used in NLP: 1 _{Speech recognition} 2 _{Machine translation}

3 _{Spell checking (see Nordvig on the net)} 4 _OCR

(19)

Eksempel MT

Vi er interessert i å finne den beste engelske oversettelsen ˆ

E av en norsk (fremmed) setning F .

ˆ E = arg max E P(E | F ) = arg max E P(F | E ) P(F ) P(E ) = arg max E P(F | E )P(E )

Vi snur på flisa: betrakter F som en oversettelse (forvanskning) av E og spør hvilken E :

Utgangspunkt for forenkling/approksimasjoner. Tilgang til P(E ) = språkmodell.

(20)

Bigram model

By the chain rule of joint probabilities, the probability of a sequence of words can be factorized as:

p(w₁, . . . ,w_k) =

p(w1)p(w2|w1)p(w3|w1,w2) . . .p(wk|w1, . . . ,wk −1)

However, modeling this full distribution would require too many parameters...

We simplify the estimation problem by the so-called

Markov assumption: The probability of a given word is taken to only depend on the preceding word:

(21)

Bigram model

The probability of a string w1, . . . ,wk is just the product of

its individual word probabilities, computed as:

pn(w1. . .wk) = k

Y

i=1

p(wi|wi−1)

(22)

N-gram model

In atrigram modelthe probability of a word may depend on two proceeding words

p(wj|w1, . . . ,wj−1) =p(wj|wj−2wj−1) pn(w1. . .wk) = k Y i=1 p(wi|wj−2wi−1)

In anN-gram modelthe probability of a word may depend

on N − 1 proceeding words p(wj|w1, . . . ,wj−1) =p(wj|wj−N+1. . .wj−1) =p(wj|w_j−N+1j−1 ) pn(w1. . .wk) = k Y i=1 p(wi|wi−N+1i−1 )

(23)

Estimation

How to get our hands on the probabilities?

P(bananas|i, like) = C(i, like, bananas)

C(i, like, ∗)

The probabilities are given by therelative frequenciesof

the observed outcomes, i.e. as they occur in our training corpus.

(24)

Some Considerations When Counting Words

What is to count as a word depends on the application. Tokenization

Normalization of case, spelling variants, clitics, abbreviations, numerical expressions, punctuation... Lemmatization. Base forms vs full word forms.

(25)

Evaluation

Extrinsic Evaluation

AKA application-based or end-to-end evaluation

Measure the effect on performance within the embedding application (e.g. the MT-system or the speech-recognizer that the LM is a part of).

Intrinsic

Quick and cheap evaluation based on the probability that the model assigns to unseen test data.

(26)

Training Data vs Test Data

We use the training set toestimateour models and use the

test set toevaluatethem.

Testing on the training data would give unrealistically high expectations about performance in a real application. Even with a perfect score when testing on the training data, we would probably see a drastic fall in performance on unseen data.

(27)

Problems

1 _{Data sparseness}

Regardless of corpus size, perfectly acceptable phrases will be missing.

The creativity of language.

2 _{Unknown words}

3 _{Zero counts (mixes badly with the factorization into word}

probabilities)

Some remedies

1 _{Make sure all n-grams receive a non-zero count}

(smoothing).

(28)

Smoothing

General idea: take some of the probability mass of frequent events, and redistribute it to less frequent or unseen events.

Simplest approach: Add-One smoothing: ˆ

P(wi | wi−n+1. . .wi−1) =

C(wi−n+1...wi−1wi)+1

C(wi−n+1...wi−1)+N

where N are the number of words

Other LM techniques aimed at overcoming problems with unseen events:

Back-off and interpolated models Class-based models