Generative models estimate a joint probability model p(class, document) which is used to calculate the class that is most likely to have produced the document. The probability of a document given both the available classes is calculated and the class with the highest probability of producing that document is selected (Ng & Jordan, 2002).
Class=M AX(p(document|classi)) (23)
A.1.1 Bayesian classifiers. Bayesian classifiers are a type of generative model that, as the name implies, use Bayes’ rule to calculate the most likely class given the document, from probabilities that can be calculated.
p(classi|document) =
p(document|classi)p(classi)
p(document) (24)
The class i for which the probability is highest is the class that is selected. Because
the formula to:
Class=M AX(p(document|classi)p(classi)) (25)
The probability ofp(document|classi) is given byp(f1, . . . , fn|classi), wheref1, . . . , fn
refers to the words in a document. To make calculating this likelihood feasible, Bayesian models make strong assumptions about the data that is processed and not all of them are valid to presume in natural language data. The most prominent - and most violated - assumption is that of independence of features. This assumption is made because the likelihood p(f1, . . . , fn|classi) in
Class=M AX(p(f1, . . . , fn|classi)p(classi) (26)
is computationally expensive to calculate, especially if texts, and therefore the set of words in a text f1, . . . , fn, get larger. The independence assumption simplifies the formula as
follows.
p(f1, . . . , fn|classi) =p(f1|c)·p(fn|c)· . . . ·p(fn|c) (27)
The entire formula for the Bayesian classifier can then be simplified to:
Class=M AX(p(c)
fn
Y
1
p(fi|c)) (28)
This means that the model assumes different words in a sentence are independent of each other and that the probability of one word occurring does not relate at all with the probabilities of other words occurring in that sentence. This assumption does not hold, as words in a sentence must cohere somehow to form a semantically logical sentence. Despite the actual function approximation being poor, naive Bayesian models can classify natural language data relatively well (McCallum, Nigam, et al., 1998).
A.1.2 Training Bayesian classifiers. The Bayesian classifiers are trained based on maximum likelihood estimation. Frequency data is used to determine the probabilities of the documents given each class and the probabilities of the classes. This frequency data is obtained from a training dataset such as the collection of imdb reviews used in this thesis.
of that class by the total number of documents.
p(classi) =
N(classi)
N(documents) (29)
The document probability given each class is given by p(fi|classi). To estimate this, the
number of times the word fi is featured in each class is divided by the number of times
each word in the vocabulary is featured in all documents of class i.
p(fi|classi) =
N(fi, classi)
Vclassi
(30) The Bayesian classifier uses frequency information from the texts to give a class prediction. This information can be neatly represented in an array of numbers, a vector, where each dimension represents a word in the vocabulary. As vocabularies consist of hundreds to thousands of words (see section 3.1), the vectors can get quite large. To illustrate how this representation works, consider the following mini-vocabulary:
[ I, you, her, him, they, duck, goose, crouch, bend ] (31)
If the sentence ‘I saw her pet her duck’ is represented in terms of frequencies of the words in the vocabulary, it would the yield the following vector:
[ 1 0 2 0 0 1 0 0 0 ] (32)
In this vector, each dimension represents a word from the vocabulary, the values are determined by the frequencies in the text. The example vocabulary and text are both very small, but both can be of any length. However, the vector, and therefore the vocabulary, has practical limit as computation with longer vectors is computationally expensive. This way of representing natural language as frequencies of a vocabulary is called the ‘bag of words’ representation.
If a Bayesian classifier is trained with such a standard bag of words representation, it is called a multinomial Bayesian classifier. This is opposed to the Bernoulli classifier, which functions exactly the same but utilizes vectors of binary data in its training. Each piece of text is represented by a vector that indicates whether the words in that text are featured in a predefined set of words, the vocabulary. The example sentence ‘I saw her
pet her duck’ using the mini-vocabulary from formula 31 and the Bernoulli representation would look as follows:
[ 1 0 1 0 0 1 0 0 0 ] (33)
In this very small example, some information is already lost because only the pres- ence is captured and the frequency is not. For longer examples, the loss of information due the binary nature of this representation is significant. The sparsity of information in the Bernoulli model causes it to mis-classify especially longer texts because of the presence of one word that the model has found indicative of a certain class. It can for example “[. . . ] assign an entire book to the class China because of a single occurrence of the term China.” (Manning, Raghavan, Schütze, et al., 2008).