5.6 Classification Algorithms
5.6.2 Multinomial Naive Bayes
The major feature of Multinomial Naive Bayes is that it captures the frequencies of words in a document [20]. Firstly, we start from the Bayes rule. The equation below shows the classical Bayes rule in the field of probability and mathematical statistics:
P (c|d) ∝ P (c) bb P (d|c) b
P (d) → bP (c) bP (d|c) (10) We use c to represent a class, and use d to represent a document. Therefore, P (c|d) represents the probability of a document d being in a class c. Actually calculating P (c|d) is the goal of our classification task. We have two classes, use c1 (CS) and c2
(non-CS) to represent them, now for each new test document di, we need to obtain
P (c1|di) and P (c2|di), compare which value is larger. Assume P (c1|di) > P (c2|di),
then we can conclude that di should belong to c1. Note that for each probability, we
add a superscript, that’s because all of these parameters are estimated.
Therefore, the problem is how to obtain P (c|d). We use Bayes rule to transform the conditional probability, transfer P (c|d) into calculating P (d|c), which is easier to
5. CLASSIFICATION SYSTEM METHODOLOGY
obtain. Note that in the Bayes rule equation, we neglect the probability P (d), since P (d) is a constant that we don’t need to care about, what we really need to calculate is P (c) and P (d|c). The meaning of P (d|c) is given class c has already happened, what is the probability of document d occurs. In order to calculate P (d|c), base on the bag of words model, we can split document d into a set of its words. The equation is shown below:
P (c|d) ∝ bP (c) bP (d|c) → bP (c) bP (t1· t2...tnd|c) (11)
The probability of document d occurs can be equalized to the probability of all the words in document d occur together. Assume document d has nd words (document
length is nd), then we separate each word, don’t need to consider the order or position
of different words, finally the probability P (d|c) can be equalized to P (t1· t2...tnd|c).
This is called the bag of words model probability transformation.
The final step of the probability calculation is using the independent assumption. In the field of probability and mathematical statistics, if two incidents A and B are independent, we can say that P (AB) = P (A) · P (B). Based on the independent assumption in Naive Bayes, we can separate the conditional probability for each words in a document. The equation is shown as follows:
P (c|d) ∝ bP (c) bP (t1· t2...tnd|c) → bP (c)
Y
16k6nd
b
P (tk|c) (12)
As we reach this final step, we transformed our goal P (c|d) into estimating P (c) and P (tk|c). P (c) can be called as the prior probability of class c, it is calculated
using the following equation:
P (c) = Nc
N (13)
Nc is the number of class c documents in the training data set, and N is the
number of all documents in the training data set. On the other hand, P (tk|c) is the
5. CLASSIFICATION SYSTEM METHODOLOGY
is equalized as the proportion of tk occurring in all class c training documents.
b P (tk|c) = Tctk + min(1, τ ) P ´ t∈V(Tc´t+ min(1, τ )) = Tctk + min(1, τ ) P ´ t∈V Tc´t+ B · min(1, τ ) τ = min{Tc´t|´t ∈ V } (14)
We use Tctk to represent the class term frequency of word tk. However, under fea-
ture weight normalized environment, Tctk maybe normalized by length normalization
and TF-iDF. V is the size of vocabulary (number of features) in the whole training set. If a word t appears in any of class c documents, add its class term frequency / or normalized weight; otherwise add 0 if the word doesn’t appear in any of class c documents. However, there is a special case that we need to take into consideration. If Tctk = 0, we need to prevent P (tk|c) = 0, which will cause the whole probability
P (c|d) becomes zero. Considering this case, a new test document dnew has word tk,
now we need to calculate the probability P (c|dnew), however, among all class c train-
ing documents, tk haven’t occurred; tk only occurred in the documents of another
class; thus Tctk = 0. In order to prevent such cases, the typical method is add-one
smoothing. Add one to the class term frequency of any words. As for the denomi- nator, we also need to add some values to ensure the balance. We follow the typical smoothing method that add the number of total different features (the size of feature set), which is represented by B. However, if feature weights are normalized, add-one smoothing maybe no longer appropriate, since the feature weight are not represented by CTF (class term frequency), instead, they are normalized to a much smaller value (most of them smaller than 1). Therefore, in the normalized environments, we need to replace 1 with the smallest feature weight value τ (note that τ > 0), as shown in equation 14.
There is one more special case that a word in test document never occur in any of the training documents, then this word will be discarded since we don’t have this word’s estimated conditional probability.
As conclusion, Multinomial Naive Bayes has two steps:
5. CLASSIFICATION SYSTEM METHODOLOGY
appeared in training data set.
2. Test: for a new document d, calculate P (ci|d) based on the estimated parame-
ters.
Algorithm 4 and algorithm 5 further illustrate the process of train and test Multi- nomial Naive Bayes in detail.
Algorithm 4 Train Multinomial NB Require: Doc set D, classList C
V ← getV ocabulary(D) N ← countDocs(D) for each class c ∈ C do
Nc← countDocsInClass(D, c)
prior[c] ← Nc N
Tc ← W ordsInClass(D, c)
for each each feature t ∈ V do Tct ← countT F inClass(D, c, t)
condP rob[t][c] ← (Tct+ min(1, τ )) (Tc+ len(V ) · min(1, τ ))
end for end for
Return prior, condP rob
Algorithm 5 Test Multinomial NB
Require: Test doc d, ClassList C, prior, condP rob, V Wd← getT okens(d)
for each class c ∈ C do score[c] ← log prior[c] for each t in Wd do
if t ∈ V then
score[c]+ = log condP rob[t][c] end if
end for end for
Return arg max
c∈C score[c]
We use this simple example in Table 12 to illustrate how Multinomial Naive Bayes works. Assume we have 4 papers in our training set, 3 papers belong to CS class, and 1 paper belongs to non-CS class. Each of the paper full texts are shown in the
5. CLASSIFICATION SYSTEM METHODOLOGY
Paper ID Paper Text Class Label
Training Set 1 Data Data Computer CS
2 Complexity Data Data CS
3 Software Data CS
4 Data Climate Business non-CS
Test Set 5 Climate Business Data Data Data ?
TABLE 12: Example of using Naive Bayes
table. Now we need to predict the class for a new test paper (d5): ”Climate Business
Data Data Data”. First we estimate the probability of d5 under class CS. Based on
the Multinomial Naive Bayes rules, we can calculate P (CS|d5) using the following
equation:
P (CS|d5) ∝ bP (CS) · bP (Climate|CS) · bP (Business|CS) · bP (Data|CS) · bP (Data|CS) ·
b
P (Data|CS)
First calculate the prior probability bP (CS) = 3
4 as the fraction of CS papers in the
training data set. Then, for each word in d5, estimate the word conditional probability under
CS class. Firstly, for the first word Climate, it appears 0 times in all the 3 CS class training
papers, so we estimate bP (Climate|CS) as 0 + 1
8 + 6 =
1
14. The total of the 3 CS papers’ length
is 8 (3+3+2), and among all 4 training papers there are 6 different words (6 features); thus, the denominator is 8+6. And the estimation process is the same for the other four words:
Business, Data, Data, Data, bP (Business|CS) = 0 + 1
8 + 6 = 1 14 P (Data|CS) =b 5 + 1 8 + 6 = 3 7.
Finally the probability of d5 under class CS is estimated as
3 4 · 1 14 · 1 14· ( 3 7) 3 ≈ 0.0003.
Accordingly, the probability of d5 under class nonCS is estimated as
1 4 · 2 9 · 2 9· ( 2 9) 3 ≈
0.0001. As a result, d5 has been classified as a CS paper.