Using Statistics to Identify Spam
3.6 Implementing the Naïve Bayes Classifier
3.6.4 Computational Considerations
o = order(llrVals) llrVals = llrVals[o]
isSpam = isSpam[o]
idx = which(!isSpam) N = length(idx)
list(error = (N:1)/N, values = llrVals[idx]) }
In essence, we have found a vectorized way to compute the Type I errors. We can compute the Type II errors similarly. We leave this as an exercise.
The plot in Figure 3.2 shows that a threshold of -43 looks reasonable. A Type I error rate of 0.01 coincides with τ = −43, and our Type II error rate is 0.02. If we want a smaller Type I error, say 0.001, then we need to set the threshold at τ = 120 and that leads to a very high Type II error of 0.73, i.e. 73% of the spam is misclassified as ham.
We have used the test set here to both select the threshold τ and evaluate the Type I and II errors for that threshold. The implication of this is that the threshold we have chosen may work well with this particular test set but not others, and it may underestimate the size of the errors. Ideally we select τ from other data, independent of our training and test data. To address this problem, we can apply the method of cross-validation. With cross-validation, we partition the training data into k parts at random. Then we use each of these parts to act as a test set and compute the LLR values for the messages in this subset using the remaining data as a training set. We pool all of these LLR values from all k validation sets to select the threshold τ . In this case, when we use k = 5 we find that τ = −33 corresponds to a 1% Type I error. Finally, we apply this threshold to our original test set and find for τ = −33, the Type I error is 0.8% and the Type II error is 4%. We leave it as an exercise to carry out this cross-validation.
This completes the naïve Bayes approach to spam classification using word vectors.
Before we turn to the second approach where we derive characteristics of email as variables to predict spam and ham, we briefly examine some of the computational considerations in calculating the LLR.
3.6.4 Computational Considerations
In computing the log likelihood ratio for a message, we used the following representation of this quantity to guide how we wrote the code
−300 −200 −100 0 100 200
0.00.20.40.60.81.0
Log Odds
Error Rate
Classify Ham as Spam Classify Spam as Ham
Type I Error = 0.01
−43
Type II Error = 0 Type II Error = 0 Type II Error = 0 Type II Error = 0.01 Type II Error = 0.01 Type II Error = 0.01 Type II Error = 0.01 Type II Error = 0.01 Type II Error = 0.01 Type II Error = 0.01 Type II Error = 0.01 Type II Error = 0.02 Type II Error = 0.02 Type II Error = 0.02 Type II Error = 0.02 Type II Error = 0.02
Figure 3.2: Comparison of Type I and II Error Rates. The Type I and II error rates for the 3116 test messages are shown as a function of the threshold τ. For example, with a threshold of τ = −43, all messages with an LLR value above -43 are classified as spam and those below as ham. In this case, 1% of ham is misclassified as spam and 2% of spam is misclassified as ham.
X
words in message
log P(word present| spam) − log P(word present| ham)
+ X
words not in message
log P(word absent| spam) − log P(word absent| ham)
In other words, we first computed from the observed proportions in our training set the estimates to P(word in message| spam), P(word not in message| spam), P(word| ham), and P(not word| ham). Then, we took logs of these estimated probabilities and combined them to calculate the LLR for a particular message. That is, we selected which of these terms to include in the above sum, according to whether each word in the bag of words was present or absent from that message. Given our bag of words consists of more than 80,000 words, we want to consider whether there are faster or more accurate ways to carry out these computations.
The following are equivalent representations of the log likelihood ratio:
LLR = log
Y
words in message
P(word present| spam) P(word present| ham)
+
log
Y
words not in message
P(word absent| spam) P(word absent| ham)
= log
Qin msgP(word present| spam) Q
in msgP(word present| ham)
! + log
Qnot in msgP(word absent| spam) Q
not in msgP(word absent| ham)
!
∝ log
Qin msg#spam with word Q
in msg#ham with word ×
Qnot in msg#spam without word Q
not in msg#ham without word
These alternative mathematical expressions each suggest a different approach to carrying out the computation of the log odds. We leave it as an exercise to write code for them and compare the results to our approach.
Why might these various alternatives not give us the same answer? A computer is a finite state machine, meaning that it has only a fixed amount of space to store a number so some numbers can only be approximated, e.g., irrational numbers. Additionally, the order of operations can matter. For example, if we have one large number and many small numbers, then adding up all of the small numbers first and then adding this total to the large number can produce a more accurate result than adding the small numbers one at a time to the large number. Below is an artificial example that makes this point:
smallNums = rep((1/2)^40, 2000000) largeNum = 10000
print(sum(smallNums), digits = 20) [1] 1.8189894035458564758e-06
print(largeNum + sum(smallNums), digits = 20) [1] 10000.000001818989404
for (i in 1:length(smallNums)) { largeNum = largeNum + smallNums[i]
}
print(largeNum, digits = 20) [1] 10000
In our case, we are working with thousands of small numbers such as the proportion of spam that contains a particular work. It might matter quite a bit how we compute the LLR.