Using Statistics to Identify Spam
3.6 Implementing the Naïve Bayes Classifier
3.6.3 Classifying New Messages
We apply this function to our training data with
trainTable = computeFreqs(trainMsgWords, trainIsSpam)
Now,trainTablecan be used to construct the log likelihood ratio for a new message. That is, we select values from the matrixtrainTable that correspond to the words that appear in a new message and the words that are absent from the message, and we use these to compute the log likelihood ratio for the message. This value is then used to classify the message as spam or ham. We do this next.
3.6.3 Classifying New Messages
The trainTable object has all of the individual word probabilities needed to construct the log likelihood ratio for a message. To do this we need to combine these estimated probabilities where we take the log odds from the “present” row of trainTable for each word appearing in the message and similarly take the log odds from the “absent” row of the table for all those words in the bag of words that do not appear in the message. We combine these to create the likelihood that the message is spam versus ham using
X
words in message
log P(word present| spam) − log P(word present| ham)
+ X
words not in message
log P(word absent| spam) − log P(word absent| ham)
For example, consider the set of words in the first message intestMsgWords,
newMsg = testMsgWords[[1]]
There is the possibility that a test message contains a word that is not in the bag of words.
When this happens we do not include it in our calculation as we have no information about the likelihood a message with this word is spam or ham. We drop these new words from newMsgwith
newMsg = newMsg[!is.na(match(newMsg, colnames(trainTable)))]
For the remaining words that are in newMsg, we locate the columns in the frequency table that contain them with the logical vector:
present = colnames(trainTable) %in% newMsg
Then we compute the log of the ratio of the probability a message is spam versus ham with
sum(trainTable["presentLogOdds", present]) + sum(trainTable["absentLogOdds", !present]) [1] 255
We know the first message in testMsgWords is spam, and we see the log likelihood ratio computed for it is large and positive, indicating spam. We can try a test ham message as well, e.g.,
newMsg = testMsgWords[[ which(!testIsSpam)[1] ]]
newMsg = newMsg[!is.na(match(newMsg, colnames(trainTable)))]
present = (colnames(trainTable) %in% newMsg) sum(trainTable["presentLogOdds", present]) +
sum(trainTable["absentLogOdds", !present]) [1] -125
This message has a large negative value, which indicates it is ham.
We place this simple code into a function so that we can calculate the log likelihood ratio (LLR) for all of the test messages. Our function,computeMsgOdds()appears as computeMsgLLR = function(words, freqTable)
{
# Discards words not in training data.
words = words[!is.na(match(words, colnames(freqTable)))]
# Find which words are present present = colnames(freqTable) %in% words sum(freqTable["presentLogOdds", present]) +
sum(freqTable["absentLogOdds", !present]) }
We apply this function to each of the messages in our test set with testLLR = sapply(testMsgWords, computeMsgLLR, trainTable)
We want to use these values to classify the test messages as spam or ham. A value that is positive indicates spam is more likely and a negative value indicates ham is more likely, but we are free to choose some other value as a threshold for classification.
We compare the summary statistics of the LLR values for the ham and spam in the test data with
tapply(testLLR, testIsSpam, summary)
$‘FALSE‘
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1360 -127 -102 -117 -82 700
$‘TRUE‘
Min. 1st Qu. Median Mean 3rd Qu. Max.
-61 7 50 138 131 23600
We see from these statistics and the boxplots in Figure 3.1 that there is a good deal of separation of the ham and spam.
●
●●
●
●
●
●
●
●●
●
●●
●●
●
●●
●
●
●●
●
●
●
●
●●
●●
●
●
●
●●
●
●
●
●●
●●
●
●
●
●
●
●●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
ham spam
−400−2000200400
Log Likelihood Ratio
Figure 3.1: Boxplot of Log Likelihood Ratio for Spam and Ham. The log likelihood ratio, log(P (spam | message content)/P ( ham | message content)), for 3116 test messages was computed using a naïve Bayes approximation based on word frequencies found in manually classified training data. The test messages are grouped according to whether they are spam or ham. Notice most ham messages have values well below 0 and nearly all spam values are above 0.
We have 3116 LLR values corresponding to each test message, and we need to decide on a cut-off τ , where we classify a message as spam or ham according to whether or not the LLR exceeds this threshold. We assess the choice of τ using our test data. That is, we find
the proportion of ham messages in the test set with LLR values that exceed the threshold and so are misclassified as spam. This is the Type I error rate for the test data. Likewise, we find the proportion of LLR values for spam messages in the test set that are below the threshold and so misclassified as ham, which is the Type II error rate.
We can write a simple R function to compute the rate of misclassification of ham as spam for a particular value of τ . This function takes 3 inputs: the value of τ , the vector of LLR values for the test messages, and the hand-classified type of each message (spam or ham). This function appears as
typeIErrorRate =
function(tau, llrVals, spam) {
classify = llrVals > tau
sum(classify & !spam)/sum(!spam) }
Note that we do not divide by the total number of messages, but only by the number of ham messages. It is important to divide by the right number here, which is the total number of ham messages as these are the only ones that can contribute to a Type I error.
ThetypeIErrorRate() function is not vectorized in its argument, tau. For example, in order to find τ that yields a 0.5% Type I error rate, we examine the boxplots in Figure 3.1.
From the plot we make an initial guess that τ = 0. We use typeIErrorRate()to calculate the Type I error with this threshold for the test messages, and find it is 0.3%. Then, we calculate the error for a few τ values below 0 and, and find that for τ = −20 we get an error rate of 0.5%, i.e.,
typeIErrorRate(0, testLLR,testIsSpam) [1] 0.0035
typeIErrorRate(-20, testLLR,testIsSpam) [1] 0.0056
Typically, we want to find the error rate for a vector of τ s because we want to find one that provides an acceptable Type I error. In its current form, if we want to usetypeIErrorRate() to calculate the Type I error for a vector of values, we need a loop in the form of ansapply() call.
In theory, to select a threshold, we need to search over all possible values of τ . However, it should be clear after a little thought that we can at least restrict the interval. Any value of τ less than the minimum of the LLR values means that we classify all messages as spam and the Type I error rate is 1. Similarly, any value of τ greater than the maximum of the LLR values implies that we classify every message in our sample as ham so our Type I error rate is 0. Additionally, we need to keep in mind that there are also errors in misclassifying spam as ham. The Type II error is 1 when we use the largest observed LLR value in our test set because all spam is classified as ham, which is clearly not acceptable either.
We also note that the Type I error rate only changes at values of τ that match one of the observed LLR values in our set of messages. That is, for 2 values of τ , say τ1 and τ2, if there are no LLR values from the test set between them, then their associated Type I errors must be the same. Likewise, the Type II error rates for τ1and τ2are the same. This means that we only need to compute the error rate at the 3116 LLR values for the test messages.
Our estimate of the Type I error rate is a step function and only changes at each of
the observed LLR values. We can do even better than this to reduce the set of possible τ s that we search over. It is not all LLR values that potentially cause a change in the Type I error. Only the values corresponding to ham messages will affect the Type I error because messages that are spam do not contribute to the Type I error.
These observations about the Type I and II error rates for our test messages imply that we can determine the error rates as a function of τ more conveniently and efficiently. The following function does this by looking only at the llrVals values for ham messages and recognizing that the number of Type I errors decreases by 1 at each of these values and so is i/(number of ham messages). Note that the function ignores ties for the ratios, but these are unlikely since they should be unique. Our function is defined as
typeIErrorRates =
function(llrVals, isSpam) {
o = order(llrVals) llrVals = llrVals[o]
isSpam = isSpam[o]
idx = which(!isSpam) N = length(idx)
list(error = (N:1)/N, values = llrVals[idx]) }
In essence, we have found a vectorized way to compute the Type I errors. We can compute the Type II errors similarly. We leave this as an exercise.
The plot in Figure 3.2 shows that a threshold of -43 looks reasonable. A Type I error rate of 0.01 coincides with τ = −43, and our Type II error rate is 0.02. If we want a smaller Type I error, say 0.001, then we need to set the threshold at τ = 120 and that leads to a very high Type II error of 0.73, i.e. 73% of the spam is misclassified as ham.
We have used the test set here to both select the threshold τ and evaluate the Type I and II errors for that threshold. The implication of this is that the threshold we have chosen may work well with this particular test set but not others, and it may underestimate the size of the errors. Ideally we select τ from other data, independent of our training and test data. To address this problem, we can apply the method of cross-validation. With cross-validation, we partition the training data into k parts at random. Then we use each of these parts to act as a test set and compute the LLR values for the messages in this subset using the remaining data as a training set. We pool all of these LLR values from all k validation sets to select the threshold τ . In this case, when we use k = 5 we find that τ = −33 corresponds to a 1% Type I error. Finally, we apply this threshold to our original test set and find for τ = −33, the Type I error is 0.8% and the Type II error is 4%. We leave it as an exercise to carry out this cross-validation.
This completes the naïve Bayes approach to spam classification using word vectors.
Before we turn to the second approach where we derive characteristics of email as variables to predict spam and ham, we briefly examine some of the computational considerations in calculating the LLR.