Naïve Bayes is the simplest classification algorithm among Bayesian classification methods. In this algorithm, we simply need to learn the probabilities by making the assumption that the attributes A and B are independent, that's why this model is defined as an independent feature model. Naïve Bayes is widely used in text classification because the algorithm can be trained easily and efficiently. In Naïve Bayes we can calculate the probability of a condition A given B (described as P(A|B)), if we already know the probability of B given A (described as P(B|A)), and additionally the probability of A (described as (P(A)) and the probability of B (described as P(B)) individually, as is shown in the preceding Bayes Theorem.
E-mail subject line tester
An e-mail subject line tester is a simple program, which will define if a certain subject line in an e-mail is spam or not. In this chapter, we will program a Naïve Bayes classifier from scratch. The example will classify if a subject line is spam or not using a very simple code. This will be done by breaking the subject lines into a list of relevant words, which will be used as the features vectors in the algorithm. In order to do this, we will use the SpamAssassin public dataset. SpamAssasin includes three categories; spam, easy ham, and hard ham. In this case, we will create a binary classifier with two classes spam and not spam (easy ham).
There are several features that we can use for our classifier such as the precedence, the language, and the use of upper case. We will keep things simple and use the frequency of only those words which consist of more than three characters, avoiding words such as The or RT, when training the algorithm.
We will implement the Bayes rule, using the words and categories, as shown in the following equation:
P word category( | ) = P category | word P word( ) ( ) P category( )
For more information about probability distributions, please refer to http://en.wikipedia.org/wiki/Probability_distribution. Here, we have two classes in the categories which represents if a subject line is spam or not. We need to split the texts into a list of words in order to get the likelihood of each word. Once we know the probability of each word, we need to multiply the probabilities for each category as shown in the following equation:
In other words, we multiply the likelihood of each word P(word|category) of the subject line and the probability of the category P(category).
For training the algorithm, we need to provide with some prior examples. In this case, we will use the training() function that needs a dictionary of subject line and category, as we can see in the following table:
Subject line Category
Re: Tiny DNS Swap nospam
Save up to 70% on international calls! nospam
[Ximian Updates] Hyperlink handling in Gaim allows
arbitrary code to be executed nospam
Promises. nospam
Life Insurance - Why Pay More? spam
[ILUG] Guaranteed to lose 10-12 lbs in 30 days 10.206 spam
The data
We can find the spam dataset at http://spamassassin.apache.org/. In the following screenshot, we can see the easy ham (not spam) folder with 2551 files:
The spam text looks very much similar to the following screenshot, and may include HTML tags and plain text. In this case, we are only interested in the subject line so we need to write a code to obtain the subject from all the files.
This example will show how to preprocess the SpamAssassin data, using Python, in order to collect all the subject lines from the e-mails.
First, we need to import the os module, in order to get the list of filenames using the listdir function from the \spam and \easy_ham folders:
import os
We will need a new file to store the subject lines and category (spam or not spam), but this time we will use a comma as a separator:
with open("SubjectsSpam.out","a") as out: category = "spam"
Now, we will parse each file and get the subject. Finally, we write the subject and the category in a new file, and delete all the commas from the subject lines (line. replace(",", "")) to skip future troubles with the CSV format:
for fname in files:
with open("\\spam\\" + fname) as f: data = f.readlines()
for line in data:
if line.startswith("Subject:"): line.replace(",", "")
print(line)
out.write("{0}, {1} \n".format(line[8:-1], category)) We use line[8:-1] to skip the word Subject: (8-characters long) and the enter at the end of the line (-1):
Output:
>>>Hosting from ?6.50 per month >>>Want to go on a date?
>>>[ILUG] ilug,Bigger, Fuller Breasts Naturally In Just Weeks >>> zzzz Increase your breast size. 100% safe!
We will keep the spam and not spam in different files, to play with the size of the training sets and test sets. Usually, more data in the training set means better performance of the algorithm but in this case we will try to find an optimal threshold of the training set size.
All the codes and datasets of this chapter can be found in the author's GitHub repository at https://github. com/hmcuesta/PDA_Book/tree/master/Chapter4.