Using Statistics to Identify Spam
3.8 Organizing an email Message into an R Data Structure
3.8.1 Processing the Header
Our plan is to process the header by converting the key: value pairs into a named vector, where the name of each element in the vector is taken as the key. Having a separate function to carry out this task allows us to test it without having to repeatedly read the entire message, and if desired, we can use this function to process the header in attachments within the message body.
To determine how to process the header, we examine a few lines of the header in one of our sample messages with
header = sampleSplit[[1]]$header header[1:12]
[1] "From [email protected] Thu Aug 22 12:36:23..."
[2] "Return-Path: <[email protected]>"
[3] "Delivered-To: [email protected]"
[4] "Received: from localhost (localhost [127.0.0.1])"
[5] "\tby phobos.labs.netnoteinc.com (Postfix) with ESMTP id..."
[6] "\tfor <zzzz@localhost>; Thu, 22 Aug 2002 07:36:16 -0400..."
[7] "Received: from phobos [127.0.0.1]"
[8] "\tby localhost with IMAP (fetchmail-5.9.0)"
[9] "\tfor zzzz@localhost (single-drop); Thu, 22 Aug 2002 12..."
[10] "Received: from listman.spamassassin.taint.org (listman...."
[11] " dogma.slashnull.org (8.11.6/8.11.6) with ESMTP id g..."
[12] " <[email protected]>; Thu, 22 Aug 200..."
We make the following observations about these 12 header lines that are relevant to our extraction process:
• Some key: value pairs appear on multiple lines so there is not a one-to-one correspon-dence between lines in theheadervariable and key: value pairs. For example, lines 4, 5, and 6 are all part of the same Received value, and the same is true for lines 7, 8, and 9 and for lines 10–12. We also note that when the value continues over multiple lines, these additional lines begin with blanks as in lines 11 and 12 or with a tab character, i.e., \t, as with lines 5 and 6.
• The first line in the header is not in the key: value format. The information in this line also appears elsewhere in the header. This identifies the start of a new message in general.
• Colons can appear in the value portion of the key: value pair, e.g., line 6 contains a time in the format 07:36:16.
Our function needs to handle these various situations as it transforms the header. In what order do we address them? Also, to address the second issue, we need to decide what we want to do with the first line of the header. Do we discard it because the information appears elsewhere in the header or do we keep it? Let’s be conservative and keep this line, but change it so that it follows the key: value format of the other lines.
To handle the situation where a value appears on multiple lines in the header, we can collapse these extra lines into one. Then, each key: value pair occupies one line in the revised header vector. It makes sense to fix these two issues first because then all of the lines have the same format when we go about splitting the lines up into their keys and values.
We can address the problem with the first line of the header not being in the key: value format by simply substituting the string "From" that appears at the start of the line with, say, "Top-From:". By choosing a special key that is not used in typical email headers, it will not be confused with other key: value pairs. We do this with the following call tosub():
header[1] = sub("^From", "Top-From:", header[1]) header[1]
[1] "Top-From: [email protected] Thu Aug 22..."
Here we have used the regular expression ’^From’ to locate ’From’ at the start of the string. That is, the character ^ anchors the pattern we are searching for to the start of the string, and any From later in the string is not matched. The second argument to sub()is the substring that is substituted for the initial ’From’. Alternatively, we can use simple string manipulation functions available in R for finding and modifying this first header line.
For example, we can remove the first 4 characters from the string and paste ’Top-From:’
to the front of this shortened string.
Now all of the information appears in a key: value format. We might ask: does R provide a function to read files in this format? Our initial searches on the Internet and in the documentation for input/export of files in R do not turn up anything useful. We can write code to handle the continuation lines, and, e.g., catenate them to the previous line, and code to identify the key and value. In fact the first time that we processed the header, this is what we did, and we make this approach an exercise. Later, we discovered the read.dcf()function that handles this format. According to the documentation for the function,read.dcf()reads the format:
• Regular lines are of the form key: value and start with a non-whitespace character.
• Lines starting with whitespace are continuation lines (to the preceding field).
• Fields may appear more than once in a record.
• Records are separated by one or more empty (i.e., whitespace only) lines.
We can try read.dcf() on our sample header. Since it is already in R as a character vector, we use a text connection to read it, i.e.,
headerPieces = read.dcf(textConnection(header), all = TRUE)
The return value is a data frame with one row, where, e.g., the Delivered-To element is a list that contains the values for the 2 Delivered-To keys in the header. That is, headerPieces[, "Delivered-To"]
[[1]]
[1] "[email protected]"
[2] "[email protected]"
We can convertheaderPiecesinto a character vector and use the key for the name of each of these values. We have duplicate names when there are duplicate fields in the header. We do this with
headerVec = unlist(headerPieces)
dupKeys = sapply(headerPieces, function(x) length(unlist(x))) names(headerVec) = rep(colnames(headerPieces), dupKeys)
We confirm that we have 2 elements inheaderVecnamed “Delivered-To” with headerVec[ which(names(headerVec) == "Delivered-To") ]
Delivered-To
Delivered-To
The header vector has 36 elements, i.e., length(headerVec)
[1] 36
The raw header was originally 62 lines, but apparently, 26 of these lines were continuation lines. Moreover, these 36 elements include 10 duplicate names,
length(unique(names(headerVec))) [1] 26
We can put this code into our processHeader() function. What are the inputs and outputs of this function? We only need the original header vector as input, and the return value fromprocessHeader()is the named character vector. Our function follows:
processHeader = function(header) {
# modify the first line to create a key:value pair header[1] = sub("^From", "Top-From:", header[1])
headerMat = read.dcf(textConnection(header), all = TRUE) headerVec = unlist(headerMat)
dupKeys = sapply(headerMat, function(x) length(unlist(x))) names(headerVec) = rep(colnames(headerMat), dupKeys)
return(headerVec) }
Let’s call processHeader() on the rest of our sample messages. Recall the headers and bodies of the messages in sampleEmail have already been separated and assigned tosampleSplit. We apply processHeader()to them with
headerList = lapply(sampleSplit, function(msg) {
processHeader(msg$header)} )
We can access the value of, e.g., the Content-Type key with subsetting by name, i.e., contentTypes = sapply(headerList, function(header)
header["Content-Type"]) names(contentTypes) = NULL
contentTypes
[1] " text/plain; charset=us-ascii"
[2] " text/plain; charset=US-ASCII"
[3] " text/plain; charset=US-ASCII"
[4] " text/plain; charset=\"us-ascii\""
[5] " text/plain; charset=US-ASCII"
[6] " multipart/signed;\n boundary=..."
[7] NA
[8] " multipart/alternative;\n ... "
[9] " multipart/alternative; boundary=Apple-Mail-2-874629474"
[10] " multipart/signed;\n boundary=..."
[11] " multipart/related;\n boundary=..."
[12] " multipart/signed;\n boundary=..."
[13] " multipart/signed;\n boundary=..."
[14] " multipart/mixed;\n boundary=..."
[15] " multipart/alternative;\n boundary=..."
We see that in our sample one of the 15 messages has no Content-Type specified, i.e., it yields NA. When we examine the raw element, we confirm that we properly processed it, i.e., the Content-Type key is not present in the original header.
We next tackle processing the body of the message; in particular, we extract the attach-ments from the body and summarize them.