E-mail Analysis Methods - Computational intelligence in e mail traffic analysis

E-mail data in its original text format contains a great deal of information that can be examined to provide useful knowledge about patterns of communication occurring between e-mail users. However, human analysis of e-mail data in its original text format is an extremely difficult and time consuming task, given the large amounts of data that may be involved. This is where e-mail analysis methods can be utilised for the examination of e-mail data, in order to aid in extracting information on meaningful patterns or features hidden in the data. There are two types of e-mail analysis methods that may be employed to examine e-mail data for computer forensic purposes: e-mail content analysis and e-mail traffic analysis. Both of these methods provide different ways of analysing the data, and extract different types of information about the communication activities of individuals using e-mail. These methods also use different parts of an e-mail message for analysis, as shown in Figure 1.1.

Sections of message analysed

Main body of e-mail message.

Header Body Return-Path: Received: From: To: CC: Subject: Date: MIME-Version: Content-Type:

(a) Parts of message used for e-mail content analysis.

Sections of message analysed

Main body of e-mail message.

Header Body Return-Path: Received: From: To: CC: Subject: Date: MIME-Version: Content-Type:

(b) Parts of message used for e-mail traffic analysis.

Figure 1.1: The parts of an e-mail message that are used for content and traffic analysis.

1.4.1 E-mail Content Analysis

E-mail content analysis is defined here as a type of analysis method that extracts information related to the textual content of messages sent by e-mail users. The purpose of this form of analysis is to obtain useful information about the patterns

or features associated with the use of words entered into e-mail messages. There are various parts of an e-mail message that can be examined for content analysis, such as the Subject, To, From, and Body fields [32, 33, 34]. Each of the fields used for e-mail content analysis are highlighted in Figure 1.1(a).

Based on the text information contained in each e-mail message, e-mail content analysis techniques are able to analyse for particular patterns or features related to the linguistic characteristics of the text used in the e-mail messages. The techniques used to extract this information come from areas such as text mining [35], natural language processing (NLP) [36], or statistics. Techniques in each of these areas provide different ways of examining the content-based information contained in e-mail. Some brief examples of approaches that have been proposed for analysing e-mail content are: analysis of user topic interests [37], analysis of an individual’s word usage patterns [38], message thread analysis [39], and authorship analysis [40].

The use of e-mail content analysis for computer forensic analysis of e-mail data offers a number of advantages compared to human analysis of individual e-mail messages. One of these is that the techniques used for e-mail content analysis enable the forensic analyst to identify patterns of word usage or particular linguistic features in e-mail messages, which would otherwise be unnoticed by the analyst. Another advantage is that content analysis offers a method of summaris- ing or overviewing the textual content of e-mail messages without requiring the analyst to examine each individual e-mail message.

However, there are certain drawbacks that limit the use of e-mail content analysis. Firstly, if the body of certain e-mail messages is encrypted, then content analysis techniques will be unable to extract any information about what is contained in the main body of those messages. This defeats the purpose of e-mail content analysis and provides another problem for the analyst in terms of need- ing to break the encryption if the text content of the encrypted message is to be analysed. Another drawback is that content analysis techniques mainly focus on the textual content of e-mails, but provide little information about other characteristics such as time-based information or the presence of attachments. This limits the use of e-mail content analysis from analysing other types characteristics that may be present in e-mail data. While these are drawbacks limit the use of e-mail content analysis in these situations, e-mail traffic analysis is another type of method that can be considered for analysing e-mail data.

1.4.2 E-mail Traffic Analysis

E-mail traffic analysis is defined here as a type of analysis method that extracts information related to the transit and delivery of e-mail messages sent by e-mail users. The main purpose of e-mail traffic analysis is to obtain useful information about the exchange of messages between e-mail users and the types of messages sent, without examining the content of the e-mail messages. The information examined for traffic analysis is derived from the header section of e-mail messages, given that the header contains fields related to the transport of e-mail [9].

The header fields that can generally be used for traffic analysis are those related to the sender, recipient, and date/time information [4] (e.g. From, To, CC, BCC, and Date fields). Other header fields such as Content-Type, may be used for traffic analysis to provide other types of information about the e-mail messages (e.g. the presence of attachments or the types of attachments being sent) [3]. A header field that is generally not used for traffic analysis the Subject field, since it provides user specified information related the content of e-mail messages. Examples of header fields that can be used for traffic analysis are illustrated and highlighted in Figure 1.1(b).

Using the information contained in e-mail message headers, traffic analysis techniques are able to analyse for patterns or features associated with the communication interactions occurring between e-mail users. The techniques that can be used for e-mail traffic analysis come from a variety of areas, including: statistics, visualisation, and artificial intelligence. Techniques from each of these areas enable particular types of traffic information to be obtained about the individuals using e-mail. Some brief examples of types of information that can be extracted from e-mail using traffic analysis methods are: the usage patterns of individual e-mail users [41, 42], the interaction behaviour between an e-mail user and their contacts [2], and the connections emerging from interactions between groups of e-mail users [43, 44].

The application of e-mail traffic analysis for computer forensic purposes offers a number of benefits in comparison to human analysis of individual e-mail messages and e-mail content analysis. One of these benefits is that e-mail traffic analysis provides a way of overviewing the interactions occurring between e- mail users and enable the observation of events that occurred as a result of e- mail exchanges between e-mail users. This may provide useful information for forensic analysts, given that they need to obtain information from the data that allows the reconstruction of events associated with a crime. The second benefit

of e-mail traffic analysis is that it enables the identification of patterns of interaction occurring between e-mail users, which would otherwise be unnoticed by the analyst when examining individual e-mail messages. Finally, in comparison to e-mail content analysis, e-mail traffic analysis techniques are not hindered by the presence of encrypted e-mails, since traffic analysis focuses on the transit and delivery of e-mail messages rather than its contents.

Although there are benefits of using e-mail traffic analysis techniques, there are also some problems that make it difficult to apply traffic analysis techniques in certain situations. The first of these is the use of fake or falsely generated e-mail addresses by certain individuals, in order to hide their true identity and the origin of where their e-mails are sent from. This can be done by the individual send- ing e-mails through anonymous remailers, which are able to conceal the sender’s true identity from the recipient [45]. The use of such tactics can cause problems for e-mail traffic analysis techniques, since it is dependent on knowing the identity of the sender and recipient addresses, and also how e-mail messages are being transported through the Internet. The second problem is knowing whether certain e-mail accounts belong to a particular individual. While traffic analysis techniques are able to examine the patterns of communications associated with interactions between e-mail users, it is still difficult to determine whether certain e-mail accounts are associated with a particular individual. This problem is due to the fact that an individual may use multiple e-mail accounts and also by the fact that the e-mail addresses used by individuals do not necessarily represent the individual’s real identity. It is still yet to be determined whether e-mail traffic analysis techniques are able to accurately identify a particular individual from their e-mail traffic communication activity.

Overall, while e-mail content analysis and e-mail traffic analysis provide different methods of analysing e-mail data, the use of these methods may not necessarily have to be mutually exclusive. E-mail content analysis techniques may be used alongside e-mail traffic analysis techniques, in order to provide wider cover- age about the communication activities of the individuals using e-mail. This ap- proach may prove to be useful, given that the combination of both analysis methods could be used to overcome limitations that exist when using these methods individually. However, to determine the practical application of e-mail content and e-mail traffic analysis methods for computer forensic purposes, one needs to consider the current difficulties faced by law enforcement agencies in analysing e-mail data.

1.5 Difficulties Associated With Analysing E-mail

In document Computational intelligence in e mail traffic analysis (Page 35-39)