Concept classification - Email mining - A Visual Framework for Graph and Text Analytics in Emai

3.2 Email mining

3.2.2 Concept classification

While it’s possible to use traditional keyword searching techniques for example to detect if criminality related words are mentioned (e.g: drugs), these techniques are ineﬃcient and may not give any significant results or suspected anomalies in the data analyzed, since suspected users usually do not explicitly use such words, instead other expressions and encrypted mes-sages are preferred, which might hide diﬀerent suspected meanings.

Classification techniques are more robust to noise and dimensionality, in addition the final results are more precise, and can easily elaborate large amount of data, otherwise much more diﬃcult to analyze with manual ad-hoc searches.

For email messages, a text based classification algorithm helps us

clas-sify emails in diﬀerent categories, some might turn out to be anomalies and unconventional categories, if compared to the type and expected usage of a particular email address, for instance using the work email for personal private use and duties.

Natural language techniques (NLP) help us understand, elaborate, and mine the textual content of the email. A very common and successful ap-proach for textual classification is LSA (Latent Semantic Analysis), we al-ready talked about this powerful tool in the textual mining techniques back-ground Section. 2.1, we will use it along with TFIDF as text weighting algo-rithm. We can summarize the LSA process in these steps:

1. Building the corpus/collection of documents to use as input:

the corpus that we must generate is a list of all the diﬀerent emails in the archive, we should pay attention to text redundancy, and avoid it.

This is done in the preliminary phase of data cleaning as we already mentioned in Section. 3.1.2, the main reason for the existence of this problem is due to the presence of reply emails, usually these messages copy the text of messages that they are replying too, along with the actual reply message. We will filter and exclude these contents using the procedure we already mentioned in Section. 3.1.2, and next populate the corpus with these filtered (cleaned) documents (emails).

2. Building the word phrases dictionary and removing stop words:

we will use the n-grams model to build the set of word phrases, and we choose a granularity of n = 3, which means that the maximum word phrases length we might have is 3 (e.g: new york city). It’s very common to use trigrams models especially when the available training data is limited, and this particular n value proved to be very successful in detecting important and relevant word phrases, and in addition the data elaboration time and complexity is less expensive, 4-gram and 5-gram models are used when the available data is very large. We should exclude stop words from the dictionary words, stop words are extremely common words which have small semantic relevance to the final

anal-3.2 Email mining 39

ysis. This set of words is strictly dependent on the text language, and can be updated with additional ad-hoc words. The proposed frame-work will leave this as an open option and will let the user choose the language and manually add other irrelevant terms.

3. Applying a TF-IDF text weighting algorithm: Its the combina-tion of term frequency and inverse document frequency metrics. This value will be high when a term occurs many times within a small num-ber of documents, while we will get a lower value when a term occur fewer times in a single document or many times in many documents.

For a further and mathematical definition of TF-IDF, we send you back to Section. 2.1. This step will create a matrix (terms x documents), and each cell will contain the tf-idf value.

4. Applying a matrix decomposition scheme SVD (Singular value decomposition): giving the matrix of step(3) we will construct a low-rank approximation of it using SVD. This algorithm will decompose the matrix to three diﬀerent matrices. SVD will decompose the original matrix to a lower rank K, this value is generally chosen to be in the low hundreds when having a very high rank. For our case, this value is chosen ad-hoc according to the data analyzed, by certifying manually the results validity. We choose this approach mainly to optimize the elaboration time. Further in section 3.3 we will give a possible more sophisticated approach to deal with this problem and automatize the value of k.

The final step of LSA (decomposition) helps us represent and classify the original documents into a new set of documents, these new documents rep-resent diﬀerent concepts. For each concept (cluster) we will retrieve the set of terms with higher scores, these terms are the most representative words of that concept, and might give an idea about the possible common subject that pool these terms.

Now that we have diﬀerent clusters of concepts, we need to know the

clusters aﬃliation to the network elements. First we need to redefine the set of documents. Two diﬀerent approaches could be adopted:

Nodes: each node will have a diﬀerent document that includes the context of all the messages that he treats

Edges: each edge (relation between two contacts) will be represented as a diﬀerent document containing all the messages exchanged between the two nodes.

Both techniques will consult the words x documents matrix, and get the col-lection of vector space representation according to the documents needed.

This will let us apply cosine similarity operations between clusters space vectors and Nodes/Edges representations. We decided to integrate both ap-proaches in the framework, and let users decide the better representation according to his needs.

In document A Visual Framework for Graph and Text Analytics in Email Investigation (Page 49-52)