Term Frequency Inverse Document Frequency Filtering

It is clear that after applying the FindFPOF algorithm, generally the normal records receive higher anomaly scores than the anomalous records, and thus by thresholding to output the low scoring records a user can retrieve anomalous records. However, in order to retrieve all 65 embedded anomalous records, the user requires the anomaly score threshold to be set at a high value, at which a large number of normal records are output and thus the user is presented with a large number of false positives. In addition to the normal records, the nature of the FindFPOF algorithm means that the result set will always contain full potentially anomalous records, despite the fact that typically only one or more features will cause the record to be an anomaly. In order to combat this limitation, Section 4.4 describes a novel extension to the FindFPOF algorithm.

4.4 Term Frequency Inverse Document Frequency

Filtering

The anomaly detection method discussed in Section 4.1 provides potentially anomalous records to the user. In the following section, these potentially anomalous records are formally defined as dAnom. However, in order to understand why these records are

anomalous, the user must explore and analyse the records manually. To overcome this limitation, a modification on the classical Term Frequency Inverse Document Frequency (TFIDF) algorithm, can be applied to the resulting set from the FindFPOF algorithm, as a filtering criteria on the potentially anomalous records. TFIDF is the combination of two methods:

1. Term Frequency (TF) - a form of term weighting for all terms in a given document, first proposed by Luhn [81] in 1957,

2. Inverse Document Frequency (IDF) - a method proposed by Jones [59] in 1972. This method modifies the term weight of a given term in a document, based upon the frequency of the term in all documents in a document set.

56 Chapter 4. Infrequent and Anomalous Itemset Detection

Specifically, given a term τ within a document δ, the term frequency tf(τ, δ) can be defined as:

tf(τ, δ) = 0.5 + 0.5 × freqτ,δ

max{freq_τ0_,δ: τ0∈ δ}

(4.6)

where freq_τ,δis the frequency of the term τ in the document δ, and max{freq_τ0_,δ : τ0 ∈ δ}

is the maximum frequency of all terms in the document.

By using the frequency of a term in the document freq_τ,δ, relative to the maximum frequency of any term in the document, the term weight is not biased by the length of the document.

Furthermore, the inverse document frequency idf(τ, ∆) can be defined as:

idf(τ, ∆) = log N

1 + |{δ ∈ ∆ : τ ∈ δ}| (4.7) where N is the total number of documents and ∆ is the complete set of all documents.

The TFIDF weight tfidf(τ, δ, ∆) for a given term is therefore:

tfidf(τ, δ, ∆) = tf(τ, δ) × idf(τ, ∆) (4.8)

Although TFIDF weighting is usually applied to each document in a set of documents in order to analyse term frequency, this is not dissimilar to the anomaly detection scenario. In this scenario, the document δ is the the set of resulting records from the FindFPOF algorithm, the set of all documents ∆ is the set of all records in the original dataset and the term τ is the value of a feature within a record. Therefore in this case, the TFIDF weighting can be used to weight the values of each feature, based upon their frequency in the anomaly detection result set, and relative to their frequency in the original dataset.

4.4. Term Frequency Inverse Document Frequency Filtering 57

For clarity, when applied to the values v, of an anomalous subset of records dAnom,

from a dataset D, the notation for the TFIDF algorithm becomes:

tfidf(v, dAnom, D) = Term Frequency z }| { 0.5 + 0.5 × freqv,dAnom max{freq_v0_,d Anom : v 0_{∈ d} Anom} ×

Inverse Document Frequency

z }| {

log NR 1 + |{freq_v,D}|

(4.9)

where NR is the total number of records in the dataset D.

In order to compensate for the distributions of the features within the dataset, Equation 4.9 can be modified such that the term frequency is calculated relative to the feature to which the value belongs. Therefore, for the value v of a feature f , the distribution compensated TFIDF (dc tfidf) is:

dc tfidf(v, dAnom, D) = Term Frequency z }| { 0.5 + 0.5 × freqv,dAnom max{freq_v0_,d Anom∩f : v 0 _{∈ d} Anom∩ f } ×

Inverse Document Frequency

z }| {

log NR 1 + |{freq_v,D}|

(4.10)

Using Equation 4.10, the records output from the FindFPOF algorithm, can be filtered based upon the value of dc tfidf(v, dAnom, D) for the individual features within the

record. This feature based filtering allows the user to reduce the records output from the FindFPOF algorithm, to potentially anomalous subsets, by removing normal items from the records. For the following discussion, this hybrid algorithm is titled Discovering Anomalous Terms Using Mining (DATUM).

58 Chapter 4. Infrequent and Anomalous Itemset Detection

In the DATUM algorithm, a low value of dc tfidf(v, dAnom, D) corresponds to poten-

tially anomalous feature values. Therefore, by filtering the output of the FindFPOF to remove features with a high dc tfidf(v, dAnom, D) value, the remaining feature patterns

correspond to the anomalous values contained within the records.

In document Interactive visualisation for the discovery of cyber security threats. (Page 67-70)