• No results found

2.3 Machine Learning Approaches

2.3.2 Behaviour-based Approaches

2.3.2.2 Anomaly Detection

In the case of the availability of only normal data (i.e. data maturity is medium), the system is only capable of learning on the majority class (normal class). Hence, the insider threat detection problem is addressed using an anomaly detection approach. In this case, an anomaly detection approach learns the baseline of user(s) normal behaviour, and detects the instances which deviate from the baseline as anomalous instances. There exists a recent research trend towards employing anomaly detection approaches for insider threat detection. iForest [50] is a model-based anomaly detec- tion algorithm that isolates anomalies from the normal instances instead of profiling normal instances.

In the following, we give a brief review of the anomaly detection approaches. A recent unsupervised ensemble-based anomaly detection system, namely PROac- tive Detection of Insider threats with Graph Analysis and Learning (PRODIGAL),

2.3. Machine Learning Approaches 19

was presented by Goldeberg et al. [51]; a result of five years work on the insider threat detection problem [52]–[54]. The authors define three diverse types of user- day detectors that are employed in PRODIGAL including, indicator-based detec- tors which apply outlier detection techniques to feature subsets related to particu- lar activity(ies); anomaly detectors which identify potential anomalies in the whole feature space related to different aspects of the data; and scenario-based detectors which focus on subsets of features and subsets of target users relevant to a particu- lar scenario in order to allow the comparison with peer groups. iForest is configured as one of the anomaly detectors in PRODIGAL to detect unknown malicious insider threats in user activities. Furthermore, PRODIGAL implements an ensemble ap- proach which combines the scores from multiple detectors, such that the consensus about the most anomalous instances with respect to individual detectors is main- tained with respect to the ensemble. The ensemble in PRODIGAL achieves an aver- age AUC=0.85, which approaches the average AUC achieved by the best individual anomaly detector (0.88to0.89).

Legg et al. [33] present an automated system, called Corporate Insider Threat De- tection (CITD), to detect insider threats in an organisation. The CITD defines a data parser module, including anactivity parserand acontent parser, which retrieves and parses the data logs for each session. The activity parser appends the activity logs to tree-structure behaviour profiles and extracts an activity feature set for each session. The content parser was discussed in a further paper [55], where it utlisesk-means clustering and Principal Component Analysis (PCA) to extract the psychological fea- tures based on the contents of the browsed websites, and appends these features to the activity feature set. The CITD then assesses the feature set, for each session, based on three levels of alerts: level ‘1’ policy violations and threat patterns; level ‘2’ threshold-based anomalies, and level ‘3’ deviation-based anomalies. The level ‘1’ alerts correspond to the tripwires addressed in a further paper [34]. The tripwires only fire if the user’s behaviour matches the implemented policy violations or threat patterns, thus will not suffer from FPs. This refers to signature detection, however, the level ‘2’ and level ‘3’ alerts are in charge of detecting new threats (i.e. anomaly detection). The level ‘3’ alerts is in charge of finding the anomaly threshold, which in turn is used for level ‘2’ alerts. The level ‘2’ alerts compare the anomaly score for

20 Chapter 2. Approaches for Insider Threat Detection

each session, and triggers an alert based on the predefined threshold. If the alert is an FP, the system’s parameters are refined with the aim to reduce upcoming FPs. The authors test the scalability and performance of CITD on a real data set in an ex- perimental paper [56]. The results show that the alerts are generated for 25% of the staff in a multinational organisation. The high FP is related to the significant change in staff’s working hours in a multinational organisation. CITD is also evaluated on synthetic CMU-CERT data set with a precision=42%and recall =100%. Besides, the utilisation of a parallel coordinates plot shows 7 of 10 insider threat cases identified clearly.

Zhang et al. [57] apply the Naive Bayes algorithm to file logs with the aim to identify anomalous users based on their probability of interest in file topics com- pared to that of the community. The approach first categorises the files into prede- fined topics. It then constructs two types of probabilistic models: a user behaviour model which defines the probability of a user’s interest in each predefined topic (based on their file accesses); and a community behaviour profile which defines the probability of a community’s interest in a particular topic given another topic (i.e. conditional probability). A user is flagged as anomalous if the user-topic probability is significantly different from a user-community probability, given the user belongs to the community (i.e. users having the same role).

Gates et al. [58] use the structure of the file system hierarchy to measure the level of access similarity of the files, and detect anomalous behaviour based on a prede- fined threshold. The paper defines access similarity measure techniques including, self score which compares a user’s access similarity to a user’s historical accesses; and a relative score which compares a user’s average score to other users’ accesses. These measure techniques are suggested to be used as a feature in the feature vector. The results show that the use of the feature relative score, which compares to oth- ers’ accesses, allows to attain a lower number of FPs. The results show the ability to detect80%of the threats with a percentage of FP=2.5%.

Chen et al. [59] introduced Meta-CADS, an extension of the proposed Commu- nity Anomaly Detection System (CADS) [60] to detect insider threats in collabora- tive environments. CADS consists of two components: a Patter Extraction com- ponent (CADS-PE) which extracts user-subject relations (patterns) from access logs

2.3. Machine Learning Approaches 21

to infer communities; and an Anomaly Detection component (CADS-AD) which employs the unsupervised k-NN to compare the user-subject relations to the in- ferred community-subject relations, so that the user’s behaviour which deviates significantly from its community is detected as anomalous. Meta-CADS extends CADS where it incorporates the semantics of subjects (subject-category relations) into CADS-PE to infer complex categories using Singular Value Decomposition (SVD); a step before the community inference. The experiments suggest that Meta-CADS performs better than CADS in terms of AUC when the rate of malicious users with respect to benign users is low (0.5%).