Description of the Data Sets - Opportunistic machine learning methods for effective insider thr

The CMU-CERT data sets are synthetic insider threat data sets generated by the CERT Division at Carnegie Mellon University [26], [70]. The CMU-CERT data repository is the only available data repository that implements malicious insider threat scenarios (5 scenarios) and has recently become the evaluation data repository for researchers addressing the insider threat problem [21], [51], [61].

The CMU-CERT data repository has several data generator releases (includingr1

tor6), such that most releases include multiple versions of data sets (e.g.r3.1,r3.2). For this thesis, we used r5.2 data sets from the insider threat data sets generated by CMU-CERT. This data set logs the behaviour of2000employees over18months. The rationale behind the selection of ther5.2 data sets among the panoply of data generator releases is explained below.

The first step to address the insider threat problem is to determine the model to employ to analyse the behaviour of users in an organisation. We define two types of models: (1) user behaviour model, and (2) community behaviour model.

4.2. Description of the Data Sets 35

A user behaviour model defines the behaviour of a single user (employee) in an organisation. It represents only the activities of this user in a structure to analyse the user’s behaviour compared to their previous behaviour (only their self).

A community behaviour model broadens and enriches the analysis of a user’s behaviour with respect to the community a user belongs to. We define a community as a group of users (employees) in an organisation having the same role. In other words, the users in a community tend to work in a team environment, and the activities required from them are quite similar. Therefore, their behaviours tend to align with the users in the same community. A community behaviour model represents the activities of all the users in the community in a structure to analyse each user’s behaviour compared to their previous behaviour as well as to the community’s behaviour. In this thesis, we employ the community behaviour models to analyse the behaviour of users and detect malicious insider threats. For instance, consider the community of salesmen, where the activities required from each salesman are to sell and promote commercial products. In the salesmen behaviour model, the activities of the salesmen align, and any deviation of salesman’s behaviour from their previous behaviour or the behaviour of the whole community may indicate a malicious insider threat. Hence, we hypothesise that adopting the community behaviour mod- elling in our work will guide the proposed approaches towards effective detection of malicious insider threats.

Following the definition of a community behaviour model, we justify the selection of ther5.2 data sets. Unlike the other released CMU-CERT data sets, the communities in ther5.2data sets consist of a considerable variety of malicious insider threats. In other words, ther5.2data sets implement a considerable number of malicious insiders in each community, such that the scenarios followed by the malicious insider threats in a community are varied. As a result, the richness and variety in the community behaviour models will allow testing of the effectiveness of the proposed approaches and validation of different scenarios. Table 4.1 summarises the released CMU-CERT data sets in terms of the number of implemented malicious insider threats. It is evident that ther5.2 release consists of the highest number of malicious insider threats, followed by ther4.2release. Furthermore, the scenarios

36 Chapter 4. Feature Space for Insider Threat Detection

TABLE4.1: Summary of CMU-CERT releases in terms of the number of implemented malicious insider threats.

Release r1 r2 r3.1 r3.2 r4.1 r4.2 r5.1 r5.2 r6.1 r6.2

Number of threats / 1 2 2 3 70 4 99 5 5

The slash symbol (/) denotes that this release has no implemented malicious insider threats.

implemented in each community of ther5.2release are varied. This is further justi- fied in Table 4.2, where the variety of scenarios implemented in each of the utilised communities is shown.

Among the2000employees in ther5.2data sets, we extracted the data logs for the users (employees) belonging to the following three community data sets to be later utilised to validate the proposed approaches:

• Production line worker (com-P): It consists of300users, including17malicious insiders. It has the scenarios{s1, s2, s4}implemented;

• Salesman (com-S): It consists of 298users, including22malicious insiders. It has the scenarios{s1, s2, s4}implemented; and

• IT admin (com-I): It consists of80users, including12malicious insiders. It has the scenarios{s2, s3}implemented.

A description of the scenarios{s1, s2, s3, s4}implemented in the aforementioned community data sets is in Section 4.2.1.

4.2.1 Scenarios of Malicious Insider Threats

In the following, we give a brief description of the scenarios implemented in the extracted communities:

Scenarios1 This scenario considers a user who starts logging in after hours, using a removable drive, and uploading data to the WikiLeaks website. This behaviour occurs for a period of time and the user leaves the organisation thereafter.

Scenarios2 This scenario considers a user who starts surfing job websites espe- cially targeting competitor companies. The user’s activity of connecting a removable

4.2. Description of the Data Sets 37

TABLE4.2: Summary of ther5.2community data sets in terms of the number of users, the number of malicious insider threats PT, and the

number of malicious insider threats that map to each of the described scenarios{s1, s2, s3, s4}.

Community Users PT s1 s2 s3 s4

com-P 300 17 6 5 / 6

com-S 298 22 7 9 / 6

com-I 80 12 / 2 10 /

The slash symbol (/) denotes that this scenario is not implemented in the corresponding community data set.

drive to their PC increases incrementally, at a higher frequency than their previous activity, in order to steal data before leaving the company. The activity ofsurfing job websitesoccurs for a certain period of time, stops for a few session slots, and then reoccurs in a similar manner.

Scenarios3 This scenario considers a disgruntled system administrator who down- loads a keylogger into a removable drive and connects it to their supervisor’s PC to steal their login identity. The next day, the administrator logs into the supervisor’s PC using their login identity collected in keylogs and sends an alarming mass email to employees in the organisation. Scenarios3refers to a masquerade insider threat where the malicious insider uses the legitimate user’s identity to gain access to their PC.

Scenarios4 This scenario considers a user who logs into another user’s PC, ac- cesses their files, and emails them to a personal email. It is carried out more and more frequently over a three-month period. New activities (e.g. logon from a new PC, emails to non-employees) occur in a persistent manner to establish a novel behaviour.

More information regarding CMU-CERT data sets and simulated scenarios can be found in [26], [70]. Table 4.2 summarises the aforementioned r5.2 community data sets in terms of the number of users (employees), the number of malicious insider threats PT, and the number of malicious insider threats that map to each of the described scenarios{s1, s2, s3, s4}.

38 Chapter 4. Feature Space for Insider Threat Detection

FIGURE4.1: Sample of users’ information in CMU-CERTr5.2release.

4.3 Data preprocessing for CMU-CERT Insider Threat Data

In document Opportunistic machine learning methods for effective insider threat detection (Page 55-59)