In order to evaluate, train and test intrusion detection systems, it is important to collect suitable labelled data and correlate relevant information and malicious events from it. From the literature review it has been observed that data for intrusion detection is usually collected from sources such as:
1. Networks: data packets
2. Host: Input commands from users
3. Host: System calls, log files, system usage statistics.
Since the focus of our thesis is on NIDS, therefore, we consider network sources. Datasets from network sources may be artificial or synthesized datasets, or non-artificial datasets. Some popular artificial datasets used for performance evaluation measures in the intrusion detection domain include: The DARPA-Lincoln datasets and the KDD99 datasets. Most anomaly detection systems depend on statistical features extracted from the network traffic such as: source and destination ip, source and destination port, la- tency changes, arrival rates, traffic volume. Many datasets created by researchers were designed with these features in mind. Such datasets lacked payload or content infor- mation.
20The use of a compromised systems resources for attackers gain
21Malicious attacker gains local privileged access over the network or Internet
22Attacker is able to escalate/elevate his privileges to administrator or root level (Lazarevic et al.,
The DARPA-Lincoln datasets were synthesized and collected at MIT’s Lincoln Labs for IDS performance evaluation. It is available in the form of two datasets known as DARPA 1998 and DARPA 1999 datasets23, also known as the IDEVAL corpus. Both datasets are in TCPDump and BSM24 format. The DARPA 1998 dataset contains seven weeks of training data and two weeks of test data. This includes 300 instances of 38 different attacks that can be categorized into four main attack categories, against a Unix host. The DARPA 1999 dataset contains three weeks of training data and two weeks of test data, with around 58 attack types launched against a Unix host, a Windows NT host and a Cisco Router.
The KDD99 dataset was derived form the DARPA 1998 network dataset. This dataset was used in the KDD Cup competition. Features were extracted from TCP connections and labelled as normal or attack traffic. The training set comprises of 24 attacks while the test set contains 38 attacks.
Other artificial datasets can be created employing frameworks to generate ma- licious traffic. Two notable frameworks include: MACE (Sommers et al., 2004) and FLAME (Brauckhoff et al., 2008).
(McHugh, 2000) critically analysed the DARPA dataset and concluded that the results of artifical or synthetic data are not sufficiently similar to real network traffic data, and hence the DARPA or IDEVAL corpus is not analogous to the properties of real network traffic. This was highlighted with statistics and rate difference between synthetic and original traffic, since the KDD99 dataset was extracted from the DARPA dataset
Real or non-artificial datasets are now being generated to benchmark IDS by re- searchers. Self-produced datasets promise a larger training set, since malicious streams are manually extracted and labelled from live networks. The procurement of real datasets can be facilitated by the use of Honeypots and Honeynets. In order to get a better insight into computer attacks and create a labelled dataset of such attacks, we employed Honeypots and Honeynets as a data collection tool to evaluate the methods developed in our thesis.
23http://www.ll.mit.edu/IST/ideval/data/data_index.html 24Basic Security Module, logs audit data
2.3.1 Honeypots and Honeynets
Honeypots and Honeynets provide the means to study malicious attacks and attackers under a microscope. Because of their unique architecture they provide a strategic advantage over other security tools by allowing researchers access to critical information regarding attacks as they occur on live systems, at both network and system level. This extra insight helps researchers to quickly devise defence mechanisms to fend off any new or novel attacks. They serve as a means to spy on malicious attackers, their organization, their tools and their sophisticated techniques. Since all traffic received by the Honeynet and or Honeypot is considered malicious, it becomes easy to focus on attack data compared to gateway devices, which cannot provide this much granularity due to the scale or high volume of traffic they process.
2.3.1.1 Honeypots
A Honeypot is generally defined as a network security resource whose value lies in it being scanned, attacked, compromised, controlled and misused by an attacker to achieve his malicious goals. Lance Spitzner defines Honeypots as: “An information system resource whose value lies in unauthorized or illicit use of that resource” (Spitzner, 2002). Honeypots can be classified into two main categories. Firstly, they can be based upon their level of interaction with an attacker. This can be further categorized as:
Low-interaction Honeypot Emulate a variety of host services. These mimic real services but are implemented as a sandbox environment and run as an application. e.g. honeyd and nepenthes. (Provos and Holz, 2007)
High-interaction Honeypot Attacker is given the freedom to interact with a real operating system and their every attempt is logged and accounted for.
Hybrid A hybrid Honeypot would be a mixed type of Honeypot, combining features and functionalities from both low and high interaction Honeypots.
Honeypots can also be categorized by the way they are deployed in a network. This can include server-side Honeypots, which are deployed on a server or host running services, malicious attacker actively attack and exploit these services. Honeypots can also be deployed as client-side Honeypots, which act as clients and actively interact with a malicious server.
Figure 2.3: A Honeypot comic by XKCD(http://xkcd.com/350/), showing several honey- pots connected together in a network, running samples of malware
Honeypots are now widely being deployed and used by security researchers and rolled out in production networks to monitor malicious and suspicious activities. The concept is also illustrated in a light-hearted way in Figure 2.3.
2.3.1.2 Honeynet
A Honeynet is a special kind of high-interaction Honeypot. Honeynets extend the concept of a single Honeypot to a highly controlled network of Honeypots. A Honeynet is a specialized network architecture configured in a way to achieve:
• Data Control: It deals with the containment of activity within the Honeynet. • Data Capture: It involves the capturing, monitoring and logging of all threats
and attacker activities within the Honeynet.
• Data Collection: Captured data is securely forwarded to a centralized data col- lection point.
This architecture creates a highly controlled network, in which one can control and monitor all kinds of system and network activity. Honeypots are then placed within this network. A basic Honeynet comprises of Honeypots placed behind a transparent gateway the Honeywall. Acting as a transparent gateway the Honeywall is undetectable
by attackers and serves its purpose by logging all network activity going in or out of the Honeypots.
Implementation and design challenges of Honeypots and Honeynets relevant to our thesis are discussed in Chapter 3.