Chapter 2 : Literature Review
2.5 Machine Learning in IDSs
2.5.2 Insight Identification and Standards for Data-Mining
The most suitable dataset is recognised as being one of the key considerations that need taking into account in regards to attack detection systems. More specifically, a number of different public datasets are applied in order to act as the IDS in the case of machine-learning. As has been established through a review of the literature in [47]and[48], there are two key datasets, both of which are commonly implemented in the case of network intrusion detection system; these are DARPA (Defence Advanced Research Projects Agency) and KDD Cup ’99 (Knowledge Discovery and Data).
In regards the former, this is recognised as the preliminary standard corpora in the computer network attack detection systems assessment, and is gathered and accordingly distributed by MIT (Massachusetts Institute of Technology) Lincoln Laboratory in line with sponsorship from DARPA and Air Force Research Laboratory (AFRL). This particular dataset has been commonly implemented by the researcher owing to the fact that it is commonly adopted for training and testing attack identifiers where suitable modern-day results are achieved. In actual fact, a number of different datasets that form the DARPA Intrusion Detection Evaluation have been documented by the MIT Lincoln Laboratory, with 1998, 1999 and 2000 datasets utilised, with the first of these gathered for 9 weeks, notably 7 of training data and 2 weeks of testing data; the 1999 data comprised 3 weeks’ training data and 2 weeks’ testing data; the last of these comprise datasets across two scenarios. As has been shown throughout past works, it is not common for the DARPA dataset to be utilised following the introduction of the KDD Cup ’99 dataset owing to the fact that the latter has overcome the various restrictions and drawbacks of the former. The most fundamental drawback of the DARPA is that establishing the overall accuracy of the background traffic incorporated within the assessment is not possible owing to the fact that the testbed traffic generation software is not available in the public domain. A number of other commonly cited critiques centre on the approaches applied in creating the dataset, as well as in the completion of assessments [49]. In those cases where the generation of background traffic was completed with the application of non-complex models and in the case that life traffic was utilised, there would be a notably higher false-positive rate. Furthermore, the background data did not include any factors contributing to background noise, such as strange packets and packet storms, for example. Other critiques are regards the irregularities in the data as it commented in [48].where an appreciable detection rate is shown by the trivial detector as the attacks TTL value is obviously different as well as the normal
Page 27 of 146
packets. However, with all the criticisms, the DARPA dataset is slightly used by the researcher for IDS evaluation as highlighted in [50], [51].
More specifically, since 1999, the point has been made that the commonly utilised dataset centred on identification methods assessment is that of KDD Cup ’99 Dataset. This has been devised in line with the data gathered through DARPA 1998 TCP/IP. In consideration to the subset of KDD, a total of five million records are encompassed within the training data, whereas the test set is seen to comprise approximately four million records spread across a total of 41 different aspects; on the other hand, only 24 types of attack are included in each of the training data records, whilst only 14 types are added to the test data. All of the training data records are assigned with a label, either detailing the attack type or that it is normal. Importantly, the attacks are recognised as belonging to one of four different groups, including DoS, Probe, R2L, and U2R. There are detailed explanations defining the various attack types used for training purposes and these are specifically listed in[52]. Moreover, the different aspects of the dataset of the KDD Cup ’99 are categorised into one of three, as follows [51]:
1. Basic Features: Each of the characteristics that are able to be derived from a TCP/IP connection is contained within it. Because of this, there is a delay in the detection of attacks.
2. Traffic Features: These characteristics come from computations regarding window interval considerations, and there are two basic segments: similar host features and similar service features. Those connections that had the same host destination and occurred within the last two seconds are considered by the host, and those connections that have similar service and occurred within the previous two seconds are compared to one another.
3. Content Features: These are the characteristics that are utilized to find suspect behaviour in data. This means that features can be used to determine R2L and U2R attacks because these types of attacks are embedded in the different data portions in the packets. These typically involve one connection at a time, which is different from the DoS and Probe attacks, which examine multiple connections to different hosts in the same time period.
Nonetheless, in mind of both the cost-inducing, erroneous approach to manually classifying connections, combined with privacy-related factors, the point is made that securing public
Page 28 of 146
datasets in relation to attack identification across a network is notably problematic. As such, the data of KDD has been extensively examined and quoted by the attack identification community owing to it being one of the public datasets very limitedly available.