CORRELATION 4.1 Introduction
6.3 Evaluation metrics
In signal processing, Receiver Operating Characteristics (ROC) [169] curves are in use to assess the quality of a receiver. Similarly, ROC curves have been used to evaluate IDSs, such as [170, 171]. The typical metrics used to illustrate IDS assessment are detection rate (the average of detected actions to the observable ones) and the false positive rate (1-detection rate). For signature-based IDSs as non-parametric IDSs, a
163
single point is used to represent the ROC instead of a curve. Furthermore, IDS measures are non-binary, in contrast with signal processing metrics. For this reason, more generic metrics are considered to assess our correlation system, which are inspired from Information Retrieval (IR) systems [172]. The four measures used are:
True positives (TP): denotes the correctly correlated alerts.
True negatives (TN): denotes the correctly uncorrelated alerts.
False positives (FP): denotes the incorrectly correlated alerts.
False negatives (FN): denotes the incorrectly uncorrelated alerts.In IR, the confusion matrix is used to measure precision and recall rate. In an alert correlation context, precision is used to measure the soundness of the results and recall is used to measure the completeness of the results. Figure 6.1 shows the measurement terms applied to the correlation problem.
True Positives (TP) Correctly correlated alert rates False Positives (FP) Incorrectly correlated alert rates False Negatives (FN) Incorrectly uncorrelated alert rates True Negatives (TN) Correctly uncorrelated alert rates
Figure 6.1. Confusion matrix.
The recall rate denotes the proportion of TP (correctly correlated alerts) to the total number of TP and FN (incorrectly uncorrelated alerts).
The true positive rate is the correctly correlated alert rate which is denoted by the recall rate, and the optimal measure is 100%.
164
The precision rate denotes the proportion of TP (correctly correlated alerts) to the total number of TP and FP (incorrectly correlated alerts).
Hence, the true alerts correlated by the system (assigned to be related but could be not related) are the total of TP and FP. On the other hand, the related alerts (known to be related and must be correlated) are the total of TP and FN. The optimal result is to achieve a higher recall rate with a higher precision rate, which means maximum precision and detection coverage. Figure 6.2 illustrates the relationships between the confusion matrix measurements.
Irrelevant alerts – correlated (FP) Relevant alerts – correlated (TP) Relevant alerts – uncorrelated (FN) Detected correlated alerts Relevant alerts Irrelevant alerts – uncorrelated (TN)
Figure 6.2. Relations between the confusion matrix measures.
The overall system accuracy can be identified by calculating the percentage of correct results (true positives and true negatives) to the total of all identified results.
6.4 Datasets
It has been identified that the unavailability of enough benchmarking datasets is the major difficulty in evaluating IDSs in general [132]. However, there are some available
165
datasets have been used to evaluate alert correlation systems, such as DARPA2000 [161], Defcon [54] and honeypot datasets. However, the DARPA2000 dataset is still a reference point in the evaluation process for the comparison of results. The DARPA dataset was originally created to assess IDS sensors and is not designed for alert correlation systems. Even though it has received a high volume of criticism [22] for lack of realism of background traffic, being old and not reflecting the real attack scenarios, it is the only well-documented available dataset. The Defcon dataset, a network capture of a competition for hackers, is also commonly used to assess the correlation process. However, it is different from real-world traffic because it contains a huge volume of attack traffic only and with very limited IP addresses. The offline nature of such recorded traces creates some problems: first, the sensor alerts are not included and we have to use a certain sensor to regenerate the actual alerts, which may be different from others based on the sensor coverage. Second, the verification process is typically obtained from the status of the target at the attack time, and that has to be done manually if using capture files. Furthermore, most of these traces are synthetically created and lack a mix of the normal and anomalous traffic existing in real-life traffic.
On the other hand, the real traces recorded from real-life networks lack necessary
ground truth. And the attack traffic in these data does not contain enough activities to
represent successful multi-stage attacks [173]. In the main, datasets can be collected using five different methods:
1- A purely attack dataset with no background traffic, which is very simple to produce and is only used for basic validation of detection functionalities.
2- A dataset consisting of real background traffic obtained from production networks and synthetic attacks, which is similar to real-life traffic to some
166
extent. However, it is not fully controlled, has privacy concerns and is not for public use.
3- A dataset similar to 2- above but where the background traffic is sanitised to provide semi-real life traffic. However, traffic data sanitation is a cumbersome and error-prone task.
4- An entirely pure real dataset with real background traffic and real attacks captured from a production network environment. This method requires comprehensive analysis and data labelling, which is difficult, in addition to privacy concerns and being unrestrained dataset. Moreover, collected attacks are not only insufficient but require lengthy observation, which makes analysis difficult.
5- A dataset with both synthetic attacks and background traffic. The main advantage of this method is that the test environment is totally controlled and there is no potential for non-identified variables. Consequently, the results attained are more reliable and accurate. The drawbacks of this mechanism are that it is very costly because various pieces of hardware and software as well as services have to be installed, and the fact that it naturally does not reflect real- life traffic.
Our evaluation methodology is to use different datasets as follows:
- Datasets traces from .pcap files using the same timestamp for comparison purposes.
- Datasets obtained from a controlled setup to simulate real-life traffic.