• No results found

Chapter 5: UNB ISCX 2012 Dataset Transformation

5.4 Server Specifications

This experiment was run using a server with 2U Supermicro chassis; 8x host-swap 2.5" SAS/SATA disk bays; Supermicro X8DTU-LN4F+ motherboard; Dual Intel Xeon E5620 (quad core) ; 24GB RAM (6 x 4GB DDR3 ECC RDIMM) ; 4x 1TB SATA (RAID10) and 4x 1Gb Ethernet. It used a Windows Server 2012 R2 Datacentre (64-bit) Operating System [351]. A virtual machine (VM) was created on Hyper-V with 4 Virtual Processors and 4 GB RAM. This VM was used to host the SecurityOnion (12.04.5.1-20150205) operating system [352], which had Bro (2.4), Perl (5.18.2) [353], TCPTRACE (6.6.7) [350] and TShark (1.6.7) [349] installed to run these experiments. Bro was chosen to process the PCAP files because of its high-speed, extensibility and ability to extract features at multiple levels (frame header, IP header and transport headers) at the same time. It also had the capabilities to extract the content of unencrypted traffic if needed. TShark and TCPTRACE were used to validate the results of Bro’s TCP connections, which provided confirmation that the numbers in the generated dataset added up for all the targeted traffic. Perl was used to map the processed connections to their correct labels in the traffic flow provided by the ISCX2012 dataset. “DMwR” package

138

(0.4.1) [346] in R software [321] was used to run the SMOTE algorithm [207] to generate the synthetic attack traffic.

5.5

Limitations

The transformation process discussed above produced a large dataset in the network security domain that addressed many of the limitations of other known datasets in the field. However, there were still a number of limitations to the process that was used.

Firstly, even though the ISCX2012 dataset contained all of the exchanged payload unencrypted, this transformation did not generate any content based features that were similar to those in KDD [155], NSL-KDD [157] or gureKDD [160-162]. The decision to make the generation process generic was taken to avoid any complications in a real life environment, such as encryption and privacy concerns as explained in Section 5.2.1. In addition, another reason was the fact that the payload of every service required a different set of features specific to that service, so producing a general set of features to profile the content of all the different services would have been challenging. Moreover, this kind of profiling could be addressed through another line of research, where service specific IDS could be investigated.

Secondly, this transformation adopted the settings suggested by Lee et al. [354-356], Stolfo et al. [155] and Perona et al. [162] in using a window of 100 connections to derive the connection-based features and followed the documentation of Onut et al. [15] in using a window of 5 seconds for the time-based features. It is not clear if these sizes are for all network traffic and targeted profiling, or whether they should be adapted to set the right window size for specific traffic. Further investigation is required to analyse the effect of different connections and different time window sizes on different traffic patterns and attack types. Thirdly, the SMOTE algorithm was used to generate synthetic traffic and balance the dataset. However, it is not clear if the SMOTE algorithm was the best choice for a such domain, i.e.

139 network security, as the values of a connection feature are not actually random because of the standards governed by networking protocols. Any instances generated by the SMOTE algorithm which introduced randomness could cast doubt on the validity of those samples. Therefore, as this issue might have affected the quality of the dataset, the generation process ensured that every connection was identifiable. Researchers can omit these synthetic instances, or even use the original connections in the dataset, to come up with their own balanced version using whatever technique best fits their research aim.

Fourthly, the labelling of this dataset used the tags provided within the XML files which only provide binary options (Normal or Attack). Although, different attack scenarios were performed in the ISCX2012, further investigation would have been required to distinguish them. Due to time limitations, the binary labels provided by the flow (XML) files were used. Finally, this transformation process assumed that all connections in the dataset were normal except for those labelled otherwise in the XML files. This decision was dictated by the issues already discussed in relation to the ISCX2012 dataset which effected the mapping and labelling process. For example, a processed connection (from the transformation process) could have been mapped to a number of split connections from the XML files (as discussed in Problem 5

in Section 5.3.3), where the split connections had mixed labels. In such cases Attack was used to label these connections in the resultant dataset.

5.6

Summary

This chapter has outlined the transformation process of the UNB ISCX 2012 dataset into a KDD-like format. This transformation took into account many of the lessons learned during the analysis of the KDD 1999 dataset [1] and the investigation of the flow files in the ISCX2012 dataset. These lessons could be summarised as guidelines for dataset generators and authors who require a comparable transformation as follows:

140

1. Capture every targeted packet. This requirement ensures that no packet is dropped or neglected without valid reason. In any transformation, the total number of packets in the original (raw) dataset should match the resultant (transformed) dataset.

2. All values (the number of packets, bytes sizes, durations, timestamps, etc.) should be computed correctly. This is to address the limitation of using nonstandard tools to perform packet processing. For this reason, this experiment used well-known software (Bro) for flow processing. Another possible obstacle is the misconfiguration by the tool used, which could overlook some traffic which would in turn result in a loss of information. Also the total number of packets in the PCAP files must be equal to the sum of packets for all profiled connections. With this overview any mismatches can then be investigated.

3. Correctly extract every IP address. This is similar to all IPv6 addresses in labelled flow (XML) files in the ISCX2012 dataset not being processed correctly, resulting in useless data.

4. Use timestamps rather than the human readable date/time format, (or use both). This is because the human readable date/time format will not translate accurately or with the precision of the original timestamp when converted back to match connections.

5. Use multiple standard tools or libraries to ensure the same view of connections and to provide guidance when any differences arise (TCPTRACE, vs tShark, vs BRO, etc.) between tools. Differences usually arise when a tool is configured in such a way that it is not readily accepted in production environments; this could be picked up when different views of the same traffic are produced by multiple standard tools.

6. Ensure that every flow direction is correctly represented. This is to avoid any problems due to a code error or a tool bug that might mix up flow processing, and

141 aggregate packets in the wrong direction which would affect the quality of the transformed data.

7. Transformation should be based on a clear definition of connections. Ensure that every connection is clearly defined for every targeted protocol, so that the start and end criteria of such connections are clearly defined. For example, some tools will define a TCP connection from the first SYN packet to the last FIN packet, while others might define the start of a connection as being from the successful completion of the handshake phase to the last FIN packet or a certain idle period. Consistent definitions will ensure the transformation process is reproducible by other researchers.

As a result of following these guidelines, it is believed that the resultant dataset (STA2018) of this study provided the most accurate profile for every connection in the UNB ISCX 2012 dataset. The STA2018 will be used for the experiments of the following chapters.

6

143