5.2 Implementation Details
5.2.3 Trace Extraction and Feature Analysis
The BotFinder core implements steps 3 and 4 – the trace assembly and the feature analysis – and expects as input a flow level representation of network communication. Based on the tuple of matching source IP address, destination
1 http://www.tcpdump.org/ 2 http://www.wireshark.org/ 3http://code.google.com/p/flow-tools/ 4http://code.google.com/p/pyflowtools/ 5http://bro-ids.org/
52 5.2.3 Trace Extraction and Feature Analysis
IP address, protocol ID and destination port, a dictionary structure – the “trace assembly buffer” – is filled with the flows that build traces.
This trace assembly buffer raises special technical challenges for the analysis of large quantities of data: To reassemble sufficiently long traces (Section 6.2), as many connections as possible should be included into the trace. However, if millions or even billions of flows are under investigation in parallel, maintain- ing the current state of all traces in the trace assembly buffer in memory is hard. Moreover, a live deployment of BotFinder will have to report analysis results in regular intervals. To solve this challenge, the following solutions are implemented.
Balancing the Trace Assembly Load
To process large amounts of flow records in real-time, BotFinder offers var- ious parallelization options that benefit from the fact, that BotFinder per- forms no horizontal correlation between independent flows.
Multiple Analysis Servers: On a large scale, the data collection server or a special module at the analysis server may balance the trace analysis load among different servers. Hereby, different source IP address ranges may be used to coordinate the load balancing.
Simple Size Limitation: The simplest way to keep the trace assembly buffer in the targeted range, which is typically the overall memory size, is to limit the amount of flows that is read in between each trace calculation step. This method requires the user to specify the amount of traces that should be read in. Statistically, the smaller the trace assembly buffer is chosen, the less long traces will be generated and the lower is the detection quality of BotFinder. Furthermore, BotFinder offers two different behaviors after reaching and processing a full trace assembly buffer. In the non-overlapping mode, the trace assembly buffer is deleted and the next processing step starts after re-filling the buffer. As an alternative, BotFinder allows to let the trace assembly buffers of each processing interval overlap (e.g., by 25%) to minimize problems at the end of each processing buffer. Nevertheless, simply reducing the size of the buffer comes with a number of disadvantages, such as the potential loss of traces due to small buffer sizes.
File Based Analysis: As the trace building process is linear and requires no interaction with other traces, the processing is easily split by IP addresses
and the flows are stored in different files. As an example, a /24 network traffic may be split up to 254 files and each file is processed individually. Thereby, the amount of data that has to be kept simultaneously in the trace buffer is limited.
Especially for offline analysis, the file based approach provides substantial benefits. Conceptually, the number of sub-splitting files is infinite, although
technically a limitation of ≈ 210 is observed (Ubuntu 11.04). Still, using, for
example, 768 files to distribute flows based on the source IP address allows to compute traces over a significantly longer timeframe than to simply use in-memory storage of all traces.
Processing the Trace Buffer
After all files are read or the trace assembly buffer is full, the processing starts. To speed up the process, BotFinder uses multiple threads to separate the work of reading data from the processing workload. Thereby, after reading the buffer, sufficiently long traces are sent to the feature extraction clients and the reading of a new buffer continues while the old buffer is being processed. This concurrency is especially useful if traffic or NetFlow data is directly piped into BotFinder without saving it to files first.
For the feature extraction, BotFinder’s performance is inherently impacted by the computational challenge to perform large amounts of Fast Fourier Trans- forms and other mathematical tests. To cope with this problem, BotFinder supports optional computational workers as highlighted in the blue cloud in Figure 5.1. These calculation clients expect a trace as input and perform all statistical analysis and report the information on averages, deviations, and FFT back to the analysis server. As the processing of one trace does not re- quire any result from the processing of another trace, the process parallelizes well and an arbitrary number of machines can be used for computational sup- port. In an optimal scenario, the processing completes just before the trace assembly buffer is filled again.
All communication with the networked calculation clients – whereby Bot- Finder allows to run both entities, the core script and the calculation client on the same host – is done in an additional thread that manages multiple processing clients. Hereby, BotFinder already reads new data while process- ing the buffer. The processing results are written to a file that serves as input either to the model-creation script or the model-comparison (detection) script.