Measurement Methodology
3.1 Data Collection
Periodic network traffic (henceforth referred to as network heartbeats) is challenging to detect. It is first necessary to have network measurement infrastructure capable of capturing all traffic that passes through a network. Then one must have an analysis infrastructure capable of storing and processing the large amount of data collected. After these two preconditions have been met, the design and interpretation of analysis capable of divining meaning from terabytes of raw data is no easy task. This section explains how we address these challenges.
3.1.1 Edge Network Description
In this thesis, we study the University of Calgary’s edge network, which is used by 32,000 under-graduate/graduate students and 3,000 faculty/staff. Like many organizations today, the U of C’s network has multiple subnets, with some for devices that are managed by the University’s IT staff, and some for devices that are managed by the users themselves (i.e., Bring Your Own Device). In our study, we refer to these subnets as “managed” and “BYOD”, respectively.
3.1.2 Measurement Infrastructure Network Monitor
To conduct analysis on network traffic, we must first have the infrastructure in place to observe it.
To achieve this, we worked with campus IT to obtain mirrored streams of all traffic entering and exiting the campus network. These streams are then aggregated via a link aggregator tap. We then send the aggregated stream to our network monitor, which receives it using an Endace 10 Gb/s capture card. This card processes the network traffic and distributes it to a set of Bro worker nodes running on our monitor. Once processed in Bro, the resulting logs are periodically transmitted to our file server.
Our Endace card also enables the capture of full-packet traces. These packet capture abilities are used sparingly due to the large amounts of data flowing over the network at any given time. In addition, it is necessary to record packet captures for only one specific type of network traffic at a time to ensure that the card does not become overloaded with traffic.
Bro
Bro is an open-source network monitoring tool [34]. While intended for security, Bro provides a comprehensive platform for general network traffic analysis. Bro records network traffic sum-maries from a network interface connected to our monitor. These sumsum-maries take the form of Bro logs. Bro records logs for a number of different protocols such as HTTP, DNS, and SSL.
The logs used in this thesis are referred to as the conn logs. Conn logs are comprised of meta-data associated with TCP, UDP, and ICMP connections1. This means that the application-layer data sent over the connection is ignored, but information related to the nature of the connection is recorded. The fields most relevant to this thesis are the timestamp, protocol, destination port, sender’s IP address, and the receiver’s IP address. These fields provide the information needed to construct the time series of connections required for periodicity analysis.
An example of a Bro conn log (with fields not relevant to this study removed) is presented in
1https://www.bro.org/sphinx/scripts/base/protocols/conn/main.bro.html#type-Conn::Info
Table 3.1: Example of a Bro conn log with anonymized IP addresses and unused fields removed.
Timestamp Orig. Host Orig. Port Resp. Host Resp. Port Protocol 2017-06-28 01:52:34.879706 136.159.643.597 39004 172.217.567.514 443 TCP 2017-06-28 01:52:35.386183 136.159.643.597 40634 216.58.592.530 443 TCP 2017-06-28 01:52:36.373944 136.159.643.597 39006 172.217.567.514 443 TCP 2017-06-28 01:59:29.111547 136.159.643.597 62935 40.90.652.448 443 TCP 2017-06-28 02:00:36.948038 136.159.643.597 60422 17.167.609.444 993 TCP 2017-06-28 02:00:36.948513 136.159.643.597 60424 132.208.508.467 993 TCP 2017-06-28 02:00:36.948596 136.159.643.597 60423 74.125.552.456 993 TCP 2017-06-28 02:00:37.440011 136.159.643.597 60431 132.208.508.467 993 TCP 2017-06-28 02:00:38.251809 136.159.643.597 60434 17.249.541.448 5223 TCP
Table 3.1. We also made use of the bytes sent, bytes received, and history fields to support our analysis.
Figure 3.1: Bro’s internal architecture.
Bro converts packet-level information into log summaries in real time. Figure 3.1 illustrates this process. Raw network packets are fed to Bro from our Endace card, as described above. These packets are then fed into the event engine, that generates a series of high-level events. These events are policy neutral, meaning that they explain what happened but not why. For example, if an HTTP request is seen on the wire, an “HTTP request” event is generated, which details the
related IP address, ports, URL, and other information. However, the event engine will not attempt to apply any sort of rule to it.
Once an event is generated, it is passed to the policy script interpreter. Upon receiving an event, the policy script interpreter will execute a set of event handlers. These event handlers can be defined by the user. In our case, we use the logging scripts that come prepackaged with Bro. The script that generates our conn log entries records all transport-level metadata every time a TCP, UDP, or ICMP packet event is generated. The scripts also maintain state so they aggregate the information from every packet in a transport-level connection into a single log entry. Thus, the conn log is then populated by all of the transport-level connections observed.
Analysis Hardware
To conduct our analysis we have created a dedicated computing cluster. This cluster consists of four individual machines working in concert. One machine serves as the client for our analytics database, HPE Vertica. The other three each host a Vertica node and conduct a portion of the analysis. All of these machines have a single core Intel Xeon 5110 1.6 GHz processor and 32 GB of RAM. These machines are approximately 10 years old, thus their performance is not representative of more modern computer systems.
HPE Vertica
To conduct our analysis, we make use of the big data analytics database HPE Vertica2. Based on the CSTORE research project [45], Vertica is massively parallelized, optimized for big data analytics, and capable of processing terabytes or even petabytes of data. We conduct our analysis by loading the recorded Bro logs into a Vertica database as detailed in Appendix A.2. Once loaded, we made use of SQL queries to conduct analysis on the data. The use of SQL queries makes for a flexible analysis method that allows the fast prototyping of analytic functions. In addition, Vertica provides many built-in functions common in analytics, such as functions for mean and sample variance calculation.
2https://www.vertica.com/
Despite the advantages of Vertica, there were practical challenges in its operation. These chal-lenges fall into two categories:
• Space Restrictions: We only had access to the community-edition of Vertica, which has a one terabyte license. Due to this restriction, we were limited in the number of logs that we could analyze at one time. We maximized our dataset by using only a subset of the fields available in the conn logs, and encoding data to minimize space consumption.
• Computation Restrictions: Our computation restrictions stem from how dated our analysis hardware was. Even though Vertica could handle processing large amounts of data, our computing cluster could not. To deal with this obstacle, we had to develop workarounds. For example, when analyzing our logs as a whole, we had to run one part of the analysis query first, then write the results to disk, load them into Vertica, and complete the second half of the query. These queries are presented in Appendix A.3.
3.1.3 Data Visualization
To visualize the results of our analysis, we made use of two data visualization tools in tandem with Vertica. To generate the majority of our graphs, we used the Python visualization library mat-plotlib3. For example, to generate Figure 3.2, we first gathered data using Vertica using an SQL query, and then fed the result to a Python script that uses matplotlib to generate the graph. The topology graphs, like Figure 3.3, were generated with the use of Gephi4. Like the previous exam-ple, we used SQL queries to gather the data, then we loaded the results into Gephi and formatted the graph.
3https://matplotlib.org/
4https://gephi.org/
Figure 3.2: Traffic profile on our edge network.