Monitoring individual traffic flows within the ATLAS TDAQ network

(1)

This content has been downloaded from IOPscience. Please scroll down to see the full text.

Download details:

IP Address: 148.251.235.206

This content was downloaded on 03/09/2015 at 11:42

Please note that terms and conditions apply.

Monitoring individual traffic flows within the ATLAS TDAQ network

View the table of contents for this issue, or go to the journal homepage for more 2010 J. Phys.: Conf. Ser. 219 052013

(2)

Monitoring Individual Traffic Flows within the

ATLAS TDAQ Network

R Sjoen1,4, S Stancu2,3, M Ciobotaru2,3, S M Batraneanu1,2, L

Leahu1,2, B Martin1 and A Al-Shabibi1,5

1 _{CERN, 1211 Geneva 23 , Switzerland} 2

“Politehnica” University of Bucharest, Romania

3

University of California, Irvine, USA

4

Bergen University College, Norway

5

University of Heidelberg, Germany E-mail: rune.velle.sjoen@cern.ch

Abstract. The ATLAS data acquisition system consists of four different networks interconnecting up to 2000 processors using up to 200 edge switches and five multi-blade chassis devices. The architecture of the system has been described in [1] and its operational model in [2]. Classical, SNMP-based, network monitoring provides statistics on aggregate traffic, but for performance monitoring and troubleshooting purposes there was an imperative need to identify and quantify single traffic flows. sFlow [3] is an industry standard based on statistical sampling which attempts to provide a solution to this.

Due to the size of the ATLAS network, the collection and analysis of the sFlow data from all devices generates a data handling problem of its own. This paper describes how this problem is addressed by making it possible to collect and store data either centrally or distributed according to need. The methods used to present the results in a relevant fashion for system analysts are discussed and we explore the possibilities and limitations of this diagnostic tool, giving an example of its use in solving system problems that arise during the ATLAS data taking.

1. Introduction

We describe the concept of statistical sampling followed by a description of the system architecture. The developed system consists of a collector, a processor, a storage solution, a service exposing the data and a web interface. After describing the general architecture, we discuss every component in the system in more detail using a bottom-up approach. Finally there is a discussion of data presentation techniques and examples of typical use.

2. Statistical Sampling

The concept of statistical sampling involves capturing and examining a relatively small subset of the total amount of data flowing through a network, and still being able to make fairly accurate assumptions and observations about the traffic. All the devices in the ATLAS data acquisition network support the sFlow standard. One of the design goals of this standard is to be able to accurately monitor traffic in high speed networks, without being overwhelmed by the volume of data that needs to be processed and stored.

c

(3)

We can define an individual traffic flow in a network as a set of packets belonging to a single conversation between two nodes during a certain period of time. The ability to identify individual traffic flows in a network enables powerful techniques for debugging. Implementing statistical packet sampling within the ATLAS data acquisition network gives us the ability to identify and examine the causes of unknown traffic patterns.

sFlow[3] is an industry standard which enables an Ethernet switch to take a sample of the packets traversing it and send them to a collector for permanent storage. The packet samples are analyzed in software and conversations at different network layers can be individually traced. The sFlow standard describes the mechanism used by the agents and allows implementation in hardware as well as software. sFlow samples every n-th packet on average, the sampling process uses randomness in order to prevent capturing perodic patters.

As every switch in ATLAS supports sFlow in hardware, there is the potential to concurrently monitor over 4000 ports. Since even brief transactions can be important, we operate sFlow at high sampling rates, up to one sample per 256 packets, which together with the large number of ports in the system generates a data handling problem of its own.

There are other technologies for high-speed traffic monitoring such as NetFlow[4], which was originally developed by Cisco to collect IP traffic information, NetFlow looks at all the packets and adds their properties to the frame and byte counters of the flow records before sending these records to a NetFlow collector at regular intervals. Since NetFlow agents are implemented in software, this approach can be very computationally expensive for the devices, especially when there are many short-lived traffic flows. Closer to sFlow and statistical sampling there is Sampled NetFlow, which samples every n-th packet where the value of n can either be deterministic or random.

We tested several of the solutions that were already available[5], including commercial products such as InMon’s [6] Traffic Sentinel, and non-commercial and open source products such as sFlowTrend also from InMon and nTop[7]. While some of the tested solutions were simply not flexible and scalable enough, the more powerful commercial solutions did not offer the ability to provide tight integration with our other in-house developed tools due to the proprietary nature of the data storage formats. Developing our own solution also enables potential use of the network topology already discovered by other tools when analyzing the data.

3. System architecture

The system we developed (Figure 1) consists of the following components

• sFlow agents implemented on the devices we want to monitor, the agents sample packets and send them to the collector. No other packet processing is done by the device.

• sFlow collector receives and temporarily stores the samples received from the devices it monitors. The samples arrive packed in sFlow datagrams.

• sFlow processor extracts information from the collected packet samples and prepares the data for permanent storage.

• MySQL database as a data storage service stores the processed data in a permanent location. • Web service provides a clean API for accessing the data storage.

• Web interface front-end to the data which allows the user to make queries and display plots using sFlow data.

To increase flexibility in terms of resource allocation the collector and processor are modular components and can be decoupled if this proves necessary, an example is having multiple processors working for a single collector. The different models of distribution are presented in section 3.3.

(4)

Figure 1. Overview of the system architecture

3.1. Data collection

The process of collecting data involves receiving and temporarily storing samples until they are ready to be processed. The devices send flow samples and counter samples inside sFlow datagrams (UDP). A datagram may contain multiple samples. We currently operate our devices at a sampling rate of 256 which can be considered a high sampling rate for high throughput networks. This effectively means that on average every 256’th packet passing through each interface on the sFlow enabled devices is sampled.

The flow samples received from the switches contain a copy of the first 128 bytes of the packet. We can use this data to extract information about the traffic flow the packet itself is a part of. All of our devices sample the raw header of the packets, so we have all the necessary information from each layer of the TCP/IP model. In addition to the packet header, a sample contains layer 2 switching information, i.e inbound and outbound physical interfaces used by the packet.

The collector is a dual-threaded application written in C. The collecting thread is responsible for collecting all the samples and storing them into fixed-size dynamically allocated buffers.

When a buffer is full, or an interval expires, this buffer is passed to the processing thread which processes the samples in the buffer and distributes the samples into POSIX shared memory objects, one object per agent address(device).

The collector keeps an open handle to the memory object for the current interval. When a new interval starts the handle is closed and a message with information about the object is put into a POSIX message queue. The collector then creates a new object and starts collecting there. The interval size is configurable, allowing us to modify the granularity of the data we collect.

This method will allow the collecting thread to store more samples than the processing thread

(5)

is able to save, for short periods of time, depending on the available memory. The maximum amount of memory to use by the collector is configurable.

The processing daemon waits for a message from the collector by polling the message queue. The message contains the necessary information to process the samples received. Processing the temporarily stored samples is addressed in section 3.2.

In order to get information about short-lived traffic flows, both in heavy and light load conditions we want to collect as many samples as we can. According to the sFlow standard the agents may implement throttling features that permanently reduce the sampling rate in case the load gets too high. Not all agents write this into the logs and it is not necessarily easily detectable. We currently address this issue by resetting the sampling rates statically at regular intervals. At a later time we will explore the possibility of dynamically changing sampling rates.

3.1.1. Pushing vs pulling SNMP data In addition to packet sampling, the sFlow agent in the

switch sends the values of the SNMP counters in the switch to the collector at regular intervals in the form of counter samples. It will do so using best-effort approach within a predefined maximum time limit and try to piggy-back these counter samples in the datagrams together with the flow samples. If this cannot be done it will send the counter sample in a separate datagram when the maximum time is reached.

By using this method of collecting SNMP data from the switch we will be able to move away from the normal request-response scenario of pulling SNMP information from the switch by asking for it. In addition to significantly reducing the work needed to collect this information from the devices, this will also decrease the bandwidth used by monitoring tools in the network. There is no additional work required using this approach except receiving and storing the counter samples.

3.2. Data processing

This process involves retrieving the temporarily stored samples from the collector, processing them, and extracting the information we are interested in.

We define a conversation as a stream of frames and bytes between two nodes. On layer 2 the key for a unique conversation is a tuple that consists of the source and destination hardware address. This also goes for layer 3 except that the Internet address is used instead of the hardware address. On layer 4 a unique conversation is represented by a 4-tuple consisting of the source and destination address including the source and destination port. These keys, in addition to the ingress and egress interface id’s, define the master key which we use to identify the conversation. This key is then used to sum up the bytes and frames from the sampled packets to their respective conversations.

Currently we are only processing and storing byte and frame counters for each conversation, but this can be extended to include for example flag counters for TCP and protocol distribution counters for layer 2 and layer 3. Basically any data found in the first 4 layers of the TCP/IP model can be extracted and stored.

By processing and grouping the samples into time intervals we can define the granularity of the data we collect. By doing this we can greatly reduce the amount of data that needs to be stored. This eliminates the direct proportionality between the sampling rate and the size of the stored data, allowing us to increase the sampling rate to gain more accurate information. We discuss the size of the stored data in more detail in section 3.3.

3.3. Storing data

Sustained high sampling rates requires the sampled packets to be processed and stored in a timely fashion. We explored different methods of storing the data both sample-based and conversation-based.

(6)

Currently all the counter samples we receive are stored into the same database as the data extracted from the flow samples, but since the counter samples are sent at relatively fixed intervals we also have the option to decouple flow and counter samples and store the counter values in round robin databases (RRD).

3.3.1. Sample based storage Storing the samples involves collecting the datagrams from the

agents, unpacking them and storing the samples received from each agent in separate locations. No further processing is done until the data is needed.

This is a simple but very inefficient way to store the data with regards to the total size. Because the data has not been processed in any way, additional delays will occur when it needs to be processed and displayed to the user. The total size of the data to store is dependent on two factors, the number of flow samples received and the number of ports which are sending counter samples. As the sampling rate is increased the effect of the counter samples becomes less significant and the size we need to store becomes directly proportional to the number of flow samples received.

3.3.2. Conversation based storage When storing conversations we do most of the processing

in the sFlow processor before the data is requested by the user and this involves extracting the information we need at a pre-defined granularity. The granularity determines the detail at which we can examine the collected data. With a granularity of, say, 1 minute we collect data for that minute before processing it and extracting the information.

This method of storage does not directly depend on the sampling rate, but on the expected number of conversations expected to be seen on every port in any single interval. There may be many samples within the time period but the number of conversations is only a function of the system, and not of the sampling. As a consequence the quantity of data that needs to be stored is significantly reduced in comparison to the sample based storage method.

3.3.3. Comparison between sample based and conversation based storage We can see that even

in an almost idle network, where the traffic mostly consists of basic control and monitoring traffic (Figure 2), there is a significant improvement on the size of the data we need to store. When the network load increases, and the network is busy during data taking, this difference becomes even bigger (Figure 3). As traffic increases in the network, we will get more samples, but they will mostly belong to the conversations already seen, which means that they will not affect the data rate at which we store data. This means that as the traffic increases in the network, the data growth rates for the two storage types will diverge.

3.3.4. Storage solution The storage solution needs to be able to handle high speed continuous

inserts. On the current solution, we are using an open source database, MySQL, as a backend for storing data. The configuration of the database server has been optimized for this purpose.

The most important optimizations include the following:

• Increasing the maximum size for bulk inserts which is a very effective way of reducing the time and resources required to insert a large amount of rows. Instead of inserting every row as it becomes available we build a large query and do multiple insertions in the same query. This will also reduce the amount of open/close operations on the database table which are time consuming hard-disk access operations.

• Increasing the size of the key buffer allows the server to store table indexes in memory. • By disabling the option to instantly flush transactions to disk we gain performance during

write operations in exchange for data security.

(7)

0 5 10 15 20 25 30 35 40 Time (minutes) 10 20 30 40 50 60 70 80 90 Data rate (KB/s) Samples Conversations

Figure 2. A graph showing the data rate difference between storing raw samples and processed data in an almost idle network

0 5 10 15 20 25 30 35 40 Time (minutes) 0 500 1000 1500 2000 2500 3000 3500 4000 Data rate (KB/s) Samples Conversations

Figure 3. A graph showing the data rate difference between storing raw samples and processed data in a busy network

• The buffer pool determines how much data can be cached. Increasing this results in less disk I/O operations.

3.4. Data presentation

This involves retrieving the stored data from the data storage facility, formatting it and presenting it in whatever form the user wants.

The main goal is being able to perform post-mortem analysis of past events. We want to be able to examine who was using the bandwidth, contributing to the link utilization and how specific conversations developed over time. This information can be used to discover problems in the network. The information in the data collected can also be used to find links with errors, links operating at the wrong line speed and other information but these features can also be accomplished by using other standard tools.

To make the collected data available to the end user we have implemented a web interface (Figure 4) that is part of the 3-tier architecture of the front-end system (Figure 1) where the database represents the data tier, the logic tier is represented by a web service providing access to the database while doing post-processing, formatting and other tasks. The presentation tier is the web interface itself.

The interface gives the user the ability to select a switch or a single port, and define the TCP/IP layer and time interval to examine. After these values have been selected it shows the conversations which has been sampled during the selected interval. From this point the user can select between different ways of visually representing the data such as pie charts that show link utilization and line charts that illustrate how conversations develop over time.

The web interface has been implemented using the Google Web Toolkit (GWT). This provides us with a powerful and cross-browser compatible interface with little extra effort. As a simple initial implementation for displaying graphs in the web interface we have used the Google Charts application programming interface. At a later time we will examine solutions to replace the current implementation with a more interactive graphing solution.

The interface will get the data asynchronously via a Python [8] module served by Apache [9]. This module will take care of retrieving the data from the database and preparing it to be displayed in the web interface. This preparation involves different things such as resolving host and service names. This module exports a clean API to the world and makes it available for

(8)

Figure 4. Screen-shot of the web interface

other applications to make use of the data. The data returned by the module is in JavaScript Object Notation (JSON) JSON format, which is a lightweight data interchange format that the Google Web Toolkit, by design, supports very well.

3.5. Distributing the system

Because the system developed is inherently modular, it can be distributed in several different ways depending on the needs.

The system in it simplest form with no distribution (Figure 5) has one collector feeding a processor which stores the data into a single location. It is also highly dependent on the performance of the system it is running on. At the moment we are operating with a non-distributed system and it has proven capable of handling the data rates during tests, however we are currenty operating at approximately 40% of the total number of nodes. In addition to adding more nodes, new nodes also have an increased number of cores leading to more processes running on each nodes which in turn leads to more conversations per node. Therefore we are looking at several different methods of distribution.

We can distribute the collecting/processing process while still collecting to a centralized repository (Figure 6). This is the optimal solution when operating at high sampling rates with a limited number of possible conversations between each pair of nodes. In this case the work which is required to collect and process the samples outweighs the work required to insert the processed data into storage. In the Atlas TDAQ Network there are two switches and one local file server per rack, with each rack containing approximately 30 processors. We can run an sFlow collector/processor in each rack, resulting in distributed collection and processing. This will significantly reduce the bandwidth consumed by sFlow since samples will not have to be aggregated to the single collector, but is processed in each rack leaving only the extracted information to be sent to the data storage.

A fully distributed system (Figure 7) can also be implemented by giving each node its

(9)

sFlow agent sFlow collector sFlow processor Web service Web interface D a t a storage

Figure 5. A non distributed system

sFlow agent sFlow collector sFlow processor sFlow agent sFlow collector sFlow processor Web service Web interface D a t a storage

Figure 6. A partially distributed system

sFlow agent sFlow collector sFlow processor sFlow agent sFlow collector sFlow processor Web service Web interface Data storage Data storage sFlow agent sFlow collector sFlow processor

Figure 7. A fully distributed system

own database for storage and leaving the responsibility of requesting the correct data to the web service. If we need to store large amounts of data from a large number of devices with a long history, or the database server of the partially-distributed system cannot handle the load, this is a way to address that problem. The functionality required for the web service to enable full distribution will require extra application logic in the web service, and will be implemented if proven necessary. This distribution model will also increase the response time when communicating with the web service.

4. Example 1

By inspecting the plots generated from our network monitoring tools (Figure 8) we are able to identify unknown or unexpected traffic patterns but we have no way to extract more information about them.

(10)

Figure 8. Traffic plot from SNMP data

By using the collected sFlow data, we are able to extract information on the approximate bandwidth usage per host during the selected interval (Figure 9), as well as information about the TCP conversations made by each host. By examining the conversation plot we are able to determine which application is generating this traffic by looking for known port numbers.

Figure 9. Bandwidth usage per host Figure 10. TCP conversations over time

5. Example 2

While checking for anomalous traffic a continuous traffic load on one port was noticed (Figure 11). The traffic led to machines that had nothing to do with data taking and tracking down the source across several hops of the general purpose network would have been extremely tedious.

Figure 11. Unknown control core traffic

Figure 12. Bandwidth usage per host However analysing the conversations across the busy core port with sFlow immediately showed that a test PC was running some diagnostic procedure on local file servers (Figure 12).

(11)

6. Conclusion

Network traffic sampling has been an industry technology for nearly two decades. It s typically employed with fairly low sampling rates (e.g. 1:8192) for continuous sampling and only specific ports would be subject to high rates (e.g. 1:512) for specific diagnostic purposes.

However, the whole Atlas TDAQ network needs to be continuously monitored and the flows understood so as to achieve the maximum throughput under many different modes of operation and this in turns places a requirement for high rate sampling on every port. To achieve this we have developed a modular architecture that permitted both the rapid deployment of the prototype system and allows for staged task distribution to meet future load increases. Much attention has been paid to the GUI and event display as well as integration into the experimental monitoring system. As a result of this, the intuitive interface allows even non-expert users to easily obtain the desired results. The fact that we can continuously monitor every port means that all the statistics are on hand the moment a problem is detected and diagnosis can be performed immediately.

References

[1] S. Stancu, M. Ciobotaru, C. Meirosu, L. Leahu, and B. Martin, ”Networks for the ATLAS Trigger and Data Acquisition,“ Computing in High Energy and Nuclear Physics, Mumbai, India, 2006.

[2] S.M. Batraneanu, A. Al-Shabibi, M. Ciobotaru, M. Ivanovici, L. Leahu, B. Martin and S. Stancu, ”Operational Model of the ATLAS TDAQ Network,“ Proc. IEEE Real Time 2007 Conference, Chicago, USA, May 2007.

[3] P. Phaal, S. Panchen, and N. McKee, ”InMon Corporation’s sFlow: A method for monitoring traffic in switched and routed networks,“ T ech. rep., RFC 3176, Sept. 2001.

[4] Cisco IOS NetFlow. [Online]. Available: http://www.cisco.com/web/go/netflow [5] sFlow Collectors. [Online]. Available: http://www.sflow.org/products/collectors.php [6] Inmon Corporation [Online]. Available: http://www.inmon.com

[7] Network Top (nTop) [Online]. Available http://www.ntop.org

[8] Python Programming Language. [Online]. Available: http://www.python.org [9] Apache HTTP Server Project. [Online]. Available: http://httpd.apache.org