How Network Flow Data Are Organized - Introduction to Flow Collection

2.3 Introduction to Flow Collection

2.3.4 How Network Flow Data Are Organized

The data repository is accessed through the use of SiLK tools, particularly the rwfilter command-line application. An analyst using rwfilter should specify the type of data to be viewed by using a set of five selection parameters. This handbook discusses selection parameters in more depth in Section3.2; this section briefly outlines how data are stored in the repository.

Dates

Repository data are stored in hourly divisions, which are referred to in the form yyyy/mm/ddThh in UTC. Thus, the hour beginning 11 a.m. on February 23, 2014 in Pittsburgh would be referred to as2014/2/23T16 when compensating for the difference between UTC and Eastern Standard Time (EST)—five hours. In general, data for a particular hour starts being recorded at that hour and will continue recording until some time after the end of the hour. Under ideal conditions, the last long-lived flows will be written to the file soon after they time out (e.g., if the active timeout period is 30 minutes, the last flows will be written out 30 minutes plus propagation time after the end of the hour). Under adverse network conditions, however, flows could accumulate on the sensor until they can be delivered. Under normal conditions, the file for 2005/3/7 20:00 UTC would have data starting at 3 p.m. in Pittsburgh and finish being updated after 4:30 p.m. in Pittsburgh.

Sensors: Class and Type

Data are divided by time and sensor. The class of a sensor is often associated with the sensor’s role as a router: access layer, distribution layer, core (backbone) layer, or border (edge) router. The classes of sensors that are available are determined by the installation. By default, there is only one class—“all”—but based on analytical interest, other classes may be configured as needed. As shown in Figure2.2, each class of sensor has several types of traffic associated with it: typically in, inweb, out, and outweb. To find the classes and types supported by the installation, runrwsiteinfo --fields=class,type,mark-defaults. This produces three columns labeled Class, Type, and Defaults. The Defaults column shows plus signs (+) for all the types in the default class and asterisks (*) for the default types in each class.

Data types are used for two reasons: (1) they group data together into common directions and (2) they split off major query classes. As shown in Figure2.2, most data types have a companion web type (i.e., in, inweb, out, outweb). Web traffic generally constitutes about 50% of the flows in any direction; by splitting the web traffic into a separate type, we reduce query time.

Chapter 3 Essential SiLK Tools

This chapter describes analyses with the six fundamental SiLK tools: rwfilter, rwstats, rwcount, rwcut, rwsort, and rwuniq. The chapter introduces these tools through example analyses, with their more general usage briefly described. At the conclusion of this chapter, you will be able to

• userwfilter to choose flow records

• describe the basic partitioning parameters, including how to express IP addresses, times, and ports • perform and display basic analyses using the SiLK tools and a shell scripting language

During this chapter, the most commonly-used parameters specific to each of these tools are covered. Sec- tion3.9surveys features, including parameters, that are common across several tools.

3.1 Suite Introduction

The SiLK analysis suite consists of over 60 command-line UNIX tools (including flow collection tools) that rapidly process flow records or manipulate ancillary data. The tools can communicate with each other and with scripting tools via pipes,8 both unnamed and named, or with intermediate files.

Flow analysis is generally input/output bound—the amount of time required to perform an analysis is proportional to the amount of data read from disk. A major goal of the SiLK tool suite is to minimize that access time. Some SiLK tools perform functions analogous to common UNIX command-line tools and to higher level scripting languages such as Perl®. However, the SiLK tools process this data in non-text (binary) form and use data structures specifically optimized for analysis.

Consequently, most SiLK analysis consists of a sequence of operations using the SiLK tools. These operations typically start with an initial rwfilter call to retrieve data of interest and culminate in a final call to a text output tool likerwstats or rwuniq to summarize the data for presentation. Keeping data in binary for as many steps as possible greatly improves efficiency of processing. This is because the structured binary records created by the SiLK tools are readily decomposed without parsing, their fields are compact, and the fields are already in a format that is ready for calculations, such as computing netmasks.

8_{See Section}_1.2.2_.

In some ways, it is appropriate to think of SiLK as an awareness toolkit. The flow-record repository provides large volumes of data, and the tool suite provides the capabilities needed to process these data. However, the actual insights come from analysts.

In document Using SiLK for Network Traffic Analysis (Page 45-48)