Conclusions - Analysis and architectural support for parallel stateful packet processing

4.7 Conclusions

In this chapter we present the first workload characterization of stateful processing. To the best of our knowledge, this study is the first to analyze stateful workloads, specially for networking applications.

In order to better understand the bottlenecks that the network applications generate, we propose a new type of classification that distinguishes workloads according to the management of data throughout packet processing: self–contained (each packet contains all data needed for processing), stateless (shared data structures that are used to check packet data and trigger an action), and stateful (keep track of previous packet processing in order to provide higher knowledge about the current packet processing).

The results show important differences between workload categories. The main rea- son for the variation in performance among workload categories relies on the data locality. The study shows that stateful processing is related to state data and therefore it is sensitive to traffic flow characteristics. In stateless applications, however, the locality is sensitive to lower layer network features (e.g. IP address distribution), as well as to the size of global information required by the packet processing.

In a single threaded system, the main bottleneck of stateful applications is the memory system, since even a larger L2 data cache is unable to maintain the state of active flows. The evaluated applications show from 3x up to 19x of performance speedup on average using a perfect memory system. The speedup is sensitive to the statefulness degree of the application and to the memory hierarchy configuration of the processor architecture. Nevertheless, branch prediction is another critical issue once memory bottleneck is overcomed, showing 16% on average of speedup. Also the percentage is sensitive to the application characteristics and the state level processed during the packet processing.

In this chapter we also analyze the processing for each packet during the life of a network connection. We compare packet requirements under different stateful configurations of Snort. The analysis shows that the bottlenecks can be concentrated in the first packets of the TCP connection, such as portscan detector, or can be distributed along the flow lifetime, such as monitoring.

The variations increase as more stateful upper layer processing is performed to the network traffic. Although this chapter discusses the variations among packets of a given TCP connection, the performance can also be sensitive to the state related to other network layers, such as group of flows, traffic of a given user, traffic of a given application. Other applications may present different valuable results, but our studies suggest that the

72 Characterization of Networking Applications

critical bottlenecks will be preserved and even be more stressed.

The analysis is performed in single threaded processor. In Chapter 6 we analyze bottlenecks that arise due to parallel execution models in a multithreaded architecture (e.g. negative aliasing in both data and instruction caches, contention due to shared data structures).

Chapter 5

Principles of Parallel Stateful

Processing

In previous chapters we have characterized the stateful applications. We have demonstrated that stateful applications present computational intensive workloads, reduced ILP, and large amount of long latency memory accesses. Thus, providing more complex stateful DPI services for high bandwidth links leads us to explore multithreaded architectures to exploit other levels of parallelism, such as the inherent packet level parallelism of network processing.

There are two main execution models towards parallel network processing on multithreaded architectures: run–to–completion (RTC) and software pipeline (SPL). In the RTC model, a thread is run to process packets in a discrete, indivisible RTC step. In the SPL model, a thread is run to process packets in several steps according to the differenti- ated pipeline stages of the workload. Both models aim to exploit packet level parallelism (PLP). In addition, SPL aims to improve the affinity among threads.

Layer 2–3 processing exploit large amount of PLP, since there are negligible dependencies among packets. In contrast, stateful processing shows significantly lower PLP, since there are dependencies between packets (e.g. variety of states related to a particular packet) as well as contention in the use of shared data structures.

In this chapter we introduce background information of parallel stateful processing. The analysis is based on our parallelization of Snort [70], called Snort–MT. We present

74 Principles of Parallel Stateful Processing

an analysis of the parallel workload. In addition, we introduce both RTC and SPL execution models using Snort–MT.

5.1 Motivation

In this section we present an example of throughput demand to provide Snort stateful DPI while sustaining 10Gbps traffic (i.e. OC–192 at full bandwidth utilization). We assume a packet size of 500 bytes on average [53] and workloads from different Snort configurations (see Table 6.1). In addition, we assume a massive multithreaded architecture where each stream is single–issue and the processor runs at 1GHz. Figure 5.1 depicts the required amount of streams in the architecture (Y–axis) running different Snort workloads (X–axis).

0 50 100 150 200 250 300 350 400 450 500

Mix-1 Mix-2 Mix-3 Mix-12 Mix-13 Mix-23 Mix-123 Avg

T o ta l N u m b e r o f S tr e a m s SingleStream IPC = 0,3 SingleStream IPC = 1

Figure 5.1: Massive multithreaded size to sustain stateful DPI for a 10Gbps link The black bars indicate the architecture size assuming the best stream performance case (i.e. IPC of 1 per stream). We can observe that the total number of streams ranges from 30 to 130. However, as we demonstrated in Chapter 4, the IPC of stateful DPI applications is substantially lower. The gray bars indicate the requirements assuming a

In document Analysis and architectural support for parallel stateful packet processing (Page 90-94)