1.6 Organization
2.1.3 Fast and Flexible Packet Processors
Networks would benefit from fast and flexible packet processors. If a packet processor can process custom header bits, it would simplify the design and deployment of new network protocols. Similarly, if a packet processor can handle custom payload bits, critical network functions, such as packet classification and consensus protocol, can be offloaded to network dataplane. At the same time, a packet processor must be fast. For example, data center network bandwidth has been growing steadily: 10Gbps Ethernet is prevalent, 25Gbps is gaining traction, and 100 Gbps is on the horizon as of the time of writing this dissertation. Handling packet forwarding at line rate while performing complex packet processing requires significant computation power and programmabil- ity. Unfortunately, none of the existing network dataplanes can achieve both flexibility and performance at the same time.
We surveyed three dataplane implementation technologies from 2008, 2012, and 2016, to understand how different types of network dataplanes have evolved and why existing dataplanes lack flexibility or performance; that is, network dataplanes are either flexible (programmable) or performant, but not both.
Year 2008 2012 2016
Series Trident+ Trident II Tomahawk
Model BCM56820 BCM56850 BCM56960 Process Technology 40 nm 40 nm 28 nm Transceivers 24 x 10Gbps 128 x 10Gbps 128 x 25Gbps (32 x 40Gbps) (32 x 100Gbps) Forwarding Cores 1 1 4 Buffer Size 9MB 12MB 16MB Latency ≈500us ≈500ns ≈300ns Feature Set
Table 2.2: Broadcom Trident-series switch ASIC specification from year 2008 to 2016
Software Packet Processors Software packet processors are flexible but not fast. Software packet processors are software programs that are written in high level program- ming language, such as C or C++, and executed on general purpose CPUs. A 25Gbps network interface can receive a minimum sized (64B) packet every 19.2ns. However, at this speed, even a single access to a last level cache would take longer than the ar- rival time of a packet. Processing packets in a software dataplane is challenging, even when using all the advanced software techniques, such as kernel bypass, receive side scaling, and data direct I/O [62]. Worse, CPU performance is unlikely to improve be- cause of stalled frequency scaling [175]. Table 1 summarizes three examples of CPUs for building software dataplanes in the years 2008, 2012, and 2016, which compare the CPU frequency, total number of cores, fabrication process and memory bandwidth. The “Core Count” row of the table shows that the total number of CPU cores has increased from 4 to 24 in the year 2016, whereas Clock frequencies have decreased from 3.4GHz to 2.2GHz. If we use clock frequency to approximate single thread performance, the performance has not improved during the year between 2008 and 2016. As network link speeds approach 25Gbps or 100Gbps, software network dataplanes put a lot strain on server computation and memory capabilities and become impractical.
2008 2012 2016
Family Virtex-6 Virtex-7 Virtex Ultrascale+
Model XCE6VHX565T XC7VH870T VU37P
Silicon Process 40 nm 28 nm 20 nm SerDes 48 x 6.6Gbps 72 x 13.1Gbps 96 x 32.75Gbps (24 x 11.2Gbps) (16 x 28.05Gbps) Logic Cells 566K 876K 2,852K Flip-Flops 708K 1,095K 2,607K Block RAM 32,832Kb 50,760Kb 70,900Kb UltraRAM - - 270.0Mb
Table 2.3: Xilinx Virtex-series FPGA specification from year 2008 to 2016
ASIC-based Packet Processors ASIC-based packet processors are fast, but not flex- ible. These packet processors are often used in modern switches and routers [146], and tend to handle a limited set of protocols. For example, Table 2.2 compares three gen- erations of Broadcom switch ASICs, the dominant device used for switch development, from year 2008 to 2016 [74]. The Transceivers row shows that the number of ports and per-port link speed have scaled substantially during that period. However, the feature set provided by the switch ASICs remained largely constant. Consequently, despite the performance gain, ASICs are still sub-optimal in terms of providing a programmable dataplane, because it is difficult to fulfill application requirements that are not directly supported.
FPGA-based Packet Processors Packet processors implemented in FPGAs balance hardware performance and software flexibility. For example, the NetFPGA project [128] has been used to prototype shortest path routing and congestion control algorithms [65]. The designs are expressed using a hardware description language, such as Verilog, and synthesized to FPGAs using vendor-specific place-and-route tools. (See Appendix D) Table 2.3 shows three generations of Virtex-series FPGAs from Xilinx in the years 2008,
2012 and 2016. We selected the top-of-line FPGA model for each generation. The process technology used by Xilinx in each generation is similar to the fabrication pro- cess employed by Broadcom in the same period [24]. During the period, the data rate of transceivers on Virtex FPGA has increased by 3x, (from 11.2Gbps in Virtex-6 to 32.75Gbps in Virtex UltraScale+), and the total number of transceivers has increased slightly from 72 in Virtex-6 to 96 in UltraScale+, with limited scaling due to packaging limitations. The amount of logic and memory resources on FPGA has increased by 2x and 5x, respectively. Overall, FPGAs offer competitive flexibility to software due to their reconfigurability; and they provides comparable performance to ASICs.
The challenge in FPGA-based design is that the packet processing pipelines are often hard-coded with packet processing algorithms. Implementing a new packet processing algorithm requires users to program in low-level hardware description languages, which is a difficult and error-prone process.
In this dissertation, we explore a different approach to compiling network dataplane programs written in high level language to hardware. We prototype the approach on an FPGA. The same methodology is applicable to ASICs as well, if they provide a pro- grammable abstraction in their design [36]. In this dissertation, we propose an approach to allow users to program in a high-level declarative language. In particular, the users can describe how packets are parsed, processed and re-assembled within the pipeline using a language called P4 [38]. We demonstrate that a compiler can map the high-level description to a template implementation on FPGAs to instantiate the dataplane imple- mentation. Further, our approach is portable and the template implementation can be synthesized to both Altera [20] and Xilinx [24] FPGAs.
In summary, there is an opportunity to leverage high level programming language to generate the network dataplane that is both high performance and flexible. We design a
system to compile network dataplane programs to FPGA devices in Chapter 3.