We present a proof-of-concept implementation of the proposed multicast forwarding system. We realize that core routers have many different hardware designs and software stacks. The goal of this section is to show that our proposed ideas can be implemented in a representative platform
A STEM Router C B Traffic Generator Traffic Measurement 28 Bytes 12 Bytes Ethernet MCT CPY Payload D
Figure 5.4: Setup of our testbed. It has a NetFPGA that represents a core router in an ISP network and runs our algorithm.
(NetFPGA). Implementation in other platforms will be different, but the conceptual ideas are the same.
5.5.1 Testbed Setup
Figure 5.4 shows the setup of our testbed. The testbed has a router representing a core router in an ISP topology that receives and processes packets of concurrent multicast sessions. As shown in the figure, the router is a branching point for all multicast sessions passing through it. This requires duplicating every packet on multiple interfaces, and copying and rearranging labels for each packet. This represents the worst-case scenario in terms of processing for the router, as processing FSP and FTE labels is simple and requires only forwarding packets on one interface and removing a single label. We implemented the router in a programmable processing pipeline using NetFPGA SUME [153], which has four 10GbE ports.
The testbed also has a 40-core server with an Intel X520-DA2 2x10GbE NIC, which is used to generate traffic of multicast sessions at high rate.
Implementation. Our router implementation is based on the reference_switch_lite project for NetFPGAs [104]. This router contains three main modules: input_arbiter, output_port _lookupand output_queues. Our implementation modifies the last two Verilog modules. The first module receives a packet from the input queues, and reads the first label to decide which ports to forward/duplicate the packet on. The second module is executed at every output queue. It de- taches labels that are not needed in the following path segments. The second module has input and output queues by its own to store packets during processing. The main design decision for the sec- ond module is to process a data beat in two parallel pipelines. The first one is to read a data beat from the module input queue, and remove or rearrange labels as required. The second pipeline is to write the processed data beat to the output queue. We designed this module as a finite state machine that decides when to read/write from/to input and output queues.
64
128 256 512 1024
Packet Size (Bytes)
0
2
4
6
8
10
Throughput (Gbps)
Figure 5.5: Throughput of received traffic of concurrent multicast sessions from our testbed. We stress STEM by transmitting traffic that requires copying a range of labels for every packet.
Traffic Generation. The 40-core server in our testbed is used to generate traffic of concurrent multicast sessions using MoonGen [42]. We stress STEM by transmitting 10 Gbps of traffic that requires copying and rearranging labels for each packet as shown in Figure 5.4. We attach labels of size 28 bytes to each packet. These labels contain MCT and CPY labels. The MCT label instructs the router to duplicate packets on two ports B and C in Figure 5.4. We monitor and measure the outgoing traffic on port B. The labels instruct STEM to remove 16 non-sequential bytes and keep the other 12 bytes.
5.5.2 Results
We report three metrics that are important in the design of high-end routers: (1) Received through- put, (2) Packet latency, and (3) Resource usage.
Throughput. We transmit labelled packets of concurrent multicast sessions from the traffic-generating server to the router for 60 sec. The main parameter that we control is the packet size, which we vary from 64 to 1024 bytes.
Figure 5.5 shows the throughput of the received multicast traffic from the router. The figure shows that STEM can forward traffic of concurrent multicast sessions at high rates for typical packet sizes. Specifically, for the 1024-byte packet, the received throughput is 9.83 Gbps which is only 1.7% less than the actual bandwidth. We note that multicast is often used in video streaming, which has large packet sizes.
Notice that STEM processes each incoming packet independently. That is, processing packets in STEM does not depend on the number of ports in the router. Hence, its performance still applies for routers with large port density. In addition, our simulation results in Section 5.6 show that the 28-byte label is sufficient to represent a multicast tree in a real ISP network of 84 routers for large receiver density. Thus, the results from our testbed reflect STEM performance in large and real networks.
50th% 95th% 99.9th%
STEM 960.4 973 1,042.7
Unicast 960.3 972.9 1,040.3
Difference 0.1 0.1 2.4
Table 5.3: Packet latency (in µsec) measured in our testbed.
Latency. We report the packet processing latency at port B in Figure 5.4. We measure the latency by timestamping each packet at the traffic generator, and taking the difference between the current timestamp and the received timestamp at port B. Due to the lack of a specialized hardware equip- ment, we use the Berkeley Extensible Software Switch (BESS) [60] to timestamp the packets. Since the software layer may add overheads while timestamping and transmitting packets, we compare the latency of STEM processing against the basic IP unicast forwarding in the same testbed. As shown in Figure 5.4, we transmit packets of concurrent multicast sessions at 10 Gbps, where each packet requires copying a range of its labels.
Table 5.3 shows multiple statistics of the packet latency for both STEM and unicast forwarding in µsec when the packet size is 1,024 bytes. The results show that the latency of STEM process- ing under stress is close to the simple unicast forwarding. For example, the difference of the 95th percentiles of packet latency is only 0.9 µsec.
Resource Usage. We measure the resource usage of the packet processing algorithm, in terms of the number of used look-up tables (LUTs) and registers in the NetFPGA. These numbers are generated by the Xilinx Vivado tool after synthesizing and implementing the project. Our implementation uses 12,677 slice LUTs and 1,701 slice registers per port. Relative to the available resources, the used resources are only 3% and 0.2% of the available LUTs and registers, respectively. Thus, STEM requires small amount of resources while it can forward traffic of thousands of concurrent multicast sessions.