Hardware - Application: Precise Clock Synchronization

4.5 Application: Precise Clock Synchronization

5.1.1 Hardware

Programmable network hardware allows users to experiment with novel network system architectures. Previous studies have shown that reconfigurable NICs [200] can be used to explore new I/O virtualization techniques in VMMs. Those who are conducting research on new network protocols and intrusion detection can use NetFPGA [128, 211] to experiment with FPGA-based router and switches [34, 47, 144, 210].

Furthermore, users can employ specialized NICs that are programmable and support P4 language [38] which is a dataplane programming language. In P4, forwarding elements perform user-defined actions such as modifying, discarding, or forwarding packets. Netronome’s FlowNIC [145] and Xilinx’s SDNet [203] support P4. While programmable NICs allow users to access the layer 2 and above, SoNIC allows users to access the PHY. In other words, SoNIC allows users to access the entire network stack in software.

5.1.2 Software

Although SoNIC is orthogonal to software routers, software routers are important because they share common techniques. SoNIC preallocates buffers to reduce memory overhead [79, 174], polls huge chunks of data from hardware to minimize interrupt

overhead [63, 79], packs packets in a fashion that resembles batching to improve performance [63,79,135,174]. Software routers normally focus on scalability, and, hence, they exploit multi-core processors and multi-queue supports from NICs to distribute packets to different cores to process. On the other hand, SoNIC pipelines multiple CPUs to handle continuous bitstreams.

5.1.3 Language

P4 Reflecting the significant interest in P4 as a development platform, several efforts are underway to implement P4 compilers and tools. Our micro-benchmarks can be compared to those of PISCES [179], which is a software hypervisor switch that extends Open vSwitch [152] with a protocol-independent design. The Open-NFP [151] organi- zation provides a set of tools to develop network function processing logic, including a P4 compiler that targets 10, 40 and 100GbE Intelligent Server Adapters (ISAs) manu- factured by Netronome. These devices are network processing units (NPUs), in contrast to P4FPGA, which targets FPGAs. The Open-NFP compiler currently does not support register-related operations and cannot parse header fields that are larger than 32 bits. Users implement actions in a MicroC code that is external to the P4 program. P4c [157] is a retargetable compiler for the P4 language that generates a high performance network switch code in C, linking against DPDK [64] libraries. DPDK provides a set of user-space libraries that bypass the Linux kernel. P4c does not yet support P4 applications that require register uses to store state. P4.org provides a reference compiler [158] that generates a software target, can be executed in a simulated environment (i.e. Mininet [142] and P4 Behavioral Model switch [156]). P4FPGA shares the same compiler front-end, but it provides a different back-end.

A P4 compiler backend targeting a programmable ASIC [90] must deal with re- source constraints. The major challenge arises from mapping logical lookup tables to physical tables on an ASIC. In contrast, FPGAs can directly map logical tables into the physical substrate without the complexity of logical-to-physical table mapping, thanks to the flexible and programmable nature of FPGAs.

Perhaps the most closely related effort is Xilinx’s SDNet [203]. SDNet compiles programs from the high-level PX [41] language to a data plane implementation on a Xilinx FPGA target at selectable line rates that range from 1G to 100G. A Xilinx Labs prototype P4 compiler works by translating from P4 to PX, and then it uses SDNet to map this PX to a target FPGA. The compiler implementation is not yet publicly available, and so we cannot comment on how the design or architecture compares to P4FPGA.

High Level Synthesis. FPGAs are typically programmed using hardware description languages such as Verilog of VHDL. Many software-developers find working with these languages challenging because they expose low-level hardware details to the program- mer.

Consequently, there has been significant research in high-level synthesis and programming language support for FPGAs. Some well-known examples include CASH [44], which compiles C to FPGAs; Kiwi [180], which transforms .NET progr- rams into FPGA circuits; and Xilinx’s AccelDSP [27], which performs synthesis from MATLAB code.

P4FPGA notably relies on Bluespec [147] as a target language, and it re-uses the associated compiler and libraries to provide platform independence. As already mentioned, P4FPGA uses Connectal [96] libraries, which also are written in Bluespec, for

common hardware features.

5.2 Network Applications

5.2.1 Consensus Protocol

Prior work [56] proposed that consensus logic could be moved to forwarding devices using two approaches: (i) implementing Paxos in switches; and (ii) using a modified protocol, named NetPaxos, that makes assumption about packet ordering in order to solve consensus without switch-based computation. This section builds on that work by making the implementation of a switch-based Paxos concrete. Istv´an et al. [83] have also proposed implementing consensus logic in hardware, although they focus on Zookeeper’s atomic broadcast written in Verilog.

Dataplane programming languages. Several recent projects have proposed domain- specific languages for dataplane programming. Notable examples include Huawei’s POF [186], Xilinx’s PX [41], and the P4 [39] language used throughout this section. We focus on P4 because there is a growing community of active users, and because it is relatively more mature than the other choices. However, the ideas for implementing Paxos in switches should generalize to other languages.

Replication protocols. Research on replication protocols for high availability is quite mature. Existing approaches for replication-transparent protocols, notably protocols that implement some form of strong consistency (e.g. linearizability, serializability), can be roughly divided into three classes [52]: (a) state-machine replication [104, 178], (b) primary-backup replication [150], and (c) deferred update replication [52].

Despite the long history of research on replication protocols, there exist very few examples of protocols that leverage network behavior to improve performance. We are aware of one exception: systems that exploit spontaneous message ordering, [107, 162, 163]. These systems check whether messages reach their destination in order; they do not assume that order must be always constructed by the protocol or incur additional message steps to achieve it. This section implements a standard Paxos protocol that does not make ordering assumptions.

5.2.2 Timestamping

The importance of timestamping has long been established in the network measurement community. Prior work either does not provide sufficiently precise timestamping, or requires special devices. Packet stamping in user-space or kernel suffers from the impre- cision that is introduced by the OS layer [55]. Many commodity NICs support hardware timestamping that has levels of accuracy that range from nanoseconds [6, 7, 17] to hun- dreds of nanoseconds [18]. Furthermore, NICs can be combined with a GPS receiver or PTP-capability to use reference time for timestamping.

Timestamping in hardware either requires offloading the network stack to a custom processor [207] or network interface cards; the latter provide hardware timestamping capability via an external clock source [6, 17], which makes the device hard to program and inconvenient to use in a data center environment. Many commodity network interface cards support hardware timestamping. Data acquisition and generation (DAG) cards [6] additionally offer globally synchronized clocks among multiple devices, whereas SoNIC only supports delta timestamping.

prevented it from being a portable and realtime tool. BiFocals can store and analyze only a few milliseconds worth of a bitstream at a time due to the small memory of the oscilloscope. Furthermore, it requires thousands of CPU hours to convert raw optic waveforms to packets. Finally, the physics equipment used by BiFocals is expensive and is not easily portable. Its limitations motivated us to design SoNIC to achieve the realtime exact precision timestamping. Unfortunately, both BiFocals and SoNIC only support delta timestamping.

5.2.3 Bandwidth Estimaton

Prior work in MAD [182], MGRP [160] and Periscope [80] provided a thin measurement layer (via userspace daemon and kernel module) for network measurement in a shared environment. These works complement MinProbe , which is applicable in a shared environment. We also advance the available bandwidth estimation to 10Gbps; this contrasts with most prior work, which only operated in 1Gbps or less.

We use an algorithm similar to Pathload [84] and Pathchirp [173] to estimate available bandwidth. There are many related works on both the theoretical and practical aspects of available bandwidth estimation [30, 84, 88, 123, 125, 129, 170, 173, 183, 185]. Our work contributes to the practice of available bandwidth estimation in high speed networks (10Gbps). We find that existing probing parameterizations such as probe trains are sound and applicable in high-speed networks when they are supported by the appro- priate instrumentation. Furthermore, [85, 197] mentioned that burstiness of the cross traffic can negatively affect the accuracy of estimation. Although bursty traffic does in- troduce noise into the measurement data, we find that the noise can be filtered out using simple statistical processing, such as the moving average.

Middleboxes are popular building blocks for current network architecture [159], when operating as a middlebox, is similar to software routers in the sense that Min- Probe consists of a data forwarding path and a programmable flow table. Unlike software routers, which typically focus on high throughput, MinProbe has the capability to precisely control the distance between packets; this capability is absent in most existing software routers. With regards to system design, ICIM [133] has proposed an inline network measurement mechanism that is similar to our middlebox approach. However, in contrast to ICI, which has only simulated their proposal, we implemented a prototype and performed experiment in real network. Others have investigated the effect of interrupt coalescence [171] and the memory I/O subsystem [89], which are orthogonal to our efforts. MinProbe , albeit a software tool, avoids completely the noise of OS and the network stack by operating at the physical layer of the network stack.

Another area of related work is precise traffic pacing and precise packet timestamping. These are useful system building blocks for designing a high-fidelity network measurement platform. Traditionally, traffic pacing [31] is used to smooth out the burstiness of the traffic in order to improve system performance. In MinProbe , we use traffic pacing in the opposite way to generate micro-burst of traffic to serve as probe packets. Precise timestamping has been used widely in passive network monitoring. Typically, this is achieved through the use of a dedicated hardware platform, such as Endace Data Acquisition and a Generation (DAG) card [6]. With MinProbe , we achieve the same nanosecond timestamping precision, and we are able to use the precise timestamp for active network measurement, which the traditional hardware platform cannot achieve.

In document Towards a Programmable Dataplane (Page 181-188)