OpenDataPlane Introduction and Overview

(1)

OpenDataPlane™ Introduction and Overview

Linaro Networking Group (LNG)

Initial Release 0.1.0, January 2014

Executive Summary

OpenDataPlane (ODP) is an open source project that provides an application programming environment for data plane applications that is easy to use, highperformance, and portable across networking SoCs of various instruction sets and architectures. The environment consists of common APIs, configuration files, services, and utilities on top of an implementation optimized for the underlying hardware. ODP cleanly separates the API from the underlying implementation and is designed to support implementations ranging from pure software to those that deeply exploit underlying hardware coprocessing and acceleration features present in most modern networking SystemsonChip (SoCs). The goal of ODP is to create a truly crossplatform framework for data plane applications. This document provides an introduction and overview of the initial ODP release and discusses the motivation and philosophy behind it, while presenting how it will evolve to achieve its goals. Also under development is a formal ODP Architecture document which describes the overall design and structure of ODP, a Programmer’s Guide that presents the ODP Architecture from a programmer’s perspective and is aimed at application developers who wish to use ODP to write portable data plane applications, and an Implementer’s Guide aimed at platform vendors and those who wish to create conforming ODP implementations on new platforms. These will be released as they become available in 2014. Executive Summary Problem Statement ODP Design Principles Separating Data Plane Application Design from Implementation Design Layering Packets Flows Traffic Classes Raising the Level of Abstraction A Foundation for Growth ODP Staging

(2)

Schemas Reference Implementations LinuxGeneric Implementation Implementation Limits ODP Overview Definitions Terminology ODP API ODP Application ODP implementation Run to Completion Linux APIs Bare metal SDK Fast path Scope Networking data plane applications Development environment Application environment Abstraction level Programming language CPU architecture Coding style Prefixes Licensing Open source Versioning ODP Application Design and Programming Resource Management APIs Memory Management APIs Thread Management APIs Event Management APIs Packet Management APIs Flow Management APIs Traffic Class Management APIs SoC introduction SoC logical view Design Principles Multicore Hardware acceleration Runtocompletion

(3)

Software only Virtualization Existing APIs Linux features ODP and Linux Direct hardware access Linux scheduler Kernel interference Realtime Preemption Power efficiency Execution Model Load Balancing and Packet Distribution ODP Components and APIs Resources and Resource Management APIs Common mechanism Memory and Memory Management APIs Shared Memory Buffer Pools Thread Management APIs Event and Queue Management APIs Packet I/O Management APIs Application Design Principles and Models

Problem Statement

To meet the performance, capacity, and scalability needs of modern networks, many vendors provide networking SoCs that incorporate innovative hardware solutions to common networking problems, enabling packet processing at up to 100Gb/s speeds. To enable application software to exploit the capability of these platforms, vendors supply Software Development Kits (SDKs) for each platform. While these SDKs enable applications to exploit the capabilities of each platform, they also make it difficult for applications to be truly portable across different platforms. From an SoC customer standpoint, the proliferation of differing solutions to common problems means it is difficult to manage large scale deployments of networking applications in a consistent manner. What is needed is a open standard framework for data plane applications that supports development of portable applications while simultaneously allowing innovation in how these applications are implemented to achieve various price/performance goals. OpenDataPlane is an effort to separate the development of data plane applications from how the various services used to by these applications are implemented on different networking SoCs.

(4)

As such it is inspired by earlier industry precedents like OpenGL, which at the time of

introduction sought to provide a similar commonality for the then fragmented world of graphics processing.

ODP Design Principles

ODP is motivated by several driving forces. While significant strides have been made in implementing data plane applications on general purpose processors, the leading edge of networking has always required some degree of hardware acceleration and offload. Starting with simple functions such as checksum calculation and verification, which are now virtually universal, networking application design has always been a balance between hardware and software implementation choices. When general purpose processors could handle line rate processing of network flows operating at 1Gb/s speeds, 10Gb/s networks began to arrive which required more sophisticated hardware assists. Today, as general purpose processors are beginning to be able to handle 10Gb/s line rates, 100Gb/s networks are beginning to be deployed. This trend is expected to continue with the climb towards Terabit Ethernet.

Separating Data Plane Application Design from Implementation Design

Increasing network speeds pose several scaling issues, the most obvious being that the rate of increase in networking speeds outstrips the rate of increase in processing speeds. Multicore processing fills this gap to a certain extent, but this also introduces its own challenges in scheduling, floworder preservation, and overall Quality of Service (QoS) management. In addition, as link capacity increases, converged networking becomes an imperative, with disparate traffic classes sharing highspeed links while having very different throughput and latency requirements, all of which are difficult to manage purely in software. Beyond this, as network speeds increase the effect of packet loss on overall system performance becomes greatly magnified. Historically this required that data plane applications be completely redesigned to cope with changes in network speed and capacity because applications needed evercloser integration with specialized offload hardware to achieve acceptable performance levels at higher scale. The key to achieving true network agility is to eliminate this need to redesign and reimplement these applications as network technology evolves by cleanly separating application design from the functional implementation of that design. This is the key aim of ODP.

Layering

The success and ubiquity of networking application in general is due in large part to the strength of the ISO layered model for networking, which cleanly separates networking into seven distinct layers. This means that innovation at lower layers of the network does not affect the operation of applications running on upper layers. However in the data plane (predominantly Layers 2 and 3 of the ISO model) applications are more fully exposed to the rapid changes in the underlying technologies driving networking. This is what results in the need to redesign and rework data

(5)

plane applications to keep pace with this evolution. However, within the data plane there are identifiable processing layers which can be separated and abstracted usefully. Among these are packets, flows, and traffic classes.

Packets

A packet is the basic unit of data processing in the data plane. Since data plane applications may need to process tens of millions of packets per second, features such as receive and transmit, buffer management, header parsing and assembly, encapsulation and decapsulation, and similar such offloads are common features of many networking SoCs. ODP provides APIs to abstract these features so that data plane applications may assume that these common features are available regardless of how they are realized in a given ODP implementation.

Flows

A flow is a related sequence of packets for which order must be preserved and which may share state information (if stateful processing is being performed for the flow). Since most modern networking SoCs provide significant hardware innovation in the management of flows, ODP APIs provide abstractions for flows, enabling data plane applications to take advantage of hardware classification, scheduling, flow ordering, and context management services, which may be available in the implementation.

Traffic Classes

A traffic class is a set of flows that share a common administrative policy. At the highest level, the data plane is charged with implementing control plane policies with regard to traffic classes. This is especially true in converged networks where storage traffic is mixed with voice/video and similar flows with strict latency/jitter requirements, as well as with general Ethernet traffic. ODP provides APIs for identifying traffic classes to hardware or software ratelimiter and other traffic classification and shaping features of the implementation.

Raising the Level of Abstraction

The idea behind ODP is to provide the data plane application with an abstraction of a modern network SoC for which all common (and many advanced) hardware offload features may be assumed, and then allow the implementation to map these application assumptions to whatever hardware and/or software resources are available on the host SoC to realize these functions. Thus rather than having a leastcommondenominator approach to processing, or else having multiple applications for the same networking function which differ only in their environmental assumptions, the application is free to focus on the function it is designed to achieve and rely on the specific ODP implementation to help it realize that function in the most efficient manner possible for a given platform. This is done not by adding overhead but rather by factoring out implementation details into the ODP implementation layer to permit well designed implementations to leverage the inherent capabilities of the platform. At the same time, network SoC vendors are free to create highly optimized solutions for their platforms which can be easily leveraged across a wide array of ODP applications running on that implementation.

(6)

A Foundation for Growth

Similar to the evolution of OpenGL, we expect ODP to evolve and grow both in response to continued innovation in technology and business opportunity as well as a result of the many contributions of the open source community. This will be the true key to the success of this effort and the measure of its worth to the industry.

ODP Staging

Given the ambitious scope of ODP and the fact that its development is being conducted in a fully open manner, it will take some time to realize its goals fully. One of the challenges in providing crossplatform APIs that are both portable and yet exhibit nearnative performance levels on widely differing SoCs is that it is not obvious in advance how to best structure the fine details of ODP APIs to achieve these goals. Rather than take an Olympian view where a master architecture is first defined and promulgated and then forcefit to various implementations, with perhaps very uneven results, ODP intends to follow a more organic path where multiple efficient implementations of ODPs on different SoCs help refine the common high level ODP APIs. Thus, while this document presents an overview of ODP as currently envisioned, it should be kept in mind that the formal ODP architecture is still very much a work in progress and is expected to evolve and change, perhaps significantly, as ODP implementations inform the direction of its evolution. ODP uses a standard threelevel release naming convention (major.minor.revision) and this first public preview release is designated 0.1.0. As such it contains a minimal set of APIs and features needed to give a flavor of ODP and to illustrate the basic programming model ODP will support. Hence not all of the features described here will be found in the initial code. These will follow in subsequent releases both in response to ongoing development as well as feedback and contributions from the open source community at large.

Schemas

Different networking SoCs offer a wide variety of hardware acceleration and offload features that enable significant variability in how packets are processed by the device. Borrowing from

database concepts, we use the term schema to refer to this overarching packet flow architecture embodied by an implementation that is independent of the specific data plane application using it. For example, in many SoCs, packets can be routed and processed to different hardware engines without explicit software involvement. A schema would be the description of this flow architecture and would be expressed in a formal domainspecific language (DSL) for this purpose. Some SoCs (typically ASICs) support a single hardwired schema while others permit different schemas to be configured either statically or dynamically. Similar concepts are to be found in the graphics world. For example, GStreamer provides a framework for constructing graphs of mediahandling components.

(7)

ODP intends to address these capabilities in future architectural revisions, however such extensions are not part of the initial ODP release as described here. Instead, the architecture of the initial ODP release may be thought of as a default schema in which all packet processing flows are under explicit software control and direction.

Reference Implementations

As noted previously, ODP consists of a common set of APIs that articulate a packet processing schema, coupled to an implementation of those APIs tailored to a specific platform. This is what permits ODP application portability across different platforms. Over time there will be many such implementations of the ODP API for different SoCs and platforms. In this initial release of ODP a single reference implementation is offered named linuxgeneric.

Linux-Generic Implementation

The linuxgeneric implementation is intended to be a reference ODP implementation that is platformneutral and relies only on the Linux kernel itself. Thus, linuxgeneric can run on any SoC or platform that has a Linux implementation. Linuxgeneric serves as a vehicle both for defining and expressing the core ODP API set as well as a means of rapidly porting ODP applications to any platform in advance of it having a native ODP reference implementation. While not intended as a performance target, the performance of linuxgeneric can be improved by making use of Linux kernel features like NO_HZ_FULL that seek to minimize kernel disruption of threads executing on dedicated cores. The scope of linuxgeneric in the preview release is quite modest and covers only the most essential APIs needed to illustrate basic packet processing in the default softwarecentric schema. A fuller implementation of linuxgeneric will parallel the development of SoCspecific reference implementations as ODP development progresses. This will include adding additional general performance improvements as they become available.

Implementation Limits

While ODP itself does not specify limits on functions or features, as a practical matter each ODP implementation will define appropriate limits for itself. For example, while the ODP architecture does not impose an upper limit on the number of queues that may be created, an implementation may impose such a limit to match the number of physical queues supported by hardware. Similarly, ODP threads are assumed to map uniquely to cores but the number of cores are not unlimited and each implementation may restrict the number of ODP threads to the number of physical/logical cores available, etc. ODP API calls use standard error return codes to indicate whether a given function is either unavailable or if an implementation limit has been exceeded for a given call. It is up to the

(8)

application to decide how to structure itself to work with the limits imposed by any given ODP implementation it runs with.

ODP Overview

Definitions

Figure 1 shows ODP and related interfaces at a high level. Figure 1. ODP interfaces An ODP application runs as a Linux user space process but makes very limited calls to Linux APIs. Instead it uses ODP APIs (and possibly SDK APIs) to enable accelerated support of underlying hardware features without incurring kernel overhead. Note that while ODP does not preclude an application from using platformspecific SDK calls directly, such use would typically involve a loss of full sourcelevel portability across platforms, and would be an application design decision. As a framework for supporting data plane applications, ODP applications can run in parallel with full Linux user processes that implement control and/or management plane functions as these typically do not have the critical performance and latency requirements of the data plane and can more fully benefit from the full Linux API feature set. Figures 2 and 3 demonstrate possible ODP deployments on a multicore SoC. The first deployment has three ODP applications running the first two in separate (sets of) Linux processes in user space and a third one outside Linux in bare metal environment. Linux user space supports direct hardware access from ODP applications (through an SoC specific SDK). The second deployment (in Figure 3) has the same setup, but runs Linux and bare metal in separate virtual machines. ODP is designed to coexist with all standard Virtual Machine Monitors (VMMs) and virtualization hardware to enable data plane applications to run as virtualized containers in support of initiatives like Network Functions Virtualization (NFV).

(9)

(10)

Figure 3. ODP deployment example with virtualization

Terminology

Common terminology used throughout this document includes:

ODP API

The common data plane application programming interface, as described here and supported by conforming ODP implementations.

ODP Application

An ODP application is a data plane application using the ODP API. Typically it processes pieces of work (e.g., packets) in a runtocompletion loop. It may consist of multiple Linux user space processes/threads or bare metal cores.

ODP implementation

An ODP implementation provides the ODP API for use by ODP applications on a given platform.

Run to Completion

A programming model in which tasks execute nonpreemptively and process work requests for as long as progress can be made. This may complete the work in a single dispatch or in stages under application control if the application needs to queue the work for asynchronous processing to an offload function.

Linux APIs

An ODP application may be a Linux program and thus may use regular Linux/POSIX APIs.

Bare metal

Bare metal environment does not contain an operating system. Application configures and uses hardware directly, usually through an SDK.

SDK

A Software Development Kit (SDK) consists of hardware specific APIs and tools. It offers efficient interface towards hardware features, but provides portability only within a family of SoCs.

Fast path

The part of an ODP application that does majority of the work. It is the part that is optimised for maximum packet rate, data throughput and minimum (realtime) latency.

Scope

Networking data plane applications

An ODP application implements a networking function (such as IP router, firewall, mobile network gateway or base station, etc.), which consist of various standard (IETF, 3GPP, IEEE)

(11)

and proprietary networking protocols, and features. The value of the application is in providing an efficient, scalable, feature rich, and innovative implementation of the networking function. Its support of protocols, features, performance, and/or robustness requirements typically exceed those provided by a general purpose Linux IP stack.

Development environment

ODP applications are expected to be developed under Linux and the libraries and makefiles are distributed in source form designed for compilation using Linux tools and commands. Development may be on Linux systems running natively on the target platform or using cross compilation for other target platforms. The latter model is most convenient when the

development target is running an OpenEmbedded (OE) Linux kernel.

Application environment

The majority of ODP applications are expected to run in Linux user space. However, there will also be applications running in bare metal environments, or a combination of the two. So applications may run entirely in user space, or may be divided between user space and bare metal, or run entirely on bare metal. In general, the Linux kernel (or kernel modules) is not considered an application execution environment for ODP, but may support or implement some ODP services or APIs (e.g., configuration/control). The ODP API itself does not dictate the application execution environment (user space, bare metal, or kernel). Hence, ODP APIs do not contain types, structs or definitions from Linux/Posix headers or a build system.

Software on programmable engines (i.e., firmware) is considered part of the hardware implementation and does not use ODP APIs.

Abstraction level

The goal of ODP is to provide full crossplatform source compatibility for ODP applications. Applications using complex hardware acceleration or that are highly tuned to particular hardware may need more porting work than just a recompile. This is an area of ODP development that will be carefully considered as multiple SoC implementations are created and in turn help drive the evolution of the ODP APIs. The goal is to retain full crossplatform source portability for applications sharing the same schema, however since applications are generally schemaaware it is expected that compatibility across schemas may require more than a simple recompile to achieve portability.

Programming language

The ODP API and reference implementations are written in the C programming language (C99). Applications or implementations may use also C11 or C++. ODP C++ support is limited to

(12)

providing appropriate extern “C” clauses in headers to enable usage by C++ routines. ODP itself does not define classes or other object oriented structures as these have limited use in the embedded space.

CPU architecture

ODP is agnostic to the underlying CPU architecture and is designed to work well on various ISAs, including both 32 and 64 bit versions. As a result, various CPU features (e.g., cache line size) are treated as implementation configurations rather than assumed quantities. For now these are implemented statically as #defines in implementation headers. Other

implementations (e.g., those for NUMA architectures) may take into account locality issues such as varying cache line sizes based on the target memory being referenced. ODP is designed to work in both big and little endian modes of a CPU. When referring to networking data types (like IP address) the endianess is documented. By default, all parameters in the API are use an endian native to the implementation’s ISA. Packets on the wire use network byte order (big endian).

Coding style

The Linux kernel coding style is used for API and reference implementations. There are some exceptions, for example in the use of typedefs, as these provide greater levels of abstraction since the implementation of types may vary widely between ODP implementations.

Prefixes

All ODP APIs are prefixed with odp_ in their names.

Licensing

The ODP API is provided with 3clause BSD license. The API cannot have a GPL license since ODP applications and ODP implementations may be proprietary to the companies using the API.

Open source

The ODP API and reference implementations (including test applications and documentation) are open source. ODP implementations are encouraged to follow this model as well, but ODP does not dictate this.

Versioning

The ODP API is versioned with major and minor versions. Versions under the same major version (beginning with the version 1.0 release) are fully backward compatible. The version 0.x releases may not be fully backward compatible as they are preview releases.

ODP Application Design and Programming

(13)

Resource Management APIs

These APIs enable ODP applications to interrogate the environment to discover resources (cores, I/O interfaces, special purpose offload functions, etc.) and to allocate/configure them for application use.

Memory Management APIs

These APIs enable ODP applications to allocate and manage memory areas, including shared memory areas used for communication as well as buffer pools used in support of packet processing and other interfaces.

Thread Management APIs

These APIs enable ODP applications to create and manage logical threads. While ODP itself does not specify a threading model, it does assume that an application can divide itself into multiple threads of control and provides basic APIs for this purpose. In most ODP

implementations it is assumed that there is a onetoone mapping between threads and processing cores to minimize scheduling overhead and interference.

Event Management APIs

These APIs enable ODP applications to create and configure event queues to allow threads of control to receive and process events and to queue asynchronous processing requests to other event handlers.

Packet Management APIs

These APIs enable ODP applications to receive and transmit packets from input interfaces, to manipulate them for processing, and to transmit them on output interfaces.

Flow Management APIs

These APIs enable ODP applications to configure and manage classification rules that enable packets to be grouped into flows.

Traffic Class Management APIs

These APIs enable ODP applications to define and implement policies relating to traffic classes for Quality of Service (QoS) or other purposes.

SoC introduction

SoC logical view

All packet processing or data plane applications need a set of basic functionality to manage the packets. Figure 4 illustrates a logical function split (with optional acceleration) that is also easy to map to a networking SoC.

(14)

Figure 4. Logical view on networking SoCs ● Packet input abstracts physical ingress packet ports ● Preprocessing works at line rate and provides a coarse grained packet classification for buffer pool selection (VM etc) and first level congestion control. It allocates memory for the incoming packets and transfers content to the buffer memory ● Input classification is a fine grained parsing and classification function that separates traffic flows into the configured queues and adds metadata like packet parsing results ● Ingress queueing provides queues (FIFOs) of descriptors (meta data for the actual payload). Descriptors to queues may arrive directly from HW devices or from SW ● Delivery/Scheduling is an important block. It provides a synchronized SW/HW interface, work scheduling and load balancing functionality for all cores with a single receive point. Scheduler makes the decision based on per queue priority settings, queue status and CPU status. Optionally CPUs can bypass the scheduling function and access a queue directly. ● Accelerators can provide special purpose processing like cryptography or compression with an asynchronous queue based interface. Output from an accelerator typically goes to a queue. The “job complete” descriptor can then be scheduled towards SW or chained to another accelerator.

(15)

● Coprocessors are like accelerators, but have a synchronous interface towards SW (special opcode, CPU register or dedicated mapped address), execute the operation quickly and are typically per CPU. Output from a coprocessor is typically synchronous, but could optionally be a descriptor to a queue. ● Egress queueing provides shared synchronized interface towards egress ports. Each queue is mapped towards a logical port and optionally scheduled/shaped with the configured QoS. A logical port is then mapped towards a physical port with attributes (e.g. QoS, VLAN etc). ● Post processing schedules packets towards egress ports and frees the packet buffers as the packet leaves the device. It may also provide inline acceleration like adding packet checksums. ● Packet output provides interface to the physical egress ports While all this could be implemented in software, many of the blocks can benefit from hardware implementation. This is especially true for functions that include very high packet/bit rates (e.g., packet classification), SoC level synchronisation (scheduling, buffer management) or wide data operations (crypto). All of these are good candidates for hardware implementation and are found in many networking SoCs.

Design Principles

Performance

Attention to maximum performance and multi core scaling is needed to achieve high throughput, packet rate and processing efficiency. Design decisions must be evaluated against performance impact on various SoCs. An ODP application should be able to use SoC features at near to native performance and not face significant overheads due to multiple layers of abstraction. While specific performance targets and measurements have yet to be established, for planning purposes the goal of “near native” is with 5%. Some numeric examples: ● An ODP application on a SoC may have to sustain ~100 Gbps and ~100 Mpps packet throughput, which could result in a total cycle budget ~500 CPU cycles per packet (32 cores at 1.5GHz). ● Another application and SoC may have to sustain 10 Gbps or 15 Mpps with just a few Watt power budget (e.g. max four 1.5 GHz CPU cores), which would then result to total cycle budget of ~400 CPU cycles per packet.

Multicore

Single core solutions are almost nonexistent nowadays. The powerpriceperformance ratio of a system is optimized by selecting a SoC with right hardware features, core count and frequency. The same application code may cover a large range of products and performance targets. When using ODP, the application would be easy to port and scale from small to large SoCs, whichever

(16)

would be the optimal selection for a given power/price budget. As the core count gets higher, it’s important to maximize parallelism in applications with minimal performance overhead and programming complexity. These can be achieved with support of hardware synchronisation features (scheduling, mutual exclusion) and an application framework which uses these hardware features.

Hardware acceleration

Special purpose hardware enables very high throughput, performance and power efficiency when properly used. ODP provides an abstraction of common SoC hardware acceleration features, which can be used on multiple SoCs at near native performance levels. ODP aims not to abstract all hardware features of all SoCs, but rather a set of the most commonly used and provided features.

Run-to-completion

For maximum performance, ODP avoids per packet interrupts, system calls and CPU context switches. All of these cost additional instructions and potential stall cycles (due to cache, TLB, and branch prediction misses). When the total CPU cycle budget per packet may be from hundreds to couple of thousand cycles, even a single CPU context switch per packet can create an unacceptably large overhead. Most of these overheads can be avoided/minimized by running a single software thread per core (or hardware thread) in a runtocompletion loop, This thread handles one packet (task/event) at a time to completion, before it starts to process the next packet. This model integrates well to global work scheduling and load balancing of the cores.

Software only

ODP enables running networking applications also in data centres or customer private clouds. The same ODP application (source code) may need to support both data centers based on general purpose CPUs (with modest hardware acceleration) and utility boxes built from special purpose SoCs. The first would provide savings to customers through highvolume hardware (including maintenance) and other benefits, such as flexibility to test new features quickly, but it may not be the most optimized solution. The second would provide customers the most performancepricepower optimized solution for highly loaded applications. ODP APIs support both software only and hardware accelerated implementations. Typically, a software only implementation would have higher CPU overhead (more instructions) per operation and may not scale as well as with core count as a hardware accelerated implementation. Still, the ODP architecture and the API aims to provide best in class software only performance.

(17)

Virtualization

Full virtualization of networking SoCs will become common as core counts increase, hardware features expand and cloud deployments require it. ODP is designed to perform well with virtualization. The performance difference between native and virtualized implementations should be negligible (as long as SoC hardware supports it).

Existing APIs

ODP implementations may use or depend on existing platform (SDK) APIs when possible. ODP itself does not specify how an implementation may implement the ODP API set.

Linux features

ODP and Linux

ODP considers Linux the default operating system for SoCs running ODP applications. However, ODP API and specifications do not rely on Linux or POSIX definitions. ODP can be very well implemented and used with some other OSes, RTOSes or bare metal environments. The following part refers to Linux features, but same applies to other operating systems.

Direct hardware access

ODP application performance depends on hardware accelerator performance and application overhead on accessing those accelerators. Many times direct access from application to hardware accelerator registers/interfaces is needed to guarantee high performance. System calls (including context switch) and data copies are avoided at least on the interfaces used by application fast path.

Linux scheduler

Typically, an ODP application pins a single thread per core. It does not rely on Linux scheduler to schedule threads or work when doing the fast path processing. Application (work) scheduling is based on the SoC level hardware (or a specialised software) scheduler, which is optimised to efficiently load balance and synchronise work between the cores. Normal Linux threads and scheduling is used for running slow path/control plane part of the application. Sometimes the slow and fast path core allocation may overlap, in which case some slow path threads (for debugging, etc.) may be running in the background of the fast path ODP threads. The main reason why this would occur would be in lowend systems with limited numbers of cores that preclude full dedication.

Kernel interference

ODP implementations minimize Linux kernel interference, preferably to zero, on the cores running ODP application fast path logic. When a single thread runs on a core, the kernel should

(18)

not interfere the application (thread) in any way as long as the application does not call system calls or otherwise raise/cause exceptions or interrupts. If or when an ODP application invokes the kernel (system call, exception, etc.), the kernel takes control to process the event, after which it returns to zero interference mode. Since such kernel processing is only done by specific application request, presumably the application has accounted for this overhead in its overall design. For example, Linux system calls during application initialization, termination, or special exception/error path processing for things like device recovery, link up/down, etc., would normally not be a performance concern.

The Linux kernel’s NO_HZ_FULL configuration option can be used in conjunction with some additional features to achieve the effect of eliminating kernel interrupts on cores to be dedicated to ODP threads. Details of this will be forthcoming.

Real-time

Although ODP fast path processing generally executes a single thread per core and avoids interrupt processing on those cores, sometimes this cannot be avoided, e.g., due to low core count or other reasons. When a core is shared between fast path and other (interrupt or background) processing, it is important that context switches have relatively low maximum latency in both directions. First, interrupt processing may need fast reaction time, while bulk of the (interrupt) processing can be message based and scheduled with an appropriate priority. Second, preemptions to fast path processing should be short and relatively constant, otherwise it may not be able to meet deadlines or will suffer from increased packet latency and jitter. Third, fast path processing must be resumed quickly from background processing to meet realtime processing deadlines and guarantee maximum system performance. The worst case interrupt or context switch latency should be in order of microseconds, far less than a millisecond. Linux kernel RT patch improves kernel’s response time and will help to achieve the realtime requirements described above. An ODP implementation should work with or without the RT patch, and leave it to the user to decide if the RT patch is applied or not.

Preemption

As mentioned, preemption is generally avoided in ODP fast path processing. If a fast path thread is preempted, it should happen only for a short while and preferably with cooperation with the thread via an explicit yield (for example, before it starts processing a new packet). In addition to latency issues, preemption may cause core and SoC level performance issues. On core level, preempting code may suffer from cache misses and cause cache thrashing. Performance degradation may be more severe on SoC level. If the preempted thread is holding

(19)

software or hardware locks (such as for maintaining packet order), it may cause all other fast path threads (cores) to wait and limit severely SoC level throughput.

Power efficiency

When ODP application core utilization is low, it may be appropriate for some cores to save power. Moves between different power save states may need Linux support, at least in the deeper states. Core idle and power save can be implemented in on several levels. For example: ● Save automatically core dynamic power whenever there’s no work from the SoC level scheduler. In some SoCs this may be a special instruction that blocks waiting on a hardware scheduler and stops the core clocking for the waiting period. The period would be roughly from nanoseconds to seconds.

● Application initiated context switch to a deeper power save states (e.g., through Linux idle), when the application notices that there’s (likely) a longer period of low activity ahead. Implementation needs actions from application control plane and Linux such as reconfiguring scheduling and classification rules, unpinning the thread, moving the core into a sleep state, etc. ● Application initiated power down of a core. The application would remove the core (and associated thread) from ODP processing pool and command Linux to power down the core.

Execution Model

Figure 5 shows the logical view of packet processing using ODP. Figure 5. ODP Packet Processing

(20)

Packets arrive on one or more ingress interfaces and are processed into flows via a classifier function that assigns them to queues. Work is processed from the queues via a scheduler to one or more application threads and/or offload function accelerators and then are routed via queues and another scheduler/shaper instance to one or more egress interfaces. Not every ODP application will follow this model but it is expected to be typical of a large class of them.

Load Balancing and Packet Distribution

A key design element of ODP is scaleout support via multi core processing such that increased workloads can be processed by adding cores without fundamentally changing application design. Figures 6 and 7 show two approaches to scaleout using, respectively, push mode and pull mode scheduling

(21)

Figure 7. ODP Push Model The difference between push and pull models is the position of the scheduling function. In the pull model the scheduler dispatches items from queues to worker threads while in the push model queues are associated directly with worker threads and are serviced individually by them. Again the choice of which model to use is up to the application as ODP APIs exist to support both.

ODP Components and APIs

As noted, ODP APIs cover several broad component areas. These are introduced and discussed in the following sections.

Resources and Resource Management APIs

Hardware resources are more complex and diverse on SoCs, than on general purpose servers due to the variety of advanced hardware accelerators present. A hardware accelerator may serve multiple cores, VMs, kernels and application processes/threads. Also hardware accelerators may be interconnected, which adds complexity to the configuration, For example, typically a packet output port can free packet buffers back to hardware managed buffer pools after transmitting a packet.. Examples of SoC resources ● CPUs (or hardware threads) ● Main memory

(22)

● Shared memory regions ● Huge page mappings (how many, what sizes) ● Physical and virtual input ports / interfaces ● Packet classification rules ● Scheduler (core groups, algorithms, ordering) ● Hardware queues ● Crypto (sessions, autonomous protocol termination) ● Timers ● Buffer management (pools, buffer sizes, buffer counts) ● Physical and virtual output ports / interfaces ● Output traffic management and hardware Quality of Service (QoS) support. ● Deep packet inspection The first four of these are common resources and can use standard (Linux) mechanisms. Others have networking SoC specific features and need special attention.

Common mechanism

Applications need a common mechanism to find and reserve hardware resources regardless of execution environment (user space, bare metal, with/without virtualization) or resource usage of other applications or kernels. The common ODP resource management (RM) should be dynamic, so that hardware resource allocation and configuration can be changed in a live system. Application would most likely access RM during startup/initialisation/termination phases, but potentially also when processing live traffic. Application level resource allocation and configuration must not be based on static mechanisms like recompiling images or (SoC or VM) rebooting. The RM must work correctly also when ODP application share resources with other applications or kernel (e.g., share a network interface with related packet classification and buffer

management). The intent is that the Linux kernel itself provide the bulk of these services since managing shared resources is one of the primary functions of an operating system. Normally the control/management plane will interact with the OS to provision resources for the data plane and the data plane will simply make use of the resources identified to it. Specific ODP APIs to help in this are currently limited to simple functions such as enumerating the number of cores available to an ODP application. Additional functions will be added as needed as ODP evolves.

Fast path processing

Software threading

(23)

threads may be Linux processes, pthreads, or main threads on bare metal. Threads running control plane or slow path processing normally use Linux SMP scheduling, while fast path

threads are pinned to separate cores and process packets in a runtocompletion loop (not using Linux SMP scheduling). Typically there should be only one fast path thread per core (or hardware thread). There may be also some low priority, background threads (e.g., for house keeping, etc.) running on fast path cores, especially in lowerend configurations with limited numbers of cores. ODP does not specify how threads are implemented, only that implementations provide some conforming thread semantics. ODP implementations are thus free to support any threading options (OS processes, threads and bare metal) most relevant to that implementation. Linux processes provide better protection by default, while data sharing is easier with pthreads or bare metal. The threads provided by the linux generic reference implementation use pthreads.

Main loop

The ODP application is in control of its main loop. ODP does not force any particular main loop structure, but offers different options for application developers. For example, an application may just run its framework while(errors == 0) { work = get_work() dispatch(work) } or integrate other software into the framework while(errors == 0) { work = get_work() update_profile_stats() if(work == packet_in) { packet_classifier(work) continue } else if(work == framework) { dispatch(work) continue }

(24)

else if(work == tick) { timer(work) continue } else if(work == packet_out) { packet_output(work) continue } else { error_log(work) errors = 1; } } or poll individual resources while(errors == 0) { packets_out = dequeue(output_done_queues); timeouts = dequeue(timer_queues) packets_in = dequeue(input_queues) if(packets_out) { process_output_done(packets_out) } if(timeouts) { process_timeouts(timeouts) } if(packets_in) { packets_fwd = process_packets(packets_in) enqueue(packets_fwd, output_queues) }

(25)

}

Queues

Definition

Queues are multicore safe, First In First Out (FIFO) structures that can hold packet descriptors / messages / events to be processed by the receiving entity. Both hardware and software entities can enqueue (send) items to queues, and dequeue (receive) items from queues. Queues are the main method to transfer data between various (hardware or software) entities on a SoC. Software may receive items from queues directly or through a scheduler. Typically, high end SoCs have hardware acceleration that supports many of these SoC level queues. Hardware implementation varies from all queues being physical to only logical queue IDs mapped on top of a small set of physical queues. When there is no hardware queue support, the implementation must done in software using optimized, multicore safe queue or ring structures. Ideally all queues on an SoC are equal, so that any entity can en/dequeue on any of the queues. In practice, some sets of queues may have been reserved for specific usage and may not be accessible for all the entities. For example, there can be queues dedicated for a hardware accelerator that cores or the scheduler cannot dequeue from (but cores can send to). Queues and queue IDs are visible in many of the ODP APIs, e.g., as a destination for

asynchronous messages or data flows. A queue may represent, for example, a VLAN interface, an IPsec tunnel, a port in a messaging protocol, an end user data flow, a crypto accelerator session, or a packet output interface with specific traffic shaping parameters. The specific queue types and configuration vary based on the application design and structure.

Operations

The main operations on queues are enqueue and dequeue. A scheduler can perform dequeue operations on behalf of the user software on cores, which is often the default option. Software and schedulers should not dequeue from the same set of queues. Typically hardware acceleration does not allow software to walk queues, remove items in the middle of a queue or empty a queue. Also, queue length may not be known. Some queue implementations support batching, where multiple items can be en/dequeued with a single operation. This typically lowers average queue operation overhead to software and this batching is normally transparent to software.

Software only

In addition to SoC level queues, ODP offers optimized multicore safe queues for application use.

(26)

These are not connected to a scheduler or other hardware accelerators, but can be used internally to the application.

Packet descriptor

Many ODP API functions handle packets. A common packet descriptor format is needed for portability and interwork between APIs. ODP defines a common data type and a basic set of metadata that the descriptor can carry. Descriptor fields are left implementation specific and are not accessed directly, but through access macros and/or inline functions. This way the implementation can use SoC specific descriptor formats and avoid data copies and abstraction overhead. If a SoC does not provide all specified metadata in hardware, the missing features are supported in software by its ODP implementation.

Buffer descriptor

In addition to packet descriptors, ODP defines a common buffer descriptor, which includes features common to all different descriptor formats (packet, software messages, etc.). These support batch processing and scattergather lists. Common descriptors enable the building of standardized software interfaces and can carry metadata between software blocks. Possible metadata in packet descriptor include: ● Buffer addresses (virtual and physical), including scattergather support ● Total buffer length ● Current offset ● Offsets to L2/L3/L4 protocol headers ● Flags for L2/L3/L4 protocols (errors, multicast, etc.) ● Reference count ● Owner

Scheduling

ODP applications are very dependent on the global scheduling function, which controls their throughput, QoS, queue synchronisation, load balance and multicore scaling. In the runtocompletion model the global packet/task scheduler replaces the operating system thread scheduler in driving application task priority scheduling and load balancing. It controls the fast path execution, whereas the OS thread scheduler may be used for running background threads (or idle / deep power save modes) on the same cores. Typically high end SoCs have a hardware accelerator for packet scheduling and it is well

integrated to the cores (e.g., can prefetch data to core caches). However, ODP implementations for low end SoCs or general purpose CPUs (or emulation/simulation environments) will normally implement the scheduler in software. The linux generic reference implementation uses such a software scheduler since it does not assume the availability of any specific set of hardware

(27)

scheduling features.

Features

The SoC level queues are inputs to the scheduler. The scheduler can either push scheduled packets/events/work to shallow core specific queues or wait for cores to pull work from it. Supported queue depth is an implementation decision. Scheduling decisions are based on queue priority, core grouping and queue synchronization status. The priority based scheduling algorithm is implementation specific, but typically has some strict priority and/or some weighted roundrobin levels. Core grouping determines the set of cores the scheduler can target from a specific queue. The scheduler forces queue

synchronization between cores, e.g., by having only single outstanding item per queue.

Queue synchronization

The queue synchronization features are important for avoiding software locking in application code and thus provide effective means to write well scalable applications. The scheduler supports three types of queue synchronization: ● Parallel queues do not have extra synchronization features. Any number of items can be processed in parallel from a queue, and possible synchronization / ordering issues are handled in application software. ● An atomic queue can have only one core processing its outstanding items at a time. When a core holds item(s) from an atomic queue, it can be sure that there are no other cores accessing other items or context data of the queue concurrently. Thus it does not need to use software locks on those resources. ● Ordered queues can have multiple outstanding items processed concurrently by multiple cores, and still the original queue order will be restored after processing (before sending those to another queue).

Performance

The above scheduling features are required with high packet rate and low latency. Each incoming packet targeted to software will transit at least once through the scheduler, and may transit multiple times depending on the software structure. The result is that SoClevel total scheduling decision rates may range from millions to hundreds of millions per second.

API interface types

ODP supports both synchronous and asynchronous APIs.

Synchronous

Synchronous APIs complete their requested operation prior to returning to the caller. They thus behave like instructions. The return code indicates the success or failure of the requested operation.

(28)

Synchronous interfaces are used for operations that have short and finite execution time, have a core local implementation, or have realtime response requirement. Examples: read cpu cycle count, set timeout, queue enqueue/dequeue.

Asynchronous

Asynchronous interfaces are prefered when requested operations will take long (hundreds of cycles or more) or undetermined time to finish. Requests may be generated with function calls (rather than with explicit messages), but the replies arrive as messages back to the application through queues and scheduling. Messages can be abstracted using common access functions or standard message formats. Accelerator data I/O interfaces are commonly asynchronous. Queue based interfaces give flexibility to implement acceleration functions in various levels in a SoC (SoC level hardware block, coprocessor, special instructions, plain software, or even an external offload device). Also, core cycles can be used for other useful work (controlled by the scheduler) while waiting for a reply from an asynchronous interface such as an accelerator. On the other hand, application context saves/restores add overhead (compared to synchronous waiting) and may cause side effects like cache misses/thrashing. The latter can be minimized by (hardware assisted) context data prefetching. In general, function calls are prefered over explicit messages, since functions are more portable and can hide underlying access methods. The function interface also makes it possible to combine synchronous and asynchronous modes of an operation into a single API since a return code can indicate either completion or successful initiation of a requested operation that will then be completed asynchronously via message notification. Replies from operations are usually defined as abstract messages with a set of access

functions. Sometimes user defined messages are possible (e.g., a user allocated/filled message to a user defined queue). Callback functions and interrupts are avoided as application level reply mechanisms as these are inconsistent with the ODP execution model and in general not easily portable across different SoC hardware implementation models.

Software components

A standard software component interface will be defined and implemented for ODP applications for easier software reusability and 3rd party software integration. This interface will be synchronous and will enable applications to push or pull packet (or buffer) descriptors through a chain of interconnected software components (similarly to the Click Modular Router project). These chains may be integrated into the application runtocompletion main loop.

Packet input and output

Packet I/O configuration can be divided into interface and flow levels. An ODP application must first enable an interface level configuration before it can start sending or receiving any packets.

(29)

The configuration includes: ● Interface enumeration (physical and virtual interfaces) ● Interface initialization and default configuration ○ buffer pools and buffer management modes ○ default receive and send queues ● Link layer addresses (Ethernet MAC) ● Link status (up, down) and speed After interface level configuration, the application can send/receive packets through default queues. Only one entity (application or kernel) can control the physical functions of an interface, others use only virtual functions. Applications adapt to their role (defined by the resource management) when using an interface.

Packet classification

Hardware classification capabilities vary between SoCs. Most are able to classify few lower level protocols (ethernet, VLAN, etc.) in hardware, which could form the basic level of ODP

classification support. More advanced classification capabilities are diverse, so a flexible method is needed to describe those. Missing or dissimilar hardware features can be complemented with additional classification in software. Integration of additional software classification should be implemented with low overhead. Depending on implementation additional passes through queues and scheduling could be avoided. Another, less portable option is to let application find out hardware capabilities and integrate missing hardware features as additional/modified application (front end) software. Similarly to physical packet I/O interfaces, classification hardware is likely to be shared between other applications or the kernel. Thus common coordination is needed when applications need to change classification rules, so that changes to the classification rules are validated and modified without disturbing other traffic on the SoC. Classification enables definition of target queues for incoming packet flows. A packet flow can be defined with: ● Physical interface ● Ethernet type ● Destination MAC, VLAN or MPLS label ● Source and destination IP addresses ● Transport layer source and destination ports ● Packet priority (VLAN, IP, MPLS) ● User defined fields relative to L2/L3/L4 layer headers ● Combinations/tunnels of the above

(30)

Buffer management

Buffer management hardware of SoCs typically support a number of buffer pools, each holding a number of fixed size buffers. Buffer allocation and free operations are accelerated, and both hardware accelerators and software can operate on the same pools and buffers. Shared pools enable data sharing without copies. Used buffers can be returned directly back to the pool, regardless of the allocator (hardware or software). Applications may reserve some pools for internal use, but as SoCs often have limited RAM storage relative to general purpose servers this must be done with knowledge of resource limits. Buffer management includes ● Pool configuration and management ● Pool information ● Buffer allocation and free

Timers

ODP applications use timers frequently and for various purposes. Networking protocols specify many of these. For example, a single user flow may include timers for packet retransmissions, user inactivity, jitter buffering and packet shaping / scheduling. So there can be millions of timers running concurrently on an SoC. Some types of timers almost never expire, but are cancelled or reset in most cases. Others may expire often or periodically (e.g., every 1ms). The aggregate rate of timer operations may reach millions per second. Requested timeout range is wide, from microseconds to hours. Similarly the required resolution varies, starting from the microsecond level. ODP timer operations include timer configuration, timeout requests and cancels. Timeouts are delivered as asynchronous messages through SoC queueing and scheduling. There is an option to use user defined messages and destination queues. This way timeouts and packet data can share the same flow context without extra locking.

Crypto

The same crypto operations may be provided by an SoC level or external accelerator, accelerated instructions or generic software. Crypto operation latency is relatively high even when hardware accelerators are used since it depends directly on the buffer length. Typically, short buffers are processed more efficiently on CPU cores than on SoC level accelerators due to access latency/overhead. Large buffers fit accelerators better than CPUs, due to higher throughput and lower CPU processing jitter for software on the same cores. ODP provides APIs for both synchronous (inline on the core) and asynchronous crypto processing operations. The application can obtain operational capabilities via the Resource

(31)

Manager. Management of various crypto operation parameters is session based. The application first creates a session with parameters, and later on requests operations with references to the session as well as source and destination buffers. Selection between asynchronous/synchronous operations is on a per session and per packet basis. A crypto session may include:. ● Crypto algorithm and mode selection

● Operation chaining (e.g., crypto + authentication) ● Keys ● Initialization vector (if session based) ● Binding to selected implementation(s)

Helper library

Many small helper functions and definitions are needed to enable ODP applications to be hardware optimized but not tied to a particular hardware or execution environment. These are typically implemented with inline functions, preprocessor macros, or compiler builtin features. Thus API definitions are normally inline when possible. The list of envisioned helper features include:

Core enumeration

Application or middleware need to handle physical and/or logical core IDs, core counts and core masks quite often. Core enumeration has to remain consistent even when core deployment may change during application execution (e.g., due to adaptation to changing traffic profile, etc).

Memory alignments

For optimal performance and scalability (e.g., to avoid false sharing and cache line aliasing), some application data structures need to be aligned to cache (cache line) and/or memory subsystem (page, DRAM burst) alignments. NUMA systems also support locationawareness and potentially different cache line sizes on a permemory basis.

Static memory allocation

Serves application needs for portable definitions for global and core/thread local data.

Compiler hints

The compiler and linker can do better optimizations if code includes hints on expected application behavior. Examples of these are classification of branches with likely/unlikely hints, or marking code with hot (optimize for speed) or cold (optimize for size) tags.

Prefetching

Prefetching data into core caches before using it improves cache hit rate and thus performance. Optimal number of and places for prefetches are hardware dependent, but prefetching in most

(32)

obvious places should increase rather than decrease performance.

Atomic operations

Modern ISAs offers various atomic instructions to access/manipulate data concurrently from multiple cores. Well scalable multicore software is possible only through correct usage (and combination) of hardware acceleration and atomic instructions. Applications use atomic

operations to update global statistics, sequence counters, quotas, etc., and to build concurrent data structures.

Memory synchronization barriers

Application (or middleware) needs a portable way to synchronize data modifications into main memory before messaging other cores or hardware acceleration about the changes. The nature of the synchronization needs are cache coherence protocol specific.

Execution barriers and spinlocks

Although software locking should be avoided (especially in fast path code), at times there is no practical way to synchronize cores other than using execution barriers or spinlocks. For

example, the application initialization phase typically is not performance critical and may be much simpler with synchronous interfaces and locking.

Profiling and debugging

Although there are (external) tools for profiling and debugging, some level of application code instrumentation is typically needed (e.g., for on field debug/profiling). Typically an SoC supports CPU level (e.g., cycle count, cache misses, branch prediction misses) and SoC level (system cache misses, interconnect/DRAM utilization) performance counters.

SoC Hardware info

The application may be interested in generic performance characteristics of the SoC it is running on to have optimal adaption to the system. APIs for reading this information are thus provided.

Data manipulation

There are some data manipulation operations that are typical to networking applications. Examples of these are byte order swap for big/littleendian conversion, various checksum algorithms, and bit shuffling/shifting.

Optimized standard library functions

Some commonly used standard C library functions may be supplemented with versions that are specialized, performance optimized and have bounded execution times. For example there could be multiple versions of memcpy / memmove / memset functions with different fixed alignments and possibly even lengths. Memory allocation functions (alloc, etc.) could have versions using huge page mapped memory and optimised for performance rather than memory consumption. Also functions like printf could have more fast path friendly versions (bounded execution time).