5.3 H2C design
5.3.1 Architecture
IO core Processing core SSD core
NIC Content store L1 storage L2 storage L1+L2 index Next processing stage (PIT, FIB)
SSD DRAM
Figure 5.4: H2C architecture
H2C is a multithreaded userspace application designed for *nix operating systems running on a x86 architecture. An H2C instance runs on as many threads as the number of CPU cores available in the system. Each thread is bound to a specific core using the core affinity functionalities offered by the OS.
Cores are grouped into three sets depending on the task they perform (see Fig. 5.4).
Network I/O cores. They are in charge of fetching received packets from NICs using poll-mode drivers and forward them to a specific processing core based on the hash of the content identifier. In H2C content objects are allocated to processing cores based on the hash of the content identifier to ensure that a content is always processed by the same core. This effectively shards contents among cores. As a result, each core can maintain dedicated data structures and process packets independently of other cores.
It should be noted that the role of Network I/O cores may be fulfilled in certain situations by NICs. Modern NICs support Receiver Side Scaling (RSS) functionality that places received packets on specific hardware queues depending on the result of a hash function computed on a number of fields of the received packets. However most commodity NICs limit the choice of fields that can be used to compute the hash. For example, Intel 82599 NICs, which we used in our experiments, only support hashing based on IP v4/6 source/destination addresses and TCP/UDP source/destination ports. This limitation could be overcome with the cooperation of other network entities. For example, CCN/NDN packets could be encapsulated in UDP packets. Network entities could agree on a range of UDP destination ports to use and select a specific port in such range based on the hash of the content item.
Packets are passed from Network I/O to processing cores using concurrent lock-less ring queues. These queues are implemented using the atomic compare-and-swap (CAS) instruction which is widely supported by modern CPUs, including x86 architectures. This enables lightweight producer/consumer synchronisation without recurring to locks. It should be noted that packets are not copied during this operation. Cores only pass each others 32-bit pointers indicating the index of a packet in a packet buffer pool.
Multiple instances of Network I/O cores do not need synchronisation among each other, since each core has exclusive access to a dedicated hardware queue on each NIC.
Processing cores. They are in charge of receiving packets from NICs or Network I/O cores and per- forming cache lookup operations. As mentioned above, since we shard packets across cores, each processing core can operate using dedicated instances of all data structures. Also, each processing core has exclusive access to an area of the SSD memory for storing data. In case NIC’s RSS functionality is used (instead of software-based hashing by Network I/O cores), each processing core also has exclusive access to a DRAM region, otherwise DRAM is shared by all cores. However, even in the latter case, synchronisation overhead is minimum as allocation/release of packets from the packet pool is also implemented using the atomic CAS instruction.
When an Interest for a chunk is received, the processing core looks it up in the content cache. If available in the DRAM cache, it sends it to the destination NIC directly using a DMA transfer. If available in SSD cache, it issues a request to an SSD I/O core to fetch the content from SSD. If the content is not in cache, then it performs other standard processing operations (i.e., PIT and FIB lookups).
SSD I/O cores. Their only functionality is to receive SSD read/write commands from processing cores and execute them. All communications between processing and SSD I/O cores occur using the same kind of lock-less rings used for communications between NIC I/O and processing cores. We use dedicated cores to access SSDs in order to separate high-latency SSD lookup operations from forwarding and DRAM lookup operations in order not to compromise the latency of the latter. One key design challenge is how to deal with packet-level caching (each addressable content packet
is referred to as a chunk in CCN/NDN terminology). In fact, in addition to requiring considerably more lookup operations than object-level caching, reading and writing chunks randomly to SSD causes poor performance for two reasons. First, chunks may be smaller than SSD pages and therefore lead to read and write amplification if stored independently. As a reference, NDN specifications mandate chunk sizes of up to 4 KB, while modern SSD drives may have pages of up to 32 KB. Second, even if page and chunk size matched, performing small random read and write operations does not fully exploit the parallelism offered by SSD drives.
However, since it is very likely that a user requesting a content object does so by requesting all chunks sequentially from first to last, we can use this pattern to optimise performance. We design data structures to process groups of consecutive chunks together. We name this group of consecutive chunks a segment. When H2C reads/writes data from/to SSD, it does so at a segment granularity. H2C uses SSD drives as raw disks, i.e., it accesses them reading/writing at logical block addresses bypassing the filesystem. Using a standard feature of *nix kernels named vector I/O or scatter/gather I/O, it is possible to copy chunks belonging to the same segment between random locations on DRAM and a contiguous location in SSD. As a by-product, organising chunks in segments reduces the memory overhead of control data structures, as we will see in Sec. 5.3.3. The size of the segment depends on various parameters, but it should be at least as large as the size of an SSD page. In our experiments we use chunks of 4 KB and segments of 8 chunks, i.e., 32 KB.
As the reader may have already noted, a key design decision is the extensive use of sharding throughout the architecture, specifically to partition DRAM and SSD regions and assign each of them exclusively to one processing core. This decision is motivated by the findings of Chapter 3, which showed how a system of hash-partitioned caches yields the same cache hit ratio performance of a single cache of cumulative size. Not only sharding can be safely applied without compromising cache hit ratio, but it also simplifies system implementation and improves throughput since processing cores can operate independently without requiring synchronisation to access shared memory regions. In addition, the decision of assigning chunks to shards based on segment identifier, as opposed to content object identifier, has been made to reduce load imbalance, in accordance to the results of Theorem 3.4.