2.3 Flow control
2.3.3 System-wide techniques:
In order to support these abstractions, aggregate rates must be enforced which re-quires a system-wide approach.
Scheduling in Data Centers
In the data center this means explicitly scheduling the rates that servers can send to one another and enforcing these rates at servers in software, typically by modifying the kernel or hypervisor. This approach is reasonable because data centers are under
single administrative control which means that the networking stack of the kernel or hypervisor is part of the trusted code base. By explicitly scheduling the rates between all of a tenant’s servers, scheduling can ensure that a tenant does not use more than its share of a bottleneck link. In principle, this means it can be used to enforce the bandwidth constraints imposed by any arbitrary network topology. In addition, by coordinating rates, explicit scheduling makes it possible to avoid congestion.
Current approaches schedule rates centrally
There are a number of approaches that schedule rates and enforce them in software.
XCo [50] proposes a general framework to schedule traffic in data center networks.
Their current approach uses a central scheduler to collect traffic information from all of the servers in the network in order to create a global traffic matrix. Using this information, the scheduler periodically provides servers with a time schedule indicating the times during which each server can transmit. Ballani et al. describe
“Oktopus” [14], which also enforces server-to-server rates at end hosts using a central controller. NetShare [36] also uses kernel-based rate limiting and a central controller to schedule rates between servers. While this approach can avoid congestion, it is unclear how well it can scale to manage large networks.
Distributed Scheduling in Interconnection Networks
1
output 1
output n DS
Interconnect
1
2 2
N N
VOQs
Input Ports Output Ports
Figure 2.4: Distributed scheduling in high-performance routers
Our work draws upon the concept of distributed scheduling used in interconnection networks. Distributed scheduling is used as a flow control mechanism in large in-terconnection networks [22] such as those used in high performance routers [51] [47]
[46]. It works by scheduling the rates that input ports can send to outputs through the interconnect. Figure 2.4 shows a simplified router diagram that helps illustrates the idea. Arriving packets are buffered at each input port in Virtual Output Queues (VOQs) corresponding to the output ports that they are destined for. By controlling the rate of traffic leaving each VOQ, the router can manage the rate that each input sends to each output. These rates will be assigned periodically by a controller, shown as DS in the figure, that resides at each port (or linecard). Controllers periodically exchange information about the state of their queues and use this information to independently assign rates to their VOQs. Congestion can be avoided by assigning rates at inputs so that the total rate sent from inputs does not overload any of the outputs. This approach can also be used to provide QoS guarantees by managing separate VOQs for each traffic class, effectively providing virtual networks within the interconnect [22].
While the scheduling that we propose uses the same basic idea, there are some im-portant differences. In the router context, the network typically provides a speedup relative to the speed of the ports. To accommodate this, each output has an output queue which is where most of the queueing occurs. These queues are typically quite large (e.g. 100 ms worth of traffic) and the VOQs at inputs usually only begin to fill when there is an overload at an output. Maximizing throughput is often an objective of the scheduling algorithm used in these routers. For example, Distributed BLOOFA [47] attempts to remain work conserving by focusing traffic on the least occupied out-put queues. However, there is no sensible analogy for an outout-put-side queue at servers in the data center network. Traffic arriving at a server may be destined for different transport-level ports which may process incoming messages at different rates. It is generally also not very practical to construct a data center network that provides a speedup relative to the speed of the server’s interface. The scheduling that occurs in routers also occurs on a very small time scale compared to the end-to-end delays experienced by the traffic. In the data center network, control messages experience the same end-to-end delay as the traffic being scheduled which means that scheduling occurs on an inherently longer time scale.
Chapter 3
Packet-level Routing
In this chapter, we investigate the use of packet-level routing and resequencing in the context of multi-tenant data center networks. Here we focus specifically on FatTree networks constructed from commodity switches such as the simple example shown in figure Figure 3.1a. The purpose of the FatTree is to approximate the topology shown in Figure 3.1b since trees with such “fat” links are impractical to construct for large networks. The motivation for using packet-level routing is to be able to view the network by its logically equivalent tree. In this chapter, we argue that this is an important property to achieve in the context of providing performance isolation while maintaining agility.
1"Gbps"
1"Gbps"
(a) A fully provisioned 4-port 3-level FatTree DCN.
4"Gbps"
1"Gbps"
(b) The same network represented logically as a tree.
Figure 3.1: Under ideal routing, a FatTree is equivalent to a tree.
This chapter is divided into four sections.
• First, we make the case for packet-level routing by presenting a performance study comparing flow-level and packet-level routing in the data center.
• Second, we adapt packet-level routing to the data center context by exploring several ways to maximize performance.
• Third, we perform a thorough evaluation where we examine the tradeoff be-tween isolation and network utilization and the degree to which performance is dependent on flow control and the amount of buffering available at the switches.
• Finally, we show how to cope with out-of-order arrivals by describing an efficient method for resequencing packets in software at servers.