Achieving High Utilization with Software-Driven WAN

(1)

Achieving High Utilization with Software-Driven WAN

Chi-Yao Hong

(UIUC)

Srikanth Kandula Ratul Mahajan Ming Zhang

Vijay Gill Mohan Nanduri Roger Wattenhofer

(ETH)

Microsoft Abstract—We present SWAN, a system that boosts the utilization of inter-datacenter networks by centrally control-ling when and how much traffic each service sends and fre-quently re-configuring the network’s data plane to match current traffic demand. But done simplistically, these re-configurations can also cause severe, transient congestion because different switches may apply updates at different times. We develop a novel technique that leverages a small amount of scratch capacity on links to apply updates in a provably congestion-free manner, without making any as-sumptions about the order and timing of updates at individ-ual switches. Further, to scale to large networks in the face of limited forwarding table capacity, SWANgreedily selects a small set of entries that can best satisfy current demand. It updates this set without disrupting traffic by leveraging a small amount of scratch capacity in forwarding tables. Ex-periments using a testbed prototype and data-driven simu-lations of two production networks show that SWANcarries 60% more traffic than the current practice.

Categories and Subject Descriptors: C.2.1 [Computer-Communication Networks]: Network Architecture and Design

Keywords: Inter-DC WAN; software-defined networking

1. INTRODUCTION

The wide area network (WAN) that connects the data-centers (DC) is critical infrastructure for providers of online services such as Amazon, Google, and Microsoft. Many ser-vices rely on low-latency inter-DC communication for good user experience and on high-throughput transfers for relia-bility (e.g., when replicating updates). Given the need for high capacity—inter-DC traffic is a significant fraction of Internet traffic and rapidly growing [20]—and unique traf-fic characteristics, the inter-DC WAN is often a dedicated network, distinct from the WAN that connects with ISPs to reach end users [15]. It is an expensive resource, with amor-tized annual cost of 100s of millions of dollars, as it provides 100s of Gbps to Tbps of capacity over long distances.

However, providers are unable to fully leverage this in-vestment today. Inter-DC WANs have extremely poor ef-ficiency; the average utilization of even the busier links is 40-60%. One culprit is the lack of coordination among the services that use the network. Barring coarse, static limits Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full cita-tion on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-publish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

in some cases, services send traffic whenever they want and however much they want. As a result, the network cycles through periods of peaks and troughs. Since it must be pro-visioned for peak usage to avoid congestion, the network is under-subscribed on average. Observe that network usage does not have to be this way if we can exploit the char-acteristics of inter-DC traffic. Some inter-DC services are delay-tolerant. We can tamp the cyclical behavior if such traffic is sent when the demand from other traffic is low. This coordination will boost average utilization and enable the network to either carry more traffic with the same ca-pacity or use less caca-pacity to carry the same traffic.1

Another culprit behind poor efficiency is the distributed resource allocation model of today, typically implemented using MPLS TE (Multiprotocol Label Switching Traffic En-gineering) [4,24]. In this model, no entity has a global view and ingress routers greedily select paths for their traffic. As a result, the network can get stuck in locally optimal routing patterns that are globally suboptimal [27].

We present SWAN(Software-driven WAN), a system that enables inter-DC WANs to carry significantly more traffic. By itself, carrying more traffic is straightforward—we can let loose bandwidth-hungry services. SWANachieves high effi-ciency while meeting policy goals such as preferential treat-ment for higher-priority services and fairness among similar services. Per observations above, its two key aspects arei) globally coordinating the sending rates of services; and ii) centrally allocating network paths. Based on current service

demands and network topology, SWANdecides how much

traffic each service can send and configures the network’s data plane to carry that traffic.

Maintaining high utilization requires frequent updates to the network’s data plane, as traffic demand or network topol-ogy changes. A key challenge is to implement these updates without causing transient congestion that can hurt latency-sensitive traffic. The underlying problem is that the updates are not atomic as they require changes to multiple switches. Even if the before and after states are not congested, con-gestion can occur during updates if traffic that a link is sup-posed to carry after the update arrives before the traffic that is supposed to leave has left. The extent and duration of such congestion is worse when the network is busier and has larger RTTs (which lead to greater temporal disparity in the application of updates). Both these conditions hold

1_{In some networks, fault tolerance is another reason for low} utilization; the network is provisioned such that there is am-ple capacity even after (common) failures. However, in inter-DC WANs, traffic that needs strong protection is a small subset of the overall traffic, and existing technologies can tag and protect such traffic in the face of failures (§2).

(2)

for our setting, and we find that uncoordinated updates lead to severe congestion and heavy packet loss.

This challenge recurs in every centralized resource allo-cation scheme. MPLS TE’s distributed resource alloallo-cation can make only a smaller class of “safe” changes; it cannot make coordinated changes that require one flow to move in order to free a link for use by another flow. Further, recent work on atomic updates, to ensure that no packet experi-ences a mix of old and new forwarding rules [23, 29], does not address our challenge. It does not consider capacity lim-its and treats each flow independently; congestion can still occur due to uncoordinated flow movements.

We address this challenge by first observing that it is im-possible to update the network’s data plane without creating congestion if all links are full. SWANthus leaves “scratch” capacitys(e.g., 10%) at each link. We prove that this en-ables a congestion-free plan to update the network in at mostd1/se–1 steps. Each step involves a set of changes to forwarding rules at switches, with the property that there will be no congestion independent of the order and

tim-ing of those changes. We then develop an algorithm to

find a congestion-free plan with the minimum number of steps. Further, SWANdoes not waste the scratch capacity. Some inter-DC traffic is tolerant to small amounts of conges-tion (e.g., data replicaconges-tion with long deadlines). We extend our basic approach to use all link capacity while guaran-teeing bounded-congestion updates for tolerant traffic and congestion-free updates for the rest.

Another challenge that we face is that fully using net-work capacity requires many forwarding rules at switches, to exploit many alternative paths through the network, but commodity switches support a limited number of forwarding rules.2 _{Analysis of a production inter-DC WAN shows that} the number of rules required to fully use its capacity exceeds the limits of even next generation SDN switches. We address this challenge by dynamically changing, based on traffic de-mand, the set of paths available in the network. On the same WAN, our technique can fully use network capacity with an order of magnitude fewer rules.

We develop a prototype of SWAN, and evaluate our ap-proach through testbed experiments and simulations using traffic and topology data from two production inter-DC WANs. We find that SWANcarries 60% more traffic than MPLS TE and it comes within 2% of the traffic carried by an optimal method that assumes infinite rule capacity and incurs no update overhead. We also show that changes to network updates are quick, requiring only 1-3 steps.

While our work focuses on inter-DC WANs, many of its underlying techniques are useful for other WANs as well (e.g., ISP networks). We show that even without control-ling how much traffic enters the network, an ability that is unique to the inter-DC context, our techniques for global re-source and change management allow the network to carry 16-25% more traffic than MPLS TE.

2. BACKGROUND AND MOTIVATION

Inter-DC WANs carry traffic from a range of services, where a service is an activity across multiple hosts. Ex-ternally visible functionality is usually enabled by multiple 2

The limit stems from the amount of fast, expensive memory in switches. It is not unique to OpenFlow switches; number of tunnels that MPLS routers support is also limited [2].

internal services (e.g., search may use Web-crawler, index-builder, and query-responder services). Prior work [6] and our conversations with operators reveal that services fall into three broad types, based on their performance requirements. Interactiveservices are in the critical path of end user experience. An example is when one DC contacts another in the process of responding to a user request because not all information is available in the first DC. Interactive traffic is highly sensitive to loss and delay; even small increases in response time (100 ms) degrade user experience [35].

Elasticservices are not in the critical path of user expe-rience but still require timely delivery. An example is repli-cating a data update to another DC. Elastic traffic requires delivery within a few seconds or minutes. The consequences of delay vary with the service. In the replication example, the risk is loss of data if a failure occurs or that a user will observe data inconsistency.

Background services conduct maintenance and provi-sioning activities. An example is copying all the data of a service to another DC for long-term storage or as a pre-cursor to running the service there. Such traffic tends to be bandwidth hungry. While it has no explicit deadline or a long deadline, it is still desirable to complete transfers as soon as possible—delays lower business agility and tie up expensive server resources.

In terms of overall volumes, interactive traffic is the small-est subset and background traffic is the largsmall-est.

2.1 Current traffic engineering practice

Many WANs are operated using MPLS TE today. To

effectively use network capacity, MPLS TE spreads traffic across a number of tunnels between ingress-egress router pairs. Ingress routers split traffic, typically equally using equal cost multipath routing (ECMP), across the tunnels to the same egress. They also estimate the traffic demand for each tunnel and find network paths for it using the con-strained shortest path first (CSPF) algorithm, which iden-tifies the shortest path that can accommodate the tunnel’s traffic (subject to priorities; see below).

With MPLS TE, service differentiation can be provided using two mechanisms. First, tunnels are assigned priori-ties and different types of services are mapped to different tunnels. Higher priority tunnels can displace lower priority tunnels and thus obtain shorter paths; the ingress routers of displaced tunnels must then find new paths. Second, pack-ets carry differentiated services code point (DSCP) bits in the IP header. Switches map different bits to different pri-ority queues, which ensures that packets are not delayed or dropped due to lower-priority traffic; they may still be delayed or dropped due to equal or higher priority traffic. Switches typically have only a few priority queues (4–8).

2.2 Problems of MPLS TE

Inter-DC WANs suffer from two key problems today. Poor efficiency: The amount of traffic the WAN carries tends to be low compared to capacity. For a production inter-DC WAN, which we call IDN(§6.1), we find that the average utilization of half the links is under 30% and of three in four links is under 50%.

Two factors lead to poor efficiency. First, services send whenever and however much traffic they want, without re-gard to the current state of the network or other services. This lack of coordination leads to network swinging between

(3)

(a) 0 0.2 0.4 0.6 0.8 1 Nor malized traffic rate Mean Peak Peak-to-mean ratio = 2.17 (b) 0 0.2 0.4 0.6 0.8 1 Nor malized traffic rate Background traffic Non-background traffic (c) 0 0.2 0.4 0.6 0.8 1 Nor malized traffic rate >50% peak reduction Peak after adapting

Peak before adapting

Figure 1: Illustration of poor utilization. (a) Daily traffic pattern on a busy link in a production inter-DC WAN. (b) Breakdown based on traffic type. (c) Reduction in peak usage if background traffic is dynamically adapted.

𝑅₁ 𝑅₂ 𝑅₃ 𝑅4 𝑅₅ 𝑅₆ 𝑅₇ 𝐹_𝐶 𝐹_𝐴 𝐹_𝐵 (a) Local path selection

𝑅₁ 𝑅₂ 𝑅₃ 𝑅4 𝑅₅ 𝑅₆ 𝑅₇ 𝐹_𝐶 𝐹𝐴 𝐹𝐵

(b) Globally optimal paths Figure 2: Inefficient routing due to local allocation.

over- and under-subscription. Figure1ashows the load over a day on a busy link in IDN. Assuming capacity matches peak usage (a common provisioning model to avoid conges-tion), the average utilization on this link is under 50%. Thus, half the provisioned capacity is wasted. This inefficiency is not fundamental but can be remedied by exploiting traffic characteristics. As a simple illustration, Figure1bseparates background traffic. Figure1cshows that the same total traf-fic can fit in half the capacity if background traftraf-fic is adapted to use capacity left unused by other traffic.

Second, the local, greedy resource allocation model of MPLS TE is inefficient. Consider Figure 2 in which each link can carry at most one flow. If the flows arrive in the orderFA,FB, andFC, Figure2ashows the path assignment

with MPLS TE:FAis assigned to the top path which is one

of the shortest paths; whenFB arrives, it is assigned to the

shortest path with available capacity (CSPF); and the same happens withFC. Figure2bshows a more efficient routing

pattern with shorter paths and many links freed up to carry more traffic. Such an allocation requires non-local changes, e.g., movingFA to the lower path whenFB arrives.

Partial solutions for such inefficiency exist. Flows can be split across two tunnels, which would divide FA across

the top and bottom paths, allowing half of FB and FC to

use direct paths; a preemption strategy that prefers shorter paths can also help. But such strategies do not address the fundamental problem of local allocation decisions [27]. Poor sharing: Inter-DC WANs have limited support for flexible resource allocation. For instance, it is difficult to be fair across services or favor some services over certain paths. When services compete today, they typically obtain throughput proportional to their sending rate, an undesir-able outcome (e.g., it creates perverse incentives for service developers). Mapping each service onto its own queue at routers can alleviate problems but the number of services (100s) far exceeds the number of available router queues. Even if we had infinite queues and could ensure fairness on the data plane, network-wide fairness is not possible without

𝑆2 𝑆1 𝐷2 𝐷3 𝑆3 𝐷₁ 1/2 1/2 1/2 1/2 (a) Link-level 𝑆2 𝑆1 𝐷2 𝐷3 𝑆3 𝐷₁ 1/3 1/3 2/3 2/3 (b) Network-wide Figure 3: Link-level fairness6=network-wide fairness.

controlling which flows have access to which paths. Consider Figure3in which each link has unit capacity and each ser-vice (Si→Di) has unit demand. With link-level fairness,

S2→D2 gets twice the throughput of other services. As we show, flexible sharing can be implemented with a limited number of queues by carefully allocating paths to traffic and control the sending rate of services.

3. SWAN OVERVIEW AND CHALLENGES

Our goal is to carry more traffic and support flexible network-wide sharing. Driven by inter-DC traffic character-istics, SWANsupports two types of sharing policies. First, it supports a small number of priority classes (e.g., Inter-active > Elastic > Background) and allocates bandwidth in strict precedence across these classes, while preferring shorter paths for higher classes. Second, within a class, SWANallocates bandwidth in a max-min fair manner.

SWANhas two basic components that address the funda-mental shortcomings of the current practice. It coordinates the network activity of services and uses centralized resource allocation. Abstractly, it works as:

1. All services, except interactive ones, inform the SWAN

controller of their demand between pairs of DCs. In-teractive traffic is sent like today, without permission from the controller, so there is no delay.

2. The controller, which has an up-to-date, global view of the network topology and traffic demands, computes how much each service can send and the network paths that can accommodate the traffic.

3. Per SDN paradigm, the controller directly updates the forwarding state of the switches. We use OpenFlow switches, though any switch that permits direct pro-gramming of forwarding state (e.g., MPLS Explicit Route Objects [3]) may be used.

While the architecture is conceptually simple, we must ad-dress three challenges to realize this design. First, we need a scalable algorithm for global allocation that maximizes network utilization subject to constraints on service prior-ity and fairness. Best known solutions are computationally intensive as they solve long sequences of linear programs (LP) [9,26]. Instead, SWANuses a more practical approach that is approximately fair with provable bounds and close to optimal in practical scenarios (§6).

Second, atomic reconfiguration of a distributed system of switches is hard to engineer. Network forwarding state needs updating in response to changes in the traffic demand or network topology. Lacking WAN-wide atomic changes, the network can drop many packets due to transient congestion even if both the initial and final configurations are uncon-gested. Consider Figure4in which each flow is 1 unit and each link’s capacity is 1.5 units. Suppose we want to change the network’s forwarding state from Figure4ato4b, perhaps to accommodate a new flow fromR2 toR4. This change re-quires changes to at least two switches. Depending on the

(4)

(a) 𝑅₁ 𝑅₄ 𝑅3 𝐹_𝐴 𝑅2 𝐹_𝐵 (b) 𝑅₁ 𝑅₄ 𝑅3 𝐹𝐴 𝐹𝐵 𝑅2 (c) 𝑅₁ 𝑅₄ 𝑅3 𝑅2 𝐹_𝐵 𝐹𝐴 (d) 𝑅1 𝑅4 𝑅₃ 𝐹𝐴 𝑅2 𝐹𝐵 (e) 𝑅1 𝑅4 𝑅₃ 𝐹𝐵 𝑅2 𝐹𝐴: 50% (f ) 𝑅1 𝑅4 𝑅₃ 𝑅2 𝐹𝐴: 50% 𝐹𝐵

Figure 4: Illustration of congestion-free updates. Each flow’s size is 1 unit and each link’s capacity is 1.5 units. Changing from state (a) to (b) may lead to congested states (c) or (d). A congestion-free update sequence is (a);(e);(f );(b).

order in which the switch-level changes occur, the network reaches the states in Figures4cor4d, which have a heavily congested link and can significantly hurt TCP flows as many packets may be lost in a burst.

To avoid congestion during network updates, SWAN com-putes a multi-stepcongestion-freetransition plan. Each step involves one or more changes to the state of one or more switches, but irrespective of the order in which the changes are applied, there will be no congestion. For the reconfigura-tion in Figure4, a possible congestion-free plan is: i) move half ofFAto the lower path (Figure4e);ii) moveFB to the

upper path (Figure4f); andiii) move the remaining half of

FAto the lower path (Figure4b).

A congestion-free plan may not always exist, and even if it does, it may be hard to find or involve a large number of steps. SWANleaves scratch capacity ofs∈[0,50%] on each link, which guarantees that a transition plan exists with at mostd1/se−1 steps (which is 9 ifs=10%). We then develop a method to find a plan with the minimal number of steps. In practice, it finds a plan with 1-3 steps whens=10%.

Further, instead of wasting scratch capacity, SWAN al-locates it to background traffic. Overall, it guarantees that non-background traffic experiences no congestion dur-ing transitions, and the congestion for background traffic is bounded (configurable).

Third, switch hardware supports a limited number of for-warding rules, which makes it hard to fully use network ca-pacity. For instance, if a switch has six distinct paths to a destination but supports only four rules, a third of paths cannot be used. Our analysis of a production inter-DC WAN illustrates the challenge. If we usek-shortest paths between each pair of switches (as in MPLS), fully using this network’s capacity requiresk=15. Installing these many tunnels needs up to 20K rules at switches (§6.5), which is beyond the capa-bilities of even next-generation SDN switches; the Broadcom Trident2 chipset will support 16K OpenFlow rules [33]. The current-generation switches in our testbed support 750 rules. To fully exploit network capacity with a limited number of rules, we are motivated by how the working set of a process is often a lot smaller than the total memory it uses. Sim-ilarly, not all tunnels are needed at all times. Instead, as traffic demand changes, different sets of tunnels are most suitable. SWAN dynamically identifies and installs these tunnels. Our dynamic tunnel allocation method, which uses an LP, is effective because the number of non-zero variables in a basic solution for any LP is fewer than the number of constraints [25]. In our case, we will see that variables in-clude the fraction of a DC-pair’s traffic that is carried over a

Inter-DC WAN Datacenter

SWAN controller Datacenter

Service host Service broker

Network agent Switch

Figure 5: Architecture ofSWAN.

tunnel and the number of constraints is roughly the number of priority classes times the number of DC pairs. Because SWANsupports three priority classes, we obtain three tun-nels with non-zero traffic per DC pair on average, which is much less than the 15 required for a non-dynamic solution. Dynamically changing rules introduces another wrinkle for network reconfiguration. To not disrupt traffic, new rules must be added before the old rules are deleted; otherwise, the traffic that is using the to-be-deleted rules will be dis-rupted. Doing so requires some rule capacity to be kept vacant at switches to accommodate the new rules; done sim-plistically, up to half of the rule capacity must be kept va-cant [29], which is wasteful. SWANsets aside a small amount of scratch space (e.g., 10%) and uses a multi-stage approach to change the set of rules in the network.

4. SWAN DESIGN

Figure 5 shows the architecture of SWAN. A logically centralized controller orchestrates all activity. Each non-interactive service has a broker that aggregates demands from the hosts and apportions the allocated rate to them. One or morenetwork agentsintermediate between the con-troller and the switches. This architecture provides scale— by providing parallelism where needed—and choice—each service can implement a rate allocation strategy that fits.

Service hosts and brokerscollectively estimate the ser-vice’s current demand and limit it to the rate allocated by the controller. Our current implementation draws on dis-tributed rate limiting [5]. A shim in the host OS estimates its demand to each remote DC for the nextTh=10 seconds

and asks the broker for an allocation. It uses a token bucket per remote DC to enforce the allocated rate and tags packets with DSCP bits to indicate the service’s priority class.

The service broker aggregates demand from hosts and up-dates the controller every Ts=5 minutes. It apportions its

allocation from the controller piecemeal, in time units ofTh,

to hosts in a proportionally fair manner. This way,This the

maximum time that a newly arriving host has to wait before starting to transmit. It is also the maximum time a service takes to change its sending rate to a new allocation. Brokers that suddenly experience radically larger demands can ask for more any time; the controller does a lightweight compu-tation to determine how much of the additional demand can be carried without altering network configuration.

Network agentstrack topology and traffic with the aid

of switches. They relay news about topology changes to

the controller right away and collect and report information about traffic, at the granularity of OpenFlow rules, every

Ta=5 minutes. They are also responsible for reliably

up-dating switch rules as requested by the controller. Before returning success, an agent reads the relevant part of the switch rule table to ensure that the changes have been suc-cessfully applied.

(5)

Controlleruses the information on service demands and network topology to do the following everyTc=5 minutes.

1. Compute the service allocations and forwarding plane configuration for the network (§4.1,§4.2).

2. Signal new allocations to services whose allocation has decreased. Wait forThseconds for the service to lower

its sending rate.

3. Change the forwarding state (§4.3) and then signal the new allocations to services whose allocation has increased.

4.1 Forwarding plane configuration

SWANuses label-based forwarding. Doing so reduces for-warding complexity; the complex classification that may be required to assign a label to traffic is done just once, at the source switch. Remaining switches simply read the label and forward the packet based on the rules for that label. We use VLAN IDs as labels.

Ingress switches split traffic across multiple tunnels (la-bels). We propose to implement unequal splitting, which leads to more efficient allocation [13], using group tablesin the OpenFlow pipeline. The first table maps the packet, based on its destination and other characteristics (e.g., DSCP bits), to a group table. Each group table consists of the set of tunnels available and a weight assignment that reflects the ratio of traffic to be sent to each tunnel. Con-versations with switch vendors indicate that most will roll out support for unequal splitting. When such support is un-available, SWANuses traffic profiles to pick boundaries in the range of IP addresses belonging to a DC such that split-ting traffic to that DC at these boundaries will lead to the desired split. Then, SWAN configures rules at the source switch to map IP destination spaces to tunnels. Our ex-periments with traffic from a production WAN show that implementing unequal splits in this way leads to a small amount of error (less than 2%).

4.2 Computing service allocations

When computing allocated rate for services, our goal is to maximize network utilization subject to service priorities and approximate max-min fairness among same-priority ser-vices. The allocation process must be scalable enough to handle WANs with 100s of switches.

Inputs: The allocation uses as input the service demands

dibetween pairs of DCs. While brokers report the demand

for non-interactive services, SWAN estimates the demand of interactive services (see below). We also use as input the paths (tunnels) available between a DC pair. Running an unconstrained multi-commodity problem could result in allocations that require many rules at switches. Since a DC pair’s traffic could flow through any link, every switch may need rules to split every pair’s traffic across its outgoing ports. Constraining usable paths avoids this possibility and also simplifies data plane updates (§4.3). But it may lead to lower overall throughput. For our two production inter-DC WANs, we find that using the 15 shortest paths between each pair of DCs results in negligible loss of throughput. Allocation LP: Figure6shows the LP used in SWAN. At the core is theMCF(multi-commodity flow) function that maximizes the overall throughput while preferring shorter paths;is a small constant and tunnel weightswjare

pro-portional to latency. sP ri is the fraction of scratch link

ca-pacity that enables congestion-managed network updates;

Inputs:

di: flow demands for source destination pairi

wj: weight of tunnelj(e.g., latency)

cl: capacity of linkl

sP ri: scratch capacity ([0,50%]) for classP ri

Ij,l: 1 if tunneljuses linkland 0 otherwise

Outputs:

bi=Pjbi,j: biis allocation to flowi;bi,jover tunnelj

Func:SWANAllocation: ∀linksl:cremain

l ←cl;// remaining link capacity

forP ri= Interactive,Elastic, . . . ,Backgrounddo {bi} ← Throughput Maximization_{Approx. Max-Min Fairness} P ri,{cremainl }

; cremain l ←cremainl − P i,jbi,j·Ij,l;

Func:Throughput Maximization(P ri,{cremain

l }):

returnMCF(P ri,{cremain

l },0,∞,∅);

Func:Approx. Max-Min Fairness(P ri,{cremain

l }):

// ParametersαandUtrade-off unfairness for runtime

//α >1 and 0< U≤min(fairratei)

T← dlogα

max(di)

U e;F← ∅;

fork= 1. . . T do

foreachbi∈MCF(P ri,{cremainl }, αk

−1_{U, α}k_{U, F}₎_do

ifi /∈F andbi<min(di, αkU)then

F←F+{i};fi←bi;// flow saturated

return{fi:i∈F};

Func:MCF(P ri,{cremain

l }, bLow, bHigh, F):

//Allocate ratebifor flows in priority classP ri

maximize P

ibi−(Pi,jwj·bi,j)

subject to ∀i /∈F :bLow≤bi≤min(di, bHigh);

∀i∈F :bi=fi;

∀l:P

i,jbi,j·Ij,l≤min{clremain,(1−sP ri)cl};

∀(i, j) :bi,j≥0.

Figure 6: Computing allocations over a set of tunnels.

it can be different for different priority classes (§4.3). The SWAN Allocation function allocates rate by invoking MCF separately for classes in priority order. After a class is allo-cated, its allocation is removed from remaining link capacity. It is easy to see that our allocation respects traffic prior-ities. By allocating demands in priority order, SWANalso ensures that higher priority traffic is likelier to use shorter paths. This keeps the computation simple because MCF’s time complexity increases manifold with the number of con-straints. While, in general, it may reduce overall utilization, in practice, SWANachieves nearly optimal utilization (§6).

Max-min fairness can be achieved iteratively: maximize the minimal flow rate allocation, freeze the minimal flows and repeat with just the other flows [26]. However, solving such a long sequence of LPs is rather costly in practice, so we devised an approximate solution instead. SWANprovides approximated max-min fairness for services in the same class by invokingMCFinTsteps, with the constraint that at step

k, flows are allocated rates in the range

αk−1_{U, α}k_U

, but no more than their demand. See Fig.6, functionApprox. Max-Min Fair. A flow’s allocation isfrozenat stepkwhen it is allocated its full demanddiat that step or it receives a

rate smaller thanαkU due to capacity constraints. Ifriand

bi are the max-min fair rate and the rate allocated to flow

i, we can prove that this is anα-approximation algorithm, i.e.,bi∈

ri

α, αri

(Theorem 1 in [14]3_).

Many proposals exist to combine network-wide max-min 3

(6)

fairness with high throughput. A recent one offers a search function that is shown to empirically reduce the number of LPs that need to be solved [9]. Our contribution is showing that one can trade-off the number of LP calls and the degree of unfairness. The number of LPs we solve per priority isT; with maxdi=10Gbps,U=10Mbps andα=2, we getT=10.

We find that SWAN’s allocations are highly fair and take less than a second combined for all priorities (§6). In contrast, Danna et al. report running times of over a minute [9].

Finally, our approach can be easily extended to other pol-icy goals such as virtually dedicating capacity to a flow over certain paths and weighted max-min fairness.

Interactive service demands: SWANestimates an inter-active service’s demand based on its average usage in the last five minutes. To account for prediction errors, we inflate the demand based on the error in past estimates (mean plus two standard deviations). This ensures that enough capacity is set aside for interactive traffic. So that inflated estimates do not let capacity go unused, when allocating rates to back-ground traffic, SWANadjusts available link capacities as if there was no inflation. If resource contention does occur, priority queueing at switches protects interactive traffic. Post-processing: The solution produced by the LP may not be feasible to implement; while it obeys link capacity concerns, it disregards rule count limits on switches. Di-rectly including these limits in the LP would turn the LP into an Integer LP making it intractably complex. Hence, SWAN post-processes the output of the LP to fit into the number of rules available.

Finding the set of tunnels with a given size that carries the most traffic is NP-complete [13]. SWANuses the follow-ing heuristic: first pick at least the smallest latency tunnel for each DC pair, prefer tunnels that carry more traffic (as per the LP’s solution) and repeat as long as more tunnels can be added without violating rule count constraintmjat

switchj. If Mj is the number of tunnels that switchj can

store andλ∈ [0,50%] is the scratch space needed for rule updates (§4.3.2),mj=(1−λ)Mj.In practice, we found that

{mj}is large enough to ensure at least two tunnels per DC

pair (§6.5). However, the original allocation of the LP is no longer valid since only a subset of the tunnels are selected due to rule limit constraints. We thus re-run the LP with only the chosen tunnels as input. The output of this run has both high utilization and is implementable in the network.

To further speed-up allocation computation to work with large WANs, SWANuses two strategies. First, it runs the LP at the granularity of DCs instead of switches. DCs have at least 2 WAN switches, so a DC-level LP has at least 4x fewer variables and constraints (and the complexity of an LP is at least quadratic in this number). To map DC-level allocations to switches, we leverage the symmetry of inter-DC WANs. Each WAN switch in a inter-DC gets equal traffic from inside the DC as border routers use ECMP for outgoing traffic. Similarly, equal traffic arrives from neighboring DCs because switches in a DC have similar fan-out patterns to neighboring DCs. This symmetry allows traffic on each DC-level link (computed by the LP) to be spread equally among the switch-level links between two DCs. However, symmetry may be lost during failures; we describe how SWANhandles failures in§4.4.

Second, during allocation computation, SWANaggregates the demands from all services in the same priority class be-tween a pair of DCs. This reduces the number of flows that

Inputs:          q, sequence length b0

i,j=bi,j, initial configuration

bq_i,j=b0_i,j, final configuration

cl, capacity of linkl

Ijl, indicates if tunneljusing linkl

Outputs:{ba

i,j} ∀a∈ {1, . . . q}if feasible

maximize cmargin// remaining capacity margin

subject to ∀i, a:P

jbai,j=bi;

∀l, a:cl≥

P

i,jmax(bai,j, b a+1

i,j )·Ij,l+cmargin;

∀(i, j, a) :ba_i,j≥0;cmargin≥0;

Figure 7: LP to find if a congestion-free update sequence of lengthqexists.

the LP has to allocate by a factor that equals the number of services, which can run into 100s. Given the per DC-pair allocation, we divide it among individual services in a max-min fair manner.

4.3 Updating forwarding state

To keep the network highly utilized, its forwarding state must be updated as traffic demand and network topology change. Our goal is to enable forwarding state updates that are not only congestion-free but also quick; the more agile the updates the better one can utilize the network. One can meet these goals trivially, by simply pausing all data movement on the network during a configuration change. Hence, an added goal is that the network continue to carry significant traffic during updates.4

Forwarding state updates are of two types: changing the distribution of traffic across available tunnels and changing the set of tunnels available in the network. We describe below how we make each type of change.

4.3.1 Updating traffic distribution across tunnels

Given two congestion-free configurations with different traffic distributions, we want to update the network from the first configuration to the second in a congestion-free man-ner. More precisely, let the current network configuration be C={bij : ∀(i, j)},where bij is the traffic of flow i over

tunnel j. We want to update the network’s configuration toC0={b0ij:∀(i, j)}. This update can involve moving many

flows, and when an update is applied, the individual switches may apply the changes in any order. Hence, many transient configurations emerge, and in some, a link’s load may be much higher than its capacity. We want to find a sequence of configurations (C=C0, . . . , Ck=C0) such that no link is

overloaded in any configuration. Further, no link should be overloaded when moving fromCi toCi+1 regardless of the order in which individual switches apply their updates.

In arbitrary cases congestion-free update sequences do not exist; when all links are full, any first move will congest at least one link. However, given the scratch capacity that we engineered on each link (sP ri;§4.2), we show that there

ex-ists a congestion-free sequence of updates of length no more thand1/se−1 steps (Theorem 2 in [14]). The constructive proof of this theorem yields an update sequence with ex-actlyd1/se−1 steps. But shorter sequences may exist and are desirable because they will lead to faster updates.

We use an LP-based algorithm to find the sequence with the minimal number of steps. Figure7shows how to exam-4_{Network updates can cause packet re-ordering.} _{In this} work, we assume that switch-level (e.g., FLARE [18]) or host-level mechanisms (e.g., reordering robust TCP [36]) are in place to ensure that applications are not hurt.

(7)

ine whether a feasible sequence of q steps exists. We vary

q from 1 to d1/se−1 in increments of 1. The key part in the LP is the constraint that limits the worst case load on a link during an update to be below link capacity. This load

is P

i,jmax(b a i,j, b

a+1

i,j )Ij,l at stepa; it happens when none

of the flows that will decrease their contribution have done so, but all flows that will increase their contribution have already done so. Ifq is feasible, the LP outputsCa={bai,j},

fora=(1, . . . , q−1), which represent the intermediate con-figurations that form a congestion-free update sequence. From congestion-free to bounded-congestion: We showed above that leaving scratch capacity on each link fa-cilitates congestion-free updates. If there exists a class of traffic that is tolerant to moderate congestion (e.g., back-ground traffic), then scratch capacity need not be left idle; we can fully use link capacities with the caveat that transient congestion will only be experienced by traffic in this class. To realize this, when computing flow allocations (§4.2), we usesP ri=s >0 for interactive and elastic traffic, but set

sP ri = 0 for background traffic (which is allocated last).

Thus, link capacity can be fully used, but no more than (1−s) fraction is used by non-background traffic. Just this, however, is not enough: since links are no longer guaranteed to have slack there may not be a congestion-free solution within d1

se −1 steps. To remedy this, we replace the

per-link capacity constraint in Figure7with two constraints, one to ensure that the worst-case traffic on a link from all classes is no more than (1 +η) of link capacity (η∈[0,50%]) and another to ensure that the worst-case traffic due to the non-background traffic is below link capacity. In this case, we prove thati) there is a feasible solution within max(d1/se– 1,d1/ηe) steps (Theorem 3 in [14]) such that ii) the non-background traffic never encounters loss andiii) the back-ground traffic experiences no more than anηfraction loss. Based on this result, we setη= s

1−s in SWAN, which ensures

the samed1

se −1 bound on steps as before.

4.3.2 Updating tunnels

To update the set of tunnels in the network from P

to P0, SWAN first computes a sequence of tunnel-sets (P=P0, . . . , Pk=P0) that each fit within rule limits of

switches. Second, for each set, it computes how much traffic from each service can be carried (§4.2). Third, it signals ser-vices to send at a rate that is minimum across all tunnel-sets. Fourth, after Th=10 seconds when services have changed

their sending rate, it starts executing tunnel changes as follows. To go from set Pi to Pi+1: i) add tunnels that

are inPi+1 but not in Pi—the computation of tunnel-sets

(described below) guarantees that this will not violate rule count limits;ii) change traffic distribution, using bounded-congestion updates, to what is supported by Pi+1, which

frees up the tunnels that are in Pi but not in Pi+1; iii)

delete these tunnels. Finally, SWANsignals to services to start sending at the rate that corresponds toP0.

We compute the interim tunnel-sets as follows. LetPiadd

andPirem be the set of tunnels that remain be added and

removed, respectively, at stepi. Initially,Padd

0 =P 0 −P and Prem 0 =P−P 0

. At each step i, we first pick a subsetpa i ⊆

Piadd to add and a subset pri ⊆Piremto remove. We then

update the tunnel sets as:Pi+i=(Pi∪pai)−pri,Piadd+1=Piadd−

pa

i, andPirem+1=Pirem−pri. The process ends whenPiaddand

Piremare empty (at which pointPi will beP0).

At each step, we also maintain the invariant that Pi+1,

which is the next set of tunnels that will be installed in the network, leaves λMj rule space free at every switchj. We

achieve this by picking the maximal set pai such that the

tunnels in pa

0∪ · · · ∪pai fit within taddi rules and the

mini-mal setpri such that the tunnels that remain to be removed

(Pirem−pri) fit withintremi rules. The value oftaddi increases

with iand that of trem

i decreases with i; they are defined

more precisely in Theorem 4 in [14]. Within the size con-straint, when selecting pai, SWANprefers tunnels that will

carry more traffic in the final configuration (P0) and those that transit through fewer switches. When selecting pr

i, it

prefers tunnels that carry less traffic in Pi and those that

transit through more switches. This biases SWANtowards finding interim tunnel-sets that carry more traffic and use fewer rules.

We show that the algorithm above requires at most d1/λe −1 steps and satisfies the rule count constraints (The-orem 4 in [14]). At interim steps, some services may get an allocation that is lower than that inPorP0. The problem of finding interim tunnel-sets in which no service’s allocation is lower than the initial and final set, given link capacity con-straints, is NP-hard. (Even much simpler problems related to rule-limits are NP-hard [13]). In practice, however, ser-vices rarely experience short-term reductions (§6.6). Also, since bothP andP0 contain a common core in which there is at least one common tunnel between each DC-pair (per our tunnel selection algorithm; §4.2), basic connectivity is always maintained during transitions, which in practice suf-fices to carry at least all of the interactive traffic.

4.4 Handling failures

Gracefully handling failures is an important part of a global resource controller. We outline how SWAN handles failures. Link and switch failures are detected and com-municated to the controller by network agents, in response to which the controller immediately computes new alloca-tions. Some failures can break the symmetry in topology that SWANleverages for scalable computation of allocation. When computing allocations over an asymmetric topology, the controller expands the topology of impacted DCs and computes allocations at the switch level directly.

Network agents, service brokers, and the controller have backup instances that take over when the primary fails. For simplicity, the backups do not maintain state but acquire what is needed upon taking over. Network agents query the switches for topology, traffic, and current rules. Service bro-kers wait forTh(10 seconds), by which time all hosts would

have contacted them. The controller queries the network agents for topology, traffic, and current rule set, and service brokers for current demand. Further, hosts stop sending traffic when they are unable to contact the (primary and secondary) service broker. Service brokers retain their cur-rent allocation when they cannot contact the controller. In the period between the primary controller failing and the backup taking over, the network continues to forward traffic as last configured.

4.5 Prototype implementation

We have developed a SWANprototype that implements

all the elements described above. The controller, service brokers and hosts, and network agents communicate with each other using RESTful APIs. We implemented network agents using the Floodlight OpenFlow controller [11], which

(8)

Arista Cisco N3K Blade Server (a) Datacenters Aggregate cable bundles Hong Kong New York Florida Barcelona Los Angeles (b) WAN-facing switch Border router Data Center Server

…

Data Center 5 5 (c)

Figure 8: Our testbed. (a) Partial view of the equip-ment. (b) Emulated DC-level topology. (c) Closer look at physical connectivity for a pair of DC.

allows SWANto work with commodity OpenFlow switches. We use the QoS features in Windows Server 2012 to mark DSCP bits in outgoing packets and rate limit traffic using token buckets. We configure priority queues per class in switches. Based on our experiments (§6), we sets=10% and

λ=10% in our prototype.

5. TESTBED-BASED EVALUATION

We evaluate SWAN on a modest-sized testbed. We

ex-amine the efficiency and the value of congestion-controlled updates using today’s OpenFlow switches and under TCP dynamics. The results of several other testbed experiments, such as failure recovery time, are in [14]. We will extend our evaluation to the scale of today’s inter-DC WANs in§6.

5.1 Testbed and workload

Our testbed emulates an inter-DC WAN with 5 DCs spread across three continents (Figure8). Each DC has: i) two WAN-facing switches;ii) 5 servers per DC, where each server has a 1G Ethernet NIC and acts as 25 virtual hosts; andiii) an internal router that splits traffic from the hosts over the WAN switches. A logical link between DCs is two physical links between their WAN switches. WAN switches are a mix of Arista 7050Ts and IBM Blade G8264s, and routers are a mix of Cisco N3Ks and Juniper MX960s. The SWAN controller is in New York, and we emulate control message delays based on geographic distances.

In our experiment, every DC pair has a demand in each priority class. The demand of the Background class is infi-nite, whereas Interactive and Elastic demands vary with a period of 3-minutes as per the patterns shown in Figure9. Each DC pair has a different phase, i.e., their demands are not synchronized. We picked these demands because they have sudden changes in quantity and spatial characteristics to stress SWAN. The actual traffic per{DC-pair, class} con-sists of 100s of TCP flows. Our switches do not support unequal splitting, so we insert appropriate rules into the switches to split traffic as needed based on IP headers.

We set Ts and Tc, the service demand and network

up-date frequencies, to one minute, instead of five, to stress-test SWAN’s dynamic behavior.

5.2 Experimental results

Efficiency: Figure10shows that SWANclosely approxi-mates the throughput of an optimal method. For each 1-min interval, this method computes service rates using a multi-class, multi-commodity flow problem that is not constrained

0 0.05 0.1 0.15 0 1 2 3 Dem and pe r DC-pair [nor m. to link capacit y] Time [minute] (a) Interactive 0 0.1 0.2 0.3 0.4 0.5 0 1 2 3 Time [minute] (b) Elastic Figure 9: Demand patterns for testbed experiments.

0 0.2 0.4 0.6 0.8 1 0 1 2 3 4 5 6 Stac ked g oodput [nor m. to max] Time [minutes] InteractiveElastic Elastic Background Optimal goodput

Figure 10: SWANachieves near-optimal throughput.

.5 .751 0 10 20 Interactive .5 .751 0 10 20 Elastic .5 .751 0 10 20 Thr ou ghput [ nor m . to m aximu m] Time [second] Background (a) SWAN .5 .751 0 10 20 Interactive .5 .751 0 10 20 Elastic .5 .751 0 10 20 Thr ou ghput [ nor m . to m aximu m] Time [second] Background (b) One-shot 0 0.2 0.4 0.6 0.8 1 0% 20% 40% Cum ulative fractio n of D C-pair ﬂows Throughput loss Elastic Bkgd (c) One-shot Figure 11: Updates in SWANdo not cause congestion.

by the set of available tunnels or rule count limits. It’s pre-diction of interactive traffic is perfect, it has no overhead due to network updates, and it can modify service rates in-stantaneously.

Overall, we see that SWANclosely approximates the op-timal method. The dips in traffic occur during updates be-cause we ask services whose new allocations are lower to reduce their rates, wait for Th=10 seconds, and then ask

services with higher allocations to increase their rate. The impact of these dips is low in practice when there are more flows and the update frequency is 5 minutes (§6.6). Congestion-controlled updates: Figure11azooms in on an example update. A new epoch starts at zero and the throughput of each class is shown relative to its maximal allocation before and after the update. We see that with SWANthere is no adverse impact on the throughput in any class when the forwarding plane update is executed att=10s.

To contrast, Figure 11b shows what happens without

congestion-controlled updates. Here, as in SWAN, 10% of scratch capacity is kept with respect to non-background traf-fic, but all update commands are issued to switches in one step. We see that Elastic and Background classes suffer transient throughput degradation due to congestion induced losses followed by TCP backoffs. Interactive traffic is pro-tected due to priority queuing in this example but that does not hold for updates that move a lot of interactive traffic across paths. During updates, the throughput degradation across all traffic in a class is 20%, but as Figure11cshows, it is as high as 40% for some of the flows.

6. DATA-DRIVEN EVALUATION

To evaluate SWANat scale, we conduct data-driven sim-ulations with topologies and traffic from two production inter-DC WANs of large cloud service providers (§6.1). We

show that SWAN can carry 60% more traffic than MPLS

(9)

that SWANenables congestion-controlled updates (§6.4) us-ing bounded switch state (§6.5).

6.1 Datasets and methodology

We consider two inter-DC WANs:

IDN: A large, well-connected inter-DC WAN with more

than 40 DCs. We have accurate topology, capacity, and

traffic information for this network. Each DC is connected to 2-16 other DCs, and inter-DC capacities range from tens of Gbps to Tbps. Major DCs have more neighbors and higher capacity connectivity. Each DC has two WAN routers for fault tolerance, and each router connects to both routers in the neighboring DC. We obtain flow-level traffic on this network using sFlow logs collected by routers.

G-Scale: Google’s inter-DC WAN with 12 DCs and 19

inter-DC links [15]. We do not have traffic and capacity in-formation for it. We simulate traffic on this network using logs from another production inter-DC WAN (different from IDN) with a similar number of DCs. In particular, we ran-domly map nodes from this other network toG-Scale. This mapping retains the burstiness and skew of inter-DC traffic, but not any spatial relationships between the nodes.

We estimate capacity based on the gravity model [30]. Re-flecting common provisioning practices, we also round capac-ity up to the nearest multiple of 80 Gbps. We obtained qual-itatively similar results (omitted from the paper) with three other capacity assignment methods:i) capacity is based on 5-minute peak usage across a week when the traffic is carried over shortest paths using ECMP (we cannot use MPLS TE as that requires capacity information);ii) capacity between each pair of DCs is 320 Gbps;iii) capacity between a pair of DCs is 320 or 160 Gbps with equal probability.

With the help of network operators, we classify traffic into individual services and map each service to Interactive, Elas-tic, or Background class.

We conduct experiments using a flow-level simulator that implements a complete version of SWAN. The demand of the services is derived based on the traffic information from a week-long network log. If the full demand of a service is not allocated in an interval, it carries over to the next interval. We place the SWANcontroller at a central DC and simulate control plane latency between the controller and entities in other DCs (service brokers, network agents). This latency is based on shortest paths, where the latency of each hop is based on speed of light in fiber and great circle distance.

6.2 Network utilization

To evaluate how well SWANutilizes the network, we com-pare it to an optimal method that can offer 100% utiliza-tion. This method computes how much traffic can be car-ried in each 5-min interval by solving a class, multi-commodity flow problem. It is restricted only by link capac-ities, not by rule count limits. The changes to service rates are instantaneous, and rate limiting and interactive traffic prediction is perfect.

We also compare SWAN to the current practice, MPLS

TE (§2). Our MPLS TE implementation has the advanced features thatIDNuses [4,24]. Priorities for packets and tun-nels protect higher-priority packets and ensure shorter paths for higher-priority services. Per re-optimization, CSPF is invoked periodically (5 minutes) to search for better path assignments. Per auto-bandwidth, tunnel bandwidth is pe-riodically (5 minutes) adjusted based on the current traffic

0 0.2 0.4 0.6 0.8 1 Thr oughput [nor

-malnized to optimal] SWAN w/o RateSWAN Control MPLS TE 99.0% 70.3% 58.3% (a)IDN

SWAN w/o RateSWAN Control MPLS TE 98.1% 75.5% 65.4% (b)G-Scale

Figure 12: SWANcarries more traffic than MPLS TE.

demand, estimated by the maximum of the average (across 5-minute intervals) demand in the past 15 minutes.

Figure 12 shows the traffic that different methods can carry compared to the optimal. To quantify the traffic that a method can carry, we scale service demands by the same factor and use binary search to derive the maximum ad-missible traffic. We define admissibility as carrying at least 99.9% of service demands. Using a threshold less than 100% makes results robust to demand spikes.

We see that MPLS TE carries only around 60% of the op-timal amount of traffic. SWAN, on the other hand, can carry 98% for both WANs. This difference means that SWAN car-ries over 60% more traffic that MPLS TE, which is a signif-icant gain in the value extracted from the inter-DC WAN.

To decouple gains of SWAN from its two main

components—coordination across services and global net-work configuration—we also simulated a variant of SWAN

where the former is absent. Here, instead of getting demand requests from services, we estimate it from their throughput in a manner similar to MPLS TE. We also do not control the rate at which services send. Figure12 shows that this variant of SWANimproves utilization by 10–12% over MPLS TE, i.e., it carries 15–20% more traffic. Even this level of increase in efficiency translates to savings of millions of dol-lars in the cost of carrying wide-area traffic. By studying a (hypothetical) version of MPLS that perfectly knows future traffic demand (instead of estimating it based on history), we find that most of SWAN’s gain over MPLS stems from its ability to find better path assignments.

We draw two conclusions from this result. First, both components of SWANare needed to fully achieve its gains. Second, even in networks where incoming traffic cannot be controlled (e.g., ISP network), worthwhile utilization im-provements can be obtained through the centralized resource allocation offered by SWAN.

In the rest of the paper, we present results only forIDN. The results forG-Scaleare qualitatively similar and are de-ferred to [14] due to space constraints.

6.3 Fairness

SWANimproves not only efficiency but also fairness. To study fairness, we scale demands such that background traf-fic is 50% higher than what a mechanism admits; fairness is of interest only when traffic demands cannot be fully met. Scaling relative to traffic admitted by a mechanism ensures that oversubscription level is the same. If we used an

identi-cal demand for SWANand MPLS TE, the oversubscription

for MPLS TE would be higher as it carries less traffic. For an exemplary 5-minute window, Figure13ashows the throughput that individual flows get relative to their max-min fair share. We focus on background traffic as the higher priority for other traffic means that its demands are often

(10)

(a) 0 1 2 3 4 0 100 200 300 400 500 600 0 1 2 3 4

0Flow ID [in the increasing order of ﬂow demand] 100 200 300 400 500 600 SWAN MPLS TE Flow thr oughput [nor malized to

max-min fair rate]

(b) 0 0.1 0.2 0.3 0.01 0.1 1 10 Complementary

CDF over _ﬂows & time

Relative error [throughput vs. max-min fair rate]

SWAN MPLS TE

Figure 13: SWANis fairer than MPLS TE.

1 2 3 4 5 6 0 10% 20% 30% 40% 50% 0 0.1 0.2 0.3 # of stages Thr oughput loss Capacity slack # stages (99th_-pctl.) Throughput loss 0.001 0.01 0.1 1 1 2 3 4 5 6 Fr equency Number of stages s=0% s=10% s=30%

Figure 14: Number of stages and loss in network throughput as a function of scratch capacity.

met. We compute max-min fair shares using a precise but computationally-complex method (which is unsuitable for online use) [26]. We see that SWANwell approximates max-min fair sharing. In contrast, the greedy, local allocation of MPLS TE is significantly unfair.

Figure13b shows aggregated results. In SWAN, only 4% of the flows deviate over 5% from their fair share. In MPLS TE, 20% of the flows deviate by that much, and the worst-case deviation is much higher. As Figure 13a shows, the flows that deviate are not necessarily high- or low-demand, but are spread across the board.

6.4 Congestion-controlled updates

We now study congestion-controlled updates in detail, the tradeoff regarding the amount of scratch capacity and their benefit. Higher levels of scratch capacity lead to fewer stages, and thus faster transitions; but they lower the amount of non-background traffic that the network can carry and can waste capacity if background traffic demand is low. Figure 14 shows this tradeoff in practice. The left graph plots the maximum number of stages and loss in network throughput as a function of scratch capacity. At thes=0% extreme, throughput loss is zero but more stages—infinitely many in the worst case—are needed to transition safely. At thes=50% extreme, only one stage is needed, but the net-work delivers 25−36% less traffic. The right graph shows the PDF of the number of stages for three values ofs. Based on these results, we uses=10%, where the throughput loss is negligible and updates need only 1-3 steps (which is much lower than the theoretical worst case of 9).

To evaluate the benefit of congestion-controlled updates, we compare with a method that applies updates in one shot. This method is identical in every other way, including the amount of scratch capacity left on links. Both methods send updates in a step to the switches in parallel. Each switch ap-plies its updates sequentially and takes 2 ms per update [8]. For each method, during each reconfiguration, we com-pute the maximum over-subscription (i.e., load relative to capacity), at each link. Short-lived oversubscription will be absorbed by switch queues. Hence, we also compute the

0.01% 0.1% 1% 10% 0 0.2 0.4 0.6 Complemantary CDF over

links & updates

Oversubscription ratio 0Overloaded tra 100 200ﬃc [MB] 300 one-shot;bg

SWAN;bg one-shot;non-bg

Figure 15: Link oversubscription during updates.

0.9 0.95 1 0.125K 0.5K 2K 8K 32K Thr oughput [nor m. to optimal]

Switch memory [#Entries in log4] k-shortest paths SWAN λ=1% 10% 50% 0 0.2 0.4 0.6 0.8 1 0 1 2 3 4 5 Cumulative frac. of updates #Stages λ=1% λ=10% λ=50%

Figure 16: SWANneeds fewer rules to fully exploit net-work capacity (left). The number of stages needed for rule changes is small (right).

maximal buffering required at each link for it to not drop any packet, i.e., total excess bytes that arrive during over-subscribed periods. If this number is higher than the size of the physical queue, packets will be dropped. Per pri-ority queuing, we compute oversubscription separately for each traffic class; the computation for non-background traf-fic ignores background traftraf-fic but that for background traftraf-fic considers all traffic.

Figure15 shows oversubscription ratios on the left. We see heavy oversubscription with one-shot updates, especially for background traffic. Links can be oversubscribed by up to 60% of their capacity. The right graph plots extra bytes on the links. Today’s top-of-line switches, which we use in our testbed, have queue sizes of 9-16 MB. But we see that oversubscription can bring 100s of MB of excess pack-ets and hence, most of these will be dropped. Note that we did not model TCP backoffs which would reduce the load on a link after packet loss starts happening, but regardless, those flows would see significant slowdown. With SWAN, the worst-case oversubscription is only 11% (= s

1−s) as

con-figured for bounded-congestion updates, which presents a significantly better experience for background traffic.

We also see that despite 10% slack, one-shot updates fail to protect even the non-background traffic which is sensitive to loss and delay. Oversubscription can be up to 20%, which can bring over 50 MB of extra bytes during reconfigurations. SWANfully protects non-background traffic and hence that curve is omitted.

Since routes are updated very frequently even a small like-lihood of severe packet loss due to updates can lead to fre-quent user-visible network incidents. For e.g., when updates

happen every minute, a 1

1000 likelihood of severe packet loss due to route updates leads to an interruption, on average, once every 7 minutes on the IDN network.

6.5 Rule management

We now study rule management in SWAN. A primary mea-sure of interest here is the amount of network capacity that can be used given a switch rule count limit. Figure16(left) shows this measure for SWAN and an alternative that in-stalls rules for thek-shortest paths between DC-pairs;k is chosen such that the rule count limit is not violated for any switch. We see that k-shortest path routing requires 20K rules to fully use network capacity. As mentioned before, this requirement is beyond what will be offered by

(11)

0 0.2 0.4 0.6 0.8 1 21 22 23 Cum ulative frac. of u pdates

Update time [s] 0 5 Update time [s] 10 15 20 25

t1t2 t3t4 t5 1.3 0.7 10 0.6 10

Compute allocation & rule change plan Compute congestion-controlled plan

Wait for rate limiting Change switch rules Rate limiting

Figure 17: Time for network update.

0 0.2 0.4 0.6 0.8 1 50% 100% Cum ulative frac. of updates

Traﬃc carried by SWAN during updates / traﬃc

carried if updates were instantaneous (a) 0.5 0.6 0.7 0.8 0.9 1 1 10 100 1000 Thr ou ghput [nor maliz ed to 5-m in inter val]

Update interval [min] (b)

Figure 18: (a) SWANcarries close to optimal traffic even during updates. (b) Frequent updates lead to higher throughput.

generation switches. The natural progression towards faster link speeds and larger WANs means that future switches may need even more rules. If switches support 1K rules,

k-shortest path routing is unable to use 10% of the network capacity. In contrast, SWAN’s dynamic tunnels approach enables it to fully use network capacity with an order of magnitude fewer rules. This fits within the capabilities of current-generation switches.

Figure 16(right) shows the number of stages needed to dynamically change tunnels. It assumes a limit of 750 Open-Flow rules, which is what our testbed switches support. With 10% slack only two stages are needed 95% of the time. This nimbleness stems from 1) the efficiency of dynamic tunnels—a small set of rules are needed per interval, and 2) temporal locality in demand matrices—this set changes slowly across adjacent intervals.

6.6 Other microbenchmarks

We close our evaluation of SWANby reporting on some key microbenchmarks.

Update time: Figure17shows the time to update IDN from the start of a new epoch. Our controller uses a PC with a 2.8GHz CPU and runs unoptimized code. The left graph shows a CDF across all updates. The right graph depicts a timeline of the average time spent in various parts. Most up-dates finish in 22s; most of this time goes into waiting for ser-vice rate limits to take effect, 10s each to wait for serser-vices to reduce their rate (t1tot3) and then for those whose rate in-creases (t4tot5). SWANcomputes the congestion-controlled plan in parallel with the first of these. The network’s data plane is in flux for only 600 ms on average (t3 to t4). This includes communication delay from controller to switches and the time to update rules at switches, multiplied by the number of stages required to bound congestion. If SWAN

were used in a network without explicit resource signaling, the average update time would only be this 600 ms. Traffic carried during updates: During updates, SWAN

ensures that the network continues to maintain high utiliza-tion. That the overall network utilization of SWANcomes close to optimal (§6.2) is an evidence of this behavior. More directly, Figure18a shows the %-age of traffic that SWAN

carries during updates compared to an optimal method with instantaneous updates. The median value is 96%.

Update frequency: Figure18b shows that frequent up-dates to the network’s data plane lead to higher efficiency. It plots the drop in throughput as the update duration is increased. The service demands still change every 5 minutes but the network data plane updates at the slower rate (x-axis) and the controller allocates as much traffic as the cur-rent data plane can carry. We see that an update frequency of 10 (100) minutes reduces throughput by 5% (30%). Prediction error for interactive traffic: The error in predicting interactive traffic demand is small. For only 1% of the time, the actual amount of interactive traffic on a link differs from the predicted amount by over 1%.

7. DISCUSSION

This section discusses several issues that, for conciseness, were not mentioned in the main body of the paper. Non-conforming traffic: Sometimes services may (e.g., due to bugs) send more than what is allocated. SWANcan detect these situations using traffic logs that are collected from switches every 5 minutes. It can then notify the owners of the service and protect other traffic by re-marking the DSCP bits of non-confirming traffic to a class that is even lower than background traffic, so that it’s carried only if there is any spare capacity.

Truthful declaration: Services may declare their lower-priority traffic as higher lower-priority or ask for more bandwidth

than they can consume. SWAN discourages this behavior

through appropriate pricing: services pay more for higher priority and pay for all allocated resources. (Even within a single organization, services pay for the infrastructure re-sources they consume.)

Richer service-network interface: Our current design has a simple interface between the services and network, based on current bandwidth demand. In future work, we will consider a richer interface such as letting services reserve resources ahead of time and letting them express their needs in terms of total bytes and a deadline by which they must be transmitted. Better knowledge of such needs can further boost efficiency, for instance, by enabling store-and-forward transfers through intermediate DCs [21]. The key challenge here is the design of scalable and fair allocation mechanisms that composes the diversity of service needs.

8. RELATED WORK

SWANbuilds upon several themes in prior work.

Intra-DC traffic management: Many recent works manage intra-DC traffic to better balance load [1, 7,8] or share among selfish parties [16, 28, 31]. SWAN is similar to the former in using centralized TE and to the latter in providing fairness. But the intra-DC case has constraints and opportunities that do not translate to the WAN. For example, EyeQ [16] assumes that the network has a full bi-section bandwidth core and hence only paths to or from the core can be congested; this need not hold for a WAN. Sea-wall [31] uses TCP-like adaptation to converge to fair share, but high RTTs on the WAN would mean slow convergence. Faircloud [28] identifies strategy-proof sharing mechanisms, i.e., resilient to the choices of individual actors. SWANuses explicit resource signaling to disallow such greedy actions.