An implementation of optimal dynamic load balancing based on multipath IP routing

(1)

An implementation of optimal dynamic load

balancing based on multipath IP routing

Juan Pablo Saibene, Richard Lempert

and Fernando Paganini

Facultad de Ingenier´ıa, Universdad ORT, Montevideo, Uruguay

Abstract—We develop a protocol through which

multipath-enabled IP routers collectively engage in dynamic traffic engi-neering, to optimize performance in concert with legacy TCP congestion control. We build on recent theory which shows a globally optimum resource allocation across the TCP/IP layers can be achieved through control of rates and multipath rout-ing fractions followrout-ing a consistent congestion signal. In this work we fully develop, in an ns2 simulation environment, the required multipath IP layer consistent with prevailing loss-based TCP protocols, with the additional requirement that individual TCP connections should be routed through single paths. The solution involves a generalization of distance vector protocols where routing metrics reflect loss probabilities. We demonstrate through simulations the stability and performance of the resulting protocol in combination with TCP, and we comment on the complexity of the implementation.

I. INTRODUCTION

Routing plays a clear role in the efficient use of network capacity, since routing choices can overload certain paths while under-utilizing others. Addressing this issue is the focus of Traffic Engineering [1], focusing mostly on methods to distribute a given traffic matrix of demands within network capacity. Solutions involve use of optimization together with practical techniques to implement the results via OSPF weights [4], [13], [15] or MPLS tunnels (see [1]). When external de-mands are unkwown or time-varying, dynamic load balancing methods are required (e.g., [2]).

The impact of routing becomes more subtle when combined with the TCP layer: congestion-controlled sources adapt rate to whatever capacities are made available, effectively reducing the demand on a congested path, so superficially the demand appears to be served. It is only when the TCP/IP layers are analyzed jointly that efficiency can be precisely defined, and integrated solutions can be sought. In recent years, research on

Multipath TCP [6] has tackled this problem from the transport

layer side: here, TCP sources manage multiple congestion windows for different fixed routes provided by IP. This ap-proach has generated interest in the standards community [3]. Its impact is, however, constrained by the limited path diversity in the first-hop seen by the source (e.g., a multi-homed server). It is difficult to overcome this limitation without a stronger breach of layer separation, namely source routing inside an operator network. The latter would also not be scalable since the number of internal paths is exponential.

Far more efficiency impact and scalability can be obtained if IP routers are endowed with a dynamic multipath func-Research supported in part by AFOSR-US under grant FA9550-09-1-0504.

tion. Here TCP sources continue to manage only a single connection, and it is the network’s job to distribute the load among internal paths. In recent theoretical work [12], reviewed in Section II, it was shown how a very general network utility optimization over both layers can be formulated and solved in this manner. [12] also provides a proof-of-concept implementation on the ns2 simulator, with queueing delay as a global congestion measure, controlling sources through a variant of TCP-FAST [7], and routing through a generalized multipath distance vector protocol.

In this paper we seek a router implementation that follows the proposals of [12] without the need to change TCP from the prevailing versions [8] which respond to packet loss as a congestion measure. In Section III we describe the ns2-based implementation of a multipath routing protocol that responds to loss. We also address the challenge of keeping individual TCP connections single-path (to avoid packet reordering), while balancing the aggregate load. This is achieved through a hashing technique. In Section IV we present simulation results that validate the stability and performance of our protocol in different scenarios, in particular with a realistic mix of TCP connections. Brief conclusions are given in Section V.

II. JOINTLY OPTIMIZED CONGESTION CONTROL AND MULTIPATH ROUTING

In this section we review the framework from [12] that allows for a unified control of the network and transport layers to serve a common performance objective. This work builds on the Network Utility Maximization (NUM) models for TCP congestion control [9], [14]. Here each TCP flow k is assigned a utility Uk(xk) as a function of its rate xk, and the control

seeks to maximize the aggregate network utilityP_kUk(xk),

subject to the constraint that aggregate link rates yl do not

exceed the corresponding link capacity cl.

In the case of single-path IP routing, link rates yl are just

the sum of rates of the flows through the link. If more than one end-to-end path is defined for a source, the above model can be generalized [9], [6] by breaking the TCP rate xk _{into path}

components, which in turn determine the link rates. For a given capacity, the aggregate network utility can improve through the additional degrees of freedom provided by multiple paths. This motivates recent proposals for Mulipath TCP [6], [3], where sources control these individual path rates. If all possible paths between source and destination were included in the model, the problem would become equivalent to the following optimization over the TCP/IP layers.

(2)

Problem 1 (NUM): Maximize P_kUk(xk), subject to link

capacity constraints yl ≤ cl, and flow balance constraints at

each node.

To consider all paths would require controlling route choices at all possible intermediate hops. Source-based Multipath TCP cannot exploit this path diversity, so it cannot reach the efficiency of Problem 1.

In [12] a method is proposed that is able to fully exploit path diversity by controlling the aggregate rate xk_{of each TCP}

flow (as is current practice), plus the split fractions αd i,j that

specify multipath routing to destination d at router i. Namely, if xd

i is the traffic rate reaching router i destined to d, the

router sends to neighbor j the rate portion

yd

i,j= αdi,jxdi. (1)

Control laws are presented in [12] for xk_{, α}d

i,j so that the

equilibrium allocation of Problem 1 is achieved, or alterna-tively that of

Problem 2: Maximize S := P_kUk(xk) −

P

lφl(yl)

sub-ject to flow balance constraints.

This approximation amounts to replacing capacity constraints with a barrier function; it also can be seen as combining the utility maximization approach with cost minimization used in Traffic Engineering [4], [13], [15].

Control is based on congestion feedback, based on a link congestion measure or price pl. This variable can be generated

as in standard congestion control (“primal” or “dual” versions, see [14]), which will lead to solutions of either Problem 1 or Problem 2. For the purposes of this paper it suffices to say that either packet loss fraction or queuing delay can serve as congestion price, provided TCP responds to this quantity.

In the multipath setting, congestion prices must be averaged among paths to destination. We introduce node prices qd

i,

representing the average price of sending packets from node

i to destination d, defined through the recursion qd d = 0, qid= X j αd i,j[pi,j+ qdj], i 6= d. (2)

Here pi,j+ qjd (also denoted πi,jd ) is the mean price

experi-enced from i to d when routing through next hop j. The overall mean node price qd

i is obtained by averaging these prices with

their corresponding routing fractions. The overall price qd

s from the source node to destination

serves as congestion control signal for the TCP sources. For the overall network to solve the desired optimization, split fractions must also respond to the same congestion prices. One control law proposed in [12] for this purpose is

˙αi,jd = βi(πid− πdi,j); (3)

here πd

i is the average of the πdi,j over j. So we reduce

trans-mission on paths with higher than average price, transferring to lower price paths. In [12], a saturation is further imposed to the right-hand side of (3) so that traffic fractions remain non-negative. It is shown in [12] that this algorithm, together with prices pl= φ0l(yl) provides global convergence to the solution

of Problem 2. To solve Problem 1, a different price generation method is used, together with a variation of (3) which involves an anticipatory term in the prices. For the purposes of this paper we focus from now on (3).

In [12] the theory was demonstrated through an ns2 imple-mentation that used queueing delay as a congestion price. On the router side, delay was used to generate the qd

i prices, which

were used as metric for a generalized distance vector (RIP) protocol: multipath routers use this information to update the split fractions αd

i,j. On the transport side, a delay-based

congestion control was developed based on TCP-FAST [7]. Two modifications were required: first, FAST was modified to become insensitive to packet reordering, modifying the ACKing scheme to compute RTT averages on all packets. Second, to estimate the BaseRTT parameter in FAST, which represents here average propagation delay over all paths, an explicit feedback from the routers was required at the time-scale of RIP announcements. Results with the above imple-mentation were successful, but it is still far from a deployable proposal, because it requires modifications to both the TCP and IP portions of the network, and feedback between them.

In this paper we pursue a router-side only implementation of the above control laws, compatible with legacy TCP protocols.

III. LOSS-BASED IMPLEMENTATION

Congestion control has for a long time been implemented with TCP-Reno [8], with its additive-increase-multiplicative-decrease regulation of the congestion window based on loss events. In recent years, high-speed TCP variants have appeared that depart from AIMD, but are still predominantly loss-based. This motivates us to consider loss as a congestion price for multipath routing control as well.

If the price plis link loss probability, the recursion (2) will

compute, to first order, the mean loss fractions experienced from node to destination, given the current routing splits

αd

i,j. The usual first order approximation (commonly made

in congestion control) is that loss probabilities are additive over a path. Under this assumption, the terms πd

i,j = pi,j+ qjd

become the conditional loss probabilities when routing to hop

j, which weighted by the routing probabilities αd

i,j give the

correct qd

i through Bayes’ rule. In particular, if s is the source

node, qd

s will be the end-to-end loss probability.

Returning now to TCP, for concreteness with AIMD, we recall the model x = 1

RT T

q

1.5

q , from [11] for the TCP rate

x as a function of the loss probability q experienced by the

flow. Identifying this q with qd

s, we have modeled TCP and

multipath IP in a coherent setting. Indeed, introducing the utility U (x) = −1.5

RT T2_x associated with the AIMD “demand curve”, the framework of the previous section applies, without the need of explicit message passing between routers and sources, an advantage over the implementation in [12].

This model assumes, however, that each TCP flow is dis-tributed in the multiple routes, which has an undesirable effect: packet reordering at the arrival due to different delay paths. For legacy TCP, out-of-order arrivals are taken as an indication of loss, triggering an unnecessary congestion response.

(3)

Is it impossible, then, to take advantage of the performance benefits of multipath in a router-side only implementation? If there is a single TCP connection between s and d, this would seem an insurmountable limitation. If many flows are present, however, rather than splitting each flow’s packets in multiple routes, we can split the aggregate traffic according to (1) while keeping each connection single-path. A hashing method to achieve this is described below.

A question arises here: does per-flow routing give the same fairness as the full multipath case? Or would some flows see higher-than-average losses, while others see lower-than-average, in detriment of the former? The answer is that during a transient phase, unfairness can appear. However, a property of the routing dynamics (3) is that when it reaches equilibrium, all paths in use have the same price (loss). So there is still a fair equilibrium, with optimal overall utility as desired.

A. Multipath forwarding based on hashing

Given a desired routing split (αd

i,1, αdi,2, . . . αdi,J), one

nat-ural forwarding method could be: generate for each packet a pseudo-random number, uniform in the interval [0, 1], and compare it with the thresholds αd

i,1, (αdi,1+αdi,2), . . . , (αi,1d +

αd

i,2+. . . αdi,J−1); this comparison selects the forwarding link.

We wish, however, to rig this random routing so that packets of individual TCP flows always fall in the same “bin”, while aggregate rates keep the desired proportions. This can be done by using a flow identifier (e.g., the ‘5-tuple’ in the packet header) as a seed for pseudo-random number generation. Specifically, consider the operation

hash = (seed × K) mod k

k . (4)

Here k is the resolution used for the uniform distribution, as-sumed prime, K another large prime. The remainder modulo k of seed×K will tend to distribute uniformly in 0, 1, . . . , k−1, leading to the desired distribution in [0, 1] after dividing by k. The above can be used successfully for multipath routing in a single node. In a more general network topology we have an additional requirement: routing decisions at each node should be independent. To see this, consider the following example:

• Node 1 splits traffic between links (1, 2) and (1, 3). • Downstream node 2 splits between (2, 4) and (2, 5).

Assume both nodes use 50-50 splits. If the hashing seed only depends on the flow, node 2 will only use one outgoing link, because it will only receive packets whose hashes have fallen on one side of the split threshold.

Independent hashing can be obtained with a seed =

f low id × node id, where the first term represents the

con-nection, the second the node. To avoid trivial results the latter should be coprime with k. In the ns2 version we used k prime and node id < k, and we validated in simulation that splitting across multiple hops takes place.

B. ns2 implementation

We supplemented the standard ns2 distribution that contains a module for TCP-Reno, with multipath router modules. Many features are common to the implementation in [12],

we highlight the differences and enhancements. We emphasize that despite some similarities, the present code is in a more final and documented form, it constitutes a patch that can be added to the ns2 distribution, and will be available in [10].

The routing agent at each node i maintains an extended routing table, storing for each destination d three vectors with the variables qd

j, αdi,j, πdi,j, indexed by neighbor j. In addition,

a set of boolean flags are used to tag neighbors in “improper” state; this is part of a method for blocking transient loops, adapted from [5]; we refer to [12] for details.

Neighbor prices qd

j are updated upon receipt of the

corre-sponding routing announcements from neighbors. A config-urable time interval T adv is used1 _{to schedule}

announce-ments, of the form (destination, metric, improper flag), modi-fying de distance vector agent in ns2. The metrics qd

i for each

d are computed prior to announcement according to (2).

Link prices pi,j (loss probabilities) are measured at

config-urable periodicity T prob. This is done through a queue mon-itor in ns2, that returns counters p drops and p departures, from which the loss fraction is calculated. We included expo-nential smoothing to eliminate noise from the loss estimation. A slight departure from the theory was to add a small constant hc > 0 to the calculation of πd

i,j. This only has impact

in uncongested situations where the price would otherwise be zero, making routing indifferent. Our addition implies favoring shortest hop-count routes in this under-loaded case.

Split fractions are themselves updated with their own period

Tα, following a discretization of (3). This update must respect

blocking, and also the saturation constraints αd

i,j ≥ 0, we

refer to [12] for details. Updated split variables are transferred to the forwarding classifiers. A minimal value αmin is also

configured, below which the link is not used for forwarding. We modified the standard ns2 node agents to include two packet classifiers: the first acts as usual, based on destination, and determines if the packet is for the node itself or must be forwarded; in the latter case it is sent to a multipath classifier, one per destination. This is where split ratios are stored and the hashing operation to resolve forwarding is carried out.

C. Complexity considerations

In regard to communication requirements between routers, our multipath routing protocol involves minimal overhead with respect to standard distance vector protocols (e.g. RIP): mainly, allowing a floating point metric. A greater penalty appears in storage and computation. For a network with

N destination nodes, a router with V neighbors must store O(N V ) variables, and perform O(N V ) multiplications every Tα. Queue monitoring is O(V ) and thus of moderate impact.

All in all, for an intradomain scenario with moderate N , this complexity should be manageable by modern-day routers.

IV. SIMULATION RESULTS

We have carried out extensive simulations of our imple-mentation, that validate the correct behavior of the protocol. 1_{Random times in [0.9 × T adv, 1.1 × T adv] avoid synchronizations.}

(4)

The selection presented below highlights the main features and attempts to approximate some realistic scenarios.

A. Topology of parallel links

We begin with the simple topology of Figure 1. 100 sources are located at node 0, with destination on node 7. Multipath routing occurs in node 1, with asymmetric bottlenecks in its outgoing links. The parameters are T adv = 2sec, T prob = 4sec, Tα= 6sec, with βTα= 1.5 for (3), and αmin= 0.005.

Fig. 1. Parallel link topology.

Simulation results show convergence to the optimal resource allocation. In particular, the routing splits α1,j in Figure 2

acquire the correct values to fill the overall 100Mbps capacity. This can be compared to 30Mbps achievable with single-path, or 25Mbps with equal-split multipath. Figure 2 also shows the rates of individual TCP flows, all of which are single path. These indeed achieve a fair allocation of bandwidth, around 1Mbps each, irrespective of the path they are assigned. Routing based on hashing with split fractions α1,j controls the number

of connections per path to exploit the capacity fairly.

Fig. 2. Evolution of α1,j (top) and individual TCP flow rates (bottom).

B. Multiple source topology, random traffic

The second scenario is the topology of Figure 3, with sources at nodes 0,1,2 and destination in node 3. Assume first we had only long (‘elephant’) connections, all with the same utility2_{, and half as many in node 2 with respect nodes 0,1.}

In that case we can compute the optimal allocation which is asymmetric; the required splits αi,jare indicated in the figure.

Fig. 3. Topology with multiple sources, indicating optimal splits. To make things more realistic we include a random load pattern instead of permanent flows. This is done through an ns2 module that generates Poisson traffic, with λi arrivals/sec

and exponential file sizes of mean size 6MB. We chose λ0=

λ1= 0.4, and λ2= 0.2, creating a load of 19.2 Mbps at source

2, twice in the others. We still leave a few elephant flows in each source as a probe to measure the resulting fair-rates.

Fig. 4. Split fractions from node 1, elephant flow rates in Mbps. Figure 4 depicts one trajectory of traffic splits, α1,j; while

there is more time-variation due to random load, it settles around the optimal value. We also show the rate of the elephant flows, again equalized among routes as expected.

C. Stub domain with peering bottlenecks

The third scenario we consider is depicted in Figure 5. Here we have a full-mesh backbone from which destinations are 2_{For this to happen, RTTs must be the same; this is set up by making}

(5)

reachable with good bandwidth and low delay, but there is scarce bandwidth and longer latency in the connections to the outside Internet. Such situation is present in stub ISPs far removed from the Internet core, like those in the authors’ home country. External routers Ext0 and Ext1, located near the core are still managed by the ISP: the efficient use of their outgoing capacity has high impact in network performance.

Fig. 5. Stub network with bottlenecks in external access. We include 4 external source nodes 0, 1, 2, 3, which generate traffic to consistently labelled destinations. Each source node is fed, as before, by a mix of permanent TCP connections (10 flows each) and random TCP traffic sources (Poisson traffic with mean load 70Mbps).

Fig. 6. Split fractions for Ext0. Top: to Dst0. Bottom: to Dst2. Simulation plots for the split fractions in routers Ext0 and Ext1 are given in Figures 6 and 7. All settle initially in the 3-1 ratio consistent with the outgoing link capacities. At 3000sec we introduce a fault, severing link Ext0-Bb1, later restoring it at 5000 sec. The routing splits at node Ext0 react to these changes quickly, rerouting traffic as needed, whereas those at Ext1 are unaffected. At 7000 sec the backbone link Bb0-Bb2 fails, causing Bb0 to reroute traffic. Since remaining backbone

Fig. 7. Split fractions for Ext1. Top: to Dst1. Bottom: to Dst3. capacity is plentiful, the external rate suffers no degradation. Note, however, that split ratios at Ext0 and Ext1 move to a different equilibrium. This is consistent with theory since this problem has non-unique solutions: split fractions from Ext0 to

different destinations can each be different from 3-1 and still

result in an overall 3-1 rate partition. V. CONCLUSIONS

We have implemented a dynamic multipath routing protocol which, combined with legacy TCP, can achieve the optimal resource allocation promised by the theory in [12]. Given its performance and moderate computational requirements, it shows potential for Traffic Engineering practice. One open question is whether the proposal can be made compatible with link-state protocols, prevalent in intra-domain routing today.

REFERENCES

[1] D. Awduche, A. Chiu, A. Elwalid, I. Widjaja, X. Xiao, “Overview and Principles of Internet Traffic Engineering”, RFC3272, IETF.

[2] A. Elwalid, C. Jin, S. Low, and I. Widjaja, “MATE: MPLS Adaptive Traffic Engineering”, Proc. IEEE INFOCOM 2001.

[3] A. Ford, C. Raiciu, M.Handley “TCP Extensions for Multipath Opera-tion with Multiple Addresses”, Internet Draft, Oct. 2009.

[4] B. Fortz and M. Thorup, “Internet Traffic Engineering by Optimizing OSPF Weights”, Proc. IEEE INFOCOM 2000.

[5] R. G. Gallager, “A minimum delay routing algorithm using distributed computation”, IEEE Trans. on Comm., Vol Com-25 (1), pp. 73-85, 1977. [6] H. Han, S. Shakkottai, C.Hollot, R. Srikant and D. Towsley, “Multi-Path TCP: A joint congestion and routing scheme to exploit path diversity in the Internet”, IEEE/ACM Trans. Netw. Vol. 14(6), pp. 1260-1271, 2006. [7] C. Jin, D. X. Wei and S. H. Low, “FAST TCP: motivation, architecture,

algorithms, performance”; Proc. IEEE INFOCOM 2004.

[8] V. Jacobson, “Congestion avoidance and control”, Proc. ACM

SIG-COMM ’88.

[9] F. P. Kelly, A. Maulloo, and D. Tan, “Rate control for communication networks: Shadow prices, proportional fairness and stability”, Jour. Oper.

Res. Society, vol. 49(3), pp 237-252, 1998.

[10] http://athenea.ort.edu.uy/publicaciones/mate/en/index.html.

[11] M. Mathis, J. Semke, J. Mahdavi, T. Ott “The Macroscopic Behavior of the TCP Congestion Avoidance Algorithm”, Computer Communication

Review, volume 27, number 3, July 1997.

[12] F. Paganini, E. Mallada, “A unified approach to congestion control and node-based multipath routing”, IEEE/ACM Trans. on Networking, Vol.

17, no. 5, pp. 1413-1426, Oct. 2009.

[13] A. Sridharan, R.Guerin, C. Diot. ”Achieving Near-Optimal Traffic Engineering Solutions for Current OSPF/ISIS Networks”. IEEE/ACM

Transactions on Networking, March 2005.

[14] R. Srikant, The Mathematics of Internet Congestion Control, Birkhauser, 2004.

[15] D. Xu, M. Chiang, J. Rexford, “Link-state routing with hop-by-hop forwarding can achieve optimal traffic engineering”, Proc. IEEE