Dynamic Distributed Flow Scheduling with Load Balancing for Data Center Networks

(1)

Procedia Computer Science 19 ( 2013 ) 124 – 130

The 4th International Conference on Ambient Systems, Networks and Technologies.

(ANT 2013)

Dynamic Distributed Flow Scheduling with Load Balancing

for Data Center Networks

Sourabh Bharti, K.K.Pattanaik Information and Communication Technology

Atal Bihari Vajpayee-Indian Institute of Information Technology and Management, Gwalior, India

Abstract

Current Flow Scheduling techniques in Data Center Networks(DCN) results in overloaded or underutilized links. Static flow scheduling techniques such as ECMP and VLB use hashing techniques for scheduling the flows. In case of hash collision a path gets selected number of times resulting overloading of that path and underutilization of other paths. Dynamic flow scheduling techniques like global first fit employ centralized scheduler and always selects first fittest candidate path for scheduling. Thus in addition to single-point-of-failure the overall link uti-lization also remains a problem as the flows are not scheduled on the best available candidate path. This paper presents firstly a Dynamic Distributed Flow Scheduling(DDFS) mechanism that will lead to fair link utilization in globally used fat-tree topology of DCN. Secondly, it presents a mechanism to restrict the flow scheduling de-cisions to the lower layers thus avoiding saturation of core switches. The entire DCN is simulated using Colored Petri Nets (CPN). The load measured at the aggregate switches for various flow patterns in DCN reveals that the load factors at the aggregate switches vary by at most 0.11 which signifies the fair utilization of links.

c

2013 Published by Elsevier Ltd. Selection and/or peer-review under responsibility of [name organizer]

Keywords: Dynamic Flow Scheduling, Load Balancing, Link Utilization, Data Center Networks, Colored Petri Nets.

1. Introduction

In the present time DCNs, fat tree topology is being used globally. As we move up from top-of-rack (ToR) switches to Core switches (CS) the link over-subscription ratio becomes 16:1. Due to this bandwidth variation, the focus is on scheduling the flows effectively. To deal with this problem, many static and dynamic flow scheduling techniques are proposed like Equal Cost Multiple Path (ECMP)[1], Valiant Load Balancing (VLB)[4], Global First fit [1]. Although the focus of these algorithms is to schedule the flows in order to increase the overall network bandwidth utilization, often it causes some links to get highly overloaded while others underutilized. The gross outcome remains unfair utilization of links.

As given in Hedera [1], a packet’s path is non-deterministic and chosen on its way up to the core, and is

Email addresses:[email protected] (Sourabh Bharti), [email protected] (K.K.Pattanaik)

Selection and peer-review under responsibility of Elhadi M. Shakshuki

Open access under CC BY-NC-ND license.

(2)

deterministic from the core switches to their destination edge switch. The fat-tree topology of DCN may cause all the traﬃc to approach core switches thus saturating them.

In this paper, DDFS mechanism is proposed that will mitigate the core switch saturation problem by employing a deterministic ﬂow scheduling at lower layer switches. The ﬂow scheduling decision is thus distributed among all the layers instead of only at the core switches thus enhancing the immunity of core switches from failures and saturation.

Load on a switch is defined as the ratio of incoming traffic to outgoing traffic in a defined time period. Load information at the aggregate switches are of interest as these form the junction connecting the two remaining different types of nodes (ToR and Core) in a fat-tree structured DCN. Thus a closely related load information among aggregate switches represents their incoming and outgoing traffic ratios too are closely related. This signifies fair-utilization of links connected to the corresponding aggregate switches.

To mitigate the problem of long-waits for small flows link reservation has been done. Long-wait is a scenario that arise while a link is occupied by very large flows and small flows wait for the availability of link. 55% of the average link capacity is reserved [8] for the long-lived flows and remaining for the others.

The rest of this paper is organized as follows. Section 2 discuss the related work. Proposed mechanism is presented in Section 3. Section 4 discusses the experiment design and simulation. Results and analysis are discussed in Section 5. Finally Section 6 concludes the paper.

2. Related Work

Mainly there are two categories of flow scheduling techniques. One is static flow scheduling and the other is dynamic. ECMP and VLB come under static flow scheduling techniques. Whereas the global first fit come under dynamic flow scheduling techniques.

For path selection, ECMP takes the hash value of selected fields of a packet’s header modulo the number of equal cost paths. It then forwards the flow to that path which corresponds to the hash value. As the number of flows per host and the size of the flow increases, there is a good chance of the collision of their hash values. This results in poor link utilization.

VLB delivers the load in two steps:it selects the intermediate nodes randomly and then use these nodes for forwarding the load to the destination node. This may result in selecting the same intermediate node multiple times which further lead to the poor utilization of links.

Hedera [1] employs centralized scheduler for flow scheduling. The scheduler uses global first fit for scheduling the flows. This sometime cause waste of link capacity due to non effective path allocation. To the best of our knowledge no report is found in the literature regarding DDFS for DCNs for fair link utilization.

3. Dynamic Distributed Flow Scheduling(DDFS)

In the fat-tree structured DCN with k source-destination pairs, n ﬂows between all source-destination pairs, fS−D sized ﬂows between any source-destination pair, and any given link constraints the objective

function can be described using linear programming model as:

maximum k i=1 n j=1 fS_−D (1) With constraints i∈Ul Fi Cl (2)

Ul = set of all ﬂows traversing link l, Cl = capacity of link l, L = set of all links in the network, Fi =

(3)

3.1. Proposed mechanism

The proposed mechanism is to schedule the ﬂow to improve overall link utilization while achieving closely related load values among all aggregate switches. We begin our discussion with the assumption that the load at any aggregate switch i is represented by a nonnegative integer scalar value wi. Each aggregate

switch has four interfaces for connecting two core switches and two ToR switches (Fig 1). Each interface is equipped with one input and one output queue. The load on each interface is represented by the ratio of in-coming and outgoing traﬃc over a period [9]. The total load on an aggregate switch can thus be represented by Eq.(3)

wi=

Load on each inter f ace (3)

At time t the system’s load distribution is represented by the vector

Wt= {w1t, w2t, w3t....wnt} (4)

Wide variation among wi for i=1 to n signiﬁes that the load on the aggregate switches are not uniformly

distributed. Different values of wiis the consequence of different ratios of incoming and outgoing traffic at

aggregate switches. The inference here is that due to improper ﬂow scheduling on the available links results in wide variation among wivalues. Our objective is to schedule the ﬂows such a way that will make this

load distribution converge towards closely related load values and is represented by Eq.(5)

W = {w, w, w....w} (5)

Under the assumption of homogeneous aggregate switches and link bandwidths w can be represented as

w= n i=1 wi n (6)

Algorithm 1: flow scheduling(destination address, source address, flow size, link information) Proce-dure to schedule the flow

Data: Source address, Dest. address, Flow size, Link state information Result: Balanced Load on Aggregate Switches

begin;

if dest.Edge=source.Edge then return ﬂow to dest from Edge; else

aggregate← SELECT-SWITCH(a,b,ﬂs)

if dest.Edge is reachable from selected Aggregate switch then return ﬂow to dest.Edge

else

core← SELECT-SWITCH(x,y,ﬂs) get pod number from dest. Address return ﬂow to the pod number.Agg

if dest.Edge is reachable from selected Aggregate switch then return ﬂow to dest.Edge

(4)

Algorithm 2: SELECT-SWITCH(link 1, link 2, ﬂow size) Data: Link 1 capacity, Link 2 capacity, ﬂs

Result: selected link begin;

if ﬂs> (link1,link2) then wait for the link else

if (link1> ﬂs) and (link1 > link2) then send ﬂow by link 1

if (link2> ﬂs) and (link2 > link1) then send ﬂow by link 2

return selected switch

4. Experiment design and simulation

The proposed mechanism is simulated using CPN. We explain the experiment design and simulation setup in the following subsections.

4.1. Modeling DCN traﬃc

Since DCNs traffic traces are not publicly available due to privacy and security concerns,we model patterns that characterize DCN traffic. Data center traffic flows are characterized in two categories; small or short-lived flows and large or long-lived flows [1]. Large flows are very less in number as compared to small flows[5, 7]. Flow arrivals are Poisson distributed with an average number of flows in a time-frame. Packet size is application dependent [10]. Similar to Internet traffic, DCN traffic may consist different application specific flows and the packet size in a flow is specific to an application. Packet size variation follows discrete random distribution.

We first focus on generating DCN traffic patterns that stress and saturate the network, and then apply our dynamic distributed flow scheduling mechanism to achieve fair-link utilization. As a bi-product the mechanism prevents the core switch from saturation and failure. In our traffic data, flow size distribution was dominated by short-lived flows as evident from Fig 6.

Our DCN is structured around the fat-tree topology. Fig 1 shows the net for the topology used in DCN. It comprise of four pods and a pod is further comprise two ToRs and one aggregate switch. Each core switch is connected to all four pods. The link capacity between a node or server to ToR switch links is diﬀerent from all other links. [2]

4.2. Simulation

The Hierarchical net of our simulation is shown in Fig 1 and the corresponding simulation parameters are in Table 1. The logical and important subnets of this hierarchical net are shown in Fig 2 through Fig 5 and the ﬁring rules are in Table 2.

Table 1. Simulation parameters

Parameters Description

Topology Fat-Tree

Capacity Partition 55% of the average link capacity to large ﬂows

Flow Arrivals Poisson Distributed

Flow bandwidth 1-2% of average link capacity

Scheduling Dynamic distributed

Small ﬂow size 0-70000 MB

Large ﬂow size >70000 MB

Maximum no of ﬂows 500 ﬂows by one node

Packet range poisson distributed with mean 100 Packet size range distributed between 1 to 1000

(5)

Table 2. Firing rules

Label Places Transitions Functionality

Input data p1,p2,p3,p4 t1,t2 Generates ﬂows randomly

Link capacity p1,p2,p3,p4,p5 t1,t2,t3,t4 Assigns link capacity dynamically

Path selection a,b,p1,p2,p3 t1 Selects the link according to its available capacity

Queue p1,p2,p3,p4 t1,t2 Implements a queue.

Node3 Node3

A2 A2

Node1 Node1 Node2Node2

E1 E1 C4 C4 C3 C3 C2 C2 C1 C1 A1 A1 A8 A8 A7 A7 A6 A6 A5 A5 A4 A4 A3 A3 E8 E8 E7 E7 E6 E6 E5 E5 E4 E4 E3 E3 E2 E2 Node16 Node16 Node15 Node15 Node14 Node14 Node13 Node13 Node12 Node12 Node11 Node11 Node10 Node10 Node9 Node9 Node8 Node8 Node7 Node7 Node6 Node6 Node5 Node5 Node4 Node4 a16 flow a15 1`("",0,"","",0,"",0,0) flow a14 1`("",0,"","",0,"",0,0) flow a13 1`("",0,"","",0,"",0,0) flow a12 1`("",0,"","",0,"",0,0) flow a11 1`("",0,"","",0,"",0,0) flow a10 1`("",0,"","",0,"",0,0) flow a9 1`("",0,"","",0,"",0,0) flow a8 1`("",0,"","",0,"",0,0) flow a7 1`("",0,"","",0,"",0,0) flow a6 1`("",0,"","",0,"",0,0) flow a5 1`("",0,"","",0,"",0,0) flow a4 1`("",0,"","",0,"",0,0) flow a3 1`("",0,"","",0,"",0,0) flow a2 1`("",0,"","",0,"",0,0) flow a1 flow o 1`("",0,"","",0,"",0,0) flow n 1`("",0,"","",0,"",0,0) flow k 1`("",0,"","",0,"",0,0) flow j 1`("",0,"","",0,"",0,0) flow g 1`("",0,"","",0,"",0,0) flow f 1`("",0,"","",0,"",0,0) flow b 1`("",0,"","",0,"",0,0) flow c 1`("",0,"","",0,"",0,0) flow p flow m 1`("",0,"","",0,"",0,0) flow l 1`("",0,"","",0,"",0,0) flow i 1`("",0,"","",0,"",0,0) flow h 1`("",0,"","",0,"",0,0) flow e 1`("",0,"","",0,"",0,0) flow d 1`("",0,"","",0,"",0,0)flow a flow v6 flow u5 1`("",0,"","",0,"",0,0) flow q1 1`("",0,"","",0,"",0,0) flow v 1`("",0,"","",0,"",0,0) flow t4 1`("",0,"","",0,"",0,0) flow r2 1`("",0,"","",0,"",0,0) flow x 1`("",0,"","",0,"",0,0) flow t 1`("",0,"","",0,"",0,0) flow s3 1`("",0,"","",0,"",0,0) flow z 1`("",0,"","",0,"",0,0) flow u 1`("",0,"","",0,"",0,0)flow s 1`("",0,"","",0,"",0,0) flow y 1`("",0,"","",0,"",0,0)flow w 1`("",0,"","",0,"",0,0) flow r 1`("",0,"","",0,"",0,0) flow q flow

Node4 Node5 Node6 Node7 Node8 Node9 Node10 Node11 Node12 Node13 Node14 Node15 Node16

E2 E3 E4 E5 E6 E7 E8 A3 A4 A5 A6 A7 A8 A1 C1 C2 C3 C4 E1 Node2 Node1 A2 Node3 1`("",0,"","",0,"",0,0) 1`("",0,"","",0,"",0,0) 1`("",0,"","",0,"",0,0) 1`("",0,"","",0,"",0,0) 1`("",0,"","",0,"",0,0) 1`("",0,"","",0,"",0,0)

Fig. 1. Fat-Tree Topology of DCN

srange.ran() j+1 j nf nfrange() a t2 t1 p4 INT p2 1`0 INT p3 INT p1 1`0 INT

Fig. 2. Input data

p5 t3 p2 t2 p1 t1 p3 t4 p4 flow flow flow INT INT (s,spdn,es1,d,dpdn,es,fls,fln) (s,spdn,es1,d,dpdn,es,fls,fln) (s,spdn,es1,d,dpdn,es,fls,fln) fls fls lsize2 lsize 1`100000 lsize lsize2 (s,spdn,es1,d,dpdn,es,fls,fln) input (fls,lsize); output (lsize2); action let

val x= Int.toLarge lsize; val y= Int.toLarge fls; in (Int.fromLarge (x-y)) end; (s,spdn,es1,d,dpdn,es,fls,fln) input (fls,lsize); output (lsize2); action let

val x= Int.toLarge lsize; val y= Int.toLarge fls; in

(Int.fromLarge (x+y)) end;

Fig. 3. Link capacity

(s,spdn,es1,d,dpdn,es,fls,fln) if(y>x)then 1`(s,spdn,es1,d,dpdn,es,fls,fln) else empty if(y>x)then 1`(s,spdn,es1,d,dpdn,es,fls,fln) else empty y x t1 p2 flow p3 flow p1 flow b INT a INT

Fig. 4. Path selection

t1 p1 p3 p4 t2 flow E e e li li^^[lu] lu::li li lu p2 flow flowli 1`[] lu Fig. 5. Queue

(6)

5. Results and analysis

Flow size distribution for the average number of flows generated by any single node/server in a defined time period is shown in Fig 6. The number of large or long lived flows are very less as compared to small or short-lived flows which characterize the nature of DCN traffic.

t1 t2 t3 t4 t5 t6 t7 t8 t9 0 5 10 15 20 25 30 35 40 No of flows Time small flows large flows

Fig. 6. Flow size distribution

a1 a2 a3 a4 a5 a6 a7 a8 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 Load factor Aggregate switch DDFS NDFS

Fig. 7. Load factor variation at aggregate switches

t1 t2 t3 t4 0 20 40 60 80 100 120 140 No of flows Time C1(NDFS) C2(NDFS) C3(NDFS) C4(NDFS) C1(DDFS) C2(DDFS) C3(DDFS) C4(DDFS)

Fig. 8. Saturation of core switches

Load measurement at aggregate switches was one of our important requirements in order to study the effect of our mechanism. The study was aimed at first to see how the present non deterministic flow schedul-ing(NDFS) effects the link utilization, and second to see the level of impact of our mechanism on the link utilization. Fig 7 shows the plot of average load factor estimated at aggregate switches. Load information at an aggregate switch represents the incoming and outgoing traffic rate ratio which signifies load factor on the switch. Ideally, the estimated load factor at an aggregate switch should be close to 1 for it to say that the associated link(s) are fairly utilized. Wide variation in load factor values among aggregate switches repre-sent their incoming and outgoing traffic rates are widely differing, which signifies unfair link utilization. For the sake of clarity about the traffic we presented in Fig 6 the different flows and their population at different times. From Fig 7 it is evident that by using our mechanism, load factor across the aggregate switches vary

(7)

between 0.80 and 0.91 with an average load factor of 0.85 per aggregate switch. This is an indication that all the links connected to aggregate switches have been eﬀectively utilized.

Whereas in NDFS the load factor across the aggregate switches varied between 1.05 and 1.50, with an average load factor of 1.27 per aggregate switch. This indicates outgoing traﬃc rate is lower than the incoming rate thus saturating aggregate switch and causing unfair link utilization. Comparison between the maximum variation in load factors obtained in each case demonstrates that our mechanism has been able to address the identiﬁed shortcomings of NDFS.

Monitoring flow arrivals at core switch layer was another important requirement to study about how restricting the flow scheduling decisions to the lower layers enable avoiding saturation of core switches. Fig 8 represents the incoming flows over an observation period. The plot shows a comparison between the flows at each core switch for both NDFS and DDFS. It is a clear indication that the traffic on which scheduling decisions are to be taken at core switch is greatly reduced. Further analysis reveals that in the case of NDFS as the traffic increases over time the number of flows accumulating at the core switches shows an increasing trend. This signifies core switch is tending towards saturation. Whereas the outcome of DDFS shows a decreasing trend of incoming flows, which is the direct consequence of scheduling at the lower layers. 6. Conclusion

The major findings of our work are that in the pursuit of fair utilization of available links for Data Center Networks, proposed DDFS can outperform NDFS and it is resilient to switch saturation when the network is stressed with more number of flows. Another finding of our work is by taking care of scheduling flows right from the edge switch level we are able to distribute traffic going towards the core switch fairly across the available links. As an outcome, the findings can be summarized as: fair utilization of links, fairly uniform load at aggregate switches, and finally preventing core switches from getting saturated. Due to the simple and easily deployable approach, we conclude that DDFS has the potential to produce better link utilization with moderate additional cost.

References

[1] Mohammad Al-Fares, Sivasankar Radhakrishnan, Barath Raghavan, Nelson Huang, Amin Vahdat, “Hedera:dynamic ﬂow scheduling for data center networks,” in 7th USENIX Symposium on Networked Systems Design and Implementation, 2010. [2] Mohammad Al-Fares, Alexander Loukissas, Amin Vahdat , “A scalable, commodity data center network architecture,” in

Pro-ceedings of the ACM SIGCOMM 2008 Conference on Data Communication Seattle, WA, USA, August 17 - 22, 2008.

[3] Albert Greenberg, Srikanth Kandula, David A. Maltz, James R. Hamilton, Changhoon Kim, Parveen Patel, Navendu Jain, Paran-tap Lahiri, Sudipta Sengupta, “VL2:A scalable and ﬂexible data center network,” in Proceedings of the ACM SIGCOMM 2009

conference on Data communication, 2009.

[4] Rui Zhang-Shen and Nick McKeown , “Designing a predictable internet backbone with valiant load-balancing,” in Thirteenth

International Workshop on Quality of Service (IWQoS), Passau, Germany, 2005.

[5] Theophilus Benson , Ashok Anand , Aditya Akell, and Ming Zhang , “M. Un- understanding Data Center Traﬃc Characteristics.” in Proceedings of ACM WREN, 2009.

[6] G Reenberg. et al, “ VL2: A Scalable and Flexible Data Center Network.” in Proceedings of ACM SIGCOMM, 2009.

[7] Srikanth Kandula, Sudipta Sengupta, Albert Greenberg, Parveen Patel, Ronnie Chaiken , “The Nature of Data Center Traﬃc: Measure- ments and Analysis.” in Proceedings ACM IMC, 2009

[8] Anees Shaikh, Jennifer Rexford, and Kang G. Shin , “Load sensitive routing of long-lived ip ﬂows,” in Proceedings of ACM

SIGCOMM, 1999.

[9] Kulvinder Singh, “Router buﬀer traﬃc load calculation based on a TCP congestion control algorithm,” International Journal of

Computational Engineering & Management, Vol. 15 Issue 1, 2012.

[10] Wang, Xiaoming, Parish, David J.; , “Optimized Multi-stage TCP Traﬃc Classiﬁer Based on Packet Size Distributions,” Third