Evaluation - Performance in context - Delivering Consistent Network Performance in Multi-tenant

3.3 Performance in context

3.3.3 Evaluation

There are two metrics that define the performance of load balancing in this context:

• Isolation: The degree of isolation we can provide to tenants.

• Utilization: The capacity that can be used by tenants.

We define isolation as the acceptable fraction of packet loss over all flows and we call this the loss threshold. We then define the usable capacity as the offered load at which the loss threshold is exceeded. We use the usable capacity metric as a way to define utilization in this context since it represents the fraction of the capacity that can be allocated to tenants while maintaining isolation.

This evaluation focuses on investigating two factors that affect the performance of routing:

• Available buffering: The sizes of switch queues determines how much imbalance we can tolerate before losing packets.

• Flow control: Bursty traffic can create substantial loss which means the degree to which sending rates can be controlled is a key factor affecting performance.

Here we simulate the four corners of our strict/loose model to tease apart the effects

10

^-6

10

^-5

10

^-4

10

^-3

10

^-2

10

^-1

Loss threshold

0.75 0.80 0.85 0.90 0.95 1.00

Usable capacity

LogicalTree: Strict FC

LogicalTree: Loose FC FatTree: Strict FC, ~Strict LB (MP)

FatTree: Strict FC, Loose LB (VLB) FatTree: Loose FC, ~Strict LB (MP)

FatTree: Loose FC, Loose LB (VLB)

M/M/1/K LogicalTree M/M/1/K FatTree

Figure 3.11: Fraction of the network’s capacity achievable as a function of the acceptable loss threshold. l = 3 levels, k = 12 ports, queue size K = 50 packets.

between our routing approaches. By varying the switch queue size, we can determine how much buffering is needed to be able to utilize a given fraction of the network while maintaining a given degree of isolation. This also serves as a guide to help weigh some of the costs and tradeoffs of designing a network based around our approach.

Isolation vs. utilization

We begin by simulating the four corners of our loose/strict model under all-to-all traffic and compare the results with our queueing theory model. For reasons that we discuss in B.3, we made the packet size follow a Poisson distribution around the medium packet size (midway between minimum and maximum sized) so that we could provide the best comparison with our queueing theory models.

Figure 3.11 shows the maximum fraction of the network’s capacity that we can safely use without exceeding a given loss threshold when the queue size is fixed at 50 packets.

With M/M/1/K the departure process is Poisson so this effectively models random

0.80 0.85 0.90 0.95 1.00

Usable capacity

10 Switch queue size (in packets)

FatTree: Strict FC, ~Strict LB (MP) FatTree: Strict FC, Loose LB (VLB) FatTree: Loose FC, ~Strict LB (MP) FatTree: Loose FC, Loose LB (VLB) M/M/1/K LogicalTree M/M/1/K FatTree

LogicalTree: Strict FC LogicalTree: Loose FC

Figure 3.12: Per-port switch buffer size required to achieve given fraction of the network’s capacity. l = 3 levels, k = 12 ports, loss threshold = 10⁻³.

packet sizes. Since the capacity, K, is in terms of packets, we chose a value of 50 packets because it roughly corresponds to a limit of 32 KB when using average-sized packets. In addition, to simulating the 4 corners of the space described earlier, we plotted the results from our M/M/1/K FatTree and M/M/1/K LogicalTree analysis in the figure. The results show that the lower bound on performance provided by the M/M/1/K models is overly conservative. However, they show that, with a small amount of buffering, we can expect to achieve a reasonable degree of isolation (e.g.

loss thresholds of 10⁻³ or 10⁻⁴) while being able to effectively utilize at least 85% of the network’s capacity.

In Figure 3.12 we fix the loss threshold at 10⁻³ and we plot switch queue size as a function of offered load. This is useful because it shows the per-port buffering needed at switches to avoid exceeding the loss threshold when using a given fraction of the network’s capacity. Note that queue size is shown with a logarithmic scale.

This is because, according to M/M/1 queueing theory, the length of the queue will grow exponentially with the traffic intensity and our FatTree and LogicalTree models

show this. Our simulations for loose flow control (Poisson send process) also shows exponential growth although at a slower rate. This makes sense because, as explained in B.3, the rate of traffic on a link is constrained by the speed of the link. Since a switch queue is fed by a finite number of links (i.e. at most 11 since k = 12 ports), the arrival rate at a switch queue is not truly Poisson. The figure also shows our simulated results for strict flow control (periodic send process) on both the FatTree and its equivalent logical tree. In this case, the growth rate is much smaller and, with the exception of the case using VLB routing, the required queueing stays well under 100 packets.

There are several key points to take away from these results.

• First, with loose flow control the arrival rates at switches appear Poisson making the benefit provided by improved load balancing largely irrelevant. This can be seen by the fact that, under loose flow control, both VLB and MP perform nearly the same as when we simulate the logical tree, where routing is not a factor. As we approach full offered load, these three curves converge and show the same scaling characteristics as the M/M/1/K models.

• Secondly, with strict flow control, the MP load balancer performs significantly better than VLB, approaching the performance of ideal routing (the Logical-Tree: Strict FC case).

• Third, even under worst-case traffic, with loose flow control we can still expect to use over 90% of the network given a reasonable amount of buffering (e.g. 100 packets worth).

• Finally, we can expect these results to scale with higher speed Ethernet links since M/M/1 queues only depend on the relative rates of arrivals and departures.

In document Delivering Consistent Network Performance in Multi-tenant Data Centers (Page 59-62)