Small is Better: Avoiding Latency Traps in Virtualized DataCenters

(1)

SOCC 2013

Yunjing Xu, Michael Bailey, Brian Noble, Farnam Jahanian

University of Michigan

Small is Better: Avoiding Latency

Traps in Virtualized DataCenters

(2)

Outline

  Introduction

  Related Work

  Source of latency

  Design and Implementation

  Evaluation

  Conclusion

(3)

Introduction

  Public clouds have become a popular platform for building Internet-scale applications

  Applications built with public clouds are often highly sensitive to response time

  Large data centers have become the cornerstone of modern, Internet-scale Web applications

  Unlike dedicated data centers ,a public cloud relies on

virtualization to both hide the details of the underlying host infrastructure as well as support multi-tenancy

(4)

Related Work

  Kernel-centric

  modify the operating system (OS) kernel to deploy new TCP congestion control algorithms. DCTCP, HULL

  Application-centric

  applications must be modified to tag the packets they generate with scheduling hints. D³, D² TCP , DeTail , PDQ , and

pFabric

  Operator-centric

  require operators to change their application deployment.

Bobtail

(5)

Host-centric

  It does not require or trust guest cooperation, and it only modifies the host infrastructure controlled by cloud

providers.

  The existing host-centric method: EyeQ

  it mainly focuses on bandwidth sharing in the cloud

  require feedback loops between hypervisors

  need explicit bandwidth headroom to reduce latency

  only solves one of the three latency problems addressed by our solution.

(6)

Sources of latency

  VM scheduling delay

  the server VM cannot process the packets until scheduled by the hypervisor

  Host network queueing delay

  the response packets first go through the host network stack, which processes I/O requests on behalf of all guest VMs

  Switch queueing delay

  response packets on the wire may experience switch queueing delay on congested links

(7)

Sources of latency

(8)

EC2 and the Xen hypervisor

  These studies find that virtualization and multi-tenancy are keys to EC2’s performance variation

  Xen runs on bare metal hardware to manage guest VMs

  A privileged guest VM called dom0 is used to fulfill I/O requests for non-privileged guest VMs

(9)

EC2 measurements

--VM Scheduling delay and Swich queueing delay

(10)

EC2 measurements

--VM Scheduling delay and Swich queueing delay

(11)

Testbed experiments

--Host network queueing delay

  Byte Queue Limits (BQL)

  A new device driver interface

  CoDel

  If queued packets have already spent too much time in the queue, the upper layer of the network stack is notified to slow down

Kernel network stack

NIC transmission queue

(12)

Testbed experiments

--Host network queueing delay

switch A1 A2

B1 C1

ping Cause

congestion

physical machine physical machine

physical machine

(13)

  Congestion Free: A1 and B1 are left idle.

  Congestion Enabled: A1 sends bulk traffic to B1 without BQL or CoDel.

  Congestion Managed: A1 sends bulk traffic to B1 with BQL and CoDel enabled.

Testbed experiments

--Host network queueing delay

(14)

  While BQL and CoDel can significantly reduce the latency, the result is still four to six times as large when compared to the baseline

Testbed experiments

--Host network queueing delay

(15)

Summary

  Switch queueing delay increases network tail latency by over 10 times; together with VM scheduling delay, it becomes

more than 20 times as bad

  Host network queueing delay also worsens the FCT tail by four to six times.

(16)

Design and Implementation

  Principle I: Not trusting guest VMs

  Principle II: Shortest remaining time first

  Principle III: No undue harm to throughput

(17)

VM scheduling delay

  Credit Scheduler is currently Xen’s default VM scheduler

  Each guest VM has at least one VCPU

  Credits are redistributed in 30ms intervals

  VCPU has three states

  OVER

  UNDER

  BOOST

(18)

VM scheduling delay

  The BOOST mechanism only prioritizes VMs over others in UNDER or OVER states

  BOOSTed VMs can not preempt each other, and they are round-robin scheduled

  We deploy a more aggressive VM scheduling policy to allow BOOSTed VMs to preempt each other

  Xen has a rate limit mechanism that maintains overall system throughput by preventing preemption when the running VM has run for less than 1ms in its default setting.

(19)

Host network queueing delay

  From Table2, when BQL and CoDel are both enabled, the queue delay is four to six times

  The root cause is requests often too large and hard to be preempted.

  Our solution is to break large jobs into smaller ones to allow CoDel to conduct fine-grained packet scheduling.

(20)

Host network queueing delay

  A packet sent out by guest VMs first goes to the frontend, which copies the packet to the backend

  Xen’s backend announce to the guest VMs that hardware segmentation offload is not supported, then guests have to segment the packets before copying them

frontend backend NIC

guest VM dom0

(21)

Switch queueing delay

  Letting switches favor small flows when scheduling packets

  Define a flow as the collection of any IP packets

  Define flow size as the instant size of a message the flow

  In reality, it is also difficult to define message boundaries.

Thus, we measure flow by rate as an approximation of the message semantic

  We classify flows into two classes, small and large

(22)

Switch queueing delay

  First, we build a monitoring and tagging module in the host that sets priority on outgoing packets. Small flows are tagged as high-priority

  Second, we need switches that support basic priority queueing

(23)

Putting it all together

  We change a single line in the credit scheduler of Xen 4.2.1 to enable the new scheduling policy

  We also modify the CoDel kernel module in Linux 3.6.6 with about 20 lines to segment large packets in the host

  We augment the Xen’s network backend with about 200 lines of changes to do flow monitoring and tagging

(24)

Evaluation

  The testbed consists of five four-core physical machines running Linux 3.6.6 and Xen 4.2.1.

  They are connected to a Cisco Catalyst 2970 switch

(25)

Evaluation

  A1 serves small responses to E1

  A2 sends bulk traffic to B1 (Host network queueing delay)

(26)

Evaluation

  Our solution achieves about 40% reduction in mean latency

  Over 56% for the 99th percentile, and almost 90% for the 99.9th percentile.

(27)

VM scheduling delay

  Use E1 to query A1 for small responses

  Keep A3 running a CPU-bound workload

switch

A1 A3 E1

(28)

VM scheduling delay

  Our new policy reduces latency at the 99.9th percentile by 95% at the cost of 3.8% reduction in CPU throughput

(29)

Host network queueing delay

  A1 to ping E1 once every 10ms for round-trip time

  A2 to saturate B1’s access link with iperf

switch

A1 A2 E1

B1

(30)

Host network queueing delay

  Our solution can yield an additional 50% improvement at both the body and tail of the distribution

(31)

Switch queueing delay

  E1 queries A1 and B1 in parallel for small flows

  E2 queries C1 and D1 for large flows to congest the access link

switch

A1 E1

C1

B1 D1

E2

(32)

Switch queueing delay

  When QoS support on the switch is enabled to recognize our tags, all small flows enjoy a low latency with an order of magnitude

improvement at both the 99th and 99.9^th percentiles

  On the other hand, the average throughput loss for large flows is less than 3%

(33)

Conclusion

  We design a host-centric solution that extends the classic shortest remaining time first scheduling policy from the virtualization layer, through the host network stack, to the network switches