Advanced Computer Networks
263-3501-00
Layer-7-Switching and Loadbalancing
Patrick Stuedi, Qin Yin and Timothy Roscoe
Spring Semester 2015
© Oriana Riva, Department of Computer Science | ETH Zürich
Outline
• Last time
– Virtual machine networking – Para-virtualization
– SR-IOV – IOMMU
• Today
– Load balancing – TCP Splicing
– Distributed load balancing
2
Challenge: accessing services
• Datacenters are designed to be scalable
– Datacenters are replicated
– Each has lots of machines
– Service span (and share) data centers
So:
• What address does, e.g. www.search.ch resolve to?
• What entity does this address refer to?
• What does this entity do?
3
Requirements
• “Close by” datacenter
• Load balance across machines in a center
• Target machines where the user’s state is kept
• Accessed using TCP (HTTP, SSL, …)
4
Option 1: IP Anycast
• One IP address refers to multiple destinations – BGP advertizes multiple destinations
– Packets end up at “nearest” destination to source.
Problems:
IP layer only reliable for stateless protocols (UDP) All packets of a TCP flow must go to the same machine
Service location pushed into BGP couples routing with end-system provision
5
Option 1: IP Anycast
• One IP address refers to multiple destinations – BGP advertizes multiple destinations
– Packets end up at “nearest” destination to source.
• Problems:
IP layer only reliable for stateless protocols (UDP)
Service location pushed into BGP couples routing with end-system provision
• Used for DNS root server location
6
Requirements
• “Close by” datacenter
• Load balance across machines in a center
• Target machines where the user’s state is kept
• Accessed using TCP (HTTP, SSL, …)
All packets of a TCP flow must go to the same machine
7
Recall DNS lookup
8
Option 2: DNS
• Insight: who says the answer is always the same?
• Idea: “smart” DNS server authoritative for service
Query for, e.g.. www.google.com or www.bing.com returns a different “A” record depending on:
– Source address of browser machine – Current state of the service
• Load
• Failures
– A random number
9
DNS tricks
• One-level of indirection
– Single DNS server returns different Arecs
• Additional level of indirection
– First service resolver returns CNAME
– Regional service resolver can be more specific
• Used for finding the nearest datacenter for a service
10
Using CNAMEs
11
timeouts
DNS does not solve the problem
Need IP address for every instance of the service
100,000 machines
100,000 globally routable IP addresses – expensive!
Machine fails
need to update DNS state
DNS state changes rapidly
short TTL on queries
even higher load on DNS servers
Slow to react to “hot spots” or other load skews
Selection of machine can only be made based on address of client's primary resolver
don't know which client this is
12
Next step: use 1 IP address
• Use Network Address Translation
• Hash source addresses to server machines
TCP three-way handshake
TCP three-way handshake
Stateless hashing
Hash(Source IP)
• Completely static
– No dynamic load balancing Hash(Source IP, Source TCP port)
• Better, but still static
– Limited to 64k destinations per client machine
• Known as a “Layer-4 load balancer”
16
Stateless hashing
Hash(Source IP)
• Completely static
– No dynamic load balancing Hash(Source IP, Source TCP port)
• Better, but still static
– Limited to 64k destinations per client machine
• Known as a “Layer-4 load balancer”
Basic problem: nothing else is known by the end of the handshake!
Basic problem: nothing else is known by the end of the handshake!
17
Why is static hashing bad?
• Machine failure/upgrade/provisioning
– Can’t update hash function efficiently in switch
• Load balancing
– Can’t avoid a heavily-loaded machine
• Lack of Locality
– Resource being accessed
– Client accessing the resource
18
19
What else might we
want to hash on?
HTTP Host: header
• Introduced in HTTP/1.1 – mandatory
• Hosting providers need to switch based on virtual host, not physical host
– Different services have different virtual host – Avoids replicating all service state everywhere
20
Switching on URL
• Locality:
– Allows state to be partitioned across machines
• Isolation:
– Rare, computationally intensive URLs can be sequestered
– Sensitive data can be kept on more expensive, auditted machines
21
Hashing on cookies
• Enables partioning of servers by – User state
– Session state
• Critical for scaling online services to billions of users – No need to share state
– No need to synchronize state
22
How to do it?
• Problem:
– Don’t know the hash key until after the HTTP request
– Typically the first segment after the 3WS
• Solution:
– Don’t establish connection to server until client has sent HTTP request
23
Late-binding of TCP connection
24
time
Client Server
Port = 3620
Switch
Late-binding of TCP connection
25
time
Client Server
Port = 3620
Switch
TCP connection setup + HTTP GET
TCP connection setup + HTTP GET
Late-binding of TCP connection
26
time
Client Server
Port = 3620
Switch
TCP connection setup + HTTP GET
TCP connection setup + HTTP GET
TCP connection setup + HTTP GET
TCP connection setup + HTTP GET
Late-binding of TCP connection
27
time
Client Server
Port = 3620
Switch
TCP connection setup + HTTP GET
TCP connection setup + HTTP GET
TCP connection setup + HTTP GET
TCP connection setup + HTTP GET
HTTP response HTTP response
(acks not shown)
Late-binding of TCP connection
28
time
Client Server
Port = 3620
Switch
TCP connection setup + HTTP GET
TCP connection setup + HTTP GET
TCP connection setup + HTTP GET
TCP connection setup + HTTP GET
HTTP response HTTP response HTTP response
HTTP response
(acks not shown)
Late-binding: Naïve
implementation (SOCKS protocol)
29
Late-binding: Naïve
implementation (SOCKS protocol)
30
Inefficient: switch needs to copy data between the connections!
Inefficient: switch needs to copy data between the connections!
TCP Splicing
• Proposed around 1997 by Maltz & Bhagwat at IBM
• Key idea:
– Take two established TCP connections and splice them – Transfer segments unmodified between them
– Remap port numbers and segment numbers on the fly
• Advantages:
– Very simple calculation per packet
– Not much state to maintain per spliced connection – No segmentation/reassembly
– No buffering/copying
31
Splicing pseudocode
(from Maltz & Bhagwat)
32
Splicing in pseudo code
33
queue packets received from server
splice connections, but allow for final
'n' bytes to be transmitted to the client before splicing
'n' bytes message signaling the completion
of the splicing operation
Splicing in pseudo code
34
What state is needed?
For each packet, need to do the following:
• IP header operations:
– Rewrite source and destination IP addresses – Update IP header checksum
• TCP header operations:
– Rewrite source and destination port numbers – Apply fixed offset to sequence number
– Apply fixed offset to acknowledgement number – Update TCP header checksum
35
calculated from existing connection state when
splice occurs
It’s easy to do in hardware
• A10 AX Application Delivery Controller
• Advanced layer 4 / layer 7 server load balancing
• HTTP Proxy
• Layer 7 URL and URL hash switching
• Comprehensive Layer 7 application persistence support
• Load balancing methods:
– Round Robin, Least Connections, Weighted Round Robin, Weighted Least Connections, Fastest Response
• Aggregated throughput: up to 115 Gbps
36
Problems of single-box load
balancing
• Expensive!
• Scale-up
– Buy bigger (more expensive) load balancer when reaching capacity
37
Ananta: Load balancing in
Windows Azure
• Windows Azure: Microsoft's cloud computing platform
– Compute, Storage, Databases, etc. in the cloud
• Ananta: Distributed, scalable load balancing running on hosts in a datacenter
– Lower cost
– Scale on demand
38
Background: Windows Azure
load balancing
39
• Clients connect to service using a virtual IP (VIP)
• Load balancer (LB) load balances traffic to specific server machines using a direct IP (DIP)
Background: Windows Azure
load balancing (2)
40
• Load balancer is also used when two services communicate within the same data center
Ananta: Inbound traffic
41
Ananta Manager
Ananta: Inbound traffic
42
1 2 3
Spread packet to MUX using ECMP Lookup the VIP-to-DIP mapping Tunnel packet to DIP
4 5 6 7 8
De-capsulate and forward to DIP Encapsulate response
Forward to router (bypass MUX)
Summary
• IP Anycast: select a DNS root server
• Dynamic DNS: locate nearby data centers
• Layer-3-switching: balance connections across machines
• TCP splicing: seamlessly join two connections
• Layer-7-switching: use splicing to late-bind servers to HTTP connects
• Ananta: distributed load balancing
43
References
• “Host Anycasting Service”, C. Partridge, T. Mendez, W.
Milliken, Internet RFC 1546, November 1993.
• “TCP Splicing for Application Layer Proxy
Performance”, David A. Maltz, and Pravin Bhagwat.
IBM Research Report 21139 (Computer
Science/Mathematics), IBM Research Division, 1998.
• Ananta: Cloud Scale Load Balancing, SigComm 2013
44