Network Virtualization and
Data Center Networks
263-3825-00
DC Virtualization Basics – Part 3
Qin Yin
Outline
•
A Brief History of Distributed Data Centers
•
The Case for Layer 2 Extension
•
Layer 2 Extension
–
Over optical Connections
• Virtual PortChannels • Fabric Path
–
Over MPLS
• Ethernet over MPLS (EoMPLS) • Virtual Private LAN Service (VPLS)
–
Over IP
• MPLS over GRE
Virtual PortChannel Summary
Virtualization Characteristics Virtual PortChannel
Emulation Single Ethernet Switch
Type Pooling
Subtype Homogeneous
Scalability Two physical switches
Technology area Networking
Subarea Data and control plane virtualization
Distributed Data Centers
•
Data center interconnect (DCI)
–
Many physical sites, one logical data center
•
Business goals
–
Seamless workload mobility
–
Business continuity
–
Pool and maximize global resources
–
Distributed applications
•
Defining two metrics for each application environment
–
Recovery Point Objective
: The maximum tolerable amount
of time in which data can be lost from an IT service
From a networking perspective
•
Connected through Layer 3 routing
–
Decreases fate sharing in distributed data center
–
Isolates each site from remote network
instabilities
•
Extending Layer 2 domains
The Cold Age (Mid-1970s to 1980s)
•
Computer rooms housing mainframe systems
•
Applications based on batch processing
•
RPO and RTO could span days or even weeks
•
Recovery technologies: data backup and retrieval
–
Data: stored on tapes
–
Connectivity: physical transport of tapes to the
backup site
–
Cold-standby and warm-standby
The Hot Age (1990s to Mid-2000s)
•
Internet booms and the advent of electronic business
–
The need of real-time response
–
Recovery technologies focused on service availability
–
RPO and RTO confined to hours, or even minutes
•
Geographic clusters (geocluster)
–
Application servers installed on
at least two
geo-separated sites
–
Active node failure triggers
automatic switchover
to standby
node
–
Generally require data replication to the
hot-standby site
• Synchronous replication (tens of kilometers apart to avoid latencyissues)
• Asynchronous replication (data periodically copied from primary to
Geocluster
•
Different types of
geocluster communication
– Heartbeat communication – Application state information
(such as cached data for database servers)
– Client traffic (especially nodes
share the same virtual IP address)
The Active-Active Age (Mid-2000s -)
•
In hot-standby site, hardware and software resources
– Used in case of major failure at the main site – Activated for a small amount of time per year
– Some critical applications (RPO of 0, RTO of seconds)
•
Active-active design
to avoid resource waste
– No luxury of unused sites
– Deploy several active nodes dispersed over multiple data centers – Server and storage virtualization to provide automatic and quick
workload mobility between sites
•
Challenges of scalability and flexibility
Requirements of Layer 2 Extension
• Heartbeat and connection state communications are usually directed to
multiple destinations
– Broadcast, multicast, or unknown unicast Ethernet frames (flooding)
• Active and standby nodes usually share the same virtual IP and MAC
address
– To facilitate traffic handling in the case of failure
• Server migration
– Application do not support IP readdressing
– Generates painstakingly complex operations
• Data center expansion
– A data center has reached a physical limitation
– A company hires a colocation service from an outsourcing data center
As a result, standard Layer 2 connection are deployed to provide extended VLANs over multiple data centers.
Challenges of Layer 2 Extension
•
Flooding and broadcast
– Loops over the Layer 2 extensions can be easily formed
•
A spanning tree instance spanning multiple sites presents
formidable challenges
– Scalability: recommended STP diameter is 7
– Isolation: reconvergence will affect VLANs within one STP instance – Multihoming: multiple DCI links will not be used for data comm
Challenges of Layer 2 Extension (cont.)
•
Tromboning can be formed
between data centers
– Non-optimal internal routing
within extended VLANs
– Cause for DCI resource waste:
uncontrolled state of an
active-standby pair of devices
•
Data confidentiality
– Mandates strict forms of
encryption in data center interconnect to minimize the risk of data leakage
Traditional Layer 2 VPNs
Dark Fiber
VPLS
Ethernet Extensions over Optical
Connections
•
Optical connections
–
Distance: less than a few hundred kilometers
•
Dark fiber
–
Fiber-optic pair to connect networking devices
–
Wavelength-division multiplexing (WDM) to increase
transport capacity
•
Coarse WDM: multiplex eight optical carrier signals
•
Dense WDM: aggregate a higher number (128, for example)
–
Dark fiber and WDM
•
Communication solutions belonging to Layer 1 (physical)
•
Can transport any data-link protocol including Ethernet
Spanning Tree Protocol
•
STP does not allow
Ethernet traffic on all
the links between DCI
switches
•
STP instance
–
Is spread over both sites
–
Sharing any internal
topology change or
reconvergence
STP and Link Utilization
•
STP wastes inter-switch
Link Aggregation
•
STP only detects one logical
interface
•
Traffic destined to this
interface is load balanced
among the active physical
links that are part of the
channel
•
This virtual interface is
Virtual PortChannel
•
Eliminate STP blocked
ports
•
Uses all available uplink
bandwidth
•
Allow a single device to
use a port channel across
two upstream switches
•
Dual-homed server
operate in active-active
mode
•
Provide fast convergence
Virtual PortChannels on a Layer 2
Extension
•
Virtual ProtChannels
– Transforms multiple Ethernet
links into a single-switch STP connection
•
Benefits
– Multihoming is enabled – all
links are being used (and load balanced)
– Spanning tree topology is
simplified – only one
connection between sites
– If vPC peer switch feature is
deployed, a device failure will not result in reconvergence
Virtual PortChannels in Multipoint
Data Center Connections
•
Problem
– vPCs can form a logical
looped topology
•
Solution: hub-and-spoke
– Deploy disjoint STP instances
per site – STP isolation is enabled on all DCI switches
– Avoid loops in the Layer 2
Traditional Layer 2 VPNs
Dark Fiber
VPLS
MPLS Labels and Packets
•
Provides packet
forwarding based on
labels
•
Layer 2.5 technology
•
Head fields
–
Label value
–
Experimental (Exp)
• To define QoS classes in
MPLS networks
MPLS Basics
•
Protocol flexibility
–
Comes from the capability of stacking labels
•
MPLS services
–
Traffic engineering
• Configures and defines unidirectional tunnels using tunnel label • Override routing protocol decision
–
Layer 3 virtual private networks
• Connect different VPNS • Inner label: VPN
–
Any transport over MPLS (AToM)
• Transport of Layer 2 frames • Inner label: virtual circuit • Example: EoMPLS
MPLS Network
• Forwarding Equivalence Class(FEC)
– MPLS packets sharing the same
label
• Two types of routers
– Label Edge Routers (LER) – Label Switch Router (LSR)
• MPLS router elements
– A loopback interface
• To improve reachability
– A routing protocol
• To advertise connected subnets
– LDP - Label distribution Protocol
• To enable device discovery and label
distribution
EoMPLS Configuration
•
In essence
– Within MPLS network
• Encapsulates Ethernet frames
within MPLS packets
– At the egress of MPLS
network
• Transported, de-capsulated
and delivered as they were
•
MPLS label stack
– Tunnel Label routing from
ingress to egress LER
– VC Label identifying virtual
circuit within tunnel
•
Pseudowire
Pseudo Wire Reference Model
•
A Pseudo Wire (PW) is a connection between two provider edge
(PE) devices connecting two attachment circuits (ACs)
•
Label Switched Path (LSP) – MPLS tunnel
Emulated Service MPLS (or IP) PE1 Attachment Circuit Customer Site Customer Site PSN Tunnel (LSP in MPLS) Pseudo Wire PE2 CE PW1 PW2 CE CE
VC Distribution Mechanism using LDP
• Unidirectional Tunnel LSP
– To transport PW PDU from PE to PE based on tunnel label(s)
– Both LSPs combined to form a single bi-directional Pseudo Wire
• Directed LDP session
– To exchange VC information, such as VC label and control information
VC Label identifies interface
Tunnel Label(s) gets to PE router
IP/MPLS PE1 LSP created using IGP+LDP or RSVP-TE Customer Site Customer Site Customer Site Customer Site Label Switch Path
Directed LDP Session between PE1 and PE2
PE2 CE
CE
CE CE
Ethernet PW Tunnel Encapsulation
•
Tunnel Encapsulation
–
One or more MPLS labels associated with the tunnel
–
Defines the LSP from ingress to egress PE router
0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 EXP TTL (set to 2) VC Label (VC) 1 Tunnel Label (LDP,RSVP,BGP) Layer-2 PDU
0 0 0 0 Reserved Sequence Number
EXP 0 TTL
PW Demux Tunnel Encaps
Ethernet PW Demultiplexer
• Obtained from Directed LDP session
• To identify individual circuits within a tunnel • Used by receiving PE to determine
– Egress interface for L2PDU forwarding (Port based)
– Egress VLAN used on the CE facing interface (VLAN Based)
• EXP can be set to the values received in the L2 frame
0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 EXP TTL (set to 2) VC Label (VC) 1 Tunnel Label (LDP,RSVP,BGP) Layer-2 PDU
0 0 0 0 Reserved Sequence Number EXP 0 TTL
PW Demux Tunnel Encaps
PW Operation and Encapsulation
•
This process happens in both directions
P2 P1 IP/MPLS Customer Site Customer Site Directed LDP Session between PE1
and PE2 PE2 CE CE LSP “PW1” Lo0: Label 24 for Lo0: Label Pop for Lo0: Label 38 for Lo0: Label 72 for PW1 PE1 LDP Session LDP Session LDP Session 24 72 L2 PDU 38 72 72 L2 PDU L2 PDU
Virtual Private LAN Service
•
End-to-end architecture allowing MPLS networks offer Layer 2
multipoint Ethernet Services
•
Provides emulation of a single virtual Ethernet bridge network
•
Virtual Bridges linked with MPLS Pseudo Wires
PE PE
CE CE
VPLS is an Architecture
CE
Virtual Private LAN Service
•
It is “Virtual”
–
Multiple instances share the same physical
infrastructure
•
It is “Private”
–
Each instance is independent and isolated from
one another
•
It is “LAN Service”
–
It emulates Layer 2 multipoint connectivity
VPLS Components
33 N-PE MPLS Core CE router CE router CE switch CE router CE router CE switch CE switch CE router Attachment circuits:Port or VLAN mode Mesh of LSP between N-PEs: Pseudo Wires within LSP
N-PE
N-PE
Virtual Switch Interface (VSI) terminates PW and provides
Ethernet bridge function
LDP between PEs used to exchange VC and tunnel labels
for Pseudo Wires Attachment CE:
Virtual Switch Interface
•
Flooding / Forwarding
–
MAC table instances per customer (port/vlan) for each PE
–
Associate ports to MAC, flood unknowns to all other ports
•
Address Learning / Aging
–
LDP enhanced with additional MAC list TLV (label
withdrawal)
–
MAC timers refreshed with incoming frames
•
Loop Prevention
–
Create a full-mesh of Pseudo Wires (VCs in EoMPLS)
–
Unidirectional LSP carries VCs between pair of N-PEs
VPLS Flooding and Forwarding
•
Flooding (Broadcast, Multicast, Unknown Unicast)
•
Dynamic learning of MAC addresses on PHY and VCs
•
Forwarding
–
Physical Port
–
Virtual Circuit
Data SA DA?
MAC Learning and Forwarding
• Broadcast, Multicast, and Unknown Unicast are learned via the received
label associations
• Two LSPs associated with a VC (Tx & Rx) • If inbound or outbound LSP is down
PE1 PE2
Send me frames using Label 170
Send me frames using Label 102
CE CE
E0/0 E0/1
MAC 2 E0/1 MAC Address Adj MAC 1 102 MAC 2 170
MAC Address Adj MAC 1 E0/0 Use VC Label 102 MAC1 Use VC Label 170 MAC2 PE2 170 MAC2 MAC1 Data
PE2 102 MAC1 MAC2 Data
MAC Address Withdraw
• Message speeds up convergence process
– Otherwise PE relies on MAC Address Aging Timer
• Upon failure, PE removes locally learned MAC addresses
• Send LDP Address Withdraw to remote PEs in VPLS (using the Directed LDP
session)
• New MAC List TLV is used to withdraw addresses
MPLS
X
VPLS Functional Components
•
N-PE provides VPLS termination/L3 services
•
U-PE provides customer UNI
CE U-PE N-PE MPLS Core N-PE U-PE CE Customer
MxUs SP PoPs
Customer MxUs
Directed Attachment (Flat)
Characteristics
•
Suitable for simple/small implementations
•
Full mesh of directed LDP sessions required
–
N*(N-1)/2 Pseudo Wires required
–
Scalability issue a number of PE routers grows
•
No hierarchical scalability
•
VLAN and Port level support
•
Potential signaling and packet replication overhead
–
Large amount of multicast replication over same physical
–
CPU overhead for replication
Direct Attachment VPLS (Flat
Architecture)
CE N-PE MPLS Core N-PE CE
Ethernet (VLAN/Port Ethernet (VLAN Port) Full Mesh PWs + LDP MAC2 MAC1
Hierarchical VPLS (H-VPLS)
•
Best for larger scale deployment
•
Reduction in packet replication and signaling
overhead
•
Consists of two levels in a Hub and Spoke topology
–
Hub consists of full mesh VPLS Pseudo Wires in MPLS core
–
Spokes consist of L2/L3 tunnels connecting to VPLS (Hub)
PEs
Why H-VPLS?
• Potential signaling overhead
• Full PW mesh from the Edge
• Packet replication done at the Edge
Node Discovery and Provisioning
• Minimizes signaling overhead
• Full PW mesh among Core devices
• Packet replication done the Core
Partitions Node Discovery process
VPLS H-VPLS CE CE CE CE CE CE PE PE PE PE PE PE PE PE CE CE MTU-s CE CE PE-rs PE-rs PE-rs PE-rs PE-rs PE-rs PE-r CE CE
MPLS Edge H-VPLS
MPLS Core CE N-PE PE-rs MPLS Core N-PE PE-rs CE MPLSPseudo Wire Full Mesh PWs + LDP
U-PE PE-rs U-PE PE-rs 802.1q Access 802.1q Access MPLS Pseudo Wire MAC2 MAC1 Data Vlan CE P E VC MAC2 MAC1 Data Vlan CE 802.1q Customer MPLS PW SP Edge Pseudo Wire SP Core PE VC MAC2 MAC1 Data Vlan CE
Same VCID used in Edge and core (Labels
may differ) MPLS Acces s MPLS Acces s 1 2 3 1 2 3
Layer 2 VPNs
Dark Fiber
VPLS
Flooding Behavior
•
Traditional Layer 2 VPN technologies rely on
flooding to propagate MAC reachability
•
The flooding behavior causes failures to
propagate to every site in the Layer 2 VPN
•
Goal
–
Providing layer 2 connectivity, yet restrict the reach of
the unknown unicast flooding domain in order to
contain failures and preserve the resiliency
Pseudo Wires Maintenance
•
Before any learning can happen a full mesh of
pseudo-wires/ tunnels must be in place
•
For N sites, there will be N*(N-1)/2 pseudo-wires.
Complex to add and remove sites
•
Head-end replication for multicast and broadcast.
Sub-optimal BW utilization
•
Goal
–
providing point-to-cloud provisioning and optimal
Multi-homing
•
Requires additional protocols (BGP, ICC, EEM)
•
STP often extended
•
Malfunctions impact all sites
•
Goal
–
Natively providing automatic detection of
multi-homing without the need of extending the STP
domains, together with a more efficient
load-balancing
OTV Changes the Game
•
Circuits + Data Plane Flooding
– Full mesh of circuits– MAC learning based on
flooding
– Tunnels and Pseudo Wires – Operationally challenging
• Loop prevention • Multi-homing
•
Packet + Control Protocol
Learning
– Packet switched connectivity – MAC learning by control
protocol
– Dynamic encapsulation – Operational simplification
• Automatic loop prevention and
Overlay Transport Virtualization
•
OTV delivers a virtual L2 transport over any L3
Infrastructure
•
Overlay
–
Independent of the Infrastructure technology
and
services, flexible over various inter-connect facilities
•
Transport
–
Transport services for
Layer 2 and Layer 3
Ethernet
and IP traffic
•
Virtualization
–
Provides
virtual stateless multi-access
connections.
OTV Control Plane MAC Learning
a. Server with MAC address X
sends frames that are flooded or broadcasted within site
b. OTV1 learns MAC X and
populates its MAC address table. c. OTV1 advertises MAC X with an
IS-IS update.
d. OTV2 and OTV3 become aware that MAC X can be reached through OTV1 and populate their MAC address tables using the virtual Layer 2 interface called Overlay
OTV Frame Forwarding
a. Server2 sends a unicast frame
destined to MAC X that is flooded to OTV2.
b. OTV2 checks its MAC address table
and realizes that the MAC X entry points to an Overlay interface.
c. Internally in OTV2, this Overlay
interface provides a mapping to OTV1’s IP address. As a result, the unicast frame is encapsulated into an IP packet directed to OTV1.
d. OTV1 receives the IP packet and
decapsulates it, recovering the original Ethernet frame.
e. OTV1 uses its local MAC address
table to forward the frame to Server1.
OTV Encapsulation
•
Outer IP header
•
Outer OTV shim header
–
VLAN
–
Overlay number
OTV elements
Edge device: network equipment that is actually deploying OTV - Internal interface: connected to a Layer 2 network
- To process Ethernet frames
- Join interface: connected to the Layer 3 network - To send or receive OTV packets
OTV elements
Overlay interface:
- A virtual Layer 2 interface that represents an OTV Layer 2 extension to other edge devices
- Used on their MAC address tables as the interface associated to remote MAC addresses
OTV elements
Site VLAN:
- A dedicated VLAN used for discovery and adjacency maintenance between edge devices on the same site - Should not be extended to other sites.
Spanning Tree and OTV
•
OTV is site transparent: no changes to the STP
topology
•
Each site keeps its own STP domain
•
An Edge Device will send and receive BPDUs
OTV Loop Avoidance
•
Blocking unknown unicast traffic between edge devices
•
Authoritative edge device (AED)
–
The only edge device on a site handling multicast and
OTV and Multi-homing
•
OTV built-in multi-homing
–
Allows Layer 2 traffic to be load balanced through different
IP WAN links
•
OTV multi-homing options
–
Automatic distribution of VLANs among the available AED
candidates (a hashing function to deploy this distribution).
–
For unicast egress traffic, OTV can be load balanced among
all the equal-cost Layer 3 paths to remote edge devices.
Multidestination egress and ingress traffic can only use the
join interface.
References
•
Jeff Apcar. An introduction to VPLS.
http://stor.balios.net/Divers/VPLS_Introduction.ppt
•
Peter Lam, Patrick Warichet. Simplifying Data
Center Interconnect with Overlay Transport
Virtualization (OTV).
Ethernet over MPLS Summary
Virtualization Characteristics Ethernet over MPLS
Emulation Ethernet connection
Type Abstraction
Subtype Structural
Scalability Hardware and software dependent
Technology area Networking
Subarea Data plane virtualization
Virtual Private LAN Summary
Virtualization Characteristics Virtual Private LAN
Emulation Ethernet bridge
Type Abstraction
Subtype Structural
Scalability Hardware and software dependent
Technology area Networking
Subarea Data plane virtualization
Advantages Layer 2 extension, multipoint connections,
Overlay Transport Virtualization
Virtualization Characteristics Overlay Transport Virtualization
Emulation Overlay Ethernet network
Type Abstraction
Subtype Structural
Scalability Hardware and software dependent
Technology area Networking
Subarea Data and control planes virtualization
Advantages Layer 2 extension, multipoint connections, transport