Load Balancing Routing - Cloud-Scale Data Center Network Architecture. Cheng-Chun Tu Advisor: T

Traditional Ethernet architecture does not support dynamic routing that could accommodate fluctuating workload patterns; only Layer-3 routers provide such support. By exploiting the capability of populating the forwarding tables on switches, Peregrine supports load-balancing packet routing, which takes into account the following factors. First, the importance of different physical links in a data center network is different, even if the physical network topology is symmetric. For example, a physical link may be more critical than another because it is used by many hosts to access a storage server. Peregrinecomputes the notion oflink criticality[6] and uses it to avoid choosing more critical links early on so as to eventually achieve network-wide load balance. Second, the number of hops on the route between two hosts is an important quality indicator because it determines the network latency as well as the amount of injected load. A simple and effective load-balancing routing scheme is to compute a large number of paths between every source and destination pair < s, d >, and to distribute the traffic from s to d equally among these paths. However, this algorithm is infeasible because it would require a large number of forwarding table entries for each host. Instead, we could statically run this routing algorithm, and use its result to steer the direction of more practical routing algo- rithms. More concretely, we compute up to N shortest paths for every possible source/destination pair< s, d >, equally distribute the traffic between s and

don these N paths, and then compute the link criticality of a physical linkl

with respect to < s, d >, denoted as θl(s, d), as M_N, where M is the number of paths betweens and dthat go through the link l. Then the expected load from sto d on the linkl isθl(s, d)∗T M(s, d), where T M(s, d) represents the bandwidth demand from s to d, and the total expected load on the link l is thus θl = P(s,d)θl(s, d)∗T M(s, d). Finally, we define the cost of the link l

as cost(l) = θl/Rl, where Rl is the residual capacity of the link l, and avoid choosing links with higher cost as much as possible when computing routes.

Given a traffic matrix, each of whose entries represents the bandwidth demand from one host to another,Peregrinefirst sorts its entries in a decreasing order and then computes paths for these entries in this order. That is, host pairs with higher bandwidth demands are routed earlier. To compute the primary path for a host pair < s, d >, Peregrine computes K shortest paths froms to

d, filters out those paths that cross switches whose forwarding table is already full, and picks the path whose sum of the costs of its links is minimum. After taking out the links on the primary path from s to d, Peregrine repeats the same process to calculate their backup path. After the primary and backup path fromstodare computed, the residual capacity of every links on these two paths is reduced byT M(s, d), and the expected load and cost on other links are also adjusted accordingly.

Whenever a link experiences congestion because of traffic load fluctuations, Peregrineidentifies all source destination pairs whose primary or backup path passes through this link, deducts theirmeasured bandwidth demands from the measuredcosts of the links on these primary paths, and applies the same routing

algorithm to compute a new primary or backup path for each of these source destination pairs, this time using theirmeasuredbandwidth demands.

Chapter 5

Peregrine Implementation

and Performance

Evaluation

5.1 Prototype Implementation

The current Peregrine implementation on the ITRI container computer, as shown in Figure 5.1, consists of a kernel agent that performs MIM encapsulation and is installed on the Dom0 of every physical machine with Xen, a central directory server (DS) that performs generalized IP to MAC address look-up, and a central route algorithm server (RAS) for that constantly collects the traffic matrix, runs the load-balancing routing algorithm based on the traffic matrix, and populates the switches with the resulting routing state.

With two-stage dual-mode packet forwarding, there are up to four ways to reach aPeregrinehost X:

• Route directly to X using X’s primary MAC address,

• Route directly to X using X’s backup address,

• Route to X’s primary intermediary and then X using X’s primary MAC address, and

• Route to X’s backup intermediary and then X using X’s backup MAC address.

The first two possibilities exist only for those hosts that can be directly reach- able. Accordingly, there are four MAC addresses associated with eachPeregrine host, its primary MAC address, backup MAC address, primary intermediary’s MAC address and backup intermediary’s MAC address. Traditionally, translat- ing a host’s IP address to its MAC address is done via the ARP protocol, which

Layer-2-Only Clos Network Physical Server MIM agent VM0 VM1 VMn Directory Server Route Algorithm Server

Figure 5.1: The software architecture of the currentPeregrineprototype, which consists of a kernel agent installed on every physical machine, a central directory server (DS) for IP to MAC address look-up, and a central route algorithm server (RAS) for route computation and route state population.

is incompatible withPeregrine’s design because it is based on broadcast-based queries and unicast-based responses. Instead, Peregrine adopts a centralized directory service(DS) architecture, as shown in Figure 5.1, in which every ARP query about an IP address A is transparently intercepted byPeregrine’s kernel agent and re-directed to the DS, which responds with the four MAC addresses associated with A, and the availability status of the four routes to reach A.

Peregrinedoes not require modifications to the header structure of Ethernet packets. To perform MIM encapsulation for an outgoing packet, thePeregrine agent puts the primary or backup intermediary’s MAC address in the packet’s destination address field, and the MAC addresses of the sending and receiving host in the packet’s source address field. This means that every Peregrine host’s MAC address has only 24 bits, rather than 48 bits. In addition, the MAC addresses of all VMs in aPeregrine network are centrally allocated, and every VM is assigned two MAC addresses.

The centralized IP to MAC address mapping architecture also enablesPere- grine to support private IP address reuse, which allows multiple virtual data centers (VDC) to run on a singlePeregrine network and gives each VDC the same private IP address space (e.g. 10.x.x.x). When a VM in a VDC issues an ARP query about an IP address,Peregrineconsults with the DS using the IP addressandthe ID of the VDC, which disambiguates the same IP address simultaneously used by multiple VDCs based on their VDC ID.

Figure 5.2 gives an example to illustrate the MAC address look-up for two- stage dual-mode packet forwarding. When VM3 sends out an ARP query about VM6’s IP address (step 1) , the Peregrine agent installed at Dom0 of VM3’s physical machine (PM1) intercepts this query and submits the resulting query to the directory server (DS) (step 2). The DS looks up its database and sends the four MAC addresses and their availability status associated with VM6 (step 3) back to thePeregrineagent on PM1, which first creates and sends an legitimate

SW1 SW3 SW2 SW4 DS PM1 PM2 VM6 1. ARP Request 2. Redirect 3. Reply 4. Encapsulation 5. Decapsulation mac3 mac1 VM3

mac1 VM3’s mac mac3 mac1 VM3

mac1 VM3’s mac

DA:6-byte SA: 6-byte

Ethernet header Ethernet header DA SA PM: Physical Machine

VM: Virtual Machine SW: Switch DS: Directory Server

mac1 mac2 mac3 mac4

direct indirect VM6: Primary/backup: mac1/mac2 SW3: Primary/backup: mac3/mac4 MAC address VM3 Directory Service

(primary, backup) (primary, backup)

Figure 5.2: MAC address translation for two-stage dual-mode forwarding.

ARP reply to VM3 as well as caches the reply to answer future ARP queries on VM6’s IP address. Once VM3 receives VM6’s MAC address, it forms the associated packet and sends the packet out. In Peregrine, all packets from a DomU VM pass through the Peregrine agent in Dom0 of the corresponding physical machine. For each packet going by, thePeregrineagent consults with the ARP cache with the packet’s destination IP address and rewrites the packet’s destination MAC address field based on the MAC address look-up result. For example, in the case of Figure 5.2, VM6 can be reached in four ways: (1) Indirect Primary: The Peregrine agent on PM1 performs MIM encapsulation with the MAC address of VM6’s primary intermediary and VM6’s primary MAC address, and sends the packet out (step 4). When the packet arrives at VM6’s primary intermediary, i.e., SW3, it decapsulates the MIM packet and forwards the resulting packet to VM6 (step 5). (2)Indirect Backup: Everything works in the same way as the Indirect Primary case, except that it is VM6’s backup intermediary, SW4, that is used for packet relaying. (3)Direct Primary: The destination MAC address of the outgoing packets is VM6’ primary MAC address. (4) Direct Backup: The destination MAC address of the outgoing packets is VM6’ secondary MAC address.

Figure 5.3 illustrates how Peregrine’s fast fail-over mechanism works. Ini- tially, VM6’s primary and backup MAC addresses, mac1 and mac2, are pre- populated on the switches along the two disjoint routes by the RAS (step 1) The primary route to VM6 goes through SW2 and SW3 while the backup route goes through SW1 and SW4. Whenever a link along the primary path from VM3 to VM6 is down, an SNMP trap is sent from the link’s adjacent switch to the RAS (step 2), which determines the source destination pairs that are af- fected by the link failure and passes this information to the DS (step 3), which then informs the source hosts that their associated destination hosts are reach- able only via their backup MAC addresses, in this case, sending out an ARP entry update to PM1 (step 4) indicating that to send packets from VM3 to

SW1 SW3 SW2 SW4 PM1 PM2 VM3 VM6 DS RAS 5. Backup Path Prima ry Pa th mac1 mac1 mac1 mac2 mac2 mac2 VM6 mac1: Primary mac2: Backup 1. Deploy Forwarding table 2. Link down trap 3. update 4. Update cache

Directory Service Route Algorithm Service

Figure 5.3: Switching from direct primary route to direct backup route upon a link failure

VM6 should use mac2 as the destination MAC address, which is the backup MAC address for VM6. After that, all packets destined to VM6 from VM3 go through its backup path (step 5).

Upon a link/switch ailure, the DS only needs to update those physical machines that currently cache ARP entries that are invalidated by the failure, because it keeps track of which physical machines cache which ARP entries. The DS performs these ARP cache updates using unicast. For a given VM, the number of physical machines caching its ARP entry is expected to be relatively small. Therefore the DS allocates space enough to record at most M caching machines for a given ARP entry, whereM is tentatively set to 50. For a very popular VM that communicates with a large number of physical machines, a special flag is set in its ARP entry, and any modification to its ARP entry triggers an ARP update to every physical machine.

In document Cloud-Scale Data Center Network Architecture. Cheng-Chun Tu Advisor: Tzi-cker Chiueh (Page 39-44)