3.1 The case for packet-level routing
3.1.3 Understanding the performance of ECMP
To understand the poor performance of flow-level routing, we can use a probabilis-tic method to compute the throughput that we should expect to see with ECMP.
Computing the throughput analytically allows us to validate the correctness of our simulations and helps provide some insight into the behavior of flow-level routing.
The particular point of comparison we chose for this exercise was the performance of ECMP in a 3-level FatTree as the switch port count is varied. This is similar to what we presented in Figure 3.3a except that we used a restricted form of the permutation traffic pattern so that we could limit our analysis to the first half of the network.
To determine the throughput of a given flow, we need to consider the ways in which it may collide with other flows in the network. To aid in this discussion, we present an unfolded view of the 4-port FatTree network in Figure 3.4. The unfolded FatTree is essentially equivalent to the folded network shown earlier in Figure 3.1a with the main difference being that the network is drawn with unidirectional links so that all traffic flows form left to right. The unfolded view helps illustrate the ways in which flows can collide and allows us to divide the network into a set stages which we can analyze separately. This view does imply that all traffic must pass through the intermediate switches which is not necessary for traffic local to a subtree in the folded network.
For the traffic pattern we use here, all traffic is between servers in different pods.
In the literature, a “pod” refers to the set of servers that share the same links to the intermediate switches (i.e. a subtree at level l − 1). There are 4 such pods in Figure 3.4 and a FatTree with k port switches has k pods. In the first half of the network, flows can only collide if they are from servers in the same subtree. Also notice that collisions in the first half only depend on the paths that flows take and not their destinations. Flows from different pods can only collide in the second half of the network when they are destined for the same subtree. Notice that flows that collide on a link in the first half may diverge and subsequently reconverge on a link in the second half. However, they can no longer affect each others’ throughput when this happens. We can leverage this observation by adding a requirement that the servers in one pod all choose destinations in the same pod. We call this the “pod-to-pod”
permutation pattern and under this traffic pattern, no further loss can occur after the second stage.
Throughput in stage 1:
We begin by using the 4-port network as an example.
Let X be a random variable representing the number of flows on a link in stage 1, e.g link A. Since there are 2 servers per switch, the possible outcomes are 0, 1, 2 flows.
Each flow is assigned randomly to a path independently of the other flows so a flow is routed through link A with probability p = 12:
p(X = x) =
The throughput in stage 1 is equivalent to the expected load on link A. The load on link A is a function of X, which we can represent as another random variable A = g(X). Since each server has one flow sending at full offered load, A has outcomes 1 if it carries one or more flows and 0 otherwise. So we have:
pA(a) =
This gives us an expected load of E[A] = 0.75 in stage 1. Thus the average throughput of a flow in stage 1 is also 0.75.
We can generalize this for k port switches by viewing X as the number of successes in a series of n bernoulli trials where the probability of success is p = n1 for each trial.
This follows a binomial distribution B(n, p) giving X the probability mass function (PMF):
PX(x) =n x
px(1 − p)n−x (3.1)
Since the load on link A is 1 as long as x 6= 0 the expected throughput is equivalent to 1 − P (X = 0). Thus with n = k2 the expected throughput in stage 1 is:
E[A] = 1 − (1 − 1
n)n (3.2)
Observe that lim
n→∞(1 − n1) ≈ 1e. This means as we scale the number of ports, the throughput in stage 1 converges to 1 −1e, which is roughly 63%.
Throughput in stage 2:
Finding the expected load on a link in stage 2 is more complicated since flows that collide in stage 1 do not create the same load on links in stage 2. To find the total load on some link B in stage 2, we first find the fraction of traffic from link A that continues on to B. Let Y represent the flows that continue on to B from the X flows on link A. For each pair (x, y) we need to compute the probability that P (X = x, Y = y).
For the 4-port case this joint probability distribution is:
PX,Y(x, y) =
We then calculate Z, the load from link A on link B:
Z =
( Y /X if X > 0
0 if otherwise (3.3)
For the 4-port case we have the PMF:
PZ(z) =
Z represents the load on link B from one of the switches in stage 1. Since there are two such switches, we need to consider the contributions of both. We can represent their combined load on B as Z1+Z2. Since the link has a capacity of one, B = Z1+Z2
if Z1+ Z2 ≤ 1 and 1 otherwise. The PMF of B is then:
B(Z1+ Z2) =
81 256, 0
36 256,
1 2 139 256, 1
For this case where k = 4, the expected throughput in stage 2 is thus 157256 ≈ 0.613.
Since each pod is independent, this also represents the throughput we expect to see under the “pod-to-pod” permutation pattern.
0 4 8 12 16 20 24 28
k - ports per switch
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Throughput at offered load 1
Simulated average Simulated minimum Theoretical
Figure 3.5: ECMP routing under the pod-permutation traffic pattern
In order to compare our simulations to the analysis, we simulated the pod-permutation traffic pattern for the values of k from 4 to 24 ports. We wrote a python script to carry out the analysis described above for each value of k. The results are shown in figure 3.5 and demonstrate that the simulated throughput closely matches the theoretical values.
0.00 0.25 0.50 0.75 1.00
(a) Comparison of oblivious flow-level and packet-level routing
0.90 0.92 0.94 0.96 0.98 1.00
Offered Load
random size (min) random size (avg)
max size (min) max size (avg)
(b) VLB with different packet sizes
Figure 3.6: Performance of packet-level VLB under the permutation pattern.