The Controller Failure - SD-WAN test Topologies and Parameters

6.2. SD-WAN test Topologies and Parameters

6.3.4. The Controller Failure

The fourth category of tests is performed to prove two objectives. First, the ability of the quick recovery of the controller-failures by the COVN placement algorithm. Second, even though, the VN uses the backup controllers, after using up all of the active controllers, this does not cause significant degradation in the performance of the SD-WAN. This is because, the COVN placement algorithm intends to locate the backup controllers close to the original one, along with considering the reliability of the backup node. The strategy of these tests is as follows:

 The number of the active physical controllers should always be four.

_{Considering the complete topology as a single VN, which is partitioned into four} clusters.

 The COVN placement is applied on the network with the optimal weights of closeness and reliability identified in the weight tests for every topology, as shown in Table 6.8.

Page | 153  The first test starts with the free-failure case, then in every further test, one active

controller fails, and it is recovered by a backup controller, until five tests are completed.

 The maximum, average and minimum values of the latency and reliability metrics are compared for the five tests.

 Comparing the locations of the original and backup physical controller.

 Investigating the effect of using the backup controller on latency and reliability metrics.

 Implementing the same steps over three networks (Generated-1, Internet2 and India), to notice how these weights could differ for every network.

A) The controller-failure of Generated-1 topology: Implementing the COVN

placement algorithm four times to recover four controller-failures, produce the controller placement which is presented in Figure 6.25.

Topology Closeness weight Reliability weight

Generated-1 0.3 0.7

Internet2 0.5 0.5

India 0.0 1.0

Table 6.8: The optimal weights of closeness and reliability for every topology.

Page | 154 The controller placement in Figure 6.25, shows that every active controller (rectangular shape) has been recovered by a backup controller (rhombus shape). The pairs of active- backup controllers could be produced as the displayed pairs, only if the failing and recovering process is implemented in the sequence, which is introduced in Table 6.9, otherwise, different pairs of the active-backup controllers would be created if the failures occur in different sequence.

It is clear that the process of recovering the controller failure does not cost the algorithm additional computational complexity. This is because, the algorithm only assigns the next backup controller to the affected cluster.

The second objective is discovered by comparing the latency and the reliability metrics as offered in Figures 6.26 and 6.27. Figure 6.26 demonstrates that the node-controller latency and connection control-path latency are stable, but the inter-controller latency increases by (0.1 ms) by the last two tests. Consequently, the average value of the connection flow setup latency is increased by (about 1ms) in the last two tests as well, which it is not a significant increment for this latency metric.

No. Failure- Status

Test1 Failure-Free Fail-Controllers X

Backup-Controllers X

Active-Controllers 12, 18, 11, 26

Test2 One Failure Fail-Controllers 12

Backup-Controllers 13

Active -Controllers 18, 11, 26, 13

Test3 Two Failures Fail-Controllers 12, 18 Backup-Controllers 13, 28

Active -Controllers 11, 26, 13, 28

Test4 Three Failures Fail-Controllers 12, 18, 11

Backup-Controllers 13, 28, 17

Active -Controllers 26, 13, 28, 17

Test5 Four Failures Fail-Controllers 12, 18, 11, 26

Backup-Controllers 13, 28, 17, 8

Active -Controllers 13, 28, 17, 8

Table 6.9: The sequence of the controller-failure and recovery process of the five tests on Generate-1 topology.

Page | 155 The reliability parameters in Figure 6.27, illustrate a small increment (one hop) in the length of the control-path (min-hops to the farthest node). Also, the maximum and average values of the controller-connectivity ratio stay settled, while its minimum value reduces from 41.50 to 32 in the last three tests.

Figure 6.26: The comparison of the minimum, average and maximum values of the resulted latency metrics for the controller-failure tests of Generated-1 topology.

Figure 6.27: The comparison of the minimum, average and maximum values of the resulted reliability metrics for the controller-failure tests Generated-1 topology.

Page | 156

B) The controller-failure of Internet2 topology: The pairs of active backup

controllers that are generated by the controller failure tests, are exhibited in Figure 6.28 below.

The failure tests are performed according to the sequence which is summarised in Table 6.10.

No. Failure- Status

Test1 Failure-Free Fail-Controllers X

Backup-Controllers X

Active-Controllers 20, 4, 6, 29

Test2 One Failure Fail-Controllers 20

Backup-Controllers 25

Active -Controllers 4,6,29, 25

Test3 Two Failures Fail-Controllers 20,4 Backup-Controllers 25, 15

Active -Controllers 6, 29, 25, 15

Test4 Three Failures Fail-Controllers 20, 4, 6

Backup-Controllers 25, 15, 33

Active -Controllers 29, 25, 15, 33

Test5 Four Failures Fail-Controllers 20, 4, 6, 29

Backup-Controllers 25, 15, 33, 1

Active -Controllers 25, 15, 33, 1

Figure 6.28: The active and backup physical controllers of Intrent2 topology.

Table 6.10: The sequence of the controller-failure and recovery process of the five tests on Internet2 topology.

Page | 157 Also, the comparisons of the five tests are employed to prove the second objective. Figure 6.29 exposes that all latency metrics increase with increasing the number of fail controllers. The increment this time in the connection flow-setup latency is larger than the increment of the previous tests (Generated-1 topology) due to the longer length of links. It is about 20 ms with every further test of the first four tests, while it is similar for the last two tests.

Figure 6.29: The comparison of the minimum, average and maximum values of the resulted latency metrics for the controller-failure tests of Internet2 topology.

Figure 6.30: The comparison of the minimum, average and maximum values of the resulted reliability metrics for the controller-failure tests of Internet2 topology.

Page | 158 Figure 6.30 expresses that the min-hops to the farthest node mounts by one when three failures happen. Also, it shows that the average and minimum values of the controller connectivity ratio remain the same, while its maximum value falls from 80.48 at fourth test to 63.42 at the fifth test.

C) The controller-failure of India topology: The third controller-failure tests are

applied on India network, and the resulted pairs of active-backup controllers are exposed in Figure 6.31.

Figure 6.31 simply shows that COVN algorithm spawns the new backup controller (rhombus shape) instead of the active controller (rectangular shape) if the active one is dropped. The failure recovery is archived without re-clustering or recalculating the controller placement in order to obtain a fast recovery.

The sequence of failing the controllers and recovering them using the backup controllers is listed in Table 6.11 below. The table shows that, it starts with failure-free of the controller and ends with falling the four active controllers.

Page | 159 Next, the comparisons of latency and reliability are discussed to track the changes in these metrics within the controller-failure and to prove the second objective. Figure 6.32 reveals that the latencies heighten in very small values along all tests. This is because the used weight of reliability to apply the COVN placement was the highest value (equals one) for this placement, which orders the controller nodes according to their reliability. Therefore, the next controller node (backup controller) could jump to a remote location from the original one.

The results of the reliability metrics reveal that using the highest weight of the reliability causes a degradation in the controller-connectivity ratio of the backup controllers. Figure 6.33 displays that the controller-connectivity ratio decreases with using more backup controllers. However, the decrement in the average value of this metric is not big; it is about five unit at each test. Also, Figure 6.33 shows that the maximum and average value of the length of the control-path increases by one in the last three tests.

No. Failure- Status

Test1 Failure-Free Fail-Controllers X

Backup-Controllers X

Active-Controllers 96, 121, 89, 88

Test2 One Failure Fail-Controllers 96

Backup-Controllers 72

Active -Controllers 121, 89, 88, 72

Test3 Two Failures Fail-Controllers 96,121 Backup-Controllers 72, 94

Active -Controllers 89, 88, 72, 94

Test4 Three Failures Fail-Controllers 96,121,89

Backup-Controllers 72, 94, 74

Active -Controllers 88, 72, 94, 74

Test5 Four Failures Fail-Controllers 96, 121, 89, 88

Backup-Controllers 72, 94, 74, 95

Active -Controllers 72, 94, 74, 95

Table 6.11: The sequence of the controller-failure and recovery process of the five tests on India topology.

Page | 160

The findings of the fourth category of tests (controller-failure tests):  The COVN algorithm can perform fast recovery for controller failure.

 The latency metrics receive a tiny increment in a microsecond, consequently connection flow-setup latency only increases by milliseconds.

 The reliability metrics are degraded by an acceptable value, which does not demolish the SD-WAN reliability.

Figure 6.32: The comparison of the minimum, average and maximum values of the resulted latency metrics for the controller-failure tests of India topology.

Figure 6.33: The comparison of the minimum, average and maximum values of the resulted reliability metrics for the controller-failure tests of India topology.

Page | 161 Finally, the reasons for not researching (recalculating) the nearest backup controller to the cluster which loses its active controller, are: (1) The failure recovery should be performed as quickly as possible (it is not about the computation time, but the next back up controller should be configured as a backup controller for all clusters. Subsequently, if the first backup controller is used, then the second backup controller should be configured as the next backup controller for all clusters); (2) It is preferred that the backup controller is placed in a reliable location and close to other controllers rather than the cluster's nodes. Virtually always (when reliability weight higher than or equals to 0.5) the furthest controller in the controller's list has lower reliability. Also, it always has a longer path to communicate to all other controllers (higher inter-controller latency).

In document A Novel Placement Algorithm for the Controllers Of the Virtual Networks (COVN) in SD-WAN with Multiple VNs (Page 170-179)