5.6 Testing Redundant Backup Strategies
5.6.2 Complexity
In the previous sections, all possible combinations of malfunctioning devices have been considered. Each device is considered independent of the others. This brute force method has a complexity of O(2n), where n is the number of nodes in the network.5 Each test means simulating a set of test
cases in the network. Since this is not very scalable, the complexity has to be reduced.
In the introduction to this section we raised the question of whether one should test the failure of individual services or only complete node crashes. From the point of view of complexity, it doesn’t matter. The number of nodes n is multiplied with the number of s services when all combinations of service failures are considered. O(2n·s) with s a constant factor is in the same
complexity class as O(2n).
If one takes the actual meaning of the tests into consideration, one can limit the number of cases that must be simulated. If something does not work without node A, it is not necessary to run this test again with node A plus additional nodes malfunctioning, as it will not stop failing with less nodes functioning. This cuts down the number of simulation runs that are needed if something fails. The worst case complexity (which would in the present case mean a perfectly failsafe network) remains unchanged, meaning that no test fails even with almost all nodes malfunctioning.
A further measure would be to fix an upper limit for the number of nodes assumed to be malfunctioning at the same time. If more than a certain percentage, say 50%, of the nodes are out, it is very unlikely that the network is still operable, thus testing with even more nodes out is not necessary. This would reduce the number of cases to be tested by half. However, multiplying the number of cases with a constant factor does not change the complexity of the algorithm.
In most cases, a large number of nodes failing at the same time is no independent event. The probability for simultaneous failure of more than a few devices is extremely small. The risk of virus infections or external factors like an outage in a building’s power supply system or a natural disaster is much higher. Emergency power supplies and survivability in the case of disaster must be carefully planned; however, factors that are external to networks do not lie within the scope of this thesis.
Avoiding unnecessary tests
Not all of the network elements perform tasks that are important for the network. Most of the machines in a typical campus or company network are client machines. By definition, clients do not provide network services and their failures only affect the user of the machine, but no other part of
5Choosing all combinations to fail from n nodes: Pn k=0
n k
48 CHAPTER 5. VERIFICATION OF CONFIGURATION
the network. Thus all simulation runs that include malfunctioning network elements in client roles can be skipped. In order to do this, additional information about the roles of network elements is required. This information can be provided by a role model that adds a type description to each network element, as proposed in the use case in Section 4.5.
A generalisation of this concept can even do without a role model. Only the nodes that actually have something to do need to be simulated as malfunctioning. If a node neither processes nor sends any packets during a simulation run, the failure of that node will not be noticed. There is no need to simulate the same constellation with an unused node switched off. The proposed algorithm works as follows:
• The simulations are arranged by number of simultaneous failures, level 0 meaning no failures. • Along with each simulation run, the set of nodes that are to be considered on the next level is established. A node that does something on a particular level might cause a to test fail if it fails itself.
• The iterative process starts with 0 nodes malfunctioning, meaning a network with no prob- lems.
• All nodes that process an event will be simulated to fail as single nodes.
• For two failures at the same time, only nodes that had something to do when one node failed are taken into consideration, in combination with the failure of that node.
• This procedure is continued until there are no more successful tests.
Hence, the list of nodes taken into account grows as alternatives are used after node failures. Assuming a linear increase of “important” nodes in a growing network, this still does not change the basic complexity. However, in practice it reduces the number of tests considerably, delaying the point where the network becomes too large to be validated using the simulation method. Mean time between failure
A reduction influencing the complexity can be found when considering the Mean Time Between Failure (MTBF). The MTBF is a statistical value, indicating the operating time after which half of the devices of a particular type can be expected to fail. Values for typical MTBF of nodes and links can be found in the relevant literature, for example in [Vasseur et al., 2004]. According to one observation, link failure depends on the length of cables. In this section we only look at node failures. Link failure is not inherently different, but would require taking cable length into account, which is not usually available in Verinec. Link failures are mostly important within the context of large backbones or provider networks, while Verinec is mostly aimed at local networks.
When calculating the probability of multiple simultaneous failures, the number of cases to be tested can be reduced dramatically. [Dooley, 2002] combines the probabilities of failure according to the MTBF of the devices with probabilities for the event of a number of devices failing simulta- neously. He uses several simplifications, all of them overestimating the probability of failure. The most important of these is to assume that the failure rate is linear, reaching 50% of broken devices after the MTBF. This results in the probability of 2·M1 for a specific device to fail on a particular day, where M is the MTBF in days. In reality, the failure rate will be much lower until near the MTBF, where it increases rapidly.
According to Dooley, the probability for several devices to fail on the same day is
kpn= n! · (2m − 1)n−k k! · (n − k)! · (2m)n = n k ·(2m − 1) n−k (2m)n
kpnis the probability of k failures on the same day in a network of n devices, with m being the
MTBF measured in days.
An example can illustrate this. With n=1’000 nodes relevant for the network and a MTBF of m=2’000 days (approximately 5 years), the probability of k=10 nodes to fail on the same day is in the order of 10−13. Even in a big network, the probability of more than a few nodes failing simultaneously is extremly small. Figure 5.2 shows the probability of k failures on the same day for a number of nodes n from 500 to 5’000. In a small network, no failure on a specific day is most probable. With an increasing number of nodes, failures are to be expected.
5.6. TESTING REDUNDANT BACKUP STRATEGIES 49
Figure 5.2: Plot for probability of k simultaneous failures out of n nodes, with 500 to 5’000 nodes.
It is legitimate to stop testing for cases with small kpn. At the point where the prob-
ability for external problems affecting the usage of the network becomes much higher than the probability of so many devices fail- ing independently, further improvements on the network stop being reasonable. Table 5.1 indicates how k evolves to satisfykpn<
0.0001, assuming a MTBF of m=2’000 days. The limit of 0.0001 means that the untested failure combinations can be expected to oc- cur less than once in 10’000 days or less than once every 27 years. The third row indicates the order of magnitude for the number of combinations of k elements out of n nk. While limiting k reduces the number of combinations tremendously (21000 ≈ 10301 while 1000
5 ≈ 10
12), there are still many simulation runs required for large networks with 1’000 nodes
or more. Therefore, it is useful to combine this limit with the previously mentioned mechanisms to avoid unnecessary test runs.
n 100 500 1000 2000 3000 4000 5000 k 3 4 5 6 7 8 8 n k 105 109 1012 1016 1020 1024 1025
Table 5.1: Values of k for some n to havekpn< 0.0001
An improved selection of device combinations
In the previous discussion, we assumed that the Mean Time Between Failure (MTBF) is the same for all devices. If it varies considerably between the devices, it is possible to improve the selection of combinations needing to be tested. For this, the MTBF must be known individually for each device model.
Still assuming a linear distribution of the failures, the probability for a specific device to fail on a certain day is 2·M1
d, with Md being the MTBF for that device in days. Combining these probabilities gives the probability for any combination of failures as
Y
d
1 2 · Md
where d takes the values of each device in the combination.
Instead of setting a limit to the number of devices failing simultaneously, the minimum proba- bility for a combination to fail is now limited. Combinations with a very low probability need not be examined.
When considering the probability of an event, it can also help to weight the importance of the tests as mentioned in Section 5.6.1 and combine it with the probability of failure. Some tests can be run for more combinations of device failure than others. If a test is very important, it is run even if the combined failure probability is low.
Besides giving a more precise selection of what must be examined, this method would also help to weight error reports. Dooley’s method allows a rough estimation of how probable a situation is by counting the number of device failures at which a test fails. This makes an exact probability indication available for every situation. Weighted tests also show how problematic a failure would be. A single point of failure of an extremely reliable device might be less dangerous than two redundant but very unreliable devices.
There is, of course, a major drawback to this method: the administrator has to specify the MTBF for every device used in the network. In a homogeneous network with three or four types of devices, this would be feasible. However, in a heterogeneous environment, the work involved can quickly exceed the benefits gained by the more detailed simulation.
50 CHAPTER 5. VERIFICATION OF CONFIGURATION