Put in the simplest terms possible, a failure domain is the set of devices that must recalculate their control plane information in the case of a topology change. As an example, in Figure 6-2, Routers A and B are in the same failure domain because they share a common view of the network. Router C, however, is in a separate failure domain, at least in regard to 192.0.2.0/24.
What’s the difference between a network module and a failure
domain? One is a demarcation of policy or intent, whereas the other is a demarcation in the flow of control plane information. A module is a clearly definable area of the network topology.
The edges of network modules and the edges of failure domains coincide; where we find one, we often find the other. This is because policy implementation almostalways involves information hiding. This isn’t always true, but it’s true often enough that throughout this section we use the terms failure domain and network
moduleinterchangeably.
Remember, however, that failure domains and policy edges are not always in the same place in the network; although it’s convenient to think of them in the same way, they are really different things with different purposes.
As an example, consider the network in Figure 6-3.
Figure 6-3 Modules Versus Failure Domains
In this example, the reachability within each data center is aggregated at the edge routers B and C. Aggregating so the data center routes are not transmitted into the network core is a policy, but it also
creates a failure domain boundary, so routers B and C are the edge of both a module—a specific place in the network, the data center, with specific policies and reachability information—and the edge of a failure domain. However, because there is a Layer 2 tunnel stretched
between the two data center fabrics, these two data centers represent a single failure domain. Information about a single link failure at A must be transmitted all the way to router D in order to keep the
control plane synchronized. These two data centers, although they are in the same failure domain, would not be considered the same
module; each data center may have different policy of service,
different link bandwidth, and other policies. In this example, then, the failure domain edge aligns with the module edge in one case, and not in the other case.
Separating Complexity from Complexity
Where should you hide information in a network? The first general rule is to separate complexity from complexity. For instance, in the network shown in Figure 6-4, it’s possible to hide information at Routers A and B toward Router C, or at Router C toward Router D. Which would be more optimal?
Figure 6-4 Optimal Information Hiding Points
In this example, the difference between these two choices might seem to be minor, but let’s examine the two possibilities in more detail. From Router D’s perspective, the only difference between the two options is that it receives information about two more routers (Routers A and B), and two more links ([A,C] and [B,C])—this isn’t that much information. From Router A’s
perspective, however, the difference is between learning about every router, link, and destination reachable through Router B, including all state changes for those network elements, or only learning a single destination (in the optimal case), or destinations without links and routers (in the least optimal case) from B.
Clearly, hiding information at Routers A and B reduces the total information carried across all routers more than hiding information at Routers C and D.
You will always find that hiding information at the point where any two complex topologies meet will provide the largest reduction in network-wide table sizes. This simple rule, however, must be balanced against the
suboptimal routing that is always caused by removing state in the control plane, a topic discussed in the section “Modularization and Optimization,”
later in the chapter.
Note: Another way of looking at this is to compartmentalize complexity—to separate complex bits of the topology from the rest of the network through a module edge or making it into a separate failure domain. Either way, the point is to make certain that complex topologies don’t interact in the control plane, either directly or through some simpler part of the network.
Human Level Information Overload
There is a second piece to the information overload problem, as well—the ability of network operators to quickly understand, modify, and troubleshoot the network. Because this is clearly more of a business-related problem, impacting operational rather than capital costs, this side of information
hiding rises above strict design and into the purview of the network architect.
There are two aspects to this side of information hiding: the ability of the
network designer to clearly assign functionality, and repeatable
configurations. Each of these two aspects is discussed in the following sections.
Clearly Assigned Functionality
Imagine setting up the filters for the network in Figure 6-5 so that Host A can reach Servers M, N, and C, while Host K can reach Servers F, M, and N.
Figure 6-5 Difficult Policy Problem
There are a number of ways to solve this problem, including the following:
Set up filters using the correct source and destination addresses at Routers B and H to enforce this policy.
Set up filters using the correct source and destination addresses at Routers D, G, J, and L to enforce this policy.
Create two VLAN overlays on the network, each containing the set of servers that Hosts A and K are allowed to reach.
All these solutions are going to be difficult to maintain over time. For
instance, if Host A moves so it is no longer connected to Router B, but rather to Router H, then the filters associated with Host A must move with the host.
If any of the servers move, the filters enforcing these policies need to be checked for correctness, as well. If the policies change, the network administrator must find every point at which each policy is enforced, and make the correct modifications there as well—the more widely these policies are dispersed in the network, the more difficult this will be.
How could we make this problem easier to solve? By gathering each device in the network together by purpose or policy, so policies can be implemented along the edges leading to these groups of devices. This strategy of pulling devices with common policy requirements into a single module accomplishes the following:
Reduces the number of policy chokepoints in the network to the minimum possible, so network operators need to look in fewer places to understand end-to-end policy for a particular flow or set of hosts
Makes it possible to state a specific purpose for each module in the network, which allows the policies into and out of that module to be more easily defined and verified
Allows the gathering of policy problems into common sets
This gathering of policy into specific points of the network is one of the ways modularization helps network operators focus on one problem at a time.
Repeatable Configurations
Gathering policy up along the edges between modules attacks the problem of operational complexity in a second way, as well. Each network device along the edge of a particular module should have a similar configuration, at least in terms of policy; hence they have repeatable configurations. Parameters such as IP address ranges may change, but the bulk of the configuration will remain the same.
In the same way, once policy is removed from the network devices internal to a module, those configurations are simplified as well. With careful
management, the configuration of every router and/or switch internal to a network module should be very similar.
Mean Time to Repair and Modularization
The Mean Time to Repair (MTTR) is the time it takes between a problem arising in the network and the network returning to fully operational state.
This time can be broken down into two pieces:
The time it takes for the network to resume forwarding traffic between all reachable destinations
The time it takes to restore the network to its original design and operation
The first definition relates to machine-level information overload; the less information there is in the control plane, the faster the network is going to converge. The second relates to operator information overload; the more consistent configurations are, and the easier it is to understand what the network should look like, the faster operators are going to be able to track down and find any network problems. The relationship between MTTR and modularization can be charted as shown in Figure 6-6.
Figure 6-6 MTTR Versus Modularization
As we move from a single flat failure domain into a more modularized design, the time it takes to find and repair problems in the network decreases
rapidly, driving the MTTR down. However, there is a point at which additional modularity starts increasing MTTR, where breaking the network into smaller domains actually causes the network to become more complex. To
understand this phenomenon, consider the case of a network where every network device, such as a router or switch, has become its own failure domain (think of a network configured completely with static routes and no dynamic routing protocol). It’s easy to see that there is no difference
between this case and the case of a single large flat failure domain.
How do you find the right point along the MTTR curve? The answer is always going to be, “it depends,” but let’s try to develop some general rules.
First and foremost, the right size for any given failure domain is never going to be the entire network (unless the network is really and truly very small).
Almost any size network can, and should, be broken into more than one failure domain.
Second, the right size for a given failure domain is always going to depend on advances in control plane protocols, advances in processing power, and other factors. There were long and hard arguments over the optimal size of an OSPF area within the network world for years. How many LSAs could a single router handle? How fast would SPF run across a database of a given size? After years of work optimizing the way OSPF runs, and increases in processing power in the average router, this argument has generally been overcome by events.
Over time, as technology improves, the optimal size for a single failure domain will increase. Over time, as networks increase in size, the optimal number of failure domains within a single network will tend to remain constant. These two trends tend to offset one another, so that most
networks end up with about the same number of failure domains throughout their life, even as they grow and expand to meet the ever increasing
demands of the business.
So how big is too big? Start with the basic rules we’ve already talked about, building modules around policy requirements and separating complexity from complexity. After you get the lines drawn around these two things, and
you’ve added natural boundaries based on business units, geographic locations, and other factors, you have a solid starting point for determining where failure domain boundaries should go.
From this point, consider which services need to be more isolated than others, simply so they will have a better survivability rate, and look to measure the network’s performance to determine if there are any failure domains that are too large.