Deploying management takes far more than finding the right things to measure, the right place to measure them, the right way to measure them, and the right processes to control the different management areas. This section details some of the practical points network architects need to consider when deploying network management.
Loosen the Connection Between Collection and Management
One of the major problems surrounding network management is the natural affinity, or tie in, between the reason for collecting data and the data store itself. Using the FCAPS model, for instance, the interface speed might be collected multiple times—once for fault management, once for configuration management, once for accounting, and once for performance management—
and processed and stored separately.
This is a bad thing.
Network architects should, instead, treat the measurements taken off the network as one large “pool” of data. Although specific questions might generate the original or ongoing data set, measurement data should be
stored in a way that is only loosely connected with the original purpose for which the data is collected. There should be a conscious effort to provide interfaces into the network management data so it can be used for purposes other than the original intent. In other words, data collected off the network should be placed in a store with a common set of interfaces that anyone (with the right credentials and reasons) can access.
Finally, the connection can be loosened by using monitoring information in new and unexpected ways. The entire field of data analytics can cross over from the big data and data science areas in the retail, production, or other operations area, and be used to mine for useful trends and analysis. You might know, for instance, that traffic in your network follows a working hours/off hours pattern, but are there deeper or less obvious patterns that might help you better use the bandwidth available? How can you relate network usage to energy usage in large-scale data centers, and what is the trade-off between moving loads to save energy versus the cost in terms of performance and the actual movement of the packets?
Information gathered from all around the network should be available for uses far beyond the original intent; there should be processes and
mechanisms in place to find the hard data that will help you predict failures, manage cost, and adjust to growth more quickly.
Sampling Considerations
The question often arises about whether to sample data, or to gather all the information—whether at the flow level (when using Netflow, for instance), or at the control plane level (should I capture the time required to run SPF on every router, or just a few?), or in many other places. In the end, you’re always going to end up sampling data; there’s simply too much of it to collect wholesale. The real question network architects should be asking is,
“Which samples do I want to take, and why?”
The answer to this question is the inevitable “it depends,” but there are some helpful guidelines:
How useful will the data be if it is only sampled, rather than collected wholesale? The behavior of an application that sends only a few dozen packets in each hour probably isn’t going to be discernible from a sampled measurement. On the other hand, the average time each SPF run takes in an IS-IS network can probably be correctly ascertained through sampled data.
Are you looking for trends or for hard data? Doctors often tell their patients, “Weight is a trend, not a number.” At least that’s what they say when they’re trying to be kind to their patients—but the point still holds. Sampling is often useful for spotting trends, whereas for hard numbers, you’ll need more precise wholesale monitoring.
How self-similar is the data you’re monitoring? If there is a larger network traffic pattern, such as a sudden uptick in traffic at the beginning of the workday, and a quick drop off at the end of the workday, and you already know about this daily change through
gross sampling mechanisms (such as interface counters), you might want to focus your measurement efforts at the packet level on more fine-grained changes. In this case, you might want to modify the sampling rate so it’s in line with the known traffic patterns, collecting samples more frequently when the traffic levels are high, and less frequently when the traffic levels are lower.
Where and What
It’s useful to not only have a model of management techniques and
processes, but also of the network as a measurable object. Although network engineers often think of the network within the context of the OSI (or seven layer) model, this isn’t necessarily the best model for determining what to measure, nor how to measure it (see Chapter 4, “Models,” for more
information on a number of different models used to describe
networks). Figure 10-1 illustrates a model that’s useful for thinking through network measurement techniques.
Figure 10-1 Model of Network Measurement
The objective of this model is to provide a framework for asking questions, rather than providing answers. At each of the intersections, there are a series of questions that can be asked, and a set of measurement techniques that will provide the answers. Along the bottom are different physical
locations within the network, and along the right are different protocol
layers. The key point is to determine which of these intersections need to be measured and how each measurement should be taken. Let’s examine some of these intersections to see how this works.
End-to-End/Network
For any given path taken by traffic through the network, the end-to-end performance of the network protocol (IPv4 or IPv6 in most cases) can be measured in a number of ways. For instance, the end-to-end delay, jitter, bandwidth, round-trip time, reachability, and packet loss can all be measured for each IP path through the network. Ways in which you might measure this information include the following:
Generating traffic along a given path and measuring the various parameters for the generated flow. This has the advantage of providing measurement traffic on demand, but it has the disadvantages of measuring traffic that’s not real and adding
additional load to the network. A number of management software and network testing packages can produce traffic flows with
specific characteristics for measuring the performance of an end-to-end path through a network; IP/SLA is one example.
Passively measuring traffic generated by applications already running on the network. The advantages are that you’re
measuring real traffic and you’re not adding additional load to the network. The disadvantage is that real application traffic might not always be available. NetFlow is probably the best known example of a measurement protocol in this area, although a number of other options might be viable in very narrow situations, such as logged access lists.
Interface/Transport
Every hop a transport layer packet, such as a Transmission Control Protocol (TCP) segment, passes through, must pass through a pair of interfaces. Each of these interfaces interacts with the packet through queuing, drops,
serialization delay, and other factors. Each of these factors should (though they can’t always), in turn, be measured in some way—normally through device-levelshow commands, information retrieved through SNMP or NETCONF, or some other mechanism.
Failure Domain/Control Plane
The most crucial area in which control plane performance needs to be examined is not end-to-end, at the network level, but within each failure domain. Understanding the amount of time it takes for a routing protocol to converge around a failed link within each specific failure domain can expose which failure domains are likely too large, or poorly designed, and which can be expanded in scope and size with minimal risk.
Most of the measurements taken here are going to (necessarily) be from Simple Network Management Protocol (SNMP), the Network Configuration Protocol (NETCONF), show commands, and other “white box”
measurements. There are techniques to determine when an OSPF router has converged from within the protocol itself (see RFC 4061, RFC 4062, and RFC 4063), but these techniques are not widely deployed by testing equipment vendors. So long as information in these areas is available directly from the
devices, there is little reason to add the complexity of “off box
measurements,” unless there’s some specific situation that requires it.
Network engineers should primarily examine the amount of time required to calculate new routing information after a link or device changes state
(remembering there are four steps to this process—see Chapter 8,
“Weathering Storms,” for more information). Measurements should be taken both from the perspective of the network layer (IPv4 and IPv6) and from the perspective of the control plane protocols (time required to run an SPF, for instance, in an IS-IS deployment).