The primary contribution of this dissertation is to establish the feasibility of measuring and analyzing the performance of services, based only on the transport-level information in TCP/IP protocol headers on packets sent and received by servers. Specifically, this contribution com- prises two sub-contributions. First, I measure the server performance by building a compact representation of the application-level dialog carried over the connection (Chapter 3). Second, I analyze the performance in three parts: building a profile of typical (or “normal”) server perfor- mance (Chapter 4); detecting performance anomalies, or deviances from this profile (Chapter 4); and diagnosing the cause of such anomalies (Chapter 5).
My thesis statement is as follows:
It is possible to detect and, to an extent, diagnose server performance issues in a network with hundreds of servers providing multiple diverse types of services without perturbing those services, by accessing only the information available in the TCP/IP headers of the packets the server receives and sends.
I will now introduce my solution and list my novel contributions.
I use the server response time to measure the server performance. The response time is the amount of time elapsed between a request from the client to the server and the server’s subsequent response. (I do not claim that this measure is novel.) Intuitively, the longer the response time, the worse the performance. A naive approach to detecting server performance issues using the server response time will not work. We cannot simply flag long response times as indicative of anomalies for two reasons. First, the definition of what is anomalous is different for different servers. For example, an SQL database server might be expected to take minutes or even hours to respond, whereas a second might be a long time for a chat server’s response. Second, any server will have at least a few long response times as part of its normal operation. For example, most email messages sent to a outgoing mail server can be sent in less than a second, but the occasional five megabyte attachment will take longer, and we would not call that longer response time an anomaly.
To demonstrate the feasibility of my solution, I measured server response time by using a network trafficmonitor to observe the traffic flowing between UNC and its commodity Internet up-link. The monitor observes any TCP connection involving a UNC host and an external host1. This setup achieves passive measurement, as I will discuss in more detail in Section 3.1.1.
Instead of using a simple threshold on the server response time to detect anomalies, my approach builds, for each server, a profile of the typical distribution of response times for that server. The profile is, in a sense, a summary of many response times, and detecting a performance issue is fundamentally a determination of how well the summary of recent response times compares to the historical profile (or profiles). Because each server has its own profile, observing ten-second response times might be indicative of a performance issue for some servers but not for others.
Not only is a profile unique to a given server, but they also vary by different amounts. For example, the individual distributions comprising the profile for one server might be nearly identical. In that case, a new distribution that deviates from the profile even a small amount might be considered highly unusual. On the other hand, other servers might have a profile
1
except for certain networks that are routed on one of the other routes to the Internet, such as educational institutions (on Internet2)
containing individual distributions that vary widely. In that case, a new distribution would need to be extremely different from the profile before it was considered unusual. Even more subtle, a profile might tolerate wide variation in some places but little variation in other places. For example, many services see wide variation among the tail of response time distributions, but little variation elsewhere, so it makes sense to penalize a distribution that deviates from the profile only in the tail less than one that deviates from the profile only in the body.
The approach corresponds with a user’s intuitive understanding of performance issues. A user is not likely to complain about the performance of a server that is always slow; instead, the user learns to expect that behavior from the server. Also, if the response times vary greatly for the server, the occasional long response time will probably not bother the user greatly; however, a long response time in a typically very responsive service such as an interactive remote-desktop session would be more likely to elicit a complaint. Thus, there is a correlation between what a user would deem problematic performance and what my methods detect as problematic performance. I will demonstrate this correlation more quantitatively in Section 5.4. Furthermore, note that, by continually applying this approach, the network manager can now proactivelydiscover problems and hopefully fix them before they become severe enough to elicit user complaints.
It is not just performancedegradations that are interesting to a network manager, but, more generally, performance anomalies. A good solution should detect when the performance for a server is unusually bad and also when it is unusually good. For example, a misconfiguration might be introduced by a web server upgrade, causing the web server to serve only error pages and yielding faster response times than usual. My approach detects this sort of performance anomaly as well.
In addition to discovering performance issues, I also provide a way to diagnose the issues. This diagnosis is necessarily from an outside perspective, and I cannot incorporate knowledge of packet payloads. However, we will see in Chapter 5 that the diagnosis can be quite helpful to the network manager. To diagnose issues, I can use contextual information provided by my measurement tool, such as the number of connections to a server, as well as the structure of the application-layer dialogs of connections. For example, one diagnosis might be that the third response time of each connection is the one causing the performance anomaly.
In the process of evaluating my thesis, I have collected a multi-terabyte dataset of in- formation about UNC network traffic. This dataset may conceivably have uses other than server performance management, and I will make an anonymized version available to network researchers through the Internet Measurement Data Catalog2.
Because my approach has a single monitoring point as a data source, it is easily deployed. Furthermore, the hardware driving my approach is relatively inexpensive, especially compared to some commercial solutions requiring more extensive deployment of hardware throughout the network.
Another aspect of my method that is important in large networks is the ability to automat- ically discover servers and build a profile for them. It is difficult in a large organization to know about all the servers in the network, especially when individual departments and divisions are free to setup their own servers without the aid of the technology support department.
Another important aspect of my solution is that they are tunable to various situations. For example, we will see in Section 5.5 a server that experienced a profound, ongoing shift in the response times. My methods provide the network manager the ability to redefine what “normal” traffic looks like for that server, to avoid being bombarded with alarms. Also, various parameters control the performance anomaly detection mechanism, which I discuss in more detail in Sections 4.7 and 4.8.
In summary, my novel contributions are as follows:
• A one-pass algorithm for continuous measurement of server performance from TCP/IP headers at high traffic rates (Section 3.1)
• A tool for measuring server performance using a one-pass algorithm that can keep up with high traffic rates (Section 3.1)
• A multi-terabyte, months-long dataset of server performance and other information (Sec- tion 3.3)
• A method for building a profile of “normal” performance for a server (Chapter 4)
– ...regardless of thescale of response times
2
http://imdc.datcat.org
– ...regardless of thedistribution of response times
– ...regardless of how widely the performance typically varies for the server
• A metric characterizing the extent to which a distribution of response times is anomalous relative to the normal profile (Section 4.4)
• The ability todiagnose performance issues (Chapter 5).
The rest of this dissertation is organized as follows. First, the rest of this chapter provides an overview of the dissertation. Chapter 2 provides some background and discusses related work. Chapter 3 discusses the measurement of server performance, including the algorithms involved and the data generated by the measurement. Chapter 4 details the performance anomaly detection techniques I used, which are based on statistical techniques originally developed for a medical imaging application. Chapter 5 gives a series of eight case studies involving servers in the UNC network that had severe anomalies, to showcase the diagnosis abilities afford to a network manager by my approach. Chapter 7 concludes, and Appendix A gives the full anomaly detection results.