Analysis of Variance
2.8 Dealing with Large Data Sets
All too often, the problem in the statistical analysis of modern computer systems is not the lack of experimental data but its surfeit. With the extensive logging of sys-tem components and the growing number of components in a syssys-tem, the analyst is often confronted with the daunting task of extracting comprehensible and statisti-cally valid results from large volumes of data. Therefore, the practice of statistical analysis—long focused on the extraction of statistically valid results from a handful of experiments—changes its character. This section discusses a pragmatic approach to the analysis of large data sets, based on the author’s own experiences over the past two decades.7
Unlike the classical notion of careful experimental design in order to test a hypothesis, the situation in contemporary systems evaluation is to focus, at least initially, on data exploration. We typically have access to a large compendium of
7. An alternative view from the perspective of computer system performance evaluation can be found in R. Jain, The Art of Computer Systems Performance Analysis, Wiley, 2001.
ptg7913109
2.8 Dealing with Large Data Sets 101
logs and traces, and the questions we would like to answer typically fall into the fol-lowing broad categories.
How can the data help us identify the cause of poor overall performance?
What is the relative performance of alternative implementations of one com-ponent of the system?
Are there implicit rules that describe the data?
In answering these questions, the following procedure has proved useful.
1. Extract a small sample from the entire data set and carefully read through it.
Even a quick glance at the data will often point out salient characteristics that can be used to speed up subsequent analysis. Moreover, doing so allows a researcher to spot potential problems, such as certain variables not being logged or having clearly erroneous values. Proceeding with a complex analysis in the presence of such defects only wastes time.
2. Attempt to visualize the entire data set. For example, if every sample could be represented by a point, the entire data set could be represented by a pixellated bitmap. The human eye is quick to find nonobvious patterns but only if pre-sented with the entire data set. If the data set is too large, it may help to sub-sample it, taking every fifth, tenth, or hundredth sub-sample before visualization.
This step will often result in detecting patterns that may otherwise be revealed only with considerable effort.
3. Look for outliers. The presence of outliers usually indicates a deeper problem, usually with either data collection or data representation (e.g., due to under-flow or overunder-flow). Usually, the analysis of outliers results in the discovery of problems in the logging or tracing software, and the entire data set may have to be collected again. Even if part of the data set can be sanitized to correct for errors, it is prudent to collect the data set again.
4. Formulate a preliminary null hypothesis. Choose this hypothesis with care, being conservative in your selection, so that the nonrejection of the hypothesis does not lead you to a risky conclusion.
5. Use the data set to attempt to reject the hypothesis, using the techniques described earlier in this chapter.
6. Frame and test more sophisticated hypotheses. Often, preliminary results reveal insights into the structure of the problem whose further analysis will require the collection of additional data. The problem here is that if data is collected at different times, it is hard to control extraneous influences. The workload may have changed in the interim, or some system components may
ptg7913109 have been upgraded. Therefore, it is prudent to discard the entire prior data
set to minimize the effects of uncontrolled variables. Step 6 may be repeated multiple times until the initial problem has been satisfactorily answered.
7. Use appropriate graphics to present and interpret the results of the analysis.8 When dealing with very large data sets, where visualization is impossible, tech-niques derived from data mining and machine learning are often useful. We briefly outline two elementary techniques for data clustering.
The goal of a data-clustering algorithm is to find hidden patterns in the data: in this case, the fact that the data can be grouped into clusters, where each cluster represents closely related observations. For example, in a trace of packets observed at a router interface, clusters may represent packets that fall into a certain range of lengths. A clustering algorithm automatically finds clusters in the data set that, for our example, would correspond to a set of disjoint ranges of packet lengths.
A clustering algorithm takes as input a distance metric that quantifies the con-cept of a distance between two observations. Distances may be simple metrics, such as packet lengths, or may be more complex, such as the number of edits (that is, insertions and deletions) that need to be made to a string-valued observation to transform it into another string-valued observation. Observations within a cluster will be closer, according to the specified distance metric, than observations placed in different clusters.
In agglomerative clustering, we start with each observation in its own cluster.
We then merge the two closest observations into a single cluster and repeat the pro-cess until the entire data set is in a single cluster. Note that to carry out repeated mergings, we need to define the distance between a point and a cluster and between two clusters. The distance between a point and a cluster can be defined either as the distance from that point to the closest point in the cluster or as the average of all the distances from that point to all the points in the cluster. Similarly, the dis-tance between clusters can be defined to be the closest disdis-tance between their points or the distance between their centroids. In either case, we compute a tree such that links higher up in the tree have longer distance metrics. We can therefore truncate the tree at any point and treat the forest so created as the desired set of clusters. This approach usually does not scale beyond about 10,000 observation types on a single server; distributed computation techniques allow the processing of larger data sets.
The k-means clustering technique clusters data into k classes. The earliest and most widely used algorithm for k-means clustering is Lloyd’s algorithm, in which
8. An excellent source for presentation guidelines is E. Tufte, The Visual Display of Quantitative Information, 2nd ed., Graphics Press, 2001.
ptg7913109