Analysis of Variance
2.9 Common Mistakes in Statistical Analysis
we start with a set of k empty containers. We partition the observations into k sets, either randomly or on the basis of a subsample, allocating one set to each container.
For each container, we then compute its centroid, or the point that minimizes the sum of distances from all points in the set to itself. Now, each point is reallocated to the container with the closest centroid. This may result in the container’s centroid moving to a different point. We therefore recompute the centroid for each container, reallocating points as before. This process iterates until convergence, when no points move from one cluster to another. In most practical cases, the algorithm is found to converge after a few iterations to a globally optimal clustering. However, convergence may result in a local optimum. Several variants of this algorithm with better convergence properties are described in texts on machine learning and data mining.
2.9 Common Mistakes in Statistical Analysis
We now present some common problems in statistical analysis, especially in the context of computer systems.
2.9.1 Defining Population
A question commonly left unanswered in statistical analyses is a precise statement of the underlying population. As we saw in Section 2.1, the same sample can corre-spond to multiple underlying populations. It is impossible to interpret the results of a statistical analysis without carefully justifying why the sample is representative of the chosen underlying population.
2.9.2 Lack of Confidence Intervals in Comparing Results
Comparing the performance of two systems simply by comparing the mean values of performance metrics is an all-too-common mistake. The fact that one mean is greater than another is not statistically meaningful and may lead to erroneous con-clusions. The simple solution is to always compare confidence intervals rather than means, as described in Section 2.4.5.
2.9.3 Not Stating the Null Hypothesis
Although the process of research necessitates a certain degree of evolution of hypotheses, a common problem is to carry out a statistical analysis without stating the null hypothesis. Recall that we can only reject or not reject the null hypothesis
ptg7913109 from observational data. Therefore, it is necessary to carefully formulate and
clearly state the null hypothesis.
2.9.4 Too Small a Sample
If the sample size is too small, the confidence interval associated with the sample is large, so that even a null hypothesis that is false will not be rejected. By computing the confidence interval around the mean during exploratory analysis, it is possible to detect this situation and to collect larger samples for populations with greater inherent variance.
2.9.5 Too Large a Sample
If the sample size is too large, a sample that deviates even slightly from the null hypothesis will cause the null hypothesis to be rejected because the confidence interval around the sample mean varies as . Therefore, when interpreting a test that rejects the null hypothesis, it is important to take the effect size into account, which is the (subjective) degree to which the rejection of the null hypothe-sis accurately reflects reality. For instance, suppose that we hypothesize that the population mean was 0, and we found from a very large sample that the confidence interval was 0.005±0.0001. This rejects the null hypothesis. However, in the context of the problem, perhaps the value 0.005 is indistinguishable from zero and there-fore has a small effect. In this case, we would still not reject the null hypothesis.
2.9.6 Not Controlling All Variables When Collecting Observations
The effect of controlling variables in running an experiment is to get a firm grasp on the nature of the underlying population. If the population being sampled changes during the experiment, the collected sample is meaningless. For example, suppose that you are observing the mean delay from a campus router to a particu-lar data center. Suppose that during data collection, your ISP changed its Tier 1 provider. Then, the observations made subsequent to the change would likely reflect a new population. During preliminary data analysis, therefore, it is neces-sary to ensure that such uncontrollable effects have not corrupted the data set.
2.9.7 Converting Ordinal to Interval Scales
Ordinal scales, in which each ordinal is numbered, such as the Likert scale—where 1 may represent “poor,” 2 “satisfactory,” 3 “good,” 4 “outstanding,” and 5 “excel-lent”—are often treated as if they are interval scales. So, if one user were to rate
1e n
ptg7913109
2.11 Exercises 105
the streaming performance of a video player as 1 and another as 3, the mean rating is stated to be 2. This is bad practice. It is hard to argue that the gap between “poor”
and “satisfactory” is the same as the gap between “satisfactory” and “good.” Yet that is the assumption being made when ordinal scales such as these are aggregated. In such cases, it is better to ask users to rank an experience on a linear scale from 1 to 5. This converts the ordinal scale to an interval scale and allows aggregation with-out making unwarranted assumptions.
2.9.8 Ignoring Outliers
The presence of outliers should always be a cause for concern. Silently ignoring them or deleting them from the data set altogether not only is bad practice but also prevents the analyst from unearthing significant problems in the data-collection process. Therefore, outliers should never be ignored.
2.10 Further Reading
This chapter only touches on the elements of mathematical statistics. A delightfully concise summary of the basics of mathematical statistics can be found in M. G. Bul-mer, Principles of Statistics, Oliver and Boyd, 1965, reissued by Dover, 1989. Statis-tical analysis is widely used in the social sciences and agriculture. The classic reference for a plethora of statistical techniques is G. W. Snedecor and W. G.
Cochran, Statistical Methods, 8th ed., Wiley, 1989. Exploratory data analysis is described from the perspective of a practitioner in G. Myatt, Making Sense of Data:
A Practical Guide to Exploratory Data Analysis and Data Mining, Wiley, 2006.
Readers who want to learn directly from one of the masters of statistical analysis should refer to R. A. Fisher, Statistical Methods for Research Workers, Oliver and Boyd, 1925.
2.11 Exercises
1. Means
Prove that the mean of a sample is the value of x*that minimizes . 2. Means
Prove Equation 2.12.
xi–x*
2
i=1 n
¦
ptg7913109 3. Confidence intervals (normal distribution)
Compute the 95% confidence interval for the following data values (Table 2.2):
4. Confidence intervals (t distribution)