Descriptive Statistics: Summary Numbers
3.2 Variability or Spread of the Data of the Data
0 1 2 3 4 5 6 7 8 9 10 11 12
The following groups all have the same mean,
4.25: Group B:
Group A: 2, 3, 4, 8 Group B: 1, 2, 4, 10 Group C: 0, 1, 5, 11
These data are shown graphically in Figure 0 1 2 3 4 5 6 7 8 9 10 11 12 3.1.
Group C:
It is clear that Group B is more variable (shows a larger spread in the numbers) than Group A, and Group C is more variable than Group B. But we need a quantitative measure
of this variability. 0 1 2 3 4 5 6 7 8 9 10 11 12
(a) Sample Range
One simple measure of variability is the sample range, the difference between the smallest item and the largest item in each sample. For Group A the sample range is 6, for Group B it is 9, and for Group C it is 11. For small samples all of the same size, the sample range is a useful quantity. However, it is not a good indicator if the sample size varies, because the sample range tends to increase with increasing sample size. Its other major drawback is that it depends on only two items in each sample, the smallest and the largest, so it does not make use of all the data. This disadvantage becomes more serious as the sample size increases. Because of its simplicity, the sample range is used frequently in quality control when the sample size is constant; simplicity is particularly desirable in this case so that people do not need much education to apply the test.
(b) Interquartile Range
The interquartile range is the difference between the upper quartile and the lower quartile, which will be described in section 3.3. It is used fairly frequently as a measure of variability, particularly in the Box Plot, which will be described in the next chapter. It is used less than some alternatives because it is not related to any of the important theoretical distributions.
(c) Mean Deviation from the Mean
N
The mean deviation from the mean, defined as
∑ (
xi − x)
/ N , where x =∑
xi / N , is useless because it is always zero. This follows from the i=1discussion of the sum of deviations from the mean in section 3.1 (a).
(d) Mean Absolute Deviation from the Mean However, the mean absolute deviation from the mean,
N
xi − x / N defined as
∑
i=1is used frequently by engineers to show the variability of their data, although it is usually not the best choice. Its advantage is that it is simpler to calculate than the main alternative, the standard deviation, which will be discussed below. For Groups A, B, and C the mean absolute deviation is as follows:
Group A: (2.25 + 1.25 + 0.25 + 3.75)/4 = 7.5/4 = 1.875.
Group B: (3.25 + 2.25 + 0.25 + 5.75)/4 = 11.5/4 = 2.875.
Group C: (4.25 + 3.25 + 0.75 + 6.75)/4 = 15/4 = 3.75.
Its disadvantage is that it is not simply related to the parameters of theoretical distributions. For that reason its routine use is not recommended.
(e) Variance
The variance is one of the most important descriptions of variability for engineers. It is defined as
N 2
∑ (
xi − µ)
2 i=1 (3.6)
σ = N
In words it is the mean of the squares of the deviations of each measurement from the mean of the population. Since squares of both positive and negative real numbers are always positive, the variance is always positive. The symbol µ stands for the mean of the entire population, and σ2 stands for the variance of the population. (Remember that in Chapter 1 we defined the population as a particular characteristic of all the items in which we are interested, such as the diameters of all the bolts produced under normal operating conditions.) Notice that variance is defined in terms of the population mean, µ. When we calculate the results from a sample (i.e., a part of the population) we do not usually know the population mean, so we must find a way to use the sample mean, which we can calculate. Notice also that the variance has units of the quantity squared, for example m2 or s2 if the original quantity was measured in meters or seconds, respectively. We will find later that the variance is an important parameter in probability distributions used widely in practice.
(f) Standard Deviation
The standard deviation is extremely important. It is defined as the square root of the variance:
N 2
∑ (
xi − µ)
i=1 (3.7)
σ =
N
Thus, it has the same units as the original data and is a representative of the devia
tions from the mean. Because of the squaring, it gives more weight to larger deviations than to smaller ones. Since the variance is the mean square of the devia
tions from the population mean, the standard deviation is the root-mean-square deviation from the population mean. Root-mean-square quantities are also important in describing the alternating current of electricity. An analogy can be drawn between the standard deviation and the radius of gyration encountered in applied mechanics.
(g) Estimation of Variance and Standard Deviation from a Sample
The definitions of equations 3.6 and 3.7 can be applied directly if we have data for the complete population. But usually we have data for only a sample taken from the population. We want to infer from the data for the sample the parameters for the population. It can be shown that the sample mean, x , is an unbiased estimate of the population mean, µ. This means that if very large random samples were taken from the population, the sample mean would be a good approximation of the population mean, with no systematic error but with a random error which tends to become smaller as the sample size increases.
However, if we simply substitute x for µ in equations 3.6 and 3.7, there will be a systematic error or bias. This procedure would underestimate the variance and standard deviation of the population. This is because the sum of squares of deviations from the sample mean, x , is smaller than the sum of squares of deviations from any other constant value, including µ. x is an unbiased estimate of µ, but in general
x ≠ µ , so just substituting x for µ in equations 3.6 and 3.7 would tend to give estimates of variance and standard deviation that are too small. To illustrate this, consider the four numbers 11, 13, 10, and 14 as a sample. Their sample mean is 12.
They might well come from a population of mean 13. Then the sum of squares of deviations from the population mean,
∑ (
xi − µ)
2 = (11 – 13)2 + (13 – 13)2 + (10 –i 2
13)2 + (14 – 13)2 = 22 + 02 + 32 + 12 = 14, whereas
∑ (
xi − x)
= (11 – 12)2 + (13 –i 2
∑ (
xi − x)
12)2 + (10 – 12)2 + (14 – 12)2 = 12 + 12 + 22 + 22 = 10. Thus, i would
underestimate the variance. N
The estimate of variance obtained using the sample mean in place of the population
N
mean can be made unbiased by multiplying by the factor N −1 . This is called Bessel’s correction. The estimate of σ2 is given the symbol s2 and is called the variance estimated from a sample, or more briefly the sample variance. Sometimes this estimate will be high, sometimes it will be low, but in the long run it will show no bias if samples are taken randomly. The result of Bessel’s correction is that we have
N 2
∑ (
xi − x)
s2 = i=1 (3.8)
N −1
The standard deviation is always the square root of the corresponding variance, so s is called the sample standard deviation. It is the estimate from a sample of the standard deviation of the population from which the sample came. The sample standard deviation is given by
N 2
∑ (
xi − x)
s2 = i=1 (3.9)
N −1
Equations 3.8 and 3.9 (or their equivalents) should be used to calculate the variance and standard deviation from a sample unless the population mean is known.
If the population mean is known, as when we know all the members of the popula
tion, we should use equations 3.6 and 3.7 directly. Notice that when N is very large, Bessel’s correction becomes approximately 1, so then it might be neglected. How
ever, to avoid error we should always use equations 3.8 and 3.9 (or their equivalents) unless the population mean is known accurately.
(h) Method for Faster Calculation
A modification of equations 3.6 to 3.9 makes calculation of variance and standard deviation faster. In most cases in this book we have omitted derivations, but this case is an exception because the algebra is simple and may be helpful.
Equations 3.8 and 3.9 include the expression
2 2 2
∑ (
xi − x)
=∑
xi − 2x∑
x i + Nx∑
xiBut by definition x = N Then we have
2 2
2 2 N
( ∑
xi)
∑ (
xi − x)
=∑
xi − 2( ∑
xi)
+N2 N
( ∑
xi)
2=
∑
xi 2 − N (3.10)Notice that
∑
xi 2 means we should square all the x’s and then add them up. On the other hand,( ∑
xi)
2 means we should add up all the x’s and square the result. They are not the same.An alternative to equation 3.10 is
2 2 2
∑ (
xi − x)
=∑
xi − N( )
x (3.10a)Then we have
N 2
∑
xi N N
2 2
∑
x i 2 − i=1 ∑
xi − N( )
s2 = i=1 N
= i=1
x (3.11)
N −1 N −1
It is often convenient to use equation 3.11 in the form for frequencies:
2 2
∑
f x −( ∑
f x i i)
/( ∑
fi)
s =
( ∑
fi −1)
(3.12)2 i i
N 2
Equations 3.6 and 3.7 include
∑ (
xi− µ)
, where for a complete populationi=1
1 N
µ =
∑
xi. Then similar expressions to equations 3.10 to 3.12 (but dividing by N N i=1instead of (N – 1)) apply for cases where the complete population is known.
The modified equations such as equation 3.11 or 3.12 should be used for calcula
tion of variance (and the square root of one of them should be used for calculation of standard deviation) by hand or using a good pocket calculator because it involves fewer arithmetic operations and so is faster. However, some thought is required if a digital computer is used. That is because some computers carry relatively few
N 2
significant figures in the calculation. Since in equation 3.11 the quantities
∑
i=1xi and N 2
∑
xii=1 or N x 2 are of similar magnitudes, the differences in equation 3.11 may
N
( )
involve catastrophic loss of significance because of rounding of figures in the compu
tation. Most present-day computers and calculators, however, carry enough
significant figures so that this “loss of significance” is not usually a serious problem, but the possibility of such a difficulty should be considered. It can often be avoided by subtracting a constant quantity from each number, an operation which does not change the variance or standard deviation. For example, the variance of 3617.8, 3629.6, and 3624.9 is exactly the same as the variance of 17.8, 29.6, and 24.9.
However, the number of figures in the squared terms is much smaller in the second case, so the possibility of loss of significance is greatly reduced. Then in general, fewer figures are required to calculate variance by subtracting the mean from each of the values, then squaring, adding, and dividing by the number of items (i.e., using equation 3.8 directly), but this adds to the number of arithmetic operations and so requires more time for calculations. If the calculating device carries enough signifi
cant figures to allow 3.11 or 3.12 to be used, that is the preferred method.
Microsoft Excel carries a precision of about 15 decimal digits in each numerical quantity. Statistical calculations seldom require greater precision in any final answer than four or five decimal digits, so “loss of significance” is very seldom a problem if Excel is being used. A comparison to verify that statement in a particular case will be included in Example 4.4.
(i) Illustration of Calculation
Now let us return to an example of calculations using the groups of numbers listed at the beginning of section 3.2.
Example 3.1
The numbers were as follows:
Group A: 2, 3, 4, 8 Group B: 1, 2, 4, 10 Group C: 0, 1, 5, 11
Find the sample variance and the sample standard deviation of each group of num
bers. Use both equation 3.8 and equation 3.11 to check that they give the same result.
Answer: Since the mean of Group A (and also of the other groups) is 4.25, the sample variance of Group A using the basic definition, equation 3.8, is
[(2 – 4.25)2 + (3 – 4.25)2 + (4 – 4.25)2 + (8 – 4.25)2 ] / (4 – 1)
= [5.0625 + 1.5625 + 0.0625 + 14.0625] / 3 = 20.75 / 3 = 6.917 , so the sample standard deviation is 6.917 = 2.630.
The variance of Group A calculated by equation 3.11 is
[22 + 32 + 42 + 82 – (4)(4.25)2] / (4 – 1) = [4 + 9 + 16 + 64 – 72.25] / 3 = 6.917 (again). We can see that the advantage of equation 3.11 is greater when the mean is not a simple integer.
Using equation 3.11 on Group B gives
[12 + 22 + 42 + 102 – (4)(4.25)2] / (4 – 1) = [1 + 4 + 16 + 100 – 72.25] / 3 = 48.75 / 3 = 16.25 for the sample variance, so the sample standard deviation is 4.031.
Using equation 3.11 on Group C gives
[02 + 12 + 52 + 112 – (4)(4.25)2] / (4 – 1) = [0 + 1 + 25 + 121 – 72.25] / 3 = 74.75 / 3 = 24.917 for the variance, so the standard deviation is 4.992.
(j) Coefficient of Variation
A dimensionless quantity, the coefficient of variation is the ratio between the stan
dard deviation and the mean for the same set of data, expressed as a percentage. This can be either (σ / µ) or (s / x ), whichever is appropriate, multiplied by 100%.
(k) Illustration: An Anecdote
A brief story may help the reader to see why variability is often important. Some years ago a company was producing nickel powder, which varied considerably in particle size. A metallurgical engineer in technical sales was given the task of devel
oping new customers in the alloy steel industry for the powder. Some potential buyers said they would pay a premium price for a product that was more closely sized. After some discussion with the management of the plant, specifications for three new products were developed: fine powder, medium powder, and coarse pow
der. An order was obtained for fine powder. Although the specifications for this fine powder were within the size range of powder which had been produced in the past, the engineers in the plant found that very little of the powder produced at their best
guess of the optimum conditions would satisfy the specifications. Thus, the mean size of the specification was satisfactory, but the specified variability was not
satisfactory from the point of view of production. To make production of fine powder more practical, it was necessary to change the specifications for “fine powder” to correspond to a larger standard deviation. When this was done, the plant could produce fine powder much more easily (but the customer was not willing to pay such a large premium for it!).