Summarizing data
MEASURING CENTRAL TENDENCY
One of the most important ways of summarizing a distribution of values for a variable is to establish its central tendency—the typical value in a distribution. Where, for example, do values in a distribution tend to concentrate? To many readers this may mean trying to find the ‘average’ of a distribution of values. However, statisticians mean a number of different measures when they talk about averages. Three measures of average (i.e. central tendency) are usually discussed in text-books: the arithmetic mean, the median and the mode. Stephen J.Gould, a palaeontologist who is well known for his popular writings on science, illustrates the first two of these measures of average when he writes:
A politician in power might say with pride, ‘The mean income of our citizens is $15,000 per year.’ The leader of the opposition might retort, ‘But half our citizens make less than $10,000 per year.’ Both are right,
82 Summarizing data
but neither cites a statistic with impassive objectivity. The first invokes a mean, the second a median. (1991:473)
While this comment does little to reassure us about the possible misuse of statistics, it does illustrate well the different ways in which average can be construed.
The arithmetic mean
The arithmetic mean is a method for measuring the average of a distribution which conforms to most people’s notion of what an average is. Consider the following distribution of values:
12 10 7 9 8 15 2 19 7 10 8 16
The arithmetic mean consists of adding up all of the values (i.e. 123) and dividing by the number of values (i.e. 12), which results in an arithmetic mean of 10.25. It is this kind of calculation which results in such seemingly bizarre statements as ‘the average number of children is 2.37’. However, the arithmetic mean, which is often symbolised as x¯, is by far the most commonly used method of gauging central tendency. Many of the statistical tests encountered later in this book are directly concerned with comparing means deriving from different samples or groups of cases (e.g. analysis of variance—see Chapter 7). The arithmetic mean is easy to understand and to interpret, which heightens its appeal. Its chief limitation is that it is vulnerable to extreme values, in that it may be unduly affected by very high or very low values which can respectively increase or decrease its magnitude. This is particularly likely to occur when there are relatively few values; when there are many values, it would take a very extreme value to distort the arithmetic mean. For example, if the number 59 is substituted for 19 in the previous distribution of twelve values, the mean would be 13.58, rather than 10.25, which constitutes a substantial difference and could be taken to be a poor representation of the distribution as a whole. Similarly, in Table 8.11 in Chapter 8, the variable ‘size of firm’ contains an outlier (case number 20) which is a firm of 2,700 employees whereas the next largest has 640 employees. The mean for this variable is 499, but if we exclude the outlier it is 382.6. Again, we see that an outlier can have a very large impact on the arithmetic mean, especially when the number of cases in the sample is quite small.
The median
The median is the mid-point in a distribution of values. It splits a distribution of values in half. Imagine that the values in a distribution are arrayed from low to high, e.g. 2, 4, 7, 9, 10, the median is the middle value, i.e. 7. When there is an even number of values, the average of the two middle values is taken. Thus, in
the former group of twelve values, to calculate the mean we need to array them as follows:
2 7 7 8 8 9 10 10 12 15 16 19.
Thus in this array of twelve values, we take the two underlined values—the sixth and seventh—and divide their sum by 2, i.e. (9+10)/2=9.5. This is slightly lower than the arithmetic mean of 10.25, which is almost certainly due to the presence of three fairly large values at the upper end—15, 16, 19. If we had the value 59 instead of 19, although we know that the mean would be higher at 13.58 the median would be unaffected, because it emphasizes the middle of the distribution and ignores the ends. For this reason, many writers suggest that when there is an outlying value which may distort the mean, the median should be considered because it will engender a more representative indication of the central tendency of a group of values. On the other hand, the median is less intuitively easy to understand and it does not use all of the values in a distribution in order for it to be calculated. Moreover, the mean’s vulnerability to distortion as a consequence of extreme values is less pronounced when there is a large number of cases.
The mode
This final indicator of central tendency is rarely used in research reports, but is often mentioned in textbooks. The mode is simply the value that occurs most frequently in a distribution. In the foregoing array of twelve values, there are three modes—7, 8, and 10. Unlike the mean, which strictly speaking should only be used in relation to interval variables, the mode can be employed at any measurement level. The median can be employed in relation to interval and ordinal, but not nominal, variables. Thus, although the mode appears more flexible, it is infrequently used, in part because it does not use all of the values of a distribution and is not easy to interpret when there are a number of modes.