MEASURING DISPERSION

Summarizing data

In addition to being interested in the typical or representative score for a distribution of values, researchers are usually interested in the amount of variation shown by that distribution. This is what is meant by dispersion—how widely spread a distribution is. Dispersion can provide us with important information. For example, we may find two roughly comparable firms in which the mean income of manual workers is identical. However, in one firm the salaries of these workers are more widely spread, with both considerably lower and higher salaries than in the other firm. Thus, although the mean income is the same, one firm exhibits much greater dispersion in incomes than the other. This is important information that can usefully be employed to add to measures of central tendency.

84 Summarizing data

The most obvious measure of dispersion is to take the highest and lowest value in a distribution and to subtract the latter from the former. This is known as the range. While easy to understand, it suffers from the disadvantage of being susceptible to distortion from extreme values. This point can be illustrated by the imaginary data in Table 5.5, which shows the marks out of a hundred achieved on a mathematics test by two classes of twenty students, each of which was taught by a different teacher. The two classes exhibit similar means, but the patterns of the two distributions of values are highly dissimilar. Teacher A’s class has a fairly bunched distribution, whereas that of Teacher B’s class is much more dispersed. Whereas the lowest mark attained in Teacher A’s class is 57, the lowest for Teacher B is 45. Indeed, there are eight marks in Teacher B’s class that are below 57. However, whereas the highest mark in Teacher A’s class is 74, three of Teacher B’s class exceed this figure—one with a very high 95. Although the latter distribution is more dispersed, the calculation of the range seems to exaggerate its dispersion. The range for Teacher A is 74–57, i.e. a range of 17. For Teacher B, the range is 95–45, i.e. 50. This exaggerates the amount of dispersion since all but three of the values are between 72 and 45, implying a range of 27 for the majority of the values.

One solution to this problem is to eliminate the extreme values. The inter- quartile range, for example, is sometimes recommended in this connection (see

Table 5.5 Results of a test of mathematical ability for the students of two teachers (imaginary data)

Figure 5.4). This entails arraying a range of values in ascending order. The array is divided into four equal portions, so that the lowest 25 per cent are in the first portion and the highest 25 per cent are in the last portion. These portions are used to generate quartiles. Take the earlier array from which the median was calculated:

The first quartile (Q1), often called the ‘lower quartile’ will be between 7 and 8 and is calculated as ([3×7]+8)/4, i.e. 7.25. The third quartile (Q3), often called the ‘upper quartile’, will be (12+[3×15])/4, i.e. 14.25. Therefore the inter- quartile range is the difference between the third and first quartiles, i.e. 14.25- 7.25=7. As Figure 5.4 indicates, the median is the second quartile, but is not a component of the calculation of the inter-quartile range. The main advantage of this measure of dispersion is that it eliminates extreme values, but its chief limitation is that in ignoring 50 per cent of the values in a distribution, it loses a lot of information. A compromise is the decile range, which divides a distribution into ten portions (deciles) and, in a similar manner to the inter- quartile range, eliminates the highest and lowest portions. In this case, only 20 per cent of the distribution is lost.

By far the most commonly used method of summarizing dispersion is the standard deviation. In essence, the standard deviation calculates the average amount of deviation from the mean. Its calculation is somewhat more complicated than this definition implies. A further description of the standard deviation can be found in Chapter 7. The standard deviation reflects the degree to

86 Summarizing data

which the values in a distribution differ from the arithmetic mean. The standard deviation is usually presented in tandem with the mean, since it is difficult to determine its meaning in the absence of the mean.

We can compare the two distributions in Table 5.5. Although the means are very similar, the standard deviation for Teacher B’s class (12.37) is much larger than that for Teacher A (4.91). Thus, the standard deviation permits the direct comparison of degrees of dispersal for comparable samples and measures. A further advantage is that it employs all of the values in a distribution. It summarizes in a single value the amount of dispersion in a distribution, which, when used in conjunction with the mean, is easy to interpret. The standard deviation can be affected by extreme values, but since its calculation is affected by the number of cases, the distortion is less pronounced than with the range. On the other hand, the possibility of distortion from extreme values must be borne in mind. None the less, unless there are very good reasons for not wanting to use the standard deviation, it should be used whenever a measure of dispersion is required. It is routinely reported in research reports and widely recognized as the main measure of dispersion.

This consideration of dispersion has tended to emphasize interval variables. The standard deviation can only be employed in relation to such variables. The range and inter-quartile range can be used in relation to ordinal variables, but this does not normally happen, while tests for dispersion in nominal variables would be inappropriate. Probably the best ways of examining dispersion for nominal and ordinal variables are through bar charts, pie charts and frequency tables.

Measuring central tendency and dispersion with SPSS

All of these statistics can be generated in SPSS. Taking income as an illustration, the following sequence should be followed:

ªStatistics ªSummarize ªExplore…[opens Explore dialog box shown in Box 5.8]

ªincome ª䉴button by Dependent List: [puts income in Dependent List: box] ªOK

The resulting output is in Table 5.6. The following items that have been covered above will be produced: arithmetic mean, median, range, minimum and maximum values, standard deviation, and the inter-quartile range.

In document Quantitative Analysis With SPSS (Page 100-104)