The standard deviation is
1. a distance measure—the square root of the squared distance by which scores deviate from the mean;
2. the preferred measure for quantitative variables whose distributions are rela- tively symmetrical;
3. often reported with the mean—for a normal distribution, Sis an interval that contains 68.27% of scores;
4. the measure with the best sampling stability;
5. widely used, implicitly or explicitly, in advanced statistics; 6. mathematically tractable;
7. the only widely used measure of dispersion whose value is affected by the value of every score in the distribution;
8. fairly sensitive to extreme scores, so it is not recommended for markedly skewed distributions; and
9. not appropriate for qualitative variables. The semi-interquartile range is
1. a distance measure—one-half the distance between the first and the third quartiles;
2. often reported with the median for quantitative variables;
3. closely related to the median, because both are defined in terms of quartile points;
4. sensitive only to the number and not to the value of scores above Q3 and below Q1; hence, it often is used for markedly skewed distributions;
5. the only relatively stable measure of dispersion that is appropriate for open- ended distributions;
6. more subject to sampling fluctuation than the standard deviation; 7. less mathematically tractable than the standard deviation; and 8. rarely used in advanced statistical procedures.
The range is
1. a distance measure—the distance between the largest and the smallest scores; 2. often reported with the mode for quantitative variables;
3. the simplest measure of dispersion to compute and interpret; 4. used in deciding how to group data in a frequency distribution;
5. much more subject to sampling fluctuation than the other measures of dispersion; 6. dependent on sample size—the larger the sample size, the larger, on the aver-
age, the range;
7. less mathematically tractable than the standard deviation; and 8. rarely used in advanced statistical procedures.
The index of dispersion is
1. a measure of the distinguishability of observations—that is, the number of distinguishable pairs of observations relative to the number possible. The index is 0 when all observations are in one qualitative category (minimum dispersion), and it has its maximum value of 1 when the observations are evenly distributed over the categories (maximum dispersion);
2. the only measure of dispersion appropriate for unordered qualitative variables; 3. reported with the mode;
4. rarely used in advanced statistical procedures; and
5. less familiar than the standard deviation, range, and semi-interquartile range, which are based on the concept of distance.
CHECK YOUR UNDERSTANDING OF SECTION 4.3
11. What measure of central tendency and dispersion would you compute for the following data? Defend your choice.
9 0–9 9 f a. IQ 8 0–8 9 7 0–7 9 6 0–6 9 5 0–5 9 100 –109 110 –119 120 –129 130 –139 4 5–4 9 f b.
Creativity scores of doctoral candidates in English 4 0–4 4 3 5–3 9 3 0–3 4 2 5–2 9 5 0–5 4 5 5–5 9 6 0–6 4 6 5–6 9 f c.
Frequency of drug use among teenagers
Cannabis
LSD
Mescaline Psilocybin
d.
Assembly line productivity during a week
Production
4.5 Detecting Outliers
109
4.4 DISPERSION AND THE NORMAL
DISTRIBUTION
The distribution of many variables in the behavioral sciences, health sciences, and education resembles the bell-shaped normal distribution. Because this distribution is so important, its properties have been studied extensively by mathematicians. You saw in Section 4.2 that for a normal distribution, the interval S includes 68.27% of scores. Suppose that you are interested in the interval 2Sor 3S. The percentage of scores included in these intervals is shown in Figure 4.4-1. It can be seen that an interval of six standard deviations includes almost all of the scores, 99.73%. Also, Sgives the two scores that mark the inflection pointsof the normal distribution—that is, the points where the curve changes from convex to concave or the reverse.
CHECK YOUR UNDERSTANDING OF SECTION 4.4
12. For a normal distribution, what percentage of the scores falls (a) below S? (b) between – 3S and 3S? (c) above – 2S? (d) below – S?
13. Term to remember: a. Inflection point
4.5 DETECTING OUTLIERS
In collecting data, there are many opportunities for mistakes to occur. People mis- read instruments, transpose numbers, record data in the wrong place, present the wrong experimental condition or instructions, and fail to notice that equipment has
X X X X X X X X X f(X) X S X S X 99.73% 68.27% 95.45% 50% 50% X X 2S X 2S X 3S X 3S
Figure 4.4-1. Percentage of scores contained in selected intervals around the mean for a normal distribution.
malfunctioned. Often these mistakes produce scores that are indistinguishable from correct data and go undetected. However, when you find that John’s IQ is 1100 and Susan’s height is 56 feet, you know that something is wrong.
Scores that are unusually large or small relative to other scores are called
outliers.
Outliers can seriously affect the integrity of data and result in biased or distorted sample statistics and faulty conclusions. Some outliers are obvious, such as an IQ of 1100 or a height of 56 feet, but not all outliers are so obvious. There are gray areas. A number of criteria have been suggested for identifying obvious and not-so- obvious outliers. According to one criterion, an outlier is any score that falls outside of the interval given by
Mdn2(Q3– Q1)
Another criterion identifies an outlier as any score that falls outside of the interval
2.5S
For the IQ scores in Table 4.2-1, the two criteria give the following intervals:
Mdn2(Q3Q1) 110.7 2(118.0 101.0) 76.7 to 144.7 and
2.5S110.35 2.5(13.53) 76.5 to 144.2
Both criteria identify one outlier—Waldo’s score of 76. Of the two criteria,Mdn2 (Q3– Q1) is preferred because the Mdn,Q3, and Q1are less influenced by extreme scores than are the and S. A widely used rule for detecting outliers is based on a box plot, which is described in the next section.
Outliers should be carefully examined. Their presence suggests the possibility of some form of data contamination. Data that are obviously erroneous must be either corrected or discarded. For example, an examination of the records might reveal that John’s IQ is 110 rather than 1100 and that Susan is only 5.6 feet tall, not 56 feet. However, school records might confirm that Waldo’s score of 76 is correct. Outliers should be discarded if they are impossible—for example, an IQ of 1100—or if there is ample evidence that they have resulted from some form of data contamination— for example a participant recorded his answers in the wrong column of an answer sheet or the equipment malfunctioned.