Goal: Learn to calculate indicators and construct graphs that summarize and describe a large quantity of values.
Using the textbook readings and other resources listed on the web site, be sure you can define, know when to use, calculate (with Spss), and interpret the following:
I. Indicators of Central Tendency
A. Mode
B. Median
C. Mean
II. Indicators of Dispersion
A. Range
B. Interquartile Range C. Variance
D. Standard Deviation
III. Graphic Presentation and Summarization
A. Sort raw data B. Frequency table
C. Reduce raw data to categories
D. Cumulative frequencies & percentiles E. Histograms
IV. Exploratory Data Analysis
A. Box and whisker plot
B. Stem and leaf display
Displaying the Shape of the Distribution
Goal: Determine how closely does the shape of the distribution approximates a Gaussian distribution. “Parametic” statistical tests — the kind we will study next —assume the data do indeed approximate a Gaussian distribution.
V. Indicators of a Gaussian distribution
A. Mean = Median = Mode
B. Skewness: measures the asymmetry of the distribution. A value of zero indicates no skewness is present. The larger the value the more skewed the distri- bution. Negative skew indicates the tail of the distribution is to the left, with most of the scores clustering at the higher end of the scale. Positive skew indicates the scores cluster at the low end of the scale and the tail extends to the right.
C. Kurtosis: indicates the flatness of the distribution.
1. Mesokutric: = 3 2. Platykurtic: < 3 3. Leptokurtic > 3 D. Graphs
1. Ogive
2. Normal Probability Plots E. Statistical Tests
1. Chi Square
VI. Resistant indicators
A. Central Tendency
In certain data sets some observed values lie far way from the clump of the data values.
These “outliers” or “extreme” scores, may be due to measurement errors, data recording errors, or may represent valid data points. Extreme scores influence unduly the mean and standard deviation. Suppose for example, that the mean annual salary in this class is
b1
1
n
--- • Σ
xi–
x---
s
3•
=
b2
1
n---
• Σ
xi–
x---
s
4• 3 –
=
by the exact value of the largest score (or value) and thus is a more resistant measure of central tendency.
B. Dispersion.
The range, clearly, is not resistant to the influence of extreme scores. Because each value in a distribution is included in the calculation of the variance and standard deviation, neither is resistant to extreme values.
The interquartile range, because it is based on percentiles, is resistant to extreme scores. The lower quartile is the value such that 25 percent of all values fall below that value. The upper quartile is the value at which 25 percent of all values fall above it.
The interquartile range is the difference between the upper and lower quartiles. In a large sample that approximates the Gaussian distribution, the interquartile range tends to be 1.34 times the sample standard deviation.
C. Shape of the distribution
Resistent indicators of skewness and kurtosis also exist, such as the Yule-Kendall skewness statistic defined as:
Other resistant indicators exist based on all the quantities such as L-moments but these are not included in an introductory discussion.
ϒYK
x0.25– ( 2x
0.5+
x0.75)
x0.75–
x0.25---
=
Note that the mean is the arithmetic average . The column labelled ( x -M) shows the amount by which each score deviates from the mean. This column will always sum to zero.
The column labelled ( x -M) 2 is also known as the “sum of the squared deviations about the mean,” or just as “sum of squares.” The variance is the average of the sum of squares
. and the standard deviation is the square root of the variance .
To illustrate the impact of an extreme score, the instructor realizes that for student A, the score of 67 was mistakenly entered. In actuality, student A earned a score ot 57. Note the changes in the descriptive statistics when this single change is made.
Calculation of Mean and Standard Deviation
Sample of 10 Scores from P102 Exam
Person
A 67 -23.1 533.61
B 95 4.9 24.01
C 98 7.9 62.41
D 92 1.9 3.61
E 99 8.9 79.21
F 96 5.9 34.81
G 94 3.9 15.21
H 90 -0.1 0.01
I 95 4.9 24.01
J 75 -15.1 228.01
sum = 901 sum -> 1,005
mean = 90.1 variance -> 100.49 standard deviation -> 166.17
skewness = -1.6557 kurtosis = 1.7794
Score (x) (x-M) (x-M)
2Σx
i---
nΣ x M ) ( –
2 n– 1
--- Σ x M ) ( –
2n
– 1
---
Note the changes in the descriptive statistics presented below. The mean changes slightly (about one percent), as you would expect due to an extreme score, but the median remains unchanged. This illustrates the meaning of “resistant” indicator. The standard deviation shows a 24 percent increase, skewness and kurtosis also show large changes, suggesting the shape of the distribution departs even further from the Gaussian.
Effect of an Extreme Score
Sample of 10 Scores from P102 Exam
Person
A 57 -32.1 1030.41
B 95 5.9 34.81
C 98 8.9 79.21
D 92 2.9 8.41
E 99 9.9 98.01
F 96 6.9 47.61
G 94 4.9 24.01
H 90 0.9 0.81
I 95 5.9 34.81
J 75 -14.1 198.81
sum = 891 sum -> 1,557
mean = 89.1 variance -> 155.69 standard deviation -> 312.71
skewness = -2.0341 kurtosis = 3.8474
Score (x) (x-M) (x-M)
2Original Data
One Extreme Score
Mean 90.10 89.10
Standard Error 3.34 4.16
Median 94.50 94.50
Mode 95.00 95.00
Standard Deviation 10.57 13.15 Sample Variance 111.66 172.99
Kurtosis 1.78 3.85
Skewness -1.66 -2.03
Range 32.00 42.00
Here is how skewness is calculated by hand for a different set of data:
Skewness
1. List Raw Scores in a column
2. Subtract Mean from each Raw
Score. Aka, Deviations from
the mean
3. Raise each of these deviations from the mean to the third power
and sum. Aka: Sum of third moment deviations
4. Calculate skewness, which is the sum of the deviations from the mean,
raise to the third power, divided by number of cases minus 1, times the standard deviation raised to the third
power.
y (y - M) (y - M) 3
8.04 0.54 0.16
6.95 -0.55 -0.17
7.58 0.08 0.00
8.81 1.31 2.24
8.33 0.83 0.57
9.96 2.46 14.87
7.24 -0.26 -0.02
4.26 -3.24 -34.04
10.84 3.34 37.23
4.82 -2.68 -19.27
5.68 -1.82 -6.04
sum = y = 82.51 0.00 -4.46 sum = deviations 3
mean = ( y)/n = M 7.50 83.65 = (n-1) stdev 3
st dev = ¥var 2.03 -0.0533 = skewness
Calculating Skewness:
1. First, calculate the mean and standard deviation
2. Subtract the mean from each raw score and cube (i.e., raise to the third power) 3. Sum the cubed deviations.
4. Multiply the number of scores minus 1 times the cubed standard deviation (i.e., raised to the third power).
5. Skewness = step 3 divided by step 4
Keep in mind that if a distribution is positively skewed, the bulk of the values clump around the lower end of the scale with a few trialing off at the high end. Conversely, in a negatively skewed distribution, the bulk of the values clump around or near the high end of the scale with a few values trailing off at the low end.
The following table summarizes the descriptive statistics for the P102 sample.
Table 1: Summary Statistics for P102 Exam Data
Statistic Symbol Value Comment
sample size n 10 number of cases/individuals
mean 90.1 non-resistant measure of location
standard deviation 166.17 non-resistant measure of dispersion
range 32 non-resistant measure of scale
skewness -1.66 non-resistant measure of skewness
kurtosis 1.78 non-resistant measure of kurtosis
median 94.5 resistant measure of location
interquartile range 10.25 resistant measure of dispersion
Yule-Kendall -0.19 resistant measure of skewness
x sx
xmax
–
xm inb1 b2 x0 5, x0.75
–
x0.25ϒYK
The equation for the Gaussian curve is . where:
y = The height of the curve at a given value of x = The standard deviation of the distribution.
= A constant (pi) of approximately 3.1416 x = A specific score within the distribution.
e = The base of the Napierian logarithms, approximately 2.71828 = The mean of the distribution.
= The variance of the distribution.
−4 sd −3 sd −2 sd −1 sd mean 1 sd 2 sd 3 sd 4 sd
y
1
σ 2π --- e
x–µ)2 ( –
2σ2 ---