Descriptive Statistics

(1)

Goal: Learn to calculate indicators and construct graphs that summarize and describe a large quantity of values.

Using the textbook readings and other resources listed on the web site, be sure you can define, know when to use, calculate (with Spss), and interpret the following:

I. Indicators of Central Tendency

A. Mode

B. Median

C. Mean

II. Indicators of Dispersion

A. Range

B. Interquartile Range C. Variance

D. Standard Deviation

III. Graphic Presentation and Summarization

A. Sort raw data B. Frequency table

C. Reduce raw data to categories

D. Cumulative frequencies & percentiles E. Histograms

IV. Exploratory Data Analysis

A. Box and whisker plot

B. Stem and leaf display

(2)

Displaying the Shape of the Distribution

Goal: Determine how closely does the shape of the distribution approximates a Gaussian distribution. “Parametic” statistical tests — the kind we will study next —assume the data do indeed approximate a Gaussian distribution.

V. Indicators of a Gaussian distribution

A. Mean = Median = Mode

B. Skewness: measures the asymmetry of the distribution. A value of zero indicates no skewness is present. The larger the value the more skewed the distri- bution. Negative skew indicates the tail of the distribution is to the left, with most of the scores clustering at the higher end of the scale. Positive skew indicates the scores cluster at the low end of the scale and the tail extends to the right.

C. Kurtosis: indicates the flatness of the distribution.

1. Mesokutric: = 3 2. Platykurtic: < 3 3. Leptokurtic > 3 D. Graphs

1. Ogive

2. Normal Probability Plots E. Statistical Tests

1. Chi Square

VI. Resistant indicators

A. Central Tendency

In certain data sets some observed values lie far way from the clump of the data values.

These “outliers” or “extreme” scores, may be due to measurement errors, data recording errors, or may represent valid data points. Extreme scores influence unduly the mean and standard deviation. Suppose for example, that the mean annual salary in this class is

b₁

1

n

--- • Σ

x_i

–

x

---

s

 

 

³

• =

b₂

1

n

---

  • Σ

x_i

–

x

---

s

 

 

⁴

•   3 –

=

(3)

by the exact value of the largest score (or value) and thus is a more resistant measure of central tendency.

B. Dispersion.

The range, clearly, is not resistant to the influence of extreme scores. Because each value in a distribution is included in the calculation of the variance and standard deviation, neither is resistant to extreme values.

The interquartile range, because it is based on percentiles, is resistant to extreme scores. The lower quartile is the value such that 25 percent of all values fall below that value. The upper quartile is the value at which 25 percent of all values fall above it.

The interquartile range is the difference between the upper and lower quartiles. In a large sample that approximates the Gaussian distribution, the interquartile range tends to be 1.34 times the sample standard deviation.

C. Shape of the distribution

Resistent indicators of skewness and kurtosis also exist, such as the Yule-Kendall skewness statistic defined as:

Other resistant indicators exist based on all the quantities such as L-moments but these are not included in an introductory discussion.

ϒYK

x_0.25

– ( 2x

_0.5

+

x_0.75

)

x0.75

–

x_0.25

---

=

(4)

Note that the mean is the arithmetic average . The column labelled ( x -M) shows the amount by which each score deviates from the mean. This column will always sum to zero.

The column labelled ( x -M) ² is also known as the “sum of the squared deviations about the mean,” or just as “sum of squares.” The variance is the average of the sum of squares

. and the standard deviation is the square root of the variance .

To illustrate the impact of an extreme score, the instructor realizes that for student A, the score of 67 was mistakenly entered. In actuality, student A earned a score ot 57. Note the changes in the descriptive statistics when this single change is made.

Calculation of Mean and Standard Deviation

Sample of 10 Scores from P102 Exam

Person

A 67 -23.1 533.61

B 95 4.9 24.01

C 98 7.9 62.41

D 92 1.9 3.61

E 99 8.9 79.21

F 96 5.9 34.81

G 94 3.9 15.21

H 90 -0.1 0.01

I 95 4.9 24.01

J 75 -15.1 228.01

sum = 901 sum -> 1,005

mean = 90.1 variance -> 100.49 standard deviation -> 166.17

skewness = -1.6557 kurtosis = 1.7794

Score (x) (x-M) (x-M)

²

Σx

i

---

n

Σ x M ) ( –

² n

– 1

--- Σ x M ) ( –

²

n

– 1

---

(5)

Note the changes in the descriptive statistics presented below. The mean changes slightly (about one percent), as you would expect due to an extreme score, but the median remains unchanged. This illustrates the meaning of “resistant” indicator. The standard deviation shows a 24 percent increase, skewness and kurtosis also show large changes, suggesting the shape of the distribution departs even further from the Gaussian.

Effect of an Extreme Score

Sample of 10 Scores from P102 Exam

Person

A 57 -32.1 1030.41

B 95 5.9 34.81

C 98 8.9 79.21

D 92 2.9 8.41

E 99 9.9 98.01

F 96 6.9 47.61

G 94 4.9 24.01

H 90 0.9 0.81

I 95 5.9 34.81

J 75 -14.1 198.81

sum = 891 sum -> 1,557

mean = 89.1 variance -> 155.69 standard deviation -> 312.71

skewness = -2.0341 kurtosis = 3.8474

Score (x) (x-M) (x-M)

²

Original Data

One Extreme Score

Mean 90.10 89.10

Standard Error 3.34 4.16

Median 94.50 94.50

Mode 95.00 95.00

Standard Deviation 10.57 13.15 Sample Variance 111.66 172.99

Kurtosis 1.78 3.85

Skewness -1.66 -2.03

Range 32.00 42.00

(6)

Here is how skewness is calculated by hand for a different set of data:

Skewness

1. List Raw Scores in a column

2. Subtract Mean from each Raw

Score. Aka, Deviations from

the mean

3. Raise each of these deviations from the mean to the third power

and sum. Aka: Sum of third moment deviations

4. Calculate skewness, which is the sum of the deviations from the mean,

raise to the third power, divided by number of cases minus 1, times the standard deviation raised to the third

power.

y (y - M) (y - M) ³

8.04 0.54 0.16

6.95 -0.55 -0.17

7.58 0.08 0.00

8.81 1.31 2.24

8.33 0.83 0.57

9.96 2.46 14.87

7.24 -0.26 -0.02

4.26 -3.24 -34.04

10.84 3.34 37.23

4.82 -2.68 -19.27

5.68 -1.82 -6.04

sum = y = 82.51 0.00 -4.46 sum = deviations ³

mean = ( y)/n = M 7.50 83.65 = (n-1) stdev ³

st dev = ¥var 2.03 -0.0533 = skewness

Calculating Skewness:

1. First, calculate the mean and standard deviation

2. Subtract the mean from each raw score and cube (i.e., raise to the third power) 3. Sum the cubed deviations.

4. Multiply the number of scores minus 1 times the cubed standard deviation (i.e., raised to the third power).

5. Skewness = step 3 divided by step 4

(7)

Keep in mind that if a distribution is positively skewed, the bulk of the values clump around the lower end of the scale with a few trialing off at the high end. Conversely, in a negatively skewed distribution, the bulk of the values clump around or near the high end of the scale with a few values trailing off at the low end.

The following table summarizes the descriptive statistics for the P102 sample.

Table 1: Summary Statistics for P102 Exam Data

Statistic Symbol Value Comment

sample size n 10 number of cases/individuals

mean 90.1 non-resistant measure of location

standard deviation 166.17 non-resistant measure of dispersion

range 32 non-resistant measure of scale

skewness -1.66 non-resistant measure of skewness

kurtosis 1.78 non-resistant measure of kurtosis

median 94.5 resistant measure of location

interquartile range 10.25 resistant measure of dispersion

Yule-Kendall -0.19 resistant measure of skewness

x s_x

x_max

–

x_{m in}

b₁ b₂ x_{0 5}_, x_0.75

–

x_0.25

ϒYK

(8)

The equation for the Gaussian curve is . where:

y = The height of the curve at a given value of x = The standard deviation of the distribution.

= A constant (pi) of approximately 3.1416 x = A specific score within the distribution.

e = The base of the Napierian logarithms, approximately 2.71828 = The mean of the distribution.

= The variance of the distribution.

−4 sd −3 sd −2 sd −1 sd mean 1 sd 2 sd 3 sd 4 sd

y

1 σ 2π --- e

x–µ)² ( –

2σ² ---

=

σ π

µ

σ

²

(9)

Box Plots

Box plots are useful in visualizing distributions. Consider the following scattergram of per capita income for each of the 50 states (y axis) with charitable deductions (x axis) listed on 1998 itemized tax returns.

An explanation of the box plot appears on the following page. The line or asterisk within the box is the median of the distribution. Fifty percent of the cases fall with the upper and lower hinges (the box boundaries). The upper hinge occurs at the 75 ^th percentile, which is the third quartile, which corresponds to a z-score of .68. As discussed earlier, the median occurs at the 50 ^th percentile, which is the second quartile and corresponds to a z-score of zero. The lower hinge occurs at the 25 ^th percentile, which is the first quartile and corre- sponds to a z-score of — .68. The “whiskers” terminate at the largest and smallest values that are not considered to be outliers.

The definitions for “outlier” and “extreme” scores may vary depending on the software pro- gram. A common definition for outlier is any value 1.5 box-lengths above or below the upper and lower hinges, and for extreme scores, any value more than 3 box-lengths above or below the upper or lower hinges respectively.

Per Capita Income

Charitable Giving

0 2,000 4,000 6,000

15,000

20,000

25,000

30,000

(10)

In the charatible giving example one of the states (that shall remain nameless) has a high

per capita income (around $27,000) but gives only about $1,000 to charity. Notice that the

circle for this pair of data points lies beyond the whisker of the “charatible giving” box.

(11)

Stem and Leaf

Another useful data display is know as the stem and leaf. This is a simple way of displaying the distribution of data without having to use computer graphics. The characteristic that makes the stem and left unique is that very value in the data set is displayed. The stem and leaf “plot” groups the values in a data set according to their all but least significant digits.

These are written in ascending or descending order to the left side of a vertical bar and are know as the “stem.” The “leaves” are formed by writing the least significant digit to the right of the vertical bar, on the same line as the more significant digits with which it belongs. The stem and leaf plot below shows the charitable giving for 100 individuals. We can see that least amout one person gave was $1,082 while the most one person gave was

$5,779. Further, we can see that in the $4,000 range, the following exact values were given:

$4,018, $4,057, $4,073, $4,095 . . . $4,814.

The stem and leaf with vary slightly in appearance depending on the specific software used.

Some programs enable you to examine the leaves in detail, by reporting the number of cases, the spread, the value of the lower and upper hinges, etc.

1* | 082 1* | 303 1* | 1* | 785

1* | 870,976,985 2* | 012,040,116

2* | 212,242,256,296,308 2* | 448,482,511,511,530,560 2*** | 609,632,686,718,740,785

2*** | 806,829,833,871,885,899,951,963

3* | 001,010,015,028,030,088,164,170,171,178 3* | 225,229,237,277,310,358,385,392

3* | 413,414,439,450,450,502,519,594 3* | 615,633,638,654,682,738,761

3* | 813,813,820,834,860,872,897,914,918,955,994 4* | 018,057,073,095,154,192

4* | 238,271,342,377,379,387 4* | 425,426,494,545

4*** |

4*** | 814

5*** | 009

5*** | 273,379

5*** | 501

5*** | 779

(12)

Histogram

The range of values is divided into a finite set of class intervals known as “bins.” The num- ber of values in each bin is then counted and divided by the sample size to obtain frequency of occurrence. The frequency is plotted as vertical bars of varying height. Some programs allow the user to set the number of bins that appear. The frequencies can be divided by the bin width to obtain frequency densities that can be compared to probability densities from a theoretical distribution, such as the Gaussian distribution. For example, the Gaussian probability density function is superimposed on the frequency histogram of the charitable giving of 100 individuals.

Descriptive Statistics

Goal: Learn to calculate indicators and construct graphs that summarize and describe a large quantity of values.

Using the textbook readings and other resources listed on the web site, be sure you can define, know when to use, calculate (with Spss), and interpret the following:

I. Indicators of Central Tendency

A. Mode

B. Median

C. Mean

II. Indicators of Dispersion

A. Range

B. Interquartile Range C. Variance

D. Standard Deviation

III. Graphic Presentation and Summarization

A. Sort raw data B. Frequency table

C. Reduce raw data to categories

D. Cumulative frequencies & percentiles E. Histograms

IV. Exploratory Data Analysis

A. Box and whisker plot

B. Stem and leaf display

Displaying the Shape of the Distribution

Goal: Determine how closely does the shape of the distribution approximates a Gaussian distribution. “Parametic” statistical tests — the kind we will study next —assume the data do indeed approximate a Gaussian distribution.

V. Indicators of a Gaussian distribution

A. Mean = Median = Mode

C. Kurtosis: indicates the flatness of the distribution.

1. Mesokutric: = 3 2. Platykurtic: < 3 3. Leptokurtic > 3 D. Graphs

1. Ogive

2. Normal Probability Plots E. Statistical Tests

1. Chi Square

VI. Resistant indicators

A. Central Tendency

In certain data sets some observed values lie far way from the clump of the data values.

These “outliers” or “extreme” scores, may be due to measurement errors, data recording errors, or may represent valid data points. Extreme scores influence unduly the mean and standard deviation. Suppose for example, that the mean annual salary in this class is

1

--- • Σ

–

---

 

 

•

=

1

---

  • Σ

–

---

 

 

•   3 –

=

by the exact value of the largest score (or value) and thus is a more resistant measure of central tendency.

B. Dispersion.

The range, clearly, is not resistant to the influence of extreme scores. Because each value in a distribution is included in the calculation of the variance and standard deviation, neither is resistant to extreme values.

The interquartile range, because it is based on percentiles, is resistant to extreme scores. The lower quartile is the value such that 25 percent of all values fall below that value. The upper quartile is the value at which 25 percent of all values fall above it.

The interquartile range is the difference between the upper and lower quartiles. In a large sample that approximates the Gaussian distribution, the interquartile range tends to be 1.34 times the sample standard deviation.

C. Shape of the distribution

Resistent indicators of skewness and kurtosis also exist, such as the Yule-Kendall skewness statistic defined as:

Other resistant indicators exist based on all the quantities such as L-moments but these are not included in an introductory discussion.

ϒYK

– ( 2x

+

)

–

---

=

Note that the mean is the arithmetic average . The column labelled ( x -M) shows the amount by which each score deviates from the mean. This column will always sum to zero.

The column labelled ( x -M) 2 is also known as the “sum of the squared deviations about the mean,” or just as “sum of squares.” The variance is the average of the sum of squares

. and the standard deviation is the square root of the variance .

To illustrate the impact of an extreme score, the instructor realizes that for student A, the score of 67 was mistakenly entered. In actuality, student A earned a score ot 57. Note the changes in the descriptive statistics when this single change is made.

Calculation of Mean and Standard Deviation

Sample of 10 Scores from P102 Exam

Person

A 67 -23.1 533.61

B 95 4.9 24.01

C 98 7.9 62.41

D 92 1.9 3.61

E 99 8.9 79.21

F 96 5.9 34.81

G 94 3.9 15.21

H 90 -0.1 0.01

I 95 4.9 24.01

J 75 -15.1 228.01

The column labelled ( x -M) ² is also known as the “sum of the squared deviations about the mean,” or just as “sum of squares.” The variance is the average of the sum of squares

y (y - M) (y - M) ³

sum = y = 82.51 0.00 -4.46 sum = deviations ³

mean = ( y)/n = M 7.50 83.65 = (n-1) stdev ³