Chapter 4
Population characteristic
-• Fixed value about a population
• Typical unknown
Suppose we want to know the MEAN length
of all the fish in Lake Lewisville . . .
Is this a value that is known?
Can we find it out?
At any given point
in time, how
many values are
Statistic
-• Value calculated from a sample
Suppose we want to know the MEAN
length of all the fish in Lake Lewisville.
Measures of Central Tendency
• Mode
– the observation that occurs the
most often
– Can be
more than one
mode
– If all values occur
only
once – there is
no
mode
Measures of Central Tendency
Median
- the
middle
value of the data; it
divides the observations in
half
To find:
list the observations in numerical
order
Suppose we catch a sample of 5 fish from the
lake. The lengths of the fish (in inches) are
listed below. Find the median length of fish.
3 4 5 8 10
The numbers are in order
& n is odd – so find the
middle observation.
Suppose we caught a sample of 6 fish from the
lake. The median length is …
3 4 5 6 8 10
The numbers are in order &
n is even – so find the
middle two observations.
The median length
is 5.5 inches.
Now, average these two values.
Measures of Central Tendency
Mean
is the arithmetic average.
– Use
μ
to represent a population mean
– Use
x
to represent a sample mean
Formula:
Σ
is the capital Greek letter sigma – it means tosum the values that follow
Population characteristic
statistic
Suppose we caught a sample of 6 fish from
the lake. Find the mean length of the
fish.
3 4 5 6 8 10
To find the mean length of fish -
add the observations and divide by
Sum 10 8 6 5 4 3
(x - x)
x
What is the sum
of the deviations
from the mean?
Now find how each observation deviates
from the mean.
0
Will this sum always
equal zero?
YES
This is the deviation from the mean.
3-6-3 -2 -1 0 2 4
Find the rest of the deviations from the mean
Imagine a ruler with pennies placed at
3”, 4”, 5”, 6”, 8” and 10”.
To balance the
ruler on your
finger, you would
need to place your
finger at the mean
of 6.
The mean is the
What happens to the median & mean if
the length of 10 inches was 15 inches?
3 4 5 6 8 15
The median is . . .
5.5
The mean is . . .
6.833
What happens to the median & mean if
the 15 inches was 20?
3 4 5 6 8 20
The median is . . .
5.5
The mean is . . .
7.667
Some statistics that are not affected by
extreme values . . .
Is the median resistant affected by
extreme values?
Is the mean
affected by extreme values
?
NO
Suppose we caught a sample of 20 fish with
the following lengths. Create a histogram
for the lengths of fish.
(Use a class width of 1.)Mean =
Median =
3 5 6 10 6 7 7 8 4 5 6 4 7 5 9 9 8 7 6 8
6.5
Calculate the mean and median.
6.5
Look at the placement of the mean and median in this symmetrical
Suppose we caught a sample of 20 fish with
the following lengths. Create a histogram
for the lengths of fish.
(Use a class width 1.)Mean =
Median =
6.8
5.5
Calculate the mean and median. Look at the placement of the mean
and median in this skewed distribution.
Suppose we caught a sample of 20 fish with
the following lengths. Create a histogram
for the lengths of fish.
(Use a class width of 1.)Mean =
Median =
8.5
7.75
Calculate the mean and median. Look at the placement of the mean
and median in this skewed distribution.
Recap:
• In a
symmetrical
distribution, the mean
and median are
equal
.
• In a
skewed
distribution, the mean is
pulled in the
direction of the skewness
.
• In a
symmetrical
distribution, you should
report the
mean
!
• In a
skewed
distribution, the
median
Trimmed mean:
Purpose is to remove outliers from a data
set
To calculate a trimmed mean:
• Multiply the percent to trim by n
• Truncate that many observations from
BOTH ends of the distribution (when
listed in order)
Mean = 23.8
Find the mean of the following set of data.
12 14 19 20 22 24 25 26 26 50
10%(10) = 1
So remove one observation from each side!
60% of the sample was
satisfied with their cell
phone service.
What values are used to describe
categorical data?
Suppose that each person in a sample of 15 cell phone users is asked if he or she is satisfied with the cell phone service.
Here are the responses:
Y N Y Y Y N N Y Y N Y Y Y N N
What would be the possible responses?
Find the sample proportion of the people
who answered “yes”:
Pronounced p-hat
Why is the study of variability
important?
• There is variability in virtually everything
• Allows us to distinguish between usual &
unusual values
• Reporting only a measure of center
doesn’t provide a complete picture of the
distribution.
Does this can of soda
contain exactly 12
Notice that these three data sets all
have the
same mean and median (at 45),
Measures of Variability
The simplest numeric measure of variability
is
range
.
Range =
largest observation – smallest observation
Measures of Variability
Another measure of the variability in a
data set uses the
deviations
from the
mean
(x – x).
Remember the sample of 6 fish that we
caught from the lake . . .
They were the following lengths:
3”, 4”, 5”, 6”, 8”, 10”
The mean length was 6 inches. Recall
that we calculated the deviations from
the mean. What was the sum of these
deviations?
Can we find an average
deviation?
What can we do to the
deviations so that we could
find an average?
Degree of freedom
The estimated average of the deviations
squared is called the
variance
.
Population variance is denoted by
When calculating sample variance, we use
degrees of freedom (n – 1) in the
denominator instead of n because this
tends to produce better estimates.
Degrees of freedom will be revisited
again in Chapter 8.
Suppose that everyone in the class
caught a sample of 6 fish from the
lake. Would each of our samples
contain the same fish?
Would our mean lengths be the
same?
(x - x)2 0 Sum 4 10 2 8 0 6 -1 5 -2 4 -3 3
(x - x)
x
What is the sum
of the deviations
squared?
Remember the sample of 6 fish that we
caught from the lake . . .
Find the variance of the length of fish.
Divide this by 5.
First square the deviations
Finding the average of the deviations would
always equal 0!
9 4 1 0 4 16
Measures of Variability
The square root of variance is called standard deviation.
A typical deviation from the mean is the
standard deviation.
s2 = 6.8 inches2 so s = 2.608 inches
Calculation of standard
deviation of a sample
Population standard deviation is denoted by σ (where n is
used in the denominator).
Measures of Variability
Interquartile range (iqr)
is the range of
the middle half of the data.
Lower quartile (Q
1)
is the median of the
lower half of the data
Upper quartile (Q
3)
is the median of the
upper half of the data
iqr = Q
3– Q
1What advantage does the interquartile
range have over the standard
deviation?
The Chronicle of Higher Education (2009-2010 issue) published the accompanying data on the
percentage of the population with a bachelor’s or higher degree in 2007 for each of the 50 states and the District of Columbia.
21 27 26 19 30 35 35 26 47 26 27 30 24 29 22 24 29 20 20 27 35 38 25 31 19 24 27 27 23 34 25 32 26 24 22 28 26 30 23 25 22 25 29 33 34 30 17 25 23 34 26
21 27 26 19 30 35 35 26 47 26 27 30 24 29 22 24 29 20 20 27 35 38 25 31 19 24 27 27 23 34 25 32 26 24 22 28 26 30 23 25 22 25 29 33 34 30 17 25 23 34 26
First put the data in order & find
the median.
17 19 19 20 20 21 22 22 22 23 23 23 24 24 24 24 25 25 25 25 25 26 26 26 26 26 26 27 27 27 27 27 28 29 29 29 30 30 30 30 31 32 33 34 34 34 35 35 35 38 4726
Find the lower quartile (Q
1) by finding the
median of the lower half.
24
Find the upper quartile (Q
3) by finding the
median of the upper half.
30
Another graph- Boxplots
What are some advantages of boxplots?
• ease of construction
• convenient handling of outliers
• construction is not subjective (like
histograms)
• Used with medium or large size data
sets (n > 10)
Boxplots
When to Use
Univariate numerical dataHow to construct a Skeleton Boxplot
– Calculate the five number summary – Draw a horizontal (or vertical) scale
– Construct a rectangular box from the lower quartile (Q1) to the upper quartile (Q3)
– Draw lines from the lower quartile to the
smallest observation and from the upper quartile to the largest observation
To describe
– comment on the center, spread, and shape of the distribution and if there is any unusual features
Use for moderate to large data sets. Don’t use with data sets of
n < 10.
The five-number summary is the minimum value, first quartile, median, third quartile,
Remember the data on the percentage of the population with a bachelor’s or higher degree in
2007 for each of the 50 states and the District of Columbia.
17 19 19 20 20 21 22 22 22 23 23 23 24 24 24 24 25 25 25 25 25 26 26 26 26 26 26 27 27 27 27 27 28 29 29 29 30 30 30 30 31 32 33 34 34 34 35 35 35 38 47
First draw a scaleDraw a box from Q1 to Q3
Draw a line for the median
Modified boxplots
To display outliers:
• Identify mild & extreme outliers
An observation is an outliers if it is more than 1.5(iqr) away from the nearest
quartile.
An outlier is extreme if it is more than 3(iqr) away from the nearest quartile.
• whiskers extend to largest (or smallest) data observation that is not an outlier
Modified boxplots are generally preferred
because they provide more information
Remember the data on the percentage of the population with a bachelor’s or higher degree in
2007 for each of the 50 states and the District of Columbia.
17 19 19 20 20 21 22 22 22 23 23 23 24 24 24 24 25 25 25 25 25 26 26 26 26 26 26 27 27 27 27 27 28 29 29 29 30 30 30 30 31 32 33 34 34 34 35 35 35 38 47
First, draw the scale, box and the line for the
median
Draw lines for the whiskers
Next calculate the fences for outliers.
24-1.5(6) = 15 30+1.5(6) = 39
30+3(6) = 48
There is one outlier at the upper end at the distribution, but none at the lower end. Is it extreme?
Place a solid dot for the outlier
To describe:
The distribution of percent of the population with a bachelor’s degree or higher for the U.S. states and District of Columbia is positively
Symmetrical boxplots Approximately symmetrical boxplot
Skewed boxplot
Notice that all 3
boxplots are identical, but their corresponding
histograms are very different. Can you determine the number
of modes from a boxplot?
Notice that the range of the lower half and the range of the upper
half of this distribution are
approximately equal so we can say that it is
approximately symmetrical.
However, the range of the two halves of this
distribution are definitely different sizes, so it would be skewed in the direction
The 2009-2010 salaries of NBA players
published on the web site hoopshype.com were used to construct the comparative boxplot of salary data for five teams.
Discuss the similarities
and
Interpreting Center & Variability
Chebyshev’s Rule –
The percentage of observations that are
within k standard deviations of the mean is at least
where k > 1
If k = 2, then at least
75% of the observations are within 2 standard
deviations of the mean.
This rule can be
used with
any
distribution – no
matter it’s
For a sample of families with one preschool child, it was reported that the mean child care time per week was approximately 36 hours with a standard deviation of approximately 12 hours.
Using Chebyshev’s rule, at least 75% of the
sample observations must be between 12 and 60 hours (within 2 standard deviations of the mean).
At most, what percent of the observations are greater than 72 hours?
At least 89% of the observations are between 0 & 72 hours. Since
time can’t be negative, at most 11% of the observations are
Input the following command into a graphing calculator in order to graph a normal curve with a mean of 20 and standard deviation of 3.
Y1 = normalpdf(X,20,3) (Window x: [10,30] y: [0,0.2])
Use the command 2nd trace, 7 to find the area under the curve for the: (Round to 3 decimal places.)
Lower limit: 17 Upper limit: 23 Area: ________
Lower limit: 14 Upper limit: 26 Area: ________
Lower limit: 11 Upper limit: 29 Area: ________
Graph a normal curve with a mean of 50 and standard deviation of 5.
Y1 = normalpdf(X,50,5) (x: [30,70] y: [0,0.1])
Find the area under the curve for the following:
Lower limit: 45 Upper limit: 55 Area: ________
Lower limit: 40 Upper limit: 60 Area: ________
Lower limit: 35 Upper limit: 65 Area: ________
What’s my area?
Interpreting Center & Variability
Empirical
Rule-• Approximately 68% of the observations are within 1 standard deviation of the mean
• Approximately 95% of the observations are within 2 standard deviation of the mean
• Approximately 99.7% of the observations are within 3 standard deviation of the mean
Can ONLY be used with distributions that
are mound shaped!
The height of male students at PWSH is
approximately normally distributed with a mean of 71 inches and standard deviation of 2.5 inches.
a) What percent of the male students are shorter than 66 inches?
b) Taller than 73.5 inches?
c) Between 66 & 73.5 inches?
About 2.5%
About 16%
Measures of Relative Standing
Z-score
A z-score tells us how many standard
deviations the value is from the mean.
What do these z-scores mean?
-2.3
1.8
-4.3
2.3 standard deviations below the mean
1.8 standard deviations above the mean
Sally is taking two different math achievement tests with different means and standard
deviations. The mean score on test A was 56 with a standard deviation of 3.5, while the
mean score on test B was 65 with a standard deviation of 2.8. Sally scored a 62 on test A and a 69 on test B. On which test did Sally score the best?
She did better on test A.
Measures of Relative Standing
Percentiles
A percentile is a value in the data set where
r percent of the observations fall AT or
In addition to weight and length, head
circumference is another measure of health in newborn babies. The National Center for
Health Statistics reports the following
summary values for head circumference (in cm) at birth for boys.
95 90 75 50 25 10 5 Percentile 38.6 38.2 37.0 35.8 34.5 33.2 32.2 Head circumference (cm)
What percent of newborn boys had head circumferences greater than 37.0 cm?
10% of newborn babies have head
circumferences bigger than what value?