• No results found

median is the value with exactly half the data

N/A
N/A
Protected

Academic year: 2020

Share "median is the value with exactly half the data"

Copied!
38
0
0

Loading.... (view fulltext now)

Full text

(1)

Chapter 5

Describing Distributions Numerically

What I will know and be able to do:

Describe a distribution numerically by stating its shape (mode, symmetry, skew, etc.), its center (median or

mean), a measure of its spread (standard deviation or IQR) and compare distributions using these features.

Assignment:

(2)

Where is the

Center of the Distribution

?

• If you had to pick a single number to describe all the data what would you pick?

It’s easy to find the center when a histogram is

unimodal and symmetric—it’s right in the middle.

• On the other hand, it’s not so easy to find the center of a skewed histogram or a histogram with more than one mode.

(3)

Center of a Distribution

-- Median

• The median is the value with exactly half the data values below it and half above it.

• It is the middle data value (once the data values have been

ordered) that divides the histogram into

two equal areas

It has the same units

as the data

(4)

How Spread Out is the Distribution?

Variation matters, and Statistics is about variation.

• Are the values of the distribution tightly clustered around the center or more spread out?

Always report a measure of spread along with a

measure of center when describing a distribution numerically.

(5)

Spread

: Home on the Range

• The range of the data is the difference between the maximum and minimum values:

Range = max – min

• A disadvantage of the range is that a single extreme value can make it very large and, thus, not

representative of the data overall.

(6)

Spread

: The Interquartile Range

• The interquartile range (IQR) lets us ignore extreme data values and concentrate on the middle of the data.

• To find the IQR, we first need to know what quartiles

are…

(7)

Spread

: The Interquartile Range (cont.)

• Quartiles divide the data into four equal sections.

One quarter of the data lies below the lower quartile, Q1One quarter of the data lies above the upper quartile, Q3.The quartiles border the middle half of the data.

• The difference between the quartiles is the interquartile range (IQR), so

IQR = upper quartile – lower quartile

(8)

Spread

: The Interquartile Range (cont.)

• The lower and upper

quartiles are the 25th

and 75th percentiles of the data, so…

• The IQR contains the middle 50% of the

values of the

distribution, as shown in figure:

(9)

5-Number Summary

• The 5-number summary of a

distribution reports its median, quartiles, and extremes

(maximum and minimum)

• The 5-number summary for the recent tsunami earthquake

Magnitudes looks like this:

(10)

Why use boxplots?

Advantages of using boxplots

-•

ease of construction

convenient handling of outliers

construction is not subjective (like histograms)

Used with medium or large size data sets (n > 10)

(11)

Disadvantages of boxplots

does not retain the individual observations

should

not

be used with small data sets (n<10)

(12)

How to construct a boxplot

find five-number summary

Min Q1 Med Q3 Max

draw box from Q1 to Q3

draw median as center line in the

box

(13)

Modified boxplots

display outliers

fences mark off mild & extreme outliers

whiskers extend to largest (smallest) data value

inside

the fence

(14)

Inner fence

Q1 Q3

Q1 – 1.5IQR

Q3 + 1.5IQR

Any observation outside this fence is an outlier! Put a dot

for the outliers. Interquartile Range

(IQR) – is the range (length) of the box

(15)

Modified Boxplot . . .

Q1 Q3

Draw the “whisker” from the quartiles to the observation that is within the

(16)

Outer fence

Q1 Q3

Q1 – 3IQR

Q3 + 3IQR

Any observation outside this fence is an extreme outlier!

Any observation between the fences is considered a

(17)

For the AP Exam . . .

. . . you just need to find outliers, you DO NOT

need to identify them as mild or extreme.

Therefore, you just need to use the

(18)

A report from the U.S. Department of Justice gave the

following percent increase in federal prison populations in 20 northeastern & mid-western states in 1999.

5.9 1.3 5.0 5.9 4.5 5.6 4.1 6.3 4.8 6.9 4.5 3.5 7.2 6.4 5.5 5.3 8.0 4.4 7.2 3.2

Create a modified boxplot. Describe the distribution.

(19)

Comparing Displays

• Compare the histogram and boxplot for daily wind speeds:

(20)

Comparing Groups

• It is almost always more

interesting to compare groups.

• With histograms, note the

shapes, centers, and spreads of the two distributions.

(21)

Comparing Groups (cont.)

• Boxplots offer an ideal balance of information and simplicity, hiding the details while displaying the overall summary information.

• We often plot them side by side for groups or categories we wish to compare.

(22)

What About Outliers?

• If there are any clear outliers and you are reporting the

mean and standard deviation, report them with the outliers present and with the outliers removed. The differences may be quite revealing.

(23)

Evidence suggests that a high indoor radon concentration might be linked to the development of childhood cancers. The data that follows is the radon

concentration in two different samples of houses. The first sample consisted of houses in which a child was diagnosed with cancer. Houses in the second

sample had no recorded cases of childhood cancer. Cancer

10 21 5 23 15 11 9 13 27 13 39 22 7 20 45 12 15 3 8 11 18 16 23 16 9 57 16 21 18 38 37 10 15 11 18 21 22 11 16 17 33 10

No Cancer

9 38 11 12 29 5 7 6 8 29 24 12 17 11 11 3 9 33 17 55 11 29 13 24 7 11 21 6 39 29 7 8 55 9 21 9 3 85 11 14

Create parallel boxplots. Compare the distributions.

(24)

Summarizing Symmetric Distributions - The Mean

• When we have symmetric data, there is an alternative other than the median.

• If we want to calculate a number, we can average the data.

• We use the Greek letter sigma to mean “sum” and write:

Slide 4 - 24

The formula says that to find the mean, we add up all the values of the variable and divide by the number of data values, n.

y

Total

n

y

(25)

The

mean

feels like the center because it is the

point where the histogram balances:

Slide 4 - 25

(26)

Mean or Median?

• Because the median considers only the order of

values, it is resistant to values that are extraordinarily large or small; it simply notes that they are one of the “big ones” or “small ones” and ignores their distance from center.

To choose between the mean and median, start by

looking at the data. If the histogram is symmetric and there are no outliers, use the mean.

However, if the histogram is skewed or with outliers,

you are better off with the median.

(27)

What About Spread?

The Standard Deviation

• A more powerful measure of spread than the IQR is the standard deviation, which takes into account how far each data value is from the mean.

• A deviation is the distance that a data value is from the mean.

• Since adding all deviations together would total

zero, we square each deviation and find an average of sorts for the deviations.

(28)

What About Spread?

The Standard Deviation

• The variance, notated by s2, is found by summing the squared deviations and (almost) averaging them:

• The variance will play a role later in our study, but it is problematic as a measure of spread—it is measured in

squared units! Slide 4 - 28

s

2

y

y

2

(29)

• The standard deviation, s, is just the square root of

the variance and is measured in the same units as the original data.

Slide 4 - 29

s

y

y

2

n

1

(30)

Thinking About Variation

• Since Statistics is about variation, spread is an important fundamental concept of Statistics.

Measures of spread help us talk about what we don’t

know.

• When the data values are tightly clustered around the center of the distribution, the IQR and standard

deviation will be small.

• When the data values are scattered far from the

center, the IQR and standard deviation will be large.

(31)

Tell -- Draw a Picture

When talking about quantitative variables, start

by making a histogram or stem-and-leaf display

and discuss the shape of the distribution.

(32)

Tell -- Shape, Center, and Spread

Next, always report the

shape

of its distribution,

along with a

center

and a

spread

.

If the shape is skewed, report the median and IQR.If the shape is symmetric, report the mean and

standard deviation and possibly the median and IQR as well.

(33)

Tell -- What About Unusual Features?

• If there are multiple modes, try to understand why. If you identify a reason for the separate modes, it may be good to split the data into two groups.

• If there are any clear outliers and you are reporting the mean and standard deviation, report them with the outliers present and with the outliers removed. The differences may be quite revealing.

• Note: The median and IQR are not likely to be affected by the outliers.

(34)

What Can Go Wrong? (cont.)

• Avoid inconsistent scales, either within the display or

when comparing two displays.

• Label clearly so a reader knows what the plot displays.

(35)

What Can Go Wrong? (cont.)

• Beware of outliers

• Be careful when comparing groups that have very

different spreads.

• Consider these side-by-side boxplots of

cotinine levels:

(36)

What Can Go Wrong? (cont.)

• Don’t forget to do a reality check – don’t let the calculator do the thinking for you.

• Don’t forget to sort the values before finding the median or percentiles.

• Don’t worry about small differences when using different methods.

• Don’t compute numerical summaries of a categorical variable.

• Don’t report too many decimal places.

• Don’t round in the middle of a calculation.

• Watch out for multiple modes

• Beware of outliers

• Make a picture … make a picture . . . make a picture !!!

(37)

What have we learned?

• We’ve learned how to make a picture for quantitative data to help us see the story the data have to Tell.

• We can display the distribution of quantitative data with a

histogram, stem-and-leaf display, or dotplot.

• We’ve learned how to summarize distributions of quantitative variables numerically.

• Measures of center for a distribution include the median and mean.

• Measures of spread include the range, IQR, and standard deviation.

• Use the median and IQR when the distribution is skewed. Use the mean and standard deviation if the distribution is symmetric.

(38)

What have we learned? (cont.)

• We’ve learned to Think about the type of variable we are summarizing.

• All methods of this chapter assume the data are quantitative.

• The Quantitative Data Condition serves as a check that the data are, in fact, quantitative.

References

Related documents

Political stability has a positive and significant effect on forest cover; however, corruption control has a negative and significant effect on forest cover because

In conclusion, a seasonal influenza immunization policy targeting the elderly Italian population, in which enhanced vaccines (aTIV and idTIV) are administered to high-risk

The average TFP of firms in rural towns is much higher than that of rural firms in remote areas, but small firms in rural towns are not significantly less productive than small

The participants also contended that the senior leaders in the Ministry of Primary and Secondary Education should support schools, as observed by Munyaradzi (2012: 10),

Commercial banks added to security holdings in 1960, as demand for bank loans slackened during the year, in contrast with 1959 when bank security holdings were substantially reduced

from Four Angles Establish Leadership Cohesion and Commitment Enlist the Help of Frontline Motivators Implement Organizational Solutions that Create Lasting Behavior Change..

For example, one way to take the ASCO framework forward would be to replace the Clinical Benefit dimension with absolute risk reduction, years of life gained or even (if

The authors conducted a study to examine the contributions of fear of self-compassion (as measured by the FSC-SC) and self- compassion (as measured by the SCS) in the link