• No results found

Unit 4 Notes

N/A
N/A
Protected

Academic year: 2019

Share "Unit 4 Notes"

Copied!
50
0
0

Loading.... (view fulltext now)

Full text

(1)

Chapter 4

(2)

Population characteristic

-• Fixed value about a population

• Typical unknown

Suppose we want to know the MEAN length

of all the fish in Lake Lewisville . . .

Is this a value that is known?

Can we find it out?

At any given point

in time, how

many values are

(3)

Statistic

-• Value calculated from a sample

Suppose we want to know the MEAN

length of all the fish in Lake Lewisville.

(4)

Measures of Central Tendency

• Mode

– the observation that occurs the

most often

– Can be

more than one

mode

– If all values occur

only

once – there is

no

mode

(5)

Measures of Central Tendency

Median

- the

middle

value of the data; it

divides the observations in

half

To find:

list the observations in numerical

order

(6)

Suppose we catch a sample of 5 fish from the

lake. The lengths of the fish (in inches) are

listed below. Find the median length of fish.

3 4 5 8 10

The numbers are in order

& n is odd – so find the

middle observation.

(7)

Suppose we caught a sample of 6 fish from the

lake. The median length is …

3 4 5 6 8 10

The numbers are in order &

n is even – so find the

middle two observations.

The median length

is 5.5 inches.

Now, average these two values.

(8)

Measures of Central Tendency

Mean

is the arithmetic average.

– Use

μ

to represent a population mean

– Use

x

to represent a sample mean

Formula:

Σ

is the capital Greek letter sigma – it means to

sum the values that follow

Population characteristic

statistic

(9)

Suppose we caught a sample of 6 fish from

the lake. Find the mean length of the

fish.

3 4 5 6 8 10

To find the mean length of fish -

add the observations and divide by

(10)

Sum 10 8 6 5 4 3

(x - x)

x

What is the sum

of the deviations

from the mean?

Now find how each observation deviates

from the mean.

0

Will this sum always

equal zero?

YES

This is the deviation from the mean.

3-6-3 -2 -1 0 2 4

Find the rest of the deviations from the mean

(11)

Imagine a ruler with pennies placed at

3”, 4”, 5”, 6”, 8” and 10”.

To balance the

ruler on your

finger, you would

need to place your

finger at the mean

of 6.

The mean is the

(12)

What happens to the median & mean if

the length of 10 inches was 15 inches?

3 4 5 6 8 15

The median is . . .

5.5

The mean is . . .

6.833

(13)

What happens to the median & mean if

the 15 inches was 20?

3 4 5 6 8 20

The median is . . .

5.5

The mean is . . .

7.667

(14)

Some statistics that are not affected by

extreme values . . .

Is the median resistant affected by

extreme values?

Is the mean

affected by extreme values

?

NO

(15)

Suppose we caught a sample of 20 fish with

the following lengths. Create a histogram

for the lengths of fish.

(Use a class width of 1.)

Mean =

Median =

3 5 6 10 6 7 7 8 4 5 6 4 7 5 9 9 8 7 6 8

6.5

Calculate the mean and median.

6.5

Look at the placement of the mean and median in this symmetrical

(16)

Suppose we caught a sample of 20 fish with

the following lengths. Create a histogram

for the lengths of fish.

(Use a class width 1.)

Mean =

Median =

6.8

5.5

Calculate the mean and median. Look at the placement of the mean

and median in this skewed distribution.

(17)

Suppose we caught a sample of 20 fish with

the following lengths. Create a histogram

for the lengths of fish.

(Use a class width of 1.)

Mean =

Median =

8.5

7.75

Calculate the mean and median. Look at the placement of the mean

and median in this skewed distribution.

(18)

Recap:

• In a

symmetrical

distribution, the mean

and median are

equal

.

• In a

skewed

distribution, the mean is

pulled in the

direction of the skewness

.

• In a

symmetrical

distribution, you should

report the

mean

!

• In a

skewed

distribution, the

median

(19)

Trimmed mean:

Purpose is to remove outliers from a data

set

To calculate a trimmed mean:

• Multiply the percent to trim by n

• Truncate that many observations from

BOTH ends of the distribution (when

listed in order)

(20)

Mean = 23.8

Find the mean of the following set of data.

12 14 19 20 22 24 25 26 26 50

10%(10) = 1

So remove one observation from each side!

(21)

60% of the sample was

satisfied with their cell

phone service.

What values are used to describe

categorical data?

Suppose that each person in a sample of 15 cell phone users is asked if he or she is satisfied with the cell phone service.

Here are the responses:

Y N Y Y Y N N Y Y N Y Y Y N N

What would be the possible responses?

Find the sample proportion of the people

who answered “yes”:

Pronounced p-hat

(22)

Why is the study of variability

important?

• There is variability in virtually everything

• Allows us to distinguish between usual &

unusual values

• Reporting only a measure of center

doesn’t provide a complete picture of the

distribution.

Does this can of soda

contain exactly 12

(23)

Notice that these three data sets all

have the

same mean and median (at 45),

(24)

Measures of Variability

The simplest numeric measure of variability

is

range

.

Range =

largest observation – smallest observation

(25)

Measures of Variability

Another measure of the variability in a

data set uses the

deviations

from the

mean

(x – x).

Remember the sample of 6 fish that we

caught from the lake . . .

They were the following lengths:

3”, 4”, 5”, 6”, 8”, 10”

The mean length was 6 inches. Recall

that we calculated the deviations from

the mean. What was the sum of these

deviations?

Can we find an average

deviation?

What can we do to the

deviations so that we could

find an average?

Degree of freedom

The estimated average of the deviations

squared is called the

variance

.

Population variance is denoted by

(26)

When calculating sample variance, we use

degrees of freedom (n – 1) in the

denominator instead of n because this

tends to produce better estimates.

Degrees of freedom will be revisited

again in Chapter 8.

Suppose that everyone in the class

caught a sample of 6 fish from the

lake. Would each of our samples

contain the same fish?

Would our mean lengths be the

same?

(27)

(x - x)2 0 Sum 4 10 2 8 0 6 -1 5 -2 4 -3 3

(x - x)

x

What is the sum

of the deviations

squared?

Remember the sample of 6 fish that we

caught from the lake . . .

Find the variance of the length of fish.

Divide this by 5.

First square the deviations

Finding the average of the deviations would

always equal 0!

9 4 1 0 4 16

(28)

Measures of Variability

The square root of variance is called standard deviation.

A typical deviation from the mean is the

standard deviation.

s2 = 6.8 inches2 so s = 2.608 inches

(29)

Calculation of standard

deviation of a sample

Population standard deviation is denoted by σ (where n is

used in the denominator).

(30)

Measures of Variability

Interquartile range (iqr)

is the range of

the middle half of the data.

Lower quartile (Q

1

)

is the median of the

lower half of the data

Upper quartile (Q

3

)

is the median of the

upper half of the data

iqr = Q

3

– Q

1

What advantage does the interquartile

range have over the standard

deviation?

(31)

The Chronicle of Higher Education (2009-2010 issue) published the accompanying data on the

percentage of the population with a bachelor’s or higher degree in 2007 for each of the 50 states and the District of Columbia.

21 27 26 19 30 35 35 26 47 26 27 30 24 29 22 24 29 20 20 27 35 38 25 31 19 24 27 27 23 34 25 32 26 24 22 28 26 30 23 25 22 25 29 33 34 30 17 25 23 34 26

(32)

21 27 26 19 30 35 35 26 47 26 27 30 24 29 22 24 29 20 20 27 35 38 25 31 19 24 27 27 23 34 25 32 26 24 22 28 26 30 23 25 22 25 29 33 34 30 17 25 23 34 26

First put the data in order & find

the median.

17 19 19 20 20 21 22 22 22 23 23 23 24 24 24 24 25 25 25 25 25 26 26 26 26 26 26 27 27 27 27 27 28 29 29 29 30 30 30 30 31 32 33 34 34 34 35 35 35 38 4726

Find the lower quartile (Q

1

) by finding the

median of the lower half.

24

Find the upper quartile (Q

3

) by finding the

median of the upper half.

30

(33)

Another graph- Boxplots

What are some advantages of boxplots?

• ease of construction

• convenient handling of outliers

• construction is not subjective (like

histograms)

• Used with medium or large size data

sets (n > 10)

(34)

Boxplots

When to Use

Univariate numerical data

How to construct a Skeleton Boxplot

– Calculate the five number summary – Draw a horizontal (or vertical) scale

– Construct a rectangular box from the lower quartile (Q1) to the upper quartile (Q3)

– Draw lines from the lower quartile to the

smallest observation and from the upper quartile to the largest observation

To describe

– comment on the center, spread, and shape of the distribution and if there is any unusual features

Use for moderate to large data sets. Don’t use with data sets of

n < 10.

The five-number summary is the minimum value, first quartile, median, third quartile,

(35)

Remember the data on the percentage of the population with a bachelor’s or higher degree in

2007 for each of the 50 states and the District of Columbia.

17 19 19 20 20 21 22 22 22 23 23 23 24 24 24 24 25 25 25 25 25 26 26 26 26 26 26 27 27 27 27 27 28 29 29 29 30 30 30 30 31 32 33 34 34 34 35 35 35 38 47

First draw a scaleDraw a box from Q1 to Q3

Draw a line for the median

(36)

Modified boxplots

To display outliers:

• Identify mild & extreme outliers

An observation is an outliers if it is more than 1.5(iqr) away from the nearest

quartile.

An outlier is extreme if it is more than 3(iqr) away from the nearest quartile.

• whiskers extend to largest (or smallest) data observation that is not an outlier

Modified boxplots are generally preferred

because they provide more information

(37)

Remember the data on the percentage of the population with a bachelor’s or higher degree in

2007 for each of the 50 states and the District of Columbia.

17 19 19 20 20 21 22 22 22 23 23 23 24 24 24 24 25 25 25 25 25 26 26 26 26 26 26 27 27 27 27 27 28 29 29 29 30 30 30 30 31 32 33 34 34 34 35 35 35 38 47

First, draw the scale, box and the line for the

median

Draw lines for the whiskers

Next calculate the fences for outliers.

24-1.5(6) = 15 30+1.5(6) = 39

30+3(6) = 48

There is one outlier at the upper end at the distribution, but none at the lower end. Is it extreme?

Place a solid dot for the outlier

To describe:

The distribution of percent of the population with a bachelor’s degree or higher for the U.S. states and District of Columbia is positively

(38)

Symmetrical boxplots Approximately symmetrical boxplot

Skewed boxplot

Notice that all 3

boxplots are identical, but their corresponding

histograms are very different. Can you determine the number

of modes from a boxplot?

Notice that the range of the lower half and the range of the upper

half of this distribution are

approximately equal so we can say that it is

approximately symmetrical.

However, the range of the two halves of this

distribution are definitely different sizes, so it would be skewed in the direction

(39)

The 2009-2010 salaries of NBA players

published on the web site hoopshype.com were used to construct the comparative boxplot of salary data for five teams.

Discuss the similarities

and

(40)

Interpreting Center & Variability

Chebyshev’s Rule –

The percentage of observations that are

within k standard deviations of the mean is at least

where k > 1

If k = 2, then at least

75% of the observations are within 2 standard

deviations of the mean.

This rule can be

used with

any

distribution – no

matter it’s

(41)

For a sample of families with one preschool child, it was reported that the mean child care time per week was approximately 36 hours with a standard deviation of approximately 12 hours.

Using Chebyshev’s rule, at least 75% of the

sample observations must be between 12 and 60 hours (within 2 standard deviations of the mean).

At most, what percent of the observations are greater than 72 hours?

At least 89% of the observations are between 0 & 72 hours. Since

time can’t be negative, at most 11% of the observations are

(42)

Input the following command into a graphing calculator in order to graph a normal curve with a mean of 20 and standard deviation of 3.

Y1 = normalpdf(X,20,3) (Window x: [10,30] y: [0,0.2])

Use the command 2nd trace, 7 to find the area under the curve for the: (Round to 3 decimal places.)

Lower limit: 17 Upper limit: 23 Area: ________

Lower limit: 14 Upper limit: 26 Area: ________

Lower limit: 11 Upper limit: 29 Area: ________

(43)

Graph a normal curve with a mean of 50 and standard deviation of 5.

Y1 = normalpdf(X,50,5) (x: [30,70] y: [0,0.1])

Find the area under the curve for the following:

Lower limit: 45 Upper limit: 55 Area: ________

Lower limit: 40 Upper limit: 60 Area: ________

Lower limit: 35 Upper limit: 65 Area: ________

What’s my area?

(44)

Interpreting Center & Variability

Empirical

Rule-• Approximately 68% of the observations are within 1 standard deviation of the mean

• Approximately 95% of the observations are within 2 standard deviation of the mean

• Approximately 99.7% of the observations are within 3 standard deviation of the mean

Can ONLY be used with distributions that

are mound shaped!

(45)

The height of male students at PWSH is

approximately normally distributed with a mean of 71 inches and standard deviation of 2.5 inches.

a) What percent of the male students are shorter than 66 inches?

b) Taller than 73.5 inches?

c) Between 66 & 73.5 inches?

About 2.5%

About 16%

(46)

Measures of Relative Standing

Z-score

A z-score tells us how many standard

deviations the value is from the mean.

(47)

What do these z-scores mean?

-2.3

1.8

-4.3

2.3 standard deviations below the mean

1.8 standard deviations above the mean

(48)

Sally is taking two different math achievement tests with different means and standard

deviations. The mean score on test A was 56 with a standard deviation of 3.5, while the

mean score on test B was 65 with a standard deviation of 2.8. Sally scored a 62 on test A and a 69 on test B. On which test did Sally score the best?

She did better on test A.

(49)

Measures of Relative Standing

Percentiles

A percentile is a value in the data set where

r percent of the observations fall AT or

(50)

In addition to weight and length, head

circumference is another measure of health in newborn babies. The National Center for

Health Statistics reports the following

summary values for head circumference (in cm) at birth for boys.

95 90 75 50 25 10 5 Percentile 38.6 38.2 37.0 35.8 34.5 33.2 32.2 Head circumference (cm)

What percent of newborn boys had head circumferences greater than 37.0 cm?

10% of newborn babies have head

circumferences bigger than what value?

25%

References

Related documents

The objectives of this study were to compare the changes of soil properties and crop yields of winter wheat under wide-narrow row spacing planting mode, and the uniform row

In the current study, at the end of storage, control group showed higher TBARS values compared to the rest of the samples, whereas the meat containing PIP-1000 ppm

Berbeda dengan penelitian yang dilakukan Martinez, Perez dan Bosque (2014) yang menganulir dimensi ekonomi dalam CSR karena nilai factor loading -nya di bawah

Based on results from work commissioned on further ways of completing the Single Market in a series of fields - road transport and railways, air and maritime transport, tourism

Figure 3 Curdlan biosynthesis in ATCC 31749 and gene knockout mutants after 24 hours of cultivation in nitrogen-free media using stationary phase (A) and late exponential phase

The increased use of corneal cross-linking (CXL) and Intacs may explain this decrease in patients requiring corneal tissue replacement, although specific numbers for these

If the Insured Beneficiary named in the Certificate Of Insurance is diagnosed as suffering from a Critical Illness covered under the Certificate of Insurance, which first occurs

• Implement climate justice, by playing a constructive role in the design of the Green Climate Fund 1 under United Nations governance, and by contributing