ch 3: numerically summarizing data - center, spread, shape

3.1 measure of central tendency

or, give me one number that represents all the data

consider the number of math classes taken by math 150 students. how can we represent the results in one number?

*average: add up all the numbers and divide by the amount of numbers that *

there are

ex) suppose you score on three tests 71,75,84. what is your test average?
*also called the mean*

ex) for number of math classes, mean =

*median: the middle number*

ex) suppose you score on three tests 71,75,84. what is your median test score?

median is 75

interpretation: half the time the score is above 75, half the time the score is below 75

note: you must put data in ascending order to determine the median ex) what is the median for: 75, 84, 71

0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2...median is ex) heights of students (in inches):

59,61,62,64,64,64,65,66,66,66,67,68,68,69,70,70,71,71,73 what is the median height?

...find middle number: there are 19 numbers (19+1)/2=10 ...so its the number in the 10th position ...the median is 66

what do you do if there are two middle numbers? add together, divide by two (i.e. take the average)

..this will happen when there is an even amount of data

note that, using the "+1" method, you would get (20+1)/2 = 10.5

...this means the median is between the 10th_{ and 11}th_{ numbers, so take their }
average

mode: most common number ex) number of math classes: 1 ex) heights: two modes 64 and 66

ex) test scores: no mode (all the same frequency) Question: which of these should we use, and why?

ex) number of credits taken at BMCC among math150 students: 0,0,9,12,21,22,27,32,35,38,44,50,52,56

mean = median = mode = 0

ex) there can be a problem with the mean

the average salary in this class is around $15,000

if Bill Gates (and his $1,000,000,000 salary) walk into the room,

the average salary is now around $35,000,000. does this make us all millionaires? ...no

the median salary is still around $15,000, because at most you go to the next number on the list

"the Bill Gates effect"

*Bill Gates' salary is an outlier: it is a value far away from most of the data*
the average is not robust with respect to an outlier

*the median is robust with respect to an outlier*
robust: not affected by [also known as resistant]

3.2 Measures of Dispersion how spread out is the data

because mean & median do not tell the whole story ex) group of 5 men, heights

group 1: 5'8,5'10, 5'11, 6', 5'9 ... in inches: 68,69,70,71,72 group 2: 4'6,7'4,4'2,6'8,6'6 ... in inches: 50,54,78,80,88 find mean:

group 1: 68+69+70+71+72 = 350 = 70" (or 5'10)

5 5

group 2: 54+88+50+80+78 = 350 = 70" (or 5'10)

5 5

- range

(highest) - (lowest)

ex) group #1: 72" - 68"=4" group #2: 88" - 50" = 38" note: affected by an outlier

ex) our salary range is 30000-0 = 30000

standard deviation

ex) group 1 (inches) 68,69,70,71,72 mean = 70

ex) var = 4 ... st.dev. = ex) st.dev. = 9 ... var = standard deviation =

you do:

ex) group #2: 54, 88, 50, 80, 78 ... mean = 70 find the standard deviation

## sample

## population

## mean

## x "x-bar"

## µ

## "mu"

## st.dev.

## s

## σ

## "sigma"

## variance

## s

2_{σ}

2
## size

## n

## N

## depends on

## fixed

## your sample

## a "statistic" a "parameter"

## also: "data value" = x

## the way that you calculate the sample mean and

## the population mean are exactly the same.

## the difference is the kind of information it gives

## you

## note for standard dev:

## for a population, divide by

## the number of data

## for a sample, divide by

## the number - 1

## ex) find the standard deviation of the sample 7,10,16 (and the

## variance)

3.3 calculating that stuff from a table [extra credit material] (measures of central tendency and dispersion)

or, what to do if we have only the table of data and not the raw data

ex)

whats the mean??

note: the table is an

approximation, so the result will be an approximation

note: divide by 12, not 5, because 12 is the total frequency (e.g. 25 appears 7 times)

this is similar to a weighted mean ex) get three scores, 80, 95, 70 whats the mean?...

but the first score is your hw grade (that counts 20%) the second score is your midterm grade (that counts 30%) the third score is your final exam grade (that counts 50%)

Formula for a weighted mean:

mean = Σ x · rel.freq(x) x or µ

whats the standard deviation? [extra credit material]

## measures of position

*- rank (location)*

## ex) New York marathon, 12,635 people run, you finished 586

## your rank is 586 (out of 12635)

*- percentile*

## you are above ? % of the data

## percentile --> data value

## ex) 3,7,9,12,15,15,16,18,19,21,24,26,28,29 (n=14)

## find the 37th percentile:

## rank = (n+1)(P/100), then find the data value

## ex) find the 58th percentile

## you do:

## ex) find the 82nd percentile

## data value --> percentile

## ex) at what percentile is x=24? [recall: "x" means data value]

## x=24 is above 10 data values (out of 14)

## percentile: 10/14 = .71 or 71st percentile (above 71% of the data)

## notation: the 71th percentile is 24

## P

71## = 24

## note: for both problems, the middle step is to find the rank (position)

## note: the "+1" formula has some glitches for small data sets. this comes from

## the fact that one data value represents a large chunk of your data set (e.g. if

## you have 20 numbers, each one represents 5%)

*- quartile*

break the data into four quartiles. they are marked off by: quarter point, half-way point, three-quarter point

*- 5-number summary*

min--Q1--Q2--Q3--max

Q1: data value after one quarter of the data. thats the same as P25 (the data value at the 25th percentile mark). it separates first quartile and second quartile

Q2 is in the 50th percentile position (then find the data value) Q3 is in the 75th percentile position (then find the data value) ex) 14,15,16,17,18,19,20,21,22 (n=9)

using the formula:

Q1 appears in which position? Q1 =

Q2 appears in which position? Q2 =

Q3 appears In which position? Q3 =

follow-up: in which quartile is x=19 ?

why do we need the "+1" ? well, if we didnt have it then for Q2 we would calculate

(9)(.5) = 4.5

but we know thats not right, its too low...the "+1" fixes that problem

Boxplot

- a visual representation of the 5 number summary

- helps you see if the distribution is symmetric or skewed

this distribution shape is called "symmetric"

here are some other shapes (as seen with boxplots):

*- z-score*

"the number of standard deviations from the mean"

ex) there is an exam. the mean score is 77, you got an 85. is that good? how good?

it depends.

suppose the standard deviation is 4. how many standard dev's above the mean is your score?

you are 8 points above the mean...that is 2 standard deviations (since st.dev. is 4)

Jerry got a 88. how many standard deviations above the mean is his score?

what is each number called?

ex) find the z-score for 47 if µ=38, σ=5

what does that mean, in words?

... 1.8 standard deviations above the mean ex) find the z-score for 68 if µ=78, σ=4

note that a positive z-score means your data value is above the mean and a negative z-score means your data value is below the mean

*ex) which exam score is relatively better, a 75 when the class average was*
68 and the standard deviation was 4, or a 89 when the class average was 76
and the standard deviation was 12 ? (use the z-score)

ex) find the data value which is 2 standard deviations above the mean if µ=32, σ=6

formula for x: x = µ + z·σ

same as the formula for z, but you solve for x Formula:

for a z-score: z = x - µ (population) σ

for a sample, same formula: z = x - x different notation s