ch 3: numerically summarizing data - center, spread, shape
3.1 measure of central tendency
or, give me one number that represents all the data
consider the number of math classes taken by math 150 students. how can we represent the results in one number?
average: add up all the numbers and divide by the amount of numbers that
there are
ex) suppose you score on three tests 71,75,84. what is your test average? also called the mean
ex) for number of math classes, mean =
median: the middle number
ex) suppose you score on three tests 71,75,84. what is your median test score?
median is 75
interpretation: half the time the score is above 75, half the time the score is below 75
note: you must put data in ascending order to determine the median ex) what is the median for: 75, 84, 71
0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2...median is ex) heights of students (in inches):
59,61,62,64,64,64,65,66,66,66,67,68,68,69,70,70,71,71,73 what is the median height?
...find middle number: there are 19 numbers (19+1)/2=10 ...so its the number in the 10th position ...the median is 66
what do you do if there are two middle numbers? add together, divide by two (i.e. take the average)
..this will happen when there is an even amount of data
note that, using the "+1" method, you would get (20+1)/2 = 10.5
...this means the median is between the 10th and 11th numbers, so take their average
mode: most common number ex) number of math classes: 1 ex) heights: two modes 64 and 66
ex) test scores: no mode (all the same frequency) Question: which of these should we use, and why?
ex) number of credits taken at BMCC among math150 students: 0,0,9,12,21,22,27,32,35,38,44,50,52,56
mean = median = mode = 0
ex) there can be a problem with the mean
the average salary in this class is around $15,000
if Bill Gates (and his $1,000,000,000 salary) walk into the room,
the average salary is now around $35,000,000. does this make us all millionaires? ...no
the median salary is still around $15,000, because at most you go to the next number on the list
"the Bill Gates effect"
Bill Gates' salary is an outlier: it is a value far away from most of the data the average is not robust with respect to an outlier
the median is robust with respect to an outlier robust: not affected by [also known as resistant]
3.2 Measures of Dispersion how spread out is the data
because mean & median do not tell the whole story ex) group of 5 men, heights
group 1: 5'8,5'10, 5'11, 6', 5'9 ... in inches: 68,69,70,71,72 group 2: 4'6,7'4,4'2,6'8,6'6 ... in inches: 50,54,78,80,88 find mean:
group 1: 68+69+70+71+72 = 350 = 70" (or 5'10)
5 5
group 2: 54+88+50+80+78 = 350 = 70" (or 5'10)
5 5
- range
(highest) - (lowest)
ex) group #1: 72" - 68"=4" group #2: 88" - 50" = 38" note: affected by an outlier
ex) our salary range is 30000-0 = 30000
standard deviation
ex) group 1 (inches) 68,69,70,71,72 mean = 70
ex) var = 4 ... st.dev. = ex) st.dev. = 9 ... var = standard deviation =
you do:
ex) group #2: 54, 88, 50, 80, 78 ... mean = 70 find the standard deviation
sample
population
mean
x "x-bar"
µ
"mu"
st.dev.
s
σ
"sigma"
variance
s
2σ
2size
n
N
depends on
fixed
your sample
a "statistic" a "parameter"
also: "data value" = x
the way that you calculate the sample mean and
the population mean are exactly the same.
the difference is the kind of information it gives
you
note for standard dev:
for a population, divide by
the number of data
for a sample, divide by
the number - 1
ex) find the standard deviation of the sample 7,10,16 (and the
variance)
3.3 calculating that stuff from a table [extra credit material] (measures of central tendency and dispersion)
or, what to do if we have only the table of data and not the raw data
ex)
whats the mean??
note: the table is an
approximation, so the result will be an approximation
note: divide by 12, not 5, because 12 is the total frequency (e.g. 25 appears 7 times)
this is similar to a weighted mean ex) get three scores, 80, 95, 70 whats the mean?...
but the first score is your hw grade (that counts 20%) the second score is your midterm grade (that counts 30%) the third score is your final exam grade (that counts 50%)
Formula for a weighted mean:
mean = Σ x · rel.freq(x) x or µ
whats the standard deviation? [extra credit material]
measures of position
- rank (location)
ex) New York marathon, 12,635 people run, you finished 586
your rank is 586 (out of 12635)
- percentile
you are above ? % of the data
percentile --> data value
ex) 3,7,9,12,15,15,16,18,19,21,24,26,28,29 (n=14)
find the 37th percentile:
rank = (n+1)(P/100), then find the data value
ex) find the 58th percentile
you do:
ex) find the 82nd percentile
data value --> percentile
ex) at what percentile is x=24? [recall: "x" means data value]
x=24 is above 10 data values (out of 14)
percentile: 10/14 = .71 or 71st percentile (above 71% of the data)
notation: the 71th percentile is 24
P
71= 24
note: for both problems, the middle step is to find the rank (position)
note: the "+1" formula has some glitches for small data sets. this comes from
the fact that one data value represents a large chunk of your data set (e.g. if
you have 20 numbers, each one represents 5%)
- quartile
break the data into four quartiles. they are marked off by: quarter point, half-way point, three-quarter point
- 5-number summary
min--Q1--Q2--Q3--max
Q1: data value after one quarter of the data. thats the same as P25 (the data value at the 25th percentile mark). it separates first quartile and second quartile
Q2 is in the 50th percentile position (then find the data value) Q3 is in the 75th percentile position (then find the data value) ex) 14,15,16,17,18,19,20,21,22 (n=9)
using the formula:
Q1 appears in which position? Q1 =
Q2 appears in which position? Q2 =
Q3 appears In which position? Q3 =
follow-up: in which quartile is x=19 ?
why do we need the "+1" ? well, if we didnt have it then for Q2 we would calculate
(9)(.5) = 4.5
but we know thats not right, its too low...the "+1" fixes that problem
Boxplot
- a visual representation of the 5 number summary
- helps you see if the distribution is symmetric or skewed
this distribution shape is called "symmetric"
here are some other shapes (as seen with boxplots):
- z-score
"the number of standard deviations from the mean"
ex) there is an exam. the mean score is 77, you got an 85. is that good? how good?
it depends.
suppose the standard deviation is 4. how many standard dev's above the mean is your score?
you are 8 points above the mean...that is 2 standard deviations (since st.dev. is 4)
Jerry got a 88. how many standard deviations above the mean is his score?
what is each number called?
ex) find the z-score for 47 if µ=38, σ=5
what does that mean, in words?
... 1.8 standard deviations above the mean ex) find the z-score for 68 if µ=78, σ=4
note that a positive z-score means your data value is above the mean and a negative z-score means your data value is below the mean
ex) which exam score is relatively better, a 75 when the class average was 68 and the standard deviation was 4, or a 89 when the class average was 76 and the standard deviation was 12 ? (use the z-score)
ex) find the data value which is 2 standard deviations above the mean if µ=32, σ=6
formula for x: x = µ + z·σ
same as the formula for z, but you solve for x Formula:
for a z-score: z = x - µ (population) σ
for a sample, same formula: z = x - x different notation s