Chapter 3 – Sample Statistics
• By the end of the chapter, you will be able to:
1) Define, interpret and evaluate sample statistics.
–Mean
–Variance and Standard Deviation –Covariance and Correlation
–Median, Mode, Percentiles
2
3. POPULATION VS. SAMPLE DATA
Population Data – Full information on the ENTIRE population.
-Includes population probability (pdf) -Uses formulas containing probability -ex) data on an ENTIRE class
Sample Data – Partial information from a RANDOM SAMPLE of the population
-Individual data points (no pdf) -Uses the following formulas
-ex) Study of 2,000 random students
3.1 Estimators
Population Expected Value:
μ = E(Y) = Σ y f(y) Sample Mean:
__
Note: From this point on, Y may be expressed as
Ybar (or any other variable - ie:Xbar). For example,
N
Y Y
i4
3.1 Sample Mean Example
Tom and Rodney both go to a 4-day gaming
convention. How much they spend (S)each day is listed below:
Day 1 2 3 4
Tom 110 85 90 135
Rodney 190 20 10 200
110 85 90 135 4 105
190 20 10 200 4 105
T Ti
R Ri
S S
N S S
N
3.2 Estimators
Population Variance:
σY2 = Var(Y) = Σ [y-E(y)]2 f(y) Sample Variance:
1 )
(
22
N
Y
S
yY
i6
3.1 Sample Variance Example
Day 1 2 3 4
Tom 110 85 90 135
Rodney 190 20 10 200
3 517
900 225
400 25
1 4
) 105 135
( )
105 90
( )
105 85
( )
105 110
(
1
) : (
1 ) (
2
2 2
2 2
2
2 2
2 2
S S
S
S S
N
S S S
Tom
N
Y S Y
i T T i
y
3.2 Sample Variance Example
Day 1 2 3 4
Tom 110 85 90 135
Rodney 190 20 10 200
2 2
2 2
2 2 2 2
2
( )
1
( )
: 1
(190 105) (20 105) (10 105) (200 105) 4 1
S
S
i y
R R
i
S Y Y
N
S S Rodney S
N S
8
3.2 Sample Variance Example
Day 1 2 3 4
Tom 110 85 90 135
Rodney 190 20 10 200
Even though Tom and Rodney spent on average the same average amount every day ($105), Rodney’s
spending was MUCH more spread out (changed more from day to day), as seen by the variance.
2 2
2
2
( )
1 : 517
: 10,833
S
S
i y
S Y Y
N Tom S
Rodney S
105 105
R T
S
S
3.2 Estimators
Population Standard Deviation:
σ
Y= (σ
2)
1/2Sample Standard Deviation:
S
y= (S
y2)
1/210
3.2 Sample Standard Deviation Example
2
2
2 2
: 517
517 22.7 : 12,833
10,833 104
S
S
S
S
S
S
Tom S
S S
Rodney S
S S
3.5 Estimators
Population Covariance:
Cov(V,W)=∑∑ (v-E(v)) (w-E(w)) f(v,w) Sample Covariance:
�
�
( ¿ ¿ � − �´ ) (¿ ¿ � − �´ )
� − �
∑ ¿
��� ( � , � ) = � �� =¿
12
3.5 Covariance Example
Day 1 2 3 4
Tom 110 85 90 135
Rodney 190 20 10 200
There is a POSITIVE covariance between Tom
and Rodney’s spending. They seem to go up and down at the same time.
( )( )
( , )
1
(110 105)(190 105) (85 105)(20 105) (90 105)(10 105) (135 105)(200 105) ( , )
4 1 425 1700 1425 2850
( , ) 2133
3
T T R R
i i
T R
T R
T R
S S S S
Cov S S
N Cov S S
Cov S S
3.5 Estimators
Population Correlation:
σ
vw= corr(V,W)= Cov(V,W)/ σ
vσ
wSample Correlation:
r
vw= corr(V,W)= Cov(V,W)/ S
vS
w14
3.5 Correlation Example
Day 1 2 3 4
Tom 110 85 90 135
Rodney 190 20 10 200
There is a VERY STRONG positive correlation between Tom and Rodney’s spending.
( , )
( , )
( , ) 2133 0.904
(22.7)(104)
T R
T R
T R
T R
Cov S S Cor S S
S S Cor S S
3.1 Median
Median –Value point in the middle of the data set
-(Data must be arranged in ascending order)
Calculating:
Odd number of observations = middle data point
������ ���= �+ �
�
For example, with 3 data points, the median is the 2nd data point :
16
3.1 Median
Calculating:
Even number of observations = average of middle 2 data points
For example, if you have 42 data points:
��������= �+�
� = ��+ �
� =�� . �
The median would be the average of the 21st and 22nd data point.
3.1 Median
Usage:
-The mean is usually used as an “average” or measurement of “central location”
-If there are strong outliers (values way above or way below most others), that could influence the mean, and the median may be a better measure
Example:
At the end of term, 6/60 students were enrolled but
18
3.1 Percentiles
-Percentiles are cut-off values that divide the data set so that, when arranged from smallest to largest,
are below the pth percentile are above the pth percentile
For example,
80
Note: The median is the 50th percentile.
3.1 Quartile
-Quartiles are specific percentiles that divide the data into four sections
25th percentile th percentile th percentile
Technically there is a 4th quartile, but it is above 100%
of the data.
20
3.2 Max, Min, Range and Mode
Min = minimum = lowest value in the data set Max = maximum = highest value in the data set Range = max – min
Mode = the value(s) that show up most
3.6 Degrees of Freedom
Some distributions (such as the t-distribution) depend on DEGREES OF FREEDOM
Degrees of Freedom are generally dependent on two things:
Sample size (as sample rise rises, so does degrees of freedom)
Complication of test (more complicated
statistical tests reduce degrees of freedom)
22
3.6 t-distribution
t-tables are both similar in shape to a normal table (bell curve) and statistically related to it
The t-table is symmetric
50% probability is on each half of the table
Statistical analysis often requires us to find
critical t-values (t*) on one or both sides of the central mean of zero
These are sometimes referred to one-tailed
or two-tailed values
3.6 t-distribution
t-distribution with 2 tails:
0
Same
Percentage
t*
-t*
24
3.6 t-distribution
Example 1:
Find the critical t-values (t*) with 1% in two tails with 27df
(Note: 1% in both tails = 0.5% in each tail)
For p=0.495, df 27 gives t*=2.77, -2.77
3.6 Example 1
1% in two tails, 27 df:
0
49.5% each
2.77 -2.77
26
3.6 t-distribution
Example 2:
Find the critical t-value (t*) that cuts of 1% of the right tail with 35df
For 1T=0.01, df 30 gives t*=2.46 df 40 gives t*=2.42
Since 35 is halfway between 30 and 40, a good approximation of df 35 would be:
t*=(2.46+2.42)/2 = 2.44
3.6 Example 2
1% in right tail, 35 df:
0
49%
2.44
28
3.6 t-distribution
Typically, the following variable (similar to the normal Z variable seen earlier) will have a t- distribution: (we will see examples later)
) (
) (
Estimator sd
Sample
Estimator E
Estimator
t
7.5 Estimators as random variables
Each of these estimators will give us a result based upon the data available.
Therefore, two different data sets can yield two different point estimates.
Therefore the value of the point estimate can be seen as being the result of a chance experiment – obtaining a
data set.
Therefore each point estimate is a random variable,
30
7.5 What distribution to use?
(when examining a sample mean)
IF:
A) The population has a normal distribution (this is a reasonable assumption for many populations)
And
B) You know the population mean Then
The sample mean follows a NORMAL DISTRIBUTION
7.5 What distribution to use?
If the population doesn’t have a normal distribution:
The central limit theorem states that: “In selecting random samples of size n from a population, the sampling distribution of the sample mean can be
approximated by a normal distribution as the same size becomes large.”
General statistic practice assumes that a sample size of 30 or more is “large” enough
If outliers are an issue, 50 may be a better goal
32
7.5 What distribution to use?
If you don’t know the population mean:
A t-distribution can be used instead of a normal distribution.
For this course, we will always assume:
a) A normal distribution is appropriate BUT
b) We don’t have the population mean, so
the t-distribution will be used
7.5 Estimators Distribution
Since the sample mean is a variable, we can easily apply expectation and summation rules to find the expected value of the sample mean:
i Yi i
i
Y N N E
Y E
Y N E
N E Y
Y E
N Y Y
) 1 1 (
1
34
7.5 Estimators Distribution
If we make the simplifying assumption that there is no covariance between data points (ie: one person’s
consumption is unaffected by the next person’s
consumption), we can easily calculate variance for the sample mean:
N NY N Var
Y N N Var
Y Var
Y N Var
N Var Y
Y Var
Y Y
Y i
i i
2 2 2
2 2
2
2
1
) 1 1 (
1
7.5 Estimators Distribution
If we don’t know the population variance of Ybar, we can calculate its sample variance, therefore,
Y S
SampleVar Y N
Var
Y Y
2 2
36
7.5 Estimators Distribution
The STANDARD DEVIATION of a point estimate (such as sample mean) is often referred to as STANDARD
ERROR: