Chapter 3 – Sample Statistics

(1)

Chapter 3 – Sample Statistics

• By the end of the chapter, you will be able to:

1) Define, interpret and evaluate sample statistics.

–Mean

–Variance and Standard Deviation –Covariance and Correlation

–Median, Mode, Percentiles

(2)

2

3. POPULATION VS. SAMPLE DATA

Population Data – Full information on the ENTIRE population.

-Includes population probability (pdf) -Uses formulas containing probability -ex) data on an ENTIRE class

Sample Data – Partial information from a RANDOM SAMPLE of the population

-Individual data points (no pdf) -Uses the following formulas

-ex) Study of 2,000 random students

(3)

3.1 Estimators

Population Expected Value:

μ = E(Y) = Σ y f(y) Sample Mean:

__

Note: From this point on, Y may be expressed as

Ybar (or any other variable - ie:Xbar). For example,

N

Y _  Y

ⁱ

(4)

4

3.1 Sample Mean Example

Tom and Rodney both go to a 4-day gaming

convention. How much they spend (S)each day is listed below:

Day 1 2 3 4

Tom 110 85 90 135

Rodney 190 20 10 200

110 85 90 135 4 105

190 20 10 200 4 105

T Ti

R Ri

S S

N S S

N

  

  

  

  



(5)

3.2 Estimators

Population Variance:

σ_Y² = Var(Y) = Σ [y-E(y)]² f(y) Sample Variance:

1 )

(

²

2



   N

Y

S

_y

Y

ⁱ

(6)

6

3.1 Sample Variance Example

Day 1 2 3 4

Tom 110 85 90 135

Rodney 190 20 10 200

3 517

900 225

400 25

1 4

) 105 135

( )

105 90

( )

105 85

( )

105 110

(

1

) : (

1 ) (

2

2 2

2

2 2

 



 













 



 



 



S S

S

S S

N

S S S

Tom

N

Y S Y

i T T i

y

(7)

3.2 Sample Variance Example

Day 1 2 3 4

Tom 110 85 90 135

Rodney 190 20 10 200

2 2

2 2 2 2

2

( )

1

( )

: 1

(190 105) (20 105) (10 105) (200 105) 4 1

S

i y

R R

i

S Y Y

N

S S Rodney S

N S

 



 



      

 



(8)

8

3.2 Sample Variance Example

Day 1 2 3 4

Tom 110 85 90 135

Rodney 190 20 10 200

Even though Tom and Rodney spent on average the same average amount every day ($105), Rodney’s

spending was MUCH more spread out (changed more from day to day), as seen by the variance.

2 2

2

( )

1 : 517

: 10,833

S

i y

S Y Y

N Tom S

Rodney S

 







105 105



R T

S

(9)

3.2 Estimators

Population Standard Deviation:

σ

_Y

= (σ

²

)

^1/2

Sample Standard Deviation:

S

_y

= (S

_y²

)

^1/2

(10)

10

3.2 Sample Standard Deviation Example

2

2 2

: 517

517 22.7 : 12,833

10,833 104

S

Tom S

S S

Rodney S

S S



  



  

(11)

3.5 Estimators

Population Covariance:

Cov(V,W)=∑∑ (v-E(v)) (w-E(w)) f(v,w) Sample Covariance:

�

( ¿ ¿ � − �´ ) (¿ ¿ � − �´ )

� − �

∑ ^¿

�� ( � , � ) = � �� =¿

(12)

12

3.5 Covariance Example

Day 1 2 3 4

Tom 110 85 90 135

Rodney 190 20 10 200

There is a POSITIVE covariance between Tom

and Rodney’s spending. They seem to go up and down at the same time.

( )( )

( , )

1

(110 105)(190 105) (85 105)(20 105) (90 105)(10 105) (135 105)(200 105) ( , )

4 1 425 1700 1425 2850

( , ) 2133

3

T T R R

i i

T R

S S S S

Cov S S

N Cov S S

Cov S S

 

 

          

 

  

 



(13)

3.5 Estimators

Population Correlation:

σ

_vw

= corr(V,W)= Cov(V,W)/ σ

_v

σ

_w

Sample Correlation:

r

_vw

= corr(V,W)= Cov(V,W)/ S

_v

S

_w

(14)

14

3.5 Correlation Example

Day 1 2 3 4

Tom 110 85 90 135

Rodney 190 20 10 200

There is a VERY STRONG positive correlation between Tom and Rodney’s spending.

( , )

( , ) 2133 0.904

(22.7)(104)

T R

Cov S S Cor S S

S S Cor S S



 

(15)

3.1 Median

Median –Value point in the middle of the data set

-(Data must be arranged in ascending order)

Calculating:

Odd number of observations = middle data point

�� _��= �+ �

�

For example, with 3 data points, the median is the 2^nd data point :

(16)

16

3.1 Median

Calculating:

Even number of observations = average of middle 2 data points

For example, if you have 42 data points:

��_��= �+�

� = ��+ �

� =�� . �

The median would be the average of the 21^st and 22^nd data point.

(17)

3.1 Median

Usage:

-The mean is usually used as an “average” or measurement of “central location”

-If there are strong outliers (values way above or way below most others), that could influence the mean, and the median may be a better measure

Example:

At the end of term, 6/60 students were enrolled but

(18)

18

3.1 Percentiles

-Percentiles are cut-off values that divide the data set so that, when arranged from smallest to largest,

are below the pth percentile are above the pth percentile

For example,

80

Note: The median is the 50^th percentile.

(19)

3.1 Quartile

-Quartiles are specific percentiles that divide the data into four sections

25th percentile th percentile th percentile

Technically there is a 4^th quartile, but it is above 100%

of the data.

(20)

20

3.2 Max, Min, Range and Mode

Min = minimum = lowest value in the data set Max = maximum = highest value in the data set Range = max – min

Mode = the value(s) that show up most

(21)

3.6 Degrees of Freedom

Some distributions (such as the t-distribution) depend on DEGREES OF FREEDOM

Degrees of Freedom are generally dependent on two things:

 Sample size (as sample rise rises, so does degrees of freedom)

 Complication of test (more complicated

statistical tests reduce degrees of freedom)

(22)

22

3.6 t-distribution

 t-tables are both similar in shape to a normal table (bell curve) and statistically related to it

 The t-table is symmetric

 50% probability is on each half of the table

 Statistical analysis often requires us to find

critical t-values (t*) on one or both sides of the central mean of zero

 These are sometimes referred to one-tailed

or two-tailed values

(23)

3.6 t-distribution

t-distribution with 2 tails:

0

Same

Percentage

t*

-t*

(24)

24

3.6 t-distribution

Example 1:

Find the critical t-values (t*) with 1% in two tails with 27df

(Note: 1% in both tails = 0.5% in each tail)

For p=0.495, df 27 gives t*=2.77, -2.77

(25)

3.6 Example 1

1% in two tails, 27 df:

0

49.5% each

2.77 -2.77

(26)

26

3.6 t-distribution

Example 2:

Find the critical t-value (t*) that cuts of 1% of the right tail with 35df

For 1T=0.01, df 30 gives t=2.46 df 40 gives t=2.42

Since 35 is halfway between 30 and 40, a good approximation of df 35 would be:

t*=(2.46+2.42)/2 = 2.44

(27)

3.6 Example 2

1% in right tail, 35 df:

0

49%

2.44

(28)

28

3.6 t-distribution

Typically, the following variable (similar to the normal Z variable seen earlier) will have a t- distribution: (we will see examples later)

) (

Estimator sd

Sample

Estimator E

Estimator

t 



(29)

7.5 Estimators as random variables

Each of these estimators will give us a result based upon the data available.

Therefore, two different data sets can yield two different point estimates.

Therefore the value of the point estimate can be seen as being the result of a chance experiment – obtaining a

data set.

Therefore each point estimate is a random variable,

(30)

30

7.5 What distribution to use?

(when examining a sample mean)

IF:

A) The population has a normal distribution (this is a reasonable assumption for many populations)

And

B) You know the population mean Then

The sample mean follows a NORMAL DISTRIBUTION

(31)

7.5 What distribution to use?

If the population doesn’t have a normal distribution:

The central limit theorem states that: “In selecting random samples of size n from a population, the sampling distribution of the sample mean can be

approximated by a normal distribution as the same size becomes large.”

General statistic practice assumes that a sample size of 30 or more is “large” enough

If outliers are an issue, 50 may be a better goal

(32)

32

7.5 What distribution to use?

If you don’t know the population mean:

A t-distribution can be used instead of a normal distribution.

For this course, we will always assume:

a) A normal distribution is appropriate BUT

b) We don’t have the population mean, so

the t-distribution will be used

(33)

7.5 Estimators Distribution

Since the sample mean is a variable, we can easily apply expectation and summation rules to find the expected value of the sample mean:

   

 

i Y

i i

i

Y N N E

Y E

Y N E

N E Y

Y E

N Y Y





 











 





 



) 1 1 (

1

(34)

34

7.5 Estimators Distribution

If we make the simplifying assumption that there is no covariance between data points (ie: one person’s

consumption is unaffected by the next person’s

consumption), we can easily calculate variance for the sample mean:

   

 

N N

Y N Var

Y N N Var

Y Var

Y N Var

N Var Y

Y Var

Y Y

Y i

i i

2 2 2

2 2

2

1

) 1 1 (

1

 



 



 



 



 



 



 



 



 



 











 



 

(35)

7.5 Estimators Distribution

If we don’t know the population variance of Ybar, we can calculate its sample variance, therefore,

 

  ^Y ^S

SampleVar Y N

Var

Y Y

2 2



 

(36)

36

7.5 Estimators Distribution

The STANDARD DEVIATION of a point estimate (such as sample mean) is often referred to as STANDARD

ERROR:

Chapter 3 – Sample Statistics