Lecture 2[Mododecompatibilidad]

(1)

Techniques of Statistical

Analysis I

Master’s Degree Programs

Universitat Pompeu Fabra

Lect_2: Introduction to Statistical Inference

Bruno Arpino

(2)

Lect_2: Intro to Inference

_{Key terminology: populations and samples;}

parameters and statistics

_{Simple random sampling}

Outline

2

_{Getting familiar with Statistical Inference}

_{Probability and the normal distribution}

(3)

_A

_population

_{is the collection of all items of interest;}

N

represents the population size

_A

_sample

_{is a subset of the population;}

n

represents the sample size

_{Individual members (of pop and sample) are called units}

Key definitions

3

(4)

_{All registered voters in the United States}

_{All women in fecund age living in Catalunya}

_{All the students registered in a Master’s at Pompeu}

Examples of Populations

4

_{All the students registered in a Master’s at Pompeu}

Fabra

(5)

_Parameter:

Parameter vs statistic

_Statistic:

_A

_parameter

_{is a specific characteristic of a population}

_A

_statistic

_{is a specific characteristic of a sample}

5

% of all adult Americans

who approve of Barack

Obama’s performance as

President

% of 1000 adult

Americans in a poll who

approve of Obama’s

performance as

President

Parameter values are (and remain) unknown.

(6)

_{How to obtain a sample?}

_{Imagine you want to know how many atheists are}

there in Spain. Is it a good idea to collect data on a

group of people attending a mass in the Catedral del

Sampling

6

group of people attending a mass in the Catedral del

Mar?

_{Imagine you want to estimate the average level of}

happiness. Is it a good idea to collect data on a

group of retired persons?

(7)

_{Each member of the population is chosen strictly by}

chance

_{Each member of the population is equally likely to be}

chosen

_{Every possible sample of size n is equally likely to be}

Simple random sampling

7

_{Every possible sample of size n is equally likely to be}

chosen

_{Randomness guaranties that there is no systematic bias}

toward sub-groups of the population (over/under

representation)

This is an example of a probability (random)

sampling method – We can specify the

(8)

_Estimation

• Point estimate

• Confidence interval

Statistical Inference

_{Inference is the process of drawing conclusions about a}

population based on sample results. Two ways of doing that:

8

• Confidence interval

e.g., Estimate the population mean income using sample data

_{Hypothesis testing}

(9)

An example of bad sampling

_{In 1987, Shere Hite published a best-selling book called}_Women and Love: A Cultural Revolution in Progress.

_{Hite mailed out 100,000 fifteen-page questionnaires to women}

who were members of a wide variety of organizations across the U.S. (e.g., church, political and volunteer associations, etc).

_{Approximately 4,500 questionnaires were returned. Below are a}

9

_{Approximately 4,500 questionnaires were returned. Below are a}

few statements from Hite’s publication:

• 84% of women are not emotionally satisfied with their relationships

• 95% of women reported emotional and psychological

harassment from their partners

• 70% of women married 5 years or more are having extramarital affairs

(10)

Why Hite’s sampling was a bad one?

_{Questionnaire were only sent to women participating in}

organizations. Are these women representative of the

whole female U.S. population? (Non-probability sample

sampling bias

)

_{Only 4,500 questionnaires were returned out of}

10

_{Only 4,500 questionnaires were returned out of}

100,000. Maybe only the angry responded

(

nonresponse bias

)

_{How data are collected can seriously affect the results}

(11)

Sampling error

_{Probability (random) samples, however, do not}

guarantee that the unknown population parameter(s) will

be estimated without errors

_The

_{sampling error}

_{of a statistic is the error that}

occurs when we use a sample statistic to predict the

11

(12)

(13)

Key concepts

_{Probability theory is the branch of mathematics concerned}

with the analysis of random phenomena.

_{The outcome of a random event cannot be determined}

before it occurs, but it may be any one of several possible

outcomes. The actual outcome is considered to be determined

13

outcomes. The actual outcome is considered to be determined

by chance.

_{In this course we need Probability theory to deal with}

uncertainty caused by sampling.

_{In common language, probability is the chance that}

something will happen - how likely it is that an event will

happen.

(14)

Random variables

_{We toss a die. What’s the probability of getting a “4”?}

P(“4”) = 1/6 = outcomes that satisfy the event/ all possible outcomes

14

X = “outcome from the toss of a die” is called a random

variable

_A_{random variable}_{takes some values with given probabilities}

_{X in this case is a}_discrete _{random variable (can only take on a}

countable number of values)

_{The list of values and their associated probabilities is called}

(15)

Some basic probability rules

_{Any probability is a number between 0 and 1}

_{All possible outcomes together have probability 1}

E.g.: P(“1” or “2” or … “6””) = 6 / 6 = 1

Certain

Impossible

1

0

15

_{The probability that an event does not occur is 1 minus}

the probability that the even does occur (complement

rule)

(16)

Continuous random variables

_{Can assume any real value in an interval. Examples:}

• Time a father spend at home with his children

• Salary

_{Since they can take infinitive values, it is not useful to consider the}

probability of individual values. In fact this will always be 0!

16

probability of individual values. In fact this will always be 0!

_{E.g., P(Time = 20 min) = 1 / ∞ = 0}

_{Instead, we typically consider the probability that a random}

variable assumes values in a given interval. E.g.:

• P(15 min < Time < 25 min)

• P(Time < 20 min)

• P(Salary > 2000 Euros)

This is a cumulative probability

(17)

A bit more formally...

_The_{cumulative distribution function}_{, F(x), for a continuous} random variable X gives the probability that X does not exceed the value of x:

F(x) = P(X<x)

_{We can always write:}

17

_{We can always write:}

F(x) = P(X<x) = 1 - P(X>x)

_{Let a and b be two possible values of X, with a < b. The}

probability that X lies between a and b is:

(18)

The normal distribution

_{Or bell-shaped or gaussian or …}

σ2

f(x)

)

σ

N( µ

~

X , 2

18 Symmetric (Mean, Median and Mode are Equal)

Characterized by mean (µ) and variance (σ 2), determining,

respectively, location and spread

_{Its range goes from}− ∞ _{to +}∞

_{Density goes to zero as X goes to}− ∞ _{or +}∞ _{and is maximum at} the mean

x

(19)

Location is determined by the mean

_{Consider two normal distributions with the same σ}2 _{but different}

means µ₁= 10 < µ₂= 20

σ2 = 25

f(x)

19

x

10

) 25 N(10

~

X ,

20

) 25 N(20

~

(20)

And the spread by the variance

_{Consider two normal distributions with the same}µ _{but different}

variances σ₁2 = 25 > σ₂2= 36

) 5 2 N(5 ~

X ,

20 As variance increases the curve becomes increasingly flat

5

(21)

Probabilities as areas

21

(22)

The standardised normal distribution

_{For the special normal distribution with mean = 0 and}

variance =1 (standardized normal distribution, Z) there are

tables

giving areas under the curve.

_{Any normal distribution (with any mean and variance}

combination) can be transformed into the standardized normal

distribution.

22

distribution.

_{We need to transform X units into Z units by subtracting}

the mean of X and dividing by its standard deviation:

σ

µ

X

Z

=

−

1)

N(0

~

(23)

Example of area calculation

_{The distribution of inhabitants of Fantasyland cities is normal}

with mean of 29000 inhabitants and a standard deviation of

3000 inhabitants.

_{Calculate the probability that the size of a randomly selected}

city lies between 23000 and 35000 inhabitants.

23

X ~ N(29000,3000

2

₎

(24)

Example of area calculation (cont’d)

_{First step}

_{: standardize interval endpoints}

₂₃₀₀₀

_{(23000-29000) / 3000 = - 2}

₃₅₀₀₀

_{(35000-29000) / 3000 = +2}

1)

N(0

~

Z

,

)

3000

N(29000

~

X

,

2

24

(25)

Example of area calculation (cont’d)

_{Now the problem is how to calculate the area between -2 and}

+2 under Z

1)

N(0

~

Z

,

25

_{The table (see next slide) of the standard normal distribution}

(usually)

gives the value of the cumulative distribution, F, for each

value z (the are on its left)

_{Second step}

_{is to write the area in terms of F:}

P(-2<Z<+2) = F(+2) – F(-2)

(26)

The standard normal table

_{For each value on the orizontal axis, z, the table gives F(z), that is}

the value of the cumulative distribution function till that value (the are on its left). E.g., F(+2) = 0.9772

26

(27)

Example of area calculation (cont’d)

_{To calculate F(-2) we use the fact that:}

F(-2)

= 1 -

F(+2)

27

_{F(-2) = 1- F(+2) = 1-0.9772 = 0.0228}

_{So, P(23000<X<35000)=}

_P(-2<Z<+2)

= F(+2) – F(-2)

(28)

Summing-up

_The

_{steps to calculate the areas under the normal curve}

are:

1)

Standardize the interval endpoints (because we only

have the table for Z);

2)

Write the probability in terms of F(z);

Find the F values on the table.

28

3)

Find the F values on the table.

_{Check out this applet:}

(29)

Why is the normal distribution so important?

_{We’ll learn that if we take different random samples and}

calculate a statistic (e.g. sample mean) to estimate a

parameter (e.g. population mean), the collection of statistic

values from these samples usually has approximately a normal

distribution.

Why should I car about that?

29

(30)

Sampling distribution

_{Lists all possible values of a statistic (e.g., sample mean or}

sample proportion) that we obtain considering all possible

samples of a fixed size that we can obtain from a population

and their probabilities. How to build it?

Example:

D

A _B C

30

Example:

_{Consider a population of size N=4}

_{Consider the variable, X, describing}

the age of individuals

(31)

Summary Measures for the Population

Distribution:

21

24

22

20

18 N

X

µ

i

=

+

=

∑

.25

P(x)

31

_{Note that the distribution of X in the population is NOT}

normal (but uniform)!!!

21

4 =

=

2.236 N

µ

)

(X

σ

2 i

=

−

=

∑

.25 0

18 20 22 24 A B C D

(32)

Building a sampling distribution

_{Now consider all possible samples of size n = 2} _{Sampling method: simple random sampling}

D

A _B C

32

_{Note that sampling is with replacement}

(33)

Building a sampling distribution (cont’d)

_{Consider all possible sample means and their probabilities (relative} frequencies)

_{How many “18” do we have? Just 1 out of 16. So, the probability}

that one sample of size 2 will give as sample mean “18” is 1/16 = 0.0625. And so on…

33

_{Note: the sampling distribution of all sample means}

(34)

Sample mean distribution shape

_{For random sampling with “large” n, the sampling distribution of} the sample mean follows approximately a normal distribution

(central limit theorem)

_{Check it with this applet:}

http://www.prenhall.com/agresti/applet_files/samplingdist.html

34

_{Approximate normality applies no matter what the shape of the}

population distribution

_{Usually “large” means n>30}

_{If X is normally distributed than also the sampling distribution of}

(35)

Summary measures of the Sampling

Distribution

_{Consider the average and standard deviation of the 16 values of}

our sample statistic (sample mean):

_{Mmm the average of all sample averages is equal to the TRUE} µ 21 16 24 21 19 18 N X ) X E( 16

1 = + + + + = =

=

∑

= L j j 35

_{Mmm the average of all sample averages is equal to the TRUE}

population average. Can be this useful?

_{The standard deviation of the distribution of sample means}

measures the variability of sample means across the possible samples. It is called standard error of the mean

(36)

The Sample Mean estimator

_{It can be proved that holds in general}

_{We refer to this property of the sample mean saying that it is an}

unbiased estimator of the population mean

_{The sample mean estimator is:}

µ

)

X

E(

=

∑

= = n 1 i i X n 1 X 36

_{It can be proved that also the median estimator is unbiased but its} variance is higher. We say that the sample mean estimator is more efficient. Check out this with this applet:

http://www.prenhall.com/agresti/applet_files/samplingdist.html

=1 i

(37)

Estimators and their properties: definitions

_{An estimator, in general, is defined as a random variable that}

depends on sample information. Its value provides an approximation to the unknown parameter in the population.

_{A single value of an estimator is called point estimate}

Good properties of estimators are:

37

Good properties of estimators are:

• Unbiasedness: its expected value is equal to the true

population parameter

• Efficiency: low variability from a sample to another (with

(38)

Standard error of the mean

_{Measures the variability in the sample mean from sample to} sample.

_{It depend on the standard deviation of the variable in the}

population and the sample size:

σ

X

=

38

(Note: this formula holds if the population has infinite size. Otherwise a correction factor should be applied. We will not go into further details.)

_{It decreases as the sample size increases (as n increases} uncertainty decreases)

n

σ

(39)

Summing-up

_{The sample mean estimator}

_{has expected value equal to the population mean:}

_{and standard deviation (standard error) equal the population}

µ

)

X

E(

=

39

_{and standard deviation (standard error) equal the population}

standard deviation divided by the square root of the sample size:

_{Its distribution is normal i}_{f X is normally distributed (or even when}

X is not normally distributed but the sample size is large).

n

σ

(40)

Review exercise

_{You want to understand how your flat mates consider yourself. In particular,}

you want to understand how much dirty they think you are. You use the

following question to operationalise your flat mates perception of your “dirtiness”:

“On a scale from 0 to 100 (with 0 meaning extremely clean and 100

meaning extremely dirty), how dirty do you think I am?”

_{Let assume that you have 3 flat mates, Ana, Joan and Xavi, and that their}

answers to such a question would be, respectively, 90, 50, 70.

40

answers to such a question would be, respectively, 90, 50, 70.

1. What is the population? What is the variable of interest? What is the parameter of interest?

2. Imagine you are considering to draw a sample of size 2 because you do not want to waste your time asking to all of them the same question (or maybe you want to practice with what you’re learning in this course ☺_{). If you want} to estimate the average level of perceived dirtiness, what estimator would you use? Why?

(41)

Review exercise (cont’d)

1. The population in this case is the group of your flat mates and has three

members (units): Ana, Joan and Xavi. The variable of interest is the level of your dirtiness as perceived by your flat mates and take these values: X={90, 50, 70}. The parameter of interest is the average of X:

70

3

70

50

90 µ

=

+

=

41

2. To estimate the population average of X we consider the sample mean

because it is an unbiased and efficient estimator.

3. From what we saw in previous slides we know that:

∑

= = 2 1 i i X 2 1 X

70 µ

)

X

E(

=

n

σ

(42)

Review exercise (cont’d)

3. To calculate the standard error of the mean we need the population

standard deviation.

(

) (

)

33 .

16

3

70

50

70

90 σ

2 2 2

=

−

+

−

+

−

=

42

55 .

11

2

16.33 n

σ

_X

=

(43)

Revision exercise (cont’d)

3. To build the sample distribution we have to consider all possible samples

of size 2 that we can draw from the population of your flat mates.

Samples

1st Obs 2nd observation

90 50 70

90 90; 90 90; 50 90; 70

P(X)

43

50 50; 90 50; 50 50; 70

70 70; 90 70; 50 70; 70

Sample means

1st Obs 2nd observation

90 50 70

90 90 70 80

50 70 50 60

70 80 60 70

50 60 70 80 90 0

.1 .2 .3

(44)

Did you survive???

44

(45)

If something is not clear

(or you find mistakes in the slides)

45