• No results found

Lecture 2[Mododecompatibilidad]

N/A
N/A
Protected

Academic year: 2020

Share "Lecture 2[Mododecompatibilidad]"

Copied!
45
0
0

Loading.... (view fulltext now)

Full text

(1)

Techniques of Statistical

Analysis I

Master’s Degree Programs

Universitat Pompeu Fabra

Lect_2: Introduction to Statistical Inference

Bruno Arpino

(2)

Lect_2: Intro to Inference

Key terminology: populations and samples;

parameters and statistics

Simple random sampling

Outline

2

Getting familiar with Statistical Inference

Probability and the normal distribution

(3)

A

population

is the collection of all items of interest;

N

represents the population size

A

sample

is a subset of the population;

n

represents the sample size

Individual members (of pop and sample) are called units

Key definitions

3

(4)

Lect_2: Intro to Inference

All registered voters in the United States

All women in fecund age living in Catalunya

All the students registered in a Master’s at Pompeu

Examples of Populations

4

All the students registered in a Master’s at Pompeu

Fabra

(5)

Parameter:

Parameter vs statistic

Statistic:

A

parameter

is a specific characteristic of a population

A

statistic

is a specific characteristic of a sample

5

% of all adult Americans

who approve of Barack

Obama’s performance as

President

% of 1000 adult

Americans in a poll who

approve of Obama’s

performance as

President

Parameter values are (and remain) unknown.

(6)

Lect_2: Intro to Inference

How to obtain a sample?

Imagine you want to know how many atheists are

there in Spain. Is it a good idea to collect data on a

group of people attending a mass in the Catedral del

Sampling

6

group of people attending a mass in the Catedral del

Mar?

Imagine you want to estimate the average level of

happiness. Is it a good idea to collect data on a

group of retired persons?

(7)

Each member of the population is chosen strictly by

chance

Each member of the population is equally likely to be

chosen

Every possible sample of size n is equally likely to be

Simple random sampling

7

Every possible sample of size n is equally likely to be

chosen

Randomness guaranties that there is no systematic bias

toward sub-groups of the population (over/under

representation)

This is an example of a probability (random)

sampling method – We can specify the

(8)

Lect_2: Intro to Inference

Estimation

• Point estimate

• Confidence interval

Statistical Inference

Inference is the process of drawing conclusions about a

population based on sample results. Two ways of doing that:

8

• Confidence interval

e.g., Estimate the population mean income using sample data

Hypothesis testing

(9)

An example of bad sampling

In 1987, Shere Hite published a best-selling book called Women and Love: A Cultural Revolution in Progress.

Hite mailed out 100,000 fifteen-page questionnaires to women

who were members of a wide variety of organizations across the U.S. (e.g., church, political and volunteer associations, etc).

Approximately 4,500 questionnaires were returned. Below are a

9

Approximately 4,500 questionnaires were returned. Below are a

few statements from Hite’s publication:

• 84% of women are not emotionally satisfied with their relationships

• 95% of women reported emotional and psychological

harassment from their partners

• 70% of women married 5 years or more are having extramarital affairs

(10)

Lect_2: Intro to Inference

Why Hite’s sampling was a bad one?

Questionnaire were only sent to women participating in

organizations. Are these women representative of the

whole female U.S. population? (Non-probability sample

sampling bias

)

Only 4,500 questionnaires were returned out of

10

Only 4,500 questionnaires were returned out of

100,000. Maybe only the angry responded

(

nonresponse bias

)

How data are collected can seriously affect the results

(11)

Sampling error

Probability (random) samples, however, do not

guarantee that the unknown population parameter(s) will

be estimated without errors

The

sampling error

of a statistic is the error that

occurs when we use a sample statistic to predict the

11

(12)
(13)

Key concepts

Probability theory is the branch of mathematics concerned

with the analysis of random phenomena.

The outcome of a random event cannot be determined

before it occurs, but it may be any one of several possible

outcomes. The actual outcome is considered to be determined

13

outcomes. The actual outcome is considered to be determined

by chance.

In this course we need Probability theory to deal with

uncertainty caused by sampling.

In common language, probability is the chance that

something will happen - how likely it is that an event will

happen.

(14)

Lect_2: Intro to Inference

Random variables

We toss a die. What’s the probability of getting a “4”?

P(“4”) = 1/6 = outcomes that satisfy the event/ all possible outcomes

14

X = “outcome from the toss of a die” is called a random

variable

A random variable takes some values with given probabilities

X in this case is a discrete random variable (can only take on a

countable number of values)

The list of values and their associated probabilities is called

(15)

Some basic probability rules

Any probability is a number between 0 and 1

All possible outcomes together have probability 1

E.g.: P(“1” or “2” or … “6””) = 6 / 6 = 1

Certain

Impossible

1

0

15

The probability that an event does not occur is 1 minus

the probability that the even does occur (complement

rule)

(16)

Lect_2: Intro to Inference

Continuous random variables

Can assume any real value in an interval. Examples:

• Time a father spend at home with his children

• Salary

Since they can take infinitive values, it is not useful to consider the

probability of individual values. In fact this will always be 0!

16

probability of individual values. In fact this will always be 0!

E.g., P(Time = 20 min) = 1 / ∞ = 0

Instead, we typically consider the probability that a random

variable assumes values in a given interval. E.g.:

• P(15 min < Time < 25 min)

• P(Time < 20 min)

• P(Salary > 2000 Euros)

This is a cumulative probability

(17)

A bit more formally...

The cumulative distribution function, F(x), for a continuous random variable X gives the probability that X does not exceed the value of x:

F(x) = P(X<x)

We can always write:

17

We can always write:

F(x) = P(X<x) = 1 - P(X>x)

Let a and b be two possible values of X, with a < b. The

probability that X lies between a and b is:

(18)

Lect_2: Intro to Inference

The normal distribution

Or bell-shaped or gaussian or …

σ2

f(x)

)

σ

N( µ

~

X , 2

18 Symmetric (Mean, Median and Mode are Equal)

Characterized by mean (µ) and variance (σ 2), determining,

respectively, location and spread

Its range goes from − ∞ to +

Density goes to zero as X goes to − ∞ or + and is maximum at the mean

x

(19)

Location is determined by the mean

Consider two normal distributions with the same σ2 but different

means µ1 = 10 < µ2 = 20

σ2 = 25

f(x)

19

x

10

) 25 N(10

~

X ,

20

) 25 N(20

~

(20)

Lect_2: Intro to Inference

And the spread by the variance

Consider two normal distributions with the same µ but different

variances σ12 = 25 > σ22= 36

) 5 2 N(5 ~

X ,

20 As variance increases the curve becomes increasingly flat

5

(21)

Probabilities as areas

21

(22)

Lect_2: Intro to Inference

The standardised normal distribution

For the special normal distribution with mean = 0 and

variance =1 (standardized normal distribution, Z) there are

tables

giving areas under the curve.

Any normal distribution (with any mean and variance

combination) can be transformed into the standardized normal

distribution.

22

distribution.

We need to transform X units into Z units by subtracting

the mean of X and dividing by its standard deviation:

σ

µ

X

Z

=

1)

N(0

~

(23)

Example of area calculation

The distribution of inhabitants of Fantasyland cities is normal

with mean of 29000 inhabitants and a standard deviation of

3000 inhabitants.

Calculate the probability that the size of a randomly selected

city lies between 23000 and 35000 inhabitants.

23

X ~ N(29000,3000

2

)

(24)

Lect_2: Intro to Inference

Example of area calculation (cont’d)

First step

: standardize interval endpoints

23000

(23000-29000) / 3000 = - 2

35000

(35000-29000) / 3000 = +2

1)

N(0

~

Z

,

)

3000

N(29000

~

X

,

2

24

(25)

Example of area calculation (cont’d)

Now the problem is how to calculate the area between -2 and

+2 under Z

1)

N(0

~

Z

,

25

The table (see next slide) of the standard normal distribution

(usually)

gives the value of the cumulative distribution, F, for each

value z (the are on its left)

Second step

is to write the area in terms of F:

P(-2<Z<+2) = F(+2) – F(-2)

(26)

Lect_2: Intro to Inference

The standard normal table

For each value on the orizontal axis, z, the table gives F(z), that is

the value of the cumulative distribution function till that value (the are on its left). E.g., F(+2) = 0.9772

26

(27)

Example of area calculation (cont’d)

To calculate F(-2) we use the fact that:

F(-2)

= 1 -

F(+2)

27

F(-2) = 1- F(+2) = 1-0.9772 = 0.0228

So, P(23000<X<35000)=

P(-2<Z<+2)

= F(+2) – F(-2)

(28)

Lect_2: Intro to Inference

Summing-up

The

steps to calculate the areas under the normal curve

are:

1)

Standardize the interval endpoints (because we only

have the table for Z);

2)

Write the probability in terms of F(z);

Find the F values on the table.

28

3)

Find the F values on the table.

Check out this applet:

(29)

Why is the normal distribution so important?

We’ll learn that if we take different random samples and

calculate a statistic (e.g. sample mean) to estimate a

parameter (e.g. population mean), the collection of statistic

values from these samples usually has approximately a normal

distribution.

Why should I car about that?

29

(30)

Lect_2: Intro to Inference

Sampling distribution

Lists all possible values of a statistic (e.g., sample mean or

sample proportion) that we obtain considering all possible

samples of a fixed size that we can obtain from a population

and their probabilities. How to build it?

Example:

D

A B C

30

Example:

Consider a population of size N=4

Consider the variable, X, describing

the age of individuals

(31)

Summary Measures for the Population

Distribution:

21

24

22

20

18

N

X

µ

i

=

+

+

+

=

=

.25

P(x)

31

Note that the distribution of X in the population is NOT

normal (but uniform)!!!

21

4

=

=

2.236

N

µ

)

(X

σ

2 i

=

=

.25 0

18 20 22 24 A B C D

(32)

Lect_2: Intro to Inference

Building a sampling distribution

Now consider all possible samples of size n = 2 Sampling method: simple random sampling

D

A B C

32

Note that sampling is with replacement

(33)

Building a sampling distribution (cont’d)

Consider all possible sample means and their probabilities (relative frequencies)

How many “18” do we have? Just 1 out of 16. So, the probability

that one sample of size 2 will give as sample mean “18” is 1/16 = 0.0625. And so on…

33

Note: the sampling distribution of all sample means

(34)

Lect_2: Intro to Inference

Sample mean distribution shape

For random sampling with “large” n, the sampling distribution of the sample mean follows approximately a normal distribution

(central limit theorem)

Check it with this applet:

http://www.prenhall.com/agresti/applet_files/samplingdist.html

34

Approximate normality applies no matter what the shape of the

population distribution

Usually “large” means n>30

If X is normally distributed than also the sampling distribution of

(35)

Summary measures of the Sampling

Distribution

Consider the average and standard deviation of the 16 values of

our sample statistic (sample mean):

Mmm the average of all sample averages is equal to the TRUE µ 21 16 24 21 19 18 N X ) X E( 16

1 = + + + + = =

=

= L j j 35

Mmm the average of all sample averages is equal to the TRUE

population average. Can be this useful?

The standard deviation of the distribution of sample means

measures the variability of sample means across the possible samples. It is called standard error of the mean

(36)

Lect_2: Intro to Inference

The Sample Mean estimator

It can be proved that holds in general

We refer to this property of the sample mean saying that it is an

unbiased estimator of the population mean

The sample mean estimator is:

µ

)

X

E(

=

= = n 1 i i X n 1 X 36

It can be proved that also the median estimator is unbiased but its variance is higher. We say that the sample mean estimator is more efficient. Check out this with this applet:

http://www.prenhall.com/agresti/applet_files/samplingdist.html

=1 i

(37)

Estimators and their properties: definitions

An estimator, in general, is defined as a random variable that

depends on sample information. Its value provides an approximation to the unknown parameter in the population.

A single value of an estimator is called point estimate

Good properties of estimators are:

37

Good properties of estimators are:

Unbiasedness: its expected value is equal to the true

population parameter

Efficiency: low variability from a sample to another (with

(38)

Lect_2: Intro to Inference

Standard error of the mean

Measures the variability in the sample mean from sample to sample.

It depend on the standard deviation of the variable in the

population and the sample size:

σ

σ

X

=

38

(Note: this formula holds if the population has infinite size. Otherwise a correction factor should be applied. We will not go into further details.)

It decreases as the sample size increases (as n increases uncertainty decreases)

n

σ

(39)

Summing-up

The sample mean estimator

has expected value equal to the population mean:

and standard deviation (standard error) equal the population

µ

)

X

E(

=

39

and standard deviation (standard error) equal the population

standard deviation divided by the square root of the sample size:

Its distribution is normal if X is normally distributed (or even when

X is not normally distributed but the sample size is large).

n

σ

σ

(40)

Lect_2: Intro to Inference

Review exercise

You want to understand how your flat mates consider yourself. In particular,

you want to understand how much dirty they think you are. You use the

following question to operationalise your flat mates perception of your “dirtiness”:

“On a scale from 0 to 100 (with 0 meaning extremely clean and 100

meaning extremely dirty), how dirty do you think I am?”

Let assume that you have 3 flat mates, Ana, Joan and Xavi, and that their

answers to such a question would be, respectively, 90, 50, 70.

40

answers to such a question would be, respectively, 90, 50, 70.

1. What is the population? What is the variable of interest? What is the parameter of interest?

2. Imagine you are considering to draw a sample of size 2 because you do not want to waste your time asking to all of them the same question (or maybe you want to practice with what you’re learning in this course ☺). If you want to estimate the average level of perceived dirtiness, what estimator would you use? Why?

(41)

Review exercise (cont’d)

1. The population in this case is the group of your flat mates and has three

members (units): Ana, Joan and Xavi. The variable of interest is the level of your dirtiness as perceived by your flat mates and take these values: X={90, 50, 70}. The parameter of interest is the average of X:

70

3

70

50

90

µ

=

+

+

=

41

2. To estimate the population average of X we consider the sample mean

because it is an unbiased and efficient estimator.

3. From what we saw in previous slides we know that:

= = 2 1 i i X 2 1 X

70

µ

)

X

E(

=

=

n

σ

σ

(42)

Lect_2: Intro to Inference

Review exercise (cont’d)

3. To calculate the standard error of the mean we need the population

standard deviation.

(

) (

) (

)

33

.

16

3

70

70

70

50

70

90

σ

2 2 2

=

+

+

=

42

55

.

11

2

16.33

n

σ

σ

X

=

=

=

(43)

Revision exercise (cont’d)

3. To build the sample distribution we have to consider all possible samples

of size 2 that we can draw from the population of your flat mates.

Samples

1st Obs 2nd observation

90 50 70

90 90; 90 90; 50 90; 70

P(X)

43

50 50; 90 50; 50 50; 70

70 70; 90 70; 50 70; 70

Sample means

1st Obs 2nd observation

90 50 70

90 90 70 80

50 70 50 60

70 80 60 70

50 60 70 80 90 0

.1 .2 .3

(44)

Lect_2: Intro to Inference

Did you survive???

44

(45)

If something is not clear

(or you find mistakes in the slides)

45

do not hesitate to come at office hours

or e-mail me

References

Related documents

1 HOUR Black board/ Lecture OCTOBER I WEEK 15 Meaning of Population, 3 HOURS Board work I &amp; II WEEK.. Parameter and Statistic -

Mandibular advancement splint as shortterm alternative treatment in patients with obstructive sleep apnea already effectively treated with continuous positive airway

Ello se encaminó a la búsqueda de un valor óptimo de corte de filtro a las altas frecuencias lo más bajo posible para lograr el registro del PESS con múltiples

Principally, it is dedicated to reviewing the basic concepts underlying strategic marketing and applying these ideas to various tourism marketing organizations both in Australia

The Business Plan for the Community Learning Campus was prepared under the direction of the CLC Governance Team, in accordance with the responsibilities specified in the

The placement of a successful bid at a U.S. Department of the Treasury auction establishes a legally binding contract between the successful high bidder and the

All not-for-profit vendors subject to prequalification are required to prequalify prior to grant application and execution of contracts. Pursuant to the New York State Division

The ventilated rack units made of AISI 304 stainless steel material, involve mobile rack suspended transparent light temperature polycarbonate cages (sizes for mice, rats, hamsters