Techniques of Statistical
Analysis I
Master’s Degree Programs
Universitat Pompeu Fabra
Lect_2: Introduction to Statistical Inference
Bruno Arpino
Lect_2: Intro to Inference
Key terminology: populations and samples;
parameters and statistics
Simple random sampling
Outline
2
Getting familiar with Statistical Inference
Probability and the normal distribution
A
population
is the collection of all items of interest;
N
represents the population size
A
sample
is a subset of the population;
n
represents the sample size
Individual members (of pop and sample) are called units
Key definitions
3
Lect_2: Intro to Inference
All registered voters in the United States
All women in fecund age living in Catalunya
All the students registered in a Master’s at Pompeu
Examples of Populations
4
All the students registered in a Master’s at Pompeu
Fabra
Parameter:
Parameter vs statistic
Statistic:
A
parameter
is a specific characteristic of a population
A
statistic
is a specific characteristic of a sample
5
% of all adult Americans
who approve of Barack
Obama’s performance as
President
% of 1000 adult
Americans in a poll who
approve of Obama’s
performance as
President
Parameter values are (and remain) unknown.
Lect_2: Intro to Inference
How to obtain a sample?
Imagine you want to know how many atheists are
there in Spain. Is it a good idea to collect data on a
group of people attending a mass in the Catedral del
Sampling
6
group of people attending a mass in the Catedral del
Mar?
Imagine you want to estimate the average level of
happiness. Is it a good idea to collect data on a
group of retired persons?
Each member of the population is chosen strictly by
chance
Each member of the population is equally likely to be
chosen
Every possible sample of size n is equally likely to be
Simple random sampling
7
Every possible sample of size n is equally likely to be
chosen
Randomness guaranties that there is no systematic bias
toward sub-groups of the population (over/under
representation)
This is an example of a probability (random)
sampling method – We can specify the
Lect_2: Intro to Inference
Estimation
• Point estimate
• Confidence interval
Statistical Inference
Inference is the process of drawing conclusions about a
population based on sample results. Two ways of doing that:
8
• Confidence interval
e.g., Estimate the population mean income using sample data
Hypothesis testing
An example of bad sampling
In 1987, Shere Hite published a best-selling book called Women and Love: A Cultural Revolution in Progress.
Hite mailed out 100,000 fifteen-page questionnaires to women
who were members of a wide variety of organizations across the U.S. (e.g., church, political and volunteer associations, etc).
Approximately 4,500 questionnaires were returned. Below are a
9
Approximately 4,500 questionnaires were returned. Below are a
few statements from Hite’s publication:
• 84% of women are not emotionally satisfied with their relationships
• 95% of women reported emotional and psychological
harassment from their partners
• 70% of women married 5 years or more are having extramarital affairs
Lect_2: Intro to Inference
Why Hite’s sampling was a bad one?
Questionnaire were only sent to women participating in
organizations. Are these women representative of the
whole female U.S. population? (Non-probability sample
sampling bias
)
Only 4,500 questionnaires were returned out of
10
Only 4,500 questionnaires were returned out of
100,000. Maybe only the angry responded
(
nonresponse bias
)
How data are collected can seriously affect the results
Sampling error
Probability (random) samples, however, do not
guarantee that the unknown population parameter(s) will
be estimated without errors
The
sampling error
of a statistic is the error that
occurs when we use a sample statistic to predict the
11
Key concepts
Probability theory is the branch of mathematics concerned
with the analysis of random phenomena.
The outcome of a random event cannot be determined
before it occurs, but it may be any one of several possible
outcomes. The actual outcome is considered to be determined
13
outcomes. The actual outcome is considered to be determined
by chance.
In this course we need Probability theory to deal with
uncertainty caused by sampling.
In common language, probability is the chance that
something will happen - how likely it is that an event will
happen.
Lect_2: Intro to Inference
Random variables
We toss a die. What’s the probability of getting a “4”?
P(“4”) = 1/6 = outcomes that satisfy the event/ all possible outcomes
14
X = “outcome from the toss of a die” is called a random
variable
A random variable takes some values with given probabilities X in this case is a discrete random variable (can only take on acountable number of values)
The list of values and their associated probabilities is calledSome basic probability rules
Any probability is a number between 0 and 1
All possible outcomes together have probability 1
E.g.: P(“1” or “2” or … “6””) = 6 / 6 = 1
Certain
Impossible
1
0
15
The probability that an event does not occur is 1 minus
the probability that the even does occur (complement
rule)
Lect_2: Intro to Inference
Continuous random variables
Can assume any real value in an interval. Examples:
• Time a father spend at home with his children
• Salary
Since they can take infinitive values, it is not useful to consider the
probability of individual values. In fact this will always be 0!
16
probability of individual values. In fact this will always be 0!
E.g., P(Time = 20 min) = 1 / ∞ = 0
Instead, we typically consider the probability that a random
variable assumes values in a given interval. E.g.:
• P(15 min < Time < 25 min)
• P(Time < 20 min)
• P(Salary > 2000 Euros)
This is a cumulative probability
A bit more formally...
The cumulative distribution function, F(x), for a continuous random variable X gives the probability that X does not exceed the value of x:
F(x) = P(X<x)
We can always write:
17
We can always write:
F(x) = P(X<x) = 1 - P(X>x)
Let a and b be two possible values of X, with a < b. The
probability that X lies between a and b is:
Lect_2: Intro to Inference
The normal distribution
Or bell-shaped or gaussian or …σ2
f(x)
)
σ
N( µ
~
X , 2
18 Symmetric (Mean, Median and Mode are Equal)
Characterized by mean (µ) and variance (σ 2), determining,
respectively, location and spread
Its range goes from − ∞ to + ∞
Density goes to zero as X goes to − ∞ or + ∞ and is maximum at the mean
x
Location is determined by the mean
Consider two normal distributions with the same σ2 but different
means µ1 = 10 < µ2 = 20
σ2 = 25
f(x)
19
x
10
) 25 N(10
~
X ,
20
) 25 N(20
~
Lect_2: Intro to Inference
And the spread by the variance
Consider two normal distributions with the same µ but different
variances σ12 = 25 > σ22= 36
) 5 2 N(5 ~
X ,
20 As variance increases the curve becomes increasingly flat
5
Probabilities as areas
21
Lect_2: Intro to Inference
The standardised normal distribution
For the special normal distribution with mean = 0 and
variance =1 (standardized normal distribution, Z) there are
tables
giving areas under the curve.
Any normal distribution (with any mean and variance
combination) can be transformed into the standardized normal
distribution.
22
distribution.
We need to transform X units into Z units by subtracting
the mean of X and dividing by its standard deviation:
σ
µ
X
Z
=
−
1)
N(0
~
Example of area calculation
The distribution of inhabitants of Fantasyland cities is normal
with mean of 29000 inhabitants and a standard deviation of
3000 inhabitants.
Calculate the probability that the size of a randomly selected
city lies between 23000 and 35000 inhabitants.
23
X ~ N(29000,3000
2)
Lect_2: Intro to Inference
Example of area calculation (cont’d)
First step
: standardize interval endpoints
23000
(23000-29000) / 3000 = - 2
35000
(35000-29000) / 3000 = +2
1)
N(0
~
Z
,
)
3000
N(29000
~
X
,
224
Example of area calculation (cont’d)
Now the problem is how to calculate the area between -2 and
+2 under Z
1)
N(0
~
Z
,
25
The table (see next slide) of the standard normal distribution
(usually)
gives the value of the cumulative distribution, F, for each
value z (the are on its left)
Second step
is to write the area in terms of F:
P(-2<Z<+2) = F(+2) – F(-2)
Lect_2: Intro to Inference
The standard normal table
For each value on the orizontal axis, z, the table gives F(z), that is
the value of the cumulative distribution function till that value (the are on its left). E.g., F(+2) = 0.9772
26
Example of area calculation (cont’d)
To calculate F(-2) we use the fact that:
F(-2)
= 1 -
F(+2)
27
F(-2) = 1- F(+2) = 1-0.9772 = 0.0228
So, P(23000<X<35000)=
P(-2<Z<+2)
= F(+2) – F(-2)
Lect_2: Intro to Inference
Summing-up
The
steps to calculate the areas under the normal curve
are:
1)
Standardize the interval endpoints (because we only
have the table for Z);
2)
Write the probability in terms of F(z);
Find the F values on the table.
28
3)
Find the F values on the table.
Check out this applet:
Why is the normal distribution so important?
We’ll learn that if we take different random samples and
calculate a statistic (e.g. sample mean) to estimate a
parameter (e.g. population mean), the collection of statistic
values from these samples usually has approximately a normal
distribution.
Why should I car about that?
29
Lect_2: Intro to Inference
Sampling distribution
Lists all possible values of a statistic (e.g., sample mean or
sample proportion) that we obtain considering all possible
samples of a fixed size that we can obtain from a population
and their probabilities. How to build it?
Example:
D
A B C
30
Example:
Consider a population of size N=4
Consider the variable, X, describing
the age of individuals
Summary Measures for the Population
Distribution:
21
24
22
20
18
N
X
µ
i=
+
+
+
=
=
∑
.25P(x)
31Note that the distribution of X in the population is NOT
normal (but uniform)!!!
21
4
=
=
2.236
N
µ
)
(X
σ
2 i=
−
=
∑
.25 018 20 22 24 A B C D
Lect_2: Intro to Inference
Building a sampling distribution
Now consider all possible samples of size n = 2 Sampling method: simple random samplingD
A B C
32
Note that sampling is with replacement
Building a sampling distribution (cont’d)
Consider all possible sample means and their probabilities (relative frequencies)
How many “18” do we have? Just 1 out of 16. So, the probability
that one sample of size 2 will give as sample mean “18” is 1/16 = 0.0625. And so on…
33
Note: the sampling distribution of all sample means
Lect_2: Intro to Inference
Sample mean distribution shape
For random sampling with “large” n, the sampling distribution of the sample mean follows approximately a normal distribution
(central limit theorem)
Check it with this applet:
http://www.prenhall.com/agresti/applet_files/samplingdist.html
34
Approximate normality applies no matter what the shape of the
population distribution
Usually “large” means n>30
If X is normally distributed than also the sampling distribution of
Summary measures of the Sampling
Distribution
Consider the average and standard deviation of the 16 values of
our sample statistic (sample mean):
Mmm the average of all sample averages is equal to the TRUE µ 21 16 24 21 19 18 N X ) X E( 16
1 = + + + + = =
=
∑
= L j j 35Mmm the average of all sample averages is equal to the TRUE
population average. Can be this useful?
The standard deviation of the distribution of sample means
measures the variability of sample means across the possible samples. It is called standard error of the mean
Lect_2: Intro to Inference
The Sample Mean estimator
It can be proved that holds in general
We refer to this property of the sample mean saying that it is an
unbiased estimator of the population mean
The sample mean estimator is:
µ
)
X
E(
=
∑
= = n 1 i i X n 1 X 36It can be proved that also the median estimator is unbiased but its variance is higher. We say that the sample mean estimator is more efficient. Check out this with this applet:
http://www.prenhall.com/agresti/applet_files/samplingdist.html
=1 i
Estimators and their properties: definitions
An estimator, in general, is defined as a random variable thatdepends on sample information. Its value provides an approximation to the unknown parameter in the population.
A single value of an estimator is called point estimate
Good properties of estimators are:
37
Good properties of estimators are:
• Unbiasedness: its expected value is equal to the true
population parameter
• Efficiency: low variability from a sample to another (with
Lect_2: Intro to Inference
Standard error of the mean
Measures the variability in the sample mean from sample to sample.
It depend on the standard deviation of the variable in the
population and the sample size:
σ
σ
X
=
38
(Note: this formula holds if the population has infinite size. Otherwise a correction factor should be applied. We will not go into further details.)
It decreases as the sample size increases (as n increases uncertainty decreases)
n
σ
Summing-up
The sample mean estimator
has expected value equal to the population mean:
and standard deviation (standard error) equal the population
µ
)
X
E(
=
39
and standard deviation (standard error) equal the population
standard deviation divided by the square root of the sample size:
Its distribution is normal if X is normally distributed (or even when
X is not normally distributed but the sample size is large).
n
σ
σ
Lect_2: Intro to Inference
Review exercise
You want to understand how your flat mates consider yourself. In particular,
you want to understand how much dirty they think you are. You use the
following question to operationalise your flat mates perception of your “dirtiness”:
“On a scale from 0 to 100 (with 0 meaning extremely clean and 100
meaning extremely dirty), how dirty do you think I am?”
Let assume that you have 3 flat mates, Ana, Joan and Xavi, and that their
answers to such a question would be, respectively, 90, 50, 70.
40
answers to such a question would be, respectively, 90, 50, 70.
1. What is the population? What is the variable of interest? What is the parameter of interest?
2. Imagine you are considering to draw a sample of size 2 because you do not want to waste your time asking to all of them the same question (or maybe you want to practice with what you’re learning in this course ☺). If you want to estimate the average level of perceived dirtiness, what estimator would you use? Why?
Review exercise (cont’d)
1. The population in this case is the group of your flat mates and has three
members (units): Ana, Joan and Xavi. The variable of interest is the level of your dirtiness as perceived by your flat mates and take these values: X={90, 50, 70}. The parameter of interest is the average of X:
70
3
70
50
90
µ
=
+
+
=
41
2. To estimate the population average of X we consider the sample mean
because it is an unbiased and efficient estimator.
3. From what we saw in previous slides we know that:
∑
= = 2 1 i i X 2 1 X70
µ
)
X
E(
=
=
n
σ
σ
Lect_2: Intro to Inference
Review exercise (cont’d)
3. To calculate the standard error of the mean we need the population
standard deviation.
(
) (
) (
)
33
.
16
3
70
70
70
50
70
90
σ
2 2 2=
−
+
−
+
−
=
4255
.
11
2
16.33
n
σ
σ
X=
=
=
Revision exercise (cont’d)
3. To build the sample distribution we have to consider all possible samples
of size 2 that we can draw from the population of your flat mates.
Samples
1st Obs 2nd observation
90 50 70
90 90; 90 90; 50 90; 70
P(X)
43
50 50; 90 50; 50 50; 70
70 70; 90 70; 50 70; 70
Sample means
1st Obs 2nd observation
90 50 70
90 90 70 80
50 70 50 60
70 80 60 70
50 60 70 80 90 0
.1 .2 .3
Lect_2: Intro to Inference
Did you survive???
44
If something is not clear
(or you find mistakes in the slides)
45