Introduction: Inferencial Statistics
x
Sam
ple
: n
Inference
Inference
Parameter Estimation Hypothesis Testing
2 S
Probability
Probability
p ^ ^ Representative (Sampling rate) Sample Size
Pop
ulat
ion:
N
Pop
ulat
ion:
N
2
P
• Which variables will we collect and how will they be measured? • What types of data do they represent?
• What do we want to do? describe, compare groups, predict change in “Y” following change in “X”?
Objetive of Chapter
Developing the methodology of
confidence
intervals
as
technique to analyze differences
and make decisions, determine
the risks involved in making
such decisions if we rely solely
on
information
from
the
Inferential Statistics
Inferential statistics. The methods used to estimate a property of a
population on the basis of a sample.
Statistical Inference involves
two main types of
techniques: Parameter
Estimation and Hypothesis
Testing. Whatever the
technique used, the overall
purpose is to use data from
a probability sample to
extract conclusions about a
population.
•
Data is a observation about the variable being measured.
•
A population consists of all subjects or objects about whom the study is being
conducted.
In statistical inference, one wishes to estimate population parameters using observed sample data. A confidence interval gives an estimated range of values which is likely to include an unknown population parameter, the estimated range being calculated from a given set of sample data. (Definition taken from Valerie J. Easton and John H. McColl's )
Parameter Estimation: Confidence interval
A 95% confidence interval is often interpreted as indicating a range within which we can be 95% certain that the true effect lies. The strictly-correct interpretation of a confidence interval is based on the hypothetical notion of considering the results that would be obtained if the study were repeated many times.
The confidence level tells you how sure you can be. It is expressed as a percentage and represents how often the true percentage of the population who would pick an answer lies within the confidence interval. The 95% confidence level means you can be 95% certain; the 99% confidence level means you can be 99% certain. Most researchers use the 95% confidence level.
Confidence Interval Data Requirements
To express a confidence interval, you need three pieces of information: Statistic, Margin of error and Confidence level,
Given these inputs, the range of the confidence interval is defined by the :
sample statistic + margin of error. And the uncertainty associated with the confidence interval is specified by the confidence level.
Often, the margin of error is not given; you must calculate it. There are three steps to constructing a confidence interval:
1. Identify a sample statistic. Choose the statistic (e.g. sample mean, sample proportion) that you will use to estimate a population parameter;
2. The maximum error of estimate is the maximum likely difference between the point estimate of a parameter and the actual value of the parameter. Often, however, you will need to compute the margin of error, based on the following equation: Margin of error = Critical value * Standard error of mean
3. Select a confidence level. The confidence level is the probability value associated with a confidence interval. It is often expressed as a percentage. For example, say, then the confidence level is equal to (1-0.05) = 0.95, i.e. a 95% confidence level. Often, researchers choose 90%, 95%, or 99% confidence
n
Sd
/
mean
of
error
)
=
1
-n
s
t
+
x
n
s
t
-
x
2 2
) ( ) (A confidence interval for a mean specifies a range of values within which the unknown population parameter, in this case the mean, may lie. These intervals may be calculated by, for example, suppose you wanted to estimate the average IQ of a community and had randomly selected a sample of 200 residents.
Confidence Interval for a Mean when unknown Standard Deviation (σ)
Where: t depends for the level of confidence = the sample mean
= the critical value t distribution determined by the alpha level
= the standard error of mean
= confidence level, it is the confidence coefficient expressed as a percentage
= margin of error
Rongin Kwizera is the host of KXYZ Radio 55 AM drive-time news in kigali. During his morning program, Rongin asks listeners to call in and discuss current local and national news. This morning, Rongin was concerned with the number of hours children under 12 years of age watches TV per day. The last 5 callers reported that their children watched the following number of hours of TV last day
Data from Sample of callers:
1.0, 2.3, 3.1, 0.30, 3.5
Would it be reasonable to develop a confidence interval from these data to show the mean number of hours of TV watched?
If yes, construct and appropriate confidence interval at 95% of confidence, and interpret the result.
If no, why would a confidence interval not be appropriate?
95
.
0
5
36
.
1
776
.
2
04
.
2
5
36
.
1
776
.
2
04
.
2
-
)
=
P(
Since for these data the mean is 2.04, the standard deviation is 1.36 and n=5. The computations using the formula is:
See: in this example requires that the variable is normally distributed.
Interpretation: At 95% of confidence the children under 12 years old watch TV between 0.35 to 3.73 hours per day.
Application Example
Answer:
0
.
35
3
.
73
Mean 2.04
Std. Deviation 1.36
Steps in SPSS for Confidence Interval for One Sample t Test
Output of SPSS for Confidence Interval for One Sample t Test
Analyze > Compare Means > One sample T test > Double-click Hours to move it to the Test variable (s) > Click OK. The output is displayed in the Output window
One-Sample Statistics N Mean Std. Deviation Std. Error Mean Number of hours children
under 12 years watches
TV per day 5 2.0400 1.36308 .60959
One-Sample Test
Test Value = 0
t df Sig. (2-tailed) Mean Difference 95% Confidence Interval of the
Difference Lower Upper Number of hours
children under 12 years watches TV per day
3.347 4 .029 2.04000 .3475 3.7325
At 95% of confidence the children under 12 years old watch TV between 0.35 to 3.73 hours per day.
The average hours that children under 12 year old watch TV is 2.04 hours per day, and the variability
11
Probability and Non Probability sampling
Probability sampling is a sampling technique, in which the subjects of
the population get an equal opportunity to be selected as a
representative sample.
We have error in all samples
12
Sample random
Mean age= 47 Mean age=48.5
Population
Difference =1.5 = Estimation error Example:
Planning
• Which variables will we collect and how will they be
measured?
• What types of data do they represent?
• What do I want to do? describe, compare groups, predict
change in Y following change in X?
Why use a sample?
Sampling is important because it is impossible to (observe, interview,
survey, etc.) an entire population. ... In the context
of research, sampling is the method one uses to gather and select,
to sample, data. A data sample is a set of data collected from a
statistical population by a defined procedure.
Lest Cost
Less field time
Speed
Accuracy
Destruction of test units
When it’s impossible to study the whole
population
•
Sample accuracy:
refers to how close a random
sample’s statistic (e.g. mean( ), variance (s
2)
proportion ( ) is to the population’s value it
represents (mean(µ), variance(σ
2), proportion (π or
P)
•
Important points:
•
Sample
size
is
NOT
related
to
representativeness …
you could sample 20,000
persons walking by a street corner and the results
would still not represent the city; however, an “n”
of 100 could be “right on.”
•
Sample size, however, is related to
accuracy.
How close the sample statistic is to the actual
population parameter (e.g. sample mean vs.
population mean) is a function of sample size.
Sample Accuracy
x
15
Sampling design: Steps
1.
Definition of target population.
The target population is the total population for which the information is required.Who has the information/data you need? How do you define your target population?
2.
Selection of a sampling frame (list)
• List of elements
•Sampling Frame Error: error that occurs when certain sample elements are not listed or available and are not represented in the sampling frame
3.
Probability or Nonprobability sampling
• Probability Sample:
A sampling technique in which every member of the population will have a known, nonzero probability of being selected.
• Non-Probability Sample:
• Units of the sample are chosen on the basis of personal judgment or convenience
4. Sampling Unit.
It is necessary to decide a sampling unit before
selecting a sample. It can be a geographical one (state, district,
village, etc.), a construction unit (house, flat, etc.), a social unit
(family, club, school, etc.), or an individual.
5. Error
–
Random sampling error (chance fluctuations)
–
Nonsampling error (design errors)
6. Sample Type.
The researcher, decides the techniques to be
used in selecting the items for the sample according of the
characteristics of the population.
7. Determination of levels of inference.
The level of
uncertainty will also be determined by the sample size.
Increasing the sample size will decrease the sampling error.
Non probability
Quota
Convenience
Snowball
Systematic
Sampling
Stratified
Sampling
Simple
Random
Cluster
Sampling
Probability samples
Judgment
ClassificationSampling Methods
Simple Random Sampling
A method for choosing cases from a population by which every case and every combination of cases has an equal chance of being included, i.e. must be you have a homogeneous population. To use this technique, we need a list of all elements and cases are often selected by using tables of random numbers. These tables are list of numbers that have no pattern (that is, they are a random), and an example of such a table shown on the next slice
Steps:
1. List the cases to be randomized in a column on a spreadsheet.
2. Assign the number from the table to each case. Start anywhere on the table, and move in any direction (up, down, across, diagonally). Begin by labeling items in the population with a number. For consistency, these numbers should consist of the same number of digits. So if we have 100 elements in our population, we can use the numerical labels 01, 02, 03, . . ., 98, 99, 100. The general rule is that if we have N digits in your population, then we can use labels with N digits from your table of random number.
Simple Random Sample
Since there are a total of 48 students, and 48 is a two digit number, every individual in the population is
assigned a two digit number
beginning 01, 02, 03, . . . 46, 47, 48.
Answer; The random sample table gives us the number or identification of the student that we must take, in our example we look for the following students according to the Random Table (Student 20, 27, 29, 39 and 40) The sample size is 5, and the
students are:
Hellene H, Lucille L., Gilles D., Flienne M. and Alain M.
Example
Simple Random Sampling
Advantages
Minimal knowledge of population needed
External validity high; internal validity
high; statistical estimation of error
Easy to analyze data
Disadvantages
High cost; low frequency of use
Requires sampling frame
Does not use researchers’ expertise
Systematic Sampling
For systematic sample is necessary the population must be homogeneous and it is ordered. This involves first listing in a serial order, all the events, persons, objects or things in the whole population. After this, the population (N) is divided by the sample size (n) to get the Kth interval. Once the Kth case is decided, all others are automatically selected. An initial starting point is selected by a random process, and then every nth number on the list is selected.
Probability of Selection = K
(Probability of Selection) K= Population Size
Sample Size
Example: assuming you have a population of 1,500 people and your sample
size is 100. Then Kth position will be given by N/n = 1500/100 = 15. It means that every 15th position or interval is automatically selected as part of the
sample.
Thus, numbers 15, 30, 45, etc. are already selected. You can even select
Systematic Sampling
Advantages
Moderate cost; moderate usage
External validity high; internal validity high; statistical estimation of error Simple to draw sample; easy to verify
Disadvantages
Periodic ordering
Requires sampling frame
Stratified Sampling
Where the population embraces a number of distinct categories, the frame can be organized by these categories into separate "strata" Each stratum is then sampled as an independent sub-population, out of which individual elements can be randomly selected and each stratum should be homogeneous.
Benefits to stratified sampling
You have differences in gender, occupation, income, socio-economic status, geographical location, qualifications, age, height, color, dialects etc.
Stratified sampling is appropriate when the population consists of a number of sub-groups which are homogeneous or contain members that share common characteristics, which need to be represented in the sample. Randomization is then used to select members from the subgroups in such a way that the proportion of each sub-group in the population is reflected in the sample.
Stratified Sampling
In a stratified sampling design, the steps we will be, first, to establish on the basis that we attribute to stratify, secondly, few variables that define attribute occur in the population and, therefore, on how many groups or strata divide the population, (the following figure shows a stratified sampling design with 4 strata, L = 4). Once
determined subgroups, the next step will consist in knowing the total population belonging to each stratum (N1, N2, N3. N4) and, finally, we take a random sample from each strata we have (n1, n2, n3, n4). The sum of the subsamples constitute our total sample (n1 + n2 + n3 + n4 = n).
Cluster Sampling
The primary sampling unit is not the individual element, but a large
cluster of elements. Either the cluster is randomly selected or the
elements within are randomly selected.
If the population is large or the area is widespread, he may decide to zone the area reflecting these characteristics and then random samples from each of the identified zones.
Cluster Sampling - example
Sometimes it is more cost-effective to select respondents in groups ('clusters'). Sampling is often clustered by geography, or by time periods. (Nearly all samples are in some sense 'clustered' in time, for example if surveying households within a city, we might choose to select 100 city blocks and then interview every household within the selected blocks.
Cluster Sampling
Advantages
Low cost/high frequency of use
Requires list of all clusters, but only of
individuals within chosen clusters
Can estimate characteristics of both cluster and
population
For multistage, has strengths of used methods
Disadvantages
Larger error for comparable size than other
probability methods
Multistage very expensive and validity depends
Non-Probability Sampling Methods
As they are not truly representative, non-probability samples are less desirable than probability samples. However, a researcher may not be able to obtain a random or stratified sample, or it may be too expensive. A researcher may not care about generalizing to a larger population. The validity of non-probability samples can be increased by trying to approximate random selection, and by eliminating as many sources of bias as possible.
Convenience Sampling or accidental sampling is a type of nonprobability sampling which involves the sample being drawn from that part of the population which is close to hand. That is, a population is selected because it is readily available and convenient. It may be through meeting the person or including a person in the sample when one meets them or chosen by finding them through technological means such as the internet or through phone. The researcher using such a sample cannot scientifically make generalizations about the total population from this sample because it would not be representative enough. For example, if the interviewer were to conduct such a survey at a shopping center early in the morning on a given day, the people that he/she could interview would be limited to those given there at that given time, which would not represent the views of other members of society in such an area.
Convenience Sampling
The sampling procedure in which an experienced researcher selects
the sample based on some appropriate characteristic of sample members… to serve a purpose.
This is necessitated when the researcher is interested in certain
specified characteristics. It ensures that only those that meet such required purpose, attributes or characteristics are selected
Judgment or Purposive Sample
The sampling procedure that ensure that a certain characteristic
of a population sample will be represented to the exact extent
that the investigator desires.
Quota Sample
Advantages: moderate cost, very extensively
used/ understood, no need for list of population elements,
introduces some elements of stratification.
Disadvantages:
Variability and bias cannot be measured or
controlled (classification of subjects), projecting
The sampling procedure in which the initial
respondents are chosen by non-probability
methods, and then additional respondents are
obtained by information provided by the initial
respondents.
Snowball sampling
This is used when every member of the population cannot comply with the demands of the
investigation. Therefore, these individuals who are willing to comply with the demands of the
investigations are used. These are the volunteers who are willing and ready to cooperate with the
Sample Size
To properly understand how to determine sample size, it helps to
understand the following AXIOMS…
•
The only perfectly accurate sample is a census.
•
A probability sample will always have some inaccuracy (sample error).
•
The larger a probability sample is, the more accurate it is (less sample
error).
•
You can take any finding in the survey, replicate the survey with the
same probability sample plan & size, and you will be “very likely” to find
the same result within the + range of the original findings.
•
In almost all cases, the accuracy (sample error) of a probability
sample is independent of the size of the population.
•
A probability sample can be a very tiny percentage of the population
size and still be very accurate (have little sample error).
•
The size of the probability sample depends on the researcher’s
desired accuracy (acceptable sample error) balanced against the cost
of data collection for that sample size.
Determining Sample Size
There is only one method of determining sample size that allows the
researcher to PREDETERMINE the accuracy of the sample results:
The
Confidence Interval Method of Determining Sample Size.
•
Confidence interval:
range whose endpoints define a certain
percentage of the responses to a question.
Confidence interval
approach:
applies the concepts of accuracy, variability, and
confidence interval to create a “correct” sample size
•
Central limit theorem:
a theory that holds that values taken from
repeated samples of a survey within a population would look like a
normal curve. The mean of all sample means is the mean of the
population
•
Two types of error:
• Nonsampling error
: pertains to all sources of error other than
sample selection method and sample size.
• Sampling error
: involves sample selection and sample size…this
is the error that we are controlling through formulas
•
Sample error formula:
%
(
n
)
pq
error
Sample
Determining Sample Size
•
Variability:
refers to how similar or dissimilar responses are to a given
question
•
P (%):
share that “have” or “are” or “will do” etc.
•
Q (%):
100%-P%, share of “have nots” or “are nots” or “won’t dos” etc.
N.B.:
The more variability in the population being studied, the larger the
sample size needed to achieve stated accuracy level.
What data do you need to consider
• Variance or heterogeneity of population
Previous studies? Industry expectations? Pilot study? Sequential sampling
• Expect the worst case (p=50%; q=50%)
• Estimate variability: results of previous studies or conduct a pilot study • Confidence level
With Nominal data (i.e. Yes, No), we can
The formula requires that we
(a) specify the amount of confidence we wish
to have,
(b) estimate the variance in the
population, and
(c) specify the
level of desired
accuracy
q
p
e
N
q
p
N
n
.
.
1
.
.
2 2 2 2 2 Finite population
Finite population
Proportion
Proportion
Infinite population
Infinite population
2 2 2.
.
e
q
p
n
•
The sample size formula for estimating a proportion (also called a
percentage or share):
Sample size n
Population size N
Confidence 1-α
Is known as the critical value, the positive Z (e.g. 1.96 for 95% confidence level)
Expected proportion in population based on
previous studies or pilot studies P
1-p q
Absolute error or precision – Has to be
decided by researcher e
Sample Size Formula
Determining the necessary sample size for estimating a single population mean or a single population total with a specified level of precision:
Z at 90% confidence = 1.64
Z at 95% confidence = 1.96
Z at 99% confidence = 2.58
2
Example
We’d like to find the satisfaction level
of the students in AUCA. The
Registration office have 1675 students
registered in 2014-II semester. We
deed to take a sample with 95% of
confidence and 5% error of estimation.
N
e
p
q
q
p
N
n
2 2 2 2 21
1675
1
0
.
05
1
.
96
0
.
50
0
.
50
50
.
0
50
.
0
96
.
1
1675
2 2 2x
x
x
x
x
x
n
Students
n
0
312
.
64
313
N
n
If
,
0
.
1869
0
.
10
1675
313
263
.
72
264
1675
313
1
313
1
N
n
n
n
StudentsExample
Sample size n ?
Population size N 1675
Confidence 1-α 0.95
Standard error associated with the chosen
level of confidence Z=Z(1- α/2) 1.96
Estimated percent in the population P 0.5
1-p q 0.5
Mean
Mean
N
Infinite population
2
2
2
2
e
z
n
Sample Size Formula for numeric data
2
z
2
z
Z is determined the same way (1.96 or 2.58)
“e” is expressed in terms of the units we are estimating, i.e. if we are
measuring attitudes on a 1-7 scale, we may want our error to be
no more than + .5 scale units. If we are estimating dollars being paid for
a product, we may want our error to be no more than + $3.00.
is a little more difficult to estimate, but must be in same units as “e”.
Finite
population
2
2
2
2
2
2
2
)
1
(
z
N
e
z
N
n
Sample Size Formula when the data are numeric
Estimating “ ” in the Formula to Determine the Sample Size
Required to Estimate a Mean
Since we are estimating a mean, we can assume that our data are either
interval or ratio. When we have interval or ratio data, the standard
deviation of the sample, , may be used as a measure of variance.
How to estimate
• Use standard deviation of the sample from a previous study on the
target population
• Conduct a pilot study of a few members of the target population and
calculate “s”
?
Example
Suppose you are planning to sample transportation employees to
determine average annual sick days. The following standard have
been set: a confidence level of 99% and an error of fewer than 2 days.
Past research has indicated the standard deviation should be 6 days.
What is the required sample size?
Solution:
2
2
2
2
e
z
n
59
.
9076
60
2
6
58
.
2
2 2 2
x
n
EmployeesSample size
n
?
Confidence
1-α
0.99
Standard error associated with the chosen
level of confidence
Z=Z(1- α/2)
2.58
Standard deviation
( )
6
Assignment 2
1. The daily salaries of substitute teachers for eight local school districts is shown. What is the point estimate for the mean? Find the 90% confidence interval of the mean for the salaries of substitute teachers in the region. Answer: <55.47, 62.28>
60 56 60 55 70 55 60 55 Source: Pittsburgh Tribune Review
2. The number of unhealthy days based on the AQI (Air Quality Index) for a random sample of metropolitan areas is shown. Construct a 98% confidence interval based on the data.
61 12 6 40 27 38 93 5 13 40 Source: N.Y. Times Almanac. Answer: <8.81, 58.19>
3. We’ve just started a new educational TV program that teaches viewers all about research methods!! • We know from past educational TV programs that such a program would likely capture 2 out of 10
viewers on a typical night.
• Let’s say we want to be 95% confident that our obtained sample proportion of viewers will differ from the true population proportions by not more than 5%.
• What sample size do we need? Answer: n= 245.86 or n= 246 viewers
4. A survey estimated that 20% of all Americans aged 16 to 20 drove under the influence of drugs or alcohol. A similar survey is planned for New Zealand. They want a 95% confidence interval to have a margin of error of 0.04. (a) Find the necessary sample size if they expect to find results similar to those in the United States. (b) Suppose instead they used the conservative formula based on ˆp = 0.5. What is now the required sample size? a. n = 384, b. n = 600
5. We would like to start an ISP and need to estimate the average Internet usage of households in one week for our business plan and model. How many households must we randomly select to be 95 percent sure that the sample mean is within 1 minute of the population mean . Assume that a previous survey of household usage has shown σ = 6.95 minutes. Answer: n= 186 households
Assignment 2
Contd.
7. A firm employs numbers of staff in one of three categories listed below: 28 managers
44 secretaries
304 production workers
a. How many managers and secretaries, and production workers did it contain?
You take a sample from each stratum, with a 95% confidence interval and a maximum sampling error of 7%.
Answer: (Managers =10, Secretaries =15, Production Workers =104).
b. What type of sampling would you recommend to obtain representativeness (probabilistic or non-probabilistic) and specify the method you would use according to the type of recommended sampling (for example: simple random, systematic, etc.)? Then, briefly, say the procedure to take the sample.
8. For the same example before, last year’s proportion of ATM withdrawals over $50 was 27 percent. If we used this estimate in our calculation, the required sample size would be…..?.
Answer n = 1892.948
9. Alternative for the example before. Suppose that our research budget will not permit a large sample. In the previous example, we could reduce the confidence interval level from 95 to 90 percent and increase the maximum error to ± 4 percent. Assuming p proportion = .50. How large of sample is needed?
Answer n= 420.65
10. The Ministry of Economy places each of its employees on one of three salary scales, A, B, and C. The number of employees on each scale is listed below: How many employees from each scale should be included in a sample of size, with a 99% confidence interval and a maximum sampling error of 1 %.
A 1000 B 875 C 245 D 120