Confidence
Interval and
Sampling
Quantitative
Techniques and
Simulation
Chapter IV
Inferential Statistics
Inferential statistical methods are a way to extract conclusions about a
population, the data obtained from a probability sample.
Statistical Inference involves
two main types of
techniques: Parameter
Estimation and Hypothesis
Testing. Whatever the
technique used, the overall
purpose is to use data from
a probability sample to
extract conclusions about a
population.
•
Data is one observation about the variable being measured.
•
A population consists of all subjects or objects about whom the study is being
conducted.
Objetive of Chapter
Introduction: Inferencial Statistic
x
Sam
ple
: n
Inference
Inference
Parameter Estimation Hypothesis Testing
2 S
Probability
Probability
p ^ ^ Representative (Sampling rate) Sample Size
Pop
ulat
ion:
N
Pop
ulat
ion:
N
2
P
• Which variables will we collect and how will they be measured? • What types of data do they represent?
• What do we want to do? describe, compare groups, predict change in Y following change in X?
Parameter Estimation
Parameter Estimation
The estimation assume 2 forms:
- Point estimates
- Estimates by interval
Estimation techniques are used when the researcher has no prior
assumptions about the value of a population characteristic and want to
know what that value might be.
Statistical inference contains two types of procedures regarding
universal parameters, made on the basis of sample evidence.
p
P
s
x
,
ˆ
,
ˆ
ˆ
2
2
2
1
2
1
ˆ
ˆ
x
x
s
s
2 2 1 2 1 2 2 2 1ˆ
ˆ
p
-p
P
-P
ˆ
1
ˆ
2
1
2
-One sample
Point estimates
In statistical inference, one wishes to estimate population parameters using observed sample data. A confidence interval gives an estimated range of values which is likely to include an unknown population parameter, the estimated range being calculated from a given set of sample data. (Definition taken from Valerie J. Easton and John H. McColl's )
The common notation for the parameter in question is:
Often, this parameter is the population mean, which is estimated through the sample mean .
How to Interpret Confidence Intervals
The confidence level describes the uncertainty associated with a sampling method. Suppose we used the same sampling method to select different samples and to compute a different interval estimate for each sample. Some interval estimates would include the true population parameter and some would not. A 99% confidence level means that we would expect 99% of the interval estimates to include the population parameter; a 95% confidence level means that 95% of the intervals would include the parameter; and so on.
Confidence interval
-
k
+
k
)
=
1
Confidence interval
Confidence Interval Data Requirements
To express a confidence interval, you need three pieces of information.
Confidence level, Statistic, Margin of error
Given these inputs, the range of the confidence interval is defined by the :
sample statistic + margin of error. And the uncertainty associated with the confidence interval is specified by the confidence level.
Often, the margin of error is not given; you must calculate it.
There are four steps to constructing a confidence interval.
1. Identify a sample statistic. Choose the statistic (e.g, sample mean, sample proportion) that you will use to estimate a population parameter; according to type of variable e.g. mean if the variable is numerical or proportion if the variable is categorical).
2. Select a confidence level. The confidence level is the probability value associated with a confidence interval. It is often expressed as a percentage. For example, say , then the confidence level is equal to (1-0.05) = 0.95, i.e. a 95% confidence level. Often, researchers choose 90%, 95%, or 99% confidence levels; but any percentage can be used.
3. Find the margin of error. If you are working on a homework problem or a test question, the margin of error may be given. Often, however, you will need to compute the margin of error, based on the following equation.
Confidence Interval for the Mean when known Standard Deviation
)
=
1
-n
s
t
+
x
n
s
t
-
x
P(
df 2 df2, ) ( , )
(
A confidence interval for a mean specifies a range of values within which the unknown population parameter, in this case the mean, may lie. These intervals may be calculated by, for example, a businessman wants to estimate the average daily sale, a producer who wishes to estimate his mean daily output; a medical researcher who wishes to estimate the mean response by patients to a new drug; etc.
Confidence Interval for the Mean when sample size is small
Where: t depends for the level of confidence = the sample mean
= the standard normal value determined by the confidence coefficient (1-α) associated with the interval estimate
= the standard deviation of the sampling distribution or standard error of mean
= level of confidence is the confidence coefficient expressed as a percentage
= margin of error
df = n-1 X n s 1 * ) 2 ( ,df
Application Example
Application Example
Del Monte Foods, Inc., distributes diced peaches in 4-ounce cans. To be sure each can contain at least the required amount Del Monte sets the filling operation to dispense 4.01 ounces of peaches and syrup in each can. So 4.01 is the population mean. Of course not every can will contain exactly 4.01 ounces of peaches and syrup. Some cans will have more and others less. Let’s also assume that the process follows the Normal Probability Distribution. Now, we selected a random sample of 16 cans and determine the sample mean. It is 4.03 ounces of peaches and syrup. Find the confidence interval.
We have the following information:
= 4.03, s =.069, n=16,
X
4.1 4.07 4.06 4.02 3.94 4.14 3.99 4.01 4.13 4.00 3.99 4.07 3.94 4.01 4.1 3.91 3.99
)
=
1
-n
s
t
+
x
n
s
t
-
x
P(
df 2 df2, ) ( , )
Application Example
Application Example
In 95% of the cans contain peaches and syrup between 3.9935 to 4.0670 ounces, in other hand, this interval contain the mean population (4.01 ounces), therefore the process is OK.
Therefore, the CI would be: < 3.9935 and 4.0670>, of course in this case we observe that the population mean of 4.01ounces is in this interval . But that will not always be the case.
If we set alpha at 0.05, the corresponding t score will be ±2.131, and the
95% confidence interval will be.
:
,
04
.
0
16
069
.
*
131
.
2
then
Margin of error=
4
.
03
.
04
4
.
03
.
04
0
.
95
Application Example; output from SPSS
Application Example; output from SPSS
In 95% of the cans contain peaches and syrup between 3.9935 to 4.0670 ounces, in other hand, this interval contain the mean population (4.01 ounces), therefore the process is OK.
Therefore, the CI would be: < 3.9935 and 4.0670>, of course in this case we observe that the population mean of 4.01ounces is in this interval . But that will
One-Sample Test
Test Value = 0
t df Sig. (2-tailed) DifferenceMean
95% Confidence Interval of the Difference
Lower Upper
Ounces of peaches and
syrup 233.702 15 .000 4.03022 3.9935 4.0670
One-Sample Statistics
N Mean DeviationStd. Std. Error Mean
Ounces of peaches
Application Example
Application Example
Rongin Kwizera is the host of KXYZ Radio 55 AM drive-time news in kigali. During his morning program, Rongin asks listeners to call in and discuss current local and national news. This morning, Rongin was concerned with the number of hours children under 12 years of age watches TV per day. The last 5 callers reported that their children watched the following number of hours of TV last day
Data from Sample of callers:
1.0, 2.3, 3.1, 0.30, 3.5
Would it be reasonable to develop a confidence interval from these data to show the mean number of hours of TV watched?
If yes, construct and appropriate confidence interval at 95% of confidence, and interpret the result.
If no, why would a confidence interval not be appropriate?
95
.
0
5
36
.
1
776
.
2
04
.
2
5
36
.
1
776
.
2
04
.
2
-
)
=
P(
Answer: 0.35 <u<3.73Mean 2.04
Std. Deviation 1.36
The computations using the formula are:
Since for these data the mean is 2.04, the standard deviation is 1.36 and n=19.
We have error in all samples
Sample random
Mean age= 47 Mean age=48.5
Population
Difference =1.5 = Estimation error
Example:
Planning
• Which variables will we collect and how will they be
measured?
• What types of data do they represent?
• What do I want to do? Describe, compare groups, predict
change in Y following change in X?
Why use a sample
Note: Homogeneous populations – small samples are highly representative
Lest Cost
Less field time
Speed
Accuracy
Destruction of test units
•
Sample accuracy:
refers to how close a random
sample’s statistic (e.g. mean( ), variance (s
2)
proportion ( ) is to the population’s value it
represents (mean(µ), variance(σ
2), proportion (π or
P)
•
Important points:
•
Sample
size
is
NOT
related
to
representativeness …
you could sample 20,000
persons walking by a street corner and the results
would still not represent the city; however, an “n”
of 100 could be “right on.”
•
Sample size, however, is related to
accuracy.
How close the sample statistic is to the actual
population parameter (e.g. sample mean vs.
population mean) is a function of sample size.
Sample Accuracy
x
Sampling design: Steps
1. Definition of target population
Who has the information/data you need?
How do you define your target population?
- Geography
- Demographics
- Use
- Awareness
2. Selection of a sampling frame (list)
• List of elements
•Sampling Frame Error: error that occurs when certain sample elements are not listed or available and are not represented in the sampling frame
3. Probability or Nonprobability sampling
• Probability Sample:
Steps
4. Sampling Unit
5. Error
–
Random sampling error (chance fluctuations)
–
Nonsampling error (design errors)
6. Sample Type
7. Determination of levels of inference
8. Sample size calculation
Non-Probability Sample:
• Units of the sample are chosen on the basis of personal judgment or convenience
Population of study
Variables of interest
Non probability
Parameters to investigate
Quota
Convenience
Snowball
ClassificationSampling Methods
Systematic
Stratified
Simple
Random
Cluster
n
Margin of error
Probability samples
Simple Random Sampling
A method for choosing cases from a population by which every case and every combination of cases has an equal chance of being included, i.e. must be you have a homogeneous population. To use this technique, we need a list of all elements and cases are often selected by using tables of random numbers.
These tables are list of numbers that have no pattern (that is, they are a random), and an example of such a table is bellow
Steps:
1. List the cases to be randomized in a column on a spreadsheet.
2. Assign the number from the table to each case. Start anywhere on the table, and move in any direction (up, down, across, diagonally). Begin by labeling items in the population with a number. For consistency, these numbers should consist of the same number of digits. So if we have 100 elements in our population, we can use the numerical labels 01, 02, 03, . . ., 98, 99, 100. The general rule is that if we have N digits in your population, then we can use labels with N digits from your table of random number.
Simple Random Sample
Since there are a total of 48 students, and 48 is a two digit number, every
individual in the population is assigned a two digit number beginning 01, 02, 03, . . . 46, 47, 48.
Answer;
The sample size is 5, and the students are:
Hellene H, Lucille L., Gilles D., Flienne M. and Alain M. Example
Simple Random Sampling
Advantages
Minimal knowledge of population needed
External validity high; internal validity
high; statistical estimation of error
Easy to analyze data
Disadvantages
High cost; low frequency of use
Requires sampling frame
Does not use researchers’ expertise
Systematic Sampling
Systematic sampling relies on arranging the study population according to some ordering scheme and then selecting elements at regular intervals through that ordered list. Systematic sampling involves a random start and then proceeds with the selection of every kth element from then onwards. It is important that the starting point is not automatically the first in the list, but is instead randomly chosen from within the first to the kth element in the list.
Probability of Selection =K
Probability of Selection = Population Size
Sample Size
An initial starting point is selected by a random process, and then
every nth number on the list is selected
n=sampling interval
The number of population elements between the units selected for
the sample
Systematic Sampling
Advantages
Moderate cost; moderate usage
External validity high; internal validity high; statistical estimation
of error
Simple to draw sample; easy to verify
Disadvantages
Periodic ordering
Requires sampling frame
Steps
1. The first thing you do is pick an integer that which is the result given the formula of probability of selection; this will be your first subject e.g. (3).
2. Select values systematically jumping every 3 values, so on until the required sample size.
Stratified Sampling
Where the population embraces a number of distinct categories, the frame can be organized by these categories into separate "strata" Each stratum is then sampled as an independent sub-population, out of which individual elements can be randomly selected.
Benefits to stratified sampling
First, dividing the population into distinct, independent strata can enable researchers to draw inferences about specific subgroups that may be lost in a more generalized random sample.
Second, it is sometimes the case that data are more readily available for individual, pre-existing strata within a population than for the overall population; in such cases, using a stratified sampling approach may be more convenient than aggregating data across groups (though this may potentially be at odds with the previously noted importance of utilizing criterion-relevant strata).
Stratified Sampling
In a stratified sampling design, the steps we will be, first, to establish on the basis that we attribute to stratify, secondly, few variables that define attribute occur in the population and, therefore, on how many groups or strata divide the population, (the following figure shows a stratified sampling design with 4 strata, L = 4). Once
determined subgroups, the next step will consist in knowing the total population belonging to each stratum (N1, N2, N3. N4) and, finally, we take a random sample from each strata we have (n1, n2, n3, n4). The sum of the subsamples constitute our total sample (n1 + n2 + n3 + n4 = n).
Cluster Sampling
The primary sampling unit is not the individual element, but a large
cluster of elements. Either the cluster is randomly selected or the
elements within are randomly selected.
Sometimes it is more cost-effective to select respondents in groups
('clusters'). Sampling is often clustered by geography, or by time periods.
(Nearly all samples are in some sense 'clustered' in time, for example if
surveying households within a city, we might choose to select 100 city
blocks and then interview every household within the selected blocks.
Cluster Sampling
Advantages
Low cost/high frequency of use
Requires list of all clusters, but only of
individuals within chosen clusters
Can estimate characteristics of both cluster and
population
For multistage, has strengths of used methods
Disadvantages
Larger error for comparable size than other
probability methods
Multistage very expensive and validity depends
Non-Probability Sampling Methods
Convenience Sampling or accidental sampling is a type of nonprobability sampling which involves the sample being drawn from that part of the population which is close to hand. That is, a population is selected because it is readily available and convenient. It may be through meeting the person or including a person in the sample when one meets them or chosen by finding them through technological means such as the internet or through phone. The researcher using such a sample cannot scientifically make generalizations about the total population from this sample because it would not be representative enough. For example, if the interviewer were to conduct such a survey at a shopping center early in the morning on a given day, the people that he/she could interview would be limited to those given there at that given time, which would not represent the views of other members of society in such an area.
Convenience Sampling
Advantages: Very low cost, Extensively used/ understood, no need for list of population elements
The sampling procedure in which an experienced researcher selects
the sample based on some appropriate characteristic of sample members… to serve a purpose.
It involves selection of cases when we judge as most appropriate
ones for a given study. It is based on the judgment of a research
Judgment or Purposive Sample
Advantages: Moderate cost,
Commonly used/ understood, Sample will meet a specific objective
Disadvantages: Projecting
The sampling procedure that ensure that a certain characteristic
of a population sample will be represented to the exact extent
that the investigator desires.
Quota Sample
Advantages: moderate cost, very extensively
used/ understood, no need for list of population elements,
introduces some elements of stratification. Disadvantages:
Variability and bias cannot be measured or
controlled (classification of subjects), projecting
The sampling procedure in which the initial
respondents are chosen by probability or
non-probability methods, and then additional
respondents are obtained by information provided
by the initial respondents.
Snowball sampling
Advantages:
low cost,
useful in specific
circumstances, useful
for locating rare
populations
Disadvantages:
Bias
because sampling units
not independent,
To properly understand how to determine sample size, it helps to
understand the following AXIOMS…
•
The only perfectly accurate sample is a census.
•
A probability sample will always have some inaccuracy (sample error).
•
The larger a probability sample is, the more accurate it is (less sample
error).
•
You can take any finding in the survey, replicate the survey with the
same probability sample plan & size, and you will be “very likely” to find
the same result within the + range of the original findings.
•
In almost all cases, the accuracy (sample error) of a probability
sample is independent of the size of the population.
•
A probability sample can be a very tiny percentage of the population
size and still be very accurate (have little sample error).
•
The size of the probability sample depends on the client’s desired
accuracy (acceptable sample error) balanced against the cost of data
collection for that sample size.
Determining Sample Size
There is only one method of determining sample size that allows the
researcher to PREDETERMINE the accuracy of the sample results: The
Confidence Interval Method of Determining Sample Size.
•
Confidence interval:
range whose endpoints define a certain
percentage of the responses to a question
•
Central limit theorem:
a theory that holds that values taken from
repeated samples of a survey within a population would look like a
normal curve. The mean of all sample means is the mean of the
population
Determining Sample Size
•
Two types of error:
• Nonsampling error
: pertains to all sources of error other than
sample selection method and sample size.
• Sampling error
: involves sample selection and sample size…
this is the error that we are controlling through formulas
•
Sample error formula:
)
(
%
2
n
pq
error
Sample
Determining Sample Size
Determining Sample Size
•
Variability:
refers to how similar or dissimilar responses are to a given
question
•
P (%):
share that “have” or “are” or “will do” etc.
•
Q (%):
100%-P%, share of “have nots” or “are nots” or “won’t dos” etc.
N.B.:
The more variability in the population being studied, the larger the
sample size needed to achieve stated accuracy level.
What data do you need to consider
• Variance or heterogeneity of population
Previous studies? Industry expectations? Pilot study? Sequential sampling
• Expect the worst case (p=50%; q=50%)
• Estimate variability: results of previous studies or conduct a pilot study • Confidence level
With Nominal data (i.e. Yes, No), we can
The formula requires that we
(a.) specify the amount of confidence
we wish to have,
(b.) estimate the variance in the
population, and
(c.) specify the
level of desired
accuracy
q
p
e
N
q
p
N
n
.
.
1
.
.
2 2 2 2 2 Finite population
Finite population
Proportion
Proportion
Infinite population
Infinite population
2 2 2.
.
e
q
p
n
•
The sample size formula for estimating a proportion (also called a
percentage or share):
Sample size n
Population size N
Confidence 1-α
Standard error associated with the chosen
level of confidence Z=Z(1- α/2)
Estimated percent in the population P
1-p q
Acceptable sample error e
Sample Size Formula
Determining the necessary Sample Size for estimating a single population mean or a single population total with a specified level of precision:
Z at 90% confidence = 1.64
Z at 95% confidence = 1.96
Z at 99% confidence = 2.58
Additional correction for sampling finite populations
The above formula assumes that the population is very large compared to the proportion of the population that is sampled. If you are sampling more than 10% of the whole population then you should apply a correction to the sample size estimate that incorporates the finite population correction factor (FPC). This will reduce the sample size.
N
n
n
n
* *
1
n' The new FPC-corrected sample size.
n* The corrected sample size from the
sample size correction table.
N The total size of the population.
0
,
10
*
The Confidence Interval Method of
Determining Sample Size
Normal Distribution
Example 1
We’d like to find the satisfaction level of the students in AUCA. The Registration office have 1675 students registered in 2014-II semester. We deed to take a sample with 95% of confidence and 5% error of estimation
Population size N 1675
Confidence 1-α 0.95
Standard error associated with the chosen level of
confidence Z=Z(1- α/2) 1.96 Estimated percent in the population P 0.5
1-p q 0.5
•The information for this sampling example is presented:
Strata Faculty
Population Strata size
Business
837
IT
502
Education
250
Theology
86
N
e p q q p N n 2 2 2 2 2 1
1675 1
0.05 1.96 0.50 0.5050 . 0 50 . 0 96 . 1 1675 2 2 2 x x x x x x n Students n0 312.64 313
N
n
If
,
0.1869 0.10 1675313
263
.
72
264
1675
313
1
313
1
N
n
n
n
StudentsExample 1
N
n
Strata Faculty
Population Strata size 264/1675
Business
837
132
IT
502
79
Education
250
39
Theology
86
14
Mean
Mean
N
Infinite population
Infinite population
2
2
2
2
e
z
n
Sample Size Formula
2 z 2 z
Z is determined the same way (1.96 or 2.58)
“e” is expressed in terms of the units we are estimating, i.e. if we are
measuring attitudes on a 1-7 scale, we may want our error to be
no more than + .5 scale units. If we are estimating dollars being paid for
a product, we may want our error to be no more than + $3.00.
0
.
10
N
n
Adjust the sample size:
If
N
n
n
n
1
0Finite population
Finite population
2
2
2
2
2
2
2
)
1
(
z
N
e
z
N
n
Sample Size Formula
Estimating “ ” in the Formula to Determine the Sample Size
Required to Estimate a Mean
Since we are estimating a mean, we can assume that our data are either
interval or ratio. When we have interval or ratio data, the standard
deviation of the sample, , may be used as a measure of variance.
How to estimate ?
• Use standard deviation of the sample from a previous study on the
target population
• Conduct a pilot study of a few members of the target population and
Example
An accounting firm wishes to form a 90 percent confidence
interval for the population mean tax refund for its clients
who receive refunds. How large a random sample is needed
to be within $6 (error) of the actual amount if a preliminary
study finds the standard deviation to be $42.67?
Solution Data:
90% of confidence (Zα/2=1.64) e=$6 σ=$42.67
clients
x
e
z
n
136
.
029
136
36
032
.
4897
6
67
.
42
64
.
1
2 2 2 2 2 22
Review problems of chapter
Question 1
•
We are about to go on a recruitment drive to hire some
auditors at the entry level. We need to decide on a
competitive salary offer for these new auditors. From
talking to some HR professionals, I’ve made a rough
estimate that most new hires are getting starting
salaries in the $38-42,000 range and the average (mean)
is around $39,000. The standard deviation seems to be
around $3000.
•
I want to be 95% confident about the average salary and
I’m willing to tolerate an estimate that is within $500
(plus or minus) of the true estimate. If we’re off, we can
always adjust salaries at the end of the probation
period.
•
What sample size should we use?
2.
We’ve just started a new educational TV program that teaches
viewers all about research methods!!
• We know from past educational TV programs that such a program
would likely capture 2 out of 10 viewers on a typical night.
• Let’s say we want to be 99% confident that our obtained sample
proportion of viewers will differ from the true population proportions
by not more than 5%.
• What sample size do we need?
Answer:426
3. A study is to be performed to determine a certain parameter in a
community. From a previous study a “sd” of 46 was obtained.
If a sample error of up to 4 is to be accepted. How many subjects
should be included in this study at 99% level of confidence?
Answer:880
Examples
4.
Management wants to know customers’ level of satisfaction with their service. They propose conducting a survey and asking for satisfaction on a scale from 1 to 10 (since there are 10 possible answers, the range = 10). Management wants to be 99% confident in the results (99 chances in 100 that true value is captured) and they do not want the allowed error to be more than + .5 scale points. S = 1.7 (from a pilot study). What is n?Answer: 77
5. Five years ago a survey showed that 42% of consumers were aware of the company’s brand (Consumers were either “aware” or “not aware”)
• After an intense ad campaign, management will conduct another survey. They want to be 95% confident (95 chances in 100) that the survey estimate will be within + 5% of the true share of “aware” consumers in the population. What is n?
Examples:
6. We wish to determine the required sample size with 95% confidence and 5% error tolerance that the percentage of Rwandans preferring the federal Liberal party.
A recent poll showed that 40% of Rwandans questioned preferred the Liberals. What is the required sample size? Answer: 369
7. In a school there are 800 girls and 750 boys.
a. What sampling design is more convenient in this problem and explain how you would.
b. How many girls and how many boys would you include with 95% confidence and 5% error. Answer (Girls = 133, Boy = 124 )
8. A firm employs numbers of staff in one of three categories listed below: 18 managers
34 secretaries
204 production workers.
For example, if you want to test whether attending class influences how students perform on an exam, using test scores (from 0-100) as data would not be appropriate for a Chi-square test. However, arranging students into the categories "Pass" and "Fail" would. Additionally, the data in a Chi-square grid should not be in the form of percentages, or anything other than
frequency (count) data. Thus, by dividing a class of 54 into groups according to whether they attended class and whether they passed the exam, you
might construct a data set like this:
Pass
Fail
Attendent
25
6
Skipped
8
15