Chapter IV Confidence Interval and Sampling.ppt

(1)

Confidence

Interval and

Sampling

Quantitative

Techniques and

Simulation

Chapter IV

(2)

Inferential Statistics

 _{Inferential statistical methods are a way to extract conclusions about a}

population, the data obtained from a probability sample.

Statistical Inference involves

two main types of

techniques: Parameter

Estimation and Hypothesis

Testing. Whatever the

technique used, the overall

purpose is to use data from

a probability sample to

extract conclusions about a

population.

•

Data is one observation about the variable being measured.

•

A population consists of all subjects or objects about whom the study is being

conducted.

(3)

Objetive of Chapter

(4)

Introduction: Inferencial Statistic

x

Sam

ple

: n

Inference

 _{Parameter Estimation}  _{Hypothesis Testing}

2 S

Probability

p ^ ^  _{Representative} (Sampling rate)

 _{Sample Size}

Pop

ulat

ion:

N

Pop

ulat

ion:

N



2



P

• Which variables will we collect and how will they be measured? • What types of data do they represent?

• What do we want to do? describe, compare groups, predict change in Y following change in X?

(5)

Parameter Estimation

The estimation assume 2 forms:

- Point estimates

- Estimates by interval

Estimation techniques are used when the researcher has no prior

assumptions about the value of a population characteristic and want to

know what that value might be.

Statistical inference contains two types of procedures regarding

universal parameters, made on the basis of sample evidence.

p

P

s

x



,

ˆ

,

ˆ



2



2

1

2

1

ˆ







x



x



s

2 2 1 2 1 2 2 2 1

ˆ





p

-p

P

-P

ˆ

1

ˆ

2



1

2

-One sample

Point estimates

(6)

In statistical inference, one wishes to estimate population parameters using observed sample data. A confidence interval gives an estimated range of values which is likely to include an unknown population parameter, the estimated range being calculated from a given set of sample data. (Definition taken from Valerie J. Easton and John H. McColl's )

The common notation for the parameter in question is:

Often, this parameter is the population mean, which is estimated through the sample mean .

How to Interpret Confidence Intervals

The confidence level describes the uncertainty associated with a sampling method. Suppose we used the same sampling method to select different samples and to compute a different interval estimate for each sample. Some interval estimates would include the true population parameter and some would not. A 99% confidence level means that we would expect 99% of the interval estimates to include the population parameter; a 95% confidence level means that 95% of the intervals would include the parameter; and so on.

Confidence interval











-

k

_

+

k

_

)

=

1

(7)

Confidence interval

Confidence Interval Data Requirements

To express a confidence interval, you need three pieces of information.

Confidence level, Statistic, Margin of error

Given these inputs, the range of the confidence interval is defined by the :

sample statistic + margin of error. And the uncertainty associated with the confidence interval is specified by the confidence level.

Often, the margin of error is not given; you must calculate it.

There are four steps to constructing a confidence interval.

1. Identify a sample statistic. Choose the statistic (e.g, sample mean, sample proportion) that you will use to estimate a population parameter; according to type of variable e.g. mean if the variable is numerical or proportion if the variable is categorical).

2. Select a confidence level. The confidence level is the probability value associated with a confidence interval. It is often expressed as a percentage. For example, say , then the confidence level is equal to (1-0.05) = 0.95, i.e. a 95% confidence level. Often, researchers choose 90%, 95%, or 99% confidence levels; but any percentage can be used.

3. Find the margin of error. If you are working on a homework problem or a test question, the margin of error may be given. Often, however, you will need to compute the margin of error, based on the following equation.

(8)



Confidence Interval for the Mean when known Standard Deviation







)

=

1

-n

s

t

+

x

n

s

t

-

x

P(

_df 2 df

2, ) ( , )

(



A confidence interval for a mean specifies a range of values within which the unknown population parameter, in this case the mean, may lie. These intervals may be calculated by, for example, a businessman wants to estimate the average daily sale, a producer who wishes to estimate his mean daily output; a medical researcher who wishes to estimate the mean response by patients to a new drug; etc.

Confidence Interval for the Mean when sample size is small

Where: t depends for the level of confidence = the sample mean

= the standard normal value determined by the confidence coefficient (1-α) associated with the interval estimate

= the standard deviation of the sampling distribution or standard error of mean

= level of confidence is the confidence coefficient expressed as a percentage

= margin of error

df = n-1 X n s   1 * ) 2 ( ,df

(9)

Application Example

Del Monte Foods, Inc., distributes diced peaches in 4-ounce cans. To be sure each can contain at least the required amount Del Monte sets the filling operation to dispense 4.01 ounces of peaches and syrup in each can. So 4.01 is the population mean. Of course not every can will contain exactly 4.01 ounces of peaches and syrup. Some cans will have more and others less. Let’s also assume that the process follows the Normal Probability Distribution. Now, we selected a random sample of 16 cans and determine the sample mean. It is 4.03 ounces of peaches and syrup. Find the confidence interval.

We have the following information:

= 4.03, s =.069, n=16,

X

4.1 4.07 4.06 4.02 3.94 4.14 3.99 4.01 4.13 4.00 3.99 4.07 3.94 4.01 4.1 3.91 3.99







)

=

1

-n

s

t

+

x

n

s

t

-

x

P(

_df 2 df

2, ) ( , )

(10)

Application Example

In 95% of the cans contain peaches and syrup between 3.9935 to 4.0670 ounces, in other hand, this interval contain the mean population (4.01 ounces), therefore the process is OK.

Therefore, the CI would be: < 3.9935 and 4.0670>, of course in this case we observe that the population mean of 4.01ounces is in this interval . But that will not always be the case.

If we set alpha at 0.05, the corresponding t score will be ±2.131, and the

95% confidence interval will be.

:

,

04

.

0

16

069

.

*

131

.

2



then

Margin of error=



4

.

03



.

04







4

.

03



.

04





0

.

95

(11)

Application Example; output from SPSS

In 95% of the cans contain peaches and syrup between 3.9935 to 4.0670 ounces, in other hand, this interval contain the mean population (4.01 ounces), therefore the process is OK.

Therefore, the CI would be: < 3.9935 and 4.0670>, of course in this case we observe that the population mean of 4.01ounces is in this interval . But that will

One-Sample Test

Test Value = 0

t df Sig. (2-tailed) DifferenceMean

95% Confidence Interval of the Difference

Lower Upper

Ounces of peaches and

syrup 233.702 15 .000 4.03022 3.9935 4.0670

One-Sample Statistics

N Mean DeviationStd. Std. Error Mean

Ounces of peaches

(12)

Application Example

Rongin Kwizera is the host of KXYZ Radio 55 AM drive-time news in kigali. During his morning program, Rongin asks listeners to call in and discuss current local and national news. This morning, Rongin was concerned with the number of hours children under 12 years of age watches TV per day. The last 5 callers reported that their children watched the following number of hours of TV last day

Data from Sample of callers:

1.0, 2.3, 3.1, 0.30, 3.5

Would it be reasonable to develop a confidence interval from these data to show the mean number of hours of TV watched?

If yes, construct and appropriate confidence interval at 95% of confidence, and interpret the result.

If no, why would a confidence interval not be appropriate?

95

.

0

5

36

.

1

776

.

2

04

.

2

5

36

.

1

776

.

2

04

.

2

-

)

=

P(









_{Answer: 0.35 <u<3.73}

Mean _2.04

Std. Deviation _1.36

The computations using the formula are:

Since for these data the mean is 2.04, the standard deviation is 1.36 and n=19.

(13)

(14)

We have error in all samples

Sample random

Mean age= 47 Mean age=48.5

Population

Difference =1.5 = Estimation _error

Example:

Planning

• Which variables will we collect and how will they be

measured?

• What types of data do they represent?

• What do I want to do? Describe, compare groups, predict

change in Y following change in X?

(15)

Why use a sample

Note: Homogeneous populations – small samples are highly representative



Lest Cost



Less field time



Speed



Accuracy



Destruction of test units

(16)

•

Sample accuracy:

refers to how close a random

sample’s statistic (e.g. mean( ), variance (s

2

)

proportion ( ) is to the population’s value it

represents (mean(µ), variance(σ

2

), proportion (π or

P)

•

Important points:

•

Sample

size

is

NOT

representativeness …

you could sample 20,000

persons walking by a street corner and the results

would still not represent the city; however, an “n”

of 100 could be “right on.”

•

Sample size, however, is related to

accuracy.

How close the sample statistic is to the actual

population parameter (e.g. sample mean vs.

population mean) is a function of sample size.

Sample Accuracy

x

(17)

Sampling design: Steps

1. Definition of target population

Who has the information/data you need?

How do you define your target population?

- Geography

- Demographics

- Use

- Awareness

2. Selection of a sampling frame (list)

• List of elements

•Sampling Frame Error: error that occurs when certain sample elements are not listed or available and are not represented in the sampling frame

3. Probability or Nonprobability sampling

• Probability Sample:

(18)

Steps

4. Sampling Unit

5. Error

–

Random sampling error (chance fluctuations)

–

Nonsampling error (design errors)

6. Sample Type

7. Determination of levels of inference

8. Sample size calculation

Non-Probability Sample:

• Units of the sample are chosen on the basis of personal judgment or convenience

(19)

Population of study

Variables of interest

Non probability

Parameters to investigate

Quota

Convenience

Snowball

ClassificationSampling Methods

Systematic

Stratified

Simple

Random

Cluster

n

Margin of error

Probability samples

(20)

Simple Random Sampling

A method for choosing cases from a population by which every case and every combination of cases has an equal chance of being included, i.e. must be you have a homogeneous population. To use this technique, we need a list of all elements and cases are often selected by using tables of random numbers.

These tables are list of numbers that have no pattern (that is, they are a random), and an example of such a table is bellow

Steps:

1. List the cases to be randomized in a column on a spreadsheet.

2. Assign the number from the table to each case. Start anywhere on the table, and move in any direction (up, down, across, diagonally). Begin by labeling items in the population with a number. For consistency, these numbers should consist of the same number of digits. So if we have 100 elements in our population, we can use the numerical labels 01, 02, 03, . . ., 98, 99, 100. The general rule is that if we have N digits in your population, then we can use labels with N digits from your table of random number.

(21)

(22)

Simple Random Sample

Since there are a total of 48 students, and 48 is a two digit number, every

individual in the population is assigned a two digit number beginning 01, 02, 03, . . . 46, 47, 48.

Answer;

The sample size is 5, and the students are:

Hellene H, Lucille L., Gilles D., Flienne M. and Alain M. Example

(23)

Simple Random Sampling



Advantages



Minimal knowledge of population needed



External validity high; internal validity

high; statistical estimation of error



Easy to analyze data



Disadvantages



High cost; low frequency of use



Requires sampling frame



Does not use researchers’ expertise

(24)

Systematic Sampling

Systematic sampling relies on arranging the study population according to some ordering scheme and then selecting elements at regular intervals through that ordered list. Systematic sampling involves a random start and then proceeds with the selection of every kth element from then onwards. It is important that the starting point is not automatically the first in the list, but is instead randomly chosen from within the first to the kth element in the list.

Probability of Selection =K

Probability of Selection = Population Size

Sample Size

 An initial starting point is selected by a random process, and then

every nth number on the list is selected

 n=sampling interval

The number of population elements between the units selected for

the sample

(25)

Systematic Sampling

 Advantages

 Moderate cost; moderate usage

 External validity high; internal validity high; statistical estimation

of error

 Simple to draw sample; easy to verify

 Disadvantages

 Periodic ordering

 Requires sampling frame

Steps

1. The first thing you do is pick an integer that which is the result given the formula of probability of selection; this will be your first subject e.g. (3).

2. Select values systematically jumping every 3 values, so on until the required sample size.

(26)

Stratified Sampling

Where the population embraces a number of distinct categories, the frame can be organized by these categories into separate "strata" Each stratum is then sampled as an independent sub-population, out of which individual elements can be randomly selected.

Benefits to stratified sampling

First, dividing the population into distinct, independent strata can enable researchers to draw inferences about specific subgroups that may be lost in a more generalized random sample.

Second, it is sometimes the case that data are more readily available for individual, pre-existing strata within a population than for the overall population; in such cases, using a stratified sampling approach may be more convenient than aggregating data across groups (though this may potentially be at odds with the previously noted importance of utilizing criterion-relevant strata).

(27)

Stratified Sampling

In a stratified sampling design, the steps we will be, first, to establish on the basis that we attribute to stratify, secondly, few variables that define attribute occur in the population and, therefore, on how many groups or strata divide the population, (the following figure shows a stratified sampling design with 4 strata, L = 4). Once

determined subgroups, the next step will consist in knowing the total population belonging to each stratum (N₁, N₂, N₃. N₄) and, finally, we take a random sample from each strata we have (n₁, n₂, n₃, n₄). The sum of the subsamples constitute our total sample (n₁ + n₂ + n₃ + n₄ = n).

(28)

Cluster Sampling

The primary sampling unit is not the individual element, but a large

cluster of elements. Either the cluster is randomly selected or the

elements within are randomly selected.

Sometimes it is more cost-effective to select respondents in groups

('clusters'). Sampling is often clustered by geography, or by time periods.

(Nearly all samples are in some sense 'clustered' in time, for example if

surveying households within a city, we might choose to select 100 city

blocks and then interview every household within the selected blocks.

(29)

(30)

Cluster Sampling



Advantages



Low cost/high frequency of use



Requires list of all clusters, but only of

individuals within chosen clusters



Can estimate characteristics of both cluster and

population



For multistage, has strengths of used methods



Disadvantages



Larger error for comparable size than other

probability methods



Multistage very expensive and validity depends

(31)

(32)

Non-Probability Sampling Methods

(33)

Convenience Sampling or accidental sampling is a type of nonprobability sampling which involves the sample being drawn from that part of the population which is close to hand. That is, a population is selected because it is readily available and convenient. It may be through meeting the person or including a person in the sample when one meets them or chosen by finding them through technological means such as the internet or through phone. The researcher using such a sample cannot scientifically make generalizations about the total population from this sample because it would not be representative enough. For example, if the interviewer were to conduct such a survey at a shopping center early in the morning on a given day, the people that he/she could interview would be limited to those given there at that given time, which would not represent the views of other members of society in such an area.

Convenience Sampling

Advantages: Very low cost, Extensively used/ understood, no need for list of population elements

(34)

 The sampling procedure in which an experienced researcher selects

the sample based on some appropriate characteristic of sample members… to serve a purpose.

 It involves selection of cases when we judge as most appropriate

ones for a given study. It is based on the judgment of a research

Judgment or Purposive Sample

 Advantages: Moderate cost,

Commonly used/ understood, Sample will meet a specific objective

 Disadvantages: Projecting

(35)

The sampling procedure that ensure that a certain characteristic

of a population sample will be represented to the exact extent

that the investigator desires.



Quota Sample

Advantages: moderate cost, very extensively

used/ understood, no need for list of population elements,

introduces some elements of stratification. Disadvantages:

Variability and bias cannot be measured or

controlled (classification of subjects), projecting

(36)



The sampling procedure in which the initial

respondents are chosen by probability or

non-probability methods, and then additional

respondents are obtained by information provided

by the initial respondents.

Snowball sampling

Advantages:

low cost,

useful in specific

circumstances, useful

for locating rare

populations

Disadvantages:

Bias

because sampling units

not independent,

(37)

(38)

To properly understand how to determine sample size, it helps to

understand the following AXIOMS…

•

The only perfectly accurate sample is a census.

•

A probability sample will always have some inaccuracy (sample error).

•

The larger a probability sample is, the more accurate it is (less sample

error).

•

You can take any finding in the survey, replicate the survey with the

same probability sample plan & size, and you will be “very likely” to find

the same result within the + range of the original findings.

•

In almost all cases, the accuracy (sample error) of a probability

sample is independent of the size of the population.

•

A probability sample can be a very tiny percentage of the population

size and still be very accurate (have little sample error).

•

The size of the probability sample depends on the client’s desired

accuracy (acceptable sample error) balanced against the cost of data

collection for that sample size.

(39)

Determining Sample Size

There is only one method of determining sample size that allows the

researcher to PREDETERMINE the accuracy of the sample results: The

Confidence Interval Method of Determining Sample Size.

•

Confidence interval:

range whose endpoints define a certain

percentage of the responses to a question

•

Central limit theorem:

a theory that holds that values taken from

repeated samples of a survey within a population would look like a

normal curve. The mean of all sample means is the mean of the

population

(40)

Determining Sample Size

•

Two types of error:

• Nonsampling error

: pertains to all sources of error other than

sample selection method and sample size.

• Sampling error

: involves sample selection and sample size…

this is the error that we are controlling through formulas

•

Sample error formula:

)

(

%

2

n

pq

error

Sample





_

(41)

(42)

•

Variability:

refers to how similar or dissimilar responses are to a given

question

•

P (%):

share that “have” or “are” or “will do” etc.

•

Q (%):

100%-P%, share of “have nots” or “are nots” or “won’t dos” etc.

N.B.:

The more variability in the population being studied, the larger the

sample size needed to achieve stated accuracy level.

What data do you need to consider

• Variance or heterogeneity of population

Previous studies? Industry expectations? Pilot study? Sequential sampling

• Expect the worst case (p=50%; q=50%)

• Estimate variability: results of previous studies or conduct a pilot study • Confidence level

(43)

With Nominal data (i.e. Yes, No), we can

(44)

The formula requires that we

(a.) specify the amount of confidence

we wish to have,

(b.) estimate the variance in the

population, and

(c.) specify the

level of desired

accuracy

(45)



























q

p

e

N

q

p

N

n

.

1

.

2 2 2 2 2  

Finite population

Proportion

Infinite population

2 2 2

.

e

q

p

n







•

The sample size formula for estimating a proportion (also called a

percentage or share):

Sample size n

Population size N

Confidence 1-α

Standard error associated with the chosen

level of confidence Z=Z(1- α/2)

Estimated percent in the population P

1-p q

Acceptable sample error e

Sample Size Formula

Determining the necessary Sample Size for estimating a single population mean or a single population total with a specified level of precision:

Z at 90% confidence = 1.64

(46)

Additional correction for sampling finite populations

The above formula assumes that the population is very large compared to the proportion of the population that is sampled. If you are sampling more than 10% of the whole population then you should apply a correction to the sample size estimate that incorporates the finite population correction factor (FPC). This will reduce the sample size.

N

n

* *

1







n' The new FPC-corrected sample size.

n* The corrected sample size from the

sample size correction table.

N The total size of the population.





0

,

10

*

(47)

The Confidence Interval Method of

Normal Distribution

(48)

Example 1

We’d like to find the satisfaction level of the students in AUCA. The Registration office have 1675 students registered in 2014-II semester. We deed to take a sample with 95% of confidence and 5% error of estimation

Population size N 1675

Confidence 1-α 0.95

Standard error associated with the chosen level of

confidence Z=Z(1- α/2) 1.96 Estimated percent in the population P 0.5

1-p q 0.5

•The information for this sampling example is presented:

Strata Faculty

Population Strata size

Business

837

IT

502

Education

250

Theology

86

(49)





N



e p q q p N n ₂ 2 2 2 2 1 _     





1675 1



0.05 1.96 0.50 0.50

50 . 0 50 . 0 96 . 1 1675 2 2 2 x x x x x x n    Students n₀  312.64  313



N

n

If

,

0.1869 0.10 1675

313





263

.

72

264

1675

313

1

313

1













N

n

Students

Example 1

N

n

Strata Faculty

Population Strata size 264/1675

Business

837

132

IT

502

79

Education

250

39

Theology

86

14

(50)

Mean

N

 

Infinite population

2

e

z

n







Sample Size Formula

2  z  2  z

Z is determined the same way (1.96 or 2.58)

“e” is expressed in terms of the units we are estimating, i.e. if we are

measuring attitudes on a 1-7 scale, we may want our error to be

no more than + .5 scale units. If we are estimating dollars being paid for

a product, we may want our error to be no more than + $3.00.

(51)





0

.

10

N

n

Adjust the sample size:

If

N

n





1

0

Finite population

2

)

1

(





z

N

e

z

N

n







Sample Size Formula

Estimating “ ” in the Formula to Determine the Sample Size

Required to Estimate a Mean

Since we are estimating a mean, we can assume that our data are either

interval or ratio. When we have interval or ratio data, the standard

deviation of the sample, , may be used as a measure of variance.

How to estimate ?

• Use standard deviation of the sample from a previous study on the

target population

• Conduct a pilot study of a few members of the target population and



(52)

Example

An accounting firm wishes to form a 90 percent confidence

interval for the population mean tax refund for its clients

who receive refunds. How large a random sample is needed

to be within $6 (error) of the actual amount if a preliminary

study finds the standard deviation to be $42.67?

Solution Data:

90% of confidence (Z_α/2=1.64) e=$6 σ=$42.67

clients

x

e

z

n

136

.

029

136

36

032

.

4897

6

67

.

42

64

.

1

2 2 2 2 2 2

2









(53)

Review problems of chapter

Question 1

•

We are about to go on a recruitment drive to hire some

auditors at the entry level. We need to decide on a

competitive salary offer for these new auditors. From

talking to some HR professionals, I’ve made a rough

estimate that most new hires are getting starting

salaries in the $38-42,000 range and the average (mean)

is around $39,000. The standard deviation seems to be

around $3000.

•

I want to be 95% confident about the average salary and

I’m willing to tolerate an estimate that is within $500

(plus or minus) of the true estimate. If we’re off, we can

always adjust salaries at the end of the probation

period.

•

What sample size should we use?

(54)

2.

We’ve just started a new educational TV program that teaches

viewers all about research methods!!

• We know from past educational TV programs that such a program

would likely capture 2 out of 10 viewers on a typical night.

• Let’s say we want to be 99% confident that our obtained sample

proportion of viewers will differ from the true population proportions

by not more than 5%.

• What sample size do we need?

Answer:426

3. A study is to be performed to determine a certain parameter in a

community. From a previous study a “sd” of 46 was obtained.

If a sample error of up to 4 is to be accepted. How many subjects

should be included in this study at 99% level of confidence?

Answer:880

(55)

Examples

4.

Management wants to know customers’ level of satisfaction with their service. They propose conducting a survey and asking for satisfaction on a scale from 1 to 10 (since there are 10 possible answers, the range = 10). Management wants to be 99% confident in the results (99 chances in 100 that true value is captured) and they do not want the allowed error to be more than + .5 scale points. S = 1.7 (from a pilot study). What is n?

Answer: 77

5. Five years ago a survey showed that 42% of consumers were aware of the company’s brand (Consumers were either “aware” or “not aware”)

• After an intense ad campaign, management will conduct another survey. They want to be 95% confident (95 chances in 100) that the survey estimate will be within + 5% of the true share of “aware” consumers in the population. What is n?

(56)

Examples:

6. We wish to determine the required sample size with 95% confidence and 5% error tolerance that the percentage of Rwandans preferring the federal Liberal party.

A recent poll showed that 40% of Rwandans questioned preferred the Liberals. What is the required sample size? Answer: 369

7. In a school there are 800 girls and 750 boys.

a. What sampling design is more convenient in this problem and explain how you would.

b. How many girls and how many boys would you include with 95% confidence and 5% error. Answer (Girls = 133, Boy = 124 )

8. A firm employs numbers of staff in one of three categories listed below: 18 managers

34 secretaries

204 production workers.

(57)

For example, if you want to test whether attending class influences how students perform on an exam, using test scores (from 0-100) as data would not be appropriate for a Chi-square test. However, arranging students into the categories "Pass" and "Fail" would. Additionally, the data in a Chi-square grid should not be in the form of percentages, or anything other than

frequency (count) data. Thus, by dividing a class of 54 into groups according to whether they attended class and whether they passed the exam, you

might construct a data set like this:

Pass

Fail

Attendent

25

6

Skipped

8

15

Confidence level,

Statistic

Margin of error