Chapter II. 2020. Confidence Interval and Sampling.ppt

(1)

(2)

Introduction: Inferencial Statistics

x

Sam

ple

: n

Inference

 _{Parameter Estimation}  _{Hypothesis Testing}

2 S

Probability

p ^ ^  _{Representative} (Sampling rate)

 _{Sample Size}

Pop

ulat

ion:

N

Pop

ulat

ion:

N



2



P

• Which variables will we collect and how will they be measured? • What types of data do they represent?

• What do we want to do? (describe, compare groups, predict change in “Y” following change in “X”)

(3)

Objetive of Chapter

Developing the methodology of

confidence

intervals

as

technique to analyze differences

and make decisions, determine

the risks involved in making

such decisions if we rely solely

on

information

from

the

(4)

Inferential Statistics

 _{Inferential statistics. The methods used to estimate a property of a}

population on the basis of a sample.

Statistical Inference involves

two main types of

techniques: Parameter

Estimation and Hypothesis

Testing. Whatever the

technique used, the overall

purpose is to use data from

a probability sample to

extract conclusions about a

population.

•

Data is an observation about the variable being measured.

•

A population consists of all subjects or objects about whom the study is being

conducted.

(5)

Estimation

There are two types of estimators:

- Point estimates

- Estimates by interval

There are two types of inference: estimation and hypothesis testing.

The objective of estimation is to determine the

approximate value

of a

population parameter on the basis of a sample statistic.

p

P

s

x



,

ˆ

,

ˆ



2 2



ˆ

1





ˆ

2



x

1



x

2

s

2 2 2 1 2 2 1 2 2 2 2 1

ˆ





_Pˆ₁ -_Pˆ₂ p₁ -p₂

-One sample

Point estimates

-Two samples

• A point estimate is a single number • A confidence interval provides

(6)

In statistical inference, one wishes to estimate population parameters using observed sample data. A confidence interval gives an estimated range of values which is likely to include an unknown population parameter, the estimated range being calculated from a given set of sample data. (Definition taken from Valerie J. Easton and John H. McColl's )

Confidence interval

A 95% confidence interval is often interpreted as indicating a range within which we can be 95% certain that the true effect lies. The strictly-correct interpretation of a confidence interval is based on the hypothetical notion of considering the results that would be obtained if the study were repeated many times.

Suppose we used the same sampling method to select different samples and to compute a different interval estimate for each sample. Some interval estimates would include the true population parameter and some would not. For 95% confidence interval indicates the range of values within which the true value will fall 95% of the time. That is we are likely to be wrong only 5% of the time.

(7)

Confidence Interval Data Requirements

To express a confidence interval, you need three pieces of information. Statistic, Margin of error and Confidence level

Given these inputs, the range of the confidence interval is defined by the :

sample statistic + margin of error. And the uncertainty associated with the confidence interval is specified by the confidence level.

Often, the margin of error is not given; you must calculate it. There are three steps to constructing a confidence interval:

1. Identify a sample statistic. Choose the statistic (e.g. sample mean, sample proportion) that you will use to estimate a population parameter.

2. The maximum error of estimate is the maximum likely difference between the point estimate of a parameter and the actual value of the parameter. Often, however, you will need to compute the margin of error, based on the following equation:

Margin of error = Critical value * Standard error of mean

3. Select a confidence level. The confidence level is the probability value associated with a confidence interval. It is often expressed as a percentage. For example, say, then the confidence level is equal to (1 - 0.05) = 0.95, i.e. a 95% confidence level. Often, researchers choose 90%, 95%, or 99% confidence levels; but any percentage can be used.

n

Sd

/

mean

of

error

(8)







)

=

1

-n

s

t

+

x

n

s

t

-

x

2 2











) ( ) (

A confidence interval for a mean specifies a range of values within which the unknown population parameter, in this case the mean, may lie. These intervals may be calculated by, for example, suppose you wanted to estimate the average IQ of a community and had randomly selected a sample of 200

Confidence Interval for a Mean when unknown Standard Deviation (σ)

Where: t depends for the level of confidence = the sample mean

= the critical value t distribution determined by the alpha level

= the standard error of mean

= confidence level, it is the confidence coefficient expressed as a percentage

(9)

Rongin Kwizera is the host of KXYZ Radio 55 AM drive-time news in

kigali. During his morning program, Rongin asks listeners to call in and

discuss current local and national news. This morning, Rongin was

concerned with the number of hours children under 12 years of age

watches TV per day. The last 5 callers reported that their children

watched the following number of hours of TV last day

Data from Sample of callers

:

1.0, 2.3, 3.1, 0.30, 3.5

Would it be reasonable to develop a confidence interval from these data

to show the mean number of hours of TV watched?

If yes, construct and appropriate confidence interval at 95% of confidence,

and interpret the result.

If no, why would a confidence interval not be appropriate?

Application Example

Steps in SPSS for Confidence Interval for One Sample t Test

(10)

(11)

Output of SPSS for Confidence Interval for One Sample t Test

The output is displayed in the Output window

One-Sample Statistics N Mean Std. Deviation Std. Error Mean Number of hours children

under 12 years watches

TV per day 5 2.0400 1.36308 .60959

One-Sample Test

Test Value = 0

t df Sig. (2-tailed) Mean Difference 95% Confidence Interval of the

Difference Lower Upper Number of hours

children under 12 years watches TV per day

3.347 4 .029 2.04000 .3475 3.7325

At 95% of confidence the children under 12 years old watch TV between 0.35 to 3.73 hours per day.

(12)

(13)

We have error in all samples

Sample random

Mean age= 47 Mean age=48.5

Population

Difference =1.5 = Estimation error

Example:

Planning

• Which variables will we collect and how will they be

measured?

• What types of data do they represent?

• What do I want to do? describe, compare groups, predict

change in Y following change in X?

(14)

Why use a sample?

Note: Homogeneous populations – small samples are highly representative



Lest Cost



Less field time



Speed



Accuracy



Destruction of test units



When it’s impossible to study the whole

population

(15)

•

Sample accuracy:

refers to how close a random

sample’s statistic (e.g. mean( ), variance (s

2

) proportion

( ) is to the population’s value it represents mean(µ),

variance(σ

2

), proportion (π or P).

Important points:

•

Sample size is NOT related to representativeness

…

you could sample 20,000 persons walking by a

street corner and the results would still not represent

the city; however, an “n” of 100 could be “right on.”

•

Sample size, however, is related to

accuracy

How close the sample statistic is to the actual

population parameter (e.g. sample mean vs.

population mean) is a function of sample size.

Sample Accuracy

x

(16)

1. Definition of target population.

The target population is the total population for which the information is required.

Who has the information/data you need? How do you define your target population?

2. Selection of a sampling frame (list)

• List of elements

•Sampling Frame Error: error that occurs when certain sample elements are not listed or available and are not represented in the sampling frame.

3. Probability or Nonprobability sampling

• Probability Sample:

A sampling technique in which every member of the population will have a known, nonzero probability of being selected.

• Non-Probability Sample:

• Units of the sample are chosen on the basis of personal judgment or convenience

• There are NO statistical techniques for measuring random sampling error in a non-probability sample. Therefore, generalizability is never statistically appropriate.

(17)

Continue Steps

4. Sampling Unit.

It is necessary to decide a sampling unit before

selecting a sample. It can be a geographical one (state, district,

village, etc.), a construction unit (house, flat, etc.), a social unit

(family, club, school, etc.), or an individual.

5. Error

–

Random sampling error (chance fluctuations)

–

Nonsampling error (design errors)

6. Sample Type.

The researcher, decides the techniques to be

used in selecting the items for the sample according of the

characteristics of the population.

7. Determination of levels of inference.

The level of

uncertainty will also be determined by the sample size.

Increasing the sample size will decrease the sampling error.

(18)

Non probability

Quota

Convenience

Snowball

Systematic

Sampling

Stratified

Sampling

Simple

Random

Cluster

Sampling

Probability samples

Judgment

or

purposive

ClassificationSampling Methods

(19)

Simple Random Sampling

A method for choosing cases from a population by which every case and every combination of cases has an equal chance of being included, i.e. must be you have a homogeneous population. To use this technique, we need a list of all elements and cases are often selected by using tables of random numbers. These tables are list of numbers that have no pattern (that is, they are a random), and an example of such a table shown on the next slice

Steps:

1. List the cases to be randomized in a column on a spreadsheet.

2. Assign the number from the table to each case. Start anywhere on the table, and move in any direction (up, down, across, diagonally). Begin by labeling items in the population with a number. For consistency, these numbers should consist of the same number of digits. So if we have 100 elements in our population, we can use the numerical labels 01, 02, 03, . . ., 98, 99, 100. The general rule is that if we have N digits in your population, then we can use labels with N digits from your table of random number.

(20)

(21)

Simple Random Sample

Since there are a total of 48 students, and 48 is a two digit number, every

individual in the population is assigned a two digit number beginning 01, 02, 03, . . . 46, 47, 48.

Answer;

The sample size is 5, and the students are:

Hellene H, Lucille L., Gilles D., Flienne M. and Alain M. Example

(22)

Simple Random Sampling



Advantages



Minimal knowledge of population needed



External validity high; internal validity

high; statistical estimation of error



Easy to analyze data



Disadvantages



High cost; low frequency of use



Requires sampling frame



Does not use researchers’ expertise

(23)

Systematic Sampling

For systematic sample is necessary the population must be homogeneous and it is ordered. This involves first listing in a serial order, all the events, persons, objects or things in the whole population. After this, the population (N) is divided by the sample size (n) to get the Kth interval. Once the Kth case is decided, all others are automatically selected. An initial starting point is selected by a random process, and then every nth number on the list is selected.

Probability of Selection = K

(Probability of Selection) K= Population Size

Sample Size

 Example: assuming you have a population of 1,500 people and your sample

size is 100. Then Kth position will be given by N/n = 1500/100 = 15. It means that every 15th position or interval is automatically selected as part of the sample.

 Thus, numbers 15, 30, 45, etc. are already selected. You can even select any

(24)

Systematic Sampling

 Advantages

 Moderate cost; moderate usage

 External validity high; internal validity high; statistical estimation

of error

 Simple to draw sample; easy to verify

 Disadvantages

 Periodic ordering

 Requires sampling frame

Example

(25)

Stratified Sampling

Where the population embraces a number of distinct categories, the frame can be organized by these categories into separate "strata“. Each stratum is then sampled as an independent sub-population, out of which individual elements can be randomly selected and each stratum should be homogeneous.

Benefits to stratified sampling

You have differences in gender, occupation, income, socio-economic status, geographical location, qualifications, age, height, color, dialects etc.

Stratified sampling is appropriate when the population consists of a number of sub-groups which are homogeneous or contain members that share common characteristics, which need to be represented in the sample. Randomization is then used to select members from the subgroups in such a way that the proportion of each sub-group in the population is reflected in the sample.

(26)

Stratified Sampling

In a stratified sampling design, the steps we will be, first, to establish on the basis that we attribute to stratify, secondly, few variables that define attribute occur in the population and, therefore, on how many groups or strata divide the population, (the following figure shows a stratified sampling design with 4 strata, L = 4). Once determined subgroups, the next step will consist in knowing the total population belonging to each stratum (N₁, N₂, N₃. N₄) and, finally, we take a random sample from each strata we have (n₁, n₂, n₃, n₄). The sum of the subsamples constitute our total sample (n₁ + n₂ + n₃ + n₄ = n).

(27)

Cluster Sampling

The primary sampling unit is not the individual element, but a large

cluster of elements. Either the cluster is randomly selected or the

elements within are randomly selected.

If the population is large or the area is widespread, he may decide to zone the area reflecting these characteristics and then random samples from each of the identified zones.

(28)

Cluster Sampling - example

Sometimes it is more cost-effective to select respondents in groups ('clusters'). Sampling is often clustered by geography, or by time periods. (Nearly all samples are in some sense 'clustered' in time, for example if surveying households within a city, we might choose to select 100 city blocks and then interview every household within the selected blocks.

(29)

(30)

Non-Probability Sampling Methods

As they are not truly representative, non-probability samples are less desirable than probability samples. However, a researcher may not be able to obtain a random or stratified sample, or it may be too expensive. A researcher may not care about generalizing to a larger population. The validity of non-probability samples can be increased by trying to approximate random selection, and by eliminating as many sources of bias as possible.

For this reason this can be called biased sampling or

non-random sampling

(31)

Convenience Sampling or accidental sampling is a type of nonprobability sampling which involves the sample being drawn from that part of the population which is close to hand. That is, a population is selected because it is readily available and convenient. It may be through meeting the person or including a person in the sample when one meets them or chosen by finding them through technological means such as the internet or through phone. The researcher using such a sample cannot scientifically make generalizations about the total population from this sample because it would not be representative enough. For example, if the interviewer were to conduct such a survey at a shopping center early in the morning on a given day, the people that he/she could interview would be limited to those given there at that given time, which would not represent the views of other members of society in such an area.

Convenience Sampling

(32)

 The sampling procedure in which an experienced researcher selects

the sample based on some appropriate characteristic of sample members… to serve a purpose.

 This is necessitated when the researcher is interested in certain

specified characteristics. It ensures that only those that meet such required purpose, attributes or characteristics are selected

Judgment or Purposive Sample

(33)

The sampling procedure that ensures that a certain

characteristic of a population sample will be represented to the

exact extent that the investigator desires.



Quota Sample

Advantages: moderate

cost, very extensively used/ understood, no need for list of

population elements, introduces some

elements of stratification. Disadvantages:

Variability and bias

cannot be measured or controlled

(classification of

(34)



The sampling procedure in which the initial respondents are

chosen by probability or non-probability methods, and then

additional respondents are obtained by information provided

by the initial respondents.

Snowball sampling

Snowball sampling is

often used to access

low-incidence populations and

individuals who are

difficult for researchers to

connect with. Therefore,

these individuals who are

willing to comply with the

demands of the

investigations are used.

These are the volunteers

who are willing and ready

to cooperate with the

(35)

Sample Size

(36)

To properly understand how to determine sample size, it helps to

understand the following AXIOMS…

•

The only perfectly accurate sample is a census.

•

A probability sample will always have some inaccuracy (sample error).

•

The larger a probability sample is, the more accurate it is (less sample

error).

•

You can take any finding in the survey, replicate the survey with the

same probability sample plan & size, and you will be “very likely” to find

the same result within the + range of the original findings.

•

In almost all cases, the accuracy (sample error) of a probability

sample is independent of the size of the population.

•

A probability sample can be a very tiny percentage of the population

size and still be very accurate (have little sample error).

•

The size of the probability sample depends on the researcher’s

desired accuracy (acceptable sample error) balanced against the cost

of data collection for that sample size.

(37)

Determining Sample Size

There is only one method of determining sample size that allows the

researcher to PREDETERMINE the accuracy of the sample results:

The Confidence Interval Method of Determining Sample Size.

•

Confidence interval:

range whose endpoints define a certain

percentage of the responses to a question

•

Central limit theorem:

a theory that holds that values taken from

repeated samples of a survey within a population would look like a

normal curve. The mean of all sample means is the mean of the

population.

•

Confidence interval approach:

applies the concepts of

(38)

Determining Sample Size

Two types of error:

• Nonsampling error

: pertains to all sources of error other than

sample selection method and sample size.

• Sampling error

: involves sample selection and sample size…this

is the error that we are controlling through formulas

•

Sample error formula:

%

(

)

2

n

pq

error

Sample





_



(39)

•

Variability:

refers to how similar or dissimilar responses are to a given

question

•

P (%):

share that “have” or “are” or “will do” etc.

•

Q (%):

100%-P%, share of “have nots” or “are not” or “won’t dos” etc.

N.B.:

The more variability in the population being studied, the larger the

sample size needed to achieve stated accuracy level.

What data do you need to consider

• Heterogeneity of population

Previous studies? Industry expectations? Pilot study?

Sequential sampling

• Expect the worst case (p=50%; q=50%)

• Estimate variability: results of previous studies or conduct a pilot

study

• Confidence level

Generally, we need to make judgments on all these variables

(40)

With Nominal data (i.e. Yes, No), we can

(41)

The formula requires that we

(a) specify the amount of confidence we wish

to have,

(b) estimate the variability in the

population, and

(c) specify the

level of desired

accuracy

(42)



























q

p

e

N

q

p

N

n

.

1

.

2 2 2 2 2  

Finite population

Proportion

Infinite population

2 2 2

.

e

q

p

n







•

The sample size formula for estimating a proportion (also called a

percentage or share):

Sample size

n

Population size N

Confidence 1-α

Is known as the critical value, the positive Z (e.g. 1.96 for 95% confidence level)

Expected proportion in population based

on previous studies or pilot studies

P

1-p

q

Absolute error or precision – Has to be

decided by researcher

e

Sample Size Formula

Determining the necessary Sample Size for estimating a single population mean or a single population total with a specified level of precision:

Z at 90% confidence = 1.64

2



(43)

Example

We would like to find the level of student

satisfaction in AUCA. The registration office

has 2675 students registered in the 2018-2019

semester. Calculate the sample size with 95%

confidence and 5% estimation error.

(44)



N



e

p

q

p

N

n

₂ 2 2 2 2

1

_ 













2675

1



0

.

05

1

.

96

0

.

50

0

.

50

.

0

50

.

0

96

.

1

2675

2 2 2

x

n







Students

n

₀



336

.

03



336

10

.

0

,



N

n

If

10

.

0

125

.

0

2675

336





299

51

.

298

2675

336

1

336

1













N

n

Students

Example solution

Sample size n ?

Population size N 2675

Confidence 1-α 0.95

Standard error associated with the chosen

level of confidence Z=Z(1- α/2) 1.96

Estimated percent in the population P 0.5

1-p q 0.5

(45)

Mean

N  

Infinite population

2

e

z

n







Sample Size Formula

2



z



2



z

"Z" is determined in the same way as in the previous exercises (1.96 or

2.58 or 1.64)

“e” is expressed in terms of the units we are estimating, i.e. if we are

measuring attitudes on a 1-7 scale, we may want our error to be

no more than + .5 scale units. If we are estimating dollars being paid for

a product, we may want our error to be no more than + $3.00.

(46)

Finite

population

2

)

1

(





z

N

e

z

N

n







Sample Size Formula

Estimating “ ” in the formula to determine the sample size

required to estimate a mean

Since we are estimating a mean, we can assume that our data are either

interval or ratio. When we have interval or ratio data, the standard

deviation of the sample, may be used as a measure of variance.

How to estimate

•Use the standard deviation of the sample from a previous study of the

target population.

•Or conduct a pilot sample of a few members of the target population and

calculate “s” (s = sample standard deviation)



?



Mean

(47)

Example

Suppose you plan to sample transport employees to

determine the average number of sick days per year. The

following standard has been established: a 99% confidence

level and an error of less than 2 days. Previous research has

indicated that the standard deviation could be 6 days. What

is the sample size required?

Solution:

2

e

z

n







59

.

9076

60

2

6

58

.

2

2 2 2





x

n

Employees

Sample size

n

?

Confidence

1-α

0.99

Standard error associated with the

chosen level of confidence

Z=Z(1- α/2)

2.58

Standard deviation

( )

6 days

Acceptable sample error

e

2 days

(48)

Assignment 2

1. The daily salaries of substitute teachers for eight local school districts is shown. What is the point estimate for the mean? Find and interpret the 90% confidence interval of the mean for the salaries of substitute teachers in the region. Answer: <55.47, 62.28>

60 56 60 55 70 55 60 55 Source: Pittsburgh Tribune Review

2. The number of unhealthy days based on the AQI (Air Quality Index) for a random sample of 10 metropolitan areas is shown. Construct and interpret a 98% confidence interval based on the data. 61 12 6 40 27 38 93 5 13 40 Source: N.Y. Times Almanac. Answer: <8.81, 58.19> 3. We’ve just started a new educational TV program that teaches viewers all about research methods!!

• We know from past educational TV programs that such a program would likely capture 2 out of 10 viewers on a typical night.

• Let’s say we want to be 95% confident that our obtained sample proportion of viewers will differ from the true population proportions by not more than 5%.

a. What sample size do we need? Answer: n= 245.86 or n= 246 viewers

b. What type of sampling would you recommend to obtain representativeness (probabilistic or non-probabilistic) and specify the method you would use according to the type of recommended sampling (for example: simple random, systematic, etc.)? Then, briefly, say the procedure to take the sample.

4. A survey estimated that 20% of all Americans aged 15 to 24 drove under the influence of drugs or alcohol. A similar survey is planned for Rwanda. According to the census data, the population in the study interval during 2012 was 2 141 460. (Source: Fourth Rwanda Population and Housing Census)

(49)

Assignment 2

Contd.

5. A company wants to know the degree of loyalty of its employees, which are determined in three categories:

28 managers 44 secretaries

304 production workers

a. How many managers and secretaries, and production workers did it contain?

You take a sample from each stratum, with a 95% confidence interval and a maximum sampling error of 7%.

Answer: (Managers =10, Secretaries =15, Production Workers =104).

b. What type of sampling would you recommend to obtain representativeness (probabilistic or non-probabilistic) and specify the method you would use according to the type of recommended sampling (for example: simple random, systematic, etc.)? Then, briefly, say the procedure to take the sample.

6. A university credit unions wants to know the proportion of cash withdrawals that exceed $50 at is ATM located in the student union building. With and error of ± 2 percent and a confidence level 95 percent.

a. how large of sample is needed to estimate the proportion of withdrawals exceeding $50? Answer n= 2401

b. For the same example before, last year’s proportion of ATM withdrawals over $50 was 27 percent. If we used this estimate in our calculation, the required sample size would be…..?.

Answer n = 1892.948