Chapter 2. Confidence Interval and Sampling. 2018.ppt

(1)

(2)

Introduction: Inferencial Statistics

x

Sam

ple

: n

Inference

 _{Parameter Estimation}  _{Hypothesis Testing}

2 S

Probability

p ^ ^  _{Representative} (Sampling rate)

 _{Sample Size}

Pop

ulat

ion:

N

Pop

ulat

ion:

N



2



P

• Which variables will we collect and how will they be measured? • What types of data do they represent?

• What do we want to do? describe, compare groups, predict change in “Y” following change in “X”?

(3)

Objetive of Chapter

Developing the methodology of

confidence

intervals

as

technique to analyze differences

and make decisions, determine

the risks involved in making

such decisions if we rely solely

on

information

from

the

(4)

Inferential Statistics

 _{Inferential statistics. The methods used to estimate a property of a}

population on the basis of a sample.

Statistical Inference involves

two main types of

techniques: Parameter

Estimation and Hypothesis

Testing. Whatever the

technique used, the overall

purpose is to use data from

a probability sample to

extract conclusions about a

population.

•

Data is a observation about the variable being measured.

•

A population consists of all subjects or objects about whom the study is being

conducted.

(5)

In statistical inference, one wishes to estimate population parameters using observed sample data. A confidence interval gives an estimated range of values which is likely to include an unknown population parameter, the estimated range being calculated from a given set of sample data. (Definition taken from Valerie J. Easton and John H. McColl's )

Parameter Estimation: Confidence interval

A 95% confidence interval is often interpreted as indicating a range within which we can be 95% certain that the true effect lies. The strictly-correct interpretation of a confidence interval is based on the hypothetical notion of considering the results that would be obtained if the study were repeated many times.

The confidence level tells you how sure you can be. It is expressed as a percentage and represents how often the true percentage of the population who would pick an answer lies within the confidence interval. The 95% confidence level means you can be 95% certain; the 99% confidence level means you can be 99% certain. Most researchers use the 95% confidence level.

(6)

Confidence Interval Data Requirements

To express a confidence interval, you need three pieces of information: Statistic, Margin of error and Confidence level,

Given these inputs, the range of the confidence interval is defined by the :

sample statistic + margin of error. And the uncertainty associated with the confidence interval is specified by the confidence level.

Often, the margin of error is not given; you must calculate it. There are three steps to constructing a confidence interval:

1. Identify a sample statistic. Choose the statistic (e.g. sample mean, sample proportion) that you will use to estimate a population parameter;

2. The maximum error of estimate is the maximum likely difference between the point estimate of a parameter and the actual value of the parameter. Often, however, you will need to compute the margin of error, based on the following equation: Margin of error = Critical value * Standard error of mean

3. Select a confidence level. The confidence level is the probability value associated with a confidence interval. It is often expressed as a percentage. For example, say, then the confidence level is equal to (1-0.05) = 0.95, i.e. a 95% confidence level. Often, researchers choose 90%, 95%, or 99% confidence

n

Sd

/

mean

of

error

(7)







)

=

1

-n

s

t

+

x

n

s

t

-

x

2 2











) ( ) (

A confidence interval for a mean specifies a range of values within which the unknown population parameter, in this case the mean, may lie. These intervals may be calculated by, for example, suppose you wanted to estimate the average IQ of a community and had randomly selected a sample of 200 residents.

Confidence Interval for a Mean when unknown Standard Deviation (σ)

Where: t depends for the level of confidence = the sample mean

= the critical value t distribution determined by the alpha level

= the standard error of mean

= confidence level, it is the confidence coefficient expressed as a percentage

= margin of error

(8)

Rongin Kwizera is the host of KXYZ Radio 55 AM drive-time news in kigali. During his morning program, Rongin asks listeners to call in and discuss current local and national news. This morning, Rongin was concerned with the number of hours children under 12 years of age watches TV per day. The last 5 callers reported that their children watched the following number of hours of TV last day

Data from Sample of callers:

1.0, 2.3, 3.1, 0.30, 3.5

Would it be reasonable to develop a confidence interval from these data to show the mean number of hours of TV watched?

If yes, construct and appropriate confidence interval at 95% of confidence, and interpret the result.

If no, why would a confidence interval not be appropriate?

95

.

0

5

36

.

1

776

.

2

04

.

2

5

36

.

1

776

.

2

04

.

2

-

)

=

P(









Since for these data the mean is 2.04, the standard deviation is 1.36 and n=5. The computations using the formula is:

See: in this example requires that the variable is normally distributed.

Interpretation: At 95% of confidence the children under 12 years old watch TV between 0.35 to 3.73 hours per day.

Application Example

Answer:

₀

.

35







3

.

73

Mean _2.04

Std. Deviation _1.36

(9)

Steps in SPSS for Confidence Interval for One Sample t Test

(10)

Output of SPSS for Confidence Interval for One Sample t Test

Analyze > Compare Means > One sample T test > Double-click Hours to move it to the Test variable (s) > Click OK. The output is displayed in the Output window

One-Sample Statistics N Mean Std. Deviation Std. Error Mean Number of hours children

under 12 years watches

TV per day 5 2.0400 1.36308 .60959

One-Sample Test

Test Value = 0

t df Sig. (2-tailed) Mean Difference 95% Confidence Interval of the

Difference Lower Upper Number of hours

children under 12 years watches TV per day

3.347 4 .029 2.04000 .3475 3.7325

At 95% of confidence the children under 12 years old watch TV between 0.35 to 3.73 hours per day.

The average hours that children under 12 year old watch TV is 2.04 hours per day, and the variability

(11)

11

Probability and Non Probability sampling

Probability sampling is a sampling technique, in which the subjects of

the population get an equal opportunity to be selected as a

representative sample.

(12)

We have error in all samples

12

Sample random

Mean age= 47 Mean age=48.5

Population

Difference =1.5 = Estimation _error Example:

Planning

• Which variables will we collect and how will they be

measured?

• What types of data do they represent?

• What do I want to do? describe, compare groups, predict

change in Y following change in X?

(13)

Why use a sample?

Sampling is important because it is impossible to (observe, interview,

survey, etc.) an entire population. ... In the context

of research, sampling is the method one uses to gather and select,

to sample, data. A data sample is a set of data collected from a

statistical population by a defined procedure.



Lest Cost



Less field time



Speed



Accuracy



Destruction of test units



When it’s impossible to study the whole

population

(14)

•

Sample accuracy:

refers to how close a random

sample’s statistic (e.g. mean( ), variance (s

2

)

proportion ( ) is to the population’s value it

represents (mean(µ), variance(σ

2

), proportion (π or

P)

•

Important points:

•

Sample

size

is

NOT

representativeness …

you could sample 20,000

persons walking by a street corner and the results

would still not represent the city; however, an “n”

of 100 could be “right on.”

•

Sample size, however, is related to

accuracy.

How close the sample statistic is to the actual

population parameter (e.g. sample mean vs.

population mean) is a function of sample size.

Sample Accuracy

x

(15)

15

Sampling design: Steps

1.

Definition of target population.

The target population is the total population for which the information is required.

Who has the information/data you need? How do you define your target population?

2.

Selection of a sampling frame (list)

• List of elements

•Sampling Frame Error: error that occurs when certain sample elements are not listed or available and are not represented in the sampling frame

3.

Probability or Nonprobability sampling

• Probability Sample:

A sampling technique in which every member of the population will have a known, nonzero probability of being selected.

• Non-Probability Sample:

• Units of the sample are chosen on the basis of personal judgment or convenience

(16)

4. Sampling Unit.

It is necessary to decide a sampling unit before

selecting a sample. It can be a geographical one (state, district,

village, etc.), a construction unit (house, flat, etc.), a social unit

(family, club, school, etc.), or an individual.

5. Error

–

Random sampling error (chance fluctuations)

–

Nonsampling error (design errors)

6. Sample Type.

The researcher, decides the techniques to be

used in selecting the items for the sample according of the

characteristics of the population.

7. Determination of levels of inference.

The level of

uncertainty will also be determined by the sample size.

Increasing the sample size will decrease the sampling error.

(17)

Non probability

Quota

Convenience

Snowball

Systematic

Sampling

Stratified

Sampling

Simple

Random

Cluster

Sampling

Probability samples

Judgment

ClassificationSampling Methods

(18)

Simple Random Sampling

A method for choosing cases from a population by which every case and every combination of cases has an equal chance of being included, i.e. must be you have a homogeneous population. To use this technique, we need a list of all elements and cases are often selected by using tables of random numbers. These tables are list of numbers that have no pattern (that is, they are a random), and an example of such a table shown on the next slice

Steps:

1. List the cases to be randomized in a column on a spreadsheet.

2. Assign the number from the table to each case. Start anywhere on the table, and move in any direction (up, down, across, diagonally). Begin by labeling items in the population with a number. For consistency, these numbers should consist of the same number of digits. So if we have 100 elements in our population, we can use the numerical labels 01, 02, 03, . . ., 98, 99, 100. The general rule is that if we have N digits in your population, then we can use labels with N digits from your table of random number.

(19)

(20)

Simple Random Sample

Since there are a total of 48 students, and 48 is a two digit number, every individual in the population is

assigned a two digit number

beginning 01, 02, 03, . . . 46, 47, 48.

Answer; The random sample table gives us the number or identification of the student that we must take, in our example we look for the following students according to the Random Table (Student 20, 27, 29, 39 and 40) The sample size is 5, and the

students are:

Hellene H, Lucille L., Gilles D., Flienne M. and Alain M.

Example

(21)

Simple Random Sampling



Advantages



Minimal knowledge of population needed



External validity high; internal validity

high; statistical estimation of error



Easy to analyze data



Disadvantages



High cost; low frequency of use



Requires sampling frame



Does not use researchers’ expertise

(22)

Systematic Sampling

For systematic sample is necessary the population must be homogeneous and it is ordered. This involves first listing in a serial order, all the events, persons, objects or things in the whole population. After this, the population (N) is divided by the sample size (n) to get the Kth interval. Once the Kth case is decided, all others are automatically selected. An initial starting point is selected by a random process, and then every nth number on the list is selected.

Probability of Selection = K

(Probability of Selection) K= Population Size

Sample Size

 Example: assuming you have a population of 1,500 people and your sample

size is 100. Then Kth position will be given by N/n = 1500/100 = 15. It means that every 15th position or interval is automatically selected as part of the

sample.

 Thus, numbers 15, 30, 45, etc. are already selected. You can even select

(23)

Systematic Sampling

 _Advantages

 _{Moderate cost; moderate usage}

 _{External validity high; internal validity high; statistical estimation of error}  _{Simple to draw sample; easy to verify}

 Disadvantages

 Periodic ordering

 Requires sampling frame

(24)

Stratified Sampling

Where the population embraces a number of distinct categories, the frame can be organized by these categories into separate "strata" Each stratum is then sampled as an independent sub-population, out of which individual elements can be randomly selected and each stratum should be homogeneous.

Benefits to stratified sampling

You have differences in gender, occupation, income, socio-economic status, geographical location, qualifications, age, height, color, dialects etc.

Stratified sampling is appropriate when the population consists of a number of sub-groups which are homogeneous or contain members that share common characteristics, which need to be represented in the sample. Randomization is then used to select members from the subgroups in such a way that the proportion of each sub-group in the population is reflected in the sample.

(25)

Stratified Sampling

In a stratified sampling design, the steps we will be, first, to establish on the basis that we attribute to stratify, secondly, few variables that define attribute occur in the population and, therefore, on how many groups or strata divide the population, (the following figure shows a stratified sampling design with 4 strata, L = 4). Once

determined subgroups, the next step will consist in knowing the total population belonging to each stratum (N₁, N₂, N₃. N₄) and, finally, we take a random sample from each strata we have (n₁, n₂, n₃, n₄). The sum of the subsamples constitute our total sample (n₁ + n₂ + n₃ + n₄ = n).

(26)

Cluster Sampling

The primary sampling unit is not the individual element, but a large

cluster of elements. Either the cluster is randomly selected or the

elements within are randomly selected.

If the population is large or the area is widespread, he may decide to zone the area reflecting these characteristics and then random samples from each of the identified zones.

(27)

Cluster Sampling - example

Sometimes it is more cost-effective to select respondents in groups ('clusters'). Sampling is often clustered by geography, or by time periods. (Nearly all samples are in some sense 'clustered' in time, for example if surveying households within a city, we might choose to select 100 city blocks and then interview every household within the selected blocks.

(28)

Cluster Sampling



Advantages



Low cost/high frequency of use



Requires list of all clusters, but only of

individuals within chosen clusters



Can estimate characteristics of both cluster and

population



For multistage, has strengths of used methods



Disadvantages



Larger error for comparable size than other

probability methods



Multistage very expensive and validity depends

(29)

(30)

Non-Probability Sampling Methods

As they are not truly representative, non-probability samples are less desirable than probability samples. However, a researcher may not be able to obtain a random or stratified sample, or it may be too expensive. A researcher may not care about generalizing to a larger population. The validity of non-probability samples can be increased by trying to approximate random selection, and by eliminating as many sources of bias as possible.

(31)

Convenience Sampling or accidental sampling is a type of nonprobability sampling which involves the sample being drawn from that part of the population which is close to hand. That is, a population is selected because it is readily available and convenient. It may be through meeting the person or including a person in the sample when one meets them or chosen by finding them through technological means such as the internet or through phone. The researcher using such a sample cannot scientifically make generalizations about the total population from this sample because it would not be representative enough. For example, if the interviewer were to conduct such a survey at a shopping center early in the morning on a given day, the people that he/she could interview would be limited to those given there at that given time, which would not represent the views of other members of society in such an area.

Convenience Sampling

(32)

 The sampling procedure in which an experienced researcher selects

the sample based on some appropriate characteristic of sample members… to serve a purpose.

 This is necessitated when the researcher is interested in certain

specified characteristics. It ensures that only those that meet such required purpose, attributes or characteristics are selected

Judgment or Purposive Sample

(33)

The sampling procedure that ensure that a certain characteristic

of a population sample will be represented to the exact extent

that the investigator desires.



Quota Sample

Advantages: moderate cost, very extensively

used/ understood, no need for list of population elements,

introduces some elements of stratification.

Disadvantages:

Variability and bias cannot be measured or

controlled (classification of subjects), projecting

(34)



The sampling procedure in which the initial

respondents are chosen by non-probability

methods, and then additional respondents are

obtained by information provided by the initial

respondents.

Snowball sampling

This is used when every member of the population cannot comply with the demands of the

investigation. Therefore, these individuals who are willing to comply with the demands of the

investigations are used. These are the volunteers who are willing and ready to cooperate with the

(35)

(36)

Sample Size

(37)

To properly understand how to determine sample size, it helps to

understand the following AXIOMS…

•

The only perfectly accurate sample is a census.

•

A probability sample will always have some inaccuracy (sample error).

•

The larger a probability sample is, the more accurate it is (less sample

error).

•

You can take any finding in the survey, replicate the survey with the

same probability sample plan & size, and you will be “very likely” to find

the same result within the + range of the original findings.

•

In almost all cases, the accuracy (sample error) of a probability

sample is independent of the size of the population.

•

A probability sample can be a very tiny percentage of the population

size and still be very accurate (have little sample error).

•

The size of the probability sample depends on the researcher’s

desired accuracy (acceptable sample error) balanced against the cost

of data collection for that sample size.

(38)

Determining Sample Size

There is only one method of determining sample size that allows the

researcher to PREDETERMINE the accuracy of the sample results:

The

Confidence Interval Method of Determining Sample Size.

•

Confidence interval:

range whose endpoints define a certain

percentage of the responses to a question.

Confidence interval

approach:

applies the concepts of accuracy, variability, and

confidence interval to create a “correct” sample size

•

Central limit theorem:

a theory that holds that values taken from

repeated samples of a survey within a population would look like a

normal curve. The mean of all sample means is the mean of the

population

•

Two types of error:

• Nonsampling error

: pertains to all sources of error other than

sample selection method and sample size.

• Sampling error

: involves sample selection and sample size…this

is the error that we are controlling through formulas

•

Sample error formula:

%

(

n

)

pq

error

Sample





_

(39)

Determining Sample Size

•

Variability:

refers to how similar or dissimilar responses are to a given

question

•

P (%):

share that “have” or “are” or “will do” etc.

•

Q (%):

100%-P%, share of “have nots” or “are nots” or “won’t dos” etc.

N.B.:

The more variability in the population being studied, the larger the

sample size needed to achieve stated accuracy level.

What data do you need to consider

• Variance or heterogeneity of population

Previous studies? Industry expectations? Pilot study? Sequential sampling

• Expect the worst case (p=50%; q=50%)

• Estimate variability: results of previous studies or conduct a pilot study • Confidence level

(40)

With Nominal data (i.e. Yes, No), we can

(41)

The formula requires that we

(a) specify the amount of confidence we wish

to have,

(b) estimate the variance in the

population, and

(c) specify the

level of desired

accuracy

(42)



























q

p

e

N

q

p

N

n

.

1

.

2 2 2 2 2  

Finite population

Proportion

Infinite population

2 2 2

.

e

q

p

n







•

The sample size formula for estimating a proportion (also called a

percentage or share):

Sample size n

Population size N

Confidence 1-α

Is known as the critical value, the positive Z (e.g. 1.96 for 95% confidence level)

Expected proportion in population based on

previous studies or pilot studies P

1-p q

Absolute error or precision – Has to be

decided by researcher e

Sample Size Formula

Determining the necessary sample size for estimating a single population mean or a single population total with a specified level of precision:

Z at 90% confidence = 1.64

2



(43)

Example

We’d like to find the satisfaction level

of the students in AUCA. The

Registration office have 1675 students

registered in 2014-II semester. We

deed to take a sample with 95% of

confidence and 5% error of estimation.

(44)





N



e

p

q

p

N

n

₂ 2 2 2 2

1

_ 













1675

1



0

.

05

1

.

96

0

.

50

0

.

50

.

0

50

.

0

96

.

1

1675

2 2 2

x

n







Students

n

₀



312

.

64



313



N

n

If

,

0

.

1869

0

.

10

1675

313





263

.

72

264

1675

313

1

313

1













N

n

Students

Example

Sample size n ?

Population size N 1675

Confidence 1-α 0.95

Standard error associated with the chosen

level of confidence Z=Z(1- α/2) 1.96

Estimated percent in the population P 0.5

1-p q 0.5

(45)

Mean

N  

Infinite population

2

e

z

n







Sample Size Formula for numeric data

2



z



2



z

Z is determined the same way (1.96 or 2.58)

“e” is expressed in terms of the units we are estimating, i.e. if we are

measuring attitudes on a 1-7 scale, we may want our error to be

no more than + .5 scale units. If we are estimating dollars being paid for

a product, we may want our error to be no more than + $3.00.

is a little more difficult to estimate, but must be in same units as “e”.

(46)

Finite

population

2

)

1

(





z

N

e

z

N

n







Sample Size Formula when the data are numeric

Estimating “ ” in the Formula to Determine the Sample Size

Required to Estimate a Mean

Since we are estimating a mean, we can assume that our data are either

interval or ratio. When we have interval or ratio data, the standard

deviation of the sample, , may be used as a measure of variance.

How to estimate

• Use standard deviation of the sample from a previous study on the

target population

• Conduct a pilot study of a few members of the target population and

calculate “s”



?

(47)

Example

Suppose you are planning to sample transportation employees to

determine average annual sick days. The following standard have

been set: a confidence level of 99% and an error of fewer than 2 days.

Past research has indicated the standard deviation should be 6 days.

What is the required sample size?

Solution:

2

e

z

n







59

.

9076

60

2

6

58

.

2

2 2 2



x

n

Employees

Sample size

n

?

Confidence

1-α

0.99

Standard error associated with the chosen

level of confidence

Z=Z(1- α/2)

2.58

Standard deviation

( )

6

(48)

Assignment 2

1. The daily salaries of substitute teachers for eight local school districts is shown. What is the point estimate for the mean? Find the 90% confidence interval of the mean for the salaries of substitute teachers in the region. Answer: <55.47, 62.28>

60 56 60 55 70 55 60 55 Source: Pittsburgh Tribune Review

2. The number of unhealthy days based on the AQI (Air Quality Index) for a random sample of metropolitan areas is shown. Construct a 98% confidence interval based on the data.

61 12 6 40 27 38 93 5 13 40 Source: N.Y. Times Almanac. Answer: <8.81, 58.19>

3. We’ve just started a new educational TV program that teaches viewers all about research methods!! • We know from past educational TV programs that such a program would likely capture 2 out of 10

viewers on a typical night.

• Let’s say we want to be 95% confident that our obtained sample proportion of viewers will differ from the true population proportions by not more than 5%.

• What sample size do we need? Answer: n= 245.86 or n= 246 viewers

4. A survey estimated that 20% of all Americans aged 16 to 20 drove under the influence of drugs or alcohol. A similar survey is planned for New Zealand. They want a 95% confidence interval to have a margin of error of 0.04. (a) Find the necessary sample size if they expect to find results similar to those in the United States. (b) Suppose instead they used the conservative formula based on ˆp = 0.5. What is now the required sample size? a. n = 384, b. n = 600

5. We would like to start an ISP and need to estimate the average Internet usage of households in one week for our business plan and model. How many households must we randomly select to be 95 percent sure that the sample mean is within 1 minute of the population mean . Assume that a previous survey of household usage has shown σ = 6.95 minutes. Answer: n= 186 households

(49)

Assignment 2

Contd.

7. A firm employs numbers of staff in one of three categories listed below: 28 managers

44 secretaries

304 production workers

a. How many managers and secretaries, and production workers did it contain?

You take a sample from each stratum, with a 95% confidence interval and a maximum sampling error of 7%.

Answer: (Managers =10, Secretaries =15, Production Workers =104).

b. What type of sampling would you recommend to obtain representativeness (probabilistic or non-probabilistic) and specify the method you would use according to the type of recommended sampling (for example: simple random, systematic, etc.)? Then, briefly, say the procedure to take the sample.

8. For the same example before, last year’s proportion of ATM withdrawals over $50 was 27 percent. If we used this estimate in our calculation, the required sample size would be…..?.

Answer n = 1892.948

9. Alternative for the example before. Suppose that our research budget will not permit a large sample. In the previous example, we could reduce the confidence interval level from 95 to 90 percent and increase the maximum error to ± 4 percent. Assuming p proportion = .50. How large of sample is needed?

Answer n= 420.65

10. The Ministry of Economy places each of its employees on one of three salary scales, A, B, and C. The number of employees on each scale is listed below: How many employees from each scale should be included in a sample of size, with a 99% confidence interval and a maximum sampling error of 1 %.

A 1000 B 875 C 245 D 120