Unit 7 - Sampling Theory

(1)

Statistics For Social Sciences – MATH1208

Unit 7 – Sampling Theory

Introduction

There are different ways for collecting data and choosing the subjects for an

investigation. Primary data is the name given to data that are used for specific

purpose for which they were collected. They will contain no unknown quantities in

respect of method of collection, accuracy of measurements or which members of

population were investigated. Sources of

(2)

than that for which they were originally collected. Summaries and analyses of such data are sometimes referred to as secondary statistics. The main sources of secondary data mainly include publications.

Census

A census is the name given to a survey which examines every member of a

population. Three government censuses are population census, census of distribution and census of production.

A Population Census is taken every ten

years, to obtain information such as age, sex, relationship to head of household,

(3)

car for travel and number of rooms in place of dwelling.

A Census Distribution is taken every five years, covering virtually all retail

establishments and some wholesalers. It obtains information on number of

employees, type of hoods sold, turnover and classification.

(4)

stocks of raw materials, finished goods and expenditure on plant and machinery.

Bias

Bias can be defined as the tendency of a pattern of errors to influence data in an

unrepresentative way. The errors involved in the results of investigations that have been subject to bias are known as systematic errors.

The main types of bias are:

1. Selection bias. This can occur if a sample is not truly representative of the population. Note that censuses cannot be subject to this type of bias. For example, sampling the

(5)

particular day may not adequately represent the nature and quality of the goods that

customers receive. Some factors that could influence the results include – this machine might be manned by more or less

experienced operators; there may be other machines that perform better or worse; the day’s production may be under more or less pressure than another day.

2. Structure and wording bias. This could be obtained from badly worded questions. For example, technical words might not be

(6)

3. Interview bias. If the subject of an

investigation are personally interviewed, the interviewer might project bias opinions or an attitude that might not gain the full

cooperation of the subjects.

4. Recording bias. This could result from badly recorded answers or clerical errors made by an untrained workforce.

Probability Sampling

A probability sampling method is any

method of sampling that utilizes some form of random selection. In order to have a

(7)

the different units in your population have equal probabilities of being chosen.

Non-probability Sampling

A core characteristic of non-probability sampling techniques is that samples are selected based on the subjective judgement of the researcher, rather than random

selection (i.e., probabilistic methods), which is the cornerstone of probability sampling techniques.

(8)

Most information obtained by an

organization about any population is as a result of examining a small, representative subset of the population. This is called a sample.

Sample theory is a study of relationships existing between a population and samples drawn from the population. It is useful in estimating unknown population quantities such as population mean and variance, often called population parameters, from a

knowledge of corresponding sample

(9)

Sampling Frame

A sampling frame is a list of all members of the population. Certain Sampling methods require each member of the population

under consideration to be known and

identifiable. The structure which supports this identification is called a sampling

frame. Some sampling methods require a sampling frame only as a listing of the population. Other methods need certain characteristics of each member also to be known.

(10)

Once you have the sampling frame and you have determined your sample size, you are ready to select your sample. Although it is easy to understand the process of placing the name of each member of the population in a hat and select the sample by picking names from the hat, this is not a practical way to select the sample. To imitate this process you can use a table of random numbers. A table of random numbers is a list of

numbers randomly generated and listed in the order in which they are generated.

(11)

arranged in groups for reading convenience. The term ‘generated in a random fashion’ can be interpreted as ‘the chance of any one digit occurring in any position in the table is no more or less than the chance of any other digit occurring’.

To use the random number table to identify which members of our population will be selected for the sample, we first assign each member of the population an identification (ID) number. An example of a simple

random sample is as follows.

(12)

Each student at your college has a mailbox on campus. The mailboxes are numbered from 0000 to 9000. To select a simple random sample of ten students, we can

select ten mailbox numbers at random using the random number table. You can close

your eyes and chose a spot on the random number table. Suppose you select row 7, column 3 of the table. The first student

selected has mailbox 2419, which is a valid number. If we continue to read off four-digit numbers from the table, the second number selected is for mailbox 0210. The list of all ten mailbox numbers selected is: 2419 0210 7750 4293 6279 4778 1976

(13)

table are organized in nine-digit blocks and we need only four-digit mailbox numbers, we just keep reading the numbers

(14)

Methods of Obtaining Samples for Statistical Analysis

The sampling techniques most commonly used in business and commerce can be split into three categories.

1. Random sampling. This ensures that each and every member of the population under consideration has an equal chance of being selected as part of the sample. Two types of random sampling used are:

i. Simple random sampling

This sampling ensures that each member of the population has an equal chance of being chosen for the sample. It is therefore

(15)

least lists all the members of the target

population. Examples of where this method might be used are:

a. by a large company, to sample 10% of their orders to determine their average value; b. by a professional association, to sample a proportion of its members to determine their views on a possible amalgamation.

Advantages

The selection of the sample members is unbiased and generally accepted by the layman that the method is fair.

Disadvantages

(16)

 the need for each chosen subject to be located and questioned is time

consuming

 the chance that certain significant

attributes of the population are under or over represented.

ii. Stratified random sampling

Stratified random sampling extends the idea of simple random sampling to ensure that a heterogeneous population has its defined strata levels taken account of in the sample. For example, if 10% of all heavy goods

(17)

investigation in hand, then 10% of a sample of such vehicles must have the safety

feature. The general procedure for taking a stratified sample is:

a. Stratify the population, defining a number of separate partitions.

b. Calculate the proportion of the population lying in each partition.

c. Split the total sample size up into the above proportions.

d. Take a separate sample (normally simple random) from each partition, using the

sample sizes as defined in c.

(18)

Advantage

The sample itself is free from bias, since it takes into account significant strata levels (attributes) of a population considered

important to the investigation. Disadvantage

a. an extensive sampling frame is necessary; b. strata levels of importance can only be selected subjectively;

c. increased costs due to the extra time and manpower necessary for the organization and implementation of the sample.

(19)

i. identifies certain attributes (or strata

levels) that are considered significant to the investigation at hand;

ii. partitions the population accordingly into groups which each have a unique

combination of these levels.

2. Quasi-random sampling. Quasi means ‘almost’ or ‘nearly’. This type of technique, while not satisfying the criterion given in 1 above, is generally thought to be as

(20)

expensive to consider. Two types that are commonly used are:

i. Systematic random sampling. Systematic sampling is a type of probability sampling method in

which sample members from a larger population are selected according to a

random starting point and a fixed periodic interval. This interval, called

the sampling interval, is calculated by dividing the population size by the

desired sample size.

Systematic sampling is a method of sampling that can be used where the

(21)

or the fleet of company vehicles) or some of it is physically in evidence (such as a row of houses, items coming of a production line or customers leaving a supermarket). The

technique is to choose a random starting place and then systematically sample every 40th_{or 12}th_{or 165}th_{item in the population.}

The number chosen is based on the sample required. For example, if 2% of the sample was needed from the population, every 50th

item would be selected, after starting at

some random point. This is because 2% = 2 out of 100 = 1 out of 50.

(22)

uniform. These are referred to as

homogeneous population. For example, the invoices of a company for one financial year would be considered as a homogeneous

population by an auditor, if their value or type of goods ordered was no consequence to the investigation.

Advantage of this method include:

i. Ease of use;

ii. the fact that it can be used where no sampling frame exists, but items are physically in evidence.

(23)

a. The main disadvantage of systematic

sampling is that bias can occur if recurring sets in the population are possible.

b. This method of sampling is not truly

random, since once a random starting point is selected all the subjects are

pre-determined. Hence, the use of the term ‘quasi-random’ to describe the technique. ii. Multi-stage sampling

Where a population is spread over a

relatively wide geographical area, random sampling will almost certainly entail

(24)

Multi-stage sampling is intended to overcome this particular problem. It involves:

a. Splitting the area into a number of regions;

b. Randomly selecting a small number of regions;

c. Confining sub-samples to these regions alone, with the size of each sub-sample proportional to the size of the area. For example, United Kingdom could be split into countries or a large city could be split into postal districts;

(25)

Once the final regions or sub-regions is

selected, the final sampling technique could be simple or stratified random or systematic, depending on the existence or otherwise of a sampling frame.

Advantage

The main advantage of this method is that less time and manpower is needed and thus it is cheaper than random sampling.

Disadvantages of multi-stage sampling include:

a. possible bias if a very small number of regions is selected;

(26)

been selected, no member of the population in any other region can be selected.

3. Non-random sampling. This is used when neither of the above techniques are possible or practical. Two well-used types are:

i. Cluster sampling.

Cluster sampling is a non-random sampling method which can be employed where no sampling frame exists, and, often for a

(27)

For example, suppose a survey was needed of companies in South Wales who uses a computerized payroll. First, three or four small area would be chosen (perhaps two of these based in city centres and one or two in outlying areas. Each company, in each area, might then be phoned to identify which of them have computerized systems. The

survey itself could then be carried out. Advantages

i. it is a good alternative to multi-stage sampling where no sampling frame exists; ii. it is generally cheaper than other

(28)

Disadvantage

The main disadvantage of the method is the fact that sampling is not random and thus selection bias could be significant.

ii. Quota sampling.

The Quota Sampling is yet another

non-probability sampling method wherein the population is divided into a mutually

exclusive, sub-groups from which the

sample items are selected on the basis of a given proportion.

Simply, Quota Sampling is a form of

(29)

knowledge and professional judgment. In this method, first of all, the quotas, i.e. a proportion in which the sample items are to be selected is set up and then within the

quotas the choice of sample items depends exclusively on the investigator’s judgment.

For Example, Suppose an interviewer is

told to interview 250 people living in certain geographical areas. Out of which 100 males, 100 females and 50 children are to be

interviewed. Within these quotas, the interviewer can select any person on the basis of his personal judgment.

(30)

chance of personal prejudice or bias of the investigator that can adversely affect the credibility of the results. Such as, if the interviewer finds children insufficient to

answer the questions, then he might ask their mothers to give answers on their behalf.

Thus, this may tamper the results, and the purpose of research gets unfulfilled.

The sampling technique mostly favoured in market research is quota sampling. The

method uses a team of interviewers, each with a set number (quota) of subjects to interview. Normally the population is

(31)

lot of responsibility on the interviewer’s

since the selection of subjects is left to them entirely. Ideally they should be well trained and have a responsible, professional attitude. Advantage of Quota sampling include:

i. stratification of the population is usual, although not essential;

ii. it is not complicated, any member can be replaced by another member with the same characteristics, no non-response;

iii. low cost and convenience. Disadvantage

(32)

ii. severe interviewer bias can be introduced into the survey by inexperienced or

untrained interviewers, since all the data collection and recording rests with them. Convenience sampling (also known

as grab sampling, accidental sampling, or opportunity sampling) is a type of non-probability sampling that involves

the sample being drawn from that part of the population that is close to hand. This type of sampling is most useful for pilot testing.

In sociology and statistics

research, snowball sampling (or

(33)

referral sampling) is a

non-probability sampling technique where existing study subjects recruit future

subjects from among their acquaintances.

Advantages of Snowball Sampling. The chain referral process allows the researcher to reach populations that are difficult to sample when using other

sampling methods. The process is cheap, simple and cost-efficient. This sampling technique needs little planning and fewer workforce compared to other sampling techniques.

(34)

It us usually impossible to determine the sampling error or make inferences about populations based on the obtained sample.

Apply Central Limit Theorem to Large Samples

The Central Limit Theorem (CLT) applies for large enough sample sizes. A “large

(35)

normal distribution, then the results of the CLT hold even for small samples (n 30).

In random sampling from a population with mean ( ) and standard deviation ( ), when n is large enough, the distribution of

(point estimator) is approximately normal with a mean and standard error.

Estimate the Sample Mean and Sample Variance (Standard Error)

The standard error is the standard deviation of the sampling distribution of a point

(36)

An estimator is a statistic that estimates some fact about the population. You can also think of an estimator as the rule that

creates an estimate. For example, the sample mean(x̄) is an estimator for the population mean, μ. Point estimator and Interval

estimator are two types of estimators.

The quantity that is being estimated (i.e. the one you want to know) is called

the estimand. For example, let’s say you

wanted to know the average height of children in a certain school with a

(37)

is your sample mean, the estimator. You use the sample mean to estimate that the population mean (your estimand) is about 56 inches.

Point vs. Interval

Estimators can be a range of values (like a confidence interval) or a single value (like the standard deviation). When an estimator is a range of values, it’s called an interval

estimate. For the height example above, you

(38)

Characteristics of Estimators

Estimators can be described in several ways:

 Biased: a statistic that is either an

overestimate or an underestimate.

 Efficient: a statistic with small variances

(the one with the smallest possible variance is also called the

“best”). Inefficient estimators can give you good results as well, but they usually

requires much larger samples.

 Invariant: statistics that are not easily

(39)

 Shrinkage: a raw estimate that’s

improved by combining it with other information.

 Sufficient: a statistic that estimates the

population parameter as well as if you knew all of the data in all possible

samples.

 Unbiased: an accurate statistic that

neither underestimates nor overestimates.

Estimation

(40)

descriptor for the sample, it is called a point estimate. A point estimate is a single number calculated from sample data. It is used to

estimate a parameter of the population. A parameter is a numerical descriptor of the population.

Interval estimates indicate the precision, or

accuracy, of an estimate and are therefore preferable to point estimates. For example, if we say that a distance is measured as 5.28 metres (m), we are giving a point estimate. If, on the other hand, we say that the

(41)

Parameters are typically unknown. One

important problem of statistical inference is the estimation of the population parameters (such as population mean and variance)

from the corresponding sample statistics (such as sample mean and variance).

An interval estimator is a statistical

estimator which is represented geometrically as a set of points in the parameter space. An interval estimator can be seen as a set of

(42)

which this estimator will "cover" the

unknown parameter point. This probability, in general, depends on unknown parameters; therefore, as a characteristic of the reliability of an interval estimator a confidence

coefficient is used; this is the lowest possible value of the given probability. Interesting statistical conclusions can be drawn for only those interval estimators

which have a confidence coefficient close to one.

A point estimator is the formula or rule that

(43)

A point estimator is a statistical estimator whose value can be represented

geometrically in the form of a point in the same space as the values of the unknown parameters (the dimension of the space is equal to the number of parameters to be

estimated). In fact, point estimators are also used as approximate values for unknown physical variables. For the sake of

simplicity, it is further supposed that one natural parameter is subject to estimation; in this case, a point estimator is a function of the results of observations, and takes

numerical values.

(44)

parameter being estimated, i.e. if the

statistical estimation is free of systematic errors. The arithmetical mean (1) is an unbiased statistical estimator for the

mathematical expectation of identically-distributed random variables (not

necessarily normal).

An unbiased estimator yields an estimate that is fair. It neither systematically

overestimates the parameter nor

systematically underestimates the parameter. Properties of an estimator

(45)

samples of a given size is equal to the parameter being estimated.

2. Consistent – as the sample size increases, the value of the estimator approaches the value of the parameter estimated.

3. Relatively efficient – of all the statistics that can be used to estimate a parameter, the relatively efficient estimator has the smallest variance.

Calculate Standard Error of a Sample

The standard deviation of a sampling

distribution of a statistics is often called its standard error. The standard error of the

(46)

= where is the sample mean. This is true for large or small samples. The

sampling distribution of means is very nearly normal for N 30 even when the population is non-normal.

Determine Confidence Intervals for Population Means

A confidence interval (also called an interval estimate) takes the point estimate a step

further and gives a range of values and a probability. The probability value is the

likelihood that an interval actually includes the value of the unknown population

(47)

Given a random sample from some

population, a confidence interval for the

unknown population mean is where

is the sample mean, s is the sample

standard deviation, n is the sample size and z = confidence factor (1.64 for 90%; 1.96 for 95%; 2.58 for 99%).

Example

A sample of 100 invoices yielded a mean gross value of $45.50 and standard deviation of $3.24. Calculate a 95% confidence

interval.

(48)

= 45.50 (1.96) = 45.50 0.635.

There is a 95% probability that the mean of the complete population of invoices from which the sample was taken is between 44.9 and 46.1.

In statistics, the 68–95–99.7 rule is a

shorthand used to remember the percentage of values that lie within a band around

the mean in a normal distribution with a width of two, four and six standard

deviations, respectively; more accurately, 68.27%, 95.45% and 99.73% of the values lie within one, two and three standard

(49)

expressed as follows, where X is an observation from a normally

distributed random variable, μ is the mean of the distribution, and σ is its standard

deviation:

In the empirical sciences the

so-called three-sigma rule of thumb expresses a conventional heuristic that "nearly all"

values are taken to lie within three standard deviations of the mean, i.e. that it is

empirically useful to treat 99.7% probability as "near certainty".[1]_{The usefulness of this}

heuristic of course depends significantly on the question under consideration, and there are other conventions, e.g. in the social

(50)

"significant" if its confidence level is of the order of a two-sigma effect (95%), while in particle physics, there is a convention of a five-sigma effect (99.99994% confidence) being required to qualify as a "discovery".

(51)

A hypothesis is an idea, an assumption (or guess), or a theory about the characteristics of one or more variables in one or more

populations. Once a hypothesis is formed, we must test it. We must decide whether or not to believe the hypothesis.

A hypothesis test is done, by using the information in the sample data to decide whether or not to believe the hypothesis.

The hypothesis test is a statistical procedure that involves formulating a hypothesis and using sample data to decide on the validity of the hypothesis.

One of the first steps in carrying out the

(52)

views. One is called the null hypothesis (H0)

and the other is the alternative hypothesis (H1).

Null Hypothesis(H0) for a Test

The null hypothesis is a statement about a parameter of the population(s). It is labelled H0. In many instances we formulate a

statistical hypothesis for the sole purpose of rejecting or nullifying it. For example, if we want to decide whether a given coin is

biased, we formulate the hypothesis that ‘the coin is fair’ (i.e., p = 0.5, where p is the

(53)

than another, we formulate the hypothesis that ‘there is no difference between the procedures (i.e., any observed differences are due merely to fluctuation in sampling from the same population). Such hypotheses are called null hypotheses (H0).

Alternative Hypothesis (H1) for a Test

The alternative hypothesis is a statement

about a parameter of the population(s) that is opposite to the null hypothesis. It is labeled H1 or HA. Any hypothesis that differs from a

given hypothesis is called an alternative hypothesis. For example, if one hypothesis (H0) is p = 0.5, alternative hypotheses might

(54)

Type I and Type II Errors

If we reject a hypothesis when it should be accepted, we say that a ‘Type I’ error has been made. If, on the other hand, we accept a hypothesis when it should be rejected, we say that a ‘Type II’ error has been made. In either case, a wrong decision or judgment has occurred. In order for decision rules or tests of hypotheses to be good, they must be designed so as to minimize errors of

decision. This is not a simple matter,

because for any given sample, an attempt to decrease one type of error is generally

(55)

type of error. In practice, one type of error may be more serious than the other, and so a compromise should be reached in favour of limiting the more serious error. The only way to reduce both types of error is to

increase the sample size, which may or may not be possible.

Levels of Significance

(56)

level’ of the test. This probability, often

denoted by , is generally specified before any samples are drawn so that the results obtained will not influence our choice.

In practice, a significance level of 0.05 or 0.01 is customary, although other values are used. If, for example, the 0.05 (or 5%)

(57)

hypothesis has a 0.05 probability of being wrong.

Rejection Region(s) and Critical Value(s)

To perform the hypothesis test we need to choose between the null and the alternative hypotheses. We must decide to reject or not to reject the null hypothesis. The decision is always phrased in terms of the null

(58)

sufficiently inconsistent with the null hypothesis.

The sample consists of n observations. We must find a single number that captures the information in the sample. This number is called ‘test statistic’. A test statistic is a number that is used to decide between the null and alternative hypothesis.

Test statistic (z) = (sample mean – mean of the distribution) ÷ (standard deviation

÷ ) =

where n is the sample size, is the sample

mean, µ is the population mean and s is

(59)

The rejection range is the range of values of the test statistic that will lead us to reject the null hypothesis. It is defined by the critical value(s).

The second approach to deciding if we

(60)

Conduct Test for Large Samples – Normal Distribution

A large-sample test of the mean is

conducted when the characteristic of interest is the population mean, , and either of the following situation exists:

 The population standard deviation is known (regardless of the sample size). OR

(61)

Suppose that under a given hypothesis the sampling distribution of a statistic S is a normal distribution with mean and

standard deviation . Thus the distribution of the standardized variable ( or z score),

given by z = , is the standardized

normal distribution

(mean = 0, variance = 1).

As indicated in the diagram above, we can be 95% confident that if the hypothesis is true, then the z score of an actual sample statistic S will lie between – 1.96 and 1.96,

0.95

0.025 0.025

z = – 1.96 z = 1.96 Critical

region

(62)

since the area under the normal curve

between these values is 0.95. However, if on choosing a single sample at random we find that the z score of it, lies outside the range – 1.96 to 1.96, we would conclude that such an event could happen with probability of 0.05 (the total shaded area in the figure) if the given hypothesis were true. We would then say that this z score differed

significantly from what would be expected under the hypothesis, and we would then be inclined to reject the hypothesis.

(63)

hypothesis (i.e., the probability of making a Type I error). Thus we say that the

hypothesis is rejected at 0.05 the

significance level or that the z score of the given sample statistic is significant at the 0.05 level.

The set of z scores outside the range – 1.96 to 1.96 constitutes what is called the critical region of the hypothesis, the region of

rejection of the hypothesis, or the region of significance. The set of z scores inside the range – 1.96 to 1.96 is thus called the region of acceptance of the hypothesis, or the

(64)

On the basis of the above remarks, we can formulate the following decision rule (or test of hypothesis or significance):

 Reject the hypothesis at the 0.05

significance level if the z score of the

statistic S lies outside the range – 1.96 to 1.96 (i.e. either z > 1.96 or z < – 1.96). This is equivalent to saying that the

observed sample statistic is significant at the 0.05 level.

 Accept the hypothesis otherwise (or, if desired, make no decision at all).

The z score is also called test statistic

(65)

that other significance levels could be used. For example, if 0.01 level were used, we would replace 1.96 everywhere above with 2.58.

Classify Hypothesis Tests Into One-tailed Tests and Two-tailed Tests

In the above test we were interested in extreme values of the statistic S or its

corresponding z score on both sides of the mean (i.e., in both tails of the distribution). Such tests are called sided tests or two-tailed tests.

We may be interested in only extreme

(66)

tail distribution). For example, testing the hypothesis that one process is better than another, which is different from testing

whether one process is better or worse than the other. Such tests are called one-sided tests or one tailed tests. In such cases the critical region is a region to one side of the distribution, with area equal to the level of significance.

The table below gives critical values of z for both one-tailed and two-tailed tests at

various levels of significance.

A two-tailed test of the population mean has these null and alternative hypotheses:

(67)

H1: [a specific number] Level of

significance,

0.10 0.05 0.01 0.005 0.002

Critical values of z for one-tailed tests – 1.28 or 1.28 – 1.645 or 1.645 – 2.33 or 2.33 – 2.58 or 2.58 – 2.88 or 2.88 Critical values of z for two-tailed tests – 1.645 and 1.645 – 1.96 and 1.96 – 2.58 and 2.58 – 2.81 and 2.81 – 3.08 and 3.08

(68)

A small-sample test of the mean is

conducted when the characteristic of interest is the population mean, , and the

population standard deviation is unknown but the sample size, n, is less than or equal to 30.

Infer or Draw Conclusion about the Outcome of the Test

Introduction

https://www.youtube.com/watch? v=e8ptHgDzJtQ

(69)

https://www.youtube.com/watch?

annotation_id=annotation_3582407077&f eature=iv&src_vid=pEidoIu3GA0&v=bU 93aSJKMGw#t=3m34s

Confidence Interval for T-score

annotation_id=annotation_3635794553&f eature=iv&src_vid=5LFhu0vGzkI&v=U mAJJtEo6cQ

Confidence interval for Z-score

(70)

Two Tailed Test

https://www.youtube.com/watch? v=0XXT3bIY_pw

One tailed Test

https://www.youtube.com/watch? v=lNoxKsuJ6Xc

(71)

eature=iv&src_vid=0XXT3bIY_pw&v=5 LFhu0vGzkI

Exercise

1. https://www.youtube.com/watch?

annotation_id=annotation_3898681119&f eature=iv&src_vid=lwpobQmUTd8&v=p EidoIu3GA0

2. An engineer hypothesizes that the mean number of defects can be decreased in a

(72)

a) Identify the Null (H0) and the alternative

hypothesis (H1). Ans: H0 = 18, H1 < 18

b.) What type of hypothesis test should be carried out to test this claim?

Ans: Left – tailed test

3. The random variable X is normally

distributed with standard deviation of 1.2. The null hypothesis H0 : µ = 12.5 cm for the

mean of this distribution, is being tested against the alternative H0 : µ ≠ 12.5 cm. A

sample size 36 turns up a sample mean of 12.3 cm. Calculate the test statistic.

Ans: Test statistic (z) = – 1

4. Given that the distribution of a

(73)

mean of 60 and a standard deviation of 3. What is the approximate percentage of data values that is expected to fall between 57 and 63? Unit 8, Page 7 to 8 Ans: 68%

5. The principal of a large community

college wishes to estimate the average age of the students presently enrolled. From past studies, the standard deviation is known to be 2 years. A sample of 100 students is

selected and the mean is found to be 23.2 years.

a) Find the 95% confidence interval of the population mean.

Hint (Unit 8, page 6): Given a random

(74)

interval for the unknown population mean is

where is the sample mean, s is the

sample standard deviation, n is the sample size and z = confidence factor (1.64 for 90%; 1.96 for 95%; 2.58 for 99%).

Ans: 22.8 < µ < 23.6

b) Explain your result in part (a).

Ans: There is a 95% probability ( or

chance) that the mean of the population from which the sample was taken lies between 22.8 and 23.6 (22.8 < µ < 23.6).

(75)

calls by cell phone users. A sample of 65 cell phone users indicated that the mean amount spent is $250, with a standard deviation $50.

a) Using a 95% level of confidence,

determine the confidence interval for the mean. Ans: 237.85 < µ < 262.15

b) Explain what part (a) indicates.

chance) that the mean of the population from which the sample was taken lies between 237.85 and 262.15

(76)

7. Given the sample size is 7, sample mean is 8 and the population deviation = 4.2. a) What is the standard error of the mean?

Ans: Standard error = 1.59

b) What is the 95% confidence interval?

Ans: (4.90, 11.10)

c) Explain your result in part (b).

8. The attendance at the All Jam Swimming Meet was 400. A random sample of 50

(77)

the mean number of soft drinks consumed per person was 1.86 with a standard

deviation of 0.5.

a) Construct a 95% confidence interval for the mean number of soft drinks consumed per person. Ans: (1.72, 2.00)

b) Interpret your result in part (a).

9. The lives of batteries used in digital

(78)

recently modified and a sample of 24 modified batteries was tested. It was

discovered that the mean life was 311 days, and the sample standard deviation was 11 days. At the 0.05 level of significance, can we claim that the modification changed the mean life of the battery? Explain.

Ans: Null Hypothesis (H0) : µ = 305

Alternate Hypothesis (H1) : µ ≠ 305

At the 0.05 level of significance, we reject

H0 if z < - 1.96 or z > 1.96 and accept H1

(draw diagram)

(79)

We reject H0 since 2.67 > 1.96 and accept

H1. We conclude that at the 5% level of

significance, there is evidence to suggest that the mean has changed due to the modification of the batteries.

10. A company suspect that the value of type A customer monthly orders has

changed from last year. Last year’s type A customer average monthly order was

$234.50. A random sample of 20 customers was taken, with a mean of $241.52 and

standard deviation $13.92.

(80)

a) Determine the test statistic at 0.05 significance level. Ans: 2.26

b) Is the difference significant? Explain.

Ans: There is evidence of a difference, since z = 2.26 lies outside of the range – 1.96 to 1.96. That is, there is evidence that the value of type A customer monthly orders has changed.

11. A manager is convinced that a new type of machine does not affect production at the company’s major shop floor. In order to test this, 12 samples of this week hourly output is taken and the average production per hour is measured as 1158 with a standard

(81)

1196 before the new machine was introduced.

a) Determine the test statistic at 0.05 significance level. Ans: -1.85

b) Is the difference significant? Explain.

Ans: There is no evidence of any difference between the sample and

population, since z = - 1.86 lies within the range – 1.96 to 1.96. That is, the

manager’s conviction is supported by the results.

12. Test at the 5% level whether a sample value of 52 could come from a normal

(82)

= 25. (Sample size not given and a two tailed test)

Ans: At the 5% level there is significant evidence to indicate that the value does not come from a population with a mean

of 40. That is we reject the view (H0) that

the mean is 40, since z = 2.4 > 1.96.

13. The length of a species of lizard is known to be normally distributed with

(83)

Ans: There is evidence to suggest that the lizard could be of the same species. We accept the null hypothesis since z = 1.265 which is less than 1.64.

14. Test at the 5% level whether the value 340 could be from a normal population with a mean of 320 and variance of 80, or

whether the mean is greater than 320. (Hint: One tailed test)

Ans: There is significant evidence to suggest the value is from a population

with a larger mean than 320. We reject H0

since, z = 2.236 > 1.64

15. A supplier claims that the mean life

(84)

A consumer organization tested 200 bulbs and found the mean to be 117.5 hours, with a variance of 169 h2_{. Is there evidence at the}

1% level that the mean is lower than 120 hours? Explain. (Hint: One tailed test)

Ans: There is evidence to indicate that the mean life span is less than 120 hours. We

reject H0 and accept H1 since z = – 2.720