201q_lect5.pdf

(1)

Unit 5: Estimation and Sampling

QBA 201 – Summer 2013

Instructor: Michael Malcolm

(2)

5.1: Estimation and estimators

The essence of statistics is to take a sample and use it to draw inferences about a population.

A parameter is some feature of the true population. For example, we might be interested in the height of people living in the United States. The true mean 𝜇 of the population distribution and the true variance 𝜎2 among the population are examples of parameters. The key is that these are features of the true population distribution and are typically unknown in practice.

What we do in statistics is to take a sample collected from the population and use this sample to make guesses about the population. If the sample is of size 𝑛, then we typically denote the observations in the sample as {𝑥1, 𝑥2, … , 𝑥𝑛}, where each 𝑥𝑖 is an observation from the sample.

We typically assume that our sample is independent and identically distributed (iid).

 A sample is independent if knowing the value of 𝑥𝑖 for one observation from the sample

gives us no information about any other observation from the sample. Sampling the height of fathers and sons would violate independence since height is genetic.

 A sample is identically distributed if all observations from the sample are drawn from the same true population distribution. That is, before the observation is actually made, any observation from the sample has the same probability distribution.

Most statistical analysis assumes that the data we use are an iid sample.

Given a sample{𝑥1, 𝑥2, … , 𝑥𝑛}, we can proceed to construct statistics based on this sample,

which describe attributes of the sample. They are typically calculated by applying some formula

to the sample data. For example, the sample mean 𝑥̅ =∑ 𝑥_𝑛𝑖 and the sample variance 𝑠2 = ∑(𝑥𝑖−𝑥̅)

2

𝑛−1

are statistics that are calculated using sample data.

It is very important to distinguish sample statistics from parameters. The sample mean and variance 𝑥̅ and 𝑠2 are calculated based on observed sample data. The parameters 𝜇 and 𝜎2 indicate the true population mean and variance in the population distribution, and these are typically unknown in practice.

An estimator is a rule for estimating a population parameter. An estimator is typically based on some statistic that comes from the sample. For example, the sample mean 𝑥̅ is an estimator of the population mean 𝜇. The sample variance 𝑠2 is an estimator of the population variance 𝜎2.

(3)

mean can be calculated as 𝑥̅ =2+7+5+4₄ = 4.5. In this case, the formula 𝑥̅ is the estimator of the population mean and the specific answer of 4.5 is our estimate of the population mean. That is, the estimator is the rule and the estimate is the result upon applying the rule to sample data.

What makes a good estimator? For example, why is the sample mean 𝑥̅ a good rule for estimating the population mean 𝜇? Below, we state three desirable properties of estimators. Here,

𝜃̂ is an estimator of a true population parameter 𝜃. This is just to make these definitions as general as possible. An example is that 𝑥̅ is an estimator for the true population mean 𝜇.

 𝜃̂ is an unbiased estimator of 𝜃 if 𝐸(𝜃̂) = 𝜃. That is, if the average/expected value of the estimator is equal to the parameter being estimated.

 𝜃̂ is a consistent estimator of 𝜃 if 𝜃̂ converges asymptotically to 𝜃 as the sample size rises. That is, the estimator 𝜃̂ should get better and better and approach the true population parameter 𝜃 asymptotically as the sample size rises.

 𝜃̂ is the efficient estimator of 𝜃 if 𝜃̂ has the lowest variance among any possible unbiased estimator of 𝜃.

A key point in understanding these definitions is that estimators are never perfect. 𝑥̅ is not going to always be exactly equal to the true population mean 𝜇. There is going to be some distribution of 𝑥̅ – sometimes higher and sometimes lower than the true population mean 𝜇.

When we say that 𝑥̅ is unbiased, it just means that the expected or average value of 𝑥̅ is equal to

𝜇 – sometimes too high, and sometimes too low, but right on average. When we say that 𝑥̅ is consistent it means that, as the sample size rises, 𝑥̅ should get closer and closer to 𝜇 with higher probability. That is, the estimator should be more accurate as the sample size rises. When we say that 𝑥̅ is efficient, it means that there is no other unbiased estimator that would have variation less than the variance that 𝑥̅ displays.

In order to investigate these properties, we make an important preliminary point. When we take an iid sample {𝑥₁, 𝑥₂, … , 𝑥_𝑛}, each member of the sample is drawn from a population distribution that has true mean 𝜇 and true variance 𝜎2. That is, 𝐸(𝑥_𝑖) = 𝜇 and 𝑉𝑎𝑟(𝑥_𝑖) = 𝜎2 for every member of the sample 𝑥𝑖 since all come from the same population distribution.

(4)

𝐸(𝑥̅) = 𝐸 (_𝑛1(𝑥₁+ 𝑥₂ + ⋯ + 𝑥_𝑛))

=_𝑛1𝐸(𝑥₁) +_𝑛1𝐸(𝑥₁) + ⋯ +_𝑛1𝐸(𝑥_𝑛) =_𝑛1⋅ 𝜇 +1_𝑛⋅ 𝜇 + ⋯ +_𝑛1⋅ 𝜇

= 𝑛 (1_𝑛⋅ 𝜇) = 𝜇

We have just demonstrated that 𝐸(𝑥̅) = 𝜇, meaning that – on average – the value of the sample mean is equal to the population mean. This proves that the sample mean 𝑥̅ is an unbiased estimator of the true population mean 𝜇.

Now, as for consistency, let us calculate the variance of 𝑥̅:

𝑉𝑎𝑟(𝑥̅) = 𝑉𝑎𝑟 (_𝑛1(𝑥1+ 𝑥2+ ⋯ + 𝑥𝑛))

= 𝑉𝑎𝑟 (1_𝑛𝑥₁+1_𝑛𝑥₂ + ⋯ +_𝑛1𝑥₂)

= 𝑉𝑎𝑟 (1_𝑛𝑥1) + 𝑉𝑎𝑟 (1_𝑛𝑥2) + ⋯ + 𝑉𝑎𝑟 (1_𝑛𝑥𝑛)

Note that this last line follows only because the sample {𝑥₁, 𝑥₂, … , 𝑥_𝑛} consists of independent observations. We noted in unit 3.5 that the variance of a sum of random variables is only equal to the sum of the individual variances when the random variables are independent. If they are not independent, then the answer is not this simple because there would also be covariance terms. But, since the sample consists of independent observations, we can proceed with the calculation. Recall that 𝑉𝑎𝑟(𝑏𝑋) = 𝑏2𝑉𝑎𝑟(𝑋):

𝑉𝑎𝑟(𝑥̅) =_𝑛1₂𝑉𝑎𝑟(𝑥1) +_𝑛12𝑉𝑎𝑟(𝑥2) + ⋯ +

1

𝑛2𝑉𝑎𝑟(𝑥𝑛)

=_𝑛1₂𝜎2 ₊ 1

𝑛2𝜎2+ ⋯ +

1 𝑛2𝜎2

= 𝑛 (_𝑛1₂𝜎2₎

=𝜎_𝑛2

The important point is that, as the sample size 𝑛 rises, the variance of the sample mean 𝑉𝑎𝑟(𝑥̅) falls. That is, the sample mean tends to bunch up around the population mean with higher and higher probability as the sample size rises. In fact, the variance goes to 0 as 𝑛 → ∞.

(5)

The fact that 𝑥̅ is a consistent estimator of 𝜇 is known as the weak law of large numbers (WLLN). As the sample size rises, 𝑥̅ converges to 𝜇 with probability approaching 1.

It turns out that 𝑥̅ is also the efficient estimator of 𝜇, although this proof is more difficult so we will omit it here. In summary, 𝑥̅ is an unbiased and consistent estimator of 𝜇 and is also the efficient estimator.

We will not give any proofs for the sample variance 𝑠2, but it turns out that 𝑠2 is an unbiased and consistent estimator of the true population variance 𝜎2. It is also efficient in some cases.

Actually, this is the reason for the division by 𝑛 − 1 rather than 𝑛 in the formula for the sample variance. It turns out that ∑(𝑥𝑖−𝑥̅)

2

𝑛 is actually a biased estimator for 𝜎

2_{in the sense that its}

expected value is not equal to 𝜎2. However, if we adjust it slightly by dividing by 𝑛 − 1 instead of 𝑛, then we obtain 𝑠2 =∑(𝑥𝑖−𝑥̅)

2

𝑛−1 , which is an unbiased estimator: 𝐸(𝑠

2_{) = 𝜎}2_{. On average, the}

sample variance 𝑠2 is equal to the population variance 𝜎2.

The reason for this is complicated. If we had the true population mean 𝜇 in the formula, then we actually divide by 𝑛 rather than 𝑛 − 1 to obtain an unbiased estimator. But the true population mean 𝜇 is typically unknown, so we have to substitute the sample mean 𝑥̅. Statisticians say that we “lose one degree of freedom” in this case since 𝑥̅ itself is estimated.

Finally, the sample standard deviation 𝑠 = √𝑠2 is actually not an unbiased estimator of the true population standard deviation 𝜎. In fact, there is no simple formula that gives an unbiased estimator of 𝜎. It is consistent, though. The bias disappears as 𝑛 → ∞ and the sample standard deviation collapses around the true standard deviation.

(6)

EXERCISES

1. Suppose you take an iid random sample of just two members of a population. We will designate the sample as {𝑥₁, 𝑥₂}. The population distribution has true mean 𝜇 and true variance 𝜎2. In this problem, we will consider two estimators for 𝜇, defined as follows.

𝜇̂₁ =1₂𝑥₁+1₂𝑥₂ 𝜇̂₂ =1₃𝑥₁+2₃𝑥₂

a. Show that both estimators are unbiased b. Show that 𝜇̂₁ is more efficient than 𝜇̂₂.

2. Suppose you take a very large iid random sample from a population {𝑥1, 𝑥2, … , 𝑥𝑛} with

𝑛 > 100. You want to estimate the true population mean 𝜇. To speed the calculations, you take the sample mean only of the first 100 observations. That is, no matter how large the sample 𝑛 is, your estimator takes the sample mean only of the first 100 observations.

(7)

5.2: Sampling Distributions

In the previous section, we explored some properties of estimators. We made the point that the sample mean 𝑥̅ will not always be exactly equal to the true population mean 𝜇. The sample mean

𝑥̅ is itself a random variable with its own distribution. It will sometimes be higher than 𝜇 and sometimes lower than 𝜇. For example, if we sample 100 people and take their average height, because the sampling is random the sample mean will sometimes be higher and will sometimes be lower than the true population mean height.

Even though the sample mean 𝑥̅ is itself randomly distributed, we demonstrated some nice properties in the previous section. The sample mean 𝑥̅ is equal to 𝜇 on average (unbiased), it gets approaches 𝜇 asymptotically as the sample size gets large (consistent), and its variance is lower than the variance of any other unbiased estimator of 𝜇 (efficient).

In this section, we go into more detail about what the distribution of 𝑥̅ actually looks like. That is, when we have a true population distribution and we take samples from this population, what can we say about the distribution of the sample mean? (We could just as easily ask this question about the distribution of the sample median or the sample standard deviation or any other sample statistic).

The answer, in general, is that it depends on the specific case. For example, suppose that the following is the population distribution.

𝑥 𝑃(𝑥)

0 1 2⁄

3 1 6⁄

12 1 3⁄

You can calculate that the mean of this distribution is 𝐸(𝑥) = 𝜇 = 4.5 and the variance is

𝑉𝑎𝑟(𝑥) = 𝜎2 _{= 29.25}_.

(8)

Sample 𝑥̅ Probability 0,0 0 1₂⋅1₂ =1₄ 0,3 1.5 1₂⋅1₆=₁₂1 0,12 6 1₂⋅1₃ =1₆

3,0 1.5 1₆⋅1₂=₁₂1 3,3 3 1₆⋅1₆=₃₆1 3,12 7.5 1₆⋅1₃=₁₈1 12,0 6 1₃⋅1₂ =1₆ 12,3 7.5 1₃⋅1₆=₁₈1 12,12 12 1₃⋅1₃ =1₉

Summarizing the table above, the distribution of the sample mean 𝑥̅ is as follows:

𝑥̅ 𝑃(𝑥̅)

0 1 4⁄

1.5 1 6⁄

3 1 36⁄

6 1 3⁄

7.5 1 9⁄

12 1 9⁄

Incidentally, you can easily do the calculation to show that, from these random samples of size 2,

𝐸(𝑥̅) = 4.5, which is the same as the true population mean. But this is not surprising since we know from the previous section that the sample mean is an unbiased estimator of the true population mean. You can use this distribution to calculate 𝑉𝑎𝑟(𝑥̅) = 14.625, which agrees with the formula we derived in the previous section: 𝑉𝑎𝑟(𝑥̅) =𝜎

2

𝑛 = 29.25

2 = 14.625.

You could work the sampling distribution out by hand for all the possible samples of size 3 or 4 or higher, but it would be tedious. Luckily, we have computers.

(9)

Population Distribution

𝑥 𝑃(𝑥)

0 1 2⁄

3 1 6⁄

12 1 3⁄

Sample mean with sample size 𝒏 = 𝟐

Sample mean with sample size 𝒏 = 𝟓

Sample mean with sample size 𝒏 = 𝟓𝟎

(10)

Population Distribution

𝑥 𝑃(𝑥)

1 0.4

2 0.1

3 0

4 0.1

5 0.4

(11)

Population Distribution Exponential with mean 𝛽 = 5

(12)

There are two important things to notice about these sampling distributions. First, when the sample size 𝑛 is low, there is no regular pattern. The distribution of the sample mean for random samples drawn from the population is specific to the population from which the samples are drawn.

(13)

EXERCISES

(14)

5.3: The central limit theorem

The previous section discussed the concept of sampling distributions. The idea is that the sample mean 𝑥̅ is itself a random variable. Although it is an unbiased and consistent estimator of the true population mean 𝜇, it is not always exactly equal, and there is some distribution of 𝑥̅. For low sample sizes 𝑛 there was no regularity – the distribution of 𝑥̅ is specific to the nature of the population distribution. But for large sample sizes there is a surprising regularity. No matter what the population distribution looks like, the distribution of the sample mean 𝑥̅ looks similar as long as the sample size is sufficiently large.

The central limit theorem states that, for any population distribution with mean 𝜇 and standard deviation 𝜎, as long as the sample size 𝑛 is sufficiently large, then the distribution of the sample mean 𝑥̅ can be approximated by a normal distribution with mean 𝜇 and standard deviation 𝜎

√𝑛.

Briefly, as 𝑛 → ∞, then 𝑥̅ → 𝑁 (𝜇, 𝜎

√𝑛).

Intuitively, the distribution of 𝑥̅ is always centered correctly around 𝜇, with the distribution getting tighter (i.e. a smaller variance) as the sample size rises. For example, in a population where the mean is 𝜇 = 7, the diagram below shows the distribution of the sample mean for 𝑛 =

50 and for 𝑛 = 500. Both are centered correctly, but the distribution is tighter for 𝑛 = 500, meaning that there is a higher probability of 𝑥̅ being closer to 𝜇.

(15)

Importantly, the central limit theorem is an asymptotic result and the approximation is only good if the sample size 𝑛 is sufficiently large. For 𝑛 very small, the distribution of 𝑥̅ may not be approximately normal.

Consider a random variable measuring birthweights with mean 𝜇 = 7 and standard deviation

𝜎 = 2. Birthweights are not normally distributed: the distribution is skewed to the left because of premature births. However, the central limit theorem tells us that the distribution of sample means taken from this distribution is approximately normal for sufficiently large sample sizes.

 You take a random sample of 30 birthweights. What is the probability that the sample mean exceeds 7.5 pounds?

We are interested in 𝑃(𝑥̅ ≥ 7.5). Using the central limit theorem, we know that 𝑥̅ is normally distributed with mean 𝜇 = 7 and standard deviation 𝜎

√𝑛= 2

√30. Thus, converting to a z-score:

𝑃(𝑥̅ ≥ 7.5) = 𝑃 (𝑥̅−𝜇

𝜎 √𝑛⁄ ≥ 7.5−7

2 √30⁄ ) = 𝑃(𝑧 ≥ 1.37) = 0.0853

 If you instead take a random sample of 100 birthweights, what now is the probability that the sample mean exceeds 7.5 pounds?

Using the same analysis as in the previous part, but with 𝑛 = 100, The central limit theorem tells us that 𝑥̅ is normally distributed with mean 𝜇 = 7 and standard deviation 𝜎

√𝑛= 2

√100. Thus,

converting to a z-score:

𝑃(𝑥̅ ≥ 7.5) = 𝑃 (𝑥̅−𝜇

𝜎 √𝑛⁄ ≥ 7.5−7

2 √100⁄ ) = 𝑃(𝑧 ≥ 2.50) = 0.0062

This makes sense. According to the central limit theorem, as the sample size rises, the sample mean is more and more likely to be close to the true population mean 𝜇. Thus, the probability of drawing a sample where the sample mean birthweight exceeds 7.5 pounds gets lower and lower as the sample size rises.

(16)

Here is an example. Suppose that the 𝑥 measures the number of cupcakes eaten by each guest at a wedding, and follows the distribution given below.

𝑥 𝑃(𝑥)

0 0.3

1 0.5

2 0.2

You can calculate that the mean and standard deviation of this distribution are 𝜇 = 𝐸(𝑥) = 0.9 and 𝜎 = √𝑉𝑎𝑟(𝑥) = 0.7.

 You have invited 900 guests to your wedding. If the caterer brings 800 cupcakes, what is the probability that he will run out?

What the question asks is 𝑃(∑ 𝑥 ≥ 800). But by dividing by the sample size 𝑛 = 900, we can easily turn this into a statement about the sample mean: 𝑃 (∑ 𝑥

𝑛 ≥ 800

900) = 𝑃(𝑥̅ ≥ 0.8889). In

other words, asking whether the total number of cupcakes eaten will exceed 800 is the same as asking whether the average number of cupcakes eaten per guest will exceed 0.8889. We can now apply the central limit theorem:

𝑃(𝑥̅ ≥ 0.8889) = 𝑃 (𝑥̅−𝜇

𝜎 √𝑛⁄ ≥

0.8889−0.9

0.7 √900⁄ ) = 𝑃(𝑧 > −0.47) = 0.6808

So with probability 0.6808 the caterer will not have enough cupcakes.

Finally, it is important to consider the special case of sampling a proportion. For example, suppose we take a survey and ask people whether they intend to vote for the Democrat in the next election. This is modeled as a random variable 𝑥 equal to 1 if the condition is true and 0 if the condition is false. Where 𝑝 is the true population proportion (i.e. the probability that each single individual votes for a Democrat), then for each person this is a binomial experiment with

𝑛 = 1 and probability 𝑝 of success. Using known results, the mean is 𝜇 = 𝑝 and the standard deviation is 𝜎 = √𝑝(1 − 𝑝).

(17)

divided by the sample size), the central limit theorem applies as long as the sample size is sufficiently large.

Applying the central limit theorem, the sample proportion 𝑝̂ is normally distributed with mean

𝜇 = 𝑝 and standard deviation 𝜎

√𝑛=

√𝑝(1−𝑝)

√𝑛 . In brief, as 𝑛 → ∞, then 𝑝̂ → 𝑁 (𝑝,

√𝑝(1−𝑝) √𝑛 ).

 The true proportion of smokers in some state is 𝑝 = 0.4. If we take a random sample of people in the state of size 𝑛 = 100, what is the probability that more than 30% from the sample are smokers?

The question is asking us to calculate 𝑃(𝑝̂ ≥ 0.3). Using the central limit theorem, we know that

𝑝̂ is approximately normally distributed with mean 𝜇 = 0.4 and standard deviation 𝜎

√𝑛= √0.4(1−0.4)

√100 . Thus, converting to a z-score:

𝑃(𝑝̂ ≥ 0.3) = 𝑃 ( 𝑝̂−𝑝

√𝑝(1−𝑝) √𝑛⁄ ≥

0.3−0.4

√0.4(1−0.4) √100⁄ ) = 𝑃(𝑧 > −1.43) = 0.9263

(18)

EXERCISES

1. Customers who plan to purchase a new vehicle on average plan to spend $27,100, with a standard deviation of $5200. What is the probability that the sample mean planned spending for a group of 100 randomly chosen customers is within $1000 of the population mean?

2. Workers employed in a certain industry have a mean wage of $7.00 per hour with a standard deviation of $0.50. What is the probability that a randomly chosen sample of 64 workers will have an average wage less than $6.90.

3. 20% of computer disks contain defects. In a sample of 100 disks, what is the probability that 85 or more will be free of defects?

4. Vehicles entering an intersection are equally likely to turn left, turn right or proceed straight. If 500 vehicles enter this intersection, what is the probability that fewer than 30% will turn right?

5. The time required for a flight from New York to Chicago is uniformly distributed between 120 and 140 minutes. If there are 145 such flights every day, what is the probability that the average flight time for these flights will exceed 132 minutes?