Chapter 9: Sampling Distributions
Section 9.1: Sampling Distribution
The introduction to the topic of “statistical inference” – using statistical concepts in interpreting scientific results from studies and surveys.
Parameter: A number that describes an aspect of a population
Statistics: A number that is computed from sample data; often used to estimate an unknown parameter.
Example: A census of all DHS seniors found that 10% got into college early. An SRS of 30 seniors was also taken and in that sample 12% got into college early. The 10% is a parameter while the 12% is a statistic.
Sampling distributions and Sampling Variability:
If we take repeated samples from the DHS senior population and measure the proportion of seniors from those samples that got into college early, we will undoubtedly get different numbers for the different samples. This is referred to as sample variability.
If we were to take all possible samples of the same size from the population and compute the sample proportion, , of each sample and then create a distribution it would be called a sampling distribution
of . The following properties generally describe a sampling distribution of created from samples with a large size (usually n 30):
The overall shape of the distribution is symmetric and approximately normal. The larger the sample size the closer the shape is to a normal distribution.
A rule of thumb used to determine if a normal curve can be used to approximate the sampling distribution of population proportions is if:
a) np > 10 and b) n(1-p) > 10
There are no outliers or other important deviations from the main pattern
The mean (center) of the distribution is equal to the true population parameter, p.
The variability (spread) of the sampling distribution depends on the sample size. The larger the sample-size the smaller the variability of the sampling distribution.
The standard deviation of the sampling distribution is (as long as the population is at least ten times larger than the sample size)
Not all sampling distributions have these properties (though most do). When a sampling distribution does not have its center equal to the true population parameter, the statistic used to create that sampling
distribution is said to be biased.
The goal when creating a sampling distribution is to have no bias and low variability. Here is how bias and variability are related:
The variability of a sampling distribution is determined by the sampling design and the sample size used to create the sampling distribution. As long as the population is much larger than the sample (at least 10 times as large) The spread of the sampling distribution is the same for any population size.
Contrary to popular belief and intuition, the behavior of a statistic from random samples is not influenced by the size of the population. To see why, think of taking a sample scoop of m&ms from a well-shuffled 1-pound bag. If the m&ms are well shuffled does the scoop of m&ms really know whether it was surrounded by a one-pound bag of m&ms or a huge bin of m&ms? Clearly it does not.
Section 9.2: Sample Proportions
Example: An SRS of 1500 high school seniors in CT was asked whether they applied to college early. Let’s assume that there are 100,000 high school seniors in the state of Connecticut, and that in fact 35% of them apply to college early. What is the probability that your sample of 1500 seniors will give a result within 2 percentage points of the true value of 35%?
a) Since the population size is greater than 10 times the sample size we are OK to proceed (we can use the formula for standard deviation).
b) Since np = 525 >10 and n(1-p) = 975 >10
we are OK in assuming that the distribution of sample proportions is approximately normal
c) We know that the sampling distribution of sample proportions has a mean of 0.35 (equal to the p in the population) and that the standard deviation is:
d) We are looking for the probability that falls between 0.33 and 0.37 (within 2 % of 35%). So we are looking for
e) Draw a normal curve that approximates the sampling distribution of :
f) Standardizing the values we get:
g) We can now find the area under the normal curve by using the z-score table in the back of the book (or our calculator).
Section 9.3: Sample Means
A couple of things to think about:
1.) Averages are less variable than individual observations. 2.) Averages are more normal than individual observations.
Why are these two things important to us? Well because if we look at a histogram of averages, we will get a histogram that is more normal and less spread out than a histogram of individual observations. Data is much easier to work with if it is normal and has a small spread, so it is to our advantage to look at a distribution of averages.
Mean and Standard Deviation of a Sample Mean
Suppose that is the mean of an SRS of size n drawn from a large population with mean μ and standard deviation σ. Then the mean of the sampling distribution of is and its standard deviation is
.
Sampling Distribution of a Sample Mean from a Normal Population
Draw an SRS of size n from a population that has a normal distribution with mean μ and standard
deviation σ. Then the sample mean has a normal distribution N(μ, ) with mean μ and standard deviation .
Example: Suppose the heights of young women are normally distributed with μ = 64.5 inches and σ = 2.5 inches. What is the probability that the mean height of an SRS of 10 young women is greater than 66.5 inches?
The Central Limit Theorem
The CLT answers the question, what does the distribution of look like if the original population is not
normal.
CLT: Draw an SRS of size n from any population whatsoever with mean μ and finite standard deviation σ. When n is large the sampling distribution of the sample mean is close to the normal distribution N(μ, ) with mean μ and standard deviation .
NOTE: How large a sample size n is needed for to be close to normal depends on the population distribution. More observations are required if the shape of the population distribution is far from normal
Example: The number of flaws per square yard in a type of carpet material varies with mean 1.6 flaws per square yard and a standard deviation of 1.2 flaws per square yard. The population distribution cannot be normal because a count takes only whole number values. An inspector studies 200 square yards of the material, records the number of flaws found in each square yard, and calculates , the mean number of flaws per square yard inspected. What is the probability the mean number of flaws exceeds 2 per square yard?
FINAL THOUGHTS ON CHAPTER 9: