3.6 Continuous distributions
3.6.2 Probabilities from continuous distributions
We computed the proportion of individuals with heights 180 to 185 cm in Example3.102as a fraction:
number of people between 180 and 185 total sample size
We found the number of people with heights between 180 and 185 cm by determining the fraction of the histogram’s area in this region. Similarly, we can use the area in the shaded region under the curve to find a probability (with the help of a computer):
P(heightbetween 180 and 185) = area between 180 and 185 = 0.1157
The probability that a randomly selected person is between 180 and 185 cm is 0.1157. This is very close to the estimate from Example 3.102: 0.1172.
GUIDED PRACTICE 3.103
Three US adults are randomly selected. The probability a single adult is between 180 and 185 cm is 0.1157.83
(a) What is the probability that all three are between 180 and 185 cm tall? (b) What is the probability that none are between 180 and 185 cm?
height (cm)
140 160 180 200
Figure 3.28: A histogram with bin sizes of 2.5 cm. The shaded region represents individuals with heights between 180 and 185 cm.
height (cm)
140 160 180 200
Figure 3.29: The continuous probability distribution of heights for US adults.
EXAMPLE 3.104
What is the probability that a randomly selected person is exactly 180 cm? Assume you can measure perfectly.
This probability is zero. A person might be close to 180 cm, but not exactly 180 cm tall. This also makes sense with the definition of probability as area; there is no area captured between 180 cm and 180 cm.
GUIDED PRACTICE 3.105
Suppose a person’s height is rounded to the nearest centimeter. Is there a chance that a random person’smeasured height will be 180 cm?84
84This has positive probability. Anyone between 179.5 cm and 180.5 cm will have ameasured height of 180 cm. This is probably a more realistic scenario to encounter in practice versus Example3.104.
Section summary
• Histograms use bins with a specific width to display the distribution of a variable. When there is enough data and the data does not have gaps, as the bin width gets smaller and smaller, the histogram begins to resemble a smooth curve, or acontinuous distribution.
• Continuous distributions are often used to approximate relative frequencies and probabilities. In a continuous distribution, the area under the curve corresponds to relative frequency or probability. The total area under a continuous probability distribution must equal 1.
• Because the area under the curve for a single point is zero, the probability of any specific value is zero. This implies that, for example, P(X < 5) = P(X ≤5) for a continuous probability distribution.
• Finding areas under curves is challenging; it is common to use distribution tables, calculators, or other technology to find such areas.
Exercises
3.45 Cat weights. The histogram shown below represents the weights (in kg) of 47 female and 97 male cats.85
(a) What fraction of these cats weigh less than 2.5 kg?
(b) What fraction of these cats weigh between 2.5 and 2.75 kg?
(c) What fraction of these cats weigh between 2.75 and 3.5 kg? Body weight Frequency 2.0 2.5 3.0 3.5 4.0 0 10 20 30
3.46 Income and gender. The relative frequency table below displays the distribution of annual total personal income (in 2009 inflation-adjusted dollars) for a representative sample of 96,420,486 Americans. These data come from the American Community Survey for 2005-2009. This sample is comprised of 59% males and 41% females.86
(a) Describe the distribution of total personal income.
(b) What is the probability that a randomly chosen US resident makes less than$50,000 per year?
(c) What is the probability that a randomly chosen US resident makes less than$50,000 per year and is female? Note any assumptions you make.
(d) The same data source indicates that 71.8% of females make less than $50,000 per year. Use this value to determine whether or not the assumption you made in part (c) is valid.
Income Total $1 to$9,999 or loss 2.2% $10,000 to$14,999 4.7% $15,000 to$24,999 15.8% $25,000 to$34,999 18.3% $35,000 to$49,999 21.2% $50,000 to$64,999 13.9% $65,000 to$74,999 5.8% $75,000 to$99,999 8.4% $100,000 or more 9.7%
85W. N. Venables and B. D. Ripley. Modern Applied Statistics with S. Fourth Edition.
www.stats.ox.ac.uk/pub/MASS4. New York: Springer, 2002. 86U.S. Census Bureau,2005-2009 American Community Survey.
Chapter highlights
This chapter focused on understanding likelihood and chance variation, first by solving individual probability questions and then by investigating probability distributions.
The main probability techniques covered in this chapter are as follows:
• TheGeneral Multiplication Ruleforandprobabilities (intersection), along with the special case when events areindependent.
• TheGeneral Addition Rulefororprobabilities (union), along with the special case when events aremutually exclusive.
• TheConditional Probability Rule.
• Tree diagrams andBayes’ Theoremto solve more complex conditional problems.
• TheBinomial Formula for finding the probability of exactly xsuccesses inn independent trials.
• Simulationsand the use of random digits to estimate probabilities.
Fundamental to all of these problems is understanding when events are independent and when they are mutually exclusive. Two events are independentwhen the outcome of one does not affect the outcome of the other, i.e. P(A|B) =P(A). Two events aremutually exclusivewhen they cannot both happen together, i.e. P(AandB) = 0.
Moving from solving individual probability questions to studying probability distributions helps us better understand chance processes and quantify expected chance variation.
• For adiscrete probability distribution, the sumof the probabilities must equal 1. For a continuous probability distribution, thearea under the curverepresents a probability and the total area under the curve must equal 1.
• As with any distribution, one can calculate the mean and standard deviation of a probability distribution. In the context of a probability distribution, themeanandstandard deviation describe the average and the typical deviation from the average, respectively, after many, many repetitions of the chance process.
• A probability distribution can be summarized by its center (mean, median), spread (SD, IQR), andshape(right skewed, left skewed, approximately symmetric).
• Adding a constant to every value in a probability distribution adds that value to the mean, but it does not affect the standard deviation. When multiplying every value by a constant, this multiplies the mean by the constant and it multiplies the standard deviation by the absolute value of the constant.
• The mean of the sum of two random variables equals the sum of the means. However, this is not true for standard deviations. Instead, when finding the standard deviation of a sum or difference of random variables, take the square root of the sum of each of the standard deviations squared.
The study of probability is useful for measuring uncertainty and assessing risk. In addition, proba- bility serves as the foundation for inference, providing a framework for evaluating when an outcome falls outside of the range of what would be expected by chance alone.
Chapter exercises
3.47 Grade distributions. Each row in the table below is a proposed grade distribution for a class. Identify each as a valid or invalid probability distribution, and explain your reasoning.
Grades A B C D F (a) 0.3 0.3 0.3 0.2 0.1 (b) 0 0 1 0 0 (c) 0.3 0.3 0.3 0 0 (d) 0.3 0.5 0.2 0.1 -0.1 (e) 0.2 0.4 0.2 0.1 0.1 (f) 0 -0.1 1.1 0 0
3.48 Health coverage, frequencies. The Behavioral Risk Factor Surveillance System (BRFSS) is an annual telephone survey designed to identify risk factors in the adult population and report emerging health trends. The following table summarizes two variables for the respondents: health status and health coverage, which describes whether each respondent had health insurance.87
Health Status
Excellent Very good Good Fair Poor Total
Health No 459 727 854 385 99 2,524
Coverage Yes 4,198 6,245 4,821 1,634 578 17,476 Total 4,657 6,972 5,675 2,019 677 20,000
(a) If we draw one individual at random, what is the probability that the respondent has excellent health and doesn’t have health coverage?
(b) If we draw one individual at random, what is the probability that the respondent has excellent health or doesn’t have health coverage?
3.49 HIV in Swaziland. Swaziland has the highest HIV prevalence in the world: 25.9% of this country’s population is infected with HIV.88 The ELISA test is one of the first and most accurate tests for HIV. For those who carry HIV, the ELISA test is 99.7% accurate. For those who do not carry HIV, the test is 92.6% accurate. If an individual from Swaziland has tested positive, what is the probability that he carries HIV?
3.50 Twins. About 30% of human twins are identical, and the rest are fraternal. Identical twins are necessarily the same sex – half are males and the other half are females. One-quarter of fraternal twins are both male, one-quarter both female, and one-half are mixes: one male, one female. You have just become a parent of twins and are told they are both girls. Given this information, what is the probability that they are identical?
3.51 Cost of breakfast. Sally gets a cup of coffee and a muffin every day for breakfast from one of the many coffee shops in her neighborhood. She picks a coffee shop each morning at random and independently of previous days. The average price of a cup of coffee is$1.40 with a standard deviation of 30¢($0.30), the average price of a muffin is $2.50 with a standard deviation of 15¢, and the two prices are independent of each other.
(a) What is the mean and standard deviation of the amount she spends on breakfast daily?
(b) What is the mean and standard deviation of the amount she spends on breakfast weekly (7 days)?
87Office of Surveillance, Epidemiology, and Laboratory Services Behavioral Risk Factor Surveillance System,BRFSS
2010 Survey Data.
3.52 Scooping ice cream. Ice cream usually comes in 1.5 quart boxes (48 fluid ounces), and ice cream scoops hold about 2 ounces. However, there is some variability in the amount of ice cream in a box as well as the amount of ice cream scooped out. We represent the amount of ice cream in the box as X and the amount scooped out asY. Suppose these random variables have the following means, standard deviations, and variances:
mean SD variance
X 48 1 1
Y 2 0.25 0.0625
(a) An entire box of ice cream, plus 3 scoops from a second box is served at a party. How much ice cream do you expect to have been served at this party? What is the standard deviation of the amount of ice cream served?
(b) How much ice cream would you expect to be left in the box after scooping out one scoop of ice cream? That is, find the expected value ofX−Y. What is the standard deviation of the amount left in the box?
(c) Using the context of this exercise, explain why we add variances when we subtract one random variable from another.