CHAPTER 5
Probability Distributions
CHAPTER OUTLINE
5.1 Probability Distribution of a Discrete Random Variable
5.2 Mean and Standard Deviation of a Probability Distribution
5.3 The Binomial Distribution
5.1 PROBABILITY DISTRIBUTION OF A DISCRETE RANDOM VARIABLE
As we explained in Chapter 4 (Section 4.1), Excel can be used to convert a frequency distribution table into a probability distribution table. Then, a graphical presentation of this distribution can easily be generated, as illustrated below.
Example 5-1 The table below gives the frequency distribution of the number of vehicles owned by all 2000 families living in a small town.
Number of Vehicles Owned Frequency
0 30 1 470 2 850 3 490 4 160
Enter this into Excel as a probability distribution table (having Excel calculate relative frequencies in place of frequencies) and generate a graphical presentation of this distribution.
87
Solution: Type the column headings “Number of Vehicles Owned, x” and “Probability, P(x)” into cells A1 and B1. (Widen the columns as necessary.) Enter the values 0-4 into cells A2-A6. Click in cell B2, and type “=30/SUM(30,470,850,490,160)” and then copy this to cells B3-B6. Double-click on cell B3 so that you can edit it. Replace the 30 after the “=” sign with 470, so that it contains the formula “=470/SUM(30,470,850,490,160).”
Hit ENTER. Edit cells B4-B6 similarly, so that they also contain the rest of the relative frequencies.
Highlight the table and click on the chart wizard icon: . In Step 1, select the Column chart type and click on Next. In Step 2, you need to click on the Series tab and remove the “Number of Vehicles Owned” series from the list, leaving only the
“Probability” series. Then click in the Category (X) axis labels text box and highlight cells A2-A6. Click on Next. In Step 3, edit the chart title to read “Probability Distribution of the Number of Vehicles Owned by Families,” enter “Number of Vehicles”
for the x-axis label, and “Probability” for the y-axis label. Click on the Legend tab and remove the legend, and then click on Finish. Double-click on the bars, click on the Options tab, and lower the Gap width to 50 (or less). Resize the chart (by clicking on it and then dragging a corner) or move it (by clicking and dragging the whole thing) as necessary.
Figure 5.1 Using Excel to generate a probability distribution table and graph.
Cumulative Frequency Distributions
In order to change a frequency distribution to a cumulative frequency distribution, you
need to change all of the lower class limits to the same first lower class limit, and then
calculate the cumulative frequencies.
Figure 5.2 The Format Data Series dialog box enables you to format various aspects of the bar chart, including the width of the bars comprising the graph.
Example 5-2 Construct a cumulative frequency distribution for the 50 states’ average travel times given in Example 2-4.
Solution: First, construct the frequency table as above, but instead of editing the classes to show varying lower class limits with the upper class limits, use 15.0 as the lower class limit for all of the classes. (These will be cumulative classes.)
In the column next to the frequencies, type in the heading "Cumulative Frequency." For the first one, just copy the first class' frequency. For the second one, type "=" and then arrow left, and type "+" and then arrow up to the previous cumulative frequency. Then copy this cell down to the cells below it (click on it and drag the square in the bottom right-hand corner).
Finally, hide the column with the frequencies in it. (You can't delete it since doing so
would also delete the actual references.) Right-click on the column letter and select
Hide.
Figure 5.3 A cumulative frequency distribution. The cumulative frequencies are calculated by using cell references to frequencies in a hidden column.
In order to generate an Ogive, which is really a cumulative frequency polygon, you need to enter upper class boundaries into a column. The upper class boundary of a class is always halfway between the upper class limit of that class and the next class' lower class limit. So, you can have Excel average these values for you, if you wish. Then copy the cumulative frequencies into the column next to them. (Go to Edit>Paste Special and select Values in order to avoid changing cell references.)
Finally, insert a row above the first upper class boundary, enter the number that is exactly one class width less than the first upper class boundary (in the cell just above it), and enter a 0 in the cell just above the first cumulative frequency.
Figure 5.4 Calculate upper class boundaries and position cumulative frequencies next to them in order to generate an Ogive.
Highlight the columns and go to the Chart Wizard. Select the Line chart option. In step 2, you will probably need to fix the data references. Click on the Series tab. Under Series, you need only to have the Cumulative Frequency series; remove the other one.
Click in the text box next to Category (X) axis labels and highlight the column of upper class boundaries. Then, you should see a good preview of the ogive. Remove the legend and edit titles in step 3, and click on Finish.
Example 5-3 Construct a cumulative frequency distribution for the data in Example 2-5.
Solution: Following the same procedure as above, only this time the cumulative classes are all scores up to and including a given score. Precisely, we the following is what the input should be for Example 2-5 (in comparison to Figure 5.4):
Figure 5.5 The input used to construct a cumulative frequency distribution for Example 2-5.
Again, in order to generate the cumulative frequencies in this case, begin by entering ‘2’
into cell D2, then ‘=B3+D2’ into cell D3, and then ‘=B4+D3’ into cell D4. The
remaining cumulative frequencies can now be easily generated by highlighting cells D3 –
D12, followed by Edit>Fill Down. The result is pictured above. Now, to form the
ogive, use the first column as the upper class boundaries. Now, follow the same
instructions as in Example 5-2 to obtain an ogive that resembles the one below:
Cumulative Frequency Distribution of Quiz Scores
0 5 10 15 20 25 30 35 40
1 2 3 4 5 6 7 8 9 10 11
Scores
Figure 5.6 Ogive for the data in Example 2-5. Here, the labels for the scores have been shifted up by one. If this is not desirable, you can always change the scale/labels along the x-axis.
5.2 MEAN AND STANDARD DEVIATION OF A PROBABILITY DISTRIBUTION
We will use Excel to calculate the mean and standard deviation of a discrete random variable similarly to how we had it calculate the mean and standard deviation of grouped data in Sections 2.5 and 2.6. The method we will use involves generating columns of values and using the following mathematical formulas found in your textbook:
2 2
( ) ( )
xP x x P x
2µ = ∑ σ = ∑ − µ
(Unfortunately, Excel does not have one-step built-in functions for calculating these measures.)
Example 5-4 The following table lists the probability distribution of the number of breakdowns per week for a machine based on past data.
Breakdowns per week x
Probability P(x)
0 .15 1 .20 2 .35 3 .30
Use Excel to find the mean number of breakdowns per week for this machine. Also, use it to calculate the standard deviation of the number of breakdowns.
Solution: Enter the probability distribution into a blank Excel worksheet. Include column headings of x and P(x), using cells A1-B5. In cell C1, type “xP(x)” for the column heading for the products that will be used for calculating the mean. In cell C2, type “=” and click on cell A2, then type “*” and click on cell B2, and hit ENTER. Then copy this down to cells C3-C5. In cell C7, insert the SUM function (or type “=sum(”
without the quotes), highlight cells C2-C5 and hit ENTER. You should see the mean value of 1.8 appear.
Figure 5.7 Using Excel to find the mean of a probability distribution.
Now, type a column label of “xxP(x)” or “x
2P(x)” in cell D1 for the products that will be used for calculating the standard deviation. Click on cell D2, type “=” and click on cell A2, then type “^2*” and click on cell B2. Then hit ENTER. Copy this down to cells D3-D5. Then copy cell C6 over to cell D6 to show the sum of that column, 4.3. Now, click on an empty cell nearby, say cell E6. Here we will finish calculating the standard deviation. Type “=SQRT(” and click on cell D6, then type “-” and click on cell C6, then type “^2)” and hit ENTER. The standard deviation of 1.029563 appear.
Figure 5.8 Using Excel to find the standard deviation of a probability distribution.
5.3 THE BINOMIAL DISTRIBUTION
Recall that the binomial random variable X is used to generate a probabilistic model of a situation in which an experiment involving only two outcomes (success and failure) is performed n independent, identical times. The binomial probability distribution and cumulative distribution, along with its parameters are as follows:
Let p = probability of a success and q = − . Then, 1 p 1) P(a successes) = ( ) n
a n aP X a p q
a
⎛ ⎞
−= = ⎜ ⎟
⎝ ⎠ , for any a = 0, 1, 2, …, n.
2) Cumulative probability distribution is
( )
x n xx a
P X a n p q
x
−
≤
≤ = ⎛ ⎞ ⎜ ⎟
∑ ⎝ ⎠ , for any a = 0, 1, 2, …, n.
3) Mean µ = np Standard Deviation npq σ =
Excel has the binomial probability distribution built-in as the function BINOMDIST.
This function requires four inputs:
1) the number of successes that you are interested in, x = Number_s 2) the total number of trials, n = Trials
3) the probability of success, p, as a decimal = Probability_s
4) 1 (TRUE) if you want the cumulative probability of at most (≤) x successes, or 0 (FALSE) if you want the probability of exactly (=) x successes = Cumulative.
We remark that Excel’s built-in probability distributions, such as the Binomial, will allow you to be slightly more accurate than using values from tables that have been rounded off. (Your answers may differ slightly from those in your textbook as a result.)
Example 5-5 At the Express House Delivery Service, providing high-quality service to customers is the top priority of the management. The company guarantees a refund of all charges if a package it is delivering does not arrive at its destination by the specified time. It is known from past data that despite all efforts, 2% of the packages mailed through this company do not arrive at their destinations within the specified time.
Suppose a corporation mails 10 packages through Express House Delivery Service on a certain day. Use Excel to find
a) the probability that exactly 1 of these 10 packages will not arrive at its destination within the specified time, and
b) the probability that at most 1 of these 10 packages will not arrive at its
destination within the specified time.
Solution: Click on a blank cell in an Excel worksheet. For part a), the first input is 1, for 1 of the 10 packages, the second input is 10, the total number of packages, the third input is .02, the probability that a package does not arrive by the specified time, and the fourth is 0 (for FALSE), since we want the probability for “exactly” 1. Click on OK and you should see the probability of 0.16675 appear. See Figure 5.9 below.
Figure 5.9 Go to Insert>Function in order to insert Excel’s BINOMDIST function, and fill in the four inputs in order to calculate a binomial probability.
For part b), the inputs are all the same except the last one, which needs to be a 1 (for TRUE), since we want the probability of “at most” (less than or equal to) 1. If you want to, you can just double-click on the cell and edit this input instead of reinserting the function and starting over. When you hit ENTER, you should see the probability of 0.983822 appear.
Figure 5.10 Excel’s BINOMDIST function can be typed in and/or edited instead of inserted.
Excel’s BINOMDIST function and chart wizard are also wonderful tools for viewing an entire binomial distribution.
Example 5-6 According to an Allstate Survey, 56% of Baby Boomers have car loans
and are making payments on these loans (USA TODAY, October 28, 2002). Assume that
this result holds true for the current population of all Baby Boomers. Let x denote the
number in a random sample of three Baby Boomers who are making payments on their car loans. Have Excel generate the probability distribution of x and graph it.
Solution: Enter headings of “x” and “P(x)” into cells A1 and B1. Enter the possible values of x: 0, 1, 2, and 3, into cells A2-A5. Click on cell B2 and insert the BINOMDIST function. For the first input, Number_s, move the window out of the way, if necessary, and click on cell A2. For the second input, Trials, type a 3. For the third input, Probability_s, enter .56, and for the fourth input, Cumulative, type 0 (for FALSE). Hit ENTER and you should see the value of 0.085184 appear. Now copy this cell down to cells B3-B5, and you’ll have the probabilities for the other success values as well!
For the graph, click on a blank cell and then click on the chart wizard icon: . In Step 1, select the Column chart type and click on Next. In Step 2, highlight cells B2- B5 (the probabilities only) for the Data range. Then click on the Series tab, click in the Category (X) axis labels text box, and highlight cells A2-A5 (the values of x). Click on Next. In Step 3, edit the titles and click on the Legend tab and remove the legend, and then click on Finish. Double-click on the bars, click on the Options tab, and lower the Gap width to 50 (or less). Finally, resize the chart (by clicking on it and then dragging a corner) or move it (by clicking and dragging the whole thing) as necessary.
Figure 5.11 Using Excel’s BINOMDIST function (with the FALSE option) and chart wizard to represent a binomial distribution.
Variations of the BINOMDIST function are sometimes required to answer specific probability questions. Remember that “at most” is the same as “equal or less” and “at least” is the same as “equal or more.”
Example 5-7 According to a 2001 study of college students by Harvard University’s
School of Public Health, 19.3% of those included in the study abstained from drinking
(USA TODAY, April 3, 2002). Suppose that of all current college students in the United States, 20% abstain from drinking. A random sample of six college students is selected.
Use Excel to answer the following.
a) Find the probability that exactly three college students in this sample abstain from drinking.
b) Find the probability that at most two college students in this sample abstain from drinking.
c) Find the probability that at least three college students in this sample abstain from drinking.
d) Find the probability that one to three college students in this sample abstain from drinking.
Solution: Using the BINOMDIST function for the various inequalities involved, with 0 (FALSE) for “=” and 1 (TRUE) for “≤,” the inputs and answers are as follows:
a) P(x = 3) is calculated by: =BINOMDIST(3, 6, .20, 0) and the result is: 0.08192.
b) P(x ≤ 2) is calculated by: =BINOMDIST(2, 6, .20, 1) and the result is: 0.90112.
c) P(x ≥ 3) = 1 – P(x < 3) = 1 – P(x ≤ 2), so use:
=1-BINOMDIST(2, 6, .20, 1) and the result is: 0.09888.
d) P(1≤ x ≤ 3) = P(x ≤ 3) – P(x <1) = P(x ≤ 3) – P(x ≤ 0), so use:
= BINOMDIST(3, 6, .20, 1)-BINOMDIST(0, 6, .20, 1), which gives 0.720896.
P-charts for Trend:
Much like a time-series plot (discussed in Section 2.7) was formed to investigate the change over time in certain numerical measurements for purposes of statistical control, the same type of chart can be formed in the same way to graph a series of sample proportions. Indeed, assuming that p
0is the long-term expected proportion, the upper and lower bounds for the control chart are given by:
0 0
0
(1 )
3 p p
p n
± −
The fact that 3 standard deviations are used in contrast to 2 for the sample mean in Section 2.6 of your textbook is simply a matter of preference; any number can be used here depending on the context of the study.
Note: Actually, these bounds correspond to the endpoints of the confidence interval about the sample proportion, used for inferential purposes. This will be discussed in a subsequent chapter in more detail.
Example 5-8 Suppose that the number of applications requiring full review out of the 20
in the sample were recorded for 19 weeks, resulting in the following measurements:
11 7 8 4 9 10 4 8 8 7
10 6 9 10 7 7 6 9 10
Compute the sample proportions corresponding to each measure and plot them in a p- chart.
Solution: Enter into column A the label Sample number (in cell A1) and then 1 – 19 in cells A2 – A20, respectively. Then, enter into column B the label Number (in cell B1) and then the above data in cells B2 – B20. Then, in column C, type the label Proportion (in cell C1) and then, in cell C2, type “=B2/20”. Then, highlight cells C2 – C20 and go to Edit>Fill down to fill in the remaining proportions.
Now, highlight columns A and C, go to the Chart Wizard, and create a scatter diagram just as described earlier. The output should look as follows:
Proportion of Applications Needing Full Review: 19 Week Study
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
0 5 10 15 20
Sample Number
Proportion
Figure 5.12 p-chart for Example 5-8.
Now, to calculate the upper and lower bounds (corresponding to 3 standard deviations from the value of p
0, which in this case is taken to be SUM(B2:B20)/380 = 0.4), we simply allow Excel to perform the arithmetic for us. Indeed, we have:
Lower Bound: 0.4 – 3*SQRT( .4 * (1 - .4) / 20 ) = 0.07 Upper Bound: 0.4 + 3*SQRT( .4 * (1 - .4) / 20 ) = 0.73
Notice that (1) none of the sample proportions escape these two bounds, and (2) the
sample proportions do indeed seem to fluctuate just around 0.4. These two observations
suggest the situation is indeed stable.
Exercises
5.1 One of the most profitable items at A1’s Auto Security Shop is the remote starting system. Let x be the number of such systems installed on a given day at this shop.
The following table lists the frequency distribution of x for the past 80 days.
x 1 2 3 4 5 f 8 20 24 16 12
Use Excel to convert this frequency distribution table to a probability distribution table for the number of remote starting systems installed on a given day. Then have Excel generate a graph of the probability distribution.
5.2 Do the following for the data sets in Exercises 2.2 and 2.3:
a) Prepare a cumulative frequency distribution.
b) Have Excel generate an ogive for the cumulative frequency distribution.
5.3 The following table lists the probability distribution of the number of exercise machines sold per day at Elmo’s Sporting Goods store.
Machines sold per day 4 5 6 7 8 9 10 Probability .08 .11 .14 .19 .20 .16 .12