Stat Camp for the
Full-time MBA Program
Daniel Solow
188
Lecture 4
The Normal Distribution and the
Central Limit Theorem
You wrote that a woman is pregnant for 266 days. Who said so? I carried my baby for ten months and five days, and there is no doubt about it because I know the exact date my baby was conceived. My husband is in the Navy and it couldn’t possibly have been any other time because I saw him only once for an hour and I didn’t see him again until the day
Example 1: Dear Abby
189 an hour, and I didn t see him again until the day before the baby was born.
I don’t drink or run around, and there is no way this baby isn’t his, so please print a retraction about the 266-day carrying time because otherwise I am in a lot of trouble.
San Diego Reader
Dear Abby
Step 1: Identify an appropriate random variable. Y = number of days of pregnancy
What are the possible values for Y? What is the density function for Y?
About 230 – 290? Prob Density ??? 190 265 270 260 275 255 … … Days Prob. Density
Idea: Approximate the density of Y with a normal!
Dear Abby
• Question: If you are going to use a normal approximation, what information do you need?
• Answer: The meanand standard deviation.
• Fact: According to the collective experience of generations of pediatricians, pregnancies have a mean of 266 and
standard deviation of 16 days, so Y ~ N(= 266, = 16). • Question:What are the possible values forY? –to
191 Question: What are the possible values for Y?
• Question: How can the number of days of pregnancy be
<230?
• Answer:Using the normal distribution, you have that P(Y < 230) = NORMDIST(230, 266, 16, true) 0.01. • Thus, when using the normal approximation, there is only
about 1% chance that a pregnancy lasts less than 230 days.
to
Dear Abby
•
Step 2: State what you are looking for as a
probability question in terms of the rv.
You want to find
P
(
Y
≥
10 mo. and 5 days) =
P
(
Y
≥
310).
•
Step 3: Use the probability distribution of
192
= 1 – NORMDIST(310, 266, 16, TRUE)
Step 3: Use the probability distribution of
the rv to answer the probability question.
= 0.00298
P
(
Y
≥
310) = 1 –
P
(
Y
<
310)
Was she telling the truth?
Possibly, but highly unlikely.
Example 2: Problem of GoodTire
GoodTire has a new tire for which in order
193
GoodTire has a new tire for which, in order to be competitive, they want to offer a warranty of 30,000 miles. Before doing so, the company wants to know what fraction of tires they can expect to be returned under the warranty.
The Problem of GoodTire
•For GoodTire, let
X = number of miles such a tire will last.
Step 1: Identify an appropriate random variable.
What are the possible values for X? Wh t i th d it f ti f X? 0 – 90000? ??? (cont.) 194 X ~N(= 40000, = 10000) with possible values: What is the density function for X? ???
From statistical analysis of a random sample, GoodTire believes the mileage follows approximately a normal distribution with a mean of 40,000 miles and a standard deviation of 10,000 miles, so assume that
–to
The Problem of GoodTire
Step 2: State what you are looking for in
terms of a probability question pertaining
to the random variable.
195
•GoodTire wants to know the
P{
X
30000} = ?
Likelihood a tire fails =
Fraction of tires returned =
The Problem of GoodTire
Step 3: Use the probability distribution
of the random variable to answer the
probability question.
•For GoodTire, you have
P
{
X
30000} = ?
196
, y
{
}
40000 X N(40000, 10000) 30000 NORMDIST(30000, 40000, 10000, TRUE) = 0.1587The Problem of GoodTire
Question: The CEO finds that a 16% return rate is too high. What warranty mileage s should they offer to get a 5% return rate?Step 2: Probability Question: What should sbe so that P{X s} = 0.05? 197 40000 s = ? 0.05 Step 3:s = NORMINV(0.05, 40000, 10000) = 23551.47
Fact: While you cannot control the value of a rv, you can control the likelihood of certain events occurring with that RV.
Example 3: Marketing
Projections
• From historical data over a number of years, a
firm knows that its annual sales average $25
million. For planning purposes, the CEO wants
to know the likelihood that sales next year will:
198
to
ow t e
e ood t at sa es e t yea w :
– Exceed $30 million.
– Be within $1.5 million of the average.
The CEO is willing to issue bonuses if sales are
“sufficiently” high. What level should be set so
that bonuses are given at most 20% of the time?
Marketing Projections
•Let
Y = next year’s sales in $ millions.
Step 1: Identify an appropriate random variable.
What are the possible values for Y? 0 – 50?
199
Y ~N(= 25, = 3) p
What is the density function for Y? ???
From statistical analysis over a number of years, they believe that annual sales follows approximately a normal distribution with a mean of $25 mil. and a standard deviation of $3 mil., so assume that
Marketing Projections
Step 2: State what you are looking for in
terms of a probability question pertaining
to the random variable.
•You want to know:
P( l d $30 il ) P(Y≥30)
200
•P(sales exceeds $30 mil.) = •P(sales is within $1.5 of $25 mil.) =
P(giving a bonus) = 0.20? P(Y ≥s) = 0.20?
P(Y ≥ 30).
P(23.5 Y26.5).
•What should be the value of sales (s) so that
Marketing Projections
Step 3: Use the probability
distribution of the random variable to
answer the probability question.
•From Excel, using = 25 and = 3:
201 •P(Y ≥ 30) = •P(23.5 Y26.5) = 1 NORMDIST(30, 25, 3, TRUE) NORMDIST(26.5, 25, 3, TRUE) – NORMDIST(23.5, 25, 3, TRUE) = 0.045. = 0.383. •s = NORMINV(0.8, 25, 3) = 27.524.
Example 4: DUI Test
• In many states, a driver is legally drunk if the blood alcohol concentration, as determined by a breath analyzer, is 0.10% or higher.
• Suppose that a driver has a true blood alcohol concentration of 0.095%. With the breath analyzer
202
y test, what is the probability that the person will be (incorrectly) booked on a DUI charge?
Step 1: Identify an appropriate random variable.
Let Y = the measurement of the analyzer as a %.
Question:What are the possible values for Y? 0 – 0.3? (cont.)
DUI Test
Step 1 (continued).
Question:What is the density function for Y?
Answer:We do not know, but experience indicates that Y follows approximately a normal distribution with mean equal to the person’s true alcohol level
203
with mean equal to the person s true alcohol level and standard deviation equal to 0.004%, so…
= the person’s true blood alcohol level (%)
DUI Test
Step 2: State what you are looking for in
terms of a probability question pertaining
to the random variable.
•You want to know the probability that a
204
p
y
person with
= 0.095 will be (incorrectly)
booked on a DUI charge:
P(
Y
≥
0.10)
P(being booked on a DUI) =
DUI Test
Step 3: Use the probability distribution of
the random variable to answer the
probability question.
F
E
l ( i
0 095 d 0 004)205
•From Excel (using
= 0.095 and = 0.004):P(
Y
≥
0.10) =
1
NORMDIST(0.10, 0.095, 0.004, true) =
0.1056.
•There is about a 10% chance that such a
person will be incorrectly charged with a DUI.
An Insurance Problem
GoodHands is considering insuring employees of GoodTire. What annual premium should the
company charge to be sure that there is a likelihood of no more than 1%
206
of losing money on each customer?
This is an example ofdecision making under uncertainty:you have to make a decision today —how much should the annual premium be—
Question: Why is the future uncertain? facing an uncertain future.
Solving the Insurance Problem
Step 1: Identify an appropriate random variable.•Let X = the $ claimed by a customer in one year.
•What are the possible values for X? [0, 100000 (?)]
•Is X continuous or discrete? discrete
•What is the density function for X?
207 X ~N(= 2500, = 1000)
y
It is unknown, so borrow one.
From statistical analysis, the annual claim for these people follows approximately a normal distribution with a mean of $2500 and a standard deviation of $1000, so:
•Note: It can be OK to approximate a discrete RV with a continuous distribution.
P b bili Q i Wh h ld h
An Insurance Problem
Step 2: State what you are looking for in terms of a probability question pertaining to the RV.•For GoodHands, what should the premium s be so that the likelihood of losing money is no more than 1%. Question: When do you lose money on a customer?
Probability Question: What should the premium s be so that the
208 2500 XN(2500, 1000) X s 01 . 0 } {Xs P s P( ) = 0.01?
An Insurance Problem
Step 3: Use the probability distribution of the random variable to answer the probability question.XN(2500, 1000) 01 . 0 } {X s P 209 = NORMINV(0.99, 2500, 1000) = $4826.35
Fact: While you cannot control the value of a rv (such as the claim of a person), you cancontrol the likelihood of certain
events occurring with that RV (such as the likelihood of such a claim exceeding the premium).
2500 s
The Insurance Problem (cont.)
Question: GoodTire wants to insure all 100 of its employees through GoodHands. What premium should GoodHands charge per employee so that the likelihood of losing money on the averageof all these claims is 1%? Step 1: Identify appropriate random variables.
•For GoodHands let
210 100 / ) ... (X1 X100 X
•For GoodHands, let
Xi= the $ / annual claim of customer i (i = 1,…,100)
Xi~N(= 2500, = 1000)
Question: What is the distribution of the random variable ?X
Answer: You do not know. However, because is the AVERAGE of other rvs, try…
X
TheCentral Limit Theorem provides an approximate density function when the r.v. you are interested in is the
average of n other rvs, say, X1, X2, …, Xn, that are:
(1) Independent
The Central Limit Theorem
(knowing the value of one rv tells you nothing about the values of the other rvs).
(2) Identically distributed
211
( , / n) nothing about the values of the other rvs).
(have the same density function with mean and standard deviation ), then, for “large” n,
(approx.) N n X X X 1... n~
The Insurance Problem (cont.)
2500 , 100 .
N 1 ... 100 100 X X X For the insurance problem, you have
Xi= annual $ claimed by person i (i = 1, …, 100)
~N
~N 2500, =1000 . 2500, 1000 / 100 212 100(1) Are X1, X2, …, X100independent random variables?
Yes, because the amount claimed by one person has no effect on the amount claimed by another person. (2) Are X1, X2, …, X100identically distributed? Yes, because
Therefore, by the CLT, is approximately Normal with… X
An Insurance Problem
Step 2: State what you are looking for in terms of a probability question pertaining to the random variable.
•For GoodHands,
What should the premium s be so that the
213 probability that the average of the 100 claims exceeds s is 0.01?
Probability Question: What should sbe so that
? 01 . 0 100 ... 100 1 XX X s P N(2500, 100) X
An Insurance Problem (cont.)
Probability Question: What should the premium s be so thatP
Xs
0.01? 01 . 0 } {Xs P 2500 214 Step 3: Use the probability distribution of therandom variable to answer the probability question. s= NORMINV(0.99, 2500, 100)
= $2732.64
s
Another Example of the CLT
• In modeling the performance of a team with 5
people, consider the following five rvs:
P
i=
performance contribution of person
i
for (i = 1,…,5)
for (i 1,…,5)
215
U[0,1]
Possible values: [0, 1] (continuous)
Density function:
E[
P
i] =
= 0.5 STDEV[
P
i] =
=
0
.
29
12
1
However, what is of interest is the
team
Another Example of the CLT
T =
performance of the whole team
5
5 4 3 2 1P
P
P
P
P
Possible values: [0, 1] (continuous)
216
Density function: ???
You cannot find the true density function, so
borrow one.
Because the rv
T
is the
average
of other RVs,
think of using the
Central Limit Theorem
to
approximate the density function of
T
.
0.29.
The Team Problem
For the team problem, you have
Pi= performance of person i (i = 1, 2, 3, 4, 5) (0.5,0.13). N 0.5 and std. dev. = 0.5,
~U[0, 1] with mean =
5 5 4 3 2 1 P P P P P T ~N(0.5,0.29/ 5)N(0.5,0.13) 217 (1) Are P1, P2, P3,P4,P5independent random variables?
Yes, assuming that the performance of a person says nothing about the performance of another person. (2) Are P1, P2, P3,P4,P5identically distributed?
Therefore, by the CLT, Pis approximately Normal with… Yes, because
5
The Team Problem
Question:
What is the probability that the
team performance is at least 0.75?
P(
T
≥
0.75)
=
1 – NORMDIST(0.75, 0.5, 0.13, TRUE) =
0.027
218 0.5 TN(0.5, 0.13) P(T≥ 0.75) 0.75The Average of a Sample
Suppose you are going to record the numbers X1, X2,…, Xntaken from a sample of size n from a population and then compute:n X X X 1... n
If you have not yet taken the sample then X ISa rv Is a rv? X
The answer depends on “timing”. If you have already taken the sample, then X is NOTa rv.
219
All possible values:
There is no practical way to list the possible values, so… G1
Groups of size n: for the group:
X A1
G2 A2
G3 A3
If you have not yet taken the sample, then X ISa rv.
Discrete, but…
YOU CANNOT WRITE THE DENSITY FUNCTION.
The (finite) list of averages of every group of size nin the population.
The Average of a Sample
n X X
X 1... n The rvs X1, X2,…, Xnare iid
from the same population with mean = and std. dev. =
Solution: Because is the average of rvs, think of the using the CLT which if applicable results in the
X 220 (, +) Possible Values: ) ( ~N X
Now you can use the Normal Distribution to answer your probability question about .X
using the CLT which, if applicable, results in the following density function for X:
,/ n
A Final Example of the CLT
• Historical data collected at a paper mill show that 40% of sheet breaks are due to water drops, resulting from the condensation of steam. • Suppose that the causes of the next 100 sheet
breaks are monitored and that the sheet breaks are i d d f h
221
independent of one another.
• Find the expected value and the standard deviation of the number of sheet breaks that will be caused by water drops.
• What is the probability that at least 35 of the breaks will be due to water drops?
• Success = break due to water drops • P(success) = p =
• X= number of breaks due to water drops • Xis Binomial with n = 100 and p = 0.4 • E(X) =
Exact Answer
np =(100)(0 4) = 40 0.4 222 • E(X) = • From Excel P(X35) = 1 – P(X< 35) = 1 – P(X34) • = 1 – BINOMDIST(34, 100, 0.4, TRUE) • = 0.8617 np = (100)(0.4) = 40 =(100)(0.4)(0.6) = 24 = 4.9 SD(X) = n p (1 p)Normal Approx. to Binomial
For this problem, let
p =
P(success) = 0.4, and
In this problem, you are interested in the rv
,
1
,...,
100
on trial
failure
a
if
,
0
on trial
success
a
if
,
1
i
i
i
X
i 223X
= number of successes in 100 trials
=
X
1+
X
2+
… +
X
100To find P(
X
≥
35) = P(
X
/
100
≥
35
/
100) , you
need to know the probability distribution of
which, by the CLT, is approximately
normal, so…
X
/
100
,
Normal Approx. to Binomial
Each
X
i~
Binomial(1,
p =
0.4), so
E[
X
i] =
=
p =
0.4
49
.
0
)
1
(
]
[
X
p
p
SD
i
Assuming that
224Assuming that
)
049
.
0
,
4
.
0
(
)
/
,
(
~
100
100 1X
N
n
N
X
X
L
•The Xiare pairwise independent and
•n = 100 is large enough (np >5 and n(1 –p) >5),
then by the CLT, the random variable
Normal Approx. to Binomial
Then, for
X = X
1+ …+ X
100P(
X
/
100
≥
35
/
100)
P(
X
≥
35) =
100
100
X
,
225= 1
NORMDIST(0.35, 0.4, 0.049, TRUE)
= 0.85.
)
35
.
0
P(
X
(The exact answer was 0.86.)
A function y = f(x) describes a relationship between the two quantitative variables x and y.
Review of Basic Math
•y = f(x) = –x + 2 (a linear relationship) •y = f(x) = x2– 2x + 1 (a nonlinear relationship)
226
You can represent a function visually as follows: y
x
y
x
Review of Functions
You can also think of a function fas transforming an
inputxinto an outputy, as follows:
x
227
f f(x) =y
Note: A function fcan have many input values, instead of just one.
y y = mx + b A linear equation y = mx + b, provides a relationship between the two variables, x and y, in which:
Review of Linear Equations
•b= they-intercept
= the value of y when x = 0. b m 1 228 x y x •m>0: as x increases, y increases. m>0 m<0 m= 0 •m = the slope of the line
•m = 0: as x increases, y remains the same.
•m<0: as x increases, y decreases.
= the change in yper
unit of increase in x. x 1
x + 1
An Example of a Line
If
y =
the thousands of bushels of wheat
x
= the number of inches of rain
then for the line
229
then, for the line
y =
80
x
+ 71,
•
b =
71 means that there are 71,000 bushels of
wheat when there is no rain.
•
m
= 80 means that each extra inch of rain
results in 80,000 more bushels of wheat.
Sometimes a line is written in the form:
a
1x
1+
a
2x
2=
c
Assuming that
a
2
0, you can solve for
x
2:
A Different Equation for a Line
230
x
2= – (
a
1/ a
2)
x
1+ (
c / a
2)
y =
m
x +
b
How Large is Large Enough?
• For symmetric but outlier-prone data,
n= 15 samples should be enough to use the normal approximation.
• For mild skewness, n= 30 should generally be sufficient to make the normal approximation
231
appropriate.
• For severe skewness, nshould be at least 100 to use the normal approximation.
• Generally speaking, the larger n is, the better the normal approximation is.
Graphing a Line
To draw the graph of the line
a
1x
1+
a
2x
2=
b
:
• Find two
different
points on the line (usually by
setting
x
1= 0 and finding
x
2and then setting
x
2=
0 and finding
x
1).
232
g
1)
• Plotting these two points on a graph.
• Drawing the straight line through those two
points.
Example of Graphing a Line
The line: 2x1 + x2 = 230 When x1= 0, x2= 230 When x2= 0, x1= 115 300 x2 233 2 , 1Note: Any point on the line gives a value for x1
and a value for x2that
satisfies 2x1 + x2 = 230. x1 300 200 100 100 200
Solving Two Linear Equations
•
Objective:
Solve the following two equations
for
x
1and
x
2:
2
x
1+ x
2= 230 (a)
x
1+
2
x
2= 250 (b)
234
•
Solution Procedure:
– Solve (a) for x2:
– Substitute x2 = 230 – 2x1in (b): x1 + 2(230 – 2x1) = –3x1 + 460 = 250 (d) – Solve (d) for x1: – Substitute x1 = 70 in (c): x1 = 70 x2 = 230 – 2x1= 90. x2 = 230 – 2x1 (c)
•
Objective:
Solve the following for
x
1and
x
2:
(a) 2
x
1+ x
2= 230
(b)
x
1+ 2x
2= 250
•
Alternative Procedure:
– Multiply (a) through by 2.
(c) 4
x
1+
2x
2= 460
Another Approach
(d) 3
x
1= 210
–[
(b)
x
]
1+ 2x
2= 250
235 p y ( ) g y – Subtract (b) from (c). – Solve (d) for x1:– Substitute x1 = 70 in (a) and solve for x2: x2 = 230 – 2x1= 90
•
Note:
There are computer packages for solving
n
linear equations in
n
unknowns.
Exponentials
• An
exponent
is the power to which a number
(called the
base)
is raised.
•
Example:
2
5(base = 2; exponent = 5)
•
Question:
How much will $1000 be worth after
5
t 6%
d i t
t?
236
5 years at 6% compound interest?
Year 1 Year 2 Year 3 Year 4 Year 5
Principal$1,000.00 $1,060.00 $1,123.60 $1,191.02 $1,262.48 Interest $60.00 $63.60 $67.42 $71.46 $75.75 Total $1,060.00 $1,123.60 $1,191.02 $1,262.48 $1,338.23
Answer:
Total =
f
(
P
,
r
,
n
) =
P
(1 +
r
)
n= 1000 (1 + 0.06)
5= 1338.23
Properties of Exponents
• Laws of Exponents: –xa + b= xb + a= xaxb (example: 23 + 2 = 2322) – (xa)b=(xb)a= xab (example: (23)2 = 26) –x–a= 1 / xa(example: 2–3 = 1 / 23= 1 / 8) –x0= 1 237• Exponential Functions Increase and Decrease Rapidly:
y = 2^x 0 200000 400000 600000 800000 1000000 1200000 0 5 10 15 20 25 y = 2^x y = 2^(-x) 0 0.1 0.2 0.3 0.4 0.5 0.6 0 5 10 15 20 25 y = 2^(-x)
Scientific Notation
•
Scientific Notation:
a
10
b(also written as
a
E ±
b
) means move the decimal point of
a
:
–b positions to the right, if b > 0. –b positions to the left, if b < 0.
238 p ,
•
Example:
4.000
10
3= 4.000 E+3 =
•
Example:
4
10
–3= 4 E
3 =
4000.
0.004
.
Logarithms
• The
log base
b
of
x
[written log
b(
x
)] is the
power to which you must raise
b
to get
x
.
•
Examples:
log
10(100) =
• Logs are only defined for positive numbers.
If th b
i
itt d th d f lt i 10
2,
log
2(32) =
5
239
• If the base is omitted, the default is 10.
• The base
e
= 2.718… is used in some financial
applications (such as continuous compounding),
in which case, log
e(
x
) is written as ln(x) (the
Laws of Logarithms
• Logs convert products to sums, that is, logb(xy)=logb(x) + logb(y).
– Ex: log2(64) =
• logb(x / y)=logb(x) – logb(y) – Ex: log10(1000 / 100) =
• Logs bring down exponents that is
log2(416) = log2(4) + log2(16) = 2+4 = 6
log10(1000) – log10(100) = 32 = 1
240 Logs bring down exponents, that is,
logb(xy) = y log b(x). – Example: log2(45) =
• Logs undo exponentiation, that is, logb(by) = ylogb(b) = y.
– Example: log2(25) =
• loga(x) = k logb(x), where k = loga(b) – Example: log2(x) = 3.322 log10(x)
5(2) = 10 5 log2(4) =
5
Problem Solving with Logs
•
Question:
How many years will it take to
double an investment at
i
% interest
compounded annually?
•
Answer:
Let
P =
the initial investment
241
P =
the initial investment
r
= interest rate as a fraction =
i /
100
n
= the number of years of compounding
Then, after
n
years, you will have
P(1 +
r
)
n.
Problem Solving with Logs
•
Answer (continued):
Thus, you want to find
n
so that
P
(1 +
r
)
n= 2
P
To solve (a) for
n
, take the log of both sides to
bring the exponent
n
down:
(1 +
r
)
n= 2
(a)
242
bring the exponent
n
down:
log[(1 +
r
)
n] = log(2)
n
log[(1 +
r
)] = log(2)
n
= log(2) / log[(1 +
r
)]
• Example: At 6% (
r
= 0.06), it will take
n = log(2) / log(1.06) = 0.301 / 0.025 = 11.9 years.
Qn: Log base what?
Ans: Log base 10 (but any base will work).