Central Limit Theorem (CLT)
DTSC‐620 Computation Statistics for Data Science Dr. Bill Mihajlovic
2021
Topics
• Uniform probability distribution via simulation
• Sum of two uniformly distributed variables distribution – Triangular
• Sum of three uniformly distributed variables ‐ distribution – Bell shaped
• Normal distribution bell shaped curve
• Basic parameters and features of the normal distribution
• Discrete distribution & continuous density
Central Limit Theorem (CLT)
• CLT is one of the most fundamental laws of nature that is under appreciated and not well understood.
– We shall use a sequence of simulations to prove/illustrate the effects of CLT on random variable values and their distributions.
Simulation: Discrete PDF/PMF (N=1)
• We generate N=1 set x1 of 10000 samples of random values
uniformly distributed in the range 0‐10, grouping sampled values in bands wide 0.1, from 0 to 100 bands of k values.
• Plot frequencies per 0.1 band. Plot frequencies per 0.1 band.
> n=10000
> x1 <‐ round(runif(n, 0,10), digits=2) ; x <‐ x1
> length(x) [1] 10000
> x <‐ round(x, digits=1) # to have 100 points 0‐10
> t <‐ table(x)
> f <‐ as.data.frame(t)
> length(f$Freq)
> p <‐ f$Freq / (n)
> p <‐ f$Freq / (n)
> lp<‐length(p)
> sum(p) [1] 1
> p[1] <‐ p[1]*2
> p[lp] <‐ p[lp]*2
> k <‐ c(0:(lp‐1))
> plot(k,p, ylim=c(0,0.05) ,type="h")
>
Simulation: Discrete x PDF/PMF (N=2)
• We generate N=2 sets x1, and x2 of 10000 samples of random values uniformly distributed in the range 0‐10, (Basic experiments)
• We compute an average of x1 and x2 as x and sort 100 x‐values in bands of 0.1 (100 x bands) we round to one decimal point digit.
bands of 0.1 (100 x bands) we round to one decimal point digit.
> n=10000
> x1 <‐ round(runif(n, 0,11), digits=2) ; length(x2) [1] 10000
> x2 <‐ round(runif(n, 0,11), digits=2)
> x <‐ (x1 + x2) / 2
> x <‐ round(x, digits=1)
> t <‐ table(x)
> f <‐ as.data.frame(t)
> p <‐ f$Freq / (n)
Distribution of the sample‐average as a random variable is triangular?
> p <‐ f$Freq / (n)
> lp<‐length(p) ; sum(p) # Verify prob sum is 1 [1] 1
> p[1] <‐ p[1]*2
> p[lp] <‐ p[lp]*2
> k <‐ c(0:(lp‐1))
> plot(k,p, ylim=c(0,0.05) ,type="h")
>
Simulation: Discrete x PDF/PMF (N=3)
• We generate N=3 sets x1, x2, and x3 of 10000 samples of random values uniformly distributed in the range 0‐10.
• The average of 3 sets of random samples x departs from triangular distribution showing slight bell shape.
distribution showing slight bell shape.
> n=10000
> x1 <‐ round(runif(n, 0,11), digits=2) ; length(x2) [1] 10000
> x2 <‐ round(runif(n, 0,11), digits=2)
> x3 <‐ round(runif(n, 0,11), digits=2)
> x <‐ (x1 + x2 + x3) /3
> x <‐ round(x, digits=1)
> t <‐ table(x)
> f <‐ as data frame(t)
Distribution of x as sample‐
average and Experiment random variable is bell‐
shaped?
> f <‐ as.data.frame(t)
> p <‐ f$Freq / (n) ; sum(p) [1] 1
> lp<‐length(p)
> p[1] <‐ p[1]*2
> p[lp] <‐ p[lp]*2
> k <‐ c(0:(lp‐1))
> plot(k,p, ylim=c(0,0.05) ,type="h")
Simulation: Discrete x PDF/PMF (N=4)
• We generate N=4 sets x1, x2, x3, and x4 of 10000 experiment samples of random values uniformly distributed in the range 0‐10.
• We create Experiment random variable x as average of x1,x2,x3,x4.
> n=10000
> x1 <‐ round(runif(n, 0,11), digits=2)
> length(x2) [1] 10000
> max(x1) [1] 11
> x2 <‐ round(runif(n, 0,11), digits=2)
> x3 <‐ round(runif(n, 0,11), digits=2)
> x4 <‐ round(runif(n 0 11) digits=2)
> x4 <‐ round(runif(n, 0,11), digits=2)
> x<‐ (x1 + x2 + x3 + x4) /4
> length(x) [1] 10000
> max(x) [1] 10.11
> x <‐ round(x, digits=1)
Simulation: Discrete x PDF/PMF (N=4)
• 1000 samples of the Experiment random variable x are sorted into bands of 0.1 width and plot of frequencies from 0 to 100 is made.
> t <‐ table(x)
> f <‐ as.data.frame(t)
> length(f$Freq) [1] 98
> p <‐ f$Freq / (n)
> lp<‐length(p)
> max(p) [1] 0.0274
> sum(p)
The bell shape is taller and narrower.
x‐variance is smaller
> sum(p) [1] 1
> p[1] <‐ p[1]*2
> lp [1] 98
> p[lp] <‐ p[lp]*2
> k <‐ c(0:(lp‐1))
> plot(k,p, ylim=c(0,0.05) ,type="h")
>
Simulation: Discrete x PDF/PMF (N=5)
• We generate 5 sets x1, x2, x3, x4 and x5 of 10000 samples of random values uniformly distributed in the range 0‐10.
• The 5 sample sets are averaged in the Experiment and new x‐variable set of values is made.
set of values is made.
> n=10000
> x1 <‐ round(runif(n, 0,11), digits=2)
> length(x2) [1] 10000
> max(x1) [1] 11
> x2 <‐ round(runif(n, 0,11), digits=2)
> x3 <‐ round(runif(n, 0,11), digits=2)
> x4 <‐ round(runif(n 0 11) digits=2)
> x4 <‐ round(runif(n, 0,11), digits=2)
> x5 <‐ round(runif(n, 0,11), digits=2)
>x <‐ (x1 + x2 + x3 + x4 + x5) /5
> length(x) [1] 10000
> max(x) [1] 10.11
> x <‐ round(x, digits=1)
Simulation: Discrete x PDF/PMF (N=5)
• Changing the plot scale may lead to different conclusions.
• With large (here 100 possible k values) number of possible values, discrete PDF/PMF P(x) approaches continuous pdf p(x) .
> t <‐ table(x)
> f <‐ as.data.frame(t)
> length(f$Freq) [1] 91
> max(f$Freq) [1] 294
> p <‐ f$Freq / (n)
> sum(p) [1] 1
> lp<‐length(p)
> p[1] <‐ p[1]*2
> p[lp] <‐ p[lp]*2
> k <‐ c(0:(lp‐1))
> plot(k,p, ylim=c(0,0.05) ,type="h")
>
Simulation: The Sample Mean x Behavior (N=5)
• This sequence of N=5 times simulated experiment 10,000 sample and i=N=5 set average as Experiment mean illustrates the tendency of the sample‐mean to behave as normal. y p
x = 1/5 x
ii=1 i=N=5
__
• Rounding random variable values creates experiment values creates experiment bias which causes the mean (curve mode) to shift to the left off 50.
Simulation: The Sample Mean x Behavior (N=6)
• This sequence of simulated 10,000 sample means with i=N=6 experiment sets illustrates the tendency of the sample‐mean to assume behavior that is labeled as normal.
x = 1/6 x
ii=1 i=N=6
__
Simulation: The Sample Mean x Behavior (N=7)
• This sequence of simulated 10,000 samples with N=7 basic experiments, and x as Experiment and mean of 7 x
isets of values illustrates further the tendency of the sample‐mean to y p behave as normal.
x = 1/7 x
ii=1 i=7
__
Simulation: The Sample Mean x Behavior (N=8)
• This sequence of i=N=8 simulated 10,000 samples and the mean x as Experiment mean value shows tendency of the sample‐mean behaving as normal. p g
x = 1/8 x
ii=1 i=N=8
__ The bell shape is taller and
narrower. Variance is smaller.
Simulation: Central Limit Theorem (CLT)
• Apparently simulation of sample mean x with i=N=1,2, …, 8 illustrates statistical tendency of the sample mean to become normally N(μ, σ
2) or N(μ, σ) distributed.
__
N=1 N=2 N=3 N=4
N=5 N=6 N=7 N=8
Central Limit Theorem (CLT)
• CLT states that if we have any population (e.g., uniform) with the mean μ and standard deviation σ and than if we take sufficiently large (N y g ( ∞) random samples from the ) p population with replacement, then the probability
distribution of the sample summations or means x as a new Experiment’s random variable, with N ∞, will approach normal distribution N(μ, σ
2).
– With replacement means that same value may be sampled several/multiple times.
E i t i i i t l N ti
– Experiment is averaging experiment samples N times.
– Sufficiently large (N ∞) random samples means large enough
N samples x
ito be averaged.
CLT Argument
• The sample mean x will be approximately normally distributed for large sample sizes (for some experiment variables like uniform N=3-8 is enough, for some we need N=30+ ),
__
g , ),
depending upon the original distribution from which we are sampling.
– CLT argument will hold almost true regardless of the experiment variable distribution, provided the sample size is sufficiently large (usually N > 30).
– To meet the CLT argument, for some original sample experiment variables like
variables like
• Uniform (symmetric around its mean and slightly resembling normal distribution) small N=3-10, and
• Geometric or exponential (asymmetric around its mean and not resembling normal distribution) large N=30 is needed.
Simulation: Practical Simulation Problems
• The sequence of simulated 10,000 experiments N‐times, (in this case N=6) is used to create an Experiment of generating random sample‐mean k=100 p times.
x = 1/8 x
ii=1 i=N=6
__
• Due to rounding error some
random sample mean values
random sample‐mean values
are lost and the plot is not
centered around k=50?!
Simulation: Practical Simulation Problems
• When using t <- table(x) to quickly find x ‐values
frequencies we find out that many lower values (around 0) are collapsed into one value of 0 which has caused plot shift to the p p left.
– Example: Sorted x‐values indicate that 17 values from 0.0 to 1.6, and 1.8, 9.1, 9.6 ‐10.0 are not represented in this simulation.
• Correction of this slightly biased sample‐mean tendency simulation would complicate the simulation script.
> xs <‐ sort(f$x)
> xs
# Sorted x‐axis values indicate jumps/discontinuities in the sequence
[1] 1.7 1.9 2 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 [22] 3.9 4 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 5 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 [43] 6 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 7 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 8 [64] 8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9 9 9.2 9.3 9.4 9.5
77 Levels: 1.7 1.9 2 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3 3.1 3.2 3.3 3.4 3.5 ... 9.5
> # Correct script should show 100 levels (0.0, 0.1, 0.2, . . . , 9.8, 9.9, 1.0
Simulation: Simulation Problem & Purpose
• This is a almost perfect “simulation hack” that fits our purpose of illustrating CLT argument (sample mean/sum statistical tendency to normal distribution)!
• All of the vertical bars indicate probability of the discrete variable x for N=8:
x = 1/8 x
ii=N=8
__
__
i i=1
Simulation: Practical Problem & Purpose
• It is important to distinguish sample parameter estimates like sample mean μ
Xfrom the entire population parameters like the mean μ!
_
μ
μ = lim μ
X_N
∞μX_
=38
μ=50Discrete & Continuous Distributions of Summations
• The sum of two variables X, Y is a new Experiment variable Z Z=X+Y
• PMF of new variable Z can be derived as the convolution operation between original PMF’s of X and Y:
P
z(Z=z) = P(X)*P(Y) = ∑
xP(X=x)P(Y=z−x)
• For continuous variable densities, integration is analogue to summation.
summation.
z z
p(z) =
The Normal Curve N(μ,σ) pdf Functions
• Normal continuous variable pdf is described by the Gaussian formula:
• Standard normal pdf N(0,1) is:
f x
Standard Normal Density (pdf)
• Standard normal curve has mean μ=0 and std. dev. σ=1 or variance σ
2=1.
f(x | μ,σ)
Negative Mean
Small Variance
f( | μ, )
Big Big Variance Variance
Standard Normal Density
• Standard normal pdf N(μ=0, σ=1) is good reference curve used to make statistical judgements.
f(x | μ=0,σ=1)
Standard Normal Density
• Standard normal pdf N(μ=0, σ=1) is good reference curve used to make statistical judgements.
f( | 0 1)
f(x | μ=0,σ=1)
Question: CLT
• If we average 10 vectors of random samples from N=10 different populations (Having different distributions) and form average x, will x as an Experiment variable. Will x have g , p
tendency to be normally distributed?
Answer: CLT
• If we average 10 vectors of random samples from N=10
different populations (Having different distributions) and form average x, will x as an Experiment variable. Will x have g , p
tendency to be normally distributed?
• Yes!!!!
– RW random events additive natural processes generate normally distributed random variables.
• Example: Communication signal noise.
Sample & Population
• Sometimes CLT supports rapid tendency towards Normal distribution, and justifies assuming that observed/measured random variables are Normal to be treated mathematically y (Probability & Statistics) as such.
What is Probability & Statistics?
• Probability & Statistics are mathematical disciplines about (dealing with) GUESSING!
– GUESSING is rarely 100% correct.
– A GUESS may be close (Educated/savvy guess) to the correct
fact but JUST CLOSE, (or far like the WILD GUESS, uneducated
guess)
Homework
• Repeat all R‐sessions and comment results obtained.
• Create a summary table of plots shown in the slide no. 15.
Summary: Sample & Population
• Please distinguish sample mean and variance estimates from the overall population mean and variance values.
– μ = Population mean
– σ = Population standard deviation – μ
x= Sample mean
– σ
x= Sample standard deviation (???) – N = Sample size
_ _
μ = lim μX n