Central Limit Theorem (CLT) Topics

(1)

Central Limit Theorem (CLT)

DTSC‐620 Computation Statistics for Data Science Dr. Bill Mihajlovic

2021

Topics

• Uniform probability distribution via simulation

• Sum of two uniformly distributed variables distribution – Triangular

• Sum of three uniformly distributed variables ‐ distribution – Bell shaped

• Normal distribution bell shaped curve

• Basic parameters and features of the normal distribution

• Discrete distribution & continuous density

(2)

Central Limit Theorem (CLT)

• CLT is one of the most fundamental laws of nature that is under appreciated and not well understood.

– We shall use a sequence of simulations to prove/illustrate the effects of CLT on random variable values and their distributions.

Simulation: Discrete PDF/PMF (N=1)

• We generate N=1 set x1 of 10000 samples of random values

uniformly distributed in the range 0‐10, grouping sampled values in bands wide 0.1, from 0 to 100 bands of k values.

• Plot frequencies per 0.1 band. Plot frequencies per 0.1 band.

> n=10000

> x1 <‐ round(runif(n, 0,10), digits=2) ; x <‐ x1

> length(x) [1] 10000

> x <‐ round(x, digits=1) # to have 100 points 0‐10

> t <‐ table(x)

> f <‐ as.data.frame(t)

> length(f$Freq)

> p <‐ f$Freq / (n)

> lp<‐length(p)

> sum(p) [1] 1

> p[1] <‐ p[1]*2

> p[lp] <‐ p[lp]*2

> k <‐ c(0:(lp‐1))

> plot(k,p, ylim=c(0,0.05) ,type="h")

>

(3)

Simulation: Discrete x PDF/PMF (N=2)

• We generate N=2 sets x1, and x2 of 10000 samples of random values uniformly distributed in the range 0‐10, (Basic experiments)

• We compute an average of x1 and x2 as x and sort 100 x‐values in bands of 0.1 (100 x bands) we round to one decimal point digit.

bands of 0.1 (100 x bands) we round to one decimal point digit.

> n=10000

> x1 <‐ round(runif(n, 0,11), digits=2) ; length(x2) [1] 10000

> x2 <‐ round(runif(n, 0,11), digits=2)

> x <‐ (x1 + x2) / 2

> x <‐ round(x, digits=1)

> t <‐ table(x)

> p <‐ f$Freq / (n)

Distribution of the sample‐average as a random variable is triangular?

> p <‐ f$Freq / (n)

> lp<‐length(p) ; sum(p) # Verify prob sum is 1 [1] 1

> p[1] <‐ p[1]*2

> p[lp] <‐ p[lp]*2

> k <‐ c(0:(lp‐1))

>

Simulation: Discrete x PDF/PMF (N=3)

• We generate N=3 sets x1, x2, and x3 of 10000 samples of random values uniformly distributed in the range 0‐10.

• The average of 3 sets of random samples x departs from triangular distribution showing slight bell shape.

distribution showing slight bell shape.

> n=10000

> x1 <‐ round(runif(n, 0,11), digits=2) ; length(x2) [1] 10000

> x <‐ (x1 + x2 + x3) /3

> t <‐ table(x)

> f <‐ as data frame(t)

Distribution of x as sample‐

average and Experiment random variable is bell‐

shaped?

> p <‐ f$Freq / (n) ; sum(p) [1] 1

> lp<‐length(p)

> p[1] <‐ p[1]*2

> p[lp] <‐ p[lp]*2

> k <‐ c(0:(lp‐1))

(4)

Simulation: Discrete x PDF/PMF (N=4)

• We generate N=4 sets x1, x2, x3, and x4 of 10000 experiment samples of random values uniformly distributed in the range 0‐10.

• We create Experiment random variable x as average of x1,x2,x3,x4.

> n=10000

> length(x2) [1] 10000

> max(x1) [1] 11

> x4 <‐ round(runif(n 0 11) digits=2)

> x<‐ (x1 + x2 + x3 + x4) /4

> length(x) [1] 10000

> max(x) [1] 10.11

Simulation: Discrete x PDF/PMF (N=4)

• 1000 samples of the Experiment random variable x are sorted into bands of 0.1 width and plot of frequencies from 0 to 100 is made.

> t <‐ table(x)

> length(f$Freq) [1] 98

> p <‐ f$Freq / (n)

> lp<‐length(p)

> max(p) [1] 0.0274

> sum(p)

The bell shape is taller and narrower.

x‐variance is smaller

> sum(p) [1] 1

> p[1] <‐ p[1]*2

> lp [1] 98

> p[lp] <‐ p[lp]*2

> k <‐ c(0:(lp‐1))

>

(5)

Simulation: Discrete x PDF/PMF (N=5)

• We generate 5 sets x1, x2, x3, x4 and x5 of 10000 samples of random values uniformly distributed in the range 0‐10.

• The 5 sample sets are averaged in the Experiment and new x‐variable set of values is made.

set of values is made.

> n=10000

> length(x2) [1] 10000

> max(x1) [1] 11

> x4 <‐ round(runif(n 0 11) digits=2)

>x <‐ (x1 + x2 + x3 + x4 + x5) /5

> length(x) [1] 10000

> max(x) [1] 10.11

Simulation: Discrete x PDF/PMF (N=5)

• Changing the plot scale may lead to different conclusions.

• With large (here 100 possible k values) number of possible values, discrete PDF/PMF P(x) approaches continuous pdf p(x) .

> t <‐ table(x)

> length(f$Freq) [1] 91

> max(f$Freq) [1] 294

> p <‐ f$Freq / (n)

> sum(p) [1] 1

> lp<‐length(p)

> p[1] <‐ p[1]*2

> p[lp] <‐ p[lp]*2

> k <‐ c(0:(lp‐1))

>

(6)

Simulation: The Sample Mean x Behavior (N=5)

• This sequence of N=5 times simulated experiment 10,000 sample and i=N=5 set average as Experiment mean illustrates the tendency of the sample‐mean to behave as normal. y p

x = 1/5 x

_i

i=1 i=N=5

__

• Rounding random variable values creates experiment values creates experiment bias which causes the mean (curve mode) to shift to the left off 50.

Simulation: The Sample Mean x Behavior (N=6)

• This sequence of simulated 10,000 sample means with i=N=6 experiment sets illustrates the tendency of the sample‐mean to assume behavior that is labeled as normal.

x = 1/6 x

_i

i=1 i=N=6

__

(7)

Simulation: The Sample Mean x Behavior (N=7)

• This sequence of simulated 10,000 samples with N=7 basic experiments, and x as Experiment and mean of 7 x

_i

sets of values illustrates further the tendency of the sample‐mean to y p behave as normal.

x = 1/7 x

_i

i=1 i=7

__

Simulation: The Sample Mean x Behavior (N=8)

• This sequence of i=N=8 simulated 10,000 samples and the mean x as Experiment mean value shows tendency of the sample‐mean behaving as normal. p g

x = 1/8 x

_i

i=1 i=N=8

__ The bell shape is taller and

narrower. Variance is smaller.

(8)

Simulation: Central Limit Theorem (CLT)

• Apparently simulation of sample mean x with i=N=1,2, …, 8 illustrates statistical tendency of the sample mean to become normally N(μ, σ

²

) or N(μ, σ) distributed.

__

N=1 N=2 N=3 N=4

N=5 N=6 N=7 N=8

Central Limit Theorem (CLT)

• CLT states that if we have any population (e.g., uniform) with the mean μ and standard deviation σ and than if we take sufficiently large (N y g (  ∞) random samples from the ) p population with replacement, then the probability

distribution of the sample summations or means x as a new Experiment’s random variable, with N  ∞, will approach normal distribution N(μ, σ

²

).

– With replacement means that same value may be sampled several/multiple times.

E i t i i i t l N ti

– Experiment is averaging experiment samples N times.

– Sufficiently large (N  ∞) random samples means large enough

N samples x

_i

to be averaged.

(9)

CLT Argument

• The sample mean x will be approximately normally distributed for large sample sizes (for some experiment variables like uniform N=3-8 is enough, for some we need N=30+ ),

__

g , ),

depending upon the original distribution from which we are sampling.

– CLT argument will hold almost true regardless of the experiment variable distribution, provided the sample size is sufficiently large (usually N > 30).

– To meet the CLT argument, for some original sample experiment variables like

variables like

• Uniform (symmetric around its mean and slightly resembling normal distribution) small N=3-10, and

• Geometric or exponential (asymmetric around its mean and not resembling normal distribution) large N=30 is needed.

Simulation: Practical Simulation Problems

• The sequence of simulated 10,000 experiments N‐times, (in this case N=6) is used to create an Experiment of generating random sample‐mean k=100 p times.

x = 1/8 x

_i

i=1 i=N=6

__

• Due to rounding error some

random sample mean values

random sample‐mean values

are lost and the plot is not

centered around k=50?!

(10)

Simulation: Practical Simulation Problems

• When using t <- table(x) to quickly find ^x ‐values

frequencies we find out that many lower values (around 0) are collapsed into one value of 0 which has caused plot shift to the p p left.

– Example: Sorted x‐values indicate that 17 values from 0.0 to 1.6, and 1.8, 9.1, 9.6 ‐10.0 are not represented in this simulation.

• Correction of this slightly biased sample‐mean tendency simulation would complicate the simulation script.

> xs <‐ sort(f$x)

> xs

# Sorted x‐axis values indicate jumps/discontinuities in the sequence

[1] 1.7 1.9 2 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 [22] 3.9 4 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 5 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 [43] 6 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 7 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 8 [64] 8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9 9 9.2 9.3 9.4 9.5

77 Levels: 1.7 1.9 2 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3 3.1 3.2 3.3 3.4 3.5 ... 9.5

> # Correct script should show 100 levels (0.0, 0.1, 0.2, . . . , 9.8, 9.9, 1.0

Simulation: Simulation Problem & Purpose

• This is a almost perfect “simulation hack” that fits our purpose of illustrating CLT argument (sample mean/sum statistical tendency to normal distribution)!

• All of the vertical bars indicate probability of the discrete variable x for N=8:

x = 1/8 x

_i

i=N=8

__

i i=1

(11)

Simulation: Practical Problem & Purpose

• It is important to distinguish sample parameter estimates like sample mean μ

_X

from the entire population parameters like the mean μ!

_

μ

μ = lim μ

_X^_

N



∞

μ_X_

=38

μ=50

Discrete & Continuous Distributions of Summations

• The sum of two variables X, Y is a new Experiment variable Z Z=X+Y

• PMF of new variable Z can be derived as the convolution operation between original PMF’s of X and Y:

P

_z

**(Z=z) = P(X)*P(Y) = ∑**

_x

P(X=x)P(Y=z−x)

• For continuous variable densities, integration is analogue to summation.

summation.

z z

p(z) =

(12)

The Normal Curve N(μ,σ) pdf Functions

• Normal continuous variable pdf is described by the Gaussian formula:

• Standard normal pdf N(0,1) is:

f x

Standard Normal Density (pdf)

• Standard normal curve has mean μ=0 and std. dev. σ=1 or variance σ

²

=1.

f(x | μ,σ)

Negative Mean

Small Variance

f( | μ, )

Big Big Variance Variance

(13)

Standard Normal Density

• Standard normal pdf N(μ=0, σ=1) is good reference curve used to make statistical judgements.

f(x | μ=0,σ=1)

Standard Normal Density

• Standard normal pdf N(μ=0, σ=1) is good reference curve used to make statistical judgements.

f( | 0 1)

f(x | μ=0,σ=1)

(14)

Question: CLT

• If we average 10 vectors of random samples from N=10 different populations (Having different distributions) and form average x, will x as an Experiment variable. Will x have g , p

tendency to be normally distributed?

Answer: CLT

• If we average 10 vectors of random samples from N=10

different populations (Having different distributions) and form average x, will x as an Experiment variable. Will x have g , p

tendency to be normally distributed?

• Yes!!!!

– RW random events additive natural processes generate normally distributed random variables.

• Example: Communication signal noise.

(15)

Sample & Population

• Sometimes CLT supports rapid tendency towards Normal distribution, and justifies assuming that observed/measured random variables are Normal to be treated mathematically y (Probability & Statistics) as such.

What is Probability & Statistics?

• Probability & Statistics are mathematical disciplines about (dealing with) GUESSING!

– GUESSING is rarely 100% correct.

– A GUESS may be close (Educated/savvy guess) to the correct

fact but JUST CLOSE, (or far like the WILD GUESS, uneducated

guess)

(16)

Homework

• Repeat all R‐sessions and comment results obtained.

• Create a summary table of plots shown in the slide no. 15.

Summary: Sample & Population

• Please distinguish sample mean and variance estimates from the overall population mean and variance values.

– μ = Population mean

– σ = Population standard deviation – μ

_x

= Sample mean

– σ

_x

= Sample standard deviation (???) – N = Sample size

_ _

μ = lim μ_X n



∞

_

N Sample size

(17)

Summary: CLT & RW

• Since most of the RW random variables practically originate from multiple sources via additive influence process, most of the RW random variables tend to be normally distributed. y