The Normal distribution

This is the most important distribution in statistical theory and it is essential to much statistical theory and reasoning. It is in a sense the

‘parent’ distribution of all the sampling distributions that we shall meet later.

In order to get some feel for the normal distribution, let us consider the exercise of constructing a histogram of people’s heights

(assumed to be normally distributed). Suppose we start with 100 people and construct a histogram using sufficient class intervals so that the diagram gives some picture of the data’s distribution. This will be a fairly ‘ragged’ diagram, but useful nonetheless.

Now suppose we increase our sample to 500 and construct an appropriate histogram for these, but using more class intervals now that we have more data. This diagram will be smoother than the first, peaked in the centre, and roughly symmetric about the centre.

The normal distribution is emerging! If we continue this exercise to samples of 5, 000 or even 50, 000, then we will eventually arrive at a very smooth bell-shaped curve similar to that shown in Figure 5.1.

Hence we can view the normal distribution as the smooth limit of the basic histogram as the sample size becomes very large.

Such a diagram represents the distribution of the population. It is conventional to adjust the vertical scale so that the total area under the curve is 1 and so it is easy to view the area under the curve as probability (recall axiom 2). The mathematical form for this curve is

well-known and can be used to compute areas, and thus

probabilities — in due course we shall make use of statistical tables for this purpose.

−3 −2 −1 0 1 2 3

0.00.10.20.30.4

Typical Normal Density Function Shape

x fX(x)

Figure 5.1: Density function of the standard normal distribution.

5.8.1 Relevance of the Normal distribution

The normal distribution is relevant to the application of statistics for many reasons such as:

Many naturally occurring phenomena can be modelled as following a normal distribution. Examples include heights of

people, diameters of bolts, weights of pigs, etc.⁴ ⁴Note the use of the word ‘modelled’.

This is due to the ‘distributional assumption’ of normality. Since a normal random variableXis defined over the entire real line, i.e.

−∞ < X < ∞, we know a person cannot have a negative height, even though the normal distribution has positive, non-zero probability over negative values.

Also nobody is of infinite height (the world’s tallest man ever, Robert Wadlow was 272 cms), so clearly there is a finite upper bound to height, rather than∞. Therefore height doesnotfollow a true normal distribution, but it is a good enough approximation for modelling purposes.

A very important point is that averages of sampled variables (discussed later), indeed any functions of sampled variables, also have probability distributions. It can be demonstrated, theoretically and empirically, that, provided the sample size is reasonably large, the distribution of the sample mean, ¯X, will be (approximately) normal regardless of the distribution of the original variable. This is known as theCentral Limit Theorem (CLT) which we will return to later.

The Normal distribution is often used as the distribution of the error term in standard statistical and econometric models such as linear regression. This assumption can be, and should be, checked. This is considered further in04b Statistics 2.

5.8.2 Consequences of the Central Limit Theorem

The consequences of the CLT are two-fold:

A number of statistical methods that we use have arobustness property, i.e. it does not matter for their validity just what the true population distribution of the variable being sampled is.

We are justified in assuming normality for statistics which are sample means or linear transformations of them.

The CLT was introduced above ‘provided the sample size is reasonably large’. In practice 30 or more is usually sufficient (and can be used as a rule of thumb), although the distribution of ¯X may be normal for n much less than 30. This depends on the distribution of the original (population) variable. If this population distribution is in fact normal, then all sample means computed from it will be normal. However if the population distribution is very non-normal, then a sample size of (at least) 30 would be needed to justify normality.

5.8.3 Characteristics of the Normal distribution

The equation which describes the normal distribution takes the general form

f_X(x)= 1

√

2πσ²e⁻^(x−µ)2^2σ2 .

The shape of this function is the bell-shaped curve, such as in Figure 5.1 above. Don’t panic, you will not be dealing with this function explicitly! That said, do be aware that it involvestwo parameters:

the mean, µ, and the variance, σ².⁵ ⁵‘Parameters’ were introduced in Chapter 2.

Since the Normal distribution is symmetric about µ, the distribution is centred at µ. As a consequence of this symmetry, the mean is equal to the median. Also, since the distribution peaks at µ, it is also equal to the mode. In principle, the mean can take any real value, i.e. −∞ < µ < ∞.

The variance is σ², hence the larger σ², the larger the spread of the distribution. Note that variances cannot be negative, hence 0 < σ²< ∞.

If the random variable X has a normal distribution with parameters

µand σ², we denote this as X ∼ N (µ, σ²).⁶ Given the infinitely ⁶Read ‘∼’ as ‘is distributed as’.

many possible values for µ and σ², and given that a normal distribution is uniquely defined by these two parameters, there are an infinite number of normal distributions due to the infinite combinations of values for µ and σ².

The most important normal distribution is the special case when µ= 0 and σ² = 1. We call this the standard normal distribution, denoted by Z, i.e. Z ∼ N (0, 1). Tabulated probabilities that appear in statistical tables are for this Z distribution.

5.8.4 Standard normal tables

We now discuss the determination of normal probabilities using standard statistical tables. (Extracts from the) New Cambridge Statistical Tables will be provided in the examination. Here we focus on Table 4. This table lists ‘lower-tail’ probabilities, which can be represented as

P(Z ≤ z)= Φ(z), z ≥0,

using the conventional Z notation for a standard normal variable.⁷ ⁷AlthoughZis the conventional letter used to denote the standard Normal distribution, Table 4 uses ‘x’ notation.

Note the cumulative probability⁸for the Z distribution, P (Z ≤ z), is

8A cumulative probability is the probability of beingless than or often denotedΦ(z). We now consider some examples of working out

probabilities from Z ∼ N (0, 1).

74

Activity

If Z ∼ N (0, 1), what is P (Z > 1.2)?

When computing probabilities, it is useful to draw a quick sketch to visualise the specific area of probability that we are after.

−3 −2 −1 0 1 2 3

0.00.10.20.30.4

Standard Normal Density Function

z fZ(z)

Figure 5.2: Standard normal distribution with shaded area depicting P(Z > 1.2).

So for P (Z > 1.2) we require the upper-tail probability shaded in red. Since Table 4 gives us lower-tail probabilities, if we look up the value 1.2 in the table we will get P (Z ≤ 1.2)= 0.8849. The total area under a normal curve is 1, so the required probability is simply 1 −Φ(1.2) = 1 − 0.8849 = 0.1151.

Activity

If Z ∼ N (0, 1), what is P (−1.24 < Z < 1.86)?

Again, begin by producing a sketch.

−3 −2 −1 0 1 2 3

0.00.10.20.30.4

Standard Normal Density Function

z fZ(z)

Figure 5.3: Standard normal distribution with shaded area depicting P(−1.24 < Z < 1.86).

The probability we require is the sum of the blue and red areas.

Using the tables, which note only cover z ≥ 0, we proceed as follows:

Red area is given by:

P(0 ≤ Z ≤ 1.86) = P (Z ≤ 1.86) − P (Z ≤ 0)

= Φ(1.86) − Φ(0)

= 0.9686 − 0.5

= 0.4686.

Blue area is given by:

P(−1.24 ≤ Z ≤ 0) = P (Z ≤ 0) − P (Z ≤ −1.24) does not give probabilities for negative z values, we can exploit the symmetry of the (standard) Normal distribution.

Hence P (−1.24 < Z < 1.86)= 0.4686 + 0.3925 = 0.8611.

5.8.5 The general Normal distribution

We have already discussed that there exists an infinite number of different normal distributions due to the infinite pairs of parameter values since −∞ < µ < ∞ and σ²>0. The good news is that the standard normal statistical tables can be used to determine probabilities for any normal random variable X, such that X ∼ N(µ, σ²).

To do so, we need a little bit of magic —standardisation. This is a special transformation which converts X ∼ N (µ, σ²) into

Z ∼ N(0, 1). The transformation is:

Z= X − µ σ ,

that is we subtract the mean from X and divide by the standard deviation.

To see why,⁹first note that any linear transformation of a normal ⁹This bit of theory is purely for illustrative purposes (and for the interested student!), you willnot have to reproduce this in the examination.

random variable is also normal — hence as X is normal, so will Z since the standardisation transformation is linear. It remains to show that standardisation results in a random variable with zero mean, and unit variance. Since X ∼ N (µ, σ²):

This result exploits the fact that σ is a constant, hence can be taken outside the expectation operator.

This result uses the fact that we must square a constant when taking it outside the ‘Var’ operator.

Activity

In Chapter 2, ‘Statistics’ was introduced as a discipline for data analysis. In this chapter we have encountered the Normal

probability distribution. Such distributions typically have associated parameters, such as the (theoretical) mean, µ, and the (theoretical) variance, σ², for the Normal distribution. By convention, Greek letters are used to denotepopulation parameters, whose values in practice are typically unknown.

The next few chapters of this course will be the subject ofstatistical inference, whereby we infer unknown population parameters based onsample data. As an example suppose we wanted to investigate the height of the UK population. As previously discussed it is reasonable to assume that height is a normally distributed variable with some mean µ and some variance σ². What are the exact values of these parameters? To know these values precisely would require data on the heights of the entire UK population — all 60-plus million people.

Population sizes, N , are typically very large and clearly no-one has the time, money or patience to undertake such a marathon data collection exercise. Instead we opt to collect asample (some subset

of the population) of size n.¹⁰ Having collected said sample, we ¹⁰Ifn= N, and we sample without replacement, then we have obtained a complete enumeration of the population — a census.

thenestimate the unknown population parameters based on the known sample data. Specifically, we estimate population quantities based on their respective sample counterparts.

A ‘statistic’ (singular noun) is just some function calculated from data. A sample statistic is calculated from sample data. At this point, be aware of the following distinction:

Anestimator is a function (formula) describing how to obtain. . .

a(point) estimate, which is a numerical value.

For example, the sample mean ¯X = ^Pⁿⁱ⁼¹_n^Xⁱ is the estimator for the

In document ST104a Vle (Page 80-85)