Distribution of the Sample Variance Let s 2 be the

variance of a sample of n and let x be the sample mean. Then

i n 1 n l n / « \ 2

sa = - 2 (*,• - x)* = - 2 Xi* - X* = - 2 Xi* - ( 2 Xiln)

L i ' - ^ S ^ ' - ^ S S (XiXj), (i * j ) "<<=] n i=l n t j = s - i s 2 (XiXj), (i * j) n i = 1 71 i i Consequently sis') = 2 ^ ) - 2 iXiXj)], (i * j)

But there are n ways of chosing Xi from n values and, once xt

is chosen, there are (n — 1) ways of chosing Xj so that Xj xt. Also, since and Xj are independent

e ( x z (XiXj)) = 2 2 [£(xixj)] = 2 2 [£(Xi) . 6(Xj)]

\ i j ' i j i J

= 2 2 (£(#))* = 0 • i

since 6(x) = (jlx' = 0. Therefore

s a m p l e a n d p o p u l a t i o n. ii ₁₄₅ Thus the mean value of the variance of all possible samples of n is (n — 1) /n times the variance of the parent population; and, as we should expect, £(s2) - > a2, as t h e sample size increases

indefinitely.

Thus if we draw a single sample of n from a population and calculate t h e variance, sa, of t h a t sample, we shall have

£(nsi/(n - 1)) = £((xi - *)*/(» - 1)) = a2 . (7.11.2)

In other words :

If we calculate I (Xi — x)*/(n — 1), instead of the usual i

2 {Xi — x)"ln, we have an unbiased estimate of the popula-

tion variance, <j*.

Of course, t h e actual figure calculated from t h e d a t a of a single sample will, in general, differ from t h e actual value of a2. But, if we continue sampling and calculating ns*l(n — 1),

we shall obtain a set of values, t h e mean value of which will tend to t h e actual value of a2 as t h e number, AT, of samples of

n drawn increases. A function of x and n which in this way yields, as its values, unbiased estimates of a population para- meter is called an unbiased estimator of t h a t parameter. Thus if 6 is a population parameter and 0 (read " thetaAcap ") is an

estimator of 0, 0 is an unbiased estimator if £($) = 0. m / 33 L Xijn is an unbiased estimator of (jl, t h e population mean,

since £(m.i) = jx; on the other hand, sa = S (Xi — x)21n is a i

biased estimator of or2. For this reason, some writers define

the variance of a sample to be S f i ( x t — x)2/(n — 1), S f i = n;

i i with this definition, the sample variance is an unbiased estimate

of the population variance. Although we have not adopted this definition here, we are introduced b y it t o a n important notion-—that of the degrees of freedom of a sample.

Let x be t h e mean of a sample of n. Then nx = S xu and, i

for a given x, there are only n — 1 independent for when we have selected n — 1 items of t h e sample, t h e n t h is neces- sarily determined. S = nx is a linear equation of constraint on t h e sample. '

If there are p( < n, of course) linear equations of constraint on the sample, the number of degrees of freedom of t h e sample is reduced b y p.

146 s t a t i s t i c s

variance. Once- again we confine our attention to the case when the population sampled is normal with zero mean and variance a2.

What is the probability that a sample (xu x2, . . . xn) from such a population is such that its mean lies between x ± and a standard deviation lying between s ± ids ?

Since the n x's are independent yet are drawn from the same population, the probability that the n values of x shall lie simultaneously between

x1 ± \dxx, x2 ± idxt, . . ± idx„ is

dp = exp ( - ( V + V + . . . + *„*))2csi)dx1dx2 . . . dxn (7.11.3) Now think of (xlt x2, . . . xn) as the co-ordinates of a point P in a space of n-dimensions. Then dxxdx2 . . . dx„ is an element of volume in that space. Call it dv. Then dp is the probability that the point P shall lie within this volume element. If now we choose this volume element dv in such a way that any point P, lying within it, represents a sample of

n with mean lying between x ± \dx and a standard deviation

between s ± -Jds, dp will be the probability that our sample has a mean and standard deviation lying between these limits. Our problem therefore is to find an appropriate dv. Now we have the two equations,

n n £ Xi = nx and 2 — x)2 — ns2.

t = i

Each of these equations represents a locus in our w-di-

n

mensional space. If n were equal to 3, the equation 2 X{

1 = 1 «= nx may be written (xl — x) + (xz — x) + (xa — x) = 0 and represents a plane through the point (x, x, x). Moreover, the length of the perpendicular from the origin on to this plane is 3^/3i = x . 3i and, so, the perpendicular distance between this plane and the parallel plane through (x + dx, i + dx, x -(- dx) is dx . 3i. In the w-dimensional case, the

n . equation 2 Xi = nx represents a " hyperplane ", as it is

i-1

called, and the " distance " between this " plane " and the " parallel plane " is dx .

n

Again, if « = 3, the equation 2 (Xi — x)1 = ns' becomes

s a m p l e a n d p o p u l a t i o n . i ₁₄₇

(#! — x)2 -)- (x2 — x)2 + (x, — x)2 = 3s2, and thus represents

a sphere, with centre (x, x, x) and radius s . 3i. The plane B

2 Xi = 3x passes through the centre of this sphere. The i = 1

section will therefore be a circle of radius s . 31, whose area is

proportional to s3. If s increases from s — %ds to s 4- J\ds, the

increase in the area of section is proportional to d(s2). So the

volume, dv, enclosed by the two neighbouring spheres and the

two neighbouring planes will be proportional to dx . d(s2). In

the w-dimensional case, instead of a sphere, we have a " hyper- sphere " of " radius " s . ni and this " hypersphere " is cut by our " hyperplane " in a section which now has a " volume "

proportional to sn _ 1. So, in this case, dv is proportional to

dx . d{sn~x) = ksn~2dsdx, say. Consequently, the probability

that our sample of n will lie within this volume is given by d p = ( 1 S ^ S T i e xP { - i , f ,X i* } s n~2 d s d* ' <7' 1 L 3 ( a ) )

But the equation 2 (xt — x)2 = ns2 may be written

i = 1 2 X{2 — n(s2 -f x2) and, therefore,

i = 1

dp = kt exp (— nx2/2a2)dx x

X k2 exp ( - ns2l2v2)(s2)C-3)l2d(s2) . (7.11.4)

where k1 and kt are constants.1

1 Determination of k1: Since

r + °o

c

kl exp (— nx2 ji^dx = 1

we have immediately (5.4(e) footnote) Aj = (2iro2/n)~i.

Determination of Aa : s'* varies from 0 to 00 ; therefore

r

Aa exp ( - «s2/2a2)(sa)<"~3"2d(s2)

"0 Put «sa/2<72 = x\ then

ht f ( 2 oa/ « ) ( " - e x p ( - x)x^-W2-1dx = 1

But since, by definition (see Mathematical Note to Chapter Three),

exp ( - x)x<— W -1* * = r « » - l)/2)

A, = (»/2os)<—1>'2/r((n - l)/2).

i 4 8 s t a t i s t i c s

We see imme'diately that, when the population sampled is normal:

(i) the mean x and the variance s2 of a sample of n are

distributed independently;

(ii) the sample mean is distributed normally about the popula- tion mean—taken here at the origin—with variance a2In-, and

(iii) the sample variance s2 is not normally distributed. The moment-generating function for the ^-distribution is M{t) s £ (exp ts*). Thus

Af(<) = k2j exp (— «sa/2aa) exp (fe2)(s2)("-3)/2cZ(s2)

we have x f exp ( - ^ h W" ^ ) "1 ^ 2) 0 But, by 3.A.3., exp ( - X * ) . ( X * ) ( ~ ) ~ld ( X * ) = ... = . . (7.11.5) The coefficient of t in the expansion of this function is the first moment of s2 about the origin : (n — 1)<j2/m, as already established. The coefficient of <2/2 ! is the second moment of

s2 about the origin: («a — 1 )o4/«2. Hence

var (s2) = (w2 - l)o*/wa - (n - l ) V / «2 = 2(n - l ^ / w2

For large samples, therefore, var (s2) === 2a* jn. In other words, the standard error of the sample variance for a normal parent population is approximately ^ for large samples.

s a m p l e a n d p o p u l a t i o n. ii 149 7.12. Worked Example.

If s,2 and s22 are the variances in two independent samples of the same size taken from a common normal population, determine

the distribution of Sj2 + ss2. (L.U.)

Treatment: The moment-generating function of s2 for samples of

n from a normal population (0, a) is (7.11.5) M(t) = (1 - 2o></»)-<"-»>'*

Hence £(exp tsts) = £(exp tst2) = M(t)

But, since the samples are independent,

£ (exp t(st2 + s22)) = £[(exp is,2) (exp te22)] = [Af(f)]2

Hence the m.g.f. for the sum of the two sample variances is (I - 2o«//»i)-<»-»).

Expanding this in powers of t, we find that the mean value of s,2 + sj2 is 2(n — l)cr2/«—the coefficient of til I—and var(s,2 + j,2)

= 4n(n - l)a*ln2 — 4(n - 1 ) V / »2 = 4(n - 1)CT4/«2. which, for

large n, is approximately equal to 4a'/«. The probability differential for s,2 + ss2 is

dp = '^^"i)' exp ( - n(s* + s12)l2o2)(s12 + st2)»-2d(si2 + st')

EXERCISES ON CHAPTER SEVEN

1. Using the table of random numbers given in the text, draw a random sample of 35 from the " population " in Table 5.1. Calculate the sample mean and compare it with the result obtained in 7.5.

2. A bowl contains a very large number of black and white balls. The probability of drawing a black ball in a single draw is p and that of drawing a white ball, therefore, 1 — p. A sample of m balls is drawn at random, and the number of black balls in the sample is counted and marked as the score for that draw. A second sample of m balls is drawn, and the number of white balls in this sample is the corresponding score. What is the expected combined score, and show that the variance of the combined score is 2mp(\ — p).

3. Out of a batch of 1,000 kg of chestnuts from a large shipment, t is found that there are 200 kg of bad nuts. Estimate the limits between which the percentage of bad nuts in the shipment is almost certain to lie.

4. A sample of 400 items is drawn from a normal population whose mean is 5 and whose variance 4. If the sample mean is 4-45, can the sample be regarded as a truly random sample ?

5. A sample of 400 items has a mean of 1-13; a sample of 900 items has a mean of 1-01. Can the samples be regarded as having been drawn at random from a common population of standard deviation 0-1 ?

150 s t a t i s t i c s

6. A random variate x is known to have the distribution p(x) = c(l + xja)m-1 exp (— mxja), - a < x Find the constant c and the first four moments of x. Derive the linear relation between /J, and /S2 of this distribution. (L.U )

7. Pairs of values of two variables * and y are given. The vari- ances of x, y and (x — y) are <rx2, and <7(X_ „)2 respectively. Show

that the coefficient of correlation between x and y is

(L.U.) 8. If « = ax + by and v — bx — ay, where x and y represent deviations from respective means, and if the correlation coefficient between x and y is p, but « and v are uncorrelated, show that

a,ar = (a2 +

(L.U.) Solutions

2. Expected score is m.

3. Probability p of 1 kg of bad nuts is ^ ^ = 0-2. Assume this is constant throughout the batch, q = 0-8. Mean is np and variance is npq. For the proportion of bad nuts we divide the variate by n and hence the variance by «2, giving variance = pqjn. The

standard error of the proportion of bad nuts is = s j ^ ^1000^ ^ = 0-1264. The probability that a normal variate will differ from its mean value by more than three times its standard error is 0-0027. We can be practically sure that no deviation will be greater than this. Required limits are therefore—for the % of bad nuts—100(0-2 ± 3 x 0-01264) = 23-8% and 16-2%.

4. No, deviation of sample mean from population mean > 4 x S . E of mean for sample of size given.

5. Difference between means is nearly twenty that of S . E of difference of means.

6. Hint: I p(x)dx = 1. Transform by using substitution J-a

1 + xja = tjm. c — mme-mlaV{m); mean-moment generating

function is t~"\ 1 - - J ; = a- ^j,2! = aa/2m\ m/3! - a»/3m»; M„/4! = a*(m + 2)/8m2; 2JS, - 3/3, = 6.

c h a p t e r e i g h t SAMPLE AND P O P U L A T I O N

I I : t, z, AND F

8.1. The ^-distribution. We have seen t h a t if x is the mean of a sample of n from a normal population (|x, a) t h e variate

t ss (x — (i)/a/Vn

is normally distributed about zero mean with unit variance. But what if the variance of the parent population is unknown and we wish t o test whether a given sample can be considered t o be a random sample from t h a t population ? The best we can do is to use an unbiased estimate of a based on our sample of n, i.e., s{n/(n — l)}i, s being the sample variance. B u t if we do this, we cannot assume t h a t

t = (x - y.)(n - 1 )*/s . . . (8.1.1) is a normal variate. I n fact, it is not. W h a t is t h e distribu- tion of this t (called Student's t) ?

Let t h e population mean be taken as origin, then

t =(n- 1 )*x/s . . . (8.1.1(a)) Since we m a y write t = (n — l)i{(x/a)/(s/a)}, t, and, therefore, the /-distribution, is independent of the population variance— a most convenient consequence which contributes greatly t o the importance of t h e distribution. If we hold s constant we have sdt = (« — 1 )*dx. Now x is normally distributed about zero with variance n~l (since t is independent of a, we may take a = 1) and, as we showed in t h e last chapter, x and s3 are statistically independent, when t h e parent population is normal. Thus

dp(x) = (»/2?t)* exp (— nx*!2)dx

Consequently, t h e probability differential of t for a constant s1

may be obtained from dp(x) by using 8.1.1(a) and t h e relation sdt — (n — 1 )idx; we have

dp{t, constant s1) = [n/2n(n — l)]*s exp [— nsH*/2(n — l)]<ft ( 8 . 1 . 2 ) 151

1 5 2 s t a t i s t i c s If now we multiply this by t h e probability differential of j* a n d integrate with respect t o s2 from 0 to <x>, we obtain t h e

probability differential of t for all s2. B y (7.11.4),

= 6 X p dp(t) = dt f [nJ2n{n — 1)]} exp [ - ns2t2/2(n — 1)] s x 0 (w/2)l»-i)/2 X r [ ( n - l)/2] e x p ™*/2)(s2)(»-Wd(s2) _ (n/2)(»- M[n/2n(n - l)]t V[{n - l)/2] X X dt f (sa)<» - 2)/2 e x p [ - ms2{1 + t2/(n - l}/2]d(s!) ' o P u t t i n g s2 = 2w/w[l + t2l{n — 1)], we have f [ ( * - l)/2] X (2/w)"'2[1 + t2l(n - 1 )]-nl2dt I V e x p ( - 0 r ( « / 2 ) " VTC(w — l)*r[(w — l)/2] • f1 + - 1>]"B/ <*' ^since I j /n , 2 > _ 1 exp ( - = r ( w / 2 ) j 0 (since vtc = r(^)) or <#(<) = 1 . + t*/(n - l)]-»/2dt p { ~ 1 Vn — 1 B\ - 2 Tt V (8.1.3)

W e see a t once t h a t t is not normally distributed. If, for instance, n = 2, we have dp(t) = (1 /tc) (1 + t2)-1, which defines w h a t is known as the Cauchy distribution, a distribution departing very considerably from normality, t h e variance, for example, being infinite. However, using Stirling's approxima- tion (5.2.7), it can be shown, as t h e reader should verify for himself, t h a t B[(n - 1 )/2, 1/2] . (n - 1)* tends t o Vsfc a s n — y oo ; a t the same time, since [1 + t2j(n — l)]-"/2 m a y be

s a m p l e a n d p o p u l a t i o n . ii ₁₅₃

(1 + x/m)m —> exp x as m —> ®, (1 + /s/(» — ~1} —>

exp ( - <s/2).

Thus the t-distribution approaches normality f \ exp (—t'/2))

as n increases. w 2 t c

It is customary to put v = (n — 1), the number of degrees of

freedom of the sample, and to write the probability function of t

for v degrees of freedom thus

In his Statistical Methods for Research Workers, Sir R. A. Fisher gives a table of the values of | 11 for given v which will be exceeded, in random sampling, with certain probabilities

(•P)• Fisher's P is related to Fv(t) by P = 1 — 2 J Fv(t)dt.

In document Teach Yourself Statistics (Page 144-153)