Checking the Model - STATISTICAL MODELS AND MAXIMUM LIKELIHOOD ESTIMA-

2. STATISTICAL MODELS AND MAXIMUM LIKELIHOOD ESTIMA-

2.6 Checking the Model

(iv) The fraction of the population with BMI over 35:0 given by

p = 1 35:0

where is the cumulative distribution function for a G(0; 1) random variable.

Suppose a random sample of 150 males gave observations y1; : : : ; y150 and that the

maximum likelihood estimates based on the results derived in Example 2.3.2 were

^ = y = 27:1 and ^ = 1 150 150_P i=1 (yi y)2 1=2 = 3:56:

The estimates of the attributes in (i)-(iv) would be: (i) and (ii) ^ = 27:1

(iii) ^Q (0:1) = ^ 1:28^ = 27:1 1:28 (3:56) = 22:54 and (iv) ^p = 1 35:0 ^_^ = 1 (2:22) = 1 0:98679 = 0:01321.

Note that (iii) and (iv) follow from the invariance property of maximum likelihood estimates.

2.6 Checking the Model

The models used in this course are probability distributions for random variables that represent variates in a population or process. A typical model has probability density function f (y; ) if the variate Y is continuous, or probability function f (y; ) if Y is discrete, where is (possibly) a vector of parameter values. If a family of models is to be used for some purpose then it is important to check that the model adequately represents the variability in Y . This can be done by comparing the model with random samples y1; : : : ; ynof y-values

from the population or process.

For data that have arisen from a discrete probability model, a straightforward way to check the …t of the model is to compare observed frequencies with the expected frequencies calculated using the assumed model as illustrated in the example below.

Example 2.6.1 Rutherford and Geiger study of alpha-particles and the Poisson model

In 1910 the physicists Ernest Rutherford and Hans Geiger conducted an experiment in which they recorded the number of alpha particles omitted from a polonium source (as detected by a Geiger counter) during 2608 time intervals each of length 1=8 minute. The number of particles j detected in the time interval and the frequency fj of that number of

64 2. STATISTICAL MODELS AND MAXIMUM LIKELIHOOD ESTIMATION

Table 2.1: Frequency Table for Rutherford/Gieger Data Number of - particles detected: j Observed Frequency: fj Expected Frequency: ej 0 57 54:3 1 203 210:3 2 383 407:1 3 525 525:3 4 532 508:4 5 408 393:7 6 273 254:0 7 139 140:5 8 45 68:0 9 27 29:2 10 10 11:3 11 4 4:0 12 0 1:3 13 1 0:4 14 1 0:1 Total 2608 2607:9

We can see whether a Poisson model …t these data by comparing the observed frequencies with the expected frequencies calculated assuming a Poisson model. To calculate these expected frequencies we need to specify the mean of the Poisson model. We estimate using the sample mean for the data which is

^ = 1 2608 14 P j=0 jfj = 1 2608(10097) = 3:8715:

The expected number of intervals in which j particles is observed is

ej = (2608)

(3:8715)je 3:8715

j! ; j = 0; 1; : : :

The expected frequencies are also given in Table 2.1.

Since the observed and expected frequencies are reasonably close, the Poisson model seems to …t these data well. Of course, we have not speci…ed how close the expected and observed frequencies need to be in order to conclude that the model is reasonable. We will look at a formal method for doing this in Chapter 7.

This comparison of observed and expected frequencies to check the …t of a model can also be used for data that have arisen from a continuous model. The following is an example.

2.6. CHECKING THE MODEL 65

Example 2.6.2 Lifetimes of brake pads and the Exponential model

Suppose we want to check whether an Exponential model is reasonable for modeling the data in Example 1.3.3 on lifetimes of brake pads. To do this we need to estimate the mean

of the Exponential distribution. We use the sample mean y = 49:0275 to estimate . Since the lifetime Y is a continuous random variable taking on all real values greater than zero the intervals for the observed and expected frequencies are not obvious as they were in the discrete case. For the lifetime of brake pads data we choose the same intervals which were used to produce the relative frequency histogram in Example 1.3.3 except we have collapsed the last four intervals into one interval [120; +1). The intervals are given in Table 2.2.

Table 2.2: Frequency Table for Brake Pad Data

Interval Observed Frequency: fj Expected Frequency: ej [0; 15) 21 52:72 [15; 30) 45 38:82 [30; 45) 50 28:59 [45; 60) 27 21:05 [60; 75) 21 15:50 [75; 90) 9 11:42 [90; 105) 12 8:41 [105; 120) 7 6:19 [120; +1) 8 17:3 Total 200 200

The expected frequency in the interval [aj 1; aj) is calculated using

ej = 200 aj Z aj 1 1 49:0275e y=49:0275_dy = 200 e aj 1=49:0275 _e aj=49:0275 _:

The expected frequencies are also given in Table 2.2. We notice that the observed and expected frequencies are not close in this case and therefore the Exponential model does not seem to be a good model for these data.

The di¢ culty of using this method for continuous data is that the intervals must be selected and this adds a degree of arbitrariness to the method.

66 2. STATISTICAL MODELS AND MAXIMUM LIKELIHOOD ESTIMATION

Graphical Checks of Models 16

We may also use graphical techniques for checking the …t of a model. These methods are particularly useful for continuous data.

The …rst graphical method is to superimpose the probability density function on the relative frequency histogram of the data as we did in Figures 1.15 and 1.16 for the data from the can …ller study.

Empirical Cumulative Distribution Functions

A second graphical procedure is to plot the empirical cumulative distribution function ^F (y) and then to superimpose on this a plot of the model-based cumulative distribution function, P (Y y; ) = F (y; ). We saw an example of such a plot in Chapter 1 but we provide more detail here. The objective is to compare two cumulative distribution functions, one that we hypothesized is the cumulative distribution function for the population, and the other obtained from the sample. If they di¤er a great deal, this would suggest that the hypothesized distribution is a poor …t.

Example 2.6.3 Checking a Uniform(0; 1) model

Suppose, for example, we have 10 observations which we think might come from the Uniform(0; 1) distribution. The observations are as follows:

0:76 0:43 0:52 0:45 0:01 0:85 0:63 0:39 0:72 0:88:

The …rst step in constructing the empirical cumulative distribution function is to order the observations from smallest to largest17 _obtaining

0:01 0:39 0:43 0:45 0:52 0:63 0:72 0:76 0:85 0:88

If you were then asked, purely on the basis of this data, what you thought the probability is that a random value in the population falls below a given value y, you would probably respond with the proportion in the sample that falls below y. For example, since four of the values 0:01 0:39 0:43 0:45 are less than 0:5, we would estimate the cumulative distribution function at 0:5 using 4=10. Thus, we de…ne the empirical cumulative distribution function for all real numbers y by the proportion of the sample less than or equal to y or:

F (y) = number of values in fy1; y2; : : : ; yng which are y

n :

1 6_{See the video at www.watstat.ca called "The empirical c.d.f. and the qqplot" on the material in this} section.

1 7_{We usually denote the ordered values y}

(1) y(2) : : : y(n)where y(1)is the smallest and y(n)is the largest. In this case y(n)= 0:88:

2.6. CHECKING THE MODEL 67

More generally for a sample of size n we …rst order the yi’s, i = 1; : : : ; n to obtain the

ordered values y₍₁₎ y₍₂₎ : : : y_(n). ^F (y) is a step function with a jump at each of the ordered observed values y_(i). If y₍₁₎; y₍₂₎; : : : ; y_(n)are all di¤erent values, then ^F (y_(j)) = j=n and the jumps are all of size 1=n. In general the size of a jump at a particular point y is the number of values in the sample that are equal to y, divided by n:

Size of jump in ^F (y) at y = number of the values fy1; y2; : : : ; yng equal to y

n :

Why is this a step function? In the data above there were no observations at all between the smallest number 0:01 and the second smallest 0:39. So for all y 2 [0:01; 0:39), the proportion of the sample which is less than or equal to y is the same, namely 1=10.

Having obtained this estimate of the population cumulative distribution function, it is natural to ask how close it is to a given cumulative distribution function, say the Uniform(0; 1) cumulative distribution function. We can do this with a graph of the empirical cumulative distribution function or more simply on a graph that just shows the vertices

y(1);_n1 ; (y(2);_n2); : : : ; (y(n);n_n) shown as star on the graph in Figure 2.5.

0 0 .1 0 .2 0 .3 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 1 0 0 .1 0 .2 0 .3 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 1 theor eti cal quanti les s a m p l e q u a n t i l e s

Figure 2.5: The empirical cumulative distribution function for n = 10 data values and a superimposed Uniform(0; 1) cumulative distribution function.

By superimposing on this graph the theoretical Uniform(0; 1) cumulative distribution function, which in this case is a straight line, we can see how well the theoretical distribution and empirical distribution agree. Since the sample is quite small we cannot expect a perfect straight line, but for larger samples we would expect much better agreement with the straight line.

Because the Uniform(0; 1) cumulative distribution function is a straight line, it is easy to assess graphically how close the two curves …t, but what if the hypothesized distribution is Normal, whose cumulative distribution function is distinctly non-linear?

68 2. STATISTICAL MODELS AND MAXIMUM LIKELIHOOD ESTIMATION

As an example we consider data (see Appendix C) for the time between 300 eruptions, between the …rst and the …fteenth of August 1985, of the geyser Old Faithful in Yellowstone National Park. One might hypothesize that the random distribution of times between consecutive eruptions follows a Normal distribution. We plot the empirical cumulative distribution function in Figure 2.6 together with the cumulative distribution function of a Gaussian distribution. 30 40 50 60 70 80 90 100 110 120 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

T ime between Eruptions e.c .d.f.

G(72.3,13.9)

Figure 2.6: Empirical c.d.f. of times between eruptions of Old Faithful and superimposed G (72:3; 13:9) c.d.f.

Of course we don’t know the parameters of the appropriate Gaussian distribution so we use the sample mean 72:3 and sample standard deviation 13:9 in order to approximate these parameters. Are the di¤erences between the two curves in Figure 2.6 su¢ cient that we would have to conclude a distribution other than the Gaussian? There are two ways of trying to get another view of the magnitude of these di¤erences. The …rst way is to plot the relative frequency histogram of the data and then superimpose the Gaussian curve. The second way is to use a qqplot which will be discussed in the next section.

2.6. CHECKING THE MODEL 69 43 49 55 61 67 73 79 85 91 97 103 109 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035

T ime between Eruptions Relative

Frequency

G(72.3,13.9)

Figure 2.7: Relative frequency histogram for times between eruptions of Old Faithful and superimposed G (72:3; 13:9) p.d.f.

Figure 2.7 seems to indicate that the distribution of the times between eruptions is not very Normal because it appears to have two modes. The plot of the empirical cumulative distribution function did not show the shape of the distribution as clearly as the histogram. The empirical cumulative distribution function does allow us to determine the pth quantile or 100pth percentile (the left-most value on the horizontal axis yp where ^F (yp) = p). For

example, from the empirical cumulative distribution function of the Old Faithful data, we see that the median time ( ^F ( ^m) = 0:5) between eruptions is around ^m = 78.

Example 2.6.4 Heights of females

For the data on female heights in Chapter 1 and using the results from Example 2.3.2 we obtain ^ = 1:62; ^ = 0:064 as the maximum likelihood estimates of and . Figure 2.8 shows a plot of the empirical cumulative distribution function with the G(1:62; 0:0637) cumulative distribution function superimposed. Figure 2.9 shows a relative frequency histogram for these data with the G(1:62; 0:0637) probability density function superimposed. The two types of plots give complementary but consistent pictures. An advantage of the distribution function comparison is that the exact heights in the sample are used, whereas in the histogram plot the data are grouped into intervals to form the histogram. However, the histogram and probability density function show the distribution of heights more clearly.

70 2. STATISTICAL MODELS AND MAXIMUM LIKELIHOOD ESTIMATION 1.4 1.45 1.5 1.55 1.6 1.65 1.7 1.75 1.8 1.85 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Height e.c.d.f. G(1.62,0.064)

Figure 2.8: Empirical c.d.f. of female heights and G (1:62; 0:064) c.d.f.

1.4 1.5 1.6 1.7 1.8 0 1 2 3 4 5 6 7 Height Relative Frequency G(1.62,0.064)

2.6. CHECKING THE MODEL 71

Qqplots

An alternative view, which is really just another method of graphing the empirical cumulative distribution function, tailored to the Normal distribution, is a graph called a qqplot. Suppose the data Yi, i = 1; : : : ; n were in fact drawn from the G( ; ) distribution so that

the standardized variables, after we order them from smallest Y(1) to largest Y(n), are

Z(i) =

Y_(i) :

These behave like the ordered values from a sample of the same size taken from the G(0; 1) distribution. Approximately what value do we expect Z(i) to take? If denotes the

standard Normal cumulative distribution function then for 0 < u < 1

P ( (Z) u) = P (Z 1(u)) = ( 1(u)) = u

so that (Z) has a Uniform distribution. It is easy to check that the expected value of the i’th largest value in a random sample of size n from a Uniform(0; 1) distribution is equal to _n+1i 18 so we expect that the i=n0th quantile (Z_(i)) to be close to _n+1i . In other words we expect Z_(i) = Y_(i) = to be approximately 1 i

n+1 or Y(i) to be roughly

a linear function of 1 i

n+1 . This is the basic argument underlying the qqplot. If

the distribution is actually Normal, then a plot Y_(i); 1 i

n+1 , i = 1; : : : ; n should be

approximately linear (subject to the usual randomness).

Similarly if the data obtain from an Exponential distribution we expect a plot of Y_(i); F 1 i

n+1 to be approximately linear where F 1(u) is the inverse of the Expo-

nential cumulative distribution function given by F 1(u) = ln(1 u).

Since reading qqplots is an art acquired from experience, it is a good idea to generate similar plots where we know the answer. This can be done by generating data from a known distribution and then plotting a qqplot. See the R code below and Chapter 2, Problem 14. A qqplot of 100 observations randomly generated from a G ( 2; 3) distribution is given in Figure 2.10. The theoretical quantiles are plotted on the horizontal axis and the empirical quantiles are plotted on the vertical axis. Since the quantiles of the Normal distribution change more rapidly in the tails of the distribution, we expect the points at both ends of the line to lie further from the line.

1 8

This is intuitively obvious since n values Y(i) breaks the interval into n + 1 spacings, and it makes sense each should have the same expected length. For empirical evidence see http://www.math.uah.edu/stat/applets/OrderStatisticExperiment.html. More formally we must …rst show the p.d.f. of Y(i) is _(i _1)!(nn! _i)!ui 1(1 u)n i for 0 < u < 1: Then …nd the integral E(Y(i)) = R1 0 n! (i 1)!(n i)!u i₍₁ _u)n i_{du =} i n+1:

72 2. STATISTICAL MODELS AND MAXIMUM LIKELIHOOD ESTIMATION -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 -10 -5 0 5 10 N(0,1) Quantiles S am ple Q uant iles

Figure 2.10: Qqplot of a random sample of 100 observations from a G( 2; 3) distribution

A qqplot of the female heights is given in Figure 2.11. Overall the points lie reasonably along a straight line. The qqplot has a staircase look because the heights are rounded to the closest centimeter. As was the case for the relative frequency histogram and the empirical cumulative distribution function, the qqplot indicates that the Normal model is reasonable for these data.

-3 -2 -1 0 1 2 3 1.4 1.5 1.6 1.7 1.8 1.9 2 N(0,1) Quantiles S am ple Q uant iles

2.6. CHECKING THE MODEL 73

A qqplot of the times between eruptions of Old Faithful is given in Figure 2.12. The points form an S-shaped curve which indicates as we saw before that the Normal is not a reasonable model for these data.

- 3 - 2 - 1 0 1 2 3 30 40 50 60 70 80 90 10 0 11 0 12 0 N(0,1) Q uantiles S am pl e Q uant iles

Figure 2.12: Qqplot of times between eruptions of Old Faithful

A qqplot of the lifetimes of brake pads (Example 1.3.3) is given in Figure 2.13. The points form a U-shaped curve. This pattern is consistent with the long right tail and positive skewness that we observed before. The Normal is not a reasonable model for these data. - 3 - 2 - 1 0 1 2 3 - 50 0 50 10 0 15 0 20 0 N(0,1) Q uantiles S am pl e Q uant iles

74 2. STATISTICAL MODELS AND MAXIMUM LIKELIHOOD ESTIMATION

R Code for Checking Models Using Histograms, Empirical c.d.f.’s and Qqplots # Normal Data Example

y<-rnorm(100,5,2) # generate 100 observations from a G(5,2) distribution

mn<-mean(y) # find the sample mean

s<-sd(y) # find the sample standard deviation

summary(y) # five number summary

skewness(y,type=’1’) # find the sample skewness as given in the Course Notes

kurtosis(y,type=’1’)+3 # find the sample kurtosis as given in the Course Notes hist(y,freq=F) # graph the relative frequency histogram

w<-mn+s*seq(-3,3,0.01) # calculate points at which to graph the Normal pdf d<-dnorm(w,mn,s) # calculate values of Normal pdf at these points points(w,d,type=’l’) # superimpose the Normal pdf on the histogram

A<-ecdf(y) # calculate the empirical cdf for the data

e<-pnorm(w,mn,s) # calculate the values of the Normal cdf plot(A,verticals=T,do.points=F,xlab=’y’,ylab=’ecdf’) # plot the ecdf points(w,e,type=’l’) # superimpose the Normal cdf

qqnorm(y) # graph a qqplot of the data

# Exponential Data Example

y<-rexp(100,5) # generate 100 observations from Exponential(5) dist’n