• No results found

Statistical data mining

5.3 The normal linear model

5.3.1 Main inferential results

Under the previous assumptions, we can derive some important inferential results that build on the theory in Section 4.3.

Result 1

For a point estimate, it can be demonstrated that the least squares fitted parameters in Section 4.3 coincide with the maximum likelihood estimators of β. We will useβˆ to indicate either of the two estimators.

Result 2

A confidence interval for a slope coefficient of the regression plane is

β= ˆβ±tnp−1(1−α/2)se(β)ˆ

wheretnp−1(1−α/2)is the 100(1−α/2)percentile of a Student’stdistribution

withnp−1 degrees of freedom and se(β)ˆ is an estimate of the standard error ofβˆ.

Result 3

To test the hypothesis that a slope coefficient is 0, a rejection region is given by

R=|T| ≥tnp−1(1−α/2)

where T = βˆ

se(β)ˆ

If the observed absolute value of the statistic T is contained in the rejection region, the null hypothesis of the slope equal to 0 is rejected, and the slope coefficient is statistically significant. In other words, the considered explana- tory variable significantly influences the response variable. Conversely, when the observed absolute value of the statistic T falls outside the rejection region, the explanatory variable is not significant. Alternatively, it is possible to calcu- late the p-value of the test, the probability of observing a value of T greater in absolute value than the observed value. If this p-value is small (e.g. lower than α=0.05), this means that the observed value is very distant from the null hypothesis, therefore the null hypothesis is rejected (i.e. the slope coefficient is significant).

Result 4

To test whether a certain regression plane, with p explanatory variables, consti- tutes a significant linear model, it can be compared with a trivial model, with only the intercept. The trivial model, set to be the null hypothesisH0, is obtained by simultaneously setting all slope coefficients to 0. The regression plane will be significant when the null hypothesis is rejected. A rejection region is given by the following inequality:

F = R

2/p

(1−R2)/(np1)Fp,np−1(1−α)

whereR2is the coefficient of determination seen in Section 4.3 andF

p,np−1(1−

α) is the 100 (1−α) percentile of an F distribution, with p and np−1 degrees of freedom. The degrees of freedom of the denominator represent the difference in dimension between the observation space (n) and the fitting plane (p+1); those of the numerator represent the difference in dimension between the fitting plane (p+1) and a fitting point (1) defined by the only intercept. A

p-value for the test can be calculated, giving further support to the significance of the model.

Notice how we have introduced a precise threshold for evaluating whether a certain regression model is valid in making predictions, in comparison with the simple arithmetic mean. But this is a relative statement, which gives little indication of how well the linear model fits the data at hand. A statistic like this can be applied to cluster analysis, assuming that the available observations come from a normal distribution. Then the degrees of freedom are c−1 and

nc. The statistic is called a pseudo-F statistic, because in the general case of a non-normal distribution for the observations, the statistic does not have anF

distribution.

Result 5

To compare two nested regression planes that differ in a single explanatory vari- able, say the (p+1)th, present in one model but not in the other, the simpler model can be set as the null hypothesisH0, so the more complex model is cho- sen if the null hypothesis is rejected, and vice versa. A rejection region can be defined by the following inequality:

F = r 2 Y,Xp+1|X1,...,Xp/1 (1−r2 Y,Xp+1|X1,...,Xp)/(np−2)F1,np−2(1−α) where r2

Y,Xp+1|X1,...,Xp is the partial correlation coefficient between Xp+1 and

the response variable Y, conditional on all present explanatory variables and

F1,np−2(1−α) is the 100(1−α) percentile of an F distribution, with 1 and

np−2 degrees of freedom.

Notice that the degrees of freedom of the denominator represent the differ- ence in dimension between the observation space (n) and the more complex fitting plane (p+2); the degrees of freedom of the numerator represent the dif- ference in dimension between the more complex fitting plane (p+2) and the simpler one (p+1). Alternatively, we can do the comparison by calculating the p-value of the test. This can usually be derived from the output table that contains the decomposition of the variance, also called the analysis of variance (ANOVA) table. By substituting the definition of the partial correlation coefficient

r2

Y,Xp+1|X1,...,Xp, we can write the test statistic as

F = Var(Yˆp+1)−Var(Yˆp)

(Var(Y )−Var(Yˆp+1))/(n−p−2)

therefore thisF test statistic can be interpreted as the ratio between the additional variance explained by the(p+1)th variable and the mean residual variance. In other words, it expresses the relative importance of the (p+1)th variable. This test is the basis of a process which chooses the best model from a collection of possible linear models that differ in their explanatory variables. The final model is chosen through a series of hypothesis tests, each comparing two alternative models. The simpler of the two models is taken as the null hypothesis and the more complex model as the alternative hypothesis.

As the model space will typically contain many alternative models, we need to choose a search strategy that will lead to a specific series of pairwise comparisons. There are at least three alternative approaches. The forward selection procedure starts with the simplest model, without explanatory variables. It then complicates it by specifying in the alternative hypothesis H1 a model with one explanatory variable. This variable is chosen to give the greatest increase in the explained variability of the response. TheF test is used to verify whether or not the added variable leads to a significant improvement with respect to the model in H0. In the negative case the procedure stops and the chosen model is the model inH0

(i.e., the simplest model). In the affirmative case the model inH0is rejected and replaced with the model in H1. An additional explanatory variable (chosen as before) is then inserted in a new model in H1, and a new comparison is made. The procedure continues until theF test does not reject the model inH0, which thus becomes the final model.

The backward elimination procedure starts with the most complex model, con- taining all the explanatory variables. It simplifies it by making the null hypotheses

H0 equal to the original model minus one explanatory variable. The eliminated variable is chosen to produce the smallest decrease in the explained variability of the response. The F test is used to verify whether or not the elimination of this variable leads to a significant improvement with respect to the model in H1. In the negative case the chosen model is the model inH1 (i.e. the most complex model) and the procedure stops. In the affirmative case the complex model inH1

is rejected and replaced with the model inH0. An additional variable is dropped (chosen as before) and the resulting model is set asH0, then a new comparison is made. The procedure continues until theF test rejects the null hypothesis. Then the chosen model is the model in H1.

The stepwise procedure is essentially a combination of the previous two. It begins with no variables; variables are then added one by one according to the for- ward procedure. At each step of the procedure, a backward elimination is carried out to verify whether any of the added variables should be removed. Whichever procedure is adopted, the final model should be the same. This is true most of the time but it cannot be guaranteed. The significance level used in the comparisons is an important parameter as the procedure is carried out automatically by the software and the software uses the same level for all comparisons. For example, the SAS procedureregchooses a significance level ofα=0.15 as a default. It is interesting to compare the model selection procedure of a linear model with the computational procedures in Chapter 4. The procedures in Chapter 4 usually require the introduction of heuristic criteria, whereas linear model selection can be fully automated but still remain within a formal procedure.

For large samples, stepwise procedures are often rather unstable in finding the best models. It is not a good idea to rely solely on stepwise procedures for selecting models.