Main inferential results - The normal linear model

Statistical data mining

5.3 The normal linear model

5.3.1 Main inferential results

Under the previous assumptions, we can derive some important inferential results that build on the theory in Section 4.3.

Result 1

For a point estimate, it can be demonstrated that the least squares ﬁtted parameters in Section 4.3 coincide with the maximum likelihood estimators of β. We will useβˆ to indicate either of the two estimators.

Result 2

A conﬁdence interval for a slope coefﬁcient of the regression plane is

β= ˆβ±tn−p−1(1−α/2)se(β)ˆ

wheretn−p−1(1−α/2)is the 100(1−α/2)percentile of a Student’stdistribution

withn−p−1 degrees of freedom and se(β)ˆ is an estimate of the standard error ofβˆ.

Result 3

To test the hypothesis that a slope coefﬁcient is 0, a rejection region is given by

R=|T| ≥tn−p−1(1−α/2)

where T = βˆ

se(β)ˆ

If the observed absolute value of the statistic T is contained in the rejection region, the null hypothesis of the slope equal to 0 is rejected, and the slope coefficient is statistically significant. In other words, the considered explanatory variable significantly influences the response variable. Conversely, when the observed absolute value of the statistic T falls outside the rejection region, the explanatory variable is not significant. Alternatively, it is possible to calcu- late the p-value of the test, the probability of observing a value of T greater in absolute value than the observed value. If this p-value is small (e.g. lower than α=0.05), this means that the observed value is very distant from the null hypothesis, therefore the null hypothesis is rejected (i.e. the slope coefficient is significant).

Result 4

To test whether a certain regression plane, with p explanatory variables, consti- tutes a significant linear model, it can be compared with a trivial model, with only the intercept. The trivial model, set to be the null hypothesisH0, is obtained by simultaneously setting all slope coefficients to 0. The regression plane will be significant when the null hypothesis is rejected. A rejection region is given by the following inequality:

F = R

2_/p

(1−R2_)/(n₋_p₋₁₎ ≥Fp,n−p−1(1−α)

whereR2_{is the coefﬁcient of determination seen in Section 4.3 and}_F

p,n−p−1(1−

α) is the 100 (1−α) percentile of an F distribution, with p and n−p−1 degrees of freedom. The degrees of freedom of the denominator represent the difference in dimension between the observation space (n) and the fitting plane (p+1); those of the numerator represent the difference in dimension between the fitting plane (p+1) and a fitting point (1) defined by the only intercept. A

p-value for the test can be calculated, giving further support to the signiﬁcance of the model.

Notice how we have introduced a precise threshold for evaluating whether a certain regression model is valid in making predictions, in comparison with the simple arithmetic mean. But this is a relative statement, which gives little indication of how well the linear model ﬁts the data at hand. A statistic like this can be applied to cluster analysis, assuming that the available observations come from a normal distribution. Then the degrees of freedom are c−1 and

n−c. The statistic is called a pseudo-F statistic, because in the general case of a non-normal distribution for the observations, the statistic does not have anF

distribution.

Result 5

To compare two nested regression planes that differ in a single explanatory variable, say the (p+1)th, present in one model but not in the other, the simpler model can be set as the null hypothesisH0, so the more complex model is chosen if the null hypothesis is rejected, and vice versa. A rejection region can be deﬁned by the following inequality:

F = r 2 Y,Xp+1|X1,...,Xp/1 (1−r2 Y,Xp+1|X1,...,Xp)/(n−p−2) ≥F1,n−p−2(1−α) where r2

Y,Xp+1|X1,...,Xp is the partial correlation coefﬁcient between Xp+1 and

the response variable Y, conditional on all present explanatory variables and

F1,n−p−2(1−α) is the 100(1−α) percentile of an F distribution, with 1 and

n−p−2 degrees of freedom.

Notice that the degrees of freedom of the denominator represent the difference in dimension between the observation space (n) and the more complex fitting plane (p+2); the degrees of freedom of the numerator represent the difference in dimension between the more complex fitting plane (p+2) and the simpler one (p+1). Alternatively, we can do the comparison by calculating the p-value of the test. This can usually be derived from the output table that contains the decomposition of the variance, also called the analysis of variance (ANOVA) table. By substituting the definition of the partial correlation coefficient

Y,Xp+1|X1,...,Xp, we can write the test statistic as

F = Var(Yˆp+1)−Var(Yˆp)

(Var(Y )−Var(Yˆp+1))/(n−p−2)

therefore thisF test statistic can be interpreted as the ratio between the additional variance explained by the(p+1)th variable and the mean residual variance. In other words, it expresses the relative importance of the (p+1)th variable. This test is the basis of a process which chooses the best model from a collection of possible linear models that differ in their explanatory variables. The ﬁnal model is chosen through a series of hypothesis tests, each comparing two alternative models. The simpler of the two models is taken as the null hypothesis and the more complex model as the alternative hypothesis.

As the model space will typically contain many alternative models, we need to choose a search strategy that will lead to a speciﬁc series of pairwise comparisons. There are at least three alternative approaches. The forward selection procedure starts with the simplest model, without explanatory variables. It then complicates it by specifying in the alternative hypothesis H1 a model with one explanatory variable. This variable is chosen to give the greatest increase in the explained variability of the response. TheF test is used to verify whether or not the added variable leads to a signiﬁcant improvement with respect to the model in H0. In the negative case the procedure stops and the chosen model is the model inH0

(i.e., the simplest model). In the afﬁrmative case the model inH0is rejected and replaced with the model in H1. An additional explanatory variable (chosen as before) is then inserted in a new model in H1, and a new comparison is made. The procedure continues until theF test does not reject the model inH0, which thus becomes the ﬁnal model.

The backward elimination procedure starts with the most complex model, con- taining all the explanatory variables. It simpliﬁes it by making the null hypotheses

H0 equal to the original model minus one explanatory variable. The eliminated variable is chosen to produce the smallest decrease in the explained variability of the response. The F test is used to verify whether or not the elimination of this variable leads to a signiﬁcant improvement with respect to the model in H1. In the negative case the chosen model is the model inH1 (i.e. the most complex model) and the procedure stops. In the afﬁrmative case the complex model inH1

is rejected and replaced with the model inH0. An additional variable is dropped (chosen as before) and the resulting model is set asH0, then a new comparison is made. The procedure continues until theF test rejects the null hypothesis. Then the chosen model is the model in H1.

The stepwise procedure is essentially a combination of the previous two. It begins with no variables; variables are then added one by one according to the forward procedure. At each step of the procedure, a backward elimination is carried out to verify whether any of the added variables should be removed. Whichever procedure is adopted, the final model should be the same. This is true most of the time but it cannot be guaranteed. The significance level used in the comparisons is an important parameter as the procedure is carried out automatically by the software and the software uses the same level for all comparisons. For example, the SAS procedureregchooses a significance level ofα=0.15 as a default. It is interesting to compare the model selection procedure of a linear model with the computational procedures in Chapter 4. The procedures in Chapter 4 usually require the introduction of heuristic criteria, whereas linear model selection can be fully automated but still remain within a formal procedure.

For large samples, stepwise procedures are often rather unstable in ﬁnding the best models. It is not a good idea to rely solely on stepwise procedures for selecting models.

In document Applied Data Mining Statistical Methods for Business and Industry Giudici P (2003) pdf (Page 161-164)