Regression Analysis 10.1
10.1.1 Linear Regression
10.1.1.5 Linear Regression Diagnostics
A series of assumptions are aforementioned; in order to fit a regression model to a considered set of data. The assumptions should have to be tested whether the fitted model is appropriate (i.e. also called adequate) and the conclusions based upon it are valid. Note that forecasting can be performed by using a considered model, only if the validity of the assumptions is achieved. This can be carried out by a series of tools known as regression diagnostics to measure model adequacy. In this section, several important testing procedures for linear regression assumptions are accordingly described.
10.1.1.5.1 Residual Analysis
Residuals can be analyzed to test whether a proposed model is a valid model for exhibiting an adequate fit to the data under the given assumptions. Thus, the plots of standardized residuals can be visually examined for assessing the magnitude of the residuals to identify unusual values (Fox, p. 246):
πΈπ = ππ
βπΜπ(1 β βπ,π) π = 1,2, β¦ , π (10.1.39) The term ββπ,πβ is called a hat value and is the ππ‘β diagonal element of the βhatβ matrix π― (Fox, p. 261):
π― = πΏ(πΏβ²πΏ)β1πΏβ² (10.1.40)
By using the weight in ππ‘β row and ππ‘β row (βπ,π) of π―, the fitted value π¦Μπ (also called π¦-hat) can be expressed in terms of the observed values π¦π (Fox, p. 244):
100 π¦Μπ = β1,ππ¦1+ β2,ππ¦2+ β― + βπ,ππ¦π+ β― + βπ,ππ¦π = β βπ,ππ¦π
π
π=1
(10.1.41)
It can be inferred that the hat values measure the potential influence (the leverage) of π¦π on all fitted values. For example, if βππ is large, the ππ‘β observation exerts a considerable impact on the ππ‘β fitted value. The-hat values are defined in the interval 1/π β€ βπ,π β€ 1 (Fox, p. 244).
Another approach for identifying unusual values is to compute studentized residuals as expressed below (Fox, p. 246):
πΈπβ = πΜπ
πΜπ(βπ)β1 β βπ,π (10.1.42)
The studentized residuals is calculated by refitting the model after removing the ππ‘β observation, and then obtaining an estimate πΜπ(βπ) of ππ that is based on the remaining π β 1 observations.
The listed features below can be observed in a plot of πΈi (or πΈπβ) versus any predictor or fitted values, if a valid model has been fit to the considered data (Sheather, 2009, p. 155):
ο· A random scatter of points around the horizontal axis, since the mean function of πi is zero when a valid model has been fit,
ο· Constant variability along the horizontal axis. Remark that the residuals do not have equal variances unlike it is assumed for the errors as follows:
πππ(ππ) = ππ2(1 β βπ,π) π = 1,2, β¦ , π (10.1.43) If any pattern in a plot of standardized residuals can be recognized, it can be concluded that an invalid model has been fit to the data. Subsequently, the residuals should be checked for the model assumptions for normality, homoscedasticity, correlation of errors and linear relationship among the observations.
The normality of the errors can be assessed by constructing a probability plot (Quantile-Quantile plot, i.e. Q-Q plot) and/or a histogram of πΈi or πΈπβ (Fox, p. 268). A normal probability plot of the πΈi or πΈπβcan be constructed by plotting the ordered πΈi or πΈπβ on the ordinate against the ordered statistics from a standard normal distribution on the abscissa. By
101 using the corresponding plot, the quantiles from πΈi or πΈπβ (sample quantiles) are compared with the quantiles from a normal distribution (theoretical quantiles) to examine whether they match with each other. If these match, the graph will indicate a line which seems close to a straight line. The observed departures from linearity are the evidence for non-normality (Sheather, 2009, p. 70). Shapiro-Wilk test or Anderson-Darling-Test (π΄2-test) can also be utilized as a supplement to the Q-Q plot. By using Wilk-Shapiro test, it can be checked whether the residuals are normally distributed. The W-statistic (π0) is calculated as follows (Wetherill, 1986, p. 182): number in ordered form of the residuals. The term βππβ denotes the mean of residuals. π»0 of this test assumes that the population is normally distributed (whereas π»1 indicates the opposite). Hence, π»0 is rejected if the p-value is less than the chosen πΌ-level. This test is suitable for π β€ 50 (Wetherill, p. 182). Anderson-Darling-Test (π΄2-test) is another test for checking normality for π β₯ 10. Wetherill (p. 183) remarks that the Shapiro-Wilk test is more powerful than the π΄2-test up to a sample size of 50. The π΄2-test is recommended to be applied on sample with large sample sizes (Wetherill, p. 183). The π΄2-test has the form as follows
86 An ππ‘β order statistic is an indexed statistic representing the ππ‘β lowest value in a statistical sample.
102 this case a normal distribution assumed for residuals (whereas π»1 indicates the opposite).
Hence, π»0 is rejected if the p-value is less than the chosen πΌ-level.
Heavy-tailed or highly skewed error distributions generate outliers causing the least square estimates to be inaccurate (Fox, p. 268). In general, these problems can be avoided by transforming the data. The Box-Cox transformation approach can be utilized by selecting a power of transformation of π to normalize the errors in the form (Bates & Watts, p. 28):
π(π) = {
π(π)β 1
π π β 0
πππ π = 0
(10.1.47)
In this procedure, variances are calculated and plotted versus π(π), π = 0, Β±0.5 Β± 1, β¦ and the value of π (so called the transformation parameter), which makes the variance stable, is selected.
During the analysis of residuals plotted against fitted values, there can be a pattern detected indicating the increase in the variance of the residuals with the level of the response variable.
This patter can be considered as an evidence of non-constant error variance (βheteroscedasticityβ) (Fox, p. 277). Fox proposes a rough rule for the least squares estimates to be in accurate due to presence of heteroscedasticity, when the ratio of the largest to smallest variance is about 10 or more (or more conservatively about 4 or more). The ignorance of the non-constant variance leads to invalid inferences about hypothesis tests and confidence intervals. Fox states that the non-constant error variance can be overcome by the transformation of π to stabilize the variance or by the substitution of weighted-least-squares estimation with ordinary least squares. The weighted-least-squares estimation method is a type of generalized least squares method. As opposed to the least squares method, the generalized least squares method is utilized in such situations when the covariance matrix of errors is any positive definite matrix rather than an identity matrix (Fox, p. 274):
πππ(π) = ππ2Ξ© (10.1.48)
The term Ξ© is considered to be a known π Γ π symmetric and positive definite matrix87 (Huang, p. 128). When Ξ© is a diagonal matrix, it indicates unequal variances. When Ξ© has
87 Ξ© is defined to be a positive definite and symmetric matrix, since πβ²Ξ©π > 0 for any non-zero column vector π β βπ.
103 nonzero off-diagonal elements, it indicates presence of correlated errors. As distinct from the least squares method, each term in weighted least squares method is assigned a weight (π€i) reflecting the uncertainty in each observation of the dataset influencing the final parameter estimates (Huang, p. 128) as shown below:
πππ(π) = π2
In practice, the weighted least squares criterion for minimizing the weighted residual sum of squares (ππ ππ) can be expressed as follows (Sheather, p. 115):
ππ ππ = β π€π(π¦π β π¦Μπ)2 smaller variance are multiplied with greater weights. More information on the estimation of π€π can be found in Fox (pp. 274-275).
The plot of residuals against time index can also be analyzed for any pattern indicating correlation of the residuals with each other (i.e. autocorrelation). The residuals vary randomly around zero line, if there is no correlation among the residuals. In addition, the Durbinβ
Watson statistic (π·π)
π·π =βππ‘=2(ππ‘β ππ‘β1)2
βππ‘=1ππ‘2 (10.1.52)
can also be computed as a supplement to the residual plot (Fox, p. 442). The range of π·π is from 0 to 4. It is assumed for π»0 of the DurbinβWatson test that no autocorrelation exists
104 between consecutive residuals. π»0 cannot be rejected for the values of π·π close to 2. If π·π is substantially less/more than 2, there is evidence for positive/negative autocorrelation. If autocorrelation is identified (i.e. Ξ© has nonzero off-diagonal elements), the generalized least squares method can be applied on the data to avoid the autocorrelation. More information on the estimation of π€π can be found in (pp. 428-429).
The analysis of the residual plots may also indicate non-linearity. In general, nonlinearity can be prevented by the transformation of considered variables or by increasing the order of the terms for the considered dependent variables (e.g. including a quadratic term).
10.1.1.5.2 Influential Data Analysis
Once the linear models are estimated by least squares, they should also be examined for unusual data influencing the results of the regression analysis. The used data should be analyzed for distinguishing among high-leverage observations, regression outliers, and influential observations (Fox, p. 244). These observations may cause the considered model to fail in capturing important features of the data; however they may also indicate results which are consistent with the rest of the data.
The leverage points are the data points whose π₯-values have unusual large impacts on the regression model by affecting the accuracy in estimation of the regression coefficients. These points are extreme values and are distant from the rest of the data. A point is a bad leverage point (i.e. called an outlier) if its π¦ β π£πππ’π does not follow the pattern set by the other points (Sheather, p. 52). A high leverage point which is in line with the rest of the data (indicating low discrepancy) is not an outlier, since this observation has no influence on the regression coefficients. The given formula below by Fox (p. 242) serves for distinguishing among the three concepts: Influence, leverage, and discrepancy (i.e. also called outlyingness).
πΌππππ’ππππ ππ πππππππππππ‘π = πΏππ£πππππ β π·ππ ππππππππ¦ (10.1.53)
105 It can be inferred that the observations with high leverage and large studentized residual cause substantial influence on the regression coefficients. The influence on the coefficients can be measured by using Cookβs D statistic (Fox, p. 250). The Cookβs D statistic is used for measuring the impact on each coefficient π· by removing each ππ‘β observation and calculating π·Μ(π) after each removal of ππ‘β observation. The Cookβs D statistic (π·π) can simply be computed using the equation given below (Fox, p. 250):
π·π = πΈπ2
π + 1β βπ,π
1 β βπ,π, π = 1,2, β¦ , π (10.1.54) The first quotient in the formula indicates a measure of discrepancy, while the second is a measure of leverage. If the ππ‘β observation is influential, its removal will result in a large value of π·π. A rough cutoff value for identifying highly influential points of π·π is when π·π > 4/(π β π β 1) (Fox, p. 266). If any highly influential points are identified, they should be checked whether any error occurred during the data entry or measurement taking. Further, it is convenient to temporarily remove observations one at a time and then refitting the model at each step to reexamine the resulting changes in Cookβs distances. The permanent removal of the identified highly influential points as outliers depends on the judgment of practitioners.
10.1.1.5.3 Multicollinearity
The multiple regression models are based on the dependencies between the dependent variable in π and the independent variables in π. On the contrary, dependencies can also be observed among independent variables indicating multicollinearity. The existence of perfect collinearity causes the least-squares coefficients to be not unique (Fox, p. 331). The existence of strong collinearity increases the sampling variances of the least-squares coefficients and makes them useless as estimators for forecasting (Fox, p. 309). The variance-inflation factor (ππΌπΉ)
πΌπΉ(π½π) = 1
(1 β π π2) π = 1,2, β¦ , π (10.1.55)
is utilized for measuring impact of collinearity on the precision of the estimate π½π (Fox, p.
309). The term π π2 is the squared multiple correlation for the regression of π₯π on the other π₯βs.
The higher the value of VIF the greater is the degree of collinearity. The precision of the
106 estimate is halved as π π approaches 0.9 which corresponds to a VIF value of about 5 (Fox, p.
308).