Linear Regression Diagnostics - Linear Regression

Regression Analysis 10.1

10.1.1 Linear Regression

10.1.1.5 Linear Regression Diagnostics

A series of assumptions are aforementioned; in order to fit a regression model to a considered set of data. The assumptions should have to be tested whether the fitted model is appropriate (i.e. also called adequate) and the conclusions based upon it are valid. Note that forecasting can be performed by using a considered model, only if the validity of the assumptions is achieved. This can be carried out by a series of tools known as regression diagnostics to measure model adequacy. In this section, several important testing procedures for linear regression assumptions are accordingly described.

10.1.1.5.1 Residual Analysis

Residuals can be analyzed to test whether a proposed model is a valid model for exhibiting an adequate fit to the data under the given assumptions. Thus, the plots of standardized residuals can be visually examined for assessing the magnitude of the residuals to identify unusual values (Fox, p. 246):

𝐸_𝑖 = 𝑒_𝑖

√𝜎̂_𝒆(1 − ℎ_𝑖,𝑖) 𝑖 = 1,2, … , 𝑛 _(10.1.39) The term “ℎ_𝑖,𝑖” is called a hat value and is the 𝑖^𝑡ℎ diagonal element of the “hat” matrix 𝑯 (Fox, p. 261):

𝑯 = 𝑿(𝑿^′𝑿)⁻¹𝑿^′ (10.1.40)

By using the weight in 𝑖^𝑡ℎ row and 𝑗^𝑡ℎ row (ℎ_𝑖,𝑗) of 𝑯, the fitted value 𝑦̂_𝑗 (also called 𝑦-hat) can be expressed in terms of the observed values 𝑦_𝑖 (Fox, p. 244):

100 𝑦̂_𝑗 = ℎ_1,𝑗𝑦₁+ ℎ_2,𝑗𝑦₂+ ⋯ + ℎ_𝑗,𝑗𝑦_𝑗+ ⋯ + ℎ_𝑛,𝑗𝑦_𝑛 = ∑ ℎ_𝑖,𝑗𝑦_𝑖

𝑛

𝑖=1

(10.1.41)

It can be inferred that the hat values measure the potential influence (the leverage) of 𝑦_𝑖 on all fitted values. For example, if ℎ_𝑖𝑗 is large, the 𝑖^𝑡ℎ observation exerts a considerable impact on the 𝑗^𝑡ℎ fitted value. The-hat values are defined in the interval 1/𝑛 ≤ ℎ_𝑖,𝑖 ≤ 1 (Fox, p. 244).

Another approach for identifying unusual values is to compute studentized residuals as expressed below (Fox, p. 246):

𝐸_𝑖^∗ = 𝜎̂_𝝐

𝜎̂_{𝒆(−𝑖)}√1 − ℎ_𝑖,𝑖 ^(10.1.42)

The studentized residuals is calculated by refitting the model after removing the 𝑖^𝑡ℎ observation, and then obtaining an estimate 𝜎̂_{𝒆(−𝑖)} of 𝜎_𝝐 that is based on the remaining 𝑛 − 1 observations.

The listed features below can be observed in a plot of 𝐸_i (or 𝐸_𝑖^∗) versus any predictor or fitted values, if a valid model has been fit to the considered data (Sheather, 2009, p. 155):

 A random scatter of points around the horizontal axis, since the mean function of 𝑒_i is zero when a valid model has been fit,

 Constant variability along the horizontal axis. Remark that the residuals do not have equal variances unlike it is assumed for the errors as follows:

𝑉𝑎𝑟(𝑒_𝑖) = 𝜎_𝝐²(1 − ℎ_𝑖,𝑖) 𝑖 = 1,2, … , 𝑛 (10.1.43) If any pattern in a plot of standardized residuals can be recognized, it can be concluded that an invalid model has been fit to the data. Subsequently, the residuals should be checked for the model assumptions for normality, homoscedasticity, correlation of errors and linear relationship among the observations.

The normality of the errors can be assessed by constructing a probability plot (Quantile-Quantile plot, i.e. Q-Q plot) and/or a histogram of 𝐸_i or 𝐸_𝑖^∗ (Fox, p. 268). A normal probability plot of the 𝐸_i or 𝐸_𝑖^∗can be constructed by plotting the ordered 𝐸_i or 𝐸_𝑖^∗ on the ordinate against the ordered statistics from a standard normal distribution on the abscissa. By

101 using the corresponding plot, the quantiles from 𝐸_i or 𝐸_𝑖^∗ (sample quantiles) are compared with the quantiles from a normal distribution (theoretical quantiles) to examine whether they match with each other. If these match, the graph will indicate a line which seems close to a straight line. The observed departures from linearity are the evidence for non-normality (Sheather, 2009, p. 70). Shapiro-Wilk test or Anderson-Darling-Test (𝐴²-test) can also be utilized as a supplement to the Q-Q plot. By using Wilk-Shapiro test, it can be checked whether the residuals are normally distributed. The W-statistic (𝑊₀) is calculated as follows (Wetherill, 1986, p. 182): number in ordered form of the residuals. The term “𝜇_𝒆” denotes the mean of residuals. 𝐻₀ of this test assumes that the population is normally distributed (whereas 𝐻₁ indicates the opposite). Hence, 𝐻₀ is rejected if the p-value is less than the chosen 𝛼-level. This test is suitable for 𝑛 ≤ 50 (Wetherill, p. 182). Anderson-Darling-Test (𝐴²-test) is another test for checking normality for 𝑛 ≥ 10. Wetherill (p. 183) remarks that the Shapiro-Wilk test is more powerful than the 𝐴²-test up to a sample size of 50. The 𝐴²-test is recommended to be applied on sample with large sample sizes (Wetherill, p. 183). The 𝐴²-test has the form as follows

86 An 𝑖^𝑡ℎ order statistic is an indexed statistic representing the 𝑖^𝑡ℎ lowest value in a statistical sample.

102 this case a normal distribution assumed for residuals (whereas 𝐻₁ indicates the opposite).

Hence, 𝐻₀ is rejected if the p-value is less than the chosen 𝛼-level.

Heavy-tailed or highly skewed error distributions generate outliers causing the least square estimates to be inaccurate (Fox, p. 268). In general, these problems can be avoided by transforming the data. The Box-Cox transformation approach can be utilized by selecting a power of transformation of 𝒚 to normalize the errors in the form (Bates & Watts, p. 28):

𝒚^(𝜆) = {

𝒚^(𝜆)− 1

𝜆 𝜆 ≠ 0

𝑙𝑛𝒚 𝜆 = 0

(10.1.47)

In this procedure, variances are calculated and plotted versus 𝒚^(𝜆), 𝜆 = 0, ±0.5 ± 1, … and the value of 𝜆 (so called the transformation parameter), which makes the variance stable, is selected.

During the analysis of residuals plotted against fitted values, there can be a pattern detected indicating the increase in the variance of the residuals with the level of the response variable.

This patter can be considered as an evidence of non-constant error variance (“heteroscedasticity”) (Fox, p. 277). Fox proposes a rough rule for the least squares estimates to be in accurate due to presence of heteroscedasticity, when the ratio of the largest to smallest variance is about 10 or more (or more conservatively about 4 or more). The ignorance of the non-constant variance leads to invalid inferences about hypothesis tests and confidence intervals. Fox states that the non-constant error variance can be overcome by the transformation of 𝒚 to stabilize the variance or by the substitution of weighted-least-squares estimation with ordinary least squares. The weighted-least-squares estimation method is a type of generalized least squares method. As opposed to the least squares method, the generalized least squares method is utilized in such situations when the covariance matrix of errors is any positive definite matrix rather than an identity matrix (Fox, p. 274):

𝑉𝑎𝑟(𝜖) = 𝜎_𝝐²Ω (10.1.48)

The term Ω is considered to be a known 𝑛 × 𝑛 symmetric and positive definite matrix⁸⁷ (Huang, p. 128). When Ω is a diagonal matrix, it indicates unequal variances. When Ω has

87 Ω is defined to be a positive definite and symmetric matrix, since 𝒂^′Ω𝒂 > 0 for any non-zero column vector 𝒂 ∈ ℝ^𝑛.

103 nonzero off-diagonal elements, it indicates presence of correlated errors. As distinct from the least squares method, each term in weighted least squares method is assigned a weight (𝑤_i) reflecting the uncertainty in each observation of the dataset influencing the final parameter estimates (Huang, p. 128) as shown below:

𝑉𝑎𝑟(𝝐) = 𝜎²

In practice, the weighted least squares criterion for minimizing the weighted residual sum of squares (𝑊𝑅𝑆𝑆) can be expressed as follows (Sheather, p. 115):

𝑊𝑅𝑆𝑆 = ∑ 𝑤_𝑖(𝑦_𝑖 − 𝑦̂_𝑖)² smaller variance are multiplied with greater weights. More information on the estimation of 𝑤_𝑖 can be found in Fox (pp. 274-275).

The plot of residuals against time index can also be analyzed for any pattern indicating correlation of the residuals with each other (i.e. autocorrelation). The residuals vary randomly around zero line, if there is no correlation among the residuals. In addition, the Durbin–

Watson statistic (𝐷𝑊)

𝐷𝑊 =∑^𝑛_𝑡=2(𝑒_𝑡− 𝑒_𝑡−1)²

∑^𝑛_𝑡=1𝑒_𝑡² ^(10.1.52)

can also be computed as a supplement to the residual plot (Fox, p. 442). The range of 𝐷𝑊 is from 0 to 4. It is assumed for 𝐻₀ of the Durbin–Watson test that no autocorrelation exists

104 between consecutive residuals. 𝐻₀ cannot be rejected for the values of 𝐷𝑊 close to 2. If 𝐷𝑊 is substantially less/more than 2, there is evidence for positive/negative autocorrelation. If autocorrelation is identified (i.e. Ω has nonzero off-diagonal elements), the generalized least squares method can be applied on the data to avoid the autocorrelation. More information on the estimation of 𝑤_𝑖 can be found in (pp. 428-429).

The analysis of the residual plots may also indicate non-linearity. In general, nonlinearity can be prevented by the transformation of considered variables or by increasing the order of the terms for the considered dependent variables (e.g. including a quadratic term).

10.1.1.5.2 Influential Data Analysis

Once the linear models are estimated by least squares, they should also be examined for unusual data influencing the results of the regression analysis. The used data should be analyzed for distinguishing among high-leverage observations, regression outliers, and influential observations (Fox, p. 244). These observations may cause the considered model to fail in capturing important features of the data; however they may also indicate results which are consistent with the rest of the data.

The leverage points are the data points whose 𝑥-values have unusual large impacts on the regression model by affecting the accuracy in estimation of the regression coefficients. These points are extreme values and are distant from the rest of the data. A point is a bad leverage point (i.e. called an outlier) if its 𝑦 − 𝑣𝑎𝑙𝑢𝑒 does not follow the pattern set by the other points (Sheather, p. 52). A high leverage point which is in line with the rest of the data (indicating low discrepancy) is not an outlier, since this observation has no influence on the regression coefficients. The given formula below by Fox (p. 242) serves for distinguishing among the three concepts: Influence, leverage, and discrepancy (i.e. also called outlyingness).

𝐼𝑛𝑓𝑙𝑢𝑒𝑛𝑐𝑒 𝑜𝑛 𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡𝑠 = 𝐿𝑒𝑣𝑒𝑟𝑎𝑔𝑒 ∙ 𝐷𝑖𝑠𝑐𝑟𝑒𝑝𝑎𝑛𝑐𝑦 (10.1.53)

105 It can be inferred that the observations with high leverage and large studentized residual cause substantial influence on the regression coefficients. The influence on the coefficients can be measured by using Cook’s D statistic (Fox, p. 250). The Cook’s D statistic is used for measuring the impact on each coefficient 𝜷 by removing each 𝑖^𝑡ℎ observation and calculating 𝜷̂_(𝑖) after each removal of 𝑖^𝑡ℎ observation. The Cook’s D statistic (𝐷_𝑖) can simply be computed using the equation given below (Fox, p. 250):

𝐷_𝑖 = 𝐸_𝑖²

𝑘 + 1∙ ℎ_𝑖,𝑖

1 − ℎ_𝑖,𝑖, 𝑖 = 1,2, … , 𝑛 (10.1.54) The first quotient in the formula indicates a measure of discrepancy, while the second is a measure of leverage. If the 𝑖^𝑡ℎ observation is influential, its removal will result in a large value of 𝐷_𝑖. A rough cutoff value for identifying highly influential points of 𝐷_𝑖 is when 𝐷_𝑖 > 4/(𝑛 − 𝑘 − 1) (Fox, p. 266). If any highly influential points are identified, they should be checked whether any error occurred during the data entry or measurement taking. Further, it is convenient to temporarily remove observations one at a time and then refitting the model at each step to reexamine the resulting changes in Cook’s distances. The permanent removal of the identified highly influential points as outliers depends on the judgment of practitioners.

10.1.1.5.3 Multicollinearity

The multiple regression models are based on the dependencies between the dependent variable in 𝒚 and the independent variables in 𝒙. On the contrary, dependencies can also be observed among independent variables indicating multicollinearity. The existence of perfect collinearity causes the least-squares coefficients to be not unique (Fox, p. 331). The existence of strong collinearity increases the sampling variances of the least-squares coefficients and makes them useless as estimators for forecasting (Fox, p. 309). The variance-inflation factor (𝑉𝐼𝐹)

𝐼𝐹(𝛽_𝑗) = 1

(1 − 𝑅_𝑗²) 𝑗 = 1,2, … , 𝑘 (10.1.55)

is utilized for measuring impact of collinearity on the precision of the estimate 𝛽_𝑗 (Fox, p.

309). The term 𝑅_𝑗² is the squared multiple correlation for the regression of 𝑥_𝑗 on the other 𝑥’s.

The higher the value of VIF the greater is the degree of collinearity. The precision of the

106 estimate is halved as 𝑅_𝑗 approaches 0.9 which corresponds to a VIF value of about 5 (Fox, p.

308).

In document The development of the turkish power market with special respect to renewable power generation in Turkey (Page 129-136)