• No results found

Regression Analysis 10.1

10.1.1 Linear Regression

10.1.1.5 Linear Regression Diagnostics

A series of assumptions are aforementioned; in order to fit a regression model to a considered set of data. The assumptions should have to be tested whether the fitted model is appropriate (i.e. also called adequate) and the conclusions based upon it are valid. Note that forecasting can be performed by using a considered model, only if the validity of the assumptions is achieved. This can be carried out by a series of tools known as regression diagnostics to measure model adequacy. In this section, several important testing procedures for linear regression assumptions are accordingly described.

10.1.1.5.1 Residual Analysis

Residuals can be analyzed to test whether a proposed model is a valid model for exhibiting an adequate fit to the data under the given assumptions. Thus, the plots of standardized residuals can be visually examined for assessing the magnitude of the residuals to identify unusual values (Fox, p. 246):

𝐸𝑖 = 𝑒𝑖

βˆšπœŽΜ‚π’†(1 βˆ’ β„Žπ‘–,𝑖) 𝑖 = 1,2, … , 𝑛 (10.1.39) The term β€œβ„Žπ‘–,𝑖” is called a hat value and is the π‘–π‘‘β„Ž diagonal element of the β€œhat” matrix 𝑯 (Fox, p. 261):

𝑯 = 𝑿(𝑿′𝑿)βˆ’1𝑿′ (10.1.40)

By using the weight in π‘–π‘‘β„Ž row and π‘—π‘‘β„Ž row (β„Žπ‘–,𝑗) of 𝑯, the fitted value 𝑦̂𝑗 (also called 𝑦-hat) can be expressed in terms of the observed values 𝑦𝑖 (Fox, p. 244):

100 𝑦̂𝑗 = β„Ž1,𝑗𝑦1+ β„Ž2,𝑗𝑦2+ β‹― + β„Žπ‘—,𝑗𝑦𝑗+ β‹― + β„Žπ‘›,𝑗𝑦𝑛 = βˆ‘ β„Žπ‘–,𝑗𝑦𝑖

𝑛

𝑖=1

(10.1.41)

It can be inferred that the hat values measure the potential influence (the leverage) of 𝑦𝑖 on all fitted values. For example, if β„Žπ‘–π‘— is large, the π‘–π‘‘β„Ž observation exerts a considerable impact on the π‘—π‘‘β„Ž fitted value. The-hat values are defined in the interval 1/𝑛 ≀ β„Žπ‘–,𝑖 ≀ 1 (Fox, p. 244).

Another approach for identifying unusual values is to compute studentized residuals as expressed below (Fox, p. 246):

πΈπ‘–βˆ— = πœŽΜ‚π

πœŽΜ‚π’†(βˆ’π‘–)√1 βˆ’ β„Žπ‘–,𝑖 (10.1.42)

The studentized residuals is calculated by refitting the model after removing the π‘–π‘‘β„Ž observation, and then obtaining an estimate πœŽΜ‚π’†(βˆ’π‘–) of 𝜎𝝐 that is based on the remaining 𝑛 βˆ’ 1 observations.

The listed features below can be observed in a plot of 𝐸i (or πΈπ‘–βˆ—) versus any predictor or fitted values, if a valid model has been fit to the considered data (Sheather, 2009, p. 155):

ο‚· A random scatter of points around the horizontal axis, since the mean function of 𝑒i is zero when a valid model has been fit,

ο‚· Constant variability along the horizontal axis. Remark that the residuals do not have equal variances unlike it is assumed for the errors as follows:

π‘‰π‘Žπ‘Ÿ(𝑒𝑖) = 𝜎𝝐2(1 βˆ’ β„Žπ‘–,𝑖) 𝑖 = 1,2, … , 𝑛 (10.1.43) If any pattern in a plot of standardized residuals can be recognized, it can be concluded that an invalid model has been fit to the data. Subsequently, the residuals should be checked for the model assumptions for normality, homoscedasticity, correlation of errors and linear relationship among the observations.

The normality of the errors can be assessed by constructing a probability plot (Quantile-Quantile plot, i.e. Q-Q plot) and/or a histogram of 𝐸i or πΈπ‘–βˆ— (Fox, p. 268). A normal probability plot of the 𝐸i or πΈπ‘–βˆ—can be constructed by plotting the ordered 𝐸i or πΈπ‘–βˆ— on the ordinate against the ordered statistics from a standard normal distribution on the abscissa. By

101 using the corresponding plot, the quantiles from 𝐸i or πΈπ‘–βˆ— (sample quantiles) are compared with the quantiles from a normal distribution (theoretical quantiles) to examine whether they match with each other. If these match, the graph will indicate a line which seems close to a straight line. The observed departures from linearity are the evidence for non-normality (Sheather, 2009, p. 70). Shapiro-Wilk test or Anderson-Darling-Test (𝐴2-test) can also be utilized as a supplement to the Q-Q plot. By using Wilk-Shapiro test, it can be checked whether the residuals are normally distributed. The W-statistic (π‘Š0) is calculated as follows (Wetherill, 1986, p. 182): number in ordered form of the residuals. The term β€œπœ‡π’†β€ denotes the mean of residuals. 𝐻0 of this test assumes that the population is normally distributed (whereas 𝐻1 indicates the opposite). Hence, 𝐻0 is rejected if the p-value is less than the chosen 𝛼-level. This test is suitable for 𝑛 ≀ 50 (Wetherill, p. 182). Anderson-Darling-Test (𝐴2-test) is another test for checking normality for 𝑛 β‰₯ 10. Wetherill (p. 183) remarks that the Shapiro-Wilk test is more powerful than the 𝐴2-test up to a sample size of 50. The 𝐴2-test is recommended to be applied on sample with large sample sizes (Wetherill, p. 183). The 𝐴2-test has the form as follows

86 An π‘–π‘‘β„Ž order statistic is an indexed statistic representing the π‘–π‘‘β„Ž lowest value in a statistical sample.

102 this case a normal distribution assumed for residuals (whereas 𝐻1 indicates the opposite).

Hence, 𝐻0 is rejected if the p-value is less than the chosen 𝛼-level.

Heavy-tailed or highly skewed error distributions generate outliers causing the least square estimates to be inaccurate (Fox, p. 268). In general, these problems can be avoided by transforming the data. The Box-Cox transformation approach can be utilized by selecting a power of transformation of π’š to normalize the errors in the form (Bates & Watts, p. 28):

π’š(πœ†) = {

π’š(πœ†)βˆ’ 1

πœ† πœ† β‰  0

π‘™π‘›π’š πœ† = 0

(10.1.47)

In this procedure, variances are calculated and plotted versus π’š(πœ†), πœ† = 0, Β±0.5 Β± 1, … and the value of πœ† (so called the transformation parameter), which makes the variance stable, is selected.

During the analysis of residuals plotted against fitted values, there can be a pattern detected indicating the increase in the variance of the residuals with the level of the response variable.

This patter can be considered as an evidence of non-constant error variance (β€œheteroscedasticity”) (Fox, p. 277). Fox proposes a rough rule for the least squares estimates to be in accurate due to presence of heteroscedasticity, when the ratio of the largest to smallest variance is about 10 or more (or more conservatively about 4 or more). The ignorance of the non-constant variance leads to invalid inferences about hypothesis tests and confidence intervals. Fox states that the non-constant error variance can be overcome by the transformation of π’š to stabilize the variance or by the substitution of weighted-least-squares estimation with ordinary least squares. The weighted-least-squares estimation method is a type of generalized least squares method. As opposed to the least squares method, the generalized least squares method is utilized in such situations when the covariance matrix of errors is any positive definite matrix rather than an identity matrix (Fox, p. 274):

π‘‰π‘Žπ‘Ÿ(πœ–) = 𝜎𝝐2Ξ© (10.1.48)

The term Ξ© is considered to be a known 𝑛 Γ— 𝑛 symmetric and positive definite matrix87 (Huang, p. 128). When Ξ© is a diagonal matrix, it indicates unequal variances. When Ξ© has

87 Ξ© is defined to be a positive definite and symmetric matrix, since 𝒂′Ω𝒂 > 0 for any non-zero column vector 𝒂 ∈ ℝ𝑛.

103 nonzero off-diagonal elements, it indicates presence of correlated errors. As distinct from the least squares method, each term in weighted least squares method is assigned a weight (𝑀i) reflecting the uncertainty in each observation of the dataset influencing the final parameter estimates (Huang, p. 128) as shown below:

π‘‰π‘Žπ‘Ÿ(𝝐) = 𝜎2

In practice, the weighted least squares criterion for minimizing the weighted residual sum of squares (π‘Šπ‘…π‘†π‘†) can be expressed as follows (Sheather, p. 115):

π‘Šπ‘…π‘†π‘† = βˆ‘ 𝑀𝑖(𝑦𝑖 βˆ’ 𝑦̂𝑖)2 smaller variance are multiplied with greater weights. More information on the estimation of 𝑀𝑖 can be found in Fox (pp. 274-275).

The plot of residuals against time index can also be analyzed for any pattern indicating correlation of the residuals with each other (i.e. autocorrelation). The residuals vary randomly around zero line, if there is no correlation among the residuals. In addition, the Durbin–

Watson statistic (π·π‘Š)

π·π‘Š =βˆ‘π‘›π‘‘=2(π‘’π‘‘βˆ’ π‘’π‘‘βˆ’1)2

βˆ‘π‘›π‘‘=1𝑒𝑑2 (10.1.52)

can also be computed as a supplement to the residual plot (Fox, p. 442). The range of π·π‘Š is from 0 to 4. It is assumed for 𝐻0 of the Durbin–Watson test that no autocorrelation exists

104 between consecutive residuals. 𝐻0 cannot be rejected for the values of π·π‘Š close to 2. If π·π‘Š is substantially less/more than 2, there is evidence for positive/negative autocorrelation. If autocorrelation is identified (i.e. Ξ© has nonzero off-diagonal elements), the generalized least squares method can be applied on the data to avoid the autocorrelation. More information on the estimation of 𝑀𝑖 can be found in (pp. 428-429).

The analysis of the residual plots may also indicate non-linearity. In general, nonlinearity can be prevented by the transformation of considered variables or by increasing the order of the terms for the considered dependent variables (e.g. including a quadratic term).

10.1.1.5.2 Influential Data Analysis

Once the linear models are estimated by least squares, they should also be examined for unusual data influencing the results of the regression analysis. The used data should be analyzed for distinguishing among high-leverage observations, regression outliers, and influential observations (Fox, p. 244). These observations may cause the considered model to fail in capturing important features of the data; however they may also indicate results which are consistent with the rest of the data.

The leverage points are the data points whose π‘₯-values have unusual large impacts on the regression model by affecting the accuracy in estimation of the regression coefficients. These points are extreme values and are distant from the rest of the data. A point is a bad leverage point (i.e. called an outlier) if its 𝑦 βˆ’ π‘£π‘Žπ‘™π‘’π‘’ does not follow the pattern set by the other points (Sheather, p. 52). A high leverage point which is in line with the rest of the data (indicating low discrepancy) is not an outlier, since this observation has no influence on the regression coefficients. The given formula below by Fox (p. 242) serves for distinguishing among the three concepts: Influence, leverage, and discrepancy (i.e. also called outlyingness).

𝐼𝑛𝑓𝑙𝑒𝑒𝑛𝑐𝑒 π‘œπ‘› π‘π‘œπ‘’π‘“π‘“π‘–π‘π‘–π‘’π‘›π‘‘π‘  = πΏπ‘’π‘£π‘’π‘Ÿπ‘Žπ‘”π‘’ βˆ™ π·π‘–π‘ π‘π‘Ÿπ‘’π‘π‘Žπ‘›π‘π‘¦ (10.1.53)

105 It can be inferred that the observations with high leverage and large studentized residual cause substantial influence on the regression coefficients. The influence on the coefficients can be measured by using Cook’s D statistic (Fox, p. 250). The Cook’s D statistic is used for measuring the impact on each coefficient 𝜷 by removing each π‘–π‘‘β„Ž observation and calculating πœ·Μ‚(𝑖) after each removal of π‘–π‘‘β„Ž observation. The Cook’s D statistic (𝐷𝑖) can simply be computed using the equation given below (Fox, p. 250):

𝐷𝑖 = 𝐸𝑖2

π‘˜ + 1βˆ™ β„Žπ‘–,𝑖

1 βˆ’ β„Žπ‘–,𝑖, 𝑖 = 1,2, … , 𝑛 (10.1.54) The first quotient in the formula indicates a measure of discrepancy, while the second is a measure of leverage. If the π‘–π‘‘β„Ž observation is influential, its removal will result in a large value of 𝐷𝑖. A rough cutoff value for identifying highly influential points of 𝐷𝑖 is when 𝐷𝑖 > 4/(𝑛 βˆ’ π‘˜ βˆ’ 1) (Fox, p. 266). If any highly influential points are identified, they should be checked whether any error occurred during the data entry or measurement taking. Further, it is convenient to temporarily remove observations one at a time and then refitting the model at each step to reexamine the resulting changes in Cook’s distances. The permanent removal of the identified highly influential points as outliers depends on the judgment of practitioners.

10.1.1.5.3 Multicollinearity

The multiple regression models are based on the dependencies between the dependent variable in π’š and the independent variables in 𝒙. On the contrary, dependencies can also be observed among independent variables indicating multicollinearity. The existence of perfect collinearity causes the least-squares coefficients to be not unique (Fox, p. 331). The existence of strong collinearity increases the sampling variances of the least-squares coefficients and makes them useless as estimators for forecasting (Fox, p. 309). The variance-inflation factor (𝑉𝐼𝐹)

𝐼𝐹(𝛽𝑗) = 1

(1 βˆ’ 𝑅𝑗2) 𝑗 = 1,2, … , π‘˜ (10.1.55)

is utilized for measuring impact of collinearity on the precision of the estimate 𝛽𝑗 (Fox, p.

309). The term 𝑅𝑗2 is the squared multiple correlation for the regression of π‘₯𝑗 on the other π‘₯’s.

The higher the value of VIF the greater is the degree of collinearity. The precision of the

106 estimate is halved as 𝑅𝑗 approaches 0.9 which corresponds to a VIF value of about 5 (Fox, p.

308).