Models with Clearness Index (K) response - Multiple Linear Regression Analysis

CHAPTER 4: CLOBAL RADIATION

4.2 Multiple Linear Regression Analysis

4.2.1 Models with Clearness Index (K) response

The main goal of choosing K as a response is to study the effect of atmospheric parameters on the ratio of solar radiation that reaches the earth’s surface. All atmospheric parameters were included in the models as predictors. Although we expect to have a negative correlation between K and H0, since 𝐾 =

𝐻

𝐻0 , the correlation matrix in Figure 4.2 shows moderate positive correlation between them. Accordingly, H0 was added to the predictors for inspection.

A- Variable selection

We used “best subset selection” technique, which gives us the best model based on the number of predictors. Then, we used “adjusting the training error for the model size” methods, such as adjusted R2_{, Mallow’s C}

p, and Bayesian information criterion (BIC), to determine the best model of all. Below is a brief explanation of these methods (James et al. 2014).

The best model is the one with the lowest test (data) mean squared error (MSE) given by Equation 4.1. 𝑀𝑆𝐸 = 1 𝑛∑ (𝑦𝑖 − 𝑦̂𝑖) 2 𝑛 𝑖=1 = 𝑅𝑆𝑆 𝑛 (4.1)

where n is number of observations, 𝒚_𝒊 is the ith response (observed value), 𝒚̂_𝒊 is the ith fitted value and RSS = ∑𝒏𝒊=𝟏(𝒚𝒊− 𝒚̂𝒊)𝟐 is residual sum of squares.

However, we first use training data to fit the models. Training MSE generally underestimates the test MSE. The reason is that we fit a model to the training data, using least squares, to estimate the regression coefficients such that the training RSS (but not the test RSS) is as small as possible. Therefore, training RSS cannot be used to select the best model among a set of models with different numbers of variables. To adjust for the training MSE any of the following approaches can be used:

1- 𝐶_𝑝 = 𝑀𝑆𝐸 +2 𝑛𝑝𝜎̂

where 𝒑 is the number of predictors, and 𝝈̂𝟐_{is an estimate of the variance of the error 𝝐} associated with each response measurement (𝝐𝒊 = 𝒚𝒊− 𝒚̂𝒊). Note that 𝝈̂𝟐 = 𝑹𝑺𝑺 𝒏 − 𝒑 − 𝟏⁄ .

Accordingly, the Cp statistic adds a penalty of 2 𝑛𝑝𝜎̂

2_{to the training MSE in order to adjust}

for the fact that the training error tends to underestimate the test error. This penalty increases as the number of predictors in the model increases.

2- 𝐵𝐼𝐶 = 𝑀𝑆𝐸 +log (𝑛) 𝑛 𝑝𝜎̂

2_(4.3)

Since log (n) > 2 for any n > 7, the BIC statistic generally places a heavier penalty on models with many variables. This results in the selection of smaller models than Cp does. Cp and BIC are indirect estimation of test MSE (James et al. 2014).

3- Adjusted R2

R2 represents the proportion of variability in the responses Y (𝑦₁… 𝑦_𝑛) explained by the model:

𝑅2 ₌ 𝑇𝑆𝑆−𝑅𝑠𝑠

𝑇𝑆𝑆 = 1 − 𝑅𝑆𝑆

𝑇𝑆𝑆 (4.4)

where 𝑻𝑺𝑺 = ∑𝒏_𝒊=𝟏(𝒚_𝒊− 𝒚̅)𝟐 is the total sum of squares and it measures the total variability in the responses Y, 𝒚̅ is the average of the responses Y.

Since RSS always decreases as we add variables to the models, R2 increases with this adding. Adjusted R2 statistic, given by Equation 4.5, adds penalty on increasing the number of variables in a model.

𝐴𝑑𝑗𝑠𝑢𝑡𝑒𝑑 𝑅2 _{= 1 −}𝑅𝑆𝑆 (𝑛−𝑝−1)⁄

𝑇𝑆𝑆 (𝑛−1)⁄ (4.5)

For Cp and BIC techniques, the best model is the model with the smallest value of Cp or BIC, while for adjusted R2 the best model is the one with the largest adjusted R2.

Applying “best subset selection” technique on the models with response K, we obtained the results shown in Figure 4.3. We notice that for the one variable subset models, the best one is the opaque cloud cover (OpqC) model. This variable continues until the five variable subset model, after which, it is switched with total cloud cover (TotC). For the two variable subset,

relative humidity (RH) enters; it continues in the best subset models to the end. For the three variable subset, extraterrestrial radiation (H0) enters and continues in the best subset models to the end.

To determine the best model of these best subset models, we applied “adjusting training error” criteria. Below is the adjusted R2 _{values of the models shown in Figure 4.3, arranged in the} same order. Figure 4.4 shows the selection results of adjusted R2, Cp and BIC for the above models.

Figure 4.3: Best subset selection results for the models with response K

Figure 4.4 shows that the best model according to adjusted R2 and Cp is the model with all variables, while BIC selected the model with nine variables. However, after adding the third variable, the improvement in test MSE starts to flatten. For convenience and simplicity, we analyzed the four variable model.

B- Four variable model

The best model of the four variable subset is the model with the predictors OpqC, RH, H0, and TRange.

Fitting the model

Figure 4.5 shows the results of fitting the four variable model. The coefficients of the model are very significant for all variables. K has negative correlation with OpqC and RH, and positive correlation with H0 and TRange. The coefficient of H0 is very small compared to other coefficients. This model explained only 87.28 % of the variability in K responses. It is worth mentioning that 0.2 < K < 0.8. The typical value of K for clear sky day is between 0.65 and 0 .75 (Suehrcke, 2000).

Equation 4.6 represents the fitted model: 𝐾 = 0.701 + 6.379 × 10−6_𝐻

0− 0.023𝑂𝑝𝑞𝐶 − 0.002𝑅𝐻 + .004𝑇𝑅𝑎𝑛𝑔𝑒+ 𝜖 (4.6) Figure 4.5: The results of fitting K against OpqC, RH, H0, and TRange variables

Diagnostic analysis

Figure 4.6 shows four major diagnostic plots for model (4.6). The residuals plot, A, shows no profound pattern of nonlinearity of data or non-constant variance of residuals, although there is a small accumulation of points below the horizontal line 𝜖 = 0 , at the right corner. These points belong to NV station for June, July and August data.

Normal Q-Q plot, B, tests the assumption 𝜖~𝑁(0, 𝜎2_{). The plot shows no significant} deviation from normality, where the majority of the points lie on the Q-Q line. The point above Figure 4.6: Diagnostic plots for model (4.6). A- Linearity and constant variance of 𝜖 test. B- Normality test (𝜖~𝑁(0, 𝜎2). C- Outliers test. D- High leverage points test.

(A) (B)

the Q-Q line represents UT Jan-2001. While the points below the Q-Q line represent (starting from the furthest point) ND Jan-1999, WI Jan-1999, WI Jan-2000, and WI-Jan-2001. All these points are potential outliers.

Studentized residuals were used in Q-Q plot, as well in plots C and D. Studentized residuals, also called Jacknife residuals, are calculated using Equation 4.7:

𝑆𝑡𝑢𝑑𝑒𝑛𝑡𝑖𝑧𝑒𝑑 𝜖_𝑖 = 𝑦̂_(𝑖)− 𝑦_𝑖 (4.7)

where 𝒚̂_(𝒊) is the predicted value of the response 𝑖, calculated form a model fitted by excluding the point 𝑖; 𝒚_𝒊 is the ith observed response.

Accordingly, Studentized residuals reveal the possible outliers, which pull the regression line so close to them that they conceal their true status. If the Studentized residual of a point is large, then this point is an outlier. The red lines in Figure 4.6-C represent the Bonferroni critical value of Studentized residuals, beyond which the points are outliers (Faraway, 2002). The critical value calculated using Equation 4.8 with ∝ = 0.05.

Bonferroni critical value = 𝑞𝑢𝑎𝑛𝑡𝑖𝑙𝑒 𝑜𝑓 𝑡 (∝

2𝑛, 𝑛 − 𝑝 − 1) (4.8)

We notice that UT Jan-2001 is an outlier; ND Jan-1999 and WI Jan-1999 are almost outliers. These points are mild outliers; they deviate slightly from the Q-Q line in plot B. Given that we have 720 points, these outliers are of no concern, especially, they do not have high- leverage. Figure 4.6-D reveals the high-leverage points, the red line (Leverage = 2p/n) in plot D is just a “rule of thumb” critical value. Two points have serious high-leverage, Corpus Christi-TX Jun-1999 and Jul-1999. Fortunately, they are not outliers and consequently not influential points.

An important assumption of the linear regression model is that the error terms, 𝜖_1,… . , 𝜖_𝑛 are uncorrelated. If there is correlation among error terms, the true standard errors will be underestimated. In this case, confidence and prediction intervals will be narrower than they should be, and p-values associated with model will be lower than they should be (James et al. 2014). Figure 4.7 shows residuals verses time for two stations, Tallahassee Regional AP –FL on the left and Corpus Christi Intl Arpt –TX on the right. No profound correlation exists among the residuals.

Finally, we computed the variance inflation factor (VIF) of the model variables to check for multicollinearity problem. Denoting the estimated value of the model coefficient for variable j by 𝛽̂𝑗, the VIF is the ratio of the variance of 𝛽̂𝑗 when fitting the full model divided by the variance of 𝛽̂_𝑗 if fit on its own. The smallest value for VIF is one, which indicates a complete absence of collinearity. As a rule of thumb, a VIF value that exceeds 5 or 10 indicates a problematic amount of collinearity (James et al, 2014). VIF values for the model (4.6) predictors are VIF (H0) = 1.15, VIF (OpqC) = 2.49, VIF (RH) = 1.89, and VIF (TRange) = 2.27. These values indicate that there is no collinearity problem.

Comparison with other models

Five variable model: The best model includes H0, OpqC, RH, T, and TDaylight. We calculated the VIF values of this model to test for the collinearity. VIF values are VIF (H0) = 3.49, VIF

Tallahassee Regional -FL Corpus Christi Intl Arpt-TX

(OpqC) = 3.03, VIF (RH) = 1.83, VIF (TDaylight) = 434.99, and VIF (T) = 441.18. The last two values indicate a big collinearity problem. Accordingly, the four variables model is preferred.

Model without predictor H0: Since the model coefficient of variable H0 is very small, we repeated the analysis excluding H0. The best “one variable model” is again OpqC model. For The two variable model, RH enters. Both variables continue to the end of best subsets. For the three variable model, TDaylight enters. In four variable model, TDaylight leaves and TMin and TRange enter, where both variables continue to the end of best subsets. However, the adjusted R2 _{values are} slightly better for the model including H0, starting from the three variable model, when H0 enters the model. The comparison is below:

Adjusted R2 for the model with H0 predictor:

Adjusted R2 for the model without H0 predictor:

Equation 4.9 represents the fitted four variables model without H0:

𝐾 = 0.7094 + 0.0018𝑇𝑀𝑖𝑛− 0.0204𝑂𝑝𝑞𝐶 − 0.0025𝑅𝐻 + .0062𝑇𝑅𝑎𝑛𝑔𝑒+ 𝜖 (4.9)

The diagnostic analysis gave similar results.

In document Understanding Coupling of Global and Diffuse Solar Radiation with Climatic Variability (Page 48-55)