CHAPTER 4: CLOBAL RADIATION
4.2 Multiple Linear Regression Analysis
4.2.1 Models with Clearness Index (K) response
The main goal of choosing K as a response is to study the effect of atmospheric parameters on the ratio of solar radiation that reaches the earthβs surface. All atmospheric parameters were included in the models as predictors. Although we expect to have a negative correlation between K and H0, since πΎ =
π»
π»0 , the correlation matrix in Figure 4.2 shows moderate positive correlation between them. Accordingly, H0 was added to the predictors for inspection.
A- Variable selection
We used βbest subset selectionβ technique, which gives us the best model based on the number of predictors. Then, we used βadjusting the training error for the model sizeβ methods, such as adjusted R2, Mallowβs C
p, and Bayesian information criterion (BIC), to determine the best model of all. Below is a brief explanation of these methods (James et al. 2014).
The best model is the one with the lowest test (data) mean squared error (MSE) given by Equation 4.1. πππΈ = 1 πβ (π¦π β π¦Μπ) 2 π π=1 = π ππ π (4.1)
where n is number of observations, ππ is the ith response (observed value), πΜπ is the ith fitted value and RSS = βππ=π(ππβ πΜπ)π is residual sum of squares.
However, we first use training data to fit the models. Training MSE generally underestimates the test MSE. The reason is that we fit a model to the training data, using least squares, to estimate the regression coefficients such that the training RSS (but not the test RSS) is as small as possible. Therefore, training RSS cannot be used to select the best model among a set of models with different numbers of variables. To adjust for the training MSE any of the following approaches can be used:
1- πΆπ = πππΈ +2 πππΜ
35
where π is the number of predictors, and πΜπ is an estimate of the variance of the error π associated with each response measurement (ππ = ππβ πΜπ). Note that πΜπ = πΉπΊπΊ π β π β πβ .
Accordingly, the Cp statistic adds a penalty of 2 πππΜ
2 to the training MSE in order to adjust
for the fact that the training error tends to underestimate the test error. This penalty increases as the number of predictors in the model increases.
2- π΅πΌπΆ = πππΈ +log (π) π ππΜ
2 (4.3)
Since log (n) > 2 for any n > 7, the BIC statistic generally places a heavier penalty on models with many variables. This results in the selection of smaller models than Cp does. Cp and BIC are indirect estimation of test MSE (James et al. 2014).
3- Adjusted R2
R2 represents the proportion of variability in the responses Y (π¦1β¦ π¦π) explained by the model:
π 2 = πππβπ π π
πππ = 1 β π ππ
πππ (4.4)
where π»πΊπΊ = βππ=π(ππβ πΜ )π is the total sum of squares and it measures the total variability in the responses Y, πΜ is the average of the responses Y.
Since RSS always decreases as we add variables to the models, R2 increases with this adding. Adjusted R2 statistic, given by Equation 4.5, adds penalty on increasing the number of variables in a model.
π΄πππ π’π‘ππ π 2 = 1 βπ ππ (πβπβ1)β
πππ (πβ1)β (4.5)
For Cp and BIC techniques, the best model is the model with the smallest value of Cp or BIC, while for adjusted R2 the best model is the one with the largest adjusted R2.
Applying βbest subset selectionβ technique on the models with response K, we obtained the results shown in Figure 4.3. We notice that for the one variable subset models, the best one is the opaque cloud cover (OpqC) model. This variable continues until the five variable subset model, after which, it is switched with total cloud cover (TotC). For the two variable subset,
36
relative humidity (RH) enters; it continues in the best subset models to the end. For the three variable subset, extraterrestrial radiation (H0) enters and continues in the best subset models to the end.
To determine the best model of these best subset models, we applied βadjusting training errorβ criteria. Below is the adjusted R2 values of the models shown in Figure 4.3, arranged in the same order. Figure 4.4 shows the selection results of adjusted R2, Cp and BIC for the above models.
Figure 4.3: Best subset selection results for the models with response K
37
Figure 4.4 shows that the best model according to adjusted R2 and Cp is the model with all variables, while BIC selected the model with nine variables. However, after adding the third variable, the improvement in test MSE starts to flatten. For convenience and simplicity, we analyzed the four variable model.
B- Four variable model
The best model of the four variable subset is the model with the predictors OpqC, RH, H0, and TRange.
Fitting the model
Figure 4.5 shows the results of fitting the four variable model. The coefficients of the model are very significant for all variables. K has negative correlation with OpqC and RH, and positive correlation with H0 and TRange. The coefficient of H0 is very small compared to other coefficients. This model explained only 87.28 % of the variability in K responses. It is worth mentioning that 0.2 < K < 0.8. The typical value of K for clear sky day is between 0.65 and 0 .75 (Suehrcke, 2000).
Equation 4.6 represents the fitted model: πΎ = 0.701 + 6.379 Γ 10β6π»
0β 0.023ππππΆ β 0.002π π» + .004ππ ππππ+ π (4.6) Figure 4.5: The results of fitting K against OpqC, RH, H0, and TRange variables
38
Diagnostic analysis
Figure 4.6 shows four major diagnostic plots for model (4.6). The residuals plot, A, shows no profound pattern of nonlinearity of data or non-constant variance of residuals, although there is a small accumulation of points below the horizontal line π = 0 , at the right corner. These points belong to NV station for June, July and August data.
Normal Q-Q plot, B, tests the assumption π~π(0, π2). The plot shows no significant deviation from normality, where the majority of the points lie on the Q-Q line. The point above Figure 4.6: Diagnostic plots for model (4.6). A- Linearity and constant variance of π test. B- Normality test (π~π(0, π2). C- Outliers test. D- High leverage points test.
(A) (B)
39
the Q-Q line represents UT Jan-2001. While the points below the Q-Q line represent (starting from the furthest point) ND Jan-1999, WI Jan-1999, WI Jan-2000, and WI-Jan-2001. All these points are potential outliers.
Studentized residuals were used in Q-Q plot, as well in plots C and D. Studentized residuals, also called Jacknife residuals, are calculated using Equation 4.7:
ππ‘π’ππππ‘ππ§ππ ππ = π¦Μ(π)β π¦π (4.7)
where πΜ(π) is the predicted value of the response π, calculated form a model fitted by excluding the point π; ππ is the ith observed response.
Accordingly, Studentized residuals reveal the possible outliers, which pull the regression line so close to them that they conceal their true status. If the Studentized residual of a point is large, then this point is an outlier. The red lines in Figure 4.6-C represent the Bonferroni critical value of Studentized residuals, beyond which the points are outliers (Faraway, 2002). The critical value calculated using Equation 4.8 with β = 0.05.
Bonferroni critical value = ππ’πππ‘πππ ππ π‘ (β
2π, π β π β 1) (4.8)
We notice that UT Jan-2001 is an outlier; ND Jan-1999 and WI Jan-1999 are almost outliers. These points are mild outliers; they deviate slightly from the Q-Q line in plot B. Given that we have 720 points, these outliers are of no concern, especially, they do not have high- leverage. Figure 4.6-D reveals the high-leverage points, the red line (Leverage = 2p/n) in plot D is just a βrule of thumbβ critical value. Two points have serious high-leverage, Corpus Christi-TX Jun-1999 and Jul-1999. Fortunately, they are not outliers and consequently not influential points.
An important assumption of the linear regression model is that the error terms, π1,β¦ . , ππ are uncorrelated. If there is correlation among error terms, the true standard errors will be underestimated. In this case, confidence and prediction intervals will be narrower than they should be, and p-values associated with model will be lower than they should be (James et al. 2014). Figure 4.7 shows residuals verses time for two stations, Tallahassee Regional AP βFL on the left and Corpus Christi Intl Arpt βTX on the right. No profound correlation exists among the residuals.
40
Finally, we computed the variance inflation factor (VIF) of the model variables to check for multicollinearity problem. Denoting the estimated value of the model coefficient for variable j by π½Μπ, the VIF is the ratio of the variance of π½Μπ when fitting the full model divided by the variance of π½Μπ if fit on its own. The smallest value for VIF is one, which indicates a complete absence of collinearity. As a rule of thumb, a VIF value that exceeds 5 or 10 indicates a problematic amount of collinearity (James et al, 2014). VIF values for the model (4.6) predictors are VIF (H0) = 1.15, VIF (OpqC) = 2.49, VIF (RH) = 1.89, and VIF (TRange) = 2.27. These values indicate that there is no collinearity problem.
Comparison with other models
Five variable model: The best model includes H0, OpqC, RH, T, and TDaylight. We calculated the VIF values of this model to test for the collinearity. VIF values are VIF (H0) = 3.49, VIF
Tallahassee Regional -FL Corpus Christi Intl Arpt-TX
41
(OpqC) = 3.03, VIF (RH) = 1.83, VIF (TDaylight) = 434.99, and VIF (T) = 441.18. The last two values indicate a big collinearity problem. Accordingly, the four variables model is preferred.
Model without predictor H0: Since the model coefficient of variable H0 is very small, we repeated the analysis excluding H0. The best βone variable modelβ is again OpqC model. For The two variable model, RH enters. Both variables continue to the end of best subsets. For the three variable model, TDaylight enters. In four variable model, TDaylight leaves and TMin and TRange enter, where both variables continue to the end of best subsets. However, the adjusted R2 values are slightly better for the model including H0, starting from the three variable model, when H0 enters the model. The comparison is below:
Adjusted R2 for the model with H0 predictor:
Adjusted R2 for the model without H0 predictor:
Equation 4.9 represents the fitted four variables model without H0:
πΎ = 0.7094 + 0.0018ππππβ 0.0204ππππΆ β 0.0025π π» + .0062ππ ππππ+ π (4.9)
The diagnostic analysis gave similar results.