Random Data Splitting - Ordinary Least Squares Regression Simulation Study

2.2 Ordinary Least Squares Regression Simulation Study

2.2.5 Random Data Splitting

In this section, we will investigate whether random data splitting can be a useful tool or not, to round the problem that the performance of the ordinary least squares prediction variance formula cannot be assessed by any single data sets directly. If we want to use random data splitting in the study of principal components regression and partial least squares regression, it should work in ordinary least squares regression.

Simulation 2.4. Random Data Splitting for One Set of Simulated Data

randomly permuted and split into a calibration set with n observations, a tuning set with nt observations, and a prediction set with np observations. The three

data sets are exchangeable. Estimated regression coefficients are calculated from the calibration set, and are used to make predictions for the tuning set and the prediction set. Squared prediction errors and leverages in both of the tuning set and the prediction set are saved to be compared.

In order to investigate how the noise of the dataset make an influence on the result, we fix the noise term in the prediction set to be ξp = 0.25. The regression

model can be written as

˙yc = β0+ ˙Xcβ+ ξc,

˙yt = β0+ ˙Xtβ+ ξt,

˙yp = β0+ ˙Xpβ+ ξp.

The parameters of the numerical experiment are configured as following. N = 100000, k = 1, ˙Xc ∼ N(0, 1), β0= β1=1, σξ2 = 0.25, n = 200, nt = 200, and

np = 200.

In Figure 2.5, the green point (SPEt) stands for average squared prediction error against average leverage calculated from the tuning set. The green line (SPEt fit) is the ordinary least squares fit of all squared prediction error against leverage in the tuning set. The pink point (SPE) presents average squared prediction error against average leverage. The pink dash line (SPE fit) is the ordinary least squares fit of all squared prediction error against leverage. The light blue line (OLS) is given by the ordinary least squares regression variance where σ2

ξ = 0.25

for simplicity.

The pink points form a straight line (SPE fit). The light blue line (OLS) and the pink dash line (SPE fit) overlap, because the error term is fixed as 0.25. The green points (SPEt) are so noisy that the green line (SPEt OLSfit) is quite different from the blue and pink lines.

If the error term in the prediction set is not fixed, the tuning set and the prediction set would have exactly the same result because the tuning set and the prediction set are exchangeable. The noisy green points Figure 2.5 suggests it is

Chapter 2. OLS Prediction Uncertainty 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.2 0.25 0.3 0.35 0.4 0.45 Average Leverage

Average Squared Prediction Error

SPEt SPEt fit SPE SPE fit OLS

Figure 2.5: OLS Average Squared Prediction Error versus Average Leverage for One Set of Simulated Data, random data splitting, ξp = 0.25. SPEt: average

squared prediction error in the fit of the tuning set ( ˙yt − ˆ˙yt)2 against average

leverage ht = xt(X′cXc)−1x′t. SPEt fit: the ordinary least square fit of all squared

prediction errors in the fit of the tuning set. SPE: average squared prediction error ( ˙yp− ˆ˙yp)2 against average leverage h = xp(X′cXc)−1x′p. SPE fit: the ordinary

least square fit of all squared prediction errors. OLS: the ordinary least squares prediction variance Var ( ˙yp− ˆ˙yp) = σξ2(1n+ h + 1).

unable to round the problem for a real dataset through the random data splitting, because the noise in the nature of a single dataset has been systematically amplified by the random data splitting. But the tuning set can be used in the estimation of regression variance serving as an adjustment for this particular set of data. Simulation 2.5. Random Data Splitting Simulation Study for 400, 000 Sets of Simulated Data 0 0.005 0.01 0.015 0.02 0.025 0.03 0.25 0.251 0.252 0.253 0.254 0.255 0.256 0.257 0.258 0.259 0.26 Average Leverage

Average Squared Prediction Error SPEt

SPEt fit SPE SPE fit OLS

Figure 2.6: OLS Average Squared Prediction Error versus Average Leverage for 400, 000 Sets of Simulated Data, random data splitting. SPEt: average squared prediction error in the fit of the tuning set ( ˙yt− ˆ˙yt)2 against average leverage

ht= xt(X′cXc)−1x′t. SPEt fit: the ordinary least square fit of all squared prediction

errors in the fit of the tuning set. SPE: average squared prediction error ( ˙yp− ˆ˙yp)2

against average leverage h = xp(X′cXc)−1x′p. SPE fit: the ordinary least square

fit of all squared prediction errors. OLS: the ordinary least squares prediction variance Var ( ˙yp− ˆ˙yp) = σξ2(1n+ h + 1).

In Simulation 2.2, it has been verified that the ordinary least squares prediction variance presents the average behavior. To illustrate it using random data splitting, we run 400, 000 replicates, each of which has a set of simulated data, and then it is randomly split into the calibration set, the tuning set and the prediction set. Keeping all other simulation parameters as the same as Simulation 2.4, Figure 2.6 is plotted to show the result.

Chapter 2. OLS Prediction Uncertainty The pink point (SPE) gives average squared prediction error against average leverage, and the pink dash line (SPE fit) is the ordinary least squares fit of all squared prediction error against leverage. The green point (SPEt) presents average squared prediction error against average leverage calculated from the tuning set, and the green line (SPEt fit) is the ordinary least squares fit of all squared prediction error against leverage in the tuning set. The light blue line (OLS) is drawn by the ordinary least squares regression variance.

The pink line, the green and the blue line overlap. The pink points and green points are noisy, and give different pattern, but they are fitted to the overlapped lines. The result is consistent with what we have seen in Figure 2.1(c) that the ordinary least squares prediction variance is the result of taking expectation over lots of different data sets.

In document Quantification of Prediction Uncertainty for Principal Components Regression and Partial Least Squares Regression (Page 44-48)