1.3. MULTIVARIATE STATISTICAL ANALYSIS TECHNIQUES
1.3.4. Correlated Component Regression (CCR)
Correlated Component Regression (CCR) is an ensemble dimension reduction regression technique that provides reliable predictions even with near multicollinear data
(72). Correlation amongst variables, known as multicollinearity, is a common problem encountered when dealing with large datasets, and consequently, traditional regression techniques may estimate unstable coefficients (73). Situations where the number of predictor variables P approaches or exceeds the sample size N give rise to major problems, and such instability is often accompanied by perfect or near perfect predictions of the regression model developed. However, this seemingly good predictive performance is usually associated with overfitting, and is not accurate when new samples are employed to test the model (72).
CCR proposes the prediction of the dependent variable based on K correlated components. If K = 1, CCR is equivalent to the corresponding Naïve Bayes solution (probabilistic classifier based on the application of Bayes' theorem with independence assumptions) (74). It is further suggested that for K > 1, CCR represents a natural extension of a Naïve Bayes model applied to multiple dimensions (74).
Estimation of components: CCR utilizes K < P correlated components in place of the
P predictors to estimate the response variable (75). Each component Sk is an exact linear combination of the variables, X = (X1, X2,…XP). The first component S1 captures the effects of those predictors that have direct effects on the outcome (72). Unlike PCA, the objective is not to explain the correlation/covariance matrix through the weights of the components (percentage of variance explained), but to compute the weights with the aim of maximizing
28
the ability to predict the outcome (75). The CCR-linear model (CCR-LM) algorithm proceeds as follows:
Estimate the loading λg(1), on S1, for each predictor g = 1, 2,…P, as the simple
regression coefficient of the regression of Y on Xg . Component S1 captures the direct effects
of X for the regression y on X as a weighted average of the one predictor model.
𝜆𝑔(1) = 𝐶𝑜𝑣 (𝑌,𝑋𝑔) 𝑉𝑎𝑟(𝑋𝑔 ) (8) 𝑆1 = ∑ 𝜆𝑔 (1) 𝑃 𝑔=1 𝑋𝑔 (9)
The first component S1 is computed as a linear combination of X using 𝜆𝑔(1) as weights. Similarly, predictions for the 2-component CCR model are obtained from the simple OLS regression of Y on S1 and S2. Components are related to each other, and they are not
orthogonal as in other techniques (PCA, Partial Least Squares); this correlation allows the mutual enhancement of the predictive abilities of the whole set of components, so that S2 improves the performance of predictor S1, S3 improves that for S2,and soforth, with the aim to achieve the desired effect of removing ‘extraneous variation’.
Component Sk′ for k′ > 1, is defined as a weighted average of all 1-predictor partial
effects, where the partial effect for predictor g is computed as the partial regression coefficient in the OLS regression of Y on Xg, and also on all previously computed components
Sk,k = 1,… k’ -1 (75). 𝑌 = 𝛼 + 𝛾1.𝑔(2) 𝑆1+ 𝜆𝑔 (2) 𝑋𝑔+ 𝜀𝑔 (2) (10)
29
With the aim of representing the model developed in a simpler fashion by regression coefficients, the CCR model can be expressed as follows:
𝑌̂ = 𝛼(𝑘)+ ∑𝑘𝑘=1𝑏𝑘(𝑘)𝑆𝑘= 𝛼(𝑘)+ ∑𝑘𝑘=1𝑏𝑘(𝑘)∑𝑃𝑔=1𝜆𝑔(𝑘)𝑋𝑔= 𝛼(𝑘)+ ∑𝑃𝑔=1𝛽𝑔𝑋𝑔 (11)
Therefore, the regression coefficient 𝛽𝑔 is the weighted sum of the loadings, where
the weights are the regression coefficients for the components (75):
𝛽𝑔 = ∑ 𝑏𝑏 (𝑘)
𝜆𝑔(𝑘)
𝑘
𝑘=1 (12)
M-fold Cross Validation (CV): The purpose of CV is to improve the performance of the classification model by assigning a fraction of the original sample size to a training set and the remaining to a ‛test set’. The first of these contains the samples employed to develop the model, whilst the second one is employed to check how effective the model is when these samples are tested in order to predict their classification status.
CCR employs M-fold validation, and runs this procedure several times iteratively. Each round provides one set of CV performance statistics. One round of M-fold validation randomly divides a sample of n cases into M mutually-exclusive sub-groups, known as folds, and obtains a similar number of samples within each fold.
The first fold is the ‛test sample’, and the remaining folds are used as ‛training samples’ employed to estimate the model performance. The Q2 statistic is typically used to evaluate the performance of the model in the validation fold. Q2 is the cross-validated R2 value and is computed utilising the Predictive Error Sum of Squares (PRESS) and the Total Sum of Squares (TSS) (Equation 9). 𝑄2 values are always ≤1, and it can indeed assume
30
negative values when PRESS > TSS, revealing that the model performs worse when it is evaluated with the ‛test set’ than the mean response of the ‛training set’ (76):
𝑄2 = 1 − 𝑃𝑅𝐸𝑆𝑆
𝑇𝑆𝑆 = 1 −
∑𝑛𝑖=1(𝑦𝑖− 𝑦̂ )𝑖2
∑𝑛𝑖=1(𝑦𝑖− 𝑦̅)2 (13)
This result on fold 1 is then stored, and the process is repeated for the second fold. The model’s predictive performance is then tested on this 2nd validation fold, and the results
aggregated with the previous CV performance (from fold 1). The process is then repeated for the remaining folds.
Step-Down Procedure: This approach works in conjunction with M-fold cross- validation. For a particular value of K (number of correlated components), a model with all possible predictors is initially estimated. Subsequently, the least important predictors are left out based on their standardised effect (Equation 14), and the Q2 value for the new model is obtained using the original set of predictors minus the ‛worst’ (75). This process then repeats until we have Q2 values for every possible model (75).
𝛽𝑔∗ = ( 𝜎𝑔
𝜎𝑦 ) 𝛽𝑔 (14)
CCR uses a step-down variable selection algorithm in order to remove irrelevant predictors, which commences using all the variables for developing the model; subsequently, it eliminates those variables with the smallest standardised coefficients one by one, re- estimating the model at each step (72).
31