Principal components analysis - Model calibration /validation methodology

Chapter 5: Model calibration /validation methodology

5.1 Principal components analysis

5.1.1 Definition of PCA

Principal Components Analysis (PCA) is possibly the most widely used multivariate statistical technique in the atmospheric sciences (Wilks, 1995). PCA identifies and summarises patterns of correlation or covariance between variables in a large data set. It linearly transforms an original set of variables into a substantially smaller set of uncorrelated variables that represent most of the information in the original dataset (Dunteman, 1989). It is sometimes referred to as Empirical Orthogonal Function analysis (EOF) or Singular Value Decomposition (SVD). PCA has the potential for yielding substantial insights into both the spatial and temporal variations contained in the variable being analysed. It is just as often used, however, as a data reduction technique. In this form, its purpose is to reduce a data set containing a large number of highly correlated variables to a data set containing much fewer uncorrelated variables, but ones that nevertheless represent a large proportion of the variability contained in the original data. This role is significant when multiple regression is being subsequently utilised, as

correlated variables reduce the efficiency of multiple linear regression (Wilks, 1995), and may also lead to spurious results.

Data used in a PCA must first be centred in some way. In most hydroclimatic applications, this involves conversion to anomalies from some normal. Data must also be detrended if appropriate. If this is not done, the first PC will often capture the trend, rather than any sought after structure in the data. Data in this study was centred, but was not deemed to need de-trending, as was discussed in section 4.4.

5.1.2 Type of dispersion matrix

The type of dispersion matrix upon which the PCA is carried out, which summarises the relationships between the variables, can affect the results of the analysis. A choice must usually be made between using a covariance matrix or a correlation matrix. These matrices contain different types of information about the data and therefore can yield different results. Because a PCA extracts the variables that explain most of the variance, variables with more variance will be picked out preferentially from a covariance matrix. In a correlation matrix, all the variables have the same variance (equal to 1), so a PCA can pick out variables with closely related variation (Kestin, 2000). However, this was tested in this study by running several examples on both a correlation and a covariance matrix, and the results from both these options gave very similar results. If a correlation matrix is chosen in “Statistica” statistics software (www.statsoft.com), data are standardised to a mean of zero and a standard deviation of one before analysis commences. If covariance matrix is chosen standardisation is not undertaken by the software, and so common scale data must be input. For the above reasons a correlation dispersion matrix was chosen for this study.

5.1.3 Rotation

Many researchers use PCA to examine atmospheric circulation patterns (Mullan 1998a, Kidson and Barnes 1984, Vuille et al. 2000, Krishnamurthy & Shukla 2000, Salinger 1980a, Tait & Fitzharris 1998, Trenberth 1976, Smith 2000, Ruiz-barradas et al. 2000, Wang and Cho 1997). In this application it is often useful to use a rotated PCA. Mullan and Renwick (1996) state that rotation generally produces a simpler set of patterns, and this often consists of a single strong anomaly of one sign in part of the domain, in contrast to the bipolar or more complex unrotated EOFs. In a rotated PCA, eigenvectors are

weighted by the square root of their corresponding eigenvalues (defined below), so that the weights (i.e., loadings) represent the correlations between each variable and principal component. Most rotations are simple expressions which approximate a simple structure through the application of mathematical algorithms which distribute the PC loadings such that the dispersion of the loadings is maximized (Wilks, 1995).

Many of the researchers mentioned above use rotated PCA because it is deemed to be a more effective tool than unrotated PCA analysis for the study of atmospheric circulation patterns. However, Wilks (1995) states that this is primarily when physical interpretation, rather than compression, is the goal. In this study, therefore, where data compression was the main aim of the PCA, unrotated PCA was used.

5.1.4 Eigenvalues and their retention

The eigenvalues resulting from a PCA represent the proportion of variance captured by the corresponding principal component (PC). They can be defined thus: if A is an n ×n matrix, the number λ is an eigenvalue of A if there exists a non-zero vector v such that Av = λv. In

this case, vector v is called an eigenvector of A corresponding to λ.

If n variables are entered in to a PCA, n eigenvalues corresponding to n PCs will be output. As data reduction is the aim of the process here, a decision must be made as to how many PCs to retain for further analysis. Obviously a significant representation of the variance is required, hopefully by just a few PCs. There are several selection methods in the literature (eg. Average root test, Log eigenvalue test, Broken stick test (Wilks, 1995), but three commonly used methods were chosen and used in conjuction with each other for this study.

The Scree Test, proposed by Catell (1966), graphs the eigenvalues vs. the corresponding PC number, which produces a graph with a descending line, as in figure 5.1. The aim of the scree test is to choose a break in the curve where it begins to significantly flatten out, which represents the point where a smaller and smaller amount of variance is explained by retaining another PC.

The Scree test was chosen in conjunction with Kaiser’s rule, proposed by Kaiser (1960). Kaiser’s rule recommends the retention of those PCs whose associated eigenvalues are

average amount of the total variance. Jolliffe (1972) has argued that Kaiser’s rule is too strict, and suggested the alternative of retaining those PCs whose eigenvalues are greater than 0.7. A combination of Kaiser and Jolliffe’s values and the scree test were used to choose the number of PCs to retain. The result being that usually a few more PCs were retained than Kaiser’s criterion would have suggested.

0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0 13.0 14.0 15.0 16.0 17.0 18.0 19.0 20.0 21.0 22.0 23.0 24.0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 PC number E ige nva lue Kaiser criterion -

retain PCs where eigenvalue > 1

Jolliffe (1972) -

retain PCs where eigenvalue > 0.7 Scree plot -

choose break in slope

Figure 5.1: Scree plot for Principal Components Analysis from Winter ocean-atmosphere variables showing eigenvalues plotted for their corresponding PCs. The three above mentioned selection rules are shown, and the number of PCs each would choose to retain. 25 PCs were retained for further analysis in this instance.

All independent datasets in this study were subjected to a Principal Components Analysis, and the resulting principal components were taken forward and used as the independent variables in a multiple regression with seasonally lagged inflows and rainfall. The PCs retained from each PCA are outlined in section 6.2.

In document Model Development for Seasonal Forecasting of Hydro Lake Inflows in the Upper Waitaki Basin, New Zealand (Page 102-105)