Computation of Principal Components using Nonlinear Principal Component Analysis

5. D EVELOPMENT OF NONLINEAR AGGREGATED DROUGHT INDE

5.3. Methodology Used for Development of Nonlinear Aggregated Drought Inde

5.3.1. Computation of Principal Components using Nonlinear Principal Component Analysis

Generally, NLPCA has been used as a data redundant technique (Kramer, 1991; Monahan, 2000, 2001; Linting et al., 2007). However, in this study, the NLPCA is introduced and adopted as the numerical approach to aggregate five hydro- meteorological variables (i.e., rainfall, potential evapotranspiration, streamflow, storage reservoir volume and soil moisture content) in the development of NADI. The use of NLPCA in the development of NADI is similar to the use of linear PCA in the development of ADI in Chapter 4 and used by Keyantash and Dracup (2004). The NLPCA is similar to the linear PCA in most aspects, with only difference being that in linear PCA the PCs are obtained through a linear combination of variables, whereas, in NLPCA the PCs obtained through a linear combination of transformed variables. The variable transformations are performed in NLPCA using an iterative process (which will be discussed shortly) where the observed data are replaced with new numeric values. By doing so, the PCs generated through NLPCA capture the nonlinear relationships between the variables and account for more variance in the data than the linear PCA (Linting et al., 2007).

In linear PCA, PCs are a re-expression of the original m-variable data set in terms of uncorrelated components Zj(1 < j ≤ m). As was discussed in Section 4.3.5,

eigenvectors derived through linear PCA are unit vectors (i.e., magnitude of 1) that establish the relationship between the PCs and the original data as shown in Equation (5.1).

Z H E= (5.1)

where, Z is the (n x m) matrix of PCs (i.e. uncorrelated components), in which n

is the number of observations, H is the (n x m) matrix of standardized observational data (also called component scores) and E is the (m x m) matrix of eigenvectors (also called component loadings).

Usually, in linear PCA, component scores and component loadings are obtained from a singular value decomposition of the standardized data matrix or an eigenvalue

decomposition of the correlation matrix. However, the same results are obtained in the NLPCA through an iterative process as was mentioned above in which a least squares loss function is minimized. The loss to be minimized is the loss of information due to representing the variables by a small number of components: in other words, the difference between the variables and the component scores weighted by the component loadings. It should be noted that in NLPCA, the variable transformation task and the linear PCA model estimation (i.e., computation of component scores and component loadings) were performed simultaneously through the iterative process. Computer programs or modules that perform NLPCA are available in two major commercial statistical packages: the module PRINQUAL in SAS (SAS Institute, 1992) and the module CATPCA in SPSS (Meulman et al., 2004; SPSS, 2006). The CATPCA (i.e., Categorical Principal Component Analysis) module of SPSS software was used in this study to perform the NLPCA and the way NLPCA was performed in CATPCA is described mathematically below.

Suppose (n x m) is an observational data matrix (H), where n is the number of

observations and m is the number of variables. Each variable hj in the jth column of H

has a vector of n (n x 1) observational data, with j = 1, . . . , m. Twelve matrices of H

were used separately for NLPCA, one for each month. Assume that after optimal iterative process in NLPCA, the matrix H is replaced by the (n x m) matrix of Q,

containing the transformed variables qj (q_j =φ_j( )h_j ). Various types (e.g., nominal,

ordinal and spline) of nonlinear transformation can be chosen in the CATPCA module based on the data. Spline type transformation which is suitable for numeric and continuous data was used for nonlinear variable transformation in this study, since numeric and continuous data were used in the matrix H. If X is the (n x p) matrix of

component scores, p is the number of components, and A is the (m x p) matrix of

component loadings, with its jth_{row indicated by}_a

js (where, s = 1, . . ., p), then the loss

function (L(Q, X, A)) that can be used in NLPCA for the minimization of the difference between the transformed data and the PCs can be expressed as:

(

)

1 2 1 1 1 Q, X, A − = = = ⎛ ⎞ = _⎜ − _⎟ ⎝ ⎠

∑∑

m n ij

∑

p is js j i s L n q x a (5.2)

In matrix notation, this function can be written as

(

)

(

) (

)

1 Q, X, A m tr _j X _j _j X _j j L n− q a q a = ′ =

∑

− − (5.3)

where tr denotes the trace function that sums the diagonal elements of a matrix. For example, 2 1 1 tr = = ′ =

∑∑

m n ij j i B B b (5.4)

It was proven that the loss function in Equation (5.3) is equivalent to Equation (5.5) (Gifi, 1990).

(

)

(

) (

)

1 Q, X, A tr X X m j j j j j L n− q a q a = ′ ′ ′ =

∑

− − (5.5)

The loss function given by Equation (5.5) is used in CATPCA, because the vector representations of variables as well as the representations of transformed data as a set of group points can be incorporated in the loss function in Equation (5.5) (Linting

et al., 2007).

Loss function in Equation (5.5) is minimized in an alternating least squares way by cyclically updating one of the three sets of parameters Q, X and A, while keeping

the other two fixed. This iterative process is subjected to conditions below:

1) the transformed variables are standardized, so that

j j

q q′ =n (5.6)

X X′ =nI (5.7) where I is the identity matrix.

3) the component scores are centered; thus,

1 X 0′ = (5.8)

where “1” indicates a vector of ones.

The condition presented by Equation (5.6) is required to solve the indeterminacy between qj and aj in the inner product q a_j ′_j. The condition in Equation

(5.7), on the other hand, is applied to avoid the solutions A = 0 and X = 0. Moreover,

the conditions presented by Equations (5.7) and (5.8) imply that the columns of X

containing the components scores whose mean is zero and the standard deviation is one, and the components are uncorrelated.

The abovementioned iterative process is continued until the improvement in the subsequent loss values is below some user-specified small value. In CATPCA, starting values of X are randomly selected.

As was mentioned above, the PCs that are generated through NLPCA are a re- expression of the m-variable transformed data set, in terms of uncorrelated components

Yj (1 < j ≤ m). Therefore, the relationship between the PCs and the transformed data

can be presented in NLPCA by Equation (5.9).

Y Q E= (5.9)

where, Y is the (n x m) matrix of PCs (i.e., uncorrelated components), Q is the (n x m) matrix of transformed variables and E is the (m x m) matrix of eigenvectors.

In document Drought assessment and forecasting using a nonlinear aggregated drought index (Page 104-108)