Principal component analysis technique

3. Data & Method

3.4. Principal component analysis technique

3.4.1. Introduction

The PCA employed in previous studies have been described in section 2.2.1 where the studies have utilised the technique to decompose some of the most important components within the diurnal cycle.

Principal Component Analysis technique is a very useful statistical analysis in finding a recursive pattern in multiple dimensions [Smith, 2002]. It is a technique which transformed a set of correlated data into a set of orthogonal vectors according to their covariances. For example, the first transformed vector, known as the first principal component (PC) represents the highest percentage of variance within

the data set, the subsequent principal component will represent the next highest percentage of variance, and so on.

Principal component analysis is carried out by using eigenvalue and eigenvector decomposition of the covariance matrix which will be described in the next section (3.4.2).

3.4.2. OLR PC analysis

In this analysis, we have used GERB ARG product, which linearly interpolated to 15 minutes time resolution. We have also used the same interpolation method to address the data gaps. However, if the data gap is large than 10% of the day, then that particular is discarded from the dataset, the data is then averaged over the remaining number of days. This is done to ensure we don’t introduce any significant noise into the PCs.

The analysis technique applied was similar to which Comer et al. (2007) used in their study: the computation of principal components (PC) and the associated empirical orthogonal functions (EOF). A month of data was used to average into mean diurnal cycles, which is done by averaging the 30 days of data in a single day according to their local solar time for each individual GERB footpoint. The measurements from each pixel over time are regarded as a single column of data, and each of the columns are then subtracted from their column mean to create a deviation matrix Dij, where i and j denote the rows and columns within the matrix

respectively. A covariance matrix C is formed by taking the covariances between each row of data, hence the covariance matrix is 96 by 96 in dimension (see eq.3.3). The dimension 96 represents the time steps during a day with 15 minutes time gap between each measurements.

C =    cov(1, 1) ... cov(1, i) ... ... ... cov(i, 1) ... cov(i, i)    (3.3)

where cov(a, b) represents the covariance between rows a and b of the deviation ma- trix [Smith, 2002]. As shown in eq.3.3, the covariance matrix is symmetrical about the diagonal. It is then required to work out the eigenvectors and the eigenval-

ues of the covariance matrix. The eigenvectors are arranged according to value of the eigenvalues, with the eigenvector which has the highest eigenvalue to be the most significant principal component of the dataset and vice versa. The variance of each principal componenet is given by their associated eigenvalue and the PC are rearranged according to their variances.

The EOFs associated with the PCs are calculated by:

EOF (i) = eiDij (3.4)

where EOF (i) is the empirical orthogonal function which corresponds to the ith PC

and ei represents the ith eigenvector (PC). The eigenvectors represent the trans-

formed factors which contribute to the most amount of variance in descending order (i.e. first PC represents most variance), whereas the eigenvalues express the variance associated with the eigenvectors.

The data can be re-derived using all/some of the PCs and corresponding EOFs: using all the PCs and EOFs will yield the original dataset, whereas, using only certain components will yield a modified dataset which only projects the dataset in the components which have been chosen [Smith, 2002]. For example, when only the most significant PC has been chosen to re-derive the dataset, the entire derived dataset will consist of functions of only the first PC in different magnitudes. The equation to re-derive the dataset is as follow:

Dt = P C(n)T × ∆D (3.5)

where P C(n) denotes the chosen PCs to reformat data, ∆D represents the deviation matrix which is the original dataset subtracted from the mean. The transformed dataset (Dt) has the same spatial dimension as the deviation matrix and the tem-

poral dimension n, the number of PCs chosen. To retrieve the full dataset in terms of the chosen PCs, the following equation is used:

Df = P C(n)× Dt+ Dµ (3.6)

The transformed matrix Dt is further multiplied by the P C(n) matrix to retrieve

step to get the final modified dataset is to add the mean of the original dataset which has been subtracted at the beginning of the PCA analysis.

In terms of application of this analysis in this study, the PC is a representation of the diurnal variation on a daily timescale. Whereas, the EOF is the spatial expression of the magnitude of that particular PC varying at each specific location (pixel).

We expect to see the first few components to correspond to some of the most significant climate processes in the chosen timescale. It is expected that by observ- ing the PCs and the EOFs which expresses the temporal and spatial distribution of the OLR characteristics respectively, the correspondent climate processes can be understood better with the analysis. Especially, with the analysis carried out over chosen subsets of the data, the temporal and spatial variation in the climate processes regarding the different subsets can be compared and hence, we can gain a better understanding of the processes involved.

3.4.3. Uncertainty and limitations

The PCA is very powerful in identifying the repeating patterns especially when they are not a sinusoidal variation. The PCA method decomposes the dataset according to the variances and is able to represent the variances in both temporal and spatial distribution which is extremely useful in data analysis.

However, there are limitations to the method which have been previously inves- tigated by Smith and Rutan (2003) in application to radiation studies. Smith and

Rutan (2003) have done so by utilising a method proposed by North et al. (1982)

who derived a criterion for validating an EOF. This method assumes each realisa- tion is independent, which is unlikely to hold in radiation observations. Although this validating method is not 100% accurate, it does provide a good approximation of validating criterion. Despite the limitation to the method, it does indicate the robustness of the technique used in radiation studies.

The validating method involves the use of the standard deviation of the error of the eigenvalue, which is defined as[Smith and Rutan, 2003]:

δλi = λi(

where δλidenotes the standard deviation of the i eigenvalue(λi), and N corresponds

to the number of data points, which in our analysis represents the number of GERB pixels selected for the study.

North et al. (1982) have defined the criterion where if a particular EOF contains

only a small contamination of variances from other EOF, this EOF is considered to be valid. This is quantified by the equation:

δλi < ∆λi = λi− λi+1 (3.8)

where the criterion, is defined as the standard deviation of the eigenvalue, has to been smaller than the difference (∆λ) of that eigenvalue and the next for the corre- spondent eigenvector to be valid.

Furthermore, EOF according to LST cannot capture propagating events(such as advection). There are various ways which EOF can be modified to track propagating signals, such as complex EOFs which are based on the notion that a propagating signal should contain signals that are orthogonal to each other [Gille, 2012], this involves computing the Hilbert transform of the data and then calculating the complex EOFs for the transformed data. Propagating signals can also be captured by extended EOFs [Gille, 2012], where it’s assumed that an event occurring at a single point will occur sometime later at another point, this requires creating new data sets which are time shifted and then compute the signals from these time shifted sets.

A simpler alternative method can be employed to allow the EOF analysis to track propagating features, this is done by correlating the PCs with a time lag. However, this requires prior knowledge of the time lag and the propagating speed to isolate the phase of the propagating wave. However, this is beyond the scope of this thesis and will not be included in our study.

In document Analysis of outgoing longwave radiation (OLR) in different timescales over Africa and Atlantic Ocean (Page 69-73)