Principal component discriminant analysis

5.2 Supplementary information for analyzing T-Cells datasets

5.2.1 Principal component discriminant analysis

As described in section 5.1, FTIR raw T-Cell data measured in 6 different days were used in this analysis. Since pair-wise comparison can only work for two group comparisons, here we apply a multi-group comparison method, PCA-LDA, to explore and compare the different results. We categorized five different treatments to into five different clusters, which are AntiFas (AF), Control (C), ControlUV (CUV or CU), Sorbitol (S) and UVtreated (UV), and treated them to- gether as a whole.

The same procedures (PCA-LDA with LOOCV using SAS (SAS Institute Inc., Cary, NC, USA)) that were used for Vero cell analysis were applied to T-Cell data, as described in the pre- vious chapter.

Data analysis procedures are listed below:

1. Data were standardized for each treatment separately. Standardization method is to stan-

dardize data through frequencies. In other words, if mean = mean of absorbance (Abs) of 365 frequencies (there are 365 frequencies total), and sd = second deviation of 300 fre- quency abs., then (abs-mean)/sd. All data were merged into one file using MATLAB and mark each day as 1 to 6 and marked for each treatment as 1 to 5 (AntiFas=1, Control=2, ControlUV=3, Sorbitol=4 and UVtreated=5).

2. Data structure were indicated as in Table 5-1 below: 365 variables (Var, wavenumbers),

721 observations (Obs) (~120 spectra for each treatment), 5 treatments (inf) : (af, c, cuv, s, uv) , 6 days of repeats (date). Note: Treatment S data only has 5 days of repeats, as day 3 was missing here.

3. To analyze the dataset and distinguish each treatment, principal component discriminant analysis (PCA-LDA) was applied to the standardized dataset. All multivariate analysis was performed using SAS 9.1 (SAS Institute Inc., Cary, NC, USA). The merged file was processed first using the principal component discriminant analysis (PCA-LDA) method as well as the leave-one-out cross-validation.

4. Dataset was imported in SAS and then the principle components (PCs) were calculated.

The first 25 principle components, representing nearly 95% of all the information from the spectra, are applied to the linear discriminant analysis (LDA).

5. How many PCs we chose depended on what level of discriminant level (sensitivity and

specificity) we required. In order to find the best number of PCs we repeated the PCA- LDA analysis several times by increasing the number of PCs we used. The number of PCs was increased starting from 2 until the associated highest classification accuracies (sensitivity and specificity) were reached. For our dataset, 25 was a good number of PCs when balancing the minimum information input and better classification accuracies.

6. The classification accuracies, sensitivity and specificity, were computed by performing

the leave-one-out cross validation method. Leave-one-out is the degenerate case of K- Fold Cross Validation, where K is chosen as the total number of total spectra. Leave- one-out cross-validation (LOOCV) involves using a single observation from the original data as the validation data, and the remaining observations as the training data. This is repeated such that each observation in the sample is used once as the validation da-

ta. Therefore, the averages of all sensitivity and specificity were the final classification accuracies.

7. To better visualize the separation among clusters produced by this Multivariate analysis, one 3D plot and one 2D plot (Figure 5-1) below were drawn by MatLab R2009a (The Mathworks Inc. Natick, MA), where each cluster corresponded to the differentiation treatment for T-Cells. AntiFas is in blue. Control is in red. ControlUV is in cyan. Sorbitol is in magenta. UVTeated is in green.

5.2.1.1 Results and Discussions

Analysis results are presented in Table 5-2 and Figure 5-1. In Figure 5-1, it can be seen from the 2D or 3D plot that the control is well-separated from the remaining groups and that each cluster only slightly overlaps with any nearby group. The 100% sensitivity and 99.8% specificity in Table 5-2 also indicate that the control is well-separated from the remaining groups. We also show good discrimination among all five clusters, which should indicate that PCA-LDA with LOOCV is a good method for multi-group analysis.

We also examined the results by using different numbers of PCs. Table 5-3 presents results for PCA-LDA with 365 PCs applied in the analysis with a nearly 100% sensitivity and specificity. As the number of PCs decrease, classification ability decreased dramatically as shown in Table 5-4 (10 PCs) and Table 5-5 (7 PCs). This result agrees with the hypothesis that the more information (more PCs) from the datasets we have, the better the discrimination ability that can be achieved.

Table 5-1 Data structure for all T-Cell with five different treatments.

Var1 … Var365 inf date

Abs1 Intensity 1 1 … . . 1 6 2 1 … . . … 2 6 … 1 … . . Abs721 5 6

Table 5-2 Results for PCA-LDA with LOOCV 25 PCs applied in the analysis. 25PCs

FT-IR prediction

AntiFas Control ControlUV Sorbitol UVTreated

1 2 3 4 5 AntiFas 1 123 0 0 3 14 Control 2 0 148 0 0 0 ControlUV 3 2 1 131 0 19 Sorbitol 4 4 0 0 125 0 UVTreated 5 11 0 15 0 127 Number Correct 123 148 131 125 127 Sensitivity 87.86% 100.00% 85.62% 96.90% 83.01% Specificity 96.90% 99.80% 97.21% 99.44% 94.11%

Table 5-3 Results for PCA-LDA with 365 PCs applied in the analysis. 365 PCs

FT-IR prediction AntiFas Control ControlUV Sorbitol UVTreated

1 2 3 4 5 AntiFas 1 139 1 0 0 0 Control 2 0 148 0 0 0 ControlUV 3 0 0 150 1 0 Sorbitol 4 0 0 0 123 6 UVTreated 5 0 0 0 6 147 Number Correct 139 148 150 123 147 sensitivity 99.29% 100.00% 99.34% 95.35% 96.08% specificity 100.00% 99.82% 100.00% 98.82% 98.94%

Table 5-4 Results for PCA-LDA with 10 PCs applied in the analysis. 10 PCs

FT-IR prediction AntiFas Control ControlUV Sorbitol UVTreated

1 2 3 4 5 AntiFas 1 85 1 0 25 29 Control 2 0 146 0 2 0 ControlUV 3 0 5 124 1 21 Sorbitol 4 16 2 0 104 7 UVTreated 5 32 0 24 18 79 Number Correct 85 146 124 104 79 sensitivity 60.71% 98.65% 82.12% 80.62% 51.63% specificity 90.42% 98.00% 94.52% 90.42% 88.95%

Table 5-5 Results for PCA-LDA with 7 PCs applied in the analysis. 7 PCs

FT-IR prediction AntiFas Control ControlUV Sorbitol UVTreated

1 2 3 4 5 AntiFas 1 84 2 0 23 31 Control 2 1 123 15 9 0 ControlUV 3 0 22 108 5 16 Sorbitol 4 16 6 1 98 8 UVTreated 5 44 0 25 12 72 Number Correct 84 123 108 98 72 sensitivity 60.00% 83.11% 71.52% 75.97% 47.06% specificity 86.80% 92.35% 90.19% 88.76% 88.25%

We have preformed PCA-LDA with leave-one-out cross validation. The results show the same level of accuracy of discrimination as described in the literature [30] . However, we encountered another problem due to the same variation of the spectra we collected from different days.

Leave-one-out cross-validation (LOOCV) involves using a single spectrum from all the spectra collected over different days as the validation data, and the remaining spectra as the training data. This is the same as a K-fold cross-validation with K being equal to the number of observations in the original sample. If the data collected over different days proves to be consistent (no statistical difference over days among all spectra) or the distribution of the observed sample is the same as the population distribution, we will get similar results from K- fold cross-validation (with k equal to any number) as the results we got from leave-one-out, which represents a fair cross validation.

However, we obtained a much lower sensitivity and specificity from leaving one out of six days (all the spectra collected from one day) out of the cross validation (6-fold CV) (refer

to 5.2.2), which showed that our sample size is too small to apply the PCA-LDA k-fold CV method. Statistically speaking, 20 independent datasets will be a good sample size for any type of statistical analysis. Therefore, we have to employ a new method to reduce the influence of the variation from the different days due to our current sample size. Thus, we have settled on the new pair wise comparison method (as we discussed in 5.1) that we applied to our T-Cell datasets with a sample size equal to 6.

In the process of pair wise comparison, the most important step is to use inner intra differences. Naturally, one question will arise: why do we use inner intra differences? In order to test whether there is a difference between population means of any two groups (two different treatments), our data should satisfy two conditions:

1. The two populations have the same variance. This assumption is called the as-

sumption of homogeneity of variance. To prove this assumption, we have to prove that the variances of inner-differences within treatment type 1 should be the same as that within treatment type 2. We applied a homogeneous variance test to the two kinds of inner difference, and the homogeneous variance was confirmed.

2. The populations are normally distributed. We randomly divided all spectra obser-

vation points into two equal groups for each treatment, and find their average absorbances and inner and intra differences. From central limit theorem, both inner and intra-differences have approximately normal distributions.

Only by satisfying these two conditions, can the statistical results derived from all the statistical methods be treated as reliable. However, most of the statistical methods employed in the

literature do not attempt to meet these conditions; therefore, it is unclear whether or not the results from those studies are statistically significant [30].

In order to reduce the influence of the variability among the data observed in different dates, we will only use the differences of the absorbances between two groups (two different treatments) from the same date. To apply pair wise comparison, a control group was created by find- ing the difference between two spectra from treatment 1 and two spectra from treatment 2. Be- fore the subtraction, we randomly divided all spectral observation points into two equal groups for two different treatments, and found their average absorbances. The differences between the same treatments are called inner-difference, and the differences between different treatments are called intra-difference.

In document The Relationships between Energy Balance Deviations and Adiposity in Children and Adolescents (Page 149-157)