Chapter 4 Discussion
4.3 Principal Component Analysis vs Tile Size
4.3.1 Contribution to Explained Variance
The results of PCA applied to a collection of measures derived from a bathymetric surface using a range of tile sizes can provide information about the effects of spatial and analysis scale of the data on the processed measures. The amount of variance in the data explained by the PCA is the easiest way to determine if the measures calculated accurately represent the properties of the data. After each analysis, the variance associated with each of the reduced components can be computed from the eigenvectors of the correlation matrix created during the processing. By summing the individual component’s variance for each tile size analysis, the cumulative variance for the entire dataset can be computed.
Figure 4.5 shows the cumulative variance explained for each tile size that was tested in this study. The blue bars represent values of the cumulative variance that were obtained using the 0.5-meter resolution bathymetry, and the red bars represent results obtained using the 0.25-meter resolution data. For most of the smaller tile sizes tested using the 0.5-meter resolution data, there does not appear to be any correlation between the amount of variance explained and the tile size used, the amount of variance is consistently around ~85 percent. As tile size increases, there is a reduction of variance explained by the PCA; the largest decrease occurs with a tile size of N=65, which accounts for around 78 percent of the variance. A similar pattern is seen when higher-resolution (0.25-meter) versions of the data are analyzed. In this case, the cumulative variance is steady at just over 85
percent. Due to processing constraints, additional tile sizes were not tested to see if the trend in explained variance continues.
Figure 4.5: Plot of Cumulative Variance explained for the Amnicon River Survey with in- creasing tile size, N. The bars in blue are for data representing a surface resolution of 0.5 me- ters, and the red bars are for a data resolution of 0.25 meters and a tile size of N= (11,13,15).
Unlike the Amnicon River data, analysis of the data collected in the Lester River survey shows an opposite progression of the explained variance with increasing tile size, as illustrated in Figure 4.6. In this case, the smallest tile size explains the least amount, and the amount of variance explained increases with increasing tile sizes. Unlike the Amnicon River Survey, which exhibits a complex lake floor terrain, the Lester River site is
characterized by a maximum of three terrain types. The most common terrain type observed is a smooth bedrock surface that covers large portions of the survey area. The amount of variance explained in the Lester River dataset may be more appropriately defined using larger tile sizes that are well suited to characterize the regional trends in the lake floor geology, smaller tiles do a poor job, which suggest that the smaller tiles have a poorer ability to recognize smaller or localized features such as boulders or a fractured bedrock surface.
Figure 4.6: Plot of cumulative variance explained for the Lester River data at increasing tile sizes, N. Data used in this analysis was gridded at a resolution of 0.5 meters.
As tile size increases, the processing code calculates the terms of the quadratic equation outlined in Chapter 2 over a larger window, effectively capturing the lower frequency
detail of the lake floor terrain. As a result, smaller, more subtle variations in the actual lake floor are no longer picked up by the feature calculation, and the surface is, therefore, more generalized. This generalized surface carries less information about the lake floor’s terrain and complexity, and less of the variance is explained with increasing tile size. While a trend associated with decreasing tile size was anticipated before processing of the data began, no observable trend of increasing explained variance with smaller tile sizes was found for the Amnicon River dataset. This is interpreted to be a consequence of the limitations of the data resolution. There appear to be some slight improvement in the degree of variance explained when the PCA is repeated for the same tile size, using higher resolution versions of the same dataset, as shown in Figure 4.5 (red bars), but they are not very insignificant improvements. At some point, the data’s resolution and the resolving ability of the tile sizes will be reached. The smallest spatial scale possible using this methodology would use a tile size of N = 3, representing the smallest possible window size (Wood, 2009). Unless higher-resolution data is collected and processed, the maximum amount of variance explained from the original data set appears to be maximized at small tile sizes (N ≤ 17).
The percent contribution from each principle component also appears to vary as a function of tile size. Trends of variance explained as the spatial scale of the analysis increases may hold information about the nature of the processing methodology or underlying properties of the survey locations. Figure 4.7 shows the contribution of each principal component as a function of tile size. The first component shows a very subtle increase in contribution with increasing tile size, with a significant decrease at a tile size of N=21. Principle component two, however, shows a decreasing progression in variance explained, in contrast to components 3-5.
Figure 4.7: The contribution of each principle component plotted as a function of tile size. Data is computed from the Amnicon River Survey. Note the scale of the Y-axis of each sub-plot, the percent contribution changes significantly from one component to another
The fifth component shows an average contribution of between 8 and 8.5 percent, which changes very little with increasing tile size. With the exception of component two, these results suggest that the principal components calculated for the Amnicon River are not very sensitive to tile size increases at these spatial scales.
Trends in the average mean values calculated suggest that there is a dependence on tile size, potentially useful for determining the appropriate spatial scale for an analysis (Figure 4.4).