GRNN for moving windows multiscale analysis

4.8 GRNN modeling and estimations for the set 3B

4.8.2 GRNN for moving windows multiscale analysis

As was done with the skewness-test and the experimental variography in chapter 3, the GRNN method can be used to give insights of the data multiscale behavior. It was observed how skewness and variance reduces with data’s size. It was also observed that the arbitrary partition of data with moving windows does not help with variography modeling but helped rather with the regional partition. The optimal sigma parameter from GRNN indicates the degree of spatial variability of the dataset; a high variable dataset will require larger sigma values or bandwidths to define their distribution function.

For the MW multiscale analysis, a GRNN optimization by leave-one-out cross-validation (CV) was launched. The Gaussian kernel is used for its simplicity and because it adapts to most kinds of data configuration. The sigma parameter with the lowest CV mean squared error (MSE) was recorded for each dataset enclosed by each of the cells for the resolution grids of analysis (namely grids 5, 10, 20 and 40). The algorithm is sensitive to the initial settings of sigma parameter; the starting sigma should be small enough to consider cases of high fitting. It was found that an appropriate starting sigma value was the average distance between points (equation 3.1).

In addition, a maximum sigma must be provided as a reference for the range of optimization. This maximum was computed as half of the diagonal of the bounding box. In practice, when the optimal sigma is found beyond this value the model becomes highly smoothed. A refinement of the sigma value is completed to give more precision to the results. A refinement is also conducted when there is a tendency of the optimal sigma to be

below the initial minimum or above the maximum values. In cases where the optimal sigma tends towards infinity, the dataset was considered unstructured. Grid cells having less than 20 points were filtered-out.

Table 4.1 presents the mean number of points, the mean variance, the mean of the half diagonals, the mean optimal sigma for GRNN, and a referential percentage of cells for each grid with sigma values under 80,000 (more structured).

Table 4.1: Mean values for GRNN features at 4 scales of analysis MW mean n mean mean half mean optimal % structured grid points variance diagonal sigma datasets

5 1817 107137 69099 2576 100

10 641 106281 35510 3152 89

20 214 86888 17187 2485 76

40 88 65066 8508 2183 71

To calculate the mean value of optimal sigma per grid, only the subsets with an optimal lower than 80,000 were considered. When a very large sigma is obtained, the optimization curve is no more convex and tends towards infinity. It should be taken into account that sigma relates to spatial distances since the input variables are spatial coordinates, but the sigma value itself is not a distance. Sigma is the smoothing parameter of the density function and reflects primarily how well it adjusts to the kernel function.

For grid5, 100% of the window subsets attained an optimal sigma in the range of the starting minimum and maximum sigma proposed; 7992 was the maximum optimal sigma obtained. It has been considered that these sets are more or less structured. For the grids with smaller moving windows and smaller datasets, there were cases where no convex curve of optimization was attained and therefore they were considered as spatially unstructured. The percentages of ‘structured’ sets were 100, 89, 76 and 71 % for MW grid5, grid10, grid20 and grid40 respectively. In map 4.21a, two subsets with an optimal sigma over 7000 were highlighted with red while the other 2 sets with small sigma values were highlighted in green. The corresponding CV optimization curves using GRNN are presented a-side.

As seen in Figure 4.21 the optimization curves have different shapes depending on whether a tendency towards fitting or smoothing exists. For the cells G5C7 and G5C10 a large sigma was found optimal. The common denominator for these two sets is that they both demonstrate significant spatial clustering. For cell G5C11, spatial clustering appears to also be important, but the relatively large number of samples seems to play a role in result- ing in a low sigma. For the case of cell G5C25, the number of samples is low (only 108) but they appear to be well distributed spatially. In this case, more than one local optimum can be found. When using a starting minimum sigma higher than 3000, the optimization curve can fall in a local minima with a very large sigma. When forcing the search to lower sigma values, a local minimum of 888 was found. As mentioned before, GRNN optimization is sensitive to initial parameters.

The number of samples and spatial clustering appear to play a major role when it comes to train the optimal sigma. Having more samples in a set tends to produce a unique solution.

Figure 4.21: Map with MW grid5 and some indoor radon GRNN sigma optimization curves with large and short sigma

This tendency is seen in the results of Table 4.1 where the datasets of finer grid cells (G10, G20 and G40) are more unstructured. The more data available, the better GRNN defines the (non-parametric) spatial distribution function.

In Figure 4.22, a map with the MW cells of the grid10 is shown. Some cells are highlighted with colors according to the results of the GRNN sigma optimization. Cells with a sigma exceeding the bounding box half-diagonal distance are highlighted in red. The cells in orange color represent subsets with an optimal sigma between 5000 and the half-diagonal distance. The cyan colored cells are those that were discharged because they have less than 20 sample points.

From the previous figure, it can be noted that when datasets are smaller, there is a tendency for some subsets to have large sigma values. When data are heavily clustered (with large empty spaces) and with high local variability (as cell 83 from grid10), sigma values are high. The GRNN multiscale analysis is useful in identifying areas posing difficulties to modeling. At first glance, the areas with lower information are logically difficult to be mod- eled. Some areas with high clustering will prevent an adequate use of GRNN. The high local variance will primarily affect kriging methods as GRNN is more robust in this regard.

It is interesting to compare the general results from MW experimental variography with those of the MW GRNN. On the one hand, the number of samples to build a variogram model are not very constraining, but the stationary conditions are indeed critical for variography. GRNN, on the other hand, suffers with the lack of samples but it is robust to local variability and noise.

In Figure 4.23, the boxplot of optimal sigma values for the GRNN analysis on different scales are shown. Sigma values become skewed for subsets in grid40 because there are a significant number of extreme sigma values. In fact, the lowest optimal sigma values obtained for the datasets of grid40 were somehow forced to a local minima by constraining the optimization search to lower initial values. When data resulted absolutely unstructured, sigma values were very large.

Figure 4.22: Map of higher grnn optimal sigma values. Cells with larger sigma are in red and with medium sigmas in orange. Filtered-out cells are highlighted with cyan color

Figure 4.23: Boxplots of optimal sigma values for grid5, 10, 20 and 40 (from left to right)

If a very large sigma value is used to make predictions, the result is simply a generalized estimator or the mean of training points, as was initially expressed in equation 4.24.

In document Analysis and spatial modeling of the indoor radon Swiss data (Page 158-162)