8 Discussion
8.1 Hypercube Soil Sampling for Digital Soil Mapping
The cLHS method identifies samples, which are stratified in the multivariate feature space in a two-step approach. First, strata are generated in a hypercube, which spans the feature space. Second, one target site per stratum is selected according to an optimization procedure [Metropolis et al., 1953] that covers the feature space by a combination of unique target sites. The latter target site selection process is conditioned by only selecting feature space locations that exist in the real world [Minasny & McBratney, 2006].
Roudier et al. [2012], Mulder et al. [2013], and Clifford et al. [2014] further constrained the target site selection by penalizing spatial locations with limited accessibility within each predefined stratum. This implies a potential bias in the final sample set size, since inaccessible areas might occupy segments of the feature space that were not sampled. Further, accessibility at the target sites is not guaranteed because the existence of accessible locations in the specific stratum cannot be ensured. Clifford et al. [2014] eludes the latter problem by additionally analyzing the relation of each location in a defined neighborhood to the initial target site, providing an ordered list of alternative target sites. While it is a combination of one unique sample per stratum covering the feature space initially, the generation of alternatives follows a biased assumption that all other target sites have been successfully sampled before. However, according to a simulation study, Clifford et al. [2014] showed that the feature space coverage remains preserved by replacing up to 50% of the initial target sites by alternatives sites.
In the present thesis, instead of penalizing locations in the stage of target site selection and generating alternatives for each initial target site individually, all possible combinations of accessible locations were analyzed according to their feature space coverage and with respect to the original cLHS design. This enables for quantifying the deviation to the original cLHS design and for providing alternative target sites, while avoiding assumptions about previously sampled target sites. Ensuring that at least one accessible location per stratum is available requires to increase the number of locations within each stratum. This is accomplished by
8 Discussion
77 decreasing the number of covariates, which build the feature space and the sample set size [Minasny & McBratney, 2006].
The number of covariates was reduced to the two most adequate (i.e., ACC and SWI) according to a plausible correlation to the target variables [Gessler et al., 2000], and avoidance of multicollinearity within the feature space [Hengl et al., 2003; Mulder et al., 2013; Schmidt et al., 2014]. Other studies that applied cLHS used more covariates (approximately four to ten) to build the feature space, while the ratio of samples per km² varies between 0.005 and 0.465 in study areas with sizes ranging from 720 km² to 12,800 km² [Mulder et al., 2013; Clifford et al., 2014; Taghizadeh-Mehrjardi et al., 2014; Kidd et al., 2015]. Brungard & Boettinger [2010] suggested a minimum ratio of 0.7 to 1 samples per km² using cLHS in a study area of a size of 300 km². In this study, the sample set size was optimized to n = 30 for a study area of 4.2 km², thus, 7.1 samples per km², by analyzing ten simulated cLHS sets with different sample set sizes. Moreover, it was shown that the deviation of the feature space coverage from the original cLHS set is marginally (Table 5), while 93% of the original samples were replaced by alternatives (Table 4).
Referring to the incorporation of legacy samples to cover the feature space, Clifford et al. [2014] used 8,669 legacy samples as predefined basis to locate 300 additional target sites. Since the subset of legacy samples was not obtained purposively to cover the feature space, redundancies within the subset were not considered and potentially led to a bias in the combined sample set. In this context, Carré et al. [2007] proposed to analyze the distribution of legacy samples across the strata to evaluate the adequacy of legacy samples for the feature space coverage. Following the principles of this approach, only those five legacy samples were used, which are located within a stratum, thus, avoiding redundancies and reducing the sampling effort. Consequently, the increased number of locations per stratum also accommodates with the goal to integrate legacy samples, which are adequate to cover the feature space by a predefined sample size n. In the present study, only one legacy sample is available in the respective stratum, thus, no evaluation of preference is necessary. Apart from this situation, the results suggest to consider multiple legacy samples within a stratum simply as potential accessible target sites and to follow the methodological procedure as described in Section 6.4.
8 Discussion
78
Any cLHS or stratified sampling design must be seen as a specific case study resulting in a partially limited generalizability. This is attributed to the assumptions that the feature space, determined by local framework conditions, sufficiently describes the targeted soil property in space and time [Clifford et al., 2014]. Besides these limitations, this study's approach is transferable to any other study area, considering that an increasing number of samples and spatial resolution increases the computational load. Thus, the applied approach is suitable for small study areas with highly limited accessibility.
Soil texture estimations using DSM typically show accuracies less than 0.50, while studies with R² > 0.70 are rare [Malone et al., 2009; Lacoste et al., 2011; Wang et al., 2012; de Carvalho Junior et al., 2014; Mansuy et al., 2014]. The results of the present study (Table 6) show accuracies with R² > 0.50 for all sand fractions and for both calibration sets (cLHSadapt with n = 30; cLHSadapt+ with n = 60). Comparing the prediction accuracies for both calibration sets and referring to independent and bootstrap validation, the cLHSadapt approach results in slightly decreased accuracies by only 0.5 %. This is confirmed by referring to the coherence of the total sand fraction, ideally amounting to 100%. The similarity between the two DSM approaches confirms the robustness of the proposed sampling approach. Moreover, since the calibration sample set size of n = 30 is relatively small compared to other studies with n between 165 and 4,920 for Random Forest (RF) regression analysis [Grimm et al., 2008; Lacoste et al., 2011; Mansuy et al., 2014], and considering the topographical heterogeneity of the study area, the accuracy of the proposed DSM approach is noticeable.