3.2 Materials and Methods
3.2.4 Model construction
3.2.4.1 Random Forest regression
We hypothesised that per pixel LULC fractions can be retrieved from spectral data in line with the previous studies (e.g. Colditz et al., 2011; Guerschman et al., 2009; Lu et al., 2003; Obata et al., 2012; Schwieder et al., 2014). Modelling multi-type LULC fractions can be conceived as a multi-output regression task. This task can be accomplished either by simultaneously modelling a multi-output response, or by separately modelling single-output responses and aggregating
Table 3.2. Specification of the scenarios in combinations of the predictor set, time interval, and smoothing options.
Smoothing No smoothing Savitzky-Golay (SG) smoothing
Time interval 8-day 16-day 8-day 16-day
Predictor set
NDVI S1 S5 S9 S13
EVI S2 S6 S10 S14
SR S3 S7 S11 S15
Full S4 S8 S12 S16
the outcomes (Hothorn et al., 2006; Segal, 2004). In this study, we decomposed the multi-type fractional cover regression task into a set of single-type regression tasks. Accordingly, we built a fractional cover model for each LULC type and aggregated the model outcomes.
Fractional cover regression can be implemented via various techniques. The techniques include the fuzzy classifier (Foody et al., 1996), the time series model (Lu et al., 2003), linear models (DeFries et al., 1995; Schwarz et al., 2005), data mining algorithms (Fernandes et al., 2004;
Schwieder et al., 2014), and spectral mixture analysis (SMA) (Asner et al., 2000; Guerschman et al., 2009). Among various techniques, we used the regression mode of Random Forest (RF). RF is a decision-tree based ensembling algorithm that uses bootstrap aggregation (bagging) and the random sub-space method (Breiman, 2001; Prasad et al., 2006). It is suitable for modelling non-linear relationships and can handle a large number of covariates as it tends not to overfit the data (Breiman, 2001; Prasad et al., 2006; Segal, 2004). Its performance is comparable to the other state-of-the-art learning algorithms such as support vector machine or neural networks (Attarchi et al., 2014; Gislason et al., 2006; Prasad et al., 2006; Schwieder et al., 2014). Moreover it is convenient to set up compared to other data mining algorithms as it has a small number of training parameters Liaw et al., 2002.
In land cover modelling, Random Forest (RF) has been used to classify land cover (Clark et al., 2010; Gislason et al., 2006; Hüttich et al., 2009; Nitze et al., 2015; Rodriguez-Galiano et al., 2012; Thenkabail et al., 2005), vegetation type (Hüttich et al., 2009; Immitzer et al., 2012; Senf et al., 2013), and also crop type (Ghimire et al., 2010; Nitze et al., 2012). In fractional land cover regression, Schwieder et al. (2014) used RF to estimate shrub cover fractions in which RF showed comparable performance with support vector machine and partial least squares regression.
CHAPTER 3. MAPPING FRACTIONAL LAND USE AND LAND COVER IN A MONSOON REGION:
THE EFFECTS OF DATA PROCESSING OPTIONS 73
3.2.4.2 Spatial cross-validation
Due to the bagging and the random sub-spacing of RF (Breiman, 2001), the bootstrap samples for training (in-bag data) can be correlated with the test samples (out-of-bag data), especially for spatial models (Brenning, 2005). To avoid dependencies between training and test data, we externally partitioned training and test data by a spatial partitioning scheme utilised by Reineking et al. (2010). The spatial partitioning was implemented in our study as follows. First, we binary split the whole area six-times recursively. The recursive split divides the catchment into 64 sub-clusters. Second, we form 16 clusters by randomly sampling four sub-clusters for each; one cluster is comprised of four spatially disjointed sub-clusters as distinguished by different colours in Supplementary Figure3.10.
3.2.4.3 Fractional cover estimation
Let T be the number of LULC types such that each type i has a set Fi = {fi,1, ..., fi,n} of n
observed LULC fractions, where fi,j is the fractional area of the pixelj covered by the LULC
type i, andn is the total number of pixels belonging to the study area.
A LULC fraction fi,j ∈[0,1] and all fractions of one pixel sum up to one
T
X
i=1
fi,j = 1 (3.3)
for all j ={1, ..., n}.
First we built a RF regression model per type. Given a type i, we used the observed frac-
tion Fi = {fi,1, ..., fi,n} as response and a set of feature vectors P = {p1, ..., pn} as predictor.
Each feature vector containednfeature features varied by the spectral data used (Supplementary
Table 3.4).
The regression model was trained/tested with a 16-fold cross validation (c.f. Section 3.2.4.2
for details). By accumulating test pixels of all CV folds, we obtained the predicted fractions ˆ
Fi ={fˆi,1, ...,fˆi,n} of the type i over the entire study area. Note that RF produces predictions
from all regression trees (Breiman, 2001), therefore for each pixel ntree fractions were predicted,
where ntree is the total number of regression trees. We took the mean of the ntree predictions.
This generated a set of LULC fractional cover for the study area.
Then we normalised the type-wise predictions by Eq. 3.3. The normalised prediction ˆFi∗ was
ˆ
Fi∗ = Fˆi
PT
j=1fˆi,j
(3.4) where ˆFi,j is the type-wise prediction of the typeifor the pixelj. Finally we obtain the predicted
LULC fractions ˆF∗ ={fˆ1∗, ...,fˆT∗}.
3.2.4.4 Training parameters
RF has three training parameters: the number of trees in the forest (ntree), the number of
randomly selected variables on each split (mtry), and the number of minimal samples in terminal
nodes (nodesize). These parameters need to be tuned to avoid sub-optimal model performance
Rodriguez-Galiano et al., 2012; Strobl et al., 2008.
To find the optimal ntree and nodesize we performed a grid search on the training folds. We
used a grid from all combinations ofntree={100,200, ...,1000}andnodesize={1,2,3,4,5}. Grid
searching was implemented using an internal validation. We repartitioned the training data folds into a new training data and a new test data. The new test data contained two spatial clusters, randomly selected without replacement. We trained the model on the new training data with different parameter values and predicted the hold-out data. This was repeated for all 9 types and we calculated the mean root mean square error (RMSE) over all types. Overall, the model
performance improved with large ntree and small nodesize (Supplementary Figure 3.11).
We optimisedntree andnodesize separately based on its marginalRMSE on the tuning grid. We
chose parameters by minimising the marginal error metric unlike Rodriguez-Galiano et al. (2012) or Leutner et al. (2012) who used the joint error metric on the grid. We tried both approaches but opted for the marginal error based selection. Compared to the joint error based selection, the marginal error based selection was less sensitive to the between-partition variations. In consequence, it led to more stable parameter selection between scenarios.
The parameter mtry was determined by the square root of nfeature without grid searching as in
(Clark et al., 2012). Since the scenarios have unequal number of input features, mtry varied be-
tween scenarios. The chosen parameter values are summarised in Supplementary Table3.4.