• No results found

Experimental Design

3.3 Materials and Methods

3.3.4 Experimental Design

The experimental design was devised to address the following research question: How do the predictive performance estimates of CVAL methods compare to the estimates of OOS approaches for time series forecasting tasks?

Existing empirical evidence suggests that CVAL methods provide more ac- curate estimations than traditionally used OOS approaches in stationary time series forecasting (Bergmeir and Ben´ıtez, 2012; Bergmeir et al., 2014, 2018) (see Section 3.2). However, many real-world time series comprise complex struc- tures. These include cues from the future that may not have been revealed in the past. Effectively, our working hypothesis is that preserving the temporal order of observations when estimating the predictive ability of models is an important component.

Trend, Auto-Regressive Order, and Estimation Set Size

We applied a KPSS statistical test (Kwiatkowski et al., 1992) to account for trend in the data. Time series that are not trend-stationary according to this test are differenced until the test is passed. This approach is commonly used for trend inclusion in forecasting models, for example, ARIMA. Specifically, we follow the procedure adopted by the automatic forecasting model auto.arima from the forecast R package (Hyndman et al., 2014). The number of differences

applied to each time series is described in the last column of Table A.1. We estimate the optimal embedding dimension (p) using the method of False Nearest Neighbours (Kennel et al., 1992). This method analyses the behaviour of the nearest neighbours as we increase p (c.f. Section 2.2.3). We set the tolerance of false nearest neighbours to 1%. The embedding dimension estimated for each series is shown in Table A.1. Regarding the synthetic case study, we fixed the embedding dimension to 5. The reason for this setup is to try to follow the experimental setup by Bergmeir et al. (2018).

The estimation set (Yest) in each time series is the first 70% observations

of the time series – see Figure 3.4. The validation period is comprised of the

subsequent 30% observations (Yval).

Estimation Methods

In the experiments, we apply a total of 11 performance estimation methods, which are divided into CVAL variants and OOS approaches. The cross-validation methods are the following:

CV Standard, randomized K-fold cross-validation; CV-Bl Blocked K-fold cross-validation;

CV-Mod Modified K-fold cross-validation; CV-hvBl hv-Blocked K-fold cross-validation;

Conversely, the out-of-sample approaches are the following:

Holdout A simple OOS approach–the first 70% of YE is used for training and

the subsequent 30% is used for testing;

Rep-Holdout OOS tested in nreps testing periods with a Monte Carlo simula- tion using 70% of the total observations n of the time series in each test. For each period, a random point is picked from the time series. The pre- vious window comprising 60% of n is used for training, and the following window of 10% of n is used for testing;

Preq-Sld-Bls Prequential evaluation in blocks in a sliding fashion–the oldest block of data is discarded after each iteration;

Preq-Bls-Gap Prequential evaluation in blocks in a growing fashion with a gap block–this is similar to the method above, but comprises a block separat- ing the training and testing blocks in order to increase the independence between the two parts of the data;

Preq-Grow and Preq-Slide As baselines, we also include the exhaustive pre- quential methods in which an observation is first used to test the predic- tive model and then to train it. We use both a growing/landmark window (Preq-Grow) and a sliding window (Preq-Slide).

We refer to Section 2.3 in the background chapter of this thesis for a complete description of these methods. The number of folds K or repetitions nreps in these methods is set to 10, which is a commonly used setting in the literature. The number of observations removed in CV-Mod and CV-hvBl (c.f. Section 2.3) is the embedding dimension p of each time series.

Evaluation Metrics

Our goal is to study which estimation method provides a ˆg that best approxi-

mates Lm. Let ˆgm

i denote the estimated loss by the learning model m using the

estimation method g on the estimation set, and Lm denote the ground truth

loss of learning model m on the test set. The objective is to analyse how well ˆ

gim approximates Lm. This is quantified by the absolute predictive accuracy

error (APAE) metric and the predictive accuracy error (PAE) (Bergmeir et al., 2018):

APAE = |ˆgim− L m

| (3.1)

PAE = ˆgmi − Lm (3.2)

The APAE metric evaluates the error size of a given estimation method. On the other hand, PAE measures the error bias, i.e., whether a given estimation method is under-estimating or over-estimating the true error.

Another question regarding evaluation is how a given learning model is evalu- ated regarding its forecasting accuracy, that is, how each ˆgimor Lmis quantified. In this work, we evaluate models according to RMSE. This metric is tradition- ally used for measuring the differences between the estimated values and actual values.

Learning Algorithm

The results shown in this work are obtained using a rule-based regression sys- tem Cubist (Kuhn et al., 2014), a variant of the model tree proposed by Quinlan (1993). This method presented the best forecasting results among several other predictive models in a study that will be presented in the next chapter. Notwith- standing, other learning algorithms were tested, namely the lasso (Tibshirani, 1996; Friedman et al., 2010) and a random forest (Breiman, 2001; Wright, 2015). The conclusions drawn using these algorithms are similar to the ones reported in the next sections.