• No results found

This section formulates the problem of selecting an appropriate algorithm for a given task and finding its configuration leads to the best results. It is considered as a crucial step towards the automation of ML pipeline (Feurer et al., 2015). Although Model Selection and Hyper-parameters Optimization (MSHPO) are conceptually different areas, however, they are linked with each other. Since the selection of an appropriate algorithm with a poor choice of configuration, for a given task, leads to low accuracy. The following formulation is applicable to classification problems, however, it can be extended to the regression tasks as well.

An example of a classification task is represented as a pair (x, y),xis a vector of feature values whereas y is its corresponding class. A dataset D, expended in Equation 2.1, is a set of examples which is consumed by a classification algorithm. A classification algorithm

C is a function that learns patterns from D and apply it on hold-out instances from their feature values ¯x to predict the class ¯y as shown in Equation 2.2. Moreover the equivalent Meta-RL representation is shown in Equation 2.3 where s is a set of states and a is set of finite actions.

D={(x1, y1),(x2, y2), ...,(xN, yN)} (2.1)

Supervised Meta-learning C :{D,x¯} → {y¯} (2.2)

Meta-Reinforcement Learning C :{D, s} → {a} (2.3)

A set of all the possible classification algorithms is represented asC ={C1, C2, ..., Ck}.

A classification algorithmC requires a set of hyper-parametersPcwhereλcis a configuration

of the hyper-parameter, λc ∈ Pc. The set of hyper-parameters for the ith algorithm in C

denoted by λi = (λia, λib, λic, ...). This set of all the possible values is represented by Λi that

λi can take. A realization of classification algorithm C for a specific configuration λ is known as a classification model (Cλ). The error function E of the classification model Cλ

on held-out instances is computed as shown in Equation 2.4.

54 Problem Formulation

The feature values of the instancesx is used to train an algorithm which is applied on the feature values of ¯x to predict its class ¯y. Based on the underlying distribution of the trained model the instances with similar feature values tend to belong to the same class. It formulates the MSHPO problem as:

Cλ∗∗ = argmin

Ci∈C,λiΛi

E(Cλii,D) (2.5)

Equation 2.5 chose an algorithm and associated configuration that obtain optimized performance at predicting labels on the given task. This equation only defines the structure and general behaviour of the different components of the optimization process and not the scoring function and other details. Furthermore, the assumption that a single model and its configuration Cλ∗∗ is significantly better than the rest of the candidates can not be guaranteed.

Chapter 3

Cross-domain Meta-learning for

Time-series Forecasting

In accordance with the research challenges identified in the previous chapter, a thorough study has been conducted to evaluate whether the Meta-knowledge (MK) of a specific domain can be applied on the problems of other domains to find the best learning algorithm. The previous work on Meta-level Learning (MLL) for Time-series (TS) forecasting resulted in Lemke and Gabrys (2010a) and Lemke and Gabrys (2010b). The use of proposed MLL approaches and data from NN3 and NN5 competitions in Lemke and Gabrys (2010a) and supplementing the available NN-GC1 data has led to our research group’s1 winning of the NN-GC1 forecasting competition. In Lemke and Gabrys (2010b) it was stipulated (though not verified by any further analysis) that a particularly good predictive performance resulting from deploying the MLL approach and a Meta-ranking algorithm on the NNGC-C dataset (monthly interval) and NNGC-E (daily interval) might have been due to additional use of the NN3 and NN5 (111 daily series each) datasets for generating MK and training Meta- learners. In this chapter the concentration is on attempting to understand if indeed the use of additional time-series from NN3 and NN5 competitions have been the main reason for the best performance of the MLL on series NNGC-C and NNGC-E of the NN-GC1 competition. Through an extended analysis of the results describing for which NN-GC1 time-series the MLL performs best or worst. Also an attempt has been made to answer a more general question of when and under what circumstances the use of datasets from other domains (NN3 and NN5 competitions in current context) could be beneficial for recommending well- performing forecasting methods for a problem at hand (using MLL approaches on 6 different NN-GC1 datasets in this work).

The key focus would be on finding evidence that revolves around the following questions: 1. Whether the use of additional training data has been the main reason for the best

performance of the MLL?

2. Whether the use of data from different domain could be beneficial for recommending well-performing forecasting methods for a problem at hand?

More investigations have been required to find the evidence whether NNGC-C and NNGC-E performed well on NN3 and NN5 Meta-model because of the similar frequency

56 Methodology Examples of Datasets NN3, NN5 Time-series Meta-knowledge NN3, NN5, NN3+NN5 Base-level Forecasting Methods Meta-features generation and Selection Meta- model Clustering NN3, NN5, NN-GC1 Examples of Datasets NN-GC1 Time-series Meta-knowledge NN-GC1 Base-level Forecasting Methods Meta-features generation and Selection Evaluation NN GC1 Meta-model vs Clusters Results NN-GC1 best possible vs MLL method

Figure 3.1: Methodology of Cross-domain MLL

of observation recording or the Meta-level problem representation is tilted more towards time-series sample-rate characteristics than the others? This could be a reason that MLL performed well for only time-series datasets with similar frequency. On the other hand, in- vestigation is required to analyze whether increasing the size of training dataset by adding data from different domains could enhance overall MLL algorithm prediction accuracy? Additionally, it leads to another problem of not finding the significant amount of patterns from the cross-domain data, for example, NN3 and NN5 contain 222 instances which is a relatively small number with a lot of variations in the data. It raises the question of whether adding data from only the same domain can enhance Meta-level accuracy?

3.1

Methodology

To examine the questions stated in the above section an experimentation environment has been established containing key components required by an MLL system. Figure 3.1 provides a high-level overview of the MLL system setup for this work. Apart from MLL system a cluster analysis has been performed on MK. The results of both the systems are correlated to find evidence that could lead to the answers of the questions raised in the above section. The MLL system is divided into two phases; i) Meta-modelling, ii) Meta-ranking. For Meta-modelling two datasets, NN3 and NN5, are used from different domains, empirical business observations and cash machine transactions. Several Meta-features (MFs) and performance measures are computed from these datasets. These performance measures

CROSS-DOMAIN META-LEARNING FOR TIME-SERIES FORECASTINGExperimentation Environment

are mapped with features of each time-series to build an MK for the both datasets and a combined NN3+NN5 MK has been built. There are three different Meta-models built against these MK.

These Meta-models have been evaluated in Meta-ranking phase against six datasets of NN-GC1 which are from a different domain (i.e., transportation) than NN3 and NN5. Fur- thermore, NN-GC1 has different observation sampling rates. The same MFs which are used in Meta-modelling phase, have been extracted from NN-GC1 for Meta-ranking. The Meta-models, that are trained on NN3 and NN5, are used to estimate the most appropri- ate forecasting method on the Meta-examples of NN-GC1. These estimates are evaluated against the best possible forecasting method which is computed by evaluating base-learners as NN-GC1. Figure 3.1 provides an overview of the cross-domain MLL system.

Apart from Meta-modelling, Cluster Analysis has been performed on three different com- binations of MK including NN3 versus NN-GC1, NN5 versus NN-GC1 and NN3+NN5 versus NN-GC1. A hierarchical approach is applied with different link methods and distance sim- ilarity measures to extract most appropriate clusters on the mentioned three combinations of MK.