4. Training Time Series With Outlier Values
5.8. Testing other ML methods
As a final experiment, we wanted to evaluate how other ML algorithms would perform within the same set of values for the other parameters, as opposed to Logistic Regres- sion. Without going into too much detail, in the list below we summarize the considered alternatives1:
Quadratic Discriminant Analysis (QDA)
A classifier with a quadratic decision boundary, generated by fitting class conditional densities to the data and using Bayes’ rule.
Random Forest
A random forest is a meta estimator that fits a number of decision tree clas- sifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.
Naive Bayes
Naive Bayes methods are a set of supervised learning algorithms based on applying Bayes’ theorem with the “naive” assumption of conditional independence between every pair of features given the value of the class variable.
Decision Tree
Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.
Nearest Neighbors
Neighbors-based classification is a type of instance-based learning or non- generalizing learning: it does not attempt to construct a general internal model, but simply stores instances of the training data. Classification is computed from a simple majority vote of the nearest neighbors of each point: a query point is assigned the data class which has the most repre- sentatives within the nearest neighbors of the point.
Support Vector Machines (SVM)
Support vector machines are a set of supervised learning methods used for classification, regression and outliers detection. Different kernel functions can be specified for the decision function.
1
5.8. Testing other ML methods 43
• Support vector classifier (SVC) with Radial basis function (RBF) as kernel The RBF kernel is a stationary kernel. It is also known as the “squared exponential” kernel.
• Linear SVM
LinearSVC is another (faster) implementation of Support Vector Clas- sification for the case of a linear kernel. This class supports both dense and sparse input and the multiclass support is handled according to a one-vs-the-rest scheme.
Neural network
Supervised learning algorithm that learns a function f (·) : Rm → Ro by training on a dataset, where m is the number of dimensions for input and o is the number of dimensions for output. Given a set of features X = x1, x2, ..., xm and a target y , it can learn a non-linear function ap- proximator for either classification or regression. It is different from logistic regression, in that between the input and the output layer, there can be one or more non-linear layers, called hidden layers.
AdaBoost
An AdaBoost classifier is a meta-estimator that begins by fitting a classifier on the original dataset and then fits additional copies of the classifier on the same dataset but where the weights of incorrectly classified instances are adjusted such that subsequent classifiers focus more on difficult cases. Gaussian Process
Gaussian Processes (GP) are a generic supervised learning method de- signed to solve regression and probabilistic classification problems. The prediction is probabilistic (Gaussian) so that one can compute empirical confidence intervals and decide based on those if one should refit (online fitting, adaptive fitting) the prediction in some region of interest.
In Figure 5.11 we can see the comparison between all these methods. The Gaussian Process seems to have by far the best results. Indeed, Gaussian Processes (Williams (1998)) are widely recognized as one of the best prediction methods for markets (Ou and Wang (2009)). However the big difference with the next method is very noticeable.
The second best method is Ada Boost. This can be due to the facility of this technique for mimicking square signals, which is the case of our outlier setting, as we will see in the conclusions. Anyway, the results of AdaBoost are similar to those of the Logistic Regression, the neural networks and the SVM methods.
A better performance was expected for neural networks. Maybe this is due to the fact that neural networks are highly configurable, and here we have used just the basic default parameters.
Before finishing the chapter, we must point out that usual techniques for tuning hyper- parameters such as cross validation are not available in our system. This, combined with the high requirements of our algorithms in terms of time and memory, have complicated experimenting with more exhaustive combinations of parameters.
44 Chapter 5. Use Case Results
Chapter
6
Conclusions and Future Work
In this chapter we summarize the results achieved during the work, examine if our objectives have been fulfilled and suggest some lines of future work.
6.1.
Conclusions
As we have seen across the three previous chapters, the usage of time series data in the ML context is not automatic. On the contrary, it requires oversight and attention to detail in order to succeed. Otherwise it is possible, for instance, to obtain too high evaluation scores, because we have unintendedly trained with future values to predict past values, or we are testing a row that includes a label corresponding to some event that occurs before the future we wish to predict. In any case, the final effect is that, when taking the model to a production context, we will find that the evaluation scores are much worse than those predicted during the model evaluation phase.
In this work we proposed a framework for training machine learning algorithms with time series, which consists of two parts, corresponding to the first two objectives of Sec- tion 1.3:
1. A preprocessing method that combines past and future data, explained in Chapter 3. The method takes as inputs the file containing the raw data, a set of feature columns, a future f to predict, and a number of previous (or past) rows p. The result is a preprocessed file that combines each row r with the p − 1 previous rows as features, and with the target value at row r + f as label to predict. We also propose to work with increments (one day increment for the features, f days increment for the label) for our use case. Notice that combining pasts 1, . . . , p and future f in one single row is important because ML methods work row by row, adjusting the internal parameters to infer the label from the features. With our preprocessing the ML method ‘sees’ a window of p features together with the expected value at future f .
2. A training and testing method, explained in Chapter 4. Our proposal presents several differences with the TimeSeriesSplit method proposed in the Python Scikit-learn library for evaluating time-series methods, which suggests using a train set that starts at the first row and reaches a certain row r and then use from the row r + 1 on as test. In the case of the train we suggest using sliding windows of k values to improve the efficiency. Moreover, we include the possibility of increasing this size k to accumulate in the training set a minimum number of values of each class (in our