• No results found

Feature construction and selection

It is vital that a machine learning model is provided with features as inputs that are predictive of the target output for each training and testing example. Some features can be informative and will relate to the targets while others may be uninformative, with no relationship to the targets, or redundant, with values that are strongly or perfectly correlated with the values of other features. If there are too few features used, important information may be missed that is necessary to make good predictions. Conversely, too many features can lead to overfitting, where a model becomes biased towards the examples in the training set and thus generalises

poorly (when this happens, the model can accurately predict the targets values of training examples but performs poorly when applied to test examples). Furthermore, the number of features used will also affect the time and space required to train a model.

When machine learning is used to construct algorithm selectors, the features represent characteristics of the problems the algorithms are being applied to, which relate to a power system’s state when selecting algorithms for power flow management. In this context, it is important that the features are predictive of which algorithm should be selected (in the case of direct selectors) or are predictive of individual algorithms’ performance with respect to particular performance measures (in the case of EPM-based selectors).

In a review paper on the subject, Guyon and Elisseef [141] describe two main aspects to consider with relation to features: (1) feature construction, and (2) feature selection. These two aspects are discussed below, particularly in relation to using machine learning for algorithm selection.

7.4.1

Feature construction

Feature construction is concerned with what characteristics of the training and testing exam- ples are represented. Domain knowledge can be used to determine features that are relevant to what the examples represent, so for algorithm selection for power flow management, the features will relate to the state of the power systems that the algorithms are being selected for. Feature construction also involves applying transformations to the base set of features. Transformations include normalisation and scaling (so that all features vary across a similar range, which many machine learning algorithms require), dimensionality reduction (where a higher-dimensional feature space is transformed into a lower-dimensional space, using techniques such as principal components analysis), and feature expansion (creating new features by applying functions to existing features; for instance, constructing features that are the products of other features). Although transformations do not increase the amount of information available to the machine learning models – in fact, some transformations such as dimensionality reduction may actually reduce the amount of information – they can reveal important relationships in the data, and thus help improve predictive performance.

7.4.2

Feature selection

Feature selection is concerned with determining a subset of a set of potential features that is either particularly predictive, particularly concise and therefore efficient in terms of the time and space requirements for training and testing a model, or that reveal more information about the process that generated the data.

Guyon and Elisseef [141] distinguish the following broad strategies for feature selection: 1. Filter methods: these are applied as a pre-processing step to determine potentially relevant feature sets before training a machine learning model. Each feature is ranked according to a particular criterion; for example, the correlation between a feature and the targets, or information theoretic criteria such as the mutual information between a feature and the targets. From the ranks, a subset of features can be determined. The main advantage of filter methods is that they are applied before the model is trained, so can therefore be efficient in terms of the time and space required for their application. They are also agnostic of the machine learning model used. However, filter methods may result in redundant features being selected (as relationships between different features may not be recognised), and they may ignore features that may be low ranking, and therefore appear irrelevant, but that could be highly predictive when combined with other features.

2. Wrapper methods: this is where candidate feature subsets are ranked by using them to train and then evaluate a machine learning model, which is treated as a black box. This typically involves splitting the training set in two, with one part being used for training a model, given a particular feature set, while the other part – the validation set – is used for evaluating the predictive performance of the trained model.

The feature subsets are determined iteratively, either by adding features to a subset that is initially empty, based on what predictive performance they add (forward selection), or by removing features from a subset that initially contains all features, based on how little they add to predictive performance (backward elimination). Which feature is added or removed at each step is determined by a search algorithm, such as best-first, branch-and-bound, or genetic algorithms.

The main advantages of wrapper methods are that they can help to avoid overfitting and that they consider how each variable affects the predictive performance of the trained model, thus better rejecting redundant features. The main disadvantage is the “brute force” nature of training models to evaluate candidate feature subsets, which can

be expensive in terms of the time and space required.

3. Embedded methods: these are feature selection methods that form part of a particular machine learning algorithm. For example, theJ48 algorithm (WEKA’s implementation of the C4.5 decision tree learning algorithm [13]) implicitly performs feature selection when it considers what feature to split on for each decision node in the tree, based on the information gain associated with each feature.

7.4.3

Features used in this work

For the power flow management algorithm selection application considered in this work, the training and test examples represent particular states of the case study power systems. The state space of each system consists of only a few variables, and these have been used as the feature set for machine learning. In particular, the feature set for each system comprises:

• The output level of each generator. Where two or more generators are scaled by the same factor (and therefore always have the same output level as each other), the output level of only one of the generators is included in the feature set, preventing duplication. • The load level (scaling factor).

For the 33 kV meshed distribution system, the feature set comprises three variables, while the IEEE 14-bus system has five, and the IEEE 57-bus has four. These fixed feature sets are consistently used for all the machine learning algorithms considered, in order to allow fair comparison of the machine learning algorithms. The features, as they are the state variables, fully characterise each state so no information is lost but the information is at a high-level, which some models may not be able to use effectively. Although using a fixed feature set may restrict the predictive performance of some model types that could perform better with different features, performing feature selection for each machine learning algorithm considered would be prohibitively slow, so is not performed in this work.