In this section, soft sensors are embedded into a theoretical machine learning framework. From the several possible types of soft sensors, this thesis concentrates on the on-line prediction soft sensors (see Section 2.4.3) that, from the machine learning point of view, represent supervised predictive models with continuous target variables.
The prerequisite for supervised learning is a training data setStrain, which, in the process in-
dustry, is usually retrieved from the process information system and is referred to as the historical data(see Section 2.3.1). This data set has the following characteristics:
Strain = Xtrain, Ytrain with (3.1) Xtrain = (x 1, . . . , xn)T and Ytrain = (y1, . . . , yn)T,
whereXtrain ∈ Rm×nare then input samples (data points) organised in a matrix, with each row
vector xicorresponding to one input sample that further consists ofm measurement variables (or
features), i.e xi = (xi,1, . . . , xi,m). Each of the input samples xican be also interpreted as a point
in them-dimensional input spaceX . In general, for every input sample, there is a q-dimensional target value yi ∈ Rq assigned. However, in this work, the target space is restricted to a one
dimensional output spaceY, i.e. yi ∈ R1.
The target variable is generated by a hidden functionφ, called the target function, which maps the input space onto the output space:
φ :X → Y. (3.2)
The target function is unknown and it is the task of the learning algorithmL to find an approxi- mation to this function given the historical data setStrain and a (randomly initialised) predictor functionfinit:
ftrained ← L(Strain, finit). (3.3)
The predictor can be, for example, a simple regression model, a principal component regression model with a given number of principal components, or a multi-layer perceptron with a given number of hidden units and randomly initialised weights (for several possibilities for the predictor functions see Section 3.3). The outcome of the learning process is the trained predictorftrained.
This predictor is further referred to as the model. The model is able to map the input samples to the output space, i.e. ypredi = ftrained(x
i) and approximates the target function φ. Going back
to one of the previous examples, for the linear regression function the trained predictor is a vector of weightsftrained := β = [β
0, β1, . . . , βm], which can be used to calculate the predicted target
value:ypredi = β0+Pmj=1βjxi,j.
In order to be able to assess and rank the outcomes of the learning process, an error function e(f, y) is required. This function allows the calculation of the distance between the correct target valuey and the value predicted by the predictor function, ftrained, given an input sample x. In the
case of noisy observations:
y = φ(x) + , (3.4)
where is assumed to be a normally distributed random variable with zero mean value, it can be shown that the optimal error function (in the maximum likelihood sense), which eliminates the
influence of the random noise, is the Mean Squared Error (MSE) (see e.g. [13, p. 195],[18]): e(f (X), y) = 1 n n X i=1 (f (xi)− yi)2, (3.5)
where n is the number of data samples. Having the error function, the learning process can be described as the search for an optimal prediction function foptimal which minimises the error
functione():
foptimal = argmin
f
e(f (X), y). (3.6)
However, in practical scenarios, such as soft sensor development, the model developer is inter- ested in finding a model that performs optimally for future samples, which are not available at the time of the model development. This performance is called the generalisation performance. Since it is not possible to calculate the performance on future data samples it has to be estimated from the available training data. There are several ways to estimate the generalisation performance. The easiest way, though not the most accurate, is the hold-out estimation, where the training data
Strainis split into an actual training and a validation part. The predictor is first trained on the new
training data and then tested on the validation data, which gives an estimation of its generalisation performance. The problem of this approach is that in practical scenarios the size of the training data set is often limited by several factors and applying the previous approach wastes the valida- tion data because it cannot be used for model training. As a solution to this impracticality, several approaches summarised under the term resampling techniques were developed in the statistics (see also [193] for an overview of resampling methods). The two most common of these methods are:
• k-fold cross-validation: This method cyclically splits the training data Strain into an actual
training data and validation data, whereas the size of the splits depends on the number of folds/splits. For the k-fold cross-validation the size of each validation set isn1k of the size
ofStrainwhile the rest of the samples, i.e.n(1−1
k), are used as training data. The result is
k different training-validation splits and thus k models. The main benefit of this technique is that it guarantees that all of the available samples are used for model training as well as model validation.
• bootstraping: In the case of bootstraping the data is sampled randomly with replacement from the original pool of samplesStrain, which generates subsamples of the original train- ing data that are used to train the models. The performance of the particular models is estimated by validating them on the remaining, unseen, samples. The performance of the predictor is the average built over the model performances.
Bias-variance decomposition
The study of the generalisation error led to one of the most important theoretical findings in ma- chine learning, namely to the bias-variance decomposition [66]. Geman et al. have shown that in the case of the quadratic error. The generalisation error of a predictor can be split into two com- ponents; the (squared) bias and the variance of the error. The formal form of the decomposition
for the quadratic error function, with respect to different training sets of fixed size, is: ES h ytest− f(k)(Xtest) 2i = ESytest− f(k)(Xtest) 2 + ES h ytest− ESf(k)(Xtest) 2i (3.7) = bias(f(k))2+ variance(f(k))
with f(k)← L(Si ∈ Strain, finit)
In the previous equation, the expectation valueESis built over the random training data setsSiof
constant size,f (Xtest) are the predictions made by the predictors, given the test input data Xtest
and ytestare the correct target values for the calculation of the quadratic error.
The significance of the above decomposition originates from the fact that it splits the gener- alisation error into two components that balance each other. The bias component describes the expected error over all trained predictorsf(k). This component is in general high for simple mod-
els, i.e. models with a low number of free parameters, and relatively low for complex models with a high number of degrees of freedom, which on average will be able to fit the target function better. On the other hand, the variance component describes the degree of variability between the predictions of the predictors trained on different training sets. Contrary to the bias, this value is in general low for simple models since, due to a low degree of freedom, all these models will make similar predictions. For the complex models, the variance component is usually high because they tend to overfit the training data and then produce erroneous predictions for the test data. This phe- nomenon is also called bias-variance dilemma since finding an optimal predictor means finding a balance between the bias and variance error.
An important implication of the decomposition is that better generalisation performance cannot be achieved by choosing a more complex model structure, e.g. neural networks with a higher number of hidden units since this merely decreases the bias term while potentially increasing the variance term. The goal of model selection is finding an optimal model with balanced bias and variance errors.
The ultimate ambition in machine learning is the search for techniques that reduce both of the error terms at the same time. This can be achieved by increasing the number of training examples, which in turn allows the training of more complex models. However, in practical scenarios where obtaining more training data is often very expensive or even impossible, one needs to find another way of achieving this goal. Two such methods of approaching simultaneous minimisation of bias and variance from different directions are the ensemble methods and local learning discussed in Section 3.4 and Section 3.5 respectively.
Learning-forgetting dilemma
When dealing with adaptive systems, one inevitably has to deal with the learning-forgetting dilemma[48], sometimes also called stability-plasticity dilemma [21]. The goal when dealing with the dilemma relates to finding an optimal trade-off between learning new information and forget- ting old information. Adaptive learning systems with generalisation capability need a mechanism for forgetting old information as their capacity is limited and as such they suffer from negative interference. This refers to forgetting of past useful knowledge while learning new information [156]. The extreme manifestation of this problem is catastrophic forgetting [58]. According to the previously cited work, the problem can be prevented by: (i) having representative validation data set; (ii) memorising all training data samples; or (iii) incorporation of strong prior knowledge.