1.2 Diagnosing learning algorithms
1.2.3 Learning curves
A learning curve is a graphical representation of the increase in learning as a function of experience. The concept was first used in psychology of learning by Ebbinghaus in 1885 [19, 20], although the name was not used until 1909. The plot of a learning curve depicts improvement in performance on the vertical axis when there are changes in another parameter (on the horizontal axis), such as training set size (in machine learning) or iteration/time (in both machine and biological learning). Then in machine learning, a typical learning curve shows training and cross-validation (CV) error as a function of the number of training samples. Note that when we train with a small subset of the training data, the training error is computed using this subset, not the full training set. These plots can give a quantitative view into how beneficial will be to add training samples. Let us describe what is happening and how to interpret a learning curve. Re- garding the training error, when the number of samples is one or very small, any model (linear or nonlinear) can fit the data (almost) perfectly. This fact causes the training error to be zero or very small. As the number of samples in the training
number of samples training error cross-validation error err or
Figure 1.6: Typical learning curve.
set increases it becomes more difficult to fit all the points in the training set rais- ing the training error. Eventually the training error will flatten once the number of training samples is enough to learn the patterns in the training dataset. In contrast, the cross-validation error is expected to be big for small number of samples because the parameters of the model are very inaccurate (they were trained using only one or few samples). As the number of samples increases, the parameter set of the model gets more accurate and the cross-validation error decreases until it flattens as the training error does.
Figure 1.6 shows the expected shape of a learning curve. For small number of samples, the training error is minimal while the cross-validation error is maximal. As the number of samples in the training set increases, the two errors tend to flatten at a certain value that is determined by the task and the bias and variance of the model.
To elaborate this last point, we come back to our example of the house seller. In Figure 1.5 we showed that using a hundred samples, a linear model shows high bias (underfitting) problems. In the learning curve this problem is recog- nized because the training and cross-validation errors converge very rapidly at a relative high error. We can see this behavior in the left panel of Figure1.7. If we continue adding samples to the training set, it is unlikely that the situation changes. The two errors have converge to a certain value and they become independent of the number of samples.
In the right panel of the same figure, we have the opposite case, a high variance problem given by a degree-20 polynomial model. The characteristic features of high variance (overfitting) is the gap that exists between the training and the cross-validation errors. If we increase the number of training samples it is likely that the gap reduces causing the errors to converge in the middle point.
1.2. DIAGNOSING LEARNING ALGORITHMS
Figure 1.7: Learning curves depicting high bias (left panel) and high vari-
ance (right panel).
Small note about floating-point precision of a machine
The theory of machine learning is based on statistics and basic mathematical operations and most of them can be explicitly solved. However, in practice we usually end up performing those operations using a computer. But there are limitations that we have to be aware of when using a numerical algorithm. For a training set with N samples, if N d, being d the degrees of freedom of a model, the system of equations is perfectly solvable. This means that we can expect to have a perfect fit (zero error) between the data and the predicted model. The right panel of Figure 1.7 shows a degree-20 polynomial function. For N 20 the model should be solved exactly, however we can see non-zero errors before we reach the number of degrees of freedom of the model. This does not mean that the figure is wrong, in fact, the resulting fit has small residuals because it needs very large oscillations to fit all the points perfectly, similar to d = 6 case in Figure1.3.
1.2.4
Course to follow for high bias or high variance problems
We have seen in this section that there are several tools to diagnose a learning algorithm. All these tools can be applied to the particular case of reservoir computingin order to evaluate, diagnose and improve performance. Here we present some actions that can be taken when a high bias or a high variance problem is found. When having high bias problems, we can:• Add more features. In our example of predicting house prices, including not only the size of the house but the year it was built, the neighborhood, and other features may help to a high-biased estimator.
• Increase the complexity of the model. As we studied in section1.2.1, for polynomial models we can increase the degree of the polynomial function to add complexity. Other kind of models will have their own methods for adding complexity.
• Decrease regularization. Regularization is a technique used to impose simplicity in some machine learning models by adding a penalty term that depends on the characteristics of the parameters. If a model has high bias, decreasing the e↵ect of regularization can lead to better results. Refer to Section1.1.2for more information.
• Use a smaller training set. This is more an advice than a guidance. Re- ducing the number of training samples will probably not improve the performance of the estimator since a high-biased model will keep the same error for smaller training datasets. However, for computationally expensive algorithms, reducing the training set can lead to improvements in computational speed.
In contrast, if an algorithm is su↵ering a high variance problem, some steps we can follow are:
• Use fewer features. Using a feature selection method may decrease the overfitting of the estimator.
• Increase the training set. Adding samples can help to reduce a high vari- ance problem as mentioned in section1.2.3.
• Increase regularization. Increasing the influence of the regularization pa- rameter on the model may help to reduce overfitting. This term is intended to avoid exactly this problem. See Section1.1.2.
Up to this point of this thesis we have studied how to teach a learning algorithm such as artificial neural network or reservoir computing, to perform a task. We have also seen what we can do to optimize and diagnose problems in our model. We may wonder now how to quantify the goodness of a model, and how its performance is compared to other approaches. In the following section, we explore di↵erent measures to compare a model.