the size N of the learning setL used to build ϕL usingA tends to infinity.
Note that the definition of consistency depends on the distribution P(X, Y). In general, a learning algorithmA can be proven to be consis- tent for some classes of distributions, but not for others. If consistency can be proven for any distribution P(X, Y), thenA is said to be univer- sally (strongly) consistent.
2.3 m o d e l s e l e c t i o n
From the previous discussion in Section2.2.2, it appears that to solve the supervised learning problem it would be sufficient to estimate P(Y|X) from the learning sample L and then to define a model ac- cordingly using either Equation2.11 or Equation2.12. Unfortunately, this approach is infeasible in practice because it requiresL to grow ex- ponentially with the number p of input variables in order to compute accurate estimates of P(Y|X) [Geurts,2002].
2.3.1 Selecting the (approximately) best model
To make supervised learning work in high-dimensional input spaces with learning sets of moderate sizes, simplifying assumptions must be made on the structure of the best model ϕB. More specifically, a
supervised learning algorithm assumes that ϕB – or at least a good
enough approximation – lives in a family H of candidate models, also known as hypotheses in statistical learning theory, of restricted structure. In this setting, the model selection problem is then defined as finding the best model amongH on the basis of the learning set L. Approximation error
Depending on restrictions made on the structure of the problem, the Bayes model usually does not belong to H, but there may be models ϕ ∈ H that are sufficiently close to it. As such, the approximation error [Bottou and Bousquet, 2011] measures how closely the models
inH can approximate the optimal model ϕB:
Err(H) = min
To be more specific, let θ be the vector of hyper-parameters values controlling the execution of a learning algorithm A. The application of A with parameters θ on the learning set L is a deterministic5
pro- cess yielding a model A(θ, L) = ϕL ∈ H. As such, our goal is to find the vector of hyper-parameters values yielding to the best model possibly learnable inH from L:
θ∗ =arg min
θ
Err(A(θ, L)) (2.14)
Again, this problem cannot (usually) be solved exactly in practice since it requires the true generalization error of a model to be com- putable. However approximations bθ∗can be obtained in several ways. WhenL is large, the easiest way to find bθ∗ is to use test sample es- timates (as defined by Equation2.6) to guide the search of the hyper- parameter values, that is:
b
θ∗ =arg min
θ
d
Errtest(A(θ, L)) (2.15)
=arg min
θ
E(A(θ, Ltrain),Ltest) (2.16)
In practice, solving this later equation is also a difficult task but ap- proximations can be obtained in several ways, e.g., using either man- ual tuning of θ, exhaustive exploration of the parameter space using grid search, or dedicated optimization procedures (e.g., using ran- dom search [Bergstra and Bengio,2012]). Similarly, whenL is scarce,
the same procedure can be carried out in the exact same way but using K-fold cross-validation estimates (as defined by Equation 2.7) instead of test sample estimates. In any case, once bθ∗ is identified, the learning algorithm is run once again on the entire learning set L, finally yielding the approximately optimal modelA(bθ∗,L).
When optimizing θ, special care must be taken so that the resulting model is neither too simple nor too complex. In the former case, the model is indeed said to underfit the data, i.e., to be not flexible enough the capture the structure between X and Y. In the later case, the model is said to overfit the data, i.e., to be too flexible and to capture isolated structures (i.e., noise) that are specific to the learning set.
As Figure 2.1 illustrates, this phenomenon can be observed by ex- amining the respective training and test estimates of the model with respect to its complexity. When the model is too simple, both the train- ing and test estimates are large because of underfitting. As complexity increases, the model gets more accurate and both the training and test estimates decrease. However, when the model becomes too complex, specific elements from the training set get captured, reducing the cor- responding training estimates down to 0, as if the model were perfect. 5 IfA makes use of a pseudo-random generator to mimic a stochastic process, then
Mode l com ple xity 0 5 10 15 20 25 30 M e a n sq u a re e rr o r Training e rror Te s t e rror
Figure 2.1: Training and test error with respect to the complexity of a model.
The light blue curves show the training error overLtrainwhile the
light red curves show the test error estimated over Ltest for 100
pairs of training and test setsLtrain andLtest drawn at random
from a known distribution. The thick blue curve is the average training error while the thick red curve is the average test error.
(Figure inspired from [Hastie et al.,2005].)
At the same time, the test estimates become worse because the struc- ture learned from the training set is actually too specific and does not generalize. The model is overfitting. The best parameter value θ is therefore the one making the appropriate trade-off and producing a model which is neither too simple nor to complex, as shown by the gray line on the figure.
As we will see later in Chapter4, overfitting can also be explained by decomposing the generalization error in terms of bias and vari- ance. A model which is too simple usually has high bias but low variance, while a model which is too complex usually has low bias but high variance. In those terms, finding the best model amounts to make the appropriate bias-variance trade-off.
2.3.2 Selecting and evaluating simultaneously
On real applications, and in particular when reporting results, it usu- ally happens that one wants to do both model selection and model assessment. That is, having chosen a final modelA(bθ∗,L), one wants to also estimate its generalization error.
A naive assessment of the generalization error of the selected model might be to simply use the test sample estimate dErrtest(A(θb∗,L)) (i.e.,
issue with this estimate is that the learned model is not independent fromLtestsince its repeated construction was precisely guided by the
minimization of the prediction error over Ltest. As a result, the min-
imized test sample error is in fact a biased, optimistic, estimate of the true generalization error, sometimes leading to substantial under- estimations. For the same reasons, using the K-fold cross-validation estimate does not provide a better estimate since model selection was similarly guided by the minimization of this quantity.
To guarantee an unbiased estimate, the test set on which the gen- eralization error is evaluated should ideally be kept out of the entire model selection procedure and only be used once the final model is selected. Algorithm 2.1details such a protocol. Similarly, per-fold es- timates of the generalization error should be kept out of the model selection procedure, for example using nested cross-validation within each fold, as explicited in Algorithm2.2.
Algorithm 2.1. Train-Valid-Test set protocol for both model selection and