• No results found

Statistical Learning Theory

1.2 Machine Learning

1.2.1 Statistical Learning Theory

Statistical learning theory (SLT) provides a rigorous mathematical framework to for- mulate the supervised learning task as a risk functional minimization problem. In a nutshell, the task is to minimize the risk of making a bad prediction. A full exposition of the topic would need to go far beyond the scope of this thesis, therefore, we will only provide a brief summary, following Vapnik [124], and omitting many of the technical detail from the discussion. Hopefully, this will be sufficient to provide a clear picture of the underlying mathematical problem.

The SLT formulates the learning task as a problem of finding an optimal mapping, ˆ

f :X→Y, based on a set of input, {xi |x∈X}, and output,{yi |y∈Y}, data. The

set of ordered pairs{(xi, yi)}is know as the training set. If the mapping is known, for

any future inputxit is possible to predict the output,y. The input data is assumed to be drawn from an unknown but fixed probability distribution,P(x). Therefore, the theory

assumes that the training set is made of independent and identically distributed data (i.i.d.), drawn from the joint probability distribution P(x, y) = P(x)P(y|x). Where P(y|x) is the conditional probability of seeing y if the input is x. We note that in practice the assumption of i.i.d. data may not be true. The learning is facilitated by the expert system, i.e. the supervisor, which represents the true response of the system and will returnybased on the exact probability distributionP(y|x), for anyx. We note that, due to the probabilistic mapping, the supervisor need not reproduce the exact responses, yi, that are realized in the training set. Finally, an algorithm which can implement a

family of functions,f(x, α), where α∈Λ, needs to be provided. The learning task then is to identify an optimal function ˆf(x) ≡f(x, α0), which minimizes the risk of making a bad prediction.

The risk mimimization is not uniquely defined. In general, one can write down the risk functional,R(α), once the loss function,L(y,x), is given. The latter measures the distance between the predicted output value, ˆy= ˆf(x) and the true output valuey. The risk functional is defined as an expectation value of the loss function

R(α) =

Z

dP(x, y)L(y,x). (1.83)

For the real valued functions fα(x) ≡ f(x, α), the most common choice of the loss

function is the squared error loss,L2(y,x) = [y−fα(x)]2. In this case it is known that

the optimal function, ˆf, which minimizesR(α), is the expectation value ofy ˆ

f(x) =

Z

y dP(y|x). (1.84) If the response of the supervisor is not deterministic, there will always be some residual variance in predicting the output. This is an intrinsic error, which is out of our control. In other words, a ML algorithm, which predicts without any loss on the training set should be regarded as suspicious. An alternative loss function is the L1-loss, given by L1(y,x) =|y−fα(x)|. There is no a priori reason to assume that one loss function, i.e.

distance metric, is better than the other one. The best choice needs to be determined for a particular problem, i.e. for a particular distribution, P(x, y).

Learning problems can be classified into three main categories: the pattern recognition (classification), the regression estimate, and the probability density estimation [124]. In case of the classification task supervisor’s output is described by an indicator function. In the simplest case of binary classification, the output of the indicator function will take only values 0 and 1, for example. The most common loss function to use here is the zero-one loss [23], which is a K×K matrix with zeros on the diagonal and ones elsewhere. Here K is the number of different classes in the set of possible outcomes,

e.g. K = 2 for binary classification. Therefore, every misclassification is penalized by a unit loss. Minimizing the risk functional leads to the Bayesian classifier, which tells that the optimal class is the one that is the most probable, for a given x. The regression estimation was already discussed, in the preceding paragraph. Finally, the problem of density estimation assumes that we have at our disposal a set of observations,

{xi}, which are randomly drawn from an unknown distribution function. The task

then is to choose an optimal density, from a set of densities p(x, α), α ∈ Λ, which best describes the observations. The loss function that is used here is the log-loss one, L(p(x, α)) =−logp(x, α). Therefore, the risk minimization approach encompasses all of the aforementioned learning tasks, only the loss function differs.

Although the risk functional (1.83) provides a satisfactory theoretical way to minimize the risk of making a bad prediction, in practice, all samples are finite. Hence, a practical learning theory needs to provide means to minimize the risk using only a finite set of training data. For this purpose in statistical learning theory a concept of “empirical risk functional” is introduced. This is defined as

Remp(α) = 1 l l X i=1 L(yi,xi) , (1.85)

where l is the size of the training set. Therefore, the empirical risk of Eq. (1.85) is completely analogous to that in Eq. (1.83), only defined for a finite sample. The task of the SLT then is to answer under which conditions the empirical risk converges to the true risk and how to control the rate of convergence. Finally, the theory needs to provide guidelines for constructing practical ML algorithms. Without going into technical details it turns out that the empirical risk can converge to the true risk, if certain conditions are met. However, the convergence is guaranteed only in the asymptotic case, namely when l → ∞ [122]. The strict criteria is given in terms of the Vapnik-Chervonenkis (VC) entropy of the set of functions f(x, α) and the corresponding VC-dimension, h [124]. These measures, loosely speaking, describe the capacity of a set of functions, i.e. its ability of fitting different data. If the training set size,l, is much larger thanhthen the empirical risk provides a good estimate for the true risk. However, for small sample set sizes this is no longer true. Unfortunately, the “curse of dimensionality” implies that in high dimensional space, any finite sample is small [23]. The results of the theoretical analysis suggest that for finite samples a new risk minimization principle needs to be employed, namely thestructural risk minimization. Here the objective is to construct a structure of nested subsets of functions,Sk, having a finite VC-dimension, hk, such that

For each subsetSk we choose the optimal function ˆfk, by minimizing the empirical risk,

so that the true risk is bounded by

R(αkl)≤Remp(αkl) + Ω( l hk

). (1.86)

The second term is the penalty functional, which goes to zero when l/h → ∞. It corresponds to the confidence interval, which increases as hk gets larger (for a fixed l).

A more complete set of functionsSk has a larger confidence interval and hence, the risk

is less bounded. In other words, we can always minimize the empirical risk by choosing a large enough set of functions, however, the uncertainty of the fit simultaneously increases, namely we have over-fitting. Therefore, the structural risk minimization implies that we need to simultaneously minimize the empirical risk and the capacity of the family of functions Sk.