3.2 Support vector machine at a glance
3.2.2 ERM and VRM frameworks
We finish the review on SVM by discussing briefly on the general framework of Statistical Learning Theory including the SVM. Without enter into the detail like the important theorem of Vapnik-Chervonenkis (1998), we would like to give a more general view on the SVM by answering some questions like how to approach SVM as a regression, how to interpret the soft-margin SVM as a regularization technique... Empirical Risk Minimization framework
The Empirical Risk Minimization framework was studied by Vapnik and Chervo- nenkis in the 70s. In order to show the main idea, we first fix some notations. Let (xi, yi), 1 ≤ i ≤ n be the training dataset of pairs input/output. The dataset is
supposed to be generated i.i.d from unknown distribution P (x, y). The dependency between the input x and the output y is characterized in this distribution. For ex- ample, if the input x has a distribution P (x, y) and the out put is related to x via function y = f (x) which is altered by a Gaussian noise N 0, σ2, then P (x, y) reads
P (x, y) = P (x)N f (x − y) , σ2
We remark in this example that if σ → 0 then N 0, σ2tends to a Dirac distribution
which means that the relation between input and output can be exactly determined by the maximum position of the distribution P (x, y). Estimating the function f (x) is fundamental. In order to give measurement of the estimation quality, we compute the expectation value of the loss function with respect to the distribution P (x, y). We define here the loss function in two different contexts:
1. Classification: l (f (x) , y) = If (x)6=y where I is the indicator function.
2. Regression: l (f (x) , y) = (f (x) − y)2
The objective of statistical learning is to determine the function f in the a certain function space F which minimizes the expected loss or the risk objective function:
R (x) = Z
As the distribution P (x, y) is unknown then the expected loss can not be evaluated. However, with available training dataset {xi, yi}, one could compute the empirical
risk as following: Remp= 1 n n X i=1 l (f (xi) , y)
In the limit of large dataset n → ∞, we expect the convergence: Remp(f ) → R (f)
for all tested function f thank to the law of large number. However, does the learning function f which minimizes Remp(f )is the one minimizing the true risk R (f)? The
answer to this question is NO. In general, there is infinite number of function f which can learn perfectly the training dataset f (x) = yi∀i. In fact, we have to restraint
the function space F in order to ensure the uniform convergence of the empirical risk to the true risk. The characterization of the complexity of a space of function F was first studied in the VC theory via the concept of VC dimension (1971) and the important VC theorem which gives an upper bound of the convergence probability P{sup f ∈ F |R (f) − Remp(f )| > ε} → 0.
A common way to restrict the function space is to impose a regularization condi- tion. We denote Ω (f) as a measurement of regularity, then the regularized problem consists of minimizing the regularized risk:
Rreg(f ) = Remp(f ) + λΩ (f )
Here λ is the regularization parameter and Ω (f) can be for example the Lp norm on
some deviation of f.
Vapnik and Chervonenkis theory
We are not going to discuss in detail the VC theory on the statistical learning machine but only recall the most important result concerning the characterization of the complexity of function class. In order to well quantify the trade-off between the overfit problem and the inseparable data problem, Vapnik and Chervonenkis have introduced a very important concept which is the VC dimension and the important theorem which characterize the convergence of empirical risk function. First, the VC dimension is introduced to measure the complexity of the class of functions F Definition 3.2.1 The VC dimension of a class of functions F is defined as the maximum number of point that can be exactly learned by a function of F:
h = maxn|X|, X ⊂ X , such that ∀b ∈ {−1, 1}|X|,∃f ∈ F ∀ xi ∈ X, f (xi) = bi
o (3.4) With the definition of the VC dimension, we now present the VC theorems which is a very powerful tool with control the upper limit of the convergence for the empirical risk to the true risk function. These theorems allows us to have a clear idea about the superior boundary on the available information and the number of observation in the training set n. By satisfying this theorem, we can control the trade-off between overfit and underfit. The relation between factors or coordinates of vector x and VC dimension is given in the following theorem:
Theorem 3.2.2 (VC theorem of hyperplanes) LetF be the set of hyperplanes in Rd: F =nx7→ sign wTx + b, w∈ Rd, b∈ Ro
then VC dimension isd + 1
This theorem gives the explicit relation between the VC dimension and the number of factors or the number of coordinates in the input vector of the training set. It can be used in the next theorem in order to evaluate the necessary information for having a good classification or regression.
Theorem 3.2.3 (Vapnik and Chervonenskis) let F be a class of function of VC dimension h, then for any distribution P r and for any sample {(xi, yi)}i=1 ˙n drawn
from this distribution, the following inequality holds true:
P r ( sup f∈F|R (f) − Remp (f )| > ε ) < 4 exp ( h 1 + ln2n h − ε− 1 n 2 n )
An important corollary of the VC theorem is the upper bound for the convergence of the empirical risk function to the risk function:
Corollary 3.2.4 Under the same hypothesis of the VC theorem, the following in- equality is hold with the probability 1− η:
∀f ∈ F, R (f )− Remp(f )≤ s h ln2nh + 1− lnη4 n + 1 n
We will skip all the proofs of these theorems and postpone the discussion on the importance of VC theorems important for practical use later in Section 6 as the overfit and underfit problems are very present in any financial applications.
Vicinal Risk Minimization framework
Vicinal Risk Minimization framework (VRM) was formally developed in the work of Chapelle O. (2000s). In EVM framework, the risk is evaluated by using empirical probability distribution: dPemp(x, y) = 1 n n X i=1 δxi(x)δyi(y)
where δxi(x), δyi(y)are Dirac distributions located at xi and yi respectively. In the
VRM framework, instead of dPemp, the Dirac distribution is replaced by an estimate
density in the vicinity of xi:
dPvic(x, y) = 1 n n X i=1 dPxi(x)δyi(y)
Hence, the vicinal risk is then defined as following: Rvic(f ) = Z l (f (x) , y) dPvic(x, y) = 1 n n X i=1 Z l (f (x) , yi) dPxi(x)
In order to illustrate the different between the ERM framework and VRM framework, let us consider the following example of the linear regression. In this case, our loss function l (f (x) , y) = (f (x) − y)2 where the learning function is of the form f (x) =
wTx + b. Assuming that the vicinal density probability dPxi(x)is approximated by
a white noise of variance σ2. The vicinal risk is calculated as following:
Rvic(f ) = 1 n n X i=1 Z (f (x)− yi)2dPxi(x) = 1 n n X i=1 Z (f (xi+ ε)− yi)2dN 0, σ2 = 1 n n X i=1 (f (xi)− yi)2+ σ2kwk2
It is equivalent to the regularized risk minimization problem: Rvic(f ) = Remp(f ) +
σ2kwk2 of parameter σ2 with L
2 penalty constraint.