The Structural Risk Minimisation Principle

Let us turn to the study of (2.31) that provides us with the rate of convergence of the empirical risk to the expected one by means of the VC dimension. From this relation- ship, following the procedure that led to (2.28) from (2.27), one can assert that with probability 1₋η for all functions parametrised bya_∈Λ

R(a)_≤Remp(a) + r h(1 + ln(2l/h))₋lnη₄ l + 1 l . (2.32)

The upper bound on the expected riskR(a) is formed by the sum of two contributions, the one coming from Remp(a) and the other from a term involving the ratio l/h. It

is not obvious by mere observation that a minimisation of Remp(a) as dictated by the

ERM principle will lead to the tightest bound on R(a). We can easily verify that if we are dealing with large sample sizes with respect to the VC dimensionhthe lowest value that the r.h.s. can attain is determined mainly byRemp. Thus, the suggestion of ERM

principle that we should try to find the function that classifies the training set with the minimum number of errors seems to be in the right direction since it proves decisive in the determination of the generalisation ability of the machine.

On the other hand if it happens that the ratiol/his not large then the second summand on the r.h.s. of (2.32), called the confidence interval, plays an important role and the solution that gives the minimum guaranteed risk does not necessarily coincide with the one that comes from the ERM principle. Reducing the VC dimension of the hypothesis class reduces the contribution of the confidence interval but it is reasonable to expect that it increases the training error. Thus, the construction of (2.32) leaves one with the freedom to control the generalisation ability of the learning machine by adjusting two opposing factors namely, the number of training errors on the one hand and the capacity of the function class on the other. Surely, in our attempt for simultaneous minimisation over both terms the bound (2.32) provides us with a quantitative criterion on the basis of which a compromise between the two can be accomplished.

This new criterion called the Structural Risk Minimisation (SRM) principle [62], in contrast to the ERM principle, suggests the minimisation of the bound over both the empirical risk and the confidence interval which is controlled by the capacity of the hypothesis class. Let us define a structure on the set S of functions Q(z, a), a _∈ Λ. Consider the nested subset of functions

S1⊂S2⊂ · · · ⊂Sk· · ·,

whereSk ={Q(z, a) :a∈Λk}. The union of all subsets is denoted by S⋆ =∪kSk. The

subsets are constructed in a way such that the VC dimension of the setSk of functions

is nondecreasing with increasing indexk

h1 ≤h2≤ · · ·hk≤ · · · .

We are interested only in classification tasks. This restricts the functions in each element

Sk of the structure to the indicator functions {Q(z, ak)∈ {1,0}, ak ∈Λk}. Addition-

ally, for the structure to be admissible we need the VC dimension of each element in the structure to be kept finite and the set S⋆ _{to be everywhere dense in the set} _S _in

the L1 metric. According to the SRM principle given a number of observations one

The policy imposed by this principle to discover the element of the structure with the appropriate capacity that leads to the minimisation of the risk justifies its name.

As in the case of the ERM principle, analogous questions of consistency are raised also in connection with the SRM principle. For example, is it possible for the risk estimated on the basis of the function chosen according to this principle from an elementSkof the

structure to converge to the minimum one inS? And if this happens what would be a bound on the rate of convergence?

From (2.30) by setting in the place of the growth function GΛ(2l) its upper bound written in terms of the VC dimension assumming 2l > hkand fixingη = 1/l2 we obtain

that with probability 1₋2/l2

R(ak_l)₋R(ak₀) = Z Q(z, ak_l)dF(z)₋ inf a∈Λk Z Q(z, a)dF(z) ≤ r 2 lnl 2l + v u u thk ln_h2l k + 1 + 2 ln 2l l + 1 l . (2.33)

We recognise in the place of the upper bound the confidence interval. The term R(ak_l) denotes the risk with respect to a solution found within the functions of thek-th element of the structure whereasR(ak

0) signifies the minimum risk attainable for functions in the

same element. The term on the l.h.s. of the previous relation can be decomposed as

R(ak_l)₋R(ak₀) =R(ak_l)₋R(a0) +R(a0)−R(ak0) .

With a little rearrangement this automatically transforms (2.33) into a relation bounding the rate of convergence V(l) =R(ak_l)₋R(a0)

V(l)_≤rk+ r 2 lnl 2l + v u u thk ln_h2l k + 1 + 2 ln 2l l + 1 l , (2.34) whererk=R(ak0)−R(a0).

One can show that if one imposes rules for the choice of the appropriate element Sk of

the structure that depend on the numberlof observations then for ltending to infinity the riskR(ak

l) approaches the smallest one R(a0) in the whole structure. Let us denote

by k(l) the rule based on the number of observations that discriminates between the subsets of the structure. In terms of the new notation (2.34) becomes

V(l)_≤r_k₍_l₎+ r 2 lnl 2l + v u u thk(l) ln_h2l k(l) + 1 + 2 ln 2l l + 1 l . (2.35)

In order for consistency of the SRM to hold we needV(l) to tend to 0 for an increasing number of observations. Indeed, if the element containing the minimiser function a0

of the expected risk is found liml→∞rk(l) = 0 due to the density of S⋆ in S. So for

convergence to take place we need additionally the second term to tend to 0 for l_{→ ∞}

or equivalently

lim

l→∞

h_k₍_l₎lnl

l = 0,

a condition which reminds us of liml→∞GΛ(l)/l= 0 for consistency of the ERM principle

to hold. The term rk(l) of (2.35) known as the rate of approximation is related to the

deviation of the best approximation inS_k₍_l₎ from the smallest possible. It is reasonable to expect that as we move to subsets with larger capacity the deviation will become smaller. On the other hand the second term entering (2.35) known as the estimation error measures the deviation of the risk computed on the basis of a function in S_k₍_l₎

from the smallest possible in S_k₍_l₎. We anticipate that as k(l) increases the larger this deviation becomes. Therefore, we come to the conclusion that the rate of convergence is governed by two opposing factors.

In document Perceptron Like Large Margin Classifiers (Page 46-49)