• No results found

The Second-Order Perceptron Algorithm

Apart from the standard Perceptron algorithm there also exist other algorithms compet- ing with it in the task of separating the data into two classes the bounds of which exhibit a more attractive behaviour depending on the case. As such a prominent example we will briefly review and comment on the properties of the Second-order Perceptron algo- rithm [10] which can be viewed as an incremental variant of the Whitened Perceptron [10]. The algorithm might be considered as an adaptation to online binary classifica- tion of the ridge regression and its analysis is inspired by an instance of Vovk’s general aggregating algorithm [68] which is called the Forward algorithm in [4].

The Second-order Perceptron extends the well-known Perceptron algorithm by taking into consideration second order data information such as the correlation matrix of the training patterns. This algorithm exploits the inherent geometrical properties of the data. As the authors of [10] argue this algorithm has better mistake bounds in some cases and exhibits an improved generalisation ability compared to that of the classical Perceptron algorithm in the experiments conducted. The cases which are shown to be advantageous for the Second-order Perceptron include settings in which the data are not scattered evenly in the volume of a hypersphere centred at the origin but mostly reside along certain directions. These directions are linked to the dominant eigenvectors of the dataset correlation matrix. The bounds and the performance of the standard Perceptron algorithm are ruled by the quantity γR

d

2

. If X is a matrix the columns of which are the patterns contained in the sequence S = {y1,y2, . . . ,yK}, then the correlation matrix M is given byM =XXT. We have changed notations from vector dot product to matrix notation. The superscriptT designates the transpose of a vector or a matrix. The margin is defined byγd= min

k u Ty

vector normal to the hyperplane which separates the data with marginγd. Since Kγd2≤ K X k=1 uTyk2 ≤ K X k=1 kykk2=T r(M)

the worst case estimate of the time needed for convergence of the Perceptron algorithm is bounded from below by a quantity involving the trace of M. Moreover, points lying near the separating hyperplane and close to the surface of the minimum enclosing sphere influence the behaviour of the algorithm most. Consider the case in which an example lies near the optimum separating hyperplane and its length is close to R. Then, if this point is misclassified by the current hypothesis weight vector it is possible that the resulting new hypothesis moves further from the optimum direction u instead of approaching it. This happens because, although the Perceptron update rule forces the quantity at·u to increase at every step t, it does not guarantee the same for ut ·u

which can in fact decrease. The implication of this is that the current hypothesis weight vector overshoots the feasible solution region, a situation in which the algorithm might be misled causing a slowing down of the convergence procedure.

As an example of how an appropriate transformation of the data could improve the time bound of an algorithm over the one of the standard Perceptron we shall mention the Whitened Perceptron algorithm. This variant differs from the standard one in that it needs the whole sequence of examples in advance in order to proceed to the following mapping of the data

{y1,y2, . . . ,yK} →nM−1/2y1, M−1/2y2, . . . , M−1/2yKo ,

where M−1/2 is called the whitening transform. The existence of M−1/2 is guaranteed

due to the positive definiteness of M. This mapping results in a correlation matrix which is the identity matrix In with n being the dimensionality of the space of the patterns. The new instances even after the transformation remain linearly separable. The transformed data can be separated by a hyperplane with normal vectorz=M1/2u. This can be easily verified by constructing the product zTM−1/2yk = uTyk which is definitely positive since the optimal direction u classifies all yks correctly with margin at least γd. Then, the margin that the transformed data possess with respect to z is

at least γd′ = γd/kzk = γd/

M1/2u

. Therefore an upper bound on the number of mistakes tfor the Whitened Perceptron is given in analogy to the standard Perceptron by t 1 γd2 max k y T kM−1yk uTMu .

Observing the bound we can say that this can be significantly smaller thanγR

d

2

if either the patterns are very correlated since yTkM−1yk becomes then small or u is strongly correlated with a nondominant eigenvector of M.

The Second-order Perceptron algorithm can be considered as an incremental variant of the Whitened one. The algorithm maintains an-row matrixXt−1at time stept−1 which

initially is empty. Each timeta new instance is received by the algorithm an augmented matrix St = [Xt−1 yk] is built. Since the negatively labelled points yk are reflected in

the augmented space with respect to the origin correct classification ofyk is designated by aTtyk >0, whereat= αIn+StStT

−1

vt−1. An α multiple of the identity matrix In is added to the correlation matrixStStT in order to bypass the impossibility to invert StStT in the case that it is singular. If there is a mismatch between the predicted and the

true label of the pattern then an update ofvt−1 similar in form to that of the standard

Perceptron algorithm takes placevt=vt1+yk withv0 =0 andXtbecomesXt=St.

The parameter α which cannot be deduced ahead of running affects considerably the performance of the algorithm. In a way it captures the information about the existence of specific directions along which the data are scattered and how well aligned is the solution vector to the normal to these directions. Notice that St contains only those

patterns which are associated with mistaken past trials. This is also the main difference with the Forward algorithm in which St keeps track of all patterns seen so far by the

algorithm.

We shall now turn to some theoretical properties of the Second-order Perceptron algo- rithm. It is proved that the number of mistakes made on a finite sequence of examples is bounded from above by

t inf γ>0kminuk=1   Dγ(u;S) γ + 1 γ v u u t α+uTXtXtTu n X i=1 ln(1 +λi/α)   , (4.3)

where λ1, λ2, . . . , λn are the eigenvalues of XmXmT which consists only of the m points

that were wrongly classified during the trials. The infimum with respect to the free parameter γ is taken in order to make the bound tighter. The quantity Dγ(u;S) =

X

k

Dγ(u;yk) is the sum of hinge lossesDγ(u;yk) = max{0, γ−uTyk}for every pattern

in the sequence. If we consider the linearly separable case this bound, which can embody a repeated recycling through the patterns until convergence of the algorithm, reduces to

t   1 γd v u u t α+uTXtXtTu n X i=1 ln(1 +λi/α)   . (4.4)

The term λu = uTXtXtTu can be small as in the case of the Whitened Perceptron

if u is aligned with the eigenvector associated with min

i λi. In the limit α → ∞ the

bounds (4.3) and (4.4) give the known mistake bounds characterising the behaviour of the Perceptron in the inseparable and the separable case [22], respectively bearing in mind thatP

iλi≤tR2.

the values of α which make the Second-order Perceptron more advantageous than the standard one. By bringing the aforementioned bounds in a form that can be directly compared with the ones holding for the Perceptron we may conclude that if λu < r2

with r = Rn2t then the choiceα = rλu/(r−2λu) favours the Second-order Perceptron.

On the other hand if λu ≥ 2r there is no finite value of α for which the bound of the

Second-order Perceptron becomes better and its performance approaches that of the standard Perceptron only for α→ ∞.

In practice it is impossible to know λu since we ignore the optimal direction u, so we

can proceed alternatively in two ways. The first one is to let α increase with every mistaken trial t following the rule αt = cR2t with c > 0 adjusted empirically. The

linear dependence of αt on t is justified from the following observation. The value

of λu appearing in (4.3) and (4.4) is bounded from above and below (for the linearly

separable case) by terms linear in the number of mistakes. It can be shown that the minimal speed of growth of the bounds can be ensured with α growing linearly with

t. The second scenario eliminates α and enforces the need to introduce in the place of

αIn+StStT

−1

the pseudoinverse StStT

+

which exists in all cases.