Kernel Principal Component Ranking: Robust Ranking on Noisy Data

14  Download (0)

Full text

(1)

PDF hosted at the Radboud Repository of the Radboud University

Nijmegen

The following full text is a preprint version which may differ from the publisher's version.

For additional information about this publication click this link.

http://hdl.handle.net/2066/76137

Please be advised that this information was generated on 2017-12-06 and may be subject to

change.

(2)

K ern el P rin cip a l C o m p o n en t R anking:

R o b u st R a n k in g on N o isy D a ta

E vgeni T sivtsivadze, B o to n d Cseke, an d Tom Heskes

Institute for Computing and Information Sciences, Radboud University Nijmegen, Toernooiveld 1, 6525 ED Nijmegen, The Netherlands

firstnam e.lastnam e@ science.ru.nl

A b s tr a c t. We propose the kernel principal component ranking algo­ rithm (KPCRank) for learning preference relations. The algorithm can be considered as an extension of nonlinear principal component regres­ sion applicable to preference learning task. It is particularly suitable for learning from noisy datasets where a lower dimensional data representa­ tion preserves most expressive features. In many cases near-linear depen­ dence of regressors (multicollinearity) can notably decrease performance of the learning algorithm, however, KPCRank can effectively deal with this situation. It is accomplished by projecting the d ata onto p-principal components in the feature space defined by a positive definite kernel and consecutive learning of the ranking function. Despite the fact th at the number of the pairwise preferences is quadratic, the training time of K PCRank scales linearly with the number of d ata points in the training set and is equal to th a t of the principal component regression. We com­ pare the algorithm to several ranking and regression methods, including probabilistic regression on pairwise comparison data. O ur experiments dem onstrate th a t the performance of K PC R ank is better than th a t of the baseline methods, when learning to rank from the d ata corrupted by noise.

1 I n t r o d u c t i o n

T he ta sk of learning preference relatio n s (see e.g. [5]) has received significant a tte n tio n in m achine learning lite ra tu re . T h is p a p e r p rim arily concerns th e ta sk of ranking, w hich is a special case of a preference learning ta sk w hen a to ta l order is associated w ith th e set of d a ta p o in ts u n d er consideration. B o th th e preference learning and th e ran k in g task s can be form ulated as th e problem s w here th e aim is to learn a function capable of arran g in g d a ta p o in ts according to a given preference relation. W h en com paring tw o d a ta points, th e function is able to evaluate w h eth er th e first p o in t is p referred over th e second one. To learn th is function we propose th e kernel prin cip al com ponent ran k in g (K P C R an k ) algorithm .

T he alg o rith m can be considered as an extension of th e nonlinear principal com ponent regression applicable to th e preference learning task . S im ilar to th e kernel principal com ponent regression (K P C R ) [14], K P C R a n k is p a rtic u la rly

(3)

su itab le for learning from noisy d a ta se ts w here lower dim ensional d a ta represen­ ta tio n preserves m ost expressive features. B y p ro jec tin g th e original d a ta onto th e co m ponents w ith higher eigenvalues we can d iscard th e noise co n tain ed in th e original d a ta . We derive th e alg o rith m a n d describe an efficient w ay for selecting o p tim al nu m b er of th e principal com ponents su ita b le for th e ta s k in question. To d e m o n stra te th a t th e alg o rith m can learn to ra n k well from noisy d a ta , we evaluate K P C R a n k on th e d a ta s e t co rru p te d by noise.

We consider a label ran k in g problem settin g in w hich we are given a tra in ­ ing d a ta consisting of th e scored d a ta points, th a t is, each in p u t d a ta p o in t is assigned a real valued score in d icatin g its goodness. T he pairw ise preferences betw een these d a ta p o in ts are th e n d eterm in ed by th e differences of th e scores. L earning these preferences can also be tre a te d as an extension of pairw ise classi­ fication [7] w here class label in d icates d irectio n of th e preference. For exam ple, in [2,1] th e pairw ise preference ap p ro ach is used to g e th e r w ith G aussian processes to learn preference relations. Pairw ise app ro ach is also used w ith s u p p o rt vector m achines (SVM ) for learning to ran k . For exam ple, th e R ankS V M alg o rith m was proposed in [8] to re ra n k th e resu lts o b ta in e d from a search engine.

We evaluate ou r alg o rith m on a parse ran k in g ta sk th a t is a com m on problem in n a tu ra l language processing (see e.g. [3]). In th is task , th e aim is to ra n k a set of parses associated w ith a single sentence, based on som e goodness criteria. We consider th e parse ran k in g ta sk as label ranking. However, in th e p arse ran k in g ta sk th e labels (i.e. th e parses of a sentence) are instance-specific. T h a t is, for each sentence, we have a different set of labels, w hile in th e conventional label ran k in g se ttin g labels are n o t in stan ce specific [5]. As baseline m eth o d s in th e co n d u cted exp erim en ts we consider th e regularized least-squares (RLS) [13], th e R ankR L S [11], an d th e kernel principal com ponent regression (K P C R ) [14] al­ gorithm s. F urthem ore, we com pare th e K P C R a n k alg o rith m to th e pro b ab ilistic regression on pairw ise preference d a ta th a t is g en erated using sinc function. T he resu lts show th a t th e perform ance of th e K P C R a n k alg o rith m is b e tte r th a n th a t of th e baseline m ethods, w hen learning to ra n k from th e d a ta co rru p te d by noise.

2 L a b e l R a n k i n g

L et X b e a set of in stan ces a n d Y be a set of labels. In th e label ran k in g settin g [4, 5], we would like to p red ict for any in stan ce x G X a preference re latio n P x Ç Y x Y am ong th e set of labels Y . A n elem ent (y, y ') G P x m eans th a t th e instance x prefers th e label y com pared to y', also w ritte n as y y x y ' . We assum e th a t th e preference re latio n P x is tra n sitiv e an d asym m etric for each in stan ce x G X . As a tra in in g inform ation, we are given a finite set { ( q i, Sj)}™= i of n d a ta points, w here each d a ta p o in t (qi , s i ) = ( ( x i , yi ), s i ) G Q x R consists of an instance-label tu p le q i = ( x i ,y i ) G Q, w here Q = X x Y an d th e score si G R . We consider two d a ta p o in ts (q, s) an d (q ', s') to be relevant, iff x = x '. For relevant d a ta points, in stan ce x prefers label y to y ', if s > s ' . If s = s', th e labels are called tied. Accordingly, we w rite y y x y ' if s > s' an d y ~ x y ' if s = s '. To be able to

(4)

in c o rp o rate th e relevance in fo rm atio n we define an u n d irecte d preference g raph w hich is d eterm in ed by its adjacency m a trix W such th a t W ij = 1 if (q i , q j) are relevant an d Wij = 0 otherw ise. F u rth erm o re, let Q = ( q 1, . . . , q n )* G Q ” be th e vector of instance-label tra in in g tu p les an d s = ( s 1, . . . , s n )t G R ” th e corresponding vector of scores. G iven these definitions, our tra in in g set is th e tu p le T = (Q, W, s). We also n o te th a t our definitions allow considering b ip a rtite ranking, o rd in al regression, and label ran k in g se ttin g a t th e sam e tim e.

We define m ap p in g from th e in p u t space Q to some (higher dim ensional) feature space F . In m an y cases, th e n u m b er of tra in in g d a ta p o in ts is m uch sm aller th a n th e nu m b er of dim ensions m of th e feature space a n d we can w rite F = R m , w here m ^ n. F u rth e r, let

^ : Q ^ F . T he inner p ro d u c t

k(q , q ') = (<£(q),<£(q'))

of th e m a p p ed in stan ce-lab el pairs is called a kernel function. We also denote th e sequence of featu re m a p p ed in p u ts as

* (Q ) = ( * ( q i ) , . . . , * ( q ” )) G ( F ” )*

for all Q G (Q ” )*. We define th e sym m etric kernel m a trix K G R ” x” , w here R ” x” d enotes th e set of real m atrices of dim ension n x n, as

( k (q i, q i) ••• k ( q i, q ri) '

K = 3>(Q)‘3>(Q) = . ... . \ k ( q ” , q i) ••• k (q ” , q j ,

U nless s ta te d otherw ise, we assum e th a t th e kernel m a trix is stric tly positive definite.

3 D i m e n s i o n a l i t y R e d u c t i o n

Before presen tin g th e K P C R a n k alg o rith m we briefly describe th e pro ced u re for kernel principal com ponent analysis (K P C A ), following [15]. K P C A allows us to perform s ta n d a rd P C A in th e higher dim ensional space F using th e kernel function described in Section 2. A fter th e in itial m apping, th e d a ta falls onto some h y p erp lan e in F and th e e x tra c te d prin cip al com ponents will m ap onto m anifolds in th e lower dim ensional space. T he K P C A alg o rith m boils dow n to d iag onalization of th e covariance m a trix

1 ” 1

c = - Em ^ ( q i ) ^ ( q i ) ‘ = - ^ ( Q ) ^ ( Q ) 4, m i= i

w here th e ^ ( q i ) are th e centered m appings of th e individual instance-label pairs. To find th e first p rin cip al com ponent we need to solve th e eigenvalue eq u atio n

(5)

T he key observation m ade in [15] sta te s th a t all solutions v w ith A = 0 can be w ritte n as a linear com b in atio n of ^ ( q i ). T herefore, th e re are some ai G R such th a t v = ^ ”= i a i ^ ( q i ). T h e n E q u a tio n (1) can be w ritte n as

- $ ( Q ) ^ ( Q ) ‘v = Av m - $ ( Q ) $ ( Q ) ‘^ ( Q ) a = A ^ (Q )a m - ^ ( Q ) ^ ( Q ) ^ ( Q ) ^ ( Q ) a = A ^ ( Q ) ^ ( Q ) a m - K a = Aa, m

w here a G R ” . T hus, we have arrived to th e equivalent eigenvalue problem th a t requires d iag onalization of K in stead of C . Now we can com p u te th e p ro jectio n of th e m a p p ed in stan ce-lab el p a ir ^ ( q ) onto l — t h eigenvector v 1

” ”

(v ' , ^ (q )) = E a i ( ^ (q i)^ (q )) = E a i k (q ^ q ) . (2) V mA; ^ v mA1 “ i

If we denote by V G R mxp th e m a trix consisting of th e colum ns of th e eigenvec­ to rs { v i }P= i of th e covariance m a trix C , by A G R pxp th e diagonal m a trix of th e eigenvalues corresponding to th e e x tra c te d eigenvectors of K , an d by V G R ” xp th e m a trix of e x tra c te d eigenvectors { a i }^= i of th e kernel m a trix K , th e n th e p ro je ctio n of th e instance-label p airs Q on to first p eigenvectors can be w ritte n as

# ( Q ) ‘V = 0 ( Q ) ^ ( Q ) V A - 2 = k V A- 1 . (3)

4 K e r n e l P r i n c i p a l C o m p o n e n t R a n k i n g A l g o r i t h m

A ran k in g function is a function f : Q ^ R m ap p in g th e in stan ce-lab el pair q to a real value rep resen tin g th e relevance of th e label y w ith resp ect to th e in stan ce x . T h is induces for an y in stan ce x G X a tra n sitiv e preference relatio n P f ,x Ç Y x Y w ith (y, y ') G P f ,x ^ f (q) > f (q ' ). Inform ally, th e goal of our ran k in g ta s k is to find a label ran k in g function f : Q ^ R such th a t th e ran k in g P f ,x Ç Y x Y induced by th e function for any in stan ce x G X is a good pred ictio n for th e tru e preference rela tio n P x Ç Y x Y .

L et us define R Q = { f : Q ^ R } and let H Ç R Q be th e h ypothesis space of possible ran k in g functions. To m easure how well a h ypothesis f G H is able to p red ict th e preference relatio n s P x for all in stan ces x G X , we consider th e following cost function th a t ca p tu re s th e am o u n t of in co rrectly p red icte d pairs of th e relevant tra in in g d a ta points:

1 ”

d (f , T ) = 2 ^ 2 W ij s ig n (s i — s j) — s i g n f (qi) — f ( q j )) , (4)

(6)

w here sign(-) d enotes th e signum function

. , , 1, if r > 0 s ign ( r ) = \ . n .

I —1, if r < 0

T he use of cost functions like E q u a tio n (4) leads to in tra c ta b le o p tim izatio n problem s, therefore, we consider th e following least squares appro x im atio n , which regresses th e differences si — sj w ith f (q i ) — f ( q j ) of th e relevant tra in in g d a ta p o in ts q i a n d q j :

1 ” 2

c(f T

) = 2

E

Wj

(

(si

sj )

(f

(q i )

f

(q j

.

(5) 1

i,j= i

In th e above described se ttin g we assum e th a t every instance-label p a ir has an associated score. A straig h tfo rw ard extension of th e proposed alg o rith m (sim ilar to th e one p resen ted in [11]) m akes it applicable to th e s itu a tio n w hen only pairw ise preferences are available for th e tra in in g of th e ranker.

4 .1 K P C R a n k

T he n e x t th eo rem ch aracterizes a m e th o d we refer to as kernel p rin cip al com po­ n en t ran k in g (K P C R an k ).

T h e o r e m 1 L e t the tra in in g in fo r m a tio n p ro v id ed to the a lg o rith m be a tuple

T = (Q, W, s ). F u rth e r, let th e p r o je c tio n o f th e tra in in g in sta n ce -la b e l p a ir o nto th e l — t h p r in c ip a l c o m p o n e n t be (v 1, ^ ( q ) ) = ÿ m x 2 ”= i a j k ( q j , q). C o n s id e r­ ing th e lin e a r ra n kin g m o d el in d im e n s io n a lity reduced fe a tu r e space, the p re d ic ­ tio n fu n c tio n can be w r itte n as f (q) = 5^f= i y m x w ( 2 ”= i a j k ( q j , q). T hen the

co e fficie n t v e c to r w f o r th e a lg o rith m

A ( T ) = argm in J ( f ), ƒ ew w ith objective fu n c tio n

1 ”

J ( f ) = 2 E Wij ((s i — sj ) — ( f (q i ) — f (qj ))2 i,j= i

m in im iz in g the least squares a p p ro x im a tio n o f th e d isa g re e m e n t erro r in th e d i­ m e n s io n a lity reduced fe a tu re space is

w = A 1 (1/ ‘K L K F ) -1 V * K L s , (6)

(7)

Proof. We use th e fact th a t for any vector r G R ” an d an u n d irected w eighted g rap h W of n vertices, we can w rite

1 ”

— Wij ( r i — r j ) 2 = r * D r — r * W r = r* L r, i,j= i

w here D a n d L are th e degree m a trix a n d th e L aplacian m a trix of th e g raph d eterm in ed by W . C onsidering a linear ran k in g m odel in th e feature space F th e p red ictio n function can be w ritte n as f (Q) = ^ ( Q ) w , w here w G R ” . T he objective function in m a trix form can be w ritte n as

J (w ) = (s — ^ ( Q ) tw ) t L ( s — <?(Q)‘w ). (7) We also assum e th a t our in stan ce-lab el pairs are cen tered aro u n d zero m ean. Therefore, kernel P C A can be perform ed on th e covariance m a trix ^ (Q )^ (Q )* to e x tra c t corresponding eigenvectors. P ro je c tin g th e d a ta on to th e principal com ponents we can rew rite E q u a tio n (7) as

J (w ) = (s — ^ (Q )* V w )* L (s — ^ (Q )* V w ) or equivalently

J (w ) = (s — K V A - 1 w )* L (s — K V A - 2 w ). B y ta k in g th e derivative of J ( w ) w ith resp ect to w we o b tain

— J (w ) = — 2A- 2 V * K L s + 2A- 2 V * K L K Î A - 2 tw. dw

We set th e derivative to zero an d solve w ith resp ect to wÎ

tw = A 1 (V ‘K L K Î ) -1 V * K L s. (8) F in ally we o b ta in th e p red icted score of th e unseen in stan ce-lab el p a ir based on th e first p p rin cip al co m ponents by

p , ”

f (q ) ^ E w ^ a j k (q j , q ) . (9)

In th e n e x t section we d em o n stra te how to even fu rth e r sim plify E q u a tio n (8) and describe ways for efficient m u ltip licatio n of th e m atrices involved in th e expression. To e x tra c t th e ind iv id u al principle com ponents we can use power m e th o d described in [6] whose co m p u ta tio n a l com plexity scales to O ( n 2) for e x tra c tio n of individual prin cip al com ponent. T he p rocedure m u st be re p eated for th e prin cip al com ponents used for th e p rojection. T h e co m p u ta tio n a l com ­ p lexity for m aking th e p red ictio n on a single te s t d a ta p o in t scales as O (p n 2) due to th e fact th a t m u ltip licatio n of m atrices K L K can b e accom plished effi­ ciently because of th e sparseness of L (see Section 4.2) an d inversion involved in th e E q u a tio n (8) is less expensive th a n m u ltip licatio n in case n ^ p. T here are several o th e r strateg ie s th a t can reduce com plexity of th e K P C R a n k algorithm , however, th e y are o utside of th e scope of th is p ap er.

(8)

4 .2 E f f ic ie n t S e le c tio n o f P r i n c i p a l C o m p o n e n t s

Selecting th e o p tim al nu m b er of p rin cip al co m ponents for th e K P C R a n k algo­ rith m can n o ta b ly im prove its perform ance. T h ere are different ways for efficient e x tra c tio n of th e prin cip al com ponents (e.g. pow er m eth o d ) whose com plexity scales to O ( n 2) for e x tra c tin g a single com ponent. H ere we describe an o th er m e th o d for selecting th e o p tim a l nu m b er of principal co m ponents for th e learn ­ ing ta s k based on a single tim e perform ed eigendecom position of th e kernel m a trix .

W h en considering E q u a tio n (8) we can observe th a t by eigendecom position of th e kernel m a trix K = VA V * an d by so rtin g eigenvalues an d corresponding eigenvectors in decreasing o rd er we can fu rth e r sim plify th e expression, th a t is

w = A 2 (V ‘K L K / ) - ^ * K L s

= (A- 1V * V AV * l V AV * V A - 1 ) - 1 A - 1V * V AV * L s = (A 1 *V * lV A 1 )- 1 A 1 ‘ÿ ‘L s,

w here A 2 g R ” xp is th e diagonal m a trix co ntaining eigenvalues of K in decreas­ ing order. T he la tte r sim plification is possible due to th e fact th a t V*V = I px” as well as th a t b o th A - 2 an d A are diagonal m atrices co n tain in g m an ip u latio n s on eigenvalues of th e kernel m atrix . T h e m u ltip licatio n of th e A2 *V*Ls can be

accom plished in O ( n 2) tim e since s G R ” . F u rth erm o re A 1 *V*LVA1 is a square m a trix of dim ensions p x p, an d in case p is selected to be m uch sm aller th a n n, th e d o m in a tin g te rm is m u ltip licatio n of V4LV , whose co m p u ta tio n cost is O ( u n 2), w here u is th e n u m b er of instances. T h is red u ctio n in com plexity is achieved due to th e sparseness of th e L aplacian m a trix [17]. L et M = A2‘V ^L V A2. We

can efficiently search for th e o p tim al n u m b er of p rin cip al co m ponents in tim e O ( n 2). O nce we have perform ed eigendecom position of th e m a trix M = V AV * (here we assum e in itial p ro jectio n of th e tra in in g d a ta on to all p rin cip al com po­ n en ts), th e subsequent calcu latio n of M p-1 based on th e p principal com ponents (p < n) is co m p u ta tio n a lly inexpensive. It can be perform ed as follows. W hen m a trix M is decom posed, A = diag(A 1, . . . , A” ) is a diagonal m a trix co n tain ­ ing th e eigenvalues. Set to 0 som e of th e eigenvalues an d leave only p non-zero ones, corresponding to th e n u m b er of prin cip al com ponents on to w hich d a ta is p ro jected . T h en inverse of m a trix M p can be calcu lated as

M p-1 = Vd ia g (1 /A 1, . . . , 1/Ap, 0 , . . .)V *.

T hus, by a single decom position of th e m a trix M an d subsequent m an ip u latio n on eigenvalues we can efficiently search for th e o p tim al n u m b er of principal com ponents.

5 P r o b a b i l i s t i c R e g r e s s i o n o n P a i r w i s e P r e f e r e n c e D a t a In th is section we consider learning a ran k in g function based on pairw ise com ­ parison d a ta , th a t is, d a ta a b o u t th e ran k in g function values is provided in term s

(9)

of pairw ise com parisons a t th e given locations. L et D = { (i; , j ; , w; )};=1.N be a d a ta se t, w here N is th e n u m b er of pairw ise com parisons, and w; G {—1,1} spec­ ifies th e d irectio n of th e preference. We aim to learn th e ran k in g score function f (x ). F u rth e r, we assum e th a t th e observations a b o u t th e com parisons are noisy and p ( w |f ) is a B ernoulli d istrib u tio n , defined as

w here ^ is th e norm al cum ulative d en sity function, however, any sigm oid function can be su itab le choice. We have chosen th e form er because it m akes some of th e c o m p u tatio n s tra c ta b le . We m odel th e function f w ith a zero m ean G aussian process [9] having covariance function k (•, •; 0). L et th e vector f d en o te th e values of th e function f in th e in p u t variables, th a t is, th e ran d o m vector f = ( f ( x i ) , . . . , f (X” )). U sing B ayes’ th eo rem th e G aussian p o sterio r p ro b a b ility of f can be w ritte n as

w here K is th e covariance m a trix w ith elem ents k ( x i , Xj ; 0). Since th e p o ste­ rior p ( f |D , 0) is an aly tically in tra c ta b le we will ap p ro x im ate its m ean m and covariance C w ith th e e x p ectatio n p ro p ag atio n m e th o d [10]. P red ic tio n for th e ran k in g function values a t a new evaluation p o in t x* can be co m p u te d as

w here k* = k (x * , x*; 0), K = [k ( x i , Xj ; 0)]i j an d k* = [k (x*, Xj ; 0)]i . P red ic­ tio n for th e com parisons of th e ran k in g function values for locations x i an d x 2

Since we can ap p ro x im ate th e m arginal likelihood p (D |0 ) using ex p ectatio n p ro p ag atio n , we c a rry o u t m axim um likelihood pro ced u re on th e h y p e r-p aram eters We use th e square expo n en tial covariance function.

P ro b ab ilistic c o u n te rp a rt of th e R ankR L S alg o rith m would be regression w ith G aussian noise an d G au ssian processes prior, given th e score differences wij = Si — Sj, th a t is, p (w; | f ) = ^ (w; ( f ( x i ) — f (x j i )) ) , p ( f | D 0) = ( D 0)

n

p (w; | f ) p ( f |0) p ( D |0 H ; (10) = p ( D i0 )

n

^ (w; ( f ( x i i ) —f (x j i ))) n ( f |0, k ) , N ( ƒ (x * )|fc* K - 1 m , k* + fc*K-1 (C — K ) K - 1 fc*) , is (11) w here th e 2 x 2 m a trix R is given by

t

R = [k(x**, x**, 0 ^ i j= 1 2 + [fc*, fe2] K -1 (C — K ) K -1 [fc*, fe2] .

(10)

T h en th e p o sterio r d istrib u tio n of th e ran d o m vector f is

T he p o sterio r d istrib u tio n p ( f |w ,v , 0) is G aussian, its m ean an d covariance m a trix can be co m p u ted by solving a system of linear equations and inverting a m atrix , respectively. Here we choose to optim ize th e m odel p a ra m e te rs v an d 0 by m axim izing th e m arg in al likelihood p (D |v , 0 ). P red ictio n can be done sim ilar to E q u a tio n (11), except th a t no ap p ro x im atio n s are needed. N ote th a t predictions o b ta in ed by th e K P C R a n k alg o rith m correspond to th e p red icted m ean values of th e G au ssian process regression.

6 E x p e r i m e n t s

6 .1 P a r s e R a n k i n g D a t a s e t

We ap p ly th e K P C R a n k alg o rith m to ra n k th e parses g en erated from th e B ioln- fer corpus [12] w hich consists of 1100 m an u ally a n n o ta te d sen ten ces.1 T he p re­ processed d a ta in form at of R ankSV M [8] is available for dow nload.2 T he parse ran k in g is sim ilar to th e d ocum ent ran k in g ta sk frequently e ncountered in th e in fo rm atio n retrieval dom ain. A nalogously to th e d o cu m en t ran k in g task , w here for each q u ery we have an associated set of retriev ed docum ents, in th e parse ran k in g ta s k for each sentence th ere is a set of parses th a t needs to be ranked according to som e criteria. A m ore d etailed d escrip tio n of th e ta sk is provided in [11]. T he m ain m o tiv atio n for applying m achine learning techniques to th e p ro b ­ lem of parse ran k in g is th a t in m an y cases a set of b u ilt-in heuristics included in th e p arser softw are th a t is used to ra n k parses is perform ing n o t sa tisfacto ry and, hence, subsequent ran k in g or selection m eth o d s are needed.

We o b ta in a scoring for an in stan ce-lab el p air by com paring th e parse to th e h a n d a n n o ta te d correct parse of its sentence. T he feature vectors for instance- label p air are g en erated using g rap h kernel (see [11]). T he relevant instance-label pairs for th e ran k in g ta sk are those associated w ith th e sam e in stan ce (see Section 2). All th e o th e r pairs are considered to b e irrelevant to th e ta sk of p arse ranking.

Before co n d u ctin g th e experim ents we ensure th a t th e d a ta is centered in th e feature space. T h is can be accom plished w ith th e following m odifications to th e tra in in g kernel m a trix K train an d th e te s t kernel m a trix K test [16]

K train — (1 1 n and K test (K test 1 n 1 w w w .it.u tu .f i/B io I n fe r w w w .cs.ru.nl/~ evgeni

(11)

w here n is n u m b er of tra in in g d a ta , n test n u m b er of te s t d a ta , 1” an d 1” test are th e colum n vector co ntaining ones of th e dim ensions R ” an d R ”test respectively.

In order to select th e o p tim al n u m b er of p rin cip al co m ponents an d regu­ lariza tio n p a ra m e te rs for th e baseline m eth o d s, we divide com plete d a ta s e t of 1100 a n n o ta te d sentences w ith th e m axim um of 3 parses associated w ith each sentence in to tw o p a rts. T h e first p a rt consisting of 500 in stan ce-lab el p airs is used for tra in in g an d second p a rt of 2698 in stance-label p airs is reserved for final validation. T h e a p p ro p ria te values of th e reg u larizatio n an d th e kernel p ara m e­ ters are d eterm in ed by grid search w ith 5-fold cross-validation on th e p a ra m e te r e stim atio n d a ta . T h e perform ed experim ents can be su b d iv id ed in to th re e p arts: th e experim ents on th e d a ta s e t w ith o u t noise, exp erim en ts on th e d a ta s e t w here scores were in te n tio n a lly co rru p te d by G aussian noise w ith s ta n d a rd d eviation a = 0.5 an d m ean ^ = 0.0, an d finally w ith s ta n d a rd d ev iatio n a = 1.0 and m ean ^ = 0.0. We n o te th a t p arse goodness scores p resent in th e d a ta s e t are based on th e F -score function and, th u s, vary betw een 0 an d 1. To avoid influ­ ence of th e ran d o m in itializa tio n of th e G au ssian noise, we perfo rm th e com plete ex perim ent 3 tim es an d average th e o b ta in e d results. T he algorithm s are tra in e d on th e p a ra m e te r e stim a tio n d a ta set w ith th e b e st found p a ra m e te r values an d te ste d on instance-label pa irs reserved for th e final validation. T he resu lts of th e v alidation are p resen ted in Table 1.

W h en no noise is add ed to th e labels of th e d a ta se t, th e RLS an d R ankR L S algorithm s o u tp erfo rm th e K P C R a n k and K P C R algorithm s. U n satisfa cto ry p e r­ form ance of th e K P C R a n k alg o rith m can be explained by th e fact th a t by pro­ je ctin g d a ta on to a sm all n u m b er of prin cip al com ponents (in our case less th a n

100) features th a t are n o t m ost expressive, b u t th a t still co n trib u te to th e learn ­ ing perform ance of th e alg o rith m are lost. O pposite to this, RLS an d R ankR L S are th e m e th o d s th a t do n o t reduce d im ensionality of th e featu re space, b u t use reg u larizatio n controlling th e trad eo ff betw een th e cost on th e tra in in g set and th e com plexity of th e hyp o th esis learned in com plete feature space. However, once th e noise is ad d ed to th e d a ta s e t K P C R a n k alg o rith m p erform s n o ta b ly b e tte r th a n th e o th e r m ethods. T his is due to th e fact th a t RLS an d R ankR L S again use com plete feature space for learning, b u t th is tim e inclusion of less expressive noisy features (instance-label pairs) degrades th e perform ance. T he norm alized version of th e disagreem ent erro r E q u a tio n (4) is used to m easure th e perform ance of th e ran k in g algorithm s. T he erro r is c alcu lated for each sentence se p arately a n d th e perform ance is averaged over all sentences.

6 .2 S in c D a t a s e t

To te s t perform ance of th e m e th o d on pairw ise preference d a ta we have con­ s tru c te d sim ple artificial d a ta s e t using sin c function as follows. A sin c function

s in (n x ) sin c (x ) = --- ,

n x

w here x G [ - 4 , 4] is used to g en erate th e values used for creatin g m ag n itu d es of pairw ise preferences. We get 2000 e q u id ista n t p o in ts from th e interval [- 4 , 4],

(12)

T a b le 1. Comparison of the parse ranking performances of the KPCRank, KPCR, RLS, and RankRLS algorithms using a normalized version of the disagreement error Equation (4) as performance evaluation measure.

Method W ithout noiseb 0..5 a = 1.0 K PCR 0.40 0.46 0.47 KPCRank 0.37 0.41 0.42 RLS 0.34 0.43 0.46 RankRLS 0.35 0.45 0.47

and sam ple 1000 for c o n stru ctin g th e tra in in g pairs an d 338 for c o n stru ctin g th e te s t pairs. From these pairs we ran d o m ly sam ple 379 used for th e tra in in g and 48 for th e testin g . We consider b o th m odels proposed in Section 5, th a t is, (a) using only pairw ise preferences and (b) using pairw ise preferences w ith score differences. T he m ag n itu d e of pairw ise preference is calcu lated as

w = sin c (x ) — s in c (x /).

We have com pared p ro b ab ilistic pairw ise regression described in Section 5 to th e K P C R a n k algorithm . To evaluate perform ance of th e m eth o d s we use norm alized count of in co rrectly p red icte d pairs of d a ta points. T he erro r o b ta in e d by using th e p ro b ab ilistic pairw ise regression on te s t d a ta s e t is 0.035. T he K P C R a n k alg o rith m has an erro r of 0.025, in d icatin g th a t b o th m e th o d s have a good p e r­ form ance w hen learning pairw ise com parison d a ta .

Posterior mean approximation GP approximation (MLII) and KPCRank

F ig. 1. The sin c function and the approximate posterior means of the random vector f using the pairwise preference d ata and a full Bayesian procedure w ith a uniform prior on the log of the kernel width param eter (left). The approximation is done with expec­ tation propagation [10]. We have randomly corrupted 5% of the pairwise preferences. T hat is why the difference between the sin c function and the approximate posterior mean f (left) has no influence on the disagreement error as long as the “shape” is correct. The approximate posterior means of the random vector f using the score dif­ ferences data and the maximum likelihood method on the hype-parameter (right). Note th a t the model is invariant to the translations of the score function.

(13)

7 C o n c l u s i o n s

In th is p a p e r we propose th e K P C R a n k alg o rith m for learning preference rela­ tions. T h e K P C R a n k alg o rith m can be considered as an extension of nonlinear principal com ponent regression for learn in g pairw ise preferences. T he alg o rith m belongs to th e class of so-called shrinkage m eth o d s (sim ilar to K P C R , RLS, etc.) th a t are designed to sh rin k th e solution from th e areas of low d a ta sp read and can resu lt in an e stim a te th a t is biased b u t has lower variance. T h is is accom ­ plished by p ro je ctin g th e tra in in g d a ta on to p rin cip al co m ponents in th e feature space, th a t in tu r n m ap on to a rb itra ry m anifold in th e in p u t space. A n o th er ad­ vantage of th e m e th o d is th a t by p ro jectin g th e d a ta on to th e com ponents w ith higher eigenvalues we aim a t discard in g th e noise co n tain ed in th e original d a ta and o b ta in b e tte r perform ance com pared to th e baseline m e th o d s w hen d a ta is co rru p te d by noise.

O ur experim ents confirm th a t th e proposed alg o rith m works well in situ a tio n s w hen th e ra n k in g function has to be learned from th e noisy d a ta . T he K P C R a n k alg o rith m n o ta b ly o u tp erfo rm s th e RLS, R ankR L S, a n d K P C R alg o rith m s in th e experim ents w here d a ta is in ten tio n ally co rru p te d by noise. F u rth erm o re, we com pare th e K P C R a n k alg o rith m to th e p ro b ab ilistic regression w hen learning on pairw ise preference d a ta . O ur resu lts in d icate th a t b o th m eth o d s achieve good perform ance by learning correct p red ictio n function.

In th e fu tu re we are p la n n in g to ex ten d th e alg o rith m for learn in g m ultiple labels sim ultaneously, propose m eth o d s to fu rth e r decrease its c o m p u tatio n al com plexity, an d te s t it on various ran k in g d a ta se ts.

A c k n o w l e d g m e n t s

We acknowledge su p p o rt from th e N eth erlan d s O rg an izatio n for Scientific Re­ search (N W O ), in p a rtic u la r a Vici g ra n t (639.023.604).

R e f e r e n c e s

1. A driana Birlutiu, Perry Groot, and Tom Heskes. M ulti-task Preference learning w ith Gaussian Processes. In Proceedings o f the 17th European Sym posium on A r ­ tificial Neural N etw orks (E S A N N ), pages 123-128, 2009.

2. Wei Chu and Zoubin Ghahramani. Preference learning with gaussian processes. In Luc De Raedt and Stefan Wrobel, editors, Proceedings o f the Tw enty-Second In te r­ national Conference (IC M L 2005), volume 119 of A C M International Conference Proceeding Series, pages 137-144. ACM, 2005.

3. Michael Collins and Nigel Duffy. New ranking algorithms for parsing and tagging: kernels over discrete structures, and the voted perceptron. In A C L ’02: Proceedings o f the 40th A nn u a l M eeting on A ssociation fo r C om putational Linguistics, pages 263-270, Morristown, NJ, USA, 2001. Association for Com putational Linguistics. 4. Ofer Dekel, Christopher D. Manning, and Yoram Singer. Log-linear models for label

ranking. In Sebastian Thrun, Lawrence Saul, and Bernhard Scholkopf, editors,

Advances in Neural In fo rm a tio n Processing S ystem s 16, pages 497-504, Cambridge, MA, 2004. MIT Press.

(14)

5. Johannes Furnkranz and Eyke Hullermeier. Preference learning. K ünstliche In tel­ ligenz, 19(1):60-61, 2005.

6. Gene H. Golub and Charles F. Van Loan. M atrix Com putations. Johns Hopkins University Press, 1996.

7. Eyke Hullermeier, Johannes Furnkranz, Weiwei Cheng, and Klaus Brinker. Label ranking by learning pairwise preferences. A rtificial Intelligence, 172(16-17):1897- 1916, 2008.

8. Thorsten Joachims. Optimizing search engines using clickthrough data. In David Hand, Daniel Keim, and Raymond Ng, editors, Proceedings o f the 8th A C M SIG K D D Conference on Knowledge D iscovery and D ata M ining (K D D ’02), pages 133-142, New York, NY, USA, 2002. ACM Press.

9. David J. C. MacKay. In fo rm a tio n Theory, Inference, and Learning Algorithms.

Cambridge University Press, 2003.

10. Thomas P. Minka. A fa m ily o f algorithms fo r approximate B ayesian inference.

PhD thesis, MIT, 2001.

11. Tapio Pahikkala, Evgeni Tsivtsivadze, A ntti Airola, Jouni Jarvinen, and Jorm a Boberg. An efficient algorithm for learning to rank from preference graphs. M a­ chine Learning, 75(1):129-165, 2009.

12. Sampo Pyysalo, Filip Ginter, Juho Heimonen, Jari Björne, Jorm a Boberg, Jouni Jarvinen, and Tapio Salakoski. BioInfer: A corpus for information extraction in the biomedical domain. B M C B ioinform atics, 8(50), 2007.

13. Ryan Rifkin, Gene Yeo, and Tomaso Poggio. Regularized least-squares classifica­ tion. In J.A.K. Suykens, G. Horvath, S. Basu, C. Micchelli, and J. Vandewalle, editors, Advances in Learning Theory: Methods, Model and A pplications, pages 131-154, Amsterdam, 2003. IOS Press.

14. Roman Rosipal and Leonard J. Trejo. Kernel partial least squares regression in reproducing kernel hilbert space. Journal o f M achine Learning Research, 2:97-123,

2002.

15. Bernhard Scholkopf, Alex J. Smola, and Klaus-Robert Muller. Kernel principal component analysis. In Wulfram Gerstner, Alain Germond, M artin Hasler, and Jean-Daniel Nicoud, editors, A rtificial Neural N etw orks - I C A N N ’97, 7th In te r­ national Conference, volume 1327 of Lecture N otes in C om puter Science, pages 583-588. Springer, 1997.

16. John Shawe-Taylor and Nello Cristianini. K ernel M ethods fo r P a ttern Analysis.

Cambridge University Press, New York, NY, USA, 2004.

17. Evgeni Tsivtsivadze, Tapio Pahikkala, A ntti Airola, Jorm a Boberg, and Tapio Salakoski. A sparse regularized least-squares preference learning algorithm. In Anders Holst, Per Kreuger, and Peter Funk, editors, 10th Scandinavian Conference on A rtificial Intelligence (S C A I 2008), volume 173, pages 76-83. IOS Press, 2008.

Figure

Updating...

References