PDF hosted at the Radboud Repository of the Radboud University
Nijmegen
The following full text is a preprint version which may differ from the publisher's version.
For additional information about this publication click this link.
http://hdl.handle.net/2066/76137
Please be advised that this information was generated on 2017-12-06 and may be subject to
change.
K ern el P rin cip a l C o m p o n en t R anking:
R o b u st R a n k in g on N o isy D a ta
E vgeni T sivtsivadze, B o to n d Cseke, an d Tom Heskes
Institute for Computing and Information Sciences, Radboud University Nijmegen, Toernooiveld 1, 6525 ED Nijmegen, The Netherlands
firstnam e.lastnam e@ science.ru.nl
A b s tr a c t. We propose the kernel principal component ranking algo rithm (KPCRank) for learning preference relations. The algorithm can be considered as an extension of nonlinear principal component regres sion applicable to preference learning task. It is particularly suitable for learning from noisy datasets where a lower dimensional data representa tion preserves most expressive features. In many cases near-linear depen dence of regressors (multicollinearity) can notably decrease performance of the learning algorithm, however, KPCRank can effectively deal with this situation. It is accomplished by projecting the d ata onto p-principal components in the feature space defined by a positive definite kernel and consecutive learning of the ranking function. Despite the fact th at the number of the pairwise preferences is quadratic, the training time of K PCRank scales linearly with the number of d ata points in the training set and is equal to th a t of the principal component regression. We com pare the algorithm to several ranking and regression methods, including probabilistic regression on pairwise comparison data. O ur experiments dem onstrate th a t the performance of K PC R ank is better than th a t of the baseline methods, when learning to rank from the d ata corrupted by noise.
1 I n t r o d u c t i o n
T he ta sk of learning preference relatio n s (see e.g. [5]) has received significant a tte n tio n in m achine learning lite ra tu re . T h is p a p e r p rim arily concerns th e ta sk of ranking, w hich is a special case of a preference learning ta sk w hen a to ta l order is associated w ith th e set of d a ta p o in ts u n d er consideration. B o th th e preference learning and th e ran k in g task s can be form ulated as th e problem s w here th e aim is to learn a function capable of arran g in g d a ta p o in ts according to a given preference relation. W h en com paring tw o d a ta points, th e function is able to evaluate w h eth er th e first p o in t is p referred over th e second one. To learn th is function we propose th e kernel prin cip al com ponent ran k in g (K P C R an k ) algorithm .
T he alg o rith m can be considered as an extension of th e nonlinear principal com ponent regression applicable to th e preference learning task . S im ilar to th e kernel principal com ponent regression (K P C R ) [14], K P C R a n k is p a rtic u la rly
su itab le for learning from noisy d a ta se ts w here lower dim ensional d a ta represen ta tio n preserves m ost expressive features. B y p ro jec tin g th e original d a ta onto th e co m ponents w ith higher eigenvalues we can d iscard th e noise co n tain ed in th e original d a ta . We derive th e alg o rith m a n d describe an efficient w ay for selecting o p tim al nu m b er of th e principal com ponents su ita b le for th e ta s k in question. To d e m o n stra te th a t th e alg o rith m can learn to ra n k well from noisy d a ta , we evaluate K P C R a n k on th e d a ta s e t co rru p te d by noise.
We consider a label ran k in g problem settin g in w hich we are given a tra in ing d a ta consisting of th e scored d a ta points, th a t is, each in p u t d a ta p o in t is assigned a real valued score in d icatin g its goodness. T he pairw ise preferences betw een these d a ta p o in ts are th e n d eterm in ed by th e differences of th e scores. L earning these preferences can also be tre a te d as an extension of pairw ise classi fication [7] w here class label in d icates d irectio n of th e preference. For exam ple, in [2,1] th e pairw ise preference ap p ro ach is used to g e th e r w ith G aussian processes to learn preference relations. Pairw ise app ro ach is also used w ith s u p p o rt vector m achines (SVM ) for learning to ran k . For exam ple, th e R ankS V M alg o rith m was proposed in [8] to re ra n k th e resu lts o b ta in e d from a search engine.
We evaluate ou r alg o rith m on a parse ran k in g ta sk th a t is a com m on problem in n a tu ra l language processing (see e.g. [3]). In th is task , th e aim is to ra n k a set of parses associated w ith a single sentence, based on som e goodness criteria. We consider th e parse ran k in g ta sk as label ranking. However, in th e p arse ran k in g ta sk th e labels (i.e. th e parses of a sentence) are instance-specific. T h a t is, for each sentence, we have a different set of labels, w hile in th e conventional label ran k in g se ttin g labels are n o t in stan ce specific [5]. As baseline m eth o d s in th e co n d u cted exp erim en ts we consider th e regularized least-squares (RLS) [13], th e R ankR L S [11], an d th e kernel principal com ponent regression (K P C R ) [14] al gorithm s. F urthem ore, we com pare th e K P C R a n k alg o rith m to th e pro b ab ilistic regression on pairw ise preference d a ta th a t is g en erated using sinc function. T he resu lts show th a t th e perform ance of th e K P C R a n k alg o rith m is b e tte r th a n th a t of th e baseline m ethods, w hen learning to ra n k from th e d a ta co rru p te d by noise.
2 L a b e l R a n k i n g
L et X b e a set of in stan ces a n d Y be a set of labels. In th e label ran k in g settin g [4, 5], we would like to p red ict for any in stan ce x G X a preference re latio n P x Ç Y x Y am ong th e set of labels Y . A n elem ent (y, y ') G P x m eans th a t th e instance x prefers th e label y com pared to y', also w ritte n as y y x y ' . We assum e th a t th e preference re latio n P x is tra n sitiv e an d asym m etric for each in stan ce x G X . As a tra in in g inform ation, we are given a finite set { ( q i, Sj)}™= i of n d a ta points, w here each d a ta p o in t (qi , s i ) = ( ( x i , yi ), s i ) G Q x R consists of an instance-label tu p le q i = ( x i ,y i ) G Q, w here Q = X x Y an d th e score si G R . We consider two d a ta p o in ts (q, s) an d (q ', s') to be relevant, iff x = x '. For relevant d a ta points, in stan ce x prefers label y to y ', if s > s ' . If s = s', th e labels are called tied. Accordingly, we w rite y y x y ' if s > s' an d y ~ x y ' if s = s '. To be able to
in c o rp o rate th e relevance in fo rm atio n we define an u n d irecte d preference g raph w hich is d eterm in ed by its adjacency m a trix W such th a t W ij = 1 if (q i , q j) are relevant an d Wij = 0 otherw ise. F u rth erm o re, let Q = ( q 1, . . . , q n )* G Q ” be th e vector of instance-label tra in in g tu p les an d s = ( s 1, . . . , s n )t G R ” th e corresponding vector of scores. G iven these definitions, our tra in in g set is th e tu p le T = (Q, W, s). We also n o te th a t our definitions allow considering b ip a rtite ranking, o rd in al regression, and label ran k in g se ttin g a t th e sam e tim e.
We define m ap p in g from th e in p u t space Q to some (higher dim ensional) feature space F . In m an y cases, th e n u m b er of tra in in g d a ta p o in ts is m uch sm aller th a n th e nu m b er of dim ensions m of th e feature space a n d we can w rite F = R m , w here m ^ n. F u rth e r, let
^ : Q ^ F . T he inner p ro d u c t
k(q , q ') = (<£(q),<£(q'))
of th e m a p p ed in stan ce-lab el pairs is called a kernel function. We also denote th e sequence of featu re m a p p ed in p u ts as
* (Q ) = ( * ( q i ) , . . . , * ( q ” )) G ( F ” )*
for all Q G (Q ” )*. We define th e sym m etric kernel m a trix K G R ” x” , w here R ” x” d enotes th e set of real m atrices of dim ension n x n, as
( k (q i, q i) ••• k ( q i, q ri) '
K = 3>(Q)‘3>(Q) = . ... . \ k ( q ” , q i) ••• k (q ” , q j ,
U nless s ta te d otherw ise, we assum e th a t th e kernel m a trix is stric tly positive definite.
3 D i m e n s i o n a l i t y R e d u c t i o n
Before presen tin g th e K P C R a n k alg o rith m we briefly describe th e pro ced u re for kernel principal com ponent analysis (K P C A ), following [15]. K P C A allows us to perform s ta n d a rd P C A in th e higher dim ensional space F using th e kernel function described in Section 2. A fter th e in itial m apping, th e d a ta falls onto some h y p erp lan e in F and th e e x tra c te d prin cip al com ponents will m ap onto m anifolds in th e lower dim ensional space. T he K P C A alg o rith m boils dow n to d iag onalization of th e covariance m a trix
1 ” 1
c = - Em ^ ( q i ) ^ ( q i ) ‘ = - ^ ( Q ) ^ ( Q ) 4, m i= i
w here th e ^ ( q i ) are th e centered m appings of th e individual instance-label pairs. To find th e first p rin cip al com ponent we need to solve th e eigenvalue eq u atio n
T he key observation m ade in [15] sta te s th a t all solutions v w ith A = 0 can be w ritte n as a linear com b in atio n of ^ ( q i ). T herefore, th e re are some ai G R such th a t v = ^ ”= i a i ^ ( q i ). T h e n E q u a tio n (1) can be w ritte n as
- $ ( Q ) ^ ( Q ) ‘v = Av m - $ ( Q ) $ ( Q ) ‘^ ( Q ) a = A ^ (Q )a m - ^ ( Q ) ^ ( Q ) ^ ( Q ) ^ ( Q ) a = A ^ ( Q ) ^ ( Q ) a m - K a = Aa, m
w here a G R ” . T hus, we have arrived to th e equivalent eigenvalue problem th a t requires d iag onalization of K in stead of C . Now we can com p u te th e p ro jectio n of th e m a p p ed in stan ce-lab el p a ir ^ ( q ) onto l — t h eigenvector v 1
” ”
(v ' , ^ (q )) = E a i ( ^ (q i)^ (q )) = E a i k (q ^ q ) . (2) V mA; ^ v mA1 “ i
If we denote by V G R mxp th e m a trix consisting of th e colum ns of th e eigenvec to rs { v i }P= i of th e covariance m a trix C , by A G R pxp th e diagonal m a trix of th e eigenvalues corresponding to th e e x tra c te d eigenvectors of K , an d by V G R ” xp th e m a trix of e x tra c te d eigenvectors { a i }^= i of th e kernel m a trix K , th e n th e p ro je ctio n of th e instance-label p airs Q on to first p eigenvectors can be w ritte n as
# ( Q ) ‘V = 0 ( Q ) ^ ( Q ) V A - 2 = k V A- 1 . (3)
4 K e r n e l P r i n c i p a l C o m p o n e n t R a n k i n g A l g o r i t h m
A ran k in g function is a function f : Q ^ R m ap p in g th e in stan ce-lab el pair q to a real value rep resen tin g th e relevance of th e label y w ith resp ect to th e in stan ce x . T h is induces for an y in stan ce x G X a tra n sitiv e preference relatio n P f ,x Ç Y x Y w ith (y, y ') G P f ,x ^ f (q) > f (q ' ). Inform ally, th e goal of our ran k in g ta s k is to find a label ran k in g function f : Q ^ R such th a t th e ran k in g P f ,x Ç Y x Y induced by th e function for any in stan ce x G X is a good pred ictio n for th e tru e preference rela tio n P x Ç Y x Y .
L et us define R Q = { f : Q ^ R } and let H Ç R Q be th e h ypothesis space of possible ran k in g functions. To m easure how well a h ypothesis f G H is able to p red ict th e preference relatio n s P x for all in stan ces x G X , we consider th e following cost function th a t ca p tu re s th e am o u n t of in co rrectly p red icte d pairs of th e relevant tra in in g d a ta points:
1 ”
d (f , T ) = 2 ^ 2 W ij s ig n (s i — s j) — s i g n f (qi) — f ( q j )) , (4)
w here sign(-) d enotes th e signum function
. , , 1, if r > 0 s ign ( r ) = \ . n .
I —1, if r < 0
T he use of cost functions like E q u a tio n (4) leads to in tra c ta b le o p tim izatio n problem s, therefore, we consider th e following least squares appro x im atio n , which regresses th e differences si — sj w ith f (q i ) — f ( q j ) of th e relevant tra in in g d a ta p o in ts q i a n d q j :
1 ” 2
c(f T
) = 2E
Wj(
(si—
sj )—
(f
(q i )—
f
(q j.
(5) 1i,j= i
In th e above described se ttin g we assum e th a t every instance-label p a ir has an associated score. A straig h tfo rw ard extension of th e proposed alg o rith m (sim ilar to th e one p resen ted in [11]) m akes it applicable to th e s itu a tio n w hen only pairw ise preferences are available for th e tra in in g of th e ranker.
4 .1 K P C R a n k
T he n e x t th eo rem ch aracterizes a m e th o d we refer to as kernel p rin cip al com po n en t ran k in g (K P C R an k ).
T h e o r e m 1 L e t the tra in in g in fo r m a tio n p ro v id ed to the a lg o rith m be a tuple
T = (Q, W, s ). F u rth e r, let th e p r o je c tio n o f th e tra in in g in sta n ce -la b e l p a ir o nto th e l — t h p r in c ip a l c o m p o n e n t be (v 1, ^ ( q ) ) = ÿ m x 2 ”= i a j k ( q j , q). C o n s id e r ing th e lin e a r ra n kin g m o d el in d im e n s io n a lity reduced fe a tu r e space, the p re d ic tio n fu n c tio n can be w r itte n as f (q) = 5^f= i y m x w ( 2 ”= i a j k ( q j , q). T hen the
co e fficie n t v e c to r w f o r th e a lg o rith m
A ( T ) = argm in J ( f ), ƒ ew w ith objective fu n c tio n
1 ”
J ( f ) = 2 E Wij ((s i — sj ) — ( f (q i ) — f (qj ))2 i,j= i
m in im iz in g the least squares a p p ro x im a tio n o f th e d isa g re e m e n t erro r in th e d i m e n s io n a lity reduced fe a tu re space is
w = A 1 (1/ ‘K L K F ) -1 V * K L s , (6)
Proof. We use th e fact th a t for any vector r G R ” an d an u n d irected w eighted g rap h W of n vertices, we can w rite
1 ”
— Wij ( r i — r j ) 2 = r * D r — r * W r = r* L r, i,j= i
w here D a n d L are th e degree m a trix a n d th e L aplacian m a trix of th e g raph d eterm in ed by W . C onsidering a linear ran k in g m odel in th e feature space F th e p red ictio n function can be w ritte n as f (Q) = ^ ( Q ) w , w here w G R ” . T he objective function in m a trix form can be w ritte n as
J (w ) = (s — ^ ( Q ) tw ) t L ( s — <?(Q)‘w ). (7) We also assum e th a t our in stan ce-lab el pairs are cen tered aro u n d zero m ean. Therefore, kernel P C A can be perform ed on th e covariance m a trix ^ (Q )^ (Q )* to e x tra c t corresponding eigenvectors. P ro je c tin g th e d a ta on to th e principal com ponents we can rew rite E q u a tio n (7) as
J (w ) = (s — ^ (Q )* V w )* L (s — ^ (Q )* V w ) or equivalently
J (w ) = (s — K V A - 1 w )* L (s — K V A - 2 w ). B y ta k in g th e derivative of J ( w ) w ith resp ect to w we o b tain
— J (w ) = — 2A- 2 V * K L s + 2A- 2 V * K L K Î A - 2 tw. dw
We set th e derivative to zero an d solve w ith resp ect to wÎ
tw = A 1 (V ‘K L K Î ) -1 V * K L s. (8) F in ally we o b ta in th e p red icted score of th e unseen in stan ce-lab el p a ir based on th e first p p rin cip al co m ponents by
p , ”
f (q ) ^ E w ^ a j k (q j , q ) . (9)
□
In th e n e x t section we d em o n stra te how to even fu rth e r sim plify E q u a tio n (8) and describe ways for efficient m u ltip licatio n of th e m atrices involved in th e expression. To e x tra c t th e ind iv id u al principle com ponents we can use power m e th o d described in [6] whose co m p u ta tio n a l com plexity scales to O ( n 2) for e x tra c tio n of individual prin cip al com ponent. T he p rocedure m u st be re p eated for th e prin cip al com ponents used for th e p rojection. T h e co m p u ta tio n a l com p lexity for m aking th e p red ictio n on a single te s t d a ta p o in t scales as O (p n 2) due to th e fact th a t m u ltip licatio n of m atrices K L K can b e accom plished effi ciently because of th e sparseness of L (see Section 4.2) an d inversion involved in th e E q u a tio n (8) is less expensive th a n m u ltip licatio n in case n ^ p. T here are several o th e r strateg ie s th a t can reduce com plexity of th e K P C R a n k algorithm , however, th e y are o utside of th e scope of th is p ap er.
4 .2 E f f ic ie n t S e le c tio n o f P r i n c i p a l C o m p o n e n t s
Selecting th e o p tim al nu m b er of p rin cip al co m ponents for th e K P C R a n k algo rith m can n o ta b ly im prove its perform ance. T h ere are different ways for efficient e x tra c tio n of th e prin cip al com ponents (e.g. pow er m eth o d ) whose com plexity scales to O ( n 2) for e x tra c tin g a single com ponent. H ere we describe an o th er m e th o d for selecting th e o p tim a l nu m b er of principal co m ponents for th e learn ing ta s k based on a single tim e perform ed eigendecom position of th e kernel m a trix .
W h en considering E q u a tio n (8) we can observe th a t by eigendecom position of th e kernel m a trix K = VA V * an d by so rtin g eigenvalues an d corresponding eigenvectors in decreasing o rd er we can fu rth e r sim plify th e expression, th a t is
w = A 2 (V ‘K L K / ) - ^ * K L s
= (A- 1V * V AV * l V AV * V A - 1 ) - 1 A - 1V * V AV * L s = (A 1 *V * lV A 1 )- 1 A 1 ‘ÿ ‘L s,
w here A 2 g R ” xp is th e diagonal m a trix co ntaining eigenvalues of K in decreas ing order. T he la tte r sim plification is possible due to th e fact th a t V*V = I px” as well as th a t b o th A - 2 an d A are diagonal m atrices co n tain in g m an ip u latio n s on eigenvalues of th e kernel m atrix . T h e m u ltip licatio n of th e A2 *V*Ls can be
accom plished in O ( n 2) tim e since s G R ” . F u rth erm o re A 1 *V*LVA1 is a square m a trix of dim ensions p x p, an d in case p is selected to be m uch sm aller th a n n, th e d o m in a tin g te rm is m u ltip licatio n of V4LV , whose co m p u ta tio n cost is O ( u n 2), w here u is th e n u m b er of instances. T h is red u ctio n in com plexity is achieved due to th e sparseness of th e L aplacian m a trix [17]. L et M = A2‘V ^L V A2. We
can efficiently search for th e o p tim al n u m b er of p rin cip al co m ponents in tim e O ( n 2). O nce we have perform ed eigendecom position of th e m a trix M = V AV * (here we assum e in itial p ro jectio n of th e tra in in g d a ta on to all p rin cip al com po n en ts), th e subsequent calcu latio n of M p-1 based on th e p principal com ponents (p < n) is co m p u ta tio n a lly inexpensive. It can be perform ed as follows. W hen m a trix M is decom posed, A = diag(A 1, . . . , A” ) is a diagonal m a trix co n tain ing th e eigenvalues. Set to 0 som e of th e eigenvalues an d leave only p non-zero ones, corresponding to th e n u m b er of prin cip al com ponents on to w hich d a ta is p ro jected . T h en inverse of m a trix M p can be calcu lated as
M p-1 = Vd ia g (1 /A 1, . . . , 1/Ap, 0 , . . .)V *.
T hus, by a single decom position of th e m a trix M an d subsequent m an ip u latio n on eigenvalues we can efficiently search for th e o p tim al n u m b er of principal com ponents.
5 P r o b a b i l i s t i c R e g r e s s i o n o n P a i r w i s e P r e f e r e n c e D a t a In th is section we consider learning a ran k in g function based on pairw ise com parison d a ta , th a t is, d a ta a b o u t th e ran k in g function values is provided in term s
of pairw ise com parisons a t th e given locations. L et D = { (i; , j ; , w; )};=1.N be a d a ta se t, w here N is th e n u m b er of pairw ise com parisons, and w; G {—1,1} spec ifies th e d irectio n of th e preference. We aim to learn th e ran k in g score function f (x ). F u rth e r, we assum e th a t th e observations a b o u t th e com parisons are noisy and p ( w |f ) is a B ernoulli d istrib u tio n , defined as
w here ^ is th e norm al cum ulative d en sity function, however, any sigm oid function can be su itab le choice. We have chosen th e form er because it m akes some of th e c o m p u tatio n s tra c ta b le . We m odel th e function f w ith a zero m ean G aussian process [9] having covariance function k (•, •; 0). L et th e vector f d en o te th e values of th e function f in th e in p u t variables, th a t is, th e ran d o m vector f = ( f ( x i ) , . . . , f (X” )). U sing B ayes’ th eo rem th e G aussian p o sterio r p ro b a b ility of f can be w ritte n as
w here K is th e covariance m a trix w ith elem ents k ( x i , Xj ; 0). Since th e p o ste rior p ( f |D , 0) is an aly tically in tra c ta b le we will ap p ro x im ate its m ean m and covariance C w ith th e e x p ectatio n p ro p ag atio n m e th o d [10]. P red ic tio n for th e ran k in g function values a t a new evaluation p o in t x* can be co m p u te d as
w here k* = k (x * , x*; 0), K = [k ( x i , Xj ; 0)]i j an d k* = [k (x*, Xj ; 0)]i . P red ic tio n for th e com parisons of th e ran k in g function values for locations x i an d x 2
Since we can ap p ro x im ate th e m arginal likelihood p (D |0 ) using ex p ectatio n p ro p ag atio n , we c a rry o u t m axim um likelihood pro ced u re on th e h y p e r-p aram eters We use th e square expo n en tial covariance function.
P ro b ab ilistic c o u n te rp a rt of th e R ankR L S alg o rith m would be regression w ith G aussian noise an d G au ssian processes prior, given th e score differences wij = Si — Sj, th a t is, p (w; | f ) = ^ (w; ( f ( x i ) — f (x j i )) ) , p ( f | D 0) = ( D 0)
n
p (w; | f ) p ( f |0) p ( D |0 H ; (10) = p ( D i0 )n
^ (w; ( f ( x i i ) —f (x j i ))) n ( f |0, k ) , N ( ƒ (x * )|fc* K - 1 m , k* + fc*K-1 (C — K ) K - 1 fc*) , is (11) w here th e 2 x 2 m a trix R is given byt
R = [k(x**, x**, 0 ^ i j= 1 2 + [fc*, fe2] K -1 (C — K ) K -1 [fc*, fe2] .
T h en th e p o sterio r d istrib u tio n of th e ran d o m vector f is
T he p o sterio r d istrib u tio n p ( f |w ,v , 0) is G aussian, its m ean an d covariance m a trix can be co m p u ted by solving a system of linear equations and inverting a m atrix , respectively. Here we choose to optim ize th e m odel p a ra m e te rs v an d 0 by m axim izing th e m arg in al likelihood p (D |v , 0 ). P red ictio n can be done sim ilar to E q u a tio n (11), except th a t no ap p ro x im atio n s are needed. N ote th a t predictions o b ta in ed by th e K P C R a n k alg o rith m correspond to th e p red icted m ean values of th e G au ssian process regression.
6 E x p e r i m e n t s
6 .1 P a r s e R a n k i n g D a t a s e t
We ap p ly th e K P C R a n k alg o rith m to ra n k th e parses g en erated from th e B ioln- fer corpus [12] w hich consists of 1100 m an u ally a n n o ta te d sen ten ces.1 T he p re processed d a ta in form at of R ankSV M [8] is available for dow nload.2 T he parse ran k in g is sim ilar to th e d ocum ent ran k in g ta sk frequently e ncountered in th e in fo rm atio n retrieval dom ain. A nalogously to th e d o cu m en t ran k in g task , w here for each q u ery we have an associated set of retriev ed docum ents, in th e parse ran k in g ta s k for each sentence th ere is a set of parses th a t needs to be ranked according to som e criteria. A m ore d etailed d escrip tio n of th e ta sk is provided in [11]. T he m ain m o tiv atio n for applying m achine learning techniques to th e p ro b lem of parse ran k in g is th a t in m an y cases a set of b u ilt-in heuristics included in th e p arser softw are th a t is used to ra n k parses is perform ing n o t sa tisfacto ry and, hence, subsequent ran k in g or selection m eth o d s are needed.
We o b ta in a scoring for an in stan ce-lab el p air by com paring th e parse to th e h a n d a n n o ta te d correct parse of its sentence. T he feature vectors for instance- label p air are g en erated using g rap h kernel (see [11]). T he relevant instance-label pairs for th e ran k in g ta sk are those associated w ith th e sam e in stan ce (see Section 2). All th e o th e r pairs are considered to b e irrelevant to th e ta sk of p arse ranking.
Before co n d u ctin g th e experim ents we ensure th a t th e d a ta is centered in th e feature space. T h is can be accom plished w ith th e following m odifications to th e tra in in g kernel m a trix K train an d th e te s t kernel m a trix K test [16]
K train — (1 1 n and K test (K test 1 n 1 w w w .it.u tu .f i/B io I n fe r w w w .cs.ru.nl/~ evgeni
w here n is n u m b er of tra in in g d a ta , n test n u m b er of te s t d a ta , 1” an d 1” test are th e colum n vector co ntaining ones of th e dim ensions R ” an d R ”test respectively.
In order to select th e o p tim al n u m b er of p rin cip al co m ponents an d regu lariza tio n p a ra m e te rs for th e baseline m eth o d s, we divide com plete d a ta s e t of 1100 a n n o ta te d sentences w ith th e m axim um of 3 parses associated w ith each sentence in to tw o p a rts. T h e first p a rt consisting of 500 in stan ce-lab el p airs is used for tra in in g an d second p a rt of 2698 in stance-label p airs is reserved for final validation. T h e a p p ro p ria te values of th e reg u larizatio n an d th e kernel p ara m e ters are d eterm in ed by grid search w ith 5-fold cross-validation on th e p a ra m e te r e stim atio n d a ta . T h e perform ed experim ents can be su b d iv id ed in to th re e p arts: th e experim ents on th e d a ta s e t w ith o u t noise, exp erim en ts on th e d a ta s e t w here scores were in te n tio n a lly co rru p te d by G aussian noise w ith s ta n d a rd d eviation a = 0.5 an d m ean ^ = 0.0, an d finally w ith s ta n d a rd d ev iatio n a = 1.0 and m ean ^ = 0.0. We n o te th a t p arse goodness scores p resent in th e d a ta s e t are based on th e F -score function and, th u s, vary betw een 0 an d 1. To avoid influ ence of th e ran d o m in itializa tio n of th e G au ssian noise, we perfo rm th e com plete ex perim ent 3 tim es an d average th e o b ta in e d results. T he algorithm s are tra in e d on th e p a ra m e te r e stim a tio n d a ta set w ith th e b e st found p a ra m e te r values an d te ste d on instance-label pa irs reserved for th e final validation. T he resu lts of th e v alidation are p resen ted in Table 1.
W h en no noise is add ed to th e labels of th e d a ta se t, th e RLS an d R ankR L S algorithm s o u tp erfo rm th e K P C R a n k and K P C R algorithm s. U n satisfa cto ry p e r form ance of th e K P C R a n k alg o rith m can be explained by th e fact th a t by pro je ctin g d a ta on to a sm all n u m b er of prin cip al com ponents (in our case less th a n
100) features th a t are n o t m ost expressive, b u t th a t still co n trib u te to th e learn ing perform ance of th e alg o rith m are lost. O pposite to this, RLS an d R ankR L S are th e m e th o d s th a t do n o t reduce d im ensionality of th e featu re space, b u t use reg u larizatio n controlling th e trad eo ff betw een th e cost on th e tra in in g set and th e com plexity of th e hyp o th esis learned in com plete feature space. However, once th e noise is ad d ed to th e d a ta s e t K P C R a n k alg o rith m p erform s n o ta b ly b e tte r th a n th e o th e r m ethods. T his is due to th e fact th a t RLS an d R ankR L S again use com plete feature space for learning, b u t th is tim e inclusion of less expressive noisy features (instance-label pairs) degrades th e perform ance. T he norm alized version of th e disagreem ent erro r E q u a tio n (4) is used to m easure th e perform ance of th e ran k in g algorithm s. T he erro r is c alcu lated for each sentence se p arately a n d th e perform ance is averaged over all sentences.
6 .2 S in c D a t a s e t
To te s t perform ance of th e m e th o d on pairw ise preference d a ta we have con s tru c te d sim ple artificial d a ta s e t using sin c function as follows. A sin c function
s in (n x ) sin c (x ) = --- ,
n x
w here x G [ - 4 , 4] is used to g en erate th e values used for creatin g m ag n itu d es of pairw ise preferences. We get 2000 e q u id ista n t p o in ts from th e interval [- 4 , 4],
T a b le 1. Comparison of the parse ranking performances of the KPCRank, KPCR, RLS, and RankRLS algorithms using a normalized version of the disagreement error Equation (4) as performance evaluation measure.
Method W ithout noiseb 0..5 a = 1.0 K PCR 0.40 0.46 0.47 KPCRank 0.37 0.41 0.42 RLS 0.34 0.43 0.46 RankRLS 0.35 0.45 0.47
and sam ple 1000 for c o n stru ctin g th e tra in in g pairs an d 338 for c o n stru ctin g th e te s t pairs. From these pairs we ran d o m ly sam ple 379 used for th e tra in in g and 48 for th e testin g . We consider b o th m odels proposed in Section 5, th a t is, (a) using only pairw ise preferences and (b) using pairw ise preferences w ith score differences. T he m ag n itu d e of pairw ise preference is calcu lated as
w = sin c (x ) — s in c (x /).
We have com pared p ro b ab ilistic pairw ise regression described in Section 5 to th e K P C R a n k algorithm . To evaluate perform ance of th e m eth o d s we use norm alized count of in co rrectly p red icte d pairs of d a ta points. T he erro r o b ta in e d by using th e p ro b ab ilistic pairw ise regression on te s t d a ta s e t is 0.035. T he K P C R a n k alg o rith m has an erro r of 0.025, in d icatin g th a t b o th m e th o d s have a good p e r form ance w hen learning pairw ise com parison d a ta .
Posterior mean approximation GP approximation (MLII) and KPCRank
F ig. 1. The sin c function and the approximate posterior means of the random vector f using the pairwise preference d ata and a full Bayesian procedure w ith a uniform prior on the log of the kernel width param eter (left). The approximation is done with expec tation propagation [10]. We have randomly corrupted 5% of the pairwise preferences. T hat is why the difference between the sin c function and the approximate posterior mean f (left) has no influence on the disagreement error as long as the “shape” is correct. The approximate posterior means of the random vector f using the score dif ferences data and the maximum likelihood method on the hype-parameter (right). Note th a t the model is invariant to the translations of the score function.
7 C o n c l u s i o n s
In th is p a p e r we propose th e K P C R a n k alg o rith m for learning preference rela tions. T h e K P C R a n k alg o rith m can be considered as an extension of nonlinear principal com ponent regression for learn in g pairw ise preferences. T he alg o rith m belongs to th e class of so-called shrinkage m eth o d s (sim ilar to K P C R , RLS, etc.) th a t are designed to sh rin k th e solution from th e areas of low d a ta sp read and can resu lt in an e stim a te th a t is biased b u t has lower variance. T h is is accom plished by p ro je ctin g th e tra in in g d a ta on to p rin cip al co m ponents in th e feature space, th a t in tu r n m ap on to a rb itra ry m anifold in th e in p u t space. A n o th er ad vantage of th e m e th o d is th a t by p ro jectin g th e d a ta on to th e com ponents w ith higher eigenvalues we aim a t discard in g th e noise co n tain ed in th e original d a ta and o b ta in b e tte r perform ance com pared to th e baseline m e th o d s w hen d a ta is co rru p te d by noise.
O ur experim ents confirm th a t th e proposed alg o rith m works well in situ a tio n s w hen th e ra n k in g function has to be learned from th e noisy d a ta . T he K P C R a n k alg o rith m n o ta b ly o u tp erfo rm s th e RLS, R ankR L S, a n d K P C R alg o rith m s in th e experim ents w here d a ta is in ten tio n ally co rru p te d by noise. F u rth erm o re, we com pare th e K P C R a n k alg o rith m to th e p ro b ab ilistic regression w hen learning on pairw ise preference d a ta . O ur resu lts in d icate th a t b o th m eth o d s achieve good perform ance by learning correct p red ictio n function.
In th e fu tu re we are p la n n in g to ex ten d th e alg o rith m for learn in g m ultiple labels sim ultaneously, propose m eth o d s to fu rth e r decrease its c o m p u tatio n al com plexity, an d te s t it on various ran k in g d a ta se ts.
A c k n o w l e d g m e n t s
We acknowledge su p p o rt from th e N eth erlan d s O rg an izatio n for Scientific Re search (N W O ), in p a rtic u la r a Vici g ra n t (639.023.604).
R e f e r e n c e s
1. A driana Birlutiu, Perry Groot, and Tom Heskes. M ulti-task Preference learning w ith Gaussian Processes. In Proceedings o f the 17th European Sym posium on A r tificial Neural N etw orks (E S A N N ), pages 123-128, 2009.
2. Wei Chu and Zoubin Ghahramani. Preference learning with gaussian processes. In Luc De Raedt and Stefan Wrobel, editors, Proceedings o f the Tw enty-Second In te r national Conference (IC M L 2005), volume 119 of A C M International Conference Proceeding Series, pages 137-144. ACM, 2005.
3. Michael Collins and Nigel Duffy. New ranking algorithms for parsing and tagging: kernels over discrete structures, and the voted perceptron. In A C L ’02: Proceedings o f the 40th A nn u a l M eeting on A ssociation fo r C om putational Linguistics, pages 263-270, Morristown, NJ, USA, 2001. Association for Com putational Linguistics. 4. Ofer Dekel, Christopher D. Manning, and Yoram Singer. Log-linear models for label
ranking. In Sebastian Thrun, Lawrence Saul, and Bernhard Scholkopf, editors,
Advances in Neural In fo rm a tio n Processing S ystem s 16, pages 497-504, Cambridge, MA, 2004. MIT Press.
5. Johannes Furnkranz and Eyke Hullermeier. Preference learning. K ünstliche In tel ligenz, 19(1):60-61, 2005.
6. Gene H. Golub and Charles F. Van Loan. M atrix Com putations. Johns Hopkins University Press, 1996.
7. Eyke Hullermeier, Johannes Furnkranz, Weiwei Cheng, and Klaus Brinker. Label ranking by learning pairwise preferences. A rtificial Intelligence, 172(16-17):1897- 1916, 2008.
8. Thorsten Joachims. Optimizing search engines using clickthrough data. In David Hand, Daniel Keim, and Raymond Ng, editors, Proceedings o f the 8th A C M SIG K D D Conference on Knowledge D iscovery and D ata M ining (K D D ’02), pages 133-142, New York, NY, USA, 2002. ACM Press.
9. David J. C. MacKay. In fo rm a tio n Theory, Inference, and Learning Algorithms.
Cambridge University Press, 2003.
10. Thomas P. Minka. A fa m ily o f algorithms fo r approximate B ayesian inference.
PhD thesis, MIT, 2001.
11. Tapio Pahikkala, Evgeni Tsivtsivadze, A ntti Airola, Jouni Jarvinen, and Jorm a Boberg. An efficient algorithm for learning to rank from preference graphs. M a chine Learning, 75(1):129-165, 2009.
12. Sampo Pyysalo, Filip Ginter, Juho Heimonen, Jari Björne, Jorm a Boberg, Jouni Jarvinen, and Tapio Salakoski. BioInfer: A corpus for information extraction in the biomedical domain. B M C B ioinform atics, 8(50), 2007.
13. Ryan Rifkin, Gene Yeo, and Tomaso Poggio. Regularized least-squares classifica tion. In J.A.K. Suykens, G. Horvath, S. Basu, C. Micchelli, and J. Vandewalle, editors, Advances in Learning Theory: Methods, Model and A pplications, pages 131-154, Amsterdam, 2003. IOS Press.
14. Roman Rosipal and Leonard J. Trejo. Kernel partial least squares regression in reproducing kernel hilbert space. Journal o f M achine Learning Research, 2:97-123,
2002.
15. Bernhard Scholkopf, Alex J. Smola, and Klaus-Robert Muller. Kernel principal component analysis. In Wulfram Gerstner, Alain Germond, M artin Hasler, and Jean-Daniel Nicoud, editors, A rtificial Neural N etw orks - I C A N N ’97, 7th In te r national Conference, volume 1327 of Lecture N otes in C om puter Science, pages 583-588. Springer, 1997.
16. John Shawe-Taylor and Nello Cristianini. K ernel M ethods fo r P a ttern Analysis.
Cambridge University Press, New York, NY, USA, 2004.
17. Evgeni Tsivtsivadze, Tapio Pahikkala, A ntti Airola, Jorm a Boberg, and Tapio Salakoski. A sparse regularized least-squares preference learning algorithm. In Anders Holst, Per Kreuger, and Peter Funk, editors, 10th Scandinavian Conference on A rtificial Intelligence (S C A I 2008), volume 173, pages 76-83. IOS Press, 2008.