U-Processes and Preference Learning

(1)

U-Processes and Preference Learning

Hong Li

[email protected]

Chuanbao Ren

[email protected]

School of Mathematics and Statistics, Huazhong University of Science and Technology, Wuhan 430074, China

Luoqing Li

[email protected]

Faculty of Mathematics and Statistics, Hubei University, Wuhan 430062, China

Preference learning has caused great attention in machining learning. In this letter we propose a learning framework for pairwise loss based on empirical risk minimization of U-processes via Rademacher complex-ity. We first establish a uniform version of Bernstein inequality of U-processes of degree 2 via the entropy methods. Then we estimate the bound of the excess risk by using the Bernstein inequality and peeling skills. Finally, we apply the excess risk bound to the pairwise preference and derive the convergence rates of pairwise preference learning algo-rithms with squared loss and indicator loss by using the empirical risk minimization with respect to U-processes.

1 Introduction

Preference learning has attracted considerable attention in machine learn-ing in recent years, includlearn-ing the design of search engines, information re-trieval, and movie recommendation systems. Preference learning involves prediction of ordering of the data points rather than prediction of a single numerical value as in the case of regression or a class label as in case of classification.

Several methods based on the so-called pairwise approach have been developed and successfully applied to information retrieval (Liu, 2011). This approach takes sample pairs and formalizes the problem of preference learning as that of classification and regression. It collects sample pairs from the entire samples, and for each pair, it assigns a label representing the relative relevance of the two samples.

Preference learning is to learn a binary preference relation, which is a function of two variables (Cohen, Schapire, & Singer, 1999; Freund, Iyer,

Neural Computation 26, 2896–2924 (2014) 2014 Massachusetts Institute of Technologyc doi:10.1162/NECO_a_00674

(2)

Schapire, & Singer, 2003; H üllermeier, F ürnkranz, Cheng, & Brinker, 2008; Pahikkala, Tsivtsivadze, Airola, Boberg, & Salakoski, 2007; Clémençon, Lugosi, & Vayatis, 2008; Rejchel, 2012). In a comparison of two input points, the preference relation is able to evaluate whether the first point ranks before the second one. A binary preference relation learned from data is not necessarily consistent in the sense of transitive.

Preference learning is distinct from both classification and regression. It is natural to ask what kinds of properties hold for algorithms for this problem and, in particular, whether tools that have been applied to study excess risk of classification and regression algorithms can be adapted to study excess risk of preference algorithms. The application of U-processes to the generalization analysis of excess ranking risk was first developed in Clémençon et al. (2008). The authors gave a novel moment inequality for degenerate U-processes and investigated the empirical risk minimiza-tion of U-statistics for the ranking problem. Rejchel (2012) extended the trick of the local Rademacher complexity to U-statistics and obtained the bounds of excess risk based on the convex loss function. (For an excellent account of the theory of U-statistics and U-processes, see De la Pe ña & Giné, 1999.)

We propose a new framework about the pairwise loss function based on empirical risk minimization of U-processes. This framework is general and allows users to apply it directly instead of deriving bounds in each risk minimization problem.

First, we derive a uniform version of Bernstein inequality of U-processes of degree 2 via the entropy methods by Ledoux (1996). Our concentration inequality is new and is suitable for preference learning.

Second, we estimate the bounds of the excess risk caused by the empirical risk minimizer of processes. Our approach is based on the theory of U-processes, and the key tools involve Bernstein concentration inequality and the local Rademacher complexity, which generalized the results of Koltchinskii (2006) for the empirical processes.

Third, we investigate the convergence rates of pairwise preference learn-ing algorithms with squared loss and indicator loss by uslearn-ing empirical risk minimization with respect to U-processes. The convergence rates are fast and are the same as those in Cl´emenc¸on et al. (2008) and Rejchel (2012). But our method is different from theirs.

This letter is organized as follows. Section 2 discusses the U-processes. We establish a uniform version of Bernstein inequality for U-processes. In section 3, we give the bounds of excess risk based on the local Rademacher complexity developed by Koltchinskii (2006) for the empirical processes. In section 4, we bound the homogeneous Rademacher chaos process of order 2 via the entropy integral. In section 5, we apply the excess risk bounds to the pairwise preference algorithm and derive the convergence rates. The Bernstein inequality and the error bounds of the excess risk of the U-processes are proved in the appendixes.

(3)

2 Bernstein Inequality for U-Processes

In this section we establish a uniform version of Bernstein inequality for suprema of U-processes of degree 2. We let X be an input space, P be a probability measure on X, and P2= P ⊗ P a product probability measure on X× X. We denote byHthe class of measurable functions from X× X intoR. The U-statistics of h∈His defined as

U_n(h) = 1 n(n − 1) n i= j h(xi, xj).

As usual, we use the notations in the sequence F_H= sup

h∈H|F(h)|, and FH

2= sup h∈H|F(h

2_)|

for any functional F onHwhereH= {h : X × X →R} andH2_{= {h}2_{: h}_∈ H}.

Then the suprema of U-processes U_n(h) indexed by a function classHis defined as UnH= 1 n(n − 1)suph∈H n i= j h(xi, xj) .

We assume for simplicity thatHis a countable class of functions. This condition can be easily replaced by standard measurability assumptions known in the theory of empirical processes (Dudley, 1999; van der Vaart & Wellner, 1996; Koltchinskii & Panchenko, 2000; Talagrand, 1994) and U-processes (Arcones & Gi ńe, 1993; De la Pe ña & Giné, 1999), we do not make countability assumption in some of the examples below.

Talagrand (1996) obtained new concentration inequalities for empirical processes. Then Ledoux (1996) developed these types of inequalities with explicit constants using entropy methods. The Talagrand-type inequalities for empirical processes were also developed (Massart, 2000b; Bousquet, 2002; Klein & Rio, 2005), and were used to investigate nonparametric esti-mation and machine learning (Massart, 2000a; Koltchinskii, 2006; Bartlett, Bousquet, & Mendelson, 2005).

There are some concentration inequalities for U-processes (Arcones, 1995, for instance). For our purpose we establish a uniform version of the Bernstein inequality of suprema of U-processes of degree 2. This concen-tration inequality is more convenient for our analysis, which is one of the main contributions in this letter.

(4)

Theorem 1. LetHbe a class of measurable and symmetry functions from X× X to [−b, b], b > 0. We have for any positive number x,

P UnH≥EUnH+ 160EUnH2x n + 80bx n ≤ exp {−x} .

This inequality can be used to obtain some upper bounds of excess risk of preference learning algorithms. We postpone its proof to appendix A.

The expectations of suprema of U-processes involve the Rademacher complexity via symmetrization techniques. Rademacher complexity refers to the data-dependent estimates of the complexity of a function class. Several authors have considered Rademacher complexity (Bartlett & Mendelson, 2002; Ying & Campbell, 2010) and local Rademacher complexity (Bartlett et al., 2005; Koltchinskii, 2006).

Let x= {x₁, . . . , x_n} be independent and identically distributed (i.i.d.) random variables according to the distribution P on X, and letFbe a set of real-valued functions on X. The empirical Rademacher complexity ofF is the random variable

ˆR(1) n (F) = Eεsup f∈F 1 n n i=1 ε_if(x_i) ,

where ε = {ε₁, . . . , ε_n} are independent uniform ±1-valued Rademacher random variables. The Rademacher complexity ofFis Rn(1)(F) =EˆR(1)n (F). Some basic properties of Rademacher complexities can be found in Bartlett and Mendelson (2002).

The homogeneous Rademacher chaos process of order 2 is also a random variable. We refer to the expectation of the suprema,

ˆR(2) n (H) = Eεsup h∈H n(n − 1)1 n i= j ε_iε_jh(x_i, x_j) ,

as the empirical Rademacher chaos complexity overH. The corresponding Rademacher chaos complexity is R(2)n (F) =EˆR(2)n (F).

For U-statistics and U-processes, the Hoeffding decomposition is a basic tool. We state (De la Pe ˜na & Gin´e, 1999):

Lemma 1. (Hoeffding decomposition). The U-statistics

U_n(h) = 1 n(n− 1) n i= j h(x_i, x_j)

(5)

can be decomposed into the following form: U_n(h) = P2h + 2T_n(h) + W_n(h), where T_n(h) = 1 n n i=1(Ph(xi)− P2h) = PnPh− P2h and W_n(h) = 1 n(n− 1) n i= j (h(x_i, x_j)− Ph(x_i)− Ph(x_j) + P2_h).

From Hoeffding decomposition, the U-process is decomposed as a sum of i.i.d. random variables plus a degenerate U-statistics. The sum of i.i.d. random variables plays will be bounded by the Rademacher chaos pro-cesses, while the degenerate part will be controlled by the homogeneous Rademacher chaos processes of order 2 in our analysis.

3 Performance of Excess Risk

The general theory of empirical risk minimization was developed by Vap-nik and Chervonenkis (VapVap-nik, 1998). The concentration inequalities pro-vided a basic tool and played an important role in analyzing empirical risk minimization algorithms. Then some new concentration inequalities led to tighter generalization bounds via peeling methods (Massart, 2000a; Bartlett et al., 2005; Koltchinskii, 2006). The general approach described in (Koltchinskii, 2006, 2011) was the motivation for this letter.

We first describe the algorithm of empirical risk minimization–based U-processes. The target function h_HoverHis defined by

h_H= arg min h∈HP

2_(h).

Since the distribution P is unknown, we can construct only an approximate quantity to h_H based on the given samples x₁, . . . , x_n. One way is to find the empirical minimizer of the U-processes, defined as

h_n= arg min h∈HUn(h).

The excess risk of h_nis defined as

L(hn) = P2(hn) − P2(hH).

The excess risk is a natural measure of accuracy of this approximation. An upper bound of excess risk follows from theorem 1 and the inequality

L(h_n) ≤ 2U_n− P2

(6)

Theorem 2. Assume that H:={h : X × X → [0, 1]}. With probability at least 1− e−x, x > 0, there holds L(h_n)≤ 2EU_n− P2_H+ 640EU_n_H2x n + 160x n (3.1)

Theorem 2 provides global estimates of the complexity of the function classH. As a result, using the global Rademacher complexity, the error rate is at least of the order of 1/√n. To get a fast rate, we consider the local Rademacher complexities. The fact is that the algorithm will likely pick functions that have a small error in a small subset of the entire function class.

We follow the Koltchinskii (2006) in developing concentration inequali-ties for the excess risk based on U-processes. Our concentration inequality is new. It is suitable for analyzing the learning preference algorithms and is interesting in its own right.

We need more notations. Forδ > 0, we define the δ-minimal setH_δ⊂H

of the risk as follows:

H_δ = {h ∈H:L(h) ≤ δ}. DenoteH δ= {h = h1− h2: h1, h2∈Hδ} and define _t(δ) =EU_n− P2 H δ+ 160EU_n_H2 δt n + 80t n . It follows from theorem 1 that for all t> 0,

P sup h,g∈Hδ |(Un− P2)(h − g)| ≥ t(δ) ≤ e−t_. _(3.2)

Note that U_n(h_n) ≤ U_n(h_H). From equation 3.2, for any δ > 0, ifL(hn) < δ, then with probability at least 1− e−t,

L(hn) = P2hn− P2hH= Un(hn) − Un(hH) + (P2− Un)(hn− hH) ≤ sup

h,g∈H_δ|(P

2_{− U}

n)(h − g)| ≤ t(δ).

This implies thatδ ≤ _t(δ) for the δ > 0 satisfying L(h_n) < δ. Then with the same probability, the excess riskL(hn) will be uniformly bounded by the largest solution ¯δ of the inequality δ ≤ _t(δ). That is, the optimal ¯δ is the solution ofδ = _t(δ).

(7)

In order to obtain the optimalδ such thatL(h_n) ≤ δ with high probability, we use the iterative localization technique.

Takeδ(0)= 1, so thatH_δ(0) =H. Assume, for simplicity, that the minimum of P2_{h is attained at h}

H∈H. Since hn, hH∈Hδ(0) and U_n(h_n) ≤ U_n(h_H), we have, with probability at least 1− e−t, from equation 3.2,

L(hn) = P2hn− P2hH= Un(hn) − Un(hH) + (P2− Un)(hn− hH)

≤ sup h,g∈Hδ(0)

|(P2_{− U}

n)(h − g)| ≤ t(δ(0)) ∧ 1 =: δ(1).

This implies that h_n∈H_δ(1). We can repeat the argument to show that with probability at least 1− 2e−t,

L(hn) ≤ t(δ(1)) ∧ 1 =: δ(2).

Iterating the argument N times shows that with probability at least 1− Ne−t, we haveL(h_n) ≤ δ(N).

We regardδ as a variable to construct a fixed-point equation based on a Bernstein inequality for U-processes. We cannot compute accurate solutions δ0for the equationδ = R(δ, t), but we can find an upper bound of ¯δ through

the-transform and -transform, involved in the definitions of various com-plexity measures of function classes in empirical risk minimization. Definition 1. The-transform and the -transform of ψ are defined by

ψ₍_{δ) = sup} α≥δ

ψ(α)

α , and ψ( ) = inf{δ > 0, ψ(δ) ≤ }, respectively.

It will be convenient sometimes to discretize the definition of-transform and-transform. Let q > 1 and δ_j= q− j. Define

ψ,q_{(δ) = sup} δj≥δ

ψ(δj) δj

, and ψ,q_{( ) = inf{δ > 0, ψ},q_{(δ) ≤ }.} Some descriptions and properties can be found in Koltchinskii (2006).

We state the bound of the excess risk of h_nwhich shows that investiga-tion of excess risk reduces to the computainvestiga-tion of-transform of _t(δ) for variableδ.

(8)

Theorem 3. LetHdenote a class of functions h : X× X → [0, 1]. For all t > 0 andδ ≥ δ_n(t) =Δ,qt (2q1), P{L(h_n)> δ} =P{P2_(h n)− P2(hH)> δ} ≤ logq q δe−t.

The theorem provides a general upper bound in terms of the-transform, which has been studied and is well understood. Althoughδ_n(t) is difficult to deal with directly, it can be bounded in many interesting cases.

For brevity, we denote φ_n(H; δ) =Esup_h,g∈_H

δ|(Pn− P)(Ph − Pg)| = ETnH δand ϕn(H; δ) =E sup h,g∈Hδ |Wn(h − g)| =EWnH_δ. Using Hoeffding decomposition, it follows that

EUn− P2H

δ ≤ 2φn(H; δ) + ϕn(H; δ).

Boundingφ_n(H; δ) and ϕ_n(H; δ) is related to the behavior of the continuity modulus of the empirical processes and U-processes. We define a pseudo-metric d on the classHby

d(h, g) =P2_{(h − g)}2_,

the continuity modulus of the empirical processes by θn(δ) =E sup

d(h,h_H)≤√δ

|Tn(h − hH)|, (3.3)

and the continuity modulus of the U processes by ϑ_n(δ) =E sup d(h,h_H)≤√δ |W_n(h − h_H)| (3.4) and ηn(δ) = 1 nE_d(h,hsup H)≤ √ δ |Un(h − hH)2|. (3.5)

The continuity modulus of the empirical processes θ_n(δ) can be bounded by Rademacher complexity, which is due to Koltchinskii (2006).

(9)

The continuity moduli of the U processesϑ_n(δ) and η_n(δ) will be bounded in the next section.

Let h_∗= arg min_hP2_{h where the infimum is taken over all measurable}

functions h on X× X. The h_∗ is a global minimal point of P2_{h. Set} ₌

inf_h_∈_H(P2h− P2h_∗). With these notations we have:

Theorem 4. LetHdenote a class of functions h : X× X → [0, 1]. Set q > 1 and Θ

n(ε) = θn(ε) + ϑn(ε). Assume that there holds for any h ∈H, P2_(h_{− h}

∗)2≤ A(P2h− P2h∗) (3.6)

with some numerical constant A> 0. Then there exists a constant K > 0 such that for 0< ε ≤ 1 and for all t > 0,

δn(t)≤ εΛ + 1 AΘ n  ε K A + 1 Aη n ε K At12 +K t n , and P P2h_n− P2h_∗≥ (1 + ε)Λ + 1 AΘ n  ε K A + 1 Aη n ε K At12 +K t n ≤ log_qq n t e −t_.

Note that for condition 3.6, the constant A will be given explicitly for some special cases.

The proofs of theorems 3 and 4 are in appendixes B and C, respectively. 4 Bound the Rademacher Complexity

In the section, we estimate the continuity moduli ϑ_n(δ) and η_n(δ) for a special function class.

Definition 2. Let (T, ρ) be a pseudometric space and > 0. The covering number

N(T, ρ, ) is defined to be the minimal integer n ∈Nsuch that there exist n disks with radiusρ covering T.

In order to improve the result, we impose coditions on the covering numbers (see Rejchel, 2012; Nolan & Pollard, 1987).

Assumption.Suppose thatHis a measurable class of functions on X× X with values in [−b, b] satisfying

N(H, ρ, ) ≤ B −α_, _(4.1)

(10)

As usual, we define two empirical metrics overHas ρH 1 (h, g) = 1 n n i=1 P(h(xi, ·) − g(xi, ·))2, ρH 2 (h, g) = 1 n(n − 1) i= j (h(x_i, x_j) − g(x_i, x_j))2.

Definition 3. A class of functionsFis called a VC subgraph class if the graphs of the functions inFform a VC class of sets, that is, if we define the subgraph of a real-valued function f on S as the following subset G_fon S×R:

G_f ={(s, t) : s ∈ S, t ∈R, 0 ≤ t ≤ f (s) or f (s) ≤ t ≤ 0}, the class{G_f, f ∈F} is a VC class of sets on S ×R.

Lemma 2 (Nolan & Pollard, 1987). Let H be a uniformly bounded class of functions on X× X. For each finite measure P, if the classHsatisfies equation 4.1, then the class P(H) also satisfies that equation.

In particular, ifHis a VC-subgraph class, then condition 4.1 holds. Lemma 3.Suppose thatHsatisfies condition 4.1. Then it holds that (Koltchinskii, 2006) θn(δ) ≤ K αδlog(1/δ) n ∨ αlog(1/δ) n . By the definition of-transform, it follows that

θ n(ε) ≤

Cαlog(nε2_/α) nε2 .

For the Rademacher chaos complexity R(2)n (H), we can bound it by the metric integral (Arcones & Gi ńe, 1993; De la Pe ña & Giné, 1999).

Lemma 4. There exists a universal constant K such that R(2)_n (H)≤ K nE _σ n 0 log N(H, L2(Un), )d , whereσ2 n= suph,g∈HUn((h− g)2) andh − gL₂(U_n)= U_n((h− g)2_).

(11)

We now compute the entropy integral. We need the following lemma to boundEσ2

n:

Lemma 5. LetHbe a uniformly bounded class of real-valued measurable functions on X× X. Then for every integer n,

Esup h∈H 1 n(n− 1) n i= j h2(x_i, x_j) ≤hsup∈HP 2_h2_{+ 4R}(1) n (P(H2)) + 512R(2)n (H2).

Proof. By Hoeffding decomposition, one has

Esup h∈H 1 n(n − 1) n i= j h2(x_i, x_j) ≤Esup_h_∈_HP2h2+ 2Esup_h_∈_H|Tn(h2)| +Esup h∈H|Wn(h 2_)|.

The second term of the left side of the inequality is bounded by Rademacher complexity via the symmetrization method,

Esup h∈H|Tn(h 2_{)| ≤ 2}_E_sup h∈H 1 n n i=1 εiPh2(xi) .

The third term of the left side of the inequality is bounded by Rademacher chaos complexity of degree 2 via symmetrization and decoupling methods:

Esup h∈H|Wn(h 2_{)| =}_E_sup h∈H 1 n(n − 1) n i= j (h2_(x i, xj) − Ph2(xi) − Ph2_(x j) + P2h2) ≤ 8Esup h∈H 1 n(n − 1) n i= j (h2_(x i, xj) − Ph2_(x i) − Ph2(xj) + P2h2) ≤ 32Esup h∈H 1 n(n − 1) n i= j εiεj(h2(xi, xj) − Ph2(xi) − Ph2_(x j) + P2h2)

(12)

≤ 128Esup h∈H 1 n(n − 1) n i= j εiεjh2(xi, xj) ≤ 512Esup h∈H 1 n(n − 1) n i= j ε_iε_jh2(x_i, x_j) . Summing the estimates above, we complete the proof of lemma 5.

Let K and K_i(i ∈N) denote constants whose value may change from line to line:

Corollary 1. LetHdenote a class of functions h : X× X → [0, 1]. Suppose that Hsatisfies conditions 3.6 and 4.1, for 0< ε ≤ 1. There exist constants K₁, K₂, K₃, K₄, and K₅such that for any t> 0 with probability at least 1 − log_qq n_t e−t,

P2_h n− P2h∗≤ (1 + ε)Λ + K₁log(nε2₎ nε2 + K₂t log(nε2_/t) nε2 + K3 n2_ε2 + K₄t n + K₅(1 +√t) nε holds true.

Proof. Using condition 4.1 and the fact 0≤ (h − g)2_{≤ 1, by lemma 5, we}

have Eσ2 n =E sup d(h,g)≤√δ U_n((h − g)2) ≤ δ +√K1 n+ K₂ n, By condition 4.1 again, it holds that

ϑn(δ) ≤ KRn(2){h : d(h, hH) ≤ √ δ} ≤K nE σ_n 0 (log A − α log )d ≤K nE σnlog A+ ασn− ασnlogσn ≤K nE σ_nlog A+ α ≤K log A n Eσ2 n+ αK n ≤K log A n δ +_√K1 n+ K₂ n + αK n

(13)

≤K1 √ δ n + K₂ n + K₃ n5/4+ K₄ n3/2 ≤K1 √ δ n + K₅ n . (4.2)

By definition of ϑ_n(ε), if K₁√δ ≥ K₅, we have sup_x_≥δ2K1

√ x nx ≤ ε; then ϑn(ε) ≤ 4K21 n2_ε2, and if K₁ √ δ ≤ K5, thenϑn(ε) ≤ 2K₅ nε. Consequently we obtain ϑ n(ε) ≤ 4K2 1 n2_ε2 + 2K₅ nε . Recall η_n(δ) = 1 nE_d(h,hsup H)≤ √ δ |U_n(h − h_H)2|.

Using lemmas 5, we obtain nη2 n(δ) =E sup d(h,h_H)≤√δ |Un(h − hH)2| ≤ δ + 4R(1) n (P({h2: d(h, hH) ≤ √ δ})) + 512R(2) n ({h2: d(h, hH) ≤ √ δ}) ≤ δ + K1R(1)n (P({h : d(h, hH) ≤ √ δ})) + K2Rn(2)({h : d(h, hH) ≤ √ δ}).

The second inequality uses the contraction principle (see equation D.3 in appendix D). The third inequality is bounded by lemma 3 and equation 4.2:

nη2 n(δ) ≤ δ + K δ log(1/δ) n ∨ log(1/δ) n +K1 √ δ n + K₅ n . Similarly, we have η n(ε) ≤ K n + 1 nε + K log(nε2₎ nε2 + 4K2 1 n2_ε2 + 2K₅ nε .

(14)

It follows that η n(ε) ≤ K₆ nε + K₇log(nε2₎ nε2 .

The estimates ofϑn(ε) and η n(ε) together with θn(ε) in lemma 3 yield

n(ε) + ηn ε √ t ≤K log(nε2) nε2 + 4K2 1 n2_ε2+ 2K₅ nε + K₆√t nε + K₇t log(nε2_/t) nε2 .

The desired result follows from theorem 4.

5 Error Bounds of Pairwise Preference Algorithm

We have developed an abstract empirical risk minimization based on the U-processes. We now turn to preference learning problems. We investigate the convergence rates of pairwise preference learning by using empirical risk minimization with respect to U-processes. The nature of U-processes fits preference learning problems. The framework we gave in section 3 is general and allows users to apply it directly instead of deriving bounds in each risk minimization problem. We pay attention to the pairwise loss function of learning algorithms.

Let X be an observation space and Y⊂Rbe a real-valued label set. In this section we assume that P is the joint distribution on X× Y and is unknown. For a sample set{(x₁, y₁), . . . , (x_n, y_n)} of independent copies of X × Y, we let P_nbe the empirical distribution on X× Y based on the given samples.

5.1 Preference Learning with Squared Loss. Pahikkala et al. (2007) and Cossock and Zhan (2008) considered learning a linear order f : X→Rfrom the i.i.d. samples{(x_i, y_j)}n

i=1. Here we discuss how to learn a preference

function s : X× X → [0, 1] and show how to derive the convergence rate from theorem 4.

We may suppose thatω_{i j}∈ [0, 1] for our purposes. Otherwise we use the transformω_{i j}= e

y i −yj

1+eyi −yj. Whenωi, j>

1

2, it means that x is prior to xand vice

versa.

When taking the loss function as h_s= (s − ω)2_{, we consider the}

prefer-ence function class,

S = {s : X × X → [0, 1]},

(15)

The inference property of s is measured by its expected risk:

E(s) =

(s(x, x_{) − ω(y, y}₎₎2_dP2_.

The corresponding to the empirical risk is defined as

E_n(s) = 1 n(n − 1) n i= j (s(x_i, x_j) − ω(y_i, y_j))2_,

which measures the average error between ranking function s(x_i, x_j) and feedback informationω_{i j}. The empirical risk minimizer s_nover the function classS = {s : X × X → [0, 1]} is defined as

s_n= arg min s∈S En(s).

The regression function is defined by

s∗(x, x) =

ωdP2_{(ω|x, x}_),

where P2_{(ω|x, x}_{) is conditional expectation. It is well known that s}∗_{(x, x}_{) =}

arg minE(s), where the infimum is taken over all measurable functions s on X× X. Furthermore,

((s − ω)2_{− (s}∗_{− ω)}2_)dP2_{(ω|x, x}_{) = (s − s}∗₎2_. _(5.1)

Note that equation 5.1 also implies (by integration)

E(s) −E(s∗_{) =} _{(s − s}∗₎2_{= s − s}∗2_.

Consequently,

L(hs) =E(s − ω)2−E(s∗− ω)2= s − s∗2. For simplicity, we suppose that s∗∈S. Thus,

(16)

It is easy to get

P2((s₁− ω)2− (s₂− ω)2)2≤ 4s₁− s₂2.

As a result, condition 3.6 holds with A= 4. The symmetrization inequality gives φn(H; δ) =E sup h_s 1,hs2∈Hδ (Pn− P) Ph_s 1− Phs2 =E sup s1−s∗2≤δ,s2−s∗2≤δ (Pn− P) Ph_s 1− Phs2 ≤ 2E sup s−s∗2_≤δ 1 n n i=1 ε_i(Ph_s− Ph_s∗) ≤ 2E sup hs−hs∗2≤4δ 1 n n i=1 εi(Phs− Phs∗) = 2θn(4δ). Similarly, we have ϕn(H; δ) =E sup h,g∈Hδ |Wn(h − g)| ≤ 8ϑn(4δ).

For a function classS, we define two empirical metrics over it as

ρS 1 (h, g) = 1 n n i=1 P(h(xi, ·) − g(xi, ·))2, ρ1S(h, g) = 1 n(n − 1) i= j (h(xi, xj) − g(xi, xj))2.

The corresponding empirical covering numbers are denoted by N(S, ρ₁, ) and N(S, ρ₂, ), respectively. Since H= {(s − ω)2_{: s}_∈_S_},

ρH

1 , ρ2H defined as above, let h1= (s1− ω)2∈H and h2= (s2− ω)2∈H.

It is easy to see thatρ₁H(h₁, h₂) ≤ 4ρ₁S(s₁, s₂) and ρ₂H(h₁, h₂) ≤ 4ρ₂S(s₁, s₂). Since an /4-covering of S provides an -covering of H, N(H, ρ1, ) ≤

N(S, ρ1, /4) and N(H, ρ2, ) ≤ N(S, ρ2, /4). IfS satisfies condition 4.2,

(17)

Summarizing the discussion, we obtain the following bound of the excess risk of s_nfrom corollary 1.

Theorem 5. LetS denote a class of functions s : X× X → [0, 1], Suppose that S satisfies condition 4.1 and s∗∈S. For any 0< ε ≤ 1 and any t > 0, there exist constants K₁, K₂, K₃, K₄, and K₅, with probability at least 1− log_q(q n_δ)e−t. We have sn− s∗2≤ K₁log(nε2₎ nε2 + K₂t log(nε2_/t) nε2 , + K3 n2_ε2 + K₄t n + K₅(1 +√t) nε .

5.2 Preference Learning with Indicator Loss. Cl´emenc¸on et al. (2008) and Rejchel (2012) considered the indicator loss function.

Denote

ωi j= ω(yi, yj) =

1, y_i> y_j, −1, y_i< y_j.

For preference learning with an indicator loss, one observes x and xbut not their labels y and y. We think about x being prior to xifω_{i, j}= 1. The goal is to rank x and xso that the probability that the better ranked of them has a smaller label is as small as possible. Formally, a preference relation is a function r : X× X → {−1, 1}. If r(x, x) = 1, then the preference relation ranks x higher than xand vice versa.

The setting of the bipartite ranking problem (Agarwal & Niyogi, 2005) can be described as follows. There is an instance space X from which instances are drawn, and the learner is given a training sam-ple (S+, S−) ∈ Xm_{× X}l _{consisting of a sequence of positive training} ex-amples S+= (x+₁, . . . , x+_m) and a sequence of negative training examples S−= (x−₁, . . . , x−_l ). Denote ωi j= 1, x+_i ∈ S+, x−_j ∈ S−, −1, x+ i ∈ S−, x−j ∈ S+.

The goal is to learn from these examples a preference function r : X× X → {−1, 1} that ranks a positive instance x higher than a negative one x _if

r(x, x_{) = 1 and ranks x lower than x}_{if r}_{(x, x}_{) = −1. Denote by}_R_{a class}

of preference rule r : X× X → {−1, 1}.

We consider the indicator loss function h_r=I_{r(x,x_)=ω}defined on (X × Y)2_{, where} _I

(18)

indicator loss function class is denoted byH. The performance of a prefer-ence rule is measured by the preferprefer-ence risk

E(r) = P2_(h

r) =P{ω = r(x, x)} =P{ω · r(x, x) < 0}.

Although the preference problem shares similarities with the binary classifi-cation problem, preference risk and classificlassifi-cation risk are different (Agarwal & Niyogi, 2005). The empirical risk minimizer r_noverRis denoted by

r_n= arg min r∈R 1 n(n − 1) i= j I_{r(x i,xj)=ωi j}. (5.2)

Set η(x, x) = E(ω|X = x, X = x). Then η(x, x) = E(ω = 1|x, x) − E(ω = −1|x, x_{). The target we want to learn is denoted by}

r∗= arg min r E(r) =

1, η(x, x) > 0, −1, η(x, x_{) < 0,}

where the infimum is taken over all measurable functions r on X× X. Now we take h= h_rin theorem 4 and get

(I_{r(x,x_)=ω}−I_{r∗_(x,x_)=ω})2dP2(ω|x, x) = 1 |η(x, x_)| (I_{r(x,x)=ω}−I_{r∗_(x,x_)=ω})dP2(ω|x, x).

We assume that|η(x, x)| ≥ η₀ forη > 0 (see Massart, 2000a). Then the in-dicator loss function satisfies h_rcondition 3.6 with A= 1/η₀. The conver-gence rate of the preference algorithm defined in equation 5.2 follows from corollary 1.

Theorem 6. LetRbe a class of preference rule. Suppose thatRsatisfies condition 4.1 and there existsη₀> 0 such that |η(x, x)| ≥ η₀. For any 0< ε ≤ 1 and any t> 0, there exist constants K₁, K₂, K₃, K₄, and K₅such that the excess error

E(r_n)−E(r∗)≤ (1 + ε) inf r∈RE(r)−E(r ∗₎ + K1log(nε 2_η2 0) nε2_η 0 +K2t log(nε 2_η2 0/t) nε2_η 0 + K3 n2_ε2_η 0 + K4t n + K₅(1 +√t) nε holds with probability at least 1− log_qq n_δe−t.

(19)

Appendix A: Proof of Theorem 1

To prove theorem 1, we need the following tensorization inequality in Massart (2000b, lemma 8) and Boucheron, Lugosi, and Massart (2000, lemma 2.3):

Lemma 6. Let x₁, . . . , x_n be independent random variables with values in X and x₁, . . . , x_n independent copies of x₁, . . . , x_n. Let V = V(x₁, . . . , x_n) and Vk ₌ Vk_(x

1, . . . , xk−1, xk, xk+1, . . . , xn) be measurable functions. For anyλ,

λE{VeλV_{} −}_E_{eλV_{} log}_E_{eλV_{} ≤}n k=1 Eψ(−λ(V − Vk_))eλV_I [V−Vk_≥0] , (A.1) whereψ(λ) = λ(eλ− 1). Let Vk _{= V}k_(X

1, . . . , Xk−1, Xk+1, . . . , Xn) be measurable functions. For anyλ,

λE{VeλV_{} −}_E_{eλV_{} log}_E_{eλV_{} ≤}n k=1

E{φ(−λ(V − Vk_))eλV_}, _(A.2)

whereφ(λ) = eλ− λ − 1.

Just like the empirical process, we first establish the concentration in-equality for nonnegative functions using lemma 6.

Theorem 7. LetHbe a countable class of measurable and symmetry functions fromX×X to [0, 1] and let

V = 1 2(n− 1) _hsup_∈_H n i= j h(x_i, x_j).

Then there holds

logEet(V−EV)_≤ t2EV

1− 2t, for 0 ≤ t < 1 2. Furthermore we have for any x> 0,

P{V ≥EV + x} ≤ exp − x2 4EV + 4x .

(20)

Proof. Without loss of generality, we consider a finite class of functions

H= (h1, . . . , hN) (see Boucheron et al., 2000). Set

τ = min 1≤k≤N ⎧ ⎨ ⎩k : n i= j h_k(x_i, x_j) = sup h∈H n i= j h(x_i, x_j) ⎫ ⎬ ⎭ . Let Vk₌ 1 2(n−1)

i= j;i, j=khτ(xi, xj). We have for any 1 ≤ k ≤ n,

0≤ V − Vk≤ 1 n− 1

i=k

h_τ(x_k, x_i) ≤ 1. (A.3)

Note that the functionφ(λ) = eλ− λ − 1 is convex and φ(0) = 0. Thus, for anyλ and any μ ∈ [0, 1], we have φ(−λμ) ≤ φ(−λ)μ. It follows from equa-tions A.2 and A.3 that for anyλ,

λE{VeλV_{} −}_E_{eλV_{} log}_E_{eλV_{} ≤ 2φ(−λ)}_E_{VeλV_}.

By making use of lemmas 6 and 3 in McDiarmid and Reed (2006), we complete the proof of theorem 7.

Proof of Theorem 1.Let(x₁, . . . , x₁) be an i.i.d. copy of (x₁, . . . , x_n). Without loss of generality, we consider a finite class of functionsH= (h₁, . . . , h_N) (see Massart, 2000b). Set

τ = min 1≤k≤N ⎧ ⎨ ⎩k : n i= j h_k(x_i, x_j) =sup h∈H n i= j h(xi, xj ) ⎫ ⎬ ⎭. Let V= ₂_(n−1)1 |_in_{= j}h_τ(x_i, x_j)| and Vk= 1 2(n − 1) i= j;i, j=k h_τ(x_i, x_j) + 2 n i=k h_τ(x_k, x_i) .

We observe thatψ(−λ) = −λ(e−λ− 1) ≤ λ2_{for all}_{λ ≥ 0. For each k,}

V− Vk _≤ 1 n− 1 n i=k h_τ(x_k, x_i) − n i=k h_τ(x_k, x_i) ,

(21)

it follows that ψ(−λ(V − Vk₎₎_I [V−Vk_≥0]≤ λ2 (n − 1)2 n i=k h_τ(x_k, x_i) − n i=k h_τ(x_k, x_i) 2 ≤ λ2 n− 1 n i=k h_τ(x_k, x_i) − h_τ(x_k, x_i)2. Inequality A.1 gives, forλ > 0,

λE{VeλV_{} −}_E_{eλV_{} log}_E_{eλV_{} ≤ λ}2_E_{WeλV_}, _(A.4)

where W = sup_h_∈_H_n₋₁1 n_k₌₁n_i_=kh(x_k, x_i) − h(x_k, x_i)2.

Let G(λ) =Eeλ(V−EV), for 0< λ < 1. We have from inequality A.4, 1 λ G(λ) G(λ) − 1 λ2log G(λ) ≤ logE[eλW] λ(1 − λ) . Integrating this inequality yields

1 λlog G(λ) ≤ 1 1− λ _λ 0 logE[e10μ!W_] μ dμ, (A.5) where !W= sup_h_∈_H_n₋₁1 n_k₌₁n_i_=kh2_(x

k, xi). Here we use inequality

E[eλW]≤E[e10λ!W_{], which can be obtained from the decoupling theorem (see} De la Pe ˜na & Gin´e, 1999).

Without loss of generality, we assume b= 1 in theorem 1. We apply theorem 7 toW!₂ and get, for 0< μ <₄₀1,

logEe10μ!W _{≤ 10μ}_E_W!₊200μ2EW! 1− 40μ . For all 0< λ <₄₀1, inequality A.5 implies

1 λlog G(λ) ≤ 10EW! 1 1− λ _λ 0 1+ 20μ 1− 40μdμ . This yields, by straightforward computation,

log G(λ) ≤ 10EW! λ

2_{(1 − 30λ)}

(22)

Finally, we get

log G(λ) ≤ 20EWλ!

2

2(1 − 40λ). By Markov inequality, for all x> 0,

P{V ≥EV+ x} ≤ exp − x2 40EW!+ 80x , or P " V≥EV+ 40EWx! + 40x # ≤ exp {−x} .

Replacing M by !W/n and V byn₂U_n_H, we complete the proof of theorem 1. Appendix B: Proof of Theorem 3

Now we give the proof of the main results of this letter. Proof of Theorem 3. Recall that

t(δ) =EUn− P2H δ+ 160EUnH2 δ n + 80t n . Denote E_{n, j}(t) = sup h,g∈Hδ | (U_n− P2_{)(h − g) |≤} t(δj) . By theorem 1, P((E_{n, j}(t))c_{) ≤ exp(−t).}

Letδ_j≥ δ. In the event that E_{n, j}(t), we conclude that if h_n∈H_δ j\Hδj+1, then∀0 < ε < δ_j₊₁, ∀g ∈H_ε, δj+1<L(hn) ≤ P2_h n− P2g+ ε ≤Un(hn) − Un(g) + (P2− Un)(hn− g) + ε ≤ sup h,g∈Hδ_j | P2_{− U} n)(h − g) | +ε ≤ t(δj) + ε ≤ ,q t (δ)δj+ ε.

(23)

Consequently, we have_t,q(δ) ≥1

q > 2q1. As

,q

t (δ) is decreasing with re-spect toδ, we obtain that

δ ≤ t,q 1 2q = δn(t).

We can conclude that forδ_j≥ δ ≥ δ_n(t), {h_n∈H_δ

j\Hδj+1} ⊂ (En, j(t))

c_. There-fore, for δ ≥ δ_n(t), in the event E_n(t) = ∩_j:δ

j≥δEn, j(t) we have L(hn) ≤ δ, implying that PL(hn) > δ ≤ j:δj≥δ P{(E_{n, j}(t))c} ≤ log_qq δe−t. The proof of theorem 3 is complete.

Appendix C: Proof of Theorem 4 Recall that t(δ) ≤ 2φn(H; δ) + ϕn(H; δ) + 160EU_n_H2 δt n + 80t n . Notice that d(h, g) =P2_{(h − g)}2_{. For any h}_∈_H

δthere holds d(h, h_H) ≤ d(h, h_∗) + d(h_H, h_∗) ≤A(P2_h_{− P}2_h ∗) + A(P2_h H− P2h∗) ≤A(P2_h_{− P}2_h H) + 2 A(P2_h H− P2h∗) ≤√Aδ + 2√A ≤2A(δ + 4), where = P2_h H− P2h∗.

On the other hand, we have E sup

h,g∈Hδ

|Tn(h − g)| ≤ 2E sup h∈Hδ

(24)

and

E sup

h,g∈H_δ|Wn(h − g)| ≤ 2E suph∈H_δ|Wn(h − hH)|.

Therefore,

φn(δ) ≤ 2θn(2A(δ + 4)) and ϕn(δ) ≤ 2ϑn(2A(δ + 4)) . Then there exists a constant C> 0, such that

t(δ) ≤ Cθn(2A(δ + 4)) + Cϑn(2A(δ + 4)) +Cη_n(2A(δ + 4)) t1

2+Ct n. =: χ1(δ) + χ2(δ) + χ3(δ) + χ4(δ).

According to the property of the -transform, it follows that δ_n(t) = ,qt (2q1) ≤

t(2q1).

In order to estimate the-transform _t(_2q1), we bound the -transform of χ₁(δ), χ₂(δ), χ₃(δ), and χ₄(δ), respectively. Let μ = 1

8q. Then the property of

the-transform (see lemma 5 in Koltchinskii, 2006) gives, for all 0 < τ ≤ 1,

χ 1(μ) ≤ 1 2Aθ n  τμ 4CA + 4τ. Similarly, χ₂(μ) ≤ 1 2Aϑ n  τμ 4CA + 4τ, χ3(μ) ≤ 1 2Aη n τμ 4CA√t + 4τ, χ₄(μ) ≤Ct μn. As a result, δn(t) ≤ 1 2Aθ n  τμ 4CA + 1 2Aϑ n  τμ 4CA + 1 2Aη n τμ 4CA√t + 12τ +Ct μn.

(25)

Takingε = 12τ ∈ (0, 1] and μ = 48C_K , we deduce that δn(t) ≤ ε + 1 Aθ n  ε KA + 1 Aϑ n  ε KA + 1 Aη n ε KA√t +Kt n, and the desired estimate follows. The proof of theorem 4 is complete. Appendix D: The Contraction Principle

LetA denote a collection of n× n symmetric matrices A, and ε₁, . . . , ε_nare i.i.d. Rademacher variables. The matrices A= (a_{i j}) have zero diagonal, that is, a_ii= 0 for all A ∈A and i= 1, . . . , n.

Theorem 8. Letϕ_{i j} :R→R, i, j = 1, . . . , n be functions such that ϕ_{i j}(0) = 0 and

|ϕi j(μ) − ϕi j(ν)| ≤ |μ − ν|, μ, ν ∈R

(that is, ϕ_{i j} are contractions). Let Φ :R₊→R₊ be convex and nondecreasing. Then, for any bounded subsetA inRn×n_,

EΦ ⎛ ⎝1 2 _Asup_∈_A n i=1 n j=1 ε_iε_jϕ_{i j}(a_{i j}) ⎞ ⎠ ≤EΦ ⎛ ⎝ sup A∈A n i=1 n j=1 ε_iε_ja_{i j} ⎞ ⎠ .

Proof. We first prove that if :R→R₊is convex and nondecreasing,

E ⎛ ⎝sup A∈A n i=1 n j=1 εiεiϕi j(ai j) ⎞ ⎠ ≤E ⎛ ⎝sup A∈A n i=1 n j=1 εiεjai j ⎞ ⎠ . (D.1) By conditioning and iteration, it suffices to show that T is a subset ofR3_and

ϕ is a contraction onRsuch thatϕ(0) = 0. Any j = i; then

Ei sup t∈T[t1+ εit2+ εiεjϕ(t3)]|εj ≤Ei sup t∈T[t1+ εit2+ εiεjt3]|εj . (D.2)

(26)

The above inequality is equivalent to 1 2 sup t∈T[t1+ t2+ εjϕ(t3)] +1 2 sup t∈T[t1− t2− εjϕ(t3)] ≤1 2 sup t∈T[t1+ t2+ εjt3] +1 2 sup t∈T[t1− t2− εjt3] .

We can prove the above inequality as in Ledoux and Talagrand (1991) and Koltchinskii (2011).

In the general case, we have

E ⎛ ⎝sup A∈A n i=1 n i< j ε_iε_jϕ_{i j}(a_{i j}) ⎞ ⎠ =E_ε n···ε2Eε1 ⎛ ⎝sup A∈A ⎛ ⎝n i=2 εi ⎛ ⎝n i< j εjϕi j(ai j) ⎞ ⎠ +n j=2 ε1εjϕ1 j(a1 j) ⎞ ⎠ ⎞ ⎠ =E_ε n···ε2Eε1 ⎛ ⎝sup A∈A ⎛ ⎝n i=2 ε_i ⎛ ⎝n i< j ε_jϕ_{i j}(a_{i j}) ⎞ ⎠ + ε1 n j=3 εjϕ1 j(a1 j) + ε1ε2ϕ(a12) ⎞ ⎠ ⎞ ⎠ .

Using inequality D.2, we have

E_ε n···ε2Eε1 ⎛ ⎝sup A∈A _n i=2 εi _n i< j εjϕi j(ai j) ⎞ ⎠ + ε1 n j=3 εjϕ1 j(a1 j) + ε1ε2ϕ(a12) ≤E_ε n···ε2Eε1 ⎛ ⎝sup A∈A ⎛ ⎝n i=2 εi ⎛ ⎝n i< j εjϕi j(ai j) ⎞ ⎠ + ε1 n j=3 εjϕ1 j(a1 j) + ε1ε2a12

(27)

≤ · · · ≤E_ε n···ε2Eε1 ⎛ ⎝sup A∈A ⎛ ⎝n i=2 εi ⎛ ⎝n i< j εjϕi j(ai j) ⎞ ⎠ + ε1 n j=2 εja1 j ⎞ ⎠ ⎞ ⎠ =E_ε 1Eεn···ε3Eε2 ⎛ ⎝sup A∈A ⎛ ⎝n i=3 εi ⎛ ⎝n i< j εjϕi j(ai j) ⎞ ⎠ + ε1 n j=3 εja1 j + ε2 ⎛ ⎝ε1a12+ n j=4 εjϕi j(ai j) ⎞ ⎠ + ε2ε3ϕ23(a23) ⎞ ⎠ ⎞ ⎠ .

By an induction argument, we complete the proof of equation D.1. Note that(−ε_iε_j) has the same distribution as (ε_iε_j). Since :R₊→R₊

is convex and nondecreasing, we have

E ⎛ ⎝1 2_Asup_∈_A n i=1 n j=1 ε_iε_jϕ_{i j}(a_{i j}) ⎞ ⎠ ≤1 2E ⎛ ⎝sup A∈A ⎛ ⎝n i=1 n j=1 εiεjϕi j(ai j) ⎞ ⎠ + ⎞ ⎠ +1 2E ⎛ ⎝sup A∈A ⎛ ⎝n i=1 n j=1 −εiεjϕi j(ai j) ⎞ ⎠ + ⎞ ⎠ ≤E ⎛ ⎝sup A∈A n i=1 n j=1 εiεjai j ⎞ ⎠ ,

where(x)₊= max{0, x}. The proof of theorem 8 is complete.

In particular, taking(x) = x, further, if −1 ≤ x ≤ 1, then ϕ(x) = 1₂x2_is

contraction withϕ(0) = 0, Esup h∈H n i=1 n j=i εiεjh2(xi, xj) ≤4Esup h∈H n i=1 n j=i εiεjh(xi, xj) . (D.3) Acknowledgments

We are grateful to the editor and anonymous reviewer for their valuable comments and suggestions that helped improve the original version of this

(28)

letter. The research was partially supported by NSFC under grants 61472155 and 11371007.

References

Agarwal, S., & Niyogi, P. (2005). Stability and generalization of bipartite ranking algorithms. In P. Auer & R. Meir (Eds.), Lecture Notes in Computer Science: Vol. 3559.

Proceedings of the 18th annual conference on learning theory (pp. 32–47). Berlin:

Springer.

Arcones, M. A. (1995). A Bernstein type inequality for U-statistics and U-processes.

Statist. Probab. Letter, 22, 223–230.

Arcones, M. A., & Gi ´ne, E. (1993). Limit theorems for U-processes. Annals of

Proba-bility, 21, 1494–1542.

Bartlett, P. L., Bousquet, O., & Mendelson, S. (2005). Local Rademacher complexities.

Annals of Statistics, 33, 1497–1537.

Bartlett, P. L., & Mendelson, S. (2002). Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3, 463–482. Boucheron, S., Lugosi, G., & Massart, P. (2000). A sharp concentration inequality

with applications. Random Structures Algorithms, 16, 277–292.

Bousquet, O. (2002). A Bennett concentration inequality and its application to suprema of empirical processes. C. R. Acad. Sci. Paris Ser. I, 334, 495–500. Cl´emenc¸on, S., Lugosi, G., & Vayatis, N. (2008). Ranking and empirical minimization

of U-statistics. Annals of Statistics, 36(2), 844–874.

Cohen, W. W., Schapire, R. E., & Singer, Y. (1999). Learning to order things. Journal

of Artificial Intelligence Research, 10, 243–270.

Cossock, D., & Zhang, T. (2008). Statistical analysis of Bayes optimal subset ranking.

IEEE Trans. Info. Theory, 54, 4140–5154.

De la Pe ˜na, V. H., & Gin´e, E. (1999). Decoupling: From dependence to independence. New York: Springer.

Dudley, R. M. (1999). Uniform central limit theorems. Cambridge: Cambridge Univer-sity Press.

Freund, Y., Iyer, R., Schapire, R. E., & Singer, Y. (2003). An efficient boosting algorithm for combining preferences. Journal of Machine Learning Research, 4, 933–969. H ¨ullermeier, E., F ¨urnkranz, J., Cheng, W., & Brinker, K. (2008). Label ranking by

learning pairwise preference. Artif. Intell, 172, 1897–1916.

Klein, T., & Rio, E. (2005). Concentration around the mean for maxima of empirical processes. Annals of Probability, 33, 1060–1077.

Koltchinskii, V. (2006). Local Rademacher complexities and oracle inequalities in risk minimization. Annals of Statistics, 34, 2593–2656.

Koltchinskii, V. (2011). Oracle inequalities in empirical risk minimization and sparse recovery problems. New York: Springer.

Koltchinskii, V., & Panchenko, D. (2000). Rademacher processes and bounding the risk of function learning. In E. Gin´e, D. Mason, & J. Wellner (eds.), High dimen-sional Probability II, 443–459.

Ledoux, M. (1996). On Talagrand’s deviation inequalities for product measures.

(29)

Ledoux, M., & Talagrand, M. (1991). Probability in Banach space. New York: Springer-Verlag.

Liu, T. Y. (2011). Learning to rank for information retrieval. New York: Springer. Massart, P. (2000a). Some applications of concentration inequalities to statistics. Ann.

Fac. Sci. Tolouse Math., 6, 245–303.

Massart, P. (2000b). About the constants in Talagrand’s inequality for empirical processes. Annals of Probability, 28, 863–884.

McDiarmid, C., & Reed, B. (2006). Concentration for self-bounding functions and an inequality of Talagrand. Random Structures and Algorithms, 29, 549–557.

Nolan, D., & Pollard, D. (1987). U-processes: Rates of convergence. Annals of Statistics,

15, 780–799.

Pahikkala, T., Tsivtsivadze, E., Airola, A., Boberg, J., & Salakoski, T. (2007). Learn-ing to rank with pairwise regularized least-squares. In SIGIR 2007 Workshop on

Learning to Rank for Information Retrieval, 80, 27–33.

Rejchel, W. (2012). On ranking and generalization bounds. Journal of Machine Learning

Research, 13, 1373–1392.

Talagrand, M. (1994). Sharper bounds for gaussian and empirical processes. Annals

of Probability, 22, 28–76.

Talagrand, M. (1996). New concentration inequalities in product spaces. Invent. Math,

126, 505–563.

van der Vaart, A., & Wellner, J. (1996). Weak convergence and empirical processes. New York: Springer-Verlag.

Vapnik, V. (1998). Statistical learning theory. New York: Wiley.

Ying, Y., & Campbell, C. (2010). Rademacher chaos complexities for learning the kernel problem. Neural Computation, 22, 2858–2886.