• No results found

U-Processes and Preference Learning

N/A
N/A
Protected

Academic year: 2021

Share "U-Processes and Preference Learning"

Copied!
29
0
0

Loading.... (view fulltext now)

Full text

(1)

U-Processes and Preference Learning

Hong Li

[email protected]

Chuanbao Ren

[email protected]

School of Mathematics and Statistics, Huazhong University of Science and Technology, Wuhan 430074, China

Luoqing Li

[email protected]

Faculty of Mathematics and Statistics, Hubei University, Wuhan 430062, China

Preference learning has caused great attention in machining learning. In this letter we propose a learning framework for pairwise loss based on empirical risk minimization of U-processes via Rademacher complex-ity. We first establish a uniform version of Bernstein inequality of U-processes of degree 2 via the entropy methods. Then we estimate the bound of the excess risk by using the Bernstein inequality and peeling skills. Finally, we apply the excess risk bound to the pairwise preference and derive the convergence rates of pairwise preference learning algo-rithms with squared loss and indicator loss by using the empirical risk minimization with respect to U-processes.

1 Introduction

Preference learning has attracted considerable attention in machine learn-ing in recent years, includlearn-ing the design of search engines, information re-trieval, and movie recommendation systems. Preference learning involves prediction of ordering of the data points rather than prediction of a single numerical value as in the case of regression or a class label as in case of classification.

Several methods based on the so-called pairwise approach have been developed and successfully applied to information retrieval (Liu, 2011). This approach takes sample pairs and formalizes the problem of preference learning as that of classification and regression. It collects sample pairs from the entire samples, and for each pair, it assigns a label representing the relative relevance of the two samples.

Preference learning is to learn a binary preference relation, which is a function of two variables (Cohen, Schapire, & Singer, 1999; Freund, Iyer,

Neural Computation 26, 2896–2924 (2014)  2014 Massachusetts Institute of Technologyc doi:10.1162/NECO_a_00674

(2)

Schapire, & Singer, 2003; H ¨ullermeier, F ¨urnkranz, Cheng, & Brinker, 2008; Pahikkala, Tsivtsivadze, Airola, Boberg, & Salakoski, 2007; Cl´emenc¸on, Lugosi, & Vayatis, 2008; Rejchel, 2012). In a comparison of two input points, the preference relation is able to evaluate whether the first point ranks before the second one. A binary preference relation learned from data is not necessarily consistent in the sense of transitive.

Preference learning is distinct from both classification and regression. It is natural to ask what kinds of properties hold for algorithms for this problem and, in particular, whether tools that have been applied to study excess risk of classification and regression algorithms can be adapted to study excess risk of preference algorithms. The application of U-processes to the generalization analysis of excess ranking risk was first developed in Cl´emenc¸on et al. (2008). The authors gave a novel moment inequality for degenerate U-processes and investigated the empirical risk minimiza-tion of U-statistics for the ranking problem. Rejchel (2012) extended the trick of the local Rademacher complexity to U-statistics and obtained the bounds of excess risk based on the convex loss function. (For an excellent account of the theory of U-statistics and U-processes, see De la Pe ˜na & Gin´e, 1999.)

We propose a new framework about the pairwise loss function based on empirical risk minimization of U-processes. This framework is general and allows users to apply it directly instead of deriving bounds in each risk minimization problem.

First, we derive a uniform version of Bernstein inequality of U-processes of degree 2 via the entropy methods by Ledoux (1996). Our concentration inequality is new and is suitable for preference learning.

Second, we estimate the bounds of the excess risk caused by the empirical risk minimizer of processes. Our approach is based on the theory of U-processes, and the key tools involve Bernstein concentration inequality and the local Rademacher complexity, which generalized the results of Koltchinskii (2006) for the empirical processes.

Third, we investigate the convergence rates of pairwise preference learn-ing algorithms with squared loss and indicator loss by uslearn-ing empirical risk minimization with respect to U-processes. The convergence rates are fast and are the same as those in Cl´emenc¸on et al. (2008) and Rejchel (2012). But our method is different from theirs.

This letter is organized as follows. Section 2 discusses the U-processes. We establish a uniform version of Bernstein inequality for U-processes. In section 3, we give the bounds of excess risk based on the local Rademacher complexity developed by Koltchinskii (2006) for the empirical processes. In section 4, we bound the homogeneous Rademacher chaos process of order 2 via the entropy integral. In section 5, we apply the excess risk bounds to the pairwise preference algorithm and derive the convergence rates. The Bernstein inequality and the error bounds of the excess risk of the U-processes are proved in the appendixes.

(3)

2 Bernstein Inequality for U-Processes

In this section we establish a uniform version of Bernstein inequality for suprema of U-processes of degree 2. We let X be an input space, P be a probability measure on X, and P2= P ⊗ P a product probability measure on X× X. We denote byHthe class of measurable functions from X× X intoR. The U-statistics of hHis defined as

Un(h) = 1 n(n − 1) n  i= j h(xi, xj).

As usual, we use the notations in the sequence FH= sup

hH|F(h)|, and FH

2= sup hH|F(h

2)|

for any functional F onHwhereH= {h : X × X →R} andH2= {h2: h H}.

Then the suprema of U-processes Un(h) indexed by a function classHis defined as UnH= 1 n(n − 1)suphH    n  i= j h(xi, xj)   .

We assume for simplicity thatHis a countable class of functions. This condition can be easily replaced by standard measurability assumptions known in the theory of empirical processes (Dudley, 1999; van der Vaart & Wellner, 1996; Koltchinskii & Panchenko, 2000; Talagrand, 1994) and U-processes (Arcones & Gi ´ne, 1993; De la Pe ˜na & Gin´e, 1999), we do not make countability assumption in some of the examples below.

Talagrand (1996) obtained new concentration inequalities for empirical processes. Then Ledoux (1996) developed these types of inequalities with explicit constants using entropy methods. The Talagrand-type inequalities for empirical processes were also developed (Massart, 2000b; Bousquet, 2002; Klein & Rio, 2005), and were used to investigate nonparametric esti-mation and machine learning (Massart, 2000a; Koltchinskii, 2006; Bartlett, Bousquet, & Mendelson, 2005).

There are some concentration inequalities for U-processes (Arcones, 1995, for instance). For our purpose we establish a uniform version of the Bernstein inequality of suprema of U-processes of degree 2. This concen-tration inequality is more convenient for our analysis, which is one of the main contributions in this letter.

(4)

Theorem 1. LetHbe a class of measurable and symmetry functions from X× X to [−b, b], b > 0. We have for any positive number x,

P  UnH≥EUnH+  160EUnH2x n + 80bx n  ≤ exp {−x} .

This inequality can be used to obtain some upper bounds of excess risk of preference learning algorithms. We postpone its proof to appendix A.

The expectations of suprema of U-processes involve the Rademacher complexity via symmetrization techniques. Rademacher complexity refers to the data-dependent estimates of the complexity of a function class. Several authors have considered Rademacher complexity (Bartlett & Mendelson, 2002; Ying & Campbell, 2010) and local Rademacher complexity (Bartlett et al., 2005; Koltchinskii, 2006).

Let x= {x1, . . . , xn} be independent and identically distributed (i.i.d.) random variables according to the distribution P on X, and letFbe a set of real-valued functions on X. The empirical Rademacher complexity ofF is the random variable

ˆR(1) n (F) = Eεsup fF    1 n n  i=1 εif(xi)  ,

where ε = {ε1, . . . , εn} are independent uniform ±1-valued Rademacher random variables. The Rademacher complexity ofFis Rn(1)(F) =EˆR(1)n (F). Some basic properties of Rademacher complexities can be found in Bartlett and Mendelson (2002).

The homogeneous Rademacher chaos process of order 2 is also a random variable. We refer to the expectation of the suprema,

ˆR(2) n (H) = Eεsup hH   n(n − 1)1 n  i= j εiεjh(xi, xj)   ,

as the empirical Rademacher chaos complexity overH. The corresponding Rademacher chaos complexity is R(2)n (F) =EˆR(2)n (F).

For U-statistics and U-processes, the Hoeffding decomposition is a basic tool. We state (De la Pe ˜na & Gin´e, 1999):

Lemma 1. (Hoeffding decomposition). The U-statistics

Un(h) = 1 n(n− 1) n  i= j h(xi, xj)

(5)

can be decomposed into the following form: Un(h) = P2h + 2Tn(h) + Wn(h), where Tn(h) = 1 n n i=1(Ph(xi)− P2h) = PnPh− P2h and Wn(h) = 1 n(n− 1) n  i= j (h(xi, xj)− Ph(xi)− Ph(xj) + P2h).

From Hoeffding decomposition, the U-process is decomposed as a sum of i.i.d. random variables plus a degenerate U-statistics. The sum of i.i.d. random variables plays will be bounded by the Rademacher chaos pro-cesses, while the degenerate part will be controlled by the homogeneous Rademacher chaos processes of order 2 in our analysis.

3 Performance of Excess Risk

The general theory of empirical risk minimization was developed by Vap-nik and Chervonenkis (VapVap-nik, 1998). The concentration inequalities pro-vided a basic tool and played an important role in analyzing empirical risk minimization algorithms. Then some new concentration inequalities led to tighter generalization bounds via peeling methods (Massart, 2000a; Bartlett et al., 2005; Koltchinskii, 2006). The general approach described in (Koltchinskii, 2006, 2011) was the motivation for this letter.

We first describe the algorithm of empirical risk minimization–based U-processes. The target function hHoverHis defined by

hH= arg min hHP

2(h).

Since the distribution P is unknown, we can construct only an approximate quantity to hH based on the given samples x1, . . . , xn. One way is to find the empirical minimizer of the U-processes, defined as

hn= arg min hHUn(h).

The excess risk of hnis defined as

L(hn) = P2(hn) − P2(hH).

The excess risk is a natural measure of accuracy of this approximation. An upper bound of excess risk follows from theorem 1 and the inequality

L(hn) ≤ 2Un− P2

(6)

Theorem 2. Assume that H:={h : X × X → [0, 1]}. With probability at least 1− e−x, x > 0, there holds L(hn)≤ 2EUn− P2H+  640EUnH2x n + 160x n (3.1)

Theorem 2 provides global estimates of the complexity of the function classH. As a result, using the global Rademacher complexity, the error rate is at least of the order of 1/n. To get a fast rate, we consider the local Rademacher complexities. The fact is that the algorithm will likely pick functions that have a small error in a small subset of the entire function class.

We follow the Koltchinskii (2006) in developing concentration inequali-ties for the excess risk based on U-processes. Our concentration inequality is new. It is suitable for analyzing the learning preference algorithms and is interesting in its own right.

We need more notations. Forδ > 0, we define the δ-minimal setHδH

of the risk as follows:

Hδ = {h ∈H:L(h) ≤ δ}. DenoteH δ= {h = h1− h2: h1, h2∈} and define t(δ) =EUn− P2 H δ+  160EUnH 2 δt n + 80t n . It follows from theorem 1 that for all t> 0,

P sup h,g∈Hδ |(Un− P2)(h − g)| ≥ t(δ) ≤ e−t. (3.2)

Note that Un(hn) ≤ Un(hH). From equation 3.2, for any δ > 0, ifL(hn) < δ, then with probability at least 1− e−t,

L(hn) = P2hn− P2hH= Un(hn) − Un(hH) + (P2− Un)(hn− hH) ≤ sup

h,g∈Hδ|(P

2− U

n)(h − g)| ≤ t(δ).

This implies thatδ ≤ t(δ) for the δ > 0 satisfying L(hn) < δ. Then with the same probability, the excess riskL(hn) will be uniformly bounded by the largest solution ¯δ of the inequality δ ≤ t(δ). That is, the optimal ¯δ is the solution ofδ = t(δ).

(7)

In order to obtain the optimalδ such thatL(hn) ≤ δ with high probability, we use the iterative localization technique.

Takeδ(0)= 1, so thatHδ(0) =H. Assume, for simplicity, that the minimum of P2h is attained at h

HH. Since hn, hHHδ(0) and Un(hn) ≤ Un(hH), we have, with probability at least 1− e−t, from equation 3.2,

L(hn) = P2hn− P2hH= Un(hn) − Un(hH) + (P2− Un)(hn− hH)

≤ sup h,g∈Hδ(0)

|(P2− U

n)(h − g)| ≤ t(δ(0)) ∧ 1 =: δ(1).

This implies that hnHδ(1). We can repeat the argument to show that with probability at least 1− 2e−t,

L(hn) ≤ t(δ(1)) ∧ 1 =: δ(2).

Iterating the argument N times shows that with probability at least 1− Ne−t, we haveL(hn) ≤ δ(N).

We regardδ as a variable to construct a fixed-point equation based on a Bernstein inequality for U-processes. We cannot compute accurate solutions δ0for the equationδ = R(δ, t), but we can find an upper bound of ¯δ through

the-transform and -transform, involved in the definitions of various com-plexity measures of function classes in empirical risk minimization. Definition 1. The-transform and the -transform of ψ are defined by

ψ(δ) = sup α≥δ

ψ(α)

α , and ψ( ) = inf{δ > 0, ψ(δ) ≤ }, respectively.

It will be convenient sometimes to discretize the definition of-transform and-transform. Let q > 1 and δj= q− j. Define

ψ,q(δ) = sup δj≥δ

ψ(δj) δj

, and ψ,q( ) = inf{δ > 0, ψ,q(δ) ≤ }. Some descriptions and properties can be found in Koltchinskii (2006).

We state the bound of the excess risk of hnwhich shows that investiga-tion of excess risk reduces to the computainvestiga-tion of-transform of t(δ) for variableδ.

(8)

Theorem 3. LetHdenote a class of functions h : X× X → [0, 1]. For all t > 0 andδ ≥ δn(t) =Δ,qt (2q1), P{L(hn)> δ} =P{P2(h n)− P2(hH)> δ} ≤ logq q δe−t.

The theorem provides a general upper bound in terms of the-transform, which has been studied and is well understood. Althoughδn(t) is difficult to deal with directly, it can be bounded in many interesting cases.

For brevity, we denote φn(H; δ) =Esuph,g∈H

δ|(Pn− P)(Ph − Pg)| = ETnH δand ϕn(H; δ) =E sup h,g∈Hδ |Wn(h − g)| =EWnH δ. Using Hoeffding decomposition, it follows that

EUn− P2H

δ ≤ 2φn(H; δ) + ϕn(H; δ).

Boundingφn(H; δ) and ϕn(H; δ) is related to the behavior of the continuity modulus of the empirical processes and U-processes. We define a pseudo-metric d on the classHby

d(h, g) = P2(h − g)2,

the continuity modulus of the empirical processes by θn(δ) =E sup

d(h,hH)≤δ

|Tn(h − hH)|, (3.3)

and the continuity modulus of the U processes by ϑn(δ) =E sup d(h,hH)≤δ |Wn(h − hH)| (3.4) and ηn(δ) = 1 nEd(h,hsup H)≤δ |Un(h − hH)2|. (3.5)

The continuity modulus of the empirical processes θn(δ) can be bounded by Rademacher complexity, which is due to Koltchinskii (2006).

(9)

The continuity moduli of the U processesϑn(δ) and ηn(δ) will be bounded in the next section.

Let h= arg minhP2h where the infimum is taken over all measurable

functions h on X× X. The h is a global minimal point of P2h. Set  =

infhH(P2h− P2h). With these notations we have:

Theorem 4. LetHdenote a class of functions h : X× X → [0, 1]. Set q > 1 and Θ

n(ε) = θn(ε) + ϑn(ε). Assume that there holds for any h ∈H, P2(h− h

∗)2≤ A(P2h− P2h∗) (3.6)

with some numerical constant A> 0. Then there exists a constant K > 0 such that for 0< ε ≤ 1 and for all t > 0,

δn(t)≤ εΛ + 1  n  ε K A  + 1  n  ε K At12  +K t n , and P P2hn− P2h≥ (1 + ε)Λ + 1  n  ε K A  + 1  n  ε K At12  +K t n ≤ logqq n t e −t.

Note that for condition 3.6, the constant A will be given explicitly for some special cases.

The proofs of theorems 3 and 4 are in appendixes B and C, respectively. 4 Bound the Rademacher Complexity

In the section, we estimate the continuity moduli ϑn(δ) and ηn(δ) for a special function class.

Definition 2. Let (T, ρ) be a pseudometric space and > 0. The covering number

N(T, ρ, ) is defined to be the minimal integer n ∈Nsuch that there exist n disks with radiusρ covering T.

In order to improve the result, we impose coditions on the covering numbers (see Rejchel, 2012; Nolan & Pollard, 1987).

Assumption.Suppose thatHis a measurable class of functions on X× X with values in [−b, b] satisfying

N(H, ρ, ) ≤ B −α, (4.1)

(10)

As usual, we define two empirical metrics overHas ρH 1 (h, g) =  1 n n  i=1 P(h(xi, ·) − g(xi, ·))2, ρH 2 (h, g) =  1 n(n − 1)  i= j (h(xi, xj) − g(xi, xj))2.

Definition 3. A class of functionsFis called a VC subgraph class if the graphs of the functions inFform a VC class of sets, that is, if we define the subgraph of a real-valued function f on S as the following subset Gfon S×R:

Gf ={(s, t) : s ∈ S, t ∈R, 0 ≤ t ≤ f (s) or f (s) ≤ t ≤ 0}, the class{Gf, f ∈F} is a VC class of sets on S ×R.

Lemma 2 (Nolan & Pollard, 1987). Let H be a uniformly bounded class of functions on X× X. For each finite measure P, if the classHsatisfies equation 4.1, then the class P(H) also satisfies that equation.

In particular, ifHis a VC-subgraph class, then condition 4.1 holds. Lemma 3.Suppose thatHsatisfies condition 4.1. Then it holds that (Koltchinskii, 2006) θn(δ) ≤ K  αδlog(1/δ) nαlog(1/δ) n  . By the definition of-transform, it follows that

θ n(ε) ≤

Cαlog(nε2/α) nε2 .

For the Rademacher chaos complexity R(2)n (H), we can bound it by the metric integral (Arcones & Gi ´ne, 1993; De la Pe ˜na & Gin´e, 1999).

Lemma 4. There exists a universal constant K such that R(2)n (H)≤ K nE  σ n 0 log N(H, L2(Un), )d , whereσ2 n= suph,g∈HUn((h− g)2) andh − gL2(Un)=  Un((h− g)2).

(11)

We now compute the entropy integral. We need the following lemma to boundEσ2

n:

Lemma 5. LetHbe a uniformly bounded class of real-valued measurable functions on X× X. Then for every integer n,

Esup hH 1 n(n− 1)    n  i= j h2(xi, xj)    ≤hsupHP 2h2+ 4R(1) n (P(H2)) + 512R(2)n (H2).

Proof. By Hoeffding decomposition, one has

Esup hH 1 n(n − 1)    n  i= j h2(xi, xj)   ≤EsuphHP2h2+ 2EsuphH|Tn(h2)| +Esup hH|Wn(h 2)|.

The second term of the left side of the inequality is bounded by Rademacher complexity via the symmetrization method,

Esup hH|Tn(h 2)| ≤ 2Esup hH 1 n    n  i=1 εiPh2(xi)   .

The third term of the left side of the inequality is bounded by Rademacher chaos complexity of degree 2 via symmetrization and decoupling methods:

Esup hH|Wn(h 2)| =Esup hH 1 n(n − 1)    n  i= j (h2(x i, xj) − Ph2(xi) − Ph2(x j) + P2h2)    ≤ 8Esup hH 1 n(n − 1)    n  i= j (h2(x i, x j) − Ph2(x i) − Ph2(x j) + P2h2)    ≤ 32Esup hH 1 n(n − 1)    n  i= j εiε j(h2(xi, x j) − Ph2(xi) − Ph2(x j) + P2h2)   

(12)

≤ 128Esup hH 1 n(n − 1)    n  i= j εiε jh2(xi, x j)    ≤ 512Esup hH 1 n(n − 1)    n  i= j εiεjh2(xi, xj)   . Summing the estimates above, we complete the proof of lemma 5.

Let K and Ki(i ∈N) denote constants whose value may change from line to line:

Corollary 1. LetHdenote a class of functions h : X× X → [0, 1]. Suppose that Hsatisfies conditions 3.6 and 4.1, for 0< ε ≤ 1. There exist constants K1, K2, K3, K4, and K5such that for any t> 0 with probability at least 1 − logqq nt e−t,

P2h n− P2h≤ (1 + ε)Λ + K1log(nε2) nε2 + K2t log(nε2/t) nε2 + K3 n2ε2 + K4t n + K5(1 +t) holds true.

Proof. Using condition 4.1 and the fact 0≤ (h − g)2≤ 1, by lemma 5, we

have Eσ2 n =E sup d(h,g)≤δ Un((h − g)2) ≤ δ +K1 n+ K2 n, By condition 4.1 again, it holds that

ϑn(δ) ≤ KRn(2){h : d(h, hH) ≤δ} ≤K nE  σn 0 (log A − α log )d K nE  σnlog A+ ασn− ασnlogσn  ≤K nE  σnlog A+ α ≤K log A n Eσ2 n+ αK nK log A n  δ +K1 n+ K2 n + αK n

(13)

K1 √ δ n + K2 n + K3 n5/4+ K4 n3/2K1 √ δ n + K5 n . (4.2)

By definition of ϑn(ε), if K1δ ≥ K5, we have supx≥δ2K1

x nx ≤ ε; then ϑn(ε) ≤ 4K21 n2ε2, and if K1δ ≤ K5, thenϑn(ε) ≤ 2K5 . Consequently we obtain ϑ n(ε) ≤ 4K2 1 n2ε2 + 2K5 . Recall ηn(δ) = 1 nEd(h,hsup H)≤δ |Un(h − hH)2|.

Using lemmas 5, we obtain 2 n(δ) =E sup d(h,hH)≤δ |Un(h − hH)2| ≤ δ + 4R(1) n (P({h2: d(h, hH) ≤δ})) + 512R(2) n ({h2: d(h, hH) ≤δ}) ≤ δ + K1R(1)n (P({h : d(h, hH) ≤δ})) + K2Rn(2)({h : d(h, hH) ≤δ}).

The second inequality uses the contraction principle (see equation D.3 in appendix D). The third inequality is bounded by lemma 3 and equation 4.2:

2 n(δ) ≤ δ + K  δ log(1/δ) n ∨ log(1/δ) n  +K1 √ δ n + K5 n . Similarly, we have η n(ε) ≤ K n + 1 + K log(nε2) 2 + 4K2 1 n2ε2 + 2K5 .

(14)

It follows that η n(ε) ≤ K6 + K7log(nε2) 2 .

The estimates ofϑn(ε) and η n(ε) together with θn(ε) in lemma 3 yield

 n(ε) + ηn  εt  ≤K log(nε2) 2 + 4K2 1 n2ε2+ 2K5 + K6t + K7t log(nε2/t) 2 .

The desired result follows from theorem 4.

5 Error Bounds of Pairwise Preference Algorithm

We have developed an abstract empirical risk minimization based on the U-processes. We now turn to preference learning problems. We investigate the convergence rates of pairwise preference learning by using empirical risk minimization with respect to U-processes. The nature of U-processes fits preference learning problems. The framework we gave in section 3 is general and allows users to apply it directly instead of deriving bounds in each risk minimization problem. We pay attention to the pairwise loss function of learning algorithms.

Let X be an observation space and Y⊂Rbe a real-valued label set. In this section we assume that P is the joint distribution on X× Y and is unknown. For a sample set{(x1, y1), . . . , (xn, yn)} of independent copies of X × Y, we let Pnbe the empirical distribution on X× Y based on the given samples.

5.1 Preference Learning with Squared Loss. Pahikkala et al. (2007) and Cossock and Zhan (2008) considered learning a linear order f : X→Rfrom the i.i.d. samples{(xi, yj)}n

i=1. Here we discuss how to learn a preference

function s : X× X → [0, 1] and show how to derive the convergence rate from theorem 4.

We may suppose thatωi j∈ [0, 1] for our purposes. Otherwise we use the transformωi j= e

y i −yj

1+eyi −yj. Whenωi, j>

1

2, it means that x is prior to x and vice

versa.

When taking the loss function as hs= (s − ω)2, we consider the

prefer-ence function class,

S = {s : X × X → [0, 1]},

(15)

The inference property of s is measured by its expected risk:

E(s) = 

(s(x, x ) − ω(y, y ))2dP2.

The corresponding to the empirical risk is defined as

En(s) = 1 n(n − 1) n  i= j (s(xi, xj) − ω(yi, yj))2,

which measures the average error between ranking function s(xi, xj) and feedback informationωi j. The empirical risk minimizer snover the function classS = {s : X × X → [0, 1]} is defined as

sn= arg min sS En(s).

The regression function is defined by

s(x, x ) = 

ωdP2(ω|x, x ),

where P2(ω|x, x ) is conditional expectation. It is well known that s(x, x ) =

arg minE(s), where the infimum is taken over all measurable functions s on X× X. Furthermore,



((s − ω)2− (s− ω)2)dP2(ω|x, x ) = (s − s)2. (5.1)

Note that equation 5.1 also implies (by integration)

E(s) −E(s) = (s − s)2= s − s2.

Consequently,

L(hs) =E(s − ω)2−E(s− ω)2= s − s∗2. For simplicity, we suppose that s∗∈S. Thus,

(16)

It is easy to get

P2((s1− ω)2− (s2− ω)2)2≤ 4s1− s22.

As a result, condition 3.6 holds with A= 4. The symmetrization inequality gives φn(H; δ) =E sup hs 1,hs2∈  (Pn− P)  Phs 1− Phs2  =E sup s1−s∗2≤δ,s2−s∗2≤δ  (Pn− P)  Phs 1− Phs2  ≤ 2E sup s−s2≤δ    1 n n  i=1 εi(Phs− Phs∗)  ≤ 2E sup hs−hs∗2≤4δ    1 n n  i=1 εi(Phs− Phs∗)    = 2θn(4δ). Similarly, we have ϕn(H; δ) =E sup h,g∈Hδ |Wn(h − g)| ≤ 8ϑn(4δ).

For a function classS, we define two empirical metrics over it as

ρS 1 (h, g) =  1 n n  i=1 P(h(xi, ·) − g(xi, ·))2, ρ1S(h, g) =  1 n(n − 1)  i= j (h(xi, xj) − g(xi, xj))2.

The corresponding empirical covering numbers are denoted by N(S, ρ1, ) and N(S, ρ2, ), respectively. Since H= {(s − ω)2: sS},

ρH

1 , ρ2H defined as above, let h1= (s1− ω)2∈H and h2= (s2− ω)2∈H.

It is easy to see thatρ1H(h1, h2) ≤ 4ρ1S(s1, s2) and ρ2H(h1, h2) ≤ 4ρ2S(s1, s2). Since an /4-covering of S provides an -covering of H, N(H, ρ1, ) ≤

N(S, ρ1, /4) and N(H, ρ2, ) ≤ N(S, ρ2, /4). IfS satisfies condition 4.2,

(17)

Summarizing the discussion, we obtain the following bound of the excess risk of snfrom corollary 1.

Theorem 5. LetS denote a class of functions s : X× X → [0, 1], Suppose that S satisfies condition 4.1 and s∗∈S. For any 0< ε ≤ 1 and any t > 0, there exist constants K1, K2, K3, K4, and K5, with probability at least 1− logq(q nδ)e−t. We have sn− s∗2K1log(nε2) nε2 + K2t log(nε2/t) nε2 , + K3 n2ε2 + K4t n + K5(1 +t) .

5.2 Preference Learning with Indicator Loss. Cl´emenc¸on et al. (2008) and Rejchel (2012) considered the indicator loss function.

Denote

ωi j= ω(yi, yj) = 

1, yi> yj, −1, yi< yj.

For preference learning with an indicator loss, one observes x and x but not their labels y and y . We think about x being prior to x ifωi, j= 1. The goal is to rank x and x so that the probability that the better ranked of them has a smaller label is as small as possible. Formally, a preference relation is a function r : X× X → {−1, 1}. If r(x, x ) = 1, then the preference relation ranks x higher than x and vice versa.

The setting of the bipartite ranking problem (Agarwal & Niyogi, 2005) can be described as follows. There is an instance space X from which instances are drawn, and the learner is given a training sam-ple (S+, S) ∈ Xm× Xl consisting of a sequence of positive training ex-amples S+= (x+1, . . . , x+m) and a sequence of negative training examples S= (x1, . . . , xl ). Denote ωi j=  1, x+i ∈ S+, xj ∈ S, −1, x+ i ∈ S, xj ∈ S+.

The goal is to learn from these examples a preference function r : X× X → {−1, 1} that ranks a positive instance x higher than a negative one x if

r(x, x ) = 1 and ranks x lower than x if r(x, x ) = −1. Denote byRa class

of preference rule r : X× X → {−1, 1}.

We consider the indicator loss function hr=I{r(x,x )=ω}defined on (X × Y)2, where I

(18)

indicator loss function class is denoted byH. The performance of a prefer-ence rule is measured by the preferprefer-ence risk

E(r) = P2(h

r) =P{ω = r(x, x )} =P{ω · r(x, x ) < 0}.

Although the preference problem shares similarities with the binary classifi-cation problem, preference risk and classificlassifi-cation risk are different (Agarwal & Niyogi, 2005). The empirical risk minimizer rnoverRis denoted by

rn= arg min rR 1 n(n − 1)  i= j I{r(x i,xj)=ωi j}. (5.2)

Set η(x, x ) = E(ω|X = x, X = x ). Then η(x, x ) = E(ω = 1|x, x ) − E(ω = −1|x, x ). The target we want to learn is denoted by

r∗= arg min r E(r) =



1, η(x, x ) > 0, −1, η(x, x ) < 0,

where the infimum is taken over all measurable functions r on X× X. Now we take h= hrin theorem 4 and get

 (I{r(x,x )=ω}−I{r(x,x )=ω})2dP2(ω|x, x ) = 1 |η(x, x )|  (I{r(x,x)=ω}−I{r(x,x )=ω})dP2(ω|x, x ).

We assume that|η(x, x )| ≥ η0 forη > 0 (see Massart, 2000a). Then the in-dicator loss function satisfies hrcondition 3.6 with A= 1/η0. The conver-gence rate of the preference algorithm defined in equation 5.2 follows from corollary 1.

Theorem 6. LetRbe a class of preference rule. Suppose thatRsatisfies condition 4.1 and there existsη0> 0 such that |η(x, x )| ≥ η0. For any 0< ε ≤ 1 and any t> 0, there exist constants K1, K2, K3, K4, and K5such that the excess error

E(rn)−E(r∗)≤ (1 + ε)  inf rRE(r)E(r)  + K1log(nε 2η2 0) nε2η 0 +K2t log(nε 2η2 0/t) nε2η 0 + K3 n2ε2η 0 + K4t n + K5(1 +t) holds with probability at least 1− logqq nδe−t.

(19)

Appendix A: Proof of Theorem 1

To prove theorem 1, we need the following tensorization inequality in Massart (2000b, lemma 8) and Boucheron, Lugosi, and Massart (2000, lemma 2.3):

Lemma 6. Let x1, . . . , xn be independent random variables with values in X and x1 , . . . , xn independent copies of x1, . . . , xn. Let V = V(x1, . . . , xn) and Vk = Vk(x

1, . . . , xk−1, xk , xk+1, . . . , xn) be measurable functions. For anyλ,

λE{VeλV} −E{eλV} logE{eλV} ≤n k=1 Eψ(−λ(V − Vk))eλVI [V−Vk≥0]  , (A.1) whereψ(λ) = λ(eλ− 1). Let Vk = Vk(X

1, . . . , Xk−1, Xk+1, . . . , Xn) be measurable functions. For anyλ,

λE{VeλV} −E{eλV} logE{eλV} ≤n k=1

E{φ(−λ(V − Vk))eλV}, (A.2)

whereφ(λ) = eλ− λ − 1.

Just like the empirical process, we first establish the concentration in-equality for nonnegative functions using lemma 6.

Theorem 7. LetHbe a countable class of measurable and symmetry functions fromX×X to [0, 1] and let

V = 1 2(n− 1) hsupH n  i= j h(xi, xj).

Then there holds

logEet(V−EV) t2EV

1− 2t, for 0 ≤ t < 1 2. Furthermore we have for any x> 0,

P{V ≥EV + x} ≤ expx2 4EV + 4x .

(20)

Proof. Without loss of generality, we consider a finite class of functions

H= (h1, . . . , hN) (see Boucheron et al., 2000). Set

τ = min 1≤k≤N ⎧ ⎨ ⎩k : n  i= j hk(xi, xj) = sup hH n  i= j h(xi, xj) ⎫ ⎬ ⎭ . Let Vk= 1 2(n−1) 

i= j;i, j=khτ(xi, xj). We have for any 1 ≤ k ≤ n,

0≤ V − Vk≤ 1 n− 1

 i=k

hτ(xk, xi) ≤ 1. (A.3)

Note that the functionφ(λ) = eλ− λ − 1 is convex and φ(0) = 0. Thus, for anyλ and any μ ∈ [0, 1], we have φ(−λμ) ≤ φ(−λ)μ. It follows from equa-tions A.2 and A.3 that for anyλ,

λE{VeλV} −E{eλV} logE{eλV} ≤ 2φ(−λ)E{VeλV}.

By making use of lemmas 6 and 3 in McDiarmid and Reed (2006), we complete the proof of theorem 7.

Proof of Theorem 1.Let(x 1, . . . , x 1) be an i.i.d. copy of (x1, . . . , xn). Without loss of generality, we consider a finite class of functionsH= (h1, . . . , hN) (see Massart, 2000b). Set

τ = min 1≤k≤N ⎧ ⎨ ⎩k :    n  i= j hk(xi, xj)    =sup hH    n  i= j h(xi, xj   ) ⎫ ⎬ ⎭. Let V= 2(n−1)1 |in= jhτ(xi, xj)| and Vk= 1 2(n − 1)     i= j;i, j=k hτ(xi, xj) + 2 n  i=k hτ(x k, xi)   .

We observe thatψ(−λ) = −λ(e−λ− 1) ≤ λ2for allλ ≥ 0. For each k,

V− Vk 1 n− 1    n  i=k hτ(xk, xi) − n  i=k hτ(x k, xi)   ,

(21)

it follows that ψ(−λ(V − Vk))I [V−Vk≥0]λ2 (n − 1)2    n  i=k hτ(xk, xi) − n  i=k hτ(x k, xi)    2 ≤ λ2 n− 1 n  i=k hτ(xk, xi) − hτ(x k, xi)2. Inequality A.1 gives, forλ > 0,

λE{VeλV} −E{eλV} logE{eλV} ≤ λ2E{WeλV}, (A.4)

where W = suphHn−11 nk=1ni=kh(xk, xi) − h(x k, xi)2.

Let G(λ) =Eeλ(V−EV), for 0< λ < 1. We have from inequality A.4, 1 λ G (λ) G(λ) − 1 λ2log G(λ) ≤ logE[eλW] λ(1 − λ) . Integrating this inequality yields

1 λlog G(λ) ≤ 1 1− λ  λ 0 logE[e10μ!W] μ dμ, (A.5) where !W= suphHn−11 nk=1ni=kh2(x

k, xi). Here we use inequality

E[eλW]≤E[e10λ!W], which can be obtained from the decoupling theorem (see De la Pe ˜na & Gin´e, 1999).

Without loss of generality, we assume b= 1 in theorem 1. We apply theorem 7 toW!2 and get, for 0< μ <401,

logEe10μ!W ≤ 10μEW!+200μ2EW! 1− 40μ . For all 0< λ <401, inequality A.5 implies

1 λlog G(λ) ≤ 10EW! 1 1− λ  λ 0 1+ 20μ 1− 40μdμ  . This yields, by straightforward computation,

log G(λ) ≤ 10EW! λ

2(1 − 30λ)

(22)

Finally, we get

log G(λ) ≤ 20E!

2

2(1 − 40λ). By Markov inequality, for all x> 0,

P{V ≥EV+ x} ≤ expx2 40EW!+ 80x , or P " V≥EV+  40EWx! + 40x # ≤ exp {−x} .

Replacing M by !W/n and V byn2UnH, we complete the proof of theorem 1. Appendix B: Proof of Theorem 3

Now we give the proof of the main results of this letter. Proof of Theorem 3. Recall that

t(δ) =EUn− P2H δ+  160EUnH 2 δ n + 80t n . Denote En, j(t) =  sup h,g∈Hδ | (Un− P2)(h − g) |≤  t(δj)  . By theorem 1, P((En, j(t))c) ≤ exp(−t).

Letδj≥ δ. In the event that En, j(t), we conclude that if hnHδ j\Hδj+1, then∀0 < ε < δj+1, ∀g ∈Hε, δj+1<L(hn) ≤ P2h n− P2g+ ε ≤Un(hn) − Un(g) + (P2− Un)(hn− g) + ε ≤ sup h,g∈Hδj | P2− U n)(h − g) | +ε ≤ t(δj) + ε ≤  ,q t (δ)δj+ ε.

(23)

Consequently, we havet,q(δ) ≥1

q > 2q1. As 

,q

t (δ) is decreasing with re-spect toδ, we obtain that

δ ≤ t,q  1 2q  = δn(t).

We can conclude that forδj≥ δ ≥ δn(t), {hnHδ

j\Hδj+1} ⊂ (En, j(t))

c. There-fore, for δ ≥ δn(t), in the event En(t) = ∩j:δ

j≥δEn, j(t) we have L(hn) ≤ δ, implying that PL(hn) > δ  ≤  j:δj≥δ P{(En, j(t))c} ≤ logqq δe−t. The proof of theorem 3 is complete.

Appendix C: Proof of Theorem 4 Recall that t(δ) ≤ 2φn(H; δ) + ϕn(H; δ) +  160EUnH 2 δt n + 80t n . Notice that d(h, g) =P2(h − g)2. For any hH

δthere holds d(h, hH) ≤ d(h, h) + d(hH, h)A(P2h− P2h) + A(P2h H− P2h)A(P2h− P2h H) + 2 A(P2h H− P2h) ≤√Aδ + 2A ≤2A(δ + 4), where = P2h H− P2h∗.

On the other hand, we have E sup

h,g∈Hδ

|Tn(h − g)| ≤ 2E sup h

(24)

and

E sup

h,g∈Hδ|Wn(h − g)| ≤ 2E suphHδ|Wn(h − hH)|.

Therefore,

φn(δ) ≤ 2θn(2A(δ + 4)) and ϕn(δ) ≤ 2ϑn(2A(δ + 4)) . Then there exists a constant C> 0, such that

t(δ) ≤ Cθn(2A(δ + 4)) + Cϑn(2A(δ + 4)) +Cηn(2A(δ + 4)) t1

2+Ct n. =: χ1(δ) + χ2(δ) + χ3(δ) + χ4(δ).

According to the property of the -transform, it follows that δn(t) = ,qt (2q1) ≤ 

 t(2q1).

In order to estimate the-transform t(2q1), we bound the -transform of χ1(δ), χ2(δ), χ3(δ), and χ4(δ), respectively. Let μ = 1

8q. Then the property of

the-transform (see lemma 5 in Koltchinskii, 2006) gives, for all 0 < τ ≤ 1,

χ 1(μ) ≤ 1 2Aθ  n  τμ 4CA  + 4τ. Similarly, χ2(μ) ≤ 1 2Aϑ  n  τμ 4CA  + 4τ, χ3(μ) ≤ 1 2Aη  n  τμ 4CAt  + 4τ, χ4(μ) ≤Ct μn. As a result, δn(t) ≤ 1 2Aθ  n  τμ 4CA  + 1 2Aϑ  n  τμ 4CA  + 1 2Aη  n  τμ 4CAt  + 12τ +Ct μn.

(25)

Takingε = 12τ ∈ (0, 1] and μ = 48CK , we deduce that δn(t) ≤ ε + 1  n  ε KA  + 1  n  ε KA  + 1  n  ε KAt  +Kt n, and the desired estimate follows. The proof of theorem 4 is complete. Appendix D: The Contraction Principle

LetA denote a collection of n× n symmetric matrices A, and ε1, . . . , εnare i.i.d. Rademacher variables. The matrices A= (ai j) have zero diagonal, that is, aii= 0 for all A ∈A and i= 1, . . . , n.

Theorem 8. Letϕi j :R→R, i, j = 1, . . . , n be functions such that ϕi j(0) = 0 and

|ϕi j(μ) − ϕi j(ν)| ≤ |μ − ν|, μ, ν ∈R

(that is, ϕi j are contractions). Let Φ :R+→R+ be convex and nondecreasing. Then, for any bounded subsetA inRn×n,

EΦ⎝1 2 AsupA    n  i=1 n  j=1 εiεjϕi j(ai j)    ⎞ ⎠ ≤EΦ⎝ sup AA    n  i=1 n  j=1 εiεjai j    ⎞ ⎠ .

Proof. We first prove that if :R→R+is convex and nondecreasing,

E ⎛ ⎝sup AA n  i=1 n  j=1 εiεiϕi j(ai j) ⎞ ⎠ ≤E ⎛ ⎝sup AA n  i=1 n  j=1 εiεjai j⎠ . (D.1) By conditioning and iteration, it suffices to show that T is a subset ofR3and

ϕ is a contraction onRsuch thatϕ(0) = 0. Any j = i; then

Ei  sup t∈T[t1+ εit2+ εiεjϕ(t3)]|εj  ≤Ei  sup t∈T[t1+ εit2+ εiεjt3]|εj  . (D.2)

(26)

The above inequality is equivalent to 1 2  sup t∈T[t1+ t2+ εjϕ(t3)]  +1 2  sup t∈T[t1− t2− εjϕ(t3)]  ≤1 2  sup t∈T[t1+ t2+ εjt3]  +1 2  sup t∈T[t1− t2− εjt3]  .

We can prove the above inequality as in Ledoux and Talagrand (1991) and Koltchinskii (2011).

In the general case, we have

E ⎛ ⎝sup AA n  i=1 n  i< j εiεjϕi j(ai j) ⎞ ⎠ =Eε n···ε2Eε1 ⎛ ⎝sup AA ⎛ ⎝n i=2 εi ⎛ ⎝n i< j εjϕi j(ai j) ⎞ ⎠ +n j=2 ε1εjϕ1 j(a1 j) ⎞ ⎠ ⎞ ⎠ =Eε n···ε2Eε1 ⎛ ⎝sup AA ⎛ ⎝n i=2 εi ⎛ ⎝n i< j εjϕi j(ai j) ⎞ ⎠ + ε1 n  j=3 εjϕ1 j(a1 j) + ε1ε2ϕ(a12) ⎞ ⎠ ⎞ ⎠ .

Using inequality D.2, we have

Eε n···ε2Eε1 ⎛ ⎝sup AA  n  i=2 εi  n  i< j εjϕi j(ai j)⎠ + ε1 n  j=3 εjϕ1 j(a1 j) + ε1ε2ϕ(a12)  ≤Eε n···ε2Eε1 ⎛ ⎝sup AA ⎛ ⎝n i=2 εi ⎛ ⎝n i< j εjϕi j(ai j)⎠ + ε1 n  j=3 εjϕ1 j(a1 j) + ε1ε2a12 

(27)

≤ · · · ≤Eε n···ε2Eε1 ⎛ ⎝sup AA ⎛ ⎝n i=2 εi ⎛ ⎝n i< j εjϕi j(ai j)⎠ + ε1 n  j=2 εja1 j ⎞ ⎠ ⎞ ⎠ =Eε 1Eεn···ε3Eε2 ⎛ ⎝sup AA ⎛ ⎝n i=3 εi ⎛ ⎝n i< j εjϕi j(ai j)⎠ + ε1 n  j=3 εja1 j + ε2 ⎛ ⎝ε1a12+ n  j=4 εjϕi j(ai j)⎠ + ε2ε3ϕ23(a23) ⎞ ⎠ ⎞ ⎠ .

By an induction argument, we complete the proof of equation D.1. Note that(−εiεj) has the same distribution as (εiεj). Since  :R+→R+

is convex and nondecreasing, we have

E ⎛ ⎝1 2AsupA    n  i=1 n  j=1 εiεjϕi j(ai j)    ⎞ ⎠ ≤1 2E ⎛ ⎝sup AA ⎛ ⎝n i=1 n  j=1 εiεjϕi j(ai j) ⎞ ⎠ + ⎞ ⎠ +1 2E ⎛ ⎝sup AA ⎛ ⎝n i=1 n  j=1 −εiεjϕi j(ai j) ⎞ ⎠ + ⎞ ⎠ ≤E ⎛ ⎝sup AA    n  i=1 n  j=1 εiεjai j    ⎞ ⎠ ,

where(x)+= max{0, x}. The proof of theorem 8 is complete.

In particular, taking(x) = x, further, if −1 ≤ x ≤ 1, then ϕ(x) = 12x2is

contraction withϕ(0) = 0, Esup hH    n  i=1 n  j=i εiεjh2(xi, xj)    ≤4Esup hH    n  i=1 n  j=i εiεjh(xi, xj)   . (D.3) Acknowledgments

We are grateful to the editor and anonymous reviewer for their valuable comments and suggestions that helped improve the original version of this

(28)

letter. The research was partially supported by NSFC under grants 61472155 and 11371007.

References

Agarwal, S., & Niyogi, P. (2005). Stability and generalization of bipartite ranking algorithms. In P. Auer & R. Meir (Eds.), Lecture Notes in Computer Science: Vol. 3559.

Proceedings of the 18th annual conference on learning theory (pp. 32–47). Berlin:

Springer.

Arcones, M. A. (1995). A Bernstein type inequality for U-statistics and U-processes.

Statist. Probab. Letter, 22, 223–230.

Arcones, M. A., & Gi ´ne, E. (1993). Limit theorems for U-processes. Annals of

Proba-bility, 21, 1494–1542.

Bartlett, P. L., Bousquet, O., & Mendelson, S. (2005). Local Rademacher complexities.

Annals of Statistics, 33, 1497–1537.

Bartlett, P. L., & Mendelson, S. (2002). Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3, 463–482. Boucheron, S., Lugosi, G., & Massart, P. (2000). A sharp concentration inequality

with applications. Random Structures Algorithms, 16, 277–292.

Bousquet, O. (2002). A Bennett concentration inequality and its application to suprema of empirical processes. C. R. Acad. Sci. Paris Ser. I, 334, 495–500. Cl´emenc¸on, S., Lugosi, G., & Vayatis, N. (2008). Ranking and empirical minimization

of U-statistics. Annals of Statistics, 36(2), 844–874.

Cohen, W. W., Schapire, R. E., & Singer, Y. (1999). Learning to order things. Journal

of Artificial Intelligence Research, 10, 243–270.

Cossock, D., & Zhang, T. (2008). Statistical analysis of Bayes optimal subset ranking.

IEEE Trans. Info. Theory, 54, 4140–5154.

De la Pe ˜na, V. H., & Gin´e, E. (1999). Decoupling: From dependence to independence. New York: Springer.

Dudley, R. M. (1999). Uniform central limit theorems. Cambridge: Cambridge Univer-sity Press.

Freund, Y., Iyer, R., Schapire, R. E., & Singer, Y. (2003). An efficient boosting algorithm for combining preferences. Journal of Machine Learning Research, 4, 933–969. H ¨ullermeier, E., F ¨urnkranz, J., Cheng, W., & Brinker, K. (2008). Label ranking by

learning pairwise preference. Artif. Intell, 172, 1897–1916.

Klein, T., & Rio, E. (2005). Concentration around the mean for maxima of empirical processes. Annals of Probability, 33, 1060–1077.

Koltchinskii, V. (2006). Local Rademacher complexities and oracle inequalities in risk minimization. Annals of Statistics, 34, 2593–2656.

Koltchinskii, V. (2011). Oracle inequalities in empirical risk minimization and sparse recovery problems. New York: Springer.

Koltchinskii, V., & Panchenko, D. (2000). Rademacher processes and bounding the risk of function learning. In E. Gin´e, D. Mason, & J. Wellner (eds.), High dimen-sional Probability II, 443–459.

Ledoux, M. (1996). On Talagrand’s deviation inequalities for product measures.

(29)

Ledoux, M., & Talagrand, M. (1991). Probability in Banach space. New York: Springer-Verlag.

Liu, T. Y. (2011). Learning to rank for information retrieval. New York: Springer. Massart, P. (2000a). Some applications of concentration inequalities to statistics. Ann.

Fac. Sci. Tolouse Math., 6, 245–303.

Massart, P. (2000b). About the constants in Talagrand’s inequality for empirical processes. Annals of Probability, 28, 863–884.

McDiarmid, C., & Reed, B. (2006). Concentration for self-bounding functions and an inequality of Talagrand. Random Structures and Algorithms, 29, 549–557.

Nolan, D., & Pollard, D. (1987). U-processes: Rates of convergence. Annals of Statistics,

15, 780–799.

Pahikkala, T., Tsivtsivadze, E., Airola, A., Boberg, J., & Salakoski, T. (2007). Learn-ing to rank with pairwise regularized least-squares. In SIGIR 2007 Workshop on

Learning to Rank for Information Retrieval, 80, 27–33.

Rejchel, W. (2012). On ranking and generalization bounds. Journal of Machine Learning

Research, 13, 1373–1392.

Talagrand, M. (1994). Sharper bounds for gaussian and empirical processes. Annals

of Probability, 22, 28–76.

Talagrand, M. (1996). New concentration inequalities in product spaces. Invent. Math,

126, 505–563.

van der Vaart, A., & Wellner, J. (1996). Weak convergence and empirical processes. New York: Springer-Verlag.

Vapnik, V. (1998). Statistical learning theory. New York: Wiley.

Ying, Y., & Campbell, C. (2010). Rademacher chaos complexities for learning the kernel problem. Neural Computation, 22, 2858–2886.

References

Related documents

The variables agr_rca and min_rca are binary variables taking value 1 if the country has an RCA (in value added terms, rather than gross export) in AGR or MIN, respectively.

From the above results, it can be concluded that miR-193a regulates MLL1 gene in prostate cancer which eventually changes the global H3K4 methylation pattern

research on bullying participant roles suggests that bullies tend to have higher levels of theory of.. mind scores and emotional insight than reinforcers and assistants of

In this research, the energy consumption of the solar-electric powered vehicle Persian Gazelle IV was analyzed and compared with other similar vehicles. It was

In accordance with the improved regression rule for soybean data set displayed in Figure 4, April occurs as a significant period for yield formation when the growth of

She has widely published about populist radical right parties in Western Europe and her publications have appeared in Journalism , Patterns of Prejudice , Political

Although the instrumental variable approach addresses potential model endogeneity in the online gaming revenue parameter estimates, it does not address concerns over potential