Consistency of the partial loss - Statistical consistency of structured output learning with mi

4. Statistical consistency of structured output learning with missing labels

4.4. Consistency of the partial loss

In this section, we present the principal result which justifies learning of the structured clas- sifiers by minimization of the partial loss provided the data are generated from the statistical model defined in section4.3.

Theorem 4. Letp(x,y,a) be an arbitrary distribution defined over_{X × Y}V_{× A}V with prop-

erty A andp0(x,y) =P

a∈AVp(x,y,a)andp00(x,a) = P

y∈YVp(x,y,a)be the corresponding

marginal distributions. Let` be an additive loss (4.1) and let `p _{be an associated partial loss}

defined by (4.2). Then, for all sequences of random decision functionshm:X → TV (depend-

ing on training data generated from i.i.d variables with p00(x,a)), it holds

R`p(hm;p00)→P R∗`p(p00)⇔R`(hm;p0)

→R`∗(p0).

We start with a key lemma which shows that under proper assumptions a set of minimizers of the supervised risk is the same as the set of minimizers of the partial risk although the risk functions and their values are different.

Lemma 1. Let `:_YV _{× T}V _→ R+ be an additive loss function and let `p:AV × TV → R+

be the associated partial loss. Let p(x,y,a) be a distribution with the property A. Then,

h∗:X → TV _{is a minimizer of} _R`₍_h_;_p0₎ _{if and only if it is a minimizer of} _R`p

(h;p00), where

p0(x,y) =P

a∈AVp(x,y,a) andp00(x,a) = P

y∈YVp(x,y,a).

proof: The riskR`p(h;p00) can be rewritten as follows:

Here equality (_∗) holds because of the equality following from (4.4):

p(av |x) = X yv∈Y X zv∈Z p(yv |x)p(zv |x) [[av =α(yv, zv)]],

4.4. Consistency of the partial loss

which forav ={yv}gives us the following equality

p(av |x) =p(zv = 1|x)p(yv|x). It is seen from the last equation that if h(x)∗ _{= (}_h

v(x) |v ∈ V) is a minimizer of R`

(h;p) then for anyx∈ X and v∈ V it holds that

h∗_v(x)∈Argmin t∈T

yv∈Y

p(yv |x)`v(yv, t), (4.8)

because the marginalsp(zv|x)>0 thanks to Definition 8. Analogically, one can rewrite the riskR`₍_h_;_p0_{) as follows:} R`(h;p0) =Ep(x) X y∈YV p(y_|x)`(y,h(x)) =Ep(x) X v∈V X yv∈Y p(yv |x)`v(yv,hv(x)),

showing that also any minimizerh(x)∗ = (hv(x)|v∈ V) ofR`₍_h_;_p0_{) has to satisfy (4.8).}

Note that Lemma 1 shows that the `-risk and the `p_{-risk have the same set of minimiz-} ers. However, they do not have the same minimal value because the `-risk upper bounds the `p_{-risk. The lemma is easy to prove and understand. However, it is not immediately} applicable in practice because we cannot minimize the risks due unknown data generating distributionp(x,y,a). Instead, we resort to minimization of the empirical risk, by which we obtain approximate minimizers. The conditions under which the minimizers of the empirical risk converge are well studied [Vapnik, 1998]. It remains to show that convergence of the minimizers of the empirical partial risk to the expected partial risk implies the convergence of the same minimizers the expected true risk as stated in Theorem 4. The rigorous proof is not trivial. In the rest of the section, we give a road map of the proof and we leave the details to AppendixA.5.

It follows from Lemma 1 that for a fixed probability model p induced by a model with property A the functionHp₍_,_p

ya) : R×∆|YV_|×|AV_|→_R2 whose values is defined by minimize t∈TV pa >_`p t − min t0_∈TVpa >_`p t0 subject to py>`t− min t0_∈TVpy > `t0 ≥

is always positive for any > 0, where pya(x) = (p(y,a | x) | a ∈ AV,y ∈ YV). Flipped functionH(,pya) : R×∆|YV_|×|AV_|→_Rwhose values is defined by

minimize t∈TV py >_` t− min t0_∈TVpy >_` t0 subject to pa>`p_t − min t0_∈TVpa >_`p t0 ≥

is positive as well for any > 0. It is possible to show (see Lemmas 4 and 8 in Appendix) even stronger statement that for any > 0, Hp₍₎ _, _inf

pay∈Px

Hp₍_,_p

ay) > 0 and H() , 2_{We use ∆}

n = {p∈ Rn |pi ≥0,∀i∈ [n],Pn_i₌₁pi = 1} to denote the probability simplex inRn. In case

4. Statistical consistency of structured output learning with missing labels

inf pay∈Px

H(,pay) > 0. Thanks to this it is possible to show3 _{that for loss functions} _`₍_y_,_t₎

and `p₍_a_,_t_{) defined by (4.1), (4.2) there exist nonnegative concave functions}_ξ_:

R→R+and

ζ:R→R+, both right continuous at 0 with ξ(0) = 0 and ζ(0) = 0, such that∀ h:X → TV and for all distributions with property A it holds that

Ep(x)py(x)>`h(x)−Ep(x) min t0_∈TVpy(x)

>_`

t0 ≤

Ep(x)pa(x)>`p_h(x)−Ep(x) min t0_∈TVpa(x)

>_`p t0

Ep(x)pa(x)>`p_h(x)−Ep(x) min t0_∈TVpa(x)

>_`p t0 ≤

Ep(x)py(x)>`h(x)−Ep(x) min t0_∈TVpy(x)

>_`

See Lemmas 5 and 9 in Appendix for complete proof. Functions ξ and ζ make Theorem 4easy to prove.

proof: (⇒) We have that for any >0 and pya(x)∈ Px the inequality P{Ep(x)py(x)>`hm(x)−Ep(x) min

t0_∈TVpy(x) >_`

t0 > } ≤ P{ξ(Ep(x)pa(x)>`p_h_m_(x)−Ep(x) min

t0_∈TVpa(x) >_`p

t0)> }

holds. Since ξ(x) is right continuous at 0, there exists δ > 0 such that ∀x: x−0 ≤ δ ⇒ ξ(x)₋ξ(0)_≤. Hence, ifξ(x)> thenx > δ, thus we obtain

P{ξ(Ep(x)pa(x)>`p_h_m_(x)−Ep(x) min t0_∈TVpa(x)

>_`p

t0)> } ≤ P{Ep(x)pa(x)>`p_h_m_(x)−Ep(x) min

t0_∈TVpa(x) T_`p

t0 > δ} −→0, givenm→ ∞.

(⇐) implication is proved by repeating the same steps but using relation with functionζ.

In document Discriminative learning from partially annotated examples (Page 74-76)