• No results found

A Statistical Learning Framework

The basic idea of the statistical approach [Quadrianto et al., 2009, Patrini et al., 2014] is to estimate what is called a mean map operator that is sufficient to learn a classifier. We follow the work of [Quadrianto et al., 2009] to explain the idea. Assume the learning of a conditional exponential model:

p(y|x;θ) = exp(hφ(x, y),θi −g(θ|x)) (3.1) whereφ(x, y) is a feature mapping to aReproducing Kernel Hilbert Space andg(θ|x) is the log-partition function. If we could observe a set of independently and identically distributed training instances (X, Y) = {(xj, yj)}nj=1 that are sampled from a distribution p(x, y) on

X × Y, then the conditional log-likelihood function is: logp(Y|X,θ) = n X j=1 {hφ(xj, yj),θi −g(θ|xj)}=nhµXY, θi − n X j=1 g(θ|xj) (3.2)

where µXY = 1nPni=jφ(xj, yj) is the empirical mean in the feature space. Notice that

µXY is the sufficient statistics to the objective function and thus it makes LLP learning

possible without knowing the labels of individual instances. In order to avoid over-fitting one commonly maximizes the log-likelihood penalized by a prior p(θ). Finally, the optimization problem becomes: θ∗ = arg min θ " n X j=1 g(θ|xj)−nhµXY, θi+λkθk2 # (3.3)

However, µXY is unknown because we cannot observe the instance labels. But notice a

fact that under mild conditions the empirical mean µXY is statistically well behaved and

it converges to the population mean µxy

.

= E(x,y)∼p(x,y)φ(x, y) at rate O(n− 1

2). Hence, the solution is to estimate the population meanµxy first and use it as a proxy for µXY, and only

3.2.1 Estimation Of The Mean Operator

µxy can be decomposed into a sum of conditional expectations:

µxy =

X

y∈Y

p(y)µclassx [y, y] (3.4)

where:

• p(y) is the prior distribution of y∈ Y. It is assumed known in that it can be seen as the class proportion of a special bag that contains the whole instance population.

• µclass

x [y, y] = Ex∼p(x|y)φ(x, y) is the conditional expectation. Quantity µclassx [y, y

0] := Ex∼p(x|y)φ(x, y0) denotes the expectation of φ(x, y0) conditioning on p(x|y).

In order to compute µclass

x [y, y], a bag-based quantity µsetx [i, y

0] is introduced which is the conditional expectation based on bag i:

µsetx [i, y0] := Ex∼p(x|i)φ(x, y0) (3.5)

Note the distributionp(x|i) can be decomposed as:

p(x|i) = X

y∈Y

p(x, y|i) = X

y∈Y

[p(x|y, i)p(y|i)] (3.6) wherep(y|i) :=πiy is observed. In order to proceed, a crucial assumption has to be made on

p(x|y, i) - a conditional independence that p(x|y, i) =p(x|y). In other words, they assume that the conditional distribution of xis independent of the bag index i, as long as the label

y is known. After all, we want the distributions within each class to be independent of which bag they can be found in. If this were not the case it would be impossible to infer about the distribution on the test set from the (biased) distributions over the bags. Such an assumption bridges the gap between the instance-based classifier to be learned and the bag-level labels that are given.

Then, Equation (3.5) can be further derived as:

µsetx [i, y0](3.6)= X

y

πiyµclassx [y, y

0

It means that µset x [i, y

0] is a linear combination of µclass x [y, y

0]. So we rewrite Equation (3.7) into a linear equation system:

µsetx =πµclassx (3.8)

where µsetx ,π and µclassx are in matrix form. Then µclassx is solved as:

µclassx = (π>π)−1π>µsetx (3.9)

Lastly, µsetx [i, y0] is estimated by its empirical value µsetX [i, y0], since that µsetX [i, y0] also con- verges to µset x [i, y 0] at rate O(n−12 i ): µsetX [i, y0] := 1 ni X x∈Bi φ(x, y0) (3.10)

Then, Equation (3.9) is rewritten as:

ˆ µclassx = (π>π)−1π>µsetX (3.11) Finally, ˆ µXY = X y∈Y p(y)ˆµclassx [y, y] (3.12)

3.2.2 Assumptions Of The Statistical Learning Framework

Overall, the statistical learning algorithm works as follows: it uses the empirical means (µsetX [i, y0]) on the bags {Bi} to approximate the expectations with respect to the bag dis-

tribution (µset x [i, y

0]); then it uses the latter to compute the expectations with respect to a given label (µclassx [y, y0]); finally, it uses the means conditional on the label distribution to obtain µxy that is a good proxy for µXY, i.e.

µsetX [i, y0]−→µsetx [i, y0]−→µclassx [y, y0]−→µxy −→µXY (3.13)

The middle two steps in the sequence follow from linear algebra. But more importantly, the first and last steps in the chain require uniform convergence results. The last convergence is relatively easy to satisfy because it does not involve bags:

µXY := 1 n n X j=1 φ(xj, yj)−→µxy :=E(x,y)∼p(x,y)φ(x, y) (3.14)

This convergence holds if (1) the instances are i.i.d. sampled fromp(x, y) and (2) the feature map φ is a Reproducing Kernel Hilbert Space [Bartlett and Mendelson, 2002].

The first convergence in the chain is really the key:

µsetX [i, y0] := 1

ni

X

x∈Bi

φ(x, y)−→µsetx [i, y0] :=Ex∼p(x|i)φ(x, y0) (3.15)

In order to make it happen, they have assumed the conditional independence on bags that was discussed earlier:

p(x|i) = X

y∈Y

[p(x|y, i)p(y|i)] =X

y∈Y

[p(x|y)p(y|i)] (3.16) The assumption p(x|y, i) = p(x|y) states that “within each bag, the distribution of the instances conditioned on class should be identical to the unbagged instance distribution con- ditioned on class”. In other words, the conditional distribution of instances is independent of the bags as long as the class is known. Therefore, if bags and the inside instances are not formed in this way, they are not able to learn an instance-level model.