6.3.1 The backward correction
We can build anunbiased estimatorof the loss function in the same sense of Theorem 52. The corrected loss under expected label noiseequals the original one computed on clean data. This property is stated in the next Theorem, a multi-class generalization of the already cited Lemma 1 Natarajan et al. [2013]. The Theorem is also a particular instance of the more abstract [van Rooyen, 2015, Theorem 3.2].
Theorem 60. Suppose that the noise matrix T is non-singular. Given a loss `, define the
backwardcorrected loss as:
`←(h(x)) =T−1`(h(x)) . (6.3)
Then, the loss correction is unbiased,i.e., its expectation under label noise is exactly the loss:
(∀y=ei) Ep(y˜|y)`←(h(x))i =`(h(x))i , (6.4)
and therefore the minimizers are the same:
argmin
h
ED˜ `←(y,h(x)) =argmin
h
ED `(y,h(x)) . (6.5)
Proof in 6.7.1. The corrected loss is effectively a linear combination of the loss values for each observable label, which coefficients are due to the probability that
T−1attributes to each possible true labely, given the observed one ˜y. Intuitively, we are “going one step back” in the noise process described by the Markov chainT. The corrected loss is differentiable — although not always non-negative — and can be minimized with any off-the-shelf algorithm for back-propagation. Although in prac- tice T would be invertible almost surely, its condition number may be problematic. A simple solution is to mix Twith the identity matrix before inversion; this may be seen as taking a more conservative noise-free prior.
6.3.2 The forward correction
Alternatively, we can correct the model predictions. Following Sukhbaatar et al. [2015], we start by observing that a neural network learned with no loss correction would result in a predictor for noisy labels p(y˜|x). We can make explicit the depen-
§6.3 Loss correction procedures 123
dency onT. For instance, with cross-entropy we have:
`(ei,h(x)) =−log p(y˜ =ei|x) (6.6) = −log
∑
j∈[c] p(y˜ = ei|y=ej) p(y=ej|x) (6.7) = −log∑
j∈[c] Tji p(y=ej|x) , (6.8)or in matrix form `(h(x)) = −log T>p(y|x). This loss compares the noisy label ˜
y to averaged noisy prediction corrupted by T. We call this procedure “forward” correction.
In order to analyze its behavior, we first need to recall definition and properties of a new family of losses, named proper composite[Reid and Williamson, 2010, Section 4]. This is an additional requirement with respect to properness of Definition 12. Many losses are said to be composite, in the sense that they can be expressed by the aid of anlink function.
Definition 61. A loss`ψ iscompositewithlink functionψ :∆c−1 →Rc, invertible, if it
can be written as:
`ψ(y,h(x)) =`(y,ψ−1(h(x))) . (6.9) Cross-entropy and square are examples of proper composite losses. In the case of cross-entropy, the softmax is theinverse link function. When composite losses are
alsoproper, their minimizer assumes the particular shape of the link function applied
to the class probability:
argmin
h
ED`ψ(y,h(x)) =ψ(p(y|x)) . (6.10) An intriguing robustness property holds for forward correction of proper composite losses.
Theorem 62. Suppose that the noise matrix T is non-singular. Given a proper composite
loss`ψ, define theforwardloss correction as:
`→ψ(h(x)) =`(T>ψ−1(h(x))) . (6.11) Then, the minimizer of the corrected loss under the noisy distribution is the same as the minimizer of the original loss under the clean distribution:
ψ(p(y|x)) =argmin h ED˜ `→ψ(y,h(x)) (6.12) =argmin h ED `ψ(y,h(x)) . (6.13)
2Symmetric proper losses of Chapter 4 are also proper composite, with link function equal to the derivative of the generatorφ. See Nock and Nielsen [2009] and Reid and Williamson [2010] for details.
Algorithm 10:Robust two-stage training Input: the noisy training setSe, any loss` IfTis unknown:
Train a networkh(x)onSewith loss` Obtain an unlabeled sampleSX
Estimate ˆTby Equations (6.15)-(6.16) onSX
Train the networkh(x)onSewith loss`←or`→ Output: h(·)
Proof in 6.7.2. The result expresses a weaker property with respect to unbiased- ness of Theorem 60. Robustness applies to the minimizer only: the model learned by forward correction is the minimizer over the cleandistribution. Yet, Theorem 62 guarantees noise independence without explicitly inverting the noise process, but it does it “behind the scenes” by a “de-noising” link function. This turns out to be an important factor in practice, as shown in Section 6.5 experimentally and discussed in Section 6.6.
6.3.3 Estimating the noise rates
A clear limitation of the above procedures is that they require knowing T. In most applications, the matrix T would be unknown and needs to be estimated. We present here an extension of the noise estimator of Menon et al. [2015]; Liu and Tao [2016] to the multi-class settings. The estimator is derived under two assumptions. Theorem 63. Assume p(x,y)is such that:
(i) there exist “perfect examples” of each of class j∈[c], in the sense that:
(∃x¯j ∈ X): p(x¯j)>0∧p(y=ej|x¯j) =1.
(ii) given sufficiently many corrupted samples,his rich enough to model p(y˜|x)accurately.
It follows that:
∀i,j∈[c], Tij = p(y˜ =ej|x¯i) . (6.14)
Proof in 6.7.3. Rather surprisingly, Theorem 63 tells us that we can estimate each component of matrixT just based on noisy class probability estimates, that is, the output of the softmax of a network trained with noisy labels. In particular, letSX be any set
of features vectors. This can be taken from S itself, but not necessarily: we do not require this sample to haveanylabel at all and therefore whenever more unlabeled samples are easy to obtain from the same distributions; they could be used in place