Estimating Data Uncertainty - Uncertainty Estimation in Deep Learning with application to Spoke

For regression tasksknowledge uncertaintyrepresents uncertainty aboutboththe mapping represented by the functionf(x)andthe nature of the noiseϵ. Unlike classification, where

data uncertaintyand the prediction are linked,knowledge uncertaintyover the systematic component (prediction) andknowledge uncertaintyover the noise (data uncertainty) can be treated separately for regression tasks. Consider the datasets in figure 3.3. There is no data in the region|x| ≥15. Thus in this region there will be uncertainty in the function which is being modelled as well as the nature of the additive noise. It is difficult to know whether the systematic component changes or continues. Thus, the further a test inputx∗ is away from the region of training data, the more uncertainty there is in a model’s estimation of theboth

the underlying systematic componentandthe noise (data uncertainty).

3.2 Estimating Data Uncertainty

Having discussed the nature ofdata uncertaintyandknowledge uncertaintyfor both classification and regression in the previous section, we now discuss how to obtainestimates

of uncertainty in the predictions due todata uncertaintyfor both classification and regression models. Crucially, it is shown how a parametric probabilistic model will naturally estimatedata uncertaintyas a consequence of maximum likelihood estimation, given certain conditions.

3.2.1 Estimating Data Uncertainty for Classification

In order to obtain estimates ofdata uncertaintyit is necessary to use a probabilistic model, such as a standard classification neural network which parameterizes a discrete posterior distribution over class labelsyconditioned on the inputx:

P(y|x∗;θˆ) =Cat(y;πˆ) ˆ π =f(x∗;θˆ), K X c=1 ˆ πc= 1, πc≥0 (3.16)

In chapter 2 section 2.2.1 it was shown that the minimization of the expected negative log-likelihood is equivalent to minimizing the expected KL divergence between the model

38 Predictive Uncertainty Estimation convenience: LN LL₍_θ_,_{D) =} Eptr(x) h KL[Ptr(y|x)||P(y|x;θ)] | {z } Reducible Loss + H[Ptr(y|x)] | {z } Irreducible Loss i ≥_Eptr(x) h H[Ptr(y|x)] i (3.17)

where expected negative log-likelihood is lower-bounded by the expected entropy ofPtr(y|x), which, as discussed in section 3.1.1, is the average data uncertainty of the distribution

Ptr(y|x). As was discussed in section 3.1.1, the conditional entropy of the underlying distribution H

Ptr(y|x) represents the data uncertainty at the input x. Thus, a model

P(y|x;θ)should capture uncertainty in predictions due todata uncertaintyin its posterior over classes when trained via maximum likelihood. 4

A necessary condition for this result to hold true is that an infinite amount of training data is available and that the true underlying distribution lies within the model class which can be parameterized. Thus, in practice a model will only capturean estimateofdata uncertainty, as it is only possible to minimize the KL divergence with respect to anempirical distribution

derived from a finite training dataset^ptr(x, y) =Dtrn. A modelP(y|x;θ)may over-fit to

the training data and yield poor estimates of uncertainty on a held-out test dataset if it is too large. Alternatively, a model fail to fit the training data at all and also yield poor estimates ofdata uncertaintyif it is too simple. The quality of estimates ofdata uncertaintyshould asymptotically increase with the amount of training data. Models which generalize well should also yield more accurate estimates ofdata uncertainty.

Given a modelP(y|x;θˆ)which generalizes well, the expected behavior for inputs which are in regions of low and high data uncertainty are given in figure 3.4: The entropy of the predictive posterior is the model’s estimate ofdata uncertaintyat a particular test inputx∗:

H[P(y|x∗;θˆ)] = −

c=1

P(y =ωc|x∗;θ) lnP(ωc|x∗;θˆ) (3.18)

However, entropy is an ‘overall’ measure of uncertainty in the predictions as it depends on the entire posterior distribution over classes. It is possible to obtain a measure of uncertainty in predictingthe most likely classfor a particular test inputx∗ via thelikelihoodof that class:

P = max

c {P(ωc|x

∗

;θ)} (3.19)

3.2 Estimating Data Uncertainty 39

(a) Low Uncertainty (b) High Uncertainty

Figure 3.4 Indication of uncertainty via posterior over class labelsP(y|x∗;θˆ).

This yields theconfidence of the modeP5_{. This measure of uncertainty is not affected by the}

probabilities of theother classes, as they are irrelevant to the prediction, and may yield a more precise estimate of uncertaintyin the prediction.

Estimation ofdata uncertaintyis illustrated on the Low Data Uncertainty (LDU) and High Data Uncertainty (HDU) datasets introduced in section 3.1.1. Figure 3.5 demonstrates how the conditional entropy of a pair of simple DNNs trained using maximum likelihood on these captures data uncertainty. The distribution of entropy of DNNs trained on these datasets, shown in figures 3.5a and 3.5b, is almost identical to the distribution of entropy of the true underlying distribution, shown in figures 3.1b and 3.1d. In these experiments it is easy to obtain enough training data and the true underlying distribution is easily within the class of models which can be parameterized by the neural networks considered here. In practice it is difficult to fully satisfy these conditions for real applications.

3.2.2 Estimating Data Uncertainty for Regression

The previous section shows how a classification model will naturally capturedata uncertainty

as a consequence of maximum likelihood training. This section will demonstrate how the same can be done for regression tasks when using probabilistic models for regression. It was shown in section 2.2.2 that training a Density Networkp(y|x;θ)via maximum likelihood is equivalent to minimizing the expected KL divergence between the model and the true

40 Predictive Uncertainty Estimation

(a) Low Data Uncertainty dataset (b) High Data Uncertainty dataset

Figure 3.5 Conditional entropyH[P(y|x∗;θˆ)]of a pair of classification neural networks with 2 hidden layers of 100 units with ReLU activations trained on LDU and HDU datasets with maximum likelihood using Adam [62] optimizer.

conditional distributionptr(y|x). The result is reproduced below for convenience:

LN LL₍_θ_,_{D) =} Eptr(x) h KL[ptr(y|x)||p(y|x;θ)] | {z } Reducible Loss +H[ptr(y|x)] | {z } Irreducible Loss i ≥_Eptr(x) h H[ptr(y|x)]i (3.20)

where expected negative log-likelihood is lower-bounded by the expected differential entropy ofptr(y|x). As discussed in section 3.1.2,data uncertaintyfor regression tasks is expressed as the differential entropy of the underlying distribution. Thus, as the loss is reduced and the modelp(y|x;θ)becomes closer toptr(y|x), it will yield increasingly accurate estimates of

data uncertaintyin the form of differential entropy of the density network for a test inputx∗:

H[p(y|x∗;θˆ)] = − Z

p(y|x∗;θˆ) lnp(y|x∗;θˆ)dy (3.21)

Modellingdata uncertaintyfor regression tasks is more difficult than for classification tasks. For the result in equation 3.20 to hold it is necessary to have infinite training data and for the true distribution to be within the class of models parameterizable by the neural network. To satisfy the second condition it is necessary not only to have a model of sufficient capacity, but to also choose the appropriate probability density function which the Density Network parameterizes. The choice of probability density which the model parameterizes should reflect the structure ofhomoscedasticorheteroscedasticnoise in the data.

In document Uncertainty Estimation in Deep Learning with application to Spoken Language Assessment (Page 63-67)