• No results found

3.3 Estimating Knowledge Uncertainty via Single Models

3.3.1 Single Model Approaches for Classification

As discussed in section 3.1.1,knowledge uncertaintycorresponds to uncertainty in predictions due to a lack of knowledge about mappingx 7→ y in the regions of the input space from which a test samplex∗ came from. One way this can be indicated by a model is via a high entropy posterior distribution over class labels (eqn. 3.18), as shown in figure 3.4b. The current section discusses how a single probabilistic classification model can be made to yield 6In contrast to ensemble approaches considered in the next section, which use multiple sets of model

42 Predictive Uncertainty Estimation

a high entropy posterior for out-of-distribution inputs which come from regions of the input space far from the training data.

The simplest approach is to simply hope that a model trained via maximum likelihood will naturally yield a high-entropy posterior distribution over classes for out-of-distribution (OOD) inputs. This approach was evaluated as a baseline for detection of misclassifications and out-of-distribution samples in [49]. However, standard maximum likelihood estimation does not contain any mechanism which drives a model to learn the limits of its knowledge. It is, in general, difficult to guarantee any particular behaviour on out-of-distribution data for parametric models trained with standard maximum likelihood, especially neural networks, as they are complex non-linear parametric functions.

A diverse range of approaches has been proposed to modify a neural network classification model to produce high-entropy predictive posteriors for out-of-distribution inputs, such as [73, 74, 91]. However, while many of these methods have impressive empirical results, few have solid theoretical justification for why they work. Consequently, in this thesis only a particular class of single models approaches [73, 80] which does provide a theoretical justification is considered. This approach [73] involves multi-task training of a model to simultaneously minimize negative log-likelihood on in-domain training data and the KL divergence between the model and a uniform distributionU(y)on out-of-domain training data: LM T(θ,D) = LN LL(θ,D trn) | {z } In−Domain Loss +γ·E^pout(x) KL[U(y)||P(y|x;θ)] | {z }

Out−of−Distribution Loss

(3.22)

whereγ is a weight associated with out-of-distribution loss. By minimizing this loss function the model should learn a decision boundary between the in-domain region and the rest of the input space, given an appropriate choice of out-of-distribution training data.

This approach can be interpreted asexplicitlybuilding in knowledge about the limits of the model’s understanding, which is encoded via the choice of out-of-domain training distributionDout =^pout(x). It is necessary to selectDout in such a way as to learn atight

decision boundary between the in-domain region and everything else. Consider figure 3.6, which depicts the three-class toy classification Low Data Uncertainty dataset introduced in section 3.1.1. The in-domain data is shown in red and out-of-distribution data is shown in green. If the decision boundary is ‘too loose’, as depicted in figure 3.6a, then certain out-of- domain inputs may be incorrectly considered to be in-domain. Alternatively, if the decision boundary is ‘too tight’, as depicted in figure 3.6b, then certain in-domain inputs may be incorrectly considered out-of-domain. The out-of-distribution data must be carefully chosen so that it is near the in-domain region, but doesn’t overlap with it, as depicted in figure 3.6c. Crucially, theDoutmust lie on, or close to, the same manifold on which the in-domain data

3.3 Estimating Knowledge Uncertainty via Single Models 43

(a) Loose Decision Boundary (b) Overlap of In-domain and OOD Data

(c) Tight Decision Boundary

Figure 3.6 Illustration of in-domain (red) and out-of-domain (green) training data using a toy example. Out-of-domain training data should be close to the in-domain data in order to learn a tight decision boundary around the in-domain region.

lies, as shown in figure 3.7. The motivation for this is that in a real deployment scenario out-of-distribution data is likely going to have similar structure to the in-domain data. For example, in image classification tasks, where an in-domain data consists of natural images, OOD images are also likely to be natural images of the real world. In this scenario, using images of cartoons, random noise or other ‘unnatural’ images as OOD training data will result in the model learning a decision boundary which is too loose. One approach to generating such data is to use a generative model, such as Factor Analysis [92, 80], Variational Auto- encoders [63] or Generative Adversarial Networks [39, 73]. However, generating appropriate OOD training data is still an open task. In practice it is possible to use data from a different, appropriately chosen dataset [73].

Consider a model trained via the loss specified in equation 3.22 on appropriately chosen out-of-distribution data. The entropy of the posterior of over classes produced by this model becomes a measure of total uncertainty rather than just data uncertainty or knowledge

44 Predictive Uncertainty Estimation

Figure 3.7 Low-dimensional manifold of data in high-dimensional input space. This figure shows both the in-domain data and out-of-distribution data lying on the same 2-dimensional manifold in a 3-D input space.

uncertainty. This leads to a complication - in order to distinguish out-of-domain and in- domain inputs this approach implicitly assumes the entropy of the model’s posterior over classes never reaches a maximum in-domain. Otherwise, a high entropy posterior over classes could indicate uncertainty in the prediction due to eitheran in-domain input in a region of severe class overlap or an out-of-distribution input far from the training data. However, this assumption may or may not hold, depending on the nature of the in-domain data. As a result, while it is possible to determinewhetherthe model is uncertain, it may not always be possible to robustly determinewhythe model is uncertain using this approach. This is a detriment to applications where it is necessary to determine the source of uncertainty.

This problem is illustrated on the 3-class artificial datasets introduced in section 3.1.1. Figure 3.8 shows the entropy of the predictive posterior derived from a pair of DNNs trained on the Low Data Uncertainty and High Data Uncertainty datasets, respectively, using the loss specified in equation 3.22. The out-of-distribution training data was sampled as shown in figure 3.6c. Figure 3.8 shows that the entropy is high in the entire out-of-domain region on both the LDU and HDU datasets, which is the desired out-of-distribution behaviours. However, the entropy is also high along the decision boundaries, which is most clearly seen in figure 3.8b. In fact, the entropy is equally high in the region where all classes overlap and

3.3 Estimating Knowledge Uncertainty via Single Models 45

(a) Total Uncertainty (b) Total Uncertainty

Figure 3.8 Entropy of predictive posterior H[P(y|x;θˆ)]derived from DNNs trained in a multi-task fashion on the LDU and HDU datasets using equation 3.22. The DNNs had 2 layers of 100 ReLU units.

out-of-distribution. This makes it difficult to distinguish inputs associated with a high degree ofdata uncertaintyfrom inputs associated with a high degree ofknowledge uncertainty.

An ad-hoc solution is to introduce an extra ‘output head’ to yield the probability of the input being in-domainP(in|x∗;θˆ):

P(y|x∗;θˆ) =Cat(y;πˆ)

P(in|x∗;θˆ) = ˆπin

{πˆ,πˆin}=f(x∗;θˆ)

(3.23)

This model would use the softmax to yield a distribution over classes, from which measures oftotal uncertaintyare derived, and a separate probabilistic output head, which gives the probability of the input being in-domain. Such a model can be trained in a multi-task fashion with the following loss:

L(θ,D) =LM T(θ,D) +γ · LAD(θ,D) LAD(θ,D) = E^ptr(x) lnP(in|x;θ) | {z } In−Domain Loss +E^pout(x) ln 1−P(in|x;θ) | {z }

Out−of−Distribution Loss

(3.24)

Given this model, if the entropy of the predictive posterior is high and the probability of in-domain is high, then there is a high level ofdata uncertainty. Conversely, if the entropy is high, but probability of in-domain is low, then there is a high level ofknowledge uncertainty.

46 Predictive Uncertainty Estimation

(a) Low Data Uncertainty dataset (b) High Data Uncertainty dataset

Figure 3.9 Probability of in-domain input derived from DNNs with an additional output head trained on the LDU and HDU datasets via equation 3.24. The DNNs had 2 layers of 100 ReLU units. Note, the color scale is inverted and white corresponds to high values in this figure.

This approach is illustrated on the 3-class artificial LDU and HDU datasets in figure 3.9, which shows the output of the extra output head for DNNs trained on these datasets using the loss in equation 3.24. The probability of input being in-domain is high in-domain, even in region of class overlap and long decision boundaries, and low elsewhere, which is the desired behaviour. Clearly, this approach alleviates the issues of conflatingdata uncertainty

withknowledge uncertainty. However, it is not clear how to interpret estimates ofdataand

knowledgeuncertainty within a single consistent probabilistic framework. Specifically, while this approach could work from a practical point of view, it can only answer the question "what is the probability that the input is in-domain?". It does not allow questions about how much uncertainty there is in the prediction of a particular class due toknowledge uncertainty

to be posed. It is necessary to point out that while figures 3.8 and 3.9 show that it is possible to choose OOD training data which allows the model to learn to yield high estimates of uncertainty out-of-distribution, the task of choosing OOD training data for real tasks and datasets is highly non-trivial.