Uncertainty for Classification

2.4 Chapter Summary

3.1.1 Uncertainty for Classification

This section discusses the sources of uncertainty in the context of classification tasks, pro- viding an illustrative example of each source of uncertainty on artificial datasets. Consider a finite dataset Dtr from a distribution ptr(x, y) over inputs x ∈ RD and class labels

y∈ {ω1,· · · , ωK}:

Dtr ={x(i), y(i)}Ni=1 {x(i), y(i)} ∼ptr(x, y)

(3.1) 1_{In general, the term}_epistemic_{uncertainty covers both uncertainty in predictions for in-domain test inputs} x∗_{due to data sparsity as well as uncertainty due to distributional mismatch. However, in this thesis we use a}

narrowerdefinition ofepistemic uncertainty, which we refer to asknowledge uncertainty, which covers only the latter.

3.1 Sources of Uncertainty 31

In the context of a discriminative classification task, thedata uncertaintyat an input pointx

is defined as the entropy of thetrueconditional distributionH[ptr(y|x)]:

H[Ptr(y|x)] = −

c=1

Ptr(y=ωc|x) lnPtr(y=ωc|x) (3.2)

The entropy of a discrete probability distribution is an information-theoretic measure of uncertainty[23]. The overall level ofdata uncertaintyof the distributionptr(x, y)will be given by the expected conditional entropy:

Eptr(x)

H[Ptr(y|x)] (3.3)

A related way of thinking aboutdata uncertaintyis to consider the mutual information betweenyandx, defined as:

I[y,x] =KL[ptr(x, y)||ptr(x)Ptr(y)] =H[Ptr(y)]−_Eptr(x)

H[Ptr(y|x)] (3.4)

where the marginal distributionPtr(y)is given by:

Ptr(y) = Z

ptr(x, y)dx (3.5)

The mutual information can be interpreted as a measure ofinformation gain- it answers the question "how much information doesxconvey about y?". Alternatively, it can be seen as a measure of independence ofyandx. If the mutual information is high, then the level of

data uncertaintyis low andxconveys a large degree of information aboutyand vice versa. Conversely, if the mutual information is 0, thenxconveys no information abouty, which is another way of saying thatxandy are independent. This situation corresponds to a high degree of data uncertainty.

To illustrate the concept ofdata uncertaintymore concretely, consider a ‘toy’ distribution

ptr(x, y)which consists of three normally distributed clusters with tied isotropic covariances

with equidistant means, where each cluster corresponds to a separate class. The marginal distribution overxis given as a mixture of Gaussian distributions:

ptr(x) = 3 X c=1 ptr(x|y=ωc)·Ptr(y=ωc) = 1 3 3 X c=1 N(x;µc, σ2·I) (3.6)

32 Predictive Uncertainty Estimation

(a) Low data uncertainty (LDU) dataset (b) Entropy of LDU dataset

Figure 3.1 The top row depicts the Low Data Uncertainty (LDU) dataset with distinct classes (σ = 1), where _Eptr(x)

H[Ptr(y|x)] = 0.002 and I[y,x] = 1.097. The bottom row depicts the High Data Uncertainty (HDU) dataset with overlapping classes (σ = 4), where

Eptr(x)

H[Ptr(y|x)]

= 0.706andI[y,x] = 0.393.

The conditional distribution over the classesycan be obtained via Bayes’ rule:

Ptr(y=ωc|x) = ptr(x|y=ωc)·Ptr(y =ωc) P3 k=1ptr(x|y=ωk)·Ptr(y=ωk) = N(x;µc, σ 2_·_I₎ P3 k=1N(x;µk, σ2·I) (3.7)

Samples ofxfrom the marginalptr(x)for the case whenσ = 1are shown in figure 3.1a. Here, as the three classes are distinct and non-overlapping, it is easy to assign a test samplex∗

(green point in figure 3.1a) to the correct class. The conditional entropy, shown in figure 3.1b, is high only along the decision boundaries between classes. The expected conditional entropy is 0.002 and the mutual information betweenyandxis 1.097. Now consider the dataset in shown in figure 3.1c, where the covariances of each cluster are increased so that there is a large degree of class overlap. The entropy, shown in figure 3.1d, is now high in a wide region

3.1 Sources of Uncertainty 33

along the decision boundaries and is highest in the area of class overlap. Due to the large degree of class overlap, it will be more difficult to assign the same test samplex∗ (green point) to the correct class. The expected conditional entropy is 0.706, which is approximately 45 times larger than for the dataset with no class overlap. The mutual information is 0.393, which is about a third of the mutual information with no class overlap. Clearly, there is low uncertainty in the prediction when the classes are non-overlapping and high uncertainty in the predictions when they do overlap. Thus, the first dataset, with no overlap, will be referred to as the Low Data Uncertainty (LDU) dataset through the rest of this thesis. The artificial dataset with significant class overlap will be referred to as the High Data Uncertainty (HDU) dataset throughout the rest of this thesis. These two datasets will be used to illustrate examples of approaches discussed in this chapter and in chapter 4.

For classification problemsdata uncertaintyarises from the natural complexity of the data and the structure of decision boundaries. Datasets which contain a large number of fine- grained classes have a higher level ofdata uncertainty, as the distinctions between classes erodes. To illustrate, consider a dataset with two sets of labels - one with ten ‘coarse’ classes and one with a hundred fine-grained classes, with ten classes corresponding to different variants of each class from the first data set. The first set of labels contains the classes ‘animals, cars, airplanes’ and seven other classes. The second set of labels contains the fine-grained classes "dog, wolf, cat, ..." corresponding to the coarse class "animals" and the fine-grained classes "Audi, Ferrari, BMW, Mercedes, etc..." corresponding to the coarse class "cars" and so on. There will be more confusion between different types of cars and different types of animals than there will be between cars and animals. Thus, the second set of labels will have a higher level ofdata uncertainty. Thus,data uncertaintydepends on how the input space is partitioned into regions belonging to different classes. If the partitioning is such that the regions are distinct, then there is lowdata uncertainty. Conversely, if the regions are partitioned such that certain regions are similar, then there is highdata uncertainty.

Now consider the situation described in figure 3.2, where the test samplex∗is far away from the region of training data. Such a sample will be referred to, interchangeably, as either an out-of-distributionor out-of-domainsample because it is sampled from a distribution

pout(x, y)different to the one from which the training data was sampled. In mathematical terms, the input can be considered out-of-distribution relative to the training data if its probability density under the marginal distribution ptr(x∗) is smaller than a thresholdδ. Although in practicep(x)is unavailable.

34 Predictive Uncertainty Estimation

Figure 3.2 Low data uncertainty dataset (σ= 1) with out-of-distribution input (green dot).

In this situation, a model trained on a finite datasetDtrnmay have no understanding orknowl-

edge2of the discriminative mappingx7→yappropriate topout(x, y), due to distributional mismatch. The test samplex∗could correspond to unseen variations of known classes or to an example of a new, unseen class. As no training data is observed in that region, without strong modelling assumptions it is difficult to know to what classx∗ actually belongs. Thus, there will be a high level ofknowledge uncertaintyin the prediction.

In document Uncertainty Estimation in Deep Learning with application to Spoken Language Assessment (Page 56-60)