Experiments on Artificial Data - Prior Networks for Classification

4.2 Prior Networks for Classification

4.2.3 Experiments on Artificial Data

The previous section investigated the theoretical properties of several training criteria for Prior Networks. In this section the properties of these criteria are assessed empirically by using them to train Prior Networks on the artificial 3-class Low Data Uncertainty (LDU) and High Data Uncertainty (HDU) datasets introduced in the chapter 3 section 3.1.1. Specifically, both the forward and reverse KL-divergence between the Prior Network and a target Dirichlet distribution are considered. In these experiments, Prior Networks parameterize the Dirichlet distribution by directly yielding the concentration parametersαˆ. The models use the same architecture and training hyper-parameters as the previous networks trained on these datasets in chapter 3. The out-of-distribution training dataDout was sampled such that it forms a thin

shell around the training data, as shown in figure 3.6c. The target Dirichlet concentration parametersβ(c)were constructed as described in equation 4.24, withβ= 1e3. The in-domain loss and out-of-distribution losses were equally weighted when trained using the forward KL- divergence loss. However, it was found that it is necessary to weight the out-of-distribution loss 10 times as much as the in-domain loss when using reverse KL divergences.

Figures 4.4 and 4.5 depict the behaviour of measures of uncertainty derived from Prior Networks trained using either theforwardor thereverseKL-divergence loss on the LDU and HDU datasets, respectively. Specifically, the figures depicttotal uncertainty,expected data

4.2 Prior Networks for Classification 77

(a) Total Uncertainty - KL (b) Total Uncertainty - reverse KL

(e) Mutual Information - KL (f) Mutual Information - reverse KL

Figure 4.4 Comparison of measures of uncertainty derived from Prior Networks trained with

forwardandreverseKL-divergence loss on the Low Data Uncertainty dataset. Measures of uncertainty are derived via equation 4.13.

78 Prior Networks

(a) Total Uncertainty - KL (b) Total Uncertainty - reverse KL

(e) Mutual Information - KL (f) Mutual Information - reverse KL

Figure 4.5 Comparison of measures of uncertainty derived from Prior Networks trained with

forwardandreverseKL-divergence loss on the High Data Uncertainty dataset. Measures of uncertainty are derived via equation 4.13

4.2 Prior Networks for Classification 79

uncertaintyand mutual information, which is a measure ofknowledge uncertainty. As was shown in equation 4.4, mutual information is the difference oftotal uncertaintyandexpected data uncertainty.

(a) Low Data Uncertainty Dataset - KL (b) Low Data Uncertainty Dataset - reverse KL

Figure 4.6 Comparison of Differential Entropy derived from Prior Networks trained with

forward and reverse KL-divergence loss on the Low Data Uncertainty and High Data Uncertainty datasets. Differential entropy derived using equation 4.15.

Figures 4.4a and 4.4b show that Prior Networks trained using either the forward KL- divergence or reverse KL-divergence loss appropriately capture thetotal uncertaintyof the LDU dataset. However, Prior Networks trained using forward KL-divergence do not fully capturedata uncertainty, as figure 4.4c shows thatdata uncertaintyis lower in the region where all three decision boundaries meet than along the decision boundaries where only two classes meet. As a result, the mutual information provided by a Prior Network trained with the forward KL-divergences is higher in-domain along the decision boundaries than out-of-domain. In contrast, figures 4.4d and 4.4f show that the measures of uncertainty

80 Prior Networks

provided by a Prior Network trained using the reverse KL-divergence decompose correctly.

Data uncertaintyis highest along the decision boundaries and mutual information is 0 in in-domain, even along the decision boundaries.

These trends in the behaviours of uncertainty estimates are more apparent on the HDU dataset. By comparing figures 4.5a and 4.5b it is clear that a Prior Network trained using forward KL-divergence over-estimatestotal uncertaintyin domain, as thetotal uncertainty

is equally high along the decision boundaries, in the region of class overlap and out-of- domain. The Prior Network trained using reverse KL-divergence, on the other hand, yields a far more structured estimate of total uncertainty. Figure 4.5c shows that the expected data uncertaintyis altogether incorrectly estimated by a Prior Network trained via forward KL-divergence. This causes mutual information, which is the difference oftotal uncertainty

anddata uncertaintyto also behave incorrectly. On the other hand, the Prior Network trained via reverse KL-divergence yields correct decompositions of uncertainty.

Lastly, figure 4.6 depicts the behaviour of the differential entropy of Prior Network trained on the LDU and HDU datasets using both KL-divergence losses. Unlike thetotal uncertainty,expected data uncertaintyand mutual information, it is less clear what is the desired behaviour of the differential entropy. Conceptually, it should be low in-domain and high out-of-distribution. Figures 4.6 shows that on both the LDU and HDU datasets both losses yield low differential entropy in-domain and high differential entropy out-of- distribution. However, the reverse KL-divergence seems to capture more of the structure of the dataset, which is especially evident in figure 4.6d, then the forward KL-divergence. However, in general both losses seem to yield an appropriate behaviour of differential entropy. The experiments in this section support the analysis in the previous section and illustrate how thereverseKL-divergence is a generally more suitable optimization criterion than the

forwardKL-divergence, especially for datasets with a significant level of data uncertainty. However, it is important to keep in mind that the LDU and HDU are toy datasets where it is possible to obtain ideal out-of-distribution training data. However, the behaviours of Prior Networks trained using the forward and reverse KL-divergences losses may be different on real datasets.

In document Uncertainty Estimation in Deep Learning with application to Spoken Language Assessment (Page 102-106)