• No results found

6.3 Evaluating Generalization by De Morgan’s Laws

6.3.3 Results in Detail

In the following, we will address some of the open questions related to our results of the generalization experiments.

Differences by Inference Pattern It seems that a detailed analysis of prediction

differences depending on the inference pattern could be interesting from a logical point of view. However, the results provide little room for such a distinction by logical forms. The tensor model correctly predicted entailment for nearly every test case, and thereby, across all inference patterns. The few misclassifications of the model in the second experiment are spread across inference patterns without any apparent systematicity. The matrix model on the other hand yielded zero entailment predictions in the second experiment, and we cannot distinguish non-occurrences by case.

The single exception is the result of the matrix model in the first experiment. Here, the tRNN predicted entailments in aggregate near a random baseline. However, most of these were made on a single inference pattern, the equivalence of Eq. (6.4), for sentences of the form: all

x

not

y

no

x y

. We currently cannot say with certainty whether the effect has a systematic reason, or is a random artifact of the training process. Considering however that these entailment predictions completely vanished in the tRNN when using the higher-difficult data, we tend to favor the latter explanation, i.e. an effect without a systematic cause.

Results Variability Our results here use three models (i.e. training runs) per

model architecture. The predictions were highly consistent, in fact, nearly identical across runs. While puzzling at first perhaps, we believe the most likely reason it is a stabilization effect of the final prediction steps: a ‘squashing’ of values by the softmax

Figure 6.1: Softmax layer input. Pattern: nox y≡allx noty.

tRNN (left), tRNTN (right).16

layer activation function, followed by application of themaxfunction. Minor (and perhaps even moderate) differences are leveled by these steps, and models that are ‘internally’ different can appear indistinguishable on the outside.

To be sure then that our results are not merely a side effect of some leveling, we also inspected the (weighted) softmax input, i.e. the state of the model before those steps apply. Fig. 6.1 shows the input to the softmax layer of one model instance, over all test sentences based on the inference pattern of Eq. (6.4). We can see then that, in fact, the contrast between the two model types seems less stark than in the prediction outcomes. For example, both architectures assign most of the prediction weight to the same three relations here (FOR, REV, ALT). However, even at this incomplete state of the prediction process, some clear differences emerge. The median values of the two ‘correct’ labels (FOR, REV) are above the median value of the ‘wrong’ label (ALT) in the tensor network, while the opposite is the case for the matrix model. Further, the minimum softmax input values of the correct labels are roughly the same as the maximum values of the wrong label in the tRNTN, while the reversed case holds in the tRNN. In other words, the ‘raw’ model predictions match our previous results based on the final model predictions, just in a less sharply contrasted form.

Data Imbalance We mentioned that we observe a ‘shift’ of predictive weight, from

entailment to cover, when comparing the first and the second experiment. We believe this effect is likely linked to the distribution of relation examples in the data. The relations independence, cover and alternation make up about 70% of the training examples (INDY: 30%, COV: 21%, ALT: 19%), while the other four relations are encountered substantially less frequently during training. This underlying ‘imbalance’ of relation examples could then explain two observations: the tendency of both models to assign significant predictive weight to cover and alternation, and the above weight ‘shift’. Both could plausibly be seen as expressions of apriorgiven by the training data. For example, upon increasing task difficulty, the tRNN predictions shift back to the prior that was learned during training.

We note however that this explanation is incomplete: while the data prior might account for the predictive weight of cover and alternation, it seems to be contradicted by the observation that independence is assigned only little weight in the tests, while being the most frequent relation encountered during training. As a possible hypothesis, we can perhaps suggest that independence is somewhat easier to ‘isolate’ for the models. While other relations can overlap to a degree in their ‘meaning’ (if learned only incompletely), the one relation expressing “absence of any informative relation” seems to stand out somewhat.

Direction of Entailment In Section 6.3.1 we gave an account of the reasons why

a prediction of entailment is the most informative answer we can expect from the models, given the limitations of the training data. In fact, a prediction of entailment in both directions for a single equivalence test case would amount to a prediction of equivalence in everything but the name. However, in our experiments, we accepted the less informative answer of a single entailment prediction.

The weaker correctness criterion is justified, because the definition of model prediction, and the respective training objective of our models did not permit the stronger prediction. Note that model predictions are defined here as thesinglemost probable relation for a pair of sentences, and cost was given by negative log-likelihood of the correct label. By this cost function, only model output over the target relation directly influences cost, while the softmax distribution over the remaining labels does not enter the calculation. Therefore, models were not trained for two relation predictions over one sentence, and the weaker prediction is the best possible outcome for these models.

An extension of the models to predict multiple relations would be an interesting direction for future work. This can be achieved by replacing the current cost by general negative log-likelihood, by which models learn a probability distribution over multiple classes.17Doing so would allow us to verify if the models learned the full ‘meaning’ of equivalence, i.e. whether a distribution emerges in which forward and reverse entailment are the two most likely relations given a pair of equivalent sentences.18