Visualization of Model Representations - MoL 2017 28: Understanding Generalization: Learning

The equivalence tests of the previous section seem to show a significant difference in generalization performance between the simpler two-way tRNN models and the more complex three-way tRNTN models. While interesting as an observation, we would ideally like to understandhowthis difference comes about. For this reason, we now turn to the inspection of the model internal representation, by using dimensionality reduction techniques on parts of the model.

17_{Both cost functions were defined in Section 4.2.}

18_{We would however have to replace the natural logic relations then, or at least, adjust them for our}

Figure 6.2: Tensor model.

Hyperplane separation of quantifiers and their negation

Visualization Method We tested several different techniques to inspect the model

representation of the processed sentences. The best results were achieved by using principal component analysis(PCA), which takes a set of correlated values and turns them into a set of linearly uncorrelated values by an orthogonal transformation. Intuitively, PCA can be thought of as a rotation of the

d

-dimensional coordinate system in which our data lives such that in the new coordinate system, the first dimension (principal component 1) explains most of the data variance, the second dimension explains most of the varianceindependent of the first component, and so on.

Visualization by PCA was preferable over other methods, because we intended to explore the model internals without major distortions or loss of crucial information. The advantage of a PCA analysis is that theglobalstructure of our model representations is preserved, in particular, how these models organizesentencesthey learned to classify.19 Other popular visualization (and dimensionality reduction) techniques, such as t-SNE, were less suitable for this task. While t-SNE produces precise local ‘clusters’ of related observations, the global arrangement of these clusters in the visualization coordinate system is not representative of the global organization of the data points these clusters contain within theoriginalcoordinate system of our models.

Quantifier Negation Fig. 6.2 is one of the clearest depictions of a systematic repre-

sentation of quantifier negation in the tensor model that we were able to extract.

19_{By ‘global’ we mean data points that are far apart in the representation space constructed by the}

Figure 6.3: Matrix model.

Organization inside quantifiers along nouns

Plotted are components 1/2/4 of a single run of the tensor model on the Com- pQuant data. Coloring of the data points is chosen manually to mark the classes that underly the observed spatial separation. Different colors mark the different quantifiers, with solid circles for the non-negated quantifiers and transparent circles for their negation. We see that, without exception, non-negated quantifiers are on one side of a separating hyperplane, while their negations are on the other side.

In contrast, Fig. 6.3 depicts a similar plot for the matrix model. Across model runs, we found no evidence that the matrix model found a “negation hyperplane” similar to that of the tensor model. Instead, organization of Fig. 6.3 appears to proceed along differentnouns, i.e. each quantifier cluster contains an organization of all nouns encountered within the scope of the quantifier. This observation appears to conform to the observation of Veldhoen and Zuidema (2017), who observe that the matrix model only learns many “local approximations” or clusters. While there are clear signs of systematicityinsidethese clusters, they find no evidence of a global organization in the matrix models.

Noun Negation Fig. 6.4 shows another plot of the tensor representations, using a

different component configuration, and coloring by nouns (e.g. red for ‘mammal’). As before, solid vs. transparent mark non-negation and negation. In contrast to the case of quantifiers, here, both negated and non-negated forms appear on the same side of a separating line. However, we also observe that a diagonal displacement separates the negated and non-negated nouns on each side, and it is therefore possibly that we simply lack one additional principal component to depict another ‘perfect’ separation, as in the previous case of quantifiers.

Figure 6.4: Tensor model. Noun negation, all.

Verb Negation In Fig. 6.5 depicts the organization of verbs and their negation in the

tensor model, showing the separation of verb ‘growl’ and its negation. ‘Growl’ (solid red) and “not growl” (hollow red) are separated by a horizontal line of separation. Our results indicate that verb negation is the least ‘general’ representation in the tensor model: while each verb and its negation are clearly separated, we could not find evidence that the tensor model found ageneralrepresentation that would capture the negation ofallverbs.

Figure 6.6: Tensor model.

Quantifiers ordered by numerosity (view 1)

Quantifier Ordering Figs. 6.6 and 6.7 depict the organization of quantifiers in the

tensor model, colored here to highlight that the model appears to order quantifiers based on their approximatenumerosity. Consider here that the (logical) model learned by the (network) model is likely to be interpreted as finite. We could therefore consider an ordering of the quantifiers along some range of the finitely many values they could take in the model. We see then that ‘all’ (blue), ‘three’ (dark green), ‘two’ (light green) and ‘no’ (red) are arranged approximately corresponding to their implicit numerosity, including their respective negations. Note however that ‘some’ (yellow) appears to be fall outside of this ordering, for reasons that remain unclear.

Figure 6.7: Tensor model.

In document MoL 2017 28: Understanding Generalization: Learning Quantifiers and Negation with Neural Tensor Networks (Page 77-82)