Evaluation - A Compositional Vector Space Model of Ellipsis and Anaphora.

We now discuss the evaluation of the new representations that we learn on a number of similarity and compositional tasks. The main goal here is to see which of the different types of representation that we learn, works best in a number of tasks. As a second goal, we want to see if these representations outperform a non-neural baseline, for which we use the analyti- cal models that we experimented with in the previous chapter. Finally, we want to establish whether the hybrid neural tensor-based setting is capable of achieving similar results to state of the art sentence encoders and contextualised embeddings. As we train verb representations, we evaluate on three types of tasks: verb similarity, verb disambiguation – where the goal is to distinguish a verb’s meaning given a sentential context – and compositional sentence similarity. Below we discuss all tasks, composition models and similarity metrics that we experimented with, including a short reference to the sentence encoders and contextualised embeddings that we compare our approach with.

Verb Similarity For verb similarity, we consider the verb only partitions of a number of word

similarity datasets, as well as datasets aimed at evaluating verb similarity in itself. First of all, we consider those pairs of words from the MEN [Bru+12] and SimLex-999 [HRK15] datasets that were labelled as verbs, obtaining 22 and 222 verb similarity pairs, respectively. Next to these partial datasets, we considered VerbSim [YP06], a dataset of 130 verb pairs, and the more recent SimVerb-3500 dataset [Ger+16], containing 3500 verb pairs. For the latter, we independently evaluate on the development set (500 pairs) and the test set (3000 pairs).

Compositional Tasks We consider the seven compositional tasks that we also evaluated on

in Chapter 4: first, the two intransitive sentence datasets ML2008 and ML2010 introduced by Mitchell and Lapata [ML08;ML10], the first aiming at disambiguating the verb of each sentence, the second evaluating general sentence similarity. Next we consider the transitive variants of these datasets: the transitive verb disambiguation datasets of Grefenstette and Sadrzadeh [GS11a] (GS2011) and Kartsaklis and Sadrzadeh [KS13] (KS2013), and the transitive sentence similarity dataset of Kartsaklis, Sadrzadeh, and Pulman [KSP13] (KS2014). Finally, we also evaluate on the verb phrase elliptical disambiguation (ELLDIS) and similar- ity (ELLSIM) tasks introduced in the previous chapter.

Non-Neural Baseline Models We compare our neural verb representations (Table 5.1) with

the two non-neural verb representation methods from the type-driven literature [GS11a], already mentioned in the introduction of this chapter, but reiterated below. On the left is the Kronecker representation, on the right is the relational representation:

VKron= va⌦ va VRel=

si⌦ oi

Fusion We consider two ways of fusing our neural verb matrices into a single representa-

tion, middle and late fusion representations after Bruni, Tran, and Baroni [BTB14]. Middle fusion takes a weighted average of the two verb representations, using the result to compute similarity scores. Late fusion uses each representations to compute separate similarity scores and then averages the results. Given a weighted average M↵(A, B) = ↵A + (1 ↵)B

for ↵ 2 [0..1], and V1, V2verbs, the middle and late fusion operations are defined as follows:

M↵(sim(V1S, V2S), sim(V1O, V2O)) (5.12)

The same fusion operations are used in the compositional tasks, where either the verb matrices are averaged before composition, or the cosine scores are averaged after. Our values for ↵are 0.1 increments ranging from 0 (only the subject matrix) to 1 (only the object matrix).

Clustering In their article introducing the adjective skipgram model, Maillard and Clark

[MC15] argue that “cosine similarity, while suitable for vectors, does not capture any infor- mation about the function of matrices as linear maps". This argument holds for generalisa- tions of matrices to higher order tensors, such as cubes and tesseracts. The functions of these tensors are multilinear maps, e.g. bilinear for cubes and trilinear for tesseracts. Thus, follow- ing Maillard and Clark, we postulate that a suitable measure of similarity for two functions should be related to how similarly they transform their arguments. We say two words W, W0

with tensor representations W, W0_{and arguments d}

1,· · · , dnare similar whenever Wd1...dn

is similar to W0_d

1...dn, for the vector diof every argument that they have transformed in the

corpus. The degree of similarity between W and W0_{is obtained by taking the median of the}

degrees of similarities of their applications on the arguments. Since going through all the instantiations of the arguments is expensive, we cluster the most frequent argument vectors and work with the similarity between the two transformations applied to the centroids of each cluster. The resulting similarity function is defined as follows, for D the set of tuples of cluster centroids:

tensorsim : med

hd1,...,dni2D

cos(Wd1...dn,W0d1...dn) (5.13)

In the case of cube embeddings for transitive verbs, this definition is equivalent to consid- ering the most frequent subjects and objects of the verb, clustering them separately, then applying the cube to the centroid vectors and take the median. The metrics that we use for the different representations from Table 5.1 are given in Table 5.2.

Metric Formula vecsim cos(a, b) = a· b |a||b| matsimS _med s2S cos(V1s, V2s) matsimO _med o2Ocos(V1o, V2o) cubesim med

hs,oi2Acos(V1os,V2os)

TABLE5.2: Similarity metrics on vectors, matrices and cubes, based on clustering centroids.

Composition Models Similar to the evaluation study in Chapter 4, we define a number of

composition models to test on the datasets involving sentences. For the current study, we consider three baseline models: either a non-compositional model involving just the verb representation (with clustering applied in the cases of matrix and cube representations), or a compositional baseline, which are the arithmetic models of [ML10], which given a sentence w1w2...wn produces the addition or multiplication of its word vectors. Furthermore,

The results for these will be different from those in Chapter 4, as the underlying word vectors are different, i.e. 100-dimensional skipgram vectors. Then, we evaluate composition models for the different neural verb representations, where the main new models take a verb matrix/cube representation, compose it with vectors of its subject and object, and compute a final sentence representation via middle or late fusion. Composition in these models is (a variant of) tensor contraction, as used for example in [MCG14;KSP13;Mil+14].

Intransitive Models The datasets ML2008 and ML2010, respectively, contain pairs of subject-

verb, and verb-object phrases. Next to the arithmetic baseline that adds the vectors, we apply middle and late fusion on the separate subject-verb and verb-object matrices, with as a spe- cial case an unmixed model for the case where a single matrix verb embedding is available. The specification of these fusion models for subject-verb and verb-object phrases are in the table below:

Phrase subj verb verb obj

Middle M↵(V S , VO) s M↵(V S , VO) o Late M↵(V S s, VOs) M↵(V S o, VOo)

Transitive Models To model a transitive sentence of the formsubj verb obj, we compare verb- only and arithmetic baselines with tensor-based models as below:

Model type Formula

Middle T (s, M↵(VS, VO), o)

Late M↵(T (s, VS, o), T (s, VO, o))

Two M↵(Ts(s, VS, o), To(s, VO, o))

Cube Vos

TABLE 5.3: Composition models for transitive sentences. T rep- resents any standard tensor-based composition model for transitive sentences, Ts is subject-directed composition, To is object-directed

composition. When ↵ = 0 or ↵ = 1, the models reduce to the case of using one of the two verb matrix embeddings.

However, note that here we define a new class of models (Two), where we apply a separate model for the VSmatrix with the subject vector, and a distinct model for the VOmatrix with the object vector, that then get fused to give a final representation. Concretely, this leads to three new composition models, Copy Argument, Copy Argument Sum and Categorical Argument, described below:

Model Formula

CA M↵ sTVS o, VOo s

CAS M↵ sTVS+ o, VOo + s

Sentence Encoders and Contextualised Representations Similar to the previous chapter,

we include also the results of state of the art sentence encoders and contextualised representations, to allow for a comparison with our proposed modelling. In the case of sentence encoders, we take a number of different pretrained models and directly encode the sentences in the datasets of interest, where for the verb phrase elliptical datasets we re- port also the resolved sentence encodings. We consider the same encoders used in Chapter 4: Doc2vec [LB16], Skipthoughts [Kir+15], InferSent [Con+17], and Universal Sentence En- coder [Cer+18]. For the case of contextualised embeddings, we take pretrained models for ELMo [Pet+18] and BERT [Dev+19]5_{to give a contextualised representation for the words in}

a sentence, then take the mean of these to give a sentence embedding.

In document A Compositional Vector Space Model of Ellipsis and Anaphora. (Page 147-150)