Evaluation of the models on a word similarity task

In this section, we evaluate our graph-based model of syntactic compatibility and our word2vec model of similarity on a word association test set in order to get an estimate of the quality of our models.

For this experiment, we use the German Relatedness Dataset16 (Gurevych, 2005, Zesch and Gurevych, 2006). The data set provides three different test sets. The first one (Gur65) consists of 65 word pairs. Each pair was assigned a similarity rank between 0 and 4 by human subjects, where 0 denotes fully dissimilar and 4 fully similar. The second (Gur350) and third set (ZG222) contain 350 and 222 such pairs, respectively. The latter two sets are aimed at semantic relatedness, rather than (direct) similarity. The pairs do not only consist of nouns, but words of other part of speeches. We here limit our investigation to pairs of nouns.17 To assess the performance of a system, Pearson correlation of the system scores and the human judgements is measured.

Our graph-based model of compatibility is aimed at rewarding pairs of nouns with a high score if they display high second-order co-occurrence with verbs given a specific argument slot, i.e. a specific grammatical relation. Test sets of noun associativity and similarity, by contrast, provide gold standards of noun similarity where the semantics of the similarity is generally underspecified, i.e. they are not geared towards a specific semantic relation like hyponymy, meronymy, or synonymy, but subsume different such relations. Our graph model is therefore not necessarily suited to capture these relations and is not aimed at outperforming the state-of-the-art for this task. Still, it is interesting to see how well the model fares, especially compared to the word2vec.

Since our models are specific to grammatical relations, we test them using the subject and direct object relation. That is, we calculate how similar a pair of nouns is regarding the subject and the direct object role in our models, respectively.

Table 6.1 gives the results. The top table shows the Pearson correlations of the similarity estimations of our models with human judgements given the grammatical role subject, the lower table indicates correlations given the grammatical relation direct objects. The first column indicates the data set, along with the inter-annotator correlation.18 _We

also indicate the correlation of our models (Graph-W2V) to see how similarly they judge the pairs. The right part of the tables gives the counts of the pairs where both

https://www.ukp.tu-darmstadt.de/data/semantic-relatedness/german-relatedness-datasets/ 17_{We identify noun pairs by checking whether both words in a test instance start with an uppercase} character.

18_{This can be read as the average pair-wise correlation between the human judgements, although the} actual calculation is more complicated, cf. Gurevych (2005).

Chapter 6. Semantics for pronoun resolution 154

Data set Graph W2V Graph-W2V Appl. NN Total Subject relation

Gur65 (0.81) 0.38 0.76 0.64 53 63 65 Gur350 (0.69) 0.31 0.75 0.48 108 168 350 ZG222 (0.49) 0.23 0.54 0.22 68 118 222

Direct object relation

Gur65 (0.81) 0.56 0.77 0.69 49 63 65 Gur350 (0.69) 0.46 0.74 0.63 100 168 350 ZG222 (0.49) 0.24 0.53 0.29 65 118 222

Table 6.1: _{Pearson correlation of similarity/relatedness estimations by our models} and human judgements.

our models are applicable (Appl.), i.e. for which both models have representations for the nouns, the count of noun pairs (NN), and the total count of pairs (Total).

We see that for both grammatical relations, the word2vec model achieves a higher correlation with the human judgements than our graph model. Still, the graph model yields a positive, moderate correlation with the human judgments. Also, we have to take into account that the inter-annotator correlation for two of the three data sets is rather low, especially for ZG222.

Interestingly, the correlation of the graph-based model increases drastically when the direct object relation is used to determine noun similarity, i.e. from 0.38 to 0.56 for the Gur350 test set. This relates to e.g. Wunsch (2010), who excluded subject-verb relations from his selectional preference model, since he found that verbs hardly feature designated preferences towards their subjects.19 That is, the results suggest that the direct object relation is a more precise relation than subject when it comes to determining semantic similarity of nouns based on their distributions as arguments of verbs.

The word2vec model is not affected by the choice of the grammatical relation, however. This is not too surprising, since the model learns the vector representations based on all context words within a given window, while the graph model only relies on the specific co-occurrences of nouns and verbs given a specific grammatical relation. Since the grammatical relations yield different verb-argument pairs, the similarity judgements differ as well.

Concerning applicability (Appl.), we see that similarity judgements relying on the subject relation is slightly higher than for the direct object relation. Note that the applicable count only counts pairs where both models have representations of both nouns in the pair. Thus, the applicability of the individual models is potentially higher.20 Given the

Cf. section 6.6.

Chapter 6. Semantics for pronoun resolution 155

subject relation, our models cover 84% of the pairs in the Gur65 set, 64% of the pairs in the Gur350 set, and 58% of the pairs in the ZG222 set.

Overall, we conclude from this evaluation that both our models are capable of producing similarity judgements that correlate with human judgements. We also see that the models’ judgements correlate with each other (Graph-W2V), but not perfectly so. Thus, there is potential for complementary use of the models. We next investigate how well the model fare w.r.t. the task they were designed for, i.e. identifying antecedents of pronouns. Before doing so, we discuss related work on selectional preferences of verbs w.r.t. pronoun resolution.

In document Incremental Coreference Resolution for German (Page 168-170)