6.7 The distributional compatibility models as postfilters for nbest candidate
6.7.1 Learning when to apply the postfilter
The evaluation in the previous section shows that our distributional models of compati- bility have the potential to substantially improve the salience-based resolution approach. However, the evaluation setting is unrealistic, since in a real-world application, a strat- egy is needed to decide which of the approaches to apply for a given pronoun instance and the constellation of its antecedent candidates. We saw that the resolution perfor- mance of our compatibility models is below that of the salience-based approach. That
Chapter 6. Semantics for pronoun resolution 162
is, always selecting the antecedents that these models identify during the re-ranking will harm system performance overall.
A strategy is needed to decide which of the antecedents that are suggested by the models should be picked in the cases where they indicate different ones. In other words, we need a formal criterion to select the appropriate method for each pronoun. Our initial efforts that derived features from the compatibility models and incorporated them into the salience model to rank all candidates did not affect performance significantly. In this regard, our findings align with Kehler et al. (2004) and Wunsch (2010). Therefore, similar to Lappin and Leass (1994), we have opted for the strategy of nbest re-ranking, i.e. re-ranking the top two antecedent candidates as identified by the salience-based approach.
We have conducted initial experiments with a classifier that is aimed at identifying which model to choose given the verb governing the pronoun and its two top-ranked antecedents. Our main idea is to combine different features that indicate the applicabil- ity of the distributional models in the cases that they disagree with the salience-based antecedent selection. Obvious features are the confidence of the models regarding their decisions. Our entity-mention model calculates scores for each candidate and we can ac- cess these scores and their differences to assert the classifier’s confidence by comparing them. The compatibility models also produce scores which we can access and compare. The task is then to learn thresholds for the confidence measures and their differences in order to decide whether the salience-based antecedent selection should be revised in cases where the verb models disagree with it. This is the approach that Lappin and Le- ass (1994) and Dagan et al. (1995) applied, although they manually set the confidence thresholds. One problem with this approach is that it assumes that small differences in the salience-based antecedent scoring (indicating weak confidence in the antecedent selection) coincide with large differences in the scores assigned by the verb models (in- dicating high confidence). Since there is no clear motivation for this assumption, we argue that it is more beneficial to focus on the verb-based models and neglect the scores assigned by the salience model.
One of the features we envision in this direction is aimed at capturing the selectional narrowness of the verb governing the pronoun w.r.t. the grammatical function of the pronoun. For example, we expect the verb bellen (to bark) to have a more strict selection regarding its subjects than e.g. machen (to make). The main motivation for investigating this feature is that a narrow selection should correlate with the trustworthiness of the antecedent selection of the compatibility models.
Another feature that we deem helpful in deciding whether to trust the distributional models’ antecedents is to determine the similarity of the two antecedent candidates.
Chapter 6. Semantics for pronoun resolution 163
We assume that the more dissimilar the two candidates are, the more trustworthy the decision are that the models make.
Since these intuitions require further investigation, we leave it to future work to explore and parametrize them empirically. Given the long-standing debate among researchers about whether incorporating verb semantics into pronoun resolution is a fruitful en- deavor, we subscribe to the camp cheering in favor of doing so.
6.8
Chapter summary
This chapter explored the use of the distributional hypothesis to model compatibility of antecedent candidates and a pronoun’s context.
We have presented a graph representation of first-order co-occurrence of verbs and argu- ments, and second-order co-occurrence among arguments. Within this representation, we have defined compatibility metrics and similarity scores that enabled us to address the sparsity problem. We have contrasted the graph model with a state-of-the-art ap- proach to word similarity modeling within distributional semantics, i.e. word2vec. We found that the word2vec model provides better coverage, i.e. it applies to more pronoun instances, while the graph-based model achieves slightly higher Precision. A combination of both models in an oracle setting further increased performance.
In contrast to related work that used selectional preferences of verbs as a means to se- mantically represent a pronoun’s context, we have included additional verb arguments of the verb governing the pronoun to determine compatibility with an antecedent can- didate. We have argued that verb selectional preferences are not always narrow enough to favor one candidate over the other. The inclusion of the additional verb arguments helps to narrow down the selection in cases of (di-)transitive verbs.
A clear benefit of our framework over related work is that the entity-mention model can provide nominal antecedents for pronouns that are themselves antecedents for subsequent pronouns, since the antecedents of resolved pronouns are accessible during the traversal of a document. Related work that processes markables (including pronouns) in a pair- wise fashion does not have access to these antecedents. Thus, selectional preferences can only be applied to pronoun instances where the relevant antecedent candidates are all nouns.
Apart from Lappin and Leass (1994), who reported a small accuracy improvement, related work has so far reported mixed or negative results on incorporating verb se- mantics into pronoun resolution. Kehler et al. (2004) and Wunsch (2010) reported no
Chapter 6. Semantics for pronoun resolution 164
performance impact when incorporating features denoting selectional preferences into their classifier. Klebanov and Peter (2002) and Bergsma et al. (2008a) showed that their models of verb semantics were able to outperform simple baselines, but did not incorporate their models into real-world pronoun resolution systems.
Although we have not yet found a way to decide when to apply our models, we have shown that they have a large potential to improve performance of a real-world pronoun resolution system which, by itself, reaches state-of-the-art performance. How much of this potential can be harvested in a fully automated setting will have to be determined by future work.
Chapter 7
Conclusions and future work
Underspecification of German pronouns. The main interest of this thesis was to develop a procedure for coreference resolution that addresses the problem of local underspecification of mentions. While underspecification poses a problem in coreference resolution in general, we argued that it is particularly problematic regarding certain German pronouns that feature underspecified morphological properties.
We presented an entity-mention model which efficiently remedies the problem of inconsis- tent coreference decisions by incrementally disambiguating properties of mentions. Our main hypothesis stated that performance of pronoun resolution for German improves when a consistent solution for the problem of underspecification is devised. We found empirically that the entity-mention model improves performance of pronoun resolution compared to related work which does not address this issue.
Coupled with heuristics to resolve nominal mentions, the incremental entity-mention model achieved new state-of-the-art performance in German coreference and pronoun resolution. Whether our approach of incrementally disambiguating properties of men- tions is beneficial for coreference resolution in other languages has to be determined by future work.
Evaluation of coreference and pronoun resolution. We argued that the common evaluation framework for coreference and pronoun resolution is not tailored to the specific requirements of downstream applications. By devising the ARCS metrics, we aimed at developing an evaluation framework that supports the view of prospective downstream applications.
We showed that evaluation of our approach to pronoun resolution yields varying perfor- mance levels when different requirements regarding the antecedents are applied, ranging from 90% accuracy of classifiers under idealized settings to 65% F-score when pronouns
Chapter 7. Conclusions and future work 166
are required to link to the first mentions of the entities they denote. Such a requirement is not unusual for downstream applications that seek mentions of specific target entities. Thus, pronoun resolution remains a challenging task.
In this light, we encourage future work to investigate the crucial link of pronouns to nominal antecedents, since pronouns are often followed by subsequent pronouns. If the first pronominal mention of an entity is resolved incorrectly, all pronouns linked subsequently to that first pronominal mention denote an incorrect underlying entity and are thus irrelevant from the perspective of downstream applications. We believe that paying attention to this problem will significantly improve the benefit that coreference and pronoun resolution systems provide for downstream applications.
The state-of-the-art in coreference resolution changes rapidly, and progress is often made in small steps. We outlined that evaluation of coreference is affected by a variety of fac- tors. Therefore, it is often not clear why a particular system achieves better performance than another. In an effort to shed light on these differences, we have extended the ARCS framework to accommodate an in-depth comparison of system outputs. This compari- son enables an arguably more informative view on the performance differences between systems than the comparison of small changes in averaged F-score. Thus, we encourage researchers to demonstrate in what regard their approach works better compared to related work. Together with the recent approaches on systematic and automated error analysis for coreference, we hope to have provided a tool for this purpose.
Semantics for pronoun resolution. We investigated distributional models that cap- ture the semantic compatibility of antecedent candidates and contexts of pronouns. As an extension to related work, we proposed to take into account the additional arguments of a verb that governs a pronoun to determine compatibility with the antecedent can- didates. We showed that the models have the potential of correcting a large amount of erroneous pronoun resolutions of the salience-based antecedent selection. However, we found that devising strategies to successfully integrate the models into the salience-based resolution approach in a real-world setting is difficult. Given the potential of error re- duction and the leveling performance of salience-based approaches, we encourage future work to further pursue this direction.
An interesting approach would be to narrow down the set of verbs whose selectional preferences are applicable to pronoun resolution. We argued that not all verbs have a selection preference which is narrow enough to be useful for pronoun resolution. We pro- posed to address this issue indirectly by requiring dissimilarity between the antecedent candidates in order for the verb selectional preferences to be taken into account. A dif- ferent approach would be to narrow down the set of verbs that have specific selectional preferences. Furthermore, psycholinguistic research has investigated verbs that promote
Chapter 7. Conclusions and future work 167
either their subjects or their direct objects for subsequent mentions by pronouns. For example, constructions like “Peter accused Paul that he [...]” clearly mark the object of the matrix verb as the antecedent for the pronoun in the subordinated clause. Tradi- tional approaches to pronouns resolution would resolve this pronoun incorrectly because salience dictates subject preference and favors parallelism of grammatical roles. Thus, we believe that identifying verbs and constructions that violate these general salience patterns is a fruitful direction.