5.4 Evaluation of pronoun resolution performance
5.4.4 Performance impact of real preprocessing components
As mentioned in section 1.5.2, preprocessing is an integral part of any coreference reso- lution system. It has been shown that performance of coreference resolution, including pronoun resolution, depends heavily on the quality of the preprocessing components such as PoS tagging, syntactic parsing, morphological analysis etc. (Schiehlen, 2004, Klenner et al., 2010, inter alia). So far, we have assumed perfect preprocessing information. While doing so eliminates noise when investigating coreference resolution performance, it presents an unrealistic setting for real-world applications of a coreference system. Therefore, we report the performance of the entity-mention and the mention-pair model w.r.t. pronoun resolution when real preprocessing components are used to extract the markables and their features. However, we keep our method of aligning gold mention boundaries to boundaries of the corresponding extracted markables, because we argue that the precise identification of the boundaries (i.e. including or excluding a PP or a relative clause) is irrelevant for higher-level applications.
34
The Semeval 2010 Shared task on coreference resolution for multiple languages (Recasens et al., 2010) also featured a German data set compiled from an earlier version of the T¨uBa-D/Z corpus. However, the participating systems did not fare particularly well regarding German pronoun resolution, as reported in Tuggener and Klenner (2014). This is not surprising, since German pronoun resolution was not the focus of the task. Therefore, we refrain from re-running our system on the Semeval data and from comparing it to the participating systems.
35
Chapter 5. Empirical validation of our entity-mention model 121
We apply the ParZu parser (Sennrich et al., 2013) which provides PoS tagging and morphological analysis, besides dependency parsing. The parser is an adaption of the Pro3Gres dependency parser for English (Schneider, 2008) to German. For named entity recognition, we use the Stanford Named Entity Recognizer36with the model for German provided by Faruqui and Pad´o (2010). We keep the tokenization given by the gold standard to avoid token alignment problems in evaluation. That is, all preprocessing is fully automated. The evaluation thus represents the performance of our system in a real-world setting. We compare the entity-mention and mention-pair models and use our antecedent selection strategy introduced in section 5.3.2. Furthermore, we found that training the weights on the training set preprocessed with the real components gives slightly better results than using the weights obtained over gold preprocessed training data.
Apart from using the T¨uBa-D/Z development and test set, we evaluate the systems on the Potsdam Commentary Corpus (PCC).37 The corpus does not feature annotation of
relative pronouns, but for personal and possessive pronouns.
PPER PPOSAT ALL
R P F1 R P F1 R P F1
T¨uBa-D/Z Development set
E-M 64.80 62.87 63.82 68.68 62.19 65.27 65.52 64.00 64.75 M-P 59.19 57.66 58.42 64.78 58.82 61.66 62.24 60.73 61.48
T¨uBa-D/Z Test set
E-M 64.37 62.87 63.61 68.13 60.64 64.17 65.15 63.45 64.29 M-P 59.33 58.03 58.67 67.60 60.20 63.68 63.13 61.44 62.27
Potsdam Commentary Corpus
E-M 70.55 66.67 68.55 70.62 60.68 65.27 - - - M-P 66.34 62.88 64.57 66.10 56.80 61.10 - - -
Table 5.20: Functional evaluation of pronoun resolution performance using real pre- processing components.
Table 5.20 shows the results of the functional evaluation that requires pronouns to (transitively) link to nominal antecedents (i.e. the ARCS inferred antecedent metric). On the T¨uBa-D/Z data, we see that performance is lowered by roughly 6 to 10 percentage points in F-score when using real preprocessing compared to using gold preprocessing. In table 5.16, we saw that the entity-mention model (E-M) achieved F-scores of 70.15 and 73.39 on the T¨uBa-D/Z development and test set, respectively, for personal pronouns (PPER). Given real preprocessing, F-scores drop to 63.82 and 63.61, respectively. The same magnitude of loss is observed w.r.t. possessive pronouns and the performance over all pronouns. For possessive pronouns, the entity-mention model reached 72.91 and 74.09
36
http://nlp.stanford.edu/software/CRF-NER.shtml 37
Cf. section 5.1.1. The corpus currently only provides coreference annotation in the CoNLL format. Thus, we were not able to evaluate our approaches based on gold preprocessing on this corpus.
Chapter 5. Empirical validation of our entity-mention model 122
F-scores on the development and test set, respectively. Here, performance is lowered to F-scores of 65.27 and 64.17.
The evaluation of our systems on the Potsdam Commentary Corpus (PCC) shows that the performance does not change significantly regarding possessive pronouns compared to the T¨uBa-D/Z data sets. For personal pronouns, performance is higher on the PCC. However, the PCC is significantly smaller than the T¨uBa-D/Z test set.38 Still, the results suggest that our system does not overfit the T¨uBa-D/Z data and can be applied to other corpora.
Furthermore, the table shows that the entity-mention model outperforms the mention- pair model on all data sets. Thus, the improvements achieved on gold data carry over to an evaluation in a real-world setting.
This evaluation shows that pronoun resolution in a real-world setting, where pronouns have to meet the requirements of downstream applications (i.e. identify nominal an- tecedents) and where the systems have to rely on automated preprocessing, remains a challenging task. Pair-wise evaluation and the use of gold preprocessing do not ade- quately reflect these requirements.