(Co)Reference Resolution in Wikipedia

(1)

(Co)Reference Resolution in Wikipedia

Razvan Bunescu

School of Electrical Engineering and Computer Science Ohio University

Abstract

We present a coreference resolution sys-tem aimed at improving information ex-traction from Wikipedia. The proposed system is trained on a corpus of Wikipedia articles manually annotated with corefer-ence and refercorefer-ence information. Experi-mental results demonstrate that highly ac-curate coreference decisions can be made for 75% of the input, and furthermore indi-cate that Wikipedia’s structure can be used to boost the resolution performance. 1 Introduction

The wealth of user contributed knowledge in Wikipedia (WP) has led to a number of projects that leverage its rich structural information in or-der to extract highly structured repositories of world knowledge, such as DBpedia (Bizer et al., 2009) or YAGO (Suchanek et al., 2007). These and other similar resources rely exclusively on the structured information from WP, therefore miss-ing numerous ontological relations asserted by the text. Figure 1 shows a paragraph from the WP arti-cle on Barack Obama, together with the three rela-tional facts that it asserts about Obama. While the first two relations may be inferred from the Alma

Mater attribute and its values in the article’s in-foboxes, the third relation is stated only in the text. We believe that an automatic method for extract-ing relational information from the text of the ar-ticles would significantly enrich the world knowl-edge that is already available in a structured form in WP. As illustrated in the sentence from Figure 1, central to such a method is the capability to per-form reference resolution: in order to extract the

presidentrelation, the method needs to determine that he refers to Barack Obama.

Reference resolution inside WP articles may be construed as a sequence of two steps:

“Obama[x1] is a graduate of Columbia University[x2] and Harvard Law School[x3], where he[x1] was the president of the Harvard Law Review[x4].”

1. Obama is a graduate of Columbia University. 2. Obama is a graduate of Harvard Law School. 3. Obama was the president of Harvard Law Rev.

Figure 1: Generic coreference example.

1. In the coreference resolution step, all noun phrases in an article are clustered into a set of maximal non-overlapping clusters, such that mentions in a cluster refer to the same dis-course entity.

2. In the reference resolution step, each cluster is linked to the corresponding WP entity title, if such a title exists.

In this paper, we study the task of coreference resolution in the context of WP articles. Besides the already discussed utility of reference resolu-tion for extracting a highly structured repository of world knowledge, our decision to focus coref-erence resolution on WP text was also motivated by the hypothesis that WP specific structures, such as title links and category links, can be utilized to boost the coreference resolution accuracy.

2 A Wikipedia Corpus for (Co)Reference We manually annotated three WP articles with both coreference and reference information. Un-like ACE, we are interested in annotating coref-erence between mentions of all possible types of entities. Consequently, we based our annotation guidelines on the MUC coreference task defini-tion, which we further simplified in order to make the annotation task easier and reproducible. As in MUC and ACE, we mark coreference only be-tween markable noun phrases. Proper names,

(2)

def-inite and indefdef-inite noun phrases, and pronouns are all considered seed markables. Bare nouns are considered markable only if 1) they are not modi-fiers in compound nouns, and 2) they corefer with a seed markable.

The seed markables are identified using ex-clusively syntactic criteria. For lack of a com-prehensive, operational definition of what con-stitutes a name, we define proper names to be noun phrases in which content words are tagged as proper nouns. In general, we consider the en-tire maximal NP to be a name, as in “[Harvard

Law Review]”. Exceptions are titles or other capi-talized attributes, which are considered separately from the name they modify, as in “[President]

[Barack Obama]”. Seed markables such as proper names and definite NPs are prototypical examples of definite descriptions, whereas indefinite NPs are typically used as indefinite descriptions (Ab-bott, 2006). Although they can also be used gener-ically, definite (indefinite) NPs are generally un-ambiguous between the generic and the definite (indefinite) use.

The theoretical interpretation of bare NPs is more controversial. Here we follow the view that bare NPs are systematically ambiguous (Wilkin-son, 1991; Gerstner-Link and Krifka, 1993). For example, plural bare NPs can refer generically to a class of objects, as in “potatoes contain vitamin C”, or to a specific set of objects as in “potatoes rolled out of the bag” (Krifka, 2004). It is very often difficult, if not impossible, to disambiguate between the two interpretations: in “Obama crit-icized partisan views of the electorate”, partisan

viewsmay be interpreted either as a generic or an indefinite description. This and other types of am-biguity may rend the annotation task very difficult. We avoid such ambiguous cases by restricting the types of bare NPs that are markable to those that are coreferent with seed markables, which are sig-nificantly easier to identify and interpret.

Our definition of markable depends on certain coreference relations. As noticed in (van Deemter and Kibble, 2000), this may be seen as a drawback if the intention is to use the markables as input for the coreference resolution proper. However, this is not the case in our work – the markables are only introduced to make the manual annotation easier. The input to the actual coreference resolution sys-tem will consist of all the words in the text that may enter in a referential relation, such as

com-mon nouns, proper names, numbers, or pronouns (the MENTIONScolumn in Table 1).

We annotate coreference only between syntac-tic heads of noun phrases by associating them with the same coreference chain identifier. Borrowing from the ACE terminology, we mark coreference relations between NPs in appositions and predica-tive patterns as attribupredica-tive. We also use special at-tributes to mark coreference in negative and modal contexts.

Once the coreference chains are annotated , we manually link them to the WP title that describes the corresponding entity, if such a title exists. The task is made easier by the fact that some markables in the WP article are already linked to the cor-responding title – WP guidelines require the first mention of an entity to be hyperlinked to its title. Manual work is needed nevertheless for ensuring 1) completeness, since not all coreference chains (singletons included) contain a mention that is hy-perlinked in WP, and 2) correctness, since many links in WP relate to the sense, and not to the ref-erence of a phrase. We use WP titles describing the meaning of a word (e.g. Potato) as reference links only for generic uses of that word.

Table 1 shows the number of syntactic heads considered as input MENTIONS in each article (M), the number of coreference CHAINS corre-sponding to these mentions (C), the number of SINGLETONcoreference chains (S), and the num-ber of chains annotated with a WP entity (E).

Title M C S E

Barack Obama 2141 1351 352 212

New York Times 1454 1026 176 146

John Williams 1236 797 272 176

WP Corpus 4831 3174 800 534

Table 1: Wikipedia Corpus Statistics.

3 A Coreference Resolution Algorithm We approach coreference resolution as the greedy clustering algorithm shown in Table 2. The algo-rithm starts by initializing the clustering C with a set of singleton clusters. Then, as long as the clustering has more than one cluster, it repeatedly finds the highest scoring pair of clusters(Ci, Cj). If the score passes the threshold S(∅, ∅), it joins the clusters Ci, Cj.Φ(Ci, Cj) is defined as the av-erage of the feature vectors φ(xi, xj) over all pairs (xi, xj) ∈ Ci× Cj.

(3)

Cluster(X, S)

[In]: A set of mentions X = {x1, x2, ..., xn}.

[In]: A scoring function S(Ci, Cj) = wTΦ(Ci, Cj).

[Out]: A greedy clustering of X. 2. let Ci= {xi}, for 1 ≤ i ≤ n 3. let C= {Ci}1≤i≤n 4. lethCi, Cji = argmax p∈P(C) S(p) 6. while|C| > 1 and S(Ci, Cj) ≥ S(∅, ∅): 6. replace Ci, Cjin C with Ci∪ Cj 8. sethCi, Cji ← argmax p∈P(C) S(p) 9. return C

Table 2: Greedy Agglomerative Clustering.

The function P takes a clustering C as argu-ment and returns a set of cluster pairshCi, Cji as follows:

P(C) = {hCi, Cji | Ci, Cj∈ C, Ci6= Cj} ∪ {h∅, ∅i}

P(C) contains a special cluster pair h∅, ∅i, where Φ(∅, ∅) is defined to contain a feature uniquely as-sociated with this pair. Its corresponding weight is learned together with all other weights and will effectively function as a clustering threshold.

Tables 3 and 4 show a generic incremental learning algorithm for the weight vector w that is parametrized with the number of training epochs T and a set of clusterings C in which each clus-tering contains the coreference clusters from one document. The notation C |= Ck, Cl means that combining Ckand Clis consistent with the known clustering C, i.e. there is a cluster Cp ∈ C such that(Ck∪ Cl) ⊆ Cp.

w= Train(C, T )

[In]: A dataset of training clusterings C. [In]: The number of training epochs T . [Out]: The averaged parameter vector w. 1. let w= 0

2. for t= 1 to T do 3. foreach C∈ C do 4. Update(C, w) 5. return w.

Table 3: Incremental Learning.

The proposed algorithm is similar to the error-driven approach from (Culotta et al., 2007), with the following notable differences: 1) the update step does not stop after the first clustering error, and 2) Φ(Ci, Cj) is built out of feature vectors φ(xi, xj) over pairs (xi, xj) ∈ Ci × Cj (as op-posed to Ci∪ Cj). During development on ACE corpora, these different strategies were observed to obtain significant improvement in accuracy.

Update(C, w) [In]: A clustering C = {C1, C2, ..., Cm}.

[In/Out]: The parameter vector w.

1. let X= C1∪ C2∪ ... ∪ Cm= {x1, x2, ..., xn} 2. let ˆCi= {xi}, for 1 ≤ i ≤ n 3. let ˆC= { ˆCi}1≤i≤n 4. while ˆC6= C do: 5. leth ˆCi, ˆCji = argmax h ˆCi, ˆCji∈P( ˆC) S( ˆCi, ˆCj) 6. unless C|= h ˆCi, ˆCji: 7. leth ˆCk, ˆCli = argmax C|=h ˆCk, ˆCli∈P( ˆC) S( ˆCk, ˆCl) 8. set w← w + Φ( ˆCk, ˆCl) − Φ( ˆCi, ˆCj) 9. set i← k, j ← l

10. replace clusters ˆCi, ˆCjin ˆC with ˆCi∪ ˆCj

11. lethCi, Cji = argmax hCi,Cji∈P(C)

S(Ci, Cj)

12. unlesshCi, Cji = h∅, ∅i:

13. set w← w + Φ(∅, ∅) − Φ(Ci, Cj)

Table 4: Incremental Update Step.

4 Coreference Features

The baseline model uses two sets of features: 1. Mention-level features (especially useful for

anaphoricity decisions).

2. Pair-wise features (especially useful for coreference decisions).

Given an active mention, the first category con-tains features that describe the linguistic form of the mention (e.g. third person pronoun, proper name, definite description) and features that cap-ture the part-of-speech tags of the words surround-ing the mention. The pairwise features for a mention pair (xi, xj) encode: the sentence and mention distance between xi and xj, their agree-ment in gender and number, their semantic simi-larity based on WordNet senses, string simisimi-larity, whether they are in an appositive or predicative construction, or if one is the acronym of the other. 4.1 Wikipedia Coreference Features

We associate every cluster Ci with a reference set Ri, and a category set Ti. The reference set con-tains all the reference titles that are hyperlinked from mentions xi ∈ Ci in the WP article. For ex-ample, if the mention xi ∈ Ciis the proper name Harvard Law Reviewfrom Figure 1, which is hy-perlinked in WP to the article title Harvard Law

Review, we add this title to Ri. We use a set of simple heuristics to filter out WP hyperlinks that are related to the sense rather than the reference of

(4)

a word. For example, in “The New York Times is an American daily newspaper founded in 1851”, the word newspaper is linked to the WP article ti-tle with the same name that describes the meaning of the word. In this case, we use a heuristic rule that filters out hyperlinks associated with nouns that are the heads of indefinite NPs. Other use-ful features for deciding the referential status of a WP hyperlink are based on whether the mention is a standalone proper name (e.g. “Harvard Law

Review”), whether it is modified by an adjective (e.g. “the current President of the United States”), or whether the modifiers appear in the hyperlinked title (e.g. “the 109th Congress”, which is linked to

109th United States Congress).

The category set Ti is built from the WP cat-egories associated with the titles in Ri. For ev-ery WP title in Ri, we collect the singular form of the syntactic heads of its direct categories, by parsing them as noun phrases. We filter out ad-ministrative categories, and categories that do not normally correspond to an IS-A relationship with the title, e.g. categories headed by proper names, or by words such as “births” or “deaths” that are normally associated with pages on people in WP. We also use the presence of categories headed by “births” or “deaths” to set an attribute for every title in Riindicating whether it refers to a person. The following additional features are computed using the titles in Riand the categories in Ti:

• If xi is linked to a person title in WP, and xj is a personal pronoun, then add a special person agreement feature to φ(xi, xj). If on the other hand xj is a neutral pronoun, add a disagreement feature to φ(xi, xj). These fea-tures help in ruling out coreference links such as “George Lucas” = ”it”, or in enforcing the coreference “George Lucas” = ”he”.

• If mention xj is also a category in Ti, and xj is at most one sentence apart from xi, then add a special feature to φ(xi, xj). This feature helps in identifying coreference links such as “Steven Spielberg” = “the young

di-rector”, or “Star Wars” = “the film”.

• If xi and xj are hyperlinked to titles in WP, add a special feature to φ(xi, xj) indicating whether the titles are the same or different. These feature help in joining clusters corre-sponding to the same WP title, such as “the

Academy Awards” = “the Oscars”.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Precision (B3) Recall (B3) Baseline + WP Baseline

Figure 2: Precision vs. Recall graphs.

5 Experimental Evaluation

The system is trained for 20 iterations on the first two WP articles from the dataset in Table 1, and tested on the third WP article. The set of in-put mentions (column M) is created automatically based on POS tags and a proper name recognizer (94% F-measure). Table 5 shows the Precision (P), Recall (R) and F-measure (F) computed using the B3

metric (Bagga and Baldwin, 1998), and the area under the precision vs. recall curve (AUC) corresponding to the graphs in Figure 2.

Method P R F1 AUC

Baseline 90.7 70.6 79.4 82.3

+ WP Features 94.1 70.2 80.4 83.3

Table 5: Coreference results.

The system obtains an F-measure of 80.4%, with the WP features improving the overall per-formance, although not very substantially. In fu-ture work we plan to expand the set of Wikipedia specific features – in particular, we plan to ex-ploit the redundancy of the information contained in Wikipedia in order to mutually reinforce coref-erence relations across multiple documents. 6 Conclusion

We presented a coreference resolution system specifically tailored for Wikipedia articles. We created a dataset of WP articles manually anno-tated with coreference and reference information, and used it to train and evaluate an incremental learning approach to coreference. The proposed system holds the promise of improving not only the information extraction from WP, but also other types of WP based systems.

(5)

References

Barbara Abbott. 2006. Definite and indefinite. In Keith Brown, editor, Encyclopedia of Language and

Linguistics, volume 3, pages 392–399. Elsevier.

Amit Bagga and Breck Baldwin. 1998.

Entity-based cross-document coreferencing using the vec-tor space model. In Christian Boitet and Pete White-lock, editors, Proceedings of the Thirty-Sixth

An-nual Meeting of the Association for Computational Linguistics and Seventeenth International Confer-ence on Computational Linguistics, pages 79–85, San Francisco, California. Morgan Kaufmann Pub-lishers.

Christian Bizer, Jens Lehmann, Georgi Kobilarov, S¨oren Auer, Christian Becker, Richard Cyganiak, and Sebastian Hellmann. 2009. Dbpedia - a crys-tallization point for the web of data. Web

Seman-tics: Science, Services and Agents on the World Wide

Web, 7:154–165, July.

Aron Culotta, Michael Wick, and Andrew McCallum. 2007. First-order probabilistic models for corefer-ence resolution. In Human Language Technologies

2007: The Conference of the North American Chap-ter of the Association for Computational Linguistics; Proceedings of the Main Conference, pages 81–88, Rochester, New York, April. Association for Com-putational Linguistics.

Claudia Gerstner-Link and Manfred Krifka. 1993.

Genericity. In Joachim Jacobs, Arnim von Stechow, Wolfgang Sternefeld, and Theo Vennemann, editors,

Syntax: An International Handbook of Contempo-rary Research, pages 966–978. De Gruyter.

Manfred Krifka. 2004. Bare NPs: Kind referring, In-definites, Both, or Neither? In Proceedings of

Se-mantics and Linguistic Theory (SALT-XIII). Fabian M. Suchanek, Gjergji Kasneci, and Gerhard

Weikum. 2007. Yago: a core of semantic knowl-edge. In Proceedings of the 16th World Wide Web

Conference (WWW-07), pages 697–706. ACM.

Kees van Deemter and Rodger Kibble. 2000. Squibs and discussions: On coreferring: Coreference in

muc and related annotation schemes.

Computa-tional Linguistics, 26(4):629–637.

Karina Wilkinson. 1991. Studies in the Semantics of

Generic Noun Phrases. Ph.D. thesis, University of Massachusetts, Amherst.