• No results found

MULTILINGUAL COREFERENCE

RESOLUTION

In the increasingly complex and rapidly changing world, the need for robust and efficient methods for Natural Language Processing (NLP) applications that are flexible and that lead to good and stable system performance is rapidly growing. With the advances of science and technological development as well as the boosted access to information, software and ever growing communication, the demand for multilingual applications is more than ample. Modern multilin- gual systems build a bridge between the already widely available knowledge and the monolingual end-user. One well known multilingual project for ex- ample is Wikipedia1

– a multilingual, web-based, free-content encyclopedia. This easily accessible resource allows for textual content to be entered and used across language boundaries due to its hyperlinked nature. Yet, there is no guarantee for the user that the content he or she is searching for will be available in a language that the user can actually speak or understand. Further multilingual assistants as Google Translate2

for example can be made use of in order for that content to be understandable. Yet, multilingual approaches often

1

http://www.wikipedia.org

2

http://translate.google.com

carry an immense engineering and implementation effort with them. For this reason, it is necessary to shed more light on the problem of multilinguality for the subject of our interest – coreference resolution.

Thus, in the current chapter we will continue beyond the notion of simple

CRand revise the advances of that field into more than one targeted language

and in this way we will delineate the complex task of Multilingual Coreference Resolution (MCR). We will first review all initial approaches toMCR(see section

multilingual coreference resolution

3.1) and then discuss the basic necessities as well as pressing issues with

respect toMCR-based approaches (see section3.2). Section3.3offers concluding

remarks.

3.1 c o n t e m p o r a r y m u lt i l i n g ua l c o r e f e r e n c e r e s o l u t i o n Multilingual Coreference Resolution has been gaining a great amount of in- terest in the CLcommunity for almost two decades now. It was first Aone

and McKee [1993] who presented a data-driven architecture for language- independent anaphora resolution that was capable of functioning on any language and still was robust, easily extendable and trainable.Mitkov[1999b] proposed a knowledge-poor approach toARthat was initially developed and tested for English and then further extended to Polish and Arabic as well as Finnish, Russian and French. Yet, as the author notes, by that time there were already several approaches on various languages such as: French [Popescu- Belis and Robba,1997,Rolbert,1989], German [Dunker and Umbach,1993, Fischer et al.,1995,Leass and Schwall,1991,Stuckardt,1997], Japanese [Mori et al.,1997,Nakaiwa and Ikehara,1992,1995], Portuguese [Abraços and Lopes, 1994], Swedish [Fraurud,1988] and Turkish [Tin and Akman,1994]. Later on numerous other languages were added to that list: Bulgarian [Grigorova,2011, Tanev and Mitkov,2002], Catalan [Mayol,2006,Potau,2008], Dutch [Hendrickx et al.,2008,Hoste,2005], Italian [Poesio et al.,2010,Sorace and Filiaci,2006], Spanish [Palomar and Martínez-Barco,2001,Potau,2008], etc.

However, the cases given above were and still are only a very small portion of theARandCRresearch, because it is on English that the most effort from

theCLcommunity is concentrated. This is explained by the fact that linguistic information, annotations and analysis tools are easily available for English, but not for less resourced languages such as Bulgarian and Portuguese, for example (see section1.2). A multilingual approach dependent on deeper semantic and

syntactic analysis will inevitably prove to be inapplicable when that information is not accessible for every targeted language. Yet,Mitkov[1999b] also points out that the endeavour of concentrating on a multilingual approach is bound to be directed towards circumventing more complex syntactic, semantic and discourse analysis. After the two multilingual approaches [Aone and McKee, 1993, Mitkov, 1999b], there were only a few other methods concentrating on more than one language at a time: [Harabagiu and Maiorano,2000,Luo

3.1 contemporary multilingual coreference resolution 45

and Zitouni,2005]. It was not until the introduction of two highly important events for multilingual approaches that further methods and systems featuring multiple languages simultaneously were presented:

1. SemEval-2 task 1: Coreference Resolution for Multiple Languages, further referred to as SemEval-23

(see section3.1.1)

2. CoNLL 2012 Shared Task: Modeling Multilingual Unrestricted Coref- erence in OntoNotes, further referred to as CoNLL 20124

(see section

3.1.2)

Both events were organized as shared tasks and targeted the development of coreference resolution systems that can be applied to the languages addressed by the competitions. Both tasks lay foundational ground inMCRand play a central role for our further discussion. Thus, we devote the following sections to their introduction and the main aspects of their proceedings. Further, in section 3.2 we will delineate the key problems to multilingual CR, because

these are the issues that serve as basis to the research in this work.Pradhan et al.[2012] as well asRecasens et al.[2010] provide more detailed information about the proceedings of both tasks.

3.1.1 SemEval-2 task 1: Coreference Resolution for Multiple Languages

The first multilingual endeavour was approached in 2010 by the SemEval-2 task 1: Coreference Resolution for Multiple LanguagesRecasens et al.[2010]. This was the first opportunity forMCRsystems to be objectively reviewed and comparatively evaluated. A new and highly innovative pursuit, as this aimed at answering various questions (with respect toCRapplied on multiple languages) that were still open to the research community. Because of the fact that there were hardly any systems able to work on more than one language, SemEval-2 planned to estimate the effort needed to transform a monolingual system to a multilingual one. AsRecasens et al.[2010] report, it was unclear how much lan- guage specific modifications would be needed for a competitive performance as well as how important general linguistic annotations as morphology, syntactic and semantic layers are to that performance. Since manually annotated data,

also called gold data or gold standard, is exceptionally hard and expensive gold data gold standard

to obtain, it was necessary to investigate the difference between the system

performance on gold data vs. auto data. Auto data is noisier and inferior to auto data

gold data, because it is collected by the use of various computational tools. As we presented in section2.3.4.2, evaluation ofCRsystems is still highly difficult.

Thus, another questionRecasens et al.[2010] were interested in was the overall effect of the various evaluation metrics (MUC, CEAF, B3, BLANC) on the

3

http://stel.ub.edu/semeval2010-coref

4

ranking, comparison and altogether the representation of the performance of the participating systems. It is those and many other questions with respect to multilinguality that we focus on in the context of our work. We will look into the full coreference resolution pipeline within the SemEval-2 and the CoNLL 2012shared tasks and analyze the results from the approach we make use of (see chapter4).

3.1.1.1 Data

The SemEval-2 shared task targeted six different languages: Catalan, Dutch, English, German, Italian and Spanish. The six languages cover two language families – the Romance language family (with representatives: Catalan, Italian and Spanish) and the Germanic language family (with representatives: Dutch, English and German). AsRecasens et al. [2010] present, the datasets were assembled based on the availability of distinct corpora and annotation tools for the six approached languages that we summarize in the following paragraphs. c ata l a n a n d s pa n i s h The Catalan and Spanish data was extracted from the AnCora corpora [Recasens and Martí,2010], which mainly contains newswire texts annotated manually for arguments and thematic roles, predi- cate and semantic classes, named entities, WordNet5

nominal senses as well as coreference. A Named Entity (NE) can be categorized as atomic element in

named entity

text according to a predefined list of categories.NEs can be of various different types: proper names, locations, expressions of times or quantities, monetary values, percentages, etc. Additionally, automatic annotations for lemmas and Part of Speech (POS) information were acquired via the FreeLing6

open source suit of language analyzers [Padró and Stanilovsky,2012]. The dependency structure and predicate semantic roles were achieved via the syntactic-semantic JointParser7

[Lluís et al.,2009]. An example sentence for each of the two lan- guages, Catalan and Spanish, is provided in tableA.1on page248and tableA.6

on page253respectively.

d u t c h The dataset for the Dutch language was assembled from the KNACK- 2002corpus [Hoste and Pauw,2006], which also contains newswire texts. The annotations in the texts include manually identified coreference relations and semi-automatically annotatedPOS, phrase chunks and named entities. The au- tomatic part of the annotation of lemmas,POS, and named entities was acquired by the memory-based shallow parser for Dutch, presented in [Daelemans et al., 1999]. The parser was developed by the Induction of Linguistic Knowledge Research Group and is available from their website8

. The dependency informa-

5 http://wordnet.princeton.edu 6 http://nlp.lsi.upc.edu/freeling 7 http://nlp.lsi.upc.edu/jointparser/demo 8 http://ilk.uvt.nl

3.1 contemporary multilingual coreference resolution 47

tion was labeled by the Alpino9

parser introduced in [Van Noord et al.,2006]. An example sentence for Dutch is given in tableA.2on page249.

e n g l i s h The English part of the SemEval-2 shared task dataset was taken from the OntoNotes Release 2.0 corpus [Pradhan et al., 2007]. This release consists of newswire and broadcast news annotated with Penn Treebank10

syntactic annotations, Penn Propbank11

predicate argument structures, named entities, word senses and coreference information. Automatic annotations for lemmas andPOSinformation were generated using the SVMTagger12

presented in [Giménez and Màrquez,2004]. The syntactic-semantic JointParser7parser [Lluís et al.,2009] was again used for the dependency structure and predicate semantic roles. An example sentence for English can be found in tableA.3on page250.

g e r m a n For German the data was extracted from the Tüba-D/Z corpus [Hinrichs et al.,2005], which is a treebank of newswire texts with syntactic and coreference annotations. Lemmas,POS, morphological and dependency information were also automatically annotated. Lemmas were labeled by the TreeTagger13

[Schmid,1995].POS tags and morphological information were predicted by the RFTagger14

introduced in [Schmid and Laws,2008], while the dependency layer was constructed by the MaltParser15

presented in [Hall and Nivre,2008]. A German excerpt from the data is shown in tableA.4on page251.

i ta l i a n The collection for Italian was acquired from the LiveMemories corpus [Rodríguez et al.,2010] built up of Wikipedia, blogs, newswire and dialogues. The data is annotated for coreference, agreement and named entities on the basis of automatic parses. The TextPro16

suit of modular NLP tools

was used for the lemmas andPOSannotations and the MaltParser15

[Hall and Nivre,2008] was employed for the acquisition of the dependency information. An example sentence from the Italian dataset can be found in tableA.5 on page252.

A complete summary of the size of the used datasets per language, as given in [Recasens et al.,2010], is shown in table3.1. The figures are separated

for the training, development and test parts of the datasets and counts are listed for the number of documents, sentences and tokens within each part.

9 http://www.let.rug.nl/vannoord/alp/Alpino 10 http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC99T42 11 http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2004T14 12 http://www.lsi.upc.edu/~nlp/SVMTool 13 http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger 14 http://www.ims.uni-stuttgart.de/projekte/corplex/RFTagger 15 http://www.maltparser.org 16 http://textpro.fbk.eu

training development test

docs sents tokens docs sents tokens docs sents tokens Catalan 829 8,709 253,513 142 1,445 42,072 167 1,698 49,260 Dutch 145 2,544 46,894 23 496 9,165 72 2,410 48,007 English 229 3,648 79,060 39 741 17,044 85 1,141 24,206 German 900 19,233 331,614 199 4,129 73,145 136 2,736 50,287 Italian 80 2,951 81,400 17 551 16,904 46 1,494 41,586 Spanish 875 9,022 284,179 140 1,419 44,460 168 1,705 51,040 Table 3.1: A full summary of the size of the datasets for all six languages within the

SemEval-2 shared task. The numbers are separated for the training, devel- opment and test sets and counts are provided for the number of documents (docs), sentences (sents) and tokens (tokens).

As can be seen, the datasets differed to a great extent in length, with German having the largest set and Italian the smallest. The size of the provided data is important, because a machine learning approach, as the one that we will use in our investigation (see chapter4), needs a large number of examples to train

on.

3.1.1.2 Task Definition

Unlike previous evaluation exercises, such as ACE [Doddington et al.,2004] and ARE [Orˇasan et al.,2008], the task description of the SemEval-2 shared task given in [Recasens et al.,2010] included the identification of mentions in its definition. The competing systems needed to extract all types of noun phrases (apart fromNPs that cannot be referential, such as appositives, expletive

NPs, attributiveNPs, etc.) and possessive determiners which were regarded as mentions. Singletons are also considered entities and included in the set of gold mentions. Both auto and gold annotation layers were provided for the majority of languages and annotations: No gold layers were given for Italian and Dutch, apart from named entities for Italian; German did not include goldNEs; None of the datasets but the one for the Dutch language provided autoNEs.

The task aimed at the identification of intra-document coreference relations across the identified mentions and their proper clustering into coreference classes. Each class represents a distinct discourse entity.

3.1.1.3 Data Format

The format of the data was prepared in a simplified and uniform column-based format. The dataset for each separate language consisted of one single file – one file for the training, one for the development and one for the test data. Since intra-document coreference was the target of the task, the files were divided

3.1 contemporary multilingual coreference resolution 49

#begin document <document ID> <sentence>

<sentence> ...

<sentence>

#end document <document ID> ...

#begin document <document ID> <sentence>

<sentence> ...

<sentence>

#end document <document ID>

Figure 3.1: The structure of the train/devel/test files provided for each of the six lan- guages in the SemEval-2 shared task. The information listed within < > is a placeholder for the actual data.

into documents. This structure is visualized in figure3.1. Specific examples

including one sentence for each of the task languages are provided inA.1. Each document consists of n sentences separated by empty lines. The sentences were represented by their tokens listed each on a distinct line. The latter is shown in figure3.2. The various columns contained the diverse layers of

linguistic annotations made available by the task. The actual information listed in the columns is given in table3.2on page51. The two types of annotations,

auto and gold, were appended in an alternating order which is also made visible by the descriptions provided in table3.217. In case the information

<token#1 column#1> <token#1 column#2> <token#1 column#3> ... <token#2 column#1> <token#2 column#2> <token#2 column#3> ... <token#3 column#1> <token#3 column#2> <token#3 column#3> ... ...

Figure 3.2: The structure of the sentences building the documents provided for all six languages in the SemEval-2 shared task. The information listed within < > is a placeholder for the actual data.

17

in the column is not made available or it is irrelevant to the given token, an underscore was used as a placeholder.

The coreference annotation was represented in a bracketed notation, the so called open-close notation, which uses “(<entityID>” to signify that the token is the beginning of a mention that refers to the entity identified by the <entityID>. The “<entityID>)”, respectively, denotes the end of that mention.

Mentions that are marked by the same<entityID>are coreferent, because they refer to the same entity. Yet, this is only true for mentions that are situated in the same document. Mentions across documents that share identical<entityID> are not coreferent. The same is also true for mentions across languages that share the same<entityID>.

3.1.1.4 Evaluation

The SemEval-2 shared task included four different evaluation settings: gold- closed, auto-closed, gold-open and auto-open. Those variations regulated the use of gold vs. auto annotations and external tools and resources for preprocessing. The groups are to be read as follows:

gold-closed – gold linguistic annotations must be used by the systems and no external tools and resources are allowed for additional preprocessing. auto-closed – auto linguistic annotations must be used by the systems and no

external tools and resources are allowed for additional preprocessing. gold-open – gold linguistic annotations must be used by the systems and exter-

nal tools and resources are allowed for additional preprocessing. auto-open – auto linguistic annotations must be used by the systems and exter-

nal tools and resources are allowed for additional preprocessing. The SemEval-2 shared task did not release system rankings according to the results submitted by all participating teams. Furthermore, asPradhan et al. [2012] report, because of the low number of contributors, the organizers of the task were not able to achieve any strong conclusions. AppendixB.1lists the full system scores as reported by the SemEval-2 shared task.

3.1.2 CoNLL 2012 Shared Task: Modeling Multilingual Unrestricted Coreference in OntoNotes

The second multilingual task that aimed at resolving coreference relations for more than one language at a time was the CoNLL 2012 Shared Task: Modeling Multilingual Unrestricted Coreference in OntoNotes [Pradhan et al., 2012]. The task closely followed the framework established by the SemEval-2 shared task. For this reason, similar to the presentation in section3.1.1, in the

3.1 contemporary multilingual coreference resolution 51

# type description

1 ID word identifiers in the sentence

2 TOKEN word forms

3 LEMMA word lemmas (gold standard manual annotation) 4 PLEMMA word lemmas predicted by an automatic analyzer 5 POS coarse part of speech

6 PPOS same as 5 but predicted by an automatic analyzer

7 FEAT morphological features (part of speech type, number, gender, case, tense, aspect, degree of comparison, etc., separated by the character "|")

8 PFEAT same as 7 but predicted by an automatic analyzer

9 HEAD for each word, the ID of the syntactic head (’0’ if the word is the root of the tree)

10 PHEAD same as 9 but predicted by an automatic analyzer

11 DEPREL dependency relation labels corresponding to the dependencies described in 9

12 PDEPREL same as 11 but predicted by an automatic analyzer

13 NE named entities

14 PNE same as 13 but predicted by a named entity recognizer

15 PRED predicates are marked and annotated with a semantic class label 16 PPRED Same as 13 but predicted by an automatic analyzer

* APREDs N columns, one for each predicate in 15, containing the semantic

Related documents