• No results found

A Corpus based Investigation of Definite Description Use

N/A
N/A
Protected

Academic year: 2020

Share "A Corpus based Investigation of Definite Description Use"

Copied!
34
0
0

Loading.... (view fulltext now)

Full text

(1)

A Corpus-based Investigation of Definite

Description Use

Massimo Poesio

University of Edinburgh

Renata

Vieira

University of Edinburgh

We present the results of a study of the use of definite descriptions in written texts aimed at assessing the feasibility of annotating corpora with information about definite description inter- pretation. We ran two experiments, in which subjects were asked to classi~ the uses of definite descriptions in a corpus of 33 newspaper articles, containing a total ofl,412 definite descriptions.

We measured the agreement among annotators about the classes assigned to definite descriptions, as well as the agreement about the antecedent assigned to those definites that the annotators classified as being related to an antecedent in the text. The most interesting result of this study from a corpus annotation perspective was the rather low agreement (K = 0.63) that we obtained using versions of Hawkins's and Prince's classification schemes; better results (K = 0.76) were obtained using the simplified scheme proposed by Fraurud that includes only two classes, first- mention and subsequent-mention. The agreement about antecedents was also not complete. These findings raise questions concerning the strategy of evaluating systems for definite description in- terpretation by comparing their results with a standardized annotation. From a linguistic point of view, the most interesting observations were the great number of discourse-new definites in our corpus (in one of our experiments, about 50% of the definites in the collection were classified as discourse-new, 30% as anaphoric, and 18% as associative~bridging) and the presence of definites that did not seem to require a complete disambiguation.

1. Introduction

The work presented in this paper was inspired by the growing realization in the field of computational linguistics of the need for experimental evaluation of linguistic theories--semantic theories, in our case. The evaluation we are considering typically takes the form of experiments in which human subjects are asked to annotate texts from a corpus (or recordings of spoken conversations) according to a given classification scheme, and the agreement among their annotations is measured (see, for example, Passonneau and Litman 1993 or the papers in Moore and Walker 1997). These attempts at evaluation are, in part, motivated by the desire to put these theories on a more "scientific" footing by ensuring that the semantic judgments on which they are based reflect the intuitions of a large number of speakers; 1 but experimental evaluation is also seen as a necessary precondition for the kind of system evaluation done, for example, in the Message Understanding initiative (MUC), where the performance of a system is evaluated by comparing its output on a collection of texts with a standardized annotation of those texts produced by humans (Chinchor and Sundheim 1995). Clearly,

(2)

Computational Linguistics Volume 24, Number 2

a MUC-style evaluation p r e s u p p o s e s an a n n o t a t i o n scheme on w h i c h all participants agree.

O u r o w n concern are semantic j u d g m e n t s concerning the interpretation of n o u n phrases with the definite article the, that w e will call definite descriptions, follow- ing (Russell 1919). 2 These n o u n phrases are one of the most c o m m o n constructs in English, 3 a n d h a v e been extensively studied b y linguists, philosophers, psychologists, a n d c o m p u t a t i o n a l linguists (Russell 1905; C h r i s t o p h e r s e n 1939; Strawson 1950; Clark 1977; Grosz 1977; C o h e n 1978; H a w k i n s 1978; Sidner 1979; Webber 1979; Clark a n d Marshall 1981; Prince 1981; H e i m 1982; A p p e l t 1985; L6bner 1985; K a d m o n 1987; Carter 1987; Bosch a n d Geurts 1989; Neale 1990; Kronfeld 1990; F r a u r u d 1990; Barker 1991; Dale 1992; C o o p e r 1993; K a m p a n d Reyle 1993; Poesio 1993).

Theories of definite descriptions such as (Christophersen 1939; H a w k i n s 1978; Webber 1979; Prince 1981; H e l m 1982) identify two subtasks i n v o l v e d in the interpre- tation of a definite description: deciding w h e t h e r the definite description is related to an a n t e c e d e n t in the t e x t 4 - - w h i c h in turn m a y involve recognizing fairly fine-grained distinctions-and, if so, identifying this antecedent. Some of these theories h a v e b e e n cast in the f o r m of classification schemes ( H a w k i n s 1978; Prince 1992), a n d h a v e b e e n u s e d for c o r p u s analysis (Prince 1981, 1992; F r a u r u d 1990); 5 yet, w e are aware of n o a t t e m p t at verifying w h e t h e r subjects not trained in linguistics are capable of rec- ognizing the p r o p o s e d distinctions, w h i c h is a p r e c o n d i t i o n for using these schemes for the kind of large-scale text a n n o t a t i o n exercises necessary to evaluate a system's p e r f o r m a n c e , as d o n e in MUC.

In the past t w o or three years, this kind of verification has b e e n a t t e m p t e d for other aspects of semantic interpretation: b y P a s s o n n e a u a n d L i t m a n (1993) for seg- m e n t a t i o n a n d b y Kowtko, Isard, a n d D o h e r t y (1992) a n d Carletta et al. (1997) for dialogue act annotation. O u r intention was to d o the same for definite descriptions. We ran t w o e x p e r i m e n t s to d e t e r m i n e h o w g o o d naive subjects are at d o i n g the f o r m of linguistic analysis p r e s u p p o s e d b y c u r r e n t schemes for classifying definite descrip- tions. (By " h o w g o o d " here w e m e a n " h o w m u c h d o they agree a m o n g themselves?" as c o m m o n l y a s s u m e d in w o r k of this kind.) O u r subjects w e r e asked to classify the definite descriptions f o u n d in a c o r p u s of natural language texts according to classifi- cation schemes that w e d e v e l o p e d starting f r o m the t a x o n o m i e s p r o p o s e d b y H a w k i n s {'.1978) a n d Prince (1981, 1992), b u t w h i c h took into account o u r intention of h a v i n g naive speakers p e r f o r m the classification. O u r e x p e r i m e n t s w e r e also d e s i g n e d to as- sess the feasibility of a s y s t e m to process definite descriptions on unrestricted text a n d to collect data that could be u s e d for this i m p l e m e n t a t i o n . For b o t h of these reasons, the classification schemes that w e tried differ in several respects from those a d o p t e d in prior corpus-based studies such as Prince (1981) a n d F r a u r u d (1990). O u r s t u d y is also different f r o m these p r e v i o u s ones in that m e a s u r i n g the a g r e e m e n t a m o n g a n n o t a t o r s b e c a m e an issue (Carletta 1996).

For the experiments, w e u s e d a set of r a n d o m l y selected articles from the Wall Street Journal contained in the A C L / D C I CD-ROM, rather than a c o r p u s of transcripts of s p o k e n language c o r p o r a such as the HCRC MapTask c o r p u s ( A n d e r s o n et al. 1991) or the TRAINS c o r p u s ( H e e m a n a n d Allen 1995). The m a i n reason for this choice was

2 We will not be concerned with other cases of definite noun phrases such as pronouns, or possessive descriptions; hence the term definite description rather than the more general term definite NP. 3 The word the is by far the most common word in the Brown corpus (Francis and Kucera 1982), the

LOB corpus 0ohansson and Hofland 1989), and the TRAINS corpus (Heeman and Allen 1995). 4 We concentrated on written texts in this study. See discussion below.

(3)

Poesio and Vieira A Corpus-based Investigation of Definite Description Use

to avoid dealing with deictic uses of definite descriptions and with phenomena such as reference failure and repair. A second reason was that we intended to use computer simulations of the classification task to supplement the results of our experiments, and we needed a parsed corpus for this purpose; the articles we chose were all part of the Penn Treebank (Marcus, Santorini, and Marcinkiewicz 1993).

In the remainder of the paper, we review two existing classification schemes in Section 2 and then discuss our two classification experiments in Sections 3 and 4.

2. Towards a Classification Scheme: Linguistic Theories of Definite Descriptions When looking for an annotation scheme for definite descriptions, one is faced with a wide range of options. At one end of the spectrum there are mostly descriptive lists of definite description uses, such as those in Christophersen (1939) and Hawkins (1978), whose only goal is to assign a classification to all uses of definite descriptions. At the other end, there are highly developed formal analyses, such as Russell (1905), Heim (1982), L6bner (1985), Kadmon (1987), Neale (1990), Barker (1991), and Kamp and Reyle (1993), in which the compositional contribution of definite descriptions to the meaning of an utterance, as well as their truth-conditional properties, are spelled out in detail. These more formal analyses are concerned with questions such as the quantificational or nonquantificational status of definite descriptions and the proper treatment of presuppositions, but tend to concentrate on a subset of the full range of definite description use. Among the more developed semantic analyses, some identify uniqueness as the defining property of definite descriptions (Russell 1905; Neale 1990), whereas others take familiarity as the basis for the analysis (Christophersen 1939; Hawkins 1978; Heim 1982; Prince 1981; Kamp and Reyle 1993). We will say more about some of these analyses below.

Our choice of a classification scheme was dictated in part by the intended use of the annotation, in part by methodological considerations. An annotation used to evaluate the performance of a system ought to identify the anaphoric connections between discourse entities; this makes familiarity-based analyses more attractive. From a methodological point of view, it was important to choose an annotation scheme that (i) would make the classification task doable by subjects not trained in linguistics, and (ii) had already been applied to the task of corpus analysis. We felt that we could ask naive subjects to assign each definite description to one of a few classes and to identify its antecedent when appropriate; we also wanted an annotation scheme that would characterize the whole range of definite description use, so that we would not need to worry about eliminating definite .descriptions from our texts because they were unclassifiable.

For these reasons, we chose Hawkins's list of definite description uses (Hawkins 1978) and Prince's taxonomy (Prince 1981, 1992) as our starting point, and developed from there two slightly different annotation schemes, which allowed us to see whether it was better to describe the classes to our annotators in a surface-oriented or a semantic fashion, and to evaluate the seriousness of the problems with these schemes identified in the literature (see Fraurud 1990).

2.1 The Christophersen/Hawkins List of Definite Description Uses

(4)

Computational Linguistics Volume 24, Number 2

Anaphoric Use. These are definite descriptions that cospecify with a discourse en- tity already introduced in the discourse. 6 The definite description m a y use the same descriptive predicate as its antecedent, or a n y other capable of indicating the same antecedent (e.g., a s y n o n y m , a h y p o n y m , etc.).

(1) a. Fred was discussing an interesting book in his class. I w e n t to discuss the book w i t h h i m afterwards.

b. Bill was w o r k i n g at a lathe the other day. All of a s u d d e n the machine stopped turning.

c. Fred was wearing trousers. The pants h a d a big patch on them. d. M a r y travelled to Paris. The journey lasted six hours.

e. A man and a woman entered a restaurant. The couple was received b y a waiter.

Immediate Situation Uses. The next two uses of definite descriptions identified b y H a w k i n s are occurrences used to refer to an object in the situation of utterance. The referent m a y be visible, or its presence m a y be inferred. The v i s i b l e s i t u a t i o n u s e occurs w h e n the object referred to is visible to both speaker a n d hearer, as in the following examples:

(2)

a. Please, pass me the salt. b. D o n ' t break the vase.

H a w k i n s classifies as i m m e d i a t e s i t u a t i o n u s e s those definite descriptions w h o s e referent is a constituent of the i m m e d i a t e situation in which the use of the definite description is located, w i t h o u t necessarily being visible:

(3) a. Beware of the dog. b. D o n ' t feed the pony.

c. You can p u t y o u r coat on the clothes peg. d. M i n d the step.

6 There are some complex terminological problems w h e n discussing anaphoric expressions. Following standard terminology, we will use the term referent to indicate the object in the world that is

(5)

Poesio and Vieira A Corpus-based Investigation of Definite Description Use

Larger Situation Uses. Hawkins lists two uses of definite descriptions characteristic of situations in which the speaker appeals to the hearer's knowledge of entities that exist in the nonimmediate or larger situation of utterance--knowledge they share by being members of the same community, for instance.

A definite description may rely on specific knowledge about the larger situation: this is the case in which both the speaker and the hearer know about the existence of the referent, as in the example below, in which it is assumed that speaker and hearer are both inhabitants of Halifax, a town which has a gibbet at the top of Gibbet Street:

(4) The Gibbet no longer stands.

Specific knowledge is not, however, a necessary part of the meaning of larger situation uses of definite descriptions. While some hearers may have specific knowledge about the actual individuals referred to by a definite description, others may not. General knowledge about the existence of certain types of objects in certain types of situations is sufficient. Hawkins classifies those definite descriptions that depend on this knowledge as instances of general knowledge in the larger situation use. An example is the following utterance in the context of a wedding:

(5) Have you seen the bridesmaids?

Such a first-mention of the bridesmaids is possible on the basis of the knowledge that weddings typically have bridesmaids. In the same way, a first-mention of the bride, the church service, or the best man would be possible.

Associative Anaphoric Use. Speaker and hearer may have (shared) knowledge of the relations between certain objects (the triggers) and their components or attributes (the associates): associative anaphoric uses of definite descriptions exploit this knowledge. Whereas in larger situation uses the trigger is the situation itself, in the associative anaphoric use the trigger is an NP introduced in the discourse.

(6) a. The man drove past our house in a car. The exhaust fumes were terrible. b. I am reading a book about Italian history. The author claims that Ludovico

il Moro wasn't a bad ruler. The content is generally interesting. c. I went to a wedding last weekend. The bride was a friend of mine. She

baked the cake herself.

Unfamiliar Uses. Hawkins classifies as unfamiliar those definite descriptions that are not anaphoric, do not rely on information about the situation of utterance, and are not associates of some trigger in the previous discourse. Hawkins groups these definite descriptions in classes according to their syntactic and lexical properties, as follows: N P complements. One form of unfamiliar definite descriptions is characterized by the presence of a complement to the head noun.

(7) a. Bill is amazed by the fact that there is so much life on Earth.

(6)

Computational Linguistics Volume 24, Number 2

c. Fleet Street has been buzzing with the rumour that the Prime Minister is going to resign.

d. I remember the time when I was a little girl.

Nominal modifiers. The distinguishing feature of these phrases, according to Hawkins, is the presence of a nominal modifier that refers to the class to which the head noun belongs.

(8)

a. I don't like the colour red.

b. The number seven is m y lucky number.

Referent-Establishing Relative Clauses. Relative clauses m a y establish a referent for the hearer without a previous mention, when the relative clause refers to something mu- tually known.

(9)

a. What's wrong with Bill? Oh, the woman he went out with last night was nasty to him. (But: ?? Oh, the woman was nasty to him.)

b. The box (that is) over there

Associative clauses. Some definite descriptions can be seen as cases of bridging refer- ences in which both the trigger and the associate are specified. The modifiers of the head noun specify the set of objects with which the referent of the definite description is associated.

(10) a. I remember the beginning of the war very well.

b. There was a funny story on the front page of the Guardian this morning. c . . . . the bottom of the sea.

d . . . . thefight during the war.

Unexplanatory Modifiers Use. Finally, Hawkins lists a small number of modifiers that require the use of the:

(11) a. My wife and I share the same secrets.

b. Thefirst person to sail to America was an Icelander. c. The fastest person to sail to A m e r i c a . . .

2.2 The Semantics of Definite Descriptions

(7)

Poesio and Vieira A Corpus-based Investigation of Definite Description Use

One group of authors have identified u n i q u e n e s s as the defining property of definite descriptions. This idea goes back to Russell (1905), a n d is motivated b y larger situation definite descriptions such as the pope and b y some cases of unexplanatory modifier use such as thefirst person to sail to America. The hypothesis was developed in recent years to address the problem of uniqueness within small situations ( K a d m o n 1987; Neale 1990; Cooper 1993). 7

Another line of research is based on the observation that m a n y of the uses of definite descriptions listed b y H a w k i n s have one property in common: the speaker (or writer) is m a k i n g some assumptions about w h a t the hearer (or reader) already knows. Speaking very loosely, we m i g h t say that the speaker assumes that the hearer is able to "identify" the referent of the definite description. This is also true of some of the uses H a w k i n s classified as unfamiliar, such as his nominal modifiers a n d associative clause classes. Attempts at m a k i n g this intuition more precise include Christophersen's (1939) familiarity theory, Strawson's (1950) presuppositional theory of definite descriptions, Hawkins's (1978) location theory a n d its revision, Clark a n d Marshall's (1981) theory of definite reference a n d m u t u a l knowledge, as well as more formal proposals such as H e i m (1982).

Neither the uniqueness nor the familiarity approach have yet succeeded in pro- viding a satisfactory account of all uses of definite descriptions (Fraurud 1990; Birner a n d Ward 1994). However, the theories based on familiarity address more directly the main concern of NLP system designers, which is to identify the connections between discourse entities. Furthermore, the prior corpus-based studies of definite descriptions use that we are aware of (Prince 1981, 1992; Fraurud 1990) are based on theories of this type. For both of these reasons, we a d o p t e d semantic notions introduced in familiarity- style accounts in designing our e x p e r i m e n t s - - i n particular, distinctions introduced in Prince's taxonomy.

2.3 Prince's Classification of Noun Phrases

Prince studied in detail the connection between a s p e a k e r / w r i t e r ' s assumptions about the h e a r e r / r e a d e r a n d the linguistic realization of n o u n phrases (Prince 1981, 1992). She criticizes as too simplistic the binary distinction between given a n d n e w dis- course entities that is at the basis of most previous work on familiarity, and pro- poses a m u c h more detailed t a x o n o m y of "givenness"---or, as she calls it, assumed familiarity--meant to address this problem. Also, Prince's analysis of n o u n phrases is closer than the C h r i s t o p h e r s e n / H a w k i n s t a x o n o m y to a classification of definite descriptions on purely semantic terms: for example, she relates unfamiliar definites based on referent-establishing relative clauses with Hawkins's associative clause a n d associative anaphoric uses. 8

Hearer-New~Hearer-Old. One factor affecting the choice of a n o u n phrase, according to Prince, is w h e t h e r a discourse entity is old or n e w with respect to the hearer's knowledge. A speaker will use a proper n a m e or a definite description w h e n he or she assumes that the addressee already k n o w s the entity w h o m the speaker is referring to, as in (12) a n d (13).

(12) I'm waiting for it to be n o o n so I can call Sandy Thompson.

7 LObner (1985) generalizes this idea, with good results; we will return to this work later.

(8)

Computational Linguistics Volume 24, Number 2

(13) Nine hundred people attended the Institute.

On the other hand, if the speaker believes that the addressee does not know of Sandy Thompson, an indefinite will be used:

(14) I'm waiting for it to be noon so I can call someone in California.

Discourse-New~Discourse-Old. In addition, discourse entities can be new or old with respect to the discourse model: an NP m a y refer to an entity that has already been evoked in the current discourse, or it may evoke an entity that has not been previously mentioned. Discourse novelty is distinct from hearer novelty: both Sandy Thompson in (12) and the someone in California mentioned in (14) m a y well be discourse-new even if only the second one will be hearer-new. On the other hand, for an entity, being discourse-old entails being hearer-old. In other words, in Prince's theory, the notion of familiarity is split in two: familiarity with respect to the discourse, and familiarity with respect to the hearer. Either type of familiarity can license the use of definites: Hawkins's anaphoric uses of definite descriptions are cases of noun phrases referring to discourse-old discourse entities, whereas his larger situation and immediate situa- tion uses are cases of noun phrases referring to discourse-new, hearer-old entities. 9

lnferrables. The uses of definite descriptions that Hawkins called associative anaphoric, such as a book ... the author, are not discourse-old or even hearer-old, but they are not entirely new, either; as Hawkins pointed out, the hearer is assumed to be capable of inferring their existence. Prince called these discourse entities inferrables. (This is the class of definite descriptions for which Clark [1977] used the term bridging references.)

Containing Inferrables. Finally, Prince proposes a category for noun phrases that are like inferrables, but whose connection with previous hearer's knowledge is specified as part of the noun phrase itself--her example is the door of the Bastille in the following example:

(15) The door of the Bastille was painted purple.

At least three of the unfamiliar uses of H a w k i n s - - N P complements, referent-estab- lishing relative clauses, and associative clauses--fall into this category. (See also Clark and Marshall [1981].)

2.4 Some Remarks about Coverage

Perhaps the most important question concerning a classification scheme is its coverage. The two taxonomies we have just seen are largely satisfactory in this respect, but a couple of issues are worth mentioning.

First of all, Prince's taxonomy does not give us a complete account of the licensing conditions for definite descriptions. Of the uses mentioned by Hawkins, the unfamiliar definites with unexplanatory modifiers and NP complements need not satisfy any of the conditions that license the use of definites according to Prince: these definites are

(9)

Poesio and Vieira A Corpus-based Investigation of Definite Description Use

not necessarily discourse-old, hearer-old, inferrables, or containing inferrables. These uses fall outside of Clark and Marshall's classification, as well.

Secondly, none of the classification schemes just discussed, nor any of the alterna- tives proposed in the literature, consider so-called generic uses of definite descriptions, such as the use of the tiger in the generic sentence The tiger is a fierce animal that lives in the jungle. The problem with these uses is that the very question of whether the "referent" is familiar or not seems misplaced--these uses are not "referential." A problem related to the one just mentioned is that certain uses of definite descriptions are ambiguous between a referential and an attributive interpretation (Donnellan 1972). The sentence

The first person to sail to America was an Icelander, for example, can have two interpreta- tions: the writer m a y either refer to a specific person, whose identity may be mutually known to both writer and reader; or he or she may be simply expressing a property that is true of the first person to sail to America, whoever that person happened to be. This ambiguity does not seem to be possible with all uses of definite descrip- tions: e.g., pass me the salt seems only to have a referential use. Again, the schemes we have presented do not consider this issue. The question of how to annotate generic uses of definite descriptions or uses that are ambiguous between a referential and an attributive use will not be addressed in this paper.

2.5 Fraurud's Study

A second problem with the classification schemes we have discussed was raised by Fraurud in her study of definite NPs in a corpus of Swedish text (Fraurud 1990). Frau- rud introduced a drastically simplified classification scheme based on two classes only: subsequent-mention, corresponding to Hawkins's anaphoric definite descriptions and Prince's discourse-old, and first-mention, including all other definite descriptions.

Fraurud simplified matters in this way because she was primarily interested in verifying the empirical basis for the claim that familiarity is the defining property of definite descriptions; she also observed, however, that some of the distinctions introduced by Hawkins and Prince led to ambiguities of classification. For example, she observed that the reader of a Swedish newspaper can equally well interpret the definite description the king in an article about Sweden by reference to the larger situation or to the content of the article.

We took into account Fraurud's observations in designing our experiments, and we will compare our results to hers below.

3. A First Experiment in Classification

For our first experiment evaluating subjects' performance at the classification task, we developed a taxonomy of definite description uses based on the schemes discussed in the previous section, preliminarily tested the taxonomy by annotating the corpus ourselves, and then asked two annotators to do the same task. This first experiment is described in the rest of this section. We explain first the classification we developed for this experiment, then the experimental conditions, and, finally, discuss the results. 3.1 The First Classification Scheme

(10)

Computational Linguistics Volume 24, Number 2

entities i n t r o d u c e d b y a n o u n p h r a s e a n d other entities in the discourse; the s c h e m e used in MUC-6 is of this type.

In o u r experiments, w e tried b o t h a p u r e labeling s c h e m e a n d a m i x e d labeling a n d linking scheme. We also tried t w o slightly different taxonomies of definite de- scriptions, and w e varied the w a y m e m b e r s h i p in a class was d e f i n e d to the subjects. Both taxonomies w e r e b a s e d o n the schemes p r o p o s e d b y H a w k i n s a n d Prince, b u t w e i n t r o d u c e d some changes in order, first, to find a scheme that w o u l d be easily u n d e r - stood b y individuals w i t h o u t p r e v i o u s linguistic training a n d w o u l d lead to m a x i m u m a g r e e m e n t a m o n g the classifiers; a n d second, to m a k e the classification m o r e useful for o u r goal of feeding the results into an i m p l e m e n t a t i o n .

In the first e x p e r i m e n t , w e u s e d a labeling scheme, a n d the classes w e r e i n t r o d u c e d to the subjects w i t h reference to the surface characteristics of the definite descriptions. (See b e l o w a n d A p p e n d i x A.) The t a x o n o m y w e used in this e x p e r i m e n t is a sim- plification of H a w k i n s ' s scheme, to w h i c h w e m a d e three m a i n changes. First of all, we separated those a n a p h o r i c descriptions w h o s e a n t e c e d e n t s h a v e the same descrip- tive content as their a n t e c e d e n t (which w e will call

anaphoric (same head))

f r o m other cases of anaphoric descriptions in w h i c h the association is b a s e d on m o r e com- plex forms of lexical or c o m m o n s e n s e k n o w l e d g e ( s y n o n y m s , h y p e r n y m s , i n f o r m a t i o n a b o u t events, etc.). We g r o u p e d these latter definite descriptions with H a w k i n s ' s asso- ciative descriptions in a class that w e called associative. This was d o n e in o r d e r to see h o w m u c h n e e d there is for c o m p l e x lexical inferences in resolving a n a p h o r i c definite descriptions, as o p p o s e d to simple h e a d matching.

Secondly, w e g r o u p e d together all the definite descriptions that introduce a n o v e l discourse entity not associated to some p r e v i o u s l y established object in the text, i.e., that were d i s c o u r s e - n e w in Prince's sense. This class, that w e will call

larger situa-

tion/unfamiliar,

includes b o t h definite descriptions that exploit situational i n f o r m a t i o n (Hawkins's larger situation uses) a n d discourse-new definite descriptions i n t r o d u c e d together with their links or referents (unfamiliar). This was d o n e because of F r a u r u d ' s observation that distinguishing the two classes is generally difficult ( F r a u r u d 1990). Third, w e did not include a class for i m m e d i a t e situation uses, since w e a s s u m e d t h e y w o u l d be rare in written text. l° We also i n t r o d u c e d a separate class of

idioms

includ- ing indirect references, idiomatic expressions a n d m e t a p h o r i c a l uses, a n d w e allowed o u r subjects to m a r k definite descriptions as

doubts.

To s u m m a r i z e , the classes u s e d in this e x p e r i m e n t w e r e as follows:

I. A n a p h o r i c same head. This class includes uses of definite descriptions that refer back to an a n t e c e d e n t i n t r o d u c e d in discourse; it differs from H a w k i n s ' s a n a p h o r i c use or Prince's textually e v o k e d classes because it o n l y includes definite-antecedent pairs w i t h the same h e a d noun.

(16) Grace Energy just two w e e k s ago h a u l e d a rig here 500 miles f r o m Caspar, Wyo., to drill the Bilbrey well, a 15,000-foot, $1-million-plus

10 This was indeed the case, but we did observe a few instances of an interesting kind of immediate situation use. In these cases, the text is describing the immediate situation in which the writer is, and the writer apparently expects the reader to reconstruct this situation:

(i) "And you didn't want me to buy earthquake insurance", says Mrs. Hammack, reaching across the table and gently tapping his hand.

(11)

Poesio and Vieira A Corpus-based Investigation of Definite Description Use

natural gas well.

The rig

was built around 1980, but has drilled only two wells, the last in 1982.

II. Associative.

We assigned to this class those definite descriptions that stand in an anaphoric or associative anaphoric relation with an antecedent explicitly mentioned in the text, but that are not identified by the same head noun as their antecedent. This class includes Hawkins's associative anaphoric definite descriptions and Prince's inferrables, as well as some definite descriptions that would be classified as anaphoric by Hawkins and as textually evoked in Prince (1981). Recognizing the antecedent of these definite descriptions involves at least knowledge of lexical associations, and possibly general commonsense knowledge. 11

(17) a. With all this, even the most wary oil men agree

something has changed.

"It doesn't appear to be getting worse. That in itself has got to cause people to feel a little more optimistic," says Glenn Cox, the

president of Phillips Petroleum Co. Though modest,

the change

reaches beyond the oil patch, too.

b. Toni Johnson pulls a tape measure across the front of what was once a

stately Victorian home.

A deep trench now runs along its north wall, exposed when

the house

lurched two feet off its foundation

during last week's earthquake.

c. Once inside, she spends nearly four hours measuring and

diagramming each room in

the 80-year-old house,

gathering enough information to estimate what it would cost to rebuild it. While she works inside, a tenant returns with several friends to collect furniture and clothing. One of the friends sweeps broken dishes and

shattered glass from a countertop and starts to pack what can be salvaged from

the kitchen.

III. Larger situation~unfamiliar.

This class includes Hawkins's larger situation uses of definite descriptions based on specific and general knowledge (discourse-new, hearer- old in Prince's terms) as well as his unfamiliar uses (many of which correspond to Prince's containing inferrables).

(18) a. Out here on

the Querecho Plains of New Mexico,

however, the mood is more upbeat-trucks rumble along the dusty roads and burly men in hard hats sweat and swear through the afternoon sun.

b. Norton Co. said net income for

the third quarter

fell 6% to $20.6 million, or 98 cents a share, from $22 million, or $1.03 a share.

c. For the Parks and millions of other young Koreans, the long-cherished dream of home ownership has become a cruel illusion. For

the

government,

it has become a highly volatile political issue.

d. About the same time,

the Iran-Iraq war,

which was roiling oil markets, ended.

(12)

Computational Linguistics Volume 24, Number 2

Table 1

Classification by the authors of the definite descriptions in the first corpus.

Class Total Number Percentage of the Total

I. Anaphoric s.h. 304 29.23%

II. Associative 193 18.55%

III. LS/Unfamiliar 503 48.37%

IV. Idiom 26 2.50%

V. Doubt 14 1.35%

Total 1,040 100%

IV. Idiom. This class includes indirect references, idiomatic expressions, and metaphor- ical uses.

(19) A recession or new OPEC blowup could put oil markets right back in the soup.

3.2 Experimental Conditions

First of all, we classified the definite descriptions included in 20 randomly chosen articles from the Wall Street Journal contained in the subset of the Penn Treebank corpus included in the ACL/DCI CD-ROMJ 2 All together, these articles contain 1,040 instances of definite description use. The results of our analysis are summarized in Table 1.

Next, we asked two subjects to perform the same task. Our two subjects in this first experiment were graduate students in linguistics. The two subjects were given the instructions in Appendix A. They had to assign each definite description to one of the classes described in Section 3.1: I. anaphoric (same head), II. associative, III. larger situation/unfamiliar, and IV. idiom. The subjects could also express V. doubt about the classification of the definite description. Since the classes I-III are not mutually exclu- sive, we instructed the subjects to resolve conflicts according-to a preference ranking, i.e., to choose a class with higher preference when two classes seemed equally appli- cable. The ranking was (from most preferred to least preferred): 1) anaphoric (same head), 2) larger situation/unfamiliar, and 3) associative. The annotators were given one text to familiarize themselves with the task before starting with the annotation proper.

3.3 Results

3.3.1 The Distribution of Definite Descriptions in Classes. The results of the first annotator (henceforth, Annotator A) are shown in Table 2, and those of the second annotator (henceforth, Annotator B) in Table 3.

As the tables indicate, the annotators assigned approximately the same percentage of definite descriptions to each of the five classes as we did; however, the classes do not always include the same elements. This can be gathered by the confusion matrix in Table 4, where an entry mx,y indicates the number of definite descriptions assigned to class x by subject A and to class y by subject B.

(13)

Poesio and Vieira A Corpus-based Investigation of Definite Description Use

Table 2

Classification of definit'e descriptions according to Annotator A.

Class Total Number Percentage of the Total

I. Anaphoric s.h. 294 28.27%

II. Associative 160 15.38%

III. Unfamiliar/Larger Situation 546 52%

IV. Idiom 39 3.75%

V. Doubt 1 0.09%

Total 1,040 100%

Table 3

Classification of definite descriptions according to Annotator B.

Class Total Number Percentage of the Total

I. Anaphoric s.h. 332 31.92%

II. Associative 150 14.42%

III. Unfamiliar/Larger Situation 549 52.78%

IV. Idiom 2 0.19%

V. Doubt 7 0.67%

Total 1,040 100%

Table 4

Confusion matrix of A and B's classifications.

B I. II. III. IV. V. Total B A

I. Anaphoric 274 26 32 0 0 332

II. Associative 9 97 44 0 0 150 III. LS/Unfamiliar 8 37 465 38 1 549

IV. Idiom 0 0 1 1 0 2

V. Doubt 3 0 4 0 0 7

Total A 294 160 546 39 1 1,040

In o r d e r to m e a s u r e the a g r e e m e n t in a m o r e precise way, w e u s e d the K a p p a statistic (Siegel a n d Castellan 1988), recently p r o p o s e d b y Carletta as a m e a s u r e of a g r e e m e n t for discourse analysis (Carletta 1996). We also u s e d a m e a s u r e of per-class a g r e e m e n t that w e i n t r o d u c e d ourselves. We discuss these results below, after r e v i e w - ing briefly h o w K is c o m p u t e d .

3.3.2 T h e Kappa Statistic. K a p p a is a test suitable for cases w h e n the subjects h a v e to assign items to o n e of a set of n o n o r d e r e d classes. The test c o m p u t e s a coefficient K of a g r e e m e n t a m o n g coders, w h i c h takes into a c c o u n t the possibility of chance a g r e e m e n t . It is d e p e n d e n t on the n u m b e r of coders, n u m b e r of items b e i n g classified, a n d n u m b e r of choices of classes to be ascribed to items.

The k a p p a coefficient of a g r e e m e n t b e t w e e n c a n n o t a t o r s is d e f i n e d as: K - P ( A ) - P(E)

1 - P(E)

[image:13.468.37.336.83.177.2] [image:13.468.38.337.213.311.2]
(14)

Computational Linguistics Volume 24, Number 2

Table 5

An example of the Kappa test.

Definite Description ASH ASS LSU S

1. the third quarter 0 0 3 1

2. the abrasives, engineering materials

and petroleum services concern 0 2 1 0.33

3. The company 0 3 0 1

4. the year-earlier quarter 0 2 1 0.33

5. the tax credit 3 0 0 1

6. the engineering materials segment 1 1 1 0

7. the possible sale of all or part of

Eastman Christensen 0 0 3 1

8. the nine months 0 0 3 1

9. the year-earlier period 0 2 1 0.33

10. the company 3 0 0 1

11. the company 3 0 0 1

12. the company 3 0 0 1

13. the company 3 0 0 1

N = 1 3 A S H = 1 6 A S S = 1 0 L S U = 1 3 Z = 1 0

of times that w e w o u l d expect the a n n o t a t o r s to agree b y chance. W h e n there is c o m p l e t e a g r e e m e n t a m o n g the raters, K = 1; if there is n o a g r e e m e n t other t h a n that e x p e c t e d b y chance, K = 0. According to Carletta, in the field of content a n a l y s i s - - w h e r e the K a p p a statistic o r i g i n a t e d - - K > 0.8 is generally taken to indicate g o o d reliability, w h e r e a s 0.68 < K < 0.8 allows tentative conclusions to be d r a w n .

We will illustrate the m e t h o d for c o m p u t i n g K p r o p o s e d in Siegel a n d Castellan (1988) b y m e a n s of an e x a m p l e from one of o u r texts, s h o w n in Table 5.

The first c o l u m n in Table 5 (Definite description) s h o w s the definite description being classified. The c o l u m n s ASH, ASS, a n d LSU stand for the classification op- tions p r e s e n t e d to the subjects (anaphoric (same head), associative, a n d larger situ- a t i o n / u n f a m i l i a r , respectively). The n u m b e r s in each nq e n t r y of the matrix indicate the n u m b e r of classifiers that assigned the description in r o w i to the class in c o l u m n j. The final c o l u n m (labeled S) represents the p e r c e n t a g e a g r e e m e n t for each definite description; w e explain b e l o w h o w this p e r c e n t a g e a g r e e m e n t is calculated. The last r o w in the table s h o w s the total n u m b e r of descriptions (N), the total n u m b e r of de- scriptions assigned to each class and, finally, the total p e r c e n t a g e a g r e e m e n t for all descriptions (Z).

The equations for c o m p u t i n g Si, PE, PA, a n d K are s h o w n in Table 6. In these formulas, c is the n u m b e r of coders; Si the p e r c e n t a g e a g r e e m e n t for description i (we s h o w $1 a n d $2 as examples); m the n u m b e r of categories; T the total n u m b e r of classification judgments; PE the p e r c e n t a g e a g r e e m e n t e x p e c t e d b y chance; PA the total agreement; a n d K the K a p p a coefficient.

:3.3.3 V a l u e o f K for the First E x p e r i m e n t . For the first e x p e r i m e n t , K = 0.68 if w e c o u n t idioms as a class, K = 0.73 if w e take t h e m out. The overall coefficient of a g r e e m e n t b e t w e e n the t w o a n n o t a t o r s a n d o u r o w n analysis is K = 0.68 if w e c o u n t idioms, K -~ 0.72 if w e ignore them.

[image:14.468.52.396.83.277.2]
(15)

Poesio and Vieira A Corpus-based Investigation of Definite Description Use

Table 6

Computing the K coefficient of agreement.

m

Si = 1 / c ( c - 1) * ~ j = l nq(nij - 1)

S, = 1/3(2), [0 + 0 + 3(2)] = ( 1 / 6 ) , 6 = 1 $2 = 1/6 • [0 + 2(1) + 1(0)] = (1/6) • 2 = 0.33

T = 39

PE = ( A S H ~ T ) 2 + - ( A S S ~ T ) 2 + ( L S U / T ) 2

= (16/39) 2 + (10/39) 2 + (13/39) 2 = 0.17 + 0.07 + 0.11 = 0.35

PA = Z / N = 10/13 = 0.77

K = (PA - P E ) / ( 1 - PE) = (0.77 - 0.35)/(1 - 0.35) = 0.42/0.65 = 0.65

Table 7

Per-class agreement in Experiment 1.

Class Total Comparisons Agree Disagree % Agreement

I. Anaphoric s.h. 930 1860 1646 214 88% II. Associative 503 1006 596 410 59% III. LS/Unfamiliar 1598 3196 2684 512 84%

IV. Idiom 67 134 42 92 31%

V. Doubt 22 44 2 42 4%

a n d w h e r e they disagreed the most. The confusion matrix does this to some extent, b u t only w o r k s for t w o a n n o t a t o r s - - a n d therefore, for example, w e c o u l d n ' t use it to m e a s u r e a g r e e m e n t on classes b e t w e e n the t w o a n n o t a t o r s a n d ourselves.

We c o m p u t e d w h a t w e called per-class percentage of a g r e e m e n t for three coders (the t w o a n n o t a t o r s a n d ourselves) b y taking the p r o p o r t i o n of pairwise a g r e e m e n t s relative to the n u m b e r of pairwise comparisons, as follows: w h e n e v e r all three coders ascribe a description to the same class, w e c o u n t six pairwise a g r e e m e n t s o u t of six pairwise c o m p a r i s o n s for that class (100%). If t w o coders ascribe a description to class 1 a n d the other coder to class 2, w e c o u n t t w o a g r e e m e n t s in four c o m p a r i s o n s for class 1 (50%) a n d no a g r e e m e n t for class 2 (0%). The rates of a g r e e m e n t for each class thus obtained are presented in Table 7. The figures indicate better a g r e e m e n t on a n a p h o r i c same h e a d a n d larger s i t u a t i o n / u n f a m i l i a r definite descriptions, w o r s e a g r e e m e n t on the other classes. (In fact, the percentages for idioms a n d d o u b t s are v e r y low; b u t these classes are also too small to allow us to d r a w a n y conclusions.)

3.4 D i s c u s s i o n o f the R e s u l t s

[image:15.468.36.353.326.401.2]
(16)

Computational Linguistics Volume 24, Number 2

a n t e c e d e n t p r e v i o u s l y i n t r o d u c e d in the text. Surprising as it m a y seem, this finding is in fact just a confirmation of the results of other researchers. F r a u r u d (1990) reports that 60.9% of definite descriptions in her c o r p u s of 11 Swedish texts are first-mention, i.e., d o not corefer with an entity already e v o k e d in the text; 13 G a l l a w a y (1996) f o u n d a distribution similar to ours in (English) s p o k e n child language.

3.4.2

Disagreements Among Annotators.

The second notable result was the relatively low a g r e e m e n t a m o n g annotators. The reason for this d i s a g r e e m e n t was not so m u c h a n n o t a t o r s ' errors as the fact, already m e n t i o n e d , that the classes are not m u t u a l l y exclusive. The confusion matrix in Table 4 indicates that the major classes of disagree- m e n t s w e r e definite descriptions classified b y a n n o t a t o r A as larger situation a n d b y a n n o t a t o r B as associative, a n d vice versa. O n e such e x a m p l e is the government in (20). This definite description c o u l d be classified as larger situation because it refers to the g o v e r n m e n t of Korea, a n d p r e s u m a b l y the fact that Korea has a g o v e r n m e n t is shared k n o w l e d g e ; b u t it c o u l d also be classified as being associative on the predicate

Koreans. 14

(20) For the Parks a n d millions of other y o u n g Koreans, the long-cherished d r e a m of h o m e o w n e r s h i p has b e c o m e a cruel illusion. For the

government, it has b e c o m e a highly volatile political issue.

We will a n a l y z e the reasons for the d i s a g r e e m e n t in m o r e detail in relation to o u r second experiment, in w h i c h w e also asked the a n n o t a t o r s to indicate the a n t e c e d e n t of definite descriptions.

3.4.3

Surface Indicators of Discourse Novelty.

Examining the annotations p r o d u c e d in this e x p e r i m e n t , w e w e r e able to confirm the correlation o b s e r v e d b y H a w k i n s b e t w e e n the syntactic structure of certain definite descriptions a n d their classification as discourse-new. Factors that strongly suggest that a definite description is discourse- new (and in fact, p r e s u m a b l y h e a r e r - n e w as well) include the presence of modifiers such as first or best, a n d of a c o m p l e m e n t for N P s of the f o r m the fact that ... or the conclusion that . . . . 15 Postnominal modification of a n y t y p e is also a strong indicator of discourse novelty, suggesting that most p o s t n o m i n a l clauses serve to establish a referent in the sense discussed in the p r e v i o u s section. In addition, w e o b s e r v e d a p r e v i o u s l y u n r e p o r t e d (to o u r k n o w l e d g e ) correlation b e t w e e n discourse n o v e l t y a n d syntactic constructions such as appositions, c o p u l a r constructions, a n d comparatives. The following examples f r o m o u r c o r p u s illustrate the correlations just m e n t i o n e d :

(21) a. Mr. Ramirez, w h o arrived late at the S h a r p s h o o t e r with his crew because he h a d started early in the m o r n i n g setting u p tanks at a n o t h e r site, just got the first raise he can remember in eight years,

to $8.50 an h o u r f r o m $8.

b. Mr. Dinkins also has failed to allay Jewish voters' fears a b o u t his association w i t h the Rev. Jesse Jackson, despite the fact that few local

13 As mentioned above, Fraurud's first-mention class consists of Prince's discourse-new, inferrables, and containing inferrables.

14 As discussed above, this problem with Hawkins's and Prince's classification schemes had already been noted by Fraurud--e.g., Fraurud (1990, 416).

(17)

Poesio and Vieira A Corpus-based Investigation of Definite Description Use

non-Jewish politicians have been as vocal for Jewish causes in the past 20 years as Mr. Dinkins has.

c. They w o n d e r w h e t h e r he has the economic know-how to steer the city through a possible fiscal crisis, a n d they w o n d e r w h o will be advising him.

d. The appetite for oil-service stocks has been especially strong, although some got hit yesterday w h e n Shearson L e h m a n H u t t o n cut its short-term investment ratings on them.

e. After his decisive p r i m a r y victory over M a y o r E d w a r d I. Koch in September, Mr. Dinkins coasted, until recently, on a quite comfortable lead over his Republican opponent, R u d o l p h Giuliani, the former crime buster w h o has proved a something [sic] of a bust as a candidate.

f. "The bottom line is that he is a very genuine and decent guy", says Malcolm Hoenlein, a Jewish c o m m u n i t y leader.

In addition, we observed a correlation between larger situation uses of definite de- scriptions (discourse-new, a n d often hearer-old) a n d certain syntactic expressions a n d lexical items. For example, we noticed that a large n u m b e r of uses of definite descrip- tions in the corpus used for this first experiment referred to temporal entities such as

the year or the month, or included proper names in place of the head n o u n or in pre- modifier position, as in the Querecho Plains of New Mexico a n d the Iran-Iraq war. A l t h o u g h these definite descriptions w o u l d have been classified b y H a w k i n s as larger situation uses, in m a n y cases they couldn't really be considered hearer-old or unused: w h a t seems to be h a p p e n i n g in these cases is that the writer a s s u m e d the reader w o u l d use information about the visual form of words, or perhaps lexical knowledge, to infer that an object of that n a m e existed in the world.

We evaluated the strength of these correlations b y means of a computer simula- tion (Vieira a n d Poesio 1997). The system attempts to classify the definite descriptions f o u n d in texts syntactically annotated according to the Penn Treebank format. The system classifies a definite description as unfamiliar using heuristics based on the syntactic a n d lexical correlations just observed, i.e., if either (i) it includes an unex- planatory modifier, (ii) it occurs in an apposition or a copular construction, or (iii) it is modified b y a relative clause or prepositional phrase. A definite description is classi- fied as larger situation if its h e a d n o u n is a temporal expression such as year or month,

or if its h e a d or premodifiers are head nouns. The implementation revealed that some of the correlations are very strong: for example, the agreement between the system's classification and the annotators' on definite descriptions w i t h a nominal complement, such as the fact that ... varied between 93% a n d 100% d e p e n d i n g on the annotator; a n d on average, 70% of temporal expressions such as the year were interpreted as larger situation b y the annotators.

(18)

Computational Linguistics Volume 24, Number 2

4. The Second Experiment

In o r d e r to address some of the questions raised b y E x p e r i m e n t 1 w e set u p a second experiment. In this second e x p e r i m e n t w e m o d i f i e d b o t h the classification scheme a n d w h a t w e asked the a n n o t a t o r s to do.

4.1 Revisions to the Annotators" Task

One concern w e h a d in designing this second e x p e r i m e n t was to u n d e r s t a n d better the reasons for the d i s a g r e e m e n t a m o n g a n n o t a t o r s o b s e r v e d in the first experiment. In particular, w e w a n t e d to u n d e r s t a n d w h e t h e r the classification disagreements reflected d i s a g r e e m e n t s a b o u t the final semantic interpretation. A n o t h e r difference b e t w e e n this n e w e x p e r i m e n t a n d the first one is that w e s t r u c t u r e d the task of deciding on a clas- sification for a definite description a r o u n d a series of questions originating a decision tree, rather t h a n giving o u r subjects an explicit preference ranking. A third aspect of the first e x p e r i m e n t w e w a n t e d to s t u d y m o r e carefully was the distribution of definite descriptions, in particular, the characteristics of the large n u m b e r of definite descrip- tions in the larger s i t u a t i o n / u n f a m i l i a r class. Finally, w e chose t r u l y naive subjects to p e r f o r m the classification task.

In o r d e r to get a better idea of the extent of a g r e e m e n t a m o n g a n n o t a t o r s a b o u t the semantic interpretation of definite descriptions, w e asked o u r subjects to indicate the a n t e c e d e n t in the text for the definite descriptions they classified as a n a p h o r i c or associative. This w o u l d also allow us to test h o w well subjects did with a linking t y p e of classification like the one u s e d in MUC-6. We also replaced the a n a p h o r i c (same head) class w e h a d in the first e x p e r i m e n t with a b r o a d e r coreferent class including all cases in w h i c h a definite description is coreferential w i t h its antecedent, w h e t h e r or not the h e a d n o u n was the same: e.g., w e asked the subjects to classify as coreferent a definite like the house referring back to an a n t e c e d e n t i n t r o d u c e d as a Victorian home, w h i c h w o u l d not h a v e c o u n t e d as anaphoric (same head) in o u r first experiment. This resulted in a t a x o n o m y that was at the same time m o r e semantically oriented a n d closer to H a w k i n s ' s a n d Prince's classification schemes: o u r b r o a d e n e d corefer- ent class coincides w i t h H a w k i n s ' s a n a p h o r i c a n d Prince's textually e v o k e d classes, w h e r e a s the resulting, n a r r o w e r associative class (that w e called b r i d g i n g references) coincides with H a w k i n s ' s associative anaphoric a n d Prince's class of inferrables. O u r intention was to see w h e t h e r the distinctions p r o p o s e d b y H a w k i n s a n d Prince w o u l d result in a better a g r e e m e n t a m o n g a n n o t a t o r s t h a n the t a x o n o m y u s e d in o u r first experiment, i.e., w h e t h e r the subjects w o u l d be m o r e in a g r e e m e n t a b o u t the semantic relation b e t w e e n a definite description a n d its a n t e c e d e n t than t h e y w e r e a b o u t the relation b e t w e e n the h e a d n o u n of the definite description a n d the h e a d n o u n of its antecedent.

The larger s i t u a t i o n / u n f a m i l i a r class w e h a d in the first e x p e r i m e n t was split back into two classes, as in H a w k i n s ' s a n d Prince's schemes. We did this to see w h e t h e r i n d e e d these t w o classes were difficult to distinguish; w e also w a n t e d to get a clearer idea of the relative i m p o r t a n c e of the t w o kinds of definites that w e h a d g r o u p e d together in the first annotation. The two classes w e r e called Larger situation a n d Unfamiliar.

4.2 Experimental Conditions

(19)

Poesio and Vieira A Corpus-based Investigation of Definite Description Use

Table 8

Coders' classification of definite descriptions in Experiment 2.

C D E

Class Total % Total % Total % I. Coreferential 205 44% 211 45% 201 43%

If. Bridging 40 8.5% 29 6% 49 11%

III. Larger situation 119 25.5% 115 25% 93 20% IV. Unfamiliar 92 20% 82 18% 121 26% V. D o u b t 8 2% 27 6% 0 0% Total 464 100% 464 100% 464 100%

ferent from those used in E x p e r i m e n t 1, a n d containing 464 definite descriptions in total. 16

Unlike in o u r first experiment, w e did not suggest a n y relation b e t w e e n the classes a n d the syntactic f o r m of the definite descriptions in the instructions. The subjects were asked to indicate w h e t h e r the entity referred to b y a definite description (i) h a d b e e n m e n t i o n e d p r e v i o u s l y in the text, else if (ii) it was n e w b u t related to an entity already m e n t i o n e d in the text, else (iii) it was n e w b u t p r e s u m a b l y k n o w n to the average reader, or, finally, (iv) it was n e w in the text a n d p r e s u m a b l y n e w to the average reader.

W h e n the description was indicated as discourse-old (i) or related to some other entity (ii), the subjects w e r e asked to locate the p r e v i o u s m e n t i o n of the related entity in the text. Unlike the first e x p e r i m e n t , the subjects did not h a v e the option of classifying a definite description as Idiom; w e instructed t h e m to m a k e a choice a n d write d o w n their doubts. The written instructions a n d the script given to the subjects can be f o u n d in A p p e n d i x B. As in E x p e r i m e n t 1, the subjects were given one text to practice before starting with the analysis of the corpus. T h e y took, on average, eight h o u r s to c o m p l e t e the task.

4.3 R e s u l t s

The distribution of definite descriptions in the four classes according to the three coders is s h o w n in Table 8.

We h a d 283 cases of complete a g r e e m e n t a m o n g annotators on the classification (61%): 164 cases of complete a g r e e m e n t on coreferential definite descriptions, 7 cases of complete a g r e e m e n t on bridging, 65 cases of complete a g r e e m e n t on larger situation, a n d 47 cases of complete a g r e e m e n t o n the unfamiliar class.

As in E x p e r i m e n t 1, w e m e a s u r e d the K coefficient of a g r e e m e n t a m o n g annotators; the result for annotators C, D, a n d E is K = 0.58 if w e consider the definite descriptions m a r k e d as d o u b t s (in w h i c h case w e h a v e 464 descriptions a n d five classes), K = 0.63 if w e leave t h e m out (430 descriptions a n d the four classes I-IV).

We also m e a s u r e d the extent of a g r e e m e n t a m o n g subjects on the a n t e c e d e n t s for coreferential and bridging definite descriptions. A total of 164 descriptions w e r e classified as coreferential b y all three coders; of these, 155 (95%) w e r e taken b y all coders to refer to the same entity (although not n e c e s s a r i l y to the same m e n t i o n of that entity).

[image:19.468.32.324.85.183.2]
(20)

Computational Linguistics Volume 24, Number 2

There were only 7 definite descriptions classified b y all three annotators as bridg- ing references; in 5 of these cases (71%) the three annotators also agreed o n a textual a n t e c e d e n t (i.e., o n the discourse entity to w h i c h the b r i d g i n g reference was related to).

4.4 Discussion

4.4.1 Distribution into Classes. As s h o w n in Table 8, the distribution of definite de- scriptions a m o n g discourse-new, on the one side, a n d coreferential w i t h b r i d g i n g ref- erences, one the other, is r o u g h l y the same in E x p e r i m e n t 2 as in E x p e r i m e n t 1, a n d r o u g h l y the same a m o n g annotators. The average p e r c e n t a g e of discourse-new descrip- tions (larger situation a n d unfamiliar together) is 46%, against an a v e r a g e of 50% in the first experiment. H a v i n g split the discourse-new class in two in this e x p e r i m e n t , w e got an indication of the relative i m p o r t a n c e of the hearer-old a n d h e a r e r - n e w s u b c l a s s e s - - a b o u t half of the d i s c o u r s e - n e w uses fall in each of these c l a s s e s - - b u t only v e r y a p p r o x - imate, since the first two a n n o t a t o r s classified the majority of these definite descriptions as larger situation, w h e r e a s the last a n n o t a t o r classified the majority as unfamiliar.

As expected, the b r o a d e r definition of the coreferent class resulted in a larger p e r c e n t a g e of definite descriptions being i n c l u d e d in this class (an a v e r a g e of 45%), and a smaller p e r c e n t a g e b e i n g i n c l u d e d in the b r i d g i n g reference class. C o n s i d e r i n g the difference b e t w e e n the relative i m p o r t a n c e of the s a m e - h e a d a n a p h o r a class in the first e x p e r i m e n t a n d of the coreferent class in the second e x p e r i m e n t w e can estimate that a p p r o x i m a t e l y 15% of definite descriptions are coreferential a n d h a v e a different h e a d f r o m their antecedents.

4.4.2 Agreement among Annotators. The a g r e e m e n t a m o n g a n n o t a t o r s in Experi- m e n t 2 was not v e r y high: 61% total a g r e e m e n t , w h i c h gives K = 0.58 or K = 0.63, d e p e n d i n g on w h e t h e r w e consider d o u b t s as a class. 17 This v a l u e is w o r s e than the one w e obtained in E x p e r i m e n t 1 (K = 0.68 or K = 0.73); in fact, this v a l u e of K goes b e l o w the level at w h i c h w e can tentatively a s s u m e a g r e e m e n t a m o n g the annotators. There c o u l d be several reasons for the fact that a g r e e m e n t got w o r s e in this s e c o n d experiment. P e r h a p s the simplest explanation is that w e w e r e just using m o r e classes. In o r d e r to check w h e t h e r this was the case, w e m e r g e d the classes larger situation a n d unfamiliar back into one class, as w e h a d in the E x p e r i m e n t 1: that is, w e r e c o m p u t e d K after c o u n t i n g all definite descriptions classified as either larger situation or unfamiliar as m e m b e r s of the same class. A n d indeed, the a g r e e m e n t figures w e n t u p from K = 0.63 to K = 0.68 (ignoring doubts) w h e n w e did so, i.e., w i t h i n the "tentative" margins of a g r e e m e n t according to Carletta (1996) (0.68 <_ x < 0.8).

The remaining difference b e t w e e n the level of a g r e e m e n t o b t a i n e d in this experi- m e n t a n d that obtained in the first one (K = 0.73, ignoring doubts) m i g h t h a v e to d o with the annotators, w i t h the difficulty of the texts, or w i t h using a syntactic (same head) as o p p o s e d to a semantic n o t i o n of w h a t counts as coreferential; w e are inclined to think that the last t w o explanations are m o r e likely. For one thing, w e f o u n d v e r y few examples of true mistakes in the annotation, as discussed below. Secondly, w e o b s e r v e d that the coefficient of a g r e e m e n t changes dramatically from text to text: in this second experiment, it varies f r o m K = 0.42 to K = 0.92 d e p e n d i n g o n the text, a n d if w e d o not c o u n t the three w o r s t texts in the second e x p e r i m e n t , w e get again K = 0.73. Third, going from a syntactic to a semantic definition of anaphoric definite description resulted in w o r s e a g r e e m e n t b o t h for coreferential a n d for b r i d g i n g ref-

Figure

Table 2 Classification of definit'e descriptions according to Annotator A.
Table 5 An example of the Kappa test.
Table 7 Per-class agreement in Experiment 1.
Table 8 Coders' classification of definite descriptions in Experiment 2.

References

Related documents

Bulletin 2010 data, we found out that using the GARCH in mean a positive and significant relationship exist between commercial bank portfolio return and

In a n effort to lo- calize the pulmonary inflammatory response, we devised a set of experiments that would examine whole-lung inflammation (a reflection of

KPNS: kyphotic angle of proximal non-fused segment involved in the global kyphosis; LL: lumbar lordosis; NRS: Numerical rating scale; ODI: Oswestry Disability Index; OVA:

ACL: Activities of daily life; ACL: Anterior cruciate ligament; ACLR: Anterior cruciate ligament reconstruction; FoH: Fear of harm; KOOS: Knee injury and osteoarthritis outcome

The paper examines the role of security agencies in cubing election violence in Nigeria using Kebbi state as a

On the left is a symbolic representation of a population of form-selective neurons that are able to integrate colour and motion form cues to produce more neural

The converter module with higher active power generation will carry more portion of the whole ac output voltage, which may cause over modulation and degrade power quality if