A Machine Learning Approach to Coreference Resolution of Noun Phrases

Full text

A Machine Learning Approach to Coreference Resolution of N o u n Phrases. W e e M e n g S o o n * DSO National Laboratories. D a n i e l C h u n g Y o n g L i m * DSO National Laboratories. H w e e T o u N g t DSO National Laboratories. In this paper, we present a learning approach to coreference resolution o f noun phrases in unre- stricted text. The approach learns from a small, annotated corpus and the task includes resolving not just a certain type of noun phrase (e.g., pronouns) but rather general noun phrases. It also does not restrict the entity types of the noun phrases; that is, coreference is assigned whether they are o f "organization," "person," or other types. We evaluate our approach on common data sets (namely, the M U C - 6 and M U C - 7 coreference corpora) and obtain encouraging results, in- dicating that on the general noun phrase coreference task, the learning approach holds promise and achieves accuracy comparable to that of nonlearning approaches. Our system is the first learning-based system that offers performance comparable to that of state-of-the-art nonlearning systems on these data sets.. 1. I n t r o d u c t i o n . Coreference resolution is the process of determining whether two expressions in natural language refer to the same entity in the world. It is an important subtask in natural language processing systems. In particular, information extraction (IE) systems like those built in the DARPA Message Understanding Conferences (Chinchor 1998; Sundheim 1995) have revealed that coreference resolution is such a critical component of IE systems that a separate coreference subtask has been defined and evaluated since MUC-6 (MUC-6 1995).. In this paper, we focus on the task of determining coreference relations as defined in MUC-6 (MUC-6 1995) and MUC-7 (MUC-7 1997). Specifically, a coreference relation denotes an identity of reference and holds between two textual elements k n o w n as markables, which can be definite n o u n phrases, demonstrative n o u n phrases, proper names, appositives, s u b - n o u n phrases that act as modifiers, pronouns, and so on. Thus, our coreference task resolves general n o u n phrases and is not restricted to a certain type of n o u n phrase such as pronouns. Also, we do not place any restriction on the possible candidate markables; that is, all markables, whether they are "organization," "person," or other entity types, are considered. The ability to link coreferring n o u n phrases both within and across sentences is critical to discourse analysis and language u n d e r s t a n d i n g in general.. * 2 0 Science Park Drive, Singapore 118230. E-mail: [email protected] t 2 0 Science Park Drive, Singapore 118230. E-maih [email protected]. This author is also affiliated. with the Department of Computer Science, School of Computing, National University of Singapore. Web address: h t t p : / / w w w . c o m p . n u s . e d u . s g / ~ n g h t . :~ 20 Science Park Drive, Singapore 118230. E-maih [email protected]. (~) 200l Association for Computational Linguistics. Computational Linguistics Volume 27, Number 4. Free text :~ TokenizatiOnsentence & Ii ~ Morphological ~I Segmentation [ ~ Processing. POS tagger. Named Entity ~ Nested [ i. Noun Phrase :~ Recognition ~ Extraction [. Semantic Class. Determination. Figure 1 System architecture of natural language processing pipeline.. Noun Phrase. Identification. Markables. 2. A M a c h i n e Learning A p p r o a c h to Coreference R e s o l u t i o n . We a d o p t a corpus-based, machine learning a p p r o a c h to n o u n phrase coreference resolution. This a p p r o a c h requires a relatively small corpus of training d o c u m e n t s that h a v e b e e n annotated w i t h coreference chains of n o u n phrases. All possible markables in a training d o c u m e n t are d e t e r m i n e d b y a pipeline of language-processing modules, and training examples in the form of feature vectors are generated for appropriate pairs of markables. These training examples are then given to a learning algorithm to b u i l d a classifier. To determine the coreference chains in a n e w document, all markables are d e t e r m i n e d and potential pairs of coreferring markables are presented to the classifier, w h i c h decides w h e t h e r the t w o markables actually corefer. We give the details of these steps in the following subsections.. 2.1 D e t e r m i n a t i o n of Markables A prerequisite for coreference resolution is to obtain most, if not all, of the possible markables in a r a w i n p u t text. To determine the markables, a pipeline of natural language processing (NLP) m o d u l e s is used, as s h o w n in Figure 1. They consist of tokenization, sentence segmentation, morphological processing, part-of-speech tagging, n o u n phrase identification, n a m e d entity recognition, nested n o u n phrase extraction, and semantic class determination. As far as coreference resolution is concerned, the goal of these N L P m o d u l e s is to determine the b o u n d a r y of the markables, and to p r o v i d e the necessary information a b o u t each markable for s u b s e q u e n t generation of features in the training examples.. O u r part-of-speech tagger is a standard statistical tagger b a s e d on the H i d d e n Markov M o d e l (HMM) (Church 1988). Similarly, w e built a statistical H M M - b a s e d n o u n phrase identification m o d u l e that determines the n o u n phrase b o u n d a r i e s solely b a s e d on the part-of-speech tags assigned to the w o r d s in a sentence. We also im- p l e m e n t e d a m o d u l e that recognizes MUC-style n a m e d entities, that is, organization, person, location, date, time, money, and percent. O u r n a m e d entity recognition m o d u l e uses the H M M approach of Bikel, Schwartz, and Weischedel (1999), w h i c h learns from a tagged corpus of n a m e d entities. That is, our part-of-speech tagger, n o u n phrase identification module, and n a m e d entity recognition m o d u l e are all b a s e d on H M M s and learn from corpora tagged w i t h parts of speech, n o u n phrases, and n a m e d entities, respectively. Next, b o t h the n o u n phrases d e t e r m i n e d b y the n o u n phrase identification m o d u l e and the n a m e d entities are m e r g e d in such a w a y that if the n o u n phrase overlaps w i t h a n a m e d entity, the n o u n phrase b o u n d a r i e s will b e adjusted to s u b s u m e the n a m e d entity.. 522. Soon, Ng, and Lira Coreference Resolution. T h e n e s t e d n o u n p h r a s e e x t r a c t i o n m o d u l e s u b s e q u e n t l y accep t s t h e n o u n p h r a s e s a n d d e t e r m i n e s t h e n e s t e d p h r a s e s for e a c h n o u n p h r a s e . T h e n e s t e d n o u n p h r a s e s are d i v i d e d into t w o g r o u p s : . .. .. N e s t e d n o u n p h r a s e s f r o m p o s s e s s i v e n o u n p h r a s e s . C o n s i d e r t w o p o s s e s s i v e n o u n p h r a s e s m a r k e d b y t h e n o u n p h r a s e m o d u l e , his long-range strategy a n d Eastern's parent. T h e n e s t e d n o u n p h r a s e fo r t h e first p h r a s e is the p r o n o u n his, w h i l e f o r t h e s e c o n d one, it is t h e p r o p e r n a m e Eastern.. N e s t e d n o u n p h r a s e s t h a t are m o d i f i e r n o u n s (or p r e n o m i n a l s ) . F o r e x a m p l e , the n e s t e d n o u n p h r a s e for wage reductions is wage, a n d f o r Union representatives, it is Union.. Finally, the m a r k a b l e s n e e d e d f o r c o r e f e r e n c e r e s o l u t i o n are t h e u n i o n o f t h e n o u n p h r a s e s , n a m e d entities, a n d n e s t e d n o u n p h r a s e s f o u n d . F o r m a r k a b l e s w i t h o u t a n y n a m e d e n t i t y t y p e , s e m a n t i c class is d e t e r m i n e d b y t h e s e m a n t i c class d e t e r m i n a t i o n m o d u l e . M o r e details r e g a r d i n g this m o d u l e are g i v e n in t h e d e s c r i p t i o n o f t h e s e m a n - tic class a g r e e m e n t feature.. To a c h i e v e a c c e p t a b l e recall for c o r e f e r e n c e r e s o l u t i o n , it is m o s t critical t h a t t h e eligible c a n d i d a t e s f o r c o r e f e r e n c e b e i d e n t i f i e d c o r r e c t l y in t h e first place. In o r d e r to test o u r s y s t e m ' s e f f e c t i v e n e s s in d e t e r m i n i n g the m a r k a b l e s , w e a t t e m p t e d to m a t c h the m a r k a b l e s g e n e r a t e d b y o u r s y s t e m a g a i n s t t h o s e a p p e a r i n g in t h e c o r e f e r e n c e chains a n n o t a t e d in 100 S G M L d o c u m e n t s , a s u b s e t o f t h e t r a i n i n g d o c u m e n t s av ai l ab l e in MUC-6. We f o u n d t h a t o u r s y s t e m is able to c o r r e c t l y i d e n t i f y a b o u t 85% o f t h e n o u n p h r a s e s a p p e a r i n g in c o r e f e r e n c e c h a i n s in t h e 100 a n n o t a t e d S G M L d o c u m e n t s . M o s t of the u n m a t c h e d n o u n p h r a s e s are of the f o l l o w i n g t y p es: . 1.. .. .. O u r s y s t e m g e n e r a t e d a h e a d n o u n t h a t is a s u b s e t o f t h e n o u n p h r a s e in t h e a n n o t a t e d c o r p u s . For e x a m p l e , Saudi Arabia, the cartel's biggest producer w a s a n n o t a t e d as a m a r k a b l e , b u t o u r s y s t e m g e n e r a t e d o n l y Saudi Arabia.. O u r s y s t e m e x t r a c t e d a s e q u e n c e of w o r d s t h a t c a n n o t b e c o n s i d e r e d as a m a r k a b l e . . O u r s y s t e m e x t r a c t e d m a r k a b l e s t h a t a p p e a r to b e c o r r e c t b u t d o n o t m a t c h w h a t w a s a n n o t a t e d . For e x a m p l e , o u r s y s t e m i d e n t i f i e d selective wage reductions, b u t wage reductions w a s a n n o t a t e d i n s t e a d . . 2.2 D e t e r m i n a t i o n o f F e a t u r e V e c t o r s To b u i l d a l e a r n i n g - b a s e d c o r e f e r e n c e e n g i n e , w e n e e d to d e v i s e a set of f e a t u r e s t h a t is u s e f u l in d e t e r m i n i n g w h e t h e r t w o m a r k a b l e s c o r e f e r o r not. In a d d i t i o n , t h e s e f e a t u r e s m u s t b e g e n e r i c e n o u g h to b e u s e d across d i f f e r e n t d o m a i n s . Since t h e M U C - 6 a n d M U C - 7 tasks d e f i n e c o r e f e r e n c e g u i d e l i n e s for all t y p e s o f n o u n p h r a s e s a n d d i f f e r e n t t y p e s of n o u n p h r a s e s b e h a v e d i f f e r e n t l y in t e r m s o f h o w t h e y corefer, o u r f e a t u r e s m u s t b e able to h a n d l e this a n d g i v e d i f f e r e n t c o r e f e r e n c e d e c i s i o n s b a s e d o n d i f f e r e n t t y p e s of n o u n p h r a s e s . In g e n e r a l , t h e r e m u s t b e s o m e f e a t u r e s t h a t i n d i c a t e t h e t y p e o f a n o u n p h r a s e . A l t o g e t h e r , w e h a v e five f e a t u r e s t h a t i n d i c a t e w h e t h e r t h e m a r k a b l e s are d e f i n i t e n o u n p h r a s e s , d e m o n s t r a t i v e n o u n p h r a s e s , p r o n o u n s , o r p r o p e r n a m e s . . T h e r e are m a n y i m p o r t a n t k n o w l e d g e s o u r c e s u s e f u l f o r c o r e f e r e n c e . We w a n t e d to u s e t h o s e t h a t are n o t too difficult to c o m p u t e . O n e i m p o r t a n t f a c t o r is t h e d i s t a n c e . 523. Computational Linguistics Volume 27, Number 4. b e t w e e n the t w o m a r k a b l e s . M c E n e r y , Tanaka, a n d B o t l e y (1997) h a v e d o n e a s t u d y o n h o w d i s t a n c e affects c o r e f e r e n c e , p a r t i c u l a r l y fo r p r o n o u n s . O n e o f t h e i r c o n c l u s i o n s is t h a t t h e a n t e c e d e n t s of p r o n o u n s d o e x h i b i t clear q u a n t i t a t i v e p a t t e r n s o f d i s t r i b u t i o n . T h e d i s t a n c e f e a t u r e h a s d i f f e r e n t effects o n d i f f e r e n t n o u n p h r a s e s . F o r p r o p e r n a m e s , locality of the a n t e c e d e n t s m a y n o t b e so i m p o r t a n t . We i n c l u d e t h e d i s t a n c e f e a t u r e so t h a t the l e a r n i n g a l g o r i t h m c a n b e s t d e c i d e t h e d i s t r i b u t i o n fo r d i f f e r e n t classes o f n o u n p h r a s e s . . T h e r e are o t h e r f e a t u r e s t h a t are r e l a t e d to t h e g e n d e r , n u m b e r , a n d s e m a n t i c class of the t w o m a r k a b l e s . S u c h k n o w l e d g e s o u r c e s are c o m m o n l y u s e d fo r t h e t a s k o f d e t e r m i n i n g c o r e f e r e n c e . . O u r f e a t u r e v e c t o r consists of a total of 12 f e a t u r e s d e s c r i b e d b el o w , a n d is d e r i v e d b a s e d o n t w o e x t r a c t e d m a r k a b l e s , i a n d j, w h e r e i is t h e p o t e n t i a l a n t e c e d e n t a n d j is t h e a n a p h o r . I n f o r m a t i o n n e e d e d to d e r i v e t h e f e a t u r e v e c t o r s is p r o v i d e d b y t h e p i p e l i n e o f l a n g u a g e - p r o c e s s i n g m o d u l e s p r i o r t o t h e c o r e f e r e n c e e n g i n e . . 1. Distance Feature (DIST): Its p o s s i b l e v a l u e s are 0,1, 2, 3 . . . . . Th i s f e a t u r e c a p t u r e s the d i s t a n c e b e t w e e n i a n d j. If i a n d j are i n t h e s a m e s e n t e n c e , t h e v a l u e is 0; if t h e y are o n e s e n t e n c e a p a r t , t h e v a l u e is 1; a n d so on.. 2. i-Pronoun Feature ( I _ P R O N O U N ) : Its p o s s i b l e v a l u e s are t r u e o r false. If i is a p r o n o u n , r e t u r n true; else r e t u r n false. P r o n o u n s i n c l u d e r e f l e x i v e p r o n o u n s (himself, herself), p e r s o n a l p r o n o u n s (he, him, you), a n d p o s s e s s i v e p r o n o u n s (hers, her).. 3. j-Pronoun Feature ( J _ P R O N O U N ) : Its p o s s i b l e v a l u e s are t r u e o r false. If j is a p r o n o u n (as d e s c r i b e d a b o v e ) , t h e n r e t u r n t r u e ; else r e t u r n false.. 4. String Match Feature ( S T R _ M A T C H ) : Its p o s s i b l e v a l u e s are t r u e o r false. If t h e s t r i n g of i m a t c h e s the s t r i n g of j, r e t u r n t ru e; else r e t u r n false. We first r e m o v e articles (a, an, the) a n d d e m o n s t r a t i v e p r o n o u n s (this, these, that, those) f r o m t h e s t r i n g s b e f o r e p e r f o r m i n g t h e s t r i n g c o m p a r i s o n . T h e r e f o r e , the license m a t c h e s this license, that computer m a t c h e s computer.. 5. Definite N o u n Phrase Feature (DEF_NP): Its p o s s i b l e v a l u e s are t r u e o r false. In o u r d e f i n i t i o n , a d e f i n i t e n o u n p h r a s e is a n o u n p h r a s e t h a t starts w i t h t h e w o r d the. F o r e x a m p l e , the car is a d e f i n i t e n o u n p h r a s e . If j is a d e f i n i t e n o u n p h r a s e , r e t u r n true; else r e t u r n false.. 6. Demonstrative N o u n Phrase Feature (DEM_NP): Its p o s s i b l e v a l u e s are t r u e or false. A d e m o n s t r a t i v e n o u n p h r a s e is o n e t h a t starts w i t h t h e w o r d this, that, these, or those. If j is a d e m o n s t r a t i v e n o u n p h r a s e , t h e n r e t u r n true; else r e t u r n false.. 7. N u m b e r Agreement Feature ( N U M B E R ) : Its p o s s i b l e v a l u e s are t r u e o r false. If i a n d j a g r e e in n u m b e r (i.e., t h e y are b o t h s i n g u l a r o r b o t h plural), t h e v a l u e is true; o t h e r w i s e false. P r o n o u n s s u c h as they a n d them are p l u r a l , w h i l e it, him, a n d so on, are singular. T h e m o r p h o l o g i c a l r o o t of a n o u n is u s e d to d e t e r m i n e w h e t h e r it is s i n g u l a r o r p l u r a l if t h e n o u n is n o t a p r o n o u n . . 8. Semantic Class Agreement Feature ( S E M C L A S S ) : Its p o s s i b l e v a l u e s are t r u e , false, or u n k n o w n . In o u r s y s t e m , w e d e f i n e d t h e f o l l o w i n g s e m a n t i c classes: " f e m a l e , " " m a l e , " " p e r s o n , " " o r g a n i z a t i o n , " " l o c a t i o n , " " d a t e , " " t i m e , " " m o n e y , . . . . p e r c e n t , " a n d " o b j e c t . " T h e s e s e m a n t i c classes. 524. Soon, Ng, and Lim Coreference Resolution. .. 10.. 11.. are a r r a n g e d in a s i m p l e ISA hierarchy. E a c h o f t h e " f e m a l e " a n d " m a l e " s e m a n t i c classes is a subclass of t h e s e m a n t i c class " p e r s o n , " w h i l e e a c h of t h e s e m a n t i c classes " o r g a n i z a t i o n , " " l o c a t i o n , " " d a t e , " " t i m e , " " m o n e y , " a n d " p e r c e n t " is a subclass of t h e s e m a n t i c class " o b j e c t . " E a c h of t h e s e d e f i n e d s e m a n t i c classes is t h e n m a p p e d to a W o r d N e t s y n s e t (Miller 1990). F o r e x a m p l e , " m a l e " is m a p p e d t o t h e s e c o n d s e n s e o f t h e n o u n male in W o r d N e t , " l o c a t i o n " is m a p p e d t o t h e first s e n s e o f t h e n o r m location, a n d so on.. T h e s e m a n t i c class d e t e r m i n a t i o n m o d u l e a s s u m e s t h a t t h e s e m a n t i c class f o r e v e r y m a r k a b l e e x t r a c t e d is the first s e n s e o f t h e h e a d n o u n o f the m a r k a b l e . Since W o r d N e t o r d e r s t h e s en ses o f a n o u n b y t h e i r f r e q u e n c y , this is e q u i v a l e n t to c h o o s i n g t h e m o s t f r e q u e n t s e n s e as t h e s e m a n t i c class f o r e a c h n o r m . If t h e s e l e c t e d s e m a n t i c class o f a m a r k a b l e is a subclass of o n e of o u r d e f i n e d s e m a n t i c classes C, t h e n t h e s e m a n t i c class of the m a r k a b l e is C; else its s e m a n t i c class is " u n k n o w n . " . T h e s e m a n t i c classes of m a r k a b l e s i a n d j are in a g r e e m e n t if o n e is the p a r e n t of t h e o t h e r (e.g., chairman w i t h s e m a n t i c class " p e r s o n " a n d Mr. Lim w i t h s e m a n t i c class " m a l e " ) , or t h e y are t h e s a m e (e.g., Mr. Lim a n d he, b o t h of s e m a n t i c class " m a l e " ) . T h e v a l u e r e t u r n e d fo r s u c h cases is true. If the s e m a n t i c classes of i a n d j are n o t t h e s a m e (e.g., IBM w i t h s e m a n t i c class " o r g a n i z a t i o n " a n d Mr. Lim w i t h s e m a n t i c class " m a l e " ) , r e t u r n false. If e i t h e r s e m a n t i c class is " u n k n o w n , " t h e n t h e h e a d n o u n strings of b o t h m a r k a b l e s are c o m p a r e d . If t h e y are t h e sam e, r e t u r n t ru e; else r e t u r n u n k n o w n . . G e n d e r A g r e e m e n t F e a t u r e ( G E N D E R ) : Its p o s s i b l e v a l u e s are t ru e, false, or u n k n o w n . T h e g e n d e r of a m a r k a b l e is d e t e r m i n e d in s e v e r a l w a y s . D e s i g n a t o r s a n d p r o n o u n s s u c h as Mr., Mrs., she, a n d he, c a n d e t e r m i n e the g e n d e r . For a m a r k a b l e t h a t is a p e r s o n ' s n a m e , s u c h as Peter H. Diller, t h e g e n d e r c a n n o t b e d e t e r m i n e d b y t h e a b o v e m e t h o d . In o u r s y s t e m , t h e g e n d e r o f s u c h a m a r k a b l e c a n b e d e t e r m i n e d if m a r k a b l e s are f o u n d later in the d o c u m e n t t h a t r e f e r to Peter H. Diller b y u s i n g t h e d e s i g n a t o r f o r m of the n a m e , s u c h as Mr. Diller. If t h e d e s i g n a t o r f o r m of the n a m e is n o t p r e s e n t , t h e s y s t e m will l o o k t h r o u g h its d a t a b a s e of c o m m o n h u m a n first n a m e s to d e t e r m i n e t h e g e n d e r o f t h a t m a r k a b l e . T h e g e n d e r of a m a r k a b l e will b e u n k n o w n fo r n o u n p h r a s e s s u c h as the president a n d chief executive officer. The g e n d e r o f o t h e r m a r k a b l e s t h a t are n o t " p e r s o n " is d e t e r m i n e d b y t h e i r s e m a n t i c classes. U n k n o w n s e m a n t i c classes will h a v e u n k n o w n g e n d e r w h i l e t h o s e t h a t are objects will h a v e " n e u t r a l " g e n d e r . If t h e g e n d e r o f e i t h e r m a r k a b l e i or j is u n k n o w n , t h e n the g e n d e r a g r e e m e n t f e a t u r e v a l u e is u n k n o w n ; else if i a n d j a g r e e in g e n d e r , t h e n t h e f e a t u r e v a l u e is true; o t h e r w i s e its v a l u e is false.. B o t h - P r o p e r - N a m e s F e a t u r e ( P R O P E R _ N A M E ) : Its p o s s i b l e v a l u e s are t r u e or false. A p r o p e r n a m e is d e t e r m i n e d b a s e d o n cap i t al i zat i o n . P r e p o s i t i o n s a p p e a r i n g in t h e n a m e s u c h as of a n d and n e e d n o t b e in u p p e r c a s e . If i a n d j are b o t h p r o p e r n a m e s , r e t u r n t ru e; else r e t u r n false.. A l i a s F e a t u r e (ALIAS): Its p o s s i b l e v a l u e s are t r u e o r false. If i is a n alias o f j o r vice v e r s a , r e t u r n true; else r e t u r n false. T h a t is, this f e a t u r e v a l u e is t r u e if i a n d j are n a m e d entities ( p e r s o n , d at e, o r g a n i z a t i o n , etc.) t h a t r e f e r to t h e s a m e entity. T h e alias m o d u l e w o r k s d i f f e r e n t l y d e p e n d i n g . 525. Computational Linguistics Volume 27, N u m b e r 4. 12.. o n t h e n a m e d e n t i t y t y p e . F o r i a n d j t h a t a r e d a t e s (e.g., 01-08 a n d Jan. 8), b y u s i n g s t r i n g c o m p a r i s o n , t h e day, m o n t h , a n d y e a r v a l u e s a r e e x t r a c t e d a n d c o m p a r e d . I f t h e y m a t c h , t h e n j is a n alias o f i. F o r i a n d j t h a t a r e " p e r s o n , " s u c h as Mr. Simpson a n d Bent Simpson, t h e l a s t w o r d s of t h e n o u n p h r a s e s a r e c o m p a r e d to d e t e r m i n e w h e t h e r o n e is a n alias o f t h e other. F o r o r g a n i z a t i o n n a m e s , t h e alias f u n c t i o n a l s o c h e c k s f o r a c r o n y m m a t c h s u c h as IBM a n d International Business Machines Corp. I n this case, t h e l o n g e r s t r i n g is c h o s e n to b e t h e o n e t h a t is c o n v e r t e d i n t o t h e a c r o n y m f o r m . T h e first s t e p is to r e m o v e all p o s t m o d i f i e r s s u c h as Corp. a n d Ltd. T h e n , t h e a c r o n y m f u n c t i o n c o n s i d e r s e a c h w o r d i n t u r n , a n d if t h e first l e t t e r is c a p i t a l i z e d , it is u s e d t o f o r m t h e a c r o n y m . T w o v a r i a t i o n s o f t h e a c r o n y m s a r e p r o d u c e d : o n e w i t h a p e r i o d a f t e r e a c h letter, a n d o n e w i t h o u t . . A p p o s i t i v e F e a t u r e ( A P P O S I T I V E ) : Its p o s s i b l e v a l u e s a r e t r u e o r false. If j is in a p p o s i t i o n to i, r e t u r n t r u e ; else r e t u r n false. F o r e x a m p l e , t h e m a r k a b l e the chairman of Microsoft Corp. is in a p p o s i t i o n to Bill Gates in t h e s e n t e n c e Bill Gates, the chairman of Microsoft Corp . . . . . . O u r s y s t e m d e t e r m i n e s w h e t h e r j is a p o s s i b l e a p p o s i t i v e c o n s t r u c t b y first c h e c k i n g f o r t h e e x i s t e n c e o f v e r b s a n d p r o p e r p u n c t u a t i o n . L i k e t h e a b o v e e x a m p l e , m o s t a p p o s i t i v e s d o n o t h a v e a n y v e r b ; a n d a n a p p o s i t i v e is s e p a r a t e d b y a c o m m a f r o m t h e m o s t i m m e d i a t e a n t e c e d e n t , i, to w h i c h it refers. F u r t h e r , at l e a s t o n e o f i a n d j m u s t b e a p r o p e r n a m e . T h e M U C - 6 a n d M U C - 7 c o r e f e r e n c e t a s k d e f i n i t i o n s a r e s l i g h t l y d i f f e r e n t . I n M U C - 6 , j n e e d s to b e a d e f i n i t e n o u n p h r a s e to b e a n a p p o s i t i v e , w h i l e b o t h i n d e f i n i t e a n d d e f i n i t e n o u n p h r a s e s a r e a c c e p t a b l e in M U C - 7 . . A s a n e x a m p l e , Table 1 s h o w s t h e f e a t u r e v e c t o r a s s o c i a t e d w i t h t h e a n t e c e d e n t i, Frank Newman, a n d t h e a n a p h o r j, vice chairman, in t h e f o l l o w i n g s e n t e n c e : . (1) S e p a r a t e l y , C l i n t o n t r a n s i t i o n officials s a i d t h a t Frank Newman, 50, vice chairman a n d c h i e f f i n a n c i a l officer o f B a n k A m e r i c a C o r p . , is e x p e c t e d to b e n o m i n a t e d as a s s i s t a n t T r e a s u r y s e c r e t a r y f o r d o m e s t i c finance.. T a b l e 1 Feature vector of the markable pair (i = Frank Newman, j = vice chairman).. F e a t u r e V a l u e C o m m e n t s . DIST 0 i and j are in the same sentence I_PRONOUN - i is not a p r o n o u n J ~ R O N O U N - j is not a pronoun STR_MATCH - i and j do not match DEF_NP - j is not a definite noun phrase DEMaNP - j is not a demonstrative noun phrase NUMBER + i and j are both singular SEMCLASS 1 i and j are both persons (This feature has three values:. false(0), true(l), unknown(2).) GENDER 1 i and j are both males (This feature has three values:. false(0), true(l), unknown(2).) PROPER_NAME - Only i is a proper name ALIAS - j is not an alias of i APPOSITIVE + j is in apposition to i. 526. Soon, Ng, and Lira Coreference Resolution. Because of capitalization, m a r k a b l e s in the h e a d l i n e s of MUC-6 a n d MUC-7 doc- u m e n t s are a l w a y s c o n s i d e r e d p r o p e r n a m e s e v e n t h o u g h s o m e are not. O u r s y s t e m solves this i n a c c u r a c y b y first p r e p r o c e s s i n g a h e a d l i n e to correct the c a p i t a l i z a t i o n before p a s s i n g it into the p i p e l i n e of N L P m o d u l e s . O n l y those m a r k a b l e s in the h e a d - line t h a t a p p e a r in the text b o d y as p r o p e r n a m e s h a v e their c a p i t a l i z a t i o n c h a n g e d to m a t c h those f o u n d in the text body. All other h e a d l i n e m a r k a b l e s are c h a n g e d to lowercase.. 2.3 Generating Training Examples C o n s i d e r a coreference c h a i n A1 - A2 - A3 - A4 f o u n d in a n a n n o t a t e d t r a i n i n g d o c u - m e n t . O n l y pairs of n o u n p h r a s e s in the c h a i n t h a t are i m m e d i a t e l y a d j a c e n t (i.e., A1 - A2, A2 - A3, a n d A3 - A4) are u s e d to g e n e r a t e the p o s i t i v e t r a i n i n g examples. The first n o u n p h r a s e in a pair is a l w a y s c o n s i d e r e d the a n t e c e d e n t , w h i l e the s e c o n d is the anaphor. O n the o t h e r h a n d , n e g a t i v e t r a i n i n g e x a m p l e s are extracted as follows. B e t w e e n the t w o m e m b e r s of each a n t e c e d e n t - a n a p h o r pair, there are o t h e r m a r k a b l e s extracted b y o u r l a n g u a g e - p r o c e s s i n g m o d u l e s t h a t either are n o t f o u n d in a n y coreference c h a i n or a p p e a r in o t h e r chains. Each of t h e m is t h e n p a i r e d w i t h the a n a p h o r to f o r m a n e g a t i v e example. For example, if m a r k a b l e s a, b, a n d B1 a p p e a r b e t w e e n A1 a n d A2, t h e n the n e g a t i v e e x a m p l e s are a - A2, b - A2, a n d B1 - A2. N o t e t h a t a a n d b d o n o t a p p e a r in a n y coreference chain, w h i l e B1 a p p e a r s in a n o t h e r coreference chain.. For a n a n n o t a t e d n o u n p h r a s e in a coreference c h a i n in a t r a i n i n g d o c u m e n t , the s a m e n o u n p h r a s e m u s t be i d e n t i f i e d as a m a r k a b l e b y o u r p i p e l i n e of l a n g u a g e - p r o c e s s i n g m o d u l e s before this n o u n p h r a s e can be u s e d to f o r m a f e a t u r e v e c t o r for use as a t r a i n i n g example. This is b e c a u s e the i n f o r m a t i o n n e c e s s a r y to d e r i v e a feature vector, s u c h as s e m a n t i c class a n d gender, is c o m p u t e d b y the l a n g u a g e - p r o c e s s i n g m o d u l e s . If a n a n n o t a t e d n o u n p h r a s e is n o t i d e n t i f i e d as a m a r k a b l e , it will n o t c o n t r i b u t e a n y t r a i n i n g example. To see m o r e clearly h o w t r a i n i n g e x a m p l e s are g e n e r a t e d , c o n s i d e r the f o l l o w i n g f o u r sentences:. • Sentence 1. 1. (Eastern Air)a1 Proposes (Date For Talks o n ((Pay)cl-CUt)dl Plan)hi. 2. (Eastern Air)l Proposes (Date)2 For (Talks)3 on ( P a y - C u t Plan)4. • Sentence 2. 1. (Eastern Airlines)a2 executives n o t i f i e d (union)el leaders t h a t the carrier w i s h e s to discuss selective ((wage)c2 reductions)d2 o n (Feb. 3)b2.. 2. ((Eastern Airlines)5 executives)6 n o t i f i e d ((union)7 leaders)8 t h a t (the carrier)9 w i s h e s to discuss (selective (wage)10 r e d u c t i o n s ) n o n (Feb. 3)12.. 1.. .. Sentence 3. ((Union)e2 r e p r e s e n t a t i v e s w h o c o u l d be reached)f1 said (they)f2 h a d n ' t d e c i d e d w h e t h e r (they)f3 w o u l d r e s p o n d . ((Union)13 representatives)14 w h o c o u l d be r e a c h e d said (they)is h a d n ' t d e c i d e d w h e t h e r (they)16 w o u l d r e s p o n d . . 527. Computational Linguistics Volume 27, Number 4. • Sentence 4. .. .. By proposing (a meeting date)b3, (Eastern)a3 m o v e d one step closer toward reopening current high-cost contract agreements with ((its)a4 unions)e3. By proposing (a meeting dateh7, (Eastern)18 m o v e d (one step)19 closer toward reopening (current high-cost contract agreements)20 w i t h ((its)a1 UniOnS)a2.. Each sentence is s h o w n twice with different n o u n phrase boundaries. Sentences labeled (1) are obtained directly from part of the training document. The letters in the subscripts uniquely identify the coreference chains, while the n u m b e r s identify the n o u n phrases. N o u n phrases in sentences labeled (2) are extracted b y our language- processing modules and are also uniquely identified b y numeric subscripts.. Let's consider chain e, which is about the union. There are three n o u n phrases that corefer, and our system m a n a g e d to extract the boundaries that correspond to all of them: (union)7 matches with (union)el, (union)13 with (union)e2, and (its unions)22 with (its unions)e3. There are two positive training examples formed by ((union)13, (its unions)22) and ((union)7, (union)13). N o u n phrases between (union)7 and (union)13 that do not corefer with (union)13 are used to form the negative examples. The negative examples are ((the carrier)9, (union)is), ((wage)lo, (union)13), ((selective wage reductions)11, (union)13), and ((Feb. 3)12, (union)13). Negative examples can also be f o u n d similarly between ((union)13, (its unions)22).. As another example, neither n o u n phrase in chain d, (Pay-Cut)all and (wage reductions)a2, matches w i t h any machine-extracted n o u n phrase boundaries. In this case, no positive or negative example is formed for n o u n phrases in chain d.. 2.4 B u i l d i n g a Classifier The next step is to use a machine learning algorithm to learn a classifier based on the feature vectors generated from the training documents. The learning algorithm used in our coreference engine is C5, which is an u p d a t e d version of C4.5 (Quinlan 1993). C5 is a c o m m o n l y used decision tree learning algorithm and thus it m a y be considered as a baseline m e t h o d against which other learning algorithms can be compared.. 2.5 G e n e r a t i n g Coreference C h a i n s for Test D o c u m e n t s Before determining the coreference chains for a test document, all possible markables need to be extracted from the document. Every markable is a possible anaphor, and every markable before the anaphor in d o c u m e n t order is a possible antecedent of the anaphor, except w h e n the anaphor is nested. If the anaphor is a child or nested markable, then its possible antecedents m u s t not be any markable with the same root markable as the current anaphor. However, the possible antecedents can be other root markables and their children that are before the anaphor in d o c u m e n t order. For example, consider the two root markables, Mr. Tom's daughter and His daughter's eyes, appearing in that order in a test document. The possible antecedents of His cannot be His daughter or His daughter's eyes, but can be Mr. Tom or Mr. Tom's daughter.. The coreference resolution algorithm considers every markable j starting from the second markable in the d o c u m e n t to be a potential candidate as an anaphor. For each j, the algorithm considers every markable i before j as a potential antecedent. For each pair i a n d j, a feature vector is generated a n d given to the decision tree classifier. A coreferring antecedent is f o u n d if the classifier returns true. The algorithm starts from the immediately preceding markable a n d proceeds backward in the reverse order of. 528. Soon, Ng, and Lim Coreference Resolution. the m a r k a b l e s in the d o c u m e n t u n t i l t h e r e is n o r e m a i n i n g m a r k a b l e t o test o r a n a n t e c e d e n t is f o u n d . . As a n e x a m p l e , c o n s i d e r the f o l l o w i n g text w i t h m a r k a b l e s a l r e a d y d e t e c t e d b y t h e N L P m o d u l e s : . (2) (Ms. W a s h i n g t o n ) 7 3 ' s c a n d i d a c y is b e i n g c h a m p i o n e d b y (sev eral p o w e r f u l lawmakers)74 i n c l u d i n g ((her)76 boss)75, ( C h a i r m a n J o h n Dingell)77 (D., (Mich.)78) of (the H o u s e E n e r g y a n d C o m m e r c e Committee)79. (She)so c u r r e n t l y is (a counsel)81 to (the committee)s2. (Ms. W a s h i n g t o n ) s 3 a n d (Mr. DingeU)s4 h a v e b e e n c o n s i d e r e d (allies)s5 o f (the (securities)s7 exchanges)s6, w h i l e (banks)s8 a n d ((futures)90 exchanges)89 h a v e o f t e n f o u g h t w i t h ( t h e m M . . We will c o n s i d e r h o w t h e b o l d f a c e d c h a i n s are d e t e c t e d . Table 2 s h o w s t h e p a i r s of m a r k a b l e s t e s t e d f o r c o r e f e r e n c e to f o r m t h e c h a i n f o r Ms. Washington-her-She-Ms. Washington. W h e n the s y s t e m c o n s i d e r s t h e a n a p h o r , (her)76, all p r e c e d i n g p h r a s e s , e x c e p t (her boss)75, are t e s t e d to see w h e t h e r t h e y c o r e f e r w i t h it. (her boss)75 is n o t t e s t e d b e c a u s e (her)76 is its n e s t e d n o u n p h r a s e . Finally, t h e d e c i s i o n t ree d e t e r m i n e s t h a t the n o u n p h r a s e (Ms. Washington)73 c o r e f e r s w i t h (her)76. In Table 2, w e o n l y s h o w t h e s y s t e m c o n s i d e r i n g the t h r e e a n a p h o r s (her)76, (She)so, a n d (Ms. Washington)s3, in t h a t order.. T a b l e 2 Pairs of markables that are tested in forming the coreference chain Ms. Washington-her-She-Ms. Washington. The feature vector format: DIST, SEMCLASS, NUMBER, GENDER, PROPER_NAME, ALIAS, ,_PRONOUN, DEF_NP, DEMANP, STR_MATCH, APPOSITIVE, Id~RONOUN.. A n t e c e d e n t A n a p h o r F e a t u r e V e c t o r C o r e f e r s ? . (several powerful lawmakers)74 (Ms. Washington)73 (the House Energy and Commerce Committee)79 (Mich.)TS (Chairman John Dingell)77 (her) 76 (the committee)s2 (a counsel)81 (She)so (the House Energy and Commerce Committee) 79 (Mich.)7s (Chairman John Dingell)77 (her) 76 (her boss)75 (several powerful lawmakers)74 (Ms. Washington)73. (her)76 0 , 1 , - , 2 , - , - , + . . . . . (her)76 0,1,+,1,-,-,+, , , , (She)80 1,0,+,0,-,-,+ . . . . . (She)80 2,0,+,0,-,-,+ . . . . (She)s0 3,1,+,0,-,-,+, , , ,. (She)80 (Ms. Washington)8s (Ms. Washington)s3 (Ms. Washington)83 (Ms. Washington)83. Ms. Washington)s3 Ms. Washington)s3. Ms. Washington)83 Ms. Washington)83 Ms. Washington)s3. Ms. Washington)83. 3,1,+,1,-,-,+,, , 1,0,+,0,-, , , 1,1,+,2, , , , 1,1,+,1, , , , 2,0,+,0,+,-, ,. , , ' 1 + • , ]. ] ] ]. , , + ]. 3,0,+,0,+,-, 4,1,+,0,+ . . . . . 4,1,%1, , , 4 , 1 , - , 0 , - , , 4,1,-,2, , ,. , , r + . [ Y Y , . 4 , 1 , + , 1 , + , + , - - , - - , - - , + , - ,. N o . Yes No. N o No. Yes No No No No. N o No. N o No No. Yes. 529. Computational Linguistics Volume 27, N u m b e r 4. We u s e t h e s a m e m e t h o d to g e n e r a t e c o r e f e r e n c e c h a i n s for b o t h M U C - 6 a n d M U C - 7, e x c e p t f o r t h e f o l l o w i n g . F o r M U C - 7 , b e c a u s e of s l i g h t c h a n g e s in t h e c o r e f e r e n c e t a s k d e f i n i t i o n , w e i n c l u d e a f i l t e r i n g m o d u l e to r e m o v e c e r t a i n c o r e f e r e n c e c h a i n s . T h e t a s k d e f i n i t i o n s t a t e s t h a t a c o r e f e r e n c e c h a i n m u s t c o n t a i n a t l e a s t o n e e l e m e n t t h a t is a h e a d n o u n o r a n a m e ; t h a t is, a c h a i n c o n t a i n i n g o n l y p r e n o m i n a l m o d i f i e r s is r e m o v e d b y t h e f i l t e r i n g m o d u l e . . 3. E v a l u a t i o n . I n o r d e r to e v a l u a t e t h e p e r f o r m a n c e o f o u r l e a r n i n g a p p r o a c h to c o r e f e r e n c e r e s o l u - t i o n o n c o m m o n d a t a sets, w e u t i l i z e d t h e a n n o t a t e d c o r p o r a a n d s c o r i n g p r o g r a m s f r o m M U C - 6 a n d M U C - 7 , w h i c h a s s e m b l e d a set of n e w s w i r e d o c u m e n t s a n n o t a t e d w i t h c o r e f e r e n c e c h a i n s . A l t h o u g h w e d i d n o t p a r t i c i p a t e i n e i t h e r M U C - 6 or M U C - 7 , w e w e r e a b l e to o b t a i n t h e t r a i n i n g a n d t e s t c o r p o r a f o r b o t h y e a r s f r o m t h e M U C o r g a - n i z e r s f o r r e s e a r c h p u r p o s e s . 1 To o u r k n o w l e d g e , t h e s e are t h e o n l y p u b l i c l y a v a i l a b l e a n n o t a t e d c o r p o r a f o r c o r e f e r e n c e r e s o l u t i o n . . F o r M U C - 6 , 30 d r y - r u n d o c u m e n t s a n n o t a t e d w i t h c o r e f e r e n c e : i n f o r m a t i o n w e r e u s e d as t h e t r a i n i n g d o c u m e n t s f o r o u r c o r e f e r e n c e e n g i n e . T h e r e a r e a l s o 30 a n n o t a t e d t r a i n i n g d o c u m e n t s f r o m M U C - 7 . T h e t o t a l size of t h e 30 t r a i n i n g d o c u m e n t s is close to 12,400 w o r d s f o r M U C - 6 a n d 19,000 w o r d s f o r M U C - 7 . T h e r e a r e a l t o g e t h e r 20,910 (48,872) t r a i n i n g e x a m p l e s u s e d f o r M U C - 6 (MUC-7), o f w h i c h o n l y 6.5% (4.4%) a r e p o s i t i v e e x a m p l e s i n M U C - 6 (MUC-7). 2. A f t e r t r a i n i n g a s e p a r a t e classifier f o r e a c h y e a r , w e t e s t e d t h e p e r f o r m a n c e of e a c h classifier o n its c o r r e s p o n d i n g t e s t c o r p u s . F o r M U C - 6 , t h e C5 p r u n i n g c o n f i d e n c e is set at 20% a n d t h e m i n i m u m n u m b e r of i n s t a n c e s p e r l e a f n o d e is set a t 5. F o r M U C - 7 , t h e p r u n i n g c o n f i d e n c e is 60% a n d t h e m i n i m u m n u m b e r of i n s t a n c e s is 2. T h e p a r a m e t e r s a r e d e t e r m i n e d b y p e r f o r m i n g 10-fold c r o s s - v a l i d a t i o n o n t h e w h o l e t r a i n i n g s e t f o r e a c h M U C year. T h e p o s s i b l e p r u n i n g c o n f i d e n c e v a l u e s t h a t w e t r i e d a r e 10%, 20%, 40%, 60%, 80%, a n d 100%, a n d f o r m i n i m u m i n s t a n c e s , w e t r i e d 2, 5, 10, 15, a n d 20. T h u s , a t o t a l of 30 (6 x 5) c r o s s - v a l i d a t i o n r u n s w e r e e x e c u t e d . . O n e a d v a n t a g e o f u s i n g a d e c i s i o n t r e e l e a r n i n g a l g o r i t h m is t h a t t h e r e s u l t i n g d e c i s i o n t r e e classifier c a n b e i n t e r p r e t e d b y h u m a n s . T h e d e c i s i o n t r e e g e n e r a t e d f o r M U C - 6 , s h o w n in F i g u r e 2, s e e m s t o e n c a p s u l a t e a r e a s o n a b l e r u l e o f t h u m b t h a t m a t c h e s o u r i n t u i t i v e l i n g u i s t i c n o t i o n of w h e n t w o n o u n p h r a s e s c a n corefer. I t is a l s o i n t e r e s t i n g t o n o t e t h a t o n l y 8 o u t o f t h e 12 a v a i l a b l e f e a t u r e s in t h e t r a i n i n g e x a m p l e s a r e a c t u a l l y u s e d i n t h e final d e c i s i o n t r e e built.. M U C - 6 h a s a s t a n d a r d set of 30 t e s t d o c u m e n t s , w h i c h is u s e d b y all s y s t e m s t h a t p a r t i c i p a t e d i n t h e e v a l u a t i o n . Similarly, M U C - 7 h a s a t e s t c o r p u s o f 20 d o c u m e n t s . W e c o m p a r e d o u r s y s t e m ' s M U C - 6 a n d M U C - 7 p e r f o r m a n c e w i t h t h a t of t h e s y s t e m s t h a t t o o k p a r t in M U C - 6 a n d M U C - 7 , r e s p e c t i v e l y . W h e n t h e c o r e f e r e n c e e n g i n e is g i v e n n e w t e s t d o c u m e n t s , its o u t p u t is i n t h e f o r m of S G M L files w i t h t h e c o r e f e r e n c e c h a i n s p r o p e r l y a n n o t a t e d a c c o r d i n g to t h e g u i d e l i n e s . 3 We t h e n u s e d t h e s c o r i n g p r o g r a m s . 1 See http://www.itl.nist.gov/iad/894.02/related_projects/muc/index.html for details on obtaining the corpora.. 2 Our system runs on a Pentium III 550MHz PC. It took less than 5 minutes to generate the training examples from the training documents for MUC-6, and about 7 minutes for MUC-7. The training time for the C5 algorithm to generate a decision tree from all the training examples was less than 3 seconds for both MUC years.. 3 The time taken to generate the coreference chains for the 30 MUC-6 test documents of close to 13,400 words was less than 3 minutes, while it took less than 2 minutes for the 20 MUC-7 test documents of about 10,000 words.. 530. Soon, Ng, and Lira Coreference Resolution. STR_MATCH = +: +. STR_MATCH = -". :...J_PRONOUN = -". :...APPOSITIVE = +: +. : APPOSITIVE = -:. : :...ALIAS = +: +. : ALIAS . . . . . J_PRONOUN = +:. :...GENDER = O: -. GENDER = 2: -. GENDER = I:. :...1 PRONOUN = +: +. I_PRONOUN = -". :...DIST > 0: -. DIST <= 0:. :...NUMBER = +: +. NUMBER . . . . . Figure 2 The decision tree classifier learned for MUC-6.. f o r t h e r e s p e c t i v e y e a r s to g e n e r a t e t h e recall a n d p r e c i s i o n s c o r e s f o r o u r c o r e f e r e n c e e n g i n e . . O u r c o r e f e r e n c e e n g i n e a c h i e v e s a recall o f 58.6% a n d a p r e c i s i o n o f 67.3%, y i e l d i n g a b a l a n c e d F - m e a s u r e o f 62.6% f o r M U C - 6 . F o r M U C - 7 , t h e recall is 56.1%, t h e p r e c i s i o n is 65.5%, a n d t h e b a l a n c e d F - m e a s u r e is 60.4%. 4 W e p l o t t e d t h e s c o r e s of o u r c o r e f e r e n c e e n g i n e ( s q u a r e - s h a p e d ) a g a i n s t t h e official t e s t s c o r e s o f t h e o t h e r s y s t e m s (cross- s h a p e d ) in F i g u r e 3 a n d F i g u r e 4.. We a l s o p l o t t e d t h e l e a r n i n g c u r v e s o f o u r c o r e f e r e n c e e n g i n e in F i g u r e 5 a n d F i g u r e 6, s h o w i n g its a c c u r a c y a v e r a g e d o v e r t h r e e r a n d o m trials w h e n t r a i n e d o n 1, 2, 3, 4, 5, 10, 15, 20, 25, a n d 30 t r a i n i n g d o c u m e n t s . T h e l e a r n i n g c u r v e s i n d i c a t e t h a t o u r c o r e f e r e n c e e n g i n e a c h i e v e s its p e a k p e r f o r m a n c e w i t h a b o u t 25 t r a i n i n g d o c u m e n t s , o r a b o u t 11,000 to 17,000 w o r d s o f t r a i n i n g d o c u m e n t s . T h i s n u m b e r o f t r a i n i n g d o c u m e n t s w o u l d g e n e r a t e t e n s o f t h o u s a n d s o f t r a i n i n g e x a m p l e s , sufficient f o r t h e d e c i s i o n t r e e l e a r n i n g a l g o r i t h m to l e a r n a g o o d classifier. A t h i g h e r n u m b e r s o f t r a i n i n g d o c u m e n t s , o u r s y s t e m s e e m s to s t a r t o v e r f i t t i n g t h e t r a i n i n g d a t a . F o r e x a m p l e , o n M U C - 7 d a t a , t r a i n i n g o n t h e full set o f 30 t r a i n i n g d o c u m e n t s r e s u l t s in a m o r e c o m p l e x d e c i s i o n tree.. O u r s y s t e m ' s s c o r e s a r e in t h e u p p e r r e g i o n o f t h e M U C - 6 a n d M U C - 7 s y s t e m s . We p e r f o r m e d a s i m p l e o n e - t a i l e d , p a i r e d s a m p l e t-test a t s i g n i f i c a n c e l e v e l p = 0.05 to d e t e r m i n e w h e t h e r t h e d i f f e r e n c e b e t w e e n o u r s y s t e m ' s F - m e a s u r e s c o r e a n d e a c h o f t h e o t h e r s y s t e m s ' F - m e a s u r e s c o r e o n t h e test d o c u m e n t s is s t a t i s t i c a l l y significant. 5 W e f o u n d t h a t a t t h e 95% s i g n i f i c a n c e l e v e l (p = 0.05), o u r s y s t e m p e r f o r m e d b e t t e r t h a n t h r e e M U C - 6 s y s t e m s , a n d as w e l l as t h e r e s t o f t h e M U C - 6 s y s t e m s . U s i n g t h e . 4 Note that MUC-6 did not use balanced F-measure as the official evaluation measure, but MUC-7 did. 5 Though the McNemar test is shown to have low Type I error compared with the paired t-test. (Dietterich 1998), we did not carry out this test in the context of coreference. This is because an example instance defines a coreference link between two noun phrases, and since this link is transitive in nature, it is unclear how the number of links misclassified by System A but not by System B and vice versa can be obtained to execute the McNemar test.. 531. Computational Linguistics Volume 27, Number 4. 100. c- O. O,... 90. 80. 70. 60. 50. 40. 30. 20. 10. 0. X~. x x { x . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . X . . . . . . . . . . . . . . . . . - . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . X. 0 10 20 30 40 50 60 70 80 90 100. Recall. Figure 3 Coreference scores of MUC-6 systems and our system.. 100. t - O. O. 0... 90. 80. 70. 60. 50. 40. 30. 20. 10. 0. X. % . . . . . . . . . . . . . . . . . . . . . . . . . . . . . X . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . • . . . . . . . . . . . . . . . . . . . . . . . . . X. I. 0 10 20 30 40 50 60 70 80 90 100. Recall. Figure 4 Coreference scores of MUC-7 systems and our system.. same significance level, o u r system p e r f o r m e d better than four MUC-7 systems, and as well as the rest of the MUC-7 systems. O u r result is encouraging since it indicates that a learning a p p r o a c h using relatively shallow features can achieve scores comparable to those of systems built using n o n l e a r n i n g approaches.. 532. Soon, Ng, and Lira Coreference Resolution. 63 62 61 60 59 58 57 56. -1. ,~ 55 5 4 . ,,' 53 52 51 50 49 48 47 46. 0. I I. 5 10 15 20 25. Number of MUC-6 Documents Trained On. Figure 5 Learning curve of coreference resolution accuracy for MUC-6.. 63 I I I I I. "3. ¢ ) . E L I _ . 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46. Figure 6. J J. 80. I I I I I I. 0 5 10 15 20 25 30. Number of MUC-7 Documents Trained On. Learning curve of coreference resolution accuracy for MUC-7.. It s h o u l d b e n o t e d t h a t t h e a c c u r a c y o f o u r c o r e f e r e n c e r e s o l u t i o n e n g i n e d e p e n d s to a l a r g e e x t e n t o n t h e p e r f o r m a n c e o f t h e N L P m o d u l e s t h a t a r e e x e c u t e d b e f o r e t h e c o r e f e r e n c e e n g i n e . O u r c u r r e n t l e a r n i n g - b a s e d , H M M n a m e d e n t i t y r e c o g n i t i o n m o d u l e is t r a i n e d o n 318 d o c u m e n t s (a d i s j o i n t s e t f r o m b o t h t h e M U C - 6 a n d M U C - 7 . 533. Computational Linguistics Volume 27, N u m b e r 4. test d o c u m e n t s ) t a g g e d w i t h n a m e d entities, a n d its s c o r e o n t h e M U C - 6 n a m e d e n t i t y t a s k f o r t h e 30 f o r m a l t e s t d o c u m e n t s is o n l y 88.9%, w h i c h is n o t c o n s i d e r e d v e r y h i g h b y M U C - 6 s t a n d a r d s . F o r e x a m p l e , o u r n a m e d e n t i t y r e c o g n i z e r c o u l d n o t i d e n t i f y t h e t w o n a m e d e n t i t i e s USAir a n d Piedmont in t h e e x p r e s s i o n USAir and Piedmont b u t i n s t e a d t r e a t t h e m as o n e s i n g l e n a m e d entity. O u r p a r t - o f - s p e e c h t a g g e r a c h i e v e s 96% accuracy, w h i l e t h e a c c u r a c y of n o u n p h r a s e i d e n t i f i c a t i o n is a b o v e 90%.. 4. T h e C o n t r i b u t i o n o f t h e F e a t u r e s . O n e f a c t o r t h a t affects t h e p e r f o r m a n c e of a m a c h i n e l e a r n i n g a p p r o a c h is t h e set o f f e a t u r e s u s e d . I t is i n t e r e s t i n g t o f i n d o u t h o w u s e f u l e a c h o f o u r 12 f e a t u r e s is i n t h e M U C - 6 a n d M U C - 7 c o r e f e r e n c e t a s k s . O n e w a y t o d o t h i s is t o t r a i n a n d t e s t u s i n g j u s t o n e f e a t u r e a t a t i m e . T a b l e 3 a n d Table 4 s h o w t h e r e s u l t s o f t h e e x p e r i m e n t . F o r b o t h M U C - 6 a n d M U C - 7 , t h e 3 f e a t u r e s t h a t g i v e n o n z e r o recall a n d p r e c i s i o n a r e A L I A S , S T R _ M A T C H , a n d A P P O S I T I V E . T h e 12 f e a t u r e s c a n b e d i v i d e d i n t o u n a r y a n d b i n a r y . Table 3 MUC-6 results of complete and baseline systems to study the contribution of the features.. S y s t e m ID Recall Prec F Remarks. C o m p l e t e s y s t e m s . DSO 58.6 67.3 62.6 Our system DSO_TRG 52.6 67.6 59.2 Our system using RESOLVE's method. of generating positive and negative examples. RESOLVE 44.2 50.7 47.2 The RESOLVE coreference system at the University of Massachusetts. Baseline s y s t e m s u s i n g just o n e feature. DIST 0.0 0.0 0.0 Only SEMCLASS 0.0 0.0 0.0 Only. NUMBER 0.0 0.0 0.0 Only GENDER 0.0 0.0 0.0 Only. PROPER_NAME 0.0 0.0 0.0 Only ALIAS 24.5 88.7 38.4 Only. J_PRONOUN 0.0 0.0 0.0 Only DEF_NP 0.0 0.0 0.0 Only. DEM_NP 0.0 0.0 0.0 Only STR_MATCH 45.7 65.6 53.9 Only APPOSITIVE 3.9 57.7 7.3 Only I_PRONOUN 0.0 0.0 0.0 o n l y . "distance" feature is used "semantic class agreement" " n u m b e r agreement" "gender agreement" "both proper names" "alias" "j-pronoun" "definite noun phrase" "demonstrative n o u n phrase" "string match" "appositive" "i-pronoun". Other b a s e l i n e s y s t e m s . ALIAS_STR 51.5 66.4 58.0 Only the "alias" and "string match" features are used. ALIAS_STR~PPOS 55.2 66.4 60.3 Only the "alias," "string match," and "appositive" features are used. ONE_CHAIN 89.9 31.8 47.0 All markables form one chain ONE_WRD 55.4 36.6 4 4 . 1 Markables corefer if there is at least one. c o m m o n w o r d HD_WRD 56.4 50.4 53.2 Markables corefer if their head words are. the same. 534. Soon, Ng, and Lim Coreference Resolution. Table 4 MUC-7 results of complete and baseline systems to study the contribution of the features.. S y s t e m ID Recall Prec F Remarks. C o m p l e t e s y s t e m s . DSO 56.1 65.5 60.4 Our system DSO_TRG 53.3 69.7 60.4 Our system using RESOLVE's method. of generating positive and negative examples. Baseline s y s t e m s u s i n g just o n e feature. DIST 0.0 0.0 0.0 Only "distance" feature is used SEMCLASS 0.0 0.0 0.0 Only "semantic class agreement". NUMBER 0.0 0.0 0.0 Only "number agreement" GENDER 0.0 0.0 0.0 Only "gender agreement". PROPER3X/AME 0.0 0.0 0.0 Only "both proper names" ALIAS 25.6 8 1 . 1 38.9 Only "alias". J_PRONOUN 0.0 0.0 0.0 Only "j-pronoun" DEF_NP 0.0 0.0 0.0 Only "definite noun phrase". DEM_NP 0.0 0.0 0.0 Only "demonstrative noun phrase" STRdV[ATCH 43.8 71.4 54.3 Only "string match" APPOSITIVE 2.4 60.0 4.6 Only "appositive" I_PRONOUN 0.0 0.0 0.0 Only "i-pronoun". Other baseline s y s t e m s . ALIAS_STR 49.4 70.4 5 8 . 1 Only the "alias" and "string match" features are used. ALIAS_STR_APPOS 51.6 69.9 59.4 Only the "alias," "string match," and "appositive" features are used. ONE_CHAIN 87.5 30.5 45.2 All markables form one chain ONE_WRD 55.9 38.7 45.7 Markables corefer if there is at least one. common word HD_WRD 55.2 55.6 55.4 Markables corefer if their head words are. the same. f e a t u r e s . T h e u n a r y f e a t u r e s are I _ P R O N O U N , J _ P R O N O U N , DEF_NP, a n d DEM_NP, w h i l e t h e r e s t a r e b i n a r y in n a t u r e . All t h e u n a r y f e a t u r e s s c o r e a n F - m e a s u r e o f 0. T h e b i n a r y f e a t u r e s w i t h 0 F - m e a s u r e a r e DIST, P R O P E R d ' q A M E , G E N D E R , SEMCLASS, a n d N U M B E R . . T h e A L I A S , A P P O S I T I V E , a n d S T R _ M A T C H f e a t u r e s g i v e n o n z e r o F - m e a s u r e . All t h e s e f e a t u r e s g i v e r a t h e r h i g h p r e c i s i o n s c o r e s ( > 80% f o r A L I A S , > 65% f o r S T R _ M A T C H , a n d > 57% f o r A P P O S I T I V E ) . Since t h e s e f e a t u r e s a r e h i g h l y i n f o r m a - tive, w e w e r e c u r i o u s to see h o w m u c h t h e y c o n t r i b u t e to o u r M U C - 6 a n d M U C - 7 r e s u l t s o f 62.6% a n d 60.4%, r e s p e c t i v e l y . S y s t e m s ALIAS_STR a n d ALIAS_STR~a~PPOS in Table 3 a n d Table 4 s h o w t h e r e s u l t s o f t h e e x p e r i m e n t . I n t e r m s o f a b s o l u t e F- m e a s u r e , t h e d i f f e r e n c e b e t w e e n u s i n g t h e s e t h r e e f e a t u r e s a n d u s i n g all f e a t u r e s is 2.3% f o r M U C - 6 a n d 1% f o r M U C - 7 ; in o t h e r w o r d s , t h e o t h e r n i n e f e a t u r e s c o n t r i b u t e j u s t 2.3% a n d 1% m o r e f o r e a c h o f t h e M U C y e a r s . T h e s e n i n e f e a t u r e s w i l l b e t h e first o n e s to b e c o n s i d e r e d f o r p r u n i n g a w a y b y t h e C5 a l g o r i t h m . F o r e x a m p l e , f o u r f e a t u r e s , n a m e l y , SEMCLASS, P R O P E R _ N A M E , DEF_NP, a n d DEM_NP, a r e n o t u s e d i n t h e M U C - 6 t r e e s h o w n in F i g u r e 2. F i g u r e 7 s h o w s t h e d i s t r i b u t i o n o f t h e t e s t c a s e s o v e r t h e f i v e p o s i t i v e l e a f n o d e s o f t h e M U C - 6 tree. F o r e x a m p l e , a b o u t 66.3% o f all. 535. Computational Linguistics Volume 27, Number 4. STR_MATCH = +: + 944 (66.3%). STR_MATCH = -". :...J_PRONOUN = -". :...APPOSITIVE = +: + Iii (7.8%). STR_MATCH = -". :...J_PRONOUN = -". :...APPOSITIVE = -". :...ALIAS = +: + 163 (11.5%). STR_MATCH = -". :...J_PRONOUN = +:. :...GENDER = I:. :...I_PRONOUN = +: + 77 (5.4%). STR_MATCH = --. :...J_PRONOUN = +:. :...GENDER = I:. :...I_PRONOUN = -". :...DIST <= 0:. :...NUMBER = +: + 128 (9.0%). Figure 7 Distribution of test examples from the 30 MUC-6 test documents for positive leaf nodes of the MUC-6 tree.. the test examples that are classified positive go to the "If STRiVIATCH" b r a n c h of the tree.. Other baseline systems that are used are ONE_CHAIN, ONE_WRD, and HD_WRD (Cardie and Wagstaff 1999). For ONE_CHAIN, all markables f o r m e d one chain. In ONE_WRD, markables corefer if there is at least one c o m m o n word. In HD_WRD, markables corefer if their h e a d w o r d s are the same. The p u r p o s e of ONE_CHAIN is to determine the m a x i m u m recall our system is capable of. The recall level here indirectly measures h o w effective the n o u n phrase identification m o d u l e is. Both ONE_WRD a n d HD_WRD are less stringent variations of STR_MATCH. The p e r f o r m a n c e of ONE_WRD is the worst. HD_WRD offers better recall c o m p a r e d to STR_MATCH, b u t p o o r e r precision. H o w e v e r , its F-measure is comparable to that of STRA4ATCH.. The score of the coreference system at the University of Massachusetts (RESOLVE), which uses C4.5 for coreference resolution, is s h o w n in Table 3. RESOLVE is s h o w n because a m o n g the MUC-6 systems, it is the only machine l e a r n i n g - b a s e d system that w e can directly c o m p a r e to. The other MUC-6 systems w e r e not based on a learning approach. Also, n o n e of the systems in MUC-7 a d o p t e d a learning a p p r o a c h to coreference resolution (Chinchor 1998).. RESOLVE's score is not high c o m p a r e d to scores attained b y the rest of the MUC- 6 systems. In particular, the system's recall is relatively low. O u r system's score is higher than that of RESOLVE, a n d the difference is statistically significant. The RE- SOLVE system is described in three papers: McCarthy a n d Lehnert (1995), Fisher et al. (1995), and McCarthy (1996). As explained in McCarthy (1996), the reason for this low recall is that RESOLVE takes only the "relevant entities" and "relevant references" as input, w h e r e the relevant entities a n d relevant references are restricted to " p e r s o n " . 536. Soon, Ng, and Lim Coreference Resolution. and "organization." In addition, because of limitations of the n o u n phrase detection module, nested phrases are not extracted and therefore do not take part in coreference. Nested phrases can include prenominal modifiers, possessive pronouns, and so forth. Therefore, the n u m b e r of candidate markables to be used for coreference is small.. On the other hand, the markables extracted by our system include nested n o u n phrases, MUC-style n a m e d entity types (money, percent, date, etc.), and other types not defined by MUC. These markables will take part in coreference. About 3,600 top- level markables are extracted from the 30 MUC-6 test documents by our system. As detected by our NLP modules, only about 35% of these 3,600 phrases are "person" and "organization" entities and references. Concentrating on just these types has thus affected the overall recall of the RESOLVE system.. RESOLVE's w a y of generating training examples also differs from our system's: instances are created for all possible pairings of "relevant entities" and "relevant references," instead of our system's m e t h o d of stopping at the first coreferential n o u n phrase w h e n traversing back from the anaphor u n d e r consideration. We implemented RESOLVE's w a y of generating training examples, and the results (DSO-TRG) are reported in Table 3 and Table 4. For MUC-7, there is no drop in F-measure; for MUC-6, the F-measure dropped slightly.. RESOLVE makes use of 39 features, considerably more than our system's 12 features. RESOLVE's feature set includes the two highly informative features, ALIAS and STR_MATCH. RESOLVE does not use the APPOSITIVE feature.. 5. Error Analysis. In order to determine the major classes of errors m a d e by our system, we r a n d o m l y chose five test documents from MUC-6 and determined the coreference links that were either missing (false negatives) or spurious (false positives) in these sample documents. Missing links result in recall errors; spurious links result in precision errors.. Breakdowns of the n u m b e r of spurious and missing links are shown in Table 5 and Table 6, respectively. The following two subsections describe the errors in more detail.. 5.1 Errors Causing Spurious Links This section describes the five major types of errors s u m m a r i z e d in Table 5 in more detail.. 5.1.1 Prenominal Modifier String Match. This class of errors occurs w h e n some strings of the prenominal modifiers of two markables match b y surface string comparison and thus, b y the C5 decision tree in Figure 2, the markables are treated as coreferring. How-. Table 5 The types and frequencies of errors that affect precision.. Types of Errors Causing Spurious Links Frequency %. Prenominal modifier string match Strings match but noun phrases refer to different entities Errors in noun phrase identification Errors in apposition determination Errors in alias determination. 16 11. 42.1% 28.9%. 10.5% 13.2% 5.3%. 537. Computational Linguistics Volume 27, Number 4. Table 6 The types and frequencies of errors that affect recall.. Types of Errors Causing Missing Links Frequency %. Inadequacy of current surface features Errors in noun phrase identification Errors in semantic class determination Errors in part-of-speech assignment Errors in apposition determination Errors in tokenization. 38 63.3% 7 , 11.7% 7 11.7% 5 8.3% 2 3.3% 1 1.7%. ever, the entire markable actually does not corefer. The nested n o u n phrase extraction m o d u l e is responsible for obtaining the possible prenominal modifiers from a n o u n phrase.. In (3), the n o u n phrase extraction m o d u l e mistakenly extracted (vice)l a n d (vice)2, which are not prenominal modifiers. Because of string match, (vice)l a n d (vice)2 incorrectly corefer. In (4), (undersecretary)2 was correctly extracted as a prenorninal modifier, but incorrectly corefers with (undersecretary)l by string match.. (3). (4). David J. Bronczek, (vice)l president and general manager of Federal Express Canada Ltd., was n a m e d senior (vice)2 president, Europe, Africa and Mediterranean, at this air-express concern.. Tarnoff, a former Carter administration official and president of the Council on Foreign Relations, is expected to be n a m e d (undersecretary)l for political affairs . . . . Former Sen. Tim Wirth is expected to get a n e w l y created (undersecretary)2 post for global affairs, which w o u l d include refugees, drugs and environmental issues.. 5.1.2 Strings Match b u t N o u n Phrases Refer to D i f f e r e n t Entities. This error occurs w h e n the surface strings of two markables match and thus, b y the C5 decision tree in Figure 2, they are treated as coreferring. However, they actually refer to different entities a n d should not corefer. In (5), (the committee)l actually refers to the entity the House Energy and Commerce Committee, and (the committee)2 refers to the Senate Finance Committee; therefore, they should not corefer. In (6), the two instances of chief executive officer refer to two different persons, namely, Allan Laufgraben and Milton Petrie, and, again, should not corefer.. (5). (6). Ms. Washington's candidacy is being championed b y several powerful lawmakers including her boss, Chairman John Dingell (D., Mich.) of the House Energy and Commerce Committee. She currently is a counsel to (the c o m m i t t e e h . . . . Mr. Bentsen, w h o h e a d e d the Senate Finance Committee for the past six years, also is expected to nominate Samuel Sessions, (the committee)2's chief tax counsel, to one of the top tax jobs at Treasury.. Directors also approved the election of Allan Laufgraben, 54 years old, as president and (chief e x e c u t i v e officerh a n d Peter A. Left, 43, as chief operating officer. Milton Petrie, 90-year-old chairman, president a n d (chief e x e c u t i v e officer)2 since the c o m p a n y was f o u n d e d in 1932, will continue as chairman.. 538. Soon, Ng, and Lira Coreference Resolution. 5.1,3 Errors in N o u n Phrase Identification. This class of errors is caused by mistakes m a d e by the n o u n phrase identification module. In (7), May and June are incorrectly grouped together by the n o u n phrase identification m o d u l e as one n o u n phrase, that is, May, June. This markable then incorrectly causes the APPOSITIVE feature to be true, which results in classifying the pair as coreferential. In fact, (thefirst week of July)2 should not be in apposition to (May, Juneh. However, we classified this error as a n o u n phrase identification error because it is the first m o d u l e that causes the error. In (8), the n o u n phrase m o d u l e extracted Metaphor Inc. instead of Metaphor Inc. unit. This causes (it)2 to refer to Metaphor Inc. instead of Metaphor Inc. unit.. (7). (8). The w o m e n ' s apparel specialty retailer said sales at stores open more than one year, a key barometer of a retail concern's strength, declined 2.5% in (May, June)l and (the first w e e k of July)2.. . . . International Business Machines Corp.'s (Metaphor InC.)l unit said (it)2 will shed 80 employees . . . . 5.1.4 Errors in A p p o s i t i o n Determination. This class of errors occurs w h e n the anaphor is incorrectly treated as being in apposition to the antecedent and therefore causes the n o u n phrases to corefer. The precision scores obtained w h e n using the APPOSITIVE feature alone are shown in Table 3 and Table 4, which suggest that the m o d u l e can be improved further. Examples where apposition determination is incorrect are shown in (9) and (10).. (9). (10). Clinton officials are said to be deciding between recently retired Rep. Matthew M c H u g h (D., (N.Y.)l) and ( e n v i r o n m e n t a l activist)2 and transition official Gus Speth for the director of the Agency for International Development.. Metaphor, a software subsidiary that IBM purchased in 1991, also n a m e d (Chris Grejtak)l, (43 years old)2, currently a senior vice president, president and chief executive officer.. 5.1.5 Errors in A l i a s Determination. This class of errors occurs w h e n the anaphor is incorrectly treated as an alias of the antecedent, thus causing the n o u n phrase pair to corefer. In (11), the two phrases (House)l and (the House Energy and Commerce Committee)2 corefer because the ALIAS feature is incorrectly determined to be true.. (11) Consuela Washington, a longtime (House)l staffer and an expert in securities laws, is a leading candidate to be chairwoman of the Securities and Exchange Commission in the Clinton administration . . . . Ms. Washington's candidacy is being championed by several powerful lawmakers including her boss, Chairman John Dingell (D., Mich.) of (the H o u s e Energy and C o m m e r c e Committee)2.. 5.2 Errors C a u s i n g M i s s i n g Links This subsection describes the six major classes of errors s u m m a r i z e d in Table 6 in more detail.. 5.2.1 I n a d e q u a c y of Current Surface Features. This class of errors is due to the inadequacy of the current surface features because they do not have information about. 539. Computational Linguistics Volume 27, Number 4. other words (such as the connecting conjunctions, prepositions, or verbs) a n d other knowledge sources that m a y provide important clues for coreference. As a result, the set of shallow features we used is unable to correctly classify the n o u n phrases in the examples below as coreferring.. Example (12) illustrates w h y resolving (them)2 is difficult. (allies)l, securities exchanges, banks, and futures exchanges are all possible antecedents of (them)2, and the feature set m u s t include more information to be able to pick the correct one. The con- junction and in (13) and was named in (16) are important cues to determine coreference. In addition, it m a y also be possible to capture n o u n phrases in predicate constructions like (17), where (Mr. Gleason)l is the subject and (president)a is the object.. (12). (13). (14). (15). (16). (17). Ms. Washington and Mr. Dingell have been considered (allies)l of the securities exchanges, while banks a n d futures exchanges have often fought w i t h (them)2.. Separately, Clinton transition officials said that Frank N e w m a n , 50, (vice chairmanh and (chief financial officer)2 of BankAmerica Corp., is expected to be n o m i n a t e d as assistant Treasury secretary for domestic finance.. Separately, (Clinton transition officials)l said that Frank N e w m a n , 50, vice chairman and chief financial officer of BankAmerica Corp., is expected to be n o m i n a t e d as assistant Treasury secretary for domestic finance . . . . As early as today, (the C l i n t o n camp)2 is expected to n a m e five undersecretaries of state and several assistant secretaries.. (Metro-Goldwyn-Mayer InC.)l said it n a m e d Larry Gleason president of world-wide theatrical distribution of (the movie studio)2's distribution unit.. . . . (general managerh of Federal Express C a n a d a Ltd., was n a m e d (senior vice president)a, Europe, Africa and Mediterranean . . . . (Mr. G l e a s o n h , 55 years old, was (president)2 of theatrical exhibition for P a r a m o u n t Communications Inc., in charge of the c o m p a n y ' s 1,300 movie screens in 12 countries.. 5.2.2 Errors in Noun Phrase Identification. This class of errors was described in Sec- tion 5.1.3. The n o u n phrase identification m o d u l e m a y extract n o u n phrases that do not match the phrases in the coreference chain, therefore causing missing links a n d recall error.. 5.2.3 Errors in Semantic Class Determination. These errors are caused by the w r o n g assigmnent of semantic classes to words. For example, (Metaphor)l should be assigned "organization" but it is assigned " u n k n o w n " in (18), a n d (second-quarter)2 should be assigned " d a t e " instead of " u n k n o w n " in (19). However, correcting these classes will still not cause the n o u n phrases in the examples to corefer. This is because the values of the SEMCLASS feature in the training examples are extremely noisy, a situation caused largely b y our semantic class determination module. In m a n y of the negative training examples, although the n o u n phrases are assigned the same semantic classes, these assignments do not seem to be correct. Some examples are (four-year, NBC), (The union, Ford Motor Co.), a n d (base-wage, job-security). A better algorithm for assigning semantic classes a n d a more refined semantic class hierarchy are needed.. 540. Soon, Ng, and Lira Coreference Resolution. (18). (19). ( M e t a p h o r h , a software subsidiary that IBM purchased in 1991, also n a m e d Chris Grejtak, . . . Mr. Grejtak said in an interview that the staff reductions will affect most areas of (the company)2 related to its early proprietary software products.. Business brief--Petrie Stores Corp.: Losses for (Fiscal 2nd Period)l, half seen likely by retailer . . . . Petrie Stores Corp., Secaucus, N.J., said an uncertain economy and faltering sales probably will result in a (second-quarter)2 loss and perhaps a deficit for the first six months of fiscal 1994.. 5.2.4 Errors in Part-of-Speech Assignment. This class of errors is caused by the w r o n g assignment of part-of-speech tags to words. In (20), (there)2 is not extracted because the part-of-speech tag assigned is "RB," which is an adverb and not a possible n o u n phrase.. (20) Jon W. Slangerup, who is 43 and has been director of customer service in (Canadah, succeeds Mr. Bronczek as vice president and general manager (there)2.. 5.2.5 Errors in Apposition Determination. This class of errors was described in Sec- tion 5.1.4.. 5.2.6 Errors in Tokenization. This class of errors is due to the incorrect tokenization of words. In (21), (1-to-2)1 and (15-to-1)2 are not f o u n d because the tokenizer breaks 1-to-2 into 1, -, to-2.15-to-1 is broken up similarly.. (21) Separately, MGM said it completed a previously announced financial restructuring designed to clean up its balance sheet--removing $900 million in bank debt from MGM's books and reducing its debt-to-equity ratio to (1-to-2)1 from (15-to-1)2--with a view toward a future sale of the company.. 5.3 Comparing Errors Made by RESOLVE McCarthy (1996) has also performed an analysis of errors while conducting an evaluation on the MUC-5 English Joint Venture (EJV) corpus. A large n u m b e r of the spurious links are caused by w h a t he terms "feature ambiguity," which means that feature values are not computed perfectly. As seen in Table 5, our string match feature accounts for most of the spurious links. Also, seven of the spurious links are caused by alias and apposition determination. As with RESOLVE, "feature ambiguity" is the main source of precision errors.. For RESOLVE, a large n u m b e r of the missing links are caused by "incomplete semantic k n o w l e d g e " (32%) and " u n u s e d features" (40.5%). For our system, the errors due to the inadequacy of surface features and semantic class determination problems account for about 75% of the missing links. " U n u s e d features" means that some of the features, or combinations of features, that are needed to classify pairs of phrases as coreferential are not present in the decision trees (McCarthy 1996). Similarly, the inadequacy of our system's surface features means that the current feature set m a y not be enough and more information sources should be added.. Because a detailed error analysis of RESOLVE w o u l d require not only its MUC-6 response file, but also the o u t p u t of its various components, we cannot perform the same error analysis that we did for our system on RESOLVE.. 541. Computational Linguistics Volume 27, Number 4. 6. R e l a t e d W o r k . There is a long tradition of w o r k on coreference resolution within computational linguistics, b u t most of it w a s not subject to empirical evaluation until recently. A m o n g the papers that have reported quantitative evaluation results, most are not b a s e d on learning from an annotated corpus (Baldwin 1997; K a m e y a m a 1997; Lappin and Leass 1994; Mitkov 1997).. To our knowledge, the research efforts of Aone and Bennett (1995), Ge, Hale, and Charniak (1998), Kehler (1997), McCarthy and Lehnert (1995), Fisher et al. (1995), and McCarthy (1996) are the only ones that are b a s e d on learning from an annotated corpus. Ge, Hale, and Charniak (1998) u s e d a statistical m o d e l for resolving pronouns, w h e r e a s w e u s e d a decision tree learning algorithm and resolved general n o u n phrases, not just pronouns. Similarly, Kehler (1997) u s e d m a x i m u m e n t r o p y modeling to assign a probability distribution to alternative sets of coreference relationships a m o n g n o u n phrase entity templates, w h e r e a s w e u s e d decision tree learning.. The w o r k of Aone and Bennett (1995), McCarthy a n d Lehnert (1995), Fisher et al. (1995), a n d McCarthy (1996) e m p l o y e d decision tree learning. The RESOLVE system is presented in McCarthy and Lehnert (1995), Fisher et al. (1995), and McCarthy (1996). McCarthy a n d Lehnert (1995) describe h o w RESOLVE w a s tested on the MUC- 5 English Joint Ventures (EJV) corpus. It u s e d a total of 8 features, 3 of w h i c h w e r e specific to the EJV domain. For example, the feature JV-CHILD-i d e t e r m i n e d w h e t h e r i referred to a joint venture f o r m e d as the result of a tie-up. McCarthy (1996) describes h o w the original RESOLVE for MUC-5 EJV w a s i m p r o v e d to include more features, 8 of w h i c h w e r e d o m a i n specific, and 30 of w h i c h w e r e d o m a i n independent. Fisher et al. (1995) a d a p t e d RESOLVE to w o r k in MUC-6. The features u s e d w e r e slightly changed for this domain. Of the original 30 d o m a i n - i n d e p e n d e n t features, 27 w e r e used. The 8 domain-specific features w e r e completely changed for the MUC-6 task. For example, JV-CHILD-i w a s changed to C H I L D - / t o decide w h e t h e r i is a "unit" or a "subsidiary" of a certain parent company. In contrast to RESOLVE, our system makes use of a smaller set of 12 features and, as in A o n e a n d Bennett's (1995) system, the features u s e d are generic and applicable across domains. This m a k e s our coreference engine a d o m a i n - i n d e p e n d e n t module.. Although Aone and Bennett's (1995) system also m a d e use of decision tree learning for coreference resolution, they dealt w i t h Japanese texts, a n d their evaluation focused only on n o u n phrases denoting organizations, w h e r e a s our evaluation, w h i c h dealt w i t h English texts, e n c o m p a s s e d n o u n phrases of all types, not just those denoting organizations. In addition, A o n e and Bennett e v a l u a t e d their s y s t e m on n o u n phrases that h a d b e e n correctly identified, w h e r e a s w e e v a l u a t e d our coreference resolution engine as part of a total s y s t e m that first has to identify all the candidate n o u n phrases and has to deal w i t h the inevitable n o i s y data w h e n mistakes occur in n o u n phrase identification and semantic class determination.. The contribution of our w o r k lies in s h o w i n g that a learning approach, w h e n evaluated on c o m m o n coreference data sets, is able to achieve accuracy competitive w i t h that of state-of-the-art systems using nonlearning approaches. It is also the first machine learning-based system to offer performance comparable to that of nonlearning approaches.. Finally, the w o r k of Cardie and Wagstaff (1999) also falls u n d e r the machine learning approach. H o w e v e r , t h e y u s e d u n s u p e r v i s e d learning and their m e t h o d d i d not require a n y annotated training data. Their clustering m e t h o d achieved a balanced F- measure of only 53.6% on MUC-6 test data. This is to b e expected: s u p e r v i s e d learning in general o u t p e r f o r m s u n s u p e r v i s e d learning since a s u p e r v i s e d learning algorithm. 542. Soon, Ng, and Lirn Coreference Resolution. h a s access to a r i c h e r set of a n n o t a t e d d a t a to l e a r n f r o m . Since o u r s u p e r v i s e d l e a r n i n g a p p r o a c h r e q u i r e s o n l y a m o d e s t n u m b e r of a n n o t a t e d t r a i n i n g d o c u m e n t s to a c h i e v e g o o d p e r f o r m a n c e (as c a n b e s e e n f r o m the l e a r n i n g cu rv es), w e a r g u e t h a t t h e b e t t e r a c c u r a c y a c h i e v e d m o r e t h a n justifies t h e a n n o t a t i o n e f f o r t i n c u r r e d . . 7. Conclusion. In this p a p e r , w e p r e s e n t e d a l e a r n i n g a p p r o a c h to c o r e f e r e n c e r e s o l u t i o n o f n o u n p h r a s e s in u n r e s t r i c t e d text. T h e a p p r o a c h l e a r n s f r o m a small, a n n o t a t e d c o r p u s a n d the task i n c l u d e s r e s o l v i n g n o t just p r o n o u n s b u t g e n e r a l n o u n p h r a s e s . We e v a l u - a t e d o u r a p p r o a c h o n c o m m o n d a t a sets, n a m e l y , t h e M U C - 6 a n d M U C - 7 c o r e f e r e n c e c o r p o r a . We o b t a i n e d e n c o u r a g i n g results, i n d i c a t i n g t h a t o n t h e g e n e r a l n o u n p h r a s e c o r e f e r e n c e task, the l e a r n i n g a p p r o a c h a c h i e v e s a c c u r a c y c o m p a r a b l e to t h a t o f n o n - l e a r n i n g a p p r o a c h e s . . Acknowledgments This paper is an expanded version of a preliminary paper that appeared in the Proceedings of the 1999 Joint S1GDAT Conference on Empirical Methods in Natural Langua

No results found