• No results found

Transformation Based Error Driven Learning and Natural Language Processing: A Case Study in Part of Speech Tagging

N/A
N/A
Protected

Academic year: 2020

Share "Transformation Based Error Driven Learning and Natural Language Processing: A Case Study in Part of Speech Tagging"

Copied!
24
0
0

Loading.... (view fulltext now)

Full text

Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging. Eric Brill* The Johns H o p k i n s University. Recently, there has been a rebirth of empiricism in the field of natural language processing. Man- ual encoding of linguistic information is being challenged by automated corpus-based learning as a method of providing a natural language processing system with linguistic knowledge. Al- though corpus-based approaches have been successful in many different areas of natural language processing, it is often the case that these methods capture the linguistic information they are modelling indirectly in large opaque tables of statistics. This can make it difficult to analyze, understand and improve the ability of these approaches to model underlying linguistic behavior. In this paper, we will describe a simple rule-based approach to automated learning of linguistic knowledge. This approach has been shown for a number of tasks to capture information in a clearer and more direct fashion without a compromise in performance. We present a detailed case study of this learning method applied to part-of-speech tagging.. 1. I n t r o d u c t i o n . It has recently become clear that automatically extracting linguistic information from a sample text corpus can be an extremely p o w e r f u l m e t h o d of overcoming the linguis- tic k n o w l e d g e acquisition bottleneck inhibiting the creation of robust and accurate natural language processing systems. A n u m b e r of part-of-speech taggers are readily available and w i d e l y used, all trained and retrainable on text corpora (Church 1988; Cutting et al. 1992; Brill 1992; Weischedel et al. 1993). Endemic structural ambiguity, which can lead to such difficulties as trying to cope with the m a n y t h o u s a n d s of possi- ble parses that a g r a m m a r can assign to a sentence, can be greatly reduced b y adding empirically derived probabilities to grammar rules (Fujisaki et al. 1989; Sharman, Je- linek, and Mercer 1990; Black et al. 1993) and b y computing statistical measures of lexical association (Hindle and Rooth 1993). Word-sense disambiguation, a p r o b l e m that once s e e m e d o u t of reach for systems w i t h o u t a great deal of handcrafted lin- guistic and w o r l d knowledge, can n o w in s o m e cases be d o n e with high accuracy w h e n all information is derived automatically from corpora (Brown, Lai, and Mercer 1991; Yarowsky 1992; Gale, Church, and Yarowsky 1992; Bruce and Wiebe 1994). An effort has recently been u n d e r t a k e n to create a u t o m a t e d machine translation systems in which the linguistic information n e e d e d for translation is extracted automatically from aligned corpora (Brown et al. 1990). These are just a f e w of the m a n y recent applications of corpus-based techniques in natural language processing.. • Department of Computer Science, Baltimore, MD 21218-2694. E-mail: [email protected].. © 1995 Association for Computational Linguistics. Computational Linguistics Volume 21, Number 4. Along with great research advances, the infrastructure is in place for this line of research to grow even stronger, with on-line corpora, the grist of the corpus-based natural language processing grindstone, getting bigger a n d better and becoming more readily available. There are a n u m b e r of efforts w o r l d w i d e to m a n u a l l y annotate large corpora with linguistic information, including parts of speech, phrase structure a n d predicate-argument structure (e.g., the Penn Treebank a n d the British National Corpus (Marcus, Santorini, a n d Marcinkiewicz 1993; Leech, Garside, a n d Bryant 1994)). A vast a m o u n t of on-line text is n o w available, a n d m u c h more will become available in the future. Useful tools, such as large aligned corpora (e.g., the aligned Hansards (Gale a n d Church 1991)) a n d semantic w o r d hierarchies (e.g., Wordnet (Miller 1990)), have also recently become available.. Corpus-based m e t h o d s are often able to succeed while ignoring the true complex- ities of language, banking on the fact that complex linguistic p h e n o m e n a can often be indirectly observed t h r o u g h simple epiphenomena. For example, one could accurately assign a part-of-speech tag to the w o r d race in (1-3) without a n y reference to phrase structure or constituent movement: One w o u l d only have to realize that, usually, a w o r d one or two w o r d s to the right of a m o d a l is a verb a n d not a noun. A n excep- tion to this generalization arises w h e n the w o r d is also one w o r d to the right of a determiner.. (1). (2). (3). He w i l l race/VERB the car.. He w i l l not race/VERB the car.. W h e n will the r a c e / N O U N end?. It is an exciting discovery that simple stochastic n - g r a m taggers can obtain v e r y high rates of tagging accuracy simply b y observing fixed-length w o r d sequences, with- out recourse to the u n d e r l y i n g linguistic structure. However, in order to m a k e progress in corpus-based natural language processing, w e m u s t become better aware of just w h a t cues to linguistic structure are being captured a n d w h e r e these approximations to the true u n d e r l y i n g p h e n o m e n a fail. With m a n y of the current corpus-based ap- proaches to natural language processing, this is a nearly impossible task. Consider the part-of-speech tagging example above. In a stochastic n - g r a m tagger, the informa- tion about w o r d s that follow modals w o u l d be h i d d e n d e e p l y in the thousands or tens of thousands of contextual probabilities (P(Tagi I Zagi-lZagi-2) ) a n d the result of multiplying different combinations of these probabilities together.. Below, w e describe a n e w approach to corpus-based natural language processing, called transformation-based error-driven learning. This algorithm has been applied to a n u m b e r of natural language problems, including part-of-speech tagging, prepositional phrase attachment disambiguation, a n d syntactic parsing (Brill 1992; Brill 1993a; Brill 1993b; Brill a n d Resnik 1994; Brill 1994). We have also recently b e g u n exploring the use of this technique for letter-to-sound generation a n d for building pronunciation networks for speech recognition. In this approach, the learned linguistic information is represented in a concise a n d easily u n d e r s t o o d form. This p r o p e r t y should m a k e transformation-based learning a useful tool for further exploring linguistic modeling a n d attempting to discover w a y s of m o r e tightly coupling the u n d e r l y i n g linguistic systems a n d o u r approximating models.. 544. Brill Transformation-Based Error-Driven Learning. UNANNOTATED. TEXT. STATE. /~kI~INOTETAx.T ~ D TRUTH. Figure 1 Transformation-Based Error-Driven Learning.. RULES. 2. Transformation-Based Error-Driven Leaming. Figure I illustrates how transformation-based error-driven learning works. First, unan- notated text is passed through an initial-state annotator. The initial-state annotator can range in complexity from assigning r a n d o m structure to assigning the o u t p u t of a sophisticated m a n u a l l y created annotator. In part-of-speech tagging, various initial- state annotators have been used, including: the o u t p u t of a stochastic n-gram tagger; labelling all words with their most likely tag as indicated in the training corpus; and naively labelling all words as nouns. For syntactic parsing, we have explored initial- state annotations ranging from the o u t p u t of a sophisticated parser to r a n d o m tree structure with r a n d o m nonterminal labels.. Once text has been passed through the initial-state annotator, it is then compared to the truth. A m a n u a l l y annotated corpus is used as our reference for truth. An ordered list of transformations is learned that can be applied to the o u t p u t of the initial-state annotator to make it better resemble the truth. There are two components to a transformation: a rewrite rule and a triggering environment. An example of a rewrite rule for part-of-speech tagging is:. Change the tag from modal to noun.. and an example of a triggering environment is:. The preceding word is a determiner.. Taken together, the transformation with this rewrite rule and triggering environ- m e n t w h e n applied to the word can w o u l d correctly change the mistagged:. The~determiner can~modal rusted~verb ./.. 545. Computational Linguistics Volume 21, Number 4. to:. The~determiner can~noun rusted~verb ./.. An example of a bracketing rewrite rule is: change the bracketing of a subtree from:. A. B C. to:. C. A B. where A, B and C can be either terminals or nonterminals. One possible set of trigger- ing environments is a n y combination of words, part-of-speech tags, and nonterminal labels within and adjacent to the subtree. Using this rewrite rule and the triggering environment A = the, the bracketing:. ( the ( boy ate ) ). w o u l d become:. ( ( the boy ) ate ). In all of the applications we have examined to date, the following greedy search is applied for deriving a list of transformations: at each iteration of learning, the transfor- mation is f o u n d whose application results in the best score according to the objective function being used; that transformation is then a d d e d to the ordered transforma- tion list and the training corpus is u p d a t e d by applying the learned transformation. Learning continues until no transformation can be found whose application results in an improvement to the annotated corpus. Other more sophisticated search techniques could be used, such as simulated annealing or learning with a look-ahead window, but we have not yet explored these alternatives.. Figure 2 shows an example of learning transformations. In this example, we as- sume there are only four possible transformations, T1 through T4, and that the ob- jective function is the total n u m b e r of errors. The u n a n n o t a t e d training corpus is processed by the initial-state annotator, and this results in an annotated corpus with 5,100 errors, determined by comparing the o u t p u t of the initial-state annotator with the m a n u a l l y derived annotations for this corpus. Next, we apply each of the possible transformations in turn and score the resulting annotated corpus. 1 In this example,. 1 In the real i m p l e m e n t a t i o n , the search is data driven, a n d therefore not all t r a n s f o r m a t i o n s n e e d to be examined.. 546. Brill Transformation-Based Error-Driven Learning. Unannotated. Corpus. I Initial State. Annotator. Annotated. Corpus. Errors = 5,100. T I . I. Annotated. Corpus. Errors = 5,100. Annotated. Corpus. Errors = 3,145. Annotated. Corpus. Errors = 3,910. T1. Annotated. Corpus. Annotated T2 Corpus. Errors = 2,110. Annotated. Corpus. Errors = 1,231. W4 Ano ted 4 IAnnotate' Corpus Corpus /. Errors = 6,300 Errors = 4 , 2 5 ~ . Figure 2 An Example of Transformation-Based Error-Driven Learning.. T1. 1"2. T3. T4. Annotated. Corpus. Errors = 1,410. Annotated. Corpus. Errors = 1,251. Annotated Corpus. Errors = 1,231. Annotated Corpus. Errors = 1,231. applying transformation T2 results in the largest reduction of errors, so T2 is learned as the first transformation. T2 is then applied to the entire corpus, and learning con- tinues. At this stage of learning, transformation T3 results in the largest reduction of error, so it is learned as the second transformation. After applying the initial-state annotator, followed by T2 and then T3, no further reduction in errors can be obtained from applying a n y of the transformations, so learning stops. To annotate fresh text, this text is first annotated by the initial-state annotator, followed by the application of transformation T2 and then by the application of T3.. To define a specific application of transformation-based learning, one m u s t specify the following:. .. 2.. .. The initial state-annotator.. The space of allowable transformations (rewrite rules and triggering environments).. The objective function for comparing the corpus to the truth and choosing a transformation.. In cases where the application of a particular transformation in one environment could affect its application in another environment, two additional parameters m u s t be specified: the order in which transformations are applied to a corpus, and whether a transformation is applied immediately or only after the entire corpus has been ex- amined for triggering environments. For example, take the sequence:. A A A A A A . and the transformation:. 547. Computational Linguistics Volume 21, Number 4. Change the label from A to B if the preceding label is A.. If the effect of the application of a transformation is not written o u t until the entire file has b e e n processed for that one transformation, then regardless of the order of processing the o u t p u t will be:. A B B B B B , . since the triggering environment of a transformation is always checked before that transformation is applied to any s u r r o u n d i n g objects in the corpus. If the effect of a transformation is recorded immediately, then processing the string left to right w o u l d result in:. A B A B A B , . whereas processing right to left w o u l d result in:. A B B B B B . . 3. A Comparison With D e c i s i o n Trees. The technique e m p l o y e d b y the learner is s o m e w h a t similar to that u s e d in decision trees (Breiman et al. 1984; Quinlan 1986; Quinlan and Rivest 1989). A decision tree is trained on a set of preclassified entities and o u t p u t s a set of questions that can be asked a b o u t an entity to determine its p r o p e r classification. Decision trees are built b y finding the question w h o s e resulting partition is the purest, 2 splitting the training data according to that question, a n d then recursively reapplying this p r o c e d u r e on each resulting subset.. We first s h o w that the set of classifications that can b e p r o v i d e d via decision trees is a p r o p e r subset of those that can b e p r o v i d e d via transformation lists (an ordered list of transformation-based rules), given the s a m e set of primitive questions. We then give s o m e practical differences b e t w e e n the t w o learning methods.. 3.1 D e c i s i o n Trees c_ Transformation Lists We p r o v e here that for a fixed set of primitive queries, a n y binary decision tree can b e converted into a transformation list. Extending the p r o o f b e y o n d binary trees is straightforward.. Proof (by induction) Base Case: Given the following primitive decision tree, w h e r e the classification is A if the. a n s w e r to the q u e r y X? is yes, and the classification is B if the a n s w e r is no:. X?. B A. 2 O n e p o s s i b l e m e a s u r e for p u r i t y is e n t r o p y r e d u c t i o n . . 548. this tree can be c o n v e r t e d into the f o l l o w i n g t r a n s f o r m a t i o n list:. .. 2.. 3.. X?. Label w i t h S / * Start State A n n o t a t i o n */. If X, t h e n S --* A. S --* B / * E m p t y Tagging E n v i r o n m e n t - - A l w a y s A p p l i e s To Entities C u r r e n t l y Labeled With S */. Induction: A s s u m e t h a t t w o decision trees T1 a n d T2 h a v e c o r r e s p o n d i n g t r a n s f o r m a t i o n lists. L1 a n d L2. A s s u m e t h a t the a r b i t r a r y label n a m e s c h o s e n in c o n s t r u c t i n g L1 are n o t u s e d in L2, a n d t h a t those in L2 are n o t u s e d in L1. G i v e n a n e w decision tree T3 c o n s t r u c t e d f r o m T1 a n d T2 as follows:. Brill Transformation-Based Error-Driven Learning. w e c o n s t r u c t a n e w t r a n s f o r m a t i o n list L3. A s s u m e the first t r a n s f o r m a t i o n in L1 is:. Label w i t h S'. a n d the first t r a n s f o r m a t i o n in L2 is:. Label w i t h S". The first three t r a n s f o r m a t i o n s in L3 will t h e n be:. 1. Label w i t h S. 2. If X t h e n S --* S'. 3. S --+ S". f o l l o w e d b y all of the rules in L1 o t h e r t h a n the first rule, f o l l o w e d b y all of the rules in L2 o t h e r t h a n the first rule. The r e s u l t i n g t r a n s f o r m a t i o n list will first label a n i t e m as S' if X is true, or as S" if X is false. Next, the t r a n f o r m a t i o n s f r o m L1 will be a p p l i e d if X is true, since S' is the initial-state label for L1. If X is false, the t r a n s f o r m a t i o n s f r o m L2 will be applied, b e c a u s e S" is the initial-state label for L2. []. 3.2 D e c i s i o n Trees # Transformation Lists We s h o w here t h a t there exist t r a n s f o r m a t i o n lists for w h i c h n o e q u i v a l e n t decision trees exist, for a fixed set of p r i m i t i v e queries. The f o l l o w i n g classification p r o b l e m is o n e example. G i v e n a s e q u e n c e of characters, classify a character b a s e d o n w h e t h e r the p o s i t i o n i n d e x of a character is divisible b y 4, q u e r y i n g o n l y u s i n g a c o n t e x t of t w o characters to the left of the character b e i n g classified.. 549. Computational Linguistics Volume 21, Number 4. Assuming transformations are applied left to right on the sequence, the above classification problem can be solved for sequences of arbitrary length if the effect of a transformation is written out immediately, or for sequences up to a n y prespecified length if a transformation is carried out only after all triggering environments in the corpus are checked. We present the proof for the former case.. Given the input sequence:. A A A A A A A A A A 0 1 2 3 4 5 6 7 8 9. the underlined characters should be classified as true because their indices are 0, 4, and 8. To see w h y a decision tree could not perform this classification, regardless of order of classification, note that, for the two characters before both A3 and A4, both the characters and their classifications are the same, although these two characters should be classified differently. Below is a transformation list for performing this classification. Once again, we assume transformations are applied left to right and that the result of a transformation is written out immediately, so that the result of applying transformation x to character ai will always be k n o w n w h e n applying transformation x to ai+l.. 1. Label with S RESULT: A / S A / S A / S A / S . 2. If there is no previous character, RESULT: A / F A / S A / S A / S . 3. If the character two to the left is RESULT: A / F A / S A / F A / S . 4. If the character two to the left is RESULT: A / F A / S A / S A / S . 5. F --+ yes. 6. S--* no. A / S A / S A / S A / S A / S A / S A / S . then S ~ F A/S A/S A/S A/S A/S A/S A/S. labelled with F, then S --* F A / F A / S A / F A / S A / F A / S A / F . labelled with F, then F ~ S A / F A / S A / S A / S A / F A / S A / S . RESULT: A / y e s A / n o A / n o A / n o A / y e s A / n o A / n o A / n o A / y e s A / n o A / n o . The extra power of transformation lists comes from the fact that intermediate results from the classification of one object are reflected in the current label of that object, thereby making this intermediate information available for use in classifying other objects. This is not the case for decision trees, where the outcome of questions asked is saved implicitly by the current location within the tree.. 3.3 S o m e P r a c t i c a l D i f f e r e n c e s B e t w e e n D e c i s i o n T r e e s a n d T r a n s f o r m a t i o n L i s t s There are a n u m b e r of practical differences between transformation-based error-driven learning and learning decision trees. One difference is that w h e n training a decision tree, each time the depth of the tree is increased, the average a m o u n t of training mate- rial available per node at that new depth is halved (for a binary tree). In transformation- based learning, the entire training corpus is used for finding all transformations. There- fore, this m e t h o d is not subject to the sparse data problems that arise as the d e p t h of the decision tree being learned increases.. Transformations are ordered, with later transformations being d e p e n d e n t u p o n the outcome of applying earlier transformations. This allows intermediate results in. 550. Brill Transformation-Based Error-Driven Learning. class ifying o n e object to b e available in classifying o t h e r objects. F o r i n st an ce, w h e t h e r t h e p r e v i o u s w o r d is t a g g e d as to-infinitival or to-preposition m a y b e a g o o d c u e fo r d e- t e r m i n i n g t h e p a r t of s p e e c h of a w o r d . 3 If, initially, t h e w o r d to is n o t r e l i a b l y t a g g e d e v e r y w h e r e in the c o r p u s w i t h its p r o p e r tag (or n o t t a g g e d at all), t h e n this c u e will b e unreliable. T h e t r a n s f o r m a t i o n - b a s e d l e a r n e r will d e l a y p o s i t i n g a t r a n s f o r m a t i o n t r i g g e r e d b y t h e tag of the w o r d to u n t i l o t h e r t r a n s f o r m a t i o n s h a v e r e s u l t e d in a m o r e reliable t a g g i n g of this w o r d in t h e c o r p u s . For a d e c i s i o n t ree to t ak e a d v a n t a g e o f this i n f o r m a t i o n , a n y w o r d w h o s e o u t c o m e is d e p e n d e n t u p o n t h e t a g g i n g o f to w o u l d n e e d t h e e n t i r e d e c i s i o n tree s t r u c t u r e for the p r o p e r classification o f e a c h o c c u r r e n c e of to b u i l t into its d e c i s i o n tree path. If t h e classification o f to w e r e d e p e n d e n t u p o n t h e classification of y e t a n o t h e r w o r d , this w o u l d h a v e to b e b u i l t i n t o t h e d e c i s i o n t ree as well. U n l i k e d e c i s i o n trees, in t r a n s f o r m a t i o n - b a s e d l e a r n i n g , i n t e r m e d i a t e classifica- tion results are available a n d c a n b e u s e d as classification p r o g r e s s e s . E v e n if d e c i s i o n trees are a p p l i e d to a c o r p u s in a left-to-right fashi o n , t h e y are a l l o w e d o n l y o n e p a s s in w h i c h to p r o p e r l y classify.. Since a t r a n s f o r m a t i o n list is a p r o c e s s o r a n d n o t a classifier, it can r e a d i l y b e u s e d as a p o s t p r o c e s s o r to a n y a n n o t a t i o n s y s t e m . In a d d i t i o n to a n n o t a t i n g f r o m scratch, rules c a n b e l e a r n e d to i m p r o v e the p e r f o r m a n c e o f a m a t u r e a n n o t a t i o n s y s t e m b y u s i n g t h e m a t u r e s y s t e m as the initial-state a n n o t a t o r . This c a n h a v e t h e a d d e d a d v a n t a g e t h a t the list of t r a n s f o r m a t i o n s l e a r n e d u s i n g a m a t u r e a n n o t a t i o n s y s t e m as the initial-state a n n o t a t o r p r o v i d e s a r e a d a b l e d e s c r i p t i o n o r classification o f t h e e r r o r s the m a t u r e s y s t e m m a k e s , t h e r e b y a i d i n g in t h e r e f i n e m e n t o f t h a t s y s t e m . T h e fact t h a t it is a p r o c e s s o r gives a t r a n s f o r m a t i o n - b a s e d l e a r n e r g r e a t e r t h a n t h e classifier-based d e c i s i o n tree. For e x a m p l e , in a p p l y i n g t r a n s f o r m a t i o n - b a s e d l e a r n i n g to p a r s i n g , a r u l e can a p p l y a n y s t r u c t u r a l c h a n g e to a tree. In tagging, a r u l e s u c h as:. Change the tag of the current word to X, and of the previous word to Y, if Z holds. c a n easily b e h a n d l e d in the p r o c e s s o r - b a s e d s y s t e m , w h e r e a s it w o u l d b e difficult to h a n d l e in a classification s y s t e m . . In t r a n s f o r m a t i o n - b a s e d l e a r n i n g , the objective f u n c t i o n u s e d in t r a i n i n g is t h e s a m e as t h a t u s e d for e v a l u a t i o n , w h e n e v e r this is feasible. In a d e c i s i o n tree, u s i n g sys- t e m a c c u r a c y as a n objective f u n c t i o n for t r a i n i n g t y p i c a l l y resu l t s in p o o r p e r f o r m a n c e 4 a n d s o m e m e a s u r e of n o d e p u r i t y , s u c h as e n t r o p y r e d u c t i o n , is u s e d i n st ead . T h e di- rect c o r r e l a t i o n b e t w e e n rules a n d p e r f o r m a n c e i m p r o v e m e n t in t r a n s f o r m a t i o n - b a s e d l e a r n i n g c a n m a k e the l e a r n e d r u l e s m o r e r e a d i l y i n t e r p r e t a b l e t h a n d e c i s i o n t ree r u l e s for i n c r e a s i n g p o p u l a t i o n purity, s. 4. Part of Speech Tagging: A Case Study in Transformation-Based Error-Driven Learning. In this section w e d e s c r i b e the p r a c t i c a l a p p l i c a t i o n o f t r a n s f o r m a t i o n - b a s e d l e a r n i n g to p a r t - o f - s p e e c h tagging. 6 P a r t - o f - s p e e c h t a g g i n g is a g o o d a p p l i c a t i o n to test t h e. 3 T h e o r i g i n a l t a g g e d B r o w n C o r p u s ( F r a n c i s a n d K u c e r a , 1982) m a k e s t h i s d i s t i n c t i o n ; t h e P e n n T r e e b a n k ( M a r c u s , S a n t o r i n i , a n d M a r c i n k i e w i c z , 1993) d o e s n o t . . 4 F o r a d i s c u s s i o n o f w h y t h i s is t h e case, s e e B r e i m a n et al. (1984, 94-98). 5 F o r a d i s c u s s i o n o f o t h e r i s s u e s r e g a r d i n g t h e s e t w o l e a r n i n g a l g o r i t h m s , s e e R a m s h a w a n d M a r c u s . (1994). 6 A l l of t h e p r o g r a m s d e s c r i b e d h e r e i n a r e f r e e l y a v a i l a b l e w i t h n o r e s t r i c t i o n s o n u s e o r r e d i s t r i b u t i o n . . F o r i n f o r m a t i o n o n o b t a i n i n g t h e tagger, c o n t a c t t h e a u t h o r . . 551. Computational Linguistics Volume 21, Number 4. learner, for several reasons. There are a n u m b e r of large tagged corpora available, allowing for a variety of experiments to be run. Part-of-speech tagging is an active area of research; a great deal of w o r k has b e e n d o n e in this area over the past f e w years (e.g., Jelinek 1985; Church 1988; Derose 1988; Hindle 1989; DeMarcken 1990; Merialdo 1994; Brill 1992; Black et al. 1992; Cutting et al. 1992; Kupiec 1992; Charniak et al. 1993; Weischedel et al. 1993; Schutze and Singer 1994).. Part-of-speech tagging is also a v e r y practical application, with uses in m a n y areas, including speech recognition and generation, machine translation, parsing, information retrieval and lexicography. Insofar as tagging can b e seen as a prototypical p r o b l e m in lexical ambiguity, advances in part-of-speech tagging could readily translate to progress in other areas of lexical, and perhaps structural, ambiguity, such as w o r d - sense disambiguation and prepositional phrase attachment disambiguation. 7 Also, it is possible to cast a n u m b e r of other useful problems as part-of-speech tagging problems, such as letter-to-sound translation (Huang, Son-Bell, and Baggett 1994) and building pronunciation n e t w o r k s for speech recognition. Recently, a m e t h o d has b e e n p r o p o s e d for using part-of-speech tagging techniques as a m e t h o d for parsing with lexicalized grammars (Joshi and Srinivas 1994).. W h e n a u t o m a t e d part-of-speech tagging w a s initially explored (Klein a n d Sim- m o n s 1963; Harris 1962), people manually engineered rules for tagging, sometimes with the aid of a corpus. As large corpora became available, it b e c a m e clear that simple M a r k o v - m o d e l b a s e d stochastic taggers that w e r e automatically trained could achieve high rates of tagging accuracy (Jelinek 1985). M a r k o v - m o d e l b a s e d taggers assign to a sentence the tag sequence that maximizes Prob(word I tag),Prob(tag I previous n tags). These probabilities can b e estimated directly from a m a n u a l l y tagged corpus, s These stochastic taggers have a n u m b e r of a d v a n t a g e s over the m a n u a l l y built taggers, in- cluding obviating the n e e d for laborious manual rule construction, and possibly cap- turing useful information that m a y not have b e e n noticed b y the h u m a n engineer. H o w e v e r , stochastic taggers have the d i s a d v a n t a g e that linguistic information is cap- tured only indirectly, in large tables of statistics. Almost all recent w o r k in d e v e l o p i n g automatically trained part-of-speech taggers has b e e n on further exploring Markov- m o d e l b a s e d tagging (Jelinek 1985; Church 1988; Derose 1988; DeMarcken 1990; Meri- aldo 1994; Cutting et al. 1992; Kupiec 1992; Charniak et al. 1993; Weischedel et al. 1993; Schutze a n d Singer 1994).. 4.1 Transformation-based Error-driven Part-of-Speech Tagging Transformation-based part of speech tagging w o r k s as follows. 9 The initial-state an- notator assigns each w o r d its m o s t likely tag as indicated in the training corpus. The m e t h o d u s e d for initially tagging u n k n o w n w o r d s will b e described in a later section. A n ordered list of transformations is then learned, to i m p r o v e tagging accuracy b a s e d on contextual cues. These transformations alter the tagging of a w o r d from X to Y iff. 7 In Brill and Resnik (1994), w e describe an approach to prepositional phrase attachment disambiguation that obtains highly competitive performance compared to other corpus-based solutions to this problem. This system was derived in u n d e r two hours from the transformation-based part of speech tagger described in this paper.. 8 One can also estimate these probabilities without a manually tagged corpus, using a h i d d e n Markov model. However, it appears to be the case that directly estimating probabilities from even a very small manually tagged corpus gives better results than training a h i d d e n Markov m o d e l on a large untagged corpus (see Merialdo (1994)).. 9 Earlier versions of this work were reported in Brill (1992, 1994).. 552. Brill Transformation-Based Error-Driven Learning. either:. 1. T h e w o r d w a s n o t s e e n in t h e t r a i n i n g c o r p u s O R . 2. T h e w o r d w a s s e e n t a g g e d w i t h ¥ a t l e a s t o n c e in t h e t r a i n i n g c o r p u s . . I n t a g g e r s b a s e d o n M a r k o v m o d e l s , t h e l e x i c o n c o n s i s t s o f p r o b a b i l i t i e s o f t h e s o m e w h a t c o u n t e r i n t u i t i v e b u t p r o p e r f o r m P(WORD I TAG). I n t h e t r a n s f o r m a t i o n - b a s e d t a g g e r , t h e l e x i c o n is s i m p l y a list o f all t a g s s e e n f o r a w o r d in t h e t r a i n i n g c o r p u s , w i t h o n e t a g l a b e l e d as t h e m o s t likely. B e l o w w e s h o w a lexical e n t r y f o r t h e . w o r d half in t h e t r a n s f o r m a t i o n - b a s e d tagger. 1°. h a l f : C D D T JJ N N P D T RB VB. T h i s e n t r y lists t h e s e v e n t a g s s e e n f o r half in t h e t r a i n i n g c o r p u s , w i t h N N m a r k e d as t h e m o s t likely. B e l o w a r e t h e lexical e n t r i e s f o r half in a M a r k o v m o d e l t a g g e r , e x t r a c t e d f r o m t h e s a m e c o r p u s : . P(half l CD ) = 0.000066 P(half l DT ) = 0.000757. P(half I J J) = 0.000092 P(half INN) = 0.000702. P(half l PDT ) = 0.039945 P(half l RB ) = 0.000443 P(half I VB ) = 0.000027. It is difficult to m a k e m u c h s e n s e o f t h e s e e n t r i e s in isolation; t h e y h a v e to b e v i e w e d in t h e c o n t e x t o f t h e m a n y c o n t e x t u a l p r o b a b i l i t i e s . . First, w e will d e s c r i b e a n o n l e x i c a l i z e d v e r s i o n o f t h e t a g g e r , w h e r e t r a n s f o r m a t i o n t e m p l a t e s d o n o t m a k e r e f e r e n c e to specific w o r d s . I n t h e n o n l e x i c a l i z e d t a g g e r , t h e t r a n s f o r m a t i o n t e m p l a t e s w e u s e are:. Change tag a to tag b when:. 1. T h e p r e c e d i n g ( f o l l o w i n g ) w o r d is t a g g e d z.. 2. T h e w o r d t w o b e f o r e (after) is t a g g e d z.. 3. O n e o f t h e t w o p r e c e d i n g ( f o l l o w i n g ) w o r d s is t a g g e d z.. 4. O n e o f t h e t h r e e p r e c e d i n g ( f o l l o w i n g ) w o r d s is t a g g e d z.. 5. T h e p r e c e d i n g w o r d is t a g g e d z a n d t h e f o l l o w i n g w o r d is t a g g e d w.. 6. T h e p r e c e d i n g ( f o l l o w i n g ) w o r d is t a g g e d z a n d t h e w o r d t w o b e f o r e (after) is t a g g e d w.. w h e r e a, b, z a n d w a r e v a r i a b l e s o v e r t h e s e t o f p a r t s o f s p e e c h . . To l e a r n a t r a n s f o r m a t i o n , t h e l e a r n e r , in e s s e n c e , t r i e s o u t e v e r y p o s s i b l e t r a n s - f o r m a t i o n , 1I a n d c o u n t s t h e n u m b e r o f t a g g i n g e r r o r s a f t e r e a c h o n e is a p p l i e d . A f t e r . 10 A description of the partoof-speech tags is provided in Appendix A. 11 All possible instantiations of transformation templates.. 553. Computational Linguistics Volume 21, Number 4. 1. apply initial-state annotator to corpus 2. w h i l e transformations can still be found do 3. for from_tag = tag1 to tagn 4. for to_tag = tag1 to tagn 5. for corpus_position = 1 to corpus_size 6. if (correct_tag(corpus_position) --= to_tag. && current_tag(corpus_position) == from_tag) 7. num_good_transformations(tag(corpus_position -1))++ 8. else if (correct_tag(corpus_position) == from_tag. && current_tag(corpus_position) == from_tag) 9. num_bad_transformations(tag(corpus_position-1 ))++ 10. find maxT (num_good_transformations(T) - num_bad_transformations(T)) 11. if this is the best-scoring rule f o u n d yet then store as best rule:. Change tag from from_tag to to_tag if previous tag is T 12. apply best rule to training corpus 13. a p p e n d best rule to ordered list of transformations. Figure 3 Pseudocode for learning transformations.. all possible transformations have been tried, the transformation that resulted in the greatest error reduction is chosen. Learning stops w h e n no transformations can be f o u n d whose application reduces errors b e y o n d some prespecified threshold.. In the experiments described below, processing was done left to right. For each transformation application, all triggering environments are first f o u n d in the corpus, and then the transformation triggered by each triggering environment is carried out.. The search is data-driven, so only a very small percentage of possible transfor- mations really need be examined. In figure 3, we give pseudocode for the learning algorithm in the case where there is only one transformation template:. Change the tag from X to Y if the previous tag is Z.. In each learning iteration, the entire training corpus is examined once for every pair of tags X and Y, finding the best transformation whose rewrite changes tag X to tag Y. For every w o r d in the corpus whose environment matches the triggering environment, if the word has tag X and X is the correct tag, then making this transformation will result in an additional tagging error, so we increment the n u m b e r of errors caused w h e n making the transformation given the part-of-speech tag of the previous w o r d (lines 8 and 9). If X is the current tag and Y is the correct tag, then the transformation will result in one less error, so we increment the n u m b e r of improvements caused w h e n making the transformation given the part-of-speech tag of the previous w o r d (lines 6 and 7).. In certain cases, a significant increase in speed for training the transformation- based tagger can be obtained by indexing in the corpus where different transformations can and do apply. For a description of a fast index-based training algorithm, see Ramshaw and Marcus (1994).. In figure 4, we list the first t w e n t y transformations learned from training on the Penn Treebank Wall Street Journal Corpus (Marcus, Santorini, and Marcinkiewicz 1993). 12 The first transformation states that a n o u n should be changed to a verb if. 12 V e r s i o n 0.5 o f t h e P e n n T r e e b a n k w a s u s e d i n all e x p e r i m e n t s r e p o r t e d in t h i s p a p e r . . 554. Brill Transformation-Based Error-Driven Learning. Change Tag # From To 1 N N VB 2 VBP VB 3 N N VB 4 VB N N 5 VBD VBN 6 VBN VBD 7 VBN VBD 8 VBD VBN 9 VBP VB 10 POS VBZ 11 VB VBP 12 VBD VBN 13 IN WDT 14 VBD VBN 15 VB VBP 16 IN WDT 17 IN DT 18 JJ N N P 19 IN WDT 20 JJR RBR. Figure 4. Condition Previous tag is TO. One of the previous three tags is M D One of the previous two tags is M D One of the previous two tags is D T . One of the previous three tags is VBZ Previous tag is PRP Previous tag is N N P Previous tag is VBD Previous tag is TO. Previous tag is PRP Previous tag is N N S . One of previous three tags is VBP One of next two tags is VB. One of previous two tags is VB Previous tag is PRP. Next tag is VBZ Next tag is N N . Next tag is N N P Next tag is VBD. Next tag is JJ. The first 20 nonlexicalized transformations.. the previous tag is TO, as in: to~TO conflict/NN--.VB with. The second transforma- tion fixes a tagging such as: might/MD vanish/VBP--.VB. The third fixes might/MD not reply/NN--.VB. The tenth transformation is for the t o k e n ' s , which is a separate token in the Penn Treebank. 's is most frequently used as a possessive ending, but after a personal p r o n o u n , it is a verb (John's, c o m p a r e d to he 's). The transformations chang- ing IN to WDT are for tagging the w o r d that, to d e t e r m i n e in which environments that is being used as a s y n o n y m of which.. 4.2 Lexicalizing the Tagger In general, no relationships b e t w e e n w o r d s have been directly e n c o d e d in stochas- tic n-gram taggers. 13 In the Markov m o d e l typically used for stochastic tagging, state transition probabilities (P(Tagi I Tagi_l... Tagi-n)) express the likelihood of a tag im- mediately following n other tags, and emit probabilities (P(Wordj I Tagi)) express the likelihood of a word, given a tag. M a n y useful relationships, such as that b e t w e e n a w o r d and the previous word, or b e t w e e n a tag and the following word, are not di- rectly captured by M a r k o v - m o d e l based taggers. The same is true of the nonlexicalized transformation-based tagger, w h e r e transformation templates do not make reference to words.. To r e m e d y this problem, we extend the transformation-based tagger b y a d d i n g . 13 In Kupiec (1992), a limited amount of lexicalization is introduced by having a stochastic tagger with word states for the 100 most frequent words in the corpus.. 555. Computational Linguistics Volume 21, Number 4. c o n t e x t u a l t r a n s f o r m a t i o n s t h a t c a n m a k e r e f e r e n c e to w o r d s as w e l l as p a r t - o f - s p e e c h tags. T h e t r a n s f o r m a t i o n t e m p l a t e s w e a d d are:. Change tag a to tag b when:. .. 2.. 3.. 4.. 5.. 6.. 7.. .. T h e . T h e . T h e . T h e t.. T h e p r e c e d i n g (following) w o r d is w.. T h e w o r d t w o b e f o r e (after) is w.. O n e of the t w o p r e c e d i n g (following) w o r d s is w.. c u r r e n t w o r d is w a n d t h e p r e c e d i n g (fo l l o w i n g ) w o r d is x.. c u r r e n t w o r d is w a n d t h e p r e c e d i n g (following) w o r d is t a g g e d z.. c u r r e n t w o r d is w.. p r e c e d i n g (following) w o r d is w a n d t h e p r e c e d i n g (fo l l o w i n g ) tag is. T h e c u r r e n t w o r d is w, t h e p r e c e d i n g (following) w o r d is w2 a n d t h e p r e c e d i n g (following) tag is t.. w h e r e w a n d x are v a r i a b l e s o v e r all w o r d s i n t h e t r a i n i n g c o r p u s , a n d z a n d t are v a r i a b l e s o v e r all p a r t s o f s p e e c h . . BelOw w e list t w o l e x i c a l i z e d t r a n s f o r m a t i o n s t h a t w e r e l e a r n e d , t r a i n i n g o n c e a g a i n o n t h e Wall Street Journal.. C h a n g e t h e tag:. (12) F r o m I N to RB if the w o r d t w o p o s i t i o n s to t h e r i g h t is as. (16) F r o m V B P to VB if o n e of t h e p r e v i o u s t w o w o r d s is n ' t . TM. T h e P e n n T r e e b a n k t a g g i n g style m a n u a l specifies t h a t in t h e c o l l o c a t i o n as . . . as, the first as is t a g g e d as a n a d v e r b a n d the s e c o n d is t a g g e d as a p r e p o s i t i o n . Since as is m o s t f r e q u e n t l y t a g g e d as a p r e p o s i t i o n in t h e t r a i n i n g c o r p u s , t h e initial-state t a g g e r will m i s t a g t h e p h r a s e as tall as as:. a s / I N tall/JJ a s / I N . T h e first l e x i c a l i z e d t r a n s f o r m a t i o n c o r r e c t s this m i s t a g g i n g . N o t e t h a t a b i g r a m t a g g e r t r a i n e d o n o u r t r a i n i n g set w o u l d n o t c o r r e c t l y t ag t h e first o c c u r r e n c e o f as. A l t h o u g h a d v e r b s are m o r e likely t h a n p r e p o s i t i o n s to f o l l o w s o m e v e r b f o r m tags, t h e fact t h a t P(as ] IN) is m u c h g r e a t e r t h a n P(as ] RB), a n d P(JJ ] IN) is m u c h g r e a t e r t h a n P(JJ ] RB) l e a d to as b e i n g i n c o r r e c t l y t a g g e d as a p r e p o s i t i o n b y a st o ch ast i c tagger. A t r i g r a m t a g g e r will c o r r e c t l y tag this c o l l o c a t i o n in s o m e i n st an ces, d u e to t h e fact t h a t P ( I N ] RB JJ) is g r e a t e r t h a n P ( I N ] I N JJ), b u t t h e o u t c o m e will b e h i g h l y d e p e n d e n t u p o n the c o n t e x t in w h i c h this c o l l o c a t i o n a p p e a r s . . T h e s e c o n d t r a n s f o r m a t i o n arises f r o m t h e fact t h a t w h e n a v e r b a p p e a r s in a c o n t e x t s u c h as We do n't eat o r We did n't usually drink, t h e v e r b is in b a s e f o r m . A s tochastic t r i g r a m t a g g e r w o u l d h a v e to c a p t u r e this linguistic i n f o r m a t i o n i n d i r e c t l y f r o m f r e q u e n c y c o u n t s of all t r i g r a m s of t h e f o r m s h o w n i n f i g u r e 5 ( w h e r e a st ar c a n m a t c h a n y p a r t - o f - s p e e c h tag) a n d f r o m t h e fact t h a t P(n't ] RB) is fai rl y high.. 14 In the Penn Treebank, n't is treated as a separate token, so don't becomes do/VBP n't/RB.. 556. Brill Transformation-Based Error-Driven Learning. * RB VBP * RB VB RB * VBP RB * VB. F i g u r e 5 Trigram Tagger Probability Tables.. In Weischedel et al. (1993), results are given w h e n training and testing a Markov- m o d e l based tagger on the Penn Treebank Tagged Wall Street Journal Corpus. T h e y cite results making the closed vocabulary assumption that all possible tags for all w o r d s in the test set are known. W h e n training contextual probabilities on one million words, an accuracy of 96.7% was achieved. Accuracy d r o p p e d to 96.3% w h e n contextual prob- abilities were trained on 64,000 words. We trained the transformation-based tagger on the same corpus, making the same closed-vocabulary assumption. 15 W h e n training contextual rules on 600,000 words, an accuracy of 97.2% was achieved on a separate 150,000 w o r d test set. W h e n the training set was r e d u c e d to 64,000 words, accuracy d r o p p e d to 96.7%. The transformation-based learner achieved better performance, de- spite the fact that contextual information was captured in a small n u m b e r of simple nonstochastic rules, as o p p o s e d to 10,000 contextual probabilities that were learned b y the stochastic tagger. These results are s u m m a r i z e d in table 1. W h e n training on 600,000 words, a total of 447 transformations were learned. However, transformations t o w a r d the end of the list contribute v e r y little to accuracy: a p p l y i n g only the first 200 learned transformations to the test set achieves an accuracy of 97.0%; applying the first 100 gives an accuracy of 96.8%. To match the 96.7% accuracy achieved b y the stochas- tic tagger w h e n it was trained on one million words, only the first 82 transformations are needed.. To see w h e t h e r lexicalized transformations were contributing to the transformation- based tagger accuracy rate, we first trained the tagger using the nonlexical transfor- mation template subset, then ran exactly the same test. Accuracy of that tagger was 97.0%. A d d i n g lexicalized transformations resulted in a 6.7% decrease in the error rate (see table 1 ) . 16. We f o u n d it a bit surprising that the addition of lexicalized transformations did not result in a m u c h greater i m p r o v e m e n t in performance. W h e n transformations are allowed to make reference to w o r d s and w o r d pairs, some relevant information is probably missed d u e to sparse data. We are currently exploring the possibility of incorporating w o r d classes into the rule-based learner, in hopes of o v e r c o m i n g this problem. The idea is quite simple. Given any source of w o r d class information, such. 15 I n b o t h W e i s c h e d e l et al. (1993) a n d h e r e , t h e t e s t s e t w a s i n c o r p o r a t e d i n t o t h e lexicon, b u t w a s n o t u s e d i n l e a r n i n g c o n t e x t u a l i n f o r m a t i o n . T e s t i n g w i t h n o u n k n o w n w o r d s m i g h t s e e m like a n u n r e a l i s t i c test. W e h a v e d o n e s o for t h r e e r e a s o n s : (1) to a l l o w for a c o m p a r i s o n w i t h p r e v i o u s l y q u o t e d r e s u l t s , (2) to i s o l a t e k n o w n w o r d a c c u r a c y f r o m u n k n o w n w o r d a c c u r a c y , a n d (3) i n s o m e s y s t e m s , s u c h a s a c l o s e d v o c a b u l a r y s p e e c h r e c o g n i t i o n s y s t e m , t h e a s s u m p t i o n t h a t all w o r d s a r e k n o w n is v a l i d . (We s h o w r e s u l t s w h e n u n k n o w n w o r d s a r e i n c l u d e d l a t e r i n t h e p a p e r . ) . 16 T h e t r a i n i n g w e d i d h e r e w a s s l i g h t l y s u b o p t i m a l , i n t h a t w e u s e d t h e c o n t e x t u a l r u l e s l e a r n e d w i t h u n k n o w n w o r d s ( d e s c r i b e d i n t h e n e x t s e c t i o n ) , a n d filled i n t h e d i c t i o n a r y , r a t h e r t h a n t r a i n i n g o n a c o r p u s w i t h o u t u n k n o w n w o r d s . . 557. Computational Linguistics Volume 21, N u m b e r 4. Table 1 Comparison of Tagging Accuracy With No Unknown Words. Training # of Rules Corpus or Context. Acc.. Method Size (Words) Probs. (%) Stochastic 64 K 6,170 96.3 Stochastic 1 Million 10,000 96.7. Rule-Based With Lex. Rules 64 K 215 96.7. Rule-Based With Lex. Rules 600 K 447 97.2. Rule-Based w / o Lex. Rules 600 K 378 97.0. as W o r d N e t (Miller 1990), t h e l e a r n e r is e x t e n d e d s u c h t h a t a r u l e is a l l o w e d to m a k e r e f e r e n c e to p a r t s o f s p e e c h , w o r d s , a n d w o r d classes, a l l o w i n g f o r r u l e s s u c h as. Change the tag from X to Y if the following word belongs to word class Z.. T h i s a p p r o a c h h a s a l r e a d y b e e n s u c c e s s f u l l y a p p l i e d to a s y s t e m f o r p r e p o s i t i o n a l p h r a s e a t t a c h m e n t d i s a m b i g u a t i o n (Brill a n d R e s n i k 1994).. 4.3 Tagging U n k n o w n Words So far, w e h a v e n o t a d d r e s s e d t h e p r o b l e m o f u n k n o w n w o r d s . A s s t a t e d a b o v e , t h e initial-state a n n o t a t o r f o r t a g g i n g a s s i g n s all w o r d s t h e i r m o s t l i k e l y tag, a s i n d i c a t e d in a t r a i n i n g c o r p u s . B e l o w w e s h o w h o w a t r a n s f o r m a t i o n - b a s e d a p p r o a c h c a n b e t a k e n f o r t a g g i n g u n k n o w n w o r d s , b y a u t o m a t i c a l l y l e a r n i n g c u e s to p r e d i c t t h e m o s t l i k e l y t a g f o r w o r d s n o t s e e n in t h e t r a i n i n g c o r p u s . I f t h e m o s t l i k e l y t a g f o r u n k n o w n w o r d s c a n b e a s s i g n e d w i t h h i g h a c c u r a c y , t h e n t h e c o n t e x t u a l r u l e s c a n b e u s e d to i m p r o v e a c c u r a c y , as d e s c r i b e d a b o v e . . I n t h e t r a n s f o r m a t i o n - b a s e d u n k n o w n - w o r d t a g g e r , t h e i n i t i a l - s t a t e a n n o t a t o r n a i v e l y a s s u m e s t h e m o s t l i k e l y t a g f o r a n u n k n o w n w o r d is " p r o p e r n o u n " if t h e w o r d is c a p i t a l i z e d a n d " c o m m o n n o u n " o t h e r w i s e . 17. Below, w e list t h e s e t o f a l l o w a b l e t r a n s f o r m a t i o n s . . Change the tag of an unknown word (from X) to Y if:. 1.. .. 3.. .. 5.. D e l e t i n g t h e p r e f i x (suffix) x, Ixl < 4, r e s u l t s in a w o r d (x is a n y s t r i n g o f l e n g t h 1 to 4).. T h e first (last) (1,2,3,4) c h a r a c t e r s o f t h e w o r d a r e x.. A d d i n g t h e c h a r a c t e r s t r i n g x a s a p r e f i x (suffix) r e s u l t s in a w o r d (Ixl ~ 4).. W o r d w e v e r a p p e a r s i m m e d i a t e l y to t h e left (right) o f t h e w o r d . . C h a r a c t e r z a p p e a r s in t h e w o r d . . 17 If we change the tagger to tag all unknown words as common nouns, then a number of rules are learned of the form: change tag to proper noun if the prefix is "E', "A", "B', etc., since the learner is not provided with the concept of upper case in its set of transformation templates.. 558. Brill Transformation-Based Error-Driven Learning. C h a n g e Tag # F r o m To C o n d i t i o n 1 N N N N S H a s suffix -s 2 N N C D H a s c h a r a c t e r . 3 N N JJ H a s c h a r a c t e r - 4 N N VBN H a s suffix -ed 5 N N VBG H a s suffix - i n g 6 ?? RB H a s suffix -l y 7 ?? JJ A d d i n g suffix -l y resu l t s in a w o r d . 8 N N C D T h e w o r d $ c a n a p p e a r to t h e left. 9 N N JJ H a s suffix -al 10 N N VB T h e w o r d w o u l d c a n a p p e a r to t h e left. 11 N N C D H a s c h a r a c t e r 0 12 N N JJ T h e w o r d b e c a n a p p e a r to t h e left. 13 N N S JJ H a s suffix - u s 14 N N S VBZ T h e w o r d it c a n a p p e a r to t h e left. 15 N N JJ H a s suffix - b l e 16 N N JJ H a s suffix -ic 17 N N C D H a s c h a r a c t e r 1 18 N N S N N H a s suffix - s s 19 ?? JJ D e l e t i n g the prefi x u n - resu l t s in a w o r d 20 N N JJ H a s suffix - i r e . F i g u r e 6 The first 20 transformations for unknown words.. A n u n a n n o t a t e d text can b e u s e d to c h e c k the c o n d i t i o n s in all o f t h e a b o v e t ran s- f o r m a t i o n t e m p l a t e s . A n n o t a t e d text is n e c e s s a r y in t r a i n i n g to m e a s u r e t h e effect o f t r a n s f o r m a t i o n s o n t a g g i n g accuracy. Since t h e g o a l is to label e a c h lexical e n t r y fo r n e w w o r d s as a c c u r a t e l y as possible, a c c u r a c y is m e a s u r e d o n a p e r t y p e a n d n o t a p e r t o k e n basis.. F i g u r e 6 s h o w s the first 20 t r a n s f o r m a t i o n s l e a r n e d f o r t a g g i n g u n k n o w n w o r d s in t h e Wall Street J o u r n a l c o r p u s . As a n e x a m p l e of h o w ru l es c a n c o r r e c t e r r o r s g e n e r a t e d b y p r i o r rules, n o t e t h a t a p p l y i n g t h e first t r a n s f o r m a t i o n will r e s u l t in t h e m i s t a g g i n g of t h e w o r d actress. T h e 18th l e a r n e d r u l e fixes this p r o b l e m . Th i s r u l e states:. Change a tag from p l u r a l c o m m o n n o u n to s i n g u l a r c o m m o n n o u n if the word has SUffiX s s . . K e e p in m i n d t h a t n o specific affixes are p r e s p e c i f i e d . A t r a n s f o r m a t i o n c a n m a k e r e f e r e n c e to a n y string of c h a r a c t e r s u p to a b o u n d e d l en g t h . So w h i l e t h e first r u l e specifies the English suffix "s', the r u l e l e a r n e r w a s n o t c o n s t r a i n e d f r o m c o n s i d e r i n g s u c h n o n s e n s i c a l r u l e s as:. Change a tag to adjective if the word has suffix "xhqr'.. Also, a b s o l u t e l y n o English-specific i n f o r m a t i o n (su ch as a n affix list) n e e d b e p r e s p e c i f i e d in t h e learner. TM. 18 This learner has also been applied to tagging Old English. See Brill (1993b). Although the. 559. Computational Linguistics Volume 21, Number 4. J. ==. i i E i i. 0 100 200 300 400. Transformation Number. Figure 7 Accuracy vs. Transformation Number. We then ran the following e x p e r i m e n t using 1.1 million w o r d s of the Penn Tree- bank Tagged Wall Street Journal Corpus. Of these, 950,000 w o r d s were used for training and 150,000 w o r d s were used for testing. Annotations of the test corpus were not used in a n y w a y to train the system. From the 950,000 w o r d training corpus, 350,000 w o r d s were used to learn rules for tagging u n k n o w n words, and 600,000 w o r d s were used to learn contextual rules; 243 rules were learned for tagging u n k n o w n words, and 447 contextual tagging rules were learned. U n k n o w n w o r d accuracy on the test corpus was 82.2%, and overall tagging accuracy on the test corpus was 96.6%. To o u r knowledge, this is the highest overall tagging accuracy ever q u o t e d on the Penn Treebank C o r p u s w h e n making the o p e n v o c a b u l a r y assumption. Using the tagger w i t h o u t lexicalized rules, an overall accuracy of 96.3% a n d an u n k n o w n w o r d accuracy of 82.0% is ob- tained. A g r a p h of accuracy as a function of transformation n u m b e r on the test set for lexicalized rules is s h o w n in figure 7. Before a p p l y i n g any transformations, test set ac- curacy is 92.4%, so the transformations reduce the error rate b y 50% over the baseline. The high baseline accuracy is s o m e w h a t misleading, as this includes the tagging of u n a m b i g u o u s words. Baseline accuracy w h e n the w o r d s that are u n a m b i g u o u s in o u r lexicon are not considered is 86.4%. However, it is difficult to c o m p a r e taggers using this figure, as the accuracy of the system d e p e n d s on the particular lexicon used. For instance, in o u r training set the w o r d the was tagged with a n u m b e r of different tags, and so according to o u r lexicon the is ambiguous. If we instead used a lexicon w h e r e the is listed u n a m b i g u o u s l y as a determiner, the baseline accuracy w o u l d be 84.6%.. For tagging u n k n o w n words, each w o r d is initially assigned a part-of-speech tag based on w o r d and word-distribution features. Then, the tag m a y be changed based on contextual cues, via contextual transformations that are applied t o the entire cor- pus, both k n o w n and u n k n o w n - w o r d s . W h e n the contextual rule learner learns trans- formations, it does so in an a t t e m p t to maximize overall tagging accuracy, and not u n k n o w n - w o r d tagging accuracy. U n k n o w n w o r d s account for only a small percent- age of the corpus in o u r experiments, typically two to three percent. Since the distribu- tional b e h a v i o r of u n k n o w n w o r d s is quite different from that of k n o w n words, a n d . transformations are not English-specific, the set of transformation templates w o u l d have to be extended to process languages with dramatically different morphology,. 560. Brill Transformation-Based Error-Driven Learning. Table 2 Tagging Accuracy on Different Corpora. Corpus Accuracy. Penn WSJ 96.6%. Penn Brown 96.3%. Orig Brown 96.5%. since a transformation that does not increase u n k n o w n - w o r d tagging accuracy can still be beneficial to overall tagging accuracy, the contextual transformations learned are not optimal in the sense of leading to the highest tagging accuracy on u n k n o w n words. Better u n k n o w n - w o r d accuracy m a y be possible by training and using two sets of contextual rules, one maximizing k n o w n - w o r d accuracy and the other maxi- mizing u n k n o w n - w o r d accuracy, and then applying the appropriate transformations to a word w h e n tagging, d e p e n d i n g u p o n whether the word appears in the lexicon. We are currently experimenting with this idea.. In Weischedel et al. (1993), a statistical approach to tagging u n k n o w n words is shown. In this approach, a n u m b e r of suffixes and important features are prespecified. Then, for u n k n o w n words:. p ( W I T) -= p ( u n k n o w n word I T) • p(Capitalize-feature I T) * p(suffixes, hyphenation I T). Using this equation for u n k n o w n word emit probabilities within the stochastic tagger, an accuracy of 85% was obtained on the Wall Street Journal corpus. This portion of the stochastic model has over 1,000 parameters, with 108 possible unique emit proba- bilities, as opposed to a small n u m b e r of simple rules that are learned and used in the rule-based approach. In addition, the transformation-based m e t h o d learns specific cues instead of requiring them to be prespecified, allowing for the possibility of uncover- ing cues not apparent to the h u m a n language engineer. We have obtained comparable performance on u n k n o w n words, while capturing the information in a m u c h more concise and perspicuous manner, and without prespecifying any information specific to English or to a specific corpus.. In table 2, we show tagging results obtained on a n u m b e r of different corpora, in each case training on roughly 9.5 x 10 s words total and testing on a separate test set of 1.5-2 x 10 s words. Accuracy is consistent across these corpora and tag sets.. In addition to obtaining high rates of accuracy and representing relevant linguistic information in a small set of rules, the part-of-speech tagger can also be m a d e to run extremely fast. Roche and Schabes (1995) show a m e t h o d for converting a list of tagging transformations into a deterministic finite state transducer with one state transition taken per word of input; the result is a transformation-based tagger whose tagging speed is about ten times that of the fastest Markov-model tagger.. 4.4 K-Best Tags There are certain circumstances where one is willing to relax the one-tag-per-word requirement in order to increase the probability that the correct tag will be assigned to each word. In DeMarcken (1990) and Weischedel et al. (1993), k-best tags are assigned within a stochastic tagger by returning all tags within some threshold of probability of being correct for a particular word.. 561. Computational Linguistics Volume 21, N u m b e r 4. Table 3 Results from k-best tagging.. # of Rules Accuracy Avg. # of tags per word. 0 96.5 1.00. 50 96.9 1.02. 100 97.4 1.04. 150 97.9 1.10. 200 98.4 1.19. 250 99.1 1.50. We c a n m o d i f y t h e t r a n s f o r m a t i o n - b a s e d t a g g e r to r e t u r n m u l t i p l e t a g s f o r a w o r d b y m a k i n g a s i m p l e m o d i f i c a t i o n to t h e c o n t e x t u a l t r a n s f o r m a t i o n s d e s c r i b e d a b o v e . T h e i n i t i a l - s t a t e a n n o t a t o r is t h e t a g g i n g o u t p u t o f t h e p r e v i o u s l y d e s c r i b e d o n e - b e s t t r a n s f o r m a t i o n - b a s e d tagger. T h e a l l o w a b l e t r a n s f o r m a t i o n t e m p l a t e s a r e t h e s a m e as t h e c o n t e x t u a l t r a n s f o r m a t i o n t e m p l a t e s l i s t e d a b o v e , b u t w i t h t h e r e w r i t e rule: change tag X to tag Y m o d i f i e d to add tag X to tag Y o r add tag X to word W. I n s t e a d o f c h a n g i n g t h e t a g g i n g o f a w o r d , t r a n s f o r m a t i o n s n o w a d d a l t e r n a t i v e t a g g i n g s to a w o r d . . W h e n a l l o w i n g m o r e t h a n o n e t a g p e r w o r d , t h e r e is a t r a d e - o f f b e t w e e n a c c u r a c y a n d t h e a v e r a g e n u m b e r o f t a g s f o r e a c h w o r d . Ideally, w e w o u l d like to a c h i e v e as l a r g e a n i n c r e a s e i n a c c u r a c y w i t h as f e w e x t r a t a g s a s p o s s i b l e . T h e r e f o r e , in t r a i n i n g w e f i n d t r a n s f o r m a t i o n s t h a t m a x i m i z e t h e f u n c t i o n : . N u m b e r o f c o r r e c t e d e r r o r s . N u m b e r o f a d d i t i o n a l t a g s . I n t a b l e 3, w e p r e s e n t r e s u l t s f r o m first u s i n g t h e o n e - t a g - p e r - w o r d t r a n s f o r m a - t i o n - b a s e d t a g g e r d e s c r i b e d in t h e p r e v i o u s s e c t i o n a n d t h e n a p p l y i n g t h e k - b e s t t a g t r a n s f o r m a t i o n s . T h e s e t r a n s f o r m a t i o n s w e r e l e a r n e d f r o m a s e p a r a t e 240,000 w o r d c o r p u s . A s a b a s e l i n e , w e d i d k - b e s t t a g g i n g o f a t e s t c o r p u s . E a c h k n o w n w o r d in t h e t e s t c o r p u s w a s t a g g e d w i t h all t a g s s e e n w i t h t h a t w o r d in t h e t r a i n i n g c o r p u s a n d t h e five m o s t l i k e l y u n k n o w n - w o r d t a g s w e r e a s s i g n e d to all w o r d s n o t s e e n in t h e t r a i n i n g c o r p u s . 19 T h i s r e s u l t e d in a n a c c u r a c y o f 99.0%, w i t h a n a v e r a g e o f 2.28 t a g s p e r w o r d . T h e t r a n s f o r m a t i o n - b a s e d t a g g e r o b t a i n e d t h e s a m e a c c u r a c y w i t h 1.43 t a g s . p e r w o r d , o n e t h i r d t h e n u m b e r o f a d d i t i o n a l t a g s as t h e b a s e l i n e t a g g e r . 2°. 5. C o n c l u s i o n s . I n this p a p e r , w e h a v e d e s c r i b e d a n e w t r a n s f o r m a t i o n - b a s e d a p p r o a c h to c o r p u s - b a s e d l e a r n i n g . W e h a v e g i v e n d e t a i l s o f h o w this a p p r o a c h h a s b e e n a p p l i e d to p a r t - o f - s p e e c h t a g g i n g a n d h a v e d e m o n s t r a t e d t h a t t h e t r a n s f o r m a t i o n - b a s e d a p p r o a c h o b t a i n s . 19 Thanks to Fred Jelinek and Fernando Pereira for suggesting this baseline experiment. 20 Unfortunately, it is difficult to find results to compare these k-best tag results to. In DeMarcken (1990),. the test set is included in the training set, and so it is difficult to know how this system would do on fresh text. In Weischedel et al. (1993), a k-best tag experiment was run on the Wall Street Journal corpus. They quote the average number of tags per word for various threshold settings, but do not provide accuracy results.. 562. Brill Transformation-Based Error-Driven Learning. competitive performance with stochastic taggers on tagging both u n k n o w n a n d k n o w n words. The transformation-based tagger captures linguistic information in a small n u m b e r of simple nonstochastic rules, as opposed to large n u m b e r s of lexical a n d contextual probabilities. This learning approach has also been applied to a n u m b e r of other tasks, including prepositional phrase attachment disambiguation (Brill and Resnik 1994), bracketing text (Brill 1993a) a n d labeling nonterminal nodes (Brill 1993c). Recently, w e have b e g u n to explore the possibility of extending these techniques to other problems, including learning pronunciation networks for speech recognition a n d learning mappings b e t w e e n syntactic and semantic representations.. Appendix A: Penn Treebank Part-of-Speech Tags (Excluding Punctuation). 1. CC Coordinating conjunction 2. CD Cardinal n u m b e r 3. DT Determiner 4. EX Existential "there" 5. FW Foreign w o r d 6. IN Preposition or subordinating conjunction 7. JJ Adjective 8. JJR Adjective, comparative 9. JJS Adjective, superlative 10. LS List item m a r k e r 11. MD Modal 12. N N Noun, singular or mass 13. NNS Noun, plural 14. N N P Proper noun, singular 15. NNPS Proper noun, plural 16. PDT Predeterminer 17. POS Possessive ending 18. PP Personal p r o n o u n 19. PP$ Possessive p r o n o u n 20. RB A d v e r b 21. RBR Adverb, comparative 22. RBS Adverb, superlative 23. RP Particle 24. SYM Symbol 25. TO "to" 26. U H Interjection 27. VB Verb, base form 28. VBD Verb, past tense 29. VBG Verb, g e r u n d or present participle 30. VBN Verb, past participle 31. VBP Verb, non-3rd person singular present 32. VBZ Verb, 3rd person singular present 33. WDT Wh-determiner 34. WP W h - p r o n o u n 35. WP$ Possessive w h - p r o n o u n 36. WRB Wh-adverb. 563. Computational Linguistics Volume 21, Number 4. Acknowledgments This work was funded in part by NSF grant IRI-9502312. In addition, this work was done in part while the author was in the Spoken Language Systems Group at Massachusetts Institute of Technology under ARPA grant N00014-89-J-1332, and by DARPA/AFOSR grant AFOSR-90-0066 at the University of Pennsylvania. Thanks to Mitch Marcus, Mark Villain, and the anonymous reviewers for many useful comments on earlier drafts of this paper.. References Black, Ezra; Jelinek, Fred; Lafferty, John;. Magerman, David; Mercer, Robert; and Roukos, Salim (1993). "Towards history-based grammars: Using richer models for probabilistic parsing." In Proceedings, 31st Annual Meeting of the Association for Computational Linguistics. Columbus, OH.. Black, Ezra; Jelinek, Fred; Lafferty, John; Mercer, Robert, and Roukos, Salim (1992). "Decision tree models applied to the labeling of text with parts-of-speech." In Darpa Workshop on Speech and Natural Language. Harriman, NY.. Breiman, Leo; Friedman, Jerome; Olshen, Richard; and Stone, Charles (1984). Classification and regression trees. Wadsworth and Brooks.. Brill, Eric (1992). "A simple rule-based part of speech tagger." In Proceedings, Third Conference on Applied Natural Language Processing, ACL, Trento, Italy.. Brill, Eric (1993a). "Automatic grammar induction and parsing free text: A transformation-based approach." In Proceedings, 31st Meeting of the Association of Computational Linguistics, Columbus, OH.. Brill, Eric (1993b). A Corpus-Based Approach to Language Learning. Doctoral dissertation, Department of Computer and Information Science, University of Pennsylvania.. Brill, Eric (1993c). "Transformation-based error-driven parsing." In Proceedings, Third International Workshop on Parsing Technologies, Tilburg, The Netherlands.. Brill, Eric (1994). "Some advances in rule-based part of speech tagging." In Proceedings, Twelfth National Conference on Artificial Intelligence (AAAI-94), Seattle,. WA. Brill, Eric and Resnik, Philip (1994). "A. transformation-based approach to prepositional phrase attachment disambiguation." In Proceedings, Fifteenth International Conference on Computational Linguistics (COLING-1994), Kyoto, Japan.. Brown, Peter; Cocke, John; Della Pietra, Stephen; Della Pietra, Vincent; Jelinek, Fred; Lafferty, John; Mercer, Robert; and Roossin, Paul (1990). "A statistical approach to machine translation." Computational Linguistics, 16(2).. Brown, Peter; Lai, Jennifer; and Mercer, Robert (1991). "Word-sense disambiguation using statistical methods." In Proceedings, 29th Annual Meeting of the Association for Computational Linguistics, Berkeley, CA. Bruce, Rebecca and Wiebe, Janyce (1994). "Word-sense disambiguation using decomposable models." In Proceedings, 32nd Annual Meeting of the Association for Computational Linguistics, Las Cruces, NM. Charniak, Eugene; Hendrickson, Curtis; Jacobson, Neil; and Perkowitz, Michael (1993). "Equations for part of speech tagging." In Proceedings, Conference of the American Association for Artificial Intelligence (AAAI-93), Washington, DC.. Church, Kenneth (1988). "A stochastic parts program and noun phrase parser for unrestricted text." In Proceedings, Second Conference on Applied Natural Language Processing, ACL, Austin, TX.. Cutting, Doug; Kupiec, Julian; Pedersen, Jan; and Sibun, Penelope (1992). "A practical part-of-speech tagger." In Proceedings, Third Conference on Applied Natural Language Processing, ACL, Trento, Italy.. DeMarcken, Carl (1990). "Parsing the lob corpus." In Proceedings, 1990 Conference of the Association for Computational Linguistics, Pittsburgh, PA.. Derose, Stephen (1988). "Grammatical category disambiguation by statistical optimization." Computational Linguistics, 14.. Francis, Winthrop Nelson and Kucera, Henry (1982). Frequency analysis of English usage: Lexicon and grammar. Houghton Mifflin, Boston.. 564. Brill Transformation-Based Error-Driven Learning. Fujisaki, Tetsu; Jelinek, Fred; Cocke, John; and Black, Ezra (1989). "Probabilistic parsing method for sentence disambiguation." In Proceedings, International Workshop on Parsing Technologies, Carnegie Mellon University, Pittsburgh, PA.. Gale, William and Church, Kenneth (1991). "A program for aligning sentences in bilingual corpora." In Proceedings, 29th Annual Meeting of the Association for Computational Linguistics, Berkeley, CA.. Gale, William; Church, Kenneth; and Yarowsky, David (1992). "A method for disambiguating word senses in a large corpus." Computers and the Humanities.. Leech, Geoffrey; Garside, Roger; and Bryant, Michael (1994). "Claws4: The tagging of the British National Corpus." In Proceedings, 15th International Conference on Computational Linguistics, Kyoto, Japan.. Harris, Zellig (1962). String Analysis of Language Structure. Mouton and Co., The Hague.. Hindle, Donald (1989). "Acquiring disambiguation rules from text." In Proceedings, 27th Annual Meeting of the Association for Computational Linguistics, Vancouver, BC.. Hindle, D. and Rooth, M. (1993). "Structural ambiguity and lexical relations." Computational Linguistics, 19(1):103-120.. Huang, Caroline; Son-Bell, Mark; and Baggett, David (1994). "Generation of pronunciations from orthographies using transformation-based error-driven learning." In International Conference on Speech and Language Processing (ICSLP), Yokohama, Japan.. Jelinek, Fred (1985). Self-Organized Language Modelling for Speech Recognition. Dordrecht. In Impact of Processing Techniques on Communication, J. Skwirzinski, ed.. Joshi, Aravind and Srinivas, B. (1994). "Disambiguation of super parts of speech (or supertags): Almost parsing." In Proceedings, 15th International Conference on Computational Linguistics, Kyoto, Japan.. Klein, Sheldon and Simmons, Robert (1963). "A computational approach to grammatical coding of English words." ]ACM, 10.. Kupiec, Julian (1992). "Robust. part-of-speech tagging using a hidden Markov model." Computer Speech and Language, 6.. Marcus, Mitchell; Santorini, Beatrice; and Marcinkiewicz, Maryann (1993). "Building a large annotated corpus of English: the Penn Treebank." Computational Linguistics, 19(2).. Merialdo, Bernard (1994). "Tagging English text with a probabilistic model." Computational Linguistics.. Miller, George (1990). "Wordnet: an on-line lexical database." International Journal of Lexicography, 3(4).. Quinlan, J. Ross (1986). "Induction of decision trees." Machine Learning, 1:81-106.. Quinlan, J. Ross and Rivest, Ronald (1989). "Inferring decision trees using the minimum description length principle." Information and Computation, 80.. Ramshaw, Lance and Marcus, Mitchell (1994). "Exploring the statistical derivation of transformational rule sequences for part-of-speech tagging." In The Balancing Act: Proceedings of the ACL Workshop on Combining Symbolic and Statistical Approaches to Language, N ew Mexico State University, July.. Roche, Emmanuel and Schabes, Yves (1995). "Deterministic part of speech tagging with finite state transducers." Computational Linguistics, 21(2), 227-253.. Schutze, Hinrich and Singer, Yoram (1994). Part of speech tagging using a variable memory Markov model. In Proceedings, Association for Computational Linguistics, Las Cruces, NM.. Sharman, Robert; Jelinek, Fred; and Mercer, Robert (1990). "Generating a grammar for statistical training." In Proceedings, 1990 Darpa Speech and Natural Language Workshop.. Weischedel, Ralph; Meteer, Marie; Schwartz, Richard; Ramshaw, Lance; and Palmucci, Jeff (1993). "Coping with ambiguity and unknown words through probabilistic models." Computational Linguistics.. Yarowsky, David (1992). "Word-sense disambiguation using statistical models of Roget's categories trained on large corpora." In Proceedings of COLING-92, pages 454-460, Nantes, France, July.. 565

References

Related documents