C O N S T R U C T I N G LEXICAL T R A N S D U C E R c Lauri Karttunen
Rank Xerox Research Centre Grenoble
O. I N T R O D U C T I O N
A lexical t r a n s d u c e r , first discussed in Karttunen, Kaplan and Z a e n e n 1992, is a s p e c i a l i s e d finite-state a u t o m a t o n that m a p s inflected s u r f a c e forms to lexical forms, and vice versa. The lexical form con- sists of a canonical representation of the w o r d and a s e q u e n c e of tags that s h o w the morphological characteristics of the form in question and its syntactic category. For ex- a m p l e , a lexical t r a n s d u c e r for French m i g h t relate the surface form veut to the lexical form vouloir+IndPr+SG+P3. In order to m a p b e t w e e n these t w o forms, the trans- d u c e r m a y c o n t a i n a p a t h like the one s h o w n in Fig. 1.
Lexical side
v o u 1 o i r +IndP+SG+P3
v e u t
Surface side
Fig. 1 Transducer path
as well as generation (vouloir+lndP+SG+P3 -~ veut ). Analysis and generation differ only with respect to the choice of the input side (surface or lexical). The transducer and the a n a l y s i s / g e n e r a t i o n algorithm are the s a m e in both directions.
Other a d v a n t a g e s that lexical transduc- ers have over other m e t h o d s of morpholog- ical processing are compactness and speed. The logical size of a transducer for French is a r o u n d 50K states and 100K arcs b u t the n e t w o r k can be physically c o m p a c t e d to a few h u n d r e d kilobytes. The speed of analy- sis varies from a few t h o u s a n d w o r d s per second to 10K w / s or more d e p e n d i n g on h a r d w a r e and the degree of compaction.
At this time there exist full-size lexical t r a n s d u c e r s for at least five languages: English and French (Xerox
DDS),
G e r m a n (Lingsoft), Korean ( H y u k - C h u l Kwon, see K w o n and K a r t t u n e n 1994), and Turkish (Kemal Oflazer). It is expected that b y the time of the Coling conference several lan- guages will have been a d d e d to the list. The circles in Fig. 1 represent states, thearcs (state transitions) are labelled by a pair of symbols: a lexical s y m b o l and a surface symbol. Sometimes they are the same (v:v), sometimes different (o:e), sometimes one or the o t h e r side is e m p t y (=
EPSILON).
The exact alignment of the s y m b o l s on the two sides is not crucial: the path w o u l d express the veut ~ vouloir+IndP+SG+P3 m a p p i n g even if the last t on the l o w e r side was m o v e d to match the I of vouloir.Because finite-state transducers are bidi- rectional, the s a m e transducer can be u s e d for analysis (veut ~ vouloir +IndP+SG+P3)
Stage I Stage 2 Stage 3 Stage 4
surface suing
LEXICON o FSTI
0
FST 2
...
¢
o
I
surface stritq!
[image:2.612.91.471.67.251.2]surface s trittg surface string
Fig. 2 Construction of a lexical transducer with intersection and composition (Karttunen, Kaplan and Zaenen 1992)
length relations (see Kaplan and Kay 1994 for a detailed discussion).
For practical as well as linguistic reasons it m a y be desirable to use more than one set of rules to describe the mapping. Fig. 2 illustrates the construction of a lexical transducer for French by two sets of rules using intersection (&) and composition (o) operations.
Stage 1 shows two parallel rule systems arranged in a cascade. In Stage 2, the rules on each level have been intersected to a sin- gle transducer. Stage 3 shows the composi- tion of the rule system to a single trans~- ducer. Stage 4 shows the final result that is derived by comf~osh G the rule system with the lexicon.
Although the conceptual picture is quite straightforward, the actual computations are very resource intensive because of the size of the intermediate structures at Stages 2 and 3. Individual rule transducers are generally quite small but the result of inter- secting and composing sets of rule trans- ducers tends to result in very large struc- tures.
This paper describes two recent act~ vances in the construction of lexical trans- ducers that circumvent these problems: (1)
moving the description of irregular map- pings from the rtde system to the source lexicon; (2) performing intersection and composition in single operation. Both of these features have been implemented in the Xerox authoring tools (Karttunen and Beesley 1992, Karttunen 1993) for lexical transducers.
"I. LEXICON AS A SET OF RELATIONS 1.1 Stem Alternations
idiosyncratic m a p p i n g s from regular alter- nations.
From a practical point of view, the treat- m e n t of irregular alternations by individual two-level rules is not optimal if the n u m b e r of rules becomes very large. For example, the description of the irregular verb mor- p h o l o g y in English requires m o r e t h a n a h u n d r e d rules, a large n m n b e r of w h i c h d e a l with the idiosyncratic b e h a v i o u r of just one word, such as the rule in (1).
( 1 ) " F r o m 'be' to 'is' - P a r t i"
b : i <=> #: e: I r r e g u l a r :
+Pres: +Sg: +P3: ;
H e r e # (word b o u n d a r y ) and I r r e g u l a r (a diacritic assigned to strong verbs in the source lexicon) g u a r a n t e e that the rule ap- plies just w h e r e it is supposed to apply: to d e r i v e the first letter of is in the present tense 3rd person form of be. Another rule is n e e d e d to delete the e b e c a u s e two-level rules are restricted to apply to just a pair of symbols. This is an artefact of the formal- ism b u t even if the be~is alternation were to be described b y a single rule, the construc- tion of d o z e n s of rules of such limited applicability is a tedious, error prone pro- cess.
A natural solution to this problem is to a l l o w i d i o s y n c r a t i c v a r i a t i o n to be de- scribed without a n y rules at all, by a simple lexical fiat, a n d to use rules only for the m o r e general alternations. This is one of t h e novel f e a t u r e s of the Xerox lexicon c o m p i l e r (Karttunen 1993). Technically, it m e a n s that the lexicon is not necessarily a collection of simple lexical forms as in classical Kimmo systems but a relation that m a y include pairs such as <be+Pres+Sg+P3, is>.
This can be achieved by c h a n g i n g the lexical format so that the lexicon interpreter compiles an e n t r y like (2) to the transducer s h o w n in Fig. 3.
(2)
b e + P r e s + S g + P 3 + V e r b : i s N e g ? ;Here the colon marks the juncture b e t w e e n the lexical form and its surface realisation, neej? is the continuation class for forms that m a y be have an attached n 't clitic.
Lexical side
b e +IndP +Sg +P3 +Verb i s
Surface side
Fig. 3 Entry (2) in compiled form. The convention used by the Xerox corn-. piler is to interpret paired entries so that the lexical form gets paired with the second form from left to right adding epsilons to the end of the shorter form to m a k e u p the difference in length. The location of the ep- silons can also be m a r k e d explicitly by ze- ros in the lexical form. For example, if it is important to regard the s of is as the regu- lar present singular 3rd person ending, the entry could be written as in (3).
(3) b e + P r e s + S g + P 3 + V e r b : i 0 0 0 s # ;
This has the effect of moving the s in }rig. 3 to the right to give a +~3 : s pair. The map- ping from vouloir+lndP+Sg+P3 to veut in Fig. 1 can be achieved in a similar way. 1.2 Inflectional and Derivational Affixes
M o v i n g all idiosyncratic realisations from the rules to the source lexicon also gives a natural w a y to describe the realisa~ tion of inflectional and derivational suf- fixes. A l t h o u g h it is possible to write a set of two-level rules that realise the English p r o g r e s s i v e suffix as ing, the rules that m a k e up this m a p p i n g have no other func- tion in the English m o r p h o l o g y system. A m o r e s t r a i g h t - f o r w a r d solution is
just
to include the entry (4.) is the appropriate sub- lexicon.The forms on the lower side of a source lexicon need not be identical to the surface f o r m s of t h e language; t h e y can still be modified b y the rules. This allows a good separation b e t w e e n idiosyncratic and regu- lar aspects of inflectional and derivational m o r p h o l o g y . The application of v o w e l h a r m o n y to the case m a r k i n g suffixes in Finnish illustrates this point.
Finnish case e n d i n g s are subject to vowel harmonv. For example, the abessive (%vithout") case in Finnish is realised as ei- ther tta or tftl d e p e n d i n g on the stem; the latter f o r m occurs in w o r d s w i t h o u t back vowels. The general shape of the abessive,
t t A , is not predictable from a n y t h i n g but
the specific realisation of the final vowel (represented b y the " a r c h i p h o n e m e " A) is just an instance of the general vowel har- m o n y rule in Vinnish. This gives us the three-level structm'e in Fig. 4 for forms like
syyttgl "without reason."
s y y +Sg + A b e Iexical/brm
LEXICON
s y y t t A intermediate f o r m
I RULES
[image:4.612.308.499.365.541.2]s y y t t a surfai:eJ'orm
Fig. 4 Three level analysis of Finnish Abessive.
The m a p p i n g from the lexical form to the intermediate form is given by the lexi- con; the rules are responsible for the map- ping from the lower side of the lexicon to the surface. The i n t e r m e d i a t e representa- tion disappears in the composition with the rules so that the resulting transducer maps the lexical f o r m directly to the surface form, and vice versa, as in Fig. 5.
s y y +sg +~be h, x i c a l J b r m
s y y t t a su@wefi~rm
Fig. 5 I;'inal structure after composition.
This c o m p u t a t i o n a l account coincides nicely w i t h t r a d i t i o n a l d e s c r i p t i o n s of Finnish n o u n inflection (Laaksonen and Lieko 1988).
2. INTERSECTING C O M P O S I T I O N The transfer of i r r e g u l a r alternations from the rules to the lexicon has a signifi- cant effect on reducing the n u m b e r of rules that have to be intersected in o r d e r to pro- duce lexical transducers in accordance with the scheme in Fig. 2. Nevertheless, in prac- tice the intersection of even a m o d e s t n u n > ber of rules tends to be quite large. The in- termediate rule transducers in Fig. 2 (Sta- ges 2 and 3) m a y actually be larger than the final result that emerges from the composi- tion with the lexicon in Stage 4. This has b e e n o b s e r v e d in s o m e versions of o u r French and English lexicons. Fig. 6 illus- trates this curious phenomenon.
S ource
Lexicon
intersection of Rule Transducers
o
Composed Lexicon
Fig. 6. Composition of a lexicon with rule transducers
[image:4.612.93.287.375.453.2]case, it could yield a result w h o s e size is t h e p r o d u c t ~ o f the sizes of the i n p u t machines.
I n s t e a d of t h e e x p e c t e d b l o w u p , the composition of a lexicon with a rule system t e n d s to p r o d u c e a t r a n s d u c e r that is not significantly larger than the source lexicon. The r e a s o n a p p e a r s to be that the rule intersection involves c o m p u t i n g the com- b i n e d effect of t h e rules for m a n y t y p e s complex situations that n e v e r occur in the lexicon. T h u s the c o m p o s i t i o n w i t h the actual lexicon can be a simplification from the point of the rule system.
This observation suggests that it is ad~ vantageous not to do rule intersections as a s e p a r a t e operation, as in Fig. 2. We can avoid the large intermediate results by per- f o r m i n g the intersection of the rules and t h e c o m p o s i t i o n w i t h the lexicon in one c o m b i n e d o p e r a t i o n . We call this n e w o p e r a t i o n i n t e r s e c t i n g c o m p o s i t i o n . It takes as i n p u t a lexicon (either a simple a u t o m a t o n or a transducer) a n d any n u m - ber of rule transducers. The result is what w e w o u l d get b y c o m p o s i n g the lexicon with the intersection of the rules.
The basic idea of intersecting composi- tion is to build an o u t p u t transducer w h e r e each state c o r r e s p o n d s to a state in the source lexicon a n d a configuration of rule states. The initial o u t p u t state corresponds to the initial state of the lexicon and its con- figuration consists of the initial states of the rule transducers. At every step in the con- struction, w e try to add transitions to our c u r r e n t o u t p u t state b y m a t c h i n g transi- tions in the current lexicon state against the transitions in all the rule states of the cur- rent configuration. The core of the algo- rithm is given in (5).
(5) Inters¢c..ting Composition For each transition x:y in the cur- rent lexicon state, find a pair y:z such that all rule states in the cur-
rent configuration h a v e a transi- tion for the pair. For each such y:z pair, get its destination configura- tion a n d the c o r r e s p o n d i n g out- put state. T h e n build an x:z tran- sition to that state from the cur- r e n t o u t p u t state. W h e n all t h e arcs in the c u r r e n t lexicon state have been processed, m o v e on to t h e n e x t u n p r o c e s s e d o u t p u t state. Iterate until finished.
Special provisions h a v e to be m a d e for x:y arcs in the lexicon state and for y:z arcs in the rules w h e n y is EPSILON. In the for- m e r case, we build an x:EPSILON transition to the o u t p u t state associated with the des- tination and the u n c h a n g e d configuration; in the latter case we build an EPSILON:z arc to the o u t p u t state associated with t h e cur- rent lexicon state and the n e w destination configuration.
If there is o n l y one rule transducer, in- tersecting composition is equivalent to or- d i n a r y composition. If all the n e t w o r k s are simple finite-state n e t w o r k s r a t h e r t h a n t r a n s d u c e r s , the operation r e d u c e s to n- w a y intersection in which one of the net- w o r k s (the lexicon in o u r case) is u s e d to drive the process. This special case of the algorithm was d i s c o v e r e d i n d e p e n d e n t l y b y Pasi T a p a n a i n e n ( T a p a n a i n e n , 1991; K o s k e n n i e m i , T a p a n a i n e n a n d Voutilai- nen, 1992) as an efficient w a y to a p p l y finite-state constraints for syntactic dis- ambiguation. The d i s t i n g u i s h e d n e t w o r k in his application represents alternative in- terpretations of the sentence. The m e t h o d allows the sentence to be d i s a m b i g u a t e d w i t h o u t c o m p u t i n g the large (possibly un- c o m p u t a b l e ) intersection of the rule au- tomata.
StaRe I Stage 2 Stage 3
~ource
Lexicon Intermediate I
&o Result
+
& o~ ~ Final
i
m.m _ _ -
-Fig. 7. Construction of a lexical transducer with intersecting composition.
In Fig. 7, the first set of rules applies to the source lexicon, tile second set of rules modifies the outcome of the first stage to yield the final outcome. The rules are not intersected separately. The Xerox lexicon compiler (Karttunen 1993) is designed to carry out the computations in this optimal way.
REFERENCES
Antworth, E. I, (1990).
I'C-KIMMO: a lwo-
level processor for morpholo,@:al analysis.
S m n m e r institute of Linguistics, Dallas, Texas.
Kaplan, Ronald M. and Martin Kay (1994). Regular Models of Phonological Rule Systems. To appear in
Corrlputational ldn-
guistics.
Karttunen, Lauri, Ronald M. Kaplan and Annie Zaenen (1992). Two-I,evel Mor- p h o l o g y with Composition. In the
Pro~
ceedings of the fifteenth International Con-
ference on Computational Linguistics. Col-
ing-.92.
Nantes, 23-28/8/1992. Vol. 1 141- 48. ICCL.Karttunen, Lauri and Kenneth R. Beesley
(1992).Two-I.evel Rule Compiler
Technical Report. ISTLq992-2. Xerox Palo Alto Re- search Center. Palo Alto, California.Kartttmen, I,auri
(1993).F'inile-Stah' l.exicon
Compiler
Technical Report. ISTI,-NI,TT- 1993-04-02. Xerox Palo Alto Research Center. Palo Alto, California.Koskenniemi, Kimmo, Pasi T a p a n a i n e n and Atro Voutilainen (1992). Compiling and Using Finite-State Syntactic Rules. in the
Proceedings of the fifteenth Interna-
tional Conference on Computalional l,in-
guistics. Colin2-92.
Nantes,23-28/8/1992.
Vol. 1 156-62. ICCL. 1992.
Kwon, l Iyuk-Chul and l,auri Karttunen (1994). Incremental Construction of a Lexical Transducer for Korean. in the
Proceedings of the 151h International Con-
ference on Compulatiortal i.in~uislics.
Col-ing-94. Kyoto, Aug. 5--9, 1994.
l,aaksonen, Kaino and Anneli I,ieko (1988).