i N V E S T I G A T I N G THE P O S S I B I L I T Y OF A H I C R O P R O C E S S O R - B A S E D MACIIINE TRANSLATTON SYSTEM
H a r o l d L. Somers
C e n t r e f o r C o m p u t a t i o n a l L i n g u i s t i c s
U n i v e r s i t y o f H a n c h e s t e r I n s t i t u t e o f S c i e n c e and T e c h n o l o g y PO Box RR, M a n c h e s t e r H60 tqO, E n g l a n d
ABSTRACT
T h i s p a p e r d e s c r i b e s an o n - g o i n ~ r e s e a r c h p r o j e c t b e i n g c a r r i e d o u t by s t a f f and s t u d e n t s ac t h e C e n t r e f o r C o m p u t a t i o n a l L i n g u i s t i c s co examine t h e f e a s i b i l i t y o f H a c h i n e T r a n s l a t i o n (~T) i n a m i c r o p r o c e s s o r e n v i r o n m e n t . The s y s t e m i n c o r p o r a t e s as f a r as ~ o s s i h l e ~ e a c u r e s o f l a r g e - s c a l e HT s y s t e m s ~hac have p r o v e d d e s i r a b l e o r e f f e c t i v e : i t i s m u t C i l i n R u a l , a l g o r i t h m s and da~a a r e s t r i c t l y s e p a r a t e d , and t h e s y s t e m is h i = h l y m o d u l a r . Problems o f t e r m i n o l o g i c a l p o l y s e m y and s y n t a c t i c c o m p l e x i t y a r e r e d u c e d v i a t h e n o t i o n s o f c o n t r o l l e d v o c a b u l a r y and r e s t r i c t e d s y n t a x . ~ i v e n t h e s e c o n s t r a i n t s , iE seems f e a s i b l e Co a c h i e v e t r a n s t a c i o n v i a an ' i n t e r t t n g u a ' , a v o i d i n ~ any l a n g u a g e - p a i r o r i e n t e d ' t r a n s f e r ' s c a r e . The p a p e r c o n c e n t r a t e s on a d e s c r i p t i o n of t h e s e p a r a t e m o d u l e s i n t h e t r a n s l a ~ i o n p r o c e s s as t h e y a r e c u r r e n t l y e n v i s a g e d , and d e c a t t s some of t h e p r o b l e m s s p e c i f i c t o t h e m i c r o p r o c e s s o r - b a s e d a p p r o a c h t o ~ chac have So ~ar come tO t i g h t .
I . BACKC2OU:'D ;'-':D '.'V£2VI':"
This p a p e r describes p r e l i m i n a r y research in the design of Bede, a l i m i t e d - s y n t a x c o n t r o l l e d - v o c a b u l a r y ~ a c h i n e T r a n s l a t i o n s y s t e m co run on a m i c r o p r o c e s s o r , t r a n s l a c i n e b e t w e e n E n g l i s h , ~ r e n c h , Cerman and D u t c h . Our e x p e r i m e n t a l c o r p u s is a c a r - r a d i o m a n u a l . Bede (named a f t e r t h e 7 t h C e n c u r y E n g l i s h t i n ~ u i s t ) , is e s s e n t i a l l y a research p r o j e c t : we are noc immediately concerned ~v~ch c o m m e r c i a l a p o l i c a c i o n s , t h o u g h such a r e c l e a r l y possible if t h e research p r o v e s f r u i t f u l . ":ork on Bede ac t h i s s t a g e thouRh is p r i m a r i l y experimentnl. The aim at t h e moment [s co
i n v e s t i g a t e t h e e x t e n t t o w h i c h a m i c r o p r o c e s s o r - based ~ s y s t e m o f a d v a n c e d d e s i 2 n is P o s s i b l e , and the l i m i t a t i o n s t h a t have t o be imposed in order co achieve .~ ~or;<in~ s y s t e m . This p a p e r 'Je~crihes the overall s y s t e m design s n e c i f ~ c ~ C i o n t) .~n£cn we are c u r r e n t l y working.
~n cite b a s i c d e s i g n of t h e s y s t e m we a t t e m p t to i n c o r p o r a t e as much as p o s s i b l e F e a t u r e s o f f a r , e - s c a l e ~ systems ~hac have p r o v e d t o be d e s i r a b l e o r e f f e c t i v e . Thus. Bede is m u l ~ i l i n B u a l by ,~csi(zn. alqorithms and l i n R u i s t i c data are
striccl~ separated, and the system [s desiRned in
~ore o- less independent modules.
T~n ~[cron'occssor environment means t h a t ~ : r ~ r ~ l ~I" s i z ~ ~ro ~{~norE,l~E: '4~ta ~ c r u c c u r e s
b o t h d y n a m i c ( c r e a t e d by and m a n i p u t a t e d d u r i n g t h e t r a n s l a t i o n process) and static ( d i c t i o n a r i e s and l i n g u i s t i c r u l e packages) a r e c o n s t r a i n e d co be as e c o n o m i c a l i n terms o f s¢oraBe space and a c c e s s p r o c e d u r e s as p o s s i b l e . L i m i t a t i o n s on ~n- c o r e and p e r i o h e r a [ s t o r a g e a r e i m p o r t a n t c o n s i d e r a t i o n s i n t h e s y s t e m d e s i g n .
I n l a r g e g e n e r a [ p u r p o s e ,HT s y s t e m s , i= is n e c e s s a r y t o assume t h a t f a i t u r e t o t r a n s l a t e t h e g i v e n i n p u t c o r r e c t t y is g e n e r a l l y n o t due t o
incorrectly
~ormed i n p u t , bu~ t o [ n s u f f i c i e n t J y e l a b o r a t e d ~ r a n s l a c i o n a l g o r i t h m s . T h i s i s p a r t i c u l a r l y due t o =wo p r o b l e m s : t h e l e x i c a l p r o b l e m o f c h o i c e o f a p p r o p r i a t e t r a n s l a t i o n e q u i v a l e n t s , and t h e s t r a t e g i c p r o b l e m o f e ~ f e c ~ i v e a n a l y s i s of t h e w i d e range of s y n t a c t i c p a t t e r n s Eound i n n a c u r a l l a n g u a g e . The r e d u c t i o n o f t h e s e p r o b l e m s v i a ~he n o t i o n s of" c o n t r o l l e dvocabu[ary and restricted syntax seems
p a r t i c u l a r l y a p p r o p r i a t e in t h e m i c r o p r o c e s s o r e n v i r o n m e n t , s i n c e t h e a l t e r n a t i v e of m a k i n ~ a system |n~tnitely e x t e n d a b l e [s probably no~ feasible,
G i v e n t h e s e c o n s t r a / n t s , i t seems f e a s i b l e t o a c h i e v e c r a n s t a c i o n v i a an I n C e r l t n g u a . ~n ~ h i c h t h e c a n o n i c a t s t r u c t u r e s f r o m t h e s o u r c e l a n = u a ~ e a r e mapped d i r e c t l y o n t o t h o s e o f t h e t a r g e t l a n g u a g e ( s ) , a v o i d i n R any l a n g u a g e - p a i r o r i e n t e d ' t r a n s f e r ' s t a ~ e . T r a n s l a t i o n t h u s cakes p l a c e in ~wu puase~= a n a i y s l s o t s o u r c e ~ e x t an~ s y n t h e s i s o f t a r g e t t e x t .
A. I n c o r p o r a t i o n o f r e c e n t desL~n o r t n c i o [ e s ~ o d e r n ~ s y s t e m d e s i g n can be c h a r ~ c t e r L s e d hv t h r e e p r i n c i p l e s t h a c have p r o v e d Co be d e s i r a b l e and e f f e c t i v e (Lehmann e t a [ , t g ~ } o : I - ] ) : each of t h e s e is a d h e r e d co i n t h e d e s i R n oF Rede.
Bede Es m u t t [ l i n g u a l by design: early "!T systems were designed with specific lan~uaBe-oatrs in m i n d , and t r a n s l a t i o n a l g o r i t h m s were e l a b o r a t e d on t h i s b a s i s . The main c o n s e o u e n c e o f
this was that source lan~uaRe analysis ~¢as
e f f e c t e d w i t h i n t h e p e r s p e c t i v e o f t h e B~ven t a r g e t [ a n g u a R e , and was t h e r e f o r e o f t e n o f l i t t l e o r no use on t h e a d d i t i o n i n t o t h e s y s t e m o f a further l a n g u a g e (of. ~in~, IORI:12; ~:in~ Perschke, 1982:28).
underlying linguistic theory which might have been present was inextricably bound up with the program
itself. This clearly entailed the disadvantage
that any modification of the system had to be done
by a skilled programmer (of. Johnson, IgRO:IAO).
Furthermore, the side-effects of apparently quite
innocent modifications were often quite far-
reaching and difficult to trace (see for example
Boscad, l q 8 2 : 1 3 0 ) , A l t h o u g h t h i s has o n l y r e c e n t l y become an i s s u e in HT ( e . g . V a u q u o i s , 1 9 7 9 : I . 3 ; 1981=10), i t has o f course f o r a long time been s t a n d a r d p r a c t i c e i n o t h e r areas o f knowledge-based programming ( N e w e l l , 1973; Davis & King, 1 9 7 7 ) .
The third principle now current in MT and to be
incorporated in Bede is that the translation
process should be modular. This approach was a
feature of the earliest 'second generation'
systems ( o f . V a u q u o i s , 1975:33), and i s c h a r a c t e r i s e d by the g e n e r a l n o t i o n t h a t any c o m p l i c a t e d c o m p u t a t i o n a l t a s k i s b e s t t a c k l e d by
dividing it up into smaller more or less
independent sub-casks which communicate only by
means of a strictly defined interface protocol
(Aho et al, 1974). This is typically achieved in
the bit environment by a gross division of the
translation process i n t o analysis of source
language and synthesis of target language,
possibly with an intermediate transfer sca~e (see
!.D below), with these phases in turn sub-divided,
for example into morphological, lexical and
syntactico-semantlc modules. This modularity may
be reflected both in the linguistic organisation
of the translation process and in the provision of
software devices specifically tailored to the
r e l e v a n t sub-task (Vauquois, 1975:33). This is
the case in Bede, where for each sub-task a
grammar interpreter is p r o v i d e d which has the
property of being no more powerful than necessary
for the task in question. This contrasts with the approach taken in TAt~-H~c~o (TAUM, Ig73), where a single general-purpose device (Colmerauer's (1970) ' O - S y s t e m s ' ) is o r o v i d e d , w i t h the a s s o c i a t e d d i s a d v a n t a g e t h a t f o r s o m e ' s i m p l e ' t a s k s the
superfluous power of the device means that
processes are seriously u n e c o n o m i c a l . Bede
incorporates five such 'grammar t y p e s ' with
associated individual formalisms and processors:
these are described in detail in the second half of this paper.
B. The microproce,ssor environment
!t is in the microprocessor basis that the
principle interest in this system lies, and, as
mentioned above, the main concern is the effects
of the r e s t r i c t i o n s t h a t the e n v i r o n m e n t imposes. Development o f the Bede p r o t o t y p e is p r e s e n t l y caking p l a c e on ZRO-based machines which p r o v i d e
6Ak bytes of in-core memory and 72Ok bytes of
peripheral s t o r e on two 5-I/~" double-sided
double-density floppy disks. The intention i s
that any commercial version of Bede would run on
more powerful processors with larger address
space, since we f e e l chat such machines w i l l soon rival the nopularity of the less powerful ZRO's as
the standard desk-cop hardware. Pro~rarzninR so
f a r has been in P a s c a l - " ( S o r c i m , 197q), a Pascal
d i a l e c t c l o s e l y r e s e m b l i n g UCSD P a s c a l , but we are conscious of the fact that both C ( K e r n i g h a n &
Ritchie, 1978) and BCPL (Richards & Whitby-
Strevens, Ig7g) may be more suitable for some of
the software elements, and do not rule out
completing the p r o t o t y p e in a number of languages. This adds the burden of designing compatible data-
structures and interfaces, and we are currently
investigating the relative merits of these
languages. Portability and efficiency seem to be
in conflict here.
M i c r o p r o c e s s o r - b a s e d MT c o n t r a s t s s h a r p l y w i t h
the mainframe-based activity, where the
significance of problems of economy of storage and
efficiency of programs has decreased in recent
years. The possibility of introducing an element
of human interaction with the system (of. Kay,
Ig80; Melby, 1981) is also highlighted in this
environment. Contrast systems like SYSTRAN (Toma,
1977) and GETA (Vauquois, 1975, lq7g; Boiler &
Nedobejkine, IggO) which work on the principle of
large-scale processing in batch mode.
Our experience so far is chat the economy and
efficiency in data-structure design and in the
elaboration of interactions between programs and
data and between different modules is of paramount
importance. While it is relatively evident thac
l a r g e - s c a l e HT can be s i m u l a t e d in the m i c r o p r o c e s s o r e n v i r o n m e n t , the c o s t in r e a l time
is tremendous: entirely new design ~nd
implementation strategies seem co be called for.
The ancient skills of the programmer that have
become eroded b y the generosity afforded by modern
mainframe configurations become highly valued in
this microprocessor application.
C. C o n t r o l l e d v o c a b u l a r y and r e s t r i c t e d sync@x
The state of the art of language processing is
such chat the analysis of a significant range of
syntactic patterns has been shown t o be possible, and by means of a number of different approaches.
Research in this area nowadays is concentrated on
the treatment of more problematic constructions
(e.g. Harcus, lqgO). This observation has led us
tO believe that a degree of success in a small
scale MT project can be achieved via the notion of r e s t r i c t i n g the c o m p l e x i t y o f a c c e p t a b l e i n p u t , so
that only constructions that are sure tc ne
Correctly analysed are permitted. This notion of
r e s t r i c t e d s y n t a x ~ has been t r i e d w i t h some success in larger systems ( c f . Elliston, IGYn:
Lawson, 107q:81f; Somers & HcNaught, I9~O:ao~,
r e s u l t i n g both in more a c c u r a t e t r a n s l a t i o n , and
in increased legibility from t~e human point of
view. AS Elliston p o i n t s out, the development e f
strict guidelines for writers leads not only t :
the use of simpler constructions, but also to =he
avoidance of potentially ambiguous text. In
either case, the benefits for ~ are obvious.
Less obvious however is the acceptability of such
constraints; yet 'restricted syntax' need noc
imply 'baby t a l k ' , and a reasonably extensive range of c o n s t r u c t i o n s can be included.
the s y n t a c t i c c o m p l e x i t y o f the i n p u t , so the c o r r e s p o n d i n g problem o f l e x i c a l d i s a m b i g u a t i o n chat l a r g e - s c a l e HT systems are faced w i t h can be eased by the n o t i o n of c o n t r o l l e d v o c a b u l a r y . A major problem f o r PIT is the c h o i c e o f a p p r o p r i a t e
t r a n s l a t i o n e q u i v a l e n t s a t the l e x i c a l l e v e l , a c h o i c e o f t e n d e t e r m i n e d by a v a r i e t y o f f a c t o r s a t a l l l i n g u i s t i c l e v e l s ( s y n t a x , s e m a n t i c s , p r a g m a t i c s ) . I n t h e f i e l d of m u l C i l i n g u a l t e r m i n o l o g y , t h i s problem has been t a c k l e d v i a the concept o f t e r m i n o l o g i c a l e q u i v a l e n c e ( W U s t e r ,
1971):
f o r a g i v e n concept i n one l a n g u a g e , at r a n s l a t i o n in a n o t h e r language is e s t a b l i s h e d , these being c o n s i d e r e d by d e f i n i t i o n t o be i n one- t o - o n e correspondence. I n the case o f Beds, where the s u b j e c t - m a t t e r o f the t e x t s to be t r a n s l a t e d is f i x e d , such an approach f o r the ' t e c h n i c a l terms' in the corpus is c l e a r l y f e a s i b l e ; the n o t i o n is extended as f a r as p o s s i b l e to g e n e r a l v o c a b u l a r y as w e l l . For each concept a s i n g l e t e r m o n l y i s p e r m i t t e d , and a l t h o u g h the r e s u l t i n g s t y l e may appear l e s s mature ( s i n c e the use o f near synonyms f o r the sake o f v a r i e t y i s not p e r m i t t e d ) , the problems d e s c r i b e d above are somewhat a l l e v i a t e d . Polysemy is noC e n t i r e l y a v o i d a b l e , b u t i f r e d u c e d co a b a r e minimum, and p e r m i t t e d o n l y i n s p e c i f i c and acknowledged c i r c u m s t a n c e s , the problem becomes more e a s i l y manageable.
D. I n t e r l i n ~ u a
A s i g n i f i c a n t dichotomy in PIT is between the ' t r a n s f e r ' and ' t n t e r l i n g u a ' approaches. The f o r m e r can be c h a r a c t e r i s e d by t h e u s e o f b i l i n g u a l t r a n s f e r modules w h i c h c o n v e r t the r e s u l t s o f the a n a l y s i s of the source language i n t o a r e p r e s e n t a t i o n a p p r o p r i a t e f o r a s p e c i f i c
t a r g e t language. This c o n t r a s t s wlth the
i n t e r l i n g u a avproach i n w h i c h the r e s u l t o f a n a l y s i s is passed d i r e c t l y co the a p p r o p r i a t e s y n t h e s i s module.
I t is beyond the scope o f the p r e s e n t paper to discuss in d e t a i l the r e l a t i v e m e r i t s o f the two approaches (see Vauquois, i 9 7 5 : l & 2 f f ; H u t c h i n s , lq78). I~ should however c o n s i d e r soma of the major o b s t a c l e s i n h e r e n t in the i n c e r l i n g u a approach.
The development o f an I n t e r l i n g u a f o r v a r i o u s purposes (noc o n l y t r a n s l a t i o n ) has been the s u b j e c t o f p h i l o s o p h i c a l debate f o r s o m e y e a r s , and p r o p o s a l s f o r ~T have i n c l u d e d the use o f f o r m a l i z e d n a t u r a l language ( e . g . Hel'~uk, Ig7&; Andreev, lg67), a r t i f i c i a l languages (like ~ s o e r a n c o ) , or v a r i o u s symbolic r e p r e s e n t a t i o n s , ~hecher linear (e.~. B U l c i n s , I061) o r otherwise ( e . ~ . "~ilks, 1073). Host of chess approaches are p r o b l e m a t i c however ( f o r a thorough d i s c u s s i o n of the l n c e r l i n g u a approach co ~ , see O f t e n & Pacak (1071) and Barnes (ig83)). Nevertheless, some i n c e r l i n g u a - b a s e d HT systems have been developed
co a considerable degree: f o r example, the
~ r e n o h l e team's first a t t e m p t s at wT cook this approach ( V e i l l o n , 106R), w h i l e the TITUS system s t i l l in use ac the À n s c i c u t T e x t i l e de France (Ducroc. Ig72; Zinge[, 1~78~ is claimed to be ( n c e r l i n ~ u , l - b a s e d .
151
I t seems t h a t i t can be assumed a p r i o r i thac an e n t i r e l y l a n g u a g e - i n d e p e n d e n t t h e o r e t i c a l r e p r e s e n t a t i o n o f a given t e x t is f o r all p r a c t i c a l purposes i m p o s s i b l e . A more r e a l i s t i c t a r g e t seems t o be a r e p r e s e n t a t i o n in which s i g n i f i c a n t s y n t a c t i c d i f f e r e n c e s between the languages in q u e s t i o n are n e u t r a l i z e d so chat the b e s t one can aim f o r is a l a n g u a g e s - s p e c i f i c ( s i c ) r e p r e s e n t a t i o n . This approach i m p l i e s the d e f i n i t i o n o f an I n t e r l i n g u a which cakes advantage o f a n y t h i n g the languages in the system have in common, w h i l e accomodating t h e i r i d i o s y n c r a s i e s . T h i s mains chat f o r • system which i n v o l v e s s e v e r a l fairly c l o s e l y r e l a t e d languages the i n t e r l i n s u a approach is a t l e a s t f e a s i b l e , on the u n d e r s t a n d i n g t h a t the i n t r o d u c t i o n o f a s i g n i f i c a n t l y d i f f e r e n t type o f language may i n v o l v e the complete r e d e f i n i t i o n o f the I n c e r l i n g u a ( B a r n e s , 1983). ~rom the p o i n t o f v i e w o f Beds, t h e n , the common base o f the languages i n v o l v e d can be used to g r e a t advantage. The n o t i o n o f r e s t r i c t e d s y n t a x d e s c r i b e d above can be employed to f i l t e r out c o n s t r u c t i o n s chac cause p a r t i c u l a r problems f o r the chosen I n t e r l i n g u a r e p r e s e n t a t i o n .
There remains however the problem o f ~he r e p r e s e n t a t i o n o f l e x i c a l items in the I n t e r l i n g u a . T h e o r e t i c a l approaches co t h i s problem ( e . g . Andreev, 1967) s e e m q u i t e u n s a t i s f a c t o r y . BuC the n o t i o n o f c o n t r o l l e d vocabulary" seems to o f f e r a s o l u t i o n . If a one- co-one e q u i v a l e n c e o f ' t e c h n i c a l ' terms can be a c h i e v e d , t h i s leaves o n l y a r e l a t i v e l y small area o f v o c a b u l a r y for which an i n c e r l i n g u a l r e p r e s e n t a t i o n must be d e v i s e d . I t seems r e a s o n a b l e , o n a small s c a l e , co t r e a t g e n e r a l
v o c a O u i a r y tn an enelagous way co t e c h n i c a l v o c a b u l a r y , in p a r t i c u l a r c r e a t i n g l e x i c a l items in one language t h a t are ambiguous w i t h r e s p e c t co any o f the ocher languages as 'homographs'. Their ' d i s a m b i g u a t i o n ' must cake p l a c e in A n a l y s i s as t h e r e is no b i l t g u a l ' T r a n s f e r ' phase, and
Synthesis is p u r e l y deterministic. While this
approach would be q u i t e u n s u i t a b l e f o r a l a r R e - s c a l e g e n e r a l purpose HT system, in the p r e s e n t c o n t e x t - where the problem can be minimised - ~c seems Co be a reasonable approach.
Our own model f o r the Bede t n C e r l i n g u a has noc y e t been f i n a l i s e d . We b e l i e v e t h i s co be an area f o r r e s e a r c h and e x p e r i m e n t a t i o n once the system software has b e e n m o r e f u l l y developed. ~ur c u r r e n t h y p o t h e s i s is chat the I n t e r l i n R u a w i l l cake the form of a canonical r e p r e s e n t a t i o n of the t e x t in which valency-houndness and (deep) ~ e w i l l p l a y a s i g n i f i c a n t r o l e . S e n t e n t i a l f e a t u r e s such as tense and aspect w i l l be capcured by ' u n i v e r s a l ' system o f values f o r the languages i n v o l v e d . This concepcion of an I n t e r l i n g u a c l e a r l y f a l l s s h o r t of the language-independent p i v o t r e p r e s e n t a t i o n t y p i c a l l y envisaged Ccf.
Boitet & NedobeJklne, 1980:2), but we hope :o
and v a l u a b l e b y - p r o d u c t o f the p r o j e c t as a w h o l e .
II. D E S C R I P T I O N OF THE S Y S T E M DESIGN In this second half of the paper we present a d e s c r i p t i o n of the t r a n s l a t i o n p r o c e s s in Bede, as it is c u r r e n t l y envisaged. The p r o c e s s is d i v i d e d b r o a d l y into two parts, a n a l y s i s and synthesis, the i n t e r f a c e b e t w e e n the two b e i n g p r o v i d e d by the Interlingua. The a n a l y s i s m o d u l e uses a C h a r t - l i k e s t r u c t u r e ( c f . K a p l a n , 1973) and a series o f g r a m m a r s to p r o d u c e from the source text the I n c e r l i n g u a tree structure w h i c h serves as input to synthesis, w h e r e it is r e a r r a n g e d into a v a l i d s u r f a c e s t r u c t u r e f o r the t a r g e t language. The ' t r a n s l a t i o n u n i t ' (TU) i s t a k e n co be the s e n t e n c e , or e q u i v a l e n t ( e . g . s e c t i o n h e a d i n g , title, figure caption). Full d e t a i l s of the rule formalisms are g i v e n in Somers (Ig81).
A. S t r l n ~ s e g m e n t a t i o n
The TU is first s u b j e c t e d t o a t w o - s t a g e s t r i n g - s e g m e n t a t i o n and 'lemmatlsation' analysis. I n the first stage it is c o m p a r e d word by w o r d w i t h a 'stop-list' of f r e q u e n t l y o c c u r r i n g w o r d s ( m o s t l y f u n c t i o n words); words not found in the stop-list undergo s t r i n g - s e g m e n t a t l o n analysis, again on a w o r d by w o r d basis. S t r i n g - s e g m e n t a t i o n rules form a f i n i t e - s t a t e g r a m m a r of a f f i x - s t r i p p i n g r u l e s ( ' A - r u l e s ' ) w h i c h handle m o s t l y i n f l e c t i o n a l m o r p h o l o g y . T h e output is a Chart w i t h l a b e l l e d arcs i n d i c a t i n g lexical unit (LU) and p o s s i b l e i n t e r p r e t a t i o n o£ the s t r i p p e d affixes, this 'hypothesis' to be c o n f i r m e d by d i c t i o n a r y look-up. By w a y of example, c o n s i d e r (I~, a p o s s i b l e French rule, w h i c h takes any w o r d ending in - i s s o n s (e.g. f i n i s s o n s or h 4 r i s s o n s ) and c o n s t r u c t s an arc on the Chart r e c o r d i n g the h y p o t h e s i s that the w o r d is an i n f l e c t e d form of an '-it' verb (i.e. finir or *h4rir).
(I) V + " - I S S O N S " ~ V ~ "-IR"
[PERS=I & N U M = P L U R & T E N S E = P R E S & H O O D = I N D I C ] At the end of d i c t i o n a r y l o o k - u p , a t e m p o r a r y ' s e n t e n c e d i c t i o n a r y ' is c r e a t e d , c o n s i s t i n g of copies of the d i c t i o n a r y e n t r i e s f o r ( o n l y ) those
LUs found in the current TU. This is purely an
e f f i c i e n c y measure. The sentence d i c t i o n a r y may of course include entries f o r h o m o g r a p h s w h i c h will later be r e j e c t e d .
B. S t r u c t u r a l a n a l y s i s I . ' P - r u l e s '
The chart then u n d e r g o e s a two-stage structural a n a l y s t s . In the f i r s t s t a g e , c o n t e x t - s e n s i t i v e augmented p h r a s e - s t r u c t u r e r u l e s ( ' P - r u l e s ' ) work towards c r e a t i n g a s i n g l e arc spanning the e n t i r e TU. Arcs are labelled with a p p r o p r i a t e syntactic c l a s s and s y n c a c t i c o - s e m a n t i c f e a t u r e i n f o r m a t i o n and a t r a c e o f the lower arcs which have been subsumed from w h i c h the parse tree can be simply extracted. The trivial P-rule (2) iS p r o v i d e d as an examnle.
(2)
< N U M ( D E T ) = N U M ( N ) & G D R ( D E T ) . I N T . G D R ( N ~ r.. ~ > DET + N -~ NP< G D R ( N P ) : = G D R ( N ) & N U M ( N P 3:=NLvM(N) • P - r u l e s c o n s i s t o f ' c o n d i t i o n s t i p u l a t i o n s ' , a ' g e o m e t r y ' , and ' a s s i g n m e n t s t i p u l a t i o n s ' . The nodes o f the Chart are by d e f a u l t i d e n t i f i e d by the v a l u e o f the a s s o c i a t e d v a r i a b l e CLASS, though i t is a l s o p o s s i b l e to r e f e r to a node by a l o c a l v a r i a b l e name and t e s t f o r or a s s i g n the v a l u e o f CLASS in the s t i p u l a t i o n s . Our r u l e f o r m a l i s m s are q u i t e d e l i b e r a t e l y d e s i g n e d to r e f l e c t the f o r m a l i s m s of t r a d i t i o n a l l i n g u i s t i c s .
This f o r m a l i s m allows e x p e r i m e n t a t i o n with a large n u m b e r of d i f f e r e n t c o n t e x t - f r e e p a r s i n g a l g o r i t h m s . We are in f a c t s t i l l e x p e r i m e n t i n g i n t h i s a r e a . For a s i m i l a r i n v e s t i g a t i o n , though on a m a c h i n e w i t h s i g n i f i c a n t l y d i f f e r e n t time and space c o n s t r a i n t s , see Slocum ( 1 9 8 1 ) .
2. 'T-rules'
I n the second stage o f s t r u c t u r a l a n a l y s i s , the t r e e s t r u c t u r e i m p l i e d by the l a b e l s and t r a c e s on these a r c s is d i s j o i n e d from the Char~ and undergoes g e n e r a l t r e e - C o - c r e e - t r a n s d u c t i o n s as d e s c r i b e d by ' T - r u l e s ' , r e s u l t i n g in a s i n g l e t r e e structure r e p r e s e n t i n g the c a n o n i c a l form of the TU.
• The f o r m a l i s m for the T - r u l e s is s i m i l a r co that f o r the P-rules, except in the g e o m e t r y part, w h e r e tree s t r u c t u r e s rather than arc s e q u e n c e s are defined. C o n s i d e r the n e c e s s a r i l y more complex (though still s i m p l i f i e d ) example (3~. w h i c h r e g u l a r i s e s a simple E n g l i s h passive. (3) < L U ( A U X ) = " B E " & P A R T ( V ) = P A S T P A R T &
L U ( P R E P ) = " B Y " & CASE(NP{2})=ACE?;T > S(NP{I} * AUX - V ÷ N P { 2 } ( P R E P . ~ )
s ( ~ P ( 2 } ( s ) ~ v + ~ p { l } ) < D S F ( N P { 2 } ) : = D S U J & V O I C E ( V ) : = P A S S V &
D S F ( N P { I } : = D O B J •
N o t i c e the n e c e s s i t y to 'disamb£Ruate' the two NPs via c u r l y - b r a c k e t t e d d i s a m b l R u a t o r s ; the p o s s i b i l i t y of d e f i n i n g a partial g e o m e t r y via the 'dummy' symbol ($~; and how the AUX and PREP are e l i m i n a t e d in the resulting tree s t r u c t u r e . L a b e l l i n g s for nodes are copied over by default unless s p e c i f i c a l l y suppressed.
With s o u r c e - l a n g u a g e LUs r e p l a c e d by unique m u l t i i i n g u a l - d i c t i o n a r y a d d r e s s e s , t h i s c a n o n i c a l r e p r e s e n t a t i o n is the I n t e r l i n g u a which is passed f o r s y n t h e s i s i n t o the t a r g e t l a n g u a g e ( s ~ .
C. S y n t h e s i s
the a n a l y s i s r u l e s c o u l d be simpLy r e v e r s e d ,
Once the d e s i r e d s t r u c t u r e has been a r r i v e d a t , the t r e e s undergo a s e r i e s o f c o n t e x t - s e n s i t i v e
rules used to assign mainly syntactic features co
the l e a v e s ( ' L - r u l e s ' ) , f o r example f o r the purpose o f a s s i g n i n g number and g e n d e r c o n c o r d ( e t c . ) . The f o r m a l i s m for the L - r u l e s i s a g l i n s i m i l a r t o t h a t f o r t h e p - r u l e s and T - r u l e s , the geOmett'y p e r t t h i s time d e f i n Y n g a s i n g l e t r e e s t r u c t u r e with no s t r u c t u r a l modification i m p l i e d . A s i m p l e example f o r German i s p r o v i d e d here ( 4 ) .
(4) <SF(NP)=SUBJ>
NP(Drr + N)
<CASE(DET):=NOH & CASE(N):=NOH & NI~H(DET):=NUH(NP) & GDR(DET):-GDR(N)> The llst o f l a b e l l e d l e a v e s r e s u l t i n g from the a p p l i c a t i o n o f L - r u l e s i s passed to m o r p h o l o g i c a l
synthesis (the superior branches are no longer
n e e d e d ) , where a f i n i t e - s t a t e grammar o f morpbographemic and a f f t x a t i o n r u l e s ( ' H - r u l e s ' )
is a p p l i e d to produce the t a r g e t s t r i n g . The f o r m a l i s m f o r H - r u l e s is much l e s s complex than the A - r u l e f o m e l i s m , the grammar b e i n g a g a i n s t r a i g h t f o r w a r d l y deterministic. The only t a x i n g r e q u i r e m e n t o f the M - r u l e f o r m a l i s m ( w h i c h , a t the ~ime o f w r i t i n g , has not been f i n a l i s e d ) i s t h a t i t must p e r m i t a wide v a r i e t y o f s t r i n g m a n i p u l a t i o n s to be d e s c r i b e d , and t h a t it must d e f i n e a t r a n s a p a r e n t i n t e r f a c e with the d i c t i o n a r y . A t y p i c a l r u l e f o r French f o r example might consist o f s t i p u l a t i o n s concerning i n f o r m a t i o n found b o t h on the l e a f in q u e s t i o n and i n the d i c t i o n a r y , as in ( 5 ) .
( 5 ) l e a f i n f o . :
CLASS.V;
TENSE.PRES; NUH.SING;PEgs-3; HOOD=INDIC
dict. info.: CONJ(V)=IRREG
assign: A f f i x "-T" to STEHI(V) D. G e n e r a l comments on system d e s i g n
The general m o d u l a r i t y of the system w i l l have been q u i t e e v i d e n t . A key f a c t o r , as mentioned above, is that each of these grammars is j u s t powerful enough for the cask required of It: thus no computing power is ' w a s t e d ' at any o f the i n t e r m e d i a t e s t a g e s .
At each interface between grammars o n l y a small p a r t o f the d a t a s t r u c t u r e s used by the d o n a t i n g module is r e q u i r e d by the r e c e i v i n g module. The 'unwanted' data s t r u c t u r e s are w r i t t e n to
peripheral store co enable recovery of partial
s~ructures in the case of f a i l u r e or
m i s t r a n s l a t i o n , though automatic backtracking to p r e v i o u s modules by the system as such is not e n v i s a g e d as a m a j o r component.
The ' s t a t i c ' d a t a used by the system consist o f the d i f f e r e n t s e t s o f l~nguistic r u l e packages, p l u s ~he d i c t i o n a r y . The system e s s e n t i a l l y has one l a r g e m u [ t i l i n g u a l d i c t i o n a r y from which
numerous software packages generate various
s u b d i c c i o n a r i e s as required either in the
:rans[acion process itself, or for lexicographers
working on the system. Alphabetical or other
structured language-specific listings can be
produced, while of course dictionary updating and
editing packages are a l s o p r o v i d e d .
The system as a whole can be viewed as a c o l l e c t i o n o f P r o d u c t i o n Systems (PSs) ( N e w e l l , 1973; D a v i s & K i n g , 1977; see a l s o Ashman ( 1 9 8 2 ) on the use o f PSs in HT) in the way t h a t the r u l e packages ( w h i c h , i n c i d e n t a l l y , as an e f f i c i e n t 7 i I ~ a l u t e , u n d e r g o s e p a r a t e s y n t a x v e r i f i c a t i o n and
' c o m p i l a t i o n ' i n t o i n t e r p r e t a b l e ' c o d e ' ) o p e r a t e on the d a t a s t r u c t u r e . The system d i f f e r s from the c l a s s i c a l PS s e t u p in d i s t r i b u t i n g i t s s t a t i c d a t a o v e r two d a t a b a s e s : the r u l e packages and the d i c t i o n a r y . The c o m b i n a t i o n o f the r u l e packages and the d i c t i o n a r y , the s o f t w a r e i n t e r f a c i n g t h e s e , end the r u l e i n t e r p r e t e r can however be c o n s i d e r e d as a n a l g o u s t o the rule i n t e r p r e t e r of a c l a s s i c a l P$.
IIl. CONCLUSION
As an experimental research project, Bede
provides us with an extremely varied range of
computational linguistics problems, ranging from
the p r i n c i p a l l y l i n g u i s t i c t a s k o f r u l e - w r i t i n g , t o the e s s e n t i a l l y c o m p u t a t i o n a l work of s o f t w a r e t m p l e n ~ l n c a t t o n , w i t h l e x i c o g r a p h y and t e r m i n o l o g y p l a y i n g t h e i r p a r t a l o n g the way.
g u t we hope t o o t h a t Bade i s more t h a n an academic e x e r c i s e , and t h a t we a r e making a s i g n i f i c a n t c o n t r i b u t i o n to a p p l i e d C o m p u t a t i o n a l l i n g u i s t i c s r e s e a r c h .
IV. ACKNOWLEDCHENTS
I p r e s e n t t h i s p a p e r o n l y as spokesman f o r a l a r g e g r o u p o£ p e o p l e who have w o r k e d , are w o r k i n g , o r w i l l work on Bede. T h e r e f o r e I would l i k e to t h a n k c o l l e a g u e s and s t u d e n t s at C . C . L . , p a s t , p r e s e n t , and future for their work on the p r o j e c t , and in p a r t i c u l a r Rod Johnson, Jock H c N e u g h c , Pete W h i t e l o c k , K i e r a n ~ i l b y , Tonz B a r n e s , Paul B e n n e t t and R e v e r l e y Ashman f o r he[~ with ~his w r i t e - u p . I of c o u r s e accept r e s p o n s i b i l i t y f o r any e r r o r s thac s l i p p e d t h r o u g h t h a t t i g h t n e t .
V. REFERENCES
Aho, A l f r e d V . , John E. H o p c r o f c & J e f f r e y B. Utlman. The d e s i g n and a n a l y s i s o f computer a l g o r i t h m s . Reading, H a s s . : A d d i s o n - : ; e s l e v .
I g 7 4 .
A n d r e e v , N.D. The i n t e r m e d i a r y language as the f o c a l p o i n t o f machine t r a n s l a t i o n . In A.D. Booth ( e d ) , Hachine T r a n s l a t i o n , Amsterdam: N o r t h - H o l l a n d , 1967, 1-27.
Barnes, Antonia M.N. An investigation into the syntactic structures of abstracts, and the definition of an 'interlingua' for their translation by machine. MSc thesis. Centre for Computational Linguistics, University of Manchester Institute of Science and Technology, 1983.
Boiler, C. & N. NedobeJkine. Russian-French at GETA: Outline of the method and derailed example (Rapport de Recherche No. 219). Grenoble: GETA, 1980.
B~Iting, R u d o l p h . A d o u b l e i n t e r m e d i a t e l a n g u a g e for Machine Translation. In Allen Kent (ed), Information Retrieval and Machine Translation, Part 2 (Advances in Documentation and Library Science, Volume III), New York: Interscience, 1961, I139-I144.
Boscad, Dale A. Quality control procedures in modification of the Air Force Russian-English MT system. In Veronica Lawson (ed), Practical Experience of Machine Translation, Amsterdam: North-Holland, 1982, 129-133.
Colmerauer, Alain. Les syst~mes-Q: ou un formalisme pour anal~ser et s~nthdciser des phrases sur ordinateur. (Publication interne no. 43). Moncr4al: Projet de Traduction Automatique de l'Universitd de Montr4al, 1970.
Davis, Randall & Jonathan King. An overview of Production Systems. In E.W. Elcock & D. Michie (eds), Machine Intelligence Volume 81 Machine representation of knowledBe, New York: Halated, 1977, 300-332.
Ducroc, J . M . Research for an automatic translation s~stem for the diffusion of scientific and technical textile documentation in English speaking countries: F i n a l report. Boulogne-Billancourt: Insticut Textile de France, I972.
Kernighan, Brian W. & Dennis ~I. Ritchi~. ~he C programmin K language. Eng|ewood Cli~fs, ~:J: Prentice-Hall, 1978.
King, M. EUROTRA - a European system for machine translation. Lebende Sprachen 1981, 26:12-1&.
King, M. & S. Perschke. EUROTRA and its object- ives. Multilin~ua, 1982, 1127-32.
Lawson, Veronica. Tigers and polar bears, or: translating and the computer. The Incorpnrated Linguist, 1979, 18181-85.
Lehmann, Winfred P., Winfield S. Bennett. Jonathan Slocum, Howard Smith, Solveig M.V. Pfluger & Sandra A. Eveland. The METAL system (RADC-TR- 80-37&). Austin, TX: Linguistics Research Center, University of Texas, 1080.
Marcus, Mitchell P. A theory of syntactic recognition for natural language, Cambridge, MA: MIT Press, 1980.
Melby, Alan K. Translators and machines - can they cooperate? META, 1981, 26:23-34.
Mel'~uk, I.A. Grammatical meanings in interlinguas for automatic translation and the concept of grammatical meaning. In V. Ju. Rozencvejg (ed), Machine Translation and Applied Linguistics, Volume I, Frankfurt am Main: Athenaion Verlag, 1974, 95-113.
Newell, A. Production systems: Models of control structures. In William G. Chase (ed) - Visual information processing, New York: Academic Press, 1973, ~63-526.
Otten, Hichael & Milos G. P a c a k . Intermediate languages for automatic language processina. In Julius T. Tou (edi, Software Engineering: CO~:~ I l l , Volume 2, New York: Academic Press, i c - I , 105-118.
Ellis(on, John S.C. Computer aided translation: a business viewpoint. In Barbara M. Snell (ed) - Translatin~ and the computer, Amsterdam: North- Holland, 197g, I~0-158.
Johnson, Rod. Contemporary perspectives in machine translation. In Suzanne Hanon & Vigge Hj~rneger Pedersen (edsl, Human translation machine t r a n s l a t i o n (Noter og Kommentarer 39). Odensel Romansk I n s t i t u t , Odense U n i v e r s i t e t , l O g O , 1 3 ~ - 1 ~ 7 .
Hutchins, W.j. Machine t r a n s l a t i o n and machine aided t r a n s l a t i o n . Journal of Documentation, 1978, 34:119-159.
Kaplan, Ronald N. A general syntactic processor. In Randall Rustin (ed), Natural Language Processin~ (Courant Computer Symposium Q~, New York: Algnrithmics Press, 1073, 103-2&I.
Kay, "larcin. The proper place of men and machines in language t r a n s l a [ i o n (Report in. CSL-80-ll). P a l o .\~Co, CA: Xerox, l g ~ O .
Richards, Martin & Colin Whitby-Screvens. BCPL - the language and i t s compiler. Camhridze: Cambridge University Press, I QTQ .
S[ocum, Jonathan. A practical comparison ~f parsin R strategies for Machine Translation and other Natural Language Processing Purposes. PhF dissertation. University of Texas at Austin, I981. [ = Technical Report NL-41, Department of Computer Sciences, The University of Texas, Austin, TX.]
Somers, H.L. Bede - the CCL,/I~IIST Machine Translation system: Rule-writinE formalism '3rd revision) (CCL/~IIST Report Xo. 8 1 ' 5 ' . M a n c h e s t e r : C e n t r e f o r C o m p u t a t i o n a l L i n g u i s t i c s , University of Manchester £nst~cute of Science and Technology, 1981.
S o m e r s , H . L . & d . H c N a u g h t . The t r a n s l a t o r a s
Sorcim. Pa__~sca_I/H u____se.r.'.sr~.fere.ns.e manua.[ _. :'alnur. Creek, C.%: Digic.~l '.lar!<ecing, 1°79.
TA[/~. L e sysr~me de craduction a u c o m a ~ q u e de l'Universit~ de Montreal (TA(Df). HETA0 1q73, la :227-2~O.
Toma, P,P. SYSTRAN as a m u l t i l i n g u a l ~achtne T r a n s l a t i o n system. In Commission of the European Communities, T h i r d ~uropean ConRress on I n f o r m a t i o n Systems and Plecwor~s=..Overcomtn~ the tan~uaRe b a r r i e r , Volume 1, HUnchen: Ver[ag Dokumencac~on, 1977, 569-581.
Vauquois, Bernard. La c r a d u c t i o n automaclque Grenoble (Documents de L / n g u i s c i q u e Q u a n t i t a t i v e 2~), P a r i s : Dunod, 1975.
Vauquots, B. Aspects of mechanical t r a n s l a t i o n in 1079 (Conference for Japan I ~ S c i e n t i f i c Program). Grenoble: Groupe d'Ecudes pour la T r a d u c t i o n Aucomatique, 1979o
Vauquois, Bernard. L ' t n f o r m a t i q u e au s e r v i c e de la ¢raduccion. ~ETA, 1981. 2 6 : 8 - 1 7 .
VeiZZon, C. D e s c r i p t i o n du Iangage p i v o t du sysCEme de c r a d u c t i o n automatique du C.E.T.A. ?.A. [ n f o r m a t i o n s , 1068, 1:B-17.
WtIks, Y o r i c k . An A r t i f i c i a l I n t e l l i g e n c e approach co Machine T r a n s l a t i o n . In Ro~er C. Schank & ~enneth ~ark Colby ( e d s ) , Computer models of chou~ht and language, San F r a n c i s c o : Freeman, lq73, 114-151.
~¢Oscer, ~uRen. B e g r i f f s - und T h e m a k I a s s t f i k - acionen: Uncerschiede in threm ~esen und in t h r e r Anwendung. ~achrtchcen f u r Dokumenc- aclon, t 9 7 1 , 22:qR-IO~.
Zin~e[, Hermann-losef. Experiences with TITUS II. Tn~ernatinna[ Classlfication, Ig7a, 5 : 3 3 - 3 7 .