L A N G U A G E - B A S E D E N V I R O N M E N T FOR N A T U R A L L A N G U A G E P A R S I N G
Lehtola, A., J ~ p p i n e n , H., N e l i m a r k k a , E. s i r r a F o u n d a t i o n (*) and
H e l s i n k i U n i v e r s i t y of T e c h n o l o g y H e l s i n k i , F i n l a n d
A B S T R A C T
This paper i n t r o d u c e s a s p e c i a l p r o g r a m m i n g e n v i r o n m e n t for the d e f i n i t i o n of g r a m m a r s and for the i m p l e m e n t a t i o n of c o r r e s p o n d i n g parsers. In n a t u r a l l a n g u a g e p r o c e s s i n g s y s t e m s it is a d v a n t a g e o u s to have l i n g u i s t i c k n o w l e d g e and p r o c e s s i n g m e c h a n i s m s separated. Our e n v i r o n m e n t a c c e p t s g r a m m a r s c o n s i s t i n g of b i n a r y d e p e n d e n c y r e l a t i o n s and g r a m m a t i c a l functions. W e l l - f o r m e d e x p r e s s i o n s of f u n c t i o n s and r e l a t i o n s p r o v i d e c o n s t i t u e n t s u r r o u n d i n g s for s y n t a c t i c c a t e g o r i e s in the form of t w o - w a y automata. These r e l a t i o n s , functions, and a u t o m a t a are d e s c r i b e d in a s p e c i a l d e f i n i t i o n language.
In focusing on high level d e s c r i p t i o n s a l i n g u i s t may ignore c o m p u t a t i o n a l d e t a i l s of the p a r s i n g process. He w r i t e s the g r a m m a r into a D P L - d e s c r i p t i o n and a c o m p i l e r t r a n s l a t e s it into e f f i c i e n t L I S P - c o d e . The e n v i r o n m e n t has also a t r a c i n g f a c i l i t y for the p a r s i n g process, g r a m m a r - s e n s i t i v e l e x i c a l m a i n t e n a n c e p r o g r a m s , and r o u t i n e s for the i n t e r a c t i v e g r a p h i c d i s p l a y of p a r s e trees and g r a m m a r d e f i n i t i o n s . T r a n s l a t o r r o u t i n e s are also a v a i l a b l e for the t r a n s p o r t of c o m p i l e d code b e t w e e n v a r i o u s L I S P - d i a l e c t s . The e n v i r o n m e n t itself e x i s t s c u r r e n t l y in I N T E R L I S P and F R A N Z L I S P . This p a p e r focuses on k n o w l e d g e e n g i n e e r i n g issues and d o e s not enter l i n g u i s t i c a r g u m e n t a t i o n .
I N T R O D U C T I O N
Our o b j e c t i v e has b e e n to build a p a r s e r for F i n n i s h to work as a p r a c t i c a l tool in real p r o d u c t i o n a p p l i c a t i o n s . In the b e g i n n i n g of our work we were faced with two major problems. First, so far there was no formal d e s c r i p t i o n of the F i n n i s h grammar. S e c o n d d i f f i c u l t y was that F i n n i s h d i f f e r s by its s t r u c t u r e g r e a t l y from the I n d o e u r o p e a n languages. F i n n i s h has r e l a t i v e l y free word order and s y n t a c t i c o - s e m a n t i c k n o w l e d g e in a s e n t e n c e is o f t e n e x p r e s s e d in the
i n f l e c t i o n s of the words. T h e r e f o r e e x i s t i n g p a r s i n g m e t h o d s for I n d o e u r o p e a n l a n g u a g e s (eg. ATN, DCG, LFG etc.) did not seem to g r a s p the i d i o s y n c r a c i e s of F i n n i s h .
The p a r s e r s y s t e m we h a v e d e v e l o p e d is b a s e d on f u n c t i o n a l d e p e n d e n c y . G r a m m a r is s p e c i f i e d by a f a m i l y of t w o - w a y f i n i t e a u t o m a t a and by d e p e n d e n c y f u n c t i o n and r e l a t i o n d e f i n i t i o n s . Each a u t o m a t o n e x p r e s s e s the valid d e p e n d e n c y c o n t e x t of o n e c o n s t i t u e n t type. In a b s t r a c t s e n s e the w o r k i n g s t o r a g e of the p a r s e r c o n s i s t s of two c o n s t i t u e n t s t a c k s and of a r e g i s t e r w h i c h h o l d s the c u r r e n t c o n s t i t u e n t (Figure I).
The register of
the current
constituent
LI
L2
L3
RI
R2
R3
The left
The righ
constituent
constituent
stack
stack
F i g u r e I. The w o r k i n g s t o r a g e of D P L - p a r s e r s
(*) S I T R A F o u n d a t i o n
[image:1.612.318.558.280.715.2]<-Phrase Adverbial ) < + P h r a s e Adverbial IILD PHRASE ON RIGHT
~*Phrase Subject~ ~ophrase
Phrase ] I
L Adverbial! * P h r a s e I A d v e r b i a l
IILO PHRASE ON RIGHT
~Phrase
P h r a s e
Sublet1
ILO PHRASE ON RIGHT
• - - N o m i n a e m p t y l e f t - hand side
BUILD PXRA: ON RIGHT
= , N o m i n a l
- +Nominal
~nd of inpul
@
FIND REGENT ON RIGHT
Notations:
On t h e l e f t is I On the l e f t is a s t a t e transition
the s t a t e node ? X w i t h priority, conditions f o r
of the a u t o m a t o n {cond$ . . . . the d e p e n d e n t c a n d i d a t e (if not Toncllon) o t h e r w i s e d s t a t e d ) and
k
The question m a r k I
indicates the direction 4, connection function indicated.
Double circles a r e used to d e n o t e e n t r e e s and e x i t s of an a u t o m a t o n • Inside is e x p r e s s e d the m a n n e r of o p e r a t i o n .
F i g u r e 2. A t w o - w a y a u t o m a t o n for F i n n i s h v e r b s
The two stacks hold the right and left c o n t e x t s of the c u r r e n t c o n s t i t u e n t . The p a r s i n g p r o c e s s is a l w a y s d i r e c t e d by the e x p e c t a t i o n s of the c u r r e n t c o n s t i t u e n t . D y n a m i c local c o n t r o l is r e a l i z e d by p e r m i t t i n g the a u t o m a t a to a c t i v a t e one another. The b a s i c d e c i s i o n for the a u t o m a t o n a s s o c i a t e d w i t h the c u r r e n t c o n s t i t u e n t is to a c c e p t or r e j e c t a n e i g h b o r via a valid s y n t a c t i c o - s e m a n t i c s u b o r d i n a t e relation. A c c e p t a n c e s u b o r d i n a t e s the n e i g h b o r , and it d i s a p p e a r s from the stack. The s t r u c t u r e an input s e n t e n c e r e c e i v e s is an a n n o t a t e d tree of such b i n a r y relations.
An a u t o m a t o n for v e r b s is d e s c r i b e d in F i g u r e 2. W h e n a v e r b b e c o m e s the c u r r e n t c o n s t i t u e n t for the first time it w i l l enter the a u t o m a t o n t h r o u g h the S T A R T node. The a u t o m a t o n e x p e c t s to find a d e p e n d e n t from the left (?V). If the left
S u b j e c t and then for Object. W h e n a f u n c t i o n test s u c c e e d s , the n e i g h b o r w i l l
be s u b o r d i n a t e d and the v e r b a d v a n c e s to
the s t a t e i n d i c a t e d by arcs. The d o u b l e c i r c l e s t a t e s d e n o t e e n t r y and exit p o i n t s of the a u t o m a t o n .
~f c o m p l e t e d c o n s t i t u e n t s do not e x i s t as n e i g h b o r s , an a u t o m a t o n m a y d e f e r d e c i s i o n . In the F i g u r e 2 s t a t e s l a b e l l e d " B U I L D P H R A S E ON RIGHT" and " F I N D R E G E N T ON R I G H T " p u s h the v e r b to the left stack and p o p the r i g h t stack for the c u r r e n t c o n s t i t u e n t . W h e n the v e r b is a c t i v a t e d later on, the c o n t r o l flow w i l l c o n t i n u e from the s t a t e e x p r e s s e d in the d e a c t i v a t i o n c o m m a n d .
[image:2.612.78.544.79.469.2]The functions, relations and automata are
expressed in a special conditional
expression formalism DPL (for D e p e n d e n c y
Parser Language). We believe that DPL
might find applications in other
inflectional languages as well.
D P L - D E S C R I P T I O N S
The main object in DPL is a constituent.
A grammar specification opens with the
structural descriptions of constituents
and the allowed property names and
property values. User may specify simple
properties, features or categories. The
structures of the lexical entries are also
defined at the beginning. The syntax of
these declarations can be seen in Figure 3.
All properties of constituents may be
referred in a uniform manner using their values straight. The system automatically
takes into account the computational
details associated to property types. For example, the system is automatically tuned to notice the inheritance of properties in
their hierarchies. Extensive support to
m u l t i d i m e n s i o n a l analysis has been one of
the central objectives in the design of
the DPL-formalism. Patterning can be done
in multiple dimensions and the property
set associated to constituents can easily be extended.
An example of a constituent structure and
its property definitions is given in
Figure 4. The description states first
that each constituent contains Function,
Role, ConstFeat, PropOfLexeme and
MorphChar. The next two following
d e f i n i t i o n s further specify C o n s t F e a t and
PropOfLexeme. In the last part the
d e f i n i t i o n of a category tree SemCat is given. This tree has sets of p r o p e r t y
values associated with nodes. The
D P L - s y s t e m automatically takes care of
their inheritances. Thus for a
c o n s t i t u e n t that belongs to the semantic c a t e g o r y Human the system a u t o m a t i c a l l y
associates feature values +Hum, +Anim,
+Countable, and +Concr.
The binary grammatical functions and
relations are defined using the syntax in
Figure 5. A DPL-function returns as its
value the binary construct built from the ~ u r r e n t constituent (C) and its d e p e n d e n t c a n d i d a t e (D), or it returns NIL.
DPL-relations return as their values the
pairs of C and D c o n s t i t u e n t s that have passed the associated predicate filter. By choosing operators a user may vary a p r e d i c a t i o n between simple equality (=)
and equality with ambiguity elimination
(=:=). Operators := and :- denote
replacement and insertion, respectively.
In predicate expressions angle brackets
signal the scope of an implicit
OR-operator and parentheses that of an
< c o n s t i t u e n t s t r u c t u r e > : : = ( CONSTITUENT: < s u b t r e e o~ c o n s t i t u e n t > : : = ( SUBTREE:
< l i s t o f p r o p e r t i e s >
< p r o p e r t y name> < t y p e name> < g l u e node name> < g l u e node>
< l i s t o f p r o p e r t i e s > . . ) < g l u e node>
< l i s t o f p r o p e r t i e s > ) : ( LEXICON-ENTRY: < g l u e node>
< l i s t o f p r o p e r t i e s > ) : : = ( < l i s t o f p r o p e r t i e s > . . )
( < p r o p e r t y name>.. )
: : = < t y p e name> : < g l u e node name> : : = < u n i q u e l i s p atom>
: : = < u n i q u e l i s p atom>
: : = < g l u e node name i n u p p e r l e v e l - >
< p r o p e r t y d e c l a r a t i o n >
< p o s s i b l e v a l u e s > < d e f a u l t v a l u e > <node d e f i n i t i o n > <node name> < f e a t u r e s e t > < f a t h e r node> <empty>
: : = ( PROPERTY: < t y p e name> < p o s s i b l e v a l u e s > ) :
( FEATURE: < t y p e name> < p o s s i b l e v a l u e s > )
( CATEGORY: < t y p e name> < <node d e f i n i t i o n > . . > ) : : = < < d e f a u l t v a l u e > < u n i q u e l i s p a t o m > . . >
: : = N o D e f a u l t : < u n i q u e l i s p atom>
: : = ( <node name> < f e a t u r e s e t > < f a t h e r node> ) : : = < u n i q u e l i s p atom>
: : = ( < f e a t u r e v a l u e > ) : <empty>
: : = / <name o f an a l r e a d y d e f i n e d node> : <empty> : : =
[image:3.612.68.543.91.712.2] [image:3.612.72.538.402.703.2](CONSTITUENT:
(LEXICON-ENTRY:
(SUBTREE:
(CATEGORY:
( F u n c t i o n R o l e C o n s t F e a t P r o p O g L e x e m e M o r p h c h a r ) )
P r o p O f L e x e m e
( ( S y n t C a t S y n t F e a t ) (SemCat SemFeat) ( F r a m e C a t L e x F r a m e ) AKO ) )
MorphChar
( P o l a r V o i c e Modal T e n s e C o m p a r i s o n Number Case P e r s o n N P e r s o n P C l i t l C l i t 2 ) )
SemCat < ( E n t i t y )
( C o n c r e t e ( + C o n c r ) / E n t i t y )
( A n i m a t e ( +Anim + C o u n t a b l e ) / C o n c r e t e ) ( Human ( +Hum ) / A n i m a t e )
( A n i m a l s / A n i m a t e ) ( NonAnim / C o n c r e t e )
( M a t t e r ( - C o u n t a b l e ) / NonAnim ) ( T h i n g ( + C o u n t a b l e ) / NonAnim ) >
F i g u r e 4. A n e x a m p l e of a c o n s t i t u e n t s t r u c t u r e s p e c i f i c a t i o n a n d the d e f i n i t i o n of an c a t e g o r y t r e e
i m p l i c i t A N D - o p e r a t o r . A n a r r o w t r i g g e r s d e f a u l t s on: t h e e l e m e n t s of e x p r e s s i o n s to the r i g h t of an a r r o w a r e in the O R - r e l a t i o n a n d t h o s e to the l e f t of it a r e in t h e A N D - r e l a t i o n . T w o k i n d s of a r r o w s a r e in use. A s i m p l e a r r o w (->) p e r f o r m s all o p e r a t i o n s on t h e r i g h t and a d o u b l e a r r o w (=>) t e r m i n a t e s t h e e x e c u t i o n at the f i r s t s u c c e s s f u l o p e r a t i o n .
In F i g u r e 6 is an e x a m p l e of h o w o n e m a y d e f i n e S u b j e c t . If the r e l a t i o n R e c S u b j h o l d s b e t w e e n the r e g e n t and the d e p e n d e n t c a n d i d a t e the l a t t e r w i l l be l a b e l l e d
S u b j e c t and s u b o r d i n a t e d to the f o r m e r . T h e r e l a t i o n a l e x p r e s s i o n R e c S u b j d e f i n e s t h e p r o p e r t y p a t t e r n s t h e c o n s t i t u e n t s s h o u l d m a t c h .
A g r a m m a r d e f i n i t i o n e n d s w i t h the c o n t e x t s p e c i f i c a t i o n s of c o n s t i t u e n t s e x p r e s s e d a s t w o - w a y a u t o m a t a . T h e a u t o m a t a a r e d e s c r i b e d u s i n g t h e n o t a t i o n s h o w n in s o m e w h a t s i m p l i f i e d f o r m in F i g u r e 7. A n a u t o m a t o n c a n r e f e r up to t h r e e c o n s t i t u e n t s to the r i g h t or l e f t u s i n g i n d e x e d n a m e s : LI, L2, L3, RI, R2 or R3.
< ~ u n c t i o n > : : = ( FUNCTION: < ~ u n c t i o n name> < o p e r a t i o n e x p r > ) < r e l a t i o n > : : = ( RELATION: < r e l a t i o n name> < o p e r a t i o n e x p r > ) < o p e r a t i o n e x p r > : : = ( < p r e d i c a t e e ~ p r > . . < i m p l y < o p e r a t i o n e × p r > . . )
< p r e d i c a t e e x p r > < r e l a t i o n name> :
( DEL < c o n s t i t u e n t l a b e l > ) < p r e d i c a t e e x p r > : : = < < p r e d i c a t e e x p r > > I
( < p r e d i c a t e e x p r > )
( < c o n s t i t u e n t p o i n t e r > < o p e r a t o r > < v a l u e e x p r > ) < i m p l > : : = - > I =>
< c o n s t i t u e n t l a b e l > : : = C I D
< o p e r a t o r > ::= = I := I :-- I = : = < v a l u e e x p r > : : = < < v a l u e e x p r > . . > :
( < v a l u e e x p r > . . ) : < v a l u e o~ s o m e p r o p e r t y > I
' < l e x e m e > I
( < p r o p e r t y n a m e > < c o n s t i t u e n t l a b e l > )
[image:4.612.60.548.62.710.2] [image:4.612.58.539.76.381.2](FUNCTION:
)
(RELATION:
S u b j e c t
( R e c S u b j - > (D : = S u b j e c t ) )
R e c S u b j
( ( C = A c t < I n d Cond P o t I m p e r >) (D = - S e n t e n c e + N o m i n a l ) - > ( ( D = Nom)
- > (D = P e r s P r o n ( P e r s o n P C) ( P e r s o n N C ) )
( ( D = Noun) (C = 3P) - > ( ( C = S) (D = SG))
( ( C = P ) ( D = P L ) ) ) ) ( ( D = P a r t ) ( C = S 3 P )
- > ( ( C = " O L L A )
=> (C : - + E x i s t e n c e ) )
( ( C = - T r a n s i t i v e + E x i s t e n c e ) ) ) )
Figure 6. A realisation of Subject
< s t a t e i n a u t o m . > : : = ( STATE: < s t a t e name> < d i r e c t i o n > < s t a t e e x p r > . . )
< d i r e c t i o n > : : = LEFT | RIGHT
< s t a t e e x p r > : : = ( < l h s o f s . e x p r > < i m p l > < s t a t e e x p r > . . ) ( < l h s o f s . e x p r > < i m p l > < s t a t e c h a n g e > ) < l h s o f s . e x p r > : : = < f u n c t i o n name> ~ < p r e d i c a t e e x p r > . .
< s t a t e c h a n g e > : : = ( C : = <name o f n e x t s t a t e > ) :
( FIND-REG-ON < d i r e c t i o n > < s s t a t e o h . > )
( BUILD-PHRASE-ON < d i r e c t i o n > < s s t a t e o h . > )
( P A R S E D )
< s t a t e c h a n g e > : : = < w o r k s p . m a n i p ° > < s t a t e c h a n g e > < s s t a t e c h . > : : = ( C : = <name o f r e t u r n s t a t e > ) < w o r k s p . m a n i p ° > : : = ( DEL < c o n s t i t u e n t l a b e l > )
( TRANSPOSE < c o n s t i t u e n t l a b e l > < c o n s t i t u e n t l a b e l > )
Figure 7. Simplified syntax of state specifications
( STATE: V? RIGHT
( ( D = + P h r a s e ) - > ( S u b j e c t - > (C : = V S ? ) )
( O b j e c t - > (C : = VO?))
( A d v e r b i a l - > (C : = V ? ) )
(T => (C : = ? V F i n a l ) ) )
( ( D = - P h r a s e ) - > (BUILD-PHRASE-ON RIGHT (C : = V ? ) ) )
[image:5.612.64.535.75.662.2]The d i r e c t i o n of a state (see F i g u r e 2.) s e l e c t s the d e p e n d e n t c a n d i d a t e n o r m a l l y as L1 or R1. A s w i t c h of state takes p l a c e by an a s s i g n m e n t in the same way as l i n g u i s t i c p r o p e r t i e s are assigned. As an e x a m p l e the node V? of F i g u r e 2 is d e f i n e d f o r m a l l y in F i g u r e 8.
M o r e l i n g u i s t i c a l l y o r i e n t e d
a r g u m e n t a t i o n of the D P L - f o r m a l i s m a p p e a r s e l s e w h e r e (Nelimarkka, 1984a, and N e l i m a r k k a , 1984b).
THE A R C H I T E C T U R E OF THE D P L - E N V I R O N M E N T
The a r c h i t e c t u r e of the D P L - e n v i r o n m e n t is d e s c r i b e d s c h e m a t i c a l l y in F i g u r e 9. The m a i n parts are h i g h l i g h t e d by h e a v y lines. S i n g l e arrows r e p r e s e n t d a t a transfer; d o u b l e arrows indicate the p r o d u c t i o n of d a t a structures. All m o d u l e s have b e e n i m p l e m e n t e d in LISP. The r e a l i s a t i o n s do not rely on s p e c i f i c s of u n d e r l y i n g L I S P - e n v i r o n m e n t s .
The D P L - c o m p i l e r
A c o m p i l a t i o n results in e x e c u t a b l e code of a parser. The c o m p i l e r p r o d u c e s h i g h l y
o p t i m i z e d c o d e (Lehtola, 1984). I n t e r n a l l y d a t a s t r u c t u r e s are only p a r t l y d y n a m i c for the r e a s o n of fast i n f o r m a t i o n fetch. A m b i g u i t i e s are e x p r e s s e d l o c a l l y to m i n i m i z e redundant search. The p r i n c i p l e of s t r u c t u r e s h a r i n g is f o l l o w e d w h e n e v e r new data s t r u c t u r e s are built. In the m a n i p u l a t i o n of c o n s t i t u e n t s t r u c t u r e s there e x i s t s a s p e c i a l s e r v i c e r o u t i n e for each c o m b i n a t i o n of p r o p e r t y and p r e d i c a t i o n types. T h e s e r o u t i n e s take s p e c i a l care of time and m e m o r y c o n s u m p t i o n . For i n s t a n c e with r e g a r d r e p l a c e m e n t s and i n s e r t i o n s the c o p y i n g i n c l u d e s p h y s i c a l l y only the path from the root of the list s t r u c t u r e to the c h a n g e d sublist. The l o g i c a l l y shared p a r t s w i l l • be s h a r e d also p h y s i c a l l y . This
s t i p u l a t i o n m i n i m i z e s m e m o r y u s a g e .
In the state t r a n s i t i o n n e t w o r k level the s e a r c h is done d e p t h first. To h a n d l e a m b i q u i t i e s D P L - f u n c t i o n s and - r e l a t i o n s p r o c e s s all a l t e r n a t i v e i n t e r p r e t a t i o n s in p a r a l l e l . In fact the a l t e r n a t i v e s are s t o r e d in the stacks and in the C - r e g i s t e r as trees of a l t e r n a n t s .
In the first v e r s i o n of the D P L - c o m p i l e r the g e n e r a t i o n rules w e r e i n t e r m i x e d w i t h the c o m p i l e r code. The m a i n t e n a n c e of the c o m p i l e r g r e w h a r d e r w h e n we e x p e r i m e n t e d w i t h new c o m p u t a t i o n a l features. We
p a r s e r
facility
lexicon maintenance
information extraction system
[image:6.612.71.559.331.723.2]t h e r e f o r e s t a r t e d to d e v e l o p a m e t a c o m p i l e r in w h i c h c o m p i l a t i o n is d e f i n e d by rules. At m o m e n t we are t e s t i n g it and soon it w i l l be in e v e r y d a y use. T h e a m o u n t of L I S P - c o d e has g r e a t l y r e d u c e d with the rule based a p p r o a c h , and we are n o w p l a n n i n g to i n s t a l l the D P L - e n v i r o n m e n t into IBM PC.
Our p a r s e r s w e r e a i m e d to be p r a c t i c a l tools in real p r o d u c t i o n a p p l i c a t i o n s . It w a s h e n c e i m p o r t a n t to m a k e the p r o d u c e d p r o g r a m s t r a n s f e r a b l e . As of now we h a v e a r u l e - b a s e d t r a n s l a t o r w h i c h c o n v e r t s p a r s e r s b e t w e e n L I S P d i a l e c t s . The t r a n s l a t o r a c c e p t s c u r r e n t l y I N T E R L I S P , F r a n z L I S P and C o m m o n Lisp.
L e x i c o n and its M a i n t e n a n c e
T h e e n v i r o n m e n t has a s p e c i a l m a i n t e n a n c e p r o g r a m for l e x i c o n s . The p r o g r a m uses v i d e o g r a p h i c s to e a s e u p d a t i n g and it p e r f o r m s v a r i o u s c h e c k s to g u a r a n t e e the c o n s i s t e n c y of the l e x i c a l e n t r i e s . It a l s o c o - o p e r a t e s w i t h the i n f o r m a t i o n e x t r a c t i o n s y s t e m to h e l p the user in the s e l e c t i o n of p r o p e r t i e s .
T h e T r a c i n g F a c i l i t y
T h e t r a c i n g f a c i l i t y is a c o n v e n i e n t tool for g r a m m a r d e b u g g i n g . For e x a m p l e , in F i g u r e I0 a p p e a r s the t r a c e of the p a r s i n g of the s e n t e n c e " P o i k a n i tuli i l l a l l a k e n t ~ i t ~ h e i t t ~ m ~ s t ~ k i e k k o a . " (= " M y son
( T POIKANI TULI ILLALLA KENT~LT~ HEITT~M~ST~ KIEKKOA . )
~ 8 ~ ¢ c ~ s e s • 03 seconds
0 . 0 s e c o n d s , g a r b a g e c o l l e c t i o n t i m e P A R S E D
_ P R T H ( )
= > ( P O I K A ) (TULJ.A) ( I L T A ) ( K E N T T ~ ) ( H E I T T ~ ) (KIE]<KO) ?N ( P O I K A ) < = ( T U L L A ) ( I L T A ) ( K E N T T ~ ) ( H E I T T ~ ) ( K I E K K O ) N? = > ( P O I K A ) ( T U L L A ) ( I L T A ) ( K E N T T ~ ) ( H E I T T ~ ) ( K I E K K O ) ? N F i n a l
( # # ) ( P O I K A ) ( T U L L A ) ( I L T A ) ( K E N T T ~ ) ( H E I T T ~ ) ( K I E K K O ) NIL ( P O I K A ) => ( T U L L A ) (ILTA) ( K E N T T ~ ) ( H E I T T ~ ) ( K I E K K O ) ?V. ,=> ( ( P O I K A ) TULLA) (ILTA) (KENTT~) ( H E I T T ~ ) ( K I E K K O ) ?VS
((POIKA) TULLA) <= ( ~ L T A ) (KENTT~) (HEITT~&) ( K I E K K O ) VS?
((POIKA) TULLA) => (ILTA) (KENTT~) (HEITT~&~) (KIEKKO) ?N
( ( P O I K A ) T U L L A ) ( I L T A ) <= ( K E N T T ~ ) ( H E I T T ~ ) ( K I E K K O ) N? ((POIKA) TULLA) => "(ILTA) (KENTT~) ( H E I T T ~ ) ( K I E K K O ) ? N F i n a l
((POIKA) TULLA) <= (ILTA) (KENTT~) ( H E I T T ~ ) ( K I E K K O ) VS?
((POIKA) TULLA ( I L T A ) ) <= (KENTT~) (HEITTYdl) ( K I E K K O ) VS?
((POIKA) TULLA (ILTA)) => (KENTT&) ( H E I T T ~ ) (KIEKKO) ?N
((POIKA) TULLA ( I L T A ) ) (KENTT~) <= ( H E I T T ~ ) (KIEKKO) N?
((POIKA) TULLA ( I L T A ) ) => (KENTT~) ( H E I T T ~ ) ( K I E K K O ) ? N F i n a l ((POIKA) TULLA (ILTA)) <= (KENTT&) ( H E I T T ~ ) ( K I E K K O ) VS? ( ( P O L K A ) T U L L A ( I L T A ) ( K E N T T ~ ) ) < = ( H E I T T ~ ) ( K I E K K O ) V S ?
((POIKA) TULLA (ILTA) (KENTT~)) => (HEITT~i) (KIEKKO) .9%/
((POIKA) TULLA (ILTA) (KENTT~)) ( H E I T T ~ ) <= (KIEKKO) V?
((POIKA) TULLA (ILTA) (KENTT~)) (HEITT~dl) => (KIEKKO) ?N
((POIKA) TULLA (ILTA) (KENTT~)) ( H E I T T ~ ) (KIEKKO) <= N?
((POIKA) TULLA (ILTA) (KENTT~)) (HEITT&~) => (KIEKKO) ?NFinal
((POIKA) TULLA (ILTA) (KENTT~)) ( H E I T T ~ ) <= (KIEKKO) V?
((POIKA) TULLA (ILTA) (KENTT&)) ( H E I T T ~ (KIEKKO)) <= VO?
((POIKA) TULLA (ILTA) (KENTT~)) => ( H E I T T ~ (KIEKKO)) ? V F i n a l
((POIKA) TULLA (ILTA) (KENTT~)) <= (HEITT&~ (KIEKKO)) VS?
((POIKA) TULLA (ILTA) (KENTT~) ( H E I T T ~ (KIEKKO))) <= VS?
=> ((POIKA) TULLA (ILTA) (KENTT~) ( H E I T T ~ (KIEKKO))) ? V F i n a l
((POIKA) TULLA (ILTA) (KENTT~) ( H E I T T ~ (KIEKKO))) <= MainSent?
((POIKA) TULLA (ILTA) (KENTT~) (HEITT&& (KIEKKO))) <= MainSent? OK
DONE
[image:7.612.72.544.234.725.2]c a m e back in the e v e n i n g f r o m the s t a d i u m w h e r e he had b e e n t h r o w i n g the d i s c u s . " ) . Each row r e p r e s e n t s a state of the p a r s e r b e f o r e the c o n t r o l e n t e r s the s t a t e m e n t i o n e d on the r i g h t - h a n d column. T h e t h u s - f a r found c o n s t i t u e n t s are s h o w n by the p a r e n t h e s i s . An a r r o w h e a d p o i n t s from a d e p e n d e n t c a n d i d a t e (one w h i c h is s u b j e c t e d to d e p e n d e n c y tests) t o w a r d s the c u r r e n t c o n s t i t u e n t .
The t r a c i n g f a c i l i t y g i v e s also the c o n s u m e d C P U - t i m e and two q u a l i t y i n d i c a t o r s : s e a r c h e f f i c i e n c y and c o n n e c t i o n e f f i c i e n c y . S e a r c h e f f i c i e n c y is 100%, if no u s e l e s s s t a t e t r a n s i t i o n s took p l a c e in the search. T h i s figure is m e a n i n g l e s s w h e n the s y s t e m is p a r a m e t e r i z e d to full s e a r c h b e c a u s e then all t r a n s i t i o n s are tried.
C o n n e c t i o n e f f i c i e n c y is the ratio of the n u m b e r of c o n n e c t i o n s r e m a i n i n g in a r e s u l t to the total n u m b e r of c o n n e c t i o n s a t t e m p t e d for it d u r i n g the search. W e are c u r r e n t l y d e v e l o p i n g o t h e r m e a s u r i n g tools to e x t r a c t s t a t i s t i c a l i n f o r m a t i o n , eg. a b o u t the f r e q u e n c y d i s t r i b u t i o n of
d i f f e r e n t c o n s t r u c t s . U n d e r d e v e l o p m e n t is also a u t o m a t i c b o o k - k e e p i n g of all s e n t e n c e ~ input to the system. T h e s e w i l l be d i v i d e d into two g r o u p s : p a r s e d and n o t parsed. The first g r o u p c o n s t i t u t e s g r o w i n g test m a t e r i a l to e n s u r e m o n o t o n i c i m p r o v e m e n t of g r a m m a r s : a f t e r a non t r i v i a l c h a n g e is d o n e in the g r a m m a r , a n e w c o m p i l e d p a r s e r runs all test s e n t e n c e s and the r e s u l t s are c o m p a r e d to the p r e v i o u s ones.
I n f o r m a t i o n E x t r a c t i o n S y s t e m
In an a c t u a l w o r k i n g s i t u a t i o n t h e r e m a y be t h o u s a n d s of l i n g u i s t i c s y m b o l s in the w o r k space. To m a k e such a c o m p l e x m a n a g e a b l e , we have i m p l e m e n t e d an i n f o r m a t i o n s y s t e m that for a g i v e n s y m b o l p r e t t y - p r i n t s all i n f o r m a t i o n a s s o c i a t e d w i t h it.
T h e e n v i r o n m e n t has r o u t i n e s for the g r a p h i c d i s p l a y of p a r s i n g results. A user c a n s e l e c t i n f o r m a t i o n by p o i n t i n g w i t h the cursor. The e x a m p l e in F i g u r e Ii d e m o n s t r a t e s the use of this facility. T h e c o m m a n d SHOW() i n q u i r e s the r e s u l t s of
_SHOW ( )
( P O I K A N I ) ( T U L I ) ( I L J . R L L R ) ( K I ~ & I . T & ) ( HE I T T 3 1 I ' I ~ X ) ( K I E K ~ ) STRRT
( ( P I ] I K A ) T U L L A ( I L T A ] ~ K E N T T ~ ) ( H E I T T x x ( K I E K K O ) ) ) !
TULLA I I ! i
S u b J e c t ' o a t i v e N e u t r a l )
, i
! !
ILTA KENTTX
A d v e r b i a l A d v e r b i a l
TiaeIPred A b l a t i v e
F u n c t i o n S u b J e c t
R o l e ( E r g a t i v e N e u t r a l )
F r a m e F e a t ( N I L )
P o l a r ( P o s )
I V o i c e ( N I L )
! M o d a l ( N I L )
T e n s e ( N I L )
Comparison ( N i l C o l p a r )
N u m b e r (SG)
C a s e ( N e e )
P e r s o n N ( S )
P ~ s o n P ( I P )
C l i t l ( N I L )
C l i t 2 ( N I L )
, e
HEITT~U~
Adverbial
S
! K I E K K O O b j e c t N e u t r a l
C o n s t F e a t i s a l i n g u i s t i c f e a t u r e type.
D e f a u l t v a l u e n - P h r a s e
A s s o c i a t e d v a l u e s : ( + D e c l a r a t i v e - D e c l a r a t i v e +Main - M a i n +Nominal - N o m i n a l +Phrase - P h r a s e + P r e d i c a t i v e - P r e d i c a t i v e + R e l a t i v e - R e l a t i v e + S e n t e n c e - S e n t e n c e )
A s s o c i a t e d ~ u n c t i o n s l
[image:8.612.68.558.100.698.2]the p a r s i n g p r o c e s s d e s c r i b e d in F i g u r e i0. The s y s t e m r e p l i e s by first p r i n t i n g the s t a r t state and then the found result(s) in c o m p r e s s e d Eorm. The c u r s o r has b e e n m o v e d on top of this p a r s e and C T R L - G has b e e n typed. The s y s t e m now d r a w s the p i c t u r e of the tree s t r u c t u r e . S u b s e q u e n t l y one of the n o d e s has b e e n opened. The p r o p e r t i e s of the node P O I K A a p p e a r p r e t t y - p r i n t e d . The user has f u r t h e r m o r e asked i n f o r m a t i o n a b o u t the p r o p e r t y type C o n s t F e a t . All t h e s e o p e r a t i o n s are g e n e r a l ; they do not use the s p e c i a l f e a t u r e s of any p a r t i c u l a r terminal.
C O N C L U S I O N
The p a r s i n g s t r a t e g y a p p l i e d for the D P L - f o r m a l i s m was o r i g i n a l l y v i e w e d as a c o g n i t i v e model. It has p r o v e d to r e s u l t p r a c t i c a l and e f f i c i e n t p a r s e r s as well. E x p e r i m e n t s w i t h a n o n - t r i v i a l set of F i n n i s h s e n t e n c e s t r u c t u r e s h a v e b e e n p e r f o r m e d both on D E C - 2 0 6 0 and on V A X - I I / 7 8 0 systems. The a n a l y s i s of an e i g h t word sentence, for instance, takes b e t w e e n 20 and 600 ms of DEC C P U - t i m e in the I N T E R L I S P - v e r s i o n d e p e n d i n g on w h e t h e r one w a n t s o n l y the first or, t h r o u g h c o m p l e t e search, all p a r s e s for s t r u c t u r a l l y a m b i g u o u s s e n t e n c e s . The M a c L I S P - v e r s i o n of the p a r s e r r u n s a b o u t 20 % f a s t e r on the same c o m p u t e r . T h e N I L - v e r s i o n (Common L i s p compatible) is a b o u t 5 times slower on VAX. T h e w h o l e e n v i r o n m e n t has b e e n t r a n s f e r r e d a l s o to F r a n z L I S P on VAX. W e have not yet focused on o p t i m a l i t y issues in g r a m m a r d e s c r i p t i o n s . We b e l i e v e that by r e a r r a n g i n g the o r d e r i n g s of e x p e c t a t i o n s in the a u t o m a t a i m p r o v e m e n t in e f f i c i e n c y ensues.
R E F E R E N C E S
i. Lehtola, A., C o m p i l a t i o n and I m p l e m e n t a t i o n of 2 - w a y T r e e A u t o m a t a for the P a r s i n g of Finnish. M.So Thesis, ~ e l s i n k i U n i v e r s i t y of T e c h n o l o g y , D e p a r t m e n t of P h y s i c s , 1984, 120 p. (in Finnish)
2° N e l i m a r k k a , E°, J ~ p p i n e n , H. and L e h t o l a A., T w o - w a y F i n i t e A u t o m a t a and D e p e n d e n c y Theory: A P a r s i n g M e t h o d for I n f l e c t i o n a l Free W o r d O r d e r L a n g u a g e s . Proc. C O L I N G 8 4 / A C L , S t a n f o r d , 1984a, pp. 389-392.
3° N e l i m a r k k a , E., J ~ p p i n e n , H. and L e h t o l a A., P a r s i n g an I n f l e c t i o n a l F r e e W o r d O r d e r L a n g u a g e w i t h T w o - w a y F i n i t e A u t o m a t a ° Proc. of the 6th E u r o p e a n C o n f e r e n c e on A r t i f i c i a l I n t e l l i g e n c e , Pisa, 1984b, pp. 167-176.
4. W i n o g r a d , To, L a n g u a g e as a C o g n i t i v e
P r o c e s s . V o l u m e I: Syntax,