Generalized Left Corner Parsing

10 

G e n e r a l i z e d L e f t - C o r n e r P a r s i n g . Mark-Jan N e d e r h o f *. University of Nijmegen, Department of Computer Science Toernooiveld, 65125 ED Nijmegen, The Netherlands. m a r k j a n @ c s . k u n . n l . A b s t r a c t . We show how techniques known from gen- erMized LR parsing can be applied to left- corner parsing. The ~esulting parsing algo- rithm for context-free grammars has some advantages over generalized LR parsing: the sizes and generation times of the parsers are smaller, the produced output is more compact, and the basic parsing technique can more easily be adapted to arbitrary context-free grammars. The algorithm can be seen as an optimiza- tion of algorithms known from existing lit- erature. A strong advantage of our presen- tation is t h a t it makes explicit the role of left-corner parsing in these algorithms. K e y w o r d s : Generalized LR parsing, left- corner parsing, chart parsing, hidden left recursion.. 1 I n t r o d u c t i o n Generalized LR parsing was first described by Tomita [Tomita, 1986; Tomita, 1987]. It has been regarded as the most efficient parsing technique for context-free grammars. The technique has been adapted to other formalisms than context-free gram- mars in [Tomita, 1988].. A useful property of generalized LR parsing (henceforth abbreviated to GLR parsing) is that in- put is parsed in polynomial time. To be exact, if the length of the right side of the longest rule is p, and if the length of the input is n, then the time com- plexity is O(nP+l). Theoretically, this may be worse. *Supported by the Dutch Organization for Scientific Research (NWO), under grant 00-62-518. than the time complexity of Earley's algorithm [Ear- ley, 1970], which is O(n3). For practical cases in natural language processing however, GLR parsing seems to give the best results.. The polynomial time complexity is established by using a graph-structured stack, which is a generaliza- tion of the notion of parse stack, in which pointers are used to connect stack elements. If nondeterminism occurs, then the search paths are investigated simul- taneously, where the initial part of the parse stack which is common to all search paths is represented only once. If two search paths share the state of the top elements of their imaginary individual parse stacks, then the top element is represented only once, so t h a t any computation which thereupon pushes el- ements onto the stack is performed only once.. Another useful property of G L R parsing is t h a t the output is a concise representation of all possi- ble parses, the so called parse forest, which can be seen as a generalization of the notion of parse tree. (By some authors, parse forests are more specifically called shared, shared-packed, or packed shared (parse) forests.) The parse forests produced by the Mgorithm can be represented using O(n p+I) space. Efficient decoration of parse forests with attribute values has been investigated in [Dekkers et al., 1992].. There are however some drawbacks to G L R pars- ing. In order of decreasing importance, these are:. • The parsing technique is based on the use of LR tables, which may be very large for grammars describing natural languages. 1 Related to this is the large amount of time needed to construct. l[Purdom, 1974] argues that grammars for program- ruing languages require LR tables which have a size which is about linear in the size of the grammar. It is gener- ally considered doubtful that similar observations can be made for grammars for natural languages.. 305. a parser. Incremental construction of parsers m a y in some cases alleviate this problem [Rek- ers, 1992].. • T h e parse forests produced by the algorithm are n o t as compact as t h e y m i g h t be. This is be- cause packing of subtrees is guided by the merg- ing of search p a t h s due to equal L R states, in- stead of by the equality of the derived nonter- minals. T h e solution presented in [Rekers, 1992] implies much c o m p u t a t i o n a l overhead.. • A d a p t i n g the technique to a r b i t r a r y g r a m m a r s requires the generalization to cyclic graph- s t r u c t u r e d stacks [Nozohoor-Farshi, 1991], which m a y complicate the i m p l e m e n t a t i o n . . • A m i n o r disadvantage is t h a t the theoretical t i m e complexity worsens if p becomes larger. T h e solution given in [Kipps, 1991] to o b t a i n a variant of the parsing technique which has a fixed t i m e complexity of O(n3), independent of p, implies an overhead in c o m p u t a t i o n costs which worsens instead of improves the time com- plexity in practical cases.. These disadvantages of generalized L R parsing are m a i n l y consequences of the L R parsing tech- nique, more t h a n consequences o f the use of graph- structured stacks a n d parse forests.. Lang [Lang, 1974; Lang, 1988c] gives a general con- struction of deterministic parsing algorithms from nondeterministic push-down a u t o m a t a . The pro- duced d a t a structures have a strong similarity to parse forests, as argued in [Billot and Lang, 1989; Lang, 1991]. T h e general idea of Lang has been applied to other formalisms t h a n context-free gram- m a r s in [Lang, 1988a; Lang, 1988b; Lang, 1988d].. T h e idea of a graph-structured stack, however, does n o t i m m e d i a t e l y follow from Lang's construc- tion. Instead, Lang uses the abstract notion of a table to store information, w i t h o u t t r y i n g to find the best i m p l e m e n t a t i o n for this table. 2. One o f the parsing techniques which can with some minor difficulties be derived from the con- struction of Lang is generalized left-corner parsing (henceforth abbreviated to G L C parsing). 3 T h e s t a r t i n g - p o i n t is left-corner parsing, which was first f o r m a l l y defined in [Rosenkrantz and Lewis II, 1970]. Generalized left-corner parsing, albeit under a dif- ferent name, has first been investigated in [Pratt,. 2[Sikkel, 1990] argues that the way in which the ta- ble is implemented (using a two-dimensional matrix as in case of Earley's algorithm or using a graph-structured stack) is only of secondary importance to the global be- haviour of the parsing algorithm.. 3The term "generalized left-corner parsing" has been used before in [Demers, 1977] for a different parsing tech- nique. Demers generalizes "left corner of a right side" to be a prefix of a right side which does not necessarily con- sist of one member, whereas we generalize LG parsing with zero lookahead to grammars which are not LC(0).. 1975]. (See also [Tanaka et al., 1979; Bear, 1983; Sikkel and Op den i k k e r , 1992].) In [Shann, 1991] it was shown t h a t the parsing technique can be a se- rious rival to generalized L R parsing with regard to the t i m e complexities. (Other papers discussing the t i m e complexity of G L C parsing are [Slocum, 1981; Wir~n, 1987].). A functional variant of G L C parsing for defi- nite clause g r a m m a r s has been discussed in [Mat- s u m o t o and Sugimura, 1987]. T h i s a l g o r i t h m does not achieve a p o l y n o m i a l t i m e complexity however, because no "packing" takes place.. A variant of Earley's a l g o r i t h m discussed in [Leiss, 1990] also is very similar to G L C parsing a l t h o u g h the top-down n a t u r e of Earley's a l g o r i t h m is pre- served.. G L C parsin~ has been rediscovered a n u m b e r of times (e.g. in [Leermakers, 1989; Leermakers, 1992], [Schabes, 1991], and [Perlin, 1991]), b u t w i t h o u t any mention of the connection with LC parsing, which m a d e the presentations unnecessarily difficult to un- derstand. This also prevented discovery of a n u m b e r of o p t i m i z a t i o n s which are obvious from the view- point of left-corner parsing.. In this paper we reinvestigate G L C parsing in c o m b i n a t i o n with g r a p h - s t r u c t u r e d stacks and parse forests. It is shown t h a t this parsing technique is not subject to the four disadvantages of the a l g o r i t h m of T o m i t a . . T h e s t r u c t u r e of this paper is as follows. In Sec- tion 2 we explain n o n d e t e r m i n i s t i c LC parsing. T h i s parsing a l g o r i t h m is the s t a r t i n g - p o i n t of Section 3, which shows how a deterministic a l g o r i t h m can be defined which uses a g r a p h - s t r u c t u r e d stack a n d pro- duces parse forests. Section 4 discusses how this gen- eralized LC parsing a l g o r i t h m can be a d a p t e d to ar- b i t r a r y context-free g r a m m a r s . . How the a l g o r i t h m can be improved to operate in cubic t i m e is shown in Section 5. T h e improved a l g o r i t h m produces parse forests in a n o n - s t a n d a r d representation, which requires only cubic space. One more class of o p t i m i z a t i o n s is discussed in Section 6.. P r e l i m i n a r y results with an i m p l e m e n t a t i o n of our a l g o r i t h m are discussed in Section 7.. 2 Left-corner parsing Before we define LC parsing, we first define some notions strongly connected with this kind of parsing.. We define a spine to be a p a t h in a parse tree which begins at some node which is n o t the first son of its father (or which does n o t have a father), then proceeds downwards every t i m e t a k i n g the leftmost son, and finally ends in a leaf.. We define the relation / between n o n t e r m i n a l s such t h a t B / A if and only if there is a rule A --* B a , where a denotes some sequence of g r a m m a r symbols. T h e transitive and reflexive closure of / is denoted by L*, which is called the left-corner relation. Infor- mally, we have t h a t B / * A if and only if it is possible. 306. to have a spine in some parse tree in which B occurs below A (or 13 = A). We pronounce B Z* A as "B is a left corner of A".. We define the set GOAL to be the set consisting of S, the s t a r t symbol, and of all n o n t e r m i n a l s A which occur in a rule o f the form B--* t~ A fl where is n o t e (the e m p t y sequence of g r a m m a r symbols). Informally, a n o n t e r m i n a l is in GOAL if and only if it m a y occur at the first node of some spine.. We explain LC parsing by m e a n s of the small context-free g r a m m a r below. No claims are m a d e a b o u t the linguistic relevance of this g r a m m a r . Note t h a t we have t r a n s f o r m e d lexieal a m b i g u i t y into g r a m m a t i c a l a m b i g u i t y by introducing the nonter- minals VorN and VorP.. S --, N P V P S - * S P P NP --~ "time" NP - ~ "an . . . . arrow" NP - ~ NP NP NP - ~ VorN V P --~ VorN V P - * VorP NP PP --* VorP NP VorN - * "flies" VorP --~ "like". T h e a l g o r i t h m reads the i n p u t from left to right. T h e elements on the parse stack are either nonter- minals (the goal elements) or items (the item ele- ments). Items consist of a rule in which a dot has been inserted somewhere in the right side to separate the members which have been recognized from those which have not.. Initially, the parse stack consists only of the s t a r t symbol, which is the first goal, as indicate in Fig- ure 1. T h e indicated parse corresponds with one of the two possible readings of "time flies like an arrow" according to the g r a m m a r above.. We define a nondeterministic L C parser by the parsing steps which are possible according to the fol- lowing clauses:. l a . If the element on top of the stack is the nonter- m i n a l A and if the first s y m b o l of the remaining i n p u t is t, t h e n we m a y remove t from the input and push an i t e m [B --~ t • ~] onto the stack, provided B /* A.. lb. If the element on top of the stack is the non- t e r m i n a l A, then we m a y push an item [B --~ .] o n t o the stack, provided B / * A. (The item [B --* .] is derived from an epsilon rule B ---, c.). 2. I f the element on top of the stack is the i t e m [A ~ c~ . t /~] and if the first symbol of the remaining i n p u t is t, then we m a y remove t from the input a n d replace the i t e m by the item [A ---+ Ott o ill.. 3. I f the t o p - m o s t two elements on the stack are B [A ~ ~ .], t h e n we m a y replace the i t e m by an i t e m of the form [C --* A • ill, provided C Z* B.. 4. I f the t o p - m o s t three elements on the stack are [a -~ f t . A 7] A [A -~ ~ 4 , t h e n we m a y replace these three elements by the i t e m [B --* fl A • 7]-. 5. I f a step according to one o f the previous clauses ends with an i t e m [A ~ t~ • B ~ on top of the stack, where B is a n o n t e r m i n a l , t h e n we subse- quently push B onto the stack.. 6. I f the stack consists o n l y o f the two elements S [S--* a .] a n d if the i n p u t has been completely read, then we m a y successfully t e r m i n a t e the parsing process.. Note t h a t only n o n t e r m i n a l s from GOAL will oc- cur as separate elements on the stack.. T h e nondeterministie LC parsing a l g o r i t h m de- fined above uses one s y m b o l o f lookahead in case o f t e r m i n a l left corners. T h e a l g o r i t h m is therefore de- terministic for the LC(0) g r a m m a r s , according to the definition of LC(k) g r a m m a r s in [Soisalon-Soininen a n d Ukkonen, 1979]. (This definition is incompati- ble with t h a t of [Rosenkrantz a n d Lewis II, 1970].). T h e exact f o r m u l a t i o n o f the a l g o r i t h m above is chosen to simplify the t r e a t m e n t of generalized LC parsing in the next section. T h e strict separation be- tween goal elements a n d i t e m elements has also been achieved in [Perlin, 1991], as opposed to [Schabes, 1991].. 3 G e n e r a l i z i n g l e f t - c o r n e r p a r s i n g . T h e construction o f Lang can be used to form deter- ministic table-driver parsing a l g o r i t h m s from non- deterministic push-down a u t o m a t a . Because left- corner parsers are also push-down a u t o m a t a , Lang's construction can also be applied to f o r m u l a t e a de- terministic parsing a l g o r i t h m based on LC parsing.. T h e parsing a l g o r i t h m we propose in this pa- per does however n o t follow s t r a i g h t f o r w a r d l y from L a n g ' s construction. I f we applied the construction directly, t h e n n o t as much sharing would be provided as we would like. T h i s is caused by the fact t h a t sharing of c o m p u t a t i o n of different search p a t h s is interrupted if different elements occur on top o f the stack (or j u s t b e n e a t h the top if elements below the top are investigated).. To explain this more carefully we focus on Clause 3 of the nondeterministic LC parser. Assume the fol- lowing situation. T w o different search p a t h s have at the same t i m e the same i t e m element [A --* a o] on top of the stack. T h e goal elements (say B' and B" ) below t h a t i t e m element are different however in b o t h search paths.. This m e a n s t h a t the step which replaces [A ---* a o] by [C ---, A °/~], which is done for b o t h search p a t h s (provided b o t h C / * B' a n d C / * B"), is done sepa- rately because B' a n d B" differ. This is u n f o r t u n a t e . 307. Step Parse stack I n p u t read S. 1 S [NP ~ "time" .] 2 S [ N P -~ N P . NP]NP 3 S [NP --~ N P . NP] NP [VorN - * "flies' .] 4 S [ N P - ~ N P . NP] NP [NP - * VorN .] 5 S[NP - ~ N P N P . ] 6 S [ S ~ N P . V P ] V P 7 S [ S ~ N P . V P ] V P [ V o r P ~ like" .] 8 S [S --* N P . VP] VP [VP - * VorP. NP] NP 9 S [S ~ N P . VP] VP [VP ~ VorP. NP] NP [NP. i0 S [S ~ N P . VP] VP [VP - * VorP. NP] NP [NP -~ i i S [S -~ N P . VP] VP [VP -~ VorP NP .] 12 S[S - ~ N P V P . ] 13. "an" ° "arrow"] "an" "arrow" ,]. "time". "flies". "like". "arl" "arrow". Figure 1: One possible sequence o f parsing steps while read i n g "time flies like an arrow". because sharing o f c o m p u t a t i o n in this case is desir- able b o t h for efficiency reasons b u t also because it would simplify the c o n s t r u c t i o n o f a m o s t - c o m p a c t parse forest.. R e l a t e d t o t h e fact t h a t we propose to i m p l e m e n t the parse t a b l e b y m e a n s of a g r a p h - s t r u c t u r e d stack, our solution to this p r o b l e m lies in the int ro d u c- tion o f goal elements consisting of sets of n o n t e r m i - nals f r o m GOAL, instead of single n o n t e r m i n a l s f r o m GOAL.. As an example, F i g u r e 2 shows the s t a t e o f t h e g r a p h - s t r u c t u r e d stack for the s i t u a t i o n j u s t after reading "time flies". Note t h a t this s t a t e represents the states o f two different search p a t h s of a nonde- t e r m i n i s t i c LC parser after reading "time flies", one of which is the s t a t e after Step 3 in Figure 1.. We see t h a t the goals NP and VP are m e r g e d in one goal e l e m e n t so t h a t there is only one edge f r o m the i t e m e l e m e n t labelled with [VorN ~ "flies" °] t o those goals.. Merging goals in one stack element is of course only useful if those goals have at least one left corner in c o m m o n . For the simplicity of the a l g o r i t h m , we even allow merging of two goals in one goal element if these goals have a n y t h i n g to do with each o t h e r with respect to the left-corner relation / * . . Formally, we define an equivalence relation ~ on n o n t e r m i n a l s , which is the reflexive, transitive, an d s y m m e t r i c closure of L. An equivalence class o f this relation which includes n o n t e r m i n a l A will be de- n o t e d by [A]. Each goal element will now consist of a subset of some equivalence class of ~ . . In the r u n n i n g example, the goal elements con- sist o f subsets o f {S, N P , V P , PP}, which is the o n l y equivalence class in this example.. Figures 3 and 4 give the c o m p l e t e generalized LC parsing a l g o r i t h m . At this stage we do n o t w an t t o c o m p l i c a t e the a l g o r i t h m by allowing epsilon rules in t h e g r a m m a r . Consequently, Clause l b of t h e non-. d e t e r m i n i s t i c LC parser will have n o c o r r e s p o n d i n g piece o f code in t h e G L C p arsi n g a l g o r i t h m . For t h e o t h e r clauses, we will i n d i cat e where t h e y can b e r e t r a c e d in t h e new a l g o r i t h m . In Section 4 we explain how o u r a l g o r i t h m can b e e x t e n d e d so t h a t also g r a m m a r s w i t h epsilon rules can b e h an d l ed.. T h e nodes a n d arrows in t h e parse forest are con- s t r u c t e d by m e a n s o f two functions:. M A K E _ N O D E (X) c o n s t r u c t s a n o d e w i t h label X, which is a t e r m i n a l or n o n t e r m i n a l . It r e t u r n s (the address of) t h a t node.. A n o d e is associated w i t h a n u m b e r o f lists o f sons, which are o t h e r nodes in t h e forest. Each list rep- resents an a l t e r n a t i v e d eri v at i o n o f t h e n o n t e r m i n a l w i t h which t h e n o d e is labelled. Initially, a n o d e is associated with an e m p t y collection o f lists o f sons. A D D _ S U B N O D E (m , 1) ad d s a list o f sons I t o the n o d e m.. In t h e a l g o r i t h m , an i t e m e l e m e n t el labelled with [A --* Xx . . . X,n • .a] is associated w i t h a list of nodes deriving X1 . . . . . Xm. T h i s list is accessed by SONS (el). A list consisting o f e x a c t l y one n o d e m is d e n o t e d b y < m > , an d list c o n c a t e n a t i o n is d e n o t e d by t h e o p e r a t o r + . . A goal el em en t g contains for ev ery n o n t e r m i n a l A such t h a t A L* P for so m e P in g a value N O D E (g, A), which is t h e n o d e representing so m e d eri vation o f A f o u n d at t h e c u r r e n t i n p u t position, p r o v i d e d such a d eri v at i o n exists, a n d N O D E (9, A) is NIL otherwise.. In t h e g r a p h - s t r u c t u r e d stack t h e r e m a y be an edge f r o m an i t e m el em en t t o a u n i q u e goal el em en t, a n d f r o m a goal in a goal e l e m e n t t o a n u m b e r o f i t e m elements. For i t e m e l e m e n t el, S U C C E S S O R (el) yields t h e u n i q u e goal e l e m e n t t o which t h e r e is an edge f r o m el. For goal e l e m e n t g a n d goal P in g, S U C C E S S O R S (g, P) yields t h e zero or m o r e i t e m elements t o which t h e r e is an edge f r o m P in g.. T h e global variables used b y t h e a l g o r i t h m are t h e . 308. N - * N P . NP I, I ~ = . S ~ N P . V P l * . I "flies" I I V o r N ~. Figure 2: T h e g r a p h - s t r u c t u r e d stack aft er read i n g "time flies". following.. a0 al . . . an T h e s y m b o l s in the i n p u t string. i T h e current i n p u t position.. r T h e r o o t o f t h e parse forest. It has the value NIL a t t h e end o f t h e a l g o r i t h m if no parse has been found.. r and Fnezt T h e sets o f goal elements cont ai n i n g goals to be fulfilled f r o m the current and n e x t i n p u t position on, respectively.. I and Inezt T h e sets o f i t e m elements labelled with [A ~ a • t ~ such t h a t a shift m a y be p e r f o r m e d t h r o u g h t at the current and n e x t i n p u t position, respectively.. F T h e set o f pairs (g, A) such t h a t a derivati o n f r o m A has been f o u n d for g at the current i n p u t po- sition. In o t h e r words, F is the set o f all pairs (g, A) such t h a t N O D E (g, A) ~ NIL.. T h e g r a p h - s t r u c t u r e d stack (which is initially e m p t y ) and t h e rules o f t h e g r a m m a r are implicit global d a t a structures.. In a s t r a i g h t f o r w a r d i m p l e m e n t a t i o n , the rel at i o n / * is recorded by m e a n s o f one large s' x s b o o l e a n m a t r i x , where s is t h e n u m b e r o f n o n t e r m i n a l s in t h e g r a m m a r , a n d s' is the n u m b e r o f elements in GOAL. We can do b e t t e r however by using the fact t h a t A Z* B is never t r u e if A 7~ B. We propose the storage o f Z* for every equivalence class of ,,, separately, i.e. we store one t ' x t boolean m a t r i x for every class o f ,,, with t m e m b e r s , t ~ o f which are in GOAL.. We f u r t h e r m o r e need a list o f all rules A --* X a for each t e r m i n a l a n d n o n t e r m i n a l X. A small op- t i m i z a t i o n o f t o p - t o w n filtering (see also Section 6) can be achieved by grouping the rules in these lists according to the left sides A.. Note t h a t the storage o f the relation Z* is th e m a i n obstacle t o a linear-sized parser.. T h e t i m e needed to generate a parser is d e t e r m i n e d by t h e t i m e needed t o c o m p u t e Z* and the classes o f -~, which is q u a d r a t i c in t h e size of the g r a m m a r . . 4 A d a p t i n g t h e a l g o r i t h m f o r . a r b i t r a r y c o n t e x t - f r e e g r a m m a r s . T h e generalized LC parsing a l g o r i t h m f r o m t h e pre- vious section is o n l y specified for g r a m m a r s w i t h o u t epsilon rules. Allowing epsilon rules would no t o n l y . c o m p l i c a t e t h e a l g o r i t h m b u t would for so m e g r a m - m a r s also i n t r o d u c e t h e d a n g e r o f n o n - t e r m i n a t i o n of the parsing process.. T h e r e are two sources o f n o n - t e r m i n a t i o n for non- d e t e r m i n i s t i c LC a n d L R parsing: cyelicity an d hid- den left-recursion. A g r a m m a r is said t o b e cyclic i f t h e r e is so m e derivation o f t h e f o r m A ---+ A. A g r a m m a r is said t o be hidden left-recursive i f A --* B a, B -+* e, a n d c~ --+* A ~, for so m e A, B, a , an d ~. Hidden left recursion is a special case o f left recursion where t h e fact is "h i d d en " b y an e m p t y - g en erat i n g n o n t e r m i n a l . (A n o n t e r m i n a l is said to be nonfalse if it generates t h e e m p t y string.). B o t h sources o f n o n - t e r m i n a t i o n have been studied extensively in [Nederhof a n d Koster, 1993; N e d e r h o f an d S arb o , 1993].. An obvious way t o avoid n o n - t e r m i n a t i o n for non- d et erm i n i st i c LC parsers in case o f h i d d en left- recursive g r a m m a r s is t h e following. We general- ize t h e rel at i o n i so t h a t B L A i f an d o nly if t h e r e is a rule A -~ p B fl, where p is a (possibly e m p t y ) sequence o f g r a m m a r s y m b o l s such t h a t /~ - - * e. Clause l b is e l i m i n a t e d an d t o c o m p e n s a t e this, Clauses l a an d 3 are modified so t h a t t h e y take i n t o acco u n t prefixes o f right sides which g enerate t h e e m p t y string:. l a . If t h e el em en t on t o p o f t h e stack is t h e nonter- m i n a l A an d if t h e first s y m b o l o f t h e r e m a i n i n g i n p u t is t, t h e n we m a y rem o v e t f r o m t h e i n p u t an d push an i t e m [B - * p t • a] o n t o t h e stack, p ro v i d ed B Z* A a n d p -+* e.. 3. If t h e t o p - m o s t two elements on t h e stack are B [A --+ a .], t h e n we m a y replace t h e i t e m by an i t e m o f t h e f o r m [C --+ p A • fl], p ro v i d ed C L* B and # -+* e.. T h e s e clauses now allow for nonfalse m e m b e r s a t t h e beginning o f right sides. T o allow for o t h e r non- false m e m b e r s we need an e x t r a seventh clause: 4. 7. If t h e el em en t o n t o p o f t h e stack is t h e i t e m [A --+ t~ • B fl], t h e n we m a y replace this i t e m b y t h e i t e m [A --+ a B • fl], p r o v i d e d B --+* e.. T h e sam e idea can be used in a s t r a i g h t f o r w a r d way t o m a k e generalized LC parsing su i t ab l e for. 4Actually, an eighth clause is necessary to handle the special case where S, the start symbol, is nonfalse, and the input is empty. We omit this clause for the sake of clarity.. 309. P A R S E : • r ¢= N I L • C r e a t e goal e l e m e n t g consisting o f S, the s t a r t s y m b o l • r = { g } • I ¢ = 0 o F ~ O • f o r i ¢= 0 t o n d o P A R S E _ W O R D • r e t u r n r , as t h e r o o t o f t h e p a r s e forest. P A R S E _ W O R D : . • r n e z t ¢= 0 • I n e z t ¢= O • f o r a l l p a i r s (g, A) E F d o . o N O D E (g, A) ¢= N I L o F ¢ = O • t ~= M A K E _ N O D E (eq) • F I N D _ C O R N E R S (t) • S H I F T (t) • F ¢= F n e z t • I ¢= I n e z t . F I N D _ C O R N E R S (t): / * cf. Clause l a o f t h e n o n d e t e r m i n i s t i c L C p a r s e r * / • f o r a l l goal e l e m e n t s g in F c o n t a i n i n g goals in class [B] d o . o f o r a l l rules A --* ai a such t h a t A E [B] d o • i f A Z* P for s o m e goal P in g / * t o p - d o w n filtering * / . t h e n o M A K E _ I T E M _ E L E M ([A ---, ai • c~], < / > , g). S H I F T (Q: 1" cf. C l a u s e 2 " 1 • f o r a l l i t e m e l e m e n t s el in I labelled w i t h [A --* a • ai fl] d o . o M A K E _ I T E M _ E L E M ([A ~ ~ ai • fl], SONS (el) + < l > , S U C C E S S O R (el)). M A K E _ I T E M _ E L E M ([A ~ a . ~], l, g): • C r e a t e i t e m e l e m e n t el labelled w i t h [A --* ~ • fl] • SONS (el) .¢= 1 • C r e a t e a n edge f r o m el t o g • i f f ~ = e. t h e n o R E D U C E (el). e l s e l f / ~ = t7, where t is a t e r m i n a l t h e n . o I n e z t ~ I n e z t U {el} e l s e l f ~ = B7, where B is a n o n t e r m i n a l t h e n . o M A K E _ G O A L (B, el). / * cf. C l a u s e 5 * / . M A K E _ G O A L (A, el): • i f t h e r e is a goal e l e m e n t g in F n e x t c o n t a i n i n g goals in class [A]. t h e n o A d d goal A t o g ( p r o v i d e d it is n o t a l r e a d y t h e r e ) . e l s e o C r e a t e goal e l e m e n t g consisting o f A o A d d g t o F n e z t . • C r e a t e a n edge f r o m A in g t o el. F i g u r e 3: T h e generalized LC p a r s i n g a l g o r i t h m . 310. R E D U C E (el): • Assume t h e label of el is [A ~ a .] • Assume S U C C E S S O R (el) is g • i f NODE (g, A) = NIL. t h e n o m ¢ : MAKE_NODE (A) o NODE (g, A ) ¢ = m o F ~ F U { ( g , A ) } o f o r a l l rules B --* A / 3 d o / * cf. Clause 3 * / . • i f B / * P for some goal P in g / * top-down filtering * / t h e n . o MAKE_ITEM_ELEM ([B ---* A o ~ , < m > , g) o i f A is a goal in g. t h e n • i f SUCCESSORS (g, A) # 0. t h e n o f o r a l l el' E SUCCESSORS (g, A) labelled with [B --, ~ • A 7] d o / * cf. Clause 4 * / . • MAKE_ITEM_ELEM (IS --~/3 A • 7], SONS (el') + < m > , S U C C E S S O R (el')) e l s e i f i = n / * cf. Clause 6 * / t h e n . o r C = m . • ADD_SUBNODE (NODE (g, A), SONS (el)). Figure 4: The generalized LC parsing a l g o r i t h m (continued). hidden left-recursive g r a m m a r s , similar to the way this is handled in [Schabes, 1991] and [Leermakers, 1992]. T h e only technical problem is t h a t , in or- der to be able to construct a complete parse for- est, we need p r e c o m p u t e d subforests which derive the e m p t y string in every way from nonfalse nonter- minals. This p r e c o m p u t a t i o n consists of performing m A ¢= MAKE_NODE (A) for each nonfalse nonter- m i n a l A, (where m A are specific variables, one for each n o n t e r m i n a l A) a n d subsequently performing ADD_SUBNODE (mA, < m s 1 , . . . , m B k > ) for each rule A -~ B1 . . . Bk consisting only of nonfalse non- terminals. T h e variables m A now contain pointers to the required subforests.. G L C parsing is g u a r a n t e e d to t e r m i n a t e also for cyclic g r a m m a r s , in which case the infinite a m o u n t of parses is reflected by cyclic forests, which are also discussed in [Nozohoor-Farshi, 1991].. 5 P a r s i n g i n c u b i c t i m e . T h e size of parse forests, even of those which are o p t i m a l l y dense, can be more t h a n cubic in the length of the input. More precisely, the number of nodes in a parse forest is O(nP+l), where p is the length of the right side of the longest rule.. Using the n o r m a l representation of parse forests does therefore n o t allow cubic parsing algorithms for a r b i t r a r y g r a m m a r s . There is however a kind of s h o r t h a n d for parse forests which allows a repre- sentation which only requires cubic space.. For example, suppose t h a t of some rule A --* c~ fl, the prefix a of the right side derives the same. p a r t of the input in more t h a n one way, t h e n these derivations m a y be combined in a new kind o f packed node. Instead of the multiple derivations from a , this packed node is then combined with the derivations from j3 deriving subsequent input. We call packing o f derivations from prefixes of right sides subpacl¢ing to distinguish this from n o r m a l packing o f derivations from one n o n t e r m i n a l . . Subpacking has been discussed in [Billet and Lang, 1989: Leiss, 1990; Leermakers, 1991]; see also [Sheil, 1976].. Connected with cubic representation of parse forests is cubic parsing. T h e G L C parsing a l g o r i t h m in Section 3 has a t i m e complexity o f O(nP+l). T h e a l g o r i t h m can be easily changed so t h a t , with a lit- tle a m o u n t o f overhead, the t i m e complexit~ is re-. 3 duced to O(n ), similar to the a l g o r i t h m s in [Perlin, 1991] and [Leermakers, 1992], a n d the a l g o r i t h m pro- duces parse forests with subpacking, which require only O ( n 3) space for storage.. We consider how this can be accomplished. First we define the underlying rule o f an i t e m element la- belled with [A --~ a • fl] to be the rule A --* a ft. Now suppose t h a t two item elements ell and el2 with the same underlying rule, with the dot at the same posi- tion and with the same successor are created at the same input position, t h e n we m a y perform subpack- ing for the prefix of the right side before the dot. From then on, we only need one o f the i t e m elements ell and el2 for continuing the parsing process.. W h e t h e r two i t e m elements have one and the same goal element as successors cannot be efficiently veri-. 311. fled. T h e r e f o r e we propose to i n t r o d u c e a new kind of stack element which takes over the role of all for- m e r i t e m elements whose successors are one an d the same goal element and which have the same under- lying rule.. We leave the details t o the i m a g i n a t i o n o f the reader.. 6 O p t i m i z a t i o n of t o p - d o w n filtering One o f the m o s t t i m e - c o s t l y activities o f general- ized LC parsing is the check w h e t h e r for a goal el- e m e n t g and a n o n t e r m i n a l A there is some goal P in g such t h a t A Z* P. T h i s check, which is some- times called top-down filtering, occurs in the routines F I N D _ C O R N E R S and R E D U C E . We propose some o p t i m i z a t i o n s to reduce the n u m b e r of goals P in g for which A Z* P has t o be checked.. T h e m o s t s t r a i g h t f o r w a r d o p t i m i z a t i o n consists o f a n n o t a t i n g every edge f r o m an i t e m element labelled with [A ~ a •/~] to a goal element g with th e sub- set of goals in g which does not include those goals P for which A L* P has already been found t o be false. T h i s is the set of goals in g which are actu- ally useful in t o p - d o w n filtering when a new i t e m element labelled with [B ---* A . 7] is created d u ri n g a R E D U C E (see the piece of code in R E D U C E corre- sponding with Clause 3 of the n o n d e t e r m i n i s t i c LC parser). T h e idea is t h a t if A L* P does no t hold for goal P in g, t h e n neither does B Z* P if A L B. T h i s o p t i m i z a t i o n can be realized very easily if sets of goals are i m p l e m e n t e d as lists.. A second o p t i m i z a t i o n is useful if / is such t h a t t h e r e are m a n y n o n t e r m i n a l s A such t h a t t h ere is only one B with A £ B. In case we have such a non- t e r m i n a l A which is n o t a goal, t h e n no t o p - d o w n filtering needs t o be p e r f o r m e d when a new i t e m el- e m e n t labelled with [B --* A • a] is created d u ri n g a R E D U C E . T h i s can be explained by the fact t h a t if for some goal P we have A Z* P, and i f A ¢ P, an d if there is o n l y one B such t h a t A / B, t h e n we al read y know t h a t B z* p.. T h e r e are m a n y m o r e of these o p t i m i z a t i o n s b u t n o t all of these give b e t t e r p e r f o r m a n c e in all cases. It depends heavily on the properties o f / w h e t h e r t h e gain in t i m e while p e r f o r m i n g the actual top- down filtering (i.e. p e r f o r m i n g the tests A / * P for some P in a p a r t i c u l a r subset of the goals in a goal element g) outweighs the t i m e needed to set up ex- t r a a d m i n i s t r a t i o n for the purpose of reducing those subsets o f the goals.. 7 P r e l i m i n a r y results Only recently the a u t h o r has i m p l e m e n t e d a G L C parser. T h e a l g o r i t h m as presented in this p a p e r has been i m p l e m e n t e d a l m o s t literally, with the t r e a t - m e n t of epsilon rules as suggested in Section 4. A small a d a p t a t i o n has been m a d e to deal with t erm i - nals of different lengths.. Also recently, some m e m b e r s o f o u r d e p a r t m e n t have c o m p l e t e d t h e i m p l e m e n t a t i o n o f a G L R parser. Because b o t h sy st em s have been i m p l e m e n t e d us- ing different p r o g r a m m i n g languages, fair compari- son o f t h e two s y s t e m s is difficult. Specific p ro blems which occurred concerning t h e efficient cal cu l at ion o f L R tables an d t h e correct t r e a t m e n t o f epsilon rules for G L R parsing suggest t h a t G L R p arsi n g requires m o r e effort t o i m p l e m e n t t h a n G L C parsing.. P r e l i m i n a r y tests show t h a t t h e division o f nonter- minals into equivalence classes yields d i s a p p o i n t i n g results. In all t e s t e d cases, o n e large class c o n t a i n e d m o s t o f t h e n o n t e r m i n a l s . . T h e first o p t i m i z a t i o n discussed in Section 6 proved to b e v ery useful. T h e n u m b e r o f goals which h a d t o b e considered could in so m e cases be reduced t o one fifth.. Conclusions We have discussed a p arsi n g a l g o r i t h m for c o n t e x t - free g r a m m a r s called generalized LC parsing. T h i s parsing a l g o r i t h m has t h e following a d v a n t a g e s over generalized L R parsing (in o rd er o f decreasing im- p o r t a n c e ) . . • T h e size o f a parser is m u c h smaller; if we neglect t h e st o rag e o f t h e r e l a t i o n / ' , t h e size is even linear in t h e size o f t h e g r a m m a r . R e l a t e d to this, o n l y a little a m o u n t o f t i m e is needed to g en erat e a parser.. • T h e g e n e r a t e d parse forests are as c o m p a c t as possible.. • Cyclic an d h i d d e n left-recursive g r a m m a r s can be h a n d l e d m o r e easily a n d m o r e efficiently (Sec- tion 4).. • As Section 5 shows, G L C parsing can m o r e eas- ily b e m a d e t o r u n in cubic t i m e for a r b i t r a r y co n t ex t -free g r a m m a r s . F u r t h e r m o r e , this can be done w i t h o u t m u c h loss o f efficiency in prac- tical cases.. Because L R parsing is a m o r e refined f o r m o f pars- ing t h a n LC parsing, generalized L R p arsi n g m a y at least for some g r a m m a r s b e m o r e efficient t h a n generalized LC parsing. 5 However, we feel t h a t this does n o t outweigh t h e d i sad v an t ag es o f t h e large sizes an d g en erat i o n t i m e s o f L R parsers in general, which renders G L R parsing unfeasible in so m e n a t u r a l lan- guage applications.. G L C p arsi n g does n o t suffer f r o m these defects. We therefore p ro p o se this parsing a l g o r i t h m as a rea- sonable a l t e r n a t i v e t o G L R parsing. Because o f t h e small g en erat i o n t i m e o f G L C parsers, we e x p e c t this kind o f parsing t o b e p a r t i c u l a r l y a p p r o p r i a t e dur- ing t h e d e v e l o p m e n t o f g r a m m a r s , w h en g r a m m a r s . SThe ratio b e t w e e n t h e t i m e complexities of GLC parsing and GLR parsing is smaller t h a n s o m e constant, which is d e p e n d e n t on the grammar.. 312. change often and consequently new parsers have to be generated m a n y times.. As we have shown in this paper, the implementa- tion of GLC parsing using a graph-structured stack allows many optimizatious. These optimizatious would be less straightforward and possibly less ef- fective if a two-dimensional matrix was used for the implementation of the parse table. Furthermore, ma- trices require a large amount of space, especially for long input, causing overhead for initialization (at least if no optimizations are used).. In contrast, the time and space requirements of GLC parsing using a graph-structured stack are only a negligible quantity above that of nondeterministic LC parsing if no nondeterminism occurs (e.g. if the g r a m m a r is LC(O)). Only in the worst-case does a graph-structured stack require the same amount of space as a matrix.. In this paper we have not considered GLC parsing with more lookahead than one symbol for terminal left corners. The reason for this is that we feel that one of the main advantages of our parsing algorithm over GLIt parsing is the small sizes of the parsers. Adding more lookahead requires larger tables and m a y therefore reduce the advantage of generalized LC parsing over its Lit counterpart.. On the other hand, the phenomenon reported in [Billot and Lang, 1989] and [Lankhorst, 1991] that the time complexity of GLIt parsing sometimes wors- ens if more lookahead is used, does possibly not apply to GLC parsing. For GLIt parsing, more lookahead m a y result in more Lit states, which m a y result in less sharing of computation. For GLC parsing there is however no relation between the amount of looka- head and the amount of sharing of computation.. Therefore, a judicious use of extra lookahead m a y on the whole be advantageous to the usefulness of GLC parsing.. A c k n o w l e d g e m e n t s . The author is greatly indebted to Klaas Sikkel, Janos Sarbo, Franc Grootjen, and Kees Koster, for many fruitful discussions. Valuable correspondence with Rend Leermakers, J a n Itekers, Masaru Tomita, and Dick Grune is gratefully acknowledged.. R e f e r e n c e s . [Bear, 1983] J. Bear. A breadth-first parsing model. In Proc. of the Eighth International Joint Con- ference on Artificial Intelligence, volume 2, pages 696-698, Karlsruhe, West Germany, August 1983.. [Billot and Lang, 1989] S. Billot and B. Lang. The structure of shared forests in ambiguous parsing. In 27th Annual Meeting of the ACL [1], pages 143- 151.. [Dekkers el al., 1992] C. Dekkers, M.J. Nederhof, and J.J. Sarbo. Coping with ambiguity in dee- orated parse forests. In Coping with Linguistic. Ambiguity in Typed Feature Formalisms, Proceed- ings of a Workshop held at ECAI 92, pages 11-19, Vienna, Austria, August 1992.. [Demers, 1977] A.J. Demers. Generalized left cor- ner parsing. In Conference Record of the Fourth A C M Symposium on Principles of Programming Languages, pages 170-182, Los Angeles, Califor- nia, January 1977.. [Earley, 1970] J. Earley. An efficient context-free parsing algorithm. Communications of the A CM, 13(2):94-102, February 1970.. [Kipps, 1991] J.it. Kipps. GLIt parsing in time O(na). In [Tomita, 1991], chapter 4, pages 43-59.. [Lang, 1974] B. Lang. Deterministic techniques for efficient non-deterministic parsers. In Au- tomata, Languages and Programming, ~nd Col- loquium, Lecture Notes in Computer Science, volume 14, pages 255-269, Saarbriicken, 1974. Springer-Verlag.. [Lang, 1988a] B. Lang. Complete evaluation of Horn clauses: An a u t o m a t a theoretic approach. Rapport de Recherche 913, Iustitut National de Recherche en Informatique et en Automatique, I~cquencourt, France, November 1988.. [Lang, 1988b] B. Lang. Datalog automata. In Proc. of the Third International Conference on Data and Knowledge Bases: Improving Usability and Responsiveness, pages 389-401, Jerusalem, June 1988.. [Lang, 1988c] B. Lang. Parsing incomplete sen- tences. In Proc. of the l ~ h International Con- ference on Computational Linguistics, volume 1, pages 365-371, Budapest, August 1988.. [Lang, 1988d] B. Lang. The systematic construction of Earley parsers: Application to the production of O(n 6) Earley parsers for tree adjoining grammars. Unpublished paper, December 1988.. [Lang, 1991] B. Lang. Towards a uniform for- mal framework for parsing. In M. Tomita, edi- tor, Current Issues in Parsing Technology, chap- ter 11, pages 153-171. Kluwer Academic Publish- ers, 1991.. [Lankhorst, 1991] M. Lankhorst. An empirical com- parison of generalized Lit tables. In It. Heemels, A. Nijholt, and K. Sikkel, editors, Tomita's Al- gorithm: Extensions and Applications, Proc. of the first Twente Workshop on Language Technol- ogy, pages 87-93. University of Twente, September 1991. Memoranda Informatica 91-68.. [Leermakers, 1989] It. Leermakers. How to cover a grammar. In 27th Annual Meeting of the ACL [1], pages 135-142.. [Leermakers, 1991] It. Leermakers. Non- deterministic recursive ascent parsing. In Fifth. 313. Conference of the European Chapter of the Asso- ciation for Computational Linguistics, Proceedings of the Conference, pages 63-68, Berlin, Germany, April 1991.. [Leermakers, 1992] R. Leermakers. A recursive as- cent Earley parser. Information Processing Let- ters, 41(2):87-91, February 1992.. [Leiss, 1990] H. Leiss. On Kilbury's modification of Earley's algorithm. A C M Transactions on Pro- gramming Languages and Systems, 12(4):610-640, October 1990.. [Matsumoto and Sugimura, 1987] Y. M a t s u m o t o and R. Sugimura. A parsing system based on logic programming. In Proc. of the Tenth International Joint Conference on Artificial Intel- ligence, volume 2, pages 671-674, Milan, August 1987.. [Nederhof, 1992] M.J. Nederhof. Generalized left- corner parsing. Technical report no. 92-21, Uni- versity of Nijmegen, D e p a r t m e n t of C o m p u t e r Sci- ence, August 1992.. [Nederhof and Koster, 1993] M.J. Nederhof and C.H.A. Koster. Top-down pars- ing of left-recursive grammars. Technical report, University of Nijmegen, Department of C o m p u t e r Science, 1993. forthcoming.. [Nederhof and Sarbo, 1993] M.J. Nederhof and J.J. Sarbo. Increasing the applicability of LR parsing. S u b m i t t e d for publication, 1993.. [Nozohoor-Farshi, 1991] R. Nozohoor-Farshi. GLR parsing for e-grammars. In [Tomita, 1991], chap- ter 5, pages 61-75.. [Perlin, 1991] M. Perlin. LR recursive transition net- works for Earley and T o m i t a parsing. In 29th An- nual Meeting of the ACL [2], pages 98-105.. [Pratt, 1975] V.R. Pratt. LINGOL - A progress re- port. In Advance Papers of the Fourth Interna- tional Joint Conference on Artificial Intelligence, pages 422-428, Tbilisi, Georgia, USSR, September 1975.. [Purdom, 1974] P. P u r d o m . T h e size of LALR (1) parsers. BIT, 14:326-337, 1974.. [Rekers, 1992] J. Rekers. Parser Generation for In- teractive Environments. PhD thesis, University of A m s t e r d a m , 1992.. [Rosenkrantz and Lewis II, 1970] D.J. Rosenkrantz and P.M. Lewis II. Deterministic left corner pars- ing. In 1EEE Conference Record of the 11th An- nual Symposium on Switching and Automata The- ory, pages 139-152, 1970.. [Schabes, 1991] Y. Schabes. Polynomial time and space shift-reduce parsing of arbitrary context-free grammars. In 29th Annual Meeting of the ACL [2], pages 106-113.. [Shann, 1991] P. Shann. Experiments with G L R and chart parsing. In [Tomita, 1991], chapter 2, pages 17-34.. [Sheil, 1976] B.A. Sheil. Observations on context- free parsing. Statistical Methods in Linguistics, 1976, pages 71-109.. [Sikkel, 1990] K. Sikkel. Cross-fertilization of Ear- ley and T o m i t a . M e m o r a n d a informatica 90-69, University of Twente, November 1990.. [Sikkel and Op den Akker, 1992] K. Sikkel and R. op den Akker. Head-corner chart parsing. In Computing Science in the Netherlands, Utrecht, November 1992.. [Slocum, 1981] J. Slocum. A practical comparison of parsing strategies. In 19th Annual Meeting of the Association for Computational Linguistics, Pro- ceedings of the Conference, pages 1-6, Stanford, California, J u n e - J u l y 1981.. [Soisalon-Soininen and Ukkonen, 1979] E. Soisalon- Soininen and E. Ukkonen. A m e t h o d for trans- forming g r a m m a r s into LL(k) form. Acta lnfor- matica, 12:339-369, 1979.. [Tanaka et al., 1979] H. Tanaka, T. Sato, and F. Mo- toyoshi. Predictive control parser: Extended LIN- GOL. In Proc. of the Sixth International Joint Conference on Artificial Intelligence, volume 2, pages 868-870, Tokyo, August 1979.. [Tomita, 1986] M. Tomita. Efficient Parsing for Natural Language. Kluwer Academic Publishers, 1986.. [Tomita, 1987] M. Tomita. An efficient augmented- context-free parsing algorithm. Computational Linguistics, 13:31-46, 1987.. [Tomita, 1988] M. Tomita. Graph-structured stack and natural language parsing. In 26th Annual Meeting of the Association for Computational Lin- guistics, Proceedings of the Conference, pages 249- 257, Buffalo, New York, J u n e 1988.. [Tomita, 1991] M. Tomita, editor. Generalized LR Parsing. Kluwer Academic Publishers, 1991.. [Wir~n, 1987] Mats Wir~n. A comparison of rule- invocation strategies in context-free chart pars- ing. In Third Conference of the European Chap- ter of the Association for Computational Linguis- tics, Proceedings of the Conference, pages 226-233, Copenhagen, Denmark, April 1987.. [1] 27th Annual Meeting of the Association for Com- putational Linguistics, Proceedings of the Confer- ence, Vancouver, British Columbia, J u n e 1989.. [2] egth Annual Meeting of the Association for Com- putational Linguistics, Proceedings of the Confer- ence, Berkeley, California, J u n e 1991.. 314

New documents

sentences is: 3a: John, said Mary, was the nicest person at the.. 3b: John said Mary was the nicest person at

Original Article Protective effects of Kangfuxin liquid Periplaneta Americana extract on chronic atrophic gastritis in rats via anti-oxidative stress and inhibition of COX-2..

Encoding con- fusion networks outperforms encoding the best hypothesis of the automatic speech recognition in a neural system for dialog state tracking on the well-known second Dialog

Outline how you are going to come into compliance with Regulation 5: Individual assessment and care plan: The nursing home strives to ensure that residents have individual care plans

Real-time dual-frame contrast-enhanced ultrasound mode was selected in imaging; real-time dynamic image storage technology was uses in the whole process for whole recording; the images

The objective of this analytical study is twofold: to explore the functions of silence with duration of one second and above, towards informa- tion flow in a dyadic conversation

This section sets out the actions that must be taken by the provider or person in charge to ensure compliance with the Health Act 2007 Care and Welfare of Residents in Designated

Compliance with the Health Act 2007 Care and Welfare of Residents in Designated Centres for Older People Regulations 2013, Health Act 2007 Registration of Designated Centres for Older

1Department of Urology, Guangdong Provincial Work Injury Rehabilitation Hospital and Jinan University, Guangzhou 510440, China; 2Department of Urology, Guangzhou First Municipal

Concretely, we observe that the model trained on news domain pays sim- ilar amount of attention to summary-worthy con- tent i.e., words reused by human abstracts when tested on news and