Completeness conditions for mixed strategy bidirectional parsing

(1)

Strategy Bidirectional Parsing

G r a e m e R i t c h i e * University of E d i n b u r g h

It has been suggested that, in certain circumstances, it might be useful for a grammar writer to annotate which rules are to be used bottom-up and which are to be used top-down within a parser, using a bidirectional variant of the active chart parsing technique. The formal properties of such systems have not been fully explored. One limitation of this mixed strategy technique is that certain annotations of rules can lead to incompleteness; that is, there may be valid analyses of the input string that cannot be found by the parser. We formalize a fairly natural notion of mixed strategy bidirectional parsing for context-free grammars, in which one or more symbols within a rule may be annotated as "triggers," so that the rule is either top-down (triggered from its left- hand side), or bottom-up (triggered from element(s) of its right-hand side). We define a decidable property of annotated grammars, such that any grammar with this property is provably complete. There are, however, some complete annotations of grammars that fall outside this decidable class. We show that membership of this wider class is undecidable. These results suggest that the mixed strategy approach is of rather limited usefulness, regardless of whether it is empirically efficient or not.

1. O v e r v i e w

M a n y m e t h o d s h a v e b e e n e x p l o r e d for parsing context-free grammars; some of these m e t h o d s are loosely categorized as " t o p - d o w n " (e.g., recursive descent), some as " b o t t o m - u p " (e.g., shift-reduce), a n d some could be seen as a m i x t u r e of these two varieties (e.g., left-corner). All of the well-explored m e t h o d s a s s u m e that the rules in the g r a m m a r are h a n d l e d in a fairly u n i f o r m way. In particular, it is not usual for the rules to be separated into t w o classes--those to be u s e d b o t t o m - u p a n d those to be used t o p - d o w n . Steel a n d de Roeck (1987) argue (giving credit to H e n r y T h o m p s o n for some of the ideas) that the p e r f o r m a n c e of a parser c o u l d be i m p r o v e d b y allowing the g r a m m a r writer to d o exactly this. The m o t i v a t i o n comes from linguistic p h e n o m - ena w h e r e it is intuitively clear that one s y m b o l (linguistic category) in the rule is noticeably m o r e distinctive t h a n others, so that a parser s h o u l d not waste time trying to m a t c h the rule unless that distinctive element is there. For example, a rule such as NP --* NP CONJ NP (where CONJ indicates a conjunction, such as and) s h o u l d not be i n v o k e d simply because a n o u n p h r a s e (NP), or the start of a n o u n phrase, has b e e n found. The p r o p o s a l is that if the linguist is allowed to m a r k the CONJ e l e m e n t as a "trigger," a n d the parser introduces the rule, b o t t o m - u p , only if the trigger has b e e n matched, t h e n parsing w o u l d p r o c e e d m o r e efficiently.

Steel a n d de Roeck describe semi-formally a system t h e y h a v e i m p l e m e n t e d , w h i c h they claim benefits from this labeling of rules. The current p a p e r does not take a position on the w i s d o m or effectiveness of such labeling. Instead, w e explore the

(2)

Computational Linguistics Volume 25, Number 4

formal consequences of this proposal. We s h o w that, a l t h o u g h the idea m a y seem superficially plausible, it still has certain formal limitations in the area of completeness a n d decidability. The proofs m a y be of some theoretical interest from a formal language viewpoint.

The central ideas are as follows: A conventional context-free g r a m m a r is " a n n o - t a t e d " b y m a r k i n g at least one s y m b o l in each rule as a trigger. M a r k i n g the left-hand- side (LHS) s y m b o l as a trigger indicates that the rule can be u s e d t o p - d o w n ; m a r k i n g a right-hand-side (RHS) s y m b o l as a trigger m e a n s that the rule can be u s e d b o t t o m - u p w h e n e v e r a constituent labeled w i t h that s y m b o l is f o u n d b y the parser. 1 Using a m e t h o d of parsing k n o w n as active chart parsing, it is s t r a i g h t f o r w a r d to give a precise m e a n i n g to this labeling of rules, since a chart parser can operate either b o t t o m - u p or t o p - d o w n . The scheme e x a m i n e d here is similar to, b u t different in i m p o r t a n t w a y s from, h e a d - d r i v e n p a r s i n g (see Section 7.2).

It is simple to construct an a n n o t a t e d g r a m m a r in w h i c h there are some analyses that are valid according to the original ( u n a n n o t a t e d ) g r a m m a r b u t that w o u l d not be p a r s e d b y a chart parser following the annotations. This establishes that n o t all a n n o t a t e d g r a m m a r s allow complete parsing.

The m a i n substance of this p a p e r is as follows: A p r o p e r t y of a n n o t a t e d g r a m m a r s (direct analyzability) is defined, w h i c h is decidable, a n d it is p r o v e n that a n y a n n o t a t e d g r a m m a r w i t h this p r o p e r t y will also allow the p a r s e r to p r o d u c e all the valid analyses licensed b y the original grammar. H o w e v e r , some a n n o t a t e d g r a m m a r s are not directly analyzable, b u t nevertheless lead to complete parsing. A characteristic of (a subset of) this w i d e r class of a n n o t a t e d g r a m m a r s (indirect analyzability) is defined, a n d it is p r o v e n that a n y a n n o t a t e d g r a m m a r w i t h this p r o p e r t y will allow c o m p l e t e parsing. H o w e v e r , indirect analyzability can be s h o w n to be undecidable.

2. The problems

2.1 Losing Completeness

Before presenting a formal definition of the mechanisms, a n d p r o c e e d i n g to p r o v e their various properties, it is useful to consider informally a v e r y simple e x a m p l e that shows h o w this a p p r o a c h can lead to loss of analyses b y the parser. As outlined above, the central idea is to allow different rules to be m a r k e d as either t o p - d o w n (LHS trigger symbol) or b o t t o m - u p (RHS trigger symbol(s)), or both. T o p - d o w n m e a n s that the rule can be i n v o k e d only if some other rule has established a n e e d for its LHS s y m b o l (or if the LHS s y m b o l is the initial s y m b o l of the grammar). B o t t o m - u p m e a n s that the rule can be i n v o k e d only if one of the symbols m a r k e d as triggers on its RHS has b e e n c o m p l e t e l y parsed. We shall a s s u m e that rules of the f o r m A --* w w h e r e w is a terminal s y m b o l are n e v e r annotated, a n d can be u s e d w h e n e v e r n e e d e d in the parser (all this is m a d e precise in o u r formalization in Section 3.3 below).

For this informal presentation, a n d occasionally elsewhere, w e shall m a r k a trigger s y m b o l A b y overlining it, thus: A. In the illustrative examples, the distinguished (initial) s y m b o l of the g r a m m a r will always be S a n d terminal symbols will be in l o w e r case.

(3)

Consider the annotated grammar (see Section 3 for a definition of grammar):

S ~ N P V P N P --* A r t N V P ~ runs A r t --* the N - - * dog

It should be intuitively clear that although this grammar generates exactly one sentence, that string cannot be parsed by a parser that follows the annotations as described. The rule S ~ N P V P cannot be used until an initial N P is recognized, and the rule that might do that, N P --+ A r t N , cannot be used until an initial need for an N P is established (which could happen only using S ~ N P VP). There is a form of dead- lock, resulting in incompleteness. It should also be intuitively clear that the presence or absence of such combinations of annotations m a y not be as obvious as it is here. In a grammar with hundreds of rules, the presence of a combination that blocks an otherwise valid analysis could take some detailed checking. This is a serious flaw, as the annotation method was supposed to alter the efficiency of the parser, but not to eliminate strings from its language.

It would be very easy to ensure that annotation does not lose analyses, by stipu- lating that all rules are marked as top-down, or that all rules are marked as bottom-up with the leftmost symbol as a trigger. The parser would then behave as a conventional top-down (or bottom-up) chart parser, which is known to be complete. However, since the aim is to allow the grammar writer to make a nontrivial annotation of the grammar (in an attempt to allow linguistic knowledge to influence the efficiency of the parsing), we need to be able to check the completeness of arbitrarily annotated grammars. In Section 4 below, we define formally a nontrivial characteristic of annotated grammars that guarantees that they do not lose analyses in this way, and show that this property of grammars is decidable.

2.2 Completeness through Interactions

The situation is even more complicated than Section 2.1 above indicates. One of the crucial aspects of chart parsing (which is central to its simplicity and its efficiency) is that any entry in a chart can be used to combine with any other compatible entry, regardless of whether there is a single coherent tree that will result from it. In particular, an entry that has been inserted in the chart as the result of some rule interaction that does not itself produce a complete sentential tree (i.e., a partial fragment of an analysis) can contribute to some other analysis that happens to require it.

This is best demonstrated by a simple artificial example. Consider the strategy- marked grammar, notation as before:

-S--* E H H - - - ~ B F -~__, p Q E --* j P---~ I Q - - - ~ m F ---~ k

(4)

a derivation as follows (see Section 3 for a definition of the relation " 0 " ) :

S =~ EH ~ jH ~ j B F ~ j P Q F ~ j l Q F ~ j l m F =~jlmk

The tree described by this derivation cannot be found by a parser following the strategy-marked grammar, for reasons similar to those outlined in Section 2.1 above. Suppose we now add the following rules to the grammar:

S - - , C D

D--,-E A

-d-- B C*

C--+ x

This larger grammar will also generate the string

xjlmx,

but this is not relevant to the argument. What is more interesting is that the extended grammar does now allow the parsing of

jlmk, with an associated syntax tree that corresponds to the derivation given

above (i.e., a tree that makes no direct use of the rules that have been added to the grammar). The w a y in which the added rules act as a "catalyst" to allow the hitherto blocked analysis is an example of a general phenomenon. Informally, what happens is the following: (A chart parser is assumed here; formal details are given in Section 3.3 below.) With just the smaller grammar, the nonterminal H cannot be expanded as required, since it is on the LHS of a bottom-up rule, and its first symbol B cannot be recognized because it requires a top-down rule. In both the original grammar and the larger grammar, H is introduced only by the rule S ~ E H, i.e., with E on its immediate left. So the only strings where H can participate in an analysis are those where E occurs at the start. Consider the parsing, with the larger grammar, of the string jlmk (which does indeed start with an E). As E is preterminal, it can be recognized directly (with no effect from annotations). In the larger grammar, the bottom-up rule D -~ E A is then introduced to the parsing, which creates a predictive entry in the parser's structures seeking an A, after the recognized E. The top-down rule A -+ B C is then introduced, which leads to an entry, at that same point, seeking a B. This causes the top-down rule B --* P Q (from the original grammar) to be introduced; this is a crucial step. This allows the sequence

Im

to be parsed as a B, thereby causing the introduction of the bottom-up rule H --+ B F, and the subsequent success of the parse.

2.3 What Are the Problems?

The grammars discussed above (Sections 2.1 and 2.2) are examples of various aspects of the problem. We shall show that there is a simple, decidable property of annotated grammars that guarantees completeness, and that could be used to detect the simple blocking illustrated in Section 2.1. However, this property is merely a sufficient condition for completeness, as the larger grammar of Section 2.2 above does not pos- sess it, despite being complete. We shall show that the larger grammar of Section 2.2 has a more general property, which also guarantees completeness. However, the more general property of annotated grammars is undecidable.

First, we have to set up the basic formal mechanisms for our definitions.

3. Trees, Grammars, and Charts

3.1 Basic Concepts and Terms

(5)

h a s exactly one m o t h e r n o d e , a n d each n o n t e r m i n a l n o d e h a s o n e or m o r e d a u g h t e r nodes. A tree is said to s p a n the s e q u e n c e of labels associated w i t h the s e q u e n c e of its t e r m i n a l n o d e s (in left-to-right order), a n d w e shall also s a y that the root n o d e of a (sub)tree s p a n s its sequence of t e r m i n a l nodes.

Definition 1

The h e i g h t of a n o d e in a tree is d e f i n e d as follows. A t e r m i n a l n o d e h a s height 0; a n o n t e r m i n a l n o d e h a s height = (1 + m a x i m u m h e i g h t of its d a u g h t e r nodes).

Definition 2

The d e p t h of a n o d e in a tree is d e f i n e d as follows. A root n o d e h a s d e p t h 0; a n o n r o o t n o d e h a s d e p t h = (1 + d e p t h of its p a r e n t node).

Following the u s u a l c o n v e n t i o n s (e.g., A h o a n d U l l m a n 1972), w e will take a context-free g r a m m a r (CFG) to be a q u a d r u p l e (VN, VT, P, S), consisting of a set VN of nonterminal s y m b o l s , a set VT of t e r m i n a l s y m b o l s , a set P of r u l e s (productions), a n d a single d i s t i n g u i s h e d s y m b o l S E VN. Set theoretically, rules can be r e g a r d e d as b e i n g o r d e r e d pairs w h e r e the first e l e m e n t is a n o n t e r m i n a l s y m b o l a n d the s e c o n d is a tuple of s y m b o l s , i.e., of the f o r m (A0, (A1 . . . A k ) ) w h e r e k > 0, b u t for ease of exposition t h e y will b e w r i t t e n as

A0 --+ A1 . . . Ak

We will m a k e the following s i m p l i f y i n g a s s u m p t i o n s (which do n o t lose general- ity):

.

Each rule in P is either of the f o r m Ao --+ A 1 A 2 . . . Ak w i t h k > 0, w h e r e all the s y m b o l s A i E

VN,

o r of the f o r m A --+ w w h e r e w E VT.

The g r a m m a r h a s no r e d u n d a n t s y m b o l s , in the sense that no s y m b o l s are "useless" or "inaccessible" as d e f i n e d b y A h o a n d U l l m a n (1972, Sect. 2.4.2).

Rules of the f o r m A --* w w h e r e w E VT will b e referred to as lexical rules, a n d other rules as nonlexical. A n o n t e r m i n a l A that a p p e a r s in a lexical rule will b e called p r e t e r m i n a l , or lexical.

G i v e n a CFG G, a s y n t a x tree b a s e d o n G is a rooted, o r d e r e d tree w h o s e n o n t e r - m i n a l n o d e s are labeled w i t h e l e m e n t s of VN a n d w h o s e t e r m i n a l n o d e s are labeled w i t h e l e m e n t s of VT. Those n o d e s i m m e d i a t e l y d o m i n a t i n g t e r m i n a l n o d e s will be referred to as p r e t e r m i n a l ; other n o n t e r m i n a l n o d e s will be referred to as nonlexical.

W h e r e a tree T s p a n s a t e r m i n a l string al . . . an, a n d M is a n o d e w i t h i n T that s p a n s a i . . • ak, the start of M is the index i - 1, a n d the e n d of M is the index k.

A syntax tree b a s e d o n (VN, VT, P, S) is said to be w e l l - f o r m e d w i t h respect to (VN, VT, P, S) if for e v e r y n o n t e r m i n a l n o d e w i t h label A0 a n d d a u g h t e r n o d e s labeled A1 . . . . , Ak, there is a rule in P of the f o r m Ao --+ A1 • .. Ak; this rule is said to license

the n o d e labeled A0. For convenience, w e shall d i s t i n g u i s h b e t w e e n a tree that is c o m p a t i b l e w i t h the rules of the g r a m m a r , a n d a tree that also s p a n s a sentence. A syntax tree is said to be generated b y a g r a m m a r G iff:

.

2.

T h e root n o d e is labeled w i t h S (the d i s t i n g u i s h e d symbol).

(6)

We will write trees(G) for the set of all trees g e n e r a t e d b y G.

The conventional " r e w r i t e " interpretation of CFGs will also be u s e d in some sit- uations (Section 5 below). G i v e n t w o strings w1, W2 f r o m ( W N [..J V T ) * , t h e n w1 directly

derives w2, written "w1 ~ w2," if w1 = 6A% ~;2 ~- 6Ol'y a n d A --* c~ is a rule in G. Sim-

ilarly, W 1 derives w2, written "w1 G w2," is the reflexive transitive closure of directly derives. A derivation is a sequence of s y m b o l strings w1 . . . wn such that wi ~ wi+l for all 1 < i < n. A rightmost derivation is one in w h i c h each step f r o m wi to ~i+1 is m a d e b y replacing the furthest right n o n t e r m i n a l s y m b o l in W i USing some rule (i.e., -y in the a b o v e definition of directly derives is entirely m a d e u p of terminal symbols) (cf. A h o a n d U l l m a n 1972).

3.2 Annotated Grammars

Since w e are allowing trigger elements of a rule to occur a n y w h e r e on the RHS of a rule, it is necessary to allow the parser to explore o u t w a r d s in either direction (leftwards or rightwards) from a constituent that has b e e n parsed. H e n c e the p a r s i n g schemes defined b e l o w are referred to as b i d i r e c t i o n a l , to reflect this fact. This does not allude to the two "directions" of t o p - d o w n or b o t t o m - u p .

Definition 3

Let G be a context-free g r a m m a r (VN, VT, P, S). A bidirectional strategy marking of

G is a (total) function tr f r o m the nonlexical rules in P to ~P(N) (the set of sets of n o n n e g a t i v e integers) such that for a n y rule r of the f o r m Ao --* A1 ... Ak:

1. tr(r) # 0

2. 0 < i < k for e v e r y i E tr(r)

Informally, tr indicates w h i c h element(s) of the rule can trigger it. If 0 E tr(r), the LHS of the rule is a trigger; that is, it can be u s e d t o p - d o w n . If j E tr(r), w h e r e j > 0, t h e n e l e m e n t j of the RHS can act as a trigger, b o t t o m - u p . The v a l u e of tr(r) is a set of integers in o r d e r to allow a rule to h a v e m o r e t h a n one possible trigger; in particular, it is allowable for a rule to be u s e d either t o p - d o w n or b o t t o m - u p .

Definition 4

A bidirectionally strategy-marked context free-grammar (BSCFG) is a pair (G, tr)

w h e r e G is a CFG a n d tr is a bidirectional strategy m a r k i n g of G.

Definition 5

Let ((VN, VT, P, S), tr) be a bidirectionally s t r a t e g y - m a r k e d context-free grammar. T h e n a rule r E P is said to be:

1. t o p - d o w n , if 0 E tr(r).

2. b o t t o m - u p , if there is an i > 0 such that i E tr(r).

3. purely bottom-up, if 0 ~ tr(r).

4. purely top-down, if tr(r) = {0}.

3.3 Active Charts

(7)

is a generalization of Earley's algorithm (Earley 1970), a n d tutorial expositions of the ideas can be f o u n d in T h o m p s o n and Ritchie (1984) or W i n o g r a d (1983). In k e e p i n g with m o r e recent presentations (e.g., Shieber, Schabes, a n d Pereira 1995; Sikkel a n d op d e n Akker 1996) w e define the parsing principles as w e l l - f o r m e d n e s s conditions on complete charts, abstracting a w a y from the sequence of steps u s e d to build them.

Definition 6

G i v e n a CFG G of the f o r m (VN, VT, P, S) a d o u b l e - d o t t e d rule b a s e d on G is a triple (p, l, r) w h e r e p is a rule in P of the f o r m Ao --* A1 . . . Ak a n d l, r are integers such that O < l < r < k .

Such a rule will be written as:

Ao --* A1 . . . Al • Ai+l . . . Ar • Ar+l . . . Ak

for ease of exposition and similarity to p r e v i o u s literature. W h e r e either l = 0 or r = k, the e m p t y portions will be o m i t t e d f r o m the expression.

Definition 7

G i v e n a CFG G = (VN, VT, P, S), an edge based on G is a triple (i,j, d) w h e r e i a n d j are n o n n e g a t i v e integers with i _< j, and d is a d o u b l e - d o t t e d rule b a s e d on G.

An edge is said to be lexical or nonlexical according to w h e t h e r or not the rule is lexical. An edge of the form (i,j, Ao ~ A1 . . . A q - 1 • A q . . . Ap • Ap+l . . . Ak) w h e r e either q > 1 or p < k (i.e., with a n o n e m p t y c o m p o n e n t at either end) is referred to as an active edge, a n d an edge of the f o r m (i,j, Ao --+ • A 1 . . . A k o) is an inactive edge.

An active edge (i,i, Ao --* • , A 1 . . . A k ) or (i,i, Ao --* A 1 . . . A k • •) is referred to as an e m p t y active edge. (Sometimes it will be referred to as " a n e m p t y active edge for A0 --+ a l . . . Ak.")

Definition 8

G i v e n a CFG G = (VN, VT, P, S) a n d a string al . . . an from V~, a chart based on

a l , . . . , an and using G is a set C of edges b a s e d on G that meets the following conditions:

.

2.

for e v e r y (i,j, r) E C, i E {0 . . . n} and j c {0 . . . n}

for ai E VT, (i -- 1, i,L ~ . a i , ) E C iff ai C {al, a2,... ,an} a n d L --+ ai ff P. The t e r m i n o l o g y of the last three definitions will also be u s e d for a BSCFG (G, tr). Definition 9

Let G be a CFG, a n d let C be a chart b a s e d on a string ¢ a n d using G. C is said to be

bidirectionally resolved iff both the following conditions hold:

. Left Extension: For e v e r y pair of edges:

(i,j, Ao --+ , A 1 . . . A m ' )

(j,k, B0 --~ B1 . . . B q • B q + I . . . B p " B p + l . . . B v ) w h e r e p ~ v, q > 0 a n d A0 = Bq, there is also an edge:

(8)

2. Right Extension: For every pair of edges:

(i,j, Bo --* B 1 . . . B q • B q + l . . . B p • B p + l . . . Bv)

(j,k, Ao --* • A 1 . . . A m ' )

where p < v, q > 0 a n d Ao -- Bp+l, there is also an edge:

(i, k, Bo --+ B1 . . . Bq • Bq+l . . . Bp+l • Bp+2... By)

L e m m a 1

Let C be a bidirectionally resolved chart based on a string cr a n d using a CFG G, a n d suppose that C contains an edge of the form:

(i,j, Bo --~ B1. .. Bq • Bq+l .. . Bp • Bp+l .. . By)

(i) (Rightwards) If C contains edges of the form:

(ipq-l,jp+l, Bp+l --+

•Wp+l,)

(ip+t, jp+,, Bp+t --+ •We+t•)

where (p + t) <_ v, ik+ 1 ---- jk where (p + 1) _< k < (p + t) a n d ip+l = j, then C also contains an edge of the form:

(i, jp+t, Bo --~ B 1 . . . Bq • Bq+l ...Bp+t • Bp+t+l . . . B v ) (ii) (Leftwards) If C contains edges of the form:

( i(q-t), j(q-t), B(q-t) -'4 •bd(q-t) • )

(iq,jq, Bq --4 •~q,)

where 0 < t < q, ik+l = jk where (q - t) ( k < q a n d jq -~ i, then C also contains an edge of the form:

(i(q_t),j, Bo ~ B1 ...B(q_t_l) • B(q_t) . . . Bp • B p + l . . . By)

Proof

Both the cases (i) a n d (ii) proceed b y induction on the n u m b e r of inactive edges.

Corollary

If C is as described, a n d it contains a full set of edges as given at both sides (i.e., t = (q - 1), so that there are q inactive edges to the left, a n d (p + t) = v so that there are (v - p) inactive edges to the right, all w i t h labels matching the rule), then there is a complete (inactive) edge of the form (il,jv, Bo -~ . B 1 . . . Bv. ).

(9)

Definition 10

Let (G, tr) be a BSCFG, a n d let C be a chart based on a string c~ a n d using G. C is said to be b i d i r e c t i o n a l l y m i x e d strategy e x p l o r e d iff all the following conditions hold:

1. (Bottom-up activation) For e v e r y edge:

( i , j , A o ~ . A 1 . . . A m ' )

there is an edge in C:

(i,j, Bo ~ B 1 . . . B q _ l • Bq • Bq+l . . . Bv)

for e v e r y rule r in G of the form Bo --* B1 . . . Bv such that q C t r ( r ) a n d Bq = A0,

2. (Top-down initialization) For e v e r y rule r in G of the f o r m S --~ B 1 . . . Bk, w h e r e S is the distinguished symbol of G a n d 0 c t r ( r ) , there is an edge in C of the form:

( 0 , 0 , S ~ • ° B 1 . . . B k )

3. (Top-down activation, right) For e v e r y edge:

(i,j, Bo --* B1 . . . Bq • Bq+l . . . Bp ° B p + l . . . Bv)

w h e r e 0 _< p < v, a n d e v e r y rule r in G of the f o r m A o ~ A 1 . . . A k for w h i c h Bp+l = A o a n d 0 C t r ( r ) , there is also an edge in C of the form:

(j,j, A0 --* " • A 1 . . . A k )

4. (Top-down activation, left) For e v e r y edge:

( i, j , Bo ~ B 1 . . . Bq • Bq+ l . . . Bp • Bp+ l . . . By)

w h e r e 0 < q _< v, a n d e v e r y rule r in G of the f o r m A o --* A 1 . . . A k for w h i c h Bq = A o a n d 0 c t r ( r ) , there is also an edge in C of the form:

(i, i, A o --* A 1 . . . A k • • )

For brevity, the t e r m f u l l y b i d i r e c t i o n a l will be used for a chart that is b o t h bidirectionally resolved a n d bidirectionally m i x e d strategy explored.

To explore the issue of completeness (i.e., w h e t h e r a parsing m e c h a n i s m finds all valid analyses) w e n e e d to define h o w the edges in a chart c o r r e s p o n d to those in a syntax tree.

Definition 11

A chart C based on a l a 2 . . • an is said to contain a representation of a syntax tree T, iff:

1. T spans a substring (not necessarily proper) of a l a 2 . . , an; a n d

2. for e v e r y n o n t e r m i n a l n o d e N in T, s p a n n i n g ai+l • • • aj, labeled Ao, w i t h k d a u g h t e r s labeled A1 . . . A k i n order, C contains an edge

( i , j , A o --+ ° A 1 . . . A k ° )

(10)

Notice that for a n y chart C b a s e d o n a string or, C will contain an edge for each p r e t e r m i n a l n o d e of a n y tree that spans c~, b y virtue of the definition of a chart b e i n g "based o n " a string. H e n c e later discussions of p a r s i n g a n d completeness can a s s u m e the presence of these edges in the relevant charts, w i t h only the presence of edges for other n o n t e r m i n a l n o d e s being subject to verification.

Definition 12

G i v e n a CFG

G,

a bidirectional s t r a t e g y - m a r k i n g

tr

of G is said to be complete iff for e v e r y tree T E

trees(G)

that spans a string or, a n y fully bidirectional chart C b a s e d on cr and using (G,

tr)

contains a r e p r e s e n t a t i o n of T.

4. A Decidable Class of Complete Annotations

In this section w e define a decidable p r o p e r t y of a n n o t a t e d g r a m m a r s that guarantees that a p a r s e r following the annotations will not miss analyses in the m a n - ner outlined in Section 2.1. There is also a w e a k e r (more general) sufficient condition for completeness, w h i c h is defined in Section 5 below, b u t w h i c h is u n d e - cidable. The fact that the stronger condition is decidable m a k e s it w o r t h defining, a n d some of the proofs in Section 5 m a k e use of some concepts from the c u r r e n t section.

4.1 Reachability

A n o t h e r n o t i o n that has to be f o r m a l i z e d is the w a y in w h i c h a syntax tree can be p a r s e d f r o m a string of terminal symbols in a p u r e l y b o t t o m - u p manner.

Definition 13

Let (G,

tr)

be a BSCFG. In a syntax tree T g e n e r a t e d b y G, a nonlexical n o d e M0, w i t h d a u g h t e r s M1 . . .

Mk,

is said to be reachable from b e l o w iff M0 is licensed b y a b o t t o m - u p rule r a n d there is a j, 1 G j G k, such that j E

tr(r)

a n d one of the following is true:

.

2.

Mj

is a p r e t e r m i n a l n o d e of T;

Mj

is reachable f r o m below.

Definition 14

Let (G,

tr)

be a BSCFG. A syntax tree T g e n e r a t e d b y G, is said to be fully reachable

iff e v e r y nonlexical n o d e M in T licensed b y a p u r e l y b o t t o m - u p rule is reachable from below.

4.2 Direct Analyzability

N o w w e n e e d to define a p r o p e r t y of g r a m m a r s that will g u a r a n t e e that g e n e r a t e d trees are fully reachable in the above sense. This can be d o n e in three stages: first, define a p r o p e r t y of n o n t e r m i n a l symbols; then, use that to define a p r o p e r t y of g r a m - mars; lastly, p r o v e that a n y g r a m m a r w i t h this p r o p e r t y generates only fully reachable trees.

(11)

Draft Definition:

G i v e n a BSCFG (G, tr), a n o n t e r m i n a l s y m b o l A0 is

directly analyzable

iff e v e r y rule r of the f o r m A0 --* . . . is either lexical, or of the f o r m Ao --* A1 . . . Ak w i t h at least one i E tr(r), i > 0, for w h i c h Ai is directly analyzable.

The s u b s e q u e n t definition for g r a m m a r s is then:

Definition 15

A BSCFG (G, tr) is

directly analyzable

iff it meets the following condition: for e v e r y p u r e l y b o t t o m - u p rule r of the f o r m Ao --* A1 . . . Ak, there is at least one i C tr(r), i > 0, for w h i c h Ai is directly analyzable.

The draft definition captures the essential idea in a fairly natural a n d clear way, b u t it has a slight technical problem. Consider the toy g r a m m a r given below:

S ~ - d B A ---~ C A C---~ x B -* y A --+ z

In this grammar, the n o n t e r m i n a l A is not classed as directly analyzable. This is because there is a cycle from A to itself via trigger symbols in b o t t o m - u p rules. 2 It w o u l d be equally consistent w i t h the draft definition to state that A is directly analyzable, or to stipulate that it is not directly analyzable. There is a sense in w h i c h the draft deft- nition is underspecified, and gives o n l y partial coverage of the items being classified (nonterminals). To e x t e n d this definition to total coverage, a m o r e elaborate construction is n e e d e d ( b o r r o w e d from theoretical c o m p u t e r science; cf. Stoy [1981, Chap. 6]). 3 First, for a n y BSCFG (G, tr) w e define an

analyzability predicate

as a n y function g from n o n t e r m i n a l symbols to the set {true,false} that assigns true to a category A0 iff e v e r y rule r of the f o r m A0 --* . . . is either lexical, or of the f o r m A0 --* A1 ...Ak, with at least one i E tr(r), i > 0, for w h i c h g(Ai) = true. Call the set of all such func- tions A73(G, tr). For a n y two g , h E A79(G, tr), define the relation " u " b y h u g iff h ( A ) = true D g ( A ) = true. This relation is easily s h o w n to be reflexive, transitive, a n d antisymmetric, a n d h e n c e ( A ~ ( G , tr), F) forms a partially o r d e r e d set (Maclane a n d Birkhoff 1967, 59; Stoy 1981, 82). Then for a n y set gl . . . gn of elements of A~P(G, tr), the function g' (in A3V(G, tr)) given b y

g'(A) = true iff either g l ( A ) = true or ... gn(A) = true

is at least u p p e r b o u n d (Maclane a n d Birkhoff 1967; Stoy 1981) for gl . . . gn with respect to G. Since A3V(G, tr) is finite, the presence of a 1.u.b. for a n y subset m e a n s it has a m a x i m u m element, w h i c h w e will call APMAX(G,tr). This predicate APMAX(G,tr) will assign true to a s y m b o l A if there is some analyzability predicate (for (G, tr)) that m a k e s this assignment. 4 T h e n define a n o n t e r m i n a l A (from G) to be directly analyzable iff APMAX(G,tr)(A ) = true. Intuitively, a n y n o n t e r m i n a l that the draft definition m i g h t

2 Thanks to Alistair Willis for pointing out this problem. 3 Thanks to Suresh Manandhar for suggesting this approach.

(12)

leave as u n d e f i n e d w i t h respect to being directly analyzable is classed b y this n e w definition as being directly analyzable. Instead of the draft definition, w e can n o w h a v e the following complete definition:

Definition 16

G i v e n a BSCFG (G, tr), a n o n t e r m i n a l s y m b o l A0 is directly

A P M A X ( c , tr) (Ao) = true, w h e r e APMAX(G,tr) is as constructed above.

analyzable iff

Notice that it follows f r o m the construction of APMAX(G,tr) that A0 is directly analyzable iff e v e r y rule r of the f o r m A0 --* . . . is either lexical, or of the f o r m Ao --* A1 ... Ak w i t h at least one i E tr(r), i > 0, for w h i c h Ai is directly analyzable. That is, the draft definition, w h i c h was not sufficiently self-contained to be a definition, is n o w deriv- able as a theorem f r o m the m o r e rigorous definition. This m e a n s that w e can use the logical equivalence stated in the draft definition in s u b s e q u e n t proofs.

Lemma 2

Let (G, tr) be a BSCFG that is directly analyzable. Let T be a tree in trees(G). T h e n T is fully reachable.

Proof

It is s t r a i g h t f o r w a r d to p r o v e the following p r e l i m i n a r y result, using i n d u c t i o n o n the height of n o d e s a n d the logical equivalence stated in the draft definition above:

A n y n o d e in T that has a directly analyzable label is reachable from below.

It is t h e n easy to s h o w that a n y n o d e in T that is licensed b y a p u r e l y b o t t o m - u p rule

is reachable from below. []

4.3 Parsing

Lemma 3

Let (G, tr) be a BSCFG. Let T be a fully reachable tree in trees(G), s p a n n i n g the string or. Let C be a fully bidirectional chart b a s e d o n cr a n d using (G, tr). T h e n for a n y n o d e M in T, labeled A:

.

If M is licensed b y a p u r e l y b o t t o m - u p rule, t h e n C contains a representation of the subtree r o o t e d at M.

If M is licensed b y a t o p - d o w n rule, a n d there is in C an active e d g e

(t,g, B0 --~ B1 . . . B p _ l • B p . . . B q _ l • B q . . . B v )

w h e r e either Bp-1 = A a n d t is the e n d of M, or Bq = A a n d g is the start of M, then C contains a r e p r e s e n t a t i o n of the subtree r o o t e d at M.

Proof

By i n d u c t i o n on the height of nodes.

Inductive Hypothesis: For a n y 0 < d ~ < d, if n o d e M in T is of height d ~, the conditions listed in the l e m m a hold.

(13)

Inductive Step: S u p p o s e M is of height d, w h e r e d > 1, a n d is labeled A.

(a) S u p p o s e M, w i t h d a u g h t e r n o d e s M1 . . . Mk, is licensed b y a p u r e l y b o t t o m - u p rule A --* A1 • • • Ak. Then, since T is fully reachable, there is a j, 1 ~ j ~ k such that My is reachable from below; that is, My is either lexical or licensed b y a b o t t o m - u p rule. Therefore, b y the Inductive Hypothesis, C contains a representation of the subtree r o o t e d at Mj. Since C is fully bidirectional, it contains an edge s p a n n i n g M j of the form:

( l , h , A ~ A 1 . . . A j _ I • Aj • Aj+l .. . a k )

N o w consider the n o d e s Mi, for j + 1 K i < k. The Inductive H y p o t h e s i s applies to each of these nodes. H e n c e it can be p r o v e d b y i n d u c t i o n on i that there is a representation in C of the subtree r o o t e d at M i for all j + 1 < i < k (cf. L e m m a 1). Similarly, it can be p r o v e d that there are representations in C for the subtrees r o o t e d at M1 . . . M j _ I . By the corollary to L e m m a 1, there is a representation for the tree r o o t e d at M i n C .

S u p p o s e M is licensed b y a t o p - d o w n rule. S u p p o s e there is an e d g e

(t,g, Bo ~ B1 . . . Bp-1 • B p . . . Bq-1 * B q . . . Bv)

w h e r e Bq = A a n d g is the start of M (a similar a r g u m e n t holds in the case w h e r e B;_I = A a n d t is the e n d of M). Since the chart is fully bidirectional, there m u s t also be an e m p t y active edge

(g, g , A --~ • • A 1 . . .Ak)

By a similar a r g u m e n t to that in case (a) above, it follows that there are representations in C for all the n o d e s M1 . . . M k a n d thence for M.

This establishes the m a i n induction. []

L e m m a 4

Let (G, tr) be a BSCFG that is directly analyzable. Let T be a tree in trees(G), s p a n n i n g the string ~. Let C be a fully bidirectional chart b a s e d o n o- a n d using (G, tr). Then for a n y n o d e M in T, labeled A:

.

If M is licensed b y a p u r e l y b o t t o m - u p rule, then C contains a representation of the subtree rooted at M.

If M is licensed b y a t o p - d o w n rule, a n d there is in C an active edge

(t,g, Bo ~ B 1 . . . B p _ l e B ; . . . B q _ l o B q . . . B v )

w h e r e either Bp-1 = A a n d t is the e n d of M, or Bq = A a n d g is the start of M, then C contains a representation of the subtree r o o t e d at M.

Proof

(14)

Being reachable from below can be seen as a condition on nodes that can be built bottom-up. Surprisingly, we do not need a corresponding condition for nodes that are built top-down. It is possible to formulate the appropriate condition, but it turns out that any tree that meets the condition of being fully reachable will also meet the appropriate condition for top-down nodes. It is hard to give an informal, intuitive explanation for this, but roughly speaking, the reason is as follows. For a top-down rule to be invoked, it must be used in a position at which some prediction of its LHS symbol A will be introduced (by some other rule). This can happen either as a cascade of predictions from above, using a sequence of top-down rules, or because a rule has been introduced and has caused a sequence of predictions to be made, either left- to-right or right-to-left, as its RHS symbols are parsed. For either of these to happen, either there must be a clear path of daughter categories from some other prediction, or A must be on the RHS of a rule that is somehow introduced. The daughter condition of "reachable from below" simultaneously imposes these conditions on the top-down rules.

4.4 Completeness

The final step in proving completeness is now simple.

Theorem 1

If a BSCFG (G, tr) is directly analyzable, then

tr

is complete.

Proof

Let T be a tree in

trees(G),

spanning the string or, with root node M0 labeled S (the distinguished symbol of G). Let C be a fully bidirectional chart based on cr and using (G, tr).

(a)

(b)

If M0 is licensed by a bottom-up rule, then by Lemma 4, C contains a representation of the tree rooted at M0.

If M0 is licensed by a top-down rule S --* A1 . . . Ak, then C must contain an empty active edge of the form:

(0,0, S --~ • . A 1 . . . A k )

It follows from repeated applications of Lemma 4 and Lemma 1 (similar to part (b) of the Inductive Step of Lemma 3) that C contains a

representation of the subtrees rooted at the daughters of M0, and thence

of the subtree rooted at M0. []

Thus we have proved that all BSCFGs that meet the condition of being directly analyzable can be bidirectionally parsed without any valid trees being omitted.

Theorem 2

It is decidable whether a given BSCFG is directly analyzable.

Proof

(15)

5. An Undecidable Class of Complete Annotations

5.1 Informal Outline

Before proceeding to formalize the mechanisms underlying the problem presented in Section 2.2, it is useful to set out informally the relevant factors in that example. A strategy-marked grammar (G, tr) causes problems only if there is some purely bottom- up rule of the form Ao --* A 1 . . . Ak such that every trigger symbol

Ai

requires a purely top-down rule somewhere in its expansion (see Section 4.2 above). Such a rule leads to the possibility of there being a tree T E trees(G) that contains a nonterminal node that can only be built by a bottom-up rule, and whose trigger daughter can only be built using a top-down rule. This would give rise to a tree that was not fully reachable. In the example in Section 2.2 above, the rules

H---~ B F

B - - ~ P Q

create this situation. The "upper" symbol H cannot be parsed because the "lower" symbol B cannot be parsed. What salvages this difficulty is the fact that the "upper" nonterminal (H in this example) always occurs in a left context, (i.e., a string of symbols to its left) with the following property: Every possible terminal expansion of the left context contains a substring that will, via bottom-up rules, introduce rules that are bound to result in the introduction of an active edge, which starts at the point where the "upper" symbol (H) is needed and which is seeking the "lower" symbol (B).

The illustrative grammars in Sections 2.1 and 2.2 are of a particular subclass of grammars--those where

tr(r)

= {0} or

tr(r)

= {1} for any (nonlexical) rule r. This is equivalent to partitioning the rules into two subgroups--top-down and bottom-up-- where the bottom-up rules are always triggered in a left-corner manner, much as in conventional "bottom-up" chart parsers (such as those in Thompson and Ritchie [1984] or Winograd [1983]). That is, there is a natural subclass of annotated grammars that do not rely on the bidirectional exploration of the chart, but allow this limited form of mixed strategy left-to-right exploration.

The definitions and proofs of the earlier sections apply to this subclass. It is also clear, from Section 2.2, that the issue of "completeness by interaction" can be illustrated within this limited subclass. In the remainder of Section 5 below, it is proved that detecting the possibility of such rule interactions is undecidable even for this limited subclass of grammars. It follows that it must be undecidable for the more general class, where any annotation is permitted. The advantages of focusing on this more limited subclass are twofold: it shows that restricting the annotations in this w a y would not ease the undecidability problem, and it simplifies the proofs (which are already tediously complex).

Definition 17

A left-corner strategy-marked context-free grammar (LCSCFG) is a BSCFG (G, tr) such

that

tr(r) c

{0,1} for every rule r in G.

(16)

5.2 Left Contexts

Following from the informal discussion in Section 5.1 above, w e n e e d to define m o r e precisely the notion of a left context of a symbol. W h a t w e w a n t is a w a y of characterizing, for a g i v e n n o n t e r m i n a l A, exactly those strings of symbols that m u s t a p p e a r i m m e d i a t e l y to the left of A in a n y valid d e r i v a t i o n in w h i c h A appears. These n e e d not be all that is to the left of A in a derivation, b u t it m u s t be the case that A c a n n o t a p p e a r w i t h o u t h a v i n g one of these left context strings i m m e d i a t e l y adjacent to it.

In the following definitions, the d e r i v a t i o n relationship " G " is the c o n v e n t i o n a l

one, a n d is i n d e p e n d e n t of a n y strategy marking; the relationship " ~ " indicates a r i g h t m o s t d e r i v a t i o n (see Section 3.1 earlier).

Definition 18

S u p p o s e w e h a v e a context-free g r a m m a r (VN, VT, P, S), a n d a sequence of symbols

B1 . . . . , B t i n VN, w h e r e there are rules e i --+ P i - l B i - l f l i - 1 for 2 < i < t (Pi, fli C V'~l ). S u p p o s e w e h a v e a r i g h t m o s t d e r i v a t i o n of the form:

Bt ~ P t - l B t - l a ; t - 1 R

P t - l P t - 2 B t - 20dt-2

R

::~ P t - l P t - 2 • •. B2¢v2 R

=:k P t - l P t - 2 • • • plBlCOl

(all the ~vi E V~). This d e r i v a t i o n is said to be:

1 . 2.

3.

4.

n o n r e p e a t i n g if B i ~ Bj w h e n e v e r i ~ j.

rooted if B t = S.

l o c a l i z e d if there is a longer sequence of n o n t e r m i n a l symbols B1 . . . . ,Bm I

a n d a r o o t e d r i g h t m o s t d e r i v a t i o n Bm ~ Pm--1 . . . plBlWl such that Bt = Bk for some t < k < m.

essential if it is n o n r e p e a t i n g a n d either r o o t e d or localized.

Also, the d e r i v a t i o n is said to be for B1 f r o m B t , a n d the string pt-1 • .. Pl is said to be the left context s e q u e n c e of this derivation.

Definition 19

For a n y n o n t e r m i n a l A, the set of essential left contexts of A is

{or E V~ [ 3 an essential r i g h t m o s t d e r i v a t i o n D for A a n d ~b is the left

context sequence of D a n d ~b ~ or}

The following l e m m a p r o v e s that essential left contexts h a v e just the r e q u i r e d property.

L e m m a 5

(17)

pk,

ki,

i

Pk-~ :: Pl i 8 i m,

9 P 4

7

0 ~l

9 I*

Figure 1

Situation described in Lemma 5.

Proof

(See Figure 1 for an intuitive picture.) Since T is a tree, there is a p a t h of n o n t e r m i n a l n o d e s (N1 . . . . , N t) w h e r e Ni is labeled Bi, for 1 < i < t, N1 = M , Nt is the root of T, a n d Ni is the m o t h e r of Ni-1 for 2 < i < t . Since T E trees(G), there m u s t be a sequence of rules e i ---+ P i - l B i - l f l i - 1 (Pi, fli E W~l ) such that the n o d e Ni is licensed b y Bi ---+ Pi-lBi-lfli-1 for 2 < i < t. H e n c e there is a r o o t e d r i g h t m o s t d e r i v a t i o n for B1 from Bt. F r o m this it is trivial to f o r m an essential r i g h t m o s t d e r i v a t i o n for B1 from some s y m b o l Bk (where k < t):

Bk ~ Pk-1 ...plBlWl w h e r e a;1 E V~ a n d

Pk-I'''plBIOdl ~ 0 w h e r e 0 is the substring of ~ s p a n n e d b y Nk.

This m e a n s that 0 is of the f o r m 3"~wl w h e r e Pk-1 .. • Pl :~ 3" a n d B1 ~ 6 (since B1 is the label of M, a n d 6 is the terminal string s p a n n e d b y M). T h e n 3' is an essential left context of B1, b y virtue of the w a y P k - 1 . . . Pl was constructed. Since 0 is a substring

[image:17.468.40.385.63.404.2]

(18)

5.3 Bottom-up Derivations

Definition 20

In a BSCFG

(G, tr),

s u p p o s e A is a n o n t e r m i n a l s y m b o l , a n d cr is a string of t e r m i n a l s y m b o l s . T h e n ~ can b e coherently derived from A (with tree T) iff T is a syntax tree g e n e r a t e d b y G s u c h that:

1. T s p a n s cr

2. the root of T is labeled A

3. T is fully reachable.

The n e x t definition requires that the d e r i v a t i o n can occur w i t h o u t n e e d for top- d o w n initiation.

Definition 21

Let (G, tr) be a BSCFG. S u p p o s e A is a n o n t e r m i n a l s y m b o l , a n d ~r is a string of t e r m i n a l s y m b o l s , f r o m G . T h e n cr can b e up-derived from A (written "A ~* or") u s i n g (G, tr) iff:

.

2.

cr can b e coherently d e r i v e d f r o m A w i t h tree T;

the root of T is reachable f r o m below.

It is clear that all n o d e s of s u c h trees will a p p e a r in a chart, as f o r m a l i z e d in L e m m a 6.

Lemma 6

Let

(G, tr)

b e a BSCFG. If A ~* v u s i n g

(G, tr),

a n d C is a fully bidirectional chart b a s e d on a string 3,1c~-~2, a n d u s i n g

(G, tr),

t h e n C contains a r e p r e s e n t a t i o n of a tree T s u c h that T s p a n s e a n d the root of T is labeled A.

Proof

Follows f r o m L e m m a 3. []

5.4 Left-Introducible Rules

In characterizing f o r m a l l y the situation o u t l i n e d i n f o r m a l l y in Section 5.1 above, the following definition allows a m o r e succinct statement.

Definition 22

Let A, B be t w o n o n t e r m i n a l s y m b o l s f r o m a LCSCFG.

A introduces B from above

(written "A ~,- B") if either A = B, or there is a s e q u e n c e of t o p - d o w n rules

a ---+ a o . . .

Ao ~ A1 ...

(19)

Z 4

A . . . t i

Pl Pi

Figure 2

Left-introducibility.

A i+l

?

L e m m a 7

Let A, B be t w o n o n t e r m i n a l s y m b o l s f r o m a LCSCFG (G, tr). If A ---* B, t h e n in a n y bidirectionally m i x e d s t r a t e g y e x p l o r e d chart C u s i n g (G, tr) that contains a n active e d g e ( i , j , A ' - - * . a l .Aft1), there will also be an active e d g e in C of the f o r m ( I , j , A " --* ca2 • Bfl2), w h e r e either l = i or l = j.

Proof

Straightforward. The case l = i allows for A = B (and A ' = A " ) , a n d I = j is the m o r e general case w h e r e there is a s e q u e n c e of t o p - d o w n - i n v o k e d active e d g e s linking A

to B. []

N e x t w e h a v e a definition of the condition on rules that allows t h e m to enter into the p a r s i n g p r o c e s s d e s p i t e the difficulties outlined in Section 2.1 above.

D e f i n i t i o n 23

In a LCSCFG (G, tr), a rule B0 --+ BlC~ is said to be left-introducible iff for e v e r y "~ that is an essential left context of B0, there is a b o t t o m - u p rule Ao -* A1 • • • Ak s u c h that

1o 2.

3.

4.

"y = Xpl . . . p i for s o m e i < k

A1 ~* Pl

pj c a n be coherently d e r i v e d f r o m Aj, for all 1 < j G i Ai+l ""+ Bo

Scrutiny of this definition s h o u l d reveal its relationship to the i n f o r m a l outlines in Sections 2.2 a n d 5.1 earlier (see also Figure 2). Notice that for a n y n o n t e r m i n a l A, if S ~ A . . . t h e n the e m p t y string is an essential left context of A a n d h e n c e a n y rule of the f o r m A --~ . . . c a n n o t be left-introducible.

L e m m a 8

[image:19.468.35.257.58.236.2]

(20)

the start of both M0 a n d M1 is m. Suppose that the rule B0 --* B1 • .. licensing M0 in T is left-introducible. Then in a n y fully bidirectional chart based on the terminal string s p a n n e d b y T a n d using (G, tr), there is an active edge of the form

(I, m, A --* •c~ • Boil) (i.e., an edge at the start of M0,M1, seeking B0).

Proof

Let the string s p a n n e d b y T be ~r, w i t h ¢ = cr16cr2, where 6 is s p a n n e d b y M0. Let C be a fully bidirectional chart based on cr a n d using (G, tr). By L e m m a 5, Crl is of the form ~ where 3/is in the essential left contexts of B0. Since B0 --+ B1 . . . is left-introducible, every such ~ has the property that there is a bottom-up rule Ao --* A1 . . . A k such that

• "Y = X P l . . . P i

• A1 "~* Pl

• pj can be coherently derived from A j , 1 < j <_ i

• A i + l ~ Bo

It follows from L e m m a 6 that, since A1 ~* Pl, there are inactive edges in C for all nodes of a tree w i t h root label A1 spanning pl. Since Ao --* A1 . . . A k is bottom-up, this means there is an active edge in C

(1,j',A0 -* •A1 • A 2 . . . a k )

where j is the start of the inactive edge for the root of this tree (i.e., the n o d e labeled A1), a n d j' is its end. By L e m m a 1 a n d L e m m a 4, there are inactive edges in C labeled A2 . . . Ai, corresponding to nodes spanning/92 . . . Pi. By L e m m a 1, there is an edge

(j,m, A0 --* • A 1 . . . A i • A i + l . . . A k )

where m is the start of & Since A i + l "~ B0, b y L e m m a 7 there is an active edge

( l , m , A --* .c~ . Bofl) []

5.5 Indirect Analyzability

In Section 4 we defined direct analyzability as a condition on g r a m m a r s that w o u l d lead to complete parsing. N o w we establish a more general property that also leads to completeness.

Definition 24

Let (G, tr) be a LCSCFG. A nonterminal symbol A0 in G is said to be indirectly an-

alyzable iff every rule A0 --* ~; is either lexical, or t o p - d o w n a n d left-introducible, or

bottom-up of the form A0 --* A-~c~ where A1 is indirectly analyzable. 5

(21)

Definition 25

A LCSCFG (G, tr) is indirectly analyzable iff for every purely bottom-up rule A0 -* Alc~, the nonterminal symbol A1 is indirectly analyzable.

The next two lemmas ensure that a g r a m m a r w i t h the property of indirect analyzability leads to complete parses. The first is just a generalization of L e m m a 4.

Lemma 9

Let (G, tr) be a BSCFG that is indirectly analyzable. Let T be a tree in trees(G), spanning the string or. Let C be a fully bidirectional chart based on cr a n d using

(G, tr).

Then for a n y node M labeled A in T:

.

If M is licensed b y a purely bottom-up rule, then C contains a representation of the subtree rooted at M.

If M is licensed b y a t o p - d o w n rule, a n d there is in C an active edge

(t,g, Bo -- B1...Bp-1 • Bp...Bq_l • Bq...An)*

where either Bp-1 = A a n d t is the end of M, or Bq = A a n d g is the start of M, then C contains a representation of the subtree rooted at M.

Proof

By induction on the height of nodes, in a m a n n e r very similar to L e m m a 3, except that part (a) of the Inductive Step is as follows:

Inductive Step(a): Suppose M0, w i t h d a u g h t e r nodes M1 . . . Mk, is licensed b y a purely bottom-up rule A0 --* A 1 . . . Ak. Then, since (G, tr) is indirectly analyzable, this means that A1 is indirectly analyzable. Hence whatever rule licenses M1, it m u s t be either lexical, or t o p - d o w n a n d left-introducible, or bottom-up with an indirectly analyzable symbol at the start of its RHS. In the lexical a n d bottom-up cases, the Base Case a n d Inductive Hypothesis establish that C contains a representation of the tree rooted at M1. If the rule is t o p - d o w n and left-introducible, there is an edge seeking its LHS symbol at the start of M1, a n d so, b y the Inductive Hypothesis, there is a representation of the tree rooted at M1 in C. Since C is fully bidirectional, it contains an edge spanning M1 of the form:

(I, h, A0 --~ ,A1 • A 2 . . . Ak)

Repeated applications of the Inductive Hypothesis a n d L e m m a 1 (Corollary) establish that there is a representation of the tree rooted at M0 in C (i.e., the Inductive Step).

Lemma 10

(22)

Proof

By induction on the depth of nodes.

Inductive Hypothesis: Assume that for any node M of depth d', where 0 _< d' < d, the lemma holds.

Base Case: Suppose M is of depth 1 (i.e., a daughter of the root node). Suppose the root is licensed by a rule S - ~ A1 . . . A k , where Mi is the ith daughter of the root (1 < i < k) and M = My.

(a)

(b)

Assume this rule is purely bottom-up. Since (G, tr) is indirectly analyzable, A1 is indirectly analyzable. Consider the rule that licenses

M1. It cannot be left-introducible, as S ~ A1 ... (see earlier remark about empty essential left contexts); hence it must be either lexical or

bottom-up. By Lemma 9, C contains a representation of the subtree rooted at M1. Since the rule S --+ A1 . . . A k is bottom-up, C contains an active edge of the form (0, 0, S ~ • A 1 • A 2 . . . A k ) .

Assume this rule is top-down. Then there must be an empty active edge (0,0,S --~ • • A 1 . . . A k ) .

By repeated applications of Lemma 9 and Lemma 1, there are edges of the form

(0, ti, S --~ cA1 . . . A i • Ai+l . . . A k ) 1 < i < (j - 1). The last of these fulfils the condition.

Inductive Step: Let M be a node labeled A of depth d > 1, licensed by a top-down rule r. Let its mother node be N, of depth (d - 1), and the leftmost daughter of N be M1. Consider the rule r' (of the form A0 --+ A1 ... Ak), which licenses N.

(a)

1°

2.

.

Suppose r' is purely bottom-up. Then M1 is labeled with an indirectly analyzable symbol, A1. Hence for the rule licensing M1, three cases must be considered:

It is lexical. In this case, C contains a representation for M1. It is top-down and left-introducible. By Lemma 8, there is an edge at the start of M1 seeking its label. If M = M1, this establishes the Inductive Step in this situation. Otherwise, by Lemma 9, there is a representation in C for the subtree rooted at M1.

It is bottom-up of the form A1 --~ B1 ... where B1 is indirectly analyzable. By Lemma 9, there is a representation in C for the subtree rooted at M1.

(23)

(b) Suppose r' is top-down. By the Inductive Hypothesis, there is an edge at the start of N seeking the label of N. Since r' is top-down, there is also an empty active edge for r ~ at that point. If M = M1, this establishes the Inductive Step in this situation. Otherwise, by repeated applications of Lemma 1 and Lemma 9, there is an active edge seeking A at the start

of M. []

Theorem 3

If a LCSCFG (G, tr) is indirectly analyzable, then tr is complete.

Proof

Follows from Lemma 9 and Lemma 10. []

5.6 Undecidability

We have established that the condition of indirect analyzability suffices to ensure completeness. Unfortunately, indirect analyzability is not a decidable property of annotated grammars, as we now show.

Theorem 4

It is undecidable whether an arbitrary LCSCFG is indirectly analyzable.

Proof

Suppose that there were a decision procedure for indirect analyzability. This could then be used to construct a decision procedure that determines for any two CFGs G1, G2 whether every member of

L(G1)

ends in a substring that is a member of L(G2). This is an undecidable problem (see Appendix B); hence, the indirect analyzability question is also undecidable. The construction proceeds as outlined below.

Suppose we have the two arbitrary CFGs G1 and G2 over the same alphabet VT, and assume that their nonterminal alphabets 1 Vh, V N do not intersect. Construct a LCSCFG 2 as follows. The distinguished symbol S ~ is distinct from all symbols in V 1 U V 2. Use symbols B1, B2 . . . B6 also not in V 1 U V 2. The purely bottom-up rules are all the rules

o f G2, together with

B1 ~ B2B3 B6 --~ $2B2

where each Si is the distinguished symbol of Gi. The purely top-down rules are all the rules of G1 together with

S I ___+ S1B1 B2 ~ B4B5

Also we include lexical rules:

B 3 ---+a

B4 ---~b B5 --*c

for some terminal symbols a, b, c.

This LCSCFG is indirectly analyzable iff (by definition) every purely bottom-up rule has an indirectly analyzable symbol at the start of its RHS (the trigger position). All the rules taken directly from G2 meet this condition, since all are bottom-up. So, therefore, does the rule B 6 ---+ $2B2. All the rules taken directly from G1, and the rule

S / --+ SIB1, do not affect the condition, since all are top-down. Hence the grammar

(24)

analyzable. This depends on whether the only rule expanding B2, B2 ~ B4B5 is left- introducible. The essential left contexts of B2 is the set {7 E V~ I S1 ~ 3'}. The only symbol X for which X -,z B2 is B2 itself. Hence the only rule that meets the schema for left-introducibility is B6 ~ $2B2. So B2 --* B4B5 is left-introducible iff every 3' such that $1 G 3' is of the form Cp with $2 coherently derived from p. Since all G2 rules are bottom-up, $2 is coherently derived from p iff $2 ~ p. Hence the left-introducibility of the rule in question is logically equivalent to L(G1) C V~ + L(G2) (where + indicates

concatenation). []

6. Some Further Complications

So far, the proofs have shown that direct analyzability is a sufficient condition for completeness, and that indirect analyzability (a more general condition) is also sufficient. The question might be posed--is indirect analyzability necessary for completeness? In fact, it is not, as there is at least one other sufficient condition for completeness, not covered by indirect analyzability.

It is not worthwhile formalizing and analyzing these possibilities in detail, but a brief informal outline of one such condition m a y be helpful. This occurs where a set of rules that is not directly analyzable, and might seem to cause "blocking" as discussed in Section 2.1 earlier, is redeemed by interaction with other rules in the grammar. This is similar to the phenomenon analyzed in Section 5 above, but whereas the analysis above dealt with a configuration of rules that can be parsed bottom-up to the left of the problematic rule, there is an analogous condition on subtrees to the left that can be parsed top-down.

The following grammar illustrates this phenomenon.

S--+ H K S--* Z B H--+ E F E--* P R Q---~ T V -Z---~ H Q K - - * Q D P - - , p R ---+ r F--* f D ---~ d T - + t V---+ v

D

Here, the grammar is not indirectly analyzable, as the purely bottom-up rule K --* Q D has a trigger category Q that is not indirectly analyzable. (The rule S --* H K is also problematic.) However, the only situation in which K --+ Q D would be needed would be to parse a string prftvd. Since S --~ H, there will be an empty active edge introduced for H --* E F at the start of the string. This will parse prf (top-down) as an H, and this will combine with the active edge already introduced for Z --* H Q, leading to the introduction of an empty active edge for Q --* T V at the start of the correct substring, tvd.

Completeness conditions for mixed strategy bidirectional parsing

Strategy Bidirectional Parsing

2. The problems

2.1 Losing Completeness

2.2 Completeness through Interactions

S =~ EH ~ jH ~ j B F ~ j P Q F ~ j l Q F ~ j l m F =~jlmk

S - - , C D

D--,-E A

-d--* B C

C--+ x

xjlmx,

jlmk, with an associated syntax tree that corresponds to the derivation given

Im

VN,

(ipq-l,jp+l, Bp+l --+

•Wp+l,)

( i(q-t), j(q-t), B(q-t) -'4 •bd(q-t) • )

(iq,jq, Bq --4 •~q,)

G,

tr

trees(G)

tr)

tr)

Mk,

tr(r)

Mj

Mj

tr)

Draft Definition:

directly analyzable

Definition 15

directly analyzable

analyzability predicate

tr

trees(G),

Ai

H---~ B F

B - - ~ P Q

tr(r)

tr(r)

tr(r) c

pk,

ki,

0

~l

(G, tr),

(G, tr)

(G, tr),

(G, tr),

A introduces B from above

Ao ~ A1 ...

?

(G, tr).

(t,g, Bo --* B1...Bp-1 • Bp...Bq_l • Bq...An)

(b)

L(G1)

-d-- B C*

(t,g, Bo -- B1...Bp-1 • Bp...Bq_l • Bq...An)*