An Efficient Probabilistic Context Free Parsing Algorithm that Computes Prefix Probabilities

38 

An Efficient Probabilistic Context-Free Parsing Algorithm that Computes Prefix Probabilities. A n d r e a s S t o l c k e * University of California at Berkeley a n d International C o m p u t e r Science Institute. We describe an extension of Earley's parser for stochastic context-free grammars that computes the following quantities given a stochastic context-free grammar and an input string: a) probabilities of successive prefixes being generated by the grammar; b) probabilities of substrings being gen- erated by the nonterminals, including the entire string being generated by the grammar; c) most likely (Viterbi) parse of the string; d) posterior expected number of applications of each grammar production, as required for reestimating rule probabilities. Probabilities (a) and (b) are computed incrementally in a single left-to-right pass over the input. Our algorithm compares favorably to standard bottom-up parsing methods for SCFGs in that it works efficiently on sparse grammars by making use of Earley's top-down control structure. It can process any context-free rule format without conversion to some normal form, and combines computations for (a) through (d) in a single algorithm. Finally, the algorithm has simple extensions for processing partially bracketed inputs, and for finding partial parses and their likelihoods on ungrammatical inputs.. 1. I n t r o d u c t i o n . Context-free grammars are w i d e l y u s e d as models of natural language syntax. In their probabilistic version, which defines a language as a probability distribution over strings, they have been u s e d in a variety of applications: for the selection of parses for a m b i g u o u s inputs (Fujisaki et al. 1991); to guide the rule choice efficiently during parsing (Jones and Eisner 1992); to c o m p u t e island probabilities for non-linear parsing (Corazza et al. 1991). In speech recognition, probabilistic context-free grammars play a central role in integrating low-level w o r d models with higher-level language m o d - els (Ney 1992), as well as in non-finite-state acoustic and phonotactic modeling (Lari and Young 1991). In s o m e work, context-free grammars are combined with scoring functions that are not strictly probabilistic (Nakagawa 1987), or they are u s e d with context-sensitive a n d / o r semantic probabilities (Magerman and Marcus 1991; Mager- m a n and Weir 1992; Jones and Eisner 1992; Briscoe and Carroll 1993).. Although clearly not a perfect m o d e l of natural language, stochastic context-free g r a m m a r s (SCFGs) are superior to nonprobabilistic CFGs, with probability theory pro- viding a s o u n d theoretical basis for ranking and pruning of parses, as well as for integration with models for nonsyntactic aspects of language. All of the applications listed a b o v e involve (or could potentially m a k e use of) one or more of the following. * Speech Technology and Research Laboratory, SRI International, 333 Ravenswood Ave., Menlo Park, CA 94025. E-mail: stolcke@speech.sri.com.. © 1995 Association for Computational Linguistics. Computational Linguistics Volume 21, Number 2. standard tasks, compiled b y Jelinek and Lafferty (1991). 1. .. .. 3.. .. W h a t is the probability that a given string x is generated b y a g r a m m a r G?. W h a t is the single most likely parse (or derivation) for x?. W h a t is the probability that x occurs as a prefix of s o m e string generated b y G (the prefix probability of x)? H o w should the parameters (e.g., rule probabilities) in G be chosen to maximize the probability o v e r a training set of strings?. The algorithm described in this article can c o m p u t e solutions to all four of these problems in a single framework, w i t h a n u m b e r of additional a d v a n t a g e s o v e r previ- o u s l y presented isolated solutions.. M o s t probabilistic parsers are b a s e d on a generalization of b o t t o m - u p chart pars- ing, such as the CYK algorithm. Partial parses are assembled just as in nonprobabilistic parsing ( m o d u l o possible p r u n i n g b a s e d on probabilities), while substring probabili- ties (also k n o w n as "inside" probabilities) can be c o m p u t e d in a straightforward way. Thus, the CYK chart parser underlies the standard solutions to p r o b l e m s (1) and (4) (Baker 1979), as well as (2) (Jelinek 1985). While the Jelinek and Lafferty (1991) solu- tion to p r o b l e m (3) is not a direct extension of CYK parsing, the authors nevertheless present their algorithm in terms of its similarities to the c o m p u t a t i o n of inside proba- bilities.. In o u r algorithm, computations for tasks (1) and (3) proceed incrementally, as the parser scans its i n p u t from left to right; in particular, prefix probabilities are available as soon as the prefix has b e e n seen, and are u p d a t e d incrementally as it is extended. Tasks (2) and (4) require one more (reverse) pass over the chart constructed from the input.. Incremental, left-to-right c o m p u t a t i o n of prefix probabilities is particularly impor- tant since that is a necessary condition for using SCFGs as a replacement for finite-state language m o d e l s in m a n y applications, such a speech decoding. As pointed o u t b y Je- linek and Lafferty (1991), k n o w i n g probabilities P ( X o . . . xi) for arbitrary prefixes X o . . . xi enables probabilistic prediction of possible f o l l o w - w o r d s Xi+l, as P(xi+l I x o . . . x i ) = P(Xo...xixi+I)/P(xo...xi). These conditional probabilities can then be u s e d as w o r d transition probabilities in a Viterbi-style d e c o d e r or to incrementally c o m p u t e the cost function for a stack d e c o d e r (Bahl, Jelinek, and Mercer 1983).. Another application in w h i c h prefix probabilities p l a y a central role is the extrac- tion of n-gram probabilities from SCFGs (Stolcke a n d Segal 1994). Here, too, efficient incremental c o m p u t a t i o n saves time, since the w o r k for c o m m o n prefix strings can be shared.. The k e y to most of the features of o u r algorithm is that it is b a s e d on the top- d o w n parsing m e t h o d for nonprobabilistic CFGs d e v e l o p e d b y Earley (1970). Earley's algorithm is appealing because it runs w i t h b e s t - k n o w n complexity on a n u m b e r of special classes of grammars. In particular, Earley parsing is m o r e efficient than the b o t t o m - u p m e t h o d s in cases w h e r e t o p - d o w n prediction can rule o u t potential parses of substrings. The worst-case computational expense of the algorithm (either for the complete input, or incrementally for each n e w w o r d ) is as g o o d as that of the other. 1 Their paper phrases these problem in terms of context-free probabilistic grammars, but they generalize in obvious ways to other classes of models.. 166. Andreas Stolcke Efficient Probabilistic Context-Free Parsing. k n o w n specialized algorithms, but can be substantially better on well-known g r a m m a r classes.. Earley's parser (and hence ours) also deals with a n y context-free rule format in a seamless way, without requiring conversions to C h o m s k y Normal Form (CNF), as is often assumed. Another advantage is that our probabilistic Earley parser has been extended to take advantage of partially bracketed input, and to return partial parses on ungrammatical input. The latter extension removes one of the c o m m o n objections against top-down, predictive (as opposed to bottom-up) parsing approaches (Mager- m a n and Weir 1992).. 2. Overview. The remainder of the article proceeds as follows. Section 3 briefly reviews the workings of an Earley parser without regard to probabilities. Section 4 describes h o w the parser needs to be extended to compute sentence and prefix probabilities. Section 5 deals with further modifications for solving the Viterbi and training tasks, for processing partially bracketed inputs, and for finding partial parses. Section 6 discusses miscellaneous issues and relates our work to the literature on the subject. In Section 7 we summarize and d r a w some conclusions.. To get an overall idea of probabilistic Earley parsing it should be sufficient to read Sections 3, 4.2, and 4.4. Section 4.5 deals with a crucial technicality, and later sections mostly fill in details and a d d optional features.. We assume the reader is familiar with the basics of context-free g r a m m a r the- ory, such as given in Aho and Ullman (1972, Chapter 2). Some prior familiarity with probabilistic context-free grammars will also be helpful. Jelinek, Lafferty, and Mercer (1992) provide a tutorial introduction covering the standard algorithms for the four tasks mentioned in the introduction.. Notation. The input string is denoted by x. Ix[ is the length of x. Individual input symbols are identified by indices starting at 0: x0,xl . . . . . Xlxl_ 1. The input alphabet is denoted by E. Substrings are identified by beginning and end positions Xi...j. The variables i,j,k are reserved for integers referring to positions in input strings. Latin capital letters X, Y, Z denote nonterminal symbols. Latin lowercase letters a, b , . . . are used for terminal symbols. Strings of mixed nonterminal and terminal symbols are written using lowercase Greek letters ,~, #, v. The e m p t y string is denoted by e.. 3. Earley Parsing. An Earley parser is essentially a generator that builds left-most derivations of strings, using a given set of context-free productions. The parsing functionality arises because the generator keeps track of all possible derivations that are consistent with the input string up to a certain point. As more and more of the input is revealed, the set of possible derivations (each of which corresponds to a parse) can either expand as new choices are introduced, or shrink as a result of resolved ambiguities. In describing the parser it is thus appropriate and convenient to use generation terminology.. The parser keeps a set of states for each position in the input, describing all p e n d i n g derivations, a These state sets together form the Earley chart. A state is of the. 2 Earley states are also known as items in LR parsing; see Aho and Ullman (1972, Section 5.2) and Section 6.2.. 167. Computational Linguistics Volume 21, Number 2. f o r m . i : k X --* ,~.#,. where X is a nonterminal of the grammar, ,~ and # are strings of nonterminals a n d / o r terminals, and i and k are indices into the input string. States are derived from pro- ductions in the grammar. The above state is derived from a corresponding production. X--* ~#. with the following semantics:. • The current position in the input is i, i.e., X o . . . x i - 1 have been processed SO f a r . 3 The states describing the parser state at position i are collectively called state set i. Note that there is one more state set than input symbols: set 0 describes the parser state before any input is processed, while set ixl contains the states after all input symbols have been processed.. • Nonterminal X was expanded starting at position k in the input, i.e., X generates some substring starting at position k.. • The expansion of X proceeded using the production X ~ ~#, and has expanded the right-hand side (RHS) ,~# up to the position indicated by the dot. The dot thus refers to the current position i.. A state with the dot to the right of the entire RHS is called a complete state, since it indicates that the left-hand side (LHS) nonterminal has been fully expanded.. Our description of Earley parsing omits an optional feature of Earley states, the l o o k a h e a d string. Earley's algorithm allows for an adjustable a m o u n t of lookahead d u r i n g parsing, in order to process LR(k) grammars deterministically (and obtain the same computational complexity as specialized LR(k) parsers where possible). The ad- dition of lookahead is orthogonal to our extension to probabilistic grammars, so we will not include it here.. The operation of the parser is defined in terms of three operations that consult the current set of states and the current input symbol, and a d d new states to the chart. This is strongly suggestive of state transitions in finite-state models of language, parsing, etc. This analogy will be explored further in the probabilistic formulation later on.. The three types of transitions operate as follows.. Prediction. For each state. i : k X ~ & . Y # , . where Y is a nonterminal anywhere in the RHS, and for all rules Y --* L, expanding Y, a d d states. i : i Y ~ .7/.. A state produced by prediction is called a predicted state. Each prediction corresponds to a potential expansion of a nonterminal in a left-most derivation.. 3 T h i s i n d e x is i m p l i c i t i n E a r l e y (1970). We i n c l u d e it h e r e for clarity.. 168. Andreas Stolcke Efficient Probabilistic Context-Free Parsing. Scanning. For each state i : k X ~ )~.a#,. where a is a terminal symbol that matches the current input xi, add the state. i + 1 : k X ~ )~a.#. (move the dot over the current symbol). A state produced by scanning is called a scanned state. Scanning ensures that the terminals produced in a derivation match the input string.. Completion. For each complete state. i : j y " ' + ~.. and each state in set j, j < i, that has Y to the right of the dot,. j : k X --~ .~.Y#,. a d d the state i : k X - - * ,~Y.#. (move the dot over the current nonterminal). A state produced by completion is called a completed state. 4 Each completion corresponds to the end of a nonterminal expan- sion started by a matching prediction step.. For each input symbol and corresponding state set, an Earley parser performs all three operations exhaustively, i.e., until no new states are generated. One crucial insight into the working of the algorithm is that, although both prediction and com- pletion feed themselves, there are only a finite n u m b e r of states that can possibly be produced. Therefore recursive prediction and completion at each position have to terminate eventually, and the parser can proceed to the next input via scanning.. To complete the description we need only specify the initial and final states. The parser starts out with. 0 : 0 ~ .S,. where S is the sentence nonterminal (note the e m p t y left-hand side). After processing the last symbol, the parser verifies that. 1: 0 ---~ S.. has been produced (among possibly others), where I is the length of the input x. If at any intermediate stage a state set remains e m p t y (because no states from the previous stage permit scanning), the parse can be aborted because an impossible prefix has been detected.. States with e m p t y LHS such as those above are useful in other contexts, as will be shown in Section 5.4. We will refer to them collectively as d u m m y states. D u m m y states enter the chart only as a result of initialization, as opposed to being derived from g r a m m a r productions.. 4 Note the difference between "complete" and "completed" states: complete states (those with the dot to the right of the entire RHS) are the result of a completion or scanning step, but completion also produces states that are not yet complete.. 169. Computational Linguistics Volume 21, Number 2. Table 1 (a) Example grammar for a tiny fragment of English. (b) Earley parser processing the sentence a circle touches a triangle.. (a). (b). S --* N P V P Det --~ a NP -* Det N N -* circle [ square [ triangle VP ~ VT NP VT ~ touches VP --* V I P P VI --* is PP --* P NP P --* above ] below. a c i r c l e t o u c h e s a s q u a r e . 0 ----~ .S predicted 0 S ~ . N P V P 0 N P --~ . D e t N 0 D e t --~ . a . scanned scanned scanned scanned scanned o D e t ~ a . 1 N ---* c i r c l e . 2 V T ---* t o u c h e s . 3 D e t ~ a . 4 N --4 t r i a n g l e . completed completed completed completed completed o N P --~ D e t . N o N P --~ D e t N . 2 V P ~ V T . N P 3 N P ~ D e t . N 4 N P ~ D e t N . predicted oS --~ N P . V P predicted predicted 3 V P --4 V T N P . 1 N - * . c i r c l e predicted 3 N P --* . D e t N 5 N ~ . c i r c l e oS --* N P V P . 1 N --~ . s q u a r e 2 V P --* . V T N P 3 D e t --* . a 4 N --~ . s q u a r e 0 --* S. 1 N --* . t r i a n g l e 2 V P --~ . V I P P 4 N --* . t r i a n g l e . 2 V T ~ . t o u c h e s 2 V I ~ . i s . S t a t e s e t 0 1 2 3 4 5. It is e a s y to see t h a t E a r l e y p a r s e r o p e r a t i o n s are c o r r e c t , in t h e s e n s e t h a t e a c h c h a i n o f t r a n s i t i o n s ( p r e d i c t i o n s , s c a n n i n g steps, c o m p l e t i o n s ) c o r r e s p o n d s to a p o s - sible (partial) d e r i v a t i o n . Intuitively, it is also t r u e t h a t a p a r s e r t h a t p e r f o r m s t h e s e t r a n s i t i o n s e x h a u s t i v e l y is c o m p l e t e , i.e., it finds all p o s s i b l e d e r i v a t i o n s . F o r m a l p r o o f s of t h e s e p r o p e r t i e s are g i v e n in t h e l i t e r a t u r e ; e.g., A h o a n d U l l m a n (1972). T h e rela- t i o n s h i p b e t w e e n E a r l e y t r a n s i t i o n s a n d d e r i v a t i o n s will b e s t a t e d m o r e f o r m a l l y in t h e n e x t section.. T h e p a r s e trees for s e n t e n c e s c a n b e r e c o n s t r u c t e d f r o m t h e c h a r t c o n t e n t s . We will i l l u s t r a t e this in S e c t i o n 5 w h e n d i s c u s s i n g Viterbi p arses.. Table 1 s h o w s a s i m p l e g r a m m a r a n d a t race o f E a r l e y p a r s e r o p e r a t i o n o n a s a m p l e s e n t e n c e . . E a r l e y ' s p a r s e r c a n d e a l w i t h a n y t y p e of c o n t e x t - f r e e r u l e f o r m a t , e v e n w i t h null o r c - p r o d u c t i o n s , i.e., t h o s e t h a t r e p l a c e a n o n t e r m i n a l w i t h t h e e m p t y string. S u c h p r o d u c t i o n s do, h o w e v e r , r e q u i r e special a t t e n t i o n , a n d m a k e t h e a l g o r i t h m a n d its d e s c r i p t i o n m o r e c o m p l i c a t e d t h a n o t h e r w i s e necessary. In t h e f o l l o w i n g sect i o n s w e a s s u m e t h a t n o n u l l p r o d u c t i o n s h a v e to b e d e a l t w i t h , a n d t h e n s u m m a r i z e t h e n e c e s s a r y c h a n g e s in S e c t i o n 4.7. O n e m i g h t c h o o s e to s i m p l y p r e p r o c e s s t h e g r a m m a r to e l i m i n a t e n u l l p r o d u c t i o n s , a p r o c e s s w h i c h is also d e s c r i b e d . . 4. Probabilistic Earley Parsing. 4.1 Stochastic Context-Free Grammars A s tochastic c o n t e x t - f r e e g r a m m a r (SCFG) e x t e n d s t h e s t a n d a r d c o n t e x t - f r e e f o r m a l i s m b y a d d i n g p r o b a b i l i t i e s to e a c h p r o d u c t i o n : . [p],. 170. Andreas Stolcke Efficient Probabilistic Context-Free Parsing. w h e r e the rule p r o b a b i l i t y p is u s u a l l y w r i t t e n as P ( X --+ ~). This n o t a t i o n to s o m e e x t e n t h i d e s the fact t h a t p is a c o n d i t i o n a l probability, of p r o d u c t i o n X --+ ,~ b e i n g chosen, g i v e n t h a t X is u p for e x p a n s i o n . The probabilities of all rules w i t h the s a m e n o n t e r m i n a l X o n the LHS m u s t therefore s u m to unity. Context-freeness in a prob- abilistic setting translates into c o n d i t i o n a l i n d e p e n d e n c e of rule choices. As a result, c o m p l e t e d e r i v a t i o n s h a v e joint probabilities t h a t are s i m p l y the p r o d u c t s of the rule probabilities i n v o l v e d . . The probabilities of interest m e n t i o n e d in Section 1 can n o w be d e f i n e d formally.. D e f i n i t i o n 1 The f o l l o w i n g q u a n t i t i e s are d e f i n e d relative to a SCFG G, a n o n t e r m i n a l X, a n d a s t r i n g x o v e r the a l p h a b e t y~ of G.. a) The probability o f a (partial) derivation/71 ~ / 7 2 ~ ""/Tk is i n d u c t i v e l y d e f i n e d b y . 1) 2). P(/71) = 1 P(/71 ~ " " ~ /Tk) = P ( X --* ,X)P(u2 ~ . . . ~/Tk),. b). c). d). w h e r e / 7 1 , / 7 2 . . . . , / T k are strings of t e r m i n a l s a n d n o n t e r m i n a l s , X ~ A is a p r o d u c t i o n of G, a n d u2 is d e r i v e d from/71 b y r e p l a c i n g one occurrence of X w i t h &.. The s t r i n g p r o b a b i l i t y P ( X =g x) (of x g i v e n X) is the s u m of the probabilities of all left-most d e r i v a t i o n s X => . . . => x p r o d u c i n g x f r o m X. s. The sentence probability P(S ~ x) (of x g i v e n G) is the string p r o b a b i l i t y g i v e n the start s y m b o l S of G. By definition, this is also the p r o b a b i l i t y P(x I G) a s s i g n e d to x b y the g r a m m a r G.. The prefix p r o b a b i l i t y P(S g>L X) (of X g i v e n G) is the s u m of the probabilities of all s e n t e n c e strings h a v i n g x as a prefix,. P(S x) = p(s xy) yEY~*. (In particular, P(S ~ L e) = 1).. In the following, w e a s s u m e t h a t the probabilities in a SCFG are p r o p e r a n d con- sistent as d e f i n e d in Booth a n d T h o m p s o n (1973), a n d t h a t the g r a m m a r c o n t a i n s n o u s e l e s s n o n t e r m i n a l s (ones t h a t can n e v e r a p p e a r in a derivation). These restrictions e n s u r e t h a t all n o n t e r m i n a l s define p r o b a b i l i t y m e a s u r e s o v e r strings; i.e., P ( X ~ x) is a p r o p e r d i s t r i b u t i o n o v e r x for all X. F o r m a l d e f i n i t i o n s of these c o n d i t i o n s are g i v e n in A p p e n d i x A.. 5 In a left-most derivation each step replaces the n o n t e r m i n a l furthest to the left in the partially e x p a n d e d string. The order of e x p a n s i o n is actually irrelevant for this definition, because of the multiplicative c o m b i n a t i o n of p r o d u c t i o n probabilities. We restrict s u m m a t i o n to left-most derivations to avoid c o u n t i n g duplicates, a n d because left-most derivations will p l a y a n i m p o r t a n t role later.. 171. Computational Linguistics Volume 21, Number 2. 4.2 Earley Paths and Their Probabilities In order to define the probabilities associated with parser operation on a SCFG, we need the concept of a path, or partial derivation, executed by the Earley parser.. Definition 2. a). b). c). d). e). An (unconstrained) Earley path, or simply path, is a sequence of Earley states linked by prediction, scanning, or completion. For the purpose of this definition, we allow scanning to operate in "generation mode," i.e., all states with terminals to the right of the dot can be scanned, not just those matching the input. (For completed states, the predecessor state is defined to be the complete state from the same state set contributing to the completion.). A path is said to be constrained by, or to generate a string x if the terminals immediately to the left of the dot in all scanned states, in sequence, form the string x.. A path is complete if the last state on it matches the first, except that the dot has moved to the end of the RHS.. We say that a path starts with nonterminal X if the first state on it is a predicted state with X on the LHS.. The length of a path is defined as the n u m b e r of scanned states on it.. Note that the definition of path length is s o m e w h a t counterintuitive, but is moti- vated by the fact that only scanned states correspond directly to input symbols. Thus the length of a path is always the same as the length of the input string it generates.. A constrained path starting with the initial state contains a sequence of states from state set 0 derived by repeated prediction, followed by a single state from set 1 produced by scanning the first symbol, followed by a sequence of states produced by completion, followed by a sequence of predicted states, followed by a state scanning the second symbol, and so on. The significance of Earley paths is that they are in a one-to-one correspondence with left-most derivations. This will allow us to talk about probabilities of derivations, strings, and prefixes in terms of the actions performed by Earley's parser. From n o w on, we will use "derivation" to imply a left-most derivation.. Lemma 1. a). b). An Earley parser generates state. i: kX--+ A.#,. if and only if there is a partial derivation. S ~ Xo...k_lXly =~ Xo...k_l/~#I/ G Xo...k_lXk...i_l#l/. deriving a prefix Xo...i-1 of the input.. There is a one-to-one m a p p i n g between partial derivations and Earley paths, such that each production X --~ ~, applied in a derivation corresponds to a predicted Earley state X --~ .L,.. 172. Andreas Stolcke Efficient Probabilistic Context-Free Parsing. (a) is the invariant underlying the correctness and completeness of Earley's algo- rithm; it can be p r o v e d b y induction on the length of a derivation (Aho and Ullman 1972, Theorem 4.9). The slightly stronger form (b) follows from (a) and the w a y pos- sible prediction steps are defined.. Since w e have established that paths correspond to derivations, it is convenient to associate derivation probabilities directly with paths. The uniqueness condition (b) above, which is irrelevant to the correctness of a standard Earley parser, justifies (prob- abilistic) counting of paths in lieu of derivations.. D e f i n i t i o n 3 The probability P(7 ~) of a path ~ is the p r o d u c t of the probabilities of all rules u s e d in the predicted states occurring in ~v.. L e m m a 2 a) For all paths 7 ~ starting with a nonterminal X, P(7 ~) gives the probability. of the (partial) derivation represented b y ~v. In particular, the string probability P(X ~ x) is the s u m of the probabilities of all paths starting with X that are complete and constrained b y x.. b) The sentence probability P(S ~ x) is the s u m of the probabilities of all complete paths starting with the initial state, constrained b y x.. c) The prefix probability P(S ~L X) is the s u m of the probabilities of all paths 7 ~ starting with the initial state, constrained b y x, that end in a scanned state.. N o t e that w h e n s u m m i n g over all paths "starting with the initial state," s u m m a - tion is actually over all paths starting with S, b y definition of the initial state 0 --* .S. (a) follows directly from our definitions of derivation probability, string probability, path probability, and the one-to-one correspondence b e t w e e n paths and derivations established b y L e m m a 1. (b) follows from (a) b y using S as the start nonterminal. To obtain the prefix probability in (c), w e need to s u m the probabilities of all complete derivations that generate x as a prefix. The constrained paths ending in scanned states represent exactly the beginnings of all such derivations. Since the g r a m m a r is a s s u m e d to be consistent and without useless nonterminals, all partial derivations can be com- pleted with probability one. H e n c e the s u m over the constrained incomplete paths is the sought-after s u m over all complete derivations generating the prefix.. 4.3 F o r w a r d a n d I n n e r P r o b a b i l i t i e s Since string and prefix probabilities are the result of s u m m i n g derivation probabilities, the goal is to c o m p u t e these s u m s efficiently b y taking advantage of the Earley control structure. This can be accomplished b y attaching t w o probabilistic quantities to each Earley state, as follows. The terminology is derived from analogous or similar quan- tities c o m m o n l y u s e d in the literature on H i d d e n Markov Models (HMMs) (Rabiner and Juang 1986) and in Baker (1979).. D e f i n i t i o n 4 The following definitions are relative to an implied input string x.. a) The f o r w a r d p r o b a b i l i t y Oq(kX--4 A.[d,) is the s u m of the probabilities of all constrained paths of length i that end in state kX ---* .~.#.. 173. Computational Linguistics Volume 21, Number 2. b) The inner probability ~ i ( k x ---+ /~.]1,) is the s u m of the probabilities of all paths of length i - k that start in state k : kX -* .)~# and end in i : kX --* A.#, and generate the input symbols X k . . . x i - > . It helps to interpret these quantities in terms of an unconstrained Earley parser that operates as a generator emitting--rather than recognizing--strings. Instead of tracking all possible derivations, the generator traces along a single Earley path r a n d o m l y determined by always choosing a m o n g prediction steps according to the associated rule probabilities. Notice that the scanning and completion steps are deterministic once the rules have been chosen.. Intuitively, the forward probability Oq(kX "-+ ,,~.~) is the probability of an Earley generator producing the prefix of the input up to position i - 1 while passing through state kX --* ~.# at position i. However, due to left-recursion in productions the same state m a y appear several times on a path, and each occurrence is counted toward the total ~i. Thus, ~i is really the expected number of occurrences of the given state in state set i. Having said that, we will refer to o~ simply as a probability, both for the sake of brevity, and to keep the analogy to the H M M terminology of which this is a generalization. 6 Note that for scanned states, ~ is always a probability, since by definition a scanned state can occur only once along a path.. The inner probabilities, on the other hand, represent the probability of generating a substring of the input from a given nonterminal, using a particular production. Inner probabilities are thus conditional on the presence of a given nonterminal X with expansion starting at position k, unlike the forward probabilities, which include the generation history starting with the initial state. The inner probabilities as defined here correspond closely to the quantities of the same name in Baker (1979). The sum of "y of all states with a given LHS X is exactly Baker's inner probability for X.. The following is essentially a restatement of L e m m a 2 in terms of forward and inner probabilities. It shows h o w to obtain the sentence and string probabilities we are interested in, provided that forward and inner probabilities can be computed ef- fectively.. Lemma 3 The following assumes an Earley chart constructed by the parser on an input string x with Ixl = l.. a) Provided that S :GL Xo...k-lXV is a possible left-most derivation of the g r a m m a r (for some v), the probability that a nonterminal X generates the substring X k . . . xi-1 can be c o m p u t e d as the s u m . P(x xki-1) = Z i(kx i:k X---~ )~.. (sum of inner probabilities over all complete states with LHS X and start index k).. 6 The same technical complication was noticed b y Wright (1990) in the c o m p u t a t i o n of probabilistic LR p a r s e r tables. The relation to LR p a r s i n g will be discussed in Section 6.2. Incidentally, a similar interpretation of forward "probabilities" is required for H M M s w i t h n o n - e m i t t i n g states.. 174. Andreas Stolcke Efficient Probabilistic Context-Free Parsing. b). c). In particular, the string probability P(S G x) can be computed as 7. P ( s ~ x) = "rl(0 ~ s.). = ~ t ( 0 ~ s . ) . The prefix probability P(S ~a x), with Ixl = 1, can be computed as. P(S ~L X) = ~ Oq(kX ---+ .~Xl_l.#) k X - - ~ ,,~X l _ l . ] ~ . (sum of forward probabilities over all scanned states).. The restriction in (a) that X be preceded by a possible prefix is necessary, since the Earley parser at position i will only pursue derivations that are consistent with the input up to position i. This constitutes the main distinguishing feature of Earley parsing compared to the strict bottom-up computation used in the standard inside probability computation (Baker 1979). There, inside probabilities for all positions and nonterminals are computed, regardless of possible prefixes.. 4.4 C o m p u t i n g Forward and Inner Probabilities Forward and inner probabilities not only subsume the prefix and string probabilities, they are also straightforward to compute during a run of Earley's algorithm. In fact, if it weren't for left-recursive and unit productions their computation w o u l d be trivial. For the purpose of exposition we will therefore ignore the technical complications introduced by these productions for a moment, and then return to them once the overall picture has become clear.. During a run of the parser both forward and inner probabilities will be attached to each state, and u p d a t e d incrementally as new states are created through one of the three types of transitions. Both probabilities are set to unity for the initial state 0 --* .S. This is consistent with the interpretation that the initial state is derived from a d u m m y production ~ S for which no alternatives exist.. Parsing then proceeds as usual, with the probabilistic computations detailed below. The probabilities associated with new states will be computed as sums of various combinations of old probabilities. As new states are generated by prediction, scanning, and completion, certain probabilities have to be accumulated, corresponding to the multiple paths leading to a state. That is, if the same state is generated multiple times, the previous probability associated with it has to be incremented by the n e w contribution just computed. States and probability contributions can be generated in a n y order, as long as the s u m m a t i o n for one state is finished before its probability enters into the computation of some successor state. Appendix B.2 suggests a w a y to implement this incremental summation.. Notation. A few intuitive abbreviations are used from here on to describe Earley transitions succinctly. (1) To avoid u n w i e l d y y]~ notation we adopt the following con- vention. The expression x += y means that x is computed incrementally as a sum of various y terms, which are computed in some order and accumulated to finally yield the value of x. 8 (2) Transitions are denoted by ~ , with predecessor states on the left. 7 The definitions of forward a n d i n n e r probabilities coincide for the final state. 8 This n o t a t i o n suggests a simple implementation, b e i n g obviously b o r r o w e d from the p r o g r a m m i n g . l a n g u a g e C.. 175. Computational Linguistics Volume 21, Number 2. a n d successor states o n the right. (3) The f o r w a r d a n d i n n e r probabilities of states are n o t a t e d in brackets after each state, e.g.,. i: kX --+ ,~.Y# [a, 7]. is s h o r t h a n d for a = a i ( k X --+ A.Y#), 7 = "fi(k X -+ ,~.Y#).. Prediction (probabilistic).. i : kX--+ A . Y # [a, 7] ~ i : i Y - - + . , [a',7']. for all p r o d u c t i o n s Y ~ u. The n e w probabilities can be c o m p u t e d as. a ' += a . P ( Y - - + u ) . 7' = P(Y--+ u). N o t e t h a t o n l y the f o r w a r d p r o b a b i l i t y is a c c u m u l a t e d ; 7 is n o t u s e d in this step. Rationale. a ' is the s u m of all p a t h probabilities l e a d i n g u p to kX --* )~.Y#, times. the p r o b a b i l i t y of c h o o s i n g p r o d u c t i o n Y --+ u. The v a l u e "7' is just a special case of the definition.. Scanning (probabilistic).. i : k x --+ )~.a# [a, 7] ~ i + 1 : k x --* )~a.lt [a', 7']. for all states w i t h t e r m i n a l a m a t c h i n g i n p u t at p o s i t i o n i. T h e n . "7' = 7. Rationale. S c a n n i n g d o e s n o t i n v o l v e a n y n e w choices, since the t e r m i n a l w a s al- r e a d y selected as p a r t of the p r o d u c t i o n d u r i n g prediction. 9. Completion (probabilistic).. i: j Y --+ u. [ a ' , 7 " ] j : kX--+,k.Yit [a, 71 f ~ i: kX--+,kY.it [a',7']. T h e n . a / += a.'7" (11). 7' += ' 7 " 7 " (12). N o t e t h a t ~ " is n o t u s e d . Rationale. To u p d a t e the o l d f o r w a r d / i n n e r probabilities a a n d "7 to cd a n d "7',. respectively, the probabilities of all p a t h s e x p a n d i n g Y --+ t, h a v e to be factored in. These are exactly the p a t h s s u m m a r i z e d b y the i n n e r p r o b a b i l i t y "7".. 9 In different parsing scenarios the scanning step may well modify probabilities. For example, if the input symbols themselves have attached likelihoods, these can be integrated by multiplying them onto a and " / w h e n a symbol is scanned. That way it is possible to perform efficient Earley parsing with integrated joint probability computation directly on weighted lattices describing ambiguous inputs.. 176. Andreas Stolcke Efficient Probabilistic Context-Free Parsing. 4.5 Coping with Recursion The standard Earley algorithm, together with the probability computations described in the previous section, w o u l d be sufficient if it weren't for the problem of recursion in the prediction and completion steps.. The nonprobabilistic Earley algorithm can stop recursing as soon as all predic- tions/completions yield states already contained in the current state set. For the com- putation of probabilities, however, this w o u l d mean truncating the probabilities re- sulting from the repeated s u m m i n g of contributions.. 4.5.1 Prediction loops. As an example, consider the following simple left-recursive SCFG.. s - . a . S ~ Sb [q],. where q = 1 - p. Nonprobabilistically, the prediction loop at position 0 w o u l d stop after producing the states. 0 --* .S. 0S --* .a. oS --+ .Sb.. This w o u l d leave the forward probabilities at. a 0 ( 0 S ~ . a ) = p. a o ( o S - * .Sb) = q,. corresponding to just two out of an infinity of possible paths. The correct forward probabilities are obtained as a sum of infinitely m a n y terms, accounting for all possible paths of length 1.. ao(oS --~ .a). ao(oS - ~ .Sb ). = p + q p + q 2 p + . . . . p(1 _ q)-I = 1. = q + q 2 + q 3 + . . . . q ( 1 - q ) - I = p - l q . In these sums each p corresponds to a choice of the first production, each q to a choice of the second production. If we d i d n ' t care about finite computation the resulting geometric series could be computed by letting the prediction loop (and hence the summation) continue indefinitely.. Fortunately, all repeated prediction steps, including those due to left-recursion in the productions, can be collapsed into a single, modified prediction step, and the corresponding sums computed in closed form. For this purpose we need a probabilistic version of the well-known parsing concept of a left comer, which is also at the heart of the prefix probability algorithm of Jelinek and Lafferty (1991).. Definition 5 The following definitions are relative to a given SCFG G.. a) Two nonterminals X and Y are said to be in a left-comer relation X --*L Y iff there exists a production for X that has a RHS starting with Y,. X--* YA.. 177. Computational Linguistics Volume 21, Number 2. b) The probabilistic left-corner relation 1° PL = PL(G) is the matrix of probabilities P(X --+L Y), defined as the total probability of choosing a production for X that has Y as a left corner:. P(X--*L Y ) = ~ P(X ~ YA). X--*YAcG. c). d). The relation X GL Y is defined as the reflexive, transitive closure of X ~ L Y, i.e., X ~ Y iff X = Y or there is a nonterminal Z such that X --*L Z and Z GL Y.. The probabilistic reflexive, transitive left-corner relation RL = RL(G) is a matrix of probability sums R(X :GL Y). Each R(X ~L Y) is defined as a series. e ( x Y) e ( x = Y) + P(x Y). 4- y ' ~ P ( X -+L Z 1 ) P ( Z 1 --+L Y ) z1. + ~_, P(X ---~L Z1)P(Z1 ---+L Za)P(Z2 --+L Y) Z1 ,Z2. 4 - . . . . Alternatively, RL is defined by the recurrence relation. R(X ~ L Y) = 6(X, Y) + Z e ( x -~L Z)R(Z ~,~ Y), z. where we use the delta function, defined as 6(X, Y) = 1 if X = Y, and 6(X, Y) = 0 if X # Y.. The recurrence for RL can be conveniently written in matrix notation. RL = 1 4- PLRL,. from which the closed-form solution is derived:. RL = (I -- PL) -1.. An existence proof for RL is given in Appendix A. Appendix B.3.1 shows h o w to speed u p the computation of RL by inverting only a reduced version of the matrix I - PL.. The significance of the matrix RL for the Earley algorithm is that its elements are the sums of the probabilities of the potentially infinitely m a n y prediction paths leading from a state kX --+ A.Z# to a predicted state iY --~ .~', via any n u m b e r of intermediate states.. RL can be c o m p u t e d once for each grammar, and used for table-lookup in the following, modified prediction step.. 10 If a probabilistic relation R is replaced b y its set-theoretic version R I, i.e., (x,y) E R' iff R(x,y) ~ 0, t h e n the closure operations u s e d here r e d u c e to their traditional discrete counterparts; hence the choice of terminology.. 178. Andreas Stolcke Efficient Probabilistic Context-Free Parsing. Prediction (probabilistic, transitive).. i : k X --+ A . Z # [c~,'7] ~ i: iW ~ .11 [oJ, "7']. for all productions Y --+ u such that R ( Z G L Y ) is nonzero. Then. c~' += c~.R(Z d r Y ) P ( Y - - - * u) (13). "7' = P ( Y ~ . ) (14). The n e w R ( Z G L Y ) factor in the u p d a t e d forward probability accounts for the s u m of all path probabilities linking Z to Y. For Z = Y this covers the case of a single. step of prediction; R ( Y ~ L Y) _> 1 always, since RL is defined as a reflexive closure.. 4.5.2 C o m p l e t i o n loops. As in prediction, the completion step in the Earley algorithm m a y imply an infinite summation, and could lead to an infinite loop if computed naively. However, only unit p r o d u c t i o n s 11 can give rise to cyclic completions.. The problem is best explained by studying an example. Consider the g r a m m a r . S - - c a [p]. S T [q]. T --. S [1],. where q = 1 - p. Presented with the input a (the only string the g r a m m a r generates), after one cycle of prediction, the Earley chart contains the following states.. 0 : 0 ~ .S ~ = 1 , ' 7 = 1 . 0 : 0 S --* .T o~ = p - l q , ' 7 = q . 0 : 0 T --+ .S c~ -~ p - l q , ' 7 - - 1 . O : oS --* .a o~ ~ p - t p = l , 7 = P . . The p-1 factors are a result of the left-corner s u m 1 + q + q2 q_ . . . . (1 - q)-l. After scanning oS --* .a, completion without truncation w o u l d enter an infinite. loop. First 0T ~ .S is completed, yielding a complete state 0T --* S., which allows 0S ---* .T to be completed, leading to another complete state for S, etc. The nonprobabilistic Earley parser can just stop here, but as in prediction, this w o u l d lead to truncated probabilities. The sum of probabilities that needs to be computed to arrive at the correct result contains infinitely m a n y terms, one for each possible loop through the T --+ S production. Each such loop adds a factor of q to the forward and inner probabilities. The summations for all completed states turn out as. 1 : 0 S ---+ c/.. I : 0 T --+ S.. 1 : 0 --+ S.. I : 0 S ~ T.. c~=1, ' 7 = p . c~ = p - l q ( p + pq + pq2 + . . . ) = p - l q , "7 ---- P + Pq + pq2 + . . . . 1. oL = p + pq + pq2 + . . . . 1, ,7 = p + pq + pq2 + . . . . 1. c ~ = p - l q ( p + p q + p q 2 + . . . ) = p - l q , " 7 = q ( p + p q + p q 2 + . . . ) = q . 11 Unit productions are also called "chain productions" or "single productions" in the literature.. 179. Computational Linguistics Volume 21, Number 2. T h e a p p r o a c h t a k e n h e r e to c o m p u t e exact p ro b ab i l i t i es in cyclic c o m p l e t i o n s is m o s t l y a n a l o g o u s to t h a t for l e f t - r e c u r s i v e p r e d i c t i o n s . T h e m a i n d i f f e r e n c e is t h a t u n i t p r o d u c t i o n s , r a t h e r t h a n left-corners, f o r m t h e u n d e r l y i n g t r a n s i t i v e relation. Before p r o c e e d i n g w e can c o n v i n c e o u r s e l v e s t h a t this is i n d e e d t h e o n l y case w e h a v e to w o r r y about.. Lemma 4 Let. klX1 ~ - ~ 1 X 2 . ===~ k2X2 --+ / ~ 2 X 3 • = = ~ " ' " ~ kcXc --+/~cXc+l.. b e a c o m p l e t i o n cycle, i.e., kl = k¢, X1 = Xc, ~1 --- )~c, X2 = Xc+I. T h e n it m u s t b e t h e case t h a t ~1 = /~2 . . . . . "~c = C, i.e., all p r o d u c t i o n s i n v o l v e d are u n i t p r o d u c t i o n s X1 ~ X 2 , . . . , Xc ~ Xc+l.. Proof F o r all c o m p l e t i o n c h a i n s it is t r u e t h a t t h e start i n d i c e s o f t h e states are m o n o t o n i c a l l y increasing, kl ~ k2 ~ . . . (a state c a n o n l y c o m p l e t e a n e x p a n s i o n t h a t s t a r t e d at t h e s a m e or a p r e v i o u s position). F r o m kl ~- kc, it f o l l o w s t h a t kl = k2 . . . . . kc. B e c a u s e the c u r r e n t p o s i t i o n (dot) also r e f e r s to t h e s a m e i n p u t i n d e x in all states, all n o n t e r m i n a l s X ~ , X 2 , . . . , X c h a v e b e e n e x p a n d e d i n t o t h e s a m e s u b s t r i n g o f t h e i n p u t b e t w e e n kl a n d the c u r r e n t p o s i t i o n . By a s s u m p t i o n t h e g r a m m a r c o n t a i n s n o n o n t e r m i n a l s t h a t g e n e r a t e ¢,12 t h e r e f o r e w e m u s t h a v e )~1 : ~ 2 . . . . . )~c = e, q.e.d. []. We n o w f o r m a l l y d e f i n e t h e r e l a t i o n b e t w e e n n o n t e r m i n a l s m e d i a t e d b y u n i t p r o - d u c t i o n s , a n a l o g o u s to t h e l e f t - c o r n e r relation.. Definition 6 T h e f o l l o w i n g d e f i n i t i o n s are r e l a t i v e to a g i v e n SCFG G.. a). b). c). d). T w o n o n t e r m i n a l s X a n d Y are said to b e in a unit-production relation X ~ Y iff t h e r e exists a p r o d u c t i o n for X t h a t h as Y as its RHS.. T h e p r o b a b i l i s t i c unit-production relation P u = Pu(G) is t h e m a t r i x o f p r o b a b i l i t i e s P(X --+ Y).. T h e r e l a t i o n X ~ Y is d e f i n e d as t h e reflexive, t r a n s i t i v e c l o s u r e o f X ~ Y, i.e., X G Y iff X = Y or t h e r e is a n o n t e r m i n a l Z s u c h t h a t X --+ Z a n d Z ~ Y.. T h e p r o b a b i l i s t i c r e f l e x i v e , t r a n s i t i v e unit-production relation Ru = Ru(G) is t h e m a t r i x of p r o b a b i l i t y s u m s R(X ~ Y). Each R(X ~ Y) is d e f i n e d as a series. e ( x Y) P ( x = Y). + P(X --* Y). q- ~ ~ P ( X --.4 Z l ) P ( Z l --4 Y ) z1. 12 E v e n w i t h n u l l p r o d u c t i o n s , t h e s e w o u l d n o t b e u s e d f o r E a r l e y t r a n s i t i o n s ; s e e S e c t i o n 4.7.. 180. Andreas Stolcke Efficient Probabilistic Context-Free Parsing. q- ~ P ( X --+ Z1)P (Z 1 ~ Z 2 ) P ( Z 2 -..+ Y ) ZI,Z2. ~ - . , .. ~(x, Y) + ~ P ( x - , z ) R ( z ~ Y). Z. As before, a matrix inversion can compute the relation Ru in closed form:. Ru -- (I - Pu) -1.. The existence of Ru is shown in Appendix A. The modified completion loop in the probabilistic Earley parser can n o w use the. Ru matrix to collapse all unit completions into a single step. Note that we still have to do iterative completion on non-unit productions.. Completion (probabilistic, transitive).. i: jY---* u. [ct',7"] } j : kX--+ A.Z# [oe, 7] ~ i: kX--+ AZ.# [o/,7']. for all Y, Z such that R(Z ~ Y) is nonzero, and Y + t, is not a unit production (lu] > 1 or u c g). Then. ~' += ~ . 7 " e ( z g r ) -~' += -r. 7 " e ( z ~, Y). 4.6 An Example Consider the g r a m m a r . s - - , a [p]. s - , s s [q]. where q = 1 - p. This highly ambiguous g r a m m a r generates strings of any number of a's, using all possible binary parse trees over the given n u m b e r of terminals. The states involved in parsing the string aaa are listed in Table 2, along with their forward and inner probabilities. The example illustrates h o w the parser deals with left-recursion and the merging of alternative sub-parses during completion.. Since the g r a m m a r has only a single nonterminal, the left-corner matrix PL has rank 1:. PL = M- Its transitive closure is. RL = ( I - PL) -1 ~- ~]-1 ~--- ~/9-1].. Consequently, the example trace shows the factor p-1 being introduced into the for- ward probability terms in the prediction steps.. The sample string can be parsed as either (a(aa)) or ((aa)a), each parse having a probability of p3q2. The total string probability is thus 2p3q 2, the computed c~ and 7 values for the final state. The oe values for the scanned states in sets 1, 2, and 3 are the prefix probabilities for a, aa, and aaa, respectively: P(S GL a) = 1, P(S :GL aa) = q, P(S :£L aaa) = (1 + p)q2.. 181. C o m p u t a t i o n a l L i n g u i s t i c s V o l u m e 21, N u m b e r 2. T a b l e 2 E a r l e y c h a r t as c o n s t r u c t e d d u r i n g t h e p a r s e of aaa w i t h t h e g r a m m a r i n (a). T h e t w o c o l u m n s to t h e r i g h t i n (b) list t h e f o r w a r d a n d i n n e r p r o b a b i l i t i e s , respectively, for e a c h state. I n b o t h c~ a n d 3' c o l u m n s , t h e • s e p a r a t e s o l d factors f r o m n e w o n e s (as p e r e q u a t i o n s 11, 12 a n d 13). A d d i t i o n i n d i c a t e s m u l t i p l e d e r i v a t i o n s of t h e s a m e state.. (a). (b). S ----~ a [p]. S ~ S S [q]. State set 0 0 ~ . S 1 1 predicted. oS ~ .a 1 . p - l p = 1 p oS ~ . S S 1 • p - l q = p - l q q. State set 1 s c a n n e d oS --* a. p - l p = 1 p c o m p l e t e d . oS --* S . S p - l q . p = q q . p = pq predicted. 1S ---* .a q . p - l p = q p 1S ~ . S S q . p - l q = p - l q 2 q. State set 2 s c a n n e d 1S ---~ a. q p c o m p l e t e d 1S ~ S . S p - l q 2 , p = q2 q . q = pq. oS ~ S S . q . p = pq pq . p = p2q oS --* S . S p - l q . p2q = pq2 q . p2q = p2q2 0 ~ S. 1 • p2q = p2q 1 • p2q = p2q. predicted 2S ~ .a (q2 q_ pq2) . p - l p = (1 + p)q2 p 2S ~ . S S (q2 q_ pq2) . p - l q = (1 + p-1)q3 q. State set 3 s c a n n e d 2S ---* a. (1 -t- p)q2 p c o m p l e t e d 2S ----+ S.S (1 + p - 1 ) q 3 , p = (1 + p)q3 q . p = pq 1S ~ S S . q2 . p = pq2 pq . p = p2q I S ---+ S . S p - l q 2 . paq = pq3 q . p2q = p2q2 oS --+ S S . pq2 . p + q . p2q -= 2p2q2 p2q2 . p + pq . p2q .~_ 2p3q2 oS --* S . S p - l q . 2p3q2 = 2p2q3 q . 2p3q2 = 2p3q3 0 ---* S. 1 • 2p3q 2 = 2p3q 2 1 • 2p3q 2 = 2p3q 2. 182. Andreas Stolcke Efficient Probabilistic Context-Free Parsing. 4.7 Null Productions Null productions X --* c introduce some complications into the relatively straight- forward parser operation described so far, s o m e of which are d u e specifically to the probabilistic aspects of parsing. This section summarizes the necessary modifications to process null productions correctly, using the previous description as a baseline. O u r treatment of null productions follows the (nonprobabilistic) formulation of Graham, Harrison, and Ruzzo (1980), rather than the original one in Earley (1970).. 4.7.1 Computing c-expansion probabilities. The main problem with null productions is that they allow multiple prediction-completion cycles in b e t w e e n scanning steps (since null productions do not have to be matched against one or m o r e i n p u t symbols). O u r strategy will be to collapse all predictions and completions d u e to chains of null productions into the regular prediction and completion steps, not unlike the w a y recursive p r e d i c t i o n s / c o m p l e t i o n s were handled in Section 4.5.. A prerequisite for this approach is to precompute, for all nonterminals X, the prob- ability that X expands to the e m p t y string. N o t e that this is another recursive problem, since X itself m a y not have a null production, b u t expand to some nonterminal Y that does.. C o m p u t a t i o n of P ( X : ~ c) for all X can be cast as a system of non-linear equations, as follows. For each X, let ex be an abbreviation for P ( X G c). For example, let X have productions. X ---* c [Pl]. --* Y I Y2 ~92]. ---9 W 3 Y 4 Y 5 ~/93]. The semantics of context-free rules imply that X can only expand to c if all the RHS nonterminals in one of X's productions expand to e. Translating to probabilities, w e obtain the equation. e x -- Pl + p2eyley2 + p3ey3eY4eY5 + "" •. In other words, each production contributes a term in which the rule probability is multiplied b y the p r o d u c t of the e variables corresponding to the RHS nonterminals, unless the RHS contains a terminal (in which case the production contributes nothing to ex because it cannot possibly lead to e).. The resulting nonlinear system can be solved b y iterative approximation. Each variable ex is initialized to P ( X ~ e), and then repeatedly u p d a t e d b y substituting in the equation right-hand sides, until the desired level of accuracy is attained. Conver- gence is guaranteed, since the ex values are monotonically increasing and b o u n d e d a b o v e b y the true values P ( X ~ e) ( 1. For grammars w i t h o u t cyclic d e p e n d e n - cies a m o n g e-producing nonterminals, this procedure degenerates to simple b a c k w a r d substitution. O b v i o u s l y the system has to be solved only once for each grammar.. The probability ex can be seen as the p r e c o m p u t e d inner probability of an expan- sion of X to the e m p t y string; i.e., it s u m s the probabilities of all Earley paths that derive c from X. This is the justification for the w a y these probabilities can be u s e d in modified prediction and completion steps, described next.. 4.7.2 Prediction w i t h null productions. Prediction is m e d i a t e d b y the left-corner re- lation. For each X occurring to the right of a dot, w e generate states for all Y that. 183. Computational Linguistics Volume 21, Number 2. are reachable from X by w a y of the X --*L Y relation. This reachability criterion has to be extended in the presence of null productions. Specifically, if X has a production X --+ Y1 . . . W i - l Y i ) ~ then Yi is a left corner of X iff Y 1 , . . . , Y i - 1 all have a nonzero probability of expanding to e. The contribution of such a production to the left-corner probability P ( X --+L Yi) is. i--1. P ( X --* Y 1 . . . Yi-lYi&) I I eYk k = l . The old prediction procedure can n o w be modified in two steps. First, replace the old PL relation by the one that takes into account null productions, as sketched above. From the resulting PL compute the reflexive transitive closure RL, and use it to generate predictions as before.. Second, w h e n predicting a left corner Y with a production Y --* Y1 . . . Y i - I Y i ) ~ , a d d states for all dot positions u p to the first RHS nonterminal that cannot expand to e, say from X --* .Y1 . . . Y i - I Yi )~ t h r o u g h X --* Y1 . . . Y i - l . Y i .X. We will call this procedure "spontaneous dot shifting." It accounts precisely for those derivations that expand the RHS prefix Y1 . . . Wi-1 w i t h o u t consuming any of the input symbols.. The forward and inner probabilities of the states thus created are those of the first state X --* . Y 1 . . . Y i - l Y i / ~ , multiplied by factors that account for the implied e-. expansions. This factor is just the product 1-I~=1 eYk, where j is the dot position.. 4.7.3 Completion with null productions. Modification of the completion step follows a similar pattern. First, the unit production relation has to be extended to allow for unit production chains due to null productions. A rule X ~ Y1 . . . Y i - l Y i Y i + l . . . Yj can effectively act as a unit production that links X and Yi if all other nonterminals on the RHS can expand to e. Its contribution to the unit production relation P ( X ~ Yi) will then be. P ( X ~ Y 1 . . . Y i - l Y i Y i + l . . . Yj) I I e Y k . From the resulting revised Pu matrix we compute the closure Ru as usual. The second modification is another instance of spontaneous dot shifting. W h e n . completing a state X --+ )~.Y# and m o v i n g the dot to get X ~ )~Y.#, additional states have to be added, obtained by moving the dot further over any nonterminals in # that have nonzero e-expansion probability. As in prediction, forward and inner probabilities are multiplied by the corresponding e-expansion probabilities.. 4.7.4 E l i m i n a t i n g n u l l productions. Given these a d d e d complications one might con- sider simply eliminating all c-productions in a preprocessing step. This is mostly straightforward and analogous to the corresponding procedure for nonprobabilistic CFGs (Aho and Ullman 1972, Algorithm 2.10). The main difference is the u p d a t i n g of rule probabilities, for which the e-expansion probabilities are again needed.. .. .. Delete all null productions, except on the start symbol (in case the g r a m m a r as a whole produces c with nonzero probability). Scale the remaining production probabilities to s u m to unity.. For each original rule X ~ ,~Y# that contains a nonterminal Y such that. Y ~ E : . 184. Andreas Stolcke Efficient Probabilistic Context-Free Parsing. (a). (b). (c). Create a variant rule X --* &# Set the rule probability of the n e w rule to eyP(X --, &Y#). If the rule X ~ ~# already exists, s u m the probabilities. Decrement the old rule probability b y the same amount.. Iterate these steps for all RHS occurrences of a null-able nonterminal.. The crucial step in this procedure is the addition of variants of the original produc- tions that simulate the null productions by deleting the corresponding nonterminals from the RHS. The spontaneous dot shifting described in the previous sections effec- tively performs the same operation on the fly as the rules are used in prediction a n d completion.. 4.8 C o m p l e x i t y I s s u e s The probabilistic extension of Earley's parser preserves the original control structure in most aspects, the major exception being the collapsing of cyclic predictions a n d unit completions, which can only m a k e these steps more efficient. Therefore the complexity analysis from Earley (1970) applies, a n d w e only s u m m a r i z e the most important results here.. The worst-case complexity for Earley's parser is d o m i n a t e d by the completion step, which takes O(/2) for each input position, I being the length of the current prefix. The total time is therefore O(/3) for an input of length l, which is also the complexity of the standard I n s i d e / O u t s i d e (Baker 1979) and LRI (Jelinek a n d Lafferty 1991) algorithms. For g r a m m a r s of b o u n d e d ambiguity, the incremental per-word cost reduces to O(l), 0(/2) total. For deterministic CFGs the incremental cost is constant, 0(l) total. Because of the possible start indices each state set can contain 0(l) Earley states, giving O(/2) worst-case space complexity overall.. Apart from input length, complexity is also d e t e r m i n e d by g r a m m a r size. We will not try to give a precise characterization in the case of sparse g r a m m a r s (Ap- pendix B.3 gives some hints on h o w to i m p l e m e n t the algorithm efficiently for such grammars). However, for fully parameterized g r a m m a r s in CNF we can verify the scaling of the algorithm in terms of the n u m b e r of nonterminals n, and verify that it has the same O(n 3) time and space requirements as the I n s i d e / O u t s i d e (I/O) and LRI algorithms.. The completion step again dominates the computation, which has to c o m p u t e probabilities for at most O(n 3) states. By organizing summations (11) and (12) so that 3'" are first s u m m e d by LHS nonterminals, the entire completion operation can be accomplished in 0(//3). The one-time cost for the matrix inversions to c o m p u t e the left-corner a n d unit production relation matrices is also O(r/3).. 5. E x t e n s i o n s . This section discusses extensions to the Earley algorithm that go b e y o n d simple parsing a n d the computation of prefix and string probabilities. These extensions are all quite straightforward and well s u p p o r t e d b y the original Earley chart structure, which leads us to view them as part of a single, unified algorithm for solving the tasks m e n t i o n e d in the introduction.. 185. Computational Linguistics Volume 21, Number 2. 5.1 Viterbi Parses. D e f i n i t i o n 7 A Viterbi parse for a string x, in a g r a m m a r G, is a left-most derivation that assigns maximal probability to x, a m o n g all possible derivations for x.. Both the definition of Viterbi parse and its computation are straightforward gener- alizations of the corresponding notion for H i d d e n Markov Models (Rabiner and Juang 1986), where one computes the Viterbi path (state sequence) through an HMM. Pre- cisely the same approach can be used in the Earley parser, using the fact that each derivation corresponds to a path.. The standard computational technique for Viterbi parses is applicable here. Wher- ever the original parsing procedure sums probabilities that correspond to alternative derivations of a grammatical entity, the s u m m a t i o n is replaced by a maximization. Thus, during the forward pass each state m u s t keep track of the maximal path prob- ability leading to it, as well as the predecessor states associated with that m a x i m u m probability path. Once the final state is reached, the m a x i m u m probability parse can be recovered by tracing back the path of "best" predecessor states.. The following modifications to the probabilistic Earley parser implement the for- ward phase of the Viterbi computation.. • Each state computes an additional probability, its V i t e r b i p r o b a b i l i t y v.. • Viterbi probabilities are propagated in the same w a y as inner probabilities, except that d u r i n g completion the s u m m a t i o n is replaced by maximization: Vi(kX --* ,~Y.#) is the m a x i m u m of all products v i ( j W --+ 17.)Vj(kX --+ ,~.Y#) that contribute to the completed state k X --* )~Y.#. The same-position predecessor j Y -~ ~,. associated with the m a x i m u m is recorded as the Viterbi path predecessor of k X --* ,~Y.# (the other predecessor state k X --* ,~.Y# can be inferred).. • The completion step uses the original recursion w i t h o u t collapsing unit production loops. Loops are simply avoided, since they can only lower a path's probability. Collapsing unit production completions has to be avoided to maintain a continuous chain of predecessors for later backtracing and parse construction.. • The prediction step does not need to be modified for the Viterbi computation.. Once the final state is reached, a recursive procedure can recover the parse tree associated with the Viterbi parse. This procedure takes an Earley state i : kX --* &.# as input and produces the Viterbi parse for the substring between k and i as output. (If the input state is not complete (# ~ ¢), the result will be a partial parse tree with children missing from the root node.). V i t e r b i parse (i : k X --* -~.#):. 1. If )~ = ¢, return a parse tree with root labeled X and no children.. 186. Andreas Stolcke Efficient Probabilistic Context-Free Parsing. . Otherwise, if ~ ends in a terminal a, let A~a = ~, and call this procedure recursively to obtain the parse tree. T = Viterbi-parse(i - 1 : k X i + ,,V.a~). .. Adjoin a leaf node labeled a as the right-most child to the root of T and return T.. Otherwise, if A ends in a nonterminal Y, let A'Y = A. Find the Viterbi predecessor state jW ---+ t~. for the current state. Call this procedure recursively to compute. T = Viterbi-parse(j : kX ~ A'.Y#). as well as T' = Viterbi-parse(i : jY --* u.). Adjoin T ~ to T as the right-most child at the root, and return T.. 5.2 Rule Probability Estimation The rule probabilities in a SCFG can be iteratively estimated using the EM (Expectation- Maximization) algorithm (Dempster et al. 1977). Given a sample corpus D, the esti- mation procedure finds a set of parameters that represent a local m a x i m u m of the g r a m m a r likelihood function P ( D I G), which is given by the product of the string probabilities. P(D I C) = 1-[ P(S x), xED. i.e., the samples are assumed to be distributed identically and independently. The two steps of this algorithm can be briefly characterized as follows.. E-step: C o m p u t e expectations for h o w often each g r a m m a r rule is used, given the corpus D and the current g r a m m a r parameters (rule probabilities).. M-step: Reset the parameters so as to maximize the likelihood relative to the expected rule counts found in the E-step.. This procedure is iterated until the parameter values (as well as the likelihood) con- verge. It can be s h o w n that each round in the algorithm produces a likelihood that is at least as high as the previous one; the EM algorithm is therefore guaranteed to find at least a local m a x i m u m of the likelihood function.. EM is a generalization of the well-known Baum-Welch algorithm for HMM esti- mation (Baum et al. 1970); the original formulation for the case of SCFGs is attributable to Baker (1979). For SCFGs, the E-step involves computing the expected n u m b e r of times each production is applied in generating the training corpus. After that, the M- step consists of a simple normalization of these counts to yield the n e w production probabilities.. In this section we examine the computation of production count expectations re- quired for the E-step. The crucial notion introduced by Baker (1979) for this purpose is the "outer probability" of a nonterminal, or the joint probability that the nonter- minal is generated with a given prefix and suffix of terminals. Essentially the same m e t h o d can be used in the Earley framework, after extending the definition of outer probabilities to apply to arbitrary Earley states.. 187. Computational Linguistics Volume 21, Number 2. D e f i n i t i o n 8 G i v e n a string x, Ix] = l, the outer p r o b a b i l i t y fli(kX ---* A.#) of a n Earley state is the s u m of the probabilities of all p a t h s t h a t . .. 2.. 3.. 4.. 5.. start w i t h the initial state,. g e n e r a t e the prefix X o . . . X k - 1 , . pass t h r o u g h k x ---4 .17#, for s o m e u,. g e n e r a t e the suffix x i . • • x1-1 s t a r t i n g w i t h state k X --* u.# ,. e n d in the final state.. O u t e r probabilities c o m p l e m e n t i n n e r probabilities in t h a t t h e y refer precisely to t h o s e p a r t s of c o m p l e t e p a t h s g e n e r a t i n g x n o t c o v e r e d b y the c o r r e s p o n d i n g i n n e r p r o b a b i l i t y 7 i ( k X --* A.#). Therefore, the choice of the p r o d u c t i o n X --* A# is not p a r t of the o u t e r p r o b a b i l i t y associated w i t h a state k X ~ A.#. In fact, the d e f i n i t i o n m a k e s n o reference to the first p a r t A of the RHS: all states s h a r i n g the s a m e k, X, a n d # will h a v e identical o u t e r probabilities.. Intuitively, fli(kX --* A.#) is the p r o b a b i l i t y t h a t a n Earley p a r s e r o p e r a t i n g as a string g e n e r a t o r yields the prefix Xo...k-1 a n d the suffix xi...l_l, w h i l e p a s s i n g t h r o u g h state k X --* A.# at p o s i t i o n i ( w h i c h is i n d e p e n d e n t of A). As w a s the case for f o r w a r d probabilities, fl is a c t u a l l y a n e x p e c t a t i o n of the n u m b e r of s u c h states in the p a t h , as u n i t p r o d u c t i o n cycles can result in m u l t i p l e occurrences for a single state. A g a i n , w e gloss o v e r this technicality in o u r terminology. The n a m e is m o t i v a t e d b y the fact t h a t fl r e d u c e s to the " o u t e r p r o b a b i l i t y " of X, as d e f i n e d in Baker (1979), if the d o t is in final position.. 5.2.1 C o m p u t i n g expected p r o d u c t i o n counts. Before g o i n g into the details of com- p u t i n g o u t e r probabilities, w e describe their use in o b t a i n i n g the e x p e c t e d rule c o u n t s n e e d e d for the E-step in g r a m m a r estimation.. Let c ( X --* A ] x) d e n o t e the e x p e c t e d n u m b e r of uses of p r o d u c t i o n X --* A in the d e r i v a t i o n of string x. Alternatively, c ( X --* A ] x) is the e x p e c t e d n u m b e r of times t h a t X --+ ), is u s e d for p r e d i c t i o n in a c o m p l e t e Earley p a t h g e n e r a t i n g x. Let c ( X -~ A ] t 9) be the n u m b e r of occurrences of p r e d i c t e d states b a s e d o n p r o d u c t i o n X ~ A a l o n g a p a t h 79.. c ( x lx) = Z P ( 7 9 1 s x)c(X ;, 179) 79 d e r i v e s x. 1 - F_, p(79, s x ) c ( x . P ( S ~ x) 79 d e r i v e s x. _ 1 P(S ~ Xo...i_lXl] ~ x).. P ( S X) i:iX---+.A. 79). The last s u m m a t i o n is o v e r all p r e d i c t e d states b a s e d o n p r o d u c t i o n X --* A. The q u a n t i t y P ( S • Xo...i_lXt, :~ x) is the s u m of the probabilities of all p a t h s p a s s i n g t h r o u g h i : i X --* .A. I n n e r a n d o u t e r probabilities h a v e b e e n d e f i n e d s u c h t h a t this q u a n t i t y is o b t a i n e d precisely as the p r o d u c t of the c o r r e s p o n d i n g of "Yi a n d fli. Thus,. 188. Andreas Stolcke Efficient Probabilistic Context-Free Parsing. the expected usage count for a rule can be c o m p u t e d as. 1 c ( X - . ~ I x ) -. P(S X) i:iX--->.,,k 9~(~x ~ .A),~(~x -~ .,~).. The s u m can be c o m p u t e d after completing both forward and b a c k w a r d passes (or during the b a c k w a r d pass itself) b y scanning the chart for predicted states.. 5.2.2 Computing outer probabilities. The outer probabilities are c o m p u t e d b y tracing the complete paths from the final state to the start state, in a single b a c k w a r d pass over the Earley chart. Only completion and scanning steps need to be traced back. Reverse scanning leaves outer probabilities unchanged, so the only operation of concern is reverse completion.. We describe reverse transitions using the same notation as for their forward coun- terparts, annotating each state with its outer and inner probabilities.. Reverse completion.. i: j Y --~ 1,1. [fl",'y"] i: k X - * A Y . # [fl,"/] ~ j : kX--* A . Y # [fl',7']. for all pairs of states j Y --+ t,. and kX --* A.Y# in the chart. Then. fl' + = q / ' . f l . fl" += -y'.fl. The inner probability 7 is not used. Rationale. Relative to fl, fl' is missing the probability of expanding Y, which is. filled in from ~,". The probability of the s u r r o u n d i n g of Y ( f l " ) is the probability of the s u r r o u n d i n g of X(fl), plus the choice of the rule of production for X and the expansion of the partial LHS A, which are together given b y ~,'.. N o t e that the computation makes use of the inner probabilities c o m p u t e d in the forward pass. The particular w a y in which 3' and fl were defined turns o u t to be convenient here, as no reference to the production probabilities themselves needs to be m a d e in the computation.. As in the forward pass, simple reverse completion w o u l d not terminate in the pres- ence of cyclic unit productions. A version that collapses all such chains of productions is given below.. Reverse completion (transitive).. i: j Y - - * ~,. [fl",7"] i: k x --* A Z . # [fl, 3'] ~ j kX ~ A . Z # [fl', 7']. for all pairs of states j Y ---+ v. and kX --* A.Z# in the chart, such that the unit production. relation R ( Z • Y) is nonzero. Then. fl' + = q / ' . f l f l " + = -~'. flR(Z G Y). The first s u m m a t i o n is carried o u t once for each state j : kX --* M Z # , whereas the second s u m m a t i o n is applied for each choice of Z, b u t only if X --* AZ# is not itself a unit production, i.e., A# ~ E.. 189. Computational Linguistics Volume 21, Number 2. Rationale. This increments fl" the equivalent of R(Z ~ Y) times, accounting for the infinity of surroundings in which Y can occur if it can be derived through cyclic productions. Note that the computation of tip is unchanged, since "y" already includes an infinity of cyclically generated subtrees for Y, w h e r e appropriate.. 5.3 Parsing Bracketed Inputs The estimation procedure described above (and EM-based estimators in general) are only g u a r a n t e e d to find locally optimal p a r a m e t e r estimates. Unfortunately, it seems that in the case of unconstrained SCFG estimation local m a x i m a present a v e r y real problem, a n d make success d e p e n d e n t on chance a n d initial conditions (Lari a n d Young 1990). Pereira a n d Schabes (1992) s h o w e d that partially bracketed input samples can alleviate the problem in certain cases. The bracketing information constrains the parse of the inputs, a n d therefore the p a r a m e t e r estimates, steering it clear from some of the suboptimal solutions that could otherwise be found.. A n Earley parser can be minimally modified to take advantage of bracketed strings b y invoking itself recursively w h e n a left parenthesis is encountered. The recursive in- stance of the parser is passed a n y predicted states at that position, processes the input u p to the matching right parenthesis, a n d hands complete states back to the invoking instance. This technique is efficient, as it never explicitly rejects parses not consistent with the bracketing. It is also convenient, as it leaves the basic parser operations, including the left-to-right processing a n d the probabilistic computations, unchanged. For example, prefix probabilities conditioned on partial bracketings could be c o m p u t e d easily this way.. Parsing bracketed inputs is described in m o r e detail in Stolcke (1993), w h e r e it is also s h o w n that bracketing gives the expected i m p r o v e d efficiency. For example, the modified Earley parser processes fully bracketed inputs in linear time.. 5.4 Robust Parsing In m a n y applications u n g r a m m a t i c a l input has to be dealt with in some way. Tra- ditionally it has been seen as a drawback of t o p - d o w n parsing algorithms such as Earley's that they sacrifice "robustness," i.e., the ability to find partial parses in an ungrammatical input, for the efficiency gained from t o p - d o w n prediction (Magerman a n d Weir 1992).. O n e approach to the problem is to build robustness into the g r a m m a r itself. In the simplest case one could a d d top-level productions. S --* XS. w h e r e X can e x p a n d to a n y nonterminal, including an " u n k n o w n w o r d " category. This g r a m m a r will cause the Earley parser to find all partial parses of substrings, effectively behaving like a bottom-up parser constructing the chart in left-to-right fashion. More refined variations are possible: the top-level productions could be used to m o d e l w h i c h phrasal categories (sentence fragments) can likely follow each other. This probabilistic information can then be used in a p r u n i n g version of the Earley parser (Section 6.1) to arrive at a compromise b e t w e e n robust a n d expectation-driven parsing.. A n alternative m e t h o d for m a k i n g Earley parsing m o r e robust is to m o d i f y the parser itself so as to accept arbitrary input a n d find all or a chosen subset of pos- sible substring parses. In the case of Earley's parser there is a simple extension to. 190. Andreas Stolcke Efficient Probabilistic Context-Free Parsing. accomplish just that, b a s e d on the notion of a wildcard state. i : k --~ ~ . ? , . w h e r e the wildcard ? stands for an arbitrary continuation of the RHS. During prediction, a wildcard to the left of the d o t causes the chart to be seeded. with d u m m y states --* .X for each phrasal category X of interest. Conversely, a minimal modification to the standard completion step allows the wildcard states to collect all abutting substring parses:. i: jY--+ #. } j : k ~+ ,'~. ? ~ i : k --* )~ Y . ?. for all Y. This w a y each partial parse will be represented b y exactly one wildcard state in the final chart position.. A detailed account of this technique is given in Stolcke (1993). One advantage over the grammar-modifying approach is that it can be tailored to use various criteria at runtime to decide which partial parses to follow.. 6. D i s c u s s i o n . 6.1 O n l i n e Pruning In finite-state parsing (especially speech decoding) one often makes use of the forward probabilities for p r u n i n g partial parses before having seen the entire input. Pruning is formally straightforward in Earley parsers: in each state set, rank states according to their ~ values, then r e m o v e those states with small probabilities c o m p a r e d to the current best candidate, or simply those w h o s e rank exceeds a given limit. Notice this will not only omit certain parses, b u t will also underestimate the forward and inner probabilities of the derivations that remain. Pruning procedures have to be evaluated empirically since they invariably sacrifice completeness and, in the case of the Viterbi algorithm, optimality of the result.. While Earley-based on-line p r u n i n g awaits further study, there is reason to be- lieve the Earley f r a m e w o r k has inherent advantages over strategies b a s e d only on b o t t o m - u p information (including so-called "over-the-top" parsers). Context-free for- w a r d probabilities include a l l available probabilistic information (subject to assump- tions implicit in the SCFG formalism) available from an input prefix, whereas the usual inside probabilities d o not take into account the nonterminal prior probabilities that result from the t o p - d o w n relation to the start state. Using t o p - d o w n constraints does not necessarily m e a n sacrificing robustness, as discussed in Section 5.4. On the contrary, b y using Earley-style parsing with a set of carefully designed and estimated "fault-tolerant" top-level productions, it should be possible to use probabilities to bet- ter advantage in robust parsing. This approach is a subject of ongoing work, in the context of tight-coupling SCFGs with speech decoders (Jurafsky, Wooters, Segal, Stol- cke, Fosler, Tajchman, and Morgan 1995).. 6.2 Relation to Probabilistic LR Parsing One of the major alternative context-free parsing paradigms besides Earley's algo- rithm is LR p a r s i n g (Aho and Ullman 1972). A comparison of the t w o approaches, b o t h in their probabilistic and nonprobabilistic aspects, is interesting and provides useful insights. The following remarks a s s u m e familiarity with b o t h approaches. We. 191. Computational Linguistics Volume 21, Number 2. sketch the f u n d a m e n t a l relations, as well as the important tradeoffs between the two frameworks. 13. Like an Earley parser, LR parsing uses dotted productions, called items, to keep track of the progress of derivations as the input is processed. The start indices are not part of LR items: we m a y therefore use the term "item" to refer to both LR items and Earley states without start indices. An Earley parser constructs sets of possible items on the fly, by following all possible partial derivations. An LR parser, on the other hand, has access to a complete list of sets of possible items c o m p u t e d beforehand, and at runtime simply follows transitions between these sets. The item sets are k n o w n as the "states" of the LR parser. ~4 A g r a m m a r is suitable for LR parsing if these transitions can be performed deterministically by considering only the next input and the contents of a shift-reduce stack. G e n e r a l i z e d LR parsing is an extension that allows parallel tracking of multiple state transitions and stack actions by using a graph-structured stack (Tomita 1986).. Probabilistic LR parsing (Wright 1990) is based on LR items a u g m e n t e d with certain conditional probabilities. Specifically, the probability p associated with an LR item X --+ )~./z is, in our terminology, a normalized forward probability:. i(x - - + p = . P(S ~L X 0 . . . i - 1 ) ' . where the d e n o m i n a t o r is the probability of the current prefix, is
Show more

New documents

Furthermore, decomposing productivity growth using plant level data from the Central Statistics Office, Ruane and Ugur 2005a provide evidence that foreign owned firms in Ireland have

The proposed system not only converses with a user but also answer questions that the user asked or asks some educational questions by integrating a dialog system with a knowledge

Goods Vehicles by Weight Category In 1995, trucks in the two heaviest vehicle weight categories accounted for 21 per cent of registered goods vehicles and carried 65 per cent of total

Dialogue interaction with remote interlocutors is a difficult application area for speech recognition technology because of the limited duration of acoustic context available for

Algorithm 1 Dialog Management Strategy if User Intention = Greeting then return Greeting else if User Intention = Questioning then return Questioning Response else if User Intention =

7.4.3 Changes in Home Output Increases in productivity levels of the Irish residential construction sector are reflected in three developments: more homes are being built per hour

only surface form features, but also various types of semantic and discourse aspects obtained from both given texts and Wikipedia collection, our proposed method utilizing the results

While partial productivity indicators were worrying for Irish dairy farms, the profitability indicator of competitive performance costs as a per cent of dairy output value was positive

In this paper, we describe our annotation effort of a corpus of German Twitter conversations using a full schema of 57 dialog acts, with a moderate inter-annotator agreement of multi-π

Table 1: Output, Input and Value-added Annual Average Growth Rates, 1970-987 Index of volume of gross Index of volume of total Index of volume of gross agricultural output farm