The steps we have taken in the example above resemble very much the steps of a so- called pushdown automaton. A pushdown automaton (PDA) is an imaginary mathematical device that reads input and has control over a stack. The stack can con- tain symbols that belong to a so-called stack alphabet. A stack is a list that can only be accessed at one end: the last symbol entered on the list (“pushed”) is the first symbol to be taken from it (“popped”). This is also sometimes called a “first-in, last-out” list, or a FILO list: the first symbol that goes in is the last symbol to come out. In the example above, the prediction works like a stack, and this is what the pushdown automaton uses the stack for too. We therefore often call this stack the prediction stack. The stack also explains the name “pushdown” automaton: the automaton “pushes” symbols on the stack for later processing.
The pushdown automaton operates by popping a stack symbol and reading an input symbol. These two symbols then in general give us a choice of several lists of stack symbols to be pushed on the stack. So, there is a mapping of (input symbol, stack symbol) pairs to lists of stack symbols. The automaton accepts the input sentence when the stack is empty at the end of the input. If there are choices (so an (input sym- bol, stack symbol) pair maps to more than one list), the automaton accepts a sentence when there are choices that lead to an empty stack at the end of the sentence.
Greibach Normal Form (GNF). In this normal form, all grammar rules have either the
form A→a or A→aB1B2 . . . Bn, with a a terminal and A, B1, ... , Bn non-terminals. The stack symbols are, of course, the non-terminals. A rule of the form
A→aB1B2 . . . Bn leads to a mapping of the (a, A) pair to the list B1B2 . . . Bn. This means that if the input symbol is an a, and the prediction stack starts with an A, we could accept the a, and replace the A part of the prediction stack with B1B2 . . . Bn. A rule of the form A→a leads to a mapping of the (a, A) pair to an empty list. The auto-
maton starts with the start symbol of the grammar on the stack. Any context-free grammar that does not produce the empty string can be put into Greibach Normal Form. Most books on formal language theory discuss how to do this (see for instance Hopcroft and Ullman [Books 1979]).
The example grammar of Figure 6.1 already is in Greibach Normal Form, so we can easily build a pushdown automaton for it. The automaton is characterized by the mapping shown in Figure 6.3.
( (aa,, SS)) -->> BB ( (bb,, SS)) -->> AA ( (aa,, AA)) -->> ( (aa,, AA)) -->> SS ( (bb,, AA)) -->> AAAA ( (bb,, BB)) -->> ( (bb,, BB)) -->> SS ( (aa,, BB)) -->> BBBB
Figure 6.3 Mapping of the PDA for the grammar of Figure 6.1
An important remark to be made here is that many pushdown automata are non- deterministic. For instance, the pushdown automaton of Figure 6.3 can choose between an empty list and anSSfor the pair (aa,AA). In fact, there are context-free languages for which we cannot build a deterministic pushdown automaton, although we can build a non-deterministic one. We should also mention that the pushdown automata as dis- cussed here are a simplification of the ones we find in automata theory. In automata theory, pushdown automata have so-called states, and the mapping is from (state, input symbol, stack symbol) triplets to (state, list of stack symbols) pairs. Seen in this way, they are like finite-state automata (discussed in Chapter 5), extended with a stack. Also, pushdown automata come in two different kinds: some accept a sentence by empty stack, others accept by ending up in a state that is marked as an accepting state. Perhaps surprisingly, having states does not make the pushdown automaton concept more powerful. Pushdown automata with states still only accept languages that can be described with a context-free grammar. In our discussion, the pushdown automaton only has one state, so we have taken the liberty of leaving it out.
Pushdown automata as described above have several shortcomings that must be resolved if we want to convert them into parsing automata. Firstly, pushdown auto- mata require us to put our grammar into Greibach Normal Form. While grammar transformations are no problem for the formal linguist, we would like to avoid them as much as possible, and use the original grammar if we can. Now we could relax the Greibach Normal Form requirement a little by also allowing terminals as stack sym- bols, and adding
Sec. 6.2] The pushdown automaton 123 (a, a) →
to the mapping for all terminals a. We could then use any grammar all of whose right- hand sides start with a terminal. We could also split the steps of the pushdown automa- ton into separate “match” and “predict” steps, as we did in the example of Section 6.1. The “match” steps then correspond to usage of the
(a, a) →
mappings, and the “predict” step then corresponds to a (, A) → . . .
mapping, that is, a non-terminal on the top of the stack is replaced by one of its right- hand sides, without consuming a symbol from the input. For the grammar of Figure 6.1, this would result in the mapping shown in Figure 6.4, which is in fact just a rewrite of the grammar of Figure 6.1.
( (,, SS)) -->> aaBB ( (,, SS)) -->> bbAA ( (,, AA)) -->> aa ( (,, AA)) -->> aaSS ( (,, AA)) -->> bbAAAA ( (,, BB)) -->> bb ( (,, BB)) -->> bbSS ( (,, BB)) -->> aaBBBB ( (aa,, aa)) -->> ( (bb,, bb)) -->>
Figure 6.4 Match and predict mappings of the PDA for the grammar of Figure 6.1
We will see later that, even using this approach, we may have to modify the grammar anyway, but in the meantime, this looks very promising so we adopt this strategy. This strategy also solves another problem: ε-rules do not need special treatment any more. To get Greibach Normal Form, we would have to eliminate them. This is not necessary any more, because they now just correspond to a
(, A) → mapping.
The second shortcoming is that the pushdown automaton does not keep a record of the rules (mappings) it uses. Therefore, we introduce an analysis stack into the auto- maton. For every prediction step, we push the non-terminal being replaced onto the analysis stack, suffixed with the number of the right-hand side taken (numbering the right-hand sides of a non-terminal from 1 to n). For every match, we push the matched terminal onto the analysis stack. Thus, the analysis stack corresponds exactly to the parts to the left of the dashed line in Figure 6.2, and the dashed line represents the separation between the analysis stack and the prediction stack. This results in an
automaton that at any point in time has a configuration as depicted in Figure 6.5. In the literature, such a configuration, together with its current state, stacks, etc. is sometimes called an instantaneous description. In Figure 6.5, matching can be seen as pushing the vertical line to the right.
matched input rest of input analysis prediction
Figure 6.5 An instantaneous description
The third and most important shortcoming, however, is the non-determinism. Formally, it may be satisfactory that the automaton accepts a sentence if and only if there is a sequence of choices that leads to an empty stack at the end of the sentence, but for our purpose it is not, because it does not tell us how to obtain this sequence. We have to guide the automaton to the correct choices. Looking back to the example of Section 6.1, we had to make a choice at several points in the derivation, and we did so based on some ad hoc considerations that were specific for the grammar at hand: some- times we looked at the next symbol in the sentence, and there were also some points where we had to look further ahead, to make sure that there were no moreaa’s coming. In the example, the choices were easy, because all the right-hand sides start with a ter- minal symbol. In general, however, finding the correct choice is much more difficult. The right-hand sides could for instance equally well have started with a non-terminal symbol that again has right-hand sides starting with a non-terminal, etc.
In Chapter 8 we will see that many grammars still allow us to decide which right- hand side to choose, given the next symbol in the sentence. In this chapter, however, we will focus on top-down parsing methods that work for a larger class of grammars. Rather than trying to pick a choice based on ad hoc considerations, we would like to guide the automaton through all the possibilities. In Chapter 3 we saw that there are in general two methods for solving problems in which there are several alternatives in well-determined points: depth-first search and breadth-first search. We shall now see how we can make the machinery operate for both search methods. Since the effects can be exponential in size, even a small example can get quite big. We will use the grammar of Figure 6.6, with test inputaaaabbcc. This grammar generates a rather complex language: sentences consist either of a number ofaa’s followed by a number ofbb’s fol- lowed by an equal number ofcc’s, or of a number ofaa’s followed by an equal number ofbb’s followed by a number ofcc’s. Example sentences are for instance:aabbcc,aaaabbbbcc.
S S -->> AABB || DDCC A A -->> aa || aaAA B B -->> bbcc || bbBBcc D D -->> aabb || aaDDbb C C -->> cc || ccCC
Sec. 6.2] Breadth-first top-down parsing 125