Top-Down Parser for a Grammar

Lexical Analysis

Algorithm 3.12 Top-Down Parser for a Grammar

Input • a grammar G = (GΣ, GR).

Output • a G-based top-down parser represented by a pushdown automaton M = (_MΣ, _MR) that accepts L(G).

Method begin

MΣ := GΣ with MN = GN, M∆ = G∆, MS = GS;

MR := ∅;

for each r: A → x∈ GR, where A ∈GN and x ∈ GΣ^* do add A → reversal(x) to_MR; {expansion rule}

for each a ∈ _G∆ do add aa → to MR;

end.

Crucially, the G-based top-down parser M = (MΣ, MR) constructed by Algorithm 3.12 satisfies A → x∈ _GR if and only if A → reversal(x) ∈ _MR

Introduce the M-G parsing correspondence o as a mapping from MR^* to GR^* defined as o(A → reversal(x)) = A → x, for every A → x∈ _GR, and o(aa → ) = ε, for each a ∈ _G∆. Observe that for every π∈ GR^* such that u lm⇒^* yv [π], y ∈ G∆^*, u ∈ GNGΣ^*, v ∈ GNGΣ^* ∪ {ε}, there exists precisely one ρ∈ _GR^* such that o(ρ) = π and

reversal(u)y ⇒^* reversal(v) [ρ] in M if and only if u lm⇒^* yv [o(ρ)] in G Notice that for u = GS = MS and v = ε, this equivalence states that

MSy ⇒^*  [ρ] in M if and only if GS lm⇒^* y [o(ρ)] in G,

so L(M) = L(G). The G-based top-down transducer is defined simply as the pushdown transducer Π = (MΣ, MR, o) underlain by M. Observe that this transducer translates every x ∈ L(G) to its left parse; more formally,

τ(Π) = {(x, π) | x ∈ L(G) and π is a left parse of x in G}

Lemma 3.13. Let G = (GΣ, _GR) be a grammar. With G as its input, Algorithm 3.12 correctly constructs the G-based top-down parser M = (MΣ, MR) such that L(M) = L(G).

Proof. Based on the notes preceding the algorithm, prove this lemma as an exercise by analogy with the proof of Lemma 3.17, which is more complicated and, therefore, presented in detail later in this section.

Case Study 7/35 Top-Down Parser. Consider the set of all well-formed tokenized FUN programs. Reduce this set by erasing everything but begins and ends from it. The language resulting from this reduction thus contains all correctly balanced strings of begins and ends; for

instance, begin begin end begin end end is in this language, but begin begin end is not. For brevity, replace each begin and each end with a and b, respectively, in this language, which is generated by the following grammar, G:

1: S → SS 2: S → aSb 3: S → ε

Algorithm 3.12 turns G to this G-based top-down parser, M, S → SS

S → bSa

S → ε

aa →  bb → 

Define o as o(S → SS) = S → SS, o(S → bSa) = S → aSb, o(S → ε) = S → ε, o(aa → ) = ε, and o(bb → ) = ε. For brevity, by using the rule labels, we express o(S → SS) = S → SS, o(S → bSa) = S → aSb, and o(S → ε) = S → ε as o(S → SS) = 1, o(S → bSa) = 2, and o(S → ε) = 3, respectively. As a result, we have

o(S → SS) = 1, o(S → bSa) = 2, o(A → ε) = 3, and o(aa → ) = ε, o(bb → ) = ε

At this point, we also obtain the G-based top-down transducer Π = (Σ, R, o); for instance, from S → SS and o(S → SS) = 1, we obtain Π’s rule of the form S → SS1 to put it in terms of Conventions 3.11. Figure 3.9 summarizes the definitions of G, M, o, and Π.

G M o Π

1: S → SS S → SS o(S → SS) = 1 S → SS1 2: S → aSb S → bSa o(S → bSa) = 2 S → bSa2 3: S → ε S → ε o(S → ε) = 3 S → ε3 aa →  o(aa → ) = ε

bb →  o(bb → ) = ε

Figure 3.9 Top-down parsing models.

Π translates abaabb, representing begin end begin begin end end, as

Sabaabb ⇒ SSabaabb 1 [S → SS1]

⇒ SbSaabaabb 12 [S → bSa2]

⇒ SbSbaabb 12 [aa → ]

⇒ Sbbaabb 123 [S → 3]

⇒ Saabb 123 [bb → ]

⇒ bSaaabb 1232 [S → bSa2]

⇒ bSabb 1232 [aa → ]

⇒ bbSaabb 12322 [S → bSa2]

⇒ bbSbb 12322 [aa → ]

⇒ bbbb 123223 [S → 3]

⇒ bb 123223 [bb → ]

⇒ 123223 [bb → ]

That is, Sabaabb ⇒^* 123223 in Π, so (abaabb, 123223) ∈ τ(Π). Notice that G makes S lm⇒ abaabb [123223], so 123223 represents the left parse of abaabb in G.

Recursive-Descent Parser

Algorithm 3.12 constructs a G-based top-down parser M = (MΣ, _MR) as a pushdown automaton, which requires, strictly speaking, an implementation of a pushdown list. There exists, however, a top-down parsing method, called recursive descent, which frees us from this implementation.

Indeed, the pushdown list is invisible in this method because it is actually realized by the pushdown used to support recursion in the programming language in which we write the recursive-descent parser. As this method does not require an explicit manipulation with the pushdown list, it comes as no surprise that it is extremely popular in practice. Therefore, in its next description, we pay a special attention to its implementation.

Goal. Recursive-descent parser based upon a grammar G.

Gist. Consider a programming language defined by a grammar G. Let w = t1…t_jtj+1…t_m be an input string or, more pragmatically speaking, the tokenized version of a source program. Like any top-down parser, G-based recursive-descent parser, symbolically denoted as G-rd-parser, simulates the construction of a parse tree with its frontier equal to w by using G’s rules so it starts from the root and works down to the leaves, reading w in a left-to-right way. In terms of derivations, G-rd-parser looks for the leftmost derivation of w. To find it, for each nonterminal, A, G-rd-parser has a Boolean function, rd-function A, which simulates rewriting the leftmost nonterminal A. More specifically, with the right-hand side of an A-rule, A → X₁… X_iXi+1…X_n, rd-function A proceeds from X1 to Xn. Assume that rd-function A currently works with Xi and that tj is the input token. At this point, depending on whether Xi is a terminal or a nonterminal, this function works as follows:

• If X_i is a terminal, rd-function A matches Xi against tj. If Xi = tj, it reads tj and proceeds to Xi+1

and tj+1. If Xi ≠ tj, a syntactical error occurs, which the parser has to handle.

• If X_i is a nonterminal, G-rd-parser calls rd-function Xi, which simulates rewriting Xi according to a rule in G.

G-rd-parser starts the parsing process from rd-function S, which corresponds to G’s start symbol, and it ends when it eventually returns to this function after completely reading w. If during this entire process no syntactical error occurs, G-rd-parser has found the leftmost derivation of w, which thus represents a syntactically well-formed program written in the source language;

otherwise, w is syntactically incorrect.

As this method does not require explicitly manipulating a pushdown list, it is very popular in practice; in particular, it is suitable for parsing declarations and general program flow as the next case study illustrates.

Case Study 8/35 Recursive-Descent Parser. Recursive-descent parsing is particularly suitable for the syntax analysis of declarations and general program flow as this case study illustrates in terms of FUN. Consider the FUN declarations generated by the next grammar declG, where 〈declaration part〉 is its start symbol:

〈declaration part〉 → declaration 〈declaration list〉

〈declaration list〉 → 〈declaration〉; 〈declaration list〉

〈declaration list〉 → 〈declaration〉

〈declaration〉 → integer 〈variable list〉

〈declaration〉 → real 〈variable list〉

〈declaration〉 → label 〈label list〉

〈variable list〉 → i, 〈variable list〉

〈variable list〉 → i

Next, we construct declG-rd-parser, which consists of the Boolean functions corresponding to the nonterminals in declG. Throughout this construction, we obtain the tokens by programming function INPUT-SYMBOL, described next.

Definition 3.14. INPUT-SYMBOL is a lexical-analysis programming function that returns the current input symbol or, in other words, the current token when called. After a call of this function, the input string is advanced to the next symbol.

First, consider the start symbol 〈declaration part〉, and the rule with 〈declaration part〉 on its left-hand side:

〈declaration part〉 → declaration 〈declaration list〉

This rule says that 〈declaration part〉 derives a string that consists of declaration followed by a string derived from 〈declaration list〉. Formally,

〈declaration part〉 _lm⇒^* declaration x

where 〈declaration list〉 _lm⇒^* x. The next rd-function 〈declaration part〉 simulates this derivation.

function 〈declaration part〉 : boolean;

begin

〈declaration part〉 := false;

if INPUT-SYMBOL = 'declaration' then if 〈declaration list〉 then

〈declaration part〉 := true;

end

Consider 〈declaration list〉 and the two rules with 〈declaration list〉 on its left-hand side:

〈declaration list〉 → 〈declaration〉; 〈declaration list〉

〈declaration list〉 → 〈declaration〉

These two rules say that 〈declaration list〉 derives a string that consists of some substrings separated by semicolons so that each of these substrings is derived from 〈declaration〉. That is,

〈declaration list〉 _lm⇒^* y1; …; y_n

where 〈declaration〉 _lm⇒^* yi, 1 ≤ i ≤ n, for some n ≥ 1 (n = 1 means 〈declaration list〉 _lm⇒^* d1). The following rd-function 〈declaration list〉 simulates this leftmost derivations.

function 〈declaration list〉 : boolean;

begin

〈declaration list〉 := false;

if 〈declaration〉 then

if INPUT-SYMBOL = ';' then 〈declaration list〉 := 〈declaration list〉

else 〈declaration list〉 := true;

end end

With the left-hand side equal to 〈declaration〉, there exist these two rules:

〈declaration〉 → integer 〈variable list〉

〈declaration〉 → real 〈variable list〉

According to them, 〈declaration〉 derives a string that starts with integer or real behind which there is a string derived from 〈variable list〉. That is,

〈declaration〉 lm⇒^* ay

where a ∈ {integer, real} and 〈variable list〉 _lm⇒^* y. The following rd-function 〈declaration〉

simulates the above leftmost derivation.

function 〈declaration〉 : boolean;

begin

〈declaration〉 := false;

if INPUT-SYMBOL in [real, integer] then if 〈variable list〉 then

〈declaration〉 := true;

end

Consider the 〈variable list〉-rules:

〈variable list〉 → i, 〈variable list〉

〈variable list〉 → i

Thus, 〈variable list〉 derives a string, w, that consists of i followed by zero or more is separated by commas; in other words,

〈variable list〉 _lm⇒^* i, …, i The next rd-function 〈variable list〉 simulates this leftmost derivations.

function 〈variable list〉 : boolean;

begin

〈variable list〉 := false;

if INPUT-SYMBOL = 'i' then

if INPUT-SYMBOL = ',' then 〈variable list〉 := 〈variable list〉

else 〈variable list〉 := true;

end.

At this point, we have completed the construction of declG-rd-parser, consisting of the previous four functions corresponding to declG’s nonterminals. This parser starts from function 〈declaration part〉 because 〈declaration part〉 is declG’s start symbol, and it ends in this function after completely reading the input string of symbols or, to put in terms of lexical analysis, tokens. If during this entire process no syntactical error occurs, the input string is syntactically correct.

In document This page intentionally left blank (Page 107-112)