• No results found

The deterministic pushdown automata are less powerful than the pushdown automata

In document This page intentionally left blank (Page 120-126)

Lexical Analysis

Theorem 3.19. The deterministic pushdown automata are less powerful than the pushdown automata

Proof. By Definition 2.6, every deterministic pushdown automaton is a special case of a pushdown automaton. On the other hand, consider the language L = {wv| w, v ∈ {0, 1}*, v = reversal(w)}. A pushdown automaton can accept L so it first moves w onto the pushdown. Then, it reads v and, simultaneously, compares it with the pushdown contents. In greater detail, during every move of this comparison, it verifies that the current input symbol and the pushdown top symbol coincide. If they do, the automaton pops the pushdown top symbol, reads the next input symbol, and continues by making the next comparison. If in this way, the pushdown automaton empties the pushdown and, simultaneously, reads the entire input string, it accepts. We thus intuitively see that during this process, the automaton has to non-deterministically choose the move from which it starts the comparison in a non-deterministic way. As an exercise, we leave a rigorous proof that no deterministic pushdown automaton accepts L.

„ 3.3.2 Verification of the Grammatical Syntax Specification

Having specified the syntax of a programming language L by a grammar G, we want to be sure that this specification is correct. As G may contain many rules, a verification of this specification requires a rigorous proof rather than an intuitive justification. In other words, we must formally demonstrate that L(G) = L. As obvious, this verification always depends on the specific grammar, so we can hardly give a general guide how to make it. Instead, we give a single example of this verification, which demonstrates that even for a simple abstract language and a rather small grammar, this verification may represent a non-trivial mathematical task.

Example 3.5 Verification of the Generated Language. Consider the grammar G defined as 1: S → aB

2: S → bA 3: A → a 4: A → aS 5: A → bAA 6: B → b 7: B → bS 8: B → aBB

We want to verify that G generates the language L that contains all non-empty strings that consist of an equal number of as and bs; formally,

L(G) = {w| w ∈ {a, b}+ and occur(w, a) = occur(w, b)}

A verification of this kind often turns out easier if we first prove a claim that states something more than we actually need; subsequently, we obtain the desired statement as a straightforward consequence of the stronger claim. In this way, we proceed to verify the equation above.

Before we give the next claim, representing the stronger statement, recall that for a symbol a and a string w, occur(a, w) denotes the number of occurrences of a in w (see Section 1.1).

Claim. For all w ∈ {a, b}*, these three equivalences hold I. S ⇒* w if and only if occur(a, w) = occur(b, w) II. A ⇒* w if and only if occur(a, w) = occur(b, w) + 1 III. B ⇒* w if and only if occur(b, w) = occur(a, w) + 1 Proof by induction on |w| ≥ 1.

Basis. Let |w| = 1.

I. From S, G generates no sentence of length one. On the other hand, no sentence of length one satisfies occur(a, w) = occur(b, w). Thus, in this case, the basis holds vacuously.

II. Examine G to see that if A ⇒* w with |w| = 1, then w = a. For w = a, A ⇒* w [3]. Therefore, II holds in this case.

III. Prove this by analogy with the proof of II.

Consequently, the basis holds.

Induction Hypothesis. Assume that there exists a positive integer n ≥ 1 such that the claim holds for every w ∈ {a, b}* satisfying 1 ≤ |w| ≤ n.

Induction Step. Let w ∈ {a, b}* with |w| = n + 1.

I. Only if. Consider any derivation of the form S ⇒* w [ρ], where ρ is a sequence of rules. This derivation starts from S. As only rules 1 and 2 have S on the left-hand side, express S ⇒* w [ρ] as S ⇒* w [rπ], where ρ = rπ and r ∈ {1, 2}.

• If r = 1, S ⇒* w [1π], where 1: S → aB. At this point, w = av, and B ⇒* v [π], where |v| = n. By the induction hypothesis, III holds for v, so occur(b, v) = occur(a, v) + 1. Therefore, occur(a, w) = occur(b, w).

• If r = 2, S ⇒* w [2π], where 2: S → bA. Thus, w = bv, and A ⇒* v [π], where |v| = n. By the induction hypothesis, II holds for v, so occur(a, v) = occur(b, v) + 1. As w = bv, occur(a, w) = occur(b, w).

If. Let occur(a, w) = occur(b, w). Clearly, w = av or w = bv, for some v ∈ {a, b}* with |v| = n.

• Let w = av. Then, |v| = n and occur(a, v) + 1 = occur(b, v). As |v| = n, by the induction hypothesis, we have B * v if and only if occur(b, v) = occur(a, v) + 1 from III. By using 1: S → aB, we obtain S ⇒ aB [1]. Putting S ⇒ aB and B ⇒* v together, we have S ⇒ aB ⇒* av, so S ⇒* w because w = av.

• Let w = bv. Then, |v| = n and occur(a, v) = occur(b, v) + 1. By the induction hypothesis, we have A ⇒* v if and only if occur(a, v) = occur(b, v) + 1 (see II). By 2: S → bA, G makes S ⇒ bA. Thus, S ⇒ bA and A ⇒* v, so S ⇒* w.

II. Only if. Consider any derivation of the form A ⇒* w [ρ], where ρ is a sequence of G’s rules.

Express A * w [ρ] as A ⇒* w [rπ], where ρ = rπ and r ∈ {3, 4, 5} because rules 3: A → a, 4: A → aS, and 5: A → bAA are all the A-rules in G.

• If r = 3, A ⇒* w [rπ] is a one-step derivation A ⇒ a [3], so w = a, which satisfies occur(a, w) = occur(b, w) + 1.

• If r = 4, A ⇒* w [4π], where 4: A → aS. Thus, w = av, and S ⇒* v [π], where |v| = n. By the induction hypothesis, from I, occur(a, v) = occur(b, v), so occur(a, w) = occur(b, w) + 1.

• If r = 5, A ⇒* w [5π], where 5: A → bAA. Thus, w = buv, A ⇒* u, A ⇒* v, where |u| ≤ n, |v| ≤ n.

By the induction hypothesis, from II, occur(a, u) = occur(b, u) + 1 and occur(a, v) = occur(b, v) + 1, so occur(a, uv) = occur(b, uv) + 2. Notice that occur(b, uv) = occur(b, w) – 1 implies occur(a, uv) – 2 = occur(b, w) – 1. Furthermore, from occur(a, uv) – 2 = occur(b, w) – 1, it follows that occur(a, uv) = occur(a, w), so occur(a, w) = occur(b, w) + 1.

If. Let occur(a, w) = occur(b, w) + 1. Obviously, w = av or w = bv, for some v ∈ {a, b}* with |v| = n.

• Let w = av. At this point, |v| = n and occur(a, v) = occur(b, v). As |v| = n, by the induction hypothesis, we have S * v. By using 4: A → aS, A ⇒ aS [4]. Putting A ⇒ aS and S ⇒* v together, we obtain A ⇒ aS ⇒* av, so A ⇒* w because w = av.

• Let w = bv. At this point, |v| = n and occur(a, v) = occur(b, v) + 2. Express v as v = uz so that occur(a, u) = occur(b, u) + 1 and occur(a, z) = occur(b, z) + 1; as an exercise, we leave a proof that occur(a, v) = occur(b, v) + 2 implies that v can always be expressed in this way. Since |v| = n, |u| ≤ n ≥ |z|. Thus, by the induction hypothesis (see II), we have A ⇒* u and A ⇒* z. By using 5: A → bAA, A ⇒ bAA [5]. Putting A ⇒ bAA, A ⇒* u, and A ⇒* z together, we obtain A ⇒ bAA ⇒* buz, so A ⇒* w because w = bv = buz.

III. Prove this inductive step by analogy with the proof of the inductive step of II.

Having established this claim, we easily obtain the desired equation L(G) = {w| w ∈ {a, b}+ and occur(w, a) = occur(w, b)} as a consequence of Equivalence I. Indeed, this equivalence says that for all w ∈ {a, b}*, S ⇒* w if and only if occur(a, w) = occur(b, w). Consequently, w ∈ L(G) if and only if occur(a, w) = occur(b, w). As G has no ε-rules, ε ∉ L(G), so L(G) = {w| w ∈ {a, b}+ and occur(w, a) = occur(w, b)}.

„ 3.3.3 Simplification of Grammars

This section simplifies grammars in order to make their specification of the programming language syntax easier and clearer.

Canonical derivations and parse trees

As already explained in Section 3.1, both canonical derivations and parse trees simplify the discussion of parsing because they free us from considering all possible derivations in a grammar.

Indeed, assume that a programming language is specified by a grammar. To verify that a string w representing the tokenized version of a source program is syntactically correct, a top-down parser builds a parse tree starting from the root labeled with the grammatical start symbol and proceeding down towards the frontier equal to w. In other words, it constructs a leftmost derivation of w. On the other hand, a bottom-up parser starts from w as the frontier and works up toward the root labeled with the start symbol. In terms of canonical derivations, it builds up a rightmost derivation of w in reverse order. In either approach, we see that canonical derivations and parse trees are more than important to parsing. What we next demonstrate is that without any loss of generality,

we can always represent every grammatical derivation of a sentence by a canonical derivation or a parse tree, depending on which of these representations is more appropriate under given discussion.

Theorem 3.20. Let G = (Σ, R) be a grammar. Then, w ∈ L(G) if and only if S lm* w.

Proof.

If. This part of the proof says that S lm* w implies w ∈ L(G), for every w ∈ ∆*. As S lm* w is a special case of a derivation from S to w, this implication surely holds.

Only If. To demonstrate that G can generate every w ∈ L(G) in the leftmost way, we first prove the next claim.

Claim. For every w ∈ L(G), S ⇒n w implies S lmn w, for all n ≥ 0.

Proof by induction on n ≥ 0.

Basis. For n = 0, this implication is trivial.

Induction Hypothesis. Assume that there exists an integer n ≥ 0 such that the claim holds for all derivations of length n or less.

Induction Step. Let S ⇒n+1 w [ρ], where w ∈ L(G), ρ ∈ R+, and |ρ| = n + 1. If S ⇒n+1 w [ρ] is leftmost, the claim holds, so we assume that this derivation is not leftmost. Express S ⇒n+1 w [ρ]

as

S lm* uAvBx [σ]

uAvyx [r: B → y]

* w [θ]

where σ, θ ∈ R*, ρ = σrθ, r: B → y ∈ R, u ∈ prefix(w), A ∈ N, u ∈ ∆*, and v, x, y ∈ Σ*. That is, S lm* uAvBx is the longest beginning of S ⇒n+1 w performed in the leftmost way. As A ∈ N and w ∈ L(G), A ∉ alph(w), so A is surely rewritten during uAvyx ⇒* w. Thus, express S ⇒n+1 w as

S lm* uAvBx [σ]

uAvyx [r: B → y]

* uAz [π]

lm⇒ utz [p: A → t]

* w [ο]

where π, ο ∈ R*, θ = πpο, p: A → t ∈ R, vyx ⇒* z, z ∈ Σ*. Rearrange this derivation as S lm* uAvBx [σ]

lm⇒ utvBx [p: A → t]

utvyx [r: B → y]

* utz [π]

* w [ο]

The resulting derivation S * w [σprπο] begins with at least |σp| leftmost steps, so its leftmost beginning is definitely longer than the leftmost beginning of the original derivation S ⇒n+1 w [ρ].

If S ⇒* w [σprπο] still does not represent a leftmost derivation, we apply the analogical rearrangement to it. After n − 2 or fewer repetitions of this derivation rearrangement, we obtain S lm* w, which completes the induction step, so the proof of the claim is completed.

From this claim, we see that G can generate every w ∈ L(G) in the leftmost way, so the theorem holds.

„ We might be tempted to generalize Theorem 3.20 for w ∈ Σ*. That is, we might consider a statement that for every grammar G = (Σ, R), S ⇒* w if and only if S lm* w, for all w ∈ Σ*. This statement is false, however. To give a trivial counterexample, consider a grammar with two rules of the form S → AA and A → a. Observe that this grammar makes S ⇒ AA ⇒ Aa; however, there is no leftmost derivation of Aa in G.

Theorem 3.21. Let G = (Σ, R) be a grammar. Then, w ∈ L(G) if and only if S rm* w.

From Definition 3.5, we already know how to convert any grammatical derivation to the corresponding parse tree. In the following proof, we describe the opposite conversion.

Theorem 3.22. Let G = (Σ, R) be a grammar. Then, A ⇒* x if and only if there exists a parse tree t such that root(t) = A and frontier(t) = x, where A ∈ N, x ∈ Σ*.

Proof.

Only If. This part of the proof says that for every derivation A ⇒* x, there exists a parse tree t such that root(t) = A and frontier(t) = x, where A ∈ N, x ∈ Σ*. From Definition 3.5, we know how to construct this tree, denoted by pt(A ⇒* x).

If. We prove that for every derivation tree t in G with root(t) = A and frontier(t) = x, where A ∈ N, x ∈ Σ*, there exists A ⇒* x in G by induction on depth(t) ≥ 0.

Basis. Consider any derivation tree t in G such that depth(t) = 0. As depth(t) = 0, t is a tree consisting of one node, so root(t) = frontier(t) = A, where A ∈ N. Observe that A ⇒0 A in G, so the basis holds.

Induction Hypothesis. Suppose that the claim holds for every derivation tree t with depth(t) ≤ n, where n is a non-negative integer.

Induction Step. Let t be any derivation trees with depth(t) = n + 1. Let root(t) = A and frontier(t) = x, where A ∈ N, x ∈ Σ*. Consider the topmost rule tree, rt(p), occurring in t; that is, rt(p) is the rule tree whose root coincides with root(t). Let p: A → u ∈ R. If u = ε, t has actually the form A〈〉, which means u = ε and depth(t) = 1, and at this point, A ⇒ ε [p], so the induction step is completed. Assume u ≠ ε. Let u = X1X2… Xm, where m ≥ 1. Thus, t is of the form A〈t1t2 tm〉, where each ti is a parse tree with root(ti) = Xi, 1 ≤ i ≤ m, with depth(ti) ≤ n. Let frontier(ti) = yi, where yi ∈ Σ*, so x = y1y2… ym. As depth(ti) ≤ n, by the induction hypothesis, we have Xi * yi in G, 1 ≤ i ≤ m. Since A → u ∈ R with u = X1X2… Xm, we have A ⇒ X1X2… Xm. Putting together A ⇒ X1X2… Xm andXi * yi for all 1 ≤ i ≤ m, we obtain

A X1X2… Xm

* y1X2… Xm

* y1y2… Xm

M ⇒* y1y2… ym

Thus, A ⇒* x in G.

„ Corollary 3.23. Let G = (Σ, R) be a grammar. Then, w ∈ L(G) if and only if there exists a parse tree t such that root(t) = S and frontier(t) = w.

Proof. Every w ∈ L(G) satisfies S ⇒* w, so this corollary follows from Theorem 3.22.

„ From the previous statements, we next derive the main result concerning canonical derivations and parse trees. It guarantees that without any loss of generality, we can always restrict our attention to the canonical derivations or parse trees when discussing the language generated by a grammar.

Corollary 3.24. Let G = (Σ, R) be a grammar. For every w ∈ ∆*, the following statements are equivalent in G:

• w ∈ L(G);

• S ⇒* w;

• S lm* w;

• S rm* w;

• there exists a parse tree t such that root(t) = S and frontier(t) = w.

Proof. From Definition 3.1, for every w ∈ ∆*, w ∈ L(G) if and only if S ⇒* w. The rest of Corollary 3.24 follows from Theorem 3.20, Theorem 3.21, and Corollary 3.23.

„ Proper Grammatical Specification of the Syntax

A grammar G may contain some components that are of no use regarding the generation of its language L(G). These useless components unnecessarily increase the grammatical size, make the programming language syntax specification clumsy and complicate parsing. Therefore, we next explain how to remove them from G. First, we eliminate all useless symbols that are completely superfluous during the specification of any syntax structure. Then, we turn our attention to the removal of some grammatical rules, which might needlessly complicate the syntax specification.

Specifically, we explain how to eliminate all the ε-rules and the unit rules that have a single nonterminal on their right-hand side. Finally, we summarize all these grammatical transformations by converting any grammar to an equivalent proper grammar without useless components.

Useful symbols. The grammatical symbols that take part in the generation of some sentences from L(G) are useful; otherwise, they are useless. As completely useless, we obviously consider all symbols from which no string of terminals is derivable.

Definition 3.25 Terminating Symbols. Let G = (Σ, R) be a grammar. A symbol X ∈ Σ is terminating if X ⇒* w for some w ∈ G*; otherwise, X is non-terminating.

„ To eliminate all non-terminating symbols, we first need to know which symbols are terminating and which are not.

Goal. Given a grammar, G = (Σ, R), determine the subset V ⊆ Σ that contains all terminating symbols.

Gist. Every terminal a ∈ ∆ is terminating because a ⇒* a, so initialize V with ∆. If a rule r ∈ R satisfies rhs(r) ∈ V*, rhs(r) * w for some w ∈ ∆*; at this point, we add lhs(r) to V because lhs(r) ⇒ rhs(r) ⇒* w and lhs(r) is terminating, too. In this way, we keep extending V until no further terminating symbol can be added to V.

In document This page intentionally left blank (Page 120-126)