Lexical Analysis
Algorithm 3.45 Cocke-Younger-Kasami Parsing Algorithm
Algorithm 3.43 implies the next corollary because for every grammar, there exists an equivalent grammar in Greibach normal form.
Corollary 3.44. Let G = (GΣ, GR) be a grammar. Then, there exists a one-state pushdown automaton without ε-rules that accepts L(G). With w ∈ G∆* as the input string, M thus makes no more than |w| moves.
All the previous three parsing algorithms turn any grammar G to an equivalent G-based parser represented by a pushdown automaton. However, the parsing theory has also developed algorithms that act as parsers, but they are not based upon pushdown automata simulating canonical derivations in G. We close this section by giving one of them, called the Cocke-Younger-Kasami parsing algorithm after their authors. This algorithm works in a bottom-up way with grammars in the Chomsky normal form. Recall that in these grammars, every rule has one terminal or two nonterminals on its right-hand side, and Algorithm 3.40 has already demonstrated how to turn any grammar to an equivalent grammar in this normal form.
Goal. Given a grammar, G = (Σ, R), in the Chomsky normal form and a string, x = a1a2…an with ai ∈ G∆, 1 ≤ i ≤ n, for some n ≥ 1, decide whether x ∈ L(G).
Gist. The Cocke-Younger-Kasami parsing algorithm makes a decision of whether x ∈ L(G) in a bottom-up way. It works so it constructs sets of nonterminals, CYK[i, j], where 1 ≤ i ≤ j ≤ n, satisfying A ∈ CYK[i, j] if and only if A ⇒* ai…aj; therefore, as a special case, L(G) contains x if and only if CYK[1, n] contains G’s start symbol, S. To construct the sets of nonterminals, this algorithm initially includes A into CYK[i, i] if and only if A → ai is in R because A ⇒ ai in G by using A → ai. Then, whenever B ∈ CYK[i, j], C ∈ CYK[j+1, k], and A → BC ∈ GR, we add A to CYK[i, k] because B ∈ CYK[i, j] and C ∈ CYK[j+1, k] imply B ⇒* ai…aj, and C ⇒* aj+1…ak, respectively, so
A ⇒ BC [A → BC]
⇒* ai…ajC
⇒* ai…ajaj+1…ak
When this construction cannot extend any set, we examine whether S ∈ CYK[1, n]. If so, S ⇒* a1…an and x = a1…an ∈ L(G), so this algorithm announces ACCEPT (see Convention 1.8);
otherwise, x ∉ L(G) and the algorithm announces REJECT.
Algorithm 3.45 Cocke-Younger-Kasami Parsing Algorithm.
Input • a grammar, G = (GΣ, GR), in the Chomsky normal form;
• w = a1a2…an with ai ∈ G∆, 1 ≤ i ≤ n, for some n ≥ 1.
Output • ACCEPT if w ∈ L(G);
• REJECT if w ∉ L(G).
Method
begin
introduce sets CYK[i, j] = ∅ for 1 ≤ i ≤ j ≤ n;
for i = 1 to n do if A → ai ∈ R then
add A to CYK[i, i];
repeat
if B ∈ CYK[i, j], C ∈ CYK[j+1, k], A → BC ∈ R for some A, B, C ∈ GN then add A to CYK[i, k];
until no change;
if S ∈ CYK[1, n] then ACCEPT
else REJECT;
end.
All the parsers given so far work non-deterministically (Algorithms 3.12, 3.16, 3.43, and 3.45), so they are of little importance in practice. In a real world, we are primarily interested in parsing algorithms underlain by easy-to-implement deterministic pushdown automata. These deterministic parsing algorithm work only for some special cases of grammars. Besides, as they are based upon deterministic pushdown automata, they are less powerful than the parsers discussed earlier in this chapter (see Theorem 3.19). Nevertheless, their determinism represents an enormous advantage that makes them by far the most popular parsing algorithms in practice, and that is why we discuss them in detail from a practical viewpoint in Chapters 4 and 5.
3.3.5 Syntax that Grammars cannot Specify
Sometimes, we come across syntactic programming-language constructs that cannot be specified by any grammars. To put it more theoretically, these constructs are out of the family of context-free languages (see the note following Definition 3.1 for this family). However, proving that a language L is out of this family may represent a difficult task because it actually requires to demonstrate that none of all possible grammars generates L. Frequently, we can simplify such a proof by demonstrating that L does not satisfy some conditions that all context-free languages satisfy, so L cannot be context-free. As a result, conditions of this kind are important to the syntax analysis, so we pay a special attention to them in this section. First, we give conditions of the following pumping lemma, then we present some useful closure properties to prove that a language is not context-free.
Pumping Lemma and its Proof
The pumping lemma says that for every context-free language, L, there is a constant k ≥ 1 such that every z ∈ L with |z| ≥ k can be expressed as z = uvwxy with vx ≠ ε so that L also contains uvmwxmy, for every non-negative integer m. Consequently, to demonstrate the non-context-freedom of a language, K, by contradiction, assume that K is a context-free language, and k is its pumping lemma constant. Select a string z ∈ K with |z| ≥ k, consider all possible decompositions of z into uvwxy, and for each of these decompositions, prove that uvmwxmy is out of K, for some m ≥ 0, which contradicts the pumping lemma. Thus, K is not context-free.
Without any loss of generality, we prove the context-free pumping lemma based on the Chomsky normal form, discussed in Section 3.3.4. In addition, we make use of some other notions introduced earlier in this chapter, such as the parse tree pt(A ⇒* x) corresponding to a derivation A ⇒* x (see Definition 3.5).
Lemma 3.46. Let G = (Σ, R) be a grammar in the Chomsky normal form. For every parse tree pt(A ⇒* x), where A ∈ N and x ∈ ∆*, |x| ≤ 2depth(pt(A ⇒* x)) − 1.
Proof by induction on depth(pt(A ⇒* x)) ≥ 1.
Basis. Let depth(pt(A ⇒* x)) = 1, where A ∈ N and x ∈ ∆*. Because G is in Chomsky normal form, A ⇒* x [A → x] in G, where x ∈ ∆, so |x| = 1 ≤ 2depth(pt(A ⇒* x)) − 1 = 1.
Induction Hypothesis. Suppose that this lemma holds for all parse trees of depth m or less, for some positive integer n.
Induction Step. Let depth(pt(A ⇒* x)) = n + 1, where A ∈ N and x ∈ ∆*. Let A ⇒* x [rρ] in G, where r ∈ R and ρ ∈ R*. As G is in Chomsky normal form, r: A → BC ∈ R, where B, C ∈ N. Let B ⇒* u [π], C ⇒* v [θ], π, θ ∈ R*, x = uv, ρ = πθ so that A ⇒* x can be expressed in greater detail as A ⇒ BC [r] ⇒* uC [π] ⇒* uv [θ]. Observe that depth(pt(B ⇒* u)) ≤ depth(pt(A ⇒* x)) – 1 = n, so |u| ≤ 2depth(pt(B ⇒* u)) − 1 by the induction hypothesis. Analogously, as depth(pt(C ⇒* v)) ≤ depth(pt(A ⇒* x)) – 1 = n, |v| ≤ 2depth(pt(C ⇒* v)) – 1. Thus, |x| = |u| + |v| ≤ 2depth(pt(B ⇒* u)) − 1 + 2depth(pt(C ⇒* v)) − 1 ≤ 2n − 1 + 2n − 1 = 2n = 2depth(pt(A ⇒* x)) – 1.
Corollary 3.47. Let G = (Σ, R) be a grammar in the Chomsky normal form. For every parse tree pt(A ⇒* x), where A ∈ N and x ∈ ∆*, if |x| ≥ 2m for some m ≥ 0, then depth(pt(A ⇒* x)) ≥ m + 1.
Proof. This corollary follows from Lemma 3.46 and the contrapositive law (see Section 1.1).
Lemma 3.48 Pumping Lemma for Context-Free Languages. Let L be an infinite context-free language. Then, there exists a positive integer, k ≥ 1, such that every string z ∈ L satisfying |z| ≥ k can be expressed as z = uvwxy, where 0 < |vx| < |vwx| ≤ k, and uvmwxmy ∈ L, for all m ≥ 0.
Proof. Let L be a context-free language, L = L(G) for a grammar, G = (Σ, R), in the Chomsky normal form. Set k = 2card(N). Let z ∈ L(G) satisfying |z| ≥ k. As z ∈ L(G), S ⇒* z, and by Corollary 3.47, depth(pt(S ⇒* z)) ≥ card(N) + 1, so pt(S ⇒* z) contains some subtrees in which there is a path with two or more nodes labeled by the same nonterminal. Express S ⇒* z as S ⇒* uAy ⇒+ uvAxy ⇒+ uvwxy with uvwxy = z so that the parse tree corresponding to A ⇒+ vAx ⇒+ vwy contains no proper subtree with a path containing two or more different nodes labeled with the same nonterminal.
Claim A. 0 < |vx| < |vwx| ≤ k
Proof. As G is in the Chomsky normal form, every rule in R has on its right-hand side either a terminal or two nonterminals. Thus, A ⇒+ vAx implies 0 < |vx|, and vAx ⇒+ vwy implies |vx| <
|vwx|. As the parse tree corresponding to A ⇒+ vAx ⇒+ vwy contains no proper subtree with a path containing two different nodes labeled with the same nonterminal, depth(pt(A ⇒* vwx)) ≤ card(N) + 1, so by Lemma 3.46, |vx| < |vwx| ≤ 2card(N) = k.
Claim B. For all m ≥ 0, uvmwxmy ∈ L.
Proof. As S ⇒* uAy ⇒+ uvAxy ⇒+ uvwxy, S ⇒* uAy ⇒+ uwy, so uv0wx0y = uwy ∈ L. Similarly, since S ⇒* uAy ⇒+ uvAxy ⇒+ uvwxy, S ⇒* uAy ⇒+ uvAxy ⇒+ uvvAxxy ⇒+ … ⇒+ uvmAxmy ⇒+ uvmwxmy, so uvmwxmy ∈ L, for all m ≥ 1.
Thus, Lemma 3.48 holds true.
Applications of the pumping lemma
We usually use the pumping lemma in a proof by contradiction to demonstrate that a given language L is not context-free. Typically, we make a proof of this kind as follows:
A. Assume that L is context-free.
B. Select a string z ∈ L whose length depends on the pumping-lemma constant k so that |z| ≥ k is necessarily true.
C. For all possible decompositions of z into uvwxy satisfying the pumping lemma conditions, find a non-negative integer m such that uvmwxmy ∉ L—a contradiction.
D. Make the conclusion that the assumption in A was incorrect, so L is not context-free.
Example 3.14 A Non-Context-Free Language. Consider L = {anbncn| n ≥ 1}. Although this language looks quite simple at a glance, no grammar can specify it because L is not context-free as proved next under the guidance of the recommended proof structure preceding this example.
A. Assume that L is context-free.
B. As L is context-free, there exists a natural number k satisfying Lemma 3.48. Set z = akbkck with
|z| = 3k ≥ k.
C. By Lemma 3.48, z can be written as z = uvwxy so that this decomposition satisfies the pumping lemma conditions. As 0 < |vx| < |vwx| ≤ k¸ vwx ∈ {a}*{b}* or vwx ∈ {b}*{c}*. If vwx ∈ {a}*{b}*, uv0wx0y has k cs but fewer than k as or bs, so uv0wx0y ∉ L, but by the pumping lemma, uv0wx0y ∈ L—a contradiction. If vwx ∈ {b}*{c}*, uv0wx0y has k as but fewer than k bs or cs, so uv0wx0y ∉ L, but by the pumping lemma, uv0wx0y ∈ L—a contradiction.
D. L is not context-free.
Omitting some obvious details, we usually proceed in a briefer way than above when proving the non-context-freedom of a language by using the pumping lemma.
Example 3.15 A Short Demonstration of Non-Context-Freedom. Let L = {anbmanbm| n, m ≥ 1}.
Assume that L is context-free. Set z = akbkakbk with |akbkakbk| = 4k ≥ k. By Lemma 3.48, express z = uvwxy. Observe that 0 < |vx| < |vwx| ≤ k implies uwy ∉ L in all possible occurrences of vwx in akbkakbk; however, from the pumping lemma, uwy ∈ L—a contradiction. Thus, L is not context-free.
Even some seemingly trivial unary languages are not context-free as shown next.
Example 3.16 A Non-Context-Free Unary Language. Consider K = {ai| i = n2 for some n ≥ 0}.
To demonstrate the non-context-freedom of K, assume that K is context-free and select z = ai ∈ K with i = k2, where k is the pumping lemma constant. As a result, |z| = k2 ≥ k, so z = uvwxy, which satisfies the pumping-lemma conditions. As k2 < |uv2wx2y| ≤ k2 + k < k2 + 2k + 1 = (k + 1)2, so uv2wx2y ∉ L, but by Lemma 3.48, uv2wx2y ∈ L—a contradiction. Thus, K is not context-free.
Closure properties
Combined with the pumping lemma, the closure properties of context-free languages frequently significantly simplify a demonstration of the non-context-freedom of a language, L, in the following way. By contradiction, we first assume that L is context-free, and transform this language to a significantly simpler language, K, by using some operations under which the family
of free languages is closed. Then, by the pumping lemma, we prove that K is not context-free, so neither is L, and the demonstration is completed.
Next, we discuss whether the family of context-free languages is closed under these operations:
• union
• concatenation
• closure
• intersection
• complement
• homomorphism
Union. To prove that the family of context-free languages is closed under union, we transform any two grammars to a grammar that generates the union of the languages generated by the two grammars.
Goal. Convert any two grammars, H and K, to a grammar G such that L(G) = L(H) ∪ L(K).
Gist. Consider any two grammars, H = (HΣ, HR) and K = (KΣ, KR). Without any loss of generality, suppose that HN ∩ KN = ∅ (if HN ∩ KN ≠ ∅, rename the nonterminals in either H or K so that HN ∩
KN = ∅ and the generated language remains unchanged). G = (GΣ, GR) contains all rules of H and K. In addition, we include GS → HS and GS → KS, where GS ∉ HN ∪ KN. If G generates x ∈ L(G) by a derivation starting with GS → HS, then x ∈ L(H). Analogically, if G generates x ∈ L(G) by a derivation starting with GS → KS, then x ∈ L(K). Thus, L(G) = L(H) ∪ L(K).