Lexical Analysis
Lemma 2.56. Algorithm 2.55 correctly decides the infiniteness problem for M
Proof. We prove this lemma by demonstrating that M-infiniteness-test ≠ ∅ if and only if L(M) is infinite.
Only If. Let M-infiniteness-test ≠ ∅. Consider any z ∈ M-infiniteness-test. By Lemma 2.38, z = uvw, where 0 < |v| ≤ |uv| ≤ k (k is the pumping lemma constant), and uvmw ∈ L, for all m ≥ 0.
Thus, L(M) is infinite.
If. We prove that L(M) is infinite implies M-infiniteness-test ≠ ∅ by contradiction. Assume that L is infinite and M-infiniteness-test = ∅. Let z be the shortest string such that z ∈ L(M) and |z| ≥ 2k.
As |z| ≥ 2k ≥ k, Lemma 2.38 implies that z = uvw, where 0 < |v| ≤ |uv| ≤ k, and uvmw ∈ L, for all m ≥ 0. Take uv0w = uw ∈ L. Observe that 2k > |uw| because 2k ≤ |uw| < |z| would contradict that z is the shortest string satisfying z ∈ L(M) and |z| ≥ 2k. As 0 < |v| ≤ k, k ≤ |uw| < 2k ≤ |z|, so uw ∈ M-infiniteness-test ≠ ∅, which contradicts M-infiniteness-test = ∅. Thus, if L(M) is infinite, then M-infiniteness-test ≠ ∅.
As we can decide the infiniteness problem for finite automata, we can also algorithmically decide Finiteness problem for finite automata.
Instance: M = (MΣ, MR).
Question: Is L(M) finite?
Algorithm 2.57 Decision of Finiteness Problem for Finite Automata. If M-infiniteness-test ≠ ∅, set answer to false; otherwise, set answer to true.
Equivalence problem. Consider two tokens, t and u, and their corresponding regular language, t-lexemes and u-lexemes, respectively. Furthermore, assume that t-lexemes and u-lexemes are specified by finite automata, M and N, respectively; formally, t-lexemes = L(M) and u-lexemes = L(N). If M is equivalent to N, then t-lexemes = u-lexemes. At this point, the lexical analyzer probably contains some incorrect, duplicates, or superfluous parts, so its designer should carefully revise its construction. As a result, an algorithm that decides the equivalence of two finite automata is surely relevant to the lexical analysis.
Equivalence problem for finite automata.
Instance: Finite automata M = (MΣ, MR) and N = (MΣ, MR).
Question: Is M equivalent to N?
Let M and N be two completely specified deterministic finite automata. As an exercise, prove that L(M) = L(N) if and only if H = ∅, where H = (L(M) ∩ complement(L(N))) ∪ (L(N) ∩
complement(L(M))). Design an algorithm that converts any finite automata M and N to a finite automaton O such that L(O) = H. By using the algorithms given in Section 2.3.3, convert O to an equivalent completely specified deterministic finite automata, P. By Algorithm 2.53, decide L(P) = ∅. If L(P) = ∅, M is equivalent to N; otherwise, M is not equivalent to N. A precise description of an algorithm that decides this problem is left as an exercise.
Exercises
2.1. Consider the language consisting of all identifiers defined as alphanumerical strings ending with a letter. Construct a finite automaton that accepts this language. Construct a regular expression that denotes this language.
2.2. Consider a language of all Pascal identifiers of even length. Construct a finite automaton that accepts this language and a regular expression that denotes it.
2.3. Let L = {x| x ∈ {a, b}*, ab ∉ substring(x), and |x| is divisible by 3} be the language over Σ = {a, b}. Construct a finite automaton that accepts this language and a regular expression that denotes it.
2.4. Let L = {x| x ∈ {a, b}*, aa ∉ suffixes(x)}. Construct a regular expression that denotes this language. Explain why this construction is more difficult than the construct of a finite automaton that accepts this language.
2.5. Consider the following Pascal program. Does the program contain any lexical errors? If so, specify them. If not, what is the token sequence produced by a Pascal scanner working with this text? Which of these tokens are attributed?
program trivialprogram(input, output);
var i : integer;
function result(k: integer): integer;
begin
result := k * k + k end;
begin read(i);
writeln('For ', i, ', the result is ', result(i)) end.
2.6. Detect all lexical errors, if any, appearing in the following Pascal fragment. How should a Pascal lexical analyzer handle each of these errors?
while a <> 0.2+E2 do writeln(" ');
if a > 24. than writewrite('hi') else
write ('bye);
fori := 1 through 10 todo writeln(Iteration: ', i:1);
2.7. Write a regular expression that defines programming language comments delimited by { and } in a Pascal-like way. Suppose that such a comment may contain any number of either {s or }s;
however, it does not contain both { and }. Write a program that implements a finite automaton equivalent to the expression.
2.8. Design a regular expression that defines Pascal-like real numbers without any superfluous leading or trailing 0s; for instance, 10.01 and 1.204 are legal, but 01.10 and 01.240 are not.
2.9. FORTRAN ignores blanks. Explain why this property implies that a FORTRAN lexical analyzer may need an extensive lookahead to determine how to scan a source program. Consider DO 9 J = 1, 9 and DO 10 J = 1. 9. Describe the lexical analysis of both fragments in detail. Are both correct in FORTRAN? If not, justify your answer in detail. If so, how many lexemes are in the former fragment? How many lexemes are in the other?
2.10. Modify FUN lexemes by reversing them; for instance, identifiers are alphanumerical strings that end with a letter. Which lexemes remain unchanged after this modification? Write a program to implement a FUN scanner that recognizes these modified lexemes.
2.11. Consider the next Ada source text. What are the Ada lexemes contained in this text?
Translate these lexemes to the corresponding tokens.
with Ada.Text_IO;
procedure Lexemes is begin
Ada.Text_IO.Put_Line("Find lexemes!");
end Lexemes;
2.12. Design a data structure for representing tokens, including attributed tokens. Based on this structure, implement the tokens given in Case Study 3/35 Lexemes and Tokens.
2.13. In detail, design a cooperation between a lexical analyzer and a symbol-table handler.
Describe how to store and find identifiers in the table. Furthermore, explain how to distinguish keywords from identifiers.
2.14. Write a program to simplify the source-program text, including the conversion of the uppercase letters to the lowercase letters and the removal of all comments and blanks.
2.15. Write a program to implement the representations of finite automata described in Example 2.2 Tabular and Graphical Representation.
2.16. Write a program that transforms the tabular representation of any completely specified deterministic finite automaton to a C program that represents the implementation described in Algorithm 2.8 Implementation of a Finite Automaton—Tabular Method.
2.17. Write a program that transforms the tabular representation of any completely specified deterministic finite automaton to a Pascal program that represents the implementation described in Algorithm 2.9 Implementation of a Finite Automaton—Case-Statement Method.
2.18. Write a program to implement the scanner described in Case Study 4/35 Scanner.
2.19. Write a program to implement
(a) Algorithm 2.19 Finite Automaton for Union;
(b) Algorithm 2.22 Finite Automaton for Concatenation;
(c) Algorithm 2.24 Finite Automaton for Iteration.
2.20. Describe the transformation of regular expressions to finite automata formally as an algorithm. Write a program to implement this algorithm.
2.21. Give a fully rigorous proof that for any regular expression, there exists an equivalent finite automaton (see Lemma 2.26).
2.22. Give a fully rigorous proof that for every finite automaton, there exists an equivalent regular expression (see Lemma 2.27).
2.23. Consider each of the following languages. By the pumping lemma for regular languages (Lemma 2.38), demonstrate that the language is not regular.
(a) {aibai| i ≥ 1}
(b) {aibj| 1 ≤ i ≤ j}
(c) {a2i| i ≥ 0}
(d) {aibjck| i, j, k ≥ 0 and i = k + j}
(e) {aibicj| i, j ≥ 0 and i ≤ j ≤ 2i}
(f) {aibjck| i, j, k ≥ 0, i ≠ j, k ≠ i, and j ≠ k}
(g) {aibjcjdi| i, j ≥ 0 and j ≤ i}
(h) {aib2i| i ≥ 0}
2.24. Introduce a set of ten regular languages such that each of them contains infinitely many subsets that represent regular languages. For each language, L, in this set, define a non-regular language, K, such that K ⊆ L and prove that K is not regular by the regular pumping lemma. For instance, if L = {a,b}*, K = {anbn| n ≥ 0} represents one of infinitely many non-regular languages such that K ⊆ L.
2.25. Prove the pumping lemma for regular languages (Lemma 2.38) in terms of regular expressions.
2.26. Prove the following modified version of the pumping lemma for regular languages.
Lemma 2.58. Let L be a regular language. Then, there is a natural number, k, such that if xzy ∈ L and |z| = k, then z can be written as z = uvw, where |v| ≥ 1 and xuvmwy ∈ L, for all m ≥ 0.
2.27. Use Lemma 2.58, established in Exercise 2.26, to prove that {aibjcj| i, j ≥ 1} is not regular.
2.28. Prove that a language, L, over an alphabet Σ, is regular if and only if there exists a natural number, k, satisfying this statement: if z ∈ Σ* and |z| ≥ k, then (1) z = uvw with v ≠ ε and (2) zx ∈ L if and only if uvmwx ∈ L, for all m ≥ 0 and x ∈ Σ*. Explain how to use this lemma to prove that a language is regular. Then, explain how to use this lemma to prove that a language is not regular.
2.29. Let L be a regular language over an alphabet Σ. Are complement(L), substrings(L), prefixes(L), suffixes(L), reversal(L), L*, and L+ regular as well (see Section 1.1)?
2.30. Let L be a regular language over an alphabet Σ. Define the next language operations. For each of these operations, prove or disprove that the family of regular languages is closed under the operation.
(a) min(L) = {w| w ∈ L and {prefix(w) − {w}} ∩ L = ∅}
(b) max(L) = {w| w ∈ L and {w}Σ+ ∩ L = ∅}
(c) sqrt(L) = {x| xy ∈ L for some y ∈ Σ*, and |y| = |x|2} (d) log(L) = {x| xy ∈ L for some y ∈ Σ*, and |y| = 2|x|} (e) cycle(L) = {vw| wv ∈ L for some v, w ∈ Σ*} (f) half(L) = {w| wv ∈ L for some v ∈ Σ*, and |w| = |v|}
(g) inv(L) = {xwy| xzy ∈ L for some x, y, w, z ∈ Σ*, z = reversal(w)}
2.31. For a language L over an alphabet Σ and a symbol, a ∈ Σ, erasera(L) denotes the language obtained by removing all occurrences of a from the words of L. Formalize erasera(L). Is the family of regular languages closed under this operation?
2.32Solved. Theorem 2.49 has demonstrated that the family of regular languages is closed under regular substitution in terms of finite automata. Prove this important closure property in terms of regular expressions.
2.33. Consider these languages. Use the closure properties of the regular languages and the regular pumping lemma to demonstrate that these languages are not regular.
(a) {w| w ∈ {a, b}* and occur(w, a) = 2occur(w, b)}
(b) {0i10i| i ≥ 1}
(c) {wcv| w ∈ {a, b}* and v = reversal(w)}
2.34. In Section 2.3.4, we have described all algorithms rather informally. Give these algorithms formally and verify them in a rigorous way.
2.35. Reformulate all decision problems discussed in Section 2.3.4 in terms of regular expressions. Design algorithms that decide the problems reformulated in this way.
2.36. Consider Equivalence problem for finite automata and regular expressions.
• Instance: A finite automaton, M, and a regular expression, E.
• Question: Is M equivalent to E?
Design an algorithm that decides this problem.
2.37. Consider Computational multiplicity problem for finite automata.
• Instance: A finite automaton, M = (Σ, R), and w ∈ Σ*.
• Question: Can M compute sw ⇒* f [ρ] and sw ⇒* f ' [ρ'] so f, f ' ∈ F and ρ ≠ ρ'?
Design an algorithm that decides this problem.
2.38Solved. Let M = (Σ, R) be a deterministic finite automaton (M may not be completely specified, however) and k = card(Q). Prove that L(M) = ∅ if and only if {x| x ∈ Σ*, x ∈ L(M), and |x| < k} =
∅. Notice that from this equivalence, it follows that the emptiness problem for finite automata is decidable.
2.39. Consider
Definition 2.59 Minimal Finite Automaton. Let M = (Σ, R) be a deterministic finite automaton.
M is a minimal finite automaton if every deterministic finite automaton that accepts L(M) has no fewer states than M has.
To explain this definition, a deterministic finite automaton, M, may contain some redundant states that can be merged together without any change of the accepted language. For instance, introduce a deterministic finite automaton, M, with these six rules sa → s, sb → q, qa → f, qb → f, fa → f, and fb → f. Clearly, L(M) = {a}*{b}{a, b}*. Consider another deterministic finite automaton, N, obtained from M by merging states q and f to a single state, g, and replacing the six rules with these four rules sa → s, sb → g, ga → g, and gb → g. Clearly, both automata are equivalent. However, N has two states while M has three. In a minimal finite automaton, however, there exist no redundant states that can be merged together without changing the
accepted language. For example, N is a minimal finite automaton while M is not. Design an algorithm that turns any deterministic finite automaton to an equivalent minimal finite automaton.
2.40Solved. Reconsider Example 2.13 Primes are Non-Regular, which demonstrates that L = {an| n is a prime} is not regular. Give an alternative proof of this result by using the pumping lemma.
2.41. Design tabular and graphical representations for finite-state transducers.
2.42. Write a program to implement
(a) Algorithm 2.29 States Reachable without Reading;
(b) Algorithm 2.31 Removal of ε-Rules;
(c) Algorithm 2.34 Determinism;
(d) Algorithm 2.36 Determinism with Reachable States.
2.43. Discuss Example 2.8 Removal of ε-Rules in detail.
2.44. Give a rigorous proof of Lemma 2.35.
2.45. Prove that Algorithm 2.36 Determinism with Reachable States correctly converts a finite automaton M = (MΣ, MR) without ε-rules to a deterministic finite automaton N = (NΣ, NR) such that L(N) = L(M) and NQ contains only reachable states.
2.46. Consider
Definition 2.60 Lazy Finite Automaton and its Language. A lazy finite automaton is a rewriting system, M = (Σ, R), where
• Σ is divided into two pairwise disjoint subalphabets Q and ∆;
• R is a finite set of rules of the form qx → p, where q, p ∈ Q and x ∈ ∆*.
Q and ∆ are referred to as the set of states and the alphabet of input symbols, respectively. Q contains a state called the start state, denoted by s, and a set of final states, denoted by F. Like in any rewriting, we define u ⇒ v, u ⇒n v with n ≥ 0, and u ⇒* v, where u, v ∈ Σ*. If sw ⇒* f in M, where w ∈ ∆*, M accepts w. The set of all strings that M accepts is the language accepted by M, denoted by L(M).
Definition 2.60 generalizes finite automata (see Definition 2.2). Explain this generalization.
Design an algorithm that turns any lazy finite automaton to an equivalent deterministic finite automaton.
2.47. Consider the language consisting of FUN keywords and identifiers. Design a lazy finite automaton that accepts this language (see Exercise 2.46). Then, convert this automaton to an equivalent deterministic finite automaton.
2.48 Consider
Definition 2.61 Loop-Free Finite Automaton. Let M = (Σ, R) be a deterministic finite automaton (see Definition 2.6). A state, q ∈ MQ, is looping if there exists x ∈ ∆+ such that qw ⇒* q. M is loop-free if no state in MQ is looping.
Prove the next theorem.
Theorem 2.62. A language L is finite if and only if L = L(M), where M is a loop-free finite