Context-Free Grammars and Parsing - Recall our definition of a Turing machine and the function

Formal Models and Computability

Remark 6.1 Recall our definition of a Turing machine and the function it computes from Section 6.2

6.4.3 Context-Free Grammars and Parsing

From a practical point of view, for each grammar G = (, N, S, P ) representing some language, the following two problems are important:

1. (Membership) Given a string over, does it belong to L(G)?

2. (Parsing) Given a string in L (G ), how can it be derived from S?

The importance of the membership problem is quite obvious: given an English sentence or computer program we wish to know if it is grammatically correct or has the right format. Parsing is important because a derivation usually allows us to interpret the meaning of the string. For example, in the case of a Pascal program, a derivation of the program in Pascal grammar tells the compiler how the program should be executed. The following theorem illustrates the decidability of the membership problem for the four classes of grammars in the Chomsky hierarchy. The proofs can be found in Chomsky [1963], Harrison [1978], and Hopcroft and Ullman [1979].

Theorem 6.5 The membership problem for type-0 grammars is undecidable in general and is decidable for any context-sensitive grammar (and thus for any context-free or right-linear grammars).

Because context-free grammars play a very important role in describing computer programming lan-guages, we discuss the membership and parsing problems for context-free grammars in more detail.

First, let us look at another example of context-free grammar. For convenience, let us abbreviate a set of productions with the same left-hand side nonterminal

A→ 1,. . . , A → n

A→ 1| · · · | n

Example 6.12

We construct a context-free grammar for the set of all valid Pascal real values. In general, a real constant in Pascal has one of the following forms:

m.n, meq , m.neq,

where m and q are signed or unsigned integers and n is an unsigned integer. Let = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, e,+, −, .}, N = {S, M, N, D}, and the set P of the productions contain

S→ M.N|MeM|M.NeM M→ N| + N| − N N→ DN|D

D→ 0|1|2|3|4|5|7|8|9

Then the grammar generates all valid Pascal real values (including some absurd ones like 001.200e000).

The value 12.3e − 4 can be derived as

S⇒ M.NeM ⇒ N.NeM ⇒ DN.NeM ⇒ 1N.NeM ⇒ 1D.NeM

⇒ 12.NeM ⇒ 12.DeM ⇒ 12.3eM ⇒ 12.3e − N ⇒ 12.3e − D ⇒ 12.3e − 4

Perhaps the most natural representation of derivations for a context-free grammar is a derivation tree or a parse tree. Each internal node of such a tree corresponds to a nonterminal and each leaf corresponds to a terminal. If A is an internal node with children B1,. . . , Bnordered from left to right, then A→ B1· · · Bn

must be a production. The concatenation of all leaves from left to right yields the string being derived. For example, the derivation tree corresponding to the preceding derivation of 12.3e − 4 is given inFigure 6.3.

Such a tree also makes possible the extraction of the parts 12, 3, and−4, which are useful in the storage of the real value in a computer memory.

Definition 6.18 A context-free grammar G is ambiguous if there is a string x∈ L(G), which has two distinct derivation trees. Otherwise G is unambiguous.

FIGURE 6.3 The derivation tree for 12.3e − 4.

FIGURE 6.4 Different derivation trees for the expression 1+ 2 ∗ 3 + 4.

Unambiguity is a very desirable property to have as it allows a unique interpretation of each sen-tence in the language. It is not hard to see that the preceding grammar for Pascal real values and the grammar G₁defined in Example 6.7 are all unambiguous. The following example shows an ambiguous grammar.

Example 6.13

Consider a grammar G7for all valid arithmetic expressions that are composed of unsigned positive integers and symbols+, ∗, (, ). For convenience, let us use the symbol n to denote any unsigned positive integer.

This grammar has the productions

S→ T + S | S + T | T T→ F ∗ T | T ∗ F | F F → n | (S)

Two possible different derivation trees for the expression 1+ 2 ∗ 3 + 4 are shown in Figure 6.4. Thus, G7

is ambiguous. The left tree means that the first addition should be done before the second addition and the right tree says the opposite.

Although in the preceding example different derivations/interpretations of any expression always result in the same value because the operations addition and multiplication are associative, there are situations where the difference in the derivation can affect the final outcome. Actually, the grammar G₇can be made unambiguous by removing some (redundant) productions, for example, S→ T + S and T → F ∗ T.

This corresponds to the convention that a sequence of consecutive additions (or multiplications) is always evaluated from left to right and will not change the language generated by G7. It is worth noting that

there are context-free languages which cannot be generated by any unambiguous context-free grammar [Hopcroft and Ullman 1979]. Such languages are said to be inherently ambiguous. An example of inherently ambiguous languages is the set

{0^ml^m2ⁿ3ⁿ| m, n > 0} ∪ {0^mlⁿ2^m3ⁿ| m, n > 0}

We end this section by presenting an efficient algorithm for the membership problem for context-free grammars. The algorithm is due to Cocke, Younger, and Kasami [Hopcroft and Ullman 1979] and is often called the CYK algorithm. Let G = (, N, S, P ) be a context-free grammar. For simplicity, let us assume that G does not generate the empty string and that G is in the so-called Chomsky normal form [Chomsky 1963], that is, every production of G is either in the form A → BC where B and C are nonterminals, or in the form A → a where a is a terminal. An example of such a grammar is G1 given in Example 6.7. This is not a restrictive assumption because there is a simple algorithm which can convert every context-free grammar that does not generate into one in the Chomsky normal form.

Suppose that x= a1· · · anis a string of n terminals. The basic idea of the CYK algorithm, which decides if x∈ L(G), is dynamic programming. For each pair i, j, where 1 ≤ i ≤ j ≤ n, define a set Xi, j⊆ N as

Xi, j= {A | A ⇒^∗ai· · · aj}

Thus, x∈ L(G) if and only if S ∈ X1,n. The sets Xi, jcan be computed inductively in the ascending order of j− i. It is easy to figure out Xi,ifor each i because X_i,i = {A | A → ai ∈ P }. Suppose that we have computed all X_{i, j}where j−i < d for some d > 0. To compute a set Xi, j, where j−i = d, we just have to find all of the nonterminals A such that there exist some nonterminals B and C satisfying A→ BC ∈ P and for some k, i≤ k < j, B ∈ Xi,k, and C ∈ Xk+1, j. A rigorous description of the algorithm in a Pascal style pseudocode is given as follows.

Algorithm CYK(x= a1· · · an):

1. for i← 1 to n do

2. Xi,i ← {A | A → ai∈ P } 3. for d← 1 to n − 1 do 4. for i← 1 to n − d do

5. X_i,i+d← ∅

6. for t← 0 to d − 1 do

7. X_i,i+d← Xi,i+d∪ {A | A → BC ∈ P for some B ∈ Xi,i+tand C∈ Xi+t+1,i+d}

Table 6.2 shows the sets Xi, j for the grammar G1and the string x = 000111. It just so happens that every Xi, j is either empty or a singleton. The computation proceeds from the main diagonal toward the upper-right corner.

TABLE 6.2 An Example Execution of the CYK Algorithm

0 0 0 1 1 1

j→

1 2 3 4 5 6

1 O S

2 O S T

i 3 O S T

↓ 4 I

5 I

6 I

In document Computer Science (Page 145-149)