• No results found

CYK recognition with a grammar in Chomsky Normal Form

4.2 The CYK parsing method

4.2.2 CYK recognition with a grammar in Chomsky Normal Form

Two of the restrictions that we want to impose on the grammar are obvious by now: no unit rules and noε-rules. We would also like to limit the maximum length of a right- hand side to 2; this would simplify checking that a right-hand side derives a certain substring. It turns out that there is a form for CF grammars that exactly fits these res- trictions: the Chomsky Normal Form. It is as if this normal form was invented for this algorithm. A grammar is in Chomsky Normal Form (CNF), when all rules either have the form Aa, or ABC, where a is a terminal and A, B, and C are non-terminals.

Fortunately, as we shall see later, almost all CF grammars can be mechanically transformed into a CNF grammar.

Sec. 4.2] The CYK parsing method 93 There are no ε-rules in a CNF grammar, so Rε is empty. The sets Rsi, 1 can be read directly from the rules: they are determined by the rules of the form Aa. A rule

ABC can never derive a single terminal, because there are noε-rules.

Next, we proceed iteratively as before, first processing all substrings of length 2, then all substrings of length 3, etc. When a right-hand side BC is to derive a substring of length l, B has to derive the first part (which is non-empty), and C the rest (also non-empty).

B C

zi . . . zi+k−1 zi+k . . . zi+l−1

So, B must derive si,k, that is, B must be a member of Rsi,k, and, likewise, C must derive si+k,lk, that is, C must be a member of Rsi+k,lk. Determining if such a k exists is

easy: just try all possibilities; they range from 1 to l1. All sets Rsi,k and Rsi+k,lk have already been computed at this point.

This process is much less complicated than the one we saw before, with a general CF grammar, for two reasons: the most important one is that we do not have to repeat the process again and again until no new non-terminals are added to Rsi,l. Here, the sub-

strings we are dealing with are really substrings. They cannot be equal to the string we started out with. The second reason is that we only have to find one place where the substring must be split in two, because the right-hand side only consists of two non- terminals. In ambiguous grammars, there can be several different splittings, but at this point, that does not worry us. Ambiguity is a parsing issue, not a recognition issue.

The algorithm results in a complete collection of sets Rsi,l. The sentence z con- sists of only n symbols, so a substring starting at position i can never have more than

n+1−i symbols. This means that there are no substrings si,l with i+l>n+1. Therefore,

the Rsi,l sets can be organized in a triangular table, as depicted in Figure 4.6.

Rs1,n Rs1, n1 Rs2, n1 .. .. .. .. .. .. .. Rs1,l .. Rsi, l .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. Rs1,1 .. Rsi, 1 .. .. Rsi+l1,1 .. Rsn, 1 V W

Figure 4.6 Form of the recognition table

This table is called the recognition table, or the well-formed substring table. Rsi,l is

B a member of a set on the V arrow, and C a member of the corresponding set on the W

arrow. For B, substrings are taken starting at position i, with increasing length k. So the

V arrow is vertical and rising, visiting Rsi, 1, Rsi, 2, . . . , Rsi,k, . . . , Rsi,l1; for C, sub-

strings are taken starting at position i+k, with length lk, with end-position i+l−1, so the W arrow is diagonally descending, visiting Rsi+1,l1, Rsi+2,l2, . . . , Rsi+k,lk, . . . ,

Rsi+l−1,1.

As described above, the recognition table is computed in the order depicted in Figure 4.7(a). We could also compute the recognition table in the order depicted in Fig- ure 4.7(b). In this last order, Rsi,l is computed as soon as all sets and input symbols needed for its computation are available. For instance, when computing Rs3,3, Rs5,1 is

relevant, but Rs6,1 is not, because the substring at position 3 with length 3 does not con- tain the substring at position 6 with length 1. This order makes the algorithm particu- larly suitable for on-line parsing, where the number of symbols in the input is not known in advance, and additional information is computed each time a symbol is entered.

(a) off-line order (b) on-line order

Figure 4.7 Different orders in which the recognition table can be computed

Now, let us examine the cost of this algorithm. Figure 4.6 shows that there are (n*(n+1))/2 substrings to be examined. For each substring, at most n1 different k- positions have to be examined. All other operations are independent of n, so the algo- rithm operates in a time at most proportional to the cube of the length of the input sen- tence. As such, it is far more efficient than exhaustive search, which needs a time that is exponential in the length of the input sentence.