Parsing Algorithms - Lambda-calculus and formal language theory

Grammatical formalisms have a natural algorithmic problem called parsing. This problem consists in mapping a sentence to a representation of the set of its possible derivations. We have given [S22] an algorithmic solution for second order linear ACGs. As a first generalization, we proved that this algorithm can be extended to non-linear second-order ACGs [S21] using intersection types. This generalization shows that it is possible to generate texts from semantic representations that may be logical formulae. Of course, these formulae are not taken up to logical equivalence, but they can nevertheless serve as a high-level representation of meaning and some basic equivalence relations could be added such as the associativity and the commutativity of the conjunction and of the

disjunction.

One of our motivations to introduce the notion of recognizability for simply typed λ-calculus was to simplify the proof of this result. Indeed our proof looked quite similar to the usual proof of closure of context-free languages under intersection with regular sets [296]. Using the closure of recognizable languages of λ-terms under inverse homomorphism and the fact that singleton languages are recognizable (by Statman Theorem), we know that the set of syntactic structures of a given λ-terms in a second order ACG is a recognizable set of trees. Moreover as all the theorems that are used are effective, we thus have a parsing algorithm for ACGs.

When we look at the grammar that we used as an example of ACG in Sec- tion 3.1, this means that we may retrieve algorithmically the set of derivations whose interpretation is a given logical formula, as the one we have taken as example:

∃x.rat x∧(∃y.cat y∧(∃z.dog z∧saw z y))∧chased y x∧(∃u.cheese u∧ate x u) This result is far from being intuitive as the operations that are performed by λ-calculus are complex. Nevertheless, the conceptual gain of recognizability makes the proof rather trivial. It also generalizes the remark of Mezei and Wright [213] about the regularity of the set of derivations of a sentence in a context-free grammar.

When we look at the algorithm this method gives, it amounts to compute least fixpoints in the domains of interpretation of atomic types. When instanti- ated on a context-free grammar, this naive algorithm is a bottom-up algorithm that does not beneficiate from the binarization procedure that accelerates the Cocke, Younger [297] and Kasami [173] algorithm. Binarization methods can be adapted, by transforming the abstract language, but, in general, the parsing problem of second order non-linear ACGs is non-elementary. If we fix the complexity of the lexicon at k, this problem has a tower of exponential of height k − 1 [117] as time complexity.

An important feature of this algorithm is that denotational semantics is providing the representation of the information that is necessary to represent the set of derivation trees. This is in general the difficult part when dealing with parsing. This information may be rather complicated, for example in parsing algorithms for Tree Adjoining Grammars [261, 228] where it is represented with dotted trees with indices. Then proving that this information is indeed sufficient to deduce the existence of a syntactic structure requires most of the effort in proving the algorithm correct. Here this part is already contained in the fact that we use models of λ-calculus which ensures the correctness of the algorithm as a corollary. So technically, the use of denotational semantics seems to be a conceptual improvement.

The complexity of the parsing problem for non-linear ACGs pushed us to study some restrictions. The algorithm we proposed for linear second order ACGs is running in polynomial time. This algorithm has been recast by Kanazawa in terms of datalog program [168]. In this article, Kanazawa also

remarks that the result can be extended to second order almost linear ACGs. Such ACGs use lexicons which map constants to almost linear terms which obey the non-copying constraint for all variable of functional type but not necessarily for variables of atomic type.

Kanazawa’s datalog method is very interesting, not only does it allow one to give a nice presentation of parsing algorithms for second order ACGs, but also it allows one to define parsing algorithms for other formalisms. The view datalog gives of parsing algorithms is that they are mostly specific strategies for computing fixpoints. In particular, Kanazawa has showed that many of the algorithms that were described in the literature in a rather technical way for particular formalisms could be described and generalized in terms of datalog program transformation [167]. This presentation provides simpler pre- sentations of algorithms and also simpler proofs of their correctness. Interest- ingly, the community in datalog has tried to reduce every fixpoint computation strategy to a unique one called the semi-naive bottom-up algorithm. For this they have developed a wide range of program transformations which preserve program semantics. An important transformation is magic supplementary set rewriting that allows to reduce the top-down resolution algorithm to the semi- naive one [46]. On a datalog program that represents a context-free grammar, this transformation gives rise to an improved version of Earley’s parsing algorithm [109]. The algorithm is improved in the sense that the magic predicate make the algorithm have a time complexity that is linear with respect to the size of the original grammar instead of being quadratic. The datalog methodology allows us to see parsing algorithms as program transformations and program optimizations. From a software engineering perspective this view of parsing allows to factor out the semi-naive bottom-up resolution algorithm which is responsible of the memoisation which is delicate to implement and may constitute a serious bottleneck in practice.

We may understand grammars as non-deterministic programs that with least fixpoints. Using datalog may seem as just another way of computing those fixpoints. Nevertheless, datalog offers richer computation capabilities, in other words, datalog is intentionally richer than grammars. And thus it allows us to define parsing algorithms that could not directly be described in terms of grammars. The magic predicate in magic supplementary set rewritting is an instance of this phenomenon. Another good example is Kanazawa’s prefix- correct algorithm2for MCFGs which uses a program that cannot be represented as a grammar [167]

With my PhD student Pierre Bourreau, we have worked on generalizing Kanazawa’s datalog approach to almost affine ACGs [S3, S2]. This work re- quired to use game semantics as a way to prove the correctness of the algorithm. We have also extended this approach to copying formalisms like PCMFG [S1]. In this work we describe various transformation that allows us to obtain algo-

2_{A parsing algorithm has the prefix-correctness property when it reads the input from}

left to right and rejects incorrect sentences as soon as it has processed a prefix that cannot be completed into an element of the language.

rithms with or without the prefix-correctness property, and which may also use what is known as the left-corner strategy [227]. Here we take advantage of various program transformations. The way those transformations are combined results in different algorithms that may or may not have the prefix-correct property, that may use or may not use a left corner strategy etc. . .

In document Lambda-calculus and formal language theory (Page 48-51)