Recognition Systems - A Pattern-based Foundation for Language-Driven Software Engineering

expressions define unambiguous parsers for strings. Hence, this work is related to parsing techniques and grammar formalisms in general and analytical or recognition-based formalisms and parsing techniques in particular.

Regular expressions [158] recognise strings based on operators that express concate- nation, repetition and alternation. For example, the expression a(b|c)* matches all strings starting with a followed by an arbitrary number of bs and cs. Parsing Expres- sion Grammars (PEGs) [52] can be regarded as an extension of regular expressions to a more expressive parsing language. Prioritised choice (expressed using the /-operator) and greedy repetition make PEG grammars unambiguous. For example, the PEG grammar fragment Exp <- (’ab’ / ’a’) Rest clearly defines a parsing strategy for a string such as “abc”. Even though both choices ’ab’ and ’a’ match the beginning of the string, it is the lefttmost choice ’ab’ that is always tried first. In case ’ab’ succeeds ’a’will never be tried – not even if Rest fails.

The unambiguity of PEGs make the formalism especially suitable for defining analytical (rather than generative) grammars [64]. The theory of PEG parsing is based on the parsing algorithms described by Birman and Ullman [14]. First practical implementations of PEG-like operators go back at least to Schorre’s META II system in which the basic operations of PEGs are compiled into recursive-descent parsers [146].

Restricting the pattern core of Chapter 3 to horizontal operators, atoms, negation and references to define production rules yields the core PEG formalism. The use of a PEG- style grammar formalism to parse not only strings, but also nested sequences, is described by Baker [8]: the META II approach is used to parse S-Expression. Concat applies this approach to typed sequences. Because characters are also encoded as typed sequences, there is no conceptual difference between parsing and structural pattern matching.

A similar unification of parsing strings and pattern matching on nested structures can be realised with Definite Clause Grammars (DCGs) [19]. DCGs are a grammar formalism built into Prolog and other logic programming languages [150]. DCGs are syntactic sugar for regular Prolog rules. For example, the DCG grammar

as --> [a]. as --> [a], as.

recognises atomic sequences of as, such as [a] or [a,a,a]. It is equivalent to the following standard Prolog definition:

as([a|R],R).

In contrast to PEGs, DCGs undo choices and try alternatives in case parsing fails. For instance, by forcing Prolog to backtrack, a query as([a,a,a],R) produces three dif- ferent result values R: the sequences [a,a], [a] and the empty sequence [].

DCGs can be used to parse strings, which in Prolog are lists of atoms, and structured terms. DCGs do not provide syntactic meta-operators for repetition or alternation. How- ever, those can be implemented using Prologs abstraction mechanism [11]. In contrast to PEGs and the pattern expressions presented in this work, DCGs support full backtracking and ambiguous grammars.

Parser combinators are a technique for implementing top-down parsers in functional languages [38, 82, 176]. Parsers are implemented as functions and larger parsers are built from smaller parsers by functional composition using higher-order functions [83]. For example, the following Scheme code defines a parser a-or-b that recognises either token aor token b.

(define a-or-b (alternate

(token "a") (token "b")))

A parser is a function that when applied to a string yields a pair containing the parse result and the rest of the string. The function token yields such a parser that recognises a single token. For example, the application ((token "a") "abc") yields ("a" . "bc"). The higher-order function (parser combinator) alternate combines two parsers. The result is a parser that attempts the first parser and in case of success yields its result, otherwise it attempts the second parser and yields its result.

Because a-or-b is a parser, it can again be combined using parser combinators. Parser combinators typically implement basic features such as sequencing, repetition or look-ahead but might provide more advanced functionality. An important feature of most combinator libraries is to give users control over the combination of parse results. This typically involves host-language interaction – at least on the level of result data types. For example, in a Lisp parser combinator library parse trees are represented by list structures and the user must have a way to express how results of individual parsers are arranged in the list structure.

While the sematincs of the operators presented in this work can be implemented using combinatory techniques the goal is to define an extensible formalism with stand-alone syntax, semantics and meta-functionality. This goes beyond what parser combinator libraries are used for. The into combinator defined by Hutton [82] allows results of previous

by the vertical operator presented in this work. However, the vertical operator passes the result of a match to the following pattern as the input sequence. The passed result is thus treated in the same way as any other parsing input. This allows the vertical combination of parser that were not specifically designed for receiving parameters.

One of the core features of Concat is a unfication of all stages of program execu- tion through pattern-based transformations. Piumatra uses a pattern-based transformation language is used to define front-, middle- and backend of a simple compiler that gener- ates machine code for a Lisp-like programming language [137]. In Concat such a compiler could be realised using internalisation macros and computation macros with typed sequences being the representation for program text, abstract syntax trees and machine code.

In document A Pattern-based Foundation for Language-Driven Software Engineering (Page 36-39)