• No results found

2.8.1

Introduction

In order to transform the P2AMF programs used by CySeMoL, we require a method for systematically analysing and storing such programs. Building a data structure for the representation of languages is known as ’parsing’. In this section, we will discuss the theory and practice regarding parsing. We will first introduce the notion of formal languages, and grammars over languages.

A (formal) languageLover an alphabetΣ, is defined as a set of strings of wordsΣfrom that

alphabet, i.e. LP(Σ∗). When we put constrains on the structure of words which is allowed to occur in languages, we obtain classes of languages which satisfy these constraints.

One method for specifying structural constraints on a language are (formal) grammars. Gram- mars define how valid sentences of a formal language can be generated by applying transformation rules, starting from a starting symbolS. For example, we can define the language over the alphabet

{a}of sequences of ‘a’ symbols with arbitrary length as follows:

Sε SSa

whereεstands for the ‘empty string’, the string containing no symbols at all. By starting from the start symbolS, we can form any sequence of ‘a’ symbols by applying the two rules. For instance, the following derivation proves that the string ‘aaa’ is in our language:

S SSa

Sa SSa

Saa SSa

Saaa Sε

aaa

Note that the second rule is a recursive definition, the symbols occurring on the left-hand side of the rule, also appear on the right-hand side.

In this example, we have used the two types of symbols which occur in grammars:terminal(a) andnon-terminal(S) symbols. Non-terminal symbols are not part of the alphabet of the language, and serve as intermediate symbols used to structure the application of rules. Terminal symbols on the other hand are part of the language alphabet, and will eventually form the sentences generated by the grammar.

Similar to the restrictions we can impose on languages, we can create classes for grammars, by restricting the form of the transformation rules. These restrictions have given rise to the Chomsky- hierarchy, which divide all grammars into four categories: regular grammars, context-free grammars, context-sensitive grammars, and unrestricted grammars. For the specification of parsers, we often use grammars from the family of context-free grammars or CFGs. CFGs are defined as grammars which only have a single non-terminal symbol at the left-hand side of their rules. This means that the application choices of rules depend on non-terminals only, and not on other terminals or non-terminals (e.g. context) next to some non-terminal. For instance, the previously shown example of a grammar is in fact a context-free grammar.

Table 2.6: EBNF operators, and their function.

Operator Function

<a> `|' <b>

Choice operator, which specifies that either

<a>

or

<b>

is a valid alternative.

<a>

‘ ’

<b>

Concatenation operator (empty space), which specifies the concatenation of

<b>

after

<a>

.

<a> `?'

Optionality, which specifies that

<a>

is allowed to be included, or left out.

<a> `*'

Kleene-star operation, which specifies zero or more consecutive repetitions of

<a>

.

<a> `+'

Alternative repetition operator, specifying that

<a>

should be repeated one or more times.

2.8.2

Extended Backus-Naur Form

One of the more popular grammar normal forms for specifying context-free grammars for parsers, is the Extended Backus-Naur Form (EBNF). Grammars written in EBNF form a set of production rules, which specify how all valid strings from a given language can be generated, starting from theroot symbol S. An EBNF production rule has the following form:

<rulename> ::= <rulebody>

;

where the rule body can contain other rules, or terminals. By defining multiple rules, the grammar can be specified in a structured and intuitive manner. Furthermore, EBNF introduces special operators, which we list in table 2.6. These operators are used to define more complex patterns of rules and non-terminals.

2.8.3

ANTLR4

ANTLR4 (ANother Tool for Language Recognition), is a tool for generating parsers for the class of ALL(*) grammars[64]. The design goals of ANTLR4 are focused on ease-of-use in favour of speed. The generated parsers however, are able to efficiently parse grammars in practice.

The ANTLR4 file format is analogous to an EBNF grammar, with additional instructions for the generated code. ANTLR4 is able to generate a parser program which is able to recognize whether a string conforms to the specified language. Additionally, ANTLR provides the parse tree which represents the rule application choices of the parser. Another useful feature is the support for labelled rules, which allows the definition of a label for terminals or non-terminals in a rule, which will be present in the parse tree. We use this feature to label the arguments of our binary expression rules, which provides us convenient access to these arguments.

In order to smooth the integration of generated ANTLR parsers in existing projects, ANTLR is able to generate listener interfaces. These interfaces allow existing code to be notified of the parsing of elements from the language, and allows the code to access and query the generated parse tree. When compared with the previous version, ANTLR3, the parse tree of ANTLR4 is better accessible. A common usage scenario which has been improved is the representation of nodes formed by the

*

or

+

operations. The ANTLR4 tree nodes representing that operation provide a list interface to such rules, which improves the accessibility of the information in these nodes.