Regular Tree Grammars for Unordered Unranked Trees

In many applications of data modelling, the order of instances in some context is of minor importance. Typical examples are object oriented databases or programming languages, where the order in which the attributes of a class are defined is irrelevant e.g. for their access—the name is the only property for accessing attributes. With XML in contrast, the document centric paradigm always imposes an order of elements in a sequence of child nodes. However, attributes are considered to be unordered information of an element, so XML provides a limited2_{concept of mixed}

ordered and unordered content. An example of XML documents where the order of the elements is of high importance is e.g. an HTML document—headings and paragraphs occur in an order absolutely relevant for the document and ignoring the order of those elements could arguably render the document meaningless.

Many applications of XML aredata centricand notdocument centric, e.g. more in the spirit of Semistructured databases. For many of such applications, the order imposed by the document is irrelevant. As an example, consider a bibliographic database: the order of the books in the database may be completely arbitrary, e.g. if maybe the order in which the librarian grabbed the books out of the shelves to index them. The relevant information for access of the books is e.g.

2_{limited, because attributes are arguably less expressive than elements, as they may not contain structured informa-}

title, author or any information about the book suitable for indexing. For such an application, arguably the following two databases can be considered to be equivalent yet the XML semantics considers them to be different.

<title>Data on the Web</title> <authors> S.Abiteboul, and P.Buneman, and D.Suciu </authors> </book> <book> <title> Automata, Languages, and Programming </title> <authors> S.Abiteboul, and E.Shamir </authors> </book> </bib>

Code Example 22 First variation of a bibliog- raphy with arguably irrelevant order of thebook

elements <bib> <book> <title> Automata, Languages, and Programming </title> <authors> S.Abiteboul, and E.Shamir </authors> </book> <book> <authors> S.Abiteboul, and P.Buneman, and D.Suciu </authors>

</bib>

Code Example 23 Other variation of bibliog- raphy as seen in code example 22

Xcerpt distinguishes ordered child sequences from unordered child multisets, to be able to re- alize data for both application paradigms—data centric and document centric—in an appropriate manner. Using unordered Xcerpt content specifications, the former two databases are modelled to be semantically equivalent.

To be able to model unordered content appropriately in a grammar, a dedicated modelling formalism is necessary. With the type checking goal in mind, a formalism needs to be decidable concerning membership test, emptiness check, subset test and it needs to be closed under intersection, as it is motivated in 8.

Current Modelling Formalisms for languages of Multisets

Various formalisms for modelling languages of multisets have been proposed of which some will be briefly presented here:

Multiplicity Lists The multiplicity lists [16] are a metalanguage for specifying decidable sets of data terms, which are be used in later work of the authors [65] [13] as types in processing of tree-structured data. The idea is motivated by DTDs and by XML Schema. A multiplicity list is a regular type expression of the forms1(n1 :m1)· · ·sk(nk : mk)wherek≥0ands1, . . . , skare

distinct type names. Informally, the meaning is, that a multiset of symbols is valid with respect to the given multiplicity list, if the symbolsioccurs at leastnitimes but not more thanmitimes.

Other symbols than the mentioneds1throughsk may not occur.

Multiplicity lists are closed under intersection but not under union and complement. Mul- tiplicity lists have been applied successfully in static type checking of Xcerpt programs as presented in [Descriptive Typing Rules for Xcerpt and their Soundness] and [A Prototype of a De- scriptive Type System for Xcerpt.].

L-Formulae Frank Neven and Thomas Schwentick presented theL-Formulae [39] as a decidable formalism for unordered content models on the Web. L-Formulae are defined as ϕ ::=

L-Formulae are closed under intersection, complement and union. When typing or type checking semi structured data queries in general and Xcerpt programs in special, an interest- ing property emerges when querying ordered data without caring of the order, i.e. querying the ordered data using order agnostic query constructs. As an order agnostic query construct has unordered type semantics, the compatibility of an unordered type under ordered and typed data has to be checked. Ordered regular expressions give rise to modelling arity dependencies of some symbols, e.g. the regular expression “(aab)+” states, that there are twice as manyasym- bols thanbsymbols in valid data. Using Schwentwick formulae, it is not possible to express such dependant occurrence constraints on data, as the multiplicities are always formulated absolutely. Using linear equation systems restricted to natural number solutions gives rise to expression such constraints. Interestingly, it isalmostenough to use Presburger arithmetic [42], of which formulae are much easier to solve under the given natural number solution constraint. This approach is used as starting point of an unordered type specification language later on.

Presburger Arithmetics Expressions A formalism not only conceived for unordered content models, yet formally elegant, are Presburger arithmetic formulae [42]. The approach has been applied to unordered content models in XML by Denis Lugiez and Silvano Dal Zilio to formalize XML Schema [67] and by Seidl, Schwentick, Muscholl, and Habermehl [45] as formal model for unordered content models. Presburger arithmetic is the first order theory of natural numbers with addition but without multiplication. For any given statement in Presburger arithmetic it is decidable, if this statement is true or not and, if partially variable, with which variable bindings the expression is true.

The relationship between languages of multisets and Presburger arithmetic is the commutativity and associativity of the addition (in Presburger arithmetic) and of the sequence operator (in the serialisation of multisets). The commutativity and associativity of sequence operators in serialisations of bags, multisets or sets has been presented e.g. in [18]. A multiset of symbols can fully be characterized by the multiplicity of the occurring symbols. A Presburger arithmetic expressions with variables could have many solutions, hence many possible natural number as- signments for the variables. Each assignment can be interpreted as the multiplicity of a given symbol in a multiset which is valid with respect to the given Presburger arithmetic expression.

Applications, Shortcomings and Extension of current Approaches

Yet multiplicity lists are very restricted and not closed under intersection, arguably they are an elegant way to model multisets in the context of mixed ordered and unordered contents, as multiplicity lists are restricted regular expressions wrapped with a clear and simple interpretation of

unorder. While Presburger Arithmetic are more expressive and offer the decidability and closed- ness properties for type checking, end users without scientific background arguably can be con- fused by the very mathematical formalism and the need to learn two content model specification mechanisms—regular expressions for ordered content models and Presburger arithmetic formulae for unordered content models. For modelling of unordered content with the grammar formalism proposed later, an extension of multiplicity lists will be introduced—unrestricted regular expressions with unordered interpretation.

As a not formally introduced consider the regular expression for a data object representing a course in a dancing school as seen in code example 24:

DancingMaster, ((Boy, Girl)+ | BalletDancer+ )

Code Example 24 A regular content model representing well balanced dancing classes for either ballet dancers or standard couple dancers. While the regular content model implies an order, the order of the objects are arguably not of importance for the application.

For every course there must be a dancing master. Courses may be standard dance courses, in which the same amount of boys and girls have to be registered so everyone always can have a partner. Courses may be ballet dancing courses, in which case at least one ballet dancer has to be registered. The order of the objects in a content set corresponding to this regular expression is irrelevant, e.g. the dancing master may be assigned at the end when the kind of course is clear. Therefore all words with a permutation of symbols that is matched by the regular expression is member of the unordered language represented by the regular expression.

An example of a Presburger arithmetic expression representing the same constraints on a valid dancing course is arguably more difficult to understand and to write for authors:

Letdbe the multiplicity ofDancingMasterobjects,mthe number ofBoys (mfor male),fof

Girls (ffor female) andbforBalletDancers. A multiset of objects is in the given language, if the multiplicities of the symbols fulfills the following formula:

d= 1 ∧ ((m=f ∧ m≥1 ∧ f ≥1 ∧ b= 0) ∨ (m= 0 ∧ f = 0 ∧ b≥1))

As shown in Chapter 7 the unordered interpretation of regular expressions can always be mapped to Presburger arithmetic constraints and is therefore decidable and closed under the desired properties, yet no system for translation of Presburger formulas and constraints back to regular expressions will be presented (as not strictly needed for type checking). An advantage of this approach for type checking is, that it can also be used for type checking of unordered queries on ordered content models (see e.g. section 2.5.4), which is valid and useful in Xcerpt as presented in [43]

In document Berger, Sacha (2008): Regular Rooted Graph Grammars: A Web Type and Schema Language. Dissertation, LMU München: Fakultät für Mathematik, Informatik und Statistik (Page 51-54)