• No results found

6.4 Some Set Theoretic Computations on Hyper Graph Automata

7.1.2 Multiset Languages

A multiset language is a set of multisets, hence the multiset languageLoverΣisL ∈S

i∈NΣ

i.

What was called “set of defined values” in the definition of multisets is often calledvocabularyin linguistics orsymbolsin language theory.

A language of multisets may be finite or infinite, hence a finite notation for languages of multisets is necessary to formalize them.

Existing Approaches to Formalize Multiset Languages

Different approaches have been proposed to formalize sets of multisets, among them so called

multiplicity lists[16],L-formulae [39] and so calledcounting constraints[67].

Definition 7.3 (Multiplicity Lists)

A multiplicity list is a regular type expression of the forms1(n1:m1)· · ·sk(nk:mk)wherek≥0

and s1, . . . , sk are distinct symbols. Byni the minimum and by mi the maximum number of

occurrences is expressed.

A problem of the interval notation is, that it is not possible to express gaps in the multiplicity of the occurrence of a symbol. It is e.g. not possible to express, that the multiplicity of a symbol should be at leastnand at mostmbut notpwithn < p < m. Further, it is e.g. not possible to express, that the multiplicity of the symbols should be an odd or an even number or a multiple of a given factor.

Definition 7.4 (L≥-Formulae)

AL≥-Formulae is an expressionϕof the shape

ϕ::=true|f alse|a=i|a≥i|¬ϕ|ϕ∨ϕ a∈Σ, i∈N

and the language ofvarphiisL(ϕ)) ={w|w|=ϕ}withw|=a=iiff the multiplicity ofwis i,w|=a≥iiff the multiplicity ofwis greater or equali,w|=¬ϕiff notw|=ϕ,w|=ϕ1∨ϕ1iff w|=ϕ1orw|=ϕ2, neverw|=f alseand alwaysw|=true.

L≥-Formulae are closed under union, intersection and complement.

Compared to multiplicity lists,L≥-Formulae are more expressive, as they allow the expres-

sion of gaps in symbol multiplicity intervals. However, gaps have to be expressed, there is no abstraction over the concept of the gap. If e.g. odd multiplicity has to be expressed, it is needed to declare a gap of one after each valid symbol. Such multiplicities are commonly expressed e.g. using regular expressions, in the case of odd multiplicity for the symbols, the regular expression s,(s, s)∗can be used to express it.

Definition 7.5 (Counting Constraints (a.k.a. Presburger Constraints))

A counting constraint is a formulaϕof the shape:

where n is a natural number, N is natural number variable. Later on, ϕ(N1, . . . , Nm)will

denote the counting constraint with the free variablesN1, . . . , Nm. Counting constraints are de-

cidable, i.e. for everyϕ(N1, . . . , Nm)it is decidable, if there existsnifor all free variable such that

ϕ(n1, . . . , nm)is true.

In contrast toL≥-Formulae, counting constraints are able to express repeating interval/gap

multiplicities by the use of variables and sums for multiplicities. As an insight, consider the odd multiplicity of a symbol—let the multiplicity of that symbol beS, thenS= 1 +N+N ∧N ≥0

forNandSas natural numbers expresses the fact, thatSis an odd number.

Yet counting constraints are decidable, the deciding complexity in general is high [26]. In practice it turned out, that many efficient approaches based on heuristics exist [27], making type checking using counting constraints feasible.

Motivating a new Approach

Type and schema declarations for XML based languages are usually based on regular expres- sions, content models are ordered sequences of elements. For unordered content usually, as introduced earlier in this section, different formalisms are used. The new approach suggested in this thesis is to use regular expressions with unordered semantics as multiset language for- malism. This releases the user from the burden to learn different formalisms for ordered and unordered content. Further, it gives rise to a simple treatment of ordered types under unordered queries as presented now:

Often, when querying, and sometimes when modelling, the order is irrelevant. When query- ing XML using XPath it is generally possible to query data based on the sequential order (us- ing the so calledfollowing,preceding,following-siblingandpreceding-sibling

axis), but in many applications the order of the elements is ignored, especially when querying database like documents. In Xcerpt, explicitly ordered or unordered query patterns are used to query documents. An unordered query can be considered to have a type with unordered con- tent specification, which itself is a language of multisets. This is motivated by the fact, that it is arguably reasonable to interpret the type of a query as the type representing all data instances that can be queried by the query term under the assumption of a given type or schema for the queried data. If no type or schema is given for the queried data, the most general type may safely be assumed. The type of a query is a subset of the type of the queried data. When a query has multiple occurrences of the same variable, it gets necessary to calculate the variable intersection to infer the type of a query, as shown in section 8.3.2. When constructing unordered content, e.g. in Xcerpt, out of queries to ordered typed data, it is easy construct content that is of a type that is the unordered interpretation of an ordered content model. As an example consider the following Xcerpt rule:

CONSTRUCT

unorderedData{ all var C } FROM

orderedData[[ var C ]] END

To infer the type of the result of of that rule, it is necessary to give an unordered interpretation of the type of the content queried by the query patternorderedData[[ var C ]]. Unfortu- nately, such a type may not be represented precisely using multiplicity lists orL≥-Formulae. As

an example, think of a regular content of the shape(C, C)∗which models a content with an even number of elements of typeC. An unordered interpretation of such a type should arguably re- flect the even multiplicity of it’sC-typed members. NeitherL≥-Formulae nor multiplicity lists

allow this. With counting constraints or Presburger arithmetic, the desired set of multisets can be expressed.

R2G2 uses regular expressions with unordered interpretation to model unordered content models. For the operational semantics, counting constraints or Presburger arithmetic formulae,

very much in the spirit of Scheaves automata [67], are used. The main difference to Scheaves automata and Scheaves logic lies on the surface—unordered content models in R2G2 are not specified using a special formalism, but with regular expressions.