4 A Model for Documents and Constraints
4.5 Constraint Language
DOM and XPath Summary
• String, N umber and Boolean are the basic data types.
• A DOM tree consists of a set of nodes, represented by N ode. A node is either an element node, a text node or an attribute node.
• Every node has a unique identity, a name and a value.
• XPath provides paths for selecting nodes from documents and expressions for simple computations and logical comparisons.
• Paths are either absolute, or relative to variables and are evaluated to a node set using the semantic function S.
• Expressions evaluate to a string, number, boolean or node set using the semantic function E . Conversion functions can be used to convert each of these types to a string, number or boolean.
• Variables are handled using a binding context and the two functions bind and lookup.
4.5 Constraint Language
We have put in place the foundations to represent documents in a common format as DOM trees, and to query them for sets of nodes using XPath. We can now specify what it means for a set of documents to be consistent by relating nodes in the documents through constraints. By definition, elements that do not obey these constraints are causing an inconsistency.
The choice of constraint language is not entirely straightforward, as evaluating formulae in a language and providing suitable feedback becomes progressively more difficult as the expressive power of the language increases. An earlier approach [Zisman et al., 2000] was based on a simple language that expressed a relationship between a set of “source” and a set of “destination” elements. We have also experimented with a restricted language that expressed constraints between two sets of elements, for example “for every element in set A, there must be an element in set B”. While it was easy to specify diagnostics for these languages – inconsistent elements in the sets could easily be pointed out – it soon became clear that the expressive power of such languages was insufficient for a number of application domains, most notably in software engineering. The UML [OMG, 2000a] is a good counter-example as its validation requires expressiveness beyond propositional logic.
We thus chose instead a language based on first order logic as our constraint language as we were hopeful that it would be expressive enough for our purposes, a hypothesis we have since confirmed in a number of case studies.
We will introduce our constraint language using the running example. The question “Are all the products that are listed in adverts also present in the catalogue?” can be represented more formally as an assertion: For all Advert elements, there exists a Product element in the Catalogue element where the ProductName subelement of the former equals the Name subelement of the latter. If this condition holds, a consistent relationship exists between the Advert element and the Product element. Otherwise, the Advert element is inconsistent with respect to our rule as there is no matching Product element.
Definition 4.18. Constraint Language Abstract Syntax p : P ath
e : Expr v : V ariable
f : f ormula ::= ∀v ∈ p (f ) | ∃v ∈ p (f ) |
f1 and f2 | f1 or f2 | f1 implies f2 | notf | e1 = e2 | same v1 v2
Definition 4.18 gives an abstract syntax of our constraint language. The language is a function-free first order predicate logic that expresses constraints over finite sets. These sets contain elements and attributes of XML documents. Since XPath permits variable references, formulae can make reference to variables bound in superformulae. This allows predicates to be used for testing properties of nodes currently bound by a quantifier. We can express our semi-formal assertion more formally in this language:
∀a ∈ /Advert (∃p ∈ /Catalogue/Product ($a/ProductName=$p/Name))
Note that the paths in the constraint refer to elements from different documents, while not explicitly referring to which documents the constraint is applied to. This is a matter for the evaluation semantics that we apply to the language. This evaluation semantics, which we will discuss in the next chapter, will extend our notion of path evaluation to multiple documents.
To ensure constraint integrity, we place some additional static semantic constraints on formulae in the language that cannot be captured in the abstract syntax. We will state these informally here:
Definition 4.19. For any quantifier formula, the variable v must not be equal to a variable bound in an outer formula. This is to prevent double bindings, which would lead to accidental overwriting of the binding context.
Definition 4.20. If a path expression is relative to a variable v, that variable must have been bound by a quantifier in an outer formula.
4.6. Chapter Summary 46
Definition 4.21. For formulae of the form e1 = e2, any path expressions contained in e1 or e2 must be relative to a variable. For example, $a/ProductName=$p/Name is legal, while /Advert/ProductName=$p/Name is not. We made this choice because we wanted to restrict distributed path evaluation over multiple documents to quantifiers, to simplify implementation. It also eliminates any implicit quantification if the path points to entries in multiple documents, which would be better captured as an explicit quantifier.
Since the only iteration constructs in the language are quantifiers that iterate over finite sets of elements retrieved from documents, we cannot directly express constraints that require any form of infinity. For example, the constraint for all elements x, the children of x are prime numbers would require quantification over the integers to express the latter half of the constraint and thus cannot be expressed directly in this language. Such a case would have to be handled by implementing a predicate that performs a primality test, using a Turing-complete programming language. Nevertheless, as will be shown in the following, the power of the constraint language is great enough to express a wide range of static semantic constraints, including those of the Unified Modeling Language [OMG, 2000a].
Constraint Language Summary
• The constraint language combines first order logic with XPath in order to relate nodes in DOM trees.
• The language includes universal and existential quantifiers, boolean connectives and predicates.
4.6 Chapter Summary
This chapter contains the groundwork for the formal definition of our consistency checking semantics. It defines a model for XPath and the Document Object Model, and gives a semantics for evaluating paths over DOM trees.
Though most of it is based on [Wadler, 1999], we believe that this extended treatment of the DOM and XPath will be valuable as reference material in its own right for the semantic definition of other types of systems that deal with XML input.
We have also introduced the abstract syntax of a constraint language that relates elements in DOM trees. In the following chapter we will look at consistency checking as the process of evaluating constraints written in this language and providing diagnostic results. We will do this by combining the abstract syntax with the document and path models presented here.
In this chapter we get to the core theoretical contribution of our investigation: how to evaluate constraints between documents and provide a rich diagnostic result that enables users to identify inconsistency, and trace it effectively should it involve multiple documents.
We will first define a standard boolean semantics for evaluating constraints over multiple DOM models. This boolean semantics will follow the standard definition of first order logic, and adapt it to apply to a set of DOM trees.
Following an illustration of the shortcomings of the boolean semantics as a diagnostic mechanism, we will move on to defining a novel evaluation semantics for our constraint language that connects inconsistent elements in documents through hyperlinks called con-sistency links. By using a denotational semantics as a formal basis, and notions from game semantics [Hintikka and Sandu, 1996] for explanation, we cover the entire language and provide examples for each construct.
Towards the end of the chapter we will cover some of the assumptions and design decisions that went into defining our evaluation semantics. We will also investigate its relationship to the standard boolean interpretation, providing proofs that the de Morgan’s laws as well as other standard equivalences hold.