Conclusions and Outlook - Modules: From Separation to Encapsulation

3.5 Modules: From Separation to Encapsulation

3.5.8 Conclusions and Outlook

3.6 Conclusion . . . 100

Section3.3is closely based on [51], Section3.5on [13] and [12].

3.1 I N T R O D U C T I O N

The vision and principles of aversatileWeb query language capable of accessing Web data in different formats, yet not less suitable for each format than a language specialized to that format have guided the continuing development of the Xcerpt [54,187] query language. Xcerpt as described in [187] has originally been developed with focus on XML data, though viewed as graph data as in the XML Information Set [81] (by resolving id/idref links in the data model). Nevertheless, the design of the language has been, from the beginning, carefully tailored to enable also other Web data formats such as RDF [150] or Topic Maps [133].

This foundation has been exploited to refine Xcerpt, its semantics and formal foundation, and its evaluation towards the vision of versatility out- lined in the previous chapter. In this chapter, we focus on refinements to the language Xcerpt itself. In PartII, we describe an integrated formal perspective on Web data models and queries and demonstrate in Chapter7

how to describe Xcerpt and its data model in terms of that formal framework (an extension of datalog with value invention). Moreover, we extend the framework to include other Web query languages like XPath, XQuery, and SPARQL, thus opening the door for versatility not only on the level of the data but even on the level of the language. In PartIV, we finally show that versatility does not have to come at a price in performance or evaluation complexity: TheCIQCAGalgebra is proposed that allows the evaluation of Xcerpt, XPath, XQuery, SPARQL, and many other Web query languages but is capable of delivering optimal or at least competitive time and space complexity for restricted cases such as XPath (tree queries on tree data).CIQCAGalso extends previous results for (tree) query evaluation on tree data to a substantially larger class of graph data, viz. continuous image graphs. As Chapter7connects Xcerpt to our formal foundation for Web data and queries, Chapter13connects that framework toCIQCAGby showing how to compile Web queries toCIQCAGexpressions.

Returning to the topic of this chapter, we focus, as stated on the refinement of the Xcerpt language towards the vision of versatility. The result is called Xcerpt 2.0 and is, in many ways, a summary of one of the strings of work in the REWERSE working group on “Reasoning-aware Querying”. The following chapter presents some of the highlights of that refinement but refrains to address all the technical details (such as the full grammars, the language meta model, etc.). These details can be found in several deliverables and publications of the aforementioned working group, on which this chapter also draws notably: Xcerpt 2.0 is first drafted in [59] and fully specified in [101] which also contains a discussion of node identity and the proper graph data model of Xcerpt 2.0, both REWERSE deliverables. The

3.2 xcerpt 2.0: overview in 5000 words 35

versatile aspects of Xcerpt 2.0 are first described in [51]. Finally, the mod- ule extensions of Xcerpt 2.0 are based on the modularization framework described in [13] and realized with the REWERSE composition framework Reuseware1_{and has previously been published in [}₁₂_{]. In the following,} we first briefly recall Xcerpt 2.0 in Section3.2, focusing on differences to previous versions of Xcerpt. Then we illustrate the versatility of Xcerpt by a number of examples in Section3.3. Finally, we highlight two particular issues where the design of Xcerpt 2.0 differs notably from that of previous versions:(1)adding node (or object or surrogate) identity to Xcerpt and thus moving from infinite regular trees to graphs as data model in Sec- tion3.4.(2)modules with parameters for Xcerpt (and any rule language) in Section3.5. Though neither of these issues is, in and by itself, novel— XQuery uses node identity, the effect of object identity on query languages in general has also long been investigated, e.g., [1]; regular path expressions have been studied extensively for object-oriented and semi-structure data and are used, e.g., in Lorel [3]; modules for rule languages such as Prolog and Datalog have been considered, e.g., in [47]. However, their application to Xcerpt illustrates the progress of Xcerpt towards flexible, versatile Web queries. In particular, node identity and modules can be considered essential features of a versatile Web query language as node identity is, arguably, necessary to properly support, e.g., occurrence queries in cyclic data or change tracking due to updates. The versatile access to Web data often requires mediating views or rules that give provide an integrated view of data from different sources. Encapsulating such views in modules ensures proper encapsulation of data and separation of concern on the level of tasks or rule sets rather than rules.

3.2 XC E R P T 2 . 0 : O V E RV I E W I N 5 0 0 0 WO R D S 3.2.1 xcerpt: a rough sketch

Xcerpt is asemi-structured query languagefor the Web, but very much unique among the exemplars of that type of query languages (for an overview see [16] and [100]) in that it combines aspects of different languages in novel ways aiming towards aversatile query languageas defined in [58,61] and Chapter2.

(1) In its use of agraph data model, it stands closely to early semi- structured query languages such as Lorel [3] than to current W3C

XML query languages such as XPath, XQuery, or XPath. A graph data model enables Xcerpt to faithfully represent id/idref-links in XML as well as arbitrary RDF graphs. Previous versions of Xcerpt [187] lack (node or object) identity and are thus better characterized as havinginfinite regular treesas data model, cf. [80,1]. However, Xcerpt 2.0 introduces full node identity and identity variables and thus moves towards a graph data model as in Lorel or object-oriented databases. For details see Section3.4.

(2) In its aim to address all specificities ofXML with great care, it resembles current W3C recommended XML query languages such as XSLT [72] or XQuery [35]. Xcerpt is tailored to XML in numerous ways, e.g., by proper support for attributes, namespaces [44], XML base [151], comments, and processing-instructions. This is achieved without sacrificing the conceptual simplicity and syntactical concise- ness of the language. Some aspects of XML are treated differently than in the W3C query languages, e.g., the transparent resolution of non-hierarchical relations.

(3) In using (slightly enriched)patterns(or templates or examples) of the sought-for data for querying, it resembles the “query-by- example” paradigm [205] and XML query languages such as XML- QL [84]. In contrast, current XPath, XSLT, and XQuery use navigational access to XML data which is very convenient for unary selec- tion where path expressions can be used , but quickly becomes un- wieldy forn-ary queries where more complex, often nested FLWOR loops must be employed.

(4) In offering aconsistent extension of XMLto overcome certain restrictions of XML, that seem arbitrary in the context of Web querying and Xcerpt in particular, it is ready to incorporate access to data represented in richer data representation formats. Instances of such features are siblings whose relativeorder is irrelevant(and can not be queried) and more flexible label alphabets.

(5) In providing (syntactical) extensions for querying, among others, RDF, Xcerpt becomes aversatile query language(as defined in [58]).

(6) In a strictseparation of querying and constructionand in its use of logical variables and deductive rules, it resembles logic programming languages or Datalog. In contrast, SQL and XQuery, e.g., mix construction and querying (nested queries) and use explicit references to views rather than rule chaining.

3.2 xcerpt 2.0: overview in 5000 words 37

Most of these characteristics hold also for earlier versions of Xcerpt, but are further strengthened in Xcerpt 2.0. This holds particularly for items 1, 2, and 5 and, in general, strengthens Xcerpt’s character as a versatile query language in the sense of Chapter2.

As briefly mentioned above, Xcerpt uses to a large extent the same concepts for data and queries in that each data item can also serve as a query and a query is mostly an example or pattern of sought-for data. Instead of using separate concepts and syntax for queries (as in navigational query languages such as XQuery [35]), Xcerpt uses terms for representing both data and queries. All data terms are also query terms, but there are some additional constructs in query terms, that allow (a) the extraction of data by using logical variables, (b) the specification of queries that are only incomplete patterns of the data, i.e., where more nodes may occur in the data than specified in the query, and (c) the specification of formulas in terms, i.e., conjunction, disjunction, negation, optionality etc:

– Logical Variables.In query terms, logical variables are used to indicate which data is to be selected and to join data (indicated by multiple occurrences of the same variable as in logic programming languages). The result of a query is conceptually a set of tuples each representing a combination of bindings (or matches) for all the variables occurring in the query term. For each tuple, a data term must exist that matches the query where all the variables are substituted by the bindings of the tuple.

– Separation of Querying and Construction.In contrast to query languages such as SQL or XQuery, construction and querying are strictly separate in Xcerpt, in particular there are no nested queries in Xcerpt (rather rules and rule chaining is used). The data constructed by a rule is specified in construct terms, that contain variables from the corresponding query terms acting as placeholders for selected data. Additionally construct terms make use of gouping constructs to return all or some of the alternative bindings of a variable. – Incomplete Patterns.In most cases, queries specify just enough

restrictions on the data to be returned, as required by the query intent, rather than specifying full or “total” patterns of the data. Xcerpt supports such queries by providing constructs to express that a pattern is incomplete in breadth (i.e., there can be more children than specified), depth (i.e., there can be additional nodes and edges between the matched nodes) etc.

– Terms as Formulas.Query terms are not only augmented by variables, but also by constructs for expressing negation, disjunction,

conjunction, and optionality.

In the remainder of this section, we briefly outline Xcerpt’s data model and data terms, highlighting the changes in Xcerpt 2.0 compared to previous versions. Then, we discuss how construct and query terms differ from data terms.

3.2.2 xcerpt 2.0: data model

As stated above, Xcerpt 2.0 uses agraph data model. More precisely, Xcert provides access toone or moredata graphs (that are usually stored in data units called “documents” identified by IRIs [90]). Each data graph is arooted, directed, node-labeled, ordered, unrankedgraph with two types of nodes:

Definition 3.1(Element nodes). Element (or structural) nodes represent XML elements or similarstructureddata items (e.g., resources in RDF) that contain a list of references to further nodes (the node’s children).

Each element node is decorated further with a dictionary (or associative list) of (XML-style)attributes. Some attributes are predefined and exist at all nodes, viz. the label and namespace IRI (cf. [44]), others are specified in the data, e.g., as XML attributes. Just like in XML, attributes are single valued and unordered, i.e., for each attribute name (dictionary key) a single value exists and the order of the key-value pairs is not significant and can not be queried. Attributes may be hereditary, i.e., shared by all descendants of a node unless there is an intermediary node that provides a differing value for that attribute. Examples for hereditary attributes are namespaces [44] and base IRIs [151] in XML documents.

In contrast to Xcerpt 1.0, element nodes in Xcerpt have an implicit object or surrogate identity and there are three kinds of equality between element nodes: label (or shallow) equality, structural (or deep) equality, and identity-based equality. The first holds if they have the same label but ignores any child nodes, the second if they have the same label and for each child of one node there is a corresponding child of the other node that are themselves deep equal (in presence of order, the corresponding children must be in the same order), the third only between a node and itself. For details on node identity see Section3.4.

Element nodes closely resembleelement information items from [81]. The handling of attributes, however, deviates notably from the XML information set to emphasize the distinction of elements and attributes: attributes are simple key-value pair, where the key is an XML

3.2 xcerpt 2.0: overview in 5000 words 39

name (and thus may consist in prefix, IRI, and local name) and the value is an arbitrary string. No further information can be attached to attributes.

Each element node has zero or more edges to other nodes, called its children. These edges are alwaysordered. However, in contrast to pure XML, one can specify whether this order is significant, i.e., whether it has to be preserved during storage or transformation and can be queried. All element nodes originating from XML documents are by default ordered. Element nodes where the order is significant are calledordered, element nodes where the order is insignificantunordered. There are no further restrictions on the edges, i.e., the graph may be cyclic, may have loops, and multi-edges, i.e., the same two nodes may be connected by several nodes, e.g., if a node is the 2nd, 4th, and 12th children of another one.

(XML) element nodes are theonly complex data structure in Xcerpt. Other complex data structure such as lists (or sequences), ho- mogeneous or heterogeneous records, sets, and dictionaries (or associative lists) can be simulated as terms, but no specific support is offered.

The only other node type is that of atomic or content nodes:

Definition 3.2(Content nodes). Content (or atomic) nodes represent data items that are consideredunstructuredin the context of Xcerpt, i.e., they contain no list of references to further nodes and thus always play the role of leaf nodes in the data graph.

Content nodes can be further distinguished into

(1) text nodesthat represent the textual content of element nodes. The only attribute of a text node is the string it represents. The same restrictions as for text nodes in XSLT [72], XQuery, and XPath [94] apply, i.e., (1) text nodesneverrepresent anempty string, (2) two text nodes cannever be direct siblingsof each other. Two nodes are direct siblings, if either they are children of the same ordered element node and are consecutive in the children order or they are children of the same unordered element node. Thus, an unordered element node may not have more than one text node child. If two text nodes are constructed as direct siblings they arecollapsed.

(2) comment nodesthat represent comments, i.e., annotations on the actual data that are not meant for machine processing. As text nodes, they have only one attribute: the content of the comment. However, in contrast to text nodes no further restrictions are placed on comment nodes.

(3) processing instruction nodesthat represent processing instructions, i.e., annotations on the actual data that are meant for

processing by specific “target” services. They carry two attributes, the content of the processing instruction (usually some form of instructions for the “target” service) and the name of the “target” service.

In Chapter5, the formal notion ofCIQLogdata graphs is introduced which are a (slight) generalization of Xcerpt data graphs and a mapping from the Xcerpt data model toCIQLogdata graphs is discussed. This mapping faithfully represents all of the above issues.

3.2.3 a syntax for data: (data) terms

As syntax for representing data in the above data model, Xcerpt chooses terms. However, these Xcerpt terms extended standard logic terms in several ways to accommodate the richer data model of Xcerpt: We need to add means to represent ordered and unordered terms, cyclic data, attributes, and hereditary information.

An Xcerpt term is called a data term if it maps directly to a data graph as defined in Section3.2.2. For that, it may only contain four types of terms: (1) atomic or contentdata terms that represent a content node. The most common atomic data term is a simple string representing a text node.

(2) structureddata terms that represents an element node in the data model.

(3) referencesto other (structured) data terms expressed by a term identifier.

(4) declaration of hereditary attributessuch as namespace or XML base [151] declarations defining a scope for those attributes.

Figure1gives an example of an Xcerpt data term drawn from the domain of bibliography management: Mixing typical bibliographic records (similar to Bibtex or DBLP) with actual content (represented as XHTML or in a Docbook-style format) it combines

– so-called document-oriented with data-oriented XML, i.e., data with flexible, recursive structure and data with rather rigid and flat structure. Recursive structure is used, e.g., for the content of articles in Docbook-style format.

3.2 xcerpt 2.0: overview in 5000 words 41

bib{

2 journal.adm @ journal{

title["Applied Data Management"] 4 editors[

editor-in-chief["Titus Pomponius Atticus"] 6 editor(region="Africa")["Marcus Aemilius

Aemilianus"]

editor(region="Gaul")["Aulus Hirtius"

8 affiliation["Governor, Transalpine Gaul"] ] editor(region="Cilicia")["Marcus Tullius

Cicero"

10 affiliation["Governor, Cilicia"] ] ]

12 publisher["Titus Pomponius Atticus"] volumes[

14 journal.adm.v10 @ volume[ journal.adm.v10.n1 @

number(type="special-issue"){ 16 title["Data Processing Challenges in the

Age of Wax Tablets"] editorial[ ^ articles.66.cicero.wax ] 18 year["60"] month["july"] 20 } journal.adm.v10.n2 @ number{ 22 year["60"] month["november"] 24 } ] 26 ] ] 28 conf.dmmc @ proceedings{ 30 editors[

editor["Marcus Aemilius Lepidus"

32 affiliation ["Consul, SPQR"] ] editor["Gaius Julius Caesar Octavianus"] 34 editor["Marcus Antonius"]

] 36 title[

"Advancements in Data Management for Military and Civil Application"

38 ] invited-papers[ 40 ^inproc.44.brutus ^article.66.scaurus.qumran 42 ] abbrev["DMMC"] 44 year["44"] month["july"] 46 location["Mutina"] publisher["SPQR"] 48 } 50 article.66.scaurus.qumran @ article{ author["Marcus Aemilius Scaurus"

52 affiliation["Tribun, Gnaeus Pompeius Magnus" ] ]

title["From Wax Tablets to Papyri: The Qumran Case Study"]

54 in(scrolls="102-112")[ ^ journal.adm.v10.n1 ] citations [

56 cite(ref="article.66.cicero.wax")[ ] cite(type="formatted")["M. Aemilius Scaurus

(104): A Case for

58 Permanent Storage of Senate Proceedings. In: M. Aemilius

Scaurus, ed. (104): "

60 i["Princeps Senatus: Honor and Responsibility"] ", Chapter 2, 14-88."] 62 ] ] 64 article.66.cicero.wax @ article{ 66 authors[

author["Marcus Tullius Cicero"

1 affiliation["Governor, Cicilia"] ] author["Marcus Aemilius Lepidus"

3 affiliation["Gens Aemilia"] ] author["Marcus Tullius Tiro"

5 affiliation["Secretary, M. T. Cicero"] ] ]

7 title["Space- and Time-Optimal Data Storage on Wax Tablets"] in(scrolls="1-94")[ ^ journal.adm ] 9 content(type="xhtml")[ declare ns-default="http://www.w3.org/1999/xhtml" 11 body[

<!-- incomplete due to melted letters on tablet -→

13 h1(id="contributions")["Contributions"] h1["A History of Data Storage: From Stone

to Parchment"] 15 p["Despite "cite[ ^

article.66.scaurus.qumran ] ... ] ol[

17 li[ em[ strong["Homeric"]" Age:"] ... ] li[ em["Age of the "strong["Kings"]

":"] ... ] 19 ]

h1(id="tiro")["Notae Tironianae"] 21 img(title="Tironian et"src=...)[ ]

p["As discussed in "

a(href="#contributions")[ ... ] ] 23 h1(id="tachygraphy")["Challenges for

Tachygraphy on Wax"] p["Though conditions for writing on wax

tablets are adverse

25 to tachygraphy, systems as in " a(href="#tiro")[ ... ] ] ] 27 ] } 29 inproc.44.brutus @ inproceedings{ 31 authors[

author["Marcus Antonius"

33 affiliation["Consul, SPQR"] ] author["Decimus Junius Brutus"

In document Furche, Tim (2008): Implementation of Web Query Languages Reconsidered: Beyond Tree and Single-Language Algebras at (almost) No Cost. Dissertation, LMU München: Fakultät für Mathematik, Informatik und Statistik (Page 57-127)