2.1 Data Integration towards XML & Relational Database Integration
2.1.7.3 XML query languages
2.1.7.3.1 XPath
The XML Path language [W3C 1999b], known as XPath, is a W3C recommendation from November 1999. Its name is derived from its feature
“path expression” which provides a way of hierarchic addressing of the nodes in an XML tree. Thus, XPath provides a common method for locating and extracting information from XML documents based on specified criteria. XPath was defined during the development of XSLT and XPointer. It was designed to provide unambiguous navigation of XML documents. XPath's functionality is used by both XPointer and XSLT.
XSLT uses only a subset of XPath; XPointer uses additional syntax mechanisms to extend its functionality (XPointer allows forward and backward addressing to specific XML locations internal to a document and to locations in external XML documents). XPath is also used by XQuery, which is an emerging technology that will provide standardized access to data in the RDBMS using XML.
XPath 2.0 [W3C 2007a] is a superset of the old XPath 1.0 version, with new capabilities to support a richer set of data types and to take advantage of the type information that becomes available when documents are validated using XML Schema (XSD) [W3C 2001]. XPath 2.0 allows the processing of values conforming to the XQuery/XPath Data Model (XDM) defined in [W3C 2007b]. This data model provides, in addition to a tree representation of XML documents, the atomic values such as integers, strings and booleans, and sequences that may contain both references to nodes in an XML document and atomic values. A backwards compatibility mode is provided to ensure that nearly all XPath 1.0 expressions continue to deliver the same result with XPath 2.0.
XPath works on the XML document model which can be represented as a hierarchical tree structure. There are seven node types in XML (root, namespace, processing instruction, element, text, attribute, and comment nodes). In fact, XPath uses path expressions to select nodes or node-sets in an XML document. Path expressions are very similar to the expressions used in a traditional computer file system. An XPath expression is the series of steps required to identify the desired section of XML. Each of these steps represents a layer (or possibly several layers if a wildcard is used) within the XML tree. During each step of the expression, tests may be optionally performed to narrow the search based on criteria specified by a Predicate. Then an XPath expression result may be: a
selection of nodes from the input documents, or an atomic value, or more generally, any sequence allowed by the data model.
2.1.7.3.2 XSLT
The eXtensible Stylesheet Language Transformation (XSLT) [W3C 1999a], a W3C recommendation from November 1999, is a common technology for carrying out XML document transformations. It is one of the two main parts of the eXtensible Stylesheet Language (XSL) version 1.1 specifications as a W3C recommendation from December 2006. The XSL language presents the new, cutting edge language for expressing stylesheets. It enables to do any transformation we can imagine on an XML document. This is what leverages XML as the ultimate file format for data.
The second part of the XSL version 1.1 specifications is the XSL Format Objects (XSL-FO), that is an XML vocabulary for specifying formatting semantics allowing a large possibilities for print, display or oral presentations.
XSLT version 2.0 [W3C 2007c], a W3C recommendation from January 2007, is a revised version of the older version published in November 1999. It is compatible with Namespaces and XPath 2.0. XSLT 2.0 shares the same data model [W3C 2007b] as XPath 2.0, and it uses the library of functions and operators of XPath 2.0. Indeed, one of XSLT's best purposes is to translate information from one XML vocabulary to another.
XSLT operates on an abstract model that views an XML document as a tree and it is not required that a tree be created. It provides means to access the document tree in order to: access nodes by name or content, search for a specific content or nodes and manipulate content or nodes. In XSLT, the XML syntax was chosen for many reasons, among the most important were:
·
Reuse of the XML parser minimizes footprint.·
Familiarity and ease of understanding.·
Reuse of the lexical apparatus of XML for handling whitespaces, Unicode, namespaces, and so forth.XSLT has many features such as:
● XSLT Stylesheets are XML documents and they follow the XML rules.
● Multiple input sources.
● Ability to select document fragments using XPath expressions.
● Named and/or pattern-based templates.
● Parameterized templates.
● Intermediate transformation states may be managed using variables.
● Stylesheets may be combined using include or import.
● Built-in support for output sorting and numbering.
● Both XML and non-XML output is supported.
● The functions in XSLT have no side effects and can be processed in any order.
● How we code our stylesheet will impact what parts can be processed independently.
● The parts of the stylesheet can be processed in any order and that does not impact the order of the output. The order of the output depends on the order in the XML file.
● XSLT supports recursion. This feature makes it a very powerful tool.
An XSLT stylesheet accepts XML from the abstract tree model of the source document, known as the source tree, and processes it to produce a result tree. The XSLT stylesheet defines the rules for transformation, based on the XML elements and attributes in the source tree. It also may contain formatting information called format objects (or FOs) and applies those objects against the transformation. A single stylesheet can apply to multiple XML documents, provided the elements and structure are consistent with those specified by the stylesheet. The source XML document can also invoke multiple XSLT stylesheets. For example, the XML source could be processed by XSLT to render HTML, voice markup, and rendered for printing... These could occur as separate parallel processes (each invocation running an XSL processor in a separate memory space) or sequentially (each invocation running after the previous one completes).
The advantage of parallel processing is that an XSLT error in one stylesheet will not prevent the others from running, whereas in sequential processing, any downstream process will be terminated as well. The disadvantage of parallel processing is system memory usage.
2.1.7.3.3 XQuery
XQuery [W3C 2007d], a W3C recommendation since January 2007, is the language for querying XML data (i.e. not only XML files, but anything that can appear as XML, including databases). It uses the structure of XML intelligently to allow expressing queries across all these kinds of data, whether physically stored in XML or viewed as XML via middleware.
XQuery 1.0 is a superset of XPath 2.0 and shares the same data model and supports the same functions and operators [W3C 2007e]. It is compatible with several W3C standards, such as XML, Namespaces, XSLT and XML Schema. It is supported by all major databases. XQuery allows finding and extracting elements and attributes from XML documents. Here is an example of a query that XQuery could solve:
“Select all books titles with a price higher than $30 from the bookstore collection stored in the XML file called books.xml”.
This query may be expressed by the following FLWOR1 expression:
for $x in doc(“books.xml”)/bookstore/book where $x/price>30
return $x/title
The same previous FLWOR expression will select exactly the same as the following path expression:
doc(“books.xml”)/bookstore/book[price>30]/title
XQuery was designed with the goal of providing flexible query facilities to extract data from real and virtual documents on the Web and give the needed interaction between the Web and the database worlds, so it has been invented to be for XML like SQL for databases. Thus collections of XML files may be accessed like databases. However, XQuery has no means to make persistent changes or updates on the XML documents.
Therefore, the XQuery working group published a new candidate recommendation called “XQuery update facility” that extends the XQuery language. We present it in the following section.
2.1.7.3.4 XQuery update facility
The XQuery update facility [W3C 2008c] is an update facility that extends the XQuery language. It provides expressions that can be used to make persistent changes to instances of the XQuery 1.0 and XPath 2.0 Data Model (XDM). XQuery update facility has been published as a W3C candidate recommendation in August 2008 and its specification is expected to receive soon the status of recommendation.
Indeed, an XQuery 1.0 expression takes zero or more XDM instances as input and returns an XDM instance as a result. In XQuery 1.0
1FLWOR stands for: FOR, LET, WHERE, ORDER BY, RETURN.
there is no expression that can modify the state of an existing node;
however, constructor expressions may create new nodes. Therefore, the XQuery update facility 1.0 introduces a new category of expression called an “updating expression” that may modify the state of an existing node. It provides means to perform the following operations on the XDM instance:
node insertion, node deletion, node modification by changing some of its properties while preserving its node identity and node creation of a modified copy with a new node identity. XQuery update facility has five new kinds of expressions: insert, delete, replace, rename, and transform.
Hence, it is expected to facilitate the updates of XML documents.