• No results found

2.1 Data Integration towards XML & Relational Database Integration

2.1.7.1 Introduction to XML

EXtensible Markup Language (XML) [W3C 2008a] is a simple, very flexible tagged text format derived from the Standard Generalized Markup Language (SGML) [ISO8879 1986]. It has been proposed, developed, and maintained since 1996 by the World Wide Web Consortium (W3C), as a tool to standardize the format of all the documents used on the Internet and meets the challenges of large-scale electronic publishing. The first edition was published as a W3C recommendation in February 1998 [W3C 1998].

Indeed, XML is a well-defined set of rules for specifying semantic tags which divide a document into parts and identify the different parts of the document. Thus, it is more a meta-language than a language because it defines the syntax in which other domain-specific markup languages can be written. The tags of a domain-specific language can be documented in a schema written in any of several languages, including the Document Type Definition (DTD) [W3C 2008b] and the XML Schema Definition (XSD) [W3C 2001] languages.

XML differs from Hyper Text Markup Language (HTML) [W3C 1999c], another widely used markup language on the web and also derived from SGML, that has no predefined tags (i.e. one must define his own set of tags). XML and HTML were designed for different purposes. XML was designed for describing data objects called XML documents while HTML was designed to display data and focuses on how to display this data.

Moreover, XML has a much stricter syntax than HTML which simplifies its processing.

In XML documents, data can be stored in either elements or attributes - both of these can be named to give meaning for the data contained. Elements are constructed using start and end tags and possibly some content between these tags. This content can be either text data or other elements. While attributes have no start and end tags, they are defined in the beginning tag of an element, and their content is limited to only text data. The elements order is significant in the XML document while the attributes order is not.

Data vs. Documents: It is important to distinguish between the two approaches of the XML structures: the Data centric and the Document centric structures [Bourret 2005]. In the first one, the XML structure is highly regular, composed of finely grained data, the order of elements is not very important and there is no mixed content. This approach is mainly used to transport data (and often used for representing legacy data) and designed to produces XML files that are readable by machines. In the second approach, the XML structure is less regular, with largely grained data, the order of elements is important and the elements may have a mixed content. This approach is used for the design of XML documents that are mainly handled manually. In the data management context and the integration between XML and relational databases the first approach is always adopted and used.

Here are the main basic rules for well-formed XML documents:

· There must be a single root element: The root element can appear only once and all other elements are nested inside it.

· Elements must be properly terminated: Each element has a start-tag

<tag-name> and must be matching the end-tag </tag-name>. The only exceptions are the empty (have no content) tags, that look like <tag-name/>.

· Elements should be properly nested and shall not overlap and there is no limit to the nesting level.

· Elements and attributes names are case sensitive.

· An attribute, which is extra information that can be added to an element start tag, must be quoted (i.e. all attribute values must be enclosed in quotes).

· An XML document should start with a declaration using a special tag that identifies the version of the XML specification.

· It is possible to impose a specific grammar by using an XML schema language (DTD or XSD).

A simple sample XML document may look like this:

<? xml version="1.0" ?> {declaration}

<! DOCTYPE notes SYSTEM "Notes.dtd"> {type of document}

<!--My notes file--> {comment}

<notes> {root element start tag}

<note id=" 001"> {attribute}

<type priority="high"/> {empty element (no data)}

<to>Hossam</to> {not empty element}

<from>Joel</from>

<heading>Reminder</heading>

<body>Don't forget me this weekend!</body>

</note> {end element tag}

</notes>

When an XML document is received by a server, various functions can be applied on it. The document can be processed to verify if it is well-formed (i.e. satisfies the previous XML rules), validated against an XML schema, transformed, stored or forwarded depending on the application.

The well-formed check is a process where the XML document’s syntax is verified to be correct according to the XML specification and the Validation is the process where the document’s structure is checked against a possible DTD or XML schema in XSD. The transformation process can transform an XML document from one structure to another structure or from one set of tags to another set of tags. This process makes it possible to reorder elements between the source and destination representations – also arbitrary elements and structures can be added and existing ones removed.

The eXtensible Stylesheets Language (XSL) [W3C 2006] and its XSL Transformation (XSLT) part [W3C 1999a] is the common technology for carrying out XML document transformations.

In fact, XML is not one simple technology or recommendation but there are many XML related technologies which contribute to the power of XML. Indeed, well-formed XML documents can be created by using the core XML recommendation in different applications. However, to make use of such XML documents as a format to store information and publish it, on the World Wide Web for instance, we would need to use some other technologies that are XML-based or related. For example, to define the structure of an XML Document, the Document Type Definition (DTD) or the XML Schema Definition (XSD) languages may be used. To display XML in the browsers, the Cascade Style Sheet (CSS) may be used. To display or transform XML into another format such as pdf or HTML, the new Extensible Stylesheets Language (XSL) may be used. To navigate and organize the data in the XML documents, technologies such XPath, XPointer and XLink may be used. To use XML in an application, one of the XML interfacing technologies such as Document Object Model (DOM) or

Simple API for XML (SAX) may be used. To query or update XML documents, we may use the XQuery language and XML Updated Facilities.

Therefore, XML with its related languages and derivatives now provides powerful tools for sharing, converting and exchanging information via networked computers. Furthermore, there are many proposals to standardize XML document structures for domains as diverse as stock trading, graphic design, Healthcare (“HL7”), Web Services (“SOAP, ebXML”…)…

In the following sections we present in more details some of these XML related technologies that we have used in our proposed solution, starting from the technologies to define an XML document schema (the DTD and the XSD languages), to the XML Querying languages (XPath, XSLT, XQuery and XQuery update facilities), and finally the technologies for parsing XML data (DOM and SAX).