2.1 Data Integration towards XML & Relational Database Integration
2.1.7.2 XML schema languages (DTD vs. XSD)
XML schema languages present a way to define the structure of the XML documents and provide an additional level of syntax checking. The constraints provided by the well-formedness rules of XML are very simple.
Thus the validity constraints introduced by a schema allow specifying the tree structure of XML documents. Indeed, there exist two standard recommendations from the W3C to define the structure of the XML documents: The Document Type Definition (DTD) and XML Schema Definition (XSD). Hereafter, we present both of these technologies and give a comparison between them. There are other schema languages including RELAX NG [OASIS 2001], [ISO/IEC 19757-2 2003] and Schematron [ISO/IEC 19757-3 2006], but they are not W3C recommendations.
2.1.7.2.1 XML schemas advantages
XML schemas specify contracts between data producers and consumers.
Thus, they represent a standardized vocabulary and structure for application domains which requires a deep understanding of the application and the domain of interest.
Schemas may be particularly useful for:
· Validation of XML documents against a schema (for safety).
· Automation of XML exchange and storage into relational databases.
· Query optimization.
· XML binding (mapping to programming languages).
2.1.7.2.2 Document Type Definition language
The Document Type Definition language (DTD) [W3C 2008b] is inherited from the SGML world with an almost completely intact syntax. Thus a DTD itself is not defined using the XML syntax. In fact, a DTD is the blue print of a document’s structure that contains a series of declarations. It is used to describe the valid syntax of a class of XML documents by assigning names and types for different element and attributes. Therefore, a DTD enables that each XML document can carry a description of its own format.
It may characterize an agreed standard for interchanging data between independent groups of people. In addition, applications can use a standard DTD to verify that the received data from the outside world is valid.
The DTD can be a separate file or it can also be embedded in the XML file. In fact, the DTD contents can be split across an external file and the XML file. Here is a sample of a separated DTD for validating the previous XML document sample:
<? xml version="1.0" encoding="UTF-8"?>
<!ELEMENT notes (note*)> {element declaration}
<!ELEMENT note (type, to, from, heading, body)>
<!ELEMENT type EMPTY>
<!ELEMENT to (#PCDATA)>
<!ELEMENT from (#PCDATA)>
<!ELEMENT heading (#PCDATA)>
<!ELEMENT body (#PCDATA)>
<!ATTLIST type priority (low|normal|high) “normal”>
<!ATTLIST note id CDATA #REQUIRED> {attribute declaration}
2.1.7.2.3 XML Schema Definition language
The XML Schema Definition language (XSD) [W3C 2001], published as a W3C recommendation in May 2001, is a newer, more flexible and more elaborated schema language than DTD. XSD, as an alternative to DTD, is used for specifying the type of each element and the data types that are associated with the elements. It provides a way of defining strong typing relationships and supports data types and namespaces that DTD do not. In addition, XSD schemas are themselves XML documents, thus they may be managed by XML authoring tools.
Indeed, XSD introduces new levels of flexibility and security assurance against unauthorized changes that may accelerate the adoption of
XML for significant industrial use. For example, a schema author can build a schema that reuses a previous schema, and even overrides it when new unique features are needed. XSD allows the author to determine which parts of a document may be validated, or identify parts of a document where a schema may apply. Moreover, XSD also provides a way for users to choose which XML Schema they use to validate elements in a given namespace. It can define data types, ranges, enumerators, dates, and more complex data types to strictly specify what constitutes a valid XML document.
XML Schema specifications are divided into three parts. Part 0 [W3C 2004] explains what schemas are, how they differ from DTDs, and how to build a schema. Part 1 [W3C 2009a] proposes methods for describing the structure and constraints of XML documents contents, and defines the rules for documents schema-validation. Part 2 [W3C 2009b]
defines a set of simple data types, to be associated with XML element types and attributes, which allows XML applications to better manage dates, numbers, and other types of information.
Here is a sample of a XSD schema for validating the previous XML document sample:
<xs:element name="to" type="xs:string"/>
<xs:element name="from" type="xs:string"/>
<xs:element name="heading" type="xs:string"/>
<xs:element name="body" type="xs:string"/>
</xs:sequence>
<xs:attribute name="id" type="xs:int" use="required"/>
</xs:complexType>
2.1.7.2.4 XSD vs. DTD
XSD has many advantages over the DTD mainly:
(1) XSD supports powerful and rich sets of built-in data types (for elements as well as for attributes) with the possibility of inheritance and derived data types, compared to DTDs which only support character strings. For example, XSD can specify that a particular attribute must be a valid date, or a number, or a list of URIs, or a string that is exactly 8 characters long.
(2) XSD can define all the constraints that a DTD can define, and many more. XSD supports the identity constraints (key, keyref, unique), which are more powerful than the IDs and IDREFs supported by DTDs.
Identity constraints can be specified on any element or attribute, regardless of its type. They can be locally defined for a combination of elements and attributes. In addition, XSD supports fine grained cardinalities constraints while DTD is mainly based on Kleine closure (*,+,?).
(3) XSD has the same syntax as XML, and it may be managed by XML editing tools.
(4) Namespaces are supported in XSD, but not in DTD.
A summary of the comparison between XSD and DTD is depicted in Table 2-2
XSD DTD
Data types rich sets of built-in data types &
customized new data types
only two types of data PCDATA & CDATA
Constraints Cardinalities fine grained minOccurs= 0 ..*
maxOccurs= 0 ..*
*, +, ?
Primary keys
id, key ID
Foreign keys idref, keyref IDREF/IDREFS uniqueness unique not supported Namespace well-supported not supported
Syntax XML not XML (SGML)
Table 2-2 XSD vs. DTD to specify XML schemas