6.5 Document-specific Technologies
6.5.4 XML – eXtensible Markup Language
Extensible Markup Language (XML) is an extremely simple dialect of SGML [. . .]. The goal is to enable generic SGML to be served, received, and processed on the Web in the way that is now possible with HTML. For this reason, XML has been designed for ease of implementation, and for interoperability with both SGML and HTML (W3C Working Draft, November 1996).
Based on the W3C recommendation, XML (eXtensible Markup Language (W3C 1998)) has experienced truly triumphant progress with regard to its use and proliferation within and outside the Web. With its capability to define flexible data formats in the simplest ways and to exchange these formats on the Web, XML offers the prerequisite to homogenize heterogeneous environments. This means that, in addition to describing uniform data formats, we can also consider the semantics of this data, regardless of the information (see also Chapter 14). Its extension capability stems from the fact that, in contrast to HTML, XML does not dictate pre-defined markup with implicit semantics. It rather lets us flexibly define the meaning of the semantics and the structure of an XML document, so that it can be adapted to given circumstances. This is the reason why XML, rather than being an alternative to HTML, opens up entirely new ways to describe data for arbitrary purposes.
XML documents are characterized by two distinct properties: well-formedness and validity. While well-formedness is inherently anchored in XML, validity can be ensured by the Document Type Definition (DTD) and XML schemas. The latter will be discussed in other sections. To ensure that an XML document is well-formed, there are general rules for the syntax of XML
6.5 Document-specific Technologies 119 Table 6-1 XML well-formedness rules
Description Wrong Correct
All tags have to appear in pairs on the same nesting depth. In addition, there are empty tags (<B/>).
<A><B></A></B> <A><B></B></A> or <A><B/></A>
Tags are uppercase- and lowercase-sensitive.
<A></a> <A></A>
All attributes must be enclosed in quotation marks.
<A elem = 5/> <A elem = “5”/>
In contrast to HTML, there are no attributes without values.
<td nowrap> <td style =
“white-space:nowrap”>
Tag names have to comply with the rules for element names.
For example, blanks and< or
> are invalid.
<A B></A B> <AB></AB>
documents, which – in contrast to HTML – have to be strictly observed. In fact, the XML specification refers to these rules as “constraints”. Table 6-1 uses examples to demonstrate the XML well-formedness rules.
Since these rules have to be strictly observed, it is possible to clearly determine the structure of XML documents. This led to the definition of the Document Object Model (DOM), which can be used to transform the tree-type structure of XML documents into an object-oriented tree. Figure 6-1 shows a purchase order as a simple XML document example. We will use this example as our basis for other examples in the following sections.
<?xml version="1.0"?> <order OrderID="10643">
<item><book isbn="123-321" /></item>
<item><cdrom title="Vivaldi Four Seasons" /></item> <item><book isbn="3-8265-8059-1" /></item>
<OrderDate ts="2003-06-30T00:00:00" /> <price>167.00 EUR</price>
</order>
Figure 6-1 The “purchase order” XML example.
Namespaces
Namespaces (W3C 1999a) are among the core characteristics of handling XML. Namespaces
can be used to avoid name collisions with equally named elements in an XML document. This allows documents from different structures to be merged.
There are two different ways to mark an XML element with namespaces: one can either state the namespace for an element or use a prefix. The prefix method is useful mainly when several elements belong to the same namespace, because it makes XML documents shorter
<o:order xmlns:o=“uri:order”> <o:item> <o:book / > </o:item> </o:order> <order xmlns=“uri:order”> <item>
<book xmlns=“uri:order” isbn=“123-456” /> </item>
</order>
xmlns:o=“uri:order”
Figure 6-2 Using a namespace without and with a prefix.
and easier to read. Figure 6-2 uses our above example to illustrate the two variants. The URI (uri:order) addresses a namespace that corresponds to a purchase order.
XML DOM
The Document Object Model (DOM) introduces an object-oriented view on XML documents, allowing the easy and intuitive processing of XML. A DOM is created by an XML parser, which parses the structure of an XML document and instantiates an object tree (Figure 6-3). Each XML element in this tree corresponds to a node. The benefit of this method is to access the nodes in an object-oriented way, once the DOM has been created. The drawback is that this approach is rather costly, because an XML parser is needed to first create the tree. For example, we will often want to just read parts of an XML document rather than the entire document. In these cases, it is recommended to use parsers which are less resource consuming, e.g., SAX (Simple API for XML) parsers. SAX parsers use an event-based model, which supports targeted intervention in the parsing process, since a method for each occurring event can be registered with the calling program. Similar to DOM-enabled parsers, SAX parsers are available for most common platforms and programming languages.
<order> <OrderDate ts =“2003-06-30” /> <price>30.00 EUR</price> </order> nodeName =OrderDate nodeValue =null nodeType =Element nodeName =price nodeValue =null nodeType =Element nodeValue =30 EUR nodeType =Text nodeName =ts nodeValue =2003-06-30 nodeType =Attribute nodeName =order nodeValue =null nodeType =Element
Figure 6-3 The DOM structure for a fragment of the “purchase order” XML example.
The XML Validity Constraint
While the well-formedness constraint defined in the XML specification ensures a clear syntax for XML documents, validity allows us to introduce a specifically defined structure for an XML document. An XML document is valid when it is well-formed, and when its content and structure are compliant with predefined rules. These rules are formulated either in Document
6.5 Document-specific Technologies 121 Type Definitions (DTDs) or XML schemas. In terms of object orientation, this means that well- formedness enables us to map XML to DOM, while validity lets us introduce application-specific data types (thus achieving validatability).
DTD (Document Type Definition)
A DTD represents a set of rules which can be used to describe the structure of an XML document. XML borrows DTDs from SGML. Figure 6-4 shows a DTD that validates the “purchase order” XML example. The!DOCTYPE,!ELEMENT, and!ATTLISTfragments describe the data type. The way elements are linked reminds strongly of the definition of regular expressions. The rule <!ELEMENT order (item+,OrderDate,price)>expresses that an order element consists of at least one item element (“+”), followed by anOrderDateand aprice.
<?xml version="1.0"?> <!DOCTYPE order [
<!ELEMENT order (item+,OrderDate,price)> <!ATTLIST order OrderID ID #REQUIRED> <!ELEMENT item (book,cdrom)+>
<!ELEMENT book EMPTY>
<!ATTLIST book isbn CDATA #REQUIRED> <!ELEMENT cdrom EMPTY>
<!ATTLIST cdrom title CDATA #REQUIRED> <!ELEMENT OrderDate EMPTY>
<!ATTLIST OrderDate ts CDATA '2003-06-30T00:00:00'> <!ELEMENT price (#PCDATA)>
]>
Figure 6-4 DTD for the “purchase order” XML example.
Thanks to their simple structure, DTDs are relatively easy for humans to understand. This is why they are useful mainly when they have to be created or maintained manually. However, precisely because of their simple structure, DTDs cause two distinct problems, which are eventually solved by XML schemas:
• The fact that DTDs have been borrowed from SGML is often considered a problem, because it requires a DTD parser to read the grammar. It would be better to also notate the grammar itself in XML, so that existing XML parsers could read it. Our example shows that a DTD is not well-formed XML.
• Although some data types can be used to define elements or attributes within DTDs, their extent is very limited. This restriction impairs the reusability of DTDs.
XML Schemas
XML schemas (W3C 2000) are designed to answer the problems introduced by DTDs. However, their major benefits, i.e., data type integration, reusability, and XML formulation, came at the cost of a growing complexity. The result is that, when developing schemas, it has become almost
unavoidable to use tools. Due to their complexity, this section discusses schemas only briefly to outline their most important properties and concepts.
An XML schema can be used to describe various pre-defined data types, such asstring,byte, decimal, ordate. In addition, they let one define facets which support user-defined data types similar to templates. Let’s assume, for example, that all validISBNattributes of the XML element bookin the “purchase order” example must follow the notation for ISBN numbers. We could use a facet to describe the combined numbers and dashes with the patternN\-NNNN\-NNNN\-Nand reuse this data type in future development projects.
There are two distinct concepts to derive schema data types from existing types: extension and
restriction. In the sense of object-oriented inheritance, restriction would correspond to a special-
ization of the value range of the supertype, while extension would be similar to an aggregation of other types. Figure 6-5 (left-hand side) shows how the typeLTH(“less than 100”) is created by restricting the (pre-defined) typepositiveIntegerand setting an upper limit for the value range. On the right-hand side in this figure, the user-defined typeorderTypeis extended todate- dOrderType, which adds an element,orderDate, of the (pre-defined) typedateto anorderType.
restriction ns:orderType extension ns:datedOrderType xs:SimpleType xs:positiveInteger ns:LTH xs:maxExclusive=“99” <xs:choice> <xs:element name=“book”/> <xs:element name=“cdrom” /> </xs:choice> <xs:sequence> <xs:group ref=“ns:orderType” />
<xs:element name=“orderDate” type=“xs:date” /> </xs:sequence>
Figure 6-5 User-defined types.