Anatomy of an XML Document - XML – the Extensible Markup Language

2.2 XML – the Extensible Markup Language

2.2.3 Anatomy of an XML Document

An “XML document” is a file, or collection of files, that adheres to the general syntax specified in the XML Recommendation [116], independent of the concrete application. XML documents consist of an optional document prologue and a document tree containing elements, character data and attributes, with a distinguished root element.

Document Prologue

The document prologue is used to define properties of an XML document, like the version of XML used, the character encoding, processing instructions and schema information. It consists of the following parts: • a mandatory XML declaration denoted by<?xml version="1.0" ...?>which specifies the ver-

sion of XML used, and optionally the encoding of the document.

• zero or more application specific processing instructions that may be evaluated when loading an XML document denoted by <?target data?>, wheretarget identifies the application to which the instruction is directed, anddatarepresents additional information for the application

• an optional schema declaration in terms of a DTD, defined either internally, or system file, or as a public identifier associated with a DTD which is assumed to be known to the processing program15.

Example 2.1 (XML Document Prologue)

The following document prologue initiates an XML document in DocBook format (the DTD of which is identified by a public identifier), to be processed with a stylesheetstylesheet.css and an encoding of

ISO-8859-15(Western Europe with Euro):

<?xml version="1.0" encoding="ISO-8859-15"?>

<?stylesheet href="stylesheet.css"?>

<!DOCTYPE book PUBLIC "-//Norman Walsh//DTD DocBk XML V1.4/EN"

"http://docbook.org/docbook/xml/1.4/db3xml.dtd">

Although several improved schema languages like XML Schema [111] and Relax NG [39] for XML exist, both XML 1.0 and the recently released XML 1.1 only support the declaration of DTD schemas in the document prologue (see schema languages below).

Character Set and Encodings

Since the Web is a place containing documents in many different languages, XML has been designed as an internationalised language from the beginning. XML supports all characters defined in ISO/IEC 10646 (a superset of Unicode16), amounting to approximately 4 billion. To represent these characters in concrete documents, XML supports a variety of encodings, which can be specified in the XML declaration of the document prologue. Table 2.1 lists some of the more frequent character encodings17_{. Of these encodings,}

XML language processors need to implement at leastUTF-8andUTF-16.

Elements

Elements are used to “mark up” the document. They are identified by a label (called tag name) and specified by opening and closing tags that enclose the element content. Opening tags are of the form<label ...>

and contain the label and optionally a set of attributes (see below). Closing tags are of the form</label>

and contain only the label. Labels start with either an alphabetical character (with respect to the defined character encoding) or with underscore_. They may contain any alphanumeric characters, and the signs_,

-,:and.. The character:is reserved for separating namespace prefixes from element names.

15_{public identifiers are commonly used for widespread XML applications like XHTML or DocBook} 16_{http://www.unicode.org}

CHAPTER 2. DATA REPRESENTATION ON THE WEB

ASCII American Standard for Character Information Interchange, 7-bit

Big5 Traditional Chinese, Hong-Kong and Taiwan, 2 byte

GB2312 Simplified Chinese (Gu ´oji¯a Bi¯aozhˇun Mˇa), People’s Republic of China, 2 byte ISO-2022-JP Japanese, 1-2 bytes variable length (compatible to ASCII)

ISO-8859-1 Latin, Western European without Euro, 8-bit ISO-8859-2 Latin, East European, 8-bit

ISO-8859-15 Latin, Western European with Euro, 8-bit KOI8-R Cyrillic, Russian, 8-bit

UTF-8 Unicode, 1-4 bytes variable length (compatible to ASCII)

UTF-16 Unicode, 2 byte

Table 2.1: Frequently used character encodings in XML

Example 2.2 (XML Elements) <address-book>

content

</address-book>

Elements may contain either other elements, character data, or both (mixed content). In analogy with the document tree, such content is often referred to as children of an element. Interleaving of the opening and closing tags of different elements (e.g.Text ) is forbidden. The order of elements is relevant (so-called document order). This is a reasonable requirement for storing text data, but might be too restrictive when storing data items of a database. Applications working with XML data thus often ignore the document order. If an element contains no content, it may be abbreviated as<label/>, i.e. the “closing slash” is contained in the start tag.

Example 2.3 (Empty Elements)

In HTML, line breaks are indicated by an empty element with labelbr. In XML syntax, this is specified as

An XML document always contains a distinguished element called the root element that encloses all other content of the document. If a schema is associated with an XML document, then the root element has to be an instance of this schema in order for a document to be valid . Documents that do not conform to a specified schema, but otherwise adhere to the XML specification are invalid, but well-formed.

Character Data

Besides elements, XML documents may contain character data. In general, character data is written “as- is”, i.e. it is not enclosed in special symbols like in many programming languages or the semistructured expressions above.

Example 2.4 (Character Data)

The following XML document contains character data mixed with element content:

<document> ...

The quick brown fox <highlight>jumps</highlight> over the lazy dog. ...

</document>

Whitespace in character data is ignored and certain reserved characters (like<) are disallowed. Therefore, XML provides an additional construct for escaping character data, so-called CDATA sections. CDATA sections are enclosed in<![CDATA[and]]>.

Example 2.5 (CDATA Sections)

The following is only character data and does not contain markup:

<![CDATA[The quick brown fox <highlight>jumps</highlight> over the lazy dog.]]>

2.2. XML – THEEXTENSIBLE MARKUP LANGUAGE

Attributes

Opening tags of elements may contain a set of key/value pairs called attributes. Attributes are of the form

name = "value"wherenamemay contain the same characters as element labels and value is a character sequence which is always enclosed in quotes"and in which white space is insignificant. An opening tag may contain attributes in any order, but each attribute name can occur at most once.

Example 2.6 (XML Attributes) <person id="mickey mouse">

<name> <first>Mickey</first> <last>Mouse</last> </name> <phone type="home">19281118</phone> </person>

XML defines certain reserved attributes, currentlyxml:lang(which defines the language of the element content) andxml:space(which in XML 1.1 defines that whitespace is significant). Furthermore, certain extensions of XML, like XLink [110] and XML Namespaces [117], reserve attributes prefixed byxlink:

andxmlns:.

Example 2.7 (xml:lang)

The reserved attributexml:langmay be used to specify the language of element contents. This may e.g. be used to specify two different titles for a book:

<book>

Folket i Birka p˚a Vikingarnas Tid </title>

Die Leute von Birka. So lebten die Wikinger. </title>

The people of Birka in the Viking Age </title>

</book>

Entities

XML entities are a macro mechanism for reusing commonly used content. In particular, the reserved characters<and&can be expressed using entities. Note that, unlike many other macro mechanisms, XML entities cannot be parametrised.

Entities are defined in the document type definition in the prologue of an XML document (or in an

external DTD) with the construct<!ENTITY name "value">, which defines the entitynameto be an ab- breviation forvalue.valuemay contain any content, including markup. Entity references have the form

&name;, wherename is the name of an either predefined or previously defined entity. The occurrence of

&name;is then literally replaced by the value of the entity.

Example 2.8 (Entities)

The following example defines an entity warn to be the data <bold>Warning:</bold> (i.e. the word “Warning” printed in bold face) and refers to it later by&warn;:

<?xml version="1.0" encoding="ISO-8859-15"?>

<!DOCTYPE paragraph [ <!ELEMENT paragraph ANY >

CHAPTER 2. DATA REPRESENTATION ON THE WEB & & < < > > ' ’ " "

&#x; the ISO/IEC 10646 character with hexadecimal number x

Table 2.2: Predefined character references available in XML

&warn; Don’t ever try this out yourself. </paragraph>

Entities can also be used for character references. For example,<refers to the letter<, which is otherwise not allowed in character data. Table 2.2 summarises character references that may be used in XML.

Example 2.9 (Character References)

The following character reference includes the characterα, which has the hexadecimal number 0x03B1 (or 945 in decimal format): The character &#03B1; is rendered as The characterα.

A third application of entities that is of interest is the possibility to include binary data in an XML document, like a PNG (Portable Network Graphics) image with so-called external entities.

Example 2.10 (External Entities with Binary Content)

The following external entity includes the PNG imagefigure.pngin the XML document:

<!ENTITY figure SYSTEM "./figure.png" NDATA png>

In document Schaffert, Finn Sebastian (2004): Xcerpt: A Rule-Based Query and Transformation Language for the Web. Dissertation, LMU München: Fakultät für Mathematik, Informatik und Statistik (Page 30-33)