Additional XML Elements - An Introduction to XML

An Introduction to XML

2.5 Additional XML Elements

While the element or node is the primary component in XML, the language also includes several other, somewhat less common, but useful constructs. These are the XML declaration, processing instruc-tions, comments, <CDATA> section delimiters, entity references, and document-type declarations. In this section, we brieﬂy describe each of these.

Overview of Additional XML Markup

CDATA Character data that is treated by the XML parser as literal text and not XML content.

This is used to “escape” content that happens to contain special XML characters such as <

and &, and treat it as verbatim text. Markup between the delimiters <![CDATA[ and ]]>

is not processed, e.g.,

<![CDATA[ x <- y > 10 & z < 20 ]]>

Comment A block of text that is not considered part of the data itself, but informal information about the document. A comment appears between the delimiters , e.g.,

XML Declaration Optional statement at the beginning of a document that identiﬁes the content as an XML document and provides additional information about the character encoding and the version of XML, e.g.,

<?xml version = "1.0" encoding="UTF-8" ?>

Processing Instruction Optional hint or instructions to an application that might parse the XML document. It is a way for us to include directives to a target application so that it will do something as it processes the XML nodes, such as use a particular style sheet to render the document. A processing instruction has two parts—the name of the target application and the instruction. The target application is a simple string, while the instruction can be any sequence of text and is entirely application-speciﬁc. For example, we can set the width option in R with

<?R options(width = 140) ?>

Other applications reading the XML will ignore this. One can include the same information within regular XML elements, but processing instructions are a powerful and convenient way to specify application-speciﬁc information without adding to the XML vocabulary used for the data.

Document-type Declaration A declaration at the start of the document (before the root node) that identiﬁes the “type” of the document, e.g.,

<!DOCTYPE html>

The type can either be one of the known types such as html or xhtml, or can also specify the location of an external DTD (Document Type Deﬁnition) document that describes the structure of a valid document of this type. Some or all of the DTD can also be inlined within the <DOCTYPE> node, and extensions and character entities can be added.

CDATA and Entities

When working to get data from XML documents, we often come across two other XML constructs—

entities and <CDATA> (the <CDATA> construct stands for character data). The good news is that the XML parser typically allows us to transparently work with these as regular text. However, it is good to be able to recognize and understand them and why they are necessary. Since the character < is used to indicate the start of an XML tag name, how can we write an inequality such as x < y? We need a way to “escape” or protect the < from the XML parser so that it does not think the < starts a new XML element. For this, we use the entity <, that is, an ampersand (&), the shorthand lt for “less than”, and we end the name of the entity with a semicolon. Any entity is introduced with an ampersand and ended with a semicolon. XML deﬁnes several standard entities; the most commonly used are < for

<and & for &. We need an entity for & itself since we cannot use the literal & character as XML thinks that starts an entity! We can use entities generally in XML as macros or text substitutions and also for specifying nonstandard characters, e.g., accents such as the cedilla under the character c (c¸) with ç, or characters in other alphabets such as the Greek letter alpha with α, or for special symbols such as the copyright symbol (©) with ©. These can also be inserted directly into the XML without entities. The XML parser can keep entities as special objects or just replace them with the corresponding text.

Entities are especially important when we have code within the XML document because many programming languages such as R and C make common use of the < and & characters. Rather than using entities for each instance of < and & within a complex piece of code, we can “escape” an entire block of text from the XML parser using a <CDATA> section. We start such a section with

<![CDATA[, and we end it with ]]>. For example,

<r:code>

<![CDATA[

if(all(x) > 0 && max(y) < 10) z = log(x) > 1 & y > median(y) ]]>

</r:code>

When the XML parser encounters the start of the <CDATA> section, it reads the characters up to the end of the <CDATA> section as regular text. As a result, we see this text not within a special XML construct, but as a simple text node.

Comments

XML allows comments to be included within a document. These are meant to provide information about (part of) the document. These are not necessarily structured in any particular way, although one could use some convention to include the information. However, it is much more sensible to include that information as regular XML if it is to be interpreted. Accordingly, comments are usually just free-form descriptions or notes about the content. A parser can ignore comments or keep them in the parsed XML tree.

A comment is introduced with . Comments can contain the symbols < and

>because all text between the delimiters is ignored by the XML processor and so is neither rendered nor read. For example, the following KML contains a comment in the <Point> element.

<Point>

2.5 Additional XML Elements 41

</Point>

</Placemark>

Comments can appear anywhere after the XML declaration. They can even appear outside the root element of the document immediately following the XML declaration.

XML Declaration

Typically, an XML document starts with an XML declaration that identiﬁes it as an XML document and provides, most importantly, the information deﬁning the character encoding, e.g., UTF-8. The declaration must also specify the version of XML, which currently is always 1.0. For example, the following declaration,

<?xml version = "1.0" encoding="UTF-8" ?>

indicates that version 1.0 of XML is being used and that the character encoding is UTF-8. The most common encodings for XML are UTF-8, UTF-16, and ISO-8559-1 (a.k.a. Latin1), which all XML processors support. The declaration appears ﬁrst in the document, followed by the root element, i.e., the document appears as

<?xml version = "1.0" encoding="UTF-8" ?>

<root>

...

</root>

Although this declaration is not required, it is a good way to clearly identify a document as XML.

One should always specify the character encoding to help consumers of the XML determine how to interpret the sequence of bytes as strings. It is essential when one uses non-ASCII characters.

Processing Instruction

A processing instruction (PI) is a directive or instruction within the XML content to a particular target application that might be parsing the document. If another application is processing the document, it will ignore PIs not meant for it. The idea of a PI is to be able to give an application-speciﬁc command to the parser to change its state or have it do something.

A PI identiﬁes the target application and is followed by the command. This information is enclosed within an opening <? and a closing ?>, e.g.,

<?R options(width = 140)?>

<?xml-stylesheet type="text/xsl" href="XSL/Todo.xsl" ?>

(This is similar to the form of the XML declaration above, but that is technically not a processing instruction.) In these examples, we have an instruction for R and another for an application named xml-stylesheet. The latter is intended for a Web browser. The information after the target application gives two attribute-like settings. The type identiﬁes an XSL document which can be used to transform the XML document into HTML. The href “attribute” identiﬁes the location of the XSL . A capable Web browser will then use this to transform the XML document and show the resulting HTML doc-ument in its place.

An xml-stylesheet processing instruction must be placed between the XML declaration and the root element of the document. Other processing instructions may also be placed there or at other loca-tions in the document. The name–value pair format of the xml-stylesheet, e.g., type="text/css", imitates the syntax for attributes. In general, the content can be any text values expected by the appli-cation, i.e., different applications support different formats for the processing instructions.

Each application will chose which target applications it will recognize and it will operate only on

<PI>s for those applications and ignore the rest. The <PI>s are only hints as a parser may ignore them entirely.

Document-type Declaration

We have so far described the requirements for a document to be well formed, i.e., it meets the generic syntax requirements of XML. In addition to being well-formed, a class of documents may only make sense if there is a specific relationship between the elements. When a document meets these application-specific requirements it is said to be valid. With HTML5 , this information is specified via the document-type declaration (DTD ), as follows

<!DOCTYPE html>

Another approach to deﬁning an application-speciﬁc vocabulary and rules is with XML Schema.

Details on DTD s and XML Schema appear Section2.7.

In document XML and Web Technologies for Data Sciences With R-Springer-Verlag New York (2014) (Page 63-66)