• No results found

1. XML is Verbose

XML is verbose, meaning XML source text has a lot more than just the information contained within the XML document. Put another way, the unique information contained within an XML document makes up only a portion of the document’s size. Looking at the simple notebook.xml example, repeated in Table 3 below for clarity, only the values from attribute content, the characters surrounded by quotations, or element content, the characters between the beginning and ending element declarations, are actual unique information of the document. All other characters are XML structural, metadata, which loosely means information about information, i.e., the verboseness of XML.

<?xml version="1.0" encoding="UTF-8"?>

<notebook date="2007-09-12">

<note category="EXI" date="2007-07-23">

<subject>EXI</subject>

<body>Do not forget it!</body>

</note>

<note date="2007-09-12">

<subject>shopping list</subject>

<body>milk, honey</body>

</note>

</notebook>

Table 3. Notebook.xml Verbose Structured Format (From W3C, 2007)

Stripping all of the XML structure from the raw notebook.xml document and leaving only the actual information of the document, Table 4 is generated.

2007-09-12EXI2007-07-23EXIDo not forget it!2007-09-12shopping listmilk, honey

Table 4. Notebook.xml Terse Information-Only Format

The original verbose version of the notebook.xml in Table 3 has 310 characters including spaces, and the information-only version in Table 4 has only 77 characters.

Only 24% of the size of the document is data, the remaining 76% is XML formatting and structuring characters.

2. Why XML is Verbose

The fundamental problem with the information-only version of the notebook.xml document is it is nearly impossible to determine the meaning of the information; there is no clear way to determine where one piece of information starts and ends, nor how the pieces of information relate to one another. Without a detailed pre-existing knowledge of exactly how many pieces of information are contained in the document, or knowing exactly how long the number of characters used by each piece of information, extracting the individual attribute values and element contents from the terse information-only version is difficult to impossible to perform. For the purposes of general XML, this information-only approach is clearly impossible. Even knowing the meaning of the notebook.xml example it is still hard to decipher the information when looking at the information-only version.

Prior to the advent of XML, this problem of information representation was addressed with a file-structure, which is essentially XML structure hardcoded into a procedural programming language within an algorithm to parse the information from a file. This requires the parsing algorithm to know, by means of hardcoded procedural routines, the precise number of data field and the exact number of characters of each data field within the file-structure (Stern & Stern, 1994). Note that a field is a piece of

information from a list, or properly called a record. Using the notebook.xml example, the subject element is a field of the record note and the notebook.xml contains two note records.

The problem with this approach is that it mandates a rigid input file format; a format that cannot change. The file has to consist of the same number of fields with the same number of characters every time, and is unable to support new fields or varying length fields. The traditional file-structure paradigms cannot adapt to different file

formats or slight deviations without source-code modification followed by recompilation.

File-structures just cannot keep up with the dynamics of the information-based world of today.

3. XML is Verbose by Design and is Not New

XML’s flexibility to support an arbitrary information set requires that it declare every element and attribute explicitly every time, even if the element or attribute is equal to a previous element or attribute in order to show the parent-child relationships clearly and unambiguously. Overall, this equates to a verbose XML document, although it is one of the design tradeoffs considered during the XML development, design point four

“XML is Verbose by Design” from the W3C XML in 10 Points document (1999).

It can be reasoned that XML’s structured format is not extra wasteful information, since XML’s verbose structure defines the information of a document the same any other file-structure does with strict structural hard-coded algorithms. XML resolves the strict structural issues of file-structures, but at the cost of being verbose. The W3C XML in 10 Points documents note 6, “XML is new but not that new,” points that XML is the

evolution of pervious data-structuring formats such as SGML (W3C, 2003). XML is really no different than a file-structure, but unlike the classical sense of a file-structure, XML is not hardware driven nor is it driven by any particular data format, unless explicitly stated so by means of a schema, and so, it is able to flexibly define any

XML has no rigid structuring requirements of the information itself other than information tagging, which leads to the verboseness of XML.

Another simple example of the XML size implications is Table 5. This XML document (top), when processed by a browser consists of a single 3D box (bottom).

<?xml version="1.0" encoding="UTF-8"?>

<!DOCTYPE X3D PUBLIC "ISO//Web3D//DTD X3D 3.2//EN"

“http://www.web3d.org/specifications/x3d-3.2.dtd">

<X3D profile='Immersive' version='3.2'

xmlns:xsd='http://www.w3.org/2001/XMLSchema-instance' xsd:noNamespaceSchemaLocation=

'http://www.web3d.org/specifications/x3d-3.2.xsd'>

<head>

</head>

<Scene>

<Shape>

<Box/>

</Shape>

</Scene>

</X3D>

Table 5. XML Structure Bloating Example (X3D XML Code Top, X3D Scene Bottom)

This Extensible 3D Graphics (X3D) document, which is a member of the XML family of languages, consist of 355 characters, but only represents a box, defined by three characters, the bold “Box” statement code at the top. Everything else in some sense is overhead metadata information. This of course assumes that there is a language vocabulary that defined what a box is, its location, colors, etc, which in this example is the case, all based on the X3D specification.

In summary, when discounting structural semantics, the sizes of XML documents are larger than the actual information contained within them due to XML’s intentionally verbose design.