4.2 Shallow Processing and XML Markup
4.2.3 Well-Formed and Valid Documents
4.2. SHALLOW PROCESSING AND XML MARKUP 59
• text
• elements, also called tags, forming the ’markup’. Elements enclose text or
embed other elements and hence can form a hierarchical structure on docu- ments. An element name should semantically describe its content:
<heading> text </heading>
• attributes, adding non-structured information (modifiers) to elements are
specified together with the opening element:
<heading level="2"> text </heading>
One of the design goals of XML (and SGML) was to make the syntax both human-readable and machine-readable, and this is probably why the ending ele- ment designator repeats the element name redundantly (in contrast to parenthesis syntax in LISP s-expressions which the XML/SGML designers obviously did not consider human-readable).
To distinguish elements from text, the starting and ending element names in XML must be enclosed in angle brackets, the closing element is indicated addition- ally by a single slash after the opening angle bracket. Empty elements (elements that do not comprise text or other elements) may be abbreviated as<element/>.
Angle brackets and three other characters used for markup have to be quoted when occurring in normal text within an SGML or XML document. The elements in an XML document form a tree and hence must be balanced (element borders must not cross). An XML document must have a single root element.
An XML document is well-formed if it meets these conditions (plus some other mentioned in the standard such as Unicode-conformity etc), i.e., if it is syntactically correct.
An XML document is valid if it is conforming with a DTD (document type description) that describes the structure of a class of documents in a grammar with a BNF-like description of element containment, order and repetition, as well as constraints on attributes.
Such DTDs are optional, i.e., the XML recommendation requires XML doc- uments to be well-formed, but they do not necessarily have to be valid. A DTD e.g. states which element and attribute names are admitted in the document class, which element is the root element, which elements may be enclosed by which other elements (and possibly the order), which elements are mandatory or optional, and where text is allowed within elements. Examples for NLP-related DTDs can be found in the DTD Appendix (page 285ff).
Instead of a DTD, a schema can be used to validate an XML document. Sche- mata allow for finer-grained validity checking than DTDs, e.g. by user-definable data types which do not exist in DTDs. XML Schema (by the World Wide Web
Consortium; Thompson et al. 2004) and Relax NG Schema4 (by the OASIS con-
sortium) are the most popular schema definition languages. 4http://relaxng.org
60 CHAPTER 4. SHALLOW PROCESSING AND LINGUISTIC MARKUP For the purposes of this thesis, DTDs are preferable, because we are mainly interested in the coarse structure of valid documents which can be defined concisely in DTDs, while XML Schema and Relax NG syntax which themselves are defined in XML syntax, are verbose, harder to read and less intuitive.
Both SGML and XML provide a means for describing document structure in form of an abstract syntax via a DTD or schema. However, they do not provide a semantics of the document schemata or instances unlike the ISO/ITU standard ASN.1 (Abstract Syntax Notation; Dubuisson 2000) that includes a semantics de- scription in form of world-wide unique object identifiers (OIDs).
In XML, semantics is specified only implicitly and informally by giving ele- ments and attributes speaking names. The XML-generating and the XML-parsing ends must be guaranteed to interpret the content in the appropriate way. However, optionality of elements and attributes is a quite elegant way to cope with the fact that it may make sense to have XML-consuming software that only looks at those pieces of XML input that it knows about, and ignores the rest (that in turn may be of interest for another consumer).
One main difference between SGML and XML is that XML makes more re- strictions with respect to the wellformedness conditions than SGML, while SGML provides a more powerful language for describing validity of documents. Both properties together make XML easier to implement than SGML. Moreover, DTDs are mandatory in SGML, while they are optional in XML.
Further concepts of XML are
• Uniform Resource Identifiers (URIs). URIs are used to reference external
resources (similar to HTML hyperlink references). However, an explicit linking mechanism is not part of the core XML standard, but is defined in separate standards such as XPointer (DeRose et al., 2002), XLink (DeRose
et al., 2001) or XInclude (Marsh and Orchard, 2001).
• Namespaces. Namespaces are, similar to packages in programming lan-
guages, dictionaries of identifiers that make e.g. elements with the same name, but in different DTDs, distinguishable. The namespaces used in an XML document are declared at the beginning using a URI uniquely defining the namespace and a local name as reference that can then be used as a prefix for element names, separated by colon, e.g.
<invoice xmlns:edi=’http://ecommerce.org/schema’> <edi:price units=’Euro’>32.18</edi:price>
</invoice>
• ID/IDREF. ID and IDREF are special attributes for indexing and search-
ing elements within an XML document. To this end, ID attributes must be unique within a document. The XPath language we will describe below pro- vides anid() function that can be used to access XML nodes via its unique
4.2. SHALLOW PROCESSING AND XML MARKUP 61
• Entities. Entities are abbreviations, e.g. for often repeated character se-
quences. Entities can be defined in a separate DTD or at the beginning of an XML file.
• Unicode. The XML recommendation obliges implementations of XML to
support Unicode (other character sets may be supported optionally). This ‘greatest common denominator’ of character encoding ensures e.g. that mul- tilingual documents can be processed uniformly. A further very important property is character length. Unicode introduces (in contrast to previous en- coding initiatives) the concept of a code vs. encoding. Each Unicode char- acter has a single, unique code, although there may be different encodings or representation formats with fixed-length (UCS-2, UCS-4) or variable length (UTF-8, UTF-16) binary representations. The existence of an equal-length character code is very important for standoff annotation references that are based on unique text positions and string operations in multilingual applica- tions.
The above mentioned essentials of XML syntax of course constitute only a partial description. The complete XML syntax is described in the W3C recom- mendation (Bray et al., 1998). The W3C XML recommendation (the official stan- dardization document) itself makes references to other, lower-level standards such as Unicode for character encoding of text and elements, and IETF RFC 1738 for the Uniform Resource Identifier (URI) syntax.