• No results found

Describing the Structure of Classes of XML Documents: Schema and DTD s

An Introduction to XML

2.7 Describing the Structure of Classes of XML Documents: Schema and DTD s

When creating a grammar, we need a mechanism to tell people (and machines) what the format per-mits. Obviously, we can show them some examples. However, if we want to illustrate the format in its full generality, we will need a multitude of examples for all combinations of possibilities. Instead, we provide a set of constraints or definitions that limit the element and attribute names, the data types for attribute values and element content, and the allowable containment hierarchies of the elements.

There are two common approaches for doing this; one uses XML Schema [9] and the other a Doc-ument Type Definition (DTD ) [40]. An XML document that complies with a particular schema or DTD , in addition to being well-formed, is said to be valid.

We describe both XML Schema and DTD s in this section. Chapter14discusses the functionality to read and process XML schema that is available in theXMLSchemapackage [35].

2.7.1 The DTD

The oldest schema format for XML is the Document Type Definition (DTD ). While DTD support is ubiquitous due to its inclusion in the XML 1.0 standard, it is seen as limited. For example, it uses a non-XML syntax, inherited from SGML , to describe the vocabulary. As a result, it is not as easily extensible, does not support namespaces, and lacks the expressiveness of the XML Schema Definition (XSD), such as the rich data typing and complex logical structure. However, the DTD is still used in many applications because it is easy to read and write, and for these reasons, we provide a brief example with a small sample of DTD .

Example 2-4 A DTD for XHTML

A simple example of XML is the eXtensible HyperText Markup Language (XHTML ). Readers fa-miliar with HyperText Markup Language (HTML ), the language developed for presenting material on the Web, know that HTML need not be well-formed to be properly rendered in a browser. For example, there is no need to close a <p> tag and attribute values need not be enclosed in quotation marks. XHTML was developed to bring HTML in line with XML standards.

A DTD for XHTML is available at http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd.The strict standards for HTML require all element and attribute names to be low-ercase. Another rule is that the root node must be <html> and it must contain a namespace definition that points to the XHTML URI:http://www.w3.org/1999/xhtml.The start of an XHTML document must include an XML declaration, a DOCTYPE declaration, and an <html> root with namespace declaration as follows:

<?xml version="1.0" encoding="UTF-8" ?>

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"

"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">

<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">

Below is a sample of the XHTML DTD . It pertains to the ordered list element <ol> and its child, the <li> element.

<!-- Ordered (numbered) list -->

<!ELEMENT ol (li)+>

<!ATTLIST ol

%attrs;

>

<!-- list item -->

<!ELEMENT li %Flow;>

<!ATTLIST li

%attrs;

>

Notice the format of the DTD . The element names are defined via <!ELEMENT markup. The

"(li)+"in <!ELEMENT ol (li)+> indicates that the <ol> tag can have only <li> children and it must have one or more of them.

We will not describe further the syntax of DTD as it is not our focus.

2.7.2 Schema

An XML parser has the job of reading the XML, checking it for errors, and passing it on to the intended application. If no schema (or DTD ) is provided, the parser simply checks that the XML is well-formed. If a schema is provided, then the parser also determines whether the XML is valid before passing it on to the application. Models for parsing XML are described in greater detail in Chapters3 and5.The XML Schema Definition (XSD) is a schema for a particular grammar of XML. It defines the building blocks of an XML document: what elements can appear in the document; which attributes can appear in an element; which elements can be children of an element; the order and number of child elements; and which data types are allowed for element content and attribute values. With schema, the format of an XML document can be specified at a fairly high level; general parsing tools can be used to validate documents; and other applications can easily reuse, extend, and even combine schemas to cover specialized cases.

The schema is an XML document itself with its own grammar. The root element of an XML Schema Definition is the <schema> element, and following the root are elements that describe the allowable elements for the XML grammar being defined. These elements are defined via <element>

tags. The value in the name attribute of <element> is the tag name in the grammar. The schema specifies whether, for example, the element is simple or complex. Recall, a simple element is one that contains only text content. The type of text content can be specified in the schema as a boolean, string, integer, date, etc., or it can be a custom type that is also defined in the schema.

Attributes for tags in the grammar being defined are specified through <attribute> tags. Again, the name attribute in the <attribute> tag supplies the name of the attribute in the element be-ing defined. Other attributes of <attribute> are used to specify its data type, default value, and whether or not it is required.

A complex element is one that contains other elements. In the schema definition of complex el-ements, child nodes describe constraints on the hierarchy of the complex element. For example, the

<all>node indicates that children can appear in any order and that each child element must occur only once; the <sequence> tag specifies that the children have to appear in a specified order; and the <group> tag is used to define related sets of elements. The <any> element indicates that any tag may be a child of the element being defined. The <any> tag makes it easy to extend an XML document with elements not defined specifically in the schema.

2.7 Describing the Structure of Classes of XML Documents: Schema and DTD s 47 There are many more details to setting up a schema. We refer the interested reader to [38, 41]. Here we use an example to give only an idea of what is possible.

Example 2-5 Examining Schema for the Predictive Model Markup Language

The Predictive Model Markup Language (PMML) [5] is a language for representing statistical models in an application-independent way. Several applications support (PMML), including ADAPA , CART , Clementine, Enterprise Miner, DB2 Intelligent Miner, R, Teradata Warehouse Miner, and Zementis.

A PMML document must contain three main chunks: a header, a data dictionary, and a model as follows:

• The header provides general information about the model used in the application, such as the copyright and a non-application-specific description of the model.

• The data dictionary defines the variables used in the application of the model. It includes specifi-cations of data types and value ranges. These data dictionaries can be shared across models.

• The model chunk consists of a mining scheme and a model specification. The mining schema lists fields that must be provided to use the model; this list can be a proper subset of the fields in the data dictionary. The model also contains information specific to the type of model; that is, the model specification is dependent on the type of model fitted. As an example, the tree model used for classification and prediction contains <Node> elements that hold the logical predicate expressions that define the rules for branching.

Below is the PMML document (with a shortened header) that results from fitting a classification tree to thekyphosis dataset using therpart() function in R. The dependent variable in the fit is Kyphosis, which is a categorical variable with levels "absent" and "present", and the inde-pendent variables are Age,Number, andStart. The R code to fit and output the fit as a PMML document are provided here:

fit = rpart(Kyphosis ˜ Age + Number + Start, data = kyphosis) saveXML(pmml(fit), file = "KyphosisRpart.pmml")

Thepmml()function is in thepmmlpackage. ThesaveXML()function is in theXMLpackage, and is covered in greater detail in Chapter6.The document, KyphosisRpart.pmml, follows.

<PMML version="3.1"

xmlns="http://www.dmg.org/PMML-3_1"

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">

<Header copyright="Copyright..."

description="RPart decision tree model">

...

</Header>

<DataDictionary numberOfFields="4">

<DataField name="Kyphosis" optype="categorical"

dataType="string">

<Value value="absent"/>

<Value value="present"/>

</DataField>

<DataField name="Age" optype="continuous" dataType="double"/>

<DataField name="Number" optype="continuous" dataType="double"/>

<DataField name="Start" optype="continuous" dataType="double"/>

</DataDictionary>

<TreeModel modelName="RPart_Model"

functionName="classification" algorithmName="rpart"

splitCharacteristic="binarySplit">

<MiningSchema>

<MiningField name="Kyphosis" usageType="predicted"/>

<MiningField name="Age" usageType="active"/>

<MiningField name="Number" usageType="active"/>

<MiningField name="Start" usageType="active"/>

</MiningSchema>

<Node score="absent" recordCount="81">

<True/>

<Node score="absent" recordCount="62">

<SimplePredicate field="Start"

operator="greaterOrEqual" value="8.5"/>

<Node score="absent" recordCount="29">

<SimplePredicate field="Start"

operator="greaterOrEqual" value="14.5"/>

</Node>

<Node score="absent" recordCount="33">

<SimplePredicate field="Start"

operator="lessThan" value="14.5"/>

<Node score="absent" recordCount="12">

<SimplePredicate field="Age"

operator="lessThan" value="55"/>

</Node>

<Node score="absent" recordCount="21">

<SimplePredicate field="Age"

operator="greaterOrEqual" value="55"/>

<Node score="absent" recordCount="14">

<SimplePredicate field="Age"

operator="greaterOrEqual" value="111"/>

</Node>

<Node score="present" recordCount="7">

<SimplePredicate field="Age"

operator="lessThan" value="111"/>

</Node>

</Node>

</Node>

</Node>

<Node score="present" recordCount="19">

<SimplePredicate field="Start"

operator="lessThan" value="8.5"/>

</Node>

</Node>

</TreeModel>

</PMML>

The schema for the <TreeModel> element appears below. It gives a sense of how schema are used to provide the rules for a valid PMML document.

2.7 Describing the Structure of Classes of XML Documents: Schema and DTD s 49

<xs:element name="TreeModel"> 1

<xs:complexType> 2

<xs:sequence> 3

<xs:element ref="Extension" minOccurs="0" 4 maxOccurs="unbounded"/>

<xs:element ref="MiningSchema"/> 5

<xs:element ref="Output" minOccurs="0" />

<xs:element ref="ModelStats" minOccurs="0"/>

<xs:element ref="Targets" minOccurs="0" />

<xs:element ref="LocalTransformations" minOccurs="0"/>

<xs:element ref="Node"/>

<xs:element ref="ModelVerification" minOccurs="0"/>

<xs:element ref="Extension" minOccurs="0"

maxOccurs="unbounded"/>

</xs:sequence>

<xs:attribute name="modelName" type="xs:string" /> 6

<xs:attribute name="functionName"

type="MINING-FUNCTION" use="required"/>

<xs:attribute name="algorithmName" type="xs:string" />

<xs:attribute name="splitCharacteristic"

default="multiSplit"> 7

<xs:simpleType>

<xs:restriction base="xs:string"> 8

<xs:enumeration value="binarySplit"/>

<xs:enumeration value="multiSplit"/>

</xs:restriction>

</xs:simpleType>

</xs:attribute>

</xs:complexType>

</xs:element>

1 This <element> tag defines the <TreeModel> element and provides rules for its content.

2 The <complexType> child indicates that <TreeModel> has complex content.

3 According to the <sequence> child of <complexType>, the children of <TreeModel>

must appear in the specified order: <Extension>, <MiningSchema>, <Output>,...

4 The minOccurs attribute has a value of "0" in the definition of <Extension>. This indicates that <Extension> is optional. Also, since maxOccurs is "unbounded" there may be arbi-trarily many <Extension> elements in <TreeModel>.

5 minOccurs is not specified for <MiningSchema>, so the default of 1 is used.

6 The allowable attributes for <TreeModel> are provided via <attribute> elements. Here, the data type for the attribute modelName is any character data.

7 The attribute splitCharacteristic has a default value of "multiSplit".

8 In addition, <restriction> indicates that this attribute (splitCharacteristic) has only two pos-sible values: "multiSplit" and "binarySplit".

The goal of this section was to introduce the basic concepts in XML Schema. The topic is revisited in Chapter14in the context of an application (theXMLSchemapackage [35]).