CSE 3241: XML
Extensible Markup Language
(Ch. 12)
1
Topics
Structured, Semistructured, and Unstructured Data
XML Hierarchical (Tree) Data Model
XML Documents, DTD
XML Schema
Storing and Extracting XML Documents from Databases
XML Languages
Structured, Semistructured, and Unstructured Data
Structured data
◦ Represented in a strict format
◦ Example: information stored in databases
Semistructured data
◦ Has a certain structure
◦ Not all information collected will have identical structure
Unstructured data
◦ Limited indication of the of data
document that contains information
embedded within it
What is XML?
XML provides a framework to define a structure for data
◦ An XML document is a collection of related data items
◦ Document is “marked up” with tags known as elements
Elements are used to provide structure to the data
4
XML Hierarchical (Tree) Data Model
Elements and attributes
◦ Main structuring concepts used to construct an XML document
Complex elements
◦ Constructed from other elements hierarchically
Simple elements
◦ Contain data values
XML tag names
◦ Describe the meaning of the data elements in the document
XML Hierarchical (Tree) Data Model (cont’d.)
XML attributes
◦ Describe properties and
characteristics of the elements (tags) within which they appear
◦ May reference another element in another part of the XML document
Common to use attribute values in one element as the references
The XML Data Model
Attributes vs. Elements
◦ Data can be stored in an XML document
either as the contents of an element OR as an attribute of an element
◦ Why pick one over the other?
◦ Best practice:
Use attributes for information that describes/modifies the element
Use element contents to hold the actual data values
Much like in HTML:
Element (tag) contents are the data to be displayed
Attributes (generally) modify/describe how it is to be displayed
7
XML Document Types – Data Centric XML
Data-centric XML
◦ Highly structured
◦ Many small data items
◦ Often used for data exchange purposes
Transfer data from one system to another
◦ Also used to create web pages dynamically from databases
◦ Generally follow a schema document that determines their structure
8
XML Document Types – Document-Centric XML
Few structural elements
Large amounts of text
◦ Articles, blog entries, books
May have a schema document, but not required
◦ Schema may be very limited in semantics
What’s a title?
What’s a chapter?
What’s a paragraph?
9
More XML Document Types
Hybrid XML
◦ Some parts are highly structured
◦ Some parts mostly blocks of text and/or unstructured
◦ May or may not have a predefined schema
Schemaless XML documents
◦ Semi-structured documents without a predefined schema
◦ Denoted by the attribute
‘standalone=“yes”’ in the XML declaration on the top line
10
Valid XML
An XML document is considered valid if:
◦ It is well-formed* , and …
◦ It follows a particular schema in a standard definition language
A DTD document (Document Type Definition)
DTDs are the original, older technology
An XML schema document
XML schema documents are the “new” hotness
First published in 2001
11
*Well-formed XML
An XML document is well-formed when it follows certain conditions:
◦ It must start with an XML declaration line:
<?xml version=“1.0” standalone=“yes”?>
◦ It must form a tree:
Must start with a single root element
Every child element must have start and end tags that are contained completely within a parent
element:
Good Bad
<parent> <parent>
<child> <child>
</child> </parent>
</parent> </child>
12
DTD – Document Type Definition
Original method of specifying a schema definition
◦ Still in widespread use
A very simple schema definition language
◦ Each possible element in the document is defined
What children must it have?
What children can it (optionally) have?
What kinds of attributes can/must it have?
If it is a leaf element, what kinds of values
can it have? 13
XML Documents, DTD, and XML Schema (cont’d.)
Notation for specifying elements
XML DTD
◦ Special syntax
Requires specialized processors
◦ All DTD elements always forced to follow the specified ordering of the document
Unordered elements not permitted
A sample XML document and DTD
15
<?xml version=“1.0” standalone=“no”?>
<!DOCTYPE Projects SYSTEM “proj.dtd”>
<Projects>
<Project number=“1”>
<Name>Product X</Name>
<Location>Bellaire</Bellaire>
<Dept_no>5</Dept_no>
<Workers>
<Worker>
<Ssn>123456789</Ssn>
<Last_name>Smith</LastName>
<Hours>32.5</Hours>
</Worker>
<Worker>
<Ssn>453453453</Ssn>
<Hours>15.5</Hours>
</Worker>
</Workers>
</Project>
…
</Projects>
<!ELEMENT Projects (Project+)>
<!ELEMENT Project (Name, Location, Dept_no?, Workers)>
<!ATTLIST Project number ID
#REQUIRED>
<!ELEMENT Name (#PCDATA)>
<!ELEMENT Location (#PCDATA)>
<!ELEMENT Dept_no (#PCDATA)>
<!ELEMENT Workers (Worker*)>
<!ELEMENT Worker (Ssn, Last_name?, First_name?, Hours)>
<!ELEMENT Ssn (#PCDATA)>
<!ELEMENT Last_name (#PCDATA)>
<!ELEMENT First_name (#PCDATA)>
<!ELEMENT Hours (#PCDATA)>
XML Document Types - Schemas
Semi-structured data
Defines
◦ names for expected XML tags/elements
◦ required/optional elements
◦ Nested elements
Defines the entire structure of the documents
What is the root element?
What are the possible children of the root element?
What attributes must/can an element have?
Communicate semantic information about an XML document
◦ What do each of these elements mean? 16
XML Schema
XML schema definition for the COMPANY database
◦ XML Schema definition includes:
Strong types (integer, floats, strings, dates, etc.)
Minimum and maximum occurrences of elements
Keys! References from one element to another with the xsd:keyref tag
Breaks the strict hierarchy of XML
… and many other features
17
XML Schema
XML schema language
◦ Standard for specifying the structure of XML documents
◦ Uses same syntax rules as regular XML documents
Same processors can be used on both
XML Schema documents
Provide a way to specify an XML schema using XML as a language
◦ Overcomes the limitations of DTD
◦ However, XML schemas are much more complex than DTDs
XML schema a hierarchical, tree model
◦ Makes sense – it’s XML!
Adds concepts from database models
◦ Keys, references, identifiers
20XML document and XML Schema
21
<?xml version="1.0" encoding="UTF-8"?>
<pwk:Projects
xmlns:pwk="http://www.example.org/ProjectWorkers
"
xmlns:xsi="http://www.w3.org/2001/XMLSchema- instance"
xsi:schemaLocation="http://www.example.org/Proje ctWorkers ../Schemas/ProjectWorkers.xsd ">
<pwk:Project pwk:number="1">
<pwk:Name>Product X</pwk:Name>
<pwk:Location>Bellaire</pwk:Location>
<pwk:Dept_no>1</pwk:Dept_no>
<pwk:Workers>
<pwk:Worker>
<pwk:Ssn>123456789</pwk:Ssn>
<pwk:Last_name>Smith</pwk:Last_name>
<pwk:Hours>32.5</pwk:Hours>
</pwk:Worker>
<pwk:Worker>
<pwk:Ssn>453453453</pwk:Ssn>
<pwk:Hours>15.5</pwk:Hours>
</pwk:Worker>
</pwk:Workers>
</pwk:Project>
…
</pwk:Projects>
<?xml version=“1.0” encoding=“UTF-8”?>
<xsd:schema xmlns:xsd= http://www.w3.org/2001/XMLSchema xmlns:pwk=http://www.example.org/ProjectWorkers
elementFormDefault="qualified">
<xsd:element name=“Projects”>
<xsd:complexType>
<xsd:sequence>
<xsd:element name=“Project” minOccurs=“1”
maxOccurs=“unbounded”>
<xsd:complexType>
<xsd:sequence>
<xsd:element name=“Name” type=“xsd:string”/>
<xsd:element name=“Location” type=“xsd:string”/>
<xsd:element name=“Dept_no”
type=“xsd:positiveInteger” minOccurs=“0 “ maxOccurs=“1”/>
<xsd:element name=“Workers” minOccurs=“0”
maxOccurs=“unbounded”>
<xsd:complexType>
…
</xsd:complexType>
</xsd:element>
</xsd:sequence>
<xsd:attribute name="number"
type="xsd:positiveInteger"/>
</xsd:complexType>
</xsd:element>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
</xsd:schema>
XML Schema
22
<?xml version=“1.0” encoding=“UTF-8”?>
<xsd:schema xmlns:xsd= http://www.w3.org/2001/XMLSchema xmlns:pwk=
http://www.example.org/ProjectWorkers elementFormDefault="qualified">
<xsd:element name=“Projects”>
<xsd:complexType>
<xsd:sequence>
<xsd:element name=“Project” minOccurs=“1” maxOccurs=“unbounded”>
<xsd:complexType>
<xsd:sequence>
<xsd:element name=“Name” type=“xsd:string”/>
<xsd:element name=“Location” type=“xsd:string”/>
<xsd:element name=“Dept_no” type=“xsd:positiveInteger” minOccurs=“0 “ maxOccurs=“1”/>
<xsd:element name=“Workers” minOccurs=“0” maxOccurs=“unbounded”>
<xsd:complexType>
…
</xsd:complexType>
</xsd:element>
</xsd:sequence>
<xsd:attribute name="number" type="xsd:positiveInteger"/>
</xsd:complexType>
</xsd:element>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
</xsd:schema>
XML Schema
23
<?xml version=“1.0” encoding=“UTF-8”?>
<xsd:schema xmlns:xsd= http://www.w3.org/2001/XMLSchema xmlns:pwk=
http://www.example.org/ProjectWorkers elementFormDefault="qualified">
<xsd:element name=“Projects”>
<xsd:complexType>
<xsd:sequence>
<xsd:element name=“Project” minOccurs=“1” maxOccurs=“unbounded”>
<xsd:complexType>
<xsd:sequence>
<xsd:element name=“Name” type=“xsd:string”/>
<xsd:element name=“Location” type=“xsd:string”/>
<xsd:element name=“Dept_no” type=“xsd:positiveInteger” minOccurs=“0 “ maxOccurs=“1”/>
<xsd:element name=“Workers” minOccurs=“0” maxOccurs=“unbounded”>
<xsd:complexType>
…
</xsd:complexType>
</xsd:element>
</xsd:sequence>
<xsd:attribute name="number" type="xsd:positiveInteger"/>
</xsd:complexType>
</xsd:element>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
</xsd:schema>
Declare the xsd:schema element
This says we’re creating an XML Schema Our namespace will be “pwk”
XML Schema
24
<?xml version=“1.0” encoding=“UTF-8”?>
<xsd:schema xmlns:xsd= http://www.w3.org/2001/XMLSchema xmlns:pwk=
http://www.example.org/ProjectWorkers elementFormDefault="qualified">
<xsd:element name=“Projects”>
<xsd:complexType>
<xsd:sequence>
<xsd:element name=“Project” minOccurs=“1” maxOccurs=“unbounded”>
<xsd:complexType>
<xsd:sequence>
<xsd:element name=“Name” type=“xsd:string”/>
<xsd:element name=“Location” type=“xsd:string”/>
<xsd:element name=“Dept_no” type=“xsd:positiveInteger” minOccurs=“0 “ maxOccurs=“1”/>
<xsd:element name=“Workers” minOccurs=“0” maxOccurs=“unbounded”>
<xsd:complexType>
…
</xsd:complexType>
</xsd:element>
</xsd:sequence>
<xsd:attribute name="number" type="xsd:positiveInteger"/>
</xsd:complexType>
</xsd:element>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
</xsd:schema>
Elements are declared with the “element” tag
XML Schema
25
<?xml version=“1.0” encoding=“UTF-8”?>
<xsd:schema xmlns:xsd= http://www.w3.org/2001/XMLSchema xmlns:pwk=
http://www.example.org/ProjectWorkers elementFormDefault="qualified">
<xsd:element name=“Projects”>
<xsd:complexType>
<xsd:sequence>
<xsd:element name=“Project” minOccurs=“1” maxOccurs=“unbounded”>
<xsd:complexType>
<xsd:sequence>
<xsd:element name=“Name” type=“xsd:string”/>
<xsd:element name=“Location” type=“xsd:string”/>
<xsd:element name=“Dept_no” type=“xsd:positiveInteger” minOccurs=“0 “ maxOccurs=“1”/>
<xsd:element name=“Workers” minOccurs=“0” maxOccurs=“unbounded”>
<xsd:complexType>
…
</xsd:complexType>
</xsd:element>
</xsd:sequence>
<xsd:attribute name="number" type="xsd:positiveInteger"/>
</xsd:complexType>
</xsd:element>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
</xsd:schema>
Complex elements use the complexType tag to hold their component parts
XML Schema
26
<?xml version=“1.0” encoding=“UTF-8”?>
<xsd:schema xmlns:xsd= http://www.w3.org/2001/XMLSchema xmlns:pwk=
http://www.example.org/ProjectWorkers elementFormDefault="qualified">
<xsd:element name=“Projects”>
<xsd:complexType>
<xsd:sequence>
<xsd:element name=“Project” minOccurs=“1” maxOccurs=“unbounded”>
<xsd:complexType>
<xsd:sequence>
<xsd:element name=“Name” type=“xsd:string”/>
<xsd:element name=“Location” type=“xsd:string”/>
<xsd:element name=“Dept_no” type=“xsd:positiveInteger” minOccurs=“0 “ maxOccurs=“1”/>
<xsd:element name=“Workers” minOccurs=“0” maxOccurs=“unbounded”>
<xsd:complexType>
…
</xsd:complexType>
</xsd:element>
</xsd:sequence>
<xsd:attribute name="number" type="xsd:positiveInteger"/>
</xsd:complexType>
</xsd:element>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
</xsd:schema>
The child elements will form a sequence
XML Schema
27
<?xml version=“1.0” encoding=“UTF-8”?>
<xsd:schema xmlns:xsd= http://www.w3.org/2001/XMLSchema xmlns:pwk=
http://www.example.org/ProjectWorkers elementFormDefault="qualified">
<xsd:element name=“Projects”>
<xsd:complexType>
<xsd:sequence>
<xsd:element name=“Project” minOccurs=“1” maxOccurs=“unbounded”>
<xsd:complexType>
<xsd:sequence>
<xsd:element name=“Name” type=“xsd:string”/>
<xsd:element name=“Location” type=“xsd:string”/>
<xsd:element name=“Dept_no” type=“xsd:positiveInteger” minOccurs=“0 “ maxOccurs=“1”/>
<xsd:element name=“Workers” minOccurs=“0” maxOccurs=“unbounded”>
<xsd:complexType>
…
</xsd:complexType>
</xsd:element>
</xsd:sequence>
<xsd:attribute name="number" type="xsd:positiveInteger"/>
</xsd:complexType>
</xsd:element>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
</xsd:schema>
There will be at least one Project element and no maximum number of Project elements
XML Schema
28
<?xml version=“1.0” encoding=“UTF-8”?>
<xsd:schema xmlns:xsd= http://www.w3.org/2001/XMLSchema xmlns:pwk=
http://www.example.org/ProjectWorkers elementFormDefault="qualified">
<xsd:element name=“Projects”>
<xsd:complexType>
<xsd:sequence>
<xsd:element name=“Project” minOccurs=“1” maxOccurs=“unbounded”>
<xsd:complexType>
<xsd:sequence>
<xsd:element name=“Name” type=“xsd:string”/>
<xsd:element name=“Location” type=“xsd:string”/>
<xsd:element name=“Dept_no” type=“xsd:positiveInteger” minOccurs=“0 “ maxOccurs=“1”/>
<xsd:element name=“Workers” minOccurs=“0” maxOccurs=“unbounded”>
<xsd:complexType>
…
</xsd:complexType>
</xsd:element>
</xsd:sequence>
<xsd:attribute name="number" type="xsd:positiveInteger"/>
</xsd:complexType>
</xsd:element>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
</xsd:schema>
Project is another complexType holding a sequence of elements
XML Schema
29
<?xml version=“1.0” encoding=“UTF-8”?>
<xsd:schema xmlns:xsd= http://www.w3.org/2001/XMLSchema xmlns:pwk=
http://www.example.org/ProjectWorkers elementFormDefault="qualified">
<xsd:element name=“Projects”>
<xsd:complexType>
<xsd:sequence>
<xsd:element name=“Project” minOccurs=“1” maxOccurs=“unbounded”>
<xsd:complexType>
<xsd:sequence>
<xsd:element name=“Name” type=“xsd:string”/>
<xsd:element name=“Location” type=“xsd:string”/>
<xsd:element name=“Dept_no” type=“xsd:positiveInteger” minOccurs=“0 “ maxOccurs=“1”/>
<xsd:element name=“Workers” minOccurs=“0” maxOccurs=“unbounded”>
<xsd:complexType>
…
</xsd:complexType>
</xsd:element>
</xsd:sequence>
<xsd:attribute name="number" type="xsd:positiveInteger"/>
</xsd:complexType>
</xsd:element>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
</xsd:schema>
Name, Location, Dept_no are all simple elements
XML Schema
30
<?xml version=“1.0” encoding=“UTF-8”?>
<xsd:schema xmlns:xsd= http://www.w3.org/2001/XMLSchema xmlns:pwk=
http://www.example.org/ProjectWorkers elementFormDefault="qualified">
<xsd:element name=“Projects”>
<xsd:complexType>
<xsd:sequence>
<xsd:element name=“Project” minOccurs=“1” maxOccurs=“unbounded”>
<xsd:complexType>
<xsd:sequence>
<xsd:element name=“Name” type=“xsd:string”/>
<xsd:element name=“Location” type=“xsd:string”/>
<xsd:element name=“Dept_no” type=“xsd:positiveInteger” minOccurs=“0 “ maxOccurs=“1”/>
<xsd:element name=“Workers” minOccurs=“0” maxOccurs=“unbounded”>
<xsd:complexType>
…
</xsd:complexType>
</xsd:element>
</xsd:sequence>
<xsd:attribute name="number" type="xsd:positiveInteger"/>
</xsd:complexType>
</xsd:element>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
</xsd:schema>
Workers is another complex element
XML Schema
31
<?xml version=“1.0” encoding=“UTF-8”?>
<xsd:schema xmlns:xsd= http://www.w3.org/2001/XMLSchema xmlns:pwk=
http://www.example.org/ProjectWorkers elementFormDefault="qualified">
<xsd:element name=“Projects”>
<xsd:complexType>
<xsd:sequence>
<xsd:element name=“Project” minOccurs=“1” maxOccurs=“unbounded”>
<xsd:complexType>
<xsd:sequence>
<xsd:element name=“Name” type=“xsd:string”/>
<xsd:element name=“Location” type=“xsd:string”/>
<xsd:element name=“Dept_no” type=“xsd:positiveInteger” minOccurs=“0 “ maxOccurs=“1”/>
<xsd:element name=“Workers” minOccurs=“0” maxOccurs=“unbounded”>
<xsd:complexType>
…
</xsd:complexType>
</xsd:element>
</xsd:sequence>
<xsd:attribute name="number" type="xsd:positiveInteger"/>
</xsd:complexType>
</xsd:element>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
</xsd:schema>
Project has a required attribute “number”
with a positive Integer type
Storing and Extracting XML Documents from Databases
Most common approaches
◦ Using a DBMS to store the documents as text
Can be used if DBMS has a special module for document processing
◦ Using a DBMS to store document contents as data elements
Require mapping algorithms to design a database schema that is compatible with XML document structure
Storing and Extracting XML
Documents from Databases (cont’d.)
◦ Designing a specialized system for storing native XML data
Called Native XML DBMSs
◦ Creating or publishing customized XML documents from preexisting relational databases
Use a separate middleware software layer to handle conversions
XML Languages
Several different query languages proposed for XML
◦ Languages used to select (and manipulate) the nodes in an XML document
Common tools for querying data in XML documents
◦ XPath
◦ XQuery
34
XML Languages
Two query language standards
◦ XPath
Specify path expressions to identify certain nodes (elements) or attributes within an XML document that match specific patterns
◦ XQuery
Uses XPath expressions but has additional constructs
XPath
XPath is a simple query language used to select parts of an XML
document
◦ Other more complex languages
(XQuery, XSLT) use the XPath syntax
XPath queries are written as
paths through an XML document
◦ Nodes are separated by ‘/’
characters
◦ The result returned is whatever is at
the end of the XPath expression
36XPath: Specifying Path Expressions in XML
XPath expression
◦ Returns a sequence of items that
satisfy a certain pattern as specified by the expression
◦ Either values (from leaf nodes) or elements or attributes
◦ Qualifier conditions
Further restrict nodes that satisfy pattern
Separators used when specifying a path:
◦ Single slash (/) and double slash (//)
Xpath:
/company
◦ Returns the company root node and all its descendant nodes
the whole XML document.
◦ Customary to include the file
E.g.
doc(www.company.com/info.XML)/company
38
XPath: Specifying Path
Expressions in XML (cont’d.)
Main restriction of XPath path expressions
◦ Path that specifies the pattern also specifies the items to be retrieved
◦ Difficult to specify certain conditions on the pattern while separately
specifying which result items should
be retrieved
XPath: Specifying Path
Expressions in XML (cont’d.)
/ - child of the current node
// - descendent or self at any level of the current node
@ - attribute of the current node
* - Wildcard symbol
◦ Stands for any element
◦ Example: /company/*
Xpath (cont’d)
/company/department
◦ Returns all department nodes and their descendant subtrees.
◦ Nodes in an XML document are ordered, so the XPath result will be in the same order as the document tree.
//employee [employeeSalary gt 70000]
/employeeName
◦ [qualifier condition]
◦ // if we don’t know the full path, but do know the name of some tags
41
XPath: Specifying Path
Expressions in XML (cont’d.)
Axes
◦ Move in multiple directions from current node in path expression
◦ Include self, child, descendent,
attribute, parent, ancestor, previous
sibling, and next sibling
XQuery
XQuery is an extension of XPath
◦ Uses the same data model as XPath
◦ Provides a language for more complex and general queries
◦ Similar to SQL’s relationship to relational databases
XQuery uses FLWR expressions
◦ Acronym for “For, Let, Where, Return”
43
XQuery Features
Support for joins
for $x in <element1>, $y in <element2>
where $x/element = $y/element
Support for aggregate functions
◦ Sum, count, min, max, avg
Support for conditional branching
◦ If – then branching
And a lot more
44
XQuery: Specifying Queries in XML
XQuery FLWR expression
◦ Four main clauses of XQuery
◦ Form:
FOR <variable bindings to individual nodes (elements)>
LET <variable bindings to
collections of nodes (elements)>
WHERE <qualifier conditions>
RETURN <query result specification>
◦ Zero or more instances of FOR and
LET clauses
Xquery Example
LET $d :=
doc(www.company.com/info.xml) FOR $x IN
$d/company/project[projectNum ber = 5]/projectWorker, $y IN
$d/company/employee
WHERE $x/hours gt 20.0 AND
$y.ssn = $x.ssn RETURN <res>
$y/employeeName/firstName,
$y/employeeName/lastName,
$x/hours </res>
46
Variables have $ prefix LET assigns a variable for the rest of the query
FOR assigns a variable to range over items in a
sequence
WHERE specifies any conditions
RETURN specifies the elements to be retrieved.
Summary
Three main types of data:
structured, semi-structured, and unstructured
XML standard
◦ Tree-structured (hierarchical) data model
◦ XML documents and the languages for specifying the structure of these documents