CSE 3241: XML
Extensible Markup Language (Ch. 12)
1
Topics
⚫
Structured, Semistructured, and Unstructured Data
⚫
XML Hierarchical (Tree) Data Model
⚫
XML Documents, DTD
⚫
XML Schema
⚫
Storing and Extracting XML Documents from Databases
⚫
XML Languages
Structured, Semistructured, and Unstructured Data
⚫
Structured data
◦ Represented in a strict format
◦ Example: information stored in databases
⚫
Semistructured data
◦ Has a certain structure
◦ Not all information collected will have identical structure
⚫
Unstructured data
◦ Limited indication of the of data document
that contains information embedded within it
What is XML?
⚫
XML provides a framework to define a structure for data
◦ An XML document is a collection of related data items
◦ Document is “marked up” with tags known as elements
● Elements are used to provide structure to the data
4
XML Hierarchical (Tree) Data Model
⚫
Elements and attributes
◦ Main structuring concepts used to construct an XML document
⚫
Complex elements
◦ Constructed from other elements hierarchically
⚫
Simple elements
◦ Contain data values
⚫
XML tag names
◦ Describe the meaning of the data elements in the document
XML Hierarchical (Tree) Data Model (cont’d.)
⚫
XML attributes
◦ Describe properties and characteristics of the elements (tags) within which they appear
◦ May reference another element in another part of the XML document
● Common to use attribute values in one element as the references
The XML Data Model
⚫
Attributes vs. Elements
◦ Data can be stored in an XML document either as the contents of an element OR as an attribute of an element
◦ Why pick one over the other?
◦ Best practice:
● Use attributes for information that describes/modifies the element
● Use element contents to hold the actual data values
● Much like in HTML:
● Element (tag) contents are the data to be displayed
● Attributes (generally) modify/describe how it is to be displayed
7
XML Document Types – Data Centric XML
⚫
Data-centric XML
◦ Highly structured
◦ Many small data items
◦ Often used for data exchange purposes
● Transfer data from one system to another
◦ Also used to create web pages dynamically from databases
◦ Generally follow a schema document that determines their structure
8
XML Document Types – Document-Centric XML
⚫
Few structural elements
⚫
Large amounts of text
◦ Articles, blog entries, books
⚫
May have a schema document, but not required
◦ Schema may be very limited in semantics
● What’s a title?
● What’s a chapter?
● What’s a paragraph?
9
More XML Document Types
⚫
Hybrid XML
◦ Some parts are highly structured
◦ Some parts mostly blocks of text and/or unstructured
◦ May or may not have a predefined schema
⚫ Schemaless XML documents
◦ Semi-structured documents without a predefined schema
◦ Denoted by the attribute ‘standalone=“yes”’
in the XML declaration on the top line
10
Valid XML
⚫
An XML document is considered valid if:
◦ It is well-formed* , and …
◦ It follows a particular schema in a standard definition language
● A DTD document (Document Type Definition)
● DTDs are the original, older technology
● An XML schema document
● XML schema documents are the “new” hotness
● First published in 2001
11
*Well-formed XML
⚫
An XML document is well-formed when it follows certain conditions:
◦ It must start with an XML declaration line:
<?xml version=“1.0” standalone=“yes”?>
◦ It must form a tree:
● Must start with a single root element
● Every child element must have start and end tags that are contained completely within a parent element:
Good Bad
<parent> <parent>
<child> <child>
</child> </parent>
</parent> </child>
12
DTD – Document Type Definition
⚫
Original method of specifying a schema definition
◦ Still in widespread use
⚫
A very simple schema definition language
◦ Each possible element in the document is defined
● What children must it have?
● What children can it (optionally) have?
● What kinds of attributes can/must it have?
● If it is a leaf element, what kinds of values can it have?
13
XML Documents, DTD, and XML Schema (cont’d.)
⚫
Notation for specifying elements
⚫
XML DTD
◦ Special syntax
● Requires specialized processors
◦ All DTD elements always forced to follow the specified ordering of the document
● Unordered elements not permitted
A sample XML document and DTD
15
<?xml version=“1.0” standalone=“no”?>
<!DOCTYPE Projects SYSTEM “proj.dtd”>
<Projects>
<Project number=“1”>
<Name>Product X</Name>
<Location>Bellaire</Bellaire>
<Dept_no>5</Dept_no>
<Workers>
<Worker>
<Ssn>123456789</S sn>
<Last_name>Smith<
/LastName>
<Hours>32.5</Hou rs>
</Worker>
<Worker>
<Ssn>453453453</S sn>
<Hours>15.5</Hou rs>
</Worker>
</Workers>
</Project>
…
</Projects>
<!ELEMENT Projects (Project+)>
<!ELEMENT Project (Name, Location, Dept_no?, Workers)>
<!ATTLIST Project number ID #REQUIRED>
<!ELEMENT Name (#PCDATA)>
<!ELEMENT Location (#PCDATA)>
<!ELEMENT Dept_no (#PCDATA)>
<!ELEMENT Workers (Worker*)>
<!ELEMENT Worker (Ssn, Last_name?, First_name?, Hours)>
<!ELEMENT Ssn (#PCDATA)>
<!ELEMENT Last_name (#PCDATA)>
<!ELEMENT First_name (#PCDATA)>
<!ELEMENT Hours (#PCDATA)>
XML Document Types - Schemas
⚫
Semi-structured data
⚫
Defines
◦ names for expected XML tags/elements
◦ required/optional elements
◦ Nested elements
● Defines the entire structure of the documents
● What is the root element?
● What are the possible children of the root element?
● What attributes must/can an element have?
⚫
Communicate semantic information about an XML document
◦ What do each of these elements mean?
16
XML Schema
⚫
XML schema definition for the COMPANY database
◦ XML Schema definition includes:
● Strong types (integer, floats, strings, dates, etc.)
● Minimum and maximum occurrences of elements
● Keys! References from one element to another with the xsd:keyref tag
●Breaks the strict hierarchy of XML
● … and many other features
17
XML Schema
⚫
XML schema language
◦ Standard for specifying the structure of XML documents
◦ Uses same syntax rules as regular XML documents
● Same processors can be used on both
XML Schema documents
⚫
Provide a way to specify an XML schema using XML as a language
◦ Overcomes the limitations of DTD
◦ However, XML schemas are much more complex than DTDs
⚫
XML schema a hierarchical, tree model
◦ Makes sense – it’s XML!
⚫
Adds concepts from database models
◦ Keys, references, identifiers
20
XML document and XML Schema
21
<?xml version="1.0" encoding="UTF-8"?>
<pwk:Projects
xmlns:pwk="http://www.example.org/ProjectWorkers
"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-ins tance"
xsi:schemaLocation="http://www.example.org/Projec tWorkers ../Schemas/ProjectWorkers.xsd ">
<pwk:Project pwk:number="1">
<pwk:Name>Product X</pwk:Name>
<pwk:Location>Bellaire</pwk:Location>
<pwk:Dept_no>1</pwk:Dept_no>
<pwk:Workers>
<pwk:Worker>
<pwk:Ssn>123456789</pwk:Ssn>
<pwk:Last_name>Smith</pwk:Last_name>
<pwk:Hours>32.5</pwk:Hours>
</pwk:Worker>
<pwk:Worker>
<pwk:Ssn>453453453</pwk:Ssn>
<pwk:Hours>15.5</pwk:Hours>
</pwk:Worker>
</pwk:Workers>
</pwk:Project>
…
</pwk:Projects>
<?xml version=“1.0” encoding=“UTF-8”?>
<xsd:schema xmlns:xsd= http://www.w3.org/2001/XMLSchema xmlns:pwk=http://www.example.org/ProjectWorkers
elementFormDefault="qualified">
<xsd:element name=“Projects”>
<xsd:complexType>
<xsd:sequence>
<xsd:element name=“Project” minOccurs=“1”
maxOccurs=“unbounded”>
<xsd:complexType>
<xsd:sequence>
<xsd:element name=“Name” type=“xsd:string”/>
<xsd:element name=“Location” type=“xsd:string”/>
<xsd:element
name=“Dept_no” type=“xsd:positiveInteger” minOccurs=“0 “ maxOccurs=“1”/>
<xsd:element name=“Workers” minOccurs=“0” maxOccurs=“unbounded”>
<xsd:complexTy pe>
…
</xsd:complexT ype>
</xsd:element>
</xsd:sequence>
<xsd:attribute name="number"
type="xsd:positiveInteger"/>
</xsd:complexType>
</xsd:element>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
</xsd:schema>
XML Schema
22
<?xml version=“1.0” encoding=“UTF-8”?>
<xsd:schema xmlns:xsd= http://www.w3.org/2001/XMLSchema xmlns:pwk=http://www.example.org/ProjectWorkers elementFormDefault="qualified">
<xsd:element name=“Projects”>
<xsd:complexType>
<xsd:sequence>
<xsd:element name=“Project” minOccurs=“1” maxOccurs=“unbounded”>
<xsd:complexType>
<xsd:sequence>
<xsd:element name=“Name” type=“xsd:string”/>
<xsd:element name=“Location” type=“xsd:string”/>
<xsd:element name=“Dept_no” type=“xsd:positiveInteger”
minOccurs=“0 “ maxOccurs=“1”/>
<xsd:element name=“Workers” minOccurs=“0”
maxOccurs=“unbounded”>
<xsd:complexType>
…
</xsd:complexType>
</xsd:element>
</xsd:sequence>
<xsd:attribute name="number" type="xsd:positiveInteger"/>
</xsd:complexType>
</xsd:element>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
</xsd:schema>
XML Schema
23
<?xml version=“1.0” encoding=“UTF-8”?>
<xsd:schema xmlns:xsd= http://www.w3.org/2001/XMLSchema xmlns:pwk=http://www.example.org/ProjectWorkers elementFormDefault="qualified">
<xsd:element name=“Projects”>
<xsd:complexType>
<xsd:sequence>
<xsd:element name=“Project” minOccurs=“1” maxOccurs=“unbounded”>
<xsd:complexType>
<xsd:sequence>
<xsd:element name=“Name” type=“xsd:string”/>
<xsd:element name=“Location” type=“xsd:string”/>
<xsd:element name=“Dept_no” type=“xsd:positiveInteger”
minOccurs=“0 “ maxOccurs=“1”/>
<xsd:element name=“Workers” minOccurs=“0”
maxOccurs=“unbounded”>
<xsd:complexType>
…
</xsd:complexType>
</xsd:element>
</xsd:sequence>
<xsd:attribute name="number" type="xsd:positiveInteger"/>
</xsd:complexType>
</xsd:element>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
</xsd:schema>
Declare the xsd:schema element
This says we’re creating an XML Schema Our namespace will be “pwk”
XML Schema
24
<?xml version=“1.0” encoding=“UTF-8”?>
<xsd:schema xmlns:xsd= http://www.w3.org/2001/XMLSchema xmlns:pwk=http://www.example.org/ProjectWorkers elementFormDefault="qualified">
<xsd:element name=“Projects”>
<xsd:complexType>
<xsd:sequence>
<xsd:element name=“Project” minOccurs=“1” maxOccurs=“unbounded”>
<xsd:complexType>
<xsd:sequence>
<xsd:element name=“Name” type=“xsd:string”/>
<xsd:element name=“Location” type=“xsd:string”/>
<xsd:element name=“Dept_no” type=“xsd:positiveInteger”
minOccurs=“0 “ maxOccurs=“1”/>
<xsd:element name=“Workers” minOccurs=“0”
maxOccurs=“unbounded”>
<xsd:complexType>
…
</xsd:complexType>
</xsd:element>
</xsd:sequence>
<xsd:attribute name="number" type="xsd:positiveInteger"/>
</xsd:complexType>
</xsd:element>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
</xsd:schema>
Elements are declared with the “element” tag
XML Schema
25
<?xml version=“1.0” encoding=“UTF-8”?>
<xsd:schema xmlns:xsd= http://www.w3.org/2001/XMLSchema xmlns:pwk=http://www.example.org/ProjectWorkers elementFormDefault="qualified">
<xsd:element name=“Projects”>
<xsd:complexType>
<xsd:sequence>
<xsd:element name=“Project” minOccurs=“1” maxOccurs=“unbounded”>
<xsd:complexType>
<xsd:sequence>
<xsd:element name=“Name” type=“xsd:string”/>
<xsd:element name=“Location” type=“xsd:string”/>
<xsd:element name=“Dept_no” type=“xsd:positiveInteger”
minOccurs=“0 “ maxOccurs=“1”/>
<xsd:element name=“Workers” minOccurs=“0”
maxOccurs=“unbounded”>
<xsd:complexType>
…
</xsd:complexType>
</xsd:element>
</xsd:sequence>
<xsd:attribute name="number" type="xsd:positiveInteger"/>
</xsd:complexType>
</xsd:element>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
</xsd:schema>
Complex elements use the complexType tag to hold their component parts
XML Schema
26
<?xml version=“1.0” encoding=“UTF-8”?>
<xsd:schema xmlns:xsd= http://www.w3.org/2001/XMLSchema xmlns:pwk=http://www.example.org/ProjectWorkers elementFormDefault="qualified">
<xsd:element name=“Projects”>
<xsd:complexType>
<xsd:sequence>
<xsd:element name=“Project” minOccurs=“1” maxOccurs=“unbounded”>
<xsd:complexType>
<xsd:sequence>
<xsd:element name=“Name” type=“xsd:string”/>
<xsd:element name=“Location” type=“xsd:string”/>
<xsd:element name=“Dept_no” type=“xsd:positiveInteger”
minOccurs=“0 “ maxOccurs=“1”/>
<xsd:element name=“Workers” minOccurs=“0”
maxOccurs=“unbounded”>
<xsd:complexType>
…
</xsd:complexType>
</xsd:element>
</xsd:sequence>
<xsd:attribute name="number" type="xsd:positiveInteger"/>
</xsd:complexType>
</xsd:element>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
</xsd:schema>
The child elements will form a sequence
XML Schema
27
<?xml version=“1.0” encoding=“UTF-8”?>
<xsd:schema xmlns:xsd= http://www.w3.org/2001/XMLSchema xmlns:pwk=http://www.example.org/ProjectWorkers elementFormDefault="qualified">
<xsd:element name=“Projects”>
<xsd:complexType>
<xsd:sequence>
<xsd:element name=“Project” minOccurs=“1” maxOccurs=“unbounded”>
<xsd:complexType>
<xsd:sequence>
<xsd:element name=“Name” type=“xsd:string”/>
<xsd:element name=“Location” type=“xsd:string”/>
<xsd:element name=“Dept_no” type=“xsd:positiveInteger”
minOccurs=“0 “ maxOccurs=“1”/>
<xsd:element name=“Workers” minOccurs=“0”
maxOccurs=“unbounded”>
<xsd:complexType>
…
</xsd:complexType>
</xsd:element>
</xsd:sequence>
<xsd:attribute name="number" type="xsd:positiveInteger"/>
</xsd:complexType>
</xsd:element>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
</xsd:schema>
There will be at least one Project element and no maximum number of Project elements
XML Schema
28
<?xml version=“1.0” encoding=“UTF-8”?>
<xsd:schema xmlns:xsd= http://www.w3.org/2001/XMLSchema xmlns:pwk=http://www.example.org/ProjectWorkers elementFormDefault="qualified">
<xsd:element name=“Projects”>
<xsd:complexType>
<xsd:sequence>
<xsd:element name=“Project” minOccurs=“1” maxOccurs=“unbounded”>
<xsd:complexType>
<xsd:sequence>
<xsd:element name=“Name” type=“xsd:string”/>
<xsd:element name=“Location” type=“xsd:string”/>
<xsd:element name=“Dept_no” type=“xsd:positiveInteger”
minOccurs=“0 “ maxOccurs=“1”/>
<xsd:element name=“Workers” minOccurs=“0”
maxOccurs=“unbounded”>
<xsd:complexType>
…
</xsd:complexType>
</xsd:element>
</xsd:sequence>
<xsd:attribute name="number" type="xsd:positiveInteger"/>
</xsd:complexType>
</xsd:element>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
</xsd:schema>
Project is another complexType holding a sequence of elements
XML Schema
29
<?xml version=“1.0” encoding=“UTF-8”?>
<xsd:schema xmlns:xsd= http://www.w3.org/2001/XMLSchema xmlns:pwk=http://www.example.org/ProjectWorkers elementFormDefault="qualified">
<xsd:element name=“Projects”>
<xsd:complexType>
<xsd:sequence>
<xsd:element name=“Project” minOccurs=“1” maxOccurs=“unbounded”>
<xsd:complexType>
<xsd:sequence>
<xsd:element name=“Name” type=“xsd:string”/>
<xsd:element name=“Location” type=“xsd:string”/>
<xsd:element name=“Dept_no” type=“xsd:positiveInteger”
minOccurs=“0 “ maxOccurs=“1”/>
<xsd:element name=“Workers” minOccurs=“0”
maxOccurs=“unbounded”>
<xsd:complexType>
…
</xsd:complexType>
</xsd:element>
</xsd:sequence>
<xsd:attribute name="number" type="xsd:positiveInteger"/>
</xsd:complexType>
</xsd:element>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
</xsd:schema>
Name, Location, Dept_no are all simple elements
XML Schema
30
<?xml version=“1.0” encoding=“UTF-8”?>
<xsd:schema xmlns:xsd= http://www.w3.org/2001/XMLSchema xmlns:pwk=http://www.example.org/ProjectWorkers elementFormDefault="qualified">
<xsd:element name=“Projects”>
<xsd:complexType>
<xsd:sequence>
<xsd:element name=“Project” minOccurs=“1” maxOccurs=“unbounded”>
<xsd:complexType>
<xsd:sequence>
<xsd:element name=“Name” type=“xsd:string”/>
<xsd:element name=“Location” type=“xsd:string”/>
<xsd:element name=“Dept_no” type=“xsd:positiveInteger”
minOccurs=“0 “ maxOccurs=“1”/>
<xsd:element name=“Workers” minOccurs=“0”
maxOccurs=“unbounded”>
<xsd:complexType>
…
</xsd:complexType>
</xsd:element>
</xsd:sequence>
<xsd:attribute name="number" type="xsd:positiveInteger"/>
</xsd:complexType>
</xsd:element>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
</xsd:schema>
Workers is another complex element
XML Schema
31
<?xml version=“1.0” encoding=“UTF-8”?>
<xsd:schema xmlns:xsd= http://www.w3.org/2001/XMLSchema xmlns:pwk=http://www.example.org/ProjectWorkers elementFormDefault="qualified">
<xsd:element name=“Projects”>
<xsd:complexType>
<xsd:sequence>
<xsd:element name=“Project” minOccurs=“1” maxOccurs=“unbounded”>
<xsd:complexType>
<xsd:sequence>
<xsd:element name=“Name” type=“xsd:string”/>
<xsd:element name=“Location” type=“xsd:string”/>
<xsd:element name=“Dept_no” type=“xsd:positiveInteger”
minOccurs=“0 “ maxOccurs=“1”/>
<xsd:element name=“Workers” minOccurs=“0”
maxOccurs=“unbounded”>
<xsd:complexType>
…
</xsd:complexType>
</xsd:element>
</xsd:sequence>
<xsd:attribute name="number" type="xsd:positiveInteger"/>
</xsd:complexType>
</xsd:element>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
</xsd:schema>
Project has a required attribute “number”
with a positive Integer type
Storing and Extracting XML Documents from Databases
⚫
Most common approaches
◦ Using a DBMS to store the documents as text
● Can be used if DBMS has a special module for document processing
◦ Using a DBMS to store document contents as data elements
● Require mapping algorithms to design a database schema that is compatible with XML document structure
Storing and Extracting XML Documents from Databases (cont’d.)
◦ Designing a specialized system for storing native XML data
● Called Native XML DBMSs
◦ Creating or publishing customized XML documents from preexisting relational databases
● Use a separate middleware software layer to handle conversions
XML Languages
⚫
Several different query languages proposed for XML
◦ Languages used to select (and manipulate) the nodes in an XML document
⚫
Common tools for querying data in XML documents
◦ XPath
◦ XQuery
34
XML Languages
⚫
Two query language standards
◦ XPath
● Specify path expressions to identify certain nodes (elements) or attributes within an XML document that match specific patterns
◦ XQuery
● Uses XPath expressions but has additional constructs
XPath
⚫
XPath is a simple query language used to select parts of an XML document
◦ Other more complex languages (XQuery, XSLT) use the XPath syntax
⚫
XPath queries are written as paths through an XML document
◦ Nodes are separated by ‘/’ characters
◦ The result returned is whatever is at the end of the XPath expression
36
XPath: Specifying Path Expressions in XML
⚫
XPath expression
◦ Returns a sequence of items that satisfy a
certain pattern as specified by the expression
◦ Either values (from leaf nodes) or elements or attributes
◦ Qualifier conditions
● Further restrict nodes that satisfy pattern
⚫
Separators used when specifying a path:
◦ Single slash (/) and double slash (//)
Xpath:
⚫
/company
◦ Returns the company root node and all its descendant nodes
● the whole XML document.
◦ Customary to include the file
● E.g.
●doc(www.company.com/info.XML)/company
38
XPath: Specifying Path Expressions in XML (cont’d.)
⚫
Main restriction of XPath path expressions
◦ Path that specifies the pattern also specifies the items to be retrieved
◦ Difficult to specify certain conditions on the pattern while separately specifying which
result items should be retrieved
XPath: Specifying Path Expressions in XML (cont’d.)
⚫
/ - child of the current node
⚫
// - descendent or self at any level of the current node
⚫
@ - attribute of the current node
⚫
* - Wildcard symbol
◦ Stands for any element
◦ Example: /company/*
Xpath (cont’d)
⚫
/company/department
◦ Returns all department nodes and their descendant subtrees.
◦ Nodes in an XML document are ordered, so the XPath result will be in the same order as the document tree.
⚫ //employee [employeeSalary gt 70000]
/employeeName
◦ [qualifier condition]
◦ // if we don’t know the full path, but do know the name of some tags
41
XPath: Specifying Path Expressions in XML (cont’d.)
⚫
Axes
◦ Move in multiple directions from current node in path expression
◦ Include self, child, descendent, attribute,
parent, ancestor, previous sibling, and next
sibling
XQuery
⚫
XQuery is an extension of XPath
◦ Uses the same data model as XPath
◦ Provides a language for more complex and general queries
◦ Similar to SQL’s relationship to relational databases
⚫
XQuery uses FLWR expressions
◦ Acronym for “For, Let, Where, Return”
43
XQuery Features
⚫
Support for joins
for $x in <element1>, $y in <element2>
where $x/element = $y/element
⚫
Support for aggregate functions
◦ Sum, count, min, max, avg
⚫
Support for conditional branching
◦ If – then branching
⚫
And a lot more
44
XQuery: Specifying Queries in XML
⚫
XQuery FLWR expression
◦ Four main clauses of XQuery
◦ Form:
FOR <variable bindings to individual nodes (elements)>
LET <variable bindings to
collections of nodes (elements)>
WHERE <qualifier conditions>
RETURN <query result specification>
◦ Zero or more instances of FOR and LET
clauses
Xquery Example
LET $d :=
doc(www.company.com/info.xml) FOR $x IN
$d/company/project[projectNumber
= 5]/projectWorker, $y IN
$d/company/employee
WHERE $x/hours gt 20.0 AND $y.ssn
= $x.ssn
RETURN <res>
$y/employeeName/firstName,
$y/employeeName/lastName,
$x/hours </res>
46
Variables have $ prefix LET assigns a variable for the rest of the query
FOR assigns a variable to range over items in a
sequence
WHERE specifies any conditions
RETURN specifies the elements to be retrieved.
Summary
⚫
Three main types of data: structured, semi-structured, and unstructured
⚫
XML standard
◦ Tree-structured (hierarchical) data model
◦ XML documents and the languages for
specifying the structure of these documents
⚫