Databases and Information Systems 2 - SS 20 07 - Prof. Dr. Stefan Böttcher - Storage models for XML trees / 1
Databases and Information Systems 2
Storage models for XML trees
in small main memory devices
Long term goals:
reduce memory
Æ
compression (?)
still query efficiently
Æ
small data structures
Storage models for XML trees (1):
binary tables for binary trees
Auftrag 2 root 1 Meier 4 pc500 6 PC 5 Kunde 3 Label(ID) ID 3 2 2 1 6 5 4 3 fc(ID) ID
• based on label, first-child (fc) and next-sibling(ns) <Auftrag>
<Kunde>Meier</Kunde> <PC>pc500</PC> </Auftrag>
label( ID, Label) fc(P,FC)
Kunde Auftrag PC Meier pc500 root
1
2
3
4
5
6
5 3 ns(ID) ID ns(ID,NS)Databases and Information Systems 2 - SS 20 07 - Prof. Dr. Stefan Böttcher - Storage models for XML trees / 3
Storage models for XML trees (2):
a single table for binary trees
Auftrag 2 root 1 Meier 4 pc500 6 PC 5 Kunde 3 Label(ID) ID
• based on label, first-child (fc) and next-sibling(ns) <Auftrag> <Kunde>Meier</Kunde> <PC>pc500</PC> </Auftrag> Kunde Auftrag PC Meier pc500 root
1
2
3
4
5
6
-3 -2 -6 5 4 ns(ID) fc(ID)Storage models for XML trees (3):
a single table for unranked trees
Auftrag 2 root 1 Meier 4 pc500 6 PC 5 Kunde 3 Label(ID) ID
• based on label, first-child (fc) and next-sibling(ns) <Auftrag> <Kunde>Meier</Kunde> <PC>pc500</PC> </Auftrag> Kunde Auftrag PC Meier pc500 root
1
2
3
4
5
6
1 1 -1 3 1 5 2 2 1 2 sibling p(ID)Databases and Information Systems 2 - SS 20 07 - Prof. Dr. Stefan Böttcher - Storage models for XML trees / 5
Ex.: Compute needed storage for XML trees
Auftrag 2 root 1 Meier 4 pc500 6 PC 5 Kunde 3 Label(ID) ID 1 1 -1 3 1 5 2 2 1 2 sibling p(ID)
(3) single table
unranked tree
Auftrag 2 root 1 Meier 4 pc500 6 PC 5 Kunde 3 Label(ID) ID -3 -2 -6 5 4 ns(ID) fc(ID)(2) single table
binary tree
Auftrag 2 root 1 Meier 4 pc500 6 PC 5 Kunde 3 Label(ID) ID 3 2 2 1 6 5 4 3 fc(ID) ID 5 3 ns(ID) ID(1) three tables
binary tree
Assume: 1 byte per ID, fc(ID), ns(ID) and 5 bytes on average per Label 1. compute sizes for (1), (2) and (3)
2. develop general formulas for binary XML tree with N inner nodes:
a) Æhow many leaf nodes?
b) Æformulas for (1) and for (2)
3. How can we store an unranked tree in (more) tables ?
Using arrays to store XML trees
Auftrag 2 root 1 Meier 4 pc500 6 PC 5 Kunde 3 Label(ID) ID 1 1 -1 3 1 5 2 2 1 2 sibling p(ID)
(3) single table
unranked tree
Auftrag 2 root 1 Meier 4 pc500 6 PC 5 Kunde 3 Label(ID) ID -3 -2 -6 5 4 ns(ID) fc(ID)(2) single table
binary tree
Auftrag 2 root 1 Meier 4 pc500 6 PC 5 Kunde 3 Label(ID) ID 3 2 2 1 6 5 4 3 fc(ID) ID 5 3 ns(ID) ID(1) three tables
binary tree
Use ID as array indexÆ1st column is not needed
how to reduce high ID values?
use relative IDs / array indices (array index difference) instead of absoulte IDs
Databases and Information Systems 2 - SS 20 07 - Prof. Dr. Stefan Böttcher - Storage models for XML trees / 7
How to treat different node types?
escape attribute nodes:
<E a="value"></E>
Æ
<E>
<@a>
<=value>
</=value>
</@a>
</E>
escape text nodes:
<E>"text"</E>
Æ
<E>
<=text>
</=text1>
</E>
+ escape root node, comments, PIs
Æ
only elements remain
How to transform XML into a binary XML tree
1. Simplify
Æ
single node type (element nodes) only
2. generate binary tree
3. store binary tree
E1
E3
E4
E2
Databases and Information Systems 2 - SS 20 07 - Prof. Dr. Stefan Böttcher - Storage models for XML trees / 9
How to transform XML into a binary XML tree
1. Simplify
Æ
single node type (element nodes) only
2. generate binary tree
3. store binary tree
E1
E3
E4
E2
E5
E6
E7
E8
fc
fc
ns
ns
fc
ns
ns
XML file
Æ
SAX-Events
Æ
Simplified SAX events
Æ
Binary Simplified SAX events
Æ
list storing binary simplified SAX events
Æ
…
Databases and Information Systems 2 - SS 20 07 - Prof. Dr. Stefan Böttcher - Storage models for XML trees / 11
Simple Access to XML (SAX)
Parser accesses at most one XML element node at a time:
- can navigate and process nodes only in document order
Æless flexible programming than DOM
+ need less space in main memory
+ loading document nodes into main memory is fast
doc
customer
customer
address
order order address
name = “Alice“
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
1. <doc> 2. <customer name=“Alice“> 3. <order> 4. 5. 6. ... </order> 7. <address> </address> </customer> 8. <customer> 9. <order/> 10. <address/> </customer> </doc>SAX-Parser-Java-API (1)
// generate JAXP SAXParserFactory
SAXParserFactory spf = SAXParserFactory.newInstance();
// set namespaceAware to true
spf.setNamespaceAware(true);
// generate JAXP SAXParser
SAXParser saxParser = spf.newSAXParser();
// get handle to the embedded SAX XMLReader
XMLReader xmlReader = saxParser.getXMLReader();
// generate new SAX output stream for ContentHandler of XMLReader
xmlReader.setContentHandler(new SAXOut());
// setup ErrorHandler, before parsing starts
xmlReader.setErrorHandler(new MyErrorHandler(System.err));
// parse the XML file using the XMLReader
Databases and Information Systems 2 - SS 20 07 - Prof. Dr. Stefan Böttcher - Storage models for XML trees / 13
SAX-Parser-Java-API (2)
// Parser calls this procedure once, when parsing the document starts
public void startDocument() throws SAXException { …} // SAX parser calls this once for each start tag of an element
public void startElement( String namespaceURI, String localName, String qName, Attributes atts)
throws SAXException { … // code example: for(int i=0; i<atts.getLength(); i++) { // for each attribute
out.println( atts.getQName(i) + "=\"" + atts.getValue(i)+"\""); } // output attribute name and attribute value
… }
// SAX parser calls this once for each end tag of an element
public void endElement( String namespaceURI, String localName, String qName) throws SAXException { …} // SAX parser calls this once when end of document is reached
public void endDocument() throws SAXException { … }
SAX-Parser-Java-API (3)
// SAX parser calls this once
// for each text found in the XML document
public void characters(char[ ] ch, int start, int length) throws SAXException
{
String text = new String (ch, start, length); text = text.trim();
… }
Databases and Information Systems 2 - SS 20 07 - Prof. Dr. Stefan Böttcher - Storage models for XML trees / 15
Pairs of SAX-Events
Location Step
Generate node
end-element(_)
start-element(a)
start-element(_)
start-element(a)
end-element(_)
end-element(_)
start-element(_)
end-element(_)
next-sibling :: a
next-sibling : a
first-child :: a
first-child : a
parent :: *
(nothing)
go back to parent
no location step
(nothing)
From SAX events to binary SAX events
different storage models
binary tree can be efficiently stored
different implementations:
multiple tables,
single table
we use single table because of further compression steps
Summary: Storage of XML trees
Databases and Information Systems 2 - SS 20 07 - Prof. Dr. Stefan Böttcher - Storage models for XML trees / 17
How can we avoid pointers ?
use array instead of table
Æ
avoids first column
use bits denoting existence of fc and ns
Æ
avoids fc and ns columns, but requires bits
Succinct storage of XML trees (1)
Auftrag 2 root 1 Meier 4 pc500 6 PC 5 Kunde 3 Label(ID) ID -3 -2 -6 5 4 ns(ID) fc(ID)
single table
binary tree
Kunde Auftrag PC Meier pc500 root1
2
3
4
5
6
ns
use bits denoting existence of fc and ns
Æ
avoid pointers
Succinct storage of XML trees (2)
Auftrag 2 root 1 Meier 4 pc500 6 PC 5 Kunde 3 Label(ID) ID -3 -2 -6 5 4 ns(ID) fc(ID)
single table
binary tree
Kunde Auftrag PC Meier pc500 root1
2
3
4
5
6
ns
<r> <A> <K> <=M> </=M> </K> <P> <=p> </=p> </P> </A> </r> tags 1 1 1 1 0 0 1 1 0 0 0 0 bits
Databases and Information Systems 2 - SS 20 07 - Prof. Dr. Stefan Böttcher - Storage models for XML trees / 19
use bits denoting existence of fc and ns
Æ
avoid pointers
Succinct storage of XML trees (3)
ID Label(ID) 1 Æroot 2 ÆAuftrag 3 ÆKunde 4 ÆMeier 5 ÆPC 6 Æpc500
<r> <A> <K> <=M> </=M> </K> <P> <=p> </=p> </P> </A> </r> tags 1 1 1 1 0 0 1 1 0 0 0 0 bits
1 2 3 4 5 6 node IDs
1. How can we support navigation via first-child (fc) and next-sibling (ns)? 2. How can we compress further without disabling navigation?
Succinct storage of XML trees (4) - Exercise
ID Label(ID) 1 Æroot 2 ÆAuftrag 3 ÆKunde 4 ÆMeier 5 ÆPC 6 Æpc500
<r> <A> <K> <=M> </=M> </K> <P> <=p> </=p> </P> </A> </r> tags 1 1 1 1 0 0 1 1 0 0 0 0 bits
Databases and Information Systems 2 - SS 20 07 - Prof. Dr. Stefan Böttcher - Storage models for XML trees / 21
Succinct storage of XML trees (5)
ID Label(ID) 1 Æroot 2 ÆAuftrag 3 ÆKunde 4 ÆMeier 5 ÆPC 6 Æpc500
<r> <A> <K> <=M> </=M> </K> <P> <=p> </=p> </P> </A> </r> tags 1 1 1 1 0 0 1 1 0 0 0 0 bits
1 2 3 4 5 6 node IDs
1. store actual position
fc : look at next bit
Æ
1 = fc exists
ns : close following subtree, i.e. |1s| = |0s|
and look at next bit
Æ
1 = ns exists
2. count 1s until actual position
Æ
node ID
1. How can we support navigation via first-child (fc) and next-sibling (ns)? 2. How can we compress further without disabling navigation?
Succinct storage of XML trees (6)
ID Label(ID) 1 Æroot 2 ÆAuftrag 3 ÆKunde 4 ÆMeier 5 ÆPC 6 Æpc500
<r> <A> <K> <=M> </=M> </K> <P> <=p> </=p> </P> </A> </r> tags 1 1 1 1 0 0 1 1 0 0 0 0 bits
1 2 3 4 5 6 node IDs
Succinct representation of IDs in table ( ID , Label(ID) ) TID concept lenght Label(ID) 4 Æroot 7 ÆAuftrag 5 ÆKunde 5 ÆMeier 2 ÆPC 5 Æpc500
zip packages
packages' size e.g. 20
Databases and Information Systems 2 - SS 20 07 - Prof. Dr. Stefan Böttcher - Storage models for XML trees / 23
Succinct storage of XML trees (7)
ID Label(ID) 1 Æroot 2 ÆAuftrag 3 ÆKunde 4 ÆMeier 5 ÆPC 6 Æpc500
<r> <A> <K> <=M> </=M> </K> <P> <=p> </=p> </P> </A> </r> tags 1 1 1 1 0 0 1 1 0 0 0 0 bits
1 2 3 4 5 6 node IDs
Succinct representation of IDs in table ( ID , Label(ID) ) TID concept lenght Label(ID) 4 Æroot 7 ÆAuftrag 5 ÆKunde 5 ÆMeier 2 ÆPC 5 Æpc500 zip packages
packages' size e.g. 20 Æ?
Æstore / search only 1, 4, 7 (IDs that start a new package)
Tuple IDentifier (TID) concept for strings
lenght Label(ID) 4 Æroot 7 ÆAuftrag 5 ÆKunde 5 ÆMeier 2 ÆPC 5 Æpc500
zip packages
packages' size e.g. 20
rootAuftragKunde 0574
4+7+5+4
<=20 Byte
store only
IDs that
start a new package
String ID that starts
a new package
1 (root)
4 (Meier)
7 …
improvement (?)
relative addresses
(= package lengths)
MeierPCpc500 05255+2+5+4
<=20 Byte
Databases and Information Systems 2 - SS 20 07 - Prof. Dr. Stefan Böttcher - Storage models for XML trees / 25