• No results found

Databases and Information Systems 2

N/A
N/A
Protected

Academic year: 2021

Share "Databases and Information Systems 2"

Copied!
13
0
0

Loading.... (view fulltext now)

Full text

(1)

Databases and Information Systems 2 - SS 20 07 - Prof. Dr. Stefan Böttcher - Storage models for XML trees / 1

Databases and Information Systems 2

Storage models for XML trees

in small main memory devices

Long term goals:

reduce memory

Æ

compression (?)

still query efficiently

Æ

small data structures

Storage models for XML trees (1):

binary tables for binary trees

Auftrag 2 root 1 Meier 4 pc500 6 PC 5 Kunde 3 Label(ID) ID 3 2 2 1 6 5 4 3 fc(ID) ID

• based on label, first-child (fc) and next-sibling(ns) <Auftrag>

<Kunde>Meier</Kunde> <PC>pc500</PC> </Auftrag>

label( ID, Label) fc(P,FC)

Kunde Auftrag PC Meier pc500 root

1

2

3

4

5

6

5 3 ns(ID) ID ns(ID,NS)

(2)

Databases and Information Systems 2 - SS 20 07 - Prof. Dr. Stefan Böttcher - Storage models for XML trees / 3

Storage models for XML trees (2):

a single table for binary trees

Auftrag 2 root 1 Meier 4 pc500 6 PC 5 Kunde 3 Label(ID) ID

• based on label, first-child (fc) and next-sibling(ns) <Auftrag> <Kunde>Meier</Kunde> <PC>pc500</PC> </Auftrag> Kunde Auftrag PC Meier pc500 root

1

2

3

4

5

6

-3 -2 -6 5 4 ns(ID) fc(ID)

Storage models for XML trees (3):

a single table for unranked trees

Auftrag 2 root 1 Meier 4 pc500 6 PC 5 Kunde 3 Label(ID) ID

• based on label, first-child (fc) and next-sibling(ns) <Auftrag> <Kunde>Meier</Kunde> <PC>pc500</PC> </Auftrag> Kunde Auftrag PC Meier pc500 root

1

2

3

4

5

6

1 1 -1 3 1 5 2 2 1 2 sibling p(ID)

(3)

Databases and Information Systems 2 - SS 20 07 - Prof. Dr. Stefan Böttcher - Storage models for XML trees / 5

Ex.: Compute needed storage for XML trees

Auftrag 2 root 1 Meier 4 pc500 6 PC 5 Kunde 3 Label(ID) ID 1 1 -1 3 1 5 2 2 1 2 sibling p(ID)

(3) single table

unranked tree

Auftrag 2 root 1 Meier 4 pc500 6 PC 5 Kunde 3 Label(ID) ID -3 -2 -6 5 4 ns(ID) fc(ID)

(2) single table

binary tree

Auftrag 2 root 1 Meier 4 pc500 6 PC 5 Kunde 3 Label(ID) ID 3 2 2 1 6 5 4 3 fc(ID) ID 5 3 ns(ID) ID

(1) three tables

binary tree

Assume: 1 byte per ID, fc(ID), ns(ID) and 5 bytes on average per Label 1. compute sizes for (1), (2) and (3)

2. develop general formulas for binary XML tree with N inner nodes:

a) Æhow many leaf nodes?

b) Æformulas for (1) and for (2)

3. How can we store an unranked tree in (more) tables ?

Using arrays to store XML trees

Auftrag 2 root 1 Meier 4 pc500 6 PC 5 Kunde 3 Label(ID) ID 1 1 -1 3 1 5 2 2 1 2 sibling p(ID)

(3) single table

unranked tree

Auftrag 2 root 1 Meier 4 pc500 6 PC 5 Kunde 3 Label(ID) ID -3 -2 -6 5 4 ns(ID) fc(ID)

(2) single table

binary tree

Auftrag 2 root 1 Meier 4 pc500 6 PC 5 Kunde 3 Label(ID) ID 3 2 2 1 6 5 4 3 fc(ID) ID 5 3 ns(ID) ID

(1) three tables

binary tree

Use ID as array indexÆ1st column is not needed

how to reduce high ID values?

use relative IDs / array indices (array index difference) instead of absoulte IDs

(4)

Databases and Information Systems 2 - SS 20 07 - Prof. Dr. Stefan Böttcher - Storage models for XML trees / 7

How to treat different node types?

escape attribute nodes:

<E a="value"></E>

Æ

<E>

<@a>

<=value>

</=value>

</@a>

</E>

escape text nodes:

<E>"text"</E>

Æ

<E>

<=text>

</=text1>

</E>

+ escape root node, comments, PIs

Æ

only elements remain

How to transform XML into a binary XML tree

1. Simplify

Æ

single node type (element nodes) only

2. generate binary tree

3. store binary tree

E1

E3

E4

E2

(5)

Databases and Information Systems 2 - SS 20 07 - Prof. Dr. Stefan Böttcher - Storage models for XML trees / 9

How to transform XML into a binary XML tree

1. Simplify

Æ

single node type (element nodes) only

2. generate binary tree

3. store binary tree

E1

E3

E4

E2

E5

E6

E7

E8

fc

fc

ns

ns

fc

ns

ns

XML file

Æ

SAX-Events

Æ

Simplified SAX events

Æ

Binary Simplified SAX events

Æ

list storing binary simplified SAX events

Æ

(6)

Databases and Information Systems 2 - SS 20 07 - Prof. Dr. Stefan Böttcher - Storage models for XML trees / 11

Simple Access to XML (SAX)

Parser accesses at most one XML element node at a time:

- can navigate and process nodes only in document order

Æless flexible programming than DOM

+ need less space in main memory

+ loading document nodes into main memory is fast

doc

customer

customer

address

order order address

name = “Alice“

1.

2.

3.

4.

5.

6.

7.

8.

9.

10.

1. <doc> 2. <customer name=“Alice“> 3. <order> 4. 5. 6. ... </order> 7. <address> </address> </customer> 8. <customer> 9. <order/> 10. <address/> </customer> </doc>

SAX-Parser-Java-API (1)

// generate JAXP SAXParserFactory

SAXParserFactory spf = SAXParserFactory.newInstance();

// set namespaceAware to true

spf.setNamespaceAware(true);

// generate JAXP SAXParser

SAXParser saxParser = spf.newSAXParser();

// get handle to the embedded SAX XMLReader

XMLReader xmlReader = saxParser.getXMLReader();

// generate new SAX output stream for ContentHandler of XMLReader

xmlReader.setContentHandler(new SAXOut());

// setup ErrorHandler, before parsing starts

xmlReader.setErrorHandler(new MyErrorHandler(System.err));

// parse the XML file using the XMLReader

(7)

Databases and Information Systems 2 - SS 20 07 - Prof. Dr. Stefan Böttcher - Storage models for XML trees / 13

SAX-Parser-Java-API (2)

// Parser calls this procedure once, when parsing the document starts

public void startDocument() throws SAXException { …} // SAX parser calls this once for each start tag of an element

public void startElement( String namespaceURI, String localName, String qName, Attributes atts)

throws SAXException { … // code example: for(int i=0; i<atts.getLength(); i++) { // for each attribute

out.println( atts.getQName(i) + "=\"" + atts.getValue(i)+"\""); } // output attribute name and attribute value

… }

// SAX parser calls this once for each end tag of an element

public void endElement( String namespaceURI, String localName, String qName) throws SAXException { …} // SAX parser calls this once when end of document is reached

public void endDocument() throws SAXException { … }

SAX-Parser-Java-API (3)

// SAX parser calls this once

// for each text found in the XML document

public void characters(char[ ] ch, int start, int length) throws SAXException

{

String text = new String (ch, start, length); text = text.trim();

}

(8)

Databases and Information Systems 2 - SS 20 07 - Prof. Dr. Stefan Böttcher - Storage models for XML trees / 15

Pairs of SAX-Events

Location Step

Generate node

end-element(_)

start-element(a)

start-element(_)

start-element(a)

end-element(_)

end-element(_)

start-element(_)

end-element(_)

next-sibling :: a

next-sibling : a

first-child :: a

first-child : a

parent :: *

(nothing)

go back to parent

no location step

(nothing)

From SAX events to binary SAX events

different storage models

binary tree can be efficiently stored

different implementations:

multiple tables,

single table

we use single table because of further compression steps

Summary: Storage of XML trees

(9)

Databases and Information Systems 2 - SS 20 07 - Prof. Dr. Stefan Böttcher - Storage models for XML trees / 17

How can we avoid pointers ?

use array instead of table

Æ

avoids first column

use bits denoting existence of fc and ns

Æ

avoids fc and ns columns, but requires bits

Succinct storage of XML trees (1)

Auftrag 2 root 1 Meier 4 pc500 6 PC 5 Kunde 3 Label(ID) ID -3 -2 -6 5 4 ns(ID) fc(ID)

single table

binary tree

Kunde Auftrag PC Meier pc500 root

1

2

3

4

5

6

ns

use bits denoting existence of fc and ns

Æ

avoid pointers

Succinct storage of XML trees (2)

Auftrag 2 root 1 Meier 4 pc500 6 PC 5 Kunde 3 Label(ID) ID -3 -2 -6 5 4 ns(ID) fc(ID)

single table

binary tree

Kunde Auftrag PC Meier pc500 root

1

2

3

4

5

6

ns

<r> <A> <K> <=M> </=M> </K> <P> <=p> </=p> </P> </A> </r> tags 1 1 1 1 0 0 1 1 0 0 0 0 bits

(10)

Databases and Information Systems 2 - SS 20 07 - Prof. Dr. Stefan Böttcher - Storage models for XML trees / 19

use bits denoting existence of fc and ns

Æ

avoid pointers

Succinct storage of XML trees (3)

ID Label(ID) 1 Æroot 2 ÆAuftrag 3 ÆKunde 4 ÆMeier 5 ÆPC 6 Æpc500

<r> <A> <K> <=M> </=M> </K> <P> <=p> </=p> </P> </A> </r> tags 1 1 1 1 0 0 1 1 0 0 0 0 bits

1 2 3 4 5 6 node IDs

1. How can we support navigation via first-child (fc) and next-sibling (ns)? 2. How can we compress further without disabling navigation?

Succinct storage of XML trees (4) - Exercise

ID Label(ID) 1 Æroot 2 ÆAuftrag 3 ÆKunde 4 ÆMeier 5 ÆPC 6 Æpc500

<r> <A> <K> <=M> </=M> </K> <P> <=p> </=p> </P> </A> </r> tags 1 1 1 1 0 0 1 1 0 0 0 0 bits

(11)

Databases and Information Systems 2 - SS 20 07 - Prof. Dr. Stefan Böttcher - Storage models for XML trees / 21

Succinct storage of XML trees (5)

ID Label(ID) 1 Æroot 2 ÆAuftrag 3 ÆKunde 4 ÆMeier 5 ÆPC 6 Æpc500

<r> <A> <K> <=M> </=M> </K> <P> <=p> </=p> </P> </A> </r> tags 1 1 1 1 0 0 1 1 0 0 0 0 bits

1 2 3 4 5 6 node IDs

1. store actual position

fc : look at next bit

Æ

1 = fc exists

ns : close following subtree, i.e. |1s| = |0s|

and look at next bit

Æ

1 = ns exists

2. count 1s until actual position

Æ

node ID

1. How can we support navigation via first-child (fc) and next-sibling (ns)? 2. How can we compress further without disabling navigation?

Succinct storage of XML trees (6)

ID Label(ID) 1 Æroot 2 ÆAuftrag 3 ÆKunde 4 ÆMeier 5 ÆPC 6 Æpc500

<r> <A> <K> <=M> </=M> </K> <P> <=p> </=p> </P> </A> </r> tags 1 1 1 1 0 0 1 1 0 0 0 0 bits

1 2 3 4 5 6 node IDs

Succinct representation of IDs in table ( ID , Label(ID) ) TID concept lenght Label(ID) 4 Æroot 7 ÆAuftrag 5 ÆKunde 5 ÆMeier 2 ÆPC 5 Æpc500

zip packages

packages' size e.g. 20

(12)

Databases and Information Systems 2 - SS 20 07 - Prof. Dr. Stefan Böttcher - Storage models for XML trees / 23

Succinct storage of XML trees (7)

ID Label(ID) 1 Æroot 2 ÆAuftrag 3 ÆKunde 4 ÆMeier 5 ÆPC 6 Æpc500

<r> <A> <K> <=M> </=M> </K> <P> <=p> </=p> </P> </A> </r> tags 1 1 1 1 0 0 1 1 0 0 0 0 bits

1 2 3 4 5 6 node IDs

Succinct representation of IDs in table ( ID , Label(ID) ) TID concept lenght Label(ID) 4 Æroot 7 ÆAuftrag 5 ÆKunde 5 ÆMeier 2 ÆPC 5 Æpc500 zip packages

packages' size e.g. 20 Æ?

Æstore / search only 1, 4, 7 (IDs that start a new package)

Tuple IDentifier (TID) concept for strings

lenght Label(ID) 4 Æroot 7 ÆAuftrag 5 ÆKunde 5 ÆMeier 2 ÆPC 5 Æpc500

zip packages

packages' size e.g. 20

rootAuftragKunde 0574

4+7+5+4

<=20 Byte

store only

IDs that

start a new package

String ID that starts

a new package

1 (root)

4 (Meier)

7 …

improvement (?)

relative addresses

(= package lengths)

MeierPCpc500 0525

5+2+5+4

<=20 Byte

(13)

Databases and Information Systems 2 - SS 20 07 - Prof. Dr. Stefan Böttcher - Storage models for XML trees / 25

XML file

Æ

SAX-Events

Æ

Simplified SAX events

Æ

Binary Simplified SAX events

Æ

list storing binary simplified SAX events

Æ

binary DAG of binary simplified SAX events

Æ

grammar of simpified DAG events

Æ

succinct representation of grammar

References

Related documents