String-Based Semantic Web Data Management Using Ternary B-Trees PhD Seminar, April 29, 2010

(1)

Jürg Senn

Department of Computer Science, University of Basel

String-Based

Semantic Web Data Management

Using Ternary B-Trees

(2)

RDF

– Resource Description Framework (RDF) – basis of Semantic Web

– intended for knowledge management – models: statements/facts

– Building block for:

– modelling structured knowledge – automatic logical reasoning

– Example:

@prefix : <http://exmpl.org/book/>.

@prefix dc: <http://purl.org/dc/elements/1.1/>. @prefix title: <http://exmpl.org/book/title/>. Subject Predicate Object

:book0 dc:title "Web Handbook" . :book1 dc:title "SPARQL Tutorial" .

(3)

General RDF

– Outside of knowledge management 

general data model like relational model, XML, ... – Name/Value list (predicateproperty)

– Properties:

+ simple

+ no schema

– fine-grained – grows quickly

– Examples available on the Web:

– GeoNames (geographical data)

– DBpedia (structured data extracted from Wikipedia)

– Uniprot (protein sequence database) – many others...

(4)

SPARQL

– Query language SPARQL:

– General research problem: efficient and scalable support for RDF/SPARQL



Semantic Web Data Management

PREFIX dc:

<http://purl.org/dc/elements/1.1/> SELECT ?title

WHERE { ?x dc:title ?title

FILTER REGEX(?title, "Web") } ORDER BY ?title

(5)

Related Work

– Coarse categorization by physical design:

– mapping to relational database

(reference: vertically partitioned approach1)

– index-based2

– other: graph-based, distributed, bitmap-based, vector-based

– Index-based currently most scalable and efficient

– idea: build indexes from triple permutations – indexes become the database, no tables

– requires compression due to data replication – reference: RDF-3X2

– Universal: mapping to integer identifiers

SPO SOP PSO POS OSP OPS

SP SO PS PO OS OP S P O Dictionary http://exmpl.org/book/book0 1A32C24A

1_{D. Abadi et al.,}_{Scalable semantic web data management} using vertical partitioning, VLDB, 2007

2_{T. Neumann et al.,}_{RDF-3X: a risc-style engine for rdf}_,

(6)

Problem Statement

– Approaches do not consider extensive filtering and ordering

– RDF/SPARQL inherently string-based

– String-based operations neglected

– e.g. FILTER with regular expression – require many lookups in dictionary

– Mapping destroys string order

current indexes ordered by id

– Partitioning indexes by string order more problematic

but partitioning interesting for parallel processing

– Inherent properties of RDF not used

URIs already have hierarchical structure

Extensions of current approaches should

take into account the string-based nature of RDF

http://exmpl.org/book/book0 1A32C24A

http://exmpl.org/book/book1 09913AC4

(7)

Approach

– String-based data structure and processing method – Work with unmapped RDF exclusively

– Contributions:

– disk-based string index ternary B-tree

– compression scheme

– adapted query optimizer and processor – partitioning scheme, parallelization

– Goal:

– demonstrate viability and competitiveness

– demonstrate high performance even with variable-sized data

http://exmpl.org/book/book0

(8)

Ternary B-Tree

– Combination of: – B-Tree – trie – B-Tree provides: – balance – disk/file structure

– B-Tree node = trie

– stores strings

– each string=full RDF triple

...

tie

tier

tree

t

i r

e

0 r

(9)

Ternary Search

– Search is ternary: less, equal, greater

 Ternary B-Tree

– Indexes of next node to search – Indexes stored at end of strings

("terminators")

0

1

2 3

0: x < tie

tie

1: tie < x < tier

tier

2: tier < x < tree

tree

3: tree < x

t

i r

e

0 r

(10)

Triple Decomposition

S: http://example.org/book/book1

P: http://purl.org/dc/elements/1.1/title O: "Web Handbook"

http://e xample.o rg/book/ book1000

http://p url.org/

...

– Triples decomposed into parts (8 bytes) – Comparing parts, not single characters – Each element filled up with zero

(11)

Triple Arrangement

(12)

Physical Storage

Dir. Data 1 *http:// 2 example. 3 org/prop 4 erty 6 *http:// *_:0 7 example. 8 *http:// 9 org/reso 10 example.

13 urce1 urce2 urce3

– Directory tree structure prefix sum – Data breadth first – Dir./Data separate

...

(13)

Traversal

Dir. Data 0 1 0 *http:// 1 2 1 example. 2 3 2 org/prop 3 4 3 erty 4 6 4 *http:// 5 *_:0 5 7 6 example. 6 8 7 *http:// 7 9 8 org/reso 8 10 9 example.

(14)

Traversal

Dir. Data 0 1 0 *http:// 1 2 1 example. 2 3 2 org/prop 3 4 3 erty 4 6 4 *http:// 5 *_:0 5 7 6 example. 6 8 7 *http:// 7 9 8 org/reso 8 10 9 example. – dirIndex = dataIndex + i + 1 – dataIndex = dir[dirIndex - 1]

(15)

Traversal

Dir. Data 0 1 0 *http:// 1 2 1 example. 2 3 2 org/prop 3 4 3 erty 4 6 4 *http:// 5 *_:0 5 7 6 example. 6 8 7 *http:// 7 9 8 org/reso 8 10 9 example. – dirIndex = dataIndex + i + 1 – dataIndex = dir[dirIndex - 1] dirIndex dataIndex 0 0 1 1 2 2 3 3 4 4 6 7 8 9

(16)

Traversal

Dir. Data 0 1 0 *http:// 1 2 1 example. 2 3 2 org/prop 3 4 3 erty 4 6 4 *http:// 5 *_:0 5 7 6 example. 6 8 7 *http:// 7 9 8 org/reso 8 10 9 example. – dirIndex = dataIndex + i + 1 – dataIndex = dir[dirIndex - 1] dirIndex dataIndex 0 0 1 1 2 2 3 3 4 4 6 7 8 9

(17)

Traversal

Dir. Data 0 1 0 *http:// 1 2 1 example. 2 3 2 org/prop 3 4 3 erty 4 6 4 *http:// 5 *_:0 5 7 6 example. 6 8 7 *http:// 7 9 8 org/reso 8 10 9 example. – dirIndex = dataIndex + i + 1 – dataIndex = dir[dirIndex - 1] dirIndex dataIndex 0 0 1 1 2 2 3 3 4 4 6 7 8 9 – next dirIndex = 4 + 1 + 1 = 6

(18)

Traversal

Dir. Data 0 1 0 *http:// 1 2 1 example. 2 3 2 org/prop 3 4 3 erty 4 6 4 *http:// 5 *_:0 5 7 6 example. 6 8 7 *http:// 7 9 8 org/reso 8 10 9 example. – dirIndex = dataIndex + i + 1 – dataIndex = dir[dirIndex - 1] dirIndex dataIndex 0 0 1 1 2 2 3 3 4 4 6 7 8 9 – next dirIndex = 4 + 1 + 1 = 6 – next dataIndex = dir[5] = 7

(19)

Traversal

Dir. Data 0 1 0 *http:// 1 2 1 example. 2 3 2 org/prop 3 4 3 erty 4 6 4 *http:// 5 *_:0 5 7 6 example. 6 8 7 *http:// 7 9 8 org/reso 8 10 9 example. – dirIndex = dataIndex + i + 1 – dataIndex = dir[dirIndex - 1] dirIndex dataIndex 0 0 1 1 2 2 3 3 4 4 6 7 8 9 – next dirIndex = 4 + 1 + 1 = 6 – next dataIndex = dir[5] = 7

(20)

(21)

Conclusion

Thank you for your attention.

Questions?

– Semantic Web data management – String-based approach

– Basis: ternary B-tree

– Next steps: query processing and performance

– implementation  benchmarking