Jürg Senn
Department of Computer Science, University of Basel
String-Based
Semantic Web Data Management
Using Ternary B-Trees
RDF
– Resource Description Framework (RDF) – basis of Semantic Web
– intended for knowledge management – models: statements/facts
– Building block for:
– modelling structured knowledge – automatic logical reasoning
– Example:
@prefix : <http://exmpl.org/book/>.
@prefix dc: <http://purl.org/dc/elements/1.1/>. @prefix title: <http://exmpl.org/book/title/>. Subject Predicate Object
:book0 dc:title "Web Handbook" . :book1 dc:title "SPARQL Tutorial" .
General RDF
– Outside of knowledge management
general data model like relational model, XML, ... – Name/Value list (predicateproperty)
– Properties:
+ simple
+ no schema
– fine-grained – grows quickly
– Examples available on the Web:
– GeoNames (geographical data)
– DBpedia (structured data extracted from Wikipedia)
– Uniprot (protein sequence database) – many others...
SPARQL
– Query language SPARQL:
– General research problem: efficient and scalable support for RDF/SPARQL
Semantic Web Data Management
PREFIX dc:
<http://purl.org/dc/elements/1.1/> SELECT ?title
WHERE { ?x dc:title ?title
FILTER REGEX(?title, "Web") } ORDER BY ?title
Related Work
– Coarse categorization by physical design:
– mapping to relational database
(reference: vertically partitioned approach1)
– index-based2
– other: graph-based, distributed, bitmap-based, vector-based
– Index-based currently most scalable and efficient
– idea: build indexes from triple permutations – indexes become the database, no tables
– requires compression due to data replication – reference: RDF-3X2
– Universal: mapping to integer identifiers
SPO SOP PSO POS OSP OPS
SP SO PS PO OS OP S P O Dictionary http://exmpl.org/book/book0 1A32C24A
1D. Abadi et al., Scalable semantic web data management using vertical partitioning, VLDB, 2007
2T. Neumann et al., RDF-3X: a risc-style engine for rdf,
Problem Statement
– Approaches do not consider extensive filtering and ordering
– RDF/SPARQL inherently string-based
– String-based operations neglected
– e.g. FILTER with regular expression – require many lookups in dictionary
– Mapping destroys string order
current indexes ordered by id
– Partitioning indexes by string order more problematic
but partitioning interesting for parallel processing
– Inherent properties of RDF not used
URIs already have hierarchical structure
Extensions of current approaches should
take into account the string-based nature of RDF
http://exmpl.org/book/book0 1A32C24A
http://exmpl.org/book/book1 09913AC4
Approach
– String-based data structure and processing method – Work with unmapped RDF exclusively
– Contributions:
– disk-based string index ternary B-tree
– compression scheme
– adapted query optimizer and processor – partitioning scheme, parallelization
– Goal:
– demonstrate viability and competitiveness
– demonstrate high performance even with variable-sized data
http://exmpl.org/book/book0
Ternary B-Tree
– Combination of: – B-Tree – trie – B-Tree provides: – balance – disk/file structure– B-Tree node = trie
– stores strings
– each string=full RDF triple
...
...
...
tie
tier
tree
t
i r
e
e
e
0 r
Ternary Search
– Search is ternary: less, equal, greater
Ternary B-Tree
– Indexes of next node to search – Indexes stored at end of strings
("terminators")
0
1
2 3
0: x < tie
tie
1: tie < x < tier
tier
2: tier < x < tree
tree
3: tree < x
t
i r
e
e
e
0 r
Triple Decomposition
S: http://example.org/book/book1
P: http://purl.org/dc/elements/1.1/title O: "Web Handbook"
http://e xample.o rg/book/ book1000
http://p url.org/
...
– Triples decomposed into parts (8 bytes) – Comparing parts, not single characters – Each element filled up with zero
Triple Arrangement
Physical Storage
Dir. Data 1 *http:// 2 example. 3 org/prop 4 erty 6 *http:// *_:0 7 example. 8 *http:// 9 org/reso 10 example.13 urce1 urce2 urce3
– Directory tree structure prefix sum – Data breadth first – Dir./Data separate
...
Traversal
Dir. Data 0 1 0 *http:// 1 2 1 example. 2 3 2 org/prop 3 4 3 erty 4 6 4 *http:// 5 *_:0 5 7 6 example. 6 8 7 *http:// 7 9 8 org/reso 8 10 9 example.Traversal
Dir. Data 0 1 0 *http:// 1 2 1 example. 2 3 2 org/prop 3 4 3 erty 4 6 4 *http:// 5 *_:0 5 7 6 example. 6 8 7 *http:// 7 9 8 org/reso 8 10 9 example. – dirIndex = dataIndex + i + 1 – dataIndex = dir[dirIndex - 1]Traversal
Dir. Data 0 1 0 *http:// 1 2 1 example. 2 3 2 org/prop 3 4 3 erty 4 6 4 *http:// 5 *_:0 5 7 6 example. 6 8 7 *http:// 7 9 8 org/reso 8 10 9 example. – dirIndex = dataIndex + i + 1 – dataIndex = dir[dirIndex - 1] dirIndex dataIndex 0 0 1 1 2 2 3 3 4 4 6 7 8 9Traversal
Dir. Data 0 1 0 *http:// 1 2 1 example. 2 3 2 org/prop 3 4 3 erty 4 6 4 *http:// 5 *_:0 5 7 6 example. 6 8 7 *http:// 7 9 8 org/reso 8 10 9 example. – dirIndex = dataIndex + i + 1 – dataIndex = dir[dirIndex - 1] dirIndex dataIndex 0 0 1 1 2 2 3 3 4 4 6 7 8 9Traversal
Dir. Data 0 1 0 *http:// 1 2 1 example. 2 3 2 org/prop 3 4 3 erty 4 6 4 *http:// 5 *_:0 5 7 6 example. 6 8 7 *http:// 7 9 8 org/reso 8 10 9 example. – dirIndex = dataIndex + i + 1 – dataIndex = dir[dirIndex - 1] dirIndex dataIndex 0 0 1 1 2 2 3 3 4 4 6 7 8 9 – next dirIndex = 4 + 1 + 1 = 6Traversal
Dir. Data 0 1 0 *http:// 1 2 1 example. 2 3 2 org/prop 3 4 3 erty 4 6 4 *http:// 5 *_:0 5 7 6 example. 6 8 7 *http:// 7 9 8 org/reso 8 10 9 example. – dirIndex = dataIndex + i + 1 – dataIndex = dir[dirIndex - 1] dirIndex dataIndex 0 0 1 1 2 2 3 3 4 4 6 7 8 9 – next dirIndex = 4 + 1 + 1 = 6 – next dataIndex = dir[5] = 7Traversal
Dir. Data 0 1 0 *http:// 1 2 1 example. 2 3 2 org/prop 3 4 3 erty 4 6 4 *http:// 5 *_:0 5 7 6 example. 6 8 7 *http:// 7 9 8 org/reso 8 10 9 example. – dirIndex = dataIndex + i + 1 – dataIndex = dir[dirIndex - 1] dirIndex dataIndex 0 0 1 1 2 2 3 3 4 4 6 7 8 9 – next dirIndex = 4 + 1 + 1 = 6 – next dataIndex = dir[5] = 7Conclusion
Thank you for your attention.
Questions?
– Semantic Web data management – String-based approach
– Basis: ternary B-tree
– Next steps: query processing and performance
– implementation benchmarking