Wolf-Tilo Balke
Sascha Tönnies
Institut für Informationssysteme
Technische Universität Braunschweig
http://www.ifis.cs.tu-bs.de
Peer-to-Peer
Overview
•
Why Peer-to-Peer Databases?
–
Federation
–
Information integration
–
Sensor networks
•
P2P Databases
–
Challenges
–
Design Dimensions
•
Existing P2P Database systems
–
Edutella: focus on expressivity
–
PIER: focus on scalability
1 Motivation
•
Peer-to-peer data management might need some
database-like functionality
–
Complex queries over possibly large volumes of data
•
Examples
of applications include
–
Federation of sources
–
Information integration
–
Sensor networks
1.1 Federation of similar data providers
•
Examples
–
(Digital) Libraries
–
Primary Scientific Data Providers
(Gene Databases)
–
News Providers
•
All nodes offer the same kind of information
•
Homogeneous network (fixed schema)
1.2 Information Integration
•
Examples
–
Find German professors having published at
least three papers at the Conference on
Very Large Databases
–
Find introductory database book in German,
written by a German professor
–
Find all recordings of Mozarts ‚Magic
Flute„ with conductors who also once
conducted Berliner Philharmoniker
•
Very tedious to find with current search engines
•
Needs database-like querying capabilities
•
Heterogeneous network
1.3 Sensor Networks
•
Examples
–
Network Monitoring:
•
network maps
•
event detections
•
...
–
Car Traffic Monitoring
•
Huge amount of nodes
•
Low amount of data
Overview
1. Why Peer-to-Peer Databases?
1. Federation
2. Information integration
3. Sensor networks
2. P2P Databases
1. Challenges
2. Design Dimensions
3. Existing P2P Database systems
1. Edutella: focus on expressivity
2. PIER: focus on scalability
3. Piazza: focus on integration
2.1 Challenges of Schema-Based P2P Networks
•
Multi-Dimensional Search Space
– DHTs only work for one dimension (one attribute)
•
Schema Heterogeneity
– Sources use different database schemas for similar
information
•
Potentially large result sets
– SELECT * FROM Firewalls.BlockedPackets ...
– Range and Aggregate Queries
•
And the usual P2P challenges...
– Trust
2.2 Design Dimensions
•
Network Properties
–
Data Placement
–
Topology and Routing
•
Data Access
–
Data Model
–
Query Language
•
Integration Mechanism
–
Mapping Representation
–
Mapping Creation
–
Integration Method
2.2 Data Placement
•
Placement according to ownership
–
Data stays at information source
–
Full control of data by owner (access policy, availability, etc.)
–
More autonomy of single nodes
•
Placement according to search strategy
–
Data is distributed according to later access mechanism
(e.g., DHT)
–
No control over data access
–
More freedom to optimize query routing
•
Additional caching/replication possible
2.2 Topology and Routing (1)
•
Unstructured Networks
–
Flooding as routing algorithm
–
Supports arbitrary expressive queries
–
Agnostic to schema heterogeneity
–
Inefficient (filtered flooding can help)
•
Short-cut networks
–
Unstructured, but continuously optimize network connections
–
Can develop into regular structures like Small-World networks
–
Clustering & filtered flooding reduces query distribution traffic
2.2 Topology and Routing (2)
•
Super-peer networks
–
Inherits advantages and disadvantages of unstructured
network
–
Better efficiency and scaling (but still flooding)
–
Good match to distributed databases (super-peers
become mediators)
•
DHT Networks
–
Create separate overlay for each attribute
•
Or use Multidimensional DHTs, e.g. Mercury
–
Limited query expressivity
2.2 Topology and Routing - Summary
•
Local indexing
–
No knowledge about other peers
•
Central indexing
–
One node holds complete index
•
Distributed indexing
–
Distributed Hash Tables
–
Filtered Flooding
–
Short-cut networks
–
Super-peer networks
Doesn‘t scale
Single point of
control (and failure)
2.2 Data Model
•
Fixed set of attributes
–
Allows for sophisticated topologies
–
Inflexible
–
Applicability: custom applications
•
Relational model
–
Usual database model
–
Not designed for distribution
•
XML
–
Semi-structured data
•
RDF
–
Semantic Web exchange format
2.2 Query Language
•
None
–
Fixed set of parameterized queries
•
Relational query language
–
Always subset of SQL
•
XML query language
–
XPath or XQuery
•
RDF Query Language
–
SPARQL or its predecessors
2.2 Mapping Representation
•
Declarative
–
Translation between schema elements
–
Distributed database approaches applicable
•
Procedural
–
Imperative description how to translate/transform queries and data
•
Mapping characteristics
–
Unidirectional or Bidirectional
–
Simple (one-to-one) mapping or complex mappings
•
Mapping of objects
2.2 Mapping Creation
•
Manual
–
Users create mappings
–
Network distributes mappings and
uses them for translation
•
Semi-automatic
–
System proposes mappings, based on heuristics
•
attribute name
•
similar data
–
User feedback used to validate created mappings
•
Automatic
–
E.g., probabilistic mapping
2.2 Integration Mechanism
•
Query Rewriting
–
Query is translated to target schema
–
Data is translated back to source schema
–
Most common approach
•
Data Rewriting
–
Data is replicated to source schema
–
Only feasible for small data sets
2.2 Existing Systems - Typology
•
Focus on network scalability
–
homogeneous schema
–
low query expressivity
–
DHT as underlying network structure
•
Focus on expressivity
–
super-peer or unstructured
–
unlimited query complexity
•
Focus on integration
–
typically unstructured
2.2 Existing Systems – Overview
List not complete
Name
Topology
Data
Placement
Data
Model
Query Language
Scalability
PIER
DHT (Bamboo)
Distributed
Relational
SQL subset
RDFPeers
DHT (MAAN)
Distributed
RDF
-Mercury
DHT (Symphony)
Distributed
Tuples
-Expressivity
SQPeer
Super-peer
Owner
RDF
RQL
PeerDB
Unstructured
Owner
Relational
SQL subset
Edutella
Super-peer
Owner
RDF
datalog (SQL)
Integration
Piazza
Unstructured
Owner
XML
XQuery subset
GridVine
DHT (P-Grid)
Distributed
RDF
Overview
1. Why Peer-to-Peer Databases?
1. Federation
2. Information integration
3. Sensor networks
2. P2P Databases
1. Challenges
2. Design Dimensions
3. Existing P2P Database systems
1. Edutella: focus on expressivity
2. PIER: focus on scalability
3. Piazza: focus on integration
3.1 Edutella: Introduction
•
Initial Goal:
Achieve interoperability between
heterogeneous metadata-driven (e-learning) systems
•
Provides metadata only, not the resources
–
Resources are fetched via http
•
Query Examples
–
“Find software engineering course lecture notes for
undergraduates in German language”
–
“Find an introduction to Enterprise Java Beans for
professionals”
–
“Find a course in software requirements analysis from a
Swedish university”
3.1 Query Service
•
Provides standardized
query/retrieval of RDF
metadata
stored in distributed RDF repositories
•
Query Exchange Language
–
Based on Datalog (allows expression of rules)
–
RDF syntax
–
For exchange only
•
Adapters to enable QEL (query exchange
3.1 Query processing
•
Parsers/Formatters convert between query languages
•
Applications and backends are shielded from
communication layer
•
Query messages are exchanged in RDF/XML format
•
Wrappers available for SQL, RDQL, RQL, and others
Provider
Provider
Provider
Consumer
Application
E
d
u
te
ll
a
C
o
n
s
u
m
e
r
In
te
rf
a
c
e
Q
u
e
ry
P
a
rs
e
r
App.
specific
format
EQM
P2P Network
QEL
E
d
u
te
ll
a
P
ro
v
id
e
r
In
te
rf
a
c
e
Q
u
e
ry
F
o
rm
a
tt
e
r
Back-End
(Repository)
Rep.
specific
format
EQM
3.1 Edutella Topology
•
Super-Peers
•
Content Providers
•
Content Consumers
•
Use filtered
flooding in
super-peer
backbone
•
HyperCuP
topology
for backbone
3.1 Cayley Graphs
•
Graph representing a
permutation group G
,
described by a set of generators
–
Regular, vertex-symmetric, recursively decomposable
–
Optimal routing and broadcast algorithms exist
1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 2 2 2 2 2 2 2 2 2 2 2 a b 2 2 d c 2 2 a b 1234 2134 3124 1324 2314 3214 4231 2431 3421 4321 2341 3241 3412 1432 4132 1342 4312 2413 1423 4123 1243 4213
8
1
2
0
1
1
3
0
4
5
7
0
1
1
6
0
2
2
2
2
3.1 Super-peer Topology: HyperCuP
0
0
1
0
1
1
1
0
2
2
2
2
SP
1SP
3SP
4SP
2SP
5SP
7SP
8SP
6
Super-peers are arranged as hypercube
Broadcast needs n-1 messages, log
2
(n) hops
High connectivity, resilient against node failures
SP
1SP
3SP
2SP
7SP
5SP
8SP
6SP
43.1 Super-Peer-based Query Routing
•
Database fragment summaries
•
Index structure and maintenance
•
Query Routing
3.1 Peer Fragment Summaries
Peer1.Doc
Identifier
Title
Date
Format
Language
521354021
Csdoi sdofi sfi sfdsf
1948
Book
de
593574021
Deor aodfi sdfwe dls
1952
Book
de
534536021
Toid sdofij cvcdova
1937
Book
de
528943021
Csdo asofdi weor
1916
Book
de
529874521
Epodsf csmieo mo
1924
Book
de
526983221
Awer fzwe xhzpwf
1959
Book
de
Peer2.Doc
Identifier
Title
Date
Language
Coverage
1861978766
Eoite odsifj woifj
1993
en
Scotland
1394875966
Oewr svonwe
2005
en
Wales
1817305606
Psadoifh sdafns dsf
1999
en
York
1809239086
Vsd sdfokj sfew
2001
en
West Midlands
1345398705
Wdfj vspo sdfp dort
1989
en
London
Peer1
Doc.Identifier
Doc.Title
Doc.Date[1916-1959]
Doc.Format [Book]
Doc.Language[de]
Peer2
Doc.Identifier
Doc.Title
Doc.Date[1989-2005]
Doc.Language[de]
Doc.Coverage[UK]
0
1
1
0
SP
1
SP
3
SP
4
SP
2
P
1
P
2
P
4
P
3
3.1 Super-peer / Peer Indices
Super-Peer1 SP/P Index
Doc.Identifier
P
1
, P
2
Doc.Title
P
1
, P
2
Doc.Date[1916-1959]
[1989-2005]
P
1
P
2
Doc.Format [Book]
P
1
Doc.Language[de]
[en]
P
1
P
2
Doc.Coverage[UK]
P
2
Peer1 Summary
Doc.Identifier
Doc.Title
Doc.Date[1916-1959]
Doc.Format [Book]
Doc.Language[de]
Peer2 Summary
Doc.Identifier
Doc.Title
Doc.Date[1989-2005]
Doc.Language[en]
Doc.Coverage[UK]
3.1 Super-Peer Fragment Summaries
Doc
Identifier Title Date Format Language Coverage
521354021 Csdoi sdofi sfi sfdsf 1948 Book de
593574021 Deor aodfi sdfwe dls 1952 Book de
534536021 Toid sdofij cvcdova 1937 Book de
528943021 Csdo asofdi weor 1916 Book de
529874521 Epodsf csmieo mo 1924 Book de
526983221 Awer fzwe xhzpwf 1959 Book de
1861978766 Eoite odsifj woifj 1993 en Scotland
1394875966 Oewr svonwe 2005 en Wales
1817305606 Psadoifh sdafns dsf 1999 en York
1809239086 Vsd sdfokj sfew 2001 en West Midlands
1345398705 Wdfj vspo sdfp dort 1989 en London
Super-Peer1
SP1 Summary
Doc.Identifier
Doc.Title
Doc.Date[1916-2005]
Doc.Format [Book]
Doc.Language[de, en]
Doc.Coverage[UK]
3.1 Super-peer/Super-peer Indices
•
Naively forwarding is not optimal
0
1
1
0
SP
1SP
3SP
4SP
2SP1 Summary
Doc.Identifier
Doc.Title
Doc.Date[1916-2005]
Doc.Format [Book]
Doc.Language[de, en]
Doc.Coverage[UK]
Super-Peer2 SP/SP Index
…
…
Doc.Language[de]
[en]
SP
1
SP
1
…
…
Super-Peer3 SP/SP Index
…
…
Doc.Language[de]
[en]
SP
1
SP
1
…
…
Super-Peer4 SP/SP Index
…
…
Doc.Language[de]
[en]
SP
2
,SP
3
SP
2
,SP
3
…
…
Super-Peer4 SP/SP Index
…
…
Doc.Language[de]
[en]
SP
2
SP
2
…
…
3.1 Super-peer/Super-peer Indices
0
1
1
0
SP
1SP
3SP
4SP
2SP1 Summary
…
Doc.Language[de, en]
…
•
Take edge dimension into account
•
forward SP/SP index entries only along lower edges
Super-Peer3 SP/SP Index
…
…
Doc.Language[de]
[en]
SP
1
(1)
SP
1
(1)
…
…
Super-Peer2 SP/SP Index
…
…
Doc.Language[de]
[en]
SP
1
(0)
SP
1
(0)
…
…
Super-Peer4 SP/SP Index
…
…
Doc.Language[de]
[en]
SP
3
(0)
SP
3
(0)
…
…
Super-Peer4 SP/SP Index
…
…
Doc.Language[de]
[en]
…
…
0
1
1
0
SP
1
SP
3
SP
4
SP
2
P
1
P
2
P
4
P
3
3.1 Query Routing
Super-Peer3 SP/SP Index
…
…
Doc.Language[de]
[en]
SP
1
(1)
SP
1
(1)
…
…
Super-Peer4 SP/SP Index
…
…
Doc.Language[de]
[en]
SP
3
(0)
SP
3
(0)
…
…
•
Use SP/P and SP/SP indices as filters
SELECT * FROM Doc WHERE Language=”de“ AND …
Super-Peer1 SP/P Index
…
…
Doc.Language[de]
[en]
P
1
P
2
…
…
3.1 Application: P2P Digital Library Network
•
Large amount of individual DLs
•
Autonomous institutions
•
Users have to
– find relevant DLs
– search separately on every found DL
•
Violates 4th law of Library Science
– “Save the time of the reader”
(Ranganathan, 1931)
blah blah blah
3.1 DL Search Engine Solution
•
Search engine approach
– ‚Crawl„ DLs
– Copy Content
– Offer unified collection
•
Issues
– Search engine controls content
– Proprietary interface
(or just Web crawl)
– Difficult to preserve metadata
– Single point of failure
blah blah blah
3.1 Open Archive Initiative Solution
•
Standardize metadata ‚Crawling„ interface
– OAI-PMH (Protocol for
Metadata Harvesting)
•
Harvesters
– collect metadata from DLs
– offer search facilities
•
Issues
– No single entry point
– Harvesters control content
– Points of failure
– Incentive for Harvester?
blah blah blah
3.1 From OAI to P2P
•
Create „peer wrapper‟ for existing DLs
Super-peer
backbone
Digital
Libraries
OAI-PMH
Interface
Content
Providers
3.1 OAI-P2P – a Digital Library Network
•
P2P approach:
–
DLs form self-organized network
–
User queries are distributed
•
Advantages
–
No dependency on service provider
–
Each DL still controls its content
–
No single point of failure
•
5th law of Library Science:
–
“The library is a growing organism”
(Ranganathan, 1931)
blah blah blah
3.1 Edutella – Discussion
•
Efficiently limits query distribution to relevant peers
•
Very good scalability in terms of data size
–
No data movement required
–
Little index maintenance efforts
•
Flooding limits super-peer backbone scalability
–
Will never scale to millions of peers
•
Mainly query forwarding
–
Initial extension to full query planning exists
Overview
1. Why Peer-to-Peer Databases?
1.
Federation
2.
Information integration
3.
Sensor networks
4.
„New‟ internet
2. P2P Databases
1.
Challenges
2.
Design Dimensions
3. Existing P2P Database systems
1.
Edutella: focus on expressivity
2.
PIER: focus on scalability
3.
Piazza: focus on integration
3.2 PIER
•
P2P Relational Database
•
Foundation: any
DHT
•
Extended hash interface
–
put(namespace, key, value)
–
get(namespace, key)
–
namespace/key combination is used as hash value (DHT
Key)
•
Extended network capabilities
•
Exploit DHT structure for broadcast
15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14