Peer-to-Peer Data Management

(1)

Wolf-Tilo Balke

Sascha Tönnies

Institut für Informationssysteme

Technische Universität Braunschweig

http://www.ifis.cs.tu-bs.de

Peer-to-Peer

(2)

Overview

• Why Peer-to-Peer Databases?

–

Federation

–

Information integration

–

Sensor networks

• P2P Databases

–

Challenges

–

Design Dimensions

• Existing P2P Database systems

–

Edutella: focus on expressivity

–

PIER: focus on scalability

(3)

1 Motivation

• Peer-to-peer data management might need some

database-like functionality

–

Complex queries over possibly large volumes of data

• Examples

of applications include

–

Federation of sources

–

Information integration

–

Sensor networks

(4)

1.1 Federation of similar data providers

• Examples

–

(Digital) Libraries

–

Primary Scientific Data Providers

(Gene Databases)

–

News Providers

• All nodes offer the same kind of information

• Homogeneous network (fixed schema)

(5)

1.2 Information Integration

• Examples

–

Find German professors having published at

least three papers at the Conference on

Very Large Databases

–

Find introductory database book in German,

written by a German professor

–

Find all recordings of Mozarts ‚Magic

Flute„ with conductors who also once

conducted Berliner Philharmoniker

• Very tedious to find with current search engines

• Needs database-like querying capabilities

• Heterogeneous network

(6)

1.3 Sensor Networks

• Examples

–

Network Monitoring:

• network maps

• event detections

• ...

–

Car Traffic Monitoring

• Huge amount of nodes

• Low amount of data

(7)

Overview

1. Why Peer-to-Peer Databases?

1. Federation

2. Information integration

3. Sensor networks

2. P2P Databases

1. Challenges

2. Design Dimensions

3. Existing P2P Database systems

1. Edutella: focus on expressivity

2. PIER: focus on scalability

3. Piazza: focus on integration

(8)

2.1 Challenges of Schema-Based P2P Networks

• Multi-Dimensional Search Space

– DHTs only work for one dimension (one attribute)

• Schema Heterogeneity

– Sources use different database schemas for similar

information

• Potentially large result sets

– SELECT * FROM Firewalls.BlockedPackets ...

– Range and Aggregate Queries

• And the usual P2P challenges...

– Trust

(9)

2.2 Design Dimensions

• Network Properties

–

Data Placement

–

Topology and Routing

• Data Access

–

Data Model

–

Query Language

• Integration Mechanism

–

Mapping Representation

–

Mapping Creation

–

Integration Method

(10)

2.2 Data Placement

• Placement according to ownership

–

Data stays at information source

–

Full control of data by owner (access policy, availability, etc.)

–

More autonomy of single nodes

• Placement according to search strategy

–

Data is distributed according to later access mechanism

(e.g., DHT)

–

No control over data access

–

More freedom to optimize query routing

• Additional caching/replication possible

(11)

2.2 Topology and Routing (1)

• Unstructured Networks

–

Flooding as routing algorithm

–

Supports arbitrary expressive queries

–

Agnostic to schema heterogeneity

–

Inefficient (filtered flooding can help)

• Short-cut networks

–

Unstructured, but continuously optimize network connections

–

Can develop into regular structures like Small-World networks

–

Clustering & filtered flooding reduces query distribution traffic

(12)

2.2 Topology and Routing (2)

• Super-peer networks

–

Inherits advantages and disadvantages of unstructured

network

–

Better efficiency and scaling (but still flooding)

–

Good match to distributed databases (super-peers

become mediators)

• DHT Networks

–

Create separate overlay for each attribute

• Or use Multidimensional DHTs, e.g. Mercury

–

Limited query expressivity

(13)

2.2 Topology and Routing - Summary

• Local indexing

–

No knowledge about other peers

• Central indexing

–

One node holds complete index

• Distributed indexing

–

Distributed Hash Tables

–

Filtered Flooding

–

Short-cut networks

–

Super-peer networks

Doesn‘t scale

Single point of

control (and failure)

(14)

2.2 Data Model

• Fixed set of attributes

–

Allows for sophisticated topologies

–

Inflexible

–

Applicability: custom applications

• Relational model

–

Usual database model

–

Not designed for distribution

• XML

–

Semi-structured data

• RDF

–

Semantic Web exchange format

(15)

2.2 Query Language

• None

–

Fixed set of parameterized queries

• Relational query language

–

Always subset of SQL

• XML query language

–

XPath or XQuery

• RDF Query Language

–

SPARQL or its predecessors

(16)

2.2 Mapping Representation

• Declarative

–

Translation between schema elements

–

Distributed database approaches applicable

• Procedural

–

Imperative description how to translate/transform queries and data

• Mapping characteristics

–

Unidirectional or Bidirectional

–

Simple (one-to-one) mapping or complex mappings

• Mapping of objects

(17)

2.2 Mapping Creation

• Manual

–

Users create mappings

–

Network distributes mappings and

uses them for translation

• Semi-automatic

–

System proposes mappings, based on heuristics

• attribute name

• similar data

–

User feedback used to validate created mappings

• Automatic

–

E.g., probabilistic mapping

(18)

2.2 Integration Mechanism

• Query Rewriting

–

Query is translated to target schema

–

Data is translated back to source schema

–

Most common approach

• Data Rewriting

–

Data is replicated to source schema

–

Only feasible for small data sets

(19)

2.2 Existing Systems - Typology

• Focus on network scalability

–

homogeneous schema

–

low query expressivity

–

DHT as underlying network structure

• Focus on expressivity

–

super-peer or unstructured

–

unlimited query complexity

• Focus on integration

–

typically unstructured

(20)

2.2 Existing Systems – Overview

List not complete

Name

Topology

Data

Placement

Data

Model

Query Language

Scalability

PIER

DHT (Bamboo)

Distributed

Relational

SQL subset

RDFPeers

DHT (MAAN)

Distributed

RDF

-Mercury

DHT (Symphony)

Distributed

Tuples

-Expressivity

SQPeer

Super-peer

Owner

RDF

RQL

PeerDB

Unstructured

Owner

Relational

SQL subset

Edutella

Super-peer

Owner

RDF

datalog (SQL)

Integration

Piazza

Unstructured

Owner

XML

XQuery subset

GridVine

DHT (P-Grid)

Distributed

RDF

(21)

Overview

1. Why Peer-to-Peer Databases?

1. Federation

2. Information integration

3. Sensor networks

2. P2P Databases

1. Challenges

2. Design Dimensions

3. Existing P2P Database systems

1. Edutella: focus on expressivity

2. PIER: focus on scalability

3. Piazza: focus on integration

(22)

3.1 Edutella: Introduction

• Initial Goal:

Achieve interoperability between

heterogeneous metadata-driven (e-learning) systems

• Provides metadata only, not the resources

–

Resources are fetched via http

• Query Examples

–

“Find software engineering course lecture notes for

undergraduates in German language”

–

“Find an introduction to Enterprise Java Beans for

professionals”

–

“Find a course in software requirements analysis from a

Swedish university”

(23)

3.1 Query Service

• Provides standardized

query/retrieval of RDF

metadata

stored in distributed RDF repositories

• Query Exchange Language

–

Based on Datalog (allows expression of rules)

–

RDF syntax

–

For exchange only

• Adapters to enable QEL (query exchange

(24)

3.1 Query processing

• Parsers/Formatters convert between query languages

• Applications and backends are shielded from

communication layer

• Query messages are exchanged in RDF/XML format

• Wrappers available for SQL, RDQL, RQL, and others

Provider

Consumer

Application

E

d

u

te

ll

a

C

o

n

s

u

m

e

r

In

te

rf

a

c

e

Q

u

e

ry

P

a

rs

e

r

App.

specific

format

EQM

P2P Network

QEL

E

d

u

te

ll

a

P

ro

v

id

e

r

In

te

rf

a

c

e

Q

u

e

ry

F

o

rm

a

tt

e

r

Back-End

(Repository)

Rep.

specific

format

EQM

(25)

3.1 Edutella Topology

• Super-Peers

• Content Providers

• Content Consumers

• Use filtered

flooding in

super-peer

backbone

• HyperCuP

topology

for backbone

(26)

3.1 Cayley Graphs

• Graph representing a

permutation group G

,

described by a set of generators

–

Regular, vertex-symmetric, recursively decomposable

–

Optimal routing and broadcast algorithms exist

1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 2 2 2 2 2 2 2 2 2 2 2 a b 2 2 d c 2 2 a b 1234 2134 3124 1324 2314 3214 4231 2431 3421 4321 2341 3241 3412 1432 4132 1342 4312 2413 1423 4123 1243 4213

8

1

2

0

1

3

0

4

5

7

0

1

6

0

2

(27)

3.1 Super-peer Topology: HyperCuP

0

1

0

1

0

2

2 SP

₁

SP

₃

SP

₄

SP

₂

SP

₅

SP

₇

SP

₈

SP

₆



Super-peers are arranged as hypercube



Broadcast needs n-1 messages, log

₂

(n) hops



High connectivity, resilient against node failures

SP

₁

SP

₃

SP

₂

SP

₇

SP

₅

SP

₈

SP

₆

SP

₄

(28)

3.1 Super-Peer-based Query Routing

• Database fragment summaries

• Index structure and maintenance

• Query Routing

(29)

3.1 Peer Fragment Summaries

Peer1.Doc

Identifier

Title

Date

Format

Language

521354021

Csdoi sdofi sfi sfdsf

1948

Book

de

593574021

Deor aodfi sdfwe dls

1952

Book

de

534536021

Toid sdofij cvcdova

1937

Book

de

528943021

Csdo asofdi weor

1916

Book

de

529874521

Epodsf csmieo mo

1924

Book

de

526983221

Awer fzwe xhzpwf

1959

Book

de

Peer2.Doc

Identifier

Title

Date

Language

Coverage

1861978766

Eoite odsifj woifj

₁₉₉₃

_en

_Scotland

1394875966

Oewr svonwe

₂₀₀₅

_en

_Wales

1817305606

Psadoifh sdafns dsf

₁₉₉₉

_en

_York

1809239086

Vsd sdfokj sfew

2001

en

West Midlands

1345398705

Wdfj vspo sdfp dort

1989

en

London

Peer1

Doc.Identifier

Doc.Title

Doc.Date[1916-1959]

Doc.Format [Book]

Doc.Language[de]

Peer2

Doc.Identifier

Doc.Title

Doc.Date[1989-2005]

Doc.Language[de]

Doc.Coverage[UK]

(30)

0

1

0 SP

1 SP

3 SP

4 SP

2 P

1 P

2 P

4 P

3

3.1 Super-peer / Peer Indices

Super-Peer1 SP/P Index

Doc.Identifier

P

₁

, P

₂

Doc.Title

P

₁

, P

₂

Doc.Date[1916-1959]

[1989-2005]

P

₁

P

₂

Doc.Format [Book]

P

₁

Doc.Language[de]

[en]

P

₁

P

₂

Doc.Coverage[UK]

P

₂

Peer1 Summary

Doc.Identifier

Doc.Title

Doc.Date[1916-1959]

Doc.Format [Book]

Doc.Language[de]

Peer2 Summary

Doc.Identifier

Doc.Title

Doc.Date[1989-2005]

Doc.Language[en]

Doc.Coverage[UK]

(31)

3.1 Super-Peer Fragment Summaries

Doc

Identifier Title Date Format Language Coverage

521354021 Csdoi sdofi sfi sfdsf 1948 Book de

593574021 Deor aodfi sdfwe dls 1952 Book de

534536021 Toid sdofij cvcdova 1937 Book de

528943021 Csdo asofdi weor 1916 Book de

529874521 Epodsf csmieo mo 1924 Book de

526983221 Awer fzwe xhzpwf 1959 Book de

1861978766 Eoite odsifj woifj 1993 en Scotland

1394875966 Oewr svonwe 2005 en Wales

1817305606 Psadoifh sdafns dsf 1999 en York

1809239086 Vsd sdfokj sfew 2001 en West Midlands

1345398705 Wdfj vspo sdfp dort 1989 en London

Super-Peer1

SP1 Summary

_{Doc.Identifier}

Doc.Title

Doc.Date[1916-2005]

Doc.Format [Book]

Doc.Language[de, en]

Doc.Coverage[UK]

(32)

3.1 Super-peer/Super-peer Indices

• Naively forwarding is not optimal

0

1

0 SP

1

SP

3

SP

4

SP

2

SP1 Summary

Doc.Identifier

Doc.Title

Doc.Date[1916-2005]

Doc.Format [Book]

Doc.Language[de, en]

Doc.Coverage[UK]

Super-Peer2 SP/SP Index

…

Doc.Language[de]

[en]

SP

₁

SP

₁

…

Super-Peer3 SP/SP Index

…

Doc.Language[de]

[en]

SP

₁

SP

₁

…

Super-Peer4 SP/SP Index

…

Doc.Language[de]

[en]

SP

₂

,SP

₃

SP

₂

,SP

₃

…

Super-Peer4 SP/SP Index

…

Doc.Language[de]

[en]

SP

₂

SP

₂

…

(33)

3.1 Super-peer/Super-peer Indices

0

1

0 SP

1

SP

3

SP

4

SP

2

SP1 Summary

…

Doc.Language[de, en]

…

• Take edge dimension into account

• forward SP/SP index entries only along lower edges

Super-Peer3 SP/SP Index

…

Doc.Language[de]

[en]

SP

₁

(1)

SP

₁

(1)

…

Super-Peer2 SP/SP Index

…

Doc.Language[de]

[en]

SP

₁

(0)

SP

₁

(0)

…

Super-Peer4 SP/SP Index

…

Doc.Language[de]

[en]

SP

₃

(0)

SP

₃

(0)

…

Super-Peer4 SP/SP Index

…

Doc.Language[de]

[en]

…

(34)

0

1

0 SP

1 SP

3 SP

4 SP

2 P

1 P

2 P

4 P

3

3.1 Query Routing

Super-Peer3 SP/SP Index

…

Doc.Language[de]

[en]

SP

₁

(1)

SP

₁

(1)

…

Super-Peer4 SP/SP Index

…

Doc.Language[de]

[en]

SP

₃

(0)

SP

₃

(0)

…

• Use SP/P and SP/SP indices as filters

SELECT FROM Doc WHERE Language=”de“ AND …*

Super-Peer1 SP/P Index

…

Doc.Language[de]

[en]

P

₁

P

₂

…

(35)

3.1 Application: P2P Digital Library Network

• Large amount of individual DLs

• Autonomous institutions

• Users have to

– find relevant DLs

– search separately on every found DL

• Violates 4th law of Library Science

– “Save the time of the reader”

(Ranganathan, 1931)

blah blah blah

(36)

3.1 DL Search Engine Solution

• Search engine approach

– ‚Crawl„ DLs

– Copy Content

– Offer unified collection

• Issues

– Search engine controls content

– Proprietary interface

(or just Web crawl)

– Difficult to preserve metadata

– Single point of failure

blah blah blah

(37)

3.1 Open Archive Initiative Solution

• Standardize metadata ‚Crawling„ interface

– OAI-PMH (Protocol for

Metadata Harvesting)

• Harvesters

– collect metadata from DLs

– offer search facilities

• Issues

– No single entry point

– Harvesters control content

– Points of failure

– Incentive for Harvester?

blah blah blah

(38)

3.1 From OAI to P2P

• Create „peer wrapper‟ for existing DLs

Super-peer

backbone

Digital

Libraries

OAI-PMH

Interface

Content

Providers

(39)

3.1 OAI-P2P – a Digital Library Network

• P2P approach:

–

DLs form self-organized network

–

User queries are distributed

• Advantages

–

No dependency on service provider

–

Each DL still controls its content

–

No single point of failure

• 5th law of Library Science:

–

“The library is a growing organism”

(Ranganathan, 1931)

blah blah blah

(40)

3.1 Edutella – Discussion

• Efficiently limits query distribution to relevant peers

• Very good scalability in terms of data size

–

No data movement required

–

Little index maintenance efforts

• Flooding limits super-peer backbone scalability

–

Will never scale to millions of peers

• Mainly query forwarding

–

Initial extension to full query planning exists

(41)

Overview

1. Why Peer-to-Peer Databases?

1. Federation

2. Information integration

3. Sensor networks

4. „New‟ internet

2. P2P Databases

1. Challenges

2. Design Dimensions

3. Existing P2P Database systems

1. Edutella: focus on expressivity

2. PIER: focus on scalability

3. Piazza: focus on integration

(42)

3.2 PIER

• P2P Relational Database

• Foundation: any

DHT

• Extended hash interface

–

put(namespace, key, value)

–

get(namespace, key)

–

namespace/key combination is used as hash value (DHT

Key)

• Extended network capabilities

• Exploit DHT structure for broadcast

15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Spanning Tree

(43)

3.2 Application: Phi

• Phi

:

P

ublic

H

ealth for the

I

nternet

–

Monitor ip network state world-wide

–

Collect statistics

• Network traffic

• Latency

• …

(44)

3.2 Storing and Indexing Tuples

• Storing

–

Every tuple needs a synthetic tuple key

–

Choose combination of table name and tuple key as DHT

key

–

Insert complete tuple into DHT using this key

• Indexing

–

Additional attribute indexes are built by inserting

attribute value/tuple key pairs into the DHT

–

Choose combination of attribute name and attribute value

as DHT key

(45)

3.2 Example

• Sample Database

• Sample tuple : (456, „Critique of pure Reason‟, 1781,

„en‟)

• Storing

–

put(Doc, 456, (456, „Critique...‟, „en‟, Philosophy))

• Indexing on „Title‟ and „Date‟ attributes

–

put(Doc.Title, „Critique...‟, 456)

–

put(Doc.Date, „1781‟, 456)

Doc

Id

Title

Date

Language

Author

DocId

PersonId

Person

Id

Name

Surname

(46)

3.2 PIER Query Plans

• DHT-Scan

–

Use index to retrieve tuple key(s)

–

Use key(s) to retrieve data tuple(s)

• Example

–

SELECT Id, Title FROM Doc WHERE

Date= „1781‟ AND Lang = „en‟

• Each peer can create a query plan

• One DHT lookup per result tuple

• Filter has to be done on query originator

dht-scan

Subject

(Doc, Date=‟1781‟)

filter

(Lang=‟en‟)

project

({Id,Title})

(47)

3.2 Aggregate and Range Queries

• Example

–

SELECT COUNT(Id) FROM Doc WHERE

Date>„1780‟ AND Date<„1790‟

• Use spanning tree for broadcast

• Aggregate on return

1

3

1

16

(48)

3.2 Join Queries

• Example

–

Assume a Person tuple (789, „Kant‟, „Immanuel‟)

–

SELECT Id, Title FROM Doc WHERE Author.DocId = Doc.Id

AND Author.PersonId = 789

• Approach:

Hierarchical Joins

–

Use spanning tree for broadcast

–

Do local select on peer table fragments

–

Do local join on each peer

• Improves load balancing

(49)

3.2 Hierarchical Joins

D

1

D

3

D

2

A

1

A

2

A

3

T

11

T

31

T

23

T

12

T

22

T

21

T

13

T

32

T

33

D

1

A

1

A

3

D

3

D

1

A

1

A

3

D

2

A

2

(50)

3.2 PIER - Discussion

• Real query planning

• Very efficient access to individual tuples and small

result sets

• Very good scalability in terms of network size

• Degrades to broadcast for many types of queries

–

Aggregate queries

–

Joins

• INSERT operation expensive (see P2P Inform.

Retrieval)

(51)

Overview

1. Why Peer-to-Peer Databases?

1. Federation

2. Information integration

3. Sensor networks

4. „New‟ internet

2. P2P Databases

1. Challenges

2. Design Dimensions

3. Existing P2P Database systems

1. Edutella: focus on expressivity

2. PIER: focus on scalability

3. Piazza: focus on integration

(52)

3.3 Piazza

• Tackles problem of „reconciling different models of

the world” (A. Halevy)

• Goal:

provide a uniform interface to a set of

autonomous data sources

• New abstraction layer over multiple sources

• Introduce mappings

between „world views‟

–

Mapping rules are specified

manually by experts

(53)

3.3 Example – Publication Databases

(54)

3.3 Mapping Rules

• Datalog is used to specify

mapping rules

UCSD : Member(projName; member) :

UW : Member(;pid; member; );

UW : Project(pid; ; projName):

UCSD : Member(projName; member) :

UPenn : Student(sid; member; );

UPenn : ProjMember(pid; sid);

UPenn : Project(pid; projName; )

UCSD : Member(projName; member) :

UPenn : Faculty(sid; member; );

UPenn : ProjMember(pid; sid);

UPenn : Project(pid; projName; )

Mapping from UW

to UCSD

Mapping from

UPenn to UCSD

(55)

3.3 Storing and Indexing

• Unstructured network

(Gnutella-like)

• Peer keeps its database

–

No exchange of data between peers

• Indexing

–

Only on schema level

–

Each peer maintains schema catalog of its neighbors

–

Mappings Stored in central catalog (hybrid system)

• could be replaced by DHT

(56)

3.3 Query Routing

• Query Flooding

–

Peer translates query to

schema of neighbor (if possible)

–

Result tuples are

converted on way back

• Queries answered by

traversing semantic

paths

UCSD

UPenn

DBLP

CiteSeer

Q1

Q4

Q3

M(UW, UCSD)

M(UW, Stanford)

M(UCSD, UPenn)

M(Stanford, DBLP)

(57)

3.3 Piazza - Discussion

• Supports

multiple schema world

(more

realistic)

• Very expressive mapping mechanism

• Not scalable

–

Gnutella-like topology and flooding

• Piazza mapping technique could be applied to

other network infrastructures

(58)

Overview

1. Why Peer-to-Peer Databases?

1. Federation

2. Information integration

3. Sensor networks

4. „New‟ internet

2. P2P Databases

1. Challenges

2. Design Dimensions

3. Existing P2P Database systems

1. Edutella: focus on expressivity

2. PIER: focus on scalability

3. Piazza: focus on integration

(59)

3.4 HiSbase

• Specialized on

distributed spatial data

• Application: astronomy data

–

Huge amounts of data (terabyte scale)

–

Region-based queries

–

Skewed data distribution

• Main ideas

–

Distribute data on peers by region

–

Use DHT for data access

–

Use neighbor-preserving hash

function (space-filling curve)

(60)

3.4 Load Distribution

• Use Quad-Tree structure to split data space into

equally loaded regions

(61)

4.4 Data Hashing

(62)

(63)

3.4 Query Processing

• Point query

–

Simple DHT access

• Region query

–

Route to arbitrary peer in range (e.g. using upper left

region boundary)

–

This peer acts as coordinator

–

Forward query to peer region neigbors

• Until whole area is covered

(64)

3.4 HiSbase - Discussion

• Very efficient for

spatial queries

–

But only spatial queries possible

• Not completely self-organizing

(65)

3. P2P Database Networks – Summary

• Challenges

–

Multi-Dimensional Search Space

–

Schema Heterogeneity

–

Potentially large result sets

• Design Dimensions

–

Network Properties (Data Placement, Topology and Routing)

–

Data Access (Data Model, Query Language)

–

Integration Mechanism (Mapping Representation/Creation/Usage)

• P2P Database Types

–

Focus on high network scalability (e.g., Edutella)

–

Focus on high query expressivity (e.g., PIER)

–

Focus on information integration (e.g., Piazza)

(66)

3. Conclusion

• P2P Databases do already work

–

although immature compared to traditional database

technology

• One size does

not

fit all

–

Choose P2P database approach according to application

requirements

• Open problems

–

Load Balancing (Replication/Caching)

–

How to combine DHT and filtered flooding advantages

–

Reliability (probabilistic guarantees)