• No results found

Peer-to-Peer Data Management

N/A
N/A
Protected

Academic year: 2021

Share "Peer-to-Peer Data Management"

Copied!
66
0
0

Loading.... (view fulltext now)

Full text

(1)

Wolf-Tilo Balke

Sascha Tönnies

Institut für Informationssysteme

Technische Universität Braunschweig

http://www.ifis.cs.tu-bs.de

Peer-to-Peer

(2)

Overview

Why Peer-to-Peer Databases?

Federation

Information integration

Sensor networks

P2P Databases

Challenges

Design Dimensions

Existing P2P Database systems

Edutella: focus on expressivity

PIER: focus on scalability

(3)

1 Motivation

Peer-to-peer data management might need some

database-like functionality

Complex queries over possibly large volumes of data

Examples

of applications include

Federation of sources

Information integration

Sensor networks

(4)

1.1 Federation of similar data providers

Examples

(Digital) Libraries

Primary Scientific Data Providers

(Gene Databases)

News Providers

All nodes offer the same kind of information

Homogeneous network (fixed schema)

(5)

1.2 Information Integration

Examples

Find German professors having published at

least three papers at the Conference on

Very Large Databases

Find introductory database book in German,

written by a German professor

Find all recordings of Mozarts ‚Magic

Flute„ with conductors who also once

conducted Berliner Philharmoniker

Very tedious to find with current search engines

Needs database-like querying capabilities

Heterogeneous network

(6)

1.3 Sensor Networks

Examples

Network Monitoring:

network maps

event detections

...

Car Traffic Monitoring

Huge amount of nodes

Low amount of data

(7)

Overview

1. Why Peer-to-Peer Databases?

1. Federation

2. Information integration

3. Sensor networks

2. P2P Databases

1. Challenges

2. Design Dimensions

3. Existing P2P Database systems

1. Edutella: focus on expressivity

2. PIER: focus on scalability

3. Piazza: focus on integration

(8)

2.1 Challenges of Schema-Based P2P Networks

Multi-Dimensional Search Space

– DHTs only work for one dimension (one attribute)

Schema Heterogeneity

– Sources use different database schemas for similar

information

Potentially large result sets

– SELECT * FROM Firewalls.BlockedPackets ...

– Range and Aggregate Queries

And the usual P2P challenges...

– Trust

(9)

2.2 Design Dimensions

Network Properties

Data Placement

Topology and Routing

Data Access

Data Model

Query Language

Integration Mechanism

Mapping Representation

Mapping Creation

Integration Method

(10)

2.2 Data Placement

Placement according to ownership

Data stays at information source

Full control of data by owner (access policy, availability, etc.)

More autonomy of single nodes

Placement according to search strategy

Data is distributed according to later access mechanism

(e.g., DHT)

No control over data access

More freedom to optimize query routing

Additional caching/replication possible

(11)

2.2 Topology and Routing (1)

Unstructured Networks

Flooding as routing algorithm

Supports arbitrary expressive queries

Agnostic to schema heterogeneity

Inefficient (filtered flooding can help)

Short-cut networks

Unstructured, but continuously optimize network connections

Can develop into regular structures like Small-World networks

Clustering & filtered flooding reduces query distribution traffic

(12)

2.2 Topology and Routing (2)

Super-peer networks

Inherits advantages and disadvantages of unstructured

network

Better efficiency and scaling (but still flooding)

Good match to distributed databases (super-peers

become mediators)

DHT Networks

Create separate overlay for each attribute

Or use Multidimensional DHTs, e.g. Mercury

Limited query expressivity

(13)

2.2 Topology and Routing - Summary

Local indexing

No knowledge about other peers

Central indexing

One node holds complete index

Distributed indexing

Distributed Hash Tables

Filtered Flooding

Short-cut networks

Super-peer networks

Doesn‘t scale

Single point of

control (and failure)

(14)

2.2 Data Model

Fixed set of attributes

Allows for sophisticated topologies

Inflexible

Applicability: custom applications

Relational model

Usual database model

Not designed for distribution

XML

Semi-structured data

RDF

Semantic Web exchange format

(15)

2.2 Query Language

None

Fixed set of parameterized queries

Relational query language

Always subset of SQL

XML query language

XPath or XQuery

RDF Query Language

SPARQL or its predecessors

(16)

2.2 Mapping Representation

Declarative

Translation between schema elements

Distributed database approaches applicable

Procedural

Imperative description how to translate/transform queries and data

Mapping characteristics

Unidirectional or Bidirectional

Simple (one-to-one) mapping or complex mappings

Mapping of objects

(17)

2.2 Mapping Creation

Manual

Users create mappings

Network distributes mappings and

uses them for translation

Semi-automatic

System proposes mappings, based on heuristics

attribute name

similar data

User feedback used to validate created mappings

Automatic

E.g., probabilistic mapping

(18)

2.2 Integration Mechanism

Query Rewriting

Query is translated to target schema

Data is translated back to source schema

Most common approach

Data Rewriting

Data is replicated to source schema

Only feasible for small data sets

(19)

2.2 Existing Systems - Typology

Focus on network scalability

homogeneous schema

low query expressivity

DHT as underlying network structure

Focus on expressivity

super-peer or unstructured

unlimited query complexity

Focus on integration

typically unstructured

(20)

2.2 Existing Systems – Overview

List not complete

Name

Topology

Data

Placement

Data

Model

Query Language

Scalability

PIER

DHT (Bamboo)

Distributed

Relational

SQL subset

RDFPeers

DHT (MAAN)

Distributed

RDF

-Mercury

DHT (Symphony)

Distributed

Tuples

-Expressivity

SQPeer

Super-peer

Owner

RDF

RQL

PeerDB

Unstructured

Owner

Relational

SQL subset

Edutella

Super-peer

Owner

RDF

datalog (SQL)

Integration

Piazza

Unstructured

Owner

XML

XQuery subset

GridVine

DHT (P-Grid)

Distributed

RDF

(21)

Overview

1. Why Peer-to-Peer Databases?

1. Federation

2. Information integration

3. Sensor networks

2. P2P Databases

1. Challenges

2. Design Dimensions

3. Existing P2P Database systems

1. Edutella: focus on expressivity

2. PIER: focus on scalability

3. Piazza: focus on integration

(22)

3.1 Edutella: Introduction

Initial Goal:

Achieve interoperability between

heterogeneous metadata-driven (e-learning) systems

Provides metadata only, not the resources

Resources are fetched via http

Query Examples

“Find software engineering course lecture notes for

undergraduates in German language”

“Find an introduction to Enterprise Java Beans for

professionals”

“Find a course in software requirements analysis from a

Swedish university”

(23)

3.1 Query Service

Provides standardized

query/retrieval of RDF

metadata

stored in distributed RDF repositories

Query Exchange Language

Based on Datalog (allows expression of rules)

RDF syntax

For exchange only

Adapters to enable QEL (query exchange

(24)

3.1 Query processing

Parsers/Formatters convert between query languages

Applications and backends are shielded from

communication layer

Query messages are exchanged in RDF/XML format

Wrappers available for SQL, RDQL, RQL, and others

Provider

Provider

Provider

Consumer

Application

E

d

u

te

ll

a

C

o

n

s

u

m

e

r

In

te

rf

a

c

e

Q

u

e

ry

P

a

rs

e

r

App.

specific

format

EQM

P2P Network

QEL

E

d

u

te

ll

a

P

ro

v

id

e

r

In

te

rf

a

c

e

Q

u

e

ry

F

o

rm

a

tt

e

r

Back-End

(Repository)

Rep.

specific

format

EQM

(25)

3.1 Edutella Topology

Super-Peers

Content Providers

Content Consumers

Use filtered

flooding in

super-peer

backbone

HyperCuP

topology

for backbone

(26)

3.1 Cayley Graphs

Graph representing a

permutation group G

,

described by a set of generators

Regular, vertex-symmetric, recursively decomposable

Optimal routing and broadcast algorithms exist

1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 2 2 2 2 2 2 2 2 2 2 2 a b 2 2 d c 2 2 a b 1234 2134 3124 1324 2314 3214 4231 2431 3421 4321 2341 3241 3412 1432 4132 1342 4312 2413 1423 4123 1243 4213

8

1

2

0

1

1

3

0

4

5

7

0

1

1

6

0

2

2

2

2

(27)

3.1 Super-peer Topology: HyperCuP

0

0

1

0

1

1

1

0

2

2

2

2

SP

1

SP

3

SP

4

SP

2

SP

5

SP

7

SP

8

SP

6

Super-peers are arranged as hypercube

Broadcast needs n-1 messages, log

2

(n) hops

High connectivity, resilient against node failures

SP

1

SP

3

SP

2

SP

7

SP

5

SP

8

SP

6

SP

4

(28)

3.1 Super-Peer-based Query Routing

Database fragment summaries

Index structure and maintenance

Query Routing

(29)

3.1 Peer Fragment Summaries

Peer1.Doc

Identifier

Title

Date

Format

Language

521354021

Csdoi sdofi sfi sfdsf

1948

Book

de

593574021

Deor aodfi sdfwe dls

1952

Book

de

534536021

Toid sdofij cvcdova

1937

Book

de

528943021

Csdo asofdi weor

1916

Book

de

529874521

Epodsf csmieo mo

1924

Book

de

526983221

Awer fzwe xhzpwf

1959

Book

de

Peer2.Doc

Identifier

Title

Date

Language

Coverage

1861978766

Eoite odsifj woifj

1993

en

Scotland

1394875966

Oewr svonwe

2005

en

Wales

1817305606

Psadoifh sdafns dsf

1999

en

York

1809239086

Vsd sdfokj sfew

2001

en

West Midlands

1345398705

Wdfj vspo sdfp dort

1989

en

London

Peer1

Doc.Identifier

Doc.Title

Doc.Date[1916-1959]

Doc.Format [Book]

Doc.Language[de]

Peer2

Doc.Identifier

Doc.Title

Doc.Date[1989-2005]

Doc.Language[de]

Doc.Coverage[UK]

(30)

0

1

1

0

SP

1

SP

3

SP

4

SP

2

P

1

P

2

P

4

P

3

3.1 Super-peer / Peer Indices

Super-Peer1 SP/P Index

Doc.Identifier

P

1

, P

2

Doc.Title

P

1

, P

2

Doc.Date[1916-1959]

[1989-2005]

P

1

P

2

Doc.Format [Book]

P

1

Doc.Language[de]

[en]

P

1

P

2

Doc.Coverage[UK]

P

2

Peer1 Summary

Doc.Identifier

Doc.Title

Doc.Date[1916-1959]

Doc.Format [Book]

Doc.Language[de]

Peer2 Summary

Doc.Identifier

Doc.Title

Doc.Date[1989-2005]

Doc.Language[en]

Doc.Coverage[UK]

(31)

3.1 Super-Peer Fragment Summaries

Doc

Identifier Title Date Format Language Coverage

521354021 Csdoi sdofi sfi sfdsf 1948 Book de

593574021 Deor aodfi sdfwe dls 1952 Book de

534536021 Toid sdofij cvcdova 1937 Book de

528943021 Csdo asofdi weor 1916 Book de

529874521 Epodsf csmieo mo 1924 Book de

526983221 Awer fzwe xhzpwf 1959 Book de

1861978766 Eoite odsifj woifj 1993 en Scotland

1394875966 Oewr svonwe 2005 en Wales

1817305606 Psadoifh sdafns dsf 1999 en York

1809239086 Vsd sdfokj sfew 2001 en West Midlands

1345398705 Wdfj vspo sdfp dort 1989 en London

Super-Peer1

SP1 Summary

Doc.Identifier

Doc.Title

Doc.Date[1916-2005]

Doc.Format [Book]

Doc.Language[de, en]

Doc.Coverage[UK]

(32)

3.1 Super-peer/Super-peer Indices

Naively forwarding is not optimal

0

1

1

0

SP

1

SP

3

SP

4

SP

2

SP1 Summary

Doc.Identifier

Doc.Title

Doc.Date[1916-2005]

Doc.Format [Book]

Doc.Language[de, en]

Doc.Coverage[UK]

Super-Peer2 SP/SP Index

Doc.Language[de]

[en]

SP

1

SP

1

Super-Peer3 SP/SP Index

Doc.Language[de]

[en]

SP

1

SP

1

Super-Peer4 SP/SP Index

Doc.Language[de]

[en]

SP

2

,SP

3

SP

2

,SP

3

Super-Peer4 SP/SP Index

Doc.Language[de]

[en]

SP

2

SP

2

(33)

3.1 Super-peer/Super-peer Indices

0

1

1

0

SP

1

SP

3

SP

4

SP

2

SP1 Summary

Doc.Language[de, en]

Take edge dimension into account

forward SP/SP index entries only along lower edges

Super-Peer3 SP/SP Index

Doc.Language[de]

[en]

SP

1

(1)

SP

1

(1)

Super-Peer2 SP/SP Index

Doc.Language[de]

[en]

SP

1

(0)

SP

1

(0)

Super-Peer4 SP/SP Index

Doc.Language[de]

[en]

SP

3

(0)

SP

3

(0)

Super-Peer4 SP/SP Index

Doc.Language[de]

[en]

(34)

0

1

1

0

SP

1

SP

3

SP

4

SP

2

P

1

P

2

P

4

P

3

3.1 Query Routing

Super-Peer3 SP/SP Index

Doc.Language[de]

[en]

SP

1

(1)

SP

1

(1)

Super-Peer4 SP/SP Index

Doc.Language[de]

[en]

SP

3

(0)

SP

3

(0)

Use SP/P and SP/SP indices as filters

SELECT * FROM Doc WHERE Language=”de“ AND …

Super-Peer1 SP/P Index

Doc.Language[de]

[en]

P

1

P

2

(35)

3.1 Application: P2P Digital Library Network

Large amount of individual DLs

Autonomous institutions

Users have to

– find relevant DLs

– search separately on every found DL

Violates 4th law of Library Science

– “Save the time of the reader”

(Ranganathan, 1931)

blah blah blah

(36)

3.1 DL Search Engine Solution

Search engine approach

– ‚Crawl„ DLs

– Copy Content

– Offer unified collection

Issues

– Search engine controls content

– Proprietary interface

(or just Web crawl)

– Difficult to preserve metadata

– Single point of failure

blah blah blah

(37)

3.1 Open Archive Initiative Solution

Standardize metadata ‚Crawling„ interface

– OAI-PMH (Protocol for

Metadata Harvesting)

Harvesters

– collect metadata from DLs

– offer search facilities

Issues

– No single entry point

– Harvesters control content

– Points of failure

– Incentive for Harvester?

blah blah blah

(38)

3.1 From OAI to P2P

Create „peer wrapper‟ for existing DLs

Super-peer

backbone

Digital

Libraries

OAI-PMH

Interface

Content

Providers

(39)

3.1 OAI-P2P – a Digital Library Network

P2P approach:

DLs form self-organized network

User queries are distributed

Advantages

No dependency on service provider

Each DL still controls its content

No single point of failure

5th law of Library Science:

“The library is a growing organism”

(Ranganathan, 1931)

blah blah blah

(40)

3.1 Edutella – Discussion

Efficiently limits query distribution to relevant peers

Very good scalability in terms of data size

No data movement required

Little index maintenance efforts

Flooding limits super-peer backbone scalability

Will never scale to millions of peers

Mainly query forwarding

Initial extension to full query planning exists

(41)

Overview

1. Why Peer-to-Peer Databases?

1.

Federation

2.

Information integration

3.

Sensor networks

4.

„New‟ internet

2. P2P Databases

1.

Challenges

2.

Design Dimensions

3. Existing P2P Database systems

1.

Edutella: focus on expressivity

2.

PIER: focus on scalability

3.

Piazza: focus on integration

(42)

3.2 PIER

P2P Relational Database

Foundation: any

DHT

Extended hash interface

put(namespace, key, value)

get(namespace, key)

namespace/key combination is used as hash value (DHT

Key)

Extended network capabilities

Exploit DHT structure for broadcast

15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Spanning Tree

(43)

3.2 Application: Phi

Phi

:

P

ublic

H

ealth for the

I

nternet

Monitor ip network state world-wide

Collect statistics

Network traffic

Latency

(44)

3.2 Storing and Indexing Tuples

Storing

Every tuple needs a synthetic tuple key

Choose combination of table name and tuple key as DHT

key

Insert complete tuple into DHT using this key

Indexing

Additional attribute indexes are built by inserting

attribute value/tuple key pairs into the DHT

Choose combination of attribute name and attribute value

as DHT key

(45)

3.2 Example

Sample Database

Sample tuple : (456, „Critique of pure Reason‟, 1781,

„en‟)

Storing

put(Doc, 456, (456, „Critique...‟, „en‟, Philosophy))

Indexing on „Title‟ and „Date‟ attributes

put(Doc.Title, „Critique...‟, 456)

put(Doc.Date, „1781‟, 456)

Doc

Id

Title

Date

Language

Author

DocId

PersonId

Person

Id

Name

Surname

(46)

3.2 PIER Query Plans

DHT-Scan

Use index to retrieve tuple key(s)

Use key(s) to retrieve data tuple(s)

Example

SELECT Id, Title FROM Doc WHERE

Date= „1781‟ AND Lang = „en‟

Each peer can create a query plan

One DHT lookup per result tuple

Filter has to be done on query originator

dht-scan

Subject

(Doc, Date=‟1781‟)

filter

(Lang=‟en‟)

project

({Id,Title})

(47)

3.2 Aggregate and Range Queries

Example

SELECT COUNT(Id) FROM Doc WHERE

Date>„1780‟ AND Date<„1790‟

Use spanning tree for broadcast

Aggregate on return

1

1

1

1

3

1

16

(48)

3.2 Join Queries

Example

Assume a Person tuple (789, „Kant‟, „Immanuel‟)

SELECT Id, Title FROM Doc WHERE Author.DocId = Doc.Id

AND Author.PersonId = 789

Approach:

Hierarchical Joins

Use spanning tree for broadcast

Do local select on peer table fragments

Do local join on each peer

Improves load balancing

(49)

3.2 Hierarchical Joins

D

1

D

3

D

2

A

1

A

2

A

3

T

11

T

31

T

23

T

12

T

22

T

21

T

13

T

32

T

33

D

1

A

1

A

3

D

3

D

1

A

1

A

3

D

2

A

2

(50)

3.2 PIER - Discussion

Real query planning

Very efficient access to individual tuples and small

result sets

Very good scalability in terms of network size

Degrades to broadcast for many types of queries

Aggregate queries

Joins

INSERT operation expensive (see P2P Inform.

Retrieval)

(51)

Overview

1. Why Peer-to-Peer Databases?

1.

Federation

2.

Information integration

3.

Sensor networks

4.

„New‟ internet

2. P2P Databases

1.

Challenges

2.

Design Dimensions

3. Existing P2P Database systems

1.

Edutella: focus on expressivity

2.

PIER: focus on scalability

3.

Piazza: focus on integration

(52)

3.3 Piazza

Tackles problem of „reconciling different models of

the world” (A. Halevy)

Goal:

provide a uniform interface to a set of

autonomous data sources

New abstraction layer over multiple sources

Introduce mappings

between „world views‟

Mapping rules are specified

manually by experts

(53)

3.3 Example – Publication Databases

(54)

3.3 Mapping Rules

Datalog is used to specify

mapping rules

UCSD : Member(projName; member) :

UW : Member(;pid; member; );

UW : Project(pid; ; projName):

UCSD : Member(projName; member) :

UPenn : Student(sid; member; );

UPenn : ProjMember(pid; sid);

UPenn : Project(pid; projName; )

UCSD : Member(projName; member) :

UPenn : Faculty(sid; member; );

UPenn : ProjMember(pid; sid);

UPenn : Project(pid; projName; )

Mapping from UW

to UCSD

Mapping from

UPenn to UCSD

(55)

3.3 Storing and Indexing

Unstructured network

(Gnutella-like)

Peer keeps its database

No exchange of data between peers

Indexing

Only on schema level

Each peer maintains schema catalog of its neighbors

Mappings Stored in central catalog (hybrid system)

could be replaced by DHT

(56)

3.3 Query Routing

Query Flooding

Peer translates query to

schema of neighbor (if possible)

Result tuples are

converted on way back

Queries answered by

traversing semantic

paths

UCSD

UPenn

DBLP

CiteSeer

Q1

Q4

Q3

M(UW, UCSD)

M(UW, Stanford)

M(UCSD, UPenn)

M(Stanford, DBLP)

(57)

3.3 Piazza - Discussion

Supports

multiple schema world

(more

realistic)

Very expressive mapping mechanism

Not scalable

Gnutella-like topology and flooding

Piazza mapping technique could be applied to

other network infrastructures

(58)

Overview

1. Why Peer-to-Peer Databases?

1.

Federation

2.

Information integration

3.

Sensor networks

4.

„New‟ internet

2. P2P Databases

1.

Challenges

2.

Design Dimensions

3. Existing P2P Database systems

1.

Edutella: focus on expressivity

2.

PIER: focus on scalability

3.

Piazza: focus on integration

(59)

3.4 HiSbase

Specialized on

distributed spatial data

Application: astronomy data

Huge amounts of data (terabyte scale)

Region-based queries

Skewed data distribution

Main ideas

Distribute data on peers by region

Use DHT for data access

Use neighbor-preserving hash

function (space-filling curve)

(60)

3.4 Load Distribution

Use Quad-Tree structure to split data space into

equally loaded regions

(61)

4.4 Data Hashing

(62)
(63)

3.4 Query Processing

Point query

Simple DHT access

Region query

Route to arbitrary peer in range (e.g. using upper left

region boundary)

This peer acts as coordinator

Forward query to peer region neigbors

Until whole area is covered

(64)

3.4 HiSbase - Discussion

Very efficient for

spatial queries

But only spatial queries possible

Not completely self-organizing

(65)

3. P2P Database Networks – Summary

Challenges

Multi-Dimensional Search Space

Schema Heterogeneity

Potentially large result sets

Design Dimensions

Network Properties (Data Placement, Topology and Routing)

Data Access (Data Model, Query Language)

Integration Mechanism (Mapping Representation/Creation/Usage)

P2P Database Types

Focus on high network scalability (e.g., Edutella)

Focus on high query expressivity (e.g., PIER)

Focus on information integration (e.g., Piazza)

(66)

3. Conclusion

P2P Databases do already work

although immature compared to traditional database

technology

One size does

not

fit all

Choose P2P database approach according to application

requirements

Open problems

Load Balancing (Replication/Caching)

How to combine DHT and filtered flooding advantages

Reliability (probabilistic guarantees)

References

Related documents

The purpose of this study was to determine to what extent an education and training module on the proper use of NEXRAD-based products in the cockpit, developed using

Systems and Services Certification delivered comparable revenue growth of 7.8% to CHF 189 million, with a stable adjusted operating margin at 19.3%.. This strong revenue growth,

A NetWorker storage node can be used to improve performance by off loading from the NetWorker server much of the data movement involved in a backup or recovery operation.

• This course covers methods for analysis of data from Illumina and Ion Torrent high- throughput sequencing, with or without a reference genome sequence, using free and

T h e second approximation is the narrowest; this is because for the present data the sample variance is substantially smaller than would be expected, given the mean

As reported last year, the City of Cambridge notified MWRA in the fall of 2012 that new information gained from its design of the CAM004 sewer separation project had caused it to

Based upon student responses to three waves of questionnaires (pre-test, post-test, and one-year follow-up), we are able to assess short-term program effects. students in

[r]