• No results found

Peer-to-Peer Data Management

N/A
N/A
Protected

Academic year: 2021

Share "Peer-to-Peer Data Management"

Copied!
70
0
0

Loading.... (view fulltext now)

Full text

(1)

Wolf-Tilo Balke

Sascha Tönnies

Institut für Informationssysteme

Technische Universität Braunschweig

http://www.ifis.cs.tu-bs.de

Peer-to-Peer

(2)

14. Overview

1. Introduction

2. Content Searching in Peer-to-Peer Applications

1. Problems in Peer-to-Peer Information Retrieval

2. Related Work in Distributed Information Retrieval

3. Index structures for Query Routing

1. Distributed Hash Tables for Information Retrieval

2. Routing Indexes for Information Retrieval

3. Locality-Based Routing Indexes

4. Supporting Effective Information Retrieval

1. Providing Collection-Wide Information

2. Estimating the Document Overlap

3. Prestructuring Collections with Taxonomies

(3)

14.1 What is IR?

Information retrieval

(

IR

) is the science of

searching for documents, for information within

documents and for metadata about documents

A user enters a

query

, i.e.

an information need, into

the system

Several objects may match

the query with different

(4)

14.1 Representing Text

How do we represent the

complexities of language?

Computers don‟t “understand”

documents or queries

Simple, yet effective approach: bag of words

Treat all the words in a document as index terms for

that document

Assign a “weight” to each term based on its

“importance”

(5)

14.1 Representing Text

McDonald's slims down spuds

Fast-food chain to reduce certain types of fat in its

french fries with new cooking oil.

NEW YORK (CNN/Money) - McDonald's Corp. is

cutting the amount of "bad" fat in its french fries

nearly in half, the fast-food chain said Tuesday as it

moves to make all its fried menu items healthier.

But does that mean the popular shoestring fries won't

taste the same? The company says no. "It's a win-win

for our customers because they are getting the same

great french-fry taste along with an even healthier

nutrition profile," said Mike Roberts, president of

McDonald's USA.

But others are not so sure. McDonald's will not

specifically discuss the kind of oil it plans to use, but at

least one nutrition expert says playing with the

formula could mean a different taste.

Shares of Oak Brook, Ill.-based McDonald's (MCD:

down $0.54 to $23.22, Research, Estimates) were

lower Tuesday afternoon. It was unclear Tuesday

whether competitors Burger King and Wendy's

International (WEN: down $0.80 to $34.91, Research,

Estimates) would follow suit. Neither company could

immediately be reached for comment.

16 × said

14 × McDonalds

12 × fat

11 × fries

8 × new

6 × company, french, nutrition

5 × food, oil, percent, reduce,

taste, Tuesday

(6)

14.1 Retrieval

Retrieving

relevant

information is hard!

Evolving, ambiguous user needs, context, etc.

Complexities of language

To

operationalize

information retrieval, we

must vastly simplify the picture

Information retrieval is all (and only) about matching

words in documents with words in queries

Obviously, not true…

(7)

14.1 Representing Documents as Vectors

The quick brown

fox jumped over

the lazy dog’s

back.

Document 1

Document 2

Now is the time

for all good men

to come to the

aid of their party.

the

is

for

to

of

quick

brown

fox

over

lazy

dog

back

now

time

all

good

men

come

jump

aid

their

party

0

0

1

1

0

1

1

0

1

1

0

0

1

0

1

0

0

1

1

0

0

1

0

0

1

0

0

1

1

0

1

0

1

1

Term

Do

cu

m

en

t

1

Do

cu

m

en

t

2

Stopword

List

(8)

14.1 Representing Text

document

structured

recognition

accents,

spacing,

etc.

stopwords

noun

stemming

groups

automatic

or manual

indexing

text +

structure

text

structure

full text

index terms

How to compare

(9)

14.1 Boolean Retrieval

Weights assigned to terms are either

“0”

or

“1”

“0” represents “absence”: term

isn’t

in the document

“1” represents “presence”: term

is

in the document

Build queries by combining terms with Boolean

operators

AND, OR, NOT

The system returns all documents that satisfy the

(10)

14.1 Boolean View of a Document-Set (=Collection)

quick

brown

fox

over

lazy

dog

back

now

time

all

good

men

come

jump

aid

their

party

0

0

1

1

0

0

0

0

0

1

0

0

1

0

1

1

0

0

1

0

0

1

0

0

1

0

0

1

1

0

0

0

0

1

Term

Doc

1

Do

c

2

0

0

1

1

0

1

1

0

1

1

0

0

1

0

1

0

0

1

1

0

0

1

0

0

1

0

0

1

0

0

0

0

0

1

Do

c

3

Do

c

4

0

0

0

1

0

1

1

0

0

1

0

0

1

0

0

1

0

0

1

0

0

1

0

0

1

0

0

0

1

0

1

0

0

1

Do

c

5

Do

c

6

0

0

1

1

0

0

1

0

0

1

0

0

1

0

0

1

0

1

0

0

0

1

0

0

1

0

0

1

1

1

1

0

0

0

Do

c

7

Do

c

8

Each column represents the view of

a particular document: What terms

are contained in this document?

Each row represents the view of a

particular term: What documents

contain this term?

To execute a query, pick out rows

corresponding to query terms

and then apply logic table of

corresponding Boolean operator

(11)

14.1 Sample Queries

fox

dog

0

0

0

0

1

1

0

0

1

1

0

0

0

1

0

0

Term

Do

c

1

Do

c

2

Do

c

3

Do

c

4

Do

c

5

Do

c

6

Do

c

7

Do

c

8

dog

fox

0 0 1 0 1 0 0 0

dog

fox

0 0 1 0 1 0 1 0

dog

fox

0 0 0 0 0 0 0 0

fox

dog

0 0 0 0 0 0 1 0

dog AND fox 

Doc 3, Doc 5

dog OR fox 

Doc 3, Doc 5, Doc 7

dog NOT fox 

empty

fox NOT dog 

Doc 7

good

party

0

0

1

0

0

0

1

0

0

0

1

1

0

0

1

1

g

p

0 0 0 0 0 1 0 1

g

p

o

0 0 0 0 0 1 0 0

good AND party 

Doc 6, Doc 8

over

1 0 1 0 1 0 1 1

good AND party NOT over 

Doc 6

Term

Do

c

1

Do

c

2

Do

c

3

Do

c

4

Do

c

5

Do

c

6

Do

c

7

Do

c

8

(12)

14.1 The Perfect Query Paradox

Every information need has a

perfect set of

documents

If not, there would be no sense doing retrieval

Every document set has a

perfect query

AND every word in a document to get a query for it

Repeat for each document in the set

OR every document query to get the set query

But can users realistically be expected to

formulate

this perfect query?

(13)

14.1 Why Boolean Retrieval fails

Natural language is way more complex

AND “discovers” nonexistent relationships

Terms in different sentences, paragraphs, …

Guessing terminology for OR is hard

good, nice, excellent, outstanding, awesome, …

Guessing terms to exclude is even harder!

(14)

14.1 Strengths and Weaknesses

Strengths

Precise, if you have a clear idea of what you‟re looking for

Efficient for the computer

Weaknesses

Users must learn Boolean logic

Boolean logic insufficient to capture the richness of

language

No control over size of result set: either too many

documents or none

All documents in the result set are considered “equally

good”

What about partial matches? Documents that “don‟t quite

match” the query may be useful also

(15)

14.1 Ranked Retrieval

Order

documents by how likely they are to be

relevant to the information need

Present hits one screen at a time

At any point, users can continue browsing through

ranked list or reformulate query

Attempts to retrieve relevant documents directly,

(16)

14.1 Why Ranked Retrieval?

Arranging documents by relevance is

Closer to how humans think: some documents are

“better” than others

Closer to user behavior: users can decide when to

stop reading

Best (partial) match: documents need not have all

query terms

Although documents with more query terms should

(17)

14.1 Similarity-based Retrieval?

Let‟s replace relevance with “similarity”

Rank documents by their similarity with the query

Treat the query as if it were a document

Create a query bag-of-words

Find its similarity to each document

Rank order the documents by similarity

(18)

14.1 Vector Space Model

Postulate:

Documents that are “close together” in vector

space “talk about” the same things

t

1

d

2

d

1

d

3

d

4

d

5

t

3

t

2

θ

φ

Therefore, retrieve documents based on how close the

(19)

14.1 How to Weight Terms?

Idea: Hans Peter Luhn 1958, IBM

Here‟s the intuition:

Terms that appear often in a document

should get high weights

Terms that appear in many documents should get low

weights

How do we capture this mathematically?

Term frequency

Inverse document frequency

The more often a document contains the term “dog”,

the more likely that the document is “about” dogs.

Words like “the”, “a”, “of” appear in (nearly) all

documents.

(20)

14.1 TFxIDF

TFxIDF [Gerald Salton, 1961]

T

erm

F

requency (TF)

How often a term appears in a document

can be calculated locally

D

ocument

F

requency (DF)

Number of documents, which contain a specific term

Needs global (system wide) knowledge

I

nverse

D

ocument

F

requency (IDF)

Discriminator for the importance of a term regarding the number of

occurrences in all documents

(21)

14.1 Working on Indices

quick

brown

fox

over

lazy

dog

back

now

time

all

good

men

come

jump

aid

their

party

0

0

1

1

0

0

0

0

0

1

0

0

1

0

1

1

0

0

1

0

0

1

0

0

1

0

0

1

1

0

0

0

0

1

Term

Do

c

1

Do

c

2

0

0

1

1

0

1

1

0

1

1

0

0

1

0

1

0

0

1

1

0

0

1

0

0

1

0

0

1

0

0

0

0

0

1

Do

c

3

Doc

4

0

0

0

1

0

1

1

0

0

1

0

0

1

0

0

1

0

0

1

0

0

1

0

0

1

0

0

0

1

0

1

0

0

1

Do

c

5

Do

c

6

0

0

1

1

0

0

1

0

0

1

0

0

1

0

0

1

0

1

0

0

0

1

0

0

1

0

0

1

1

1

1

0

0

0

Do

c

7

Do

c

8

The term-document matrix again

has “bag of words” information

(22)

14.1 Small yet Fast?

Can we make this data structure smaller, keeping

in mind the need for fast retrieval?

Observations:

The nature of the search problem requires us to

quickly find which documents contain a term

The term-document matrix is very sparse

(23)

14.1 Posting Lists

quick

brown

fox

over

lazy

dog

back

now

time

all

good

men

come

jump

aid

their

party

0

0

1

1

0

0

0

0

0

1

0

0

1

0

1

1

0

0

1

0

0

1

0

0

1

0

0

1

1

0

0

0

0

1

Term

Do

c

1

Do

c

2

0

0

1

1

0

1

1

0

1

1

0

0

1

0

1

0

0

1

1

0

0

1

0

0

1

0

0

1

0

0

0

0

0

1

Do

c

3

Do

c

4

0

0

0

1

0

1

1

0

0

1

0

0

1

0

0

1

0

0

1

0

0

1

0

0

1

0

0

0

1

0

1

0

0

1

Do

c

5

Do

c

6

0

0

1

1

0

0

1

0

0

1

0

0

1

0

0

1

0

1

0

0

0

1

0

0

1

0

0

1

1

1

1

0

0

0

Do

c

7

Do

c

8

Postings

1, 3

1, 3, 5, 7

3, 5, 7

1, 3, 5, 7, 8

1, 3, 5, 7

3, 5

1, 3, 7

2, 6, 8

2, 4, 6

2, 4, 6

2, 4, 6, 8

2, 4, 8

2, 4, 6, 8

3

4, 8

1, 5, 7

6, 8

(24)

14.1 Inverted Document Index

quick

brown

fox

over

lazy

dog

back

now

time

all

good

men

come

jump

aid

their

party

Term

Postings

1, 3

1, 3, 5, 7

3, 5, 7

1, 3, 5, 7, 8

1, 3, 5, 7

3, 5

1, 3, 7

2, 6, 8

2, 4, 6

2, 4, 6

2, 4, 6, 8

2, 4, 8

2, 4, 6, 8

3

4, 8

1, 5, 7

6, 8

(25)

14.1 What goes in the Postings?

Boolean retrieval

Just the document number

Ranked Retrieval

Document number and term weight (

tf.idf

, ...)

Proximity operators

Word offsets for each occurrence of the term

(26)

14.2 Overview

1. Introduction

2. Content Searching in Peer-to-Peer Applications

1. Problems in Peer-to-Peer Information Retrieval

2. Related Work in Distributed Information Retrieval

3. Index structures for Query Routing

1. Distributed Hash Tables for Information Retrieval

2. Routing Indexes for Information Retrieval

3. Locality-Based Routing Indexes

4. Supporting Effective Information Retrieval

1. Providing Collection-Wide Information

2. Estimating the Document Overlap

3. Prestructuring Collections with Taxonomies

(27)

14.2 Information Retrieval in P2P Systems

Information Retrieval deals with

complex documents

Meta-data can only capture some aspects of a document, but

not anticipate all semantic searches

E.g. sports-related newspaper article, but no names, locations, etc.

Support for full-text searches needed

Find the

best-matching document

from the

best-connected peer

Unlike in file sharing emphasis is on the document quality

If there are multiple sources offering similar quality documents,

choose best peer according to connection, etc.

(28)

14.2 Challenges in P2P IR

Efficient query evaluation scheme

Central inverted index of documents is expensive to maintain

How to disseminate a peer„s query?

Simple flooding of all queries is not scalable, if „best„ documents have to be

found (not just some match)

Dealing with network churn

A peer can always alter the set of documents offered, or significantly

change individual documents

Peers may join and leave the network, i.e. whole document

collections may disappear, or can be added

Integration of collection-wide information

Peers are not able to calculate IR-style scorings from

local knowledge, but needs some knowledge from the

(virtual) merged collection

Constant dissemination of collection-wide information

needs a lot of bandwidth

(29)

14.2 Example: Problem of Collection-wide

Information

Example: Different news collections, query on keyword

„basketball„

General news collection, e.g.

Many articles, only few about basketball, therefore IDF small

Keyword discriminates well between articles

NBA news collection

Few articles, almost all about basketball, therefore IDF high

Keyword hardly discriminates between articles

Merged collection: IDF medium

(30)

…B...

…A...

…B...

…A...

…A...

…B...

14.2 Example: Problem of Collection-wide

Information

Query:

A and B

TF=1 IDF=3/2

TF=1 IDF= 3/1

local scoring

TF=1 IDF=3/1

TF=1 IDF= 3/2

local scoring

…B…

…A…

Top object

Top object

global scoring

all objects identical

TF = 1 IDF = 6/3

Querying Peer

Peer 1

(31)

14.2 Distributed IR

Distributed information retrieval techniques grew

increasingly important for searching Web sources

Abstracts of information sources

To support distributed retrieval sources have to register

abstracts or keyword sets

Abstracts can either be kept in a central repository or

distributed by gossiping algorithms, e.g. PlanetP

[Cuenca-Acuna et al., „03]

Collection selection

Having no central index needs a sophisticated way of

choosing the most promising collections for querying

(32)

14.2 Distributed IR

Such abstracts can be compactly represented by

Bloom

Filters

, i.e. bit vectors that allow membership queries

Each term is hashed with n different functions and the position

in the bit vector for each hash value is set to 1

Allows for false positives, but no false negatives

(33)

14.2 Distributed IR

Benefit estimators

for collection selection use aggregated statistics

about individual collections for selection, e.g. CORI measure

[Callan et al., „95]

CORI calculates collection score s

i

for collection i regarding query q:

with

and

where

n

is the number of collections,

cdf

the collection document

frequency,

cdf

max

the maximum

cdf

and

cf

t

the collection frequency of

(34)

14.3 Overview

1. Introduction

2. Content Searching in Peer-to-Peer Applications

1. Problems in Peer-to-Peer Information Retrieval

2. Related Work in Distributed Information Retrieval

3. Index structures for Query Routing

1. Distributed Hash Tables for Information Retrieval

2. Routing Indexes for Information Retrieval

3. Locality-Based Routing Indexes

4. Supporting Effective Information Retrieval

1. Providing Collection-Wide Information

2. Estimating the Document Overlap

3. Prestructuring Collections with Taxonomies

(35)

14.3 Index Structures for Query Routing

Traditional index structures cannot be readily

employed in P2P systems

High degree of distribution

High degree of volatility (churn)

Distributed paradigms needed to route queries to

appropriate peers

Simple flooding method does not scale

Distributed hash table lookup

Using indexed routing information

Using shortcut overlays

High degree of

index maintenance

(36)

14.3 Distributed Hash Tables for IR

Distributed hash tables

Route queries to appropriate peers with number of hops

logarithmic in network size

No peer needs to maintain more than logarithmic amount

of routing information

But…

Exact match queries only

All new content has to be published, if peers join/change

Old content has to be unpublished, if peers leave

Documents added/removed will contain a lot of different terms

to be published/unpublished. Thus, usually many index peers have

to be addressed

Conjunction of query terms needs to access many peers, but

there is still no guarantee that a single document with the

conjunction exists

(37)

14.3 Distributed Hash Tables for IR

Improvement:

Hybrid P2P

infrastructures [Loo et al.,‟04]

Efficiency of DHT is worst, if

highly replicated

items are

requested

Experiments show worse behavior than flooding, degrading with churn

Querying and content allocation

follow

Zipf-distribution

Only few highly replicated and

often queried items

„People are looking for hay,

not for needles‟ (S. Shenker)

Hybrid P2P infrastructures use DHTs

only for the less replicated and rarely

queried items, all other queries are flooded

Still, DHTs have to be maintained for the majority of query

terms

Query Frequency Distribution

0,00% 2,00% 4,00% 6,00% 8,00% 10,00% 12,00% 14,00% 16,00% 0 10 20 30 40 50 60 70 80 90 Query O c c urr e nc e Fre qu e nc y

(38)

14.3 Routing Indexes for IR

Routing indexes

are

local

collections of (key, peer) pairs

Key is either a keyword or a query

Peer is the address of a peer that either offers relevant results, or

routes the query to other peers with relevant result

In contrast to flooding only „interesting‟ directions are queried

Often distinguished between links in the default network (directions

of content providers) and overlay structure of direct links to content

providers („shortcuts‟)

First introduced by [Crespo & Garcia-Molina, „02] to choose

best neighbors in the default network for query forwarding

Index maintenance is of local nature and index coverage is usually

high due to Zipf distribution of requests

(39)

14.3 Routing Indexes for IR

Routing index policies in the face of

network churn

With restricted index sizes new entries are collected and

always stored. If the maximum size is reached, some stale

information is replaced

A simple strategy always replaces the currently oldest index

entries

„Least recently used‟ (LRU) strategy assigns higher usefulness to

entries that have been successfully used recently

Optimal index size is a problematic parameter

Indexes with unrestricted size have to combat network

churn differently

„time to live‟ assigns an expiry time for each new index entry

„forgetting factors‟ can periodically weigh down reliability of link

information

(40)

14.3 An Algorithm for Correct Query Routing

Goal:

progressive distributed top-k ranking of

documents

Putting techniques together to design an efficient

top-k algorithm

Minimal number of object transfers

Optimal number of object accesses

Features of the P2P based approach

Optimized Query-Routing

No global Index

(41)

14.3 Bird’s View

1. Distribute query through the network (Routing)

2. Every peer scores documents locally (Ranking)

3. Hierarchical construction of the final result (Merging)

4. Optimized query routing (Index)

(42)

14.3 Building Blocks

local ranking

query-driven index

result

merging

Structured network

(43)

14.3 Network Structure

Observation: peers strongly differ in availability,

bandwidth, computing power, …

Hierarchical network structure with super-peers

Query routing

Result merging

Indexes

(44)

14.3 Network topology

Super-peers as hypercube (HyperCuP protocol)

Resilient against leaving peers

Broadcast with (n-1) messages, log

2

(n) hops

0 0 1 0 1 1 1 0 2 2 2 2

SP

1

SP

3

SP

4

SP

2

SP

5

SP

7

SP

8

SP

6

SP

1

SP

3

SP

2

SP

7

SP

5

SP

8

SP

6

SP

4

(45)

14.3 Local Ranking

Super-peer asks for

local rankings

of peers„

collections

Top-

k

results (plus metric-dependent

information) are returned to SP

Arbitrary similarity measures can be used

TFxIDF

(46)

14.3 Result Merging

Results will be

merged

at the super-peers

Unique scoring function

Maximum of k messages per SP-SP egde

SP

A

SP

C

SP

B

SP

D

0

1

0

P

3

P

2

P

5

P

4

P

1

P

Q

P

7

P

6

(47)

14.3 Indexing

Super-peers keep indexes

IDFs (collection wide information)

IDF-values for query terms

Top peers (routing)

List of peers that already have contributed to a previous top-k result

Others possible, e.g. for taxonomies

(48)

14.3 Routing Indexes Example: Top k Query Routing

Example for routing indexes in P2P networks with

super-peer backbone holding routing indexes

Progressive P2P top-k algorithm [Balke et al., „04]

If query q is indexed, distribute query and collect results

Otherwise flood query and

Compute ranks at local peers

Merge results at super-peers

Use statistics for new entry in routing index (routing information, collection-wide

information, etc.)

Data structures at super-peers

RequestResults: Peers which are queried for result (index information)

BestPeer: Peers which delivered recent best result

TopRes: Current top results

(49)

SP

1

SP

3

SP

2

SP

7

SP

5

SP

8

SP

6

SP

4

P

4

P

2

P

3

P

0

P

1

14.3 Routing Indexes Example: Top k Query Routing

d31

0.6

d32

0.6

d33

0.1

d21

0.7

d22

0.4

d23

0.3

d41

0.5

d42

0.5

d43

0.2

d11 0.8

d12 0.3

d13 0.2

q1 ?

SP4

RequestResults

{SP8,P2, P3, P4}

BestPeers

{}

TopRes

{}

Delivered

{}

Empty routing index at SP

4

Find top 2

documents

(50)

SP

1

SP

3

SP

2

SP

7

SP

5

SP

8

SP

6

SP

4

P

4

P

2

P

3

P

0

P

1

14.3 Routing Indexes Example: Top k Query Routing

SP4

RequestResults

{SP8,P2, P3, P4}

BestPeers

{}

TopRes

{}

Delivered

{}

SP4

RequestResults

{SP8, P3, P4}

BestPeers

{}

TopRes

{(P2, d21, 0.7)}

Delivered

{}

SP4

RequestResults {}

BestPeers

{}

TopRes

{(P2, d21, 0.7),

(P3, d31, 0.5),

(P4, d41, 0.4)}

Delivered

{}

SP4

RequestResults {}

BestPeers

{P2}

TopRes

{(P3, d31, 0.5),

(P4, d41, 0.4)}

Delivered

{(P2, d21, 0.7)}

d31

0.6

d32

0.6

d33

0.1

d21

0.7

d22

0.4

d23

0.3

d41

0.5

d42

0.5

d43

0.2

d11 0.8

d12 0.3

d13 0.2

q1 ?

d21 0.7

d21

0.7

d21 0.7

(51)

SP

1

SP

3

SP

2

SP

7

SP

5

SP

8

SP

6

SP

4

P

4

P

2

P

3

P

0

P

1

14.3 Routing Indexes Example: Top k Query Routing

SP1

RequestResults {SP3,SP5, P1}

BestPeers

{}

TopRes

{(SP2, d21, 0.7)}

Delivered

{SP2}

SP1

RequestResults {}

BestPeers

{}

TopRes

{(P1, d11, 0.8),

(SP2, d21, 0.7)}

Delivered

{}

SP1

RequestResults {}

BestPeers

{P1}

TopRes

{(SP2, d21, 0.7)}

Delivered

{(P1, d11, 0.8)}

d31

0.6

d32

0.6

d33

0.1

d21

0.7

d22

0.4

d23

0.3

d41

0.5

d42

0.5

d43

0.2

d11 0.8

d12 0.3

d13 0.2

q1 {(d11, 0.8)}

d11 0.8

q1 ?

(52)

SP

1

SP

3

SP

2

SP

7

SP

5

SP

8

SP

6

SP

4

P

4

P

2

P

3

P

0

P

1

14.3 Routing Indexes Example: Top k Query Routing

SP1

RequestResults {P1}

BestPeers

{}

TopRes

{(SP2, d21, 0.7)}

Delivered

{(P1, d11, 0.8)}

SP1

BestPeers

{}

Delivered

{(P1, d11, 0.8)}

RequestResults {}

TopRes

{(SP2, d21, 0.7),

(P1, d12, 0.3)}

q1

{(d11, 0.8),

(d21, 0.7)}

SP1

RequestResults {}

BestPeers

{SP2}

TopRes

{(P1, d12, 0.3)}

Delivered

{(P1, d11, 0.8),

(SP2, d21, 0.7)}

d31

0.6

d32

0.6

d33

0.1

d21

0.7

d22

0.4

d23

0.3

d41

0.5

d42

0.5

d43

0.2

d11 0.8

d12 0.3

d13 0.2

d12 0.3

q1 {(d11, 0.8)}

(53)

SP

1

SP

3

SP

2

SP

7

SP

5

SP

8

SP

6

SP

4

P

4

P

2

P

3

P

0

P

1

14.3 Routing Indexes Example: Top k Query Routing

SP1

RequestResults {}

BestPeers

{SP2}

TopRes

{(P1, d12, 0.3)}

Delivered

{(P1, d11, 0.8),

(SP2, d21, 0.7)}

q1

{(d11, 0.8),

(d21, 0.7)}

d31

0.6

d32

0.6

d33

0.1

d21

0.7

d22

0.4

d23

0.3

d41

0.5

d42

0.5

d43

0.2

d11 0.8

d12 0.3

d13 0.2

SP1 Routing Index

q1

{SP2, P1}

SP2 Routing Index

q1

{SP4}

SP4 Routing Index

q1

{P2, P3}

q1

(54)

14.3 Query Routing

At the first appearance of a queries peers only

send out their input for IDF computation

Super-peers aggregate IDFs and build index

Whenever a query is repeated

SPs will send recent IDF-values together with query terms

Peers will uses IDFs for local score computation

Disadvantage: at first occurrance of query it has

to be sent twice

Zipf-Distribution minimizes number of queries concerned

Advantages:

No effort for maintaining global IDF index

(55)

14.3 Query Routing und Network Churn

Query index strategy

Send queries only to peers that have already

recently contributed

to answering a query

Problem:

the network„s and

each peer„s volatility

Solution 1: Send queries also to a randomly

selected set of peers

Solution 2: “Best before”-timestamp

SP

1

SP

3

SP

2

SP

7

SP

5

SP

6

SP

4

X

X

X

X

(56)

14.3 Locality-Based Routing Indexes

Refinement o

f

routing indexes

by social metaphors

Similar retrieval process like in real life

Every person has only limited knowledge of the environment

Who knows about a certain topic?

Who might know other people who know about the topic?

Try to build (short) „chains of acquaintances‟ that will bring you close

to the requested information

Aims at building „social networks‟ as overlays

Peers semantically connected by certain topics form „small

world networks‟, e.g. [Milgram, ‟67; Kleinberg, „00]

Paradigm of

interest-based locality

If a peer has relevant content for a user‟s query, it very often also has

some other content that this user might be interested in

(57)

14.3 Locality-Based Routing Indexes

For information retrieval in P2P network this enables

new routing in

interest-based overlay structures

Route queries to peers with documents matching

semantically close queries

Traces on practical data collections show that

Peers get well-connected

The overlay graph shows highly-clustered characteristics with a small minimum

distance between any two nodes

„Overhearing‟ of communications routed through a peer

can be used to enhance its local index

Randomly sending queries also to peers from the default

network helps to extend knowledge and can remedy the

effect of network churn

(58)

14.4 Overview

1. Introduction

2. Content Searching in Peer-to-Peer Applications

1. Problems in Peer-to-Peer Information Retrieval

2. Related Work in Distributed Information Retrieval

3. Index structures for Query Routing

1. Distributed Hash Tables for Information Retrieval

2. Routing Indexes for Information Retrieval

3. Locality-Based Routing Indexes

4. Supporting Effective Information Retrieval

1. Providing Collection-Wide Information

2. Estimating the Document Overlap

3. Prestructuring Collections with Taxonomies

(59)

14.4 Supporting Effective P2P IR

P2P information retrieval has to deal with

the

trade-off

between

Efficient local maintenance of statistics / index information,

where information can be stale (incorrect)

Expensive global maintenance of statistics / index

information, where information always is accurate

Needed is „just the right level‟ of dissemination of

statistics to guarantee a „sufficiently effective‟ retrieval

Some techniques help to support efficient retrieval

Providing adequate collection-wide information

Estimate document overlap between peers

(60)

14.3 Providing Collection-Wide Information

Collection-wide information

is important for retrieval

quality, but cannot be calculated locally like e,g., IDFs

Some systems like e.g. PlanetP, do not use CWI directly, but

circumnavigate the problem by using an inverted peer frequency

where

N

is the number of all peers and

N

t

is the number of peers

offering documents on term

t

If summarizations of peers (abstracts) are eagerly disseminated, each

peer can locally decide values for

N

and

N

t

The relevance of peers in multi-keyword queries is simply the sum of

IPFs for the individual terms

Practical tests show an average overlap of about 70% between result

sets retrieved with IDFs and those retrieved with IPFs

(61)

14.4 Providing Collection-Wide Information

Tests in Web information retrieval, e.g. [Viles & French, „95],

show that CWI

stays relatively stable

over the whole

collection of Web Sites even with churn

Only joining/leaving corpora on completely new topics result in

significant change

Indexing CWI in a similar way as the routing information

for queries is possible [Balke et al., „05]

In structured networks CWI can be aggregated along the

backbone and indexed CWI can be distributed together with

the query

New queries have to be flooded/routed twice

The first flooding collects and aggregates CWI

The second one provides the correct CWI for local scorings

(62)

14.4 Estimating the Document Overlap

Assessing the

novelty of collections

also supports

retrieval quality

Pre-computed statistics about expected result quality in each

collection is often used to minimize the number of queried

collections

Choosing collection with high overlap for querying will usually

not improve result sets sufficiently to justify the access costs

Especially progressive searches, like top-k searches, profit from

focusing on collections with small overlaps, since result merging

procedures will ignore identical/similar results

The novelty of a collection can only be calculated with

respect to some reference collection(s)

(63)

14.4 Estimating the Document Overlap

A definition of a peer

p

‟s collection

C

p

with respect

to a reference collection

C

ref

[Bender et al., „05]

Since the information what exact documents a peer offers

is usually not disseminated, the values have to be

approximated from statistics

E.g. if abstracts in the form of Bloom filters are given, a combined Bloom filter

b

p

can

be calculated by bitwise logical AND between

p

‟s Bloomfilters for all keywords in a

query

Novelty then can be estimated by comparing it to

as the union of those Bloom filters

b

i

of the set of collections

S

that have already

been retrieved

The

degree of novelty

is given by counting locations where

p

‟s Bloom filter has

(64)

14.4 Prestructuring Collections with Taxonomies

Retrieval in P2P systems generally

considers two basic paradigms

Fulltext-based queries

Metadata-based queries

Integrating these paradigms can support retrieval effectiveness

Structuring document collections

Disambiguation of query terms

Peers often host collections of similar documents, e.g. similar

kind of information (newspaper articles, etc.) on similar topics,

etc.

Scalability and successful use of statistics are strongly improved, if a

common system of categories to classify the documents can be used

Since categories are more or less similar to each other a taxonomy

on categories allows for easily finding semantically similar documents

(65)

14.4 Prestructuring Collections with Taxonomies

Business

Politics

Sports

News

sim

(Politics, Foreign):

l

= 1

h

= 2

sim

= 0.68

sim

(Politics, Sports):

l

= 2

h

= 1

sim

= 0.35

l

h

l

h

Topical similarity

within a taxonomy is defined by

[Li et al., „03]

l

: shortest path between categories

c

1

and

c

2

h

: level of common subsumer

Common values

= 0.2,

=0.6 (experimentally determined)

(66)

14.4 Combination of Topics and Keywords

Topics dominate keywords

Cooperative Filter: Relax on topics until k results have

been found

Example: [<

Politics

>, “London Olympics”]

Politics

Foreign

Sports

Politics

Foreign

Domestic

Sports

Business

Tennis

(67)

14.4 Combination of Topics and Keywords

SP

1

SP

3

SP

2

SP

7

SP

5

SP

8

SP

6

SP

4

P

4

P

2

P

3

P

0

P

1

d11

P

0.8

d12

P

0.3

d13

P

0.2

SP1

RequestResults {P1}

TopRes

{(P1, d11, [P,0.8]),

(SP2, d21, [P,0.7])}

Delivered

{(P1, d11, [P,0.8])}

SP1

RequestResults {P1}

TopRes

{(P1, d11, [P,0.8]),

(SP2, d21, [P,0.7]),

(P1, d12, [S,0.3])}

Delivered

{(P1, d11, [P,0.8])}

SP1

RequestResults {P1}

TopRes

{(P1, d11, [P,0.8]),

(SP2, d21, [P,0.7]),

(P1, d12, [P,0.3])}

Delivered

{(P1, d11, [P,0.8]),

(SP2, d21, [P,0.7])}

d12

P

0.3

d31

D

0.6

d32

D

0.5

d33

D

0.1

d21

D

0.7

d22

P

0.4

d23

S

0.3

d41

S

0.9

d42

S

0.5

d43

S

0.2

{d11}

{d11, d21}

S

ports

P

olitics

News

d21

P

0.7

d31

P

0.6

d41 S

0.9

(68)

14.5 Overview

1. Introduction

2. Content Searching in Peer-to-Peer Applications

1. Problems in Peer-to-Peer Information Retrieval

2. Related Work in Distributed Information Retrieval

3. Index structures for Query Routing

1. Distributed Hash Tables for Information Retrieval

2. Routing Indexes for Information Retrieval

3. Locality-Based Routing Indexes

4. Supporting Effective Information Retrieval

1. Providing Collection-Wide Information

2. Estimating the Document Overlap

3. Prestructuring Collections with Taxonomies

(69)

14.5 Summary and Conclusion

In today‟s P2P systems only

exact match keyword

retrieval

is prevalent (usually on meta-data)

Information retrieval in P2P scenarios is needed

Individual, loosely coupled document collections need fulltext

retrieval and ranking techniques

Applications range from shared working environments e.g. in

project groups, to distributed digital libraries

Almost all IR systems use at least some

global statistics

,

in P2P infrastructures the dissemination of necessary

statistics becomes a performance bottleneck

Trade-off between

cached, but sometimes stale statistics

and

new,

but expensively updated statistics

needs to be managed

How much staleness does a „sufficient‟ retrieval effectiveness

allow?

(70)

14.5 Summary and Conclusion

Choosing the „right‟ collections for querying improves retrieval

efficiency

Containing most promising documents with possibly little overlap

Small worlds offer quick connections to semantically close

collections

Query routing indexes can handle some network churn while

providing results of sufficient quality

Local indexes can be efficiently maintained

Can exploit advantages by Zipf-distributed content allocations and

querying behavior

Need to contact only small numbers of peers

Supporting techniques like efficient CWI estimation/

dissemination or taxonomies of document categories

References

Related documents

• This course covers methods for analysis of data from Illumina and Ion Torrent high- throughput sequencing, with or without a reference genome sequence, using free and

The purpose of this study was to determine to what extent an education and training module on the proper use of NEXRAD-based products in the cockpit, developed using

Therefore, there is inevitable need to develop and design a syllabus and instructional materials that appropriate for the teaching of Essay Writing course for

Adapun definisi operasional dan pengukuran variabel penelitian adalah sebagai berikut: Kinerja keuangan pemerintah daerah adalah kemampuan suatu daerah untuk menggali dan

Hierarchies : Example 2 Hierarchies : Example 2 R1 Person Dept Student Graduate PhD SSN marital status ID Name first mi last phone DOB R3 Thesis GradDate Defense Date Text Title

The atherosclerotic plaque area in aortic sinus and arch, plasma lipid profile, fatty acid composition, hepatic enzyme activities and gene expression were determined.. A

As reported last year, the City of Cambridge notified MWRA in the fall of 2012 that new information gained from its design of the CAM004 sewer separation project had caused it to

[r]