Wolf-Tilo Balke
Sascha Tönnies
Institut für Informationssysteme
Technische Universität Braunschweig
http://www.ifis.cs.tu-bs.de
Peer-to-Peer
14. Overview
1. Introduction
2. Content Searching in Peer-to-Peer Applications
1. Problems in Peer-to-Peer Information Retrieval
2. Related Work in Distributed Information Retrieval
3. Index structures for Query Routing
1. Distributed Hash Tables for Information Retrieval
2. Routing Indexes for Information Retrieval
3. Locality-Based Routing Indexes
4. Supporting Effective Information Retrieval
1. Providing Collection-Wide Information
2. Estimating the Document Overlap
3. Prestructuring Collections with Taxonomies
14.1 What is IR?
•
Information retrieval
(
IR
) is the science of
searching for documents, for information within
documents and for metadata about documents
–
A user enters a
query
, i.e.
an information need, into
the system
–
Several objects may match
the query with different
14.1 Representing Text
•
How do we represent the
complexities of language?
–
Computers don‟t “understand”
documents or queries
•
Simple, yet effective approach: bag of words
–
Treat all the words in a document as index terms for
that document
–
Assign a “weight” to each term based on its
“importance”
14.1 Representing Text
McDonald's slims down spuds
Fast-food chain to reduce certain types of fat in its
french fries with new cooking oil.
NEW YORK (CNN/Money) - McDonald's Corp. is
cutting the amount of "bad" fat in its french fries
nearly in half, the fast-food chain said Tuesday as it
moves to make all its fried menu items healthier.
But does that mean the popular shoestring fries won't
taste the same? The company says no. "It's a win-win
for our customers because they are getting the same
great french-fry taste along with an even healthier
nutrition profile," said Mike Roberts, president of
McDonald's USA.
But others are not so sure. McDonald's will not
specifically discuss the kind of oil it plans to use, but at
least one nutrition expert says playing with the
formula could mean a different taste.
Shares of Oak Brook, Ill.-based McDonald's (MCD:
down $0.54 to $23.22, Research, Estimates) were
lower Tuesday afternoon. It was unclear Tuesday
whether competitors Burger King and Wendy's
International (WEN: down $0.80 to $34.91, Research,
Estimates) would follow suit. Neither company could
immediately be reached for comment.
…
16 × said
14 × McDonalds
12 × fat
11 × fries
8 × new
6 × company, french, nutrition
5 × food, oil, percent, reduce,
taste, Tuesday
…
14.1 Retrieval
•
Retrieving
relevant
information is hard!
–
Evolving, ambiguous user needs, context, etc.
–
Complexities of language
•
To
operationalize
information retrieval, we
must vastly simplify the picture
–
Information retrieval is all (and only) about matching
words in documents with words in queries
–
Obviously, not true…
14.1 Representing Documents as Vectors
The quick brown
fox jumped over
the lazy dog’s
back.
Document 1
Document 2
Now is the time
for all good men
to come to the
aid of their party.
the
is
for
to
of
quick
brown
fox
over
lazy
dog
back
now
time
all
good
men
come
jump
aid
their
party
0
0
1
1
0
1
1
0
1
1
0
0
1
0
1
0
0
1
1
0
0
1
0
0
1
0
0
1
1
0
1
0
1
1
Term
Do
cu
m
en
t
1
Do
cu
m
en
t
2
Stopword
List
14.1 Representing Text
document
structured
recognition
accents,
spacing,
etc.
stopwords
noun
stemming
groups
automatic
or manual
indexing
text +
structure
text
structure
full text
index terms
How to compare
14.1 Boolean Retrieval
•
Weights assigned to terms are either
“0”
or
“1”
–
“0” represents “absence”: term
isn’t
in the document
–
“1” represents “presence”: term
is
in the document
•
Build queries by combining terms with Boolean
operators
–
AND, OR, NOT
•
The system returns all documents that satisfy the
14.1 Boolean View of a Document-Set (=Collection)
quick
brown
fox
over
lazy
dog
back
now
time
all
good
men
come
jump
aid
their
party
0
0
1
1
0
0
0
0
0
1
0
0
1
0
1
1
0
0
1
0
0
1
0
0
1
0
0
1
1
0
0
0
0
1
Term
Doc
1
Do
c
2
0
0
1
1
0
1
1
0
1
1
0
0
1
0
1
0
0
1
1
0
0
1
0
0
1
0
0
1
0
0
0
0
0
1
Do
c
3
Do
c
4
0
0
0
1
0
1
1
0
0
1
0
0
1
0
0
1
0
0
1
0
0
1
0
0
1
0
0
0
1
0
1
0
0
1
Do
c
5
Do
c
6
0
0
1
1
0
0
1
0
0
1
0
0
1
0
0
1
0
1
0
0
0
1
0
0
1
0
0
1
1
1
1
0
0
0
Do
c
7
Do
c
8
Each column represents the view of
a particular document: What terms
are contained in this document?
Each row represents the view of a
particular term: What documents
contain this term?
To execute a query, pick out rows
corresponding to query terms
and then apply logic table of
corresponding Boolean operator
14.1 Sample Queries
fox
dog
0
0
0
0
1
1
0
0
1
1
0
0
0
1
0
0
Term
Do
c
1
Do
c
2
Do
c
3
Do
c
4
Do
c
5
Do
c
6
Do
c
7
Do
c
8
dog
fox
0 0 1 0 1 0 0 0
dog
fox
0 0 1 0 1 0 1 0
dog
fox
0 0 0 0 0 0 0 0
fox
dog
0 0 0 0 0 0 1 0
dog AND fox
Doc 3, Doc 5
dog OR fox
Doc 3, Doc 5, Doc 7
dog NOT fox
empty
fox NOT dog
Doc 7
good
party
0
0
1
0
0
0
1
0
0
0
1
1
0
0
1
1
g
p
0 0 0 0 0 1 0 1
g
p
o
0 0 0 0 0 1 0 0
good AND party
Doc 6, Doc 8
over
1 0 1 0 1 0 1 1
good AND party NOT over
Doc 6
Term
Do
c
1
Do
c
2
Do
c
3
Do
c
4
Do
c
5
Do
c
6
Do
c
7
Do
c
8
14.1 The Perfect Query Paradox
•
Every information need has a
perfect set of
documents
–
If not, there would be no sense doing retrieval
•
Every document set has a
perfect query
–
AND every word in a document to get a query for it
–
Repeat for each document in the set
–
OR every document query to get the set query
•
But can users realistically be expected to
formulate
this perfect query?
14.1 Why Boolean Retrieval fails
•
Natural language is way more complex
•
AND “discovers” nonexistent relationships
–
Terms in different sentences, paragraphs, …
•
Guessing terminology for OR is hard
–
good, nice, excellent, outstanding, awesome, …
•
Guessing terms to exclude is even harder!
14.1 Strengths and Weaknesses
•
Strengths
–
Precise, if you have a clear idea of what you‟re looking for
–
Efficient for the computer
•
Weaknesses
–
Users must learn Boolean logic
–
Boolean logic insufficient to capture the richness of
language
–
No control over size of result set: either too many
documents or none
–
All documents in the result set are considered “equally
good”
–
What about partial matches? Documents that “don‟t quite
match” the query may be useful also
14.1 Ranked Retrieval
•
Order
documents by how likely they are to be
relevant to the information need
–
Present hits one screen at a time
–
At any point, users can continue browsing through
ranked list or reformulate query
•
Attempts to retrieve relevant documents directly,
14.1 Why Ranked Retrieval?
•
Arranging documents by relevance is
–
Closer to how humans think: some documents are
“better” than others
–
Closer to user behavior: users can decide when to
stop reading
•
Best (partial) match: documents need not have all
query terms
–
Although documents with more query terms should
14.1 Similarity-based Retrieval?
•
Let‟s replace relevance with “similarity”
–
Rank documents by their similarity with the query
•
Treat the query as if it were a document
–
Create a query bag-of-words
•
Find its similarity to each document
•
Rank order the documents by similarity
14.1 Vector Space Model
Postulate:
Documents that are “close together” in vector
space “talk about” the same things
t
1
d
2
d
1
d
3
d
4
d
5
t
3
t
2
θ
φ
Therefore, retrieve documents based on how close the
14.1 How to Weight Terms?
•
Idea: Hans Peter Luhn 1958, IBM
•
Here‟s the intuition:
–
Terms that appear often in a document
should get high weights
–
Terms that appear in many documents should get low
weights
•
How do we capture this mathematically?
–
Term frequency
–
Inverse document frequency
The more often a document contains the term “dog”,
the more likely that the document is “about” dogs.
Words like “the”, “a”, “of” appear in (nearly) all
documents.
14.1 TFxIDF
•
TFxIDF [Gerald Salton, 1961]
•
T
erm
F
requency (TF)
–
How often a term appears in a document
–
can be calculated locally
•
D
ocument
F
requency (DF)
–
Number of documents, which contain a specific term
–
Needs global (system wide) knowledge
•
I
nverse
D
ocument
F
requency (IDF)
–
Discriminator for the importance of a term regarding the number of
occurrences in all documents
14.1 Working on Indices
quick
brown
fox
over
lazy
dog
back
now
time
all
good
men
come
jump
aid
their
party
0
0
1
1
0
0
0
0
0
1
0
0
1
0
1
1
0
0
1
0
0
1
0
0
1
0
0
1
1
0
0
0
0
1
Term
Do
c
1
Do
c
2
0
0
1
1
0
1
1
0
1
1
0
0
1
0
1
0
0
1
1
0
0
1
0
0
1
0
0
1
0
0
0
0
0
1
Do
c
3
Doc
4
0
0
0
1
0
1
1
0
0
1
0
0
1
0
0
1
0
0
1
0
0
1
0
0
1
0
0
0
1
0
1
0
0
1
Do
c
5
Do
c
6
0
0
1
1
0
0
1
0
0
1
0
0
1
0
0
1
0
1
0
0
0
1
0
0
1
0
0
1
1
1
1
0
0
0
Do
c
7
Do
c
8
The term-document matrix again
has “bag of words” information
14.1 Small yet Fast?
•
Can we make this data structure smaller, keeping
in mind the need for fast retrieval?
•
Observations:
–
The nature of the search problem requires us to
quickly find which documents contain a term
–
The term-document matrix is very sparse
14.1 Posting Lists
quick
brown
fox
over
lazy
dog
back
now
time
all
good
men
come
jump
aid
their
party
0
0
1
1
0
0
0
0
0
1
0
0
1
0
1
1
0
0
1
0
0
1
0
0
1
0
0
1
1
0
0
0
0
1
Term
Do
c
1
Do
c
2
0
0
1
1
0
1
1
0
1
1
0
0
1
0
1
0
0
1
1
0
0
1
0
0
1
0
0
1
0
0
0
0
0
1
Do
c
3
Do
c
4
0
0
0
1
0
1
1
0
0
1
0
0
1
0
0
1
0
0
1
0
0
1
0
0
1
0
0
0
1
0
1
0
0
1
Do
c
5
Do
c
6
0
0
1
1
0
0
1
0
0
1
0
0
1
0
0
1
0
1
0
0
0
1
0
0
1
0
0
1
1
1
1
0
0
0
Do
c
7
Do
c
8
Postings
1, 3
1, 3, 5, 7
3, 5, 7
1, 3, 5, 7, 8
1, 3, 5, 7
3, 5
1, 3, 7
2, 6, 8
2, 4, 6
2, 4, 6
2, 4, 6, 8
2, 4, 8
2, 4, 6, 8
3
4, 8
1, 5, 7
6, 8
14.1 Inverted Document Index
quick
brown
fox
over
lazy
dog
back
now
time
all
good
men
come
jump
aid
their
party
Term
Postings
1, 3
1, 3, 5, 7
3, 5, 7
1, 3, 5, 7, 8
1, 3, 5, 7
3, 5
1, 3, 7
2, 6, 8
2, 4, 6
2, 4, 6
2, 4, 6, 8
2, 4, 8
2, 4, 6, 8
3
4, 8
1, 5, 7
6, 8
14.1 What goes in the Postings?
•
Boolean retrieval
–
Just the document number
•
Ranked Retrieval
–
Document number and term weight (
tf.idf
, ...)
•
Proximity operators
–
Word offsets for each occurrence of the term
14.2 Overview
1. Introduction
2. Content Searching in Peer-to-Peer Applications
1. Problems in Peer-to-Peer Information Retrieval
2. Related Work in Distributed Information Retrieval
3. Index structures for Query Routing
1. Distributed Hash Tables for Information Retrieval
2. Routing Indexes for Information Retrieval
3. Locality-Based Routing Indexes
4. Supporting Effective Information Retrieval
1. Providing Collection-Wide Information
2. Estimating the Document Overlap
3. Prestructuring Collections with Taxonomies
14.2 Information Retrieval in P2P Systems
•
Information Retrieval deals with
complex documents
–
Meta-data can only capture some aspects of a document, but
not anticipate all semantic searches
•
E.g. sports-related newspaper article, but no names, locations, etc.
–
Support for full-text searches needed
•
Find the
best-matching document
from the
best-connected peer
–
Unlike in file sharing emphasis is on the document quality
–
If there are multiple sources offering similar quality documents,
choose best peer according to connection, etc.
14.2 Challenges in P2P IR
•
Efficient query evaluation scheme
–
Central inverted index of documents is expensive to maintain
–
How to disseminate a peer„s query?
•
Simple flooding of all queries is not scalable, if „best„ documents have to be
found (not just some match)
•
Dealing with network churn
–
A peer can always alter the set of documents offered, or significantly
change individual documents
–
Peers may join and leave the network, i.e. whole document
collections may disappear, or can be added
•
Integration of collection-wide information
–
Peers are not able to calculate IR-style scorings from
local knowledge, but needs some knowledge from the
(virtual) merged collection
–
Constant dissemination of collection-wide information
needs a lot of bandwidth
14.2 Example: Problem of Collection-wide
Information
•
Example: Different news collections, query on keyword
„basketball„
–
General news collection, e.g.
•
Many articles, only few about basketball, therefore IDF small
•
Keyword discriminates well between articles
–
NBA news collection
•
Few articles, almost all about basketball, therefore IDF high
•
Keyword hardly discriminates between articles
–
Merged collection: IDF medium
…B...
…A...
…B...
…A...
…A...
…B...
14.2 Example: Problem of Collection-wide
Information
Query:
A and B
TF=1 IDF=3/2
TF=1 IDF= 3/1
local scoring
TF=1 IDF=3/1
TF=1 IDF= 3/2
local scoring
…B…
…A…
Top object
Top object
global scoring
all objects identical
TF = 1 IDF = 6/3
Querying Peer
Peer 1
14.2 Distributed IR
•
Distributed information retrieval techniques grew
increasingly important for searching Web sources
–
Abstracts of information sources
•
To support distributed retrieval sources have to register
abstracts or keyword sets
•
Abstracts can either be kept in a central repository or
distributed by gossiping algorithms, e.g. PlanetP
[Cuenca-Acuna et al., „03]
–
Collection selection
•
Having no central index needs a sophisticated way of
choosing the most promising collections for querying
14.2 Distributed IR
•
Such abstracts can be compactly represented by
Bloom
Filters
, i.e. bit vectors that allow membership queries
–
Each term is hashed with n different functions and the position
in the bit vector for each hash value is set to 1
–
Allows for false positives, but no false negatives
14.2 Distributed IR
•
Benefit estimators
for collection selection use aggregated statistics
about individual collections for selection, e.g. CORI measure
[Callan et al., „95]
CORI calculates collection score s
i
for collection i regarding query q:
with
and
where
n
is the number of collections,
cdf
the collection document
frequency,
cdf
max
the maximum
cdf
and
cf
t
the collection frequency of
14.3 Overview
1. Introduction
2. Content Searching in Peer-to-Peer Applications
1. Problems in Peer-to-Peer Information Retrieval
2. Related Work in Distributed Information Retrieval
3. Index structures for Query Routing
1. Distributed Hash Tables for Information Retrieval
2. Routing Indexes for Information Retrieval
3. Locality-Based Routing Indexes
4. Supporting Effective Information Retrieval
1. Providing Collection-Wide Information
2. Estimating the Document Overlap
3. Prestructuring Collections with Taxonomies
14.3 Index Structures for Query Routing
•
Traditional index structures cannot be readily
employed in P2P systems
–
High degree of distribution
–
High degree of volatility (churn)
•
Distributed paradigms needed to route queries to
appropriate peers
–
Simple flooding method does not scale
–
Distributed hash table lookup
–
Using indexed routing information
–
Using shortcut overlays
High degree of
index maintenance
14.3 Distributed Hash Tables for IR
•
Distributed hash tables
–
Route queries to appropriate peers with number of hops
logarithmic in network size
–
No peer needs to maintain more than logarithmic amount
of routing information
–
But…
•
Exact match queries only
•
All new content has to be published, if peers join/change
•
Old content has to be unpublished, if peers leave
•
Documents added/removed will contain a lot of different terms
to be published/unpublished. Thus, usually many index peers have
to be addressed
•
Conjunction of query terms needs to access many peers, but
there is still no guarantee that a single document with the
conjunction exists
14.3 Distributed Hash Tables for IR
•
Improvement:
Hybrid P2P
infrastructures [Loo et al.,‟04]
–
Efficiency of DHT is worst, if
highly replicated
items are
requested
•
Experiments show worse behavior than flooding, degrading with churn
–
Querying and content allocation
follow
Zipf-distribution
•
Only few highly replicated and
often queried items
•
„People are looking for hay,
not for needles‟ (S. Shenker)
–
Hybrid P2P infrastructures use DHTs
only for the less replicated and rarely
queried items, all other queries are flooded
–
Still, DHTs have to be maintained for the majority of query
terms
Query Frequency Distribution
0,00% 2,00% 4,00% 6,00% 8,00% 10,00% 12,00% 14,00% 16,00% 0 10 20 30 40 50 60 70 80 90 Query O c c urr e nc e Fre qu e nc y
14.3 Routing Indexes for IR
•
Routing indexes
are
local
collections of (key, peer) pairs
–
Key is either a keyword or a query
–
Peer is the address of a peer that either offers relevant results, or
routes the query to other peers with relevant result
•
In contrast to flooding only „interesting‟ directions are queried
–
Often distinguished between links in the default network (directions
of content providers) and overlay structure of direct links to content
providers („shortcuts‟)
•
First introduced by [Crespo & Garcia-Molina, „02] to choose
best neighbors in the default network for query forwarding
–
Index maintenance is of local nature and index coverage is usually
high due to Zipf distribution of requests
14.3 Routing Indexes for IR
•
Routing index policies in the face of
network churn
–
With restricted index sizes new entries are collected and
always stored. If the maximum size is reached, some stale
information is replaced
•
A simple strategy always replaces the currently oldest index
entries
•
„Least recently used‟ (LRU) strategy assigns higher usefulness to
entries that have been successfully used recently
•
Optimal index size is a problematic parameter
–
Indexes with unrestricted size have to combat network
churn differently
•
„time to live‟ assigns an expiry time for each new index entry
•
„forgetting factors‟ can periodically weigh down reliability of link
information
14.3 An Algorithm for Correct Query Routing
•
Goal:
progressive distributed top-k ranking of
documents
•
Putting techniques together to design an efficient
top-k algorithm
–
Minimal number of object transfers
–
Optimal number of object accesses
•
Features of the P2P based approach
–
Optimized Query-Routing
–
No global Index
14.3 Bird’s View
1. Distribute query through the network (Routing)
2. Every peer scores documents locally (Ranking)
3. Hierarchical construction of the final result (Merging)
4. Optimized query routing (Index)
14.3 Building Blocks
local ranking
query-driven index
result
merging
Structured network
14.3 Network Structure
•
Observation: peers strongly differ in availability,
bandwidth, computing power, …
•
Hierarchical network structure with super-peers
–
Query routing
–
Result merging
–
Indexes
14.3 Network topology
•
Super-peers as hypercube (HyperCuP protocol)
•
Resilient against leaving peers
•
Broadcast with (n-1) messages, log
2
(n) hops
0 0 1 0 1 1 1 0 2 2 2 2