Brief (non-technical) history

(1)

CS6322:

CS6322: Information Retrieval

Information Retrieval

Sanda Harabagiu

Lecture 10: Web search basics

(2)

Brief (non-technical) history

_{Early keyword-based engines ca. 1995-1997}

_{Altavista, Excite, Infoseek, Inktomi, Lycos}

_{Paid search ranking: Goto (morphed into}

Overture.com

→

Yahoo!)

_{Your search ranking depended on how much you}

paid

(3)

CS6322: Information Retrieval CS6322: Information Retrieval

Brief (non-technical) history

_{1998+: Link-based ranking pioneered by Google}

_{Blew away all early engines save Inktomi}

_{Great user experience in search of a business model}

_{Meanwhile Goto/Overture’s annual revenues were nearing $1 billion}

_{Result: Google added paid search “ads” to the side,}

independent of search results

_{Yahoo followed suit, acquiring Overture (for paid placement) and}

Inktomi (for search)

_{2005+: Google gains search share, dominating in Europe and}

very strong in North America

(4)

Algorithmic results. Paid

(5)

Web search basics

The Web

Ad indexes

Web Results 1 - 10 of about 7,310,000 for miele. (0.12 seconds)

Miele, Inc -- Anything else is a compromise

At the heart of your home, Appliances by Miele. ... USA. to miele.com. Residential Appliances. Vacuum Cleaners. Dishwashers. Cooking Appliances. Steam Oven. Coffee System ... www.miele.com/ - 20k - Cached - Similar pages

Miele

Welcome to Miele, the home of the very best appliances and kitchens in the world. www.miele.co.uk/ - 3k - Cached - Similar pages

Miele - Deutscher Hersteller von Einbaugeräten, Hausgeräten ... - [ Translate this page ]

Das Portal zum Thema Essen & Geniessen online unter www.zu-tisch.de. Miele weltweit ...ein Leben lang. ... Wählen Sie die Miele Vertretung Ihres Landes. www.miele.de/ - 10k - Cached - Similar pages

Herzlich willkommen bei Miele Österreich - [ Translate this page ] Herzlich willkommen bei Miele Österreich Wenn Sie nicht automatisch weitergeleitet werden, klicken Sie bitte hier! HAUSHALTSGERÄTE ... www.miele.at/ - 3k - Cached - Similar pages

User Needs

_{Need [Brod02, RL04]}

_{Informational} _{– want}_{to learn} _{about something (~40% / 65%)} _Navigational _{– want}_{to go}_{to that page (~25% / 15%)}

_{Transactional} _{– want}_{to do something} _{(web-mediated) (~35% / 20%)}

_{Access a service} _Downloads _Shop

_{Gray areas}

_{Find a good hub}

_{Exploratory search “see what’s there”}

Low hemoglobin United Airlines

Seattle weather

Mars surface images Canon S410

(7)

How far do people look for results?

(8)

Users’ empirical evaluation of results

_{Quality of pages varies widely}

_{Relevance is not enough}

_{Other desirable qualities (non IR!!)}

_{Content: Trustworthy, diverse, non-duplicated, well maintained} _{Web readability: display correctly & fast}

_{No annoyances: pop-ups, etc}

_{Precision vs. recall}

_{On the web, recall seldom matters}

_{What matters}

_{Precision at 1? Precision above the fold?}

_{Comprehensiveness – must be able to deal with obscure queries}

_{Recall matters when the number of matches is very small}

_{User perceptions may be unscientific, but are significant}

(9)

Users’ empirical evaluation of engines

_{Relevance and validity of results}

_{UI – Simple, no clutter, error tolerant} _{Trust – Results are objective}

_{Coverage of topics for polysemic queries} _{Pre/Post process tools provided}

_{Mitigate user errors (auto spell check, search assist,…)} _{Explicit: Search within results, more like this, refine ...} _{Anticipative: related searches}

_{Deal with idiosyncrasies}

_{Web specific vocabulary}

_{Impact on stemming, spell-check, etc}

_{Web addresses typed in the search box}

(10)

The Web document collection

_{No design/co-ordination}

_{Distributed content creation, linking,}

democratization of publishing

_{Content includes truth, lies, obsolete}

information, contradictions …

_{Unstructured (text, html, …),}

semi-structured (XML, annotated photos), structured (Databases)…

_{Scale much larger than previous text}

collections … but corporate records are catching up

_{Growth – slowed down from initial}

“volume doubling every few months” but still expanding

_{Content can be}_{dynamically generated}

(11)

Spam

(12)

The trouble with paid search ads …

_{It costs money. What’s the alternative?}

_{Search Engine Optimization:}

_{“Tuning” your web page to rank highly in the}

algorithmic search results for select keywords

_{Alternative to paying for placement} _{Thus, intrinsically a marketing function}

_{Performed by companies, webmasters and}

consultants (“Search engine optimizers”) for their

clients

(13)

Search engine optimization (Spam)

_Motives

_{Commercial, political, religious, lobbies} _{Promotion funded by advertising budget}

_Operators

_{Contractors (Search Engine Optimizers) for lobbies, companies} _{Web masters}

_{Hosting services}

_Forums

_{E.g., Web master world ( www.webmasterworld.com} ₎

_{Search engine specific tricks}

_{Discussions about academic papers}☺

(14)

Simplest forms

_{First generation engines relied heavily on}_tf/idf

_{The top-ranked pages for the query}_{maui resort} _{were the}

ones containing the most maui’s and resort’s

_{SEOs responded with dense repetitions of chosen terms}

_e.g.,_{maui resort maui resort maui resort} _{Often, the repetitions would be in the same color as the}

background of the web page

_{Repeated terms got indexed by crawlers} _{But not visible to humans on browsers}

Pure word density cannot be trusted as an IR signal

(15)

Variants of keyword stuffing

_{Misleading meta-tags, excessive repetition}

_{Hidden text with colors, style sheet tricks, etc.}

Meta-Tags =

“… London hotels, hotel, holiday inn, hilton, discount,

booking, reservation, sex, mp3, britney spears, viagra, …”

(16)

Cloaking

_{Serve fake content to search engine spider}

_{DNS cloaking: Switch IP address. Impersonate}

Is this a Search Engine spider? Y N SPAM Real Doc Cloaking

(17)

More spam techniques

_{Doorway pages}

_{Pages optimized for a single keyword that re-direct to the}

real target page

_{Link spamming}

_{Mutual admiration societies, hidden links, awards – more}

on these later

_{Domain flooding:} _{numerous domains that point or}

re-direct to a target page

_Robots

_{Fake query stream – rank checking programs}

_{“Curve-fit” ranking programs of search engines}

_{Millions of submissions via Add-Url}

(18)

The war against spam

_{Quality signals - Prefer}

authoritative pages based on:

_{Votes from authors (linkage}

signals)

_{Votes from users (usage signals)}

_{Policing of URL submissions}

_{Anti robot test}

_{Limits on meta-keywords} _{Robust link analysis}

_{Ignore statistically implausible}

linkage (or text)

_{Use link analysis to detect}

spammers (guilt by association)

_{Spam recognition by}

machine learning

_{Training set based on known}

spam

_{Family friendly filters}

_{Linguistic analysis, general}

classification techniques, etc.

_{For images: flesh tone}

detectors, source text analysis, etc.

_{Editorial intervention}

_Blacklists

_{Top queries audited} _{Complaints addressed} _{Suspect pattern detection}

(19)

More on spam

_{Web search engines have policies on SEO practices}

they tolerate/block

_{http://help.yahoo.com/help/us/ysearch/index.html} _{http://www.google.com/intl/en/webmasters/}

_{Adversarial IR: the unending (technical) battle}

between SEO’s and web search engines

(20)

(21)

What is the size of the web ?

_Issues

_{The web is really infinite}

_{Dynamic content, e.g., calendar}

_{Soft 404: www.yahoo.com/<anything>} _{is a valid page}

_{Static web contains syntactic duplication, mostly due to}

mirroring (~30%)

_{Some servers are seldom connected}

_{Who cares?}

_{Media, and consequently the user} _{Engine design}

_{Engine crawl policy. Impact on recall.}

(22)

Nielsen NetRatings Search Engine Ratings - July 2005

The Nielsen NetRatings

MegaView Search reporting service measures the

search behavior of more than a million people worldwide. These web surfers have real-time meters on their computers which monitor the sites they visit.

(23)

(24)

Web Challenges

_{Distributed data}

_{High percentage of volatile data} _{Large volume}

_{Unstructured and redundant data} _{Quality of data}

_{Heterogeneous data}

_{difference from standard IR systems: all queries must be}

answered without accessing the text.

S e ve ra l a rc h ite ct u re s fo r se a rc h e n g in e s

(25)

Centralized Architecture

Interface Query Engine Indexer Crawler Index

(26)

Distributed Architecture

_The_{Harvest distributed architecture} _{requires the}

coordination of several Web servers

Replication Manager Broker Object Cache User Broker Gatherer Web site

Gatherers = collect and extract indexing information from one

or more web servers

Brokers = provide the indexing information and the query interface

(27)

Search Engines

_{Google case (1997)}

_{Name is a common spelling of}_googol ₁₀100

_{Goal: build a very large-scale search engine} _{Index : 1997 100million; 2002 2 billion}

_{In 1997 their forecast was that in 2000 they will have 1 billion page}

index (they had 2 billion pages!!!)

_{World Wide Web Worm (1994)}

(28)

Web Search Engines:

Google

_Hyperlinks

_{represent new forms of information}

_Hypertexts

_{have existed and been studied for a long}

time

_New

_{a large number of hyperlinks created by}

independent individuals

_{Hyperlinks provide a valuable source of information}

(29)

CrawlerCrawlerCrawler

Google Anatomy

URL Server Store Server

Anchors

URL Resolver Indexer

Repository

Lexicon Links

Doc

Index SorterSorterSorter

Pagerank _Searcher

(30)

Google Indexer

Functions

_{Reads the repository}

_{Uncompress the documents, parse them}

_{Each document is converted into a set set of word occurrences}_hits

_{1 hit records [word, position in document, font size, capitalization]}

_{The hits are distributed in a set of “}_barrels_{” partially sorted forward}

index

_{Parses all links and stores info about them in anchors file:}

(31)

URL Resolver for Google

Reads the anchors file

_{Converts relative URLs in absolute URLs}

_{Translate them in}_docID_s

_{Puts the anchor text into the forward index, associated}

with the docID that the anchor points to

_{Generates a database of links = pairs of docIDs}

Sorter

– takes the barrels (sorted by

docID

) and resorts

them by

wordID

to generate the inverted index.

_{Sorting in place}_{little temporary space is needed.}

Dump Lexicon

– a program that takes the inverted list

and the lexicon produced by the indexer

(32)

Major Data Structures in Google

Optimized for crawling/indexing/searching large document

collections.

Trick

_{avoid disk seeks (10ms per seek)}

_{influences the design of data structures}

_{Big files}

_Repository

_{Document Index}

_Lexicon

_{Hit list}

_{Forward index}

_{Inverted index}

(33)

Big Files

_{Virtual files spanning multiple file systems}

addressable by 64 bit integers

_{Allocation among multiple file systems is handled}

automatically

_{Allocates and de-allocates descriptors} _{Compression options}

(34)

Repository

_{Contains the full HTML of every web page}

_{Compression using zlib}

Repository: 53.5 GB = 147.8 GB uncompressed

…

Packet (stored compressed in repository)

sync length Compressed packet sync length Compressed packet

(35)

Document Index

_{Information about each document}

_{Solution: fixed width ISAM (}_{Index sequential access mode}₎

index, ordered by docID

_{Each query:}

Lexicon: 14 million of words

_{List of words + hash table of pointers} _{__Sanda_Harabagiu_research__} current document status Pointer to repository Document checksum statistics

(36)

Hit Lists

_{Account for most of the space used in both the forward}

and inverted indices

_{It contains 2 forms of information:}

1. documents where a word occurred,

2. position in document and font info

_{Encoding position, font, capitalization}

_{Simple encoding (3 integers)}

_{Compact encoding (hand-optimized allocation of bits)} _{Huffman encoding}

_{Select – hand optimized compact encoding}

_{Less space than simple encoding}

(37)

Google Compact Coding

_{Uses 2 bytes for every hit}

_{There are two types of hits:}

_{fancy hits}

_and

_{plain hits}

_:

1. Fancy lists _{hits occurring in a URL, title, anchor text or meta}

tag

2. Plain hits _{all the other}

The codification:

A plain hit has

_{1 capitalization bit} _{font size (3 bits)}

_{12 bits of word position in document (positions higher than}

4096 are labeled 4096)

(38)

Google Compact Coding - cont

A fancy hit has

_{1 capitalization bit} _{Font size set to 7}

_{4 bits encode the type of fancy hit} _{8 bits for position}

_{For anchor hits the 8 bits for position are split:}

_{4 bits for position anchor}

_{4 bits for hash of the}_docID _{the anchor occurs in}

Hit: 2 bytes

Cap:1 Imp:3 Position:12 Cap:1 Imp=7 Type:4 Position:8 Cap:1 Imp=7 Type:4 Hash:4 Pos:4

(39)

Forward Index

_{It is partially sorted}

_{Stored in 64 barrels. Each barrel holds a range of word Ids}

_{If a document contains words that fall in a particular barrel, the}

docID is recorded into the barrel followed by a list of word Ids with hit-lists corresponding to the words

Forward barrels: total 43 GB

docid Wordid:24 Nhits:8 Hit hit hit hit Wordid:24 Nhits:8 Hit hit hit hit Null wordid

docid Wordid:24 Nhits:8 Hit hit hit hit Wordid:24 Nhits:8 Hit hit hit hit Wordid:24 Nhits:8 Hit hit hit hit Null wordid

(40)

Forward Index -cont

_DocIds

_{are duplicated –reasonable for 64 buckets but}

saves considerable time and coding complexity

_{Instead of storing actual word Id, store relative}

difference to minimum word ID in the barrel. Use

only 24 bits for word ID.

(41)

Inverted Index

_{On the same barrels as forward index}

_{For every valid wordID} _{– there is a pointer in the lexicon to the}

barrel that the wordID falls into:

_{Barrel points to a doclist of}_docIDs _{+ corresponding hit lists}_.

_{Issue: in what order the docIDs} _{should appear in the doclist ?}

_{Solution 1}_{: store them sorted by docID – quick mapping for multiple}

word queries

_{Solution 2}_{: store them sorted by a ranking of occurrence of the word in}

each document – merging is difficult,

_{Solution (Google):} _{compromise: 2 sets of inverted barrels:}

_{1 set for hit lists which include title and anchor hits} _{1 set for the other hit lists.}

(42)

Crawling the Web

Running a Web Crawler is a challenging task

_{Crawling is the most fragile application – it involves}

interacting with hundreds of thousands of Web

servers and various name servers

_{beyond the}

control of the system

_{Google solution: fast distributed crawling system}

_{The URL server serves lists of URLs to 3-5 crawlers (both}

server and crawlers are implemented in Python)

_{Each crawler keeps 300 connections open at once to}

retrieve Web pages at a fast enough pace.

(43)

Google crawler

_{Performance stress: DNS lookup}

_{Each crawler maintains its own DNS cache (it does not need to do a DNS}

lookup before crawling each document)

_{Each of the 300 connections can be in a different state: Lookup DNS,}

Connect to host, Send request or Receive response

_{The Robot Exclusion Standard}

(devised in 1994 by Martin Kostner)

_{Declares that a Web server administrator should create a document}

accessible at the relative URL /robots.txt

_E.g.

User-agent: * Disallow: /cgi Disallow: /stats

(44)

The robots.txt file

_{3 basic directives can be in robots.txt:}_User-agent_,_Allow_and_Disallow _{If the robots.txt file contains:}

User-agent: * Disallow: /

the administrator wants to shut out all robots from the entire site

_{Multiple User-agents can be specified:}

User-agent: friendly indexes User-agent: search-thingy Disallow: /cgi-bin/ Allow: / _{Another example:} User-agent: * Disallow: / User-agent: search-thingy Allow: /

(45)

Google Initial Performance

Storage Statistics

Total size of Fetched Pages 147.8 GB Compressed Repository 53.5 GB Short Inverted Index 4.1 GB Full Inverted Index 37.2 GB Lexicon 293 MB Temporary anchor data

(not in total) 6.6 GB Document Index Incl.

Variable Width Data 9.7 GB Links Database 3.9 GB Total Without Repository 55.2 GB Total With Repository 108.7 GB

Web Page Statistics Number of

Web Pages Fetched

24 million

Number of

URLs seen 76.5 million Number of Email addresses 1.7 million Number of 404’s 1.6 million

(46)

The Web Macroscopic Structure

_AltaVista experiments IN 44 Million nodes SCC 56 Milion nodes Tendrils 44 Million nodes Tubes Disconnected components

90% are in a single, giant Strongly Connected Component (SCC) IN = pages that can reach SCC, but cannot be reached from it OUT = pages reached from SCC, but cannot access SCC

TENDRILS = pages that cannot reach SCC and cannot be reached by SCC All 4 sets have roughly the same size _{big surprise} _!

(47)

More Data

_{The diameter of the central core of SCC is at least 28, the}

diameter of the graph is > 500

_{The probability that a path exists from the source to the}

destination 0.24

_{If a directed path exists, its average length is 16}

_{The Power law for in-degree}

_{The probability that a node has in-degree I is proportional to 1/i}x _for

some x > 1

(48)

What can we attempt to measure?

_{The relative sizes of search engines}

_{The notion of a page being indexed is still} _reasonably _well

defined.

_{Already there are problems}

_{Document extension: e.g. engines index pages not yet crawled, by}

indexing anchortext.

_{Document restriction: All engines restrict what is indexed (first} _n

words, only relevant words, etc.)

_{The coverage of a search engine relative to another}

(49)

New definition?

(IQ is whatever the IQ tests measure.)

_{The statically indexable web is whatever search}

engines index.

_{Different engines have different preferences}

_{max url depth, max count/host, anti-spam rules, priority}

rules, etc.

_{Different engines index different things under the}

same URL:

_{frames, meta-keywords, document restrictions, document}

extensions, ...

(50)

A ∩ ∩ ∩ ∩ B = (1/2) * Size A A ∩ ∩ ∩ ∩ B = (1/6) * Size B (1/2)*Size A = (1/6)*Size B ∴ ∴ ∴ ∴ Size A / Size B = (1/6)/(1/2) = 1/3

Sample URLs randomly from A

Check if contained in B and vice versa

A ∩∩∩∩B

Each test involves: (i) Sampling (ii) Checking

(51)

Sampling URLs

_{Ideal strategy: Generate a random URL and check for}

containment in each index.

_Problem:

_{Random URLs are hard to find! Enough to}

generate a random URL contained in a given Engine.

_{Approach 1: Generate a random URL contained in a}

given engine

_{Suffices for the estimation of relative size}

_{Approach 2: Random walks / IP addresses}

_{In theory: might give us a true estimate of the size of the web (as}

opposed to just relative sizes of indexes)

(52)

Statistical methods

_{Approach 1}

_{Random queries} _{Random searches}

_{Approach 2}

_{Random IP addresses} _{Random walks}

(53)

Random URLs from random queries

_{Generate random query: how?}

_Lexicon:

_{400,000+ words from a web crawl}

_{Conjunctive Queries:}

_w

1

and w

2

e.g., vocalists AND rsi

_{Get 100 result URLs from engine A}

_{Choose a random URL as the candidate to check for}

presence in engine B

_{This distribution induces a probability weight W(p) for each}

page.

_{Conjecture: W(SE}

A

) / W(SE

B

) ~ |SE

A

| / |SE

B

|

Not an English dictionary

(54)

Query Based Checking

_{Strong Query}

_{to check whether an engine}

_B

_{has a}

document

D

:

_Download_D_{. Get list of words.}

_{Use 8 low frequency words as AND query to}_B _{Check if}_D _{is present in result set.}

_Problems:

_{Near duplicates} _Frames

_Redirects

_{Engine time-outs}

(55)

Advantages & disadvantages

_{Statistically sound under the induced weight.}

_{Biases induced by random query}

_{Query Bias}_:_{Favors content-rich pages in the language(s) of the lexicon}

_{Ranking Bias}_:_Solution: _{Use conjunctive queries & fetch all}

_{Checking Bias}_:_{Duplicates, impoverished pages omitted}

_{Document or query restriction bias}_: _{engine might not deal properly}

with 8 words conjunctive query

_{Malicious Bias}_:_{Sabotage by engine}

_{Operational Problems}_:_{Time-outs, failures, engine inconsistencies,}

index modification.

(56)

Random searches

_{Choose random searches extracted from a local log}

[Lawrence & Giles 97] or build “random searches”

[Notess]

_{Use only queries with small result sets.} _{Count normalized URLs in result sets.} _{Use ratio statistics}

(57)

Advantages & disadvantages

_Advantage

_{Might be a better reflection of the human perception}

of coverage

_Issues

_{Samples are correlated with source of log} _Duplicates

_{Technical statistical problems (must have non-zero}

results, ratio average not statistically sound)

(58)

Random searches

_{575 & 1050 queries from the NEC RI employee logs} _{6 Engines in 1998, 11 in 1999}

_{Implementation:}

_{Restricted to queries with < 600 results in total}

_{Counted URLs from each engine after verifying query}

match

_{Computed size ratio & overlap for individual queries}

_{Estimated index size ratio & overlap by averaging over all}

(59)

Introduction to Information Retrieval Introduction to Information Retrieval

_{adaptive access control} _{neighborhood preservation}

topographic

_{hamiltonian structures} _{right linear grammar}

_{pulse width modulation neural} _{unbalanced prior probabilities} _{ranked assignment method} _{internet explorer favourites}

importing

_{karvel thornber} _{zili liu}

Queries from Lawrence and Giles study

_{softmax activation function} _{bose multidimensional system}

theory

_{gamma mlp} _dvi2pdf

_{john oliensis}

_{rieke spikes exploring neural} _{video watermarking}

_{counterpropagation network} _{fat shattering dimension}

_{abelson amorphous computing}

(60)

Random IP addresses

_{Generate random IP addresses}

_{Find a web server at the given address}

_{If there’s one}

_{Collect all pages from server}

(61)

Random IP addresses

_{HTTP requests to random IP addresses}

_{Ignored: empty or authorization required or excluded} _{[Lawr99] Estimated 2.8 million IP addresses running}

crawlable web servers (16 million total) from observing 2500 servers.

_{OCLC using IP sampling found 8.7 M hosts in 2001}

_{Netcraft [Netc02] accessed 37.2 million hosts in July 2002}

_{[Lawr99] exhaustively crawled 2500 servers and}

extrapolated

_{Estimated size of the web to be 800 million pages} _{Estimated use of metadata descriptors:}

_{Meta tags (keywords, description) in 34% of home pages, Dublin}

core metadata in 0.3%

(62)

Advantages & disadvantages

_Advantages

_{Clean statistics}

_{Independent of crawling strategies}

_{Disadvantages}

_{Doesn’t deal with duplication}

_{Many hosts might share one IP, or not accept requests} _{No guarantee all pages are linked to root page.}

_{Eg: employee pages}

_{Power law for # pages/hosts generates bias towards sites with}

few pages.

_{But bias can be accurately quantified IF underlying distribution}

understood

_{Potentially influenced by spamming (multiple IP’s for same}

(63)

Random walks

_{View the Web as a directed graph}

_{Build a random walk on this graph}

_{Includes various “jump” rules back to visited sites}

_{Does not get stuck in spider traps!} _{Can follow all links!}

_{Converges to a stationary distribution}

_{Must assume graph is finite and independent of the walk.} _{Conditions are not satisfied (cookie crumbs, flooding)}

_{Time to convergence not really known}

_{Sample from stationary distribution of walk}

_{Use the “strong query” method to check coverage by SE}

(64)

Advantages & disadvantages

_Advantages

_{“Statistically clean” method at least in theory!}

_{Could work even for infinite web (assuming convergence)}

under certain metrics.

_{Disadvantages}

_{List of seeds is a problem.}

_{Practical approximation might not be valid.} _{Non-uniform distribution}

(65)

Conclusions

_{No sampling solution is perfect.}

_{Lots of new ideas ...}

_{....but the problem is getting harder}

_{Quantitative studies are fascinating and a good}

research problem

(66)

(67)

Duplicate documents

_{The web is full of duplicated content}

_{Strict duplicate detection = exact match}

_{Not as common}

_{But many, many cases of near duplicates}

_{E.g., Last modified date the only difference}

between two copies of a page

(68)

Duplicate/Near-Duplicate Detection

_Duplication

_{: Exact match can be detected with}

fingerprints =

64-bit digest of characters from that page

_{Near-Duplication}

_{: Approximate match}

_Overview

_{Compute syntactic similarity with an edit-distance}

measure

_{Use similarity threshold to detect near-duplicates}

_{E.g., Similarity > 80% => Documents are “near duplicates”} _{Not transitive though sometimes used transitively}

(69)

Computing Similarity

_Features:

_{Segments of a document (natural or artificial breakpoints)}

_Shingles _{(Word N-Grams)}

_{a rose is a rose is a rose} _→

a_rose_is_a

rose_is_a_rose

is_a_rose_is

a_rose_is_a

_{Similarity Measure between two docs (= sets of shingles)}

_{Set intersection}

_{Specifically (Size_of_Intersection / Size_of_Union)}

(70)

Shingles + Set Intersection

Computing exact set intersection of shingles

between all pairs of documents is

expensive/intractable

_{Approximate using a cleverly chosen subset of shingles}

from each (a sketch)

_Estimate

_{(size_of_intersection / size_of_union)}

based on a short sketch

Doc

A Shingle set A Sketch A Doc

B Shingle set B Sketch B

(71)

Sketch of a document

_{Create a “}

_{sketch vector}

_{” (of size ~200) for each}

document

_{Documents that share}

_≥

_t

_{(say 80%) corresponding}

vector elements are

near duplicates

_{For doc}

_D

_{, sketch}

D

[

i

] is as follows:

_Let_f _{map all shingles in the universe to 0..2}m _{(e.g., f =}

fingerprinting)

_Let_π

i be a random permutation on 0..2m

_{Pick MIN {}_π

i(f(s))} over all shingles s in D

(72)

Computing Sketch[i] for Doc1

Document 1 264 264 264 264

Start with 64-bit _f(shingles) Permute on the number line with π_i

(73)

Test if Doc1.Sketch[i] = Doc2.Sketch[i]

Document 1 Document 2 264 264 264 264 264 264 264 264

Are these equal?

Test for 200 random permutations:

π

₁

,

π

₂

,…

π

₂₀₀

A B

(74)

However…

Document 1 Document 2 264 264 264 264 264 264 264 264

A = B iff the shingle with the MIN value in the union of Doc1 and Doc2 is common to both (i.e., lies in the

intersection)

Claim: This happens with probability

Size_of_intersection / Size_of_union

B A

(75)

Set Similarity of sets C

_i

, C

_j

_{View sets as columns of a matrix A; one row for each}

element in the universe. a_ij = 1 indicates presence of

item i in set j _Example j i j i j i

C

)

C

,

Jaccard(C

U

I

=

C₁ C₂ 0 1 1 0 1 1 Jaccard(C₁,C₂) = 2/5 = 0.4 0 0 1 1 0 1 Sec. 19.6

(76)

Key Observation

_{For columns C}

i

, C

j

,

four types of rows

C_i C_j

A 1 1

B 1 0

C 0 1

D 0 0

_{Overload notation:}

_{A = # of rows of type A}

_Claim

C

B

A

)

C

,

Jaccard(C

_i _j

+

=

(77)

“Min” Hashing

_Randomly

_permute

_rows

_Hash

_h(C

i

) = index of first row with 1 in column

C

_i

_{Surprising Property}

_Why?

_{Both are}_A/(A+B+C)

_{Look down columns}_C_i_{, C}_j _{until first}_non-Type-D _row _h(C

i) = h(Cj)  type A row

[

h(C_i) h(C_j)

]

Jaccard

(

C_i,C_j

)

P = =

(78)

Min-Hash sketches

_Pick

_P

_{random row permutations}

_{MinHash sketch}

Sketch

_D

= list of

P

indexes of first rows with 1 in column C

_{Similarity of signatures}

_Let_sim[sketch(C

i),sketch(Cj)] = fraction of permutations

where MinHash values agree

_Observe _E[sim(sig(C

(79)

Example

C₁ C₂ C₃ R₁ 1 0 1 R₂ 0 1 1 R₃ 1 0 0 R₄ 1 0 1 R₅ 0 1 0 Signatures S₁ S₂ S₃ Perm 1 = (12345) 1 2 1 Perm 2 = (54321) 4 5 4 Perm 3 = (34512) 3 5 4 Similarities 1-2 1-3 2-3 Col-Col 0.00 0.60 0.40 Sig-Sig 0.00 0.67 0.00 Sec. 19.6

(80)

Implementation Trick

_Permuting

_{universe even once is prohibitive}

_{Row Hashing}

_Pick _{P hash functions}_h_k_{: {1,…,n}}_{1,…,O(n)}

_Ordering _{under h}

k gives random permutation of rows

_{One-pass Implementation}

_{For each}_C

i and hk, keep “slot” for min-hash value

_Initialize _{all slot(C}

i,hk) to infinity

_{Scan rows} _{in arbitrary order looking for 1’s}

_{Suppose row R}

j has 1 in column Ci

_{For each h}

k,

_{if h}

(81)

Example

C₁ C₂ R₁ 1 0 R₂ 0 1 R₃ 1 1 R₄ 1 0 R₅ 0 1 h(x) = x mod 5 g(x) = 2x+1 mod 5 h(1) = 1 1 -g(1) = 3 3 -h(2) = 2 1 2 g(2) = 0 3 0 h(3) = 3 1 2 g(3) = 2 2 0 h(4) = 4 1 2 g(4) = 4 2 0 h(5) = 0 1 0 g(5) = 1 2 0 C₁slots C₂slots Sec. 19.6 For each h_k, if h_k(j) < slot(C_i,h_k), then slot(C_i,h_k)  _h_k_(j) J=1 J=2