Information Retrieval Tutorial on

(1)

Tutorial on

Information Retrieval

By:

Ansif Arooj

Lecturer of Computer Science University of the Education

(2)

Outline for the tutorial

– Introduction to information retrieval – Boolean retrieval

– Index Construction

(3)

(4)

o Information Retrieval (IR) is finding material

(usually documents) of an unstructured nature

(usually text) that satisfies an information need

from within large collections (usually stored on

computers).

(reference An Introduction to Information Retrieval by Christopher D. Manning,Prabhakar Raghavan and Hinrich Schutze)

o Information Retrieval is finding any type of

relevant information. This may include

web-pages, news events, answers, images etc. but the key notion is relevance.

(5)

• Goal = find documents relevant to an information need from a large document set

Document collection Info. need Query Answer list IR system Retrieval

(6)

Example

Google

(7)

• Retrospective

– “Searching the past”

– Different queries posed against a static collection

• Prospective (Filtering)

– “Searching the future”

(8)

(9)

• Text (Documents)

• XML and structured documents • Images

• Audio (sound effects, songs, etc.) • Video

• Source code

(10)

The Big Picture

• The three components of the information retrieval environment:

– User – Process – Collection

What computer geeks care about!

(11)

(12)

The Information Retrieval Cycle

Source Selection Search Query Selection Ranked List result Documents Query Formulation Resource query reformulation, relevance feedback

(13)

Selection Search Query Selection Ranked List Results Documents Query Formulation Resource Indexing Index Document Collection

(14)

The IR Black Box

Documents Query

(15)

Documents Query

Representation Representation

Query Representation Document Representation

Comparison

(16)

Difference b/w Structure and Non

Structure Data

(17)

• Structured data store and manage information in “tables”.

First Name Last Name Salary Ali Raza 50000 Ibraheem Khan 60000 50000 Ayesha _Umar

(18)

Un Structured data

• An unstructured data database is intended to

store in a manageable and protected way diverse objects that do not fit naturally and conveniently in common databases. It may include:

– email messages, – documents,

– journals,

– multimedia objects, etc.

• Allows

– Keyword queries including operators

– More sophisticated “concept” queries, e.g.,

(19)

Databases IR

Data Structured Unstructured

Fields Clear semantics No fields (other than text)

Queries Defined (relational _{algebra, SQL)} Free text _{language”), Boolean}(“natural

Recoverability

Critical (concurrency

control, recovery, atomic operations)

Downplayed, though still

(20)

Rapid growth in unstructured and

semi structure data

(21)

(22)

Unstructured data in past

• Query:

– Which plays of Shakespeare contain the words

Brutus AND Caesar but NOT Calpurnia?

• Solution:

– One could grip all of Shakespeare’s plays for

Brutus and Caesar, then strip out lines containing Calpurnia?

(23)

Answers to query

• Antony and Cleopatra, Act III, Scene ii

Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus,

When Antony found Julius Caesar dead, He cried almost to roaring; and he wept When at Philippi he found Brutus slain.

• Hamlet, Act III, Scene ii

Lord Polonius: I did enact Julius Caesar I was killed i' the

(24)

Term-document incidence

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 1 1 0 0 0 1 Brutus 1 1 0 1 0 0 Caesar 1 1 0 1 1 1 Calpurnia 0 1 0 0 0 0 Cleopatra 1 0 0 0 0 0 mercy 1 0 1 1 1 1 worser 1 0 1 1 1 0

1 if play contains word, 0 otherwise

Brutus AND Caesar BUT NOT Calpurnia

(25)

Incidence vectors

• So we have a 0/1 vector for each term. • To answer query:

• take the vectors for Brutus, Caesar and

Calpurnia (complemented)  bitwise AND.

(26)

(27)

How good are the retrieved docs?

• Precision

(28)

(29)

(30)

How to build inverted index

1. Assign an ID to each document: docID

2. Run the document preparation process on each document

3. Compile a list of all index terms

4. Assign an ID to each index term: termID 5. Create a list of all (termID, docID)

(31)

Inverted index

• For each term t, we must store a list of all documents that contain t.

– Identify each by a docID, a document serial number Brutus Calpurnia Caesar 1 2 4 5 6 16 57 132 1 2 4 11 31 45 173 2 31 174 54 101

(32)

Tokenizer

Token stream _Friends _Romans _Countrymen

Inverted index construction

Linguistic modules

Modified tokens friend roman countryman

Indexer Inverted index friend roman 2 4 2 1 Documents to

be indexed Friends, Romans, countrymen.

(33)

Indexer steps: Token sequence

• Sequence of (Modified token, Document ID) pairs.

I did enact Julius Caesar I was killed

i' the Capitol; Doc 1

So let it be with Caesar. The noble Brutus hath told you

(34)

Indexer steps: Sort

• Sort by terms

(35)

Indexer steps: Dictionary & Postings

• Multiple term entries in a single document are merged.

• Split into Dictionary and Postings

• Doc. frequency

(36)

The index we just built

• How do we process a query?

– Later - what kinds of queries can we process?

(37)

Query processing: AND

• Consider processing the query:

Brutus AND Caesar

– Step 1-Locate Brutus in the Dictionary;

• Retrieve its postings.

– Step 2-Locate Caesar in the Dictionary;

• Retrieve its postings.

– Step 3-“Merge” the two postings:

128

(38)

The merge

• Walk through the two postings

simultaneously, in linear time in the total number of postings entries

34 128 2 4 8 16 32 64 1 2 3 5 8 13 21 128 34 2 4 8 16 32 64 1 2 3 5 8 13 21 Brutus Caesar 2 8

If list lengths are x and y, merge takes O(x+y) operations.

(39)

Tokenizer

Token stream _Friends _Romans _Countrymen

Recall: Inverted index construction

Linguistic modules

Modified tokens friend roman countryman Documents to

(40)

Index Construction

Step 1- Parsing a document

• What format is it in?

– pdf/word/excel/html?

• What language is it in?

• What character set is in use?

We have to classify the documents by its types and description.

The Term ‘Classification’ is used importantly for parsing

But these tasks are often done heuristically …

(41)

• Input: “Friends, Romans, Countrymen”

• Output: Tokens

– Friends – Romans

– Countrymen

• A token is a sequence of characters in a document

• Each such token is now a candidate for an index entry, after further processing

– Described below

Step 2 & 3 -Tokenization and Linguistic

issues

(42)

Issues in Tokenization

– Finland’s capital – State-of-the-art – co-education

– lowercase, lower-case, lower case ? – San Francisco: one token or two?

• Number or date formats

– 3/12/91

– Mar. 12, 1991 – 12/3/91

– 55 B.C.

• With a stop list, you exclude from the dictionary entirely the commonest words. Intuition:

– They have little semantic content: the, a, and, to, be

– There are a lot of them: ~30% of postings for top 30 words

(43)

• Normalization to Terms:

– By deleting periods to form a term e.g- U.S.A., USA

– deleting hyphens to form a term e.g anti-discriminatory,

antidiscriminatory

– Accents: e.g., French résumé vs. resume. • Asymmetric Expansion

– An alternative to equivalence classing – An example of where this may be useful

• Enter: window Search: window, windows • Enter: windows Search: Windows, windows, window

(44)

• Limitization:

– Reduce inflectional/variant forms to base form – E.g.,

• am, are, is  be

• car, cars, car's, cars'  car

• Porter’s Algorithm:

Results suggest it’s at least as good as other stemming options – sses  ss – ies  i – ational  ate – tional  tion Sec. 2.2.3 44

(45)

(46)

Phrase queries

• Want to be able to answer queries such as “stanford university” – as a phrase

• Thus the sentence “I went to university at

Stanford” is not a match.

– The concept of phrase queries has proven easily understood by users

– Many more queries are implicit phrase queries

(47)

• Biword Index:

– every consecutive pair of terms in the text as a phrase “Friends,

Romans, Countrymen” would generate the biwords • friends romans

• romans countrymen

• Extended Biword:

– Parse the indexed text and perform part-of-speech-tagging (POST).

– Bucket the terms into (say) Nouns (N) and articles/prepositions (X).

– Example: catcher in the rye

N X X N

• Query processing: parse it into N’s and X’s

– Segment query into enhanced biwords – Look up in index: catcher rye

(48)

• Positional Indexes : In the postings, store for each term the position(s) in which tokens of it appear:

<term, number of docs containing term;

doc1: position1, position2 … ; doc2: position1, position2 … ;

etc.>

Sec. 2.4.2

(49)

Positional index example

• For phrase queries, we use a merge

algorithm recursively at the document level

<be: 993427;

1: 7, 18, 33, 72, 86, 231;

2: 3, 149;

4: 17, 191, 291, 430, 434;

(50)

(51)

Wild-card queries: *

• mon*: find all docs containing any word beginning with “mon”.

• Easy with binary tree (or B-tree) lexicon:

retrieve all words in range: mon ≤ w < moo • *mon: find words ending in “mon”: harder

– Maintain an additional B-tree for terms

(52)

(1) Permuterm index

• For term hello, index under:

– hello$, ello$h, llo$he, lo$hel, o$hell where $ is a special symbol.

• Queries:

– X lookup on X$ X* lookup on $X*

– *X lookup on X$* *X* lookup on X* – X*Y lookup on Y$X*

Sec. 3.2.1

(53)

(2) Bigram (k-gram) indexes

• Enumerate all k-grams (sequence of k chars) occurring in any term

• e.g., from text “April is the cruelest month” we get the 2-grams (bigrams)

– $ is a special word boundary symbol

• Maintain a second inverted index from bigrams to

$a,ap,pr,ri,il,l$,$i,is,s$,$t,th,he,e$,$c,cr,ru, ue,el,le,es,st,t$, $m,mo,on,nt,h$

(54)

Bigram index example

• The k-gram index finds terms based on a query consisting of k-grams (here k=2).

• Query mon* can now be run as

– $m AND mo AND on mo on among $m mace along amortize madden among Sec. 3.2.2

(55)

Step 4- Index construction

• How do we construct an index?

• What strategies can we use with limited main memory?

• Many design decisions in information retrieval are based on the characteristics of hardware

(56)

Hardware basics

• Access to data • Disk seeks

• block-based Sorting

– Reading and writing of entire blocks (as opposed to smaller chunks).

– Block sizes: 8KB to 256 KB.

(57)

(58)

(59)

Distributed indexing

• Specifically important for the web scale indexing.

• Maintain a master machine directing the indexing job

• Break up indexing into sets of (parallel) tasks. • Master machine assigns each task to an idle

(60)

Parallel tasks

• We will use two sets of parallel tasks

– Parsers – Inverters

• Break the input document collection into

splits

• Each split is a subset of documents

(61)

Data flow

splits Parser Parser Parser Master a-f g-p q-z a-f g-p q-z a-f g-p q-z Inverter Inverter Inverter Postings a-f g-p q-z assign assign

(62)

MapReduce

• The index construction algorithm we just described is an instance of MapReduce.

• … without having to write code for the distribution part.

• They describe the Google indexing system (ca. 2002) as consisting of a number of phases,

each implemented in MapReduce.

(63)

MapReduce

• Schema of map and reduce functions

map: input → list(k, v) reduce: (k,list(v)) → output • Instantiation of the schema for index construction

map: collection → list(termID, docID)

reduce: (<termID1, list(docID)>, <termID2, list(docID)>, …) → (postings list1, postings list2, …)

(64)

Example for index construction

• Map: d1 : C came, C c’ed. d2 : C died. → <C,d1>, <came,d1>, <C,d1>, <c’ed, d1>, <C, d2>, <died,d2> • Reduce: – (<C,(d1,d2,d1)>, <died,(d2)>, <came,(d1)>, <c’ed,(d1)>) – (<C,(d1:2,d2:1)>, <died,(d2:1)>, <came,(d1:1)>, <c’ed,(d1:1)>)

(65)

(66)

Dynamic indexing

• Up to now, we have assumed that collections are static.

• They rarely are:

– Documents come in over time and need to be inserted.

– Documents are deleted and modified.

• This means that the dictionary and postings lists have to be modified:

– Postings updates for terms already in dictionary – New terms added to dictionary

(67)

Simplest approach

• Maintain “big” main index

• New docs go into “small” auxiliary index • Search across both, merge results

(68)

(69)

What is the size of the web ?

• Issues

– The web is really infinite

• Dynamic content, e.g., calendars

– Static web contains syntactic duplication, mostly due to mirroring (~30%)

(70)

New definition?

– The statically indexable web is whatever search

engines index.

• Different engines have different preferences – max url depth, max count/host, anti-spam rules,

priority rules, etc.

• Different engines index different things under the same URL:

– frames, meta-keywords, document restrictions, document extensions, ...

(71)

Duplicate documents

• The web is full of duplicated content

• Strict duplicate detection = exact match • But many, many cases of near duplicates

– E.g., last-modified date the only difference between two copies of a page

(72)

Computing Similarity

• Features:

– Segments of a document (natural or artificial breakpoints)

– Shingles (Word N-Grams) – a rose is a rose is a rose → a_rose_is_a

rose_is_a_rose

is_a_rose_is

a_rose_is_a

• Similarity Measure between two docs (= sets of shingles)

– Jaccard coefficient: Size_of_Intersection / Size_of_Union Sec. 19.6

(73)

Shingles + Set Intersection

• Computing exact set intersection of shingles

between all pairs of documents is expensive/intractable

–Approximate using a cleverly chosen subset of shingles from each (a sketch)

• Estimate (size_of_intersection /

size_of_union) based on a short sketch

Doc

(74)

(75)

How search engine works?

Web Results 1 - 10 of about 7,310,000 for miele. (0.12 seconds)

Miele, Inc -- Anything else is a compromise

At the heart of your home, Appliances by Miele. ... USA. to miele.com. Residential Appliances. Vacuum Cleaners. Dishwashers. Cooking Appliances. Steam Oven. Coffee System ...

www.miele.com/ - 20k - Cached - Similar pages

Miele

Welcome to Miele, the home of the very best appliances and kitchens in the world.

www.miele.co.uk/ - 3k - Cached - Similar pages

Miele - Deutscher Hersteller von Einbaugeräten, Hausgeräten ... - [ Translate this page ]

Das Portal zum Thema Essen & Geniessen online unter www.zu-tisch.de. Miele weltweit ...ein Leben lang. ... Wählen Sie die Miele Vertretung Ihres Landes.

www.miele.de/ - 10k - Cached - Similar pages

Herzlich willkommen bei Miele Österreich - [ Translate this page ] Herzlich willkommen bei Miele Österreich Wenn Sie nicht automatisch weitergeleitet werden, klicken Sie bitte hier! HAUSHALTSGERÄTE ...

www.miele.at/ - 3k - Cached - Similar pages

Parts of web search engines

• The Web document • Web crawler (Spider) • Indexer

– Indexes

(77)

• A Web crawler is an Internet bot that

systematically browses the World Wide Web, typically for the purpose of Web indexing.

• A Web crawler may also be called a Web

spider, an ant, an automatic indexer, a Web scutter.

(78)

Basic crawler operation

• A typical Web crawler

– Pick a URL from frontier (starts from a set of seed pages)-,

– locates new pages by parsing the downloaded seed pages,

– extracts the hyperlinks within,

– stores the extracted links in a fetch queue for retrieval,

– continues downloading until the fetch queue

gets empty or as empty or a satisfactory number of pages are downloaded.

(79)

Crawling picture

URLs frontier Unseen Web Seed pages URLs crawled and parsed

(80)

Web Crawler Architecture

• Sequential

– single computer – not scalable

• Parallel

– multiple computers, single data center – not scalable in terms of network

• Geographically distributed

– multiple computers, multiple data centers – scalable, but has overheads

(81)

(82)

(83)

(84)

An Architectural Classification of Web

Crawler

(85)

What any crawler must do

• Be Polite: Respect implicit and explicit

politeness considerations

– Only crawl allowed pages

– Respect robots.txt (more on this shortly)

• Be Robust: Be resistant to spider traps and

other malicious behavior from web servers

(86)

What any crawler should do

• Be capable of distributed operation: designed to run on multiple distributed machines

• Be scalable: designed to increase the crawl rate by adding more machines

• Performance/efficiency: permit full use of available processing and network resources

(87)

What any crawler should do

• Fetch pages of “higher quality” first

• Continuous operation: Continue fetching

fresh copies of a previously fetched page

• Extensible: Adapt to new data formats,

protocols

(88)

Beginning with web crawler

• The basic algorithm :

{

Pickup the next URL Connect to the server GET the URL

When the page arrives, gets it links (Optional) REPEAT

(89)

• Complete search engine=CRAWLER+indexer/searcher+GUI • Working of WebCrawler – Find Stuff – Gather Stuff – Check Stuff

(90)

Updated crawling picture

URLs crawled and parsed Unseen Web Seed Pages URL frontier Sec. 20.1.1

(91)

URL frontier

• URL Frontier give a URL by its crawl process.

• Can include multiple pages from the same

host

• Must avoid trying to fetch them all at the

same time

(92)

Robots.txt

• Filter is a regular expression for a URL to be excluded

• Protocol for giving spiders (“robots”) limited access to a website, originally from 1994

– www.robotstxt.org/wc/norobots.html

• Website announces its request on what can(not) be crawled

– For a server, create a file /robots.txt – This file specifies access restrictions

(93)

Robots.txt example

• No robot should visit any URL starting with "/yoursite/temp/", except the robot called “searchengine":

User-agent: *

Disallow: /yoursite/temp/ User-agent: searchengine

(94)

Recommended Texts

• Modern Information Retrieval, by Ricardo Baeza-Yates and Berthier Ribeiro-Neto, Addison-Wesley, 1999.

• Information Retrieval Algorithms and Heuristics, by David A. Grossman and Ophir Frieder, Kluwer Academic Publishers, 1998.

• Information retrieval by Christopher D. Manning Prabhakar Raghavan Hinrich Schütze Cambridge University Press

(95)