Tutorial on
Information Retrieval
By:
Ansif Arooj
Lecturer of Computer Science University of the Education
Outline for the tutorial
– Introduction to information retrieval – Boolean retrieval
– Index Construction
o Information Retrieval (IR) is finding material
(usually documents) of an unstructured nature
(usually text) that satisfies an information need
from within large collections (usually stored on
computers).
(reference An Introduction to Information Retrieval by Christopher D. Manning,Prabhakar Raghavan and Hinrich Schutze)
o Information Retrieval is finding any type of
relevant information. This may include
web-pages, news events, answers, images etc. but the key notion is relevance.
• Goal = find documents relevant to an information need from a large document set
Document collection Info. need Query Answer list IR system Retrieval
Example
• Retrospective
– “Searching the past”
– Different queries posed against a static collection
• Prospective (Filtering)
– “Searching the future”
• Text (Documents)
• XML and structured documents • Images
• Audio (sound effects, songs, etc.) • Video
• Source code
The Big Picture
• The three components of the information retrieval environment:
– User – Process – Collection
What computer geeks care about!
The Information Retrieval Cycle
Source Selection Search Query Selection Ranked List result Documents Query Formulation Resource query reformulation, relevance feedbackSelection Search Query Selection Ranked List Results Documents Query Formulation Resource Indexing Index Document Collection
The IR Black Box
Documents Query
Documents Query
Representation Representation
Query Representation Document Representation
Comparison
Difference b/w Structure and Non
Structure Data
• Structured data store and manage information in “tables”.
First Name Last Name Salary Ali Raza 50000 Ibraheem Khan 60000 50000 Ayesha Umar
Un Structured data
• An unstructured data database is intended to
store in a manageable and protected way diverse objects that do not fit naturally and conveniently in common databases. It may include:
– email messages, – documents,
– journals,
– multimedia objects, etc.
• Allows
– Keyword queries including operators
– More sophisticated “concept” queries, e.g.,
Databases IR
Data Structured Unstructured
Fields Clear semantics No fields (other than text)
Queries Defined (relational algebra, SQL) Free text language”), Boolean (“natural
Recoverability
Critical (concurrency
control, recovery, atomic operations)
Downplayed, though still
Rapid growth in unstructured and
semi structure data
Unstructured data in past
• Query:
– Which plays of Shakespeare contain the words
Brutus AND Caesar but NOT Calpurnia?
• Solution:
– One could grip all of Shakespeare’s plays for
Brutus and Caesar, then strip out lines containing Calpurnia?
Answers to query
• Antony and Cleopatra, Act III, Scene ii
Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus,When Antony found Julius Caesar dead, He cried almost to roaring; and he wept When at Philippi he found Brutus slain.
• Hamlet, Act III, Scene ii
Lord Polonius: I did enact Julius Caesar I was killed i' the
Term-document incidence
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1 Brutus 1 1 0 1 0 0 Caesar 1 1 0 1 1 1 Calpurnia 0 1 0 0 0 0 Cleopatra 1 0 0 0 0 0 mercy 1 0 1 1 1 1 worser 1 0 1 1 1 0
1 if play contains word, 0 otherwise
Brutus AND Caesar BUT NOT Calpurnia
Incidence vectors
• So we have a 0/1 vector for each term. • To answer query:
• take the vectors for Brutus, Caesar and
Calpurnia (complemented) bitwise AND.
How good are the retrieved docs?
• Precision
How to build inverted index
1. Assign an ID to each document: docID
2. Run the document preparation process on each document
3. Compile a list of all index terms
4. Assign an ID to each index term: termID 5. Create a list of all (termID, docID)
Inverted index
• For each term t, we must store a list of all documents that contain t.
– Identify each by a docID, a document serial number Brutus Calpurnia Caesar 1 2 4 5 6 16 57 132 1 2 4 11 31 45 173 2 31 174 54 101
Tokenizer
Token stream Friends Romans Countrymen
Inverted index construction
Linguistic modules
Modified tokens friend roman countryman
Indexer Inverted index friend roman 2 4 2 1 Documents to
be indexed Friends, Romans, countrymen.
Indexer steps: Token sequence
• Sequence of (Modified token, Document ID) pairs.
I did enact Julius Caesar I was killed
i' the Capitol; Doc 1
So let it be with Caesar. The noble Brutus hath told you
Indexer steps: Sort
• Sort by terms
Indexer steps: Dictionary & Postings
• Multiple term entries in a single document are merged.
• Split into Dictionary and Postings
• Doc. frequency
The index we just built
• How do we process a query?
– Later - what kinds of queries can we process?
Query processing: AND
• Consider processing the query:
Brutus AND Caesar
– Step 1-Locate Brutus in the Dictionary;
• Retrieve its postings.
– Step 2-Locate Caesar in the Dictionary;
• Retrieve its postings.
– Step 3-“Merge” the two postings:
128
The merge
• Walk through the two postings
simultaneously, in linear time in the total number of postings entries
34 128 2 4 8 16 32 64 1 2 3 5 8 13 21 128 34 2 4 8 16 32 64 1 2 3 5 8 13 21 Brutus Caesar 2 8
If list lengths are x and y, merge takes O(x+y) operations.
Tokenizer
Token stream Friends Romans Countrymen
Recall: Inverted index construction
Linguistic modules
Modified tokens friend roman countryman Documents to
Index Construction
Step 1- Parsing a document
• What format is it in?
– pdf/word/excel/html?
• What language is it in?
• What character set is in use?
We have to classify the documents by its types and description.
The Term ‘Classification’ is used importantly for parsing
But these tasks are often done heuristically …
• Input: “Friends, Romans, Countrymen”
• Output: Tokens
– Friends – Romans
– Countrymen
• A token is a sequence of characters in a document
• Each such token is now a candidate for an index entry, after further processing
– Described below
Step 2 & 3 -Tokenization and Linguistic
issues
Issues in Tokenization
– Finland’s capital – State-of-the-art – co-education
– lowercase, lower-case, lower case ? – San Francisco: one token or two?
• Number or date formats
– 3/12/91
– Mar. 12, 1991 – 12/3/91
– 55 B.C.
• With a stop list, you exclude from the dictionary entirely the commonest words. Intuition:
– They have little semantic content: the, a, and, to, be
– There are a lot of them: ~30% of postings for top 30 words
• Normalization to Terms:
– By deleting periods to form a term e.g- U.S.A., USA
– deleting hyphens to form a term e.g anti-discriminatory,
antidiscriminatory
– Accents: e.g., French résumé vs. resume. • Asymmetric Expansion
– An alternative to equivalence classing – An example of where this may be useful
• Enter: window Search: window, windows • Enter: windows Search: Windows, windows, window
• Limitization:
– Reduce inflectional/variant forms to base form – E.g.,
• am, are, is be
• car, cars, car's, cars' car
• Porter’s Algorithm:
Results suggest it’s at least as good as other stemming options – sses ss – ies i – ational ate – tional tion Sec. 2.2.3 44
Phrase queries
• Want to be able to answer queries such as “stanford university” – as a phrase
• Thus the sentence “I went to university at
Stanford” is not a match.
– The concept of phrase queries has proven easily understood by users
– Many more queries are implicit phrase queries
• Biword Index:
– every consecutive pair of terms in the text as a phrase “Friends,
Romans, Countrymen” would generate the biwords • friends romans
• romans countrymen
• Extended Biword:
– Parse the indexed text and perform part-of-speech-tagging (POST).
– Bucket the terms into (say) Nouns (N) and articles/prepositions (X).
– Example: catcher in the rye
N X X N
• Query processing: parse it into N’s and X’s
– Segment query into enhanced biwords – Look up in index: catcher rye
• Positional Indexes : In the postings, store for each term the position(s) in which tokens of it appear:
<term, number of docs containing term;
doc1: position1, position2 … ; doc2: position1, position2 … ;
etc.>
Sec. 2.4.2
Positional index example
• For phrase queries, we use a merge
algorithm recursively at the document level
<be: 993427;
1: 7, 18, 33, 72, 86, 231;
2: 3, 149;
4: 17, 191, 291, 430, 434;
Wild-card queries: *
• mon*: find all docs containing any word beginning with “mon”.
• Easy with binary tree (or B-tree) lexicon:
retrieve all words in range: mon ≤ w < moo • *mon: find words ending in “mon”: harder
– Maintain an additional B-tree for terms
(1) Permuterm index
• For term hello, index under:
– hello$, ello$h, llo$he, lo$hel, o$hell where $ is a special symbol.
• Queries:
– X lookup on X$ X* lookup on $X*
– *X lookup on X$* *X* lookup on X* – X*Y lookup on Y$X*
Sec. 3.2.1
(2) Bigram (k-gram) indexes
• Enumerate all k-grams (sequence of k chars) occurring in any term
• e.g., from text “April is the cruelest month” we get the 2-grams (bigrams)
– $ is a special word boundary symbol
• Maintain a second inverted index from bigrams to
$a,ap,pr,ri,il,l$,$i,is,s$,$t,th,he,e$,$c,cr,ru, ue,el,le,es,st,t$, $m,mo,on,nt,h$
Bigram index example
• The k-gram index finds terms based on a query consisting of k-grams (here k=2).
• Query mon* can now be run as
– $m AND mo AND on mo on among $m mace along amortize madden among Sec. 3.2.2
Step 4- Index construction
• How do we construct an index?
• What strategies can we use with limited main memory?
• Many design decisions in information retrieval are based on the characteristics of hardware
Hardware basics
• Access to data • Disk seeks
• block-based Sorting
– Reading and writing of entire blocks (as opposed to smaller chunks).
– Block sizes: 8KB to 256 KB.
Distributed indexing
• Specifically important for the web scale indexing.
• Maintain a master machine directing the indexing job
• Break up indexing into sets of (parallel) tasks. • Master machine assigns each task to an idle
Parallel tasks
• We will use two sets of parallel tasks
– Parsers – Inverters
• Break the input document collection into
splits
• Each split is a subset of documents
Data flow
splits Parser Parser Parser Master a-f g-p q-z a-f g-p q-z a-f g-p q-z Inverter Inverter Inverter Postings a-f g-p q-z assign assignMapReduce
• The index construction algorithm we just described is an instance of MapReduce.
• … without having to write code for the distribution part.
• They describe the Google indexing system (ca. 2002) as consisting of a number of phases,
each implemented in MapReduce.
MapReduce
• Schema of map and reduce functionsmap: input → list(k, v) reduce: (k,list(v)) → output • Instantiation of the schema for index construction
map: collection → list(termID, docID)
reduce: (<termID1, list(docID)>, <termID2, list(docID)>, …) → (postings list1, postings list2, …)
Example for index construction
• Map: d1 : C came, C c’ed. d2 : C died. → <C,d1>, <came,d1>, <C,d1>, <c’ed, d1>, <C, d2>, <died,d2> • Reduce: – (<C,(d1,d2,d1)>, <died,(d2)>, <came,(d1)>, <c’ed,(d1)>) – (<C,(d1:2,d2:1)>, <died,(d2:1)>, <came,(d1:1)>, <c’ed,(d1:1)>)Dynamic indexing
• Up to now, we have assumed that collections are static.
• They rarely are:
– Documents come in over time and need to be inserted.
– Documents are deleted and modified.
• This means that the dictionary and postings lists have to be modified:
– Postings updates for terms already in dictionary – New terms added to dictionary
Simplest approach
• Maintain “big” main index
• New docs go into “small” auxiliary index • Search across both, merge results
What is the size of the web ?
• Issues
– The web is really infinite
• Dynamic content, e.g., calendars
– Static web contains syntactic duplication, mostly due to mirroring (~30%)
New definition?
– The statically indexable web is whatever search
engines index.
• Different engines have different preferences – max url depth, max count/host, anti-spam rules,
priority rules, etc.
• Different engines index different things under the same URL:
– frames, meta-keywords, document restrictions, document extensions, ...
Duplicate documents
• The web is full of duplicated content
• Strict duplicate detection = exact match • But many, many cases of near duplicates
– E.g., last-modified date the only difference between two copies of a page
Computing Similarity
• Features:
– Segments of a document (natural or artificial breakpoints)
– Shingles (Word N-Grams) – a rose is a rose is a rose → a_rose_is_a
rose_is_a_rose
is_a_rose_is
a_rose_is_a
• Similarity Measure between two docs (= sets of shingles)
– Jaccard coefficient: Size_of_Intersection / Size_of_Union Sec. 19.6
Shingles + Set Intersection
• Computing exact set intersection of shingles
between all pairs of documents is expensive/intractable
–Approximate using a cleverly chosen subset of shingles from each (a sketch)
• Estimate (size_of_intersection /
size_of_union) based on a short sketch
Doc
How search engine works?
Web Results 1 - 10 of about 7,310,000 for miele. (0.12 seconds)
Miele, Inc -- Anything else is a compromise
At the heart of your home, Appliances by Miele. ... USA. to miele.com. Residential Appliances. Vacuum Cleaners. Dishwashers. Cooking Appliances. Steam Oven. Coffee System ...
www.miele.com/ - 20k - Cached - Similar pages
Miele
Welcome to Miele, the home of the very best appliances and kitchens in the world.
www.miele.co.uk/ - 3k - Cached - Similar pages
Miele - Deutscher Hersteller von Einbaugeräten, Hausgeräten ... - [ Translate this page ]
Das Portal zum Thema Essen & Geniessen online unter www.zu-tisch.de. Miele weltweit ...ein Leben lang. ... Wählen Sie die Miele Vertretung Ihres Landes.
www.miele.de/ - 10k - Cached - Similar pages
Herzlich willkommen bei Miele Österreich - [ Translate this page ] Herzlich willkommen bei Miele Österreich Wenn Sie nicht automatisch weitergeleitet werden, klicken Sie bitte hier! HAUSHALTSGERÄTE ...
www.miele.at/ - 3k - Cached - Similar pages
Sponsored Links
CG Appliance Express
Discount Appliances (650) 756-3931 Same Day Certified Installation
www.cgappliance.com
San Francisco-Oakland-San Jose, CA
Miele Vacuum Cleaners
Miele Vacuums- Complete Selection
Free Shipping!
www.vacuums.com
Miele Vacuum Cleaners
Miele-Free Air shipping!
All models. Helpful advice.
www.best-vacuum.com Web spider Indexer Search User
Parts of web search engines
• The Web document • Web crawler (Spider) • Indexer
– Indexes
• A Web crawler is an Internet bot that
systematically browses the World Wide Web, typically for the purpose of Web indexing.
• A Web crawler may also be called a Web
spider, an ant, an automatic indexer, a Web scutter.
Basic crawler operation
• A typical Web crawler
– Pick a URL from frontier (starts from a set of seed pages)-,
– locates new pages by parsing the downloaded seed pages,
– extracts the hyperlinks within,
– stores the extracted links in a fetch queue for retrieval,
– continues downloading until the fetch queue
gets empty or as empty or a satisfactory number of pages are downloaded.
Crawling picture
URLs frontier Unseen Web Seed pages URLs crawled and parsedWeb Crawler Architecture
• Sequential
– single computer – not scalable
• Parallel
– multiple computers, single data center – not scalable in terms of network
• Geographically distributed
– multiple computers, multiple data centers – scalable, but has overheads
An Architectural Classification of Web
Crawler
What any crawler must do
• Be Polite: Respect implicit and explicit
politeness considerations
– Only crawl allowed pages
– Respect robots.txt (more on this shortly)
• Be Robust: Be resistant to spider traps and
other malicious behavior from web servers
What any crawler should do
• Be capable of distributed operation: designed to run on multiple distributed machines
• Be scalable: designed to increase the crawl rate by adding more machines
• Performance/efficiency: permit full use of available processing and network resources
What any crawler should do
• Fetch pages of “higher quality” first
• Continuous operation: Continue fetching
fresh copies of a previously fetched page
• Extensible: Adapt to new data formats,
protocols
Beginning with web crawler
• The basic algorithm :
{
Pickup the next URL Connect to the server GET the URL
When the page arrives, gets it links (Optional) REPEAT
• Complete search engine=CRAWLER+indexer/searcher+GUI • Working of WebCrawler – Find Stuff – Gather Stuff – Check Stuff
Updated crawling picture
URLs crawled and parsed Unseen Web Seed Pages URL frontier Sec. 20.1.1URL frontier
• URL Frontier give a URL by its crawl process.
• Can include multiple pages from the same
host
• Must avoid trying to fetch them all at the
same time
Robots.txt
• Filter is a regular expression for a URL to be excluded
• Protocol for giving spiders (“robots”) limited access to a website, originally from 1994
– www.robotstxt.org/wc/norobots.html
• Website announces its request on what can(not) be crawled
– For a server, create a file /robots.txt – This file specifies access restrictions
Robots.txt example
• No robot should visit any URL starting with "/yoursite/temp/", except the robot called “searchengine":
User-agent: *
Disallow: /yoursite/temp/ User-agent: searchengine
Recommended Texts
• Modern Information Retrieval, by Ricardo Baeza-Yates and Berthier Ribeiro-Neto, Addison-Wesley, 1999.
• Information Retrieval Algorithms and Heuristics, by David A. Grossman and Ophir Frieder, Kluwer Academic Publishers, 1998.
• Information retrieval by Christopher D. Manning Prabhakar Raghavan Hinrich Schütze Cambridge University Press