Introduction to
Information Retrieval
Christof Monz and Maarten de Rijke
Today’s Program
Today’s Program
Today’s Program
•
What’s Information Retrieval?
•
Some administrative stuff
Today’s Program
•
What’s Information Retrieval?
•
Some administrative stuff
Today’s Program
•
What’s Information Retrieval?
•
Some administrative stuff
I
Overview of the course
I
Grading, homework etc.
Today’s Program
•
What’s Information Retrieval?
•
Some administrative stuff
I
Overview of the course
I
Grading, homework etc.
Today’s Program
•
What’s Information Retrieval?
•
Some administrative stuff
I
Overview of the course
I
Grading, homework etc.
•
How to represent information
•
Our first retrieval model: boolean retrieval
What is Information Retrieval?
•
Finding relevant information in large collections of data
What is Information Retrieval?
•
Finding relevant information in large collections of data
What is Information Retrieval?
•
Finding relevant information in large collections of data
•
In such a collection you may want to find:
I
‘Give me information on the history of the Kennedys
’
What is Information Retrieval?
•
Finding relevant information in large collections of data
•
In such a collection you may want to find:
I
‘Give me information on the history of the Kennedys
’
An article about the Kennedys
What is Information Retrieval?
•
Finding relevant information in large collections of data
•
In such a collection you may want to find:
I
‘Give me information on the history of the Kennedys
’
An article about the Kennedys
(text retrieval)
What is Information Retrieval?
•
Finding relevant information in large collections of data
•
In such a collection you may want to find:
I
‘Give me information on the history of the Kennedys
’
An article about the Kennedys (text retrieval)
What is Information Retrieval?
•
Finding relevant information in large collections of data
•
In such a collection you may want to find:
I
‘Give me information on the history of the Kennedys
’
An article about the Kennedys (text retrieval)
I
‘What does a brain tumor look like on a CT-scan’
A picture of a brain tumor
What is Information Retrieval?
•
Finding relevant information in large collections of data
•
In such a collection you may want to find:
I
‘Give me information on the history of the Kennedys
’
An article about the Kennedys (text retrieval)
I
‘What does a brain tumor look like on a CT-scan’
A picture of a brain tumor
(image retrieval)
What is Information Retrieval?
•
Finding relevant information in large collections of data
•
In such a collection you may want to find:
I
‘Give me information on the history of the Kennedys
’
An article about the Kennedys (text retrieval)
I
‘What does a brain tumor look like on a CT-scan’
A picture of a brain tumor (image retrieval)
I
‘It goes like this: hmm hmm hahmmm . . .
’
What is Information Retrieval?
•
Finding relevant information in large collections of data
•
In such a collection you may want to find:
I
‘Give me information on the history of the Kennedys
’
An article about the Kennedys (text retrieval)
I
‘What does a brain tumor look like on a CT-scan’
A picture of a brain tumor (image retrieval)
What is Information Retrieval?
•
Finding relevant information in large collections of data
•
In such a collection you may want to find:
I
‘Give me information on the history of the Kennedys
’
An article about the Kennedys (text retrieval)
I
‘What does a brain tumor look like on a CT-scan’
A picture of a brain tumor (image retrieval)
I
‘It goes like this: hmm hmm hahmmm . . .
’
A certain song
(music retrieval)
Text Retrieval
•
Online library catalogs (OPAC)
Text Retrieval
•
Online library catalogs (OPAC)
•
Internet search engines, such as
AltaVista, Google, Ilse
Text Retrieval
•
Online library catalogs (OPAC)
•
Internet search engines, such as
AltaVista,
Google,
Ilse
•
Specialized systems (aka vendors):
Text Retrieval
•
Online library catalogs (OPAC)
•
Internet search engines, such as
AltaVista,
Google,
Ilse
•
Specialized systems (aka vendors):
Text Retrieval
•
Online library catalogs (OPAC)
•
Internet search engines, such as
AltaVista,
Google,
Ilse
•
Specialized systems (aka vendors):
I
MEDLINE (medical articles)
I
Lexis-Nexis (legal, business, academic, . . . )
Text Retrieval
•
Online library catalogs (OPAC)
•
Internet search engines, such as
AltaVista,
Google,
Ilse
•
Specialized systems (aka vendors):
I
MEDLINE (medical articles)
Text Retrieval
•
Online library catalogs (OPAC)
•
Internet search engines, such as
AltaVista,
Google,
Ilse
•
Specialized systems (aka vendors):
I
MEDLINE (medical articles)
I
Lexis-Nexis (legal, business, academic, . . . )
I
Westlaw (legal articles)
I
Dialog (business information)
Retrieval vs. Browsing
•
Popular Web Directories:
Retrieval vs. Browsing
•
Popular Web Directories:
Retrieval vs. Browsing
•
Popular Web Directories:
I
Yahoo!,
Open Directory Project (dmoz)
•
The user has to ‘guess’ the ‘right’ directories to find
the information
Retrieval vs. Browsing
•
Popular Web Directories:
I
Yahoo!,
Open Directory Project (dmoz)
•
The user has to ‘guess’ the ‘right’ directories to find
the information
I
The user has to adapt to the designers’
Retrieval vs. Browsing
•
Popular Web Directories:
I
Yahoo!,
Open Directory Project (dmoz)
•
The user has to ‘guess’ the ‘right’ directories to find
the information
I
The user has to adapt to the designers’
conceptualization of the directory
•
The goal of information retrieval is to provide
immediate random access to the data
Retrieval vs. Browsing
•
Popular Web Directories:
I
Yahoo!,
Open Directory Project (dmoz)
•
The user has to ‘guess’ the ‘right’ directories to find
the information
I
The user has to adapt to the designers’
conceptualization of the directory
IR vs. Database Querying
IR vs. Database Querying
IR vs. Database Querying
•
IR is not the same thing as querying a database
•
Database querying assumes that the data is in a
standardized format
IR vs. Database Querying
•
IR is not the same thing as querying a database
•
Database querying assumes that the data is in a
standardized format
•
Transforming all information, news articles, web sites
into a database format is difficult and impossible for
large data collections
IR vs. Database Querying
•
IR is not the same thing as querying a database
•
Database querying assumes that the data is in a
standardized format
•
Transforming all information, news articles, web sites
into a database format is difficult and impossible for
large data collections
•
Text retrieval can work with plain, unformatted data
Relevance as Similarity
•
A fundamental idea within IR is:
Relevance as Similarity
•
A fundamental idea within IR is:
‘A document is relevant to a query
if they are similar
’
Relevance as Similarity
•
A fundamental idea within IR is:
‘A document is relevant to a query
if they are similar
’
•
Similarity can be defined as
Relevance as Similarity
•
A fundamental idea within IR is:
‘A document is relevant to a query
if they are similar
’
•
Similarity can be defined as
Relevance as Similarity
•
A fundamental idea within IR is:
‘A document is relevant to a query
if they are similar
’
•
Similarity can be defined as
I
string matching/comparison
I
similar vocabulary
Relevance as Similarity
•
A fundamental idea within IR is:
‘A document is relevant to a query
if they are similar
’
•
Similarity can be defined as
I
string matching/comparison
I
similar vocabulary
The Ubiquity of IR
The Ubiquity of IR
•
Information filtering
The Ubiquity of IR
•
Information filtering
I
E-mail routing
The Ubiquity of IR
•
Information filtering
I
E-mail routing
The Ubiquity of IR
•
Information filtering
I
E-mail routing
I
Text categorization
•
Detecting information structure
The Ubiquity of IR
•
Information filtering
I
E-mail routing
I
Text categorization
•
Detecting information structure
The Ubiquity of IR
•
Information filtering
I
E-mail routing
I
Text categorization
•
Detecting information structure
I
Hyperlink generation
I
Topic/Information detection/screening
The Ubiquity of IR
•
Information filtering
I
E-mail routing
I
Text categorization
•
Detecting information structure
I
Hyperlink generation
I
Topic/Information detection/screening
The Ubiquity of IR
•
Information filtering
I
E-mail routing
I
Text categorization
•
Detecting information structure
I
Hyperlink generation
I
Topic/Information detection/screening
I
Portal development and maintenance
•
Question Answering
Some Research Groups in IR
•
Industrial IR research:
Some Research Groups in IR
•
Industrial IR research:
AT&T, NEC, Sun
Microsystems, Microsoft, G&E Research, Sabir
Research, NTT, AltaVista, Xerox, Q-Go, GO.com
(Infoseek), Lexiquest, Answers.com, AnswerLogics,
Google, Ask-Jeeves, Lucent Technologies, IBM . . .
Some Research Groups in IR
•
Industrial IR research: AT&T, NEC, Sun
Microsystems, Microsoft, G&E Research, Sabir
Research, NTT, AltaVista, Xerox, Q-Go, GO.com
(Infoseek), Lexiquest, Answers.com, AnswerLogics,
Google, Ask-Jeeves, Lucent Technologies, IBM . . .
•
Academic IR Groups:
Some Research Groups in IR
•
Industrial IR research: AT&T, NEC, Sun
Microsystems, Microsoft, G&E Research, Sabir
Research, NTT, AltaVista, Xerox, Q-Go, GO.com
(Infoseek), Lexiquest, Answers.com, AnswerLogics,
Google, Ask-Jeeves, Lucent Technologies, IBM . . .
•
Academic IR Groups:
Cornell, Massachusetts, Twente,
Glasgow, Sheffield, Dortmund, Dublin, Stanford,
Some Research Groups in IR
•
Industrial IR research: AT&T, NEC, Sun
Microsystems, Microsoft, G&E Research, Sabir
Research, NTT, AltaVista, Xerox, Q-Go, GO.com
(Infoseek), Lexiquest, Answers.com, AnswerLogics,
Google, Ask-Jeeves, Lucent Technologies, IBM . . .
•
Academic IR Groups: Cornell, Massachusetts, Twente,
Glasgow, Sheffield, Dortmund, Dublin, Stanford,
Syracruse, Virginia Tech, Pisa . . .
•
Other:
Some Research Groups in IR
•
Industrial IR research: AT&T, NEC, Sun
Microsystems, Microsoft, G&E Research, Sabir
Research, NTT, AltaVista, Xerox, Q-Go, GO.com
(Infoseek), Lexiquest, Answers.com, AnswerLogics,
Google, Ask-Jeeves, Lucent Technologies, IBM . . .
•
Academic IR Groups: Cornell, Massachusetts, Twente,
Glasgow, Sheffield, Dortmund, Dublin, Stanford,
History of IR
History of IR
History of IR
•
1950: Calvin N. Moors coins the term ‘Information Retrieval’
•
1959: Luhn describes statistical retrieval
History of IR
•
1950: Calvin N. Moors coins the term ‘Information Retrieval’
•
1959: Luhn describes statistical retrieval
History of IR
•
1950: Calvin N. Moors coins the term ‘Information Retrieval’
•
1959: Luhn describes statistical retrieval
•
1960: Maron and Kuhns define a probabilistic model of IR
•
1966: Cranfield project defines evaluation measures
History of IR
•
1950: Calvin N. Moors coins the term ‘Information Retrieval’
•
1959: Luhn describes statistical retrieval
•
1960: Maron and Kuhns define a probabilistic model of IR
•
1966: Cranfield project defines evaluation measures
•
1968: Gerard Salton’s first book about the SMART retrieval
system
History of IR
•
1950: Calvin N. Moors coins the term ‘Information Retrieval’
•
1959: Luhn describes statistical retrieval
•
1960: Maron and Kuhns define a probabilistic model of IR
•
1966: Cranfield project defines evaluation measures
•
1968: Gerard Salton’s first book about the SMART retrieval
system
•
1972: Lockheed introduces DIALOG as commercial online service
History of IR
•
1950: Calvin N. Moors coins the term ‘Information Retrieval’
•
1959: Luhn describes statistical retrieval
•
1960: Maron and Kuhns define a probabilistic model of IR
•
1966: Cranfield project defines evaluation measures
•
1968: Gerard Salton’s first book about the SMART retrieval
system
History of IR
History of IR
•
Early 1990’s: Cheap disks lead to the information storage
revolution
History of IR
•
Early 1990’s: Cheap disks lead to the information storage
revolution
•
1992: Westlaw is the first large-scale information service using
probabilistic retrieval
History of IR
•
Early 1990’s: Cheap disks lead to the information storage
revolution
•
1992: Westlaw is the first large-scale information service using
probabilistic retrieval
History of IR
•
Early 1990’s: Cheap disks lead to the information storage
revolution
•
1992: Westlaw is the first large-scale information service using
probabilistic retrieval
•
Mid 1990’s: Multi-media databases
•
1994: The internet and web explosion
History of IR
•
Early 1990’s: Cheap disks lead to the information storage
revolution
•
1992: Westlaw is the first large-scale information service using
probabilistic retrieval
•
Mid 1990’s: Multi-media databases
Overview of the Course
Overview of the Course
•
Basic IR models (week 1 & 2)
Overview of the Course
•
Basic IR models (week 1 & 2)
•
Evaluating the quality of IR methods (week 3)
Overview of the Course
•
Basic IR models (week 1 & 2)
•
Evaluating the quality of IR methods (week 3)
Overview of the Course
•
Basic IR models (week 1 & 2)
•
Evaluating the quality of IR methods (week 3)
•
Text representation (week 4)
•
Components of an IR system (week 5 & 6)
Overview of the Course
•
Basic IR models (week 1 & 2)
•
Evaluating the quality of IR methods (week 3)
•
Text representation (week 4)
•
Components of an IR system (week 5 & 6)
Overview of the Course
•
Basic IR models (week 1 & 2)
•
Evaluating the quality of IR methods (week 3)
•
Text representation (week 4)
•
Components of an IR system (week 5 & 6)
•
Improving effectiveness and efficiency (week 6 & 7)
•
Web-based IR (week 8 & 9)
Overview of the Course
•
Basic IR models (week 1 & 2)
•
Evaluating the quality of IR methods (week 3)
•
Text representation (week 4)
•
Components of an IR system (week 5 & 6)
•
Improving effectiveness and efficiency (week 6 & 7)
Objectives of the Course
Objectives of the Course
Objectives of the Course
•
At the end of the course you will be able to. . .
I
Exploit web specific information when searching
Objectives of the Course
•
At the end of the course you will be able to. . .
I
Exploit web specific information when searching
I
Understand the core components of modern IR
Objectives of the Course
•
At the end of the course you will be able to. . .
I
Exploit web specific information when searching
I
Understand the core components of modern IR
systems
I
Understand the potential of IR techniques for today’s
information society
Objectives of the Course
•
At the end of the course you will be able to. . .
I
Exploit web specific information when searching
I
Understand the core components of modern IR
systems
I
Understand the potential of IR techniques for today’s
information society
Objectives of the Course
•
At the end of the course you will be able to. . .
I
Exploit web specific information when searching
I
Understand the core components of modern IR
systems
I
Understand the potential of IR techniques for today’s
information society
I
Build your own search engine (in principle)
I
Make some serious dough
Grading etc.
•
Prerequisites:
Grading etc.
•
Prerequisites:
I
Computer literacy (including an account on
gene
plus
Grading etc.
•
Prerequisites:
I
Computer literacy (including an account on
gene
plus
the ability to use the unix command line interface)
•
Assessment:
Grading etc.
•
Prerequisites:
I
Computer literacy (including an account on
gene
plus
the ability to use the unix command line interface)
•
Assessment:
Grading etc.
•
Prerequisites:
I
Computer literacy (including an account on
gene
plus
the ability to use the unix command line interface)
•
Assessment:
I
Weekly reading assignments (1 or 2 papers per week)
I
(3-5) assignments
Grading etc.
•
Prerequisites:
I
Computer literacy (including an account on
gene
plus
the ability to use the unix command line interface)
•
Assessment:
I
Weekly reading assignments (1 or 2 papers per week)
I
(3-5) assignments
Grading etc.
•
Prerequisites:
I
Computer literacy (including an account on
gene
plus
the ability to use the unix command line interface)
•
Assessment:
I
Weekly reading assignments (1 or 2 papers per week)
I
(3-5) assignments
I
Final exam
I
Final mark is obtained as the average of the final
exam (60%), assignments (30%) and reading (10%)
Web Site of the Course
•
URL: www.science.uva.nl/
∼
christof/courses/ir/
Web Site of the Course
•
URL:
www.science.uva.nl/
∼
christof/courses/ir/
Web Site of the Course
•
URL:
www.science.uva.nl/
∼
christof/courses/ir/
•
Features of the web site:
I
Some of the reading material is available online
Web Site of the Course
•
URL:
www.science.uva.nl/
∼
christof/courses/ir/
•
Features of the web site:
I
Some of the reading material is available online
I
Links to universities, companies and people relevant
Web Site of the Course
•
URL:
www.science.uva.nl/
∼
christof/courses/ir/
•
Features of the web site:
I
Some of the reading material is available online
I
Links to universities, companies and people relevant
to IR
I
Printer-friendly versions of the transparancies
Web Site of the Course
•
URL:
www.science.uva.nl/
∼
christof/courses/ir/
•
Features of the web site:
I
Some of the reading material is available online
I
Links to universities, companies and people relevant
to IR
I
Printer-friendly versions of the transparancies
Retrieval Models
Retrieval Models
•
A retrieval model is an idealization or abstraction of an
actual retrieval process
Retrieval Models
•
A retrieval model is an idealization or abstraction of an
actual retrieval process
•
Conclusions derived from a model depend on whether
the model is a good approximation of the retrieval
situation
Retrieval Models
•
A retrieval model is an idealization or abstraction of an
actual retrieval process
•
Conclusions derived from a model depend on whether
the model is a good approximation of the retrieval
situation
•
Note that a retrieval model is not the same thing as a
retrieval implementation
Retrieval Models
Retrieval Models
query
formulation
identify relevant
information
document
User
representations
Components of a Retrieval Model
Components of a Retrieval Model
•
The user:
Components of a Retrieval Model
•
The user:
I
Search expert (e.g., librarian) vs. non-expert
Components of a Retrieval Model
•
The user:
I
Search expert (e.g., librarian) vs. non-expert
Components of a Retrieval Model
•
The user:
I
Search expert (e.g., librarian) vs. non-expert
I
Backgound of the user (knowledge of the topic)
I
In-depth searching vs. ‘just-wanna-get-an-idea’
searching
Components of a Retrieval Model
•
The user:
I
Search expert (e.g., librarian) vs. non-expert
I
Backgound of the user (knowledge of the topic)
I
In-depth searching vs. ‘just-wanna-get-an-idea’
searching
Components of a Retrieval Model
•
The user:
I
Search expert (e.g., librarian) vs. non-expert
I
Backgound of the user (knowledge of the topic)
I
In-depth searching vs. ‘just-wanna-get-an-idea’
searching
•
The documents:
I
Different languages
Components of a Retrieval Model
•
The user:
I
Search expert (e.g., librarian) vs. non-expert
I
Backgound of the user (knowledge of the topic)
I
In-depth searching vs. ‘just-wanna-get-an-idea’
searching
•
The documents:
Document Representation
Document Representation
•
Meta-descriptions
Document Representation
•
Meta-descriptions
I
Field information (author, title, date)
Document Representation
•
Meta-descriptions
I
Field information (author, title, date)
Document Representation
•
Meta-descriptions
I
Field information (author, title, date)
I
Key words
- Predefined
Document Representation
•
Meta-descriptions
I
Field information (author, title, date)
I
Key words
- Predefined
Document Representation
•
Meta-descriptions
I
Field information (author, title, date)
I
Key words
- Predefined
- Manually extracted (by author/editor)
•
Content: automatically identifying what the document
is about
Document Representation
Manual
Automatic
Controlled
Vocabulary
Document Representation
Manual
Automatic
Controlled
Current indexing
Vocabulary
practice
Free Text
Document Representation
Manual
Automatic
Controlled
Current indexing
Text categorization
Vocabulary
practice
‘intelligent’ IR
Document Representation
Manual
Automatic
Controlled
Current indexing
Text categorization
Vocabulary
practice
‘intelligent’ IR
Current indexing
Free Text
practice
Document Representation
Manual
Automatic
Controlled
Current indexing
Text categorization
Vocabulary
practice
‘intelligent’ IR
Current indexing
Text search engines
Controlled Vocabularies
Controlled Vocabularies
•
Examples are:
Controlled Vocabularies
•
Examples are:
I
ACM Computing Classification System
Controlled Vocabularies
•
Examples are:
I
ACM Computing Classification System
An article on Web search engines would (probably)
be classified as H.3.5 where:
Controlled Vocabularies
•
Examples are:
I
ACM Computing Classification System
An article on Web search engines would (probably)
be classified as H.3.5 where:
- H: Information Systems
Controlled Vocabularies
•
Examples are:
I
ACM Computing Classification System
An article on Web search engines would (probably)
be classified as H.3.5 where:
- H: Information Systems
Controlled Vocabularies
•
Examples are:
I
ACM Computing Classification System
An article on Web search engines would (probably)
be classified as H.3.5 where:
- H: Information Systems
- H.3: Information Storage and Retrieval
- H.3.5: Online Information Services
Controlled Vocabularies
•
Examples are:
I
ACM Computing Classification System
An article on Web search engines would (probably)
be classified as H.3.5 where:
- H: Information Systems
- H.3: Information Storage and Retrieval
- H.3.5: Online Information Services
Controlled Vocabularies
•
Examples are:
I
ACM Computing Classification System
An article on Web search engines would (probably)
be classified as H.3.5 where:
- H: Information Systems
- H.3: Information Storage and Retrieval
- H.3.5: Online Information Services
I
NLM
Medical Subject Headings (MeSH)
I
Yahoo!
Manual vs. Automatic Indexing
•
Pros of manual indexing:
Manual vs. Automatic Indexing
•
Pros of manual indexing:
Manual vs. Automatic Indexing
•
Pros of manual indexing:
+ Human judgements are most reliable
+ Searching controlled vocabularies is more efficient
Manual vs. Automatic Indexing
•
Pros of manual indexing:
+ Human judgements are most reliable
+ Searching controlled vocabularies is more efficient
Manual vs. Automatic Indexing
•
Pros of manual indexing:
+ Human judgements are most reliable
+ Searching controlled vocabularies is more efficient
•
Cons of manual indexing:
−
Time consuming
Manual vs. Automatic Indexing
•
Pros of manual indexing:
+ Human judgements are most reliable
+ Searching controlled vocabularies is more efficient
•
Cons of manual indexing:
−
Time consuming
−
The person using the retrieval system has to be
familiar with the classification system
Manual vs. Automatic Indexing
•
Pros of manual indexing:
+ Human judgements are most reliable
+ Searching controlled vocabularies is more efficient
•
Cons of manual indexing:
−
Time consuming
−
The person using the retrieval system has to be
familiar with the classification system
−
Classification systems are sometimes incoherent
Automatic Content Representation
•
Using natural language understanding?
Automatic Content Representation
•
Using natural language understanding?
Automatic Content Representation
•
Using natural language understanding?
I
Computationally too expensive in real-world settings
I
Coverage
Automatic Content Representation
•
Using natural language understanding?
I
Computationally too expensive in real-world settings
I
Coverage
Automatic Content Representation
•
Using natural language understanding?
I
Computationally too expensive in real-world settings
I
Coverage
I
Language dependence
I
The resulting representations may be too explicit to
deal with the vagueness of a user’s information need
Automatic Content Representation
•
Using natural language understanding?
I
Computationally too expensive in real-world settings
I
Coverage
I
Language dependence
I
The resulting representations may be too explicit to
deal with the vagueness of a user’s information need
Bag-of-Words Approach
Bag-of-Words Approach
•
A document is an unordered list of words
Bag-of-Words Approach
•
A document is an unordered list of words
Grammatical information is lost
Bag-of-Words Approach
•
A document is an unordered list of words
Grammatical information is lost
Bag-of-Words Approach
•
A document is an unordered list of words
Grammatical information is lost
•
Tokenization: What is a word?
Is ‘White House’ one or two words?
Bag-of-Words Approach
•
A document is an unordered list of words
Grammatical information is lost
•
Tokenization: What is a word?
Is ‘White House’ one or two words?
Bag-of-Words Approach
•
A document is an unordered list of words
Grammatical information is lost
•
Tokenization: What is a word?
Is ‘White House’ one or two words?
•
Case folding
‘President Bush’ becomes ‘president’ , ‘bush’
Bag-of-Words Approach
•
A document is an unordered list of words
Grammatical information is lost
•
Tokenization: What is a word?
Is ‘White House’ one or two words?
•
Case folding
‘President Bush’ becomes ‘president’ , ‘bush’
Bag-of-Words Approach
•
A document is an unordered list of words
Grammatical information is lost
•
Tokenization: What is a word?
Is ‘White House’ one or two words?
•
Case folding
‘President Bush’ becomes ‘president’ , ‘bush’
•
Stemming or lemmatization
Morphological information is thrown away
Bag-of-Words Approach
•
A document is an unordered list of words
Grammatical information is lost
•
Tokenization: What is a word?
Is ‘White House’ one or two words?
•
Case folding
‘President Bush’ becomes ‘president’ , ‘bush’
•
Stemming or lemmatization
Bag-of-Words Approach
•
A document is an unordered list of words
Grammatical information is lost
•
Tokenization: What is a word?
Is ‘White House’ one or two words?
•
Case folding
‘President Bush’ becomes ‘president’ , ‘bush’
•
Stemming or lemmatization
Morphological information is thrown away
‘agreements’ becomes ‘agreement’ (lemmatization)
or even ‘agree’ (stemming)
Example Bag of Words
Scientists have found compelling new evidence of possible
ancient microscopic life on Mars, derived from magnetic
crystals in a meteorite that fell to Earth from the red planet,
NASA announced on Monday.
Example Bag of Words
Scientists have found compelling new evidence of possible
ancient microscopic life on Mars, derived from magnetic
crystals in a meteorite that fell to Earth from the red planet,
NASA announced on Monday.
a, ancient, announced, compelling, crystals, derived, earth,
evidence, fell, found, from (2
×
), have, in, life, magnetic,
mars, meteorite, microscopic, monday, nasa, new, of,
What is this about?
What is this about?
?
added, al, an, and, ballots, been, completed, count,
county (2
×
), even, former, gore, ground, had, hand,
have (2
×
), he, if, in (2
×
), independent, lost, many,
miami-dade, might, new, not, of, president, presidential, requested,
shows, study, that, the, vice, votes, would
What is this about?
?
added, al, an, and, ballots, been, completed, count,
county (2
×
), even, former, gore, ground, had, hand,
have (2
×
), he, if, in (2
×
), independent, lost, many,
miami-dade, might, new, not, of, president, presidential, requested,
shows, study, that, the, vice, votes, would
=
An independent study shows former Vice President Al Gore
would not have added many new votes in Miami-Dade County
and might even have lost ground in that county, if the
hand count of presidential ballots he requested had been
completed.
Boolean Retrieval
•
Boolean operators are: AND (NEAR), OR, NOT
Boolean Retrieval
•
Boolean operators are: AND (NEAR), OR, NOT
Boolean Retrieval
•
Boolean operators are: AND (NEAR), OR, NOT
•
The semantics of the Boolean operators:
I
t
1
AND
t
2
=
Boolean Retrieval
•
Boolean operators are: AND (NEAR), OR, NOT
•
The semantics of the Boolean operators:
Boolean Retrieval
•
Boolean operators are: AND (NEAR), OR, NOT
•
The semantics of the Boolean operators:
I
t
1
AND
t
2
=
{
d
|
t
1
∈
r(d)
} ∩ {
d
|
t
2
∈
r(d)
}
Documents whose representation contains
t
1
and
t
2
Boolean Retrieval
•
Boolean operators are: AND (NEAR), OR, NOT
•
The semantics of the Boolean operators:
I
t
1
AND
t
2
=
{
d
|
t
1
∈
r(d)
} ∩ {
d
|
t
2
∈
r(d)
}
Documents whose representation contains
t
1
and
t
2
Boolean Retrieval
•
Boolean operators are: AND (NEAR), OR, NOT
•
The semantics of the Boolean operators:
I
t
1
AND
t
2
=
{
d
|
t
1
∈
r(d)
} ∩ {
d
|
t
2
∈
r(d)
}
Documents whose representation contains
t
1
and
t
2
I
t
1
OR
t
2
=
{
d
|
t
1
∈
r(d)
} ∪ {
d
|
t
2
∈
r(d)
}
Boolean Retrieval
•
Boolean operators are: AND (NEAR), OR, NOT
•
The semantics of the Boolean operators:
I
t
1
AND
t
2
=
{
d
|
t
1
∈
r(d)
} ∩ {
d
|
t
2
∈
r(d)
}
Documents whose representation contains
t
1
and
t
2
I
t
1
OR
t
2
=
{
d
|
t
1
∈
r(d)
} ∪ {
d
|
t
2
∈
r(d)
}
Boolean Retrieval
•
Boolean operators are: AND (NEAR), OR, NOT
•
The semantics of the Boolean operators:
I
t
1
AND
t
2
=
{
d
|
t
1
∈
r(d)
} ∩ {
d
|
t
2
∈
r(d)
}
Documents whose representation contains
t
1
and
t
2
I
t
1
OR
t
2
=
{
d
|
t
1
∈
r(d)
} ∪ {
d
|
t
2
∈
r(d)
}
Documents whose representation contains
t
1
or
t
2
I
NOT
t
1
=
Boolean Retrieval
•
Boolean operators are: AND (NEAR), OR, NOT
•
The semantics of the Boolean operators:
I
t
1
AND
t
2
=
{
d
|
t
1
∈
r(d)
} ∩ {
d
|
t
2
∈
r(d)
}
Documents whose representation contains
t
1
and
t
2
I
t
1
OR
t
2
=
{
d
|
t
1
∈
r(d)
} ∪ {
d
|
t
2
∈
r(d)
}
Documents whose representation contains
t
1
or
t
2
NOT
t
=
{
d
|
t
6∈
r(d)
}
Boolean Retrieval
•
Boolean operators are: AND (NEAR), OR, NOT
•
The semantics of the Boolean operators:
I
t
1
AND
t
2
=
{
d
|
t
1
∈
r(d)
} ∩ {
d
|
t
2
∈
r(d)
}
Documents whose representation contains
t
1
and
t
2
I
t
1
OR
t
2
=
{
d
|
t
1
∈
r(d)
} ∪ {
d
|
t
2
∈
r(d)
}
Documents whose representation contains
t
1
or
t
2
I
NOT
t
1
=
{
d
|
t
1
6∈
r(d)
}
Documents whose representation doesn’t contain
t
1
Boolean Retrieval
•
Information need: President Bill Clinton
Boolean Retrieval
•
Information need: President Bill Clinton
Boolean Retrieval
•
Information need: President Bill Clinton
•
Boolean query: clinton AND (bill OR president)
bill
clinton
president
Boolean Retrieval in Action
1
President George W. Bush on Tuesday makes his first address to a joint
session of Congress and has promised a ”to the point” speech laying out
his plans for tax cuts and spending priorities.
Boolean Retrieval in Action
1
President George W. Bush on Tuesday makes his first address to a joint
session of Congress and has promised a ”to the point” speech laying out
his plans for tax cuts and spending priorities.
2
While he was still president, Bill Clinton telephoned the chief executive of
television network CBS seeking to help two old friends in a million-dollar
billing dispute, The Wall Street Journal reported in its online edition
Tuesday.
Boolean Retrieval in Action
1
President George W. Bush on Tuesday makes his first address to a joint
session of Congress and has promised a ”to the point” speech laying out
his plans for tax cuts and spending priorities.
2
While he was still president, Bill Clinton telephoned the chief executive of
television network CBS seeking to help two old friends in a million-dollar
billing dispute, The Wall Street Journal reported in its online edition
Tuesday.
3
The White House press office did return calls seeking President Bush’s
position on the bill, but Bell, from the national partnership, said she is
optimistic that paid family leave will become a reality under a Republican
administration.
Zipf’s Law
no. occurrences
words (sorted by freq.)
Zipf’s Law
no. occurrences
words (sorted by freq.)
Zipf’s Law
no. occurrences
words (sorted by freq.)