• No results found

Introduction to Information Retrieval

N/A
N/A
Protected

Academic year: 2021

Share "Introduction to Information Retrieval"

Copied!
231
0
0

Loading.... (view fulltext now)

Full text

(1)

Introduction to

Information Retrieval

Christof Monz and Maarten de Rijke

(2)

Today’s Program

(3)

Today’s Program

(4)

Today’s Program

What’s Information Retrieval?

Some administrative stuff

(5)

Today’s Program

What’s Information Retrieval?

Some administrative stuff

(6)

Today’s Program

What’s Information Retrieval?

Some administrative stuff

I

Overview of the course

I

Grading, homework etc.

(7)

Today’s Program

What’s Information Retrieval?

Some administrative stuff

I

Overview of the course

I

Grading, homework etc.

(8)

Today’s Program

What’s Information Retrieval?

Some administrative stuff

I

Overview of the course

I

Grading, homework etc.

How to represent information

Our first retrieval model: boolean retrieval

(9)
(10)

What is Information Retrieval?

Finding relevant information in large collections of data

(11)

What is Information Retrieval?

Finding relevant information in large collections of data

(12)

What is Information Retrieval?

Finding relevant information in large collections of data

In such a collection you may want to find:

I

‘Give me information on the history of the Kennedys

(13)

What is Information Retrieval?

Finding relevant information in large collections of data

In such a collection you may want to find:

I

‘Give me information on the history of the Kennedys

An article about the Kennedys

(14)

What is Information Retrieval?

Finding relevant information in large collections of data

In such a collection you may want to find:

I

‘Give me information on the history of the Kennedys

An article about the Kennedys

(text retrieval)

(15)

What is Information Retrieval?

Finding relevant information in large collections of data

In such a collection you may want to find:

I

‘Give me information on the history of the Kennedys

An article about the Kennedys (text retrieval)

(16)

What is Information Retrieval?

Finding relevant information in large collections of data

In such a collection you may want to find:

I

‘Give me information on the history of the Kennedys

An article about the Kennedys (text retrieval)

I

‘What does a brain tumor look like on a CT-scan’

A picture of a brain tumor

(17)

What is Information Retrieval?

Finding relevant information in large collections of data

In such a collection you may want to find:

I

‘Give me information on the history of the Kennedys

An article about the Kennedys (text retrieval)

I

‘What does a brain tumor look like on a CT-scan’

A picture of a brain tumor

(image retrieval)

(18)

What is Information Retrieval?

Finding relevant information in large collections of data

In such a collection you may want to find:

I

‘Give me information on the history of the Kennedys

An article about the Kennedys (text retrieval)

I

‘What does a brain tumor look like on a CT-scan’

A picture of a brain tumor (image retrieval)

I

‘It goes like this: hmm hmm hahmmm . . .

(19)

What is Information Retrieval?

Finding relevant information in large collections of data

In such a collection you may want to find:

I

‘Give me information on the history of the Kennedys

An article about the Kennedys (text retrieval)

I

‘What does a brain tumor look like on a CT-scan’

A picture of a brain tumor (image retrieval)

(20)

What is Information Retrieval?

Finding relevant information in large collections of data

In such a collection you may want to find:

I

‘Give me information on the history of the Kennedys

An article about the Kennedys (text retrieval)

I

‘What does a brain tumor look like on a CT-scan’

A picture of a brain tumor (image retrieval)

I

‘It goes like this: hmm hmm hahmmm . . .

A certain song

(music retrieval)

(21)
(22)

Text Retrieval

Online library catalogs (OPAC)

(23)

Text Retrieval

Online library catalogs (OPAC)

Internet search engines, such as

AltaVista, Google, Ilse

(24)

Text Retrieval

Online library catalogs (OPAC)

Internet search engines, such as

AltaVista,

Google,

Ilse

Specialized systems (aka vendors):

(25)

Text Retrieval

Online library catalogs (OPAC)

Internet search engines, such as

AltaVista,

Google,

Ilse

Specialized systems (aka vendors):

(26)

Text Retrieval

Online library catalogs (OPAC)

Internet search engines, such as

AltaVista,

Google,

Ilse

Specialized systems (aka vendors):

I

MEDLINE (medical articles)

I

Lexis-Nexis (legal, business, academic, . . . )

(27)

Text Retrieval

Online library catalogs (OPAC)

Internet search engines, such as

AltaVista,

Google,

Ilse

Specialized systems (aka vendors):

I

MEDLINE (medical articles)

(28)

Text Retrieval

Online library catalogs (OPAC)

Internet search engines, such as

AltaVista,

Google,

Ilse

Specialized systems (aka vendors):

I

MEDLINE (medical articles)

I

Lexis-Nexis (legal, business, academic, . . . )

I

Westlaw (legal articles)

I

Dialog (business information)

(29)
(30)

Retrieval vs. Browsing

Popular Web Directories:

(31)

Retrieval vs. Browsing

Popular Web Directories:

(32)

Retrieval vs. Browsing

Popular Web Directories:

I

Yahoo!,

Open Directory Project (dmoz)

The user has to ‘guess’ the ‘right’ directories to find

the information

(33)

Retrieval vs. Browsing

Popular Web Directories:

I

Yahoo!,

Open Directory Project (dmoz)

The user has to ‘guess’ the ‘right’ directories to find

the information

I

The user has to adapt to the designers’

(34)

Retrieval vs. Browsing

Popular Web Directories:

I

Yahoo!,

Open Directory Project (dmoz)

The user has to ‘guess’ the ‘right’ directories to find

the information

I

The user has to adapt to the designers’

conceptualization of the directory

The goal of information retrieval is to provide

immediate random access to the data

(35)

Retrieval vs. Browsing

Popular Web Directories:

I

Yahoo!,

Open Directory Project (dmoz)

The user has to ‘guess’ the ‘right’ directories to find

the information

I

The user has to adapt to the designers’

conceptualization of the directory

(36)

IR vs. Database Querying

(37)

IR vs. Database Querying

(38)

IR vs. Database Querying

IR is not the same thing as querying a database

Database querying assumes that the data is in a

standardized format

(39)

IR vs. Database Querying

IR is not the same thing as querying a database

Database querying assumes that the data is in a

standardized format

Transforming all information, news articles, web sites

into a database format is difficult and impossible for

large data collections

(40)

IR vs. Database Querying

IR is not the same thing as querying a database

Database querying assumes that the data is in a

standardized format

Transforming all information, news articles, web sites

into a database format is difficult and impossible for

large data collections

Text retrieval can work with plain, unformatted data

(41)
(42)

Relevance as Similarity

A fundamental idea within IR is:

(43)

Relevance as Similarity

A fundamental idea within IR is:

‘A document is relevant to a query

if they are similar

(44)

Relevance as Similarity

A fundamental idea within IR is:

‘A document is relevant to a query

if they are similar

Similarity can be defined as

(45)

Relevance as Similarity

A fundamental idea within IR is:

‘A document is relevant to a query

if they are similar

Similarity can be defined as

(46)

Relevance as Similarity

A fundamental idea within IR is:

‘A document is relevant to a query

if they are similar

Similarity can be defined as

I

string matching/comparison

I

similar vocabulary

(47)

Relevance as Similarity

A fundamental idea within IR is:

‘A document is relevant to a query

if they are similar

Similarity can be defined as

I

string matching/comparison

I

similar vocabulary

(48)

The Ubiquity of IR

(49)

The Ubiquity of IR

Information filtering

(50)

The Ubiquity of IR

Information filtering

I

E-mail routing

(51)

The Ubiquity of IR

Information filtering

I

E-mail routing

(52)

The Ubiquity of IR

Information filtering

I

E-mail routing

I

Text categorization

Detecting information structure

(53)

The Ubiquity of IR

Information filtering

I

E-mail routing

I

Text categorization

Detecting information structure

(54)

The Ubiquity of IR

Information filtering

I

E-mail routing

I

Text categorization

Detecting information structure

I

Hyperlink generation

I

Topic/Information detection/screening

(55)

The Ubiquity of IR

Information filtering

I

E-mail routing

I

Text categorization

Detecting information structure

I

Hyperlink generation

I

Topic/Information detection/screening

(56)

The Ubiquity of IR

Information filtering

I

E-mail routing

I

Text categorization

Detecting information structure

I

Hyperlink generation

I

Topic/Information detection/screening

I

Portal development and maintenance

Question Answering

(57)
(58)

Some Research Groups in IR

Industrial IR research:

(59)

Some Research Groups in IR

Industrial IR research:

AT&T, NEC, Sun

Microsystems, Microsoft, G&E Research, Sabir

Research, NTT, AltaVista, Xerox, Q-Go, GO.com

(Infoseek), Lexiquest, Answers.com, AnswerLogics,

Google, Ask-Jeeves, Lucent Technologies, IBM . . .

(60)

Some Research Groups in IR

Industrial IR research: AT&T, NEC, Sun

Microsystems, Microsoft, G&E Research, Sabir

Research, NTT, AltaVista, Xerox, Q-Go, GO.com

(Infoseek), Lexiquest, Answers.com, AnswerLogics,

Google, Ask-Jeeves, Lucent Technologies, IBM . . .

Academic IR Groups:

(61)

Some Research Groups in IR

Industrial IR research: AT&T, NEC, Sun

Microsystems, Microsoft, G&E Research, Sabir

Research, NTT, AltaVista, Xerox, Q-Go, GO.com

(Infoseek), Lexiquest, Answers.com, AnswerLogics,

Google, Ask-Jeeves, Lucent Technologies, IBM . . .

Academic IR Groups:

Cornell, Massachusetts, Twente,

Glasgow, Sheffield, Dortmund, Dublin, Stanford,

(62)

Some Research Groups in IR

Industrial IR research: AT&T, NEC, Sun

Microsystems, Microsoft, G&E Research, Sabir

Research, NTT, AltaVista, Xerox, Q-Go, GO.com

(Infoseek), Lexiquest, Answers.com, AnswerLogics,

Google, Ask-Jeeves, Lucent Technologies, IBM . . .

Academic IR Groups: Cornell, Massachusetts, Twente,

Glasgow, Sheffield, Dortmund, Dublin, Stanford,

Syracruse, Virginia Tech, Pisa . . .

Other:

(63)

Some Research Groups in IR

Industrial IR research: AT&T, NEC, Sun

Microsystems, Microsoft, G&E Research, Sabir

Research, NTT, AltaVista, Xerox, Q-Go, GO.com

(Infoseek), Lexiquest, Answers.com, AnswerLogics,

Google, Ask-Jeeves, Lucent Technologies, IBM . . .

Academic IR Groups: Cornell, Massachusetts, Twente,

Glasgow, Sheffield, Dortmund, Dublin, Stanford,

(64)

History of IR

(65)

History of IR

(66)

History of IR

1950: Calvin N. Moors coins the term ‘Information Retrieval’

1959: Luhn describes statistical retrieval

(67)

History of IR

1950: Calvin N. Moors coins the term ‘Information Retrieval’

1959: Luhn describes statistical retrieval

(68)

History of IR

1950: Calvin N. Moors coins the term ‘Information Retrieval’

1959: Luhn describes statistical retrieval

1960: Maron and Kuhns define a probabilistic model of IR

1966: Cranfield project defines evaluation measures

(69)

History of IR

1950: Calvin N. Moors coins the term ‘Information Retrieval’

1959: Luhn describes statistical retrieval

1960: Maron and Kuhns define a probabilistic model of IR

1966: Cranfield project defines evaluation measures

1968: Gerard Salton’s first book about the SMART retrieval

system

(70)

History of IR

1950: Calvin N. Moors coins the term ‘Information Retrieval’

1959: Luhn describes statistical retrieval

1960: Maron and Kuhns define a probabilistic model of IR

1966: Cranfield project defines evaluation measures

1968: Gerard Salton’s first book about the SMART retrieval

system

1972: Lockheed introduces DIALOG as commercial online service

(71)

History of IR

1950: Calvin N. Moors coins the term ‘Information Retrieval’

1959: Luhn describes statistical retrieval

1960: Maron and Kuhns define a probabilistic model of IR

1966: Cranfield project defines evaluation measures

1968: Gerard Salton’s first book about the SMART retrieval

system

(72)

History of IR

(73)

History of IR

Early 1990’s: Cheap disks lead to the information storage

revolution

(74)

History of IR

Early 1990’s: Cheap disks lead to the information storage

revolution

1992: Westlaw is the first large-scale information service using

probabilistic retrieval

(75)

History of IR

Early 1990’s: Cheap disks lead to the information storage

revolution

1992: Westlaw is the first large-scale information service using

probabilistic retrieval

(76)

History of IR

Early 1990’s: Cheap disks lead to the information storage

revolution

1992: Westlaw is the first large-scale information service using

probabilistic retrieval

Mid 1990’s: Multi-media databases

1994: The internet and web explosion

(77)

History of IR

Early 1990’s: Cheap disks lead to the information storage

revolution

1992: Westlaw is the first large-scale information service using

probabilistic retrieval

Mid 1990’s: Multi-media databases

(78)

Overview of the Course

(79)

Overview of the Course

Basic IR models (week 1 & 2)

(80)

Overview of the Course

Basic IR models (week 1 & 2)

Evaluating the quality of IR methods (week 3)

(81)

Overview of the Course

Basic IR models (week 1 & 2)

Evaluating the quality of IR methods (week 3)

(82)

Overview of the Course

Basic IR models (week 1 & 2)

Evaluating the quality of IR methods (week 3)

Text representation (week 4)

Components of an IR system (week 5 & 6)

(83)

Overview of the Course

Basic IR models (week 1 & 2)

Evaluating the quality of IR methods (week 3)

Text representation (week 4)

Components of an IR system (week 5 & 6)

(84)

Overview of the Course

Basic IR models (week 1 & 2)

Evaluating the quality of IR methods (week 3)

Text representation (week 4)

Components of an IR system (week 5 & 6)

Improving effectiveness and efficiency (week 6 & 7)

Web-based IR (week 8 & 9)

(85)

Overview of the Course

Basic IR models (week 1 & 2)

Evaluating the quality of IR methods (week 3)

Text representation (week 4)

Components of an IR system (week 5 & 6)

Improving effectiveness and efficiency (week 6 & 7)

(86)

Objectives of the Course

(87)

Objectives of the Course

(88)

Objectives of the Course

At the end of the course you will be able to. . .

I

Exploit web specific information when searching

(89)

Objectives of the Course

At the end of the course you will be able to. . .

I

Exploit web specific information when searching

I

Understand the core components of modern IR

(90)

Objectives of the Course

At the end of the course you will be able to. . .

I

Exploit web specific information when searching

I

Understand the core components of modern IR

systems

I

Understand the potential of IR techniques for today’s

information society

(91)

Objectives of the Course

At the end of the course you will be able to. . .

I

Exploit web specific information when searching

I

Understand the core components of modern IR

systems

I

Understand the potential of IR techniques for today’s

information society

(92)

Objectives of the Course

At the end of the course you will be able to. . .

I

Exploit web specific information when searching

I

Understand the core components of modern IR

systems

I

Understand the potential of IR techniques for today’s

information society

I

Build your own search engine (in principle)

I

Make some serious dough

(93)
(94)

Grading etc.

Prerequisites:

(95)

Grading etc.

Prerequisites:

I

Computer literacy (including an account on

gene

plus

(96)

Grading etc.

Prerequisites:

I

Computer literacy (including an account on

gene

plus

the ability to use the unix command line interface)

Assessment:

(97)

Grading etc.

Prerequisites:

I

Computer literacy (including an account on

gene

plus

the ability to use the unix command line interface)

Assessment:

(98)

Grading etc.

Prerequisites:

I

Computer literacy (including an account on

gene

plus

the ability to use the unix command line interface)

Assessment:

I

Weekly reading assignments (1 or 2 papers per week)

I

(3-5) assignments

(99)

Grading etc.

Prerequisites:

I

Computer literacy (including an account on

gene

plus

the ability to use the unix command line interface)

Assessment:

I

Weekly reading assignments (1 or 2 papers per week)

I

(3-5) assignments

(100)

Grading etc.

Prerequisites:

I

Computer literacy (including an account on

gene

plus

the ability to use the unix command line interface)

Assessment:

I

Weekly reading assignments (1 or 2 papers per week)

I

(3-5) assignments

I

Final exam

I

Final mark is obtained as the average of the final

exam (60%), assignments (30%) and reading (10%)

(101)
(102)

Web Site of the Course

URL: www.science.uva.nl/

christof/courses/ir/

(103)

Web Site of the Course

URL:

www.science.uva.nl/

christof/courses/ir/

(104)

Web Site of the Course

URL:

www.science.uva.nl/

christof/courses/ir/

Features of the web site:

I

Some of the reading material is available online

(105)

Web Site of the Course

URL:

www.science.uva.nl/

christof/courses/ir/

Features of the web site:

I

Some of the reading material is available online

I

Links to universities, companies and people relevant

(106)

Web Site of the Course

URL:

www.science.uva.nl/

christof/courses/ir/

Features of the web site:

I

Some of the reading material is available online

I

Links to universities, companies and people relevant

to IR

I

Printer-friendly versions of the transparancies

(107)

Web Site of the Course

URL:

www.science.uva.nl/

christof/courses/ir/

Features of the web site:

I

Some of the reading material is available online

I

Links to universities, companies and people relevant

to IR

I

Printer-friendly versions of the transparancies

(108)

Retrieval Models

(109)

Retrieval Models

A retrieval model is an idealization or abstraction of an

actual retrieval process

(110)

Retrieval Models

A retrieval model is an idealization or abstraction of an

actual retrieval process

Conclusions derived from a model depend on whether

the model is a good approximation of the retrieval

situation

(111)

Retrieval Models

A retrieval model is an idealization or abstraction of an

actual retrieval process

Conclusions derived from a model depend on whether

the model is a good approximation of the retrieval

situation

Note that a retrieval model is not the same thing as a

retrieval implementation

(112)

Retrieval Models

(113)

Retrieval Models

query

formulation

identify relevant

information

document

User

representations

(114)

Components of a Retrieval Model

(115)

Components of a Retrieval Model

The user:

(116)

Components of a Retrieval Model

The user:

I

Search expert (e.g., librarian) vs. non-expert

(117)

Components of a Retrieval Model

The user:

I

Search expert (e.g., librarian) vs. non-expert

(118)

Components of a Retrieval Model

The user:

I

Search expert (e.g., librarian) vs. non-expert

I

Backgound of the user (knowledge of the topic)

I

In-depth searching vs. ‘just-wanna-get-an-idea’

searching

(119)

Components of a Retrieval Model

The user:

I

Search expert (e.g., librarian) vs. non-expert

I

Backgound of the user (knowledge of the topic)

I

In-depth searching vs. ‘just-wanna-get-an-idea’

searching

(120)

Components of a Retrieval Model

The user:

I

Search expert (e.g., librarian) vs. non-expert

I

Backgound of the user (knowledge of the topic)

I

In-depth searching vs. ‘just-wanna-get-an-idea’

searching

The documents:

I

Different languages

(121)

Components of a Retrieval Model

The user:

I

Search expert (e.g., librarian) vs. non-expert

I

Backgound of the user (knowledge of the topic)

I

In-depth searching vs. ‘just-wanna-get-an-idea’

searching

The documents:

(122)

Document Representation

(123)

Document Representation

Meta-descriptions

(124)

Document Representation

Meta-descriptions

I

Field information (author, title, date)

(125)

Document Representation

Meta-descriptions

I

Field information (author, title, date)

(126)

Document Representation

Meta-descriptions

I

Field information (author, title, date)

I

Key words

- Predefined

(127)

Document Representation

Meta-descriptions

I

Field information (author, title, date)

I

Key words

- Predefined

(128)

Document Representation

Meta-descriptions

I

Field information (author, title, date)

I

Key words

- Predefined

- Manually extracted (by author/editor)

Content: automatically identifying what the document

is about

(129)

Document Representation

Manual

Automatic

Controlled

Vocabulary

(130)

Document Representation

Manual

Automatic

Controlled

Current indexing

Vocabulary

practice

Free Text

(131)

Document Representation

Manual

Automatic

Controlled

Current indexing

Text categorization

Vocabulary

practice

‘intelligent’ IR

(132)

Document Representation

Manual

Automatic

Controlled

Current indexing

Text categorization

Vocabulary

practice

‘intelligent’ IR

Current indexing

Free Text

practice

(133)

Document Representation

Manual

Automatic

Controlled

Current indexing

Text categorization

Vocabulary

practice

‘intelligent’ IR

Current indexing

Text search engines

(134)

Controlled Vocabularies

(135)

Controlled Vocabularies

Examples are:

(136)

Controlled Vocabularies

Examples are:

I

ACM Computing Classification System

(137)

Controlled Vocabularies

Examples are:

I

ACM Computing Classification System

An article on Web search engines would (probably)

be classified as H.3.5 where:

(138)

Controlled Vocabularies

Examples are:

I

ACM Computing Classification System

An article on Web search engines would (probably)

be classified as H.3.5 where:

- H: Information Systems

(139)

Controlled Vocabularies

Examples are:

I

ACM Computing Classification System

An article on Web search engines would (probably)

be classified as H.3.5 where:

- H: Information Systems

(140)

Controlled Vocabularies

Examples are:

I

ACM Computing Classification System

An article on Web search engines would (probably)

be classified as H.3.5 where:

- H: Information Systems

- H.3: Information Storage and Retrieval

- H.3.5: Online Information Services

(141)

Controlled Vocabularies

Examples are:

I

ACM Computing Classification System

An article on Web search engines would (probably)

be classified as H.3.5 where:

- H: Information Systems

- H.3: Information Storage and Retrieval

- H.3.5: Online Information Services

(142)

Controlled Vocabularies

Examples are:

I

ACM Computing Classification System

An article on Web search engines would (probably)

be classified as H.3.5 where:

- H: Information Systems

- H.3: Information Storage and Retrieval

- H.3.5: Online Information Services

I

NLM

Medical Subject Headings (MeSH)

I

Yahoo!

(143)
(144)

Manual vs. Automatic Indexing

Pros of manual indexing:

(145)

Manual vs. Automatic Indexing

Pros of manual indexing:

(146)

Manual vs. Automatic Indexing

Pros of manual indexing:

+ Human judgements are most reliable

+ Searching controlled vocabularies is more efficient

(147)

Manual vs. Automatic Indexing

Pros of manual indexing:

+ Human judgements are most reliable

+ Searching controlled vocabularies is more efficient

(148)

Manual vs. Automatic Indexing

Pros of manual indexing:

+ Human judgements are most reliable

+ Searching controlled vocabularies is more efficient

Cons of manual indexing:

Time consuming

(149)

Manual vs. Automatic Indexing

Pros of manual indexing:

+ Human judgements are most reliable

+ Searching controlled vocabularies is more efficient

Cons of manual indexing:

Time consuming

The person using the retrieval system has to be

familiar with the classification system

(150)

Manual vs. Automatic Indexing

Pros of manual indexing:

+ Human judgements are most reliable

+ Searching controlled vocabularies is more efficient

Cons of manual indexing:

Time consuming

The person using the retrieval system has to be

familiar with the classification system

Classification systems are sometimes incoherent

(151)
(152)

Automatic Content Representation

Using natural language understanding?

(153)

Automatic Content Representation

Using natural language understanding?

(154)

Automatic Content Representation

Using natural language understanding?

I

Computationally too expensive in real-world settings

I

Coverage

(155)

Automatic Content Representation

Using natural language understanding?

I

Computationally too expensive in real-world settings

I

Coverage

(156)

Automatic Content Representation

Using natural language understanding?

I

Computationally too expensive in real-world settings

I

Coverage

I

Language dependence

I

The resulting representations may be too explicit to

deal with the vagueness of a user’s information need

(157)

Automatic Content Representation

Using natural language understanding?

I

Computationally too expensive in real-world settings

I

Coverage

I

Language dependence

I

The resulting representations may be too explicit to

deal with the vagueness of a user’s information need

(158)

Bag-of-Words Approach

(159)

Bag-of-Words Approach

A document is an unordered list of words

(160)

Bag-of-Words Approach

A document is an unordered list of words

Grammatical information is lost

(161)

Bag-of-Words Approach

A document is an unordered list of words

Grammatical information is lost

(162)

Bag-of-Words Approach

A document is an unordered list of words

Grammatical information is lost

Tokenization: What is a word?

Is ‘White House’ one or two words?

(163)

Bag-of-Words Approach

A document is an unordered list of words

Grammatical information is lost

Tokenization: What is a word?

Is ‘White House’ one or two words?

(164)

Bag-of-Words Approach

A document is an unordered list of words

Grammatical information is lost

Tokenization: What is a word?

Is ‘White House’ one or two words?

Case folding

‘President Bush’ becomes ‘president’ , ‘bush’

(165)

Bag-of-Words Approach

A document is an unordered list of words

Grammatical information is lost

Tokenization: What is a word?

Is ‘White House’ one or two words?

Case folding

‘President Bush’ becomes ‘president’ , ‘bush’

(166)

Bag-of-Words Approach

A document is an unordered list of words

Grammatical information is lost

Tokenization: What is a word?

Is ‘White House’ one or two words?

Case folding

‘President Bush’ becomes ‘president’ , ‘bush’

Stemming or lemmatization

Morphological information is thrown away

(167)

Bag-of-Words Approach

A document is an unordered list of words

Grammatical information is lost

Tokenization: What is a word?

Is ‘White House’ one or two words?

Case folding

‘President Bush’ becomes ‘president’ , ‘bush’

Stemming or lemmatization

(168)

Bag-of-Words Approach

A document is an unordered list of words

Grammatical information is lost

Tokenization: What is a word?

Is ‘White House’ one or two words?

Case folding

‘President Bush’ becomes ‘president’ , ‘bush’

Stemming or lemmatization

Morphological information is thrown away

‘agreements’ becomes ‘agreement’ (lemmatization)

or even ‘agree’ (stemming)

(169)
(170)

Example Bag of Words

Scientists have found compelling new evidence of possible

ancient microscopic life on Mars, derived from magnetic

crystals in a meteorite that fell to Earth from the red planet,

NASA announced on Monday.

(171)

Example Bag of Words

Scientists have found compelling new evidence of possible

ancient microscopic life on Mars, derived from magnetic

crystals in a meteorite that fell to Earth from the red planet,

NASA announced on Monday.

a, ancient, announced, compelling, crystals, derived, earth,

evidence, fell, found, from (2

×

), have, in, life, magnetic,

mars, meteorite, microscopic, monday, nasa, new, of,

(172)

What is this about?

(173)

What is this about?

?

added, al, an, and, ballots, been, completed, count,

county (2

×

), even, former, gore, ground, had, hand,

have (2

×

), he, if, in (2

×

), independent, lost, many,

miami-dade, might, new, not, of, president, presidential, requested,

shows, study, that, the, vice, votes, would

(174)

What is this about?

?

added, al, an, and, ballots, been, completed, count,

county (2

×

), even, former, gore, ground, had, hand,

have (2

×

), he, if, in (2

×

), independent, lost, many,

miami-dade, might, new, not, of, president, presidential, requested,

shows, study, that, the, vice, votes, would

=

An independent study shows former Vice President Al Gore

would not have added many new votes in Miami-Dade County

and might even have lost ground in that county, if the

hand count of presidential ballots he requested had been

completed.

(175)
(176)

Boolean Retrieval

Boolean operators are: AND (NEAR), OR, NOT

(177)

Boolean Retrieval

Boolean operators are: AND (NEAR), OR, NOT

(178)

Boolean Retrieval

Boolean operators are: AND (NEAR), OR, NOT

The semantics of the Boolean operators:

I

t

1

AND

t

2

=

(179)

Boolean Retrieval

Boolean operators are: AND (NEAR), OR, NOT

The semantics of the Boolean operators:

(180)

Boolean Retrieval

Boolean operators are: AND (NEAR), OR, NOT

The semantics of the Boolean operators:

I

t

1

AND

t

2

=

{

d

|

t

1

r(d)

} ∩ {

d

|

t

2

r(d)

}

Documents whose representation contains

t

1

and

t

2

(181)

Boolean Retrieval

Boolean operators are: AND (NEAR), OR, NOT

The semantics of the Boolean operators:

I

t

1

AND

t

2

=

{

d

|

t

1

r(d)

} ∩ {

d

|

t

2

r(d)

}

Documents whose representation contains

t

1

and

t

2

(182)

Boolean Retrieval

Boolean operators are: AND (NEAR), OR, NOT

The semantics of the Boolean operators:

I

t

1

AND

t

2

=

{

d

|

t

1

r(d)

} ∩ {

d

|

t

2

r(d)

}

Documents whose representation contains

t

1

and

t

2

I

t

1

OR

t

2

=

{

d

|

t

1

r(d)

} ∪ {

d

|

t

2

r(d)

}

(183)

Boolean Retrieval

Boolean operators are: AND (NEAR), OR, NOT

The semantics of the Boolean operators:

I

t

1

AND

t

2

=

{

d

|

t

1

r(d)

} ∩ {

d

|

t

2

r(d)

}

Documents whose representation contains

t

1

and

t

2

I

t

1

OR

t

2

=

{

d

|

t

1

r(d)

} ∪ {

d

|

t

2

r(d)

}

(184)

Boolean Retrieval

Boolean operators are: AND (NEAR), OR, NOT

The semantics of the Boolean operators:

I

t

1

AND

t

2

=

{

d

|

t

1

r(d)

} ∩ {

d

|

t

2

r(d)

}

Documents whose representation contains

t

1

and

t

2

I

t

1

OR

t

2

=

{

d

|

t

1

r(d)

} ∪ {

d

|

t

2

r(d)

}

Documents whose representation contains

t

1

or

t

2

I

NOT

t

1

=

(185)

Boolean Retrieval

Boolean operators are: AND (NEAR), OR, NOT

The semantics of the Boolean operators:

I

t

1

AND

t

2

=

{

d

|

t

1

r(d)

} ∩ {

d

|

t

2

r(d)

}

Documents whose representation contains

t

1

and

t

2

I

t

1

OR

t

2

=

{

d

|

t

1

r(d)

} ∪ {

d

|

t

2

r(d)

}

Documents whose representation contains

t

1

or

t

2

NOT

t

=

{

d

|

t

6∈

r(d)

}

(186)

Boolean Retrieval

Boolean operators are: AND (NEAR), OR, NOT

The semantics of the Boolean operators:

I

t

1

AND

t

2

=

{

d

|

t

1

r(d)

} ∩ {

d

|

t

2

r(d)

}

Documents whose representation contains

t

1

and

t

2

I

t

1

OR

t

2

=

{

d

|

t

1

r(d)

} ∪ {

d

|

t

2

r(d)

}

Documents whose representation contains

t

1

or

t

2

I

NOT

t

1

=

{

d

|

t

1

6∈

r(d)

}

Documents whose representation doesn’t contain

t

1

(187)
(188)

Boolean Retrieval

Information need: President Bill Clinton

(189)

Boolean Retrieval

Information need: President Bill Clinton

(190)

Boolean Retrieval

Information need: President Bill Clinton

Boolean query: clinton AND (bill OR president)

bill

clinton

president

(191)
(192)

Boolean Retrieval in Action

1

President George W. Bush on Tuesday makes his first address to a joint

session of Congress and has promised a ”to the point” speech laying out

his plans for tax cuts and spending priorities.

(193)

Boolean Retrieval in Action

1

President George W. Bush on Tuesday makes his first address to a joint

session of Congress and has promised a ”to the point” speech laying out

his plans for tax cuts and spending priorities.

2

While he was still president, Bill Clinton telephoned the chief executive of

television network CBS seeking to help two old friends in a million-dollar

billing dispute, The Wall Street Journal reported in its online edition

Tuesday.

(194)

Boolean Retrieval in Action

1

President George W. Bush on Tuesday makes his first address to a joint

session of Congress and has promised a ”to the point” speech laying out

his plans for tax cuts and spending priorities.

2

While he was still president, Bill Clinton telephoned the chief executive of

television network CBS seeking to help two old friends in a million-dollar

billing dispute, The Wall Street Journal reported in its online edition

Tuesday.

3

The White House press office did return calls seeking President Bush’s

position on the bill, but Bell, from the national partnership, said she is

optimistic that paid family leave will become a reality under a Republican

administration.

(195)
(196)

Zipf’s Law

no. occurrences

words (sorted by freq.)

(197)

Zipf’s Law

no. occurrences

words (sorted by freq.)

(198)

Zipf’s Law

no. occurrences

words (sorted by freq.)

only a few words occur many times

a lot of words occur only once (hapax legomina)

(199)
(200)

Searching the Collection

Finding a word by linear search can be inefficient

References

Related documents

Enrichment for epidermal stem cells in cultured epidermal cell sheets could be beneficial in a range of current and novel applications, including: improved outcome in treatment of

The value-added productivity acceleration for 1995-99 less 1987-95 for private industries is about 1 percentage point when the statistical discrepancy is included (BEA), 1.4

Although the explicit purpose of these stories was to inspire an interest in foreign missionary work, even a cursory reading of both Fundamentalist and Southern Baptist

Reinvigorated independent state militias could step in to fill the gaps by providing a broad-based citizen support network; (C) Federal law should be modified to provide

For the TDT, these quantities were obtained during a tunnel calibration experiments conducted in the mid to late 1990s and include wall pressure measurements near the centerline

Number, type, shape, and distribution of plcentome at three different pregnancy stage were considered parameter incase of morphological study where as the maternal and

Kereszt A, Kiss E, Rheus BL, Carlson RW, Kondorosi A, Putnoky P: Novel rkp gene clusters of Sinorhizobium meliloti involved in capsular polysaccharide production and invasion of