• No results found

Information Retrieval

N/A
N/A
Protected

Academic year: 2021

Share "Information Retrieval"

Copied!
95
0
0

Loading.... (view fulltext now)

Full text

(1)

Information Retrieval

Slides are adapted from

Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Raymond J. Mooney

Simone Teufel and Ronan Cummins

(2)
(3)

Information Retrieval

Information Retrieval (IR) is

finding material

(usually documents) of

an

unstructured

nature (usually text) that satisfies an

information

need

from within

large collections

(usually stored on computers).

– These days we frequently think first of

web search

, but there are many other

cases:

• E-mail search

• Searching your laptop

• Corporate knowledge bases

• Legal information retrieval

3

(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)

Unstructured (text) vs. structured (database) data in the

mid-nineties

11

(12)

Unstructured (text) vs. structured (database) data today

12

(13)
(14)
(15)
(16)
(17)
(18)

What other organisations have this mission?

• Libraries ?

• Scopus, Web of Science, … ?

• Twitter / Facebook ?

• Netflix ?

• Amazon ?

• iTunes / Spotify ?

• Medium ?

(19)
(20)

how trap mice alive

The classic search model

Collection User task Info need Query Results Search engine Query refinement

Get rid of mice in a

politically correct way

Info about removing mice

without killing them

Misconception?

Misformulation?

Search

(21)
(22)
(23)
(24)
(25)
(26)
(27)
(28)
(29)
(30)
(31)
(32)

Memex (Wikipedia)

• Memex is the name of the hypothetical electromechanical

device that

Vannevar Bush

described in his 1945 article "

As We

May Think

". Bush envisioned the memex as a device in which

individuals would compress and store all of their books,

records, and communications, "mechanized so that it may be

consulted with exceeding speed and flexibility". The individual

was supposed to use the memex as automatic personal

filing

system

, making the memex "an enlarged intimate supplement

to his memory".

[1]

• The concept of the memex influenced the development of

early

hypertext

systems, eventually leading to the creation of

the

World Wide Web

, and

personal knowledge

base

software.

[2]

The hypothetical implementation depicted by

Bush for the purpose of concrete illustration was based upon a

document bookmark list of static

microfilm

pages and lacked a

true hypertext system, where parts of pages would have

(33)
(34)

34

History of IR

• 1960-70’s:

• Initial exploration of text retrieval systems for “small” corpora of scientific

abstracts, and law and business documents.

• Development of the basic Boolean and vector-space models of retrieval.

• Prof. Salton and his students at Cornell University are the leading researchers

in the area.

(35)

35

IR History Continued

• 1980’s:

• Large document database systems, many run by companies:

• Lexis-Nexis

• Dialog

• MEDLINE

(36)

36

IR History Continued

• 1990’s:

• Searching FTPable documents on the Internet

• Archie

• WAIS

• Searching the World Wide Web

• Lycos

• Yahoo

• Altavista

(37)

37

IR History Continued

• 1990’s continued:

• Organized Competitions

• NIST TREC

• Recommender Systems

• Ringo

• Amazon

• NetPerceptions

• Automated Text Categorization & Clustering

(38)

38

IR History Continued

• 2000’s

• Link analysis for Web Search

• Google

• Automated Information Extraction

• Parallel Processing

• Map/Reduce

• Question Answering

• TREC Q/A track

(39)

39

IR History Continued

• 2000’s continued:

• Multimedia IR

• Image

• Video

• Audio and music

• Cross-Language IR

• DARPA Tides

• Document Summarization

• Learning to Rank

(40)

Recent IR History

• 2010’s

• Intelligent Personal Assistants

• Siri

• Cortana

• Google Now

• Alexa

• Complex Question Answering

• IBM Watson

• Distributional Semantics

• Deep Learning

40

(41)
(42)
(43)
(44)
(45)
(46)
(47)
(48)
(49)
(50)
(51)
(52)

52

Information Retrieval

(IR)

• The indexing and retrieval of textual documents.

• Searching for pages on the World Wide Web

• Concerned firstly with retrieving relevant documents to a query.

• Concerned secondly with retrieving from large sets of documents

efficiently.

(53)

53

Typical IR Task

Given:

A corpus of textual natural-language documents.

A user query in the form of a textual string.

Find:

A ranked set of documents that are relevant to the query.

(54)

54

IR System

IR

System

Query

String

Document

corpus

Ranked

Documents

1. Doc1 2. Doc2 3. Doc3 . .

(55)
(56)

56

Relevance

• Relevance is a subjective judgment and may include:

• Being on the proper subject.

• Being timely (recent information).

• Being authoritative (from a trusted source).

• Satisfying the goals of the user and his/her intended use of the information

(information need).

(57)
(58)
(59)
(60)
(61)
(62)
(63)
(64)

64

Keyword Search

• Simplest notion of relevance is that the query string appears verbatim

in the document.

• Slightly less strict notion is that the words in the query appear

frequently in the document, in any order (bag of words).

(65)

65

Problems with Keywords

• May not retrieve relevant documents that include

synonymous terms.

• “restaurant” vs. “café”

• “PRC” vs. “China”

• May retrieve irrelevant documents that include

ambiguous terms.

• “bat” (baseball vs. mammal)

• “Apple” (company vs. fruit)

• “bit” (unit of data vs. act of eating)

(66)

66

Beyond Keywords

• We will cover the basics of keyword-based IR,

but…

• We will focus on extensions and recent

developments that go beyond keywords.

• We will cover the basics of building an efficient IR

system, but…

• We will focus on basic capabilities and algorithms

rather than systems issues that allow scaling to

industrial size databases.

(67)

67

Intelligent IR

• Taking into account the meaning of the words used.

• Taking into account the order of words in the query.

• Adapting to the user based on direct or indirect feedback.

• Taking into account the authority of the source.

(68)

68

IR System Architecture

Text

Database

Database

Manager

Indexing

Index

Query

Operations

Searching

Ranking

Ranked

Docs

User

Feedback

Text Operations

User Interface

Retrieved

Docs

User

Need

Text

Query

Logical View

Inverted

file

(69)

69

IR System Components

• Text Operations

forms index words (tokens).

• Stopword removal

• Stemming

• Indexing

constructs an inverted index of word to

document pointers.

• Searching

retrieves documents that contain a given

query token from the inverted index.

• Ranking

scores all retrieved documents according to

a relevance metric.

(70)

70

IR System Components (continued)

• User Interface

manages interaction with the user:

• Query input and document output.

• Relevance feedback.

• Visualization of results.

• Query Operations

transform the query to improve retrieval:

• Query expansion using a thesaurus.

• Query transformation using relevance feedback.

(71)

71

Web Search

• Application of IR to HTML documents on the

World Wide Web.

• Differences:

• Must assemble document corpus by spidering the

web.

• Can exploit the structural layout information in HTML

(XML).

• Documents change uncontrollably.

• Can exploit the link structure of the web.

(72)

72

Web Search System

Query

String

IR

System

Ranked

Documents

1. Page1 2. Page2 3. Page3 . .

Document

corpus

Web

Spider

(73)

73

Other IR-Related Tasks

• Automated document categorization

• Information filtering (spam filtering)

• Information routing

• Automated document clustering

• Recommending information or products

• Information extraction

• Information integration

• Question answering

(74)
(75)
(76)
(77)
(78)
(79)
(80)
(81)
(82)
(83)
(84)
(85)
(86)
(87)
(88)

88

Related Areas

• Database Management

• Library and Information Science

• Artificial Intelligence

• Natural Language Processing

• Machine Learning

(89)

89

Database Management

• Focused on structured data stored in relational

tables rather than free-form text.

• Focused on efficient processing of well-defined

queries in a formal language (SQL).

• Clearer semantics for both data and queries.

• Recent move towards semi-structured data (XML)

brings it closer to IR.

(90)

90

Library and Information Science

• Focused on the human user aspects of

information retrieval (human-computer

interaction, user interface, visualization).

• Concerned with effective categorization of human

knowledge.

• Concerned with citation analysis and

bibliometrics (structure of information).

• Recent work on digital libraries brings it closer to

CS & IR.

(91)

91

Artificial Intelligence

• Focused on the representation of knowledge,

reasoning, and intelligent action.

• Formalisms for representing knowledge and

queries:

• First-order Predicate Logic

• Bayesian Networks

• Recent work on web ontologies and intelligent

information agents brings it closer to IR.

(92)

92

Natural Language Processing

• Focused on the syntactic, semantic, and pragmatic analysis of natural

language text and discourse.

• Ability to analyze syntax (phrase structure) and semantics could allow

retrieval based on meaning rather than keywords.

(93)

93

Natural Language Processing:

IR Directions

• Methods for determining the sense of an ambiguous word based on

context (word sense disambiguation).

• Methods for identifying specific pieces of information in a document

(information extraction).

• Methods for answering specific NL questions from document corpora

or structured data like FreeBase or Google’s Knowledge Graph.

(94)

94

Machine Learning

• Focused on the development of computational

systems that improve their performance with

experience.

• Automated classification of examples based on

learning concepts from labeled training examples

(supervised learning).

• Automated methods for clustering unlabeled

examples into meaningful groups (unsupervised

learning).

(95)

95

Machine Learning:

IR Directions

• Text Categorization

• Automatic hierarchical classification (Yahoo).

• Adaptive filtering/routing/recommending.

• Automated spam filtering.

• Text Clustering

• Clustering of IR query results.

• Automatic formation of hierarchies (Yahoo).

• Learning for Information Extraction

• Text Mining

• Learning to Rank

References

Related documents

Design Principles for Protection Mechanisms • Least privilege • Economy of mechanism • Complete mediation • Open design • Separation of privilege • Least common mechanism

After a review of the literature, it seems that the specific immunodeficiency pathway of STAT3 HIES exposes STAT3 HIES patients to Ureaplasma lung infections because

¾ Present information, ideas and results of work using a variety of communications technologies (e.g. multimedia presentations, web pages, videotapes, desktop published

The objectives of this study are to (1) map the key characteristics relevant to the interpretation of figures on health and healthcare, and (2) to develop a Figure

The aims of this study are to examine rear-seated adult passenger mortality with respect to 1) select driver, pas- senger, vehicle, and crash characteristics hypothesized to

In a pilot experiment and to achieve suitable conditions for the synthesis of 2-amino - 4 H -chromene, various reaction conditions have been investigated in the reaction of

Discontent with Gold Coast cocoa price policy stimulated support for Togolese reunification, but the lack of economie Integration between British and French Togo weakened Ewe resolve

Since many children with Down syndrome can attain functional literacy skills in inclusive education (Bochner, Outhred 8c Pieterse, 2001) and the finding that language