• No results found

Search and You Shall Find - and Teach Us All

N/A
N/A
Protected

Academic year: 2021

Share "Search and You Shall Find - and Teach Us All"

Copied!
21
0
0

Loading.... (view fulltext now)

Full text

(1)

Search and You Shall Find - and Teach Us All

Marius Pa

ş

ca

Google Inc.

[email protected]

Second Kyoto Workshop January 2011 Gifu, Japan _

(2)

2

Unweaving the World Wide Web of Facts

• The Web is a repository of implicitly-encoded human knowledge

– some text fragments contain easier-to-extract knowledge

• More knowledge leads to better answers

– acquire facts from a fraction of the knowledge on the Web

– exploit available facts during search

• Open-domain information extraction

– extract knowledge (facts, relations) applicable to a wide range,

rather than closed, pre-defined set of domains (e.g., medical,

financial etc.)

– no need to specify set of concepts and relations of interest in

advance

(3)

Instances, Classes and Attributes

• A concept (class) is a placeholder for a set of instances

(objects) that share similar properties

– set of instances

• {matrix, kill bill, ice age, pulp fiction, cidade de deus,...}

– class label

• movies, films

– definition

• a series of pictures projected on a screen in rapid succession with

objects shown in successive positions slightly changed so as to produce the optical effect of a continuous picture in which the objects move

(Merriam Webster)

• a form of entertainment that enacts a story by sound and a sequence of images giving the illusion of continuous movement (WordNet)

(4)

4

Instances, Classes and Attributes

• Attributes capture the types of facts that are relevant for a

given instance or class

– relevant properties extracted from a text collection for a given

class (e.g.,

stealth factor

and

top speed

for

SportsCar

, or

best-selling album

and

drummer

for

MusicBand

, or

author

and

genre

for

Book)

– as an alternative to manually pre-specifying relevant relations of a

class (e.g.,

Currency-CurrencyOf-Country

, or

City-BirthPlaceOf-Actor

)

• Applications

– augment results of search queries (

zr1

,

black eyed peas

,

la sombra

del viento

) with class attributes and/or facts

– structured-search interfaces

– semantic query refinements

(5)

Sources of Open-Domain Information

• Human-compiled knowledge resources

– resources created by experts

– resources created collaboratively by non-experts

• Sources of textual data

– semi-structured text

– unstructured text

(6)

6

Expert Resources: Cyc

Collections and individuals

– collections correspond to classes (concepts) – individuals correspond to instances

– collections have instances; individuals cannot have instances

Attributes

(7)

Non-Expert Resources: Wikipedia

Wikipedia infobox

(8)

8

Documents

Semi-structured text Unstructured text

(9)

Documents

Semi-structured text Semi-structured text

(10)

10

Beyond Documents

(11)

Characteristics of Documents vs. Queries

2-3 words 25 words or more Average length bag of keywords natural language Grammatical style low high (varies) Average quality self-contained surrounding text Available context request info. convey info. Purpose text text Type of medium Queries Document Sentences Data Source Characteristic
(12)

12

(13)

Extraction from Queries: Instances

• Input

– target classes, available as small sets of seed instances

• e.g., {phentermine, viagra, vicodin, vioxx, xanax} for Drug

• Data source

– anonymized search queries along with frequencies

• Output

– ranked (longer) lists of instances, one per class

• e.g., [viagra, phentermine, vicodin, xanax, vioxx, ambien, adderall,

(14)

14

Instance Extraction

side effects of generic birth control pills can low blood pressure make you tired prescription vicodin online long term xanax use propanolol and vicodin interaction causes of low blood pressure during pregnancy taking lipitor during pregnancy buy xanax in uk taking beta blockers during pregnancy how oxycontin works long term lamictal use propanolol and lamictal interaction is xanax habit forming can lamictal be crushed how does lipitor work effect of beta blockers on exercise does vioxx cause weight gain can phentermine make you tired buy vioxx in uk side effects of viagra pills long term vioxx use prescription lamictal online long term xanax use what effects does low blood pressure have buy lipitor online

• Identify queries that contain a seed instance

(15)

Instance Extraction

side effects of generic birth control pills can low blood pressure make you tired prescription vicodin online long term xanax use propanolol and vicodin interaction causes of low blood pressure during pregnancy taking lipitor during pregnancy buy xanax in uk taking beta blockers during pregnancy how oxycontin works long term lamictal use propanolol and lamictal interaction is xanax habit forming can lamictal be crushed how does lipitor work effect of beta blockers on exercise does vioxx cause weight gain can phentermine make you tired buy vioxx in uk side effects of viagra pills long term vioxx use prescription lamictal online long term xanax use what effects does low blood pressure have buy lipitor online

• Collect query templates

(16)

16

Instance Extraction

• Identify queries that match the query templates

– collect and rank large pool of candidate instances

[long term] [use]

prefix postfix

[buy] [in uk]

prefix postfix

[can] [make you tired]

prefix postfix side effects of generic birth control pills can low blood pressure make you tired prescription vicodin online long term xanax use propanolol and vicodin interaction causes of low blood pressure during pregnancy taking lipitor during pregnancy buy xanax in uk taking beta blockers during pregnancy how oxycontin works long term lamictal use propanolol and lamictal interaction is xanax habit forming can lamictal be crushed how does lipitor work effect of beta blockers on exercise does vioxx cause weight gain can phentermine make you tired buy vioxx in uk side effects of viagra pills long term vioxx use prescription lamictal online long term xanax use what effects does low blood pressure have buy lipitor online

xanax lamictal vioxx low blood pressure

(17)

Output Instances

[grand theft auto, warcraft, need for speed, quake, super maro bros., gta, world of warcraft, doom, need for speed underground, ...]

VideoGame

[university of chicago, stanford university,

universty of texas at austin, columbia university, university of pennsylvania, ...]

University

[leonardo da vinci, rembrandt, andy warhol, pablo picasso, vincent van gogh, salvador dali, van gogh, frida kahlo, picasso, ...]

Person

[new york times, le monde, washington post, usa today, wall street journal, ny times, chicago tribune, boston globe, toronto star, ...]

Newspaper

Top Extracted Instances Class

(18)

18

Extraction from Queries: Attributes

• Input

– target classes, available as sets of representative instances

• e.g., {Delphi, Apple Computer, Honda, Oracle, Coca Cola, Toyota,

Washington Mutual, Delta, Reuters, Target, ...} for Company

– small sets of seed attributes, one per class

• e.g., {headquarters, stock price, ceo, location, chairman} for Company

• Data source

– anonymized search queries along with frequencies

• Output

– ranked lists of attributes, one per class

• e.g., {headquarters, mission statement, stock price, ceo, cio, code of conduct, stock symbol, organizational structure, corporate address,...} for Company

(19)

Class Attribute Extraction

Company: {Delphi, Apple Computer, Honda, Oracle, Coca Cola,

Toyota, Washington Mutual, Delta, Reuters, Target,...}

Company: {headquarters, stock price, ceo, location, chairman} Seed attributes

Target classes

Company:installing

Company:stock price

Company:accord

Company:headquarters

Company:mission statement

[ ] [ ] [8.1-7 on solaris 8]

prefix infix postfix

[ ] [ ] [cressida water pump]

prefix infix postfix

[ ] [company one year] [target]

prefix infix postfix

[ ] [air lines] [history]

prefix infix postfix

[ ] [ ] [1989 sei]

prefix infix postfix

[new] [ ] [ ]

prefix infix postfix

[where is the world] [for] [corporation]

prefix infix postfix

[ ] [new] [impact]

prefix infix postfix

[ ] [for the] [corporation]

prefix infix postfix

[ ] [for] [airlines]

prefix infix postfix installingtoyotacressida water pump

installingoracle8.1-7 on solaris 8

coca colacompany one yearstock pricetarget

deltaair linesstock pricehistory

hondaaccord1989 sei newhondaaccord

where is the worldheadquartersfordelphicorporation

washington mutualnewheadquartersimpact

mission statementfor theoraclecorporation

mission statementfordeltaairlines

Query logs Company: {installing, stock price, accord,

headquarters, mission statement,...}

Pool of candidate attributes

(20)

20

Output Attributes

[vintage, color, cost, style, taste, vintage chart, pronunciation, shelf life, wine ratings, wine

reviews, ...] Wine

[features, battery life, retail price, mobile

review, specification, price list, functions, ratings, tips, tricks, ...]

CellPhoneModel

[transmission, top speed, acceleration, transmission problems, owners manual, gas mileage, towing capacity, stalling, maintenance schedule, performance parts, ...]

CarModel

[weight, length, history, fuel consumption, interior photos, specifications, photographs, interior

pictures, seating arrangement, flight deck, ...] AircraftModel

Top Extracted Attributes Class

(21)

Conclusion

• If knowledge is generally prominent or relevant, people will

(eventually) search for it

anonymized query logs collectively capture knowledge, through

requests that may be answered by knowledge asserted in document

collections

• Queries contain multiple types of knowledge

– some of them are easier to extract than others

– instances, classes, attributes, relations

References

Related documents