Search and You Shall Find - and Teach Us All
Marius Pa
ş
ca
Google Inc.
Second Kyoto Workshop January 2011 Gifu, Japan _
2
Unweaving the World Wide Web of Facts
• The Web is a repository of implicitly-encoded human knowledge
– some text fragments contain easier-to-extract knowledge
• More knowledge leads to better answers
– acquire facts from a fraction of the knowledge on the Web
– exploit available facts during search
• Open-domain information extraction
– extract knowledge (facts, relations) applicable to a wide range,
rather than closed, pre-defined set of domains (e.g., medical,
financial etc.)
– no need to specify set of concepts and relations of interest in
advance
Instances, Classes and Attributes
• A concept (class) is a placeholder for a set of instances
(objects) that share similar properties
– set of instances
• {matrix, kill bill, ice age, pulp fiction, cidade de deus,...}
– class label
• movies, films
– definition
• a series of pictures projected on a screen in rapid succession with
objects shown in successive positions slightly changed so as to produce the optical effect of a continuous picture in which the objects move
(Merriam Webster)
• a form of entertainment that enacts a story by sound and a sequence of images giving the illusion of continuous movement (WordNet)
4
Instances, Classes and Attributes
• Attributes capture the types of facts that are relevant for a
given instance or class
– relevant properties extracted from a text collection for a given
class (e.g.,
stealth factor
and
top speed
for
SportsCar
, or
best-selling album
and
drummer
for
MusicBand
, or
author
and
genre
for
Book)
– as an alternative to manually pre-specifying relevant relations of a
class (e.g.,
Currency-CurrencyOf-Country
, or
City-BirthPlaceOf-Actor
)
• Applications
– augment results of search queries (
zr1
,
black eyed peas
,
la sombra
del viento
) with class attributes and/or facts
– structured-search interfaces
– semantic query refinements
Sources of Open-Domain Information
• Human-compiled knowledge resources
– resources created by experts
– resources created collaboratively by non-experts
• Sources of textual data
– semi-structured text
– unstructured text
6
Expert Resources: Cyc
•
Collections and individuals
– collections correspond to classes (concepts) – individuals correspond to instances
– collections have instances; individuals cannot have instances
•
Attributes
Non-Expert Resources: Wikipedia
Wikipedia infobox
8
Documents
Semi-structured text Unstructured text
Documents
Semi-structured text Semi-structured text
10
Beyond Documents
Characteristics of Documents vs. Queries
2-3 words 25 words or more Average length bag of keywords natural language Grammatical style low high (varies) Average quality self-contained surrounding text Available context request info. convey info. Purpose text text Type of medium Queries Document Sentences Data Source Characteristic12
Extraction from Queries: Instances
• Input
– target classes, available as small sets of seed instances
• e.g., {phentermine, viagra, vicodin, vioxx, xanax} for Drug
• Data source
– anonymized search queries along with frequencies
• Output
– ranked (longer) lists of instances, one per class
• e.g., [viagra, phentermine, vicodin, xanax, vioxx, ambien, adderall,
14
Instance Extraction
side effects of generic birth control pills can low blood pressure make you tired prescription vicodin online long term xanax use propanolol and vicodin interaction causes of low blood pressure during pregnancy taking lipitor during pregnancy buy xanax in uk taking beta blockers during pregnancy how oxycontin works long term lamictal use propanolol and lamictal interaction is xanax habit forming can lamictal be crushed how does lipitor work effect of beta blockers on exercise does vioxx cause weight gain can phentermine make you tired buy vioxx in uk side effects of viagra pills long term vioxx use prescription lamictal online long term xanax use what effects does low blood pressure have buy lipitor online
• Identify queries that contain a seed instance
Instance Extraction
side effects of generic birth control pills can low blood pressure make you tired prescription vicodin online long term xanax use propanolol and vicodin interaction causes of low blood pressure during pregnancy taking lipitor during pregnancy buy xanax in uk taking beta blockers during pregnancy how oxycontin works long term lamictal use propanolol and lamictal interaction is xanax habit forming can lamictal be crushed how does lipitor work effect of beta blockers on exercise does vioxx cause weight gain can phentermine make you tired buy vioxx in uk side effects of viagra pills long term vioxx use prescription lamictal online long term xanax use what effects does low blood pressure have buy lipitor online
• Collect query templates
16
Instance Extraction
• Identify queries that match the query templates
– collect and rank large pool of candidate instances
[long term] [use]
prefix postfix
[buy] [in uk]
prefix postfix
[can] [make you tired]
prefix postfix side effects of generic birth control pills can low blood pressure make you tired prescription vicodin online long term xanax use propanolol and vicodin interaction causes of low blood pressure during pregnancy taking lipitor during pregnancy buy xanax in uk taking beta blockers during pregnancy how oxycontin works long term lamictal use propanolol and lamictal interaction is xanax habit forming can lamictal be crushed how does lipitor work effect of beta blockers on exercise does vioxx cause weight gain can phentermine make you tired buy vioxx in uk side effects of viagra pills long term vioxx use prescription lamictal online long term xanax use what effects does low blood pressure have buy lipitor online
xanax lamictal vioxx low blood pressure
Output Instances
[grand theft auto, warcraft, need for speed, quake, super maro bros., gta, world of warcraft, doom, need for speed underground, ...]
VideoGame
[university of chicago, stanford university,
universty of texas at austin, columbia university, university of pennsylvania, ...]
University
[leonardo da vinci, rembrandt, andy warhol, pablo picasso, vincent van gogh, salvador dali, van gogh, frida kahlo, picasso, ...]
Person
[new york times, le monde, washington post, usa today, wall street journal, ny times, chicago tribune, boston globe, toronto star, ...]
Newspaper
Top Extracted Instances Class
18
Extraction from Queries: Attributes
• Input
– target classes, available as sets of representative instances
• e.g., {Delphi, Apple Computer, Honda, Oracle, Coca Cola, Toyota,
Washington Mutual, Delta, Reuters, Target, ...} for Company
– small sets of seed attributes, one per class
• e.g., {headquarters, stock price, ceo, location, chairman} for Company
• Data source
– anonymized search queries along with frequencies
• Output
– ranked lists of attributes, one per class
• e.g., {headquarters, mission statement, stock price, ceo, cio, code of conduct, stock symbol, organizational structure, corporate address,...} for Company
Class Attribute Extraction
Company: {Delphi, Apple Computer, Honda, Oracle, Coca Cola,
Toyota, Washington Mutual, Delta, Reuters, Target,...}
Company: {headquarters, stock price, ceo, location, chairman} Seed attributes
Target classes
Company:installing
Company:stock price
Company:accord
Company:headquarters
Company:mission statement
[ ] [ ] [8.1-7 on solaris 8]
prefix infix postfix
[ ] [ ] [cressida water pump]
prefix infix postfix
[ ] [company one year] [target]
prefix infix postfix
[ ] [air lines] [history]
prefix infix postfix
[ ] [ ] [1989 sei]
prefix infix postfix
[new] [ ] [ ]
prefix infix postfix
[where is the world] [for] [corporation]
prefix infix postfix
[ ] [new] [impact]
prefix infix postfix
[ ] [for the] [corporation]
prefix infix postfix
[ ] [for] [airlines]
prefix infix postfix installingtoyotacressida water pump
installingoracle8.1-7 on solaris 8
coca colacompany one yearstock pricetarget
deltaair linesstock pricehistory
hondaaccord1989 sei newhondaaccord
where is the worldheadquartersfordelphicorporation
washington mutualnewheadquartersimpact
mission statementfor theoraclecorporation
mission statementfordeltaairlines
Query logs Company: {installing, stock price, accord,
headquarters, mission statement,...}
Pool of candidate attributes
20
Output Attributes
[vintage, color, cost, style, taste, vintage chart, pronunciation, shelf life, wine ratings, wine
reviews, ...] Wine
[features, battery life, retail price, mobile
review, specification, price list, functions, ratings, tips, tricks, ...]
CellPhoneModel
[transmission, top speed, acceleration, transmission problems, owners manual, gas mileage, towing capacity, stalling, maintenance schedule, performance parts, ...]
CarModel
[weight, length, history, fuel consumption, interior photos, specifications, photographs, interior
pictures, seating arrangement, flight deck, ...] AircraftModel
Top Extracted Attributes Class
Conclusion
• If knowledge is generally prominent or relevant, people will
(eventually) search for it
→
anonymized query logs collectively capture knowledge, through
requests that may be answered by knowledge asserted in document
collections
• Queries contain multiple types of knowledge
– some of them are easier to extract than others