• No results found

Generating Query Substitutions

N/A
N/A
Protected

Academic year: 2022

Share "Generating Query Substitutions"

Copied!
29
0
0

Loading.... (view fulltext now)

Full text

(1)

Generating Query Substitutions

Rosie Jones

Benjamin Rey, Omid Madani and Wiley Greiner

Yahoo! Research

(2)

Substitutes

(3)

Query Substitutions

(4)

Query Substitutions

(5)

Functions of Rewriting

• Enhance meaning

– Spell correction

– Corpus­appropriate terminology

• Cat cancer → feline cancer

• Change meaning

– Narrow 

• [ lexical entailment: fruit → apple]

– Broaden 

• [ alternatives, common interests]

• Conference proceedings → textbooks

(6)

Trying to Find Nathan Welsh, who  lives and works in Edinburgh

nathan welsh edinburg scotland 

nathan welsh edinburgh scotland

financial consultants edinburg scotland

financial consultants edinburgh scotland

financial consultants

nathan welsh 16­18 pennwell place  edinburgh

nathan welsh 16­18 pennywell place  edinburgh

international phone directory

white pages

edinburgh scotland phone directory

edinburgh scotland uk

nathan welsh investment consultant  edinburg

nathan welsh investment consultant  edinburgh

investment consultants edinburgh  scotland

nathan welsh

Spell correction

Name →profession Spell correction

Delete terms, generalize

Spell correction

Try second approach, using his address

Try looking up addresses rephrase

specialization

Generalize to location

(7)

Half of Query Pairs are Related

5.0%

skateboarding pics → skateboarding deletions

4.6%

jobs ­> marine employment   specialization

3.2%

gm reabtes ­> show me all the current auto rebates   generalization

6.2% 

huston's restaurant   ­> houston's   mixture

7.0%   

real eastate   ­> real estate spell correction

8.7%

john wayne bust ­> john wayne statue substitutions

2.4%

thansgiving     ­> dia de acconde gracias other

9.1%

game codes  ­> video game codes    insertions

53.2%

mic amps  ­> create taxi   non­rewrite

% Example

Type

(8)

Substitutions are repeated

• car insurance → auto insurance 

– 5086 times in a sample

• car insurance → car insurance quotes 

– 4826 times

• car insurance → geico  

[ brand of car insurance ]

–  2613 times

• car insurance → progressive auto insurance

–  1677 times

• car insurance → carinsurance  

– 428 times

     Different Users, Different Days

(9)

Statistical Test to Find Significant  Rewrites

Test whether

p q2∣q1>> p q2

P(britney spears|brittney spears) >> P(britney spears) 8% >> 0.01%

Log likelihood ratio test (GLRT) gives χ 2 distributed score

About 90% of query pairs are related after filtering with LLR > 100

(10)

generalization ­­ hypernym 920

dog ­> pet

more specific   5292

dog ­> dog pictures

random junk in query processing 2420

dog ­> 80

generalization ­­ hypernym  1719 

dog ­> pets

generalization  5567

dog ­> dog breeds

specification ­­ hyponym  1553

dog ­> puppy

generalization ­­ hypernym  1363

dog ­> animals

more specific  1416  

dog ­> dog picture

both instances of 'pet‘

5942 dog ­> cat

pluralization 9185

dog ­> dogs

Many Types of Substitutable 

Rewrites

(11)

Defining Categories of Relatedness  for Sponsored Search

A clear mismatch.

4­ Mismatch

A distant, but plausible match to a related topic. E.g.: 

glasses ­ contact lenses 3­ Marginal 

Match

A probable, but inexact match with user intent. E.g.: apple  music player ­ ipod shuffle

2­ Approximate  Match

A near­certain match. E.g.: automotive insurance ­  automobile insurance; 

1­ Precise  Match

We will call {1,2} Precise and {1,2,3} Broad

(12)

Increase Tail Coverage with Query  Segmentation

• Query segmented using  high mutual information  terms

• Most frequent queries: 

replace whole query

• Infrequent queries: 

replace constituent  phrases

Q

Q' Q''

p

1

p

2

p

'1

p

2

p

''1

p

2

p

1

p

'2

p

'1

p

'2

segmentation

(13)

Generating Query Substitutions

• Q1 ­> {q2,q3,q4,q5,q6}

• “catholic baby names” ­> 

{christian baby names, christian baby  boy names, catholic names, …}

• All are statistically relevant (log likelihood ratio on successive queries

)

Find a model to

• rank substitutions, to be able to pick the best ones

• associate a probability of correctness

P

Q−¿Q' is correct ∣score Q−¿Q' 

score

Q−¿u''1u2

score

Q−¿Q ''

. . .

(14)

Train/Test Data

• Sample 1000 queries (q1)

• Select a single substitution for each (q2)

• Manually label the <q1,q2> pairs

• Learn to score <q1,q2> pairs

• Order by score

• Assess Precision/Recall

– Precise task {1,2} vs {3,4}

– Broad task {1,2,3} vs {4}

(15)

Predicting High Quality Query  Suggestions

• Used labels to fit model

• Tried 37 features for model:

– Lexical features including

• Levenshtein character edit distance

• Prefix overlap

• Porter­stem

• Jaccard score on words

– Statistical features including

• Probability of rewrite

• Frequency of rewrite

– Other

• Number of substitutions (numSubst)

– Whole query = 0

– Replace one phrase = 1 – Replace two phrases = 2

• Query length, existence of sponsored results…

(16)

Baseline: Most suggestions  broadly related

87.5%

66%

Maximize LLR Minimize 

numSubst

­ 55%

Random

Suggestion

Broad {1,2,3}

Precise

{1,2}

(17)

Simple Decision Tree

wordsInCommon > 0

Yes No

Class={1,2} prefixOverlap>0

Yes No

Class={1,2} Class={3,4}

Interpretation of the decision tree:

• substitution must have at least 1 word in common with initial query

• the beginning of the query should stay unchanged

(18)

Linear Regression Model

Regression: continuous output in [1,4]

LMScore=in tercept 

f = features

w

f

. f

Classification:

If(LMScore < T) then Good, else Bad

For each T, we have a precision and a recall

Evaluation:

Average precision / recall on 100 times 10­fold cross validation

(19)

Learned Function

f  q

1

, q

2

=0 . 741.88×editDist  q

1

, q

2

0 . 71×wordDist  q

1

, q

2

0 . 36×numSubst  q

1

, q

2

• Outputs continuous score [1..4]

• Like decision tree

– Prefer few edits

– Prefer few word changes

– Prefer whole­query or few phrase changes

• Normalize output to a probability of correctness 

(20)

SVM, Bags of Trees, Linear Model  Give Similar Trade­offs

60%

65%

70%

75%

80%

85%

90%

95%

100%

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

precision

2 levels DT bag of 100 DTs SVM

Linear model

(21)

Results Reranking Query­

Suggestion Pairs

0.87 0.84

0.80 Linear 

Model

0.86 0.83

0.81 SVM

0.66 0.66

0.66 Baseline

0.88 0.84

0.83 Bag of 

100 DTs

0.71 0.83

0.71 2 level DT

Av. Prec Max­F1

Breakeven

(22)

Generate Best Candidate on New  Sample: Precision at 100% Recall

87.5%

66%

Maximize LLR

Minimize numSubst

87.5%

­ Broad

{1,2,3}

74%

Linear Regression

55%

Random

Suggestion

Precise

{1,2}

(23)

Example Query Substitutions

Alg. 

Hand­ Prob label Substitution

Initial Query

66%

3 wilson county

nash county

17%

4 elvis presley birth

frank sinatra birth certificate

89%

2 restaurants in washington

restaurants in washington dc

90%

2 sea world san diego tickets

sea world san diego

92%

1 anne klein watches

anne klien watches

(24)

Applications

• Sponsored Search

• Query Expansion

• Assisted Search

• Lexical entailment

(25)

Other Sources of Phrase Similarity

“dogs .. owners .. food”

“pets .. owners .. food”

“… pets … dogs …”

Documents

“pets like their owners”

“dogs like their owners”

“Pets such as dogs”

Sentences

dogs → petshop pets → petshop Pets → dogs

Query 

Sequence

dog | food pet | food Pet | dogs

Queries

Distributional  Similarity

Cooccurence

Data Source

(26)

Sample Query Substitution Types

• meaning of dreams → interpretation of dreams  (synonym)

• furniture etegere → furniture etagere (spelling  correction)

• venetian hotel → venetian hotel las vegas  (expansion)

• delta employment credit union → decu (acronym)

• lyrics finder → mp3 finder (related term)

• national car rental → alamo car rental (related  brand)

• amanda peet → saving silverman (actress in)

(27)

Future Work

• Add other sources of similarity

• Try additional features

• IR experiments

(28)

Substitutes for Edinburgh Castle

• edinburgh castle scotland (0.86)

• edinburgh (0.79)

• stirling castle (0.76)

• dudley castle (0.56)

(29)

Substitutes for Mars Bar

• mars candy bar (0.83)

• mars candy (0.77)

• Substitutes for “deep fried mars bar”

– ?

References

Related documents

During detection and prevention of SQLIA,when the user requested the needed data after passing through the user interface the request is send to the server site

While the functionality of apps, or any other software to support health services, should be based on clinical standards that provide evidence- based requirements for information

In contrast with the electrically charged JZ dyon of the Yang-Mills-Higgs (YMH) system, whose mass is larger than that of the electrically neutral (magnetic monopole) solutions,

(2011).Transforming sustainability education: ethics, leadership, community engagement and social entrepreneurship, International Journal of Business and Social

− A model for emissions of large point sources (e.g. large industrial, power plants) that are registered individually and supplemented with emission estimates for the remainder of the

Department of Veterans Affairs' (VA) Quality Enhancement Research Initiative (QUERI) was launched in 1998. QUERI was designed to harness VA's health services research expertise

O ’ Shaughnessy J, Osborne C, Pippen J, Patt D, Rocha C, Ossovskaya V, Sherman B, Bradley C: Final Results of a Randomized Phase II Study Demonstrating Efficacy and Safety of BSI-201,

Similar to the lung cancer intervention, adding in Māori background mortality (Model 3) and prevalent years lost to disability (Model 4) for the female breast and colon