• No results found

Enhancing Search with Predictive Analytics

N/A
N/A
Protected

Academic year: 2021

Share "Enhancing Search with Predictive Analytics"

Copied!
21
0
0

Loading.... (view fulltext now)

Full text

(1)

Enhancing Search with

Predictive Analytics

Text Analytics World – Boston 2013

Andrew Fast

Chief Scientist

Elder Research, Inc.

(2)

“It is difficult to

describe, but you

know it when you see

it.”

Lord Justice Stuart Smith,

Cadogan Estates Limited v. Morris

(1998)

Likewise, most textual

concepts cannot be

easily defined with a

single keyword query

The Elephant Test

(3)

Search and Predictive modeling each provides a

different trade-off between power and generality.

Combining Search and Predictive Models

Keyword  queries  can  

answer  any  query,  but  

with  limited  depth  for  

complex  queries.  

Document

Classification

Generality Po w e r

Keyword

Search

A  predicIve  model  can  

answer  one  query  well,  

especially  a  complex  query  

(4)

Our Approach

A “search ensemble” ranking function that

“boosts” keyword relevance based on a predictive

model

High  Keyword  

Relevance,  High  

Model  Ranking  

Model  Ranking  

Ke

yw

or

d  

Re

le

van

ce

 

Relevance,  Low  

High  Keyword  

Model  Ranking  

Low  Keyword  

Relevance,  Low  

Model  Ranking  

Low  Keyword  

Relevance,  High  

Model  Ranking  

(5)

The Problem

The Goal:

Explore

NEW

interesting ideas using

OLD

social entrepreneurship contest entries

The Data:

A collection of contest entries from 19

different contests sponsored by our client

Contests cover a range of topics such as health,

education, literacy, finance, technology, and

geo-tourism.

The Challenge:

Emphasize high-quality entries in

(6)

Combining Search and Predictive Models

Keyword ranking does not help you find

high-quality entries …

… but Model Ranking is not topic centric.

Complimentary strengths

Search for exploration and discovery

(7)
(8)

Target Variable

Identify characteristics of past entries that are

correlated with that proposal being ‘

Shortlisted’

by

the Contest Judges

Rankings:

1 – Likely Finalist

2 – Top Tier

3 – Honorable Mention

4 – Passed Screening

5 – No

Note: Not every contest used all 5 rankings

(9)

The Inputs

Learn a

logistic regression

model

to fit the feature

weights

Inputs:

Structured  Data  

Taxonomy  

Textual  Features  

Budget  Size  

Maturity  

Impact  

Auto-­‐tagging  

taxonomy  

terms  

Length  

Lexical  

Diversity  

(10)

• 

Joint work with Beth

Maser and Richard Iams

at PPC

• 

Non-traditional, general

approach

Broad, flexible taxonomy

• 

Focus on the range of

interests of the

organization

(11)

Using the Taxonomy

Each contest emphasizes different branches of

the taxonomy

Taxonomy features need to be contest specific

Step 1: Use the “Wisdom of Crowds” to find the

center of each contest

Step 2: Rate each entry based on the distance

(12)

Evaluation: Area Under the ROC

• 

Evaluate the overall ranking provided by the model.

(13)

Evaluation: Lift

Evaluates the improvement using the model at a

fixed amount of work

How much more efficient are the judges using our

model alone?

Every contest showed positive lift.

Maximum lift of 3.3

(14)
(15)

Our Approach

A new search ranking function that “boosts”

keyword relevance for probable shortlisted entries

High  Keyword  

Relevance,  High  

Model  Ranking  

Model  Ranking  

Ke

yw

or

d  

Re

le

van

ce

 

Relevance,  Low  

High  Keyword  

Model  Ranking  

Low  Keyword  

Relevance,  Low  

Model  Ranking  

Low  Keyword  

Relevance,  High  

Model  Ranking  

(16)

The Prototype Platform

ERI  Text  Mining  

Model    

(PredicIve  +  Taxonomy)  

Search  Index  

Custom  Search  Interface  

(17)

Faceted Search with Solr

Apache Solr is an open-source faceted search engine

(18)

Text mining can be

viewed from many

different perspectives

No single view

provides a complete

solution

Must consider the

entire “beast” to get

the best solution

(19)

19

Contact Information

Andrew Fast, Ph.D.

Chief Scientist

[email protected]

(434) 973-7673

www.datamininglab.com

(20)

Practical Text Mining

• 

Winner of the 2012

PROSE award for

Computing and

Information Science

• 

Written for a technical

audience seeking more

text experience

• 

Includes trial versions

(21)

21  

Andrew Fast

"

Chief Scientist, Elder Research, Inc.

Dr. Fast graduated Magna Cum Laude from Bethel University and earned Master’s and Ph.D. degrees in Computer Science from the University of Massachusetts Amherst. There, his research focused on causal data mining and mining complex relational data such as social networks. At ERI, Andrew leads the development of new tools and algorithms for data and text mining for applications of capabilities assessment, fraud detection, and national security.

Dr. Fast has published on an array of applications including detecting securities fraud using the social network among brokers, and understanding the structure of criminal and violent groups. Other publications cover modeling peer-to-peer music file sharing networks, understanding how collective classification works, and predicting playoff success of NFL head coaches (work featured on ESPN.com). With John Elder and other co-authors, Andrew has written a book on Practical Text Mining, that was awarded the prose Award for Computing and Information Science in 2012.

Dr. Andrew Fast leads research in Text Mining and Social

Network Analysis at Elder Research, the nation’s leading data

mining consultancy. ERI was founded in 1995 and has offices in Charlottesville VA and Washington DC,

(www.datamininglab.com). ERI focuses on Federal, commercial,

investment, and security applications of advanced analytics, including stock selection, image recognition, biometrics, process optimization, cross-selling, drug efficacy, credit scoring, risk management, and fraud detection.

References

Related documents