Enhancing Search with
Predictive Analytics
Text Analytics World – Boston 2013
Andrew Fast
Chief Scientist
Elder Research, Inc.
•
“It is difficult to
describe, but you
know it when you see
it.”
–
Lord Justice Stuart Smith,
Cadogan Estates Limited v. Morris
(1998)
•
Likewise, most textual
concepts cannot be
easily defined with a
single keyword query
The Elephant Test
•
Search and Predictive modeling each provides a
different trade-off between power and generality.
Combining Search and Predictive Models
Keyword queries can
answer any query, but
with limited depth for
complex queries.
Document
Classification
Generality Po w e rKeyword
Search
A predicIve model can
answer one query well,
especially a complex query
Our Approach
•
A “search ensemble” ranking function that
“boosts” keyword relevance based on a predictive
model
High Keyword
Relevance, High
Model Ranking
Model Ranking
Ke
yw
or
d
Re
le
van
ce
Relevance, Low
High Keyword
Model Ranking
Low Keyword
Relevance, Low
Model Ranking
Low Keyword
Relevance, High
Model Ranking
The Problem
•
The Goal:
Explore
NEW
interesting ideas using
OLD
social entrepreneurship contest entries
•
The Data:
A collection of contest entries from 19
different contests sponsored by our client
–
Contests cover a range of topics such as health,
education, literacy, finance, technology, and
geo-tourism.
•
The Challenge:
Emphasize high-quality entries in
Combining Search and Predictive Models
•
Keyword ranking does not help you find
high-quality entries …
•
… but Model Ranking is not topic centric.
•
Complimentary strengths
–
Search for exploration and discovery
Target Variable
•
Identify characteristics of past entries that are
correlated with that proposal being ‘
Shortlisted’
by
the Contest Judges
•
Rankings:
1 – Likely Finalist
2 – Top Tier
3 – Honorable Mention
4 – Passed Screening
5 – No
•
Note: Not every contest used all 5 rankings
The Inputs
•
Learn a
logistic regression
model
to fit the feature
weights
•
Inputs:
Structured Data
Taxonomy
Textual Features
•
Budget Size
•
Maturity
•
Impact
•
Auto-‐tagging
taxonomy
terms
•
Length
•
Lexical
Diversity
•
Joint work with Beth
Maser and Richard Iams
at PPC
•
Non-traditional, general
approach
–
Broad, flexible taxonomy
•
Focus on the range of
interests of the
organization
Using the Taxonomy
•
Each contest emphasizes different branches of
the taxonomy
–
Taxonomy features need to be contest specific
•
Step 1: Use the “Wisdom of Crowds” to find the
center of each contest
•
Step 2: Rate each entry based on the distance
Evaluation: Area Under the ROC
•
Evaluate the overall ranking provided by the model.
Evaluation: Lift
•
Evaluates the improvement using the model at a
fixed amount of work
–
How much more efficient are the judges using our
model alone?
•
Every contest showed positive lift.
–
Maximum lift of 3.3
Our Approach
•
A new search ranking function that “boosts”
keyword relevance for probable shortlisted entries
High Keyword
Relevance, High
Model Ranking
Model Ranking
Ke
yw
or
d
Re
le
van
ce
Relevance, Low
High Keyword
Model Ranking
Low Keyword
Relevance, Low
Model Ranking
Low Keyword
Relevance, High
Model Ranking
The Prototype Platform
ERI Text Mining
Model
(PredicIve + Taxonomy)
Search Index
Custom Search Interface
Faceted Search with Solr
Apache Solr is an open-source faceted search engine
•
Text mining can be
viewed from many
different perspectives
•
No single view
provides a complete
solution
•
Must consider the
entire “beast” to get
the best solution
19
Contact Information
Andrew Fast, Ph.D.
Chief Scientist
[email protected]
(434) 973-7673
www.datamininglab.com
Practical Text Mining
•
Winner of the 2012
PROSE award for
Computing and
Information Science
•
Written for a technical
audience seeking more
text experience
•
Includes trial versions
21
Andrew Fast
"
Chief Scientist, Elder Research, Inc.
Dr. Fast graduated Magna Cum Laude from Bethel University and earned Master’s and Ph.D. degrees in Computer Science from the University of Massachusetts Amherst. There, his research focused on causal data mining and mining complex relational data such as social networks. At ERI, Andrew leads the development of new tools and algorithms for data and text mining for applications of capabilities assessment, fraud detection, and national security.
Dr. Fast has published on an array of applications including detecting securities fraud using the social network among brokers, and understanding the structure of criminal and violent groups. Other publications cover modeling peer-to-peer music file sharing networks, understanding how collective classification works, and predicting playoff success of NFL head coaches (work featured on ESPN.com). With John Elder and other co-authors, Andrew has written a book on Practical Text Mining, that was awarded the prose Award for Computing and Information Science in 2012.
Dr. Andrew Fast leads research in Text Mining and Social
Network Analysis at Elder Research, the nation’s leading data
mining consultancy. ERI was founded in 1995 and has offices in Charlottesville VA and Washington DC,
(www.datamininglab.com). ERI focuses on Federal, commercial,
investment, and security applications of advanced analytics, including stock selection, image recognition, biometrics, process optimization, cross-selling, drug efficacy, credit scoring, risk management, and fraud detection.