Faculty of Science
Big Data Challenges for Information Retrieval
Christina Lioma
Department of Computer Science
[email protected] Slide 1/8
U N I V E R S I T Y O F C O P E N H A G E N D E P A R T M E N T O F C O M P U T E R S C I E N C E
Information Retrieval: needles in haystacks
Branch of computer science behindsearch engines: find information amonglarge, noisy, heterogeneous data
Information Retrieval: needles in haystacks
Branch of computer science behindsearch engines: find information amonglarge, noisy, heterogeneous data
• a known needle in a known haystack • a known needle in an unknown haystack
• an unknown needle in an unknown
haystack
• any needle in a haystack
• the sharpest needle in a haystack • most of the sharpest needles in a
haystack
• all the needles in a haystack
• affirmation of no needles in the haystack • things like needles in any haystack • let me know whenever a new needle
shows up
• where are the haystacks? • needles, haystacks - whatever
U N I V E R S I T Y O F C O P E N H A G E N D E P A R T M E N T O F C O M P U T E R S C I E N C E
Search engines in a nutshell
Three main types of ingredients (features):
1 Words: text semantics can be approximated by word frequencies 2 Web structure: if enough people point to something, it must be good and
(probably) relevant
3 Users: search behaviour, click behaviour, dwell behaviour
Search engines in a nutshell
Three main types of ingredients (features):
1 Words: text semantics can be approximated by word frequencies 2 Web structure: if enough people point to something, it must be good and
(probably) relevant
3 Users: search behaviour, click behaviour, dwell behaviour
• User queries: distribution over features INPUT
• Indexed documents: distribution over features INPUT
• Ranking: comparing distributions OUTPUT
U N I V E R S I T Y O F C O P E N H A G E N D E P A R T M E N T O F C O M P U T E R S C I E N C E
Anno 2013
• Realtime indexing: 20 billion pages crawled per day
• Instant search: retrieval time<0.3 sec, faster than human typing
• Zero query search: try to retrieve informationbefore you know what you are looking for based on user profiling
In terms of scale:
• 50 billion indexed webpages
• 3 billion search requests per day1(world population: ca. 7 billion people)
Data-driven technology – Big Data challenges
1 Long Data
2 Your Data
3 Small Data Thinking
1
Google alone
Anno 2013
• Realtime indexing: 20 billion pages crawled per day
• Instant search: retrieval time<0.3 sec, faster than human typing
• Zero query search: try to retrieve informationbefore you know what you are looking for based on user profiling
In terms of scale:
• 50 billion indexed webpages
• 3 billion search requests per day1(world population: ca. 7 billion people)
Data-driven technology – Big Data challenges
1 Long Data
2 Your Data
3 Small Data Thinking
1
Google alone
U N I V E R S I T Y O F C O P E N H A G E N D E P A R T M E N T O F C O M P U T E R S C I E N C E
Big data challenge 1: long data
Long as in longitudinal: spanning over time • The problem is not the range but the intervals:
dynamic streams of data coming in with timestamps per<seconds Implications to search engines:
• time-versioned indexing: fine-grained updates & threaded associations • time-travel queries:what is relevant depends on when
U N I V E R S I T Y O F C O P E N H A G E N D E P A R T M E N T O F C O M P U T E R S C I E N C E
Big data challenge 2: your data
Personalisation. Can of worms.
We can collect your data BUT it is safer not to personalise rather than annoy you...
• Personalised data on two axes:individual (e.g. user click through,
preferences, history) andsocial (e.g. twitter, Facebook, blogs) • Search engines must translate all this data into a singleuser state
reflecting user preferences
• This state needs to be updated dynamically with every new input, but also remain consistent and below the nuisance threshold
The larger and noisier the input, the harder to keep this balance
U N I V E R S I T Y O F C O P E N H A G E N D E P A R T M E N T O F C O M P U T E R S C I E N C E
Big data challenge 2: your data
Personalisation. Can of worms.
We can collect your data BUT it is safer not to personalise rather than annoy you...
Big data implications:
• Personalised data on two axes:individual (e.g. user click through,
preferences, history) andsocial (e.g. twitter, Facebook, blogs) • Search engines must translate all this data into a singleuser state
reflecting user preferences
• This state needs to be updated dynamically with every new input, but also remain consistent and below the nuisance threshold
The larger and noisier the input, the harder to keep this balance
Big data challenge 3: small data thinking
R&D in information retrieval: clear division between efficiency and effectiveness
• Efficiency: index compression, reducing lookup time, query caching ...
Is not always on-topic
• Effectiveness: accurate feature extraction, personalisation, relevance ...
Does not always scale
U N I V E R S I T Y O F C O P E N H A G E N D E P A R T M E N T O F C O M P U T E R S C I E N C E
Sources
• Haystack image, page 2:
http://footprinthr.com.au/wp-content/uploads/2012/01/needle_haystack.jpg • Needles in haystack metaphor, page 2: Matthew Koll, Bulletin of the American Society for Information
Science, Vol. 2, No. 2, December/January 2000
• Typewriter image, page 3: Copyright: Roberto Zilli, , ID: 99118544, available from
http://www.shutterstock.com
• Distributions image, page 3: Source: Edgar Meij, Large-scale Data Processing for Information Retrieval, 2012
• Tweets image, page 5: Source:
http://blog.crowdbooster.com/take-control-of-your-twitter-data-introducing • Can of worms image, page 6: Copyright: munchester2cool, available from
http://munchester2cool.deviantart.com/art/Luke-s-Can-of-Worms-55442402 • Efficiency vs. effectiveness image, page 7:
http://psychologyface.com/2012/11/effectiveness-and-efficiency