Big Data Challenges for Information Retrieval

(1)

Faculty of Science

Big Data Challenges for Information Retrieval

Christina Lioma

Department of Computer Science

[email protected] Slide 1/8

(2)

U N I V E R S I T Y O F C O P E N H A G E N D E P A R T M E N T O F C O M P U T E R S C I E N C E

Information Retrieval: needles in haystacks

Branch of computer science behindsearch engines: find information amonglarge, noisy, heterogeneous data

(3)

Information Retrieval: needles in haystacks

Branch of computer science behindsearch engines: find information amonglarge, noisy, heterogeneous data

• a known needle in a known haystack • a known needle in an unknown haystack

• an unknown needle in an unknown

haystack

• any needle in a haystack

• the sharpest needle in a haystack • most of the sharpest needles in a

haystack

• all the needles in a haystack

• affirmation of no needles in the haystack • things like needles in any haystack • let me know whenever a new needle

shows up

• where are the haystacks? • needles, haystacks - whatever

(4)

Search engines in a nutshell

Three main types of ingredients (features):

1 Words: text semantics can be approximated by word frequencies 2 Web structure: if enough people point to something, it must be good and

(probably) relevant

3 Users: search behaviour, click behaviour, dwell behaviour

(5)

Search engines in a nutshell

Three main types of ingredients (features):

1 Words: text semantics can be approximated by word frequencies 2 Web structure: if enough people point to something, it must be good and

(probably) relevant

3 Users: search behaviour, click behaviour, dwell behaviour

• User queries: distribution over features INPUT

• Indexed documents: distribution over features INPUT

• Ranking: comparing distributions OUTPUT

(6)

Anno 2013

• Realtime indexing: 20 billion pages crawled per day

• Instant search: retrieval time<0.3 sec, faster than human typing

• Zero query search: try to retrieve informationbefore you know what you are looking for based on user profiling

In terms of scale:

• 50 billion indexed webpages

• 3 billion search requests per day1(world population: ca. 7 billion people)

Data-driven technology – Big Data challenges

1 Long Data

2 Your Data

3 Small Data Thinking

1

Google alone

(7)

Anno 2013

• Realtime indexing: 20 billion pages crawled per day

• Instant search: retrieval time<0.3 sec, faster than human typing

• Zero query search: try to retrieve informationbefore you know what you are looking for based on user profiling

In terms of scale:

• 50 billion indexed webpages

• 3 billion search requests per day1(world population: ca. 7 billion people)

Data-driven technology – Big Data challenges

1 Long Data

2 Your Data

3 Small Data Thinking

1

Google alone

(8)

Big data challenge 1: long data

Long as in longitudinal: spanning over time • The problem is not the range but the intervals:

dynamic streams of data coming in with timestamps per<seconds Implications to search engines:

• time-versioned indexing: fine-grained updates & threaded associations • time-travel queries:what is relevant depends on when

(9)

Big data challenge 2: your data

Personalisation. Can of worms.

We can collect your data BUT it is safer not to personalise rather than annoy you...

• Personalised data on two axes:individual (e.g. user click through,

preferences, history) andsocial (e.g. twitter, Facebook, blogs) • Search engines must translate all this data into a singleuser state

reflecting user preferences

• This state needs to be updated dynamically with every new input, but also remain consistent and below the nuisance threshold

The larger and noisier the input, the harder to keep this balance

(10)

Big data challenge 2: your data

Personalisation. Can of worms.

We can collect your data BUT it is safer not to personalise rather than annoy you...

Big data implications:

• Personalised data on two axes:individual (e.g. user click through,

preferences, history) andsocial (e.g. twitter, Facebook, blogs) • Search engines must translate all this data into a singleuser state

reflecting user preferences

• This state needs to be updated dynamically with every new input, but also remain consistent and below the nuisance threshold

The larger and noisier the input, the harder to keep this balance

(11)

Big data challenge 3: small data thinking

R&D in information retrieval: clear division between efficiency and effectiveness

• Efficiency: index compression, reducing lookup time, query caching ...

Is not always on-topic

• Effectiveness: accurate feature extraction, personalisation, relevance ...

Does not always scale

(12)

Sources

• Haystack image, page 2:

http://footprinthr.com.au/wp-content/uploads/2012/01/needle_haystack.jpg • Needles in haystack metaphor, page 2: Matthew Koll, Bulletin of the American Society for Information

Science, Vol. 2, No. 2, December/January 2000

• Typewriter image, page 3: Copyright: Roberto Zilli, , ID: 99118544, available from

http://www.shutterstock.com

• Distributions image, page 3: Source: Edgar Meij, Large-scale Data Processing for Information Retrieval, 2012

• Tweets image, page 5: Source:

http://blog.crowdbooster.com/take-control-of-your-twitter-data-introducing • Can of worms image, page 6: Copyright: munchester2cool, available from

http://munchester2cool.deviantart.com/art/Luke-s-Can-of-Worms-55442402 • Efficiency vs. effectiveness image, page 7:

http://psychologyface.com/2012/11/effectiveness-and-efficiency