Big Data Hope or Hype?

(1)

Big Data

Hope or Hype?

David J. Hand

Imperial College, London and

Winton Capital Management

(2)

Google trends on big data

Google search 1 Sept 2013: 1.6 billion hits on ‘big data’

(3)

What is big data ?

Various definitions:

‐ data which are too extensive to permit iterative analysis: one‐pass analysis is necessary;

‐ data sets which standard database tools cannot handle;

‐ data sets which are so large they require new forms of processing;

‐ a data set which exceeds 20% of the RAM of a given

machine;

(4)

Some big data stories

 The Large Hadron Collider: petabyte (10

¹⁵

) per second;

 Sequencing the Human Genome: 3.3 billion base pairs

 Social network analysis: 2.5 quintillion (10

¹⁸

) bytes per day

 Climate modelling: Coupled model intercomparison project 5

^th

phase: more than 2 petabytes

 Google Translate: statistical machine translation; 200

billion words from UN documents

(5)

Why now?

‐ automatic data capture (often secondary)

‐ simulations (e.g. meteorology, physics)

‐ exponential growth in computer memory

(6)

But it’s not new !

It’s media rebranding

‐ 1994: Wal‐Mart, with over 7 billion transactions per year;

‐ 1997: AT&T, with over 70 billion long distant phone call records per year;

‐ 1990s: Mobil Oil, over 100 terabytes of data;

‐ 2000: in just a few months the Sloan Digital Sky Survey

collected more data than had previously been collected

in the entire history of astronomy

(7)

Why is it exciting?

A new world, according to many!

McKinsey: ‘we are on the cusp of a tremendous wave of innovation, productivity, and growth, as well as new

modes of competition and value capture, all driven by big data as consumers, companies, and economic sectors

exploit its potential’

(8)

Some see big data as a paradigm shift in science:

“Out with every theory of human behavior, from linguistics to sociology. Forget taxonomy, ontology, and psychology.

Who knows why people do what they do? The point is they do it, and we can track and measure it with unprecedented fidelity. With enough data, the numbers speak for

themselves.”

Chris Anderson Wired in an article called ‘The end of theory: the data deluge makes scientific method obsolete’.

(9)

But he was wrong:

the numbers don’t speak for themselves

(10)

But he was wrong:

the numbers don’t speak for themselves

There are two kinds of models:

data‐driven

substantive

(11)

Data‐driven models

Based purely on empirical relationships in the data

e.g.in credit scoring the model of choice is a logistic regression tree

‐ The population is partitioned into segments on empirical grounds

‐ Different logistic regression models built in each segment

No underlying theory

No psychology, prospect theory, behavioural finance, etc.

(12)

Data‐driven models are not new

e.g. segmented regression in credit scoring in 1960s Data‐driven models are good for prediction and anomaly detection

which is why they are so heavily used in some domains

But data‐driven models don’t provide insight

(13)

Substantive models

Are essentially theories

e.g. Newton’s Laws of Motion

‐ necessary for understanding

e.g. to detect dark matter from galaxy rotation

‐ lack of insight has its dangers

Billions 10203040

(14)

So it’s too much to say

“Out with every theory of human behavior” (Anderson)

It depends what you are using the models for

‐ prediction

‐ understanding

(15)

Big data needs

Computer science for manipulating data

Sorting, adding, selecting, aggregating, concatenating, etc

Statistics for extracting information from data

Most of the problems we want to solve are inferential

We don’t want to make a statement about the data we have, but about

‐ data we might get tomorrow (e.g. economic forecasting);

‐ the population from which our data were drawn (e.g. astronomical databases);

‐ a true value, which we have observed with measurement error (e.g.

gene expression data);

‐ data we might have had if things had been different (e.g. social

(16)

Big data risks

‐ big data often collected as a side effect of some other exercise: the definitions may not match

‐ definitions may change over time if administrative

‐ data quality (good for one purpose, not for another;

computer is a necessary intermediary)

‐ selection bias

‐ different observational automatic data capture sources have

different biases; problem of selecting on basis of response variable

‐ crime maps example: Direct Line Insurance survey: selective reporting of incidents for fear of impact on house prices

‐ multiple testing

‐ ‘everything’ significant

(17)

New tools needed

Wikipedia says the challenges include

“capture, curation, storage, search, sharing, transfer, analysis, and visualization”

While true, this is mostly talking about computational housekeeping tools rather than knowledge extraction tools:

It’s talking about data juggling rather than inference

[Even ‘analysis’ in the above quote refers to Hadoop,

missing the point]

(18)

But there are Implications for inference

‐ visualisation: but familiar tools may be inadequate

(19)

‐ iteration too slow ‐ use simple models (eg. regression instead of logistic reg)

‐ splitting and screening (not really taking advantage of the big data)

e.g. the LHC: 1 petabyte per sec, online filter reduces by a factor of 10,000, further selection by factor of 100.

‐ anomaly detection

‐ streaming data

(20)