Big Data
Hope or Hype?
David J. Hand
Imperial College, London and
Winton Capital Management
Google trends on big data
Google search 1 Sept 2013: 1.6 billion hits on ‘big data’
What is big data ?
Various definitions:
‐ data which are too extensive to permit iterative analysis: one‐pass analysis is necessary;
‐ data sets which standard database tools cannot handle;
‐ data sets which are so large they require new forms of processing;
‐ a data set which exceeds 20% of the RAM of a given
machine;
Some big data stories
The Large Hadron Collider: petabyte (10
15) per second;
Sequencing the Human Genome: 3.3 billion base pairs
Social network analysis: 2.5 quintillion (10
18) bytes per day
Climate modelling: Coupled model intercomparison project 5
thphase: more than 2 petabytes
Google Translate: statistical machine translation; 200
billion words from UN documents
Why now?
‐ automatic data capture (often secondary)
‐ simulations (e.g. meteorology, physics)
‐ exponential growth in computer memory
But it’s not new !
It’s media rebranding
‐ 1994: Wal‐Mart, with over 7 billion transactions per year;
‐ 1997: AT&T, with over 70 billion long distant phone call records per year;
‐ 1990s: Mobil Oil, over 100 terabytes of data;
‐ 2000: in just a few months the Sloan Digital Sky Survey
collected more data than had previously been collected
in the entire history of astronomy
Why is it exciting?
A new world, according to many!
McKinsey: ‘we are on the cusp of a tremendous wave of innovation, productivity, and growth, as well as new
modes of competition and value capture, all driven by big data as consumers, companies, and economic sectors
exploit its potential’
Some see big data as a paradigm shift in science:
“Out with every theory of human behavior, from linguistics to sociology. Forget taxonomy, ontology, and psychology.
Who knows why people do what they do? The point is they do it, and we can track and measure it with unprecedented fidelity. With enough data, the numbers speak for
themselves.”
Chris Anderson Wired in an article called ‘The end of theory: the data deluge makes scientific method obsolete’.
But he was wrong:
the numbers don’t speak for themselves
But he was wrong:
the numbers don’t speak for themselves
There are two kinds of models:
data‐driven
substantive
Data‐driven models
Based purely on empirical relationships in the data
e.g.in credit scoring the model of choice is a logistic regression tree
‐ The population is partitioned into segments on empirical grounds
‐ Different logistic regression models built in each segment
No underlying theory
No psychology, prospect theory, behavioural finance, etc.
Data‐driven models are not new
e.g. segmented regression in credit scoring in 1960s Data‐driven models are good for prediction and anomaly detection
which is why they are so heavily used in some domains
But data‐driven models don’t provide insight
Substantive models
Are essentially theories
e.g. Newton’s Laws of Motion
‐ necessary for understanding
e.g. to detect dark matter from galaxy rotation
‐ lack of insight has its dangers
Billions 10203040
So it’s too much to say
“Out with every theory of human behavior” (Anderson)
It depends what you are using the models for
‐ prediction
‐ understanding
Big data needs
Computer science for manipulating data
Sorting, adding, selecting, aggregating, concatenating, etc
Statistics for extracting information from data
Most of the problems we want to solve are inferential
We don’t want to make a statement about the data we have, but about
‐ data we might get tomorrow (e.g. economic forecasting);
‐ the population from which our data were drawn (e.g. astronomical databases);
‐ a true value, which we have observed with measurement error (e.g.
gene expression data);
‐ data we might have had if things had been different (e.g. social
Big data risks
‐ big data often collected as a side effect of some other exercise: the definitions may not match
‐ definitions may change over time if administrative
‐ data quality (good for one purpose, not for another;
computer is a necessary intermediary)
‐ selection bias
‐ different observational automatic data capture sources have
different biases; problem of selecting on basis of response variable
‐ crime maps example: Direct Line Insurance survey: selective reporting of incidents for fear of impact on house prices
‐ multiple testing
‐ ‘everything’ significant
New tools needed
Wikipedia says the challenges include
“capture, curation, storage, search, sharing, transfer, analysis, and visualization”
While true, this is mostly talking about computational housekeeping tools rather than knowledge extraction tools:
It’s talking about data juggling rather than inference
[Even ‘analysis’ in the above quote refers to Hadoop,
missing the point]
But there are Implications for inference
‐ visualisation: but familiar tools may be inadequate
‐ iteration too slow ‐ use simple models (eg. regression instead of logistic reg)
‐ splitting and screening (not really taking advantage of the big data)
e.g. the LHC: 1 petabyte per sec, online filter reduces by a factor of 10,000, further selection by factor of 100.