Göran Kauermann, Statistics & Big Data - München, 8. Mai 2015 Göran Kauermann, Scien5fic Day, Jahrestagung der DAV, 29.04.2016
Statistics, Big Data
and Data Science !?
Prof. Dr. Göran Kauermann Ludwig-Maximilians-Universität
Munich, Germany
Göran Kauermann, Statistics & Big Data - München, 8. Mai 2015 Göran Kauermann, Scien5fic Day, Jahrestagung der DAV, 29.04.2016
Statistics, Big Data and Data Science
• Statistics
– Founded around 1900 with the seminal work of Pearson and later Fisher
• Big Data
– The Big Topic with the three (four) V‘s • Data Science
– Proposed by Cleveland (2001, 2005): „Learning from Data: Unifying Statistics and Computer Science“
Göran Kauermann, Statistics & Big Data - München, 8. Mai 2015 Göran Kauermann, Scien5fic Day, Jahrestagung der DAV, 29.04.2016
Statistics, Big Data and Data Science
•
Sta$s$cs
–
Founded around 1900 with the seminal
work of Pearson and later Fisher
•
Big Data
–
The Big Topic with the three (four) V‘s
•
Data Science
–
Proposed by Cleveland (2001, 2005):
„Learning from Data: Unifying Sta5s5cs and
Computer Science
“ 3Göran Kauermann, Statistics & Big Data - München, 8. Mai 2015 Göran Kauermann, Scien5fic Day, Jahrestagung der DAV, 29.04.2016
Statistics
Statistics is … (the) science that pertains to the
collection, analysis, interpretation … and presentation of data.
(Wikipedia)
Göran Kauermann, Statistics & Big Data - München, 8. Mai 2015 Göran Kauermann, Scien5fic Day, Jahrestagung der DAV, 29.04.2016
Sta$s$cal Founda$ons Sta$s$cal Modelling Sta$s$cs and Big Data?
Likelihood-Inferenzce Sta5s5cal Tests ANOVA Linear Regression EDA etc. Generalised Regression Computa5onal Sta5s5cs, MCMC R-Project, Smooth Regression Data Mining Inference in Big Data Computa5onal Sta5s5cs Data Science 1900 1950 2000 2015 5
Statistics – the first 100 years
Is statistics ready for the next century ?
5
Göran Kauermann, Statistics & Big Data - München, 8. Mai 2015 Göran Kauermann, Scien5fic Day, Jahrestagung der DAV, 29.04.2016
Statistics in Germany
Göran Kauermann, Statistics & Big Data - München, 8. Mai 2015 Göran Kauermann, Scien5fic Day, Jahrestagung der DAV, 29.04.2016
Statistics in Germany
Statistics has been prosperous in Germany in the last 10 years
• TU Dortmund and LMU Munich (BA and MA) • HU/FU/TU Berlin, Bielefeld, Göttingen (MA)
• Ulm, Bremen, Heidelberg, Bamberg, Trier, Mainz, Magdeburg (special programs)
• Mathematics departments and economics departments Are the German statisticians ready for the next century?
7
Göran Kauermann, Statistics & Big Data - München, 8. Mai 2015 Göran Kauermann, Scien5fic Day, Jahrestagung der DAV, 29.04.2016
Statistics, Big Data and Data Science
• Statistics
– Founded around 1900 with the seminal
work of Pearson and later Fisher
• Big Data
– The Big Topic with the three (four) V‘s
• Data Science
– Proposed by Cleveland (2001, 2005):
„Learning from Data: Unifying Statistics and Computer Science“
Göran Kauermann, Statistics & Big Data - München, 8. Mai 2015
Göran Kauermann, Scien5fic Day, Jahrestagung der DAV, 29.04.2016 9
Big Data – Everybody talks about it!
Göran Kauermann, Statistics & Big Data - München, 8. Mai 2015 Göran Kauermann, Scien5fic Day, Jahrestagung der DAV, 29.04.2016
Big Data is like teenage sex: everyone talks about it,
nobody really knows how to do it,
everyone thinks everyone else is doing it, so everyone claims they are doing it … (Dan Ariely, 2013)
10
Big Data – Everybody talks about it!
Göran Kauermann, Statistics & Big Data - München, 8. Mai 2015 Göran Kauermann, Scien5fic Day, Jahrestagung der DAV, 29.04.2016
Big Data – The Buzzword
Financial Times Magazine (March 2014): • „Big Data is a vague term for a massive
phenomenon that has rapidly become an obsession with entrepreneurs, scientists, governments and the media.“
• „As with so many buzzwords, „big data“ is a vague term, often thrown around by people with something to sell.“
11
Is Big Data the new gold rush?
Göran Kauermann, Statistics & Big Data - München, 8. Mai 2015 Göran Kauermann, Scien5fic Day, Jahrestagung der DAV, 29.04.2016
Big Data – the four V‘s
Big Data are classified with the four V‘s • Volume – Big Data are large in size • Variety – Big Data are complex
• Velocity – Big Data arrive in high speed at high resolution
• Veracity – Big Data may not be reliable (bias issues)
Göran Kauermann, Statistics & Big Data - München, 8. Mai 2015 Göran Kauermann, Scien5fic Day, Jahrestagung der DAV, 29.04.2016
Big Data – From Data to Knowledge
Wired Magazine (June 2008):
• „The End of Theory: The data deluge makes the scientific method obsolete.“
• „The End of Theory: With enough data, the numbers speak for themselves.“
13
Big Data, is this the end of statistics?
Göran Kauermann, Statistics & Big Data - München, 8. Mai 2015 Göran Kauermann, Scien5fic Day, Jahrestagung der DAV, 29.04.2016
Gartner‘s Hype Cycle
14
Göran Kauermann, Statistics & Big Data - München, 8. Mai 2015 Göran Kauermann, Scien5fic Day, Jahrestagung der DAV, 29.04.2016
Big Data – The two Extremes Opinions
The two view about Big Data:
• With enough data we don‘t need theory and we can explain the world.
• Big Data is just a hype and will die out sooner or later.
15
Big Data, a challenge or the end of statistics?
Göran Kauermann, Statistics & Big Data - München, 8. Mai 2015 Göran Kauermann, Scien5fic Day, Jahrestagung der DAV, 29.04.2016
Big Data – End of Statistics?
16
Let’s answer the question with Big
https://www.google.com/trends/
Google Trends protocols which keywords are searched in Google, when, where, etc.
Göran Kauermann, Statistics & Big Data - München, 8. Mai 2015
Göran Kauermann, Scien5fic Day, Jahrestagung der DAV, 29.04.2016 17
Göran Kauermann, Statistics & Big Data - München, 8. Mai 2015
Göran Kauermann, Statistics & Big Data - München, 8. Mai 2015
Göran Kauermann, Scien5fic Day, Jahrestagung der DAV, 29.04.2016 19
Göran Kauermann, Statistics & Big Data - München, 8. Mai 2015
Göran Kauermann, Statistics & Big Data - München, 8. Mai 2015
Göran Kauermann, Scien5fic Day, Jahrestagung der DAV, 29.04.2016 21
Göran Kauermann, Statistics & Big Data - München, 8. Mai 2015 Göran Kauermann, Scien5fic Day, Jahrestagung der DAV, 29.04.2016
Big Data – End of Theory?
Is Big Data the death of Statistics?
Statisticians have spent the past 200 years figuring out what traps lie in wait when we try to understand the world through data.
The data are bigger, faster and cheaper these days, but we must not pretend that the traps have all
been made safe.
(Financial Times Magazin, Tim Harford, 28.3.2014)
Göran Kauermann, Statistics & Big Data - München, 8. Mai 2015 Göran Kauermann, Scien5fic Day, Jahrestagung der DAV, 29.04.2016
Data Scientists –
Why are they needed by the industry?
23
Göran Kauermann, Statistics & Big Data - München, 8. Mai 2015 Göran Kauermann, Scien5fic Day, Jahrestagung der DAV, 29.04.2016
Big Data – Example 1
Source: Lazer et al, 2014, Science, Vol. 343.
Göran Kauermann, Statistics & Big Data - München, 8. Mai 2015 Göran Kauermann, Scien5fic Day, Jahrestagung der DAV, 29.04.2016
Big Data – Example 1
Google‘s Flu Trend
The trend worked nicely, but then it failed, since:
• Correlation is not equal to causation
• „What causes what“ needs a model and data
25
Göran Kauermann, Statistics & Big Data - München, 8. Mai 2015 Göran Kauermann, Scien5fic Day, Jahrestagung der DAV, 29.04.2016
Big Data – Example 2
Price Elasticity Estimation
• Research Project with large German airline • Problem: Estimation of Price Elasticity
• Huge (!!) data base containing Price and Ticket sales
• Regression model:
Ticket Sales = s(Price) + error
Göran Kauermann, Statistics & Big Data - München, 8. Mai 2015 Göran Kauermann, Scien5fic Day, Jahrestagung der DAV, 29.04.2016
Big Data – Example 2
Price Elasticity Estimation
• Problem: The price is NOT exogeneous !!
• Demand depends on price and price depends on demand
• The data-based price elasticity is overestimated • The problem is well know in econometrics
27
Göran Kauermann, Statistics & Big Data - München, 8. Mai 2015 Göran Kauermann, Scien5fic Day, Jahrestagung der DAV, 29.04.2016
Big Data – Example 3
Big Computing versus Sampling
• Big Data often demand for Big Computing • Information, however, can be also be retrieved
from a sample
• Example: Network Data (e.g. Facebook) • Statisticians know how to sample
Göran Kauermann, Statistics & Big Data - München, 8. Mai 2015 Göran Kauermann, Scien5fic Day, Jahrestagung der DAV, 29.04.2016
Big Data – Example 3
• The „Sonntagsfrage“ asks roughly 1.000 people about their political views
⇒ sample 1.000 out of about 60 million ⇒ margin of error (standard deviation)
• Why is it better to ask just 1.000 people and not 60 million, if possible?
⇒ sampling error diminishes, but ⇒ „sampling bias“ occurs
29
Göran Kauermann, Statistics & Big Data - München, 8. Mai 2015 Göran Kauermann, Scien5fic Day, Jahrestagung der DAV, 29.04.2016
Big Data and Statistics
We conclude:
•
Big Data does not make theory (thinking)
obsolete.
•
Big Data analy5cs needs sta5s5cal thinking
and reasoning
•
But:
.
30
Göran Kauermann, Statistics & Big Data - München, 8. Mai 2015 Göran Kauermann, Scien5fic Day, Jahrestagung der DAV, 29.04.2016
Big Data and Statistics
We conclude:
•
Big Data does not make theory (thinking)
obsolete.
•
Big Data analy5cs needs sta5s5cal thinking
and reasoning
•
But: Sta5s5cs also needs to tackle Big Data
issues
31Göran Kauermann, Statistics & Big Data - München, 8. Mai 2015 Göran Kauermann, Scien5fic Day, Jahrestagung der DAV, 29.04.2016
Big Data and Statistics
David Spiegelhalter:
• „Complete bollocks. Absolute nonsense.“ • „There are a lot of small data problems that
occur in big data. They don‘t disappear
because you‘ve got lots of the stuff. They get worse.“
Göran Kauermann, Statistics & Big Data - München, 8. Mai 2015 Göran Kauermann, Scien5fic Day, Jahrestagung der DAV, 29.04.2016
Other statements:
• David Hand: “We have a new resource here. But nobody wants ‘data’. What they want are the
answers.”
• Patrick Wolfe: “It’s the wild west right now. People who are clever and driven will twist and turn and use every tool to get sense out of these data sets, and that’s cool. But we’re flying a little
bit blind at the moment.”
33
Big Data and Statistics
Göran Kauermann, Statistics & Big Data - München, 8. Mai 2015 Göran Kauermann, Scien5fic Day, Jahrestagung der DAV, 29.04.2016
Big Data
- A further (single) Statistician‘s View
• Statistics needs more involvement in the Big Data wave
• Statistical ideas and models are useful and need to be scaled up
• The „old statistics“ is not dying out (p-values and small samples remain useful)
• A new paradigm: Approximate data analysis
may be better than optimal fitting procedures (Göran Kauermann, 2016)
Göran Kauermann, Statistics & Big Data - München, 8. Mai 2015 Göran Kauermann, Scien5fic Day, Jahrestagung der DAV, 29.04.2016
Statistics, Big Data and Data Science
• Statistics
– Founded around 1900 with the seminal
work of Pearson and later Fisher • Big Data
– The Big Topic with the three (four) V‘s • Data Science
– Proposed by Cleveland (2001, 2005): „Learning from Data: Unifying Statistics and Computer Science“
35
Göran Kauermann, Statistics & Big Data - München, 8. Mai 2015 Göran Kauermann, Scien5fic Day, Jahrestagung der DAV, 29.04.2016
Statistics versus Data Scientists
36 What is Data Science?
• Cleveland (2001): „Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics“
• Data Science = Statistics of tomorrow ?
or
• Data Science = Statistics carried out by non-statisticians?
Göran Kauermann, Statistics & Big Data - München, 8. Mai 2015 Göran Kauermann, Scien5fic Day, Jahrestagung der DAV, 29.04.2016
Statistics and Data Scientists
37 1900 1950 2000 Statistics Computer Science Data Science
Göran Kauermann, Statistics & Big Data - München, 8. Mai 2015 Göran Kauermann, Scien5fic Day, Jahrestagung der DAV, 29.04.2016
Statistics and Data Scientists
38
1900
1950
Statistics
Göran Kauermann, Statistics & Big Data - München, 8. Mai 2015 Göran Kauermann, Scien5fic Day, Jahrestagung der DAV, 29.04.2016
Statistics and Data Scientists
39 1900 1950 Statistics Computer Science Data Science
Göran Kauermann, Statistics & Big Data - München, 8. Mai 2015
Göran Kauermann, Scien5fic Day, Jahrestagung der DAV, 29.04.2016 40
Quotes from Cleveland
„Computer scien4sts, waking up to the value of the informa4on stored, processed and transmi<ed by today‘s compu4ng environments, have a<empted to fill the void. One current of work is data mining. But the benefit to the data analyst has been limited, because the knowledge among computer scien4sts about how to think of and approach the analysis of data is limited, just as the knowledge of compu4ng environments by sta4s4cians is limited. A merger of the knowledge bases would produce a powerful force for innova4on.“Göran Kauermann, Statistics & Big Data - München, 8. Mai 2015 Göran Kauermann, Scien5fic Day, Jahrestagung der DAV, 29.04.2016 41 „Computer scien4sts, waking up to the value of the informa4on stored, processed and transmi<ed by today‘s compu4ng environments, have a<empted to fill the void. One current of work is data mining. But the benefit to the data analyst has been limited, because the knowledge among computer scien4sts about how to think of and approach the analysis of data is limited, just as the knowledge of compu4ng environments by sta4s4cians is limited. A merger of the knowledge bases would produce a powerful force for innova4on.“
Quotes from Cleveland
Göran Kauermann, Statistics & Big Data - München, 8. Mai 2015 Göran Kauermann, Scien5fic Day, Jahrestagung der DAV, 29.04.2016
Data Science – What is it about ?
Data Science combines informatics and statistics in order to extract information from real data.
42
“Data Science is a blend of Red-Bull-fuelled hacking and
espresso-inspired sta4s4cs” (Mike Driscoll, CEO Metamarket)
Göran Kauermann, Statistics & Big Data - München, 8. Mai 2015 Göran Kauermann, Scien5fic Day, Jahrestagung der DAV, 29.04.2016
Data Scientists – What do they do?
43
Source: C. O‘Neil, R. Schuf (2014), Doing Data Science, O‘Reilly Media Inc., USA.
Göran Kauermann, Statistics & Big Data - München, 8. Mai 2015 Göran Kauermann, Scien5fic Day, Jahrestagung der DAV, 29.04.2016
Data Scientists – What do they do?
44 Source: C. O‘Neil, R. Schutt (2014), Doing Data Science, O‘Reilly Media Inc., USA.
Retrieve information from data
Deal with data confidentiality Communicate the results Use statistical models Apply machine learning tools
Göran Kauermann, Statistics & Big Data - München, 8. Mai 2015 Göran Kauermann, Scien5fic Day, Jahrestagung der DAV, 29.04.2016
Statistics and Computer Science
The stereotypes:
• Computer Scientists predict and forecast • Statisticians model and interpret
But both tackle the question:
How can we make the data speak?
45
Göran Kauermann, Statistics & Big Data - München, 8. Mai 2015 Göran Kauermann, Scien5fic Day, Jahrestagung der DAV, 29.04.2016
Data Science
46
• The definition of Data Science is not consolidated • We consider Data Science as
– 50% Statistics and
– 50% Informatics (Computer Science)
• Master in Data Science at LMU (Elite-Network Bavaria)
Göran Kauermann, Statistics & Big Data - München, 8. Mai 2015 Göran Kauermann, Scien5fic Day, Jahrestagung der DAV, 29.04.2016
Data Science @ LMU
• Program starts Oct 2016 • International Program • 50% Statistics and 50% Informatics 47 www.datascience-munich.de
Göran Kauermann, Statistics & Big Data - München, 8. Mai 2015 Göran Kauermann, Scien5fic Day, Jahrestagung der DAV, 29.04.2016
Challenges in Data Science
• Collaboration
⇒ Big Data occur outside of statistics/informatics • Training
⇒ More master programs in Data Science • Consolidation
⇒ Data Science is Data Science
Göran Kauermann, Statistics & Big Data - München, 8. Mai 2015 Göran Kauermann, Scien5fic Day, Jahrestagung der DAV, 29.04.2016
Statistics, Big Data and Data Science
49
• Statistics and Computer Science merged into
Data Science
• Big Data are the driving force
• “Classical” Statistics remains important • New challenges in Statistics/Informatics
Göran Kauermann, Statistics & Big Data - München, 8. Mai 2015 Göran Kauermann, Scien5fic Day, Jahrestagung der DAV, 29.04.2016
Challenges in Statistics
• Do we need optimal solutions? ⇒ Approximate inference,
⇒ Smart and real time computing ⇒ Parrallel Computing
• Do we need asymptotic statistics? ⇒ We have large n, so why bother about
mathematical asymptotics
⇒ What does n è∞ really mean?
Göran Kauermann, Statistics & Big Data - München, 8. Mai 2015 Göran Kauermann, Scien5fic Day, Jahrestagung der DAV, 29.04.2016
Challenges in Statistics
• Do we need significance tests? ⇒ Model Selection is important ⇒ Significance versus relevance
• Do we need statistical models at all? ⇒ Stochastic character remains in big data ⇒ Simple stochastic models are too simple
51
Göran Kauermann, Statistics & Big Data - München, 8. Mai 2015 Göran Kauermann, Scien5fic Day, Jahrestagung der DAV, 29.04.2016
Challenges in Statistics
• Do we need correlation?
⇒ Dependence structure is relevant ⇒ Copula or more complex models • Do we need linear models at all?
⇒ Linear models and linear procedures are fast ⇒ Linear approximations are often sufficient
Göran Kauermann, Statistics & Big Data - München, 8. Mai 2015 Göran Kauermann, Scien5fic Day, Jahrestagung der DAV, 29.04.2016
After all: The statistical paradigm remains
53 Questions Ú Answers Data Estimates Model
The Statistical Approach
Göran Kauermann, Statistics & Big Data - München, 8. Mai 2015 Göran Kauermann, Scien5fic Day, Jahrestagung der DAV, 29.04.2016