(Big) Data Analytics:
From Word Counts to Population Opinions
Mark KeaneOutline
•
What’s New About (Big) Data Analytics
•
3 Sample Cases:
– Google Queries Predicting Epidemics
– Networks of Influence
What’s New ?: The Suggestion of a…
•
Brave new world of (new) data analysis…that can
•
Handle vast amounts of data effortlessly…with
•
Instant press-of-a-button answers…from
What’s New ?: The Suggestion of a…
•
Brave new world of (new) data analysis…that can
•
Handle vast amounts of data effortlessly…with
•
Instant press-of-a-button answers…from
•
Vast server farms of (almost free) computation
•
But
… there are significant issues
What’s Old ?
•
Good old-fashioned, data analysis
•
Many statistical ideas are very familiar
•
Many research problems are familiar
•
Proper collection of data is important
What’s Really New ? An Approach
•
Tipping-point with Very Large Data Sets
»
from 100s to 1,000,000,000s of data points
•
Unusual Types of Data
»
video, text, thumbs-up, unstructured data
•
Non-standard Data Sources
»
social media (FB, Tweets), news, phones
•
Data is not conventionally-measured
In this New Big-Data World…
!
•
Who we know, says a lot about who we are…– Facebook friends, linked-in network, tweet followers
•
What we write, says a lot about what we think…– text in books, news, blogs, social media and so on
•
Where we located, says a lot about us…– location-based sensing, GPS, IP-addresses
•
What we do, says a lot about our decisions/interests…– what we buy, web-sites visited, youtube videos watched, news re-tweeted, items shared and so on…
Case 1: Predicting Flu from Searches
•
Google Flu Trends (GFT):•
aggregates search data, counting influenza keywords•
US Centre for Disease Control:•
tracks influenza-like-illnesses (ILIs) in outpatient data•
From 2003-2009:•
GFT showed high correlations with ILI stats (ILINet)Good Correlations (Initially…)
•
Body Level One• Body Level Two
– Body Level Three
– Body Level Four
Hang on a sec…
•
Body Level One• Body Level Two
– Body Level Three
– Body Level Four
The Message
•
What we do, says a lot about our concerns…– if I think I have flu and I am looking it up on Google
•
Here, people’s illness is being defined by – their search behaviour and keywords•
Population behaviour can be predicted (in locations) – by aggregating these searchesThe Message
•
What we do, says a lot about our concerns…– if I think I have flu I am looking it up on Google
•
Here, people’s illness is being defined by – their search behaviour and keywords•
Population behaviour can be predicted (in locations) – by aggregating these searches•
But,Case 2: Showing Networks of Influence
•
Tracking news on Social Networks•
terrorists release youtube videos•
politicians comment in Facebook•
celebs tweet intimacies•
Who you comment on, What you comment on and where; can reveal networks of influence•
Storyful is using Insight system, to curate the lists of sources andNetworks in Syrian Conflict
Network of Syrian-related Twitter accounts active during late 2013
European Parliament Networks
Data analysed for 584 MEPs on Twitter during July-Sept 2014.
Political Groupings…
Data analysed for 584 MEPs on Twitter during July-Sept 2014.
The Outlier Party…
Data analysed for 584 MEPs on Twitter during July-Sept 2014.
The Message
•
Who we know, says a lot about who we are…– Facebook friends, linked-in network, tweet followers
•
I can be defined by– the people I know/like/respect/follow (homophily)
•
My behaviour can be predicted by – assuming that like-people act alike•
But,Case 3: Tracking Herding & Market Bubbles
•
Word frequencies reveal power-laws (Zipf’s Law)
•
Bubble would show in herd-like use of language
•
Power laws change systematically with herding
Agreement ‘tween Commentators…
Analysing Text in News
•
17,713
finance articles (FT, NYT, BBC)
•
4
years (Jan 2006-Jan 2010) including 2007 crash
Analysing Text in News
•
17,713
finance articles (FT, NYT, BBC)
•
4
years (Jan 2006-Jan 2010) including 2007 crash
•
10,418,266
words, we extract nouns and verbs
•
Correlations for verb distributions show:
•
DJIA (
r
= .79), FTSE-100 (
r
= .78), NIKKEI-225 (
r
= .73)
The Message
•
What we write, says a lot about what we think…– text in books, news, blogs, social media and so on
•
Here, agreement in a population is being captured by – carefully treated word frequencies•
Population beliefs can be trackedThe Message
•
What we write, says a lot about what we think…– text in books, news, blogs, social media and so on
•
Here, agreement in a population is being captured by – carefully treated word frequencies•
Population beliefs can be tracked– by a distributional analysis of changes in words
•
But,In this New Big-Data World…
!
•
Who we know, says a lot about who we are…– Facebook friends, linked-in network, tweet followers
•
What we write, says a lot about what we think…– text in books, news, blogs, social media and so on
•
Where we located, says a lot about us…– location-based sensing, GPS, IP-addresses
•
What we do, says a lot about our decisions/interests…In this New Big-Data World…
!
•
Who we know,– Facebook friends, linked-in network, tweet followers
•
What we write,– text in books, news, blogs, social media and so on
•
Where we located– location-based sensing, GPS, IP-addresses
•
What we do– what we buy, web-sites visited, youtube videos watched, news re-tweeted, items shared and so on…TINEL
Y AVA ILABLE AT A S MARTP HONE N EAR Y OU
Promises and Caveats…
•
Data analytics bears promise in tracking and predicting:•
population actions, beliefs, opinions, illness…•
changes in those actions, beliefs, opinions, illnesses…•
Challenges are in finding:•
right treatment of the data: selection/collation of data is still critical, combining multiple data-sources•
right analytic methods: which, if any, are appropriate•
right interpretations; old-fashion exclusion-of-vars/ interpretation