• No results found

(Big) Data Analytics: From Word Counts to Population Opinions

N/A
N/A
Protected

Academic year: 2021

Share "(Big) Data Analytics: From Word Counts to Population Opinions"

Copied!
46
0
0

Loading.... (view fulltext now)

Full text

(1)

(Big) Data Analytics:

From Word Counts to Population Opinions

Mark Keane
(2)
(3)
(4)
(5)
(6)
(7)

Outline

What’s New About (Big) Data Analytics

3 Sample Cases:

– Google Queries Predicting Epidemics

– Networks of Influence

(8)
(9)
(10)

What’s New ?: The Suggestion of a…

Brave new world of (new) data analysis…that can

Handle vast amounts of data effortlessly…with

Instant press-of-a-button answers…from

(11)

What’s New ?: The Suggestion of a…

Brave new world of (new) data analysis…that can

Handle vast amounts of data effortlessly…with

Instant press-of-a-button answers…from

Vast server farms of (almost free) computation

But

… there are significant issues

(12)

What’s Old ?

Good old-fashioned, data analysis

Many statistical ideas are very familiar

Many research problems are familiar

Proper collection of data is important

(13)

What’s Really New ? An Approach

Tipping-point with Very Large Data Sets

»

from 100s to 1,000,000,000s of data points

Unusual Types of Data

»

video, text, thumbs-up, unstructured data

Non-standard Data Sources

»

social media (FB, Tweets), news, phones

Data is not conventionally-measured

(14)

In this New Big-Data World…

!

Who we know, says a lot about who we are…

– Facebook friends, linked-in network, tweet followers

What we write, says a lot about what we think…

– text in books, news, blogs, social media and so on

Where we located, says a lot about us…

– location-based sensing, GPS, IP-addresses

What we do, says a lot about our decisions/interests…

– what we buy, web-sites visited, youtube videos watched, news re-tweeted, items shared and so on…

(15)
(16)
(17)

Case 1: Predicting Flu from Searches

Google Flu Trends (GFT):

aggregates search data, counting influenza keywords

US Centre for Disease Control:

tracks influenza-like-illnesses (ILIs) in outpatient data

From 2003-2009:

GFT showed high correlations with ILI stats (ILINet)
(18)

Good Correlations (Initially…)

Body Level One

• Body Level Two

– Body Level Three

– Body Level Four

(19)

Hang on a sec…

Body Level One

• Body Level Two

– Body Level Three

– Body Level Four

(20)

The Message

What we do, says a lot about our concerns…

– if I think I have flu and I am looking it up on Google

Here, people’s illness is being defined by – their search behaviour and keywords

Population behaviour can be predicted (in locations) by aggregating these searches
(21)

The Message

What we do, says a lot about our concerns…

– if I think I have flu I am looking it up on Google

Here, people’s illness is being defined by – their search behaviour and keywords

Population behaviour can be predicted (in locations) by aggregating these searches

But,
(22)
(23)

Case 2: Showing Networks of Influence

Tracking news on Social Networks

terrorists release youtube videos

politicians comment in Facebook

celebs tweet intimacies

Who you comment on, What you comment on and where; can reveal networks of influence

Storyful is using Insight system, to curate the lists of sources and
(24)
(25)
(26)

Networks in Syrian Conflict

Network of Syrian-related Twitter accounts active during late 2013

(27)

European Parliament Networks

Data analysed for 584 MEPs on Twitter during July-Sept 2014.

(28)

Political Groupings…

Data analysed for 584 MEPs on Twitter during July-Sept 2014.

(29)

The Outlier Party…

Data analysed for 584 MEPs on Twitter during July-Sept 2014.

(30)

The Message

Who we know, says a lot about who we are…

– Facebook friends, linked-in network, tweet followers

I can be defined by

– the people I know/like/respect/follow (homophily)

My behaviour can be predicted by assuming that like-people act alike

But,
(31)
(32)

Case 3: Tracking Herding & Market Bubbles

Word frequencies reveal power-laws (Zipf’s Law)

Bubble would show in herd-like use of language

Power laws change systematically with herding

(33)
(34)

Agreement ‘tween Commentators…

(35)

Analysing Text in News

17,713

finance articles (FT, NYT, BBC)

4

years (Jan 2006-Jan 2010) including 2007 crash

(36)
(37)
(38)

Analysing Text in News

17,713

finance articles (FT, NYT, BBC)

4

years (Jan 2006-Jan 2010) including 2007 crash

10,418,266

words, we extract nouns and verbs

Correlations for verb distributions show:

DJIA (

r

= .79), FTSE-100 (

r

= .78), NIKKEI-225 (

r

= .73)

(39)
(40)

The Message

What we write, says a lot about what we think…

– text in books, news, blogs, social media and so on

Here, agreement in a population is being captured by – carefully treated word frequencies

Population beliefs can be tracked
(41)

The Message

What we write, says a lot about what we think…

– text in books, news, blogs, social media and so on

Here, agreement in a population is being captured by – carefully treated word frequencies

Population beliefs can be tracked

by a distributional analysis of changes in words

But,
(42)
(43)

In this New Big-Data World…

!

Who we know, says a lot about who we are…

– Facebook friends, linked-in network, tweet followers

What we write, says a lot about what we think…

– text in books, news, blogs, social media and so on

Where we located, says a lot about us…

– location-based sensing, GPS, IP-addresses

What we do, says a lot about our decisions/interests…
(44)

In this New Big-Data World…

!

Who we know,

– Facebook friends, linked-in network, tweet followers

What we write,

– text in books, news, blogs, social media and so on

Where we located

– location-based sensing, GPS, IP-addresses

What we do

– what we buy, web-sites visited, youtube videos watched, news re-tweeted, items shared and so on…TINEL

Y AVA ILABLE AT A S MARTP HONE N EAR Y OU

(45)

Promises and Caveats…

Data analytics bears promise in tracking and predicting:

population actions, beliefs, opinions, illness…

changes in those actions, beliefs, opinions, illnesses…

Challenges are in finding:

right treatment of the data: selection/collation of data is still critical, combining multiple data-sources

right analytic methods: which, if any, are appropriate

right interpretations; old-fashion exclusion-of-vars/ interpretation
(46)

References

Related documents

Invited talk at the International Workshop on Resonant Inelastic Soft X-Ray Scattering, Walberberg

Once the files are extracted from the nodes and with the help of the script done is it possible to compare the routing tables of the own node and the neighbor node

Most of the vector competence studies report a combination of infection, dissemination, and transmission rates which provide insight in the different stages of viral

You can choose from a wide range of psychology courses, but you may also choose courses from another programme or university or you can spend a semester studying at a partner

Using those behavioural and spatial cognitive tests, we found that Eurasian harvest mice behaved constantly and that personality traits formed a behavioural syndrome in the

assumes the principal and the agent to know not only the distribution and the expectation of the stochastic influence from the environment θ and the agent’s costs C’’(e) of acting

Compared with many other social networks, SCNs possesses some unique features: Most nodes in an SCN are tied to one or less nodes at any time (i.e., most people do not maintain..

„ Development of mini-jobs (part-time work with monthly wage lower than 400 euros) with reduced employers’ social contributions. „ Hartz II (2003) further development of mini jobs