• No results found

Taking the world s pulse

N/A
N/A
Protected

Academic year: 2022

Share "Taking the world s pulse"

Copied!
33
0
0

Loading.... (view fulltext now)

Full text

(1)

Taking the world’s pulse

Antal van den Bosch

Language Machines, Centre for Language Studies

With Florian Kunneman, Ali Hürriyetoglu, Iris Hendrickx

ICT.Open 2015, 24 March 2015

(2)

Language Machines

(3)

0 5x109 1x1010 1.5x1010 2x1010 2.5x1010 3x1010 3.5x1010 4x1010 4.5x1010 5x1010

2007 2008 2009 2010 2011 2012 2013 2014 2015

Est. number of web pages

Year Google

Bing

1 2 3 4 6 7

10

5 8 9 11

0 5x109 1x1010 1.5x1010 2x1010 2.5x1010 3x1010 3.5x1010 4x1010 4.5x1010 5x1010

2007 2008 2009 2010 2011 2012 2013 2014 2015

Est. number of web pages

Year Google

Bing

Estimated no. of web pages

0 5x109 1x1010 1.5x1010 2x1010 2.5x1010 3x1010 3.5x1010 4x1010 4.5x1010 5x1010

2007 2008 2009 2010 2011 2012 2013 2014 2015

Est. number of web pages

Year Google

Bing

0 5x109 1x1010 1.5x1010 2x1010 2.5x1010 3x1010 3.5x1010 4x1010 4.5x1010 5x1010

2007 2008 2009 2010 2011 2012 2013 2014 2015

Est. number of web pages

Year Google

Bing

10 billion 5 billion 20 billion 15 billion 30 billion 25 billion 40 billion 35 billion 45 billion 55 billion 50 billion

0

0 5x109 1x1010 1.5x1010 2x1010 2.5x1010 3x1010 3.5x1010 4x1010 4.5x1010 5x1010

2007 2008 2009 2010 2011 2012 2013 2014 2015

Est. number of web pages

Year Google

Bing

12 13

15 17

14

18 19 20 16

21 22

24 25 26 23

27 28 29 30

31

Launch of Bing (#9)

Caffeine update

(#14)

Panda 1.0 update

(#20)

Panda 4.0 update

(#31)

35 36 33 32

Launch of BingBot crawler (#18)

Catapult update

(#33) 34

Van den Bosch, Bogers, and De Kunder. A Longitudinal Analysis of Search Engine Index Size. To appear in Proceedings of ISSI-2015

(4)

COMMIT/ TWIQS.NL

HTTP://TWIQS.NL

We continuously search for common Dutch words on Twitter (keyword tracking) Additionally we collect all the tweets of most frequent 5000 users in Dutch of the previous month (user tracking)

This generates close to 4 million tweets per day

(5)

COMMIT/ TWIQS.NL

“ETEN”

(6)

COMMIT/ TWIQS.NL

“HEDDE”

(7)

COMMIT/ PROVINCIAL ELECTIONS MARCH 2015

(8)

COMMIT/ EXAMS IN DREAM TEXTS

(9)

Looking forward or looking backward to an event

(10)

The language of time: data-driven time estimates

(11)
(12)
(13)

Results Table, figure

Mean Absolute Error relative to event time

(14)

Lama Events (http://applejack.science.ru.nl/lamaevents/)

(15)

•  Hürriyetoglu, A., Kunneman, F, and Van den Bosch, A. (2013). Estimating the time between Twitter messages and future events. In Proceedings of the 13th Dutch-Belgian Information Retrieval Workshop, pp. 20–23.

•  Hürriyetoglu, A., Oostdijk, N., and Van den Bosch, A. (2014). Estimating time to event from tweets using temporal expressions. In Proceedings of the Workshop on Language Analysis in Social Media, LASM-2014, Gothenburg, Sweden, pp. 8–16. ACL.

•  Kunneman, F., and Van den Bosch, A. (2012). Leveraging unscheduled event prediction through mining scheduled event tweets. In N. Roos, M. Winands, and J. Uiterwijk (Eds.), Proceedings of the 24th Benelux Conference on Artficial Intelligence, Maastricht, The Netherlands, pp. 147–154.

•  Kunneman, F., Hürriyetoglu, A., Oostdijk, N., and Van den Bosch, A. (2014). Timely identification of event start dates from Twitter. Computational Linguistics in the Netherlands Journal, 4, pp. 39–52.

•  Kunneman, F., Liebrecht, C., Van den Bosch, A., and Van Mulken, M. (2014). Signaling sarcasm: From hyperbole to hashtag. Information Processing & Management, published online.

•  Kunneman, F., Liebrecht, C., and Van den Bosch, A. (2014). The (un)predictability of emotional hashtags in Twitter. In Proceedings of the Workshop on Language Analysis in Social Media, LASM-2014, Gothenburg, Sweden, pp. 26–34. ACL.

•  Kunneman, F., and Van den Bosch, A. (2014). Event detection in Twitter: A machine learning approach based on term pivoting. In F. Grootjen, M. Otworowska, and J. Kwisthout (Eds.), Proceedings of the 26th Benelux Conference on Artificial Intelligence, pp. 65–72.

•  Liebrecht, C., Kunneman, F., and Van den Bosch, A. (2013). The perfect solution for detecting sarcasm in tweets #not. In A. Balahur, E. van der Goot, and A. Montoyo (Eds.), Proceedings of the 4th

Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, WASSA-2013, pp. 29–37. New Brunswick, NJ: ACL.

•  Tops, H., Kunneman, F., and Van den Bosch, A. (2013). Predicting time-to-event from Twitter

messages. In K. Hindriks, M. de Weerdt, B. van Riemsdijk, and M. Warnier (Eds.), Proceedings of the 25th Benelux Artificial Intelligence Conference, BNAIC-2013. Delft, The Netherlands.

(16)

Information need

Linked data

NLP

Text Text Text

Words, phrases

Search, translate

Bags of words

Classify, filter,

recommend

(17)

Information need

Text Text Text

… Lee Harvey

Oswald may not have been acting alone in the assassination of John F. Kennedy …

… Lee Harvey

Oswald may not have been acting alone in the assassination of John F. Kennedy … Assassination

plot

Conspiracy theory

Named entities:

LHO, JFK

Semantic roles: LHO actor, JFK patient

Negator: not;

negated: ?

(18)

Information vs. language

•  Every little fact can be (and is) expressed in an endless variety of linguistic expressions

•  Language is a great way of hiding information

Navajo Code Talkers Henry

Bake and George Kirk, 12/1943 (ARC 593415)

(19)
(20)

Text mining of dream descriptions

•  Dreambank (http://www.dreambank.net/)

-  archive of personal dream descriptions collected in different studies -  61 collections (approx. 30K dreams)

–  143 dreams of blind dreamers

–  469 dreams of college women, late 1940s

–  Phil, teens: 106 dreams; late 20s: 219 dreams; retirement: 180 dreams –  490 female, 491 male dreams annotated with Hall & Van de Castle norms

•  LDA (Latent Dirichlet Allocation) -  50 topics

-  2000 iterations, Gibbs sampling

-  Look for significant differences (log-likelihood test, G2) in topic distributions in female vs male dreams

(21)

Generic dream topics

•  store money buy get pay bill man grocery lot counter bank shopping machine give tickets shop put dollars change bought

•  bathroom toilet hair shower water go room bath floor clean wash see naked sink tub pee bathtub face cut towel

•  book paper read books find picture write writing reading letter pictures written name looking look something library letters office computer

(22)

Dream topics, women (significant)

•  house mother father baby old boy brother family girl home little sister children years parents child see aunt son kids

•  church wedding people married front sit seats seat aisle back table sitting ceremony group place priest getting side room chair

•  wearing white dress black blue red clothes hair dressed shirt put shoes wear pair pants hat green suit see old

•  play dance stage music audience group playing song people dancing singing show sing part piano good band performance do concert

•  pool swimming water swim go board end burt dive diving bottom little bathing side deep suit lake watching shallow underwater

(23)

Dream topics, men (significant)

•  car driving drive road truck get going go seat side back cars stop front parked turn driver street station parking

•  game ball playing team play basketball football field high baseball coach good balls hit players tennis school man other player

•  gun men man shoot shot people fire kill shooting guns police war run killed enemy escape being soldiers fight get

•  building stairs floor go get elevator people going door top hall room roof wall high office find steps way ladder

(24)

Can  we  predict  hashtag  usage?  

(25)

Distant supervision for emotion detection in Twitter

Leveraging uncontrolled labeling to obtain large amounts of training data Two forms of emotion hashtag usage:

-  Meta-communication

-  Add emotion

(26)
(27)

Hashtags conveying sarcasm

#sarcasm

#irony

#cynicism

#not

90 % of tweets with selected hashtag are indeed sarcastic (Cohen’s

Kappa: 0.44)

(28)

1: Sarcastic tweets

2: Random tweets: 406,439 3: Day of tweets

February 1 2013 2,246,904 tweets

353 with sarcastic hashtag

(29)

Results

(30)

Analysis: tweets without sarcasm hashtag

250 top ranked tweets without sarcasm hashtag by classifier confidence for sarcasm

Three annotators

Is a tweet sarcastic or not? (strict decision)

(31)

Analysis: tweets without sarcasm hashtag

(32)

What next?

•  Profiling of Twitter users -  Tweetgenie: age, gender

-  Levels of aggression, sarcasm

•  Mood and emotion detection in events -  Emotion classifiers

-  Beyond list-based sentiment analysis

•  Predicting events of specific types -  Warnings for disasters

-  Occupations -  Illegal parties

(33)

References

Related documents