Taking the world’s pulse
Antal van den Bosch
Language Machines, Centre for Language Studies
With Florian Kunneman, Ali Hürriyetoglu, Iris Hendrickx
ICT.Open 2015, 24 March 2015
Language Machines
0 5x109 1x1010 1.5x1010 2x1010 2.5x1010 3x1010 3.5x1010 4x1010 4.5x1010 5x1010
2007 2008 2009 2010 2011 2012 2013 2014 2015
Est. number of web pages
Year Google
Bing
1 2 3 4 6 7
10
5 8 9 11
0 5x109 1x1010 1.5x1010 2x1010 2.5x1010 3x1010 3.5x1010 4x1010 4.5x1010 5x1010
2007 2008 2009 2010 2011 2012 2013 2014 2015
Est. number of web pages
Year Google
Bing
Estimated no. of web pages
0 5x109 1x1010 1.5x1010 2x1010 2.5x1010 3x1010 3.5x1010 4x1010 4.5x1010 5x1010
2007 2008 2009 2010 2011 2012 2013 2014 2015
Est. number of web pages
Year Google
Bing
0 5x109 1x1010 1.5x1010 2x1010 2.5x1010 3x1010 3.5x1010 4x1010 4.5x1010 5x1010
2007 2008 2009 2010 2011 2012 2013 2014 2015
Est. number of web pages
Year Google
Bing
10 billion 5 billion 20 billion 15 billion 30 billion 25 billion 40 billion 35 billion 45 billion 55 billion 50 billion
0
0 5x109 1x1010 1.5x1010 2x1010 2.5x1010 3x1010 3.5x1010 4x1010 4.5x1010 5x1010
2007 2008 2009 2010 2011 2012 2013 2014 2015
Est. number of web pages
Year Google
Bing
12 13
15 17
14
18 19 20 16
21 22
24 25 26 23
27 28 29 30
31
Launch of Bing (#9)
Caffeine update
(#14)
Panda 1.0 update
(#20)
Panda 4.0 update
(#31)
35 36 33 32
Launch of BingBot crawler (#18)
Catapult update
(#33) 34
Van den Bosch, Bogers, and De Kunder. A Longitudinal Analysis of Search Engine Index Size. To appear in Proceedings of ISSI-2015
COMMIT/ TWIQS.NL
HTTP://TWIQS.NL
We continuously search for common Dutch words on Twitter (keyword tracking) Additionally we collect all the tweets of most frequent 5000 users in Dutch of the previous month (user tracking)
This generates close to 4 million tweets per day
COMMIT/ TWIQS.NL
“ETEN”
COMMIT/ TWIQS.NL
“HEDDE”
COMMIT/ PROVINCIAL ELECTIONS MARCH 2015
COMMIT/ EXAMS IN DREAM TEXTS
Looking forward or looking backward to an event
The language of time: data-driven time estimates
Results Table, figure
Mean Absolute Error relative to event time
Lama Events (http://applejack.science.ru.nl/lamaevents/)
• Hürriyetoglu, A., Kunneman, F, and Van den Bosch, A. (2013). Estimating the time between Twitter messages and future events. In Proceedings of the 13th Dutch-Belgian Information Retrieval Workshop, pp. 20–23.
• Hürriyetoglu, A., Oostdijk, N., and Van den Bosch, A. (2014). Estimating time to event from tweets using temporal expressions. In Proceedings of the Workshop on Language Analysis in Social Media, LASM-2014, Gothenburg, Sweden, pp. 8–16. ACL.
• Kunneman, F., and Van den Bosch, A. (2012). Leveraging unscheduled event prediction through mining scheduled event tweets. In N. Roos, M. Winands, and J. Uiterwijk (Eds.), Proceedings of the 24th Benelux Conference on Artficial Intelligence, Maastricht, The Netherlands, pp. 147–154.
• Kunneman, F., Hürriyetoglu, A., Oostdijk, N., and Van den Bosch, A. (2014). Timely identification of event start dates from Twitter. Computational Linguistics in the Netherlands Journal, 4, pp. 39–52.
• Kunneman, F., Liebrecht, C., Van den Bosch, A., and Van Mulken, M. (2014). Signaling sarcasm: From hyperbole to hashtag. Information Processing & Management, published online.
• Kunneman, F., Liebrecht, C., and Van den Bosch, A. (2014). The (un)predictability of emotional hashtags in Twitter. In Proceedings of the Workshop on Language Analysis in Social Media, LASM-2014, Gothenburg, Sweden, pp. 26–34. ACL.
• Kunneman, F., and Van den Bosch, A. (2014). Event detection in Twitter: A machine learning approach based on term pivoting. In F. Grootjen, M. Otworowska, and J. Kwisthout (Eds.), Proceedings of the 26th Benelux Conference on Artificial Intelligence, pp. 65–72.
• Liebrecht, C., Kunneman, F., and Van den Bosch, A. (2013). The perfect solution for detecting sarcasm in tweets #not. In A. Balahur, E. van der Goot, and A. Montoyo (Eds.), Proceedings of the 4th
Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, WASSA-2013, pp. 29–37. New Brunswick, NJ: ACL.
• Tops, H., Kunneman, F., and Van den Bosch, A. (2013). Predicting time-to-event from Twitter
messages. In K. Hindriks, M. de Weerdt, B. van Riemsdijk, and M. Warnier (Eds.), Proceedings of the 25th Benelux Artificial Intelligence Conference, BNAIC-2013. Delft, The Netherlands.
Information need
Linked data
NLP
Text Text Text
Words, phrases
Search, translate
Bags of words
Classify, filter,
recommend
Information need
Text Text Text
… Lee Harvey
Oswald may not have been acting alone in the assassination of John F. Kennedy …
… Lee Harvey
Oswald may not have been acting alone in the assassination of John F. Kennedy … Assassination
plot
Conspiracy theory
Named entities:LHO, JFK
Semantic roles: LHO actor, JFK patient
Negator: not;
negated: ?
Information vs. language
• Every little fact can be (and is) expressed in an endless variety of linguistic expressions
• Language is a great way of hiding information
Navajo Code Talkers Henry
Bake and George Kirk, 12/1943 (ARC 593415)
Text mining of dream descriptions
• Dreambank (http://www.dreambank.net/)
- archive of personal dream descriptions collected in different studies - 61 collections (approx. 30K dreams)
– 143 dreams of blind dreamers
– 469 dreams of college women, late 1940s
– Phil, teens: 106 dreams; late 20s: 219 dreams; retirement: 180 dreams – 490 female, 491 male dreams annotated with Hall & Van de Castle norms
• LDA (Latent Dirichlet Allocation) - 50 topics
- 2000 iterations, Gibbs sampling
- Look for significant differences (log-likelihood test, G2) in topic distributions in female vs male dreams
Generic dream topics
• store money buy get pay bill man grocery lot counter bank shopping machine give tickets shop put dollars change bought
• bathroom toilet hair shower water go room bath floor clean wash see naked sink tub pee bathtub face cut towel
• book paper read books find picture write writing reading letter pictures written name looking look something library letters office computer
Dream topics, women (significant)
• house mother father baby old boy brother family girl home little sister children years parents child see aunt son kids
• church wedding people married front sit seats seat aisle back table sitting ceremony group place priest getting side room chair
• wearing white dress black blue red clothes hair dressed shirt put shoes wear pair pants hat green suit see old
• play dance stage music audience group playing song people dancing singing show sing part piano good band performance do concert
• pool swimming water swim go board end burt dive diving bottom little bathing side deep suit lake watching shallow underwater
Dream topics, men (significant)
• car driving drive road truck get going go seat side back cars stop front parked turn driver street station parking
• game ball playing team play basketball football field high baseball coach good balls hit players tennis school man other player
• gun men man shoot shot people fire kill shooting guns police war run killed enemy escape being soldiers fight get
• building stairs floor go get elevator people going door top hall room roof wall high office find steps way ladder
Can we predict hashtag usage?
Distant supervision for emotion detection in Twitter
Leveraging uncontrolled labeling to obtain large amounts of training data Two forms of emotion hashtag usage:
- Meta-communication
- Add emotion
Hashtags conveying sarcasm
#sarcasm
#irony
#cynicism
#not
90 % of tweets with selected hashtag are indeed sarcastic (Cohen’s
Kappa: 0.44)
1: Sarcastic tweets
2: Random tweets: 406,439 3: Day of tweets
February 1 2013 2,246,904 tweets
353 with sarcastic hashtag
Results
Analysis: tweets without sarcasm hashtag
250 top ranked tweets without sarcasm hashtag by classifier confidence for sarcasm
Three annotators
Is a tweet sarcastic or not? (strict decision)
Analysis: tweets without sarcasm hashtag
What next?
• Profiling of Twitter users - Tweetgenie: age, gender
- Levels of aggression, sarcasm
• Mood and emotion detection in events - Emotion classifiers
- Beyond list-based sentiment analysis
• Predicting events of specific types - Warnings for disasters
- Occupations - Illegal parties