Leveraging Big Data
A case study from Thomson Reuters
About the speakers
Chawapong Suriyajan,
Development Group Leader
Sakol Suwinaitrakool
Senior Solution Architect
2
FOLLOW US:
facebook.com/ThomsonReutersThailand
What’s the problem we want to solve?
“Behavioral finance is an area of increasing interest in financial markets, but it's been
difficult for human traders to keep pace due to the sheer volume and detail of data and the need to interpret it and spot trends
immediately”
Philip Brittan
Chief Technology Officer & Global Head of Platform, Thomson Reuters
4
Introducing Social Media Monitor
The tool that helps overcome the challenges in analyzing social data, and provide the insights for investors
Awards
Corporate Entrepreneur Awards 2014
• Best New Service
The Technical Analyst Awards 2014
• Best Specialist Product
FStech Awards 2015
• Financial Sector Innovation of the Year
6
What does SMM do?
Perform Sentiment
Analysis
Visualize
Why Social Media?
Fast!!!
8
Social Media is Fast
On June 10, 2014, as Iraqi militants seized the Baiji oil refinery, the news broke on Twitter - six hours in advance of other media outlets covering the story.
Social Media is Fast
In November, 2013, when The Globe and Mail tweeted that BlackBerry’s $4.7 billion buyout was scuttled. The tweet
happened at 8:12 a.m., and by 8:19 a.m., BlackBerry stock had fallen 20%
10
Why Social Media?
Provides collective sentiment indicators
Social Medias
The Growth of Social Data
Source: http://www.searchenginejournal.com/growth-social-media-2-0-infographic/77055/
14
Challenges leveraging Social Media?
- Data are incredibly huge
- How we can make a machine analyze the sentiment data correctly
- How can we deal with data that are “noises”
- How do we present the huge amount of data in the way that a human can easily understand
Looking at the challenges
- Data are incredibly huge
- How we can make a machine analyze the sentiment data correctly
- How can we deal with data that are “noises”
- How do we present the huge amount of data in the way that a human can easily understand
16
Emerging Technology Trends: Big Data
File System:
Document Store:
Wide-column Store:
T O O L
How big is our Data?
• Millions of tweets with cash tag (e.g. $AAPL) per quarter
• Greater than 1Tera Bytes of Compressed data
• Around 50 GB of data flowing into our system Daily
18
Social Media Monitor Data Sources
45,000 Entries/Day
45,000,000 Entries/Day
215,000 Entries/Day
Social Media Ingestor
Filter
What is used to handle such big data?
• Distributed
• High Availability
• Full-Text Search
• Document Oriented
• Schema Free
• RESTFul API
• Apache 2 Open Source License
• Low Cost
20
Our experience using Elasticsearch & Hadoop
Strengths
• Clean distributed deployment and prior in-house testing done
Challenges
• Determining the size of the cluster
• Resource contention / Resource Sharing
• Large dataset
Looking at the challenges
- Data are incredibly huge
- How we can make a machine analyze the sentiment data correctly
- How can we deal with data that are “noises”
- How do we present the huge amount of data in the way that a human can easily understand
22
Analyzing Sentiments
Natural Language Processing
• Tokenize
Apples, are, red, …
• Part of Speech tagging
Apples = Subject, are = verb
• Lemmatization
Apples = Apple, are = be
• Name Entity Relation
Apples = Fruits, Red = Color
• Coreference resolution Apples are red. They are very delicious
Analyzing Sentiments
Machine Learning
24
Processing Tweets
Tweets
SM Ingestor
Tweets + Sentiments/
Bullish, Bearish
Search
SM Statistic Aggregator
Count of ‘positive’ tweets Count of ‘negative’ tweets Count of ‘neutral’ tweets Count of ‘bullish’ tweets Count of ‘bearish’ tweets Total tweet count
Processed in miliseconds
NLP
Sentiment Analysis
(Machine Learning)
Looking at the challenges
- Data are incredibly huge
- How we can make a machine analyze the sentiment data correctly
- How can we deal with data that are “noises”
- How do we present the huge amount of data in the way that a human can easily understand
26
What are the noises
• What if people tweet about some company with great bias?
• What if someone tweet jokes? Will this impact the analysis?
• Example:
– Buy $Apple? Is it positive or Negative?
Minimizing the noises
• Use the proper filter for the PowerTrack API
• Weighted Sentiment score using Klout score
• Focus on collective sentiments during a specific time period, instead of individual tweet.
• Enough training data to train our sentiment engine
28
Klout score
• The Klout Score is a number between 1-100 that represents your influence. The more influential you are, the higher your Klout Score.
Looking at the challenges
- Data are incredibly huge
- How we can make a machine analyze the sentiment data correctly
- How can we deal with data that are “noises”
- How do we present the huge amount of data in the way that a human can easily understand
30
Data Visualization – Bubbles Chart
Data Visualization – Heatmap
32
Data Visualization Technology
Strengths and Challenges
Strengths
• Server-side deployment
– No installation on client machines
• Off-load Presentation logic to Client machines
– Save resource requirement on server side – more scalable (Good code needed)
• Scalable
– Node.JS is single-thread non-blocking IO, no overhead for context switching
34
Strengths and Challenges
Challenges
• Developer Skills on Angular.js Framework
• JavaScript Performance
• Node.JS is quite sensitive to unhandled exceptions, which cause excessive memory usage
Q&A
36