• No results found

Leveraging Big Data. A case study from Thomson Reuters

N/A
N/A
Protected

Academic year: 2021

Share "Leveraging Big Data. A case study from Thomson Reuters"

Copied!
37
0
0

Loading.... (view fulltext now)

Full text

(1)

Leveraging Big Data

A case study from Thomson Reuters

(2)

About the speakers

Chawapong Suriyajan,

Development Group Leader

Sakol Suwinaitrakool

Senior Solution Architect

2

(3)

FOLLOW US:

facebook.com/ThomsonReutersThailand

(4)

What’s the problem we want to solve?

“Behavioral finance is an area of increasing interest in financial markets, but it's been

difficult for human traders to keep pace due to the sheer volume and detail of data and the need to interpret it and spot trends

immediately”

Philip Brittan

Chief Technology Officer & Global Head of Platform, Thomson Reuters

4

(5)

Introducing Social Media Monitor

The tool that helps overcome the challenges in analyzing social data, and provide the insights for investors

(6)

Awards

Corporate Entrepreneur Awards 2014

• Best New Service

The Technical Analyst Awards 2014

• Best Specialist Product

FStech Awards 2015

• Financial Sector Innovation of the Year

6

(7)

What does SMM do?

Perform Sentiment

Analysis

Visualize

(8)

Why Social Media?

Fast!!!

8

(9)

Social Media is Fast

On June 10, 2014, as Iraqi militants seized the Baiji oil refinery, the news broke on Twitter - six hours in advance of other media outlets covering the story.

(10)

Social Media is Fast

In November, 2013, when The Globe and Mail tweeted that BlackBerry’s $4.7 billion buyout was scuttled. The tweet

happened at 8:12 a.m., and by 8:19 a.m., BlackBerry stock had fallen 20%

10

(11)

Why Social Media?

Provides collective sentiment indicators

(12)
(13)

Social Medias

(14)

The Growth of Social Data

Source: http://www.searchenginejournal.com/growth-social-media-2-0-infographic/77055/

14

(15)

Challenges leveraging Social Media?

- Data are incredibly huge

- How we can make a machine analyze the sentiment data correctly

- How can we deal with data that are “noises”

- How do we present the huge amount of data in the way that a human can easily understand

(16)

Looking at the challenges

- Data are incredibly huge

- How we can make a machine analyze the sentiment data correctly

- How can we deal with data that are “noises”

- How do we present the huge amount of data in the way that a human can easily understand

16

(17)

Emerging Technology Trends: Big Data

File System:

Document Store:

Wide-column Store:

T O O L

(18)

How big is our Data?

• Millions of tweets with cash tag (e.g. $AAPL) per quarter

• Greater than 1Tera Bytes of Compressed data

• Around 50 GB of data flowing into our system Daily

18

(19)

Social Media Monitor Data Sources

45,000 Entries/Day

45,000,000 Entries/Day

215,000 Entries/Day

Social Media Ingestor

Filter

(20)

What is used to handle such big data?

• Distributed

• High Availability

• Full-Text Search

• Document Oriented

• Schema Free

• RESTFul API

• Apache 2 Open Source License

• Low Cost

20

(21)

Our experience using Elasticsearch & Hadoop

Strengths

• Clean distributed deployment and prior in-house testing done

Challenges

• Determining the size of the cluster

• Resource contention / Resource Sharing

• Large dataset

(22)

Looking at the challenges

- Data are incredibly huge

- How we can make a machine analyze the sentiment data correctly

- How can we deal with data that are “noises”

- How do we present the huge amount of data in the way that a human can easily understand

22

(23)

Analyzing Sentiments

Natural Language Processing

• Tokenize

Apples, are, red, …

• Part of Speech tagging

Apples = Subject, are = verb

• Lemmatization

Apples = Apple, are = be

• Name Entity Relation

Apples = Fruits, Red = Color

• Coreference resolution Apples are red. They are very delicious

(24)

Analyzing Sentiments

Machine Learning

24

(25)

Processing Tweets

Tweets

SM Ingestor

Tweets + Sentiments/

Bullish, Bearish

Search

SM Statistic Aggregator

Count of ‘positive’ tweets Count of ‘negative’ tweets Count of ‘neutral’ tweets Count of ‘bullish’ tweets Count of ‘bearish’ tweets Total tweet count

Processed in miliseconds

NLP

Sentiment Analysis

(Machine Learning)

(26)

Looking at the challenges

- Data are incredibly huge

- How we can make a machine analyze the sentiment data correctly

- How can we deal with data that are “noises”

- How do we present the huge amount of data in the way that a human can easily understand

26

(27)

What are the noises

• What if people tweet about some company with great bias?

• What if someone tweet jokes? Will this impact the analysis?

• Example:

Buy $Apple? Is it positive or Negative?

(28)

Minimizing the noises

Use the proper filter for the PowerTrack API

Weighted Sentiment score using Klout score

• Focus on collective sentiments during a specific time period, instead of individual tweet.

Enough training data to train our sentiment engine

28

(29)

Klout score

• The Klout Score is a number between 1-100 that represents your influence. The more influential you are, the higher your Klout Score.

(30)

Looking at the challenges

- Data are incredibly huge

- How we can make a machine analyze the sentiment data correctly

- How can we deal with data that are “noises”

- How do we present the huge amount of data in the way that a human can easily understand

30

(31)

Data Visualization – Bubbles Chart

(32)

Data Visualization – Heatmap

32

(33)

Data Visualization Technology

(34)

Strengths and Challenges

Strengths

• Server-side deployment

No installation on client machines

• Off-load Presentation logic to Client machines

Save resource requirement on server side – more scalable (Good code needed)

• Scalable

Node.JS is single-thread non-blocking IO, no overhead for context switching

34

(35)

Strengths and Challenges

Challenges

• Developer Skills on Angular.js Framework

• JavaScript Performance

• Node.JS is quite sensitive to unhandled exceptions, which cause excessive memory usage

(36)

Q&A

36

(37)

Thank you

References

Related documents

In the world of human resources, benefits, and payroll administration, self service applications are tools that let employees across dozens of locations access their pay stub

Second, availing of this extended process, of evidence gathered of language feature usage, and of computational complexity insights from Description Logics (DL), we specify

Effectiveness of early nutrition interventions in the ICU setting: Results of some recent randomized 15 controlled trials 16 Therapeutic Target Nutrition

This corresponds to an increase in the clay content of the films from 52% at 3s:3s to 83% at 3s:9s (Table 1), indicating the increase in thickness is correlated with the

If you are buying from a private individual make sure you are covered by your insurance or theirs?. • Walk away if the seller is not

By studying how the deposits of two landslides in northern Ice- land evolved through time, we have shown for the first time that molards in permafrost terrains are cones of

1 Where this occurs as a result of a proposed change of the controllers of a Lloyd’s Underwriting Agency (which will result in the change of control of any controlled

The aim of our study was to assess whether elevated CB-IgE levels and a family history of asthma in early childhood were associated with, and could predict, allergy-related