Big Data Analytics
Analysis of high-volume and unstructured Data
Stefan Weingaertner, DYMATRIX CONSULTING GROUP
KNIME Meetup Italia, 10
thOctober 2013
Agenda
1
Company Introduction
2
Big Data - an Introduction
3
Big Data Analytics on high-volume Data
5
Livedemo: Advanced Email Classification
4
Big Data Analytics on unstructured Data
DYMATRIX – The analytical CRM Company
»
Solution provider
for Customer Intelligence, Marketing Automation and
Advanced Predictive Analytics
»
Consulting, development and implementation know how, based upon
more than
900 projects
with mid- and large cap companies across
industries
»
Goal- and client- oriented project execution
based upon award winning,
established solutions
Our Consulting Competence Centers
Business Intelligence Advanced Analytics Campaign Management»
Conception of (big)data warehouse and business intelligence architectures
»
Enterprise Reporting Systems»
Dashboards»
Sales Controlling»
Planning & Forecasting»
Balanced Scorecard E-commerce insight » Customer Segmentation » Customer Value Analysis » Propensity Modeling (Cross-/Upsell/Churn) » Shopping Basket Analysis» Credit Rating Analysis & Credit Scoring » Text Mining » Data Mining
Automation » Big Data Analytics
» Design and Optimization of Campaign Processes and Workflows » Implementation of Campaign Management Systems » Integration of Data Mining Models in Campaign Processes » Campaign Optimization » Consulting & Implementation of Next Best Activity Processes
» Web Tracking » Web Controlling » Web Mining » Real Time Recommendation » Social Media
Tracking & Analysis » Web Performance
Measurement » Customer Journey
Analytics
Analysis of client oriented processes
Initial situation – Analysis – Conception of processes for customer retention and its optimization - customer reactivation and new customer activation – benchmarking against industry leaders
Solution Portfolio – The Customer Insight Suite
DynaCampaign
» Intelligent multi-touchpoint campaign management platform
» Planning, target group selection, execution and response measurement of campaigns » Event-triggered realtime campaigning
DynaMine
» End2end automation of data mining processes » Intelligent model management for automation of preprocessing, training & scoring of models
DynaCision
» Realtime decision management platform » Design & exection of complex embedded
decision processess
DynaSocial
» Social CRM platform to listen, track, identify and quantify customer needs and sentiments
Our KNIME Solution Nodes & KNIME Consulting Services
PMML2SQL / PMML2SAS Converter
» Convert PMML to executable SQL Code forIn-Database-Scoring
» Convert PMML to executable SAS Code for Model Scoring within SAS
Big Data Integration
» Access any Hadoop large-scale distributed batch processing infrastructure from KNIME » Efficiently distribute large amounts of data &
preprocessing across a set of machines
Uplift Modeling
» Predictive Modeling Nodes to predict the incremental response to marketing actions » For up-sell, cross-sell, churn and retention
activities
Interactive Scorecard Builder
» interactive Scorecard Building Nodes for Design of Credit or Marketing Scorecards
+ Business Consulting
+ Analytical Consulting
+ Technical Consulting
+ Trainings
Referenzen
References
References
Media
Banks, Insurances
Utilities, Industries, Public
Schwäbisch Hall
A Characterization of Big Data
Big
Data
Volume
Structured Structured & Unstructured Streaming Batch Zettabyte TerabyteNeeds Possibilities Decisions Approach Purchase Delivery Usage Service & Support Remember
Challenge: Big Data Collection & Integration
Needs Possibilities Decisions Approach Purchase Delivery Usage Service & Support Remember
Big Data Analytics: Learn, Target & Influence!
Big Data Analytics on high-volume Data
Volume Structured Structured & Unstructured Streaming Batch Zettabyte Terabyte Big DataBig Data Access
Hadoop Distributed File System (HDFS) MapReduce Hive HBase Hado o p Exte n si o n s Mahout A n al ytic A p p lic ation s H ad o o p Co re B ig D ata So u rc e s MapReduce Routines
Big Data Analytics
Hadoop Distributed File System (HDFS) MapReduce Hive HBase Hado o p Exte n si o n s Mahout A n al ytic A p p lic ation s H ad o o p Co re B ig D ata So u rc e s MapReduce Routines
PMML2SQL
Converter
Big Data Analytics on unstructured Data
Volume Structured Structured & Unstructured Streaming Batch Zettabyte Terabyte Big Data80%
of the world’s data is
unstructured.
Unstructured data is growing at
15 times
the rate of structured
data.
Source: Google Trends April 6, 2012
Big Data is not just about structured data…
15 times
80%
»
…to classify all customer related text
messages by
Source / Origin
Sentiment
Product or Service
Business Transaction
Context
etc.
»
…to identify unknown trends
»
…to identify cause and effect relations
»
…to react on that information, e.g.
Technical Problems
Needs
Usability
Competition
etc.
Imagine…
The KNIME platform supports
these efforts with comprehensive
Text Analytics & Network Analytics
capabilities!
Deutsche Telekom: Social Earthquake
0
200
400
600
800
1000
1. Mrz. 8. Mrz. 15. Mrz. 22. Mrz. 29. Mrz. 5. Apr. 12. Apr. 19. Apr. 26. Apr.
Facebook Posts & Comments March & April 2013
Negativ
Neutral
Positiv
First Rumours: Limitation of Bandwidth (21.3. – 23.3.) „DSL-Drossel“:Official Pressrelease on Limitation of Bandwidth leads to a Social Earthquake. (22.4. – 27.4.)
DYMATRIX Text Mining Process (KNIME Text Processing)
Text Datasources
Datasources:
•
•
•
Emails
•
Data Provider
like GNIP,
Datasift etc.
•
Crawled Data
•
etc.
For Machine
Learning
•
Provide Training
Data for
Classification
(e.g. Sentiment)
Text
Enrichment
Language Detection
•
English
•
German
•
Many more…
Language individual
NLP POS Tagging
•
Penn Treebank
Tagger
•
STTS Tagger
Text Cleansing
•
Stop Words
•
Punctuations
•
Stemming
Sentiment Amplifier
•
Matching of
Sentiment- &
Emoticon-Dictionaries
Subject
Matching
Text Tagging with
any Subjects
•
Products
•
Brands
•
Business
Transactions
•
Service
•
Complaints
•
Requests
•
etc.
Fuzzy Matching
with Dictionary
Tagger
•
Matching of
Subject-Dictionaries
Sentiment
Classification
Text Vectorization
•
Creation of text
predictors to
predict sentiments
Machine Learning
•
Classification with
Predictive
Analytics (e.g.
Decision Tree)
Retraining Interface
•
Adjustment of
misclassified
messages for
permanent
optimization of
classification
Information
Delivery
Text Data Mart
•
Make information
available in central
Text Data Mart for
visualization,
alerting etc.
Fields of Application
•
Email-Routing
•
Event triggered
Campaign
Management
•
etc.
DYMATRIX Text Mining Process: Datasources
Text Datasources
Information
Delivery
Sentiment
Classification
Subject
Matching
Text
Enrichment
Access any Text Datasource to start the
Text Mining Process
»
»
»
Emails
»
Crawler
»
Data Provider like GNIP, Datasift
etc.
Exemplified contribution on
Facebook Fanpage
DYMATRIX Text Mining Process: Text Enrichment
Why not sortyour signal issues out instead of bringing new phones out!!!! Wk 3 of crap [----] signal but yet paying FULL monthly contract! Vodafone sort it.
Sentiment Amplifier
sort[VBG] signal[VBP] issues [VBZ] instead[RB]
bringing[VBG] phones[NNS] Wk[NNP] 3[CD] crap[NN]
paying[VBG] monthly[RB] contract[NN] Vodafone[NNP]
Removal of Stop Words & Punctuations Penn Treebank POS Tagger (English Messages)
Why[WRB] not[RB] sort[VBG] your[PRP] signal[VBP] issues
[VBZ] out[IN] instead[RB] of[IN] bringing[VBG] new[JJ]
phones[NNS]!!!![SYM] Wk[NNP] 3[CD] of[IN] crap[NN]
but[CC] yet[RB] paying[VBG] FULL[NNP] monthly[RB]
contract[NN] ![SYM] Vodafone[NNP] sort[VBG] it[PRP]
.[SYM]
Text Datasources
Information
Delivery
Sentiment
Classification
Subject
Matching
Text
Enrichment
Original Facebook Message
Why not sort your signal issues out instead of bringing new phones out!!!! Wk 3 of crap signal but yet paying FULL monthly contract! Vodafone sort it.
DYMATRIX Text Mining Process: Subject Matching
Subject Matching (Fuzzy Matching)
Why not sort your signal issues out instead of bringing new phones out!!!! Wk 3 of crap signal [NETWORK] but yet paying FULL monthly contract! Vodafone sort it
[COMPLAINT].
Text Datasources
Information
Delivery
Sentiment
Classification
Subject
Matching
Text
Enrichment
Why not sort your signal issues out instead of bringing new phones out!!!! Wk 3 of crap signal but yet paying FULL monthly contract! Vodafone sort it.
BUSINESS TRANSACTION: Complaint
NETWORK: No Signal
PRODUCT: Nokia Lumia 925 Original Facebook Message
DYMATRIX Text Mining Process: Sentiment Classification
Output from Text Enrichment
Predictors relevant for Text Classification , e.g.
- Emoticons positive/negative - Length of message - Fragments positive/negative - Likes
- Words positive/negative - Comments
- Author-related Inputs - Other linguistic Inputs
Text Vectorization (Transformation)
Text Datasources
Information
Delivery
Sentiment
Classification
Subject
Matching
Text
Enrichment
Why not sort your signal issues out instead of bringing new phones out!!!! Wk 3 of crap signal but yet paying FULL monthly contract! Vodafone sort it.
Original Facebook Message
Text Classification with Decision Tree
DYMATRIX Text Mining Process: Information Delivery
Make information available in central Text Data Mart Visualization in DynaSocial
Original Facebook Message
Other Fields of Application
»
Subject-oriented Email-Classification
& Email-Routing
Text Datasources
Information
Delivery
Sentiment
Classification
Subject
Matching
Text
Enrichment
Why not sort your signal issues out instead of bringing new phones out!!!! Wk 3 of crap signal but yet paying FULL monthly contract! Vodafone sort it.
Sentiment Business Transaction Product Relevance
+
+
+
+
Network»
Text Enrichment & Classification Workflows
can be used for classification
of any electronic text message (e.g. Social Content, Blogs, Emails).
»
KNIME Server-based
Text Enrichment & Classification Workflows
can be
deployed as a webservice and called easily from any other application.
KNIME Server: Develop once, deploy everywhere!
Benefits
»
Uniformed Sentiment- and Classification-Handling
for all
customer-related messages.
Application Integration I: DynaSocial
Generic Big Data
Model
Social Media Analytics Data Management
Social Media Analytics Dashboard
DynaSocial – Social Media Excellence Architecture
Text Enrichment &
Classification
Network Insights
Advanced Social Media Analytics Text Mining & Network Mining
Social Media Analytics Content Extractor
Client individual Sources
Social Media Data Provider
Social Service Platforms
Emails
Integrated Social Inbox including all
Social Touchpoints
Social Engagement
Data Sources Sentiments & Classifications Reports & Dashboard
DynaSocial Management Dashboard
Activities Sentiment Ratio Key Influencer Platform Distribution Trends compared to competition (Share of Voice)Geographic Distribution Overall Sentiments Top Keywords Flexible Selection of Time Windows
…
Application Integration II: Advanced Email-Classification
Email Classification: MS Exchange Connector
KNIME Server
Microsoft Exchange
Webservice
.NET Batch
Microsoft Outlook
2
Call .NET Procedure and transfer email contents to KNIME Server via Webservice Call.Incoming Email
Call KNIME Text Enrichment & Classification
Workflows und return classification results.
Classification results are returned to Exchange Server and are saved persistantly with object categories.
Any clients having access to Exchange Server get the same classification.
1
4
3
5
Microsoft Outlook
Livedemo
Realtime
Email-Classification
Thank you for your attention.
We are happy to answer any of your questions!
DYMATRIX CONSULTING GROUP GmbH Zeppelin Carré
Lautenschlagerstrasse 2 D-70173 Stuttgart
Your Contact: Stefan Weingaertner Phone Fax E-Mail Web +49.711.22.007.88 - 12 +49.711.22.007.88 - 88 [email protected] www.dymatrix.de