BIG DATA BEST PRACTICE-1
T
HE
IDEA
IN
B
RIEF
…
What are the questions at the heart of the
problem ?
Formulate the
hypothesis/questions at the heart of the issue ! Distill them into a clear set of hypothesis to be tested
Remember Hadoop and
associated technology components are a means
Isolate $ Denting Analytical Use Case
R
EAL
LIFE
EXAMPLE
:
C
URATING
USE
CASE
IN
TELECOM
SECURITY
INTELLIGENCE
Business Context
What new signals to listen to
prevent adverse events from
happening ?
4 Data Pools
Netsweepeer logs Radius logs Switch CDR MMS logs 2 Use cases
Watch list analysis + Network link analysis
Have an intensive ½ day cross functional workshop with business to boil down the game changing use case
Is it a “nice to have” use case or a “$ impacting use case” ?
Who is the consumer of the use case ?
How does it help him optimize cost or reduce risk or increase revenue ?
BIG DATA BEST PRACTICE-2
I
MPACT
“
AHA
”
MOMENT
IN
60
90
DAYS
.
T
HE
IDEA
IN
BRIEF
D
ELIVER
F
IRST
B
IG
D
ATA
“
AHA
”
MOMENT
IN
6090
DAYS
Skeletal MVP : End to
end implementation
that links all
architectural
components together
Could be the answer
to a
previously
unanswered question
Propels momentum
A
REAL
LIFE
EXAMPLE
Industry = OTA
Context : Important to improve look to book
Is there a co-relation
between response time of a web page and the look to book ratio ?
Hadoop cluster + Infobright + Hive jobs ready in 3
weeks
Scaled data and
improvised dashboard experience for another 3 weeks
Business readout in 6 weeks
T
HEREFORE
Break it into 3 chunks
30 day milestones
60 day milestones
90 day milestones
In 30 days plan to cover functional breadth
Hadoop infrastructure + cluster
Integrate disparate components – data pipeline, Columnar database, machine learning process , Hadoop cluster
Have a small file go from start to end thru the process chain
In 60 days plan to cover scalability
Scale for 12 months data atleast
Tableau / Pentaho
In 90 days plan to cover bells n whistles
Configurators
Alerters
Additional abtraction
BIG DATA BEST PRACTICE-3
A
CTIONS
NOT
INSIGHTS
DATA
INSIGHTS
B
EST
P
RACTICE
-3
A
CTIONS
NOT
INSIGHTS
Actions are executed in the frontline
Call centre
Mobile
Store channel
Digital channel
Actions could be
Behaviour based discounts
Help close a digital transaction
Serve customized webpage
Take proactive actions
Insights are nice to know Actions impact $
T
HEREFORE
W
HAT ACTIONS AREDRIVEN AS A RESULT OF THESE INSIGHTS
?
H
OW ARE WE DISSEMINATING INSIGHTS TO FRONT LINE CHANNELS?
A
SK“
SO WHAT” 5
TIMES!!!
BIG DATA BEST PRACTICE-4 :
R
EAL
LIFE
EXAMPLE
Keyword frequency
“Leaks”, “Leakage”, “Noise”, “Sound”,
“Vibrations”
Noise / leakage frequency is a better
predictor of repeat sales than any
other indicators including marketing
spends !!!
A
REAL
LIFE
EXAMPLE
S lid e 16 XYZ Online Buzz analysis
How can we create a strategy to respond to what we are hearing about XYZs buzz online ?
Business Question • Text mining
• Visual data exploration • Hypothesis testing • Affinity analysis
Statistical Technique
Sentiment trends :+/-
Sentiment benchmark with McDonalds Top keywords for XYZ
Top keywords for McDonalds Keyword affinities
Insights derived
•Theme specific campaigns • NPD process
• Instore experience
• Reverse impact of negative buzz
Business Action
www.yelp.com
Raw data
W
HERE
DO
CUSTOMERS
EXPRESS
THEMSELVES
?
Slide 17
Universe of XYZ sentiment data = 5 sources, 5556 posts,3 years data we’s phase-1 analysis = www.yelp.com, 136 posts, 2 years data
136
posts
Yelp.com552
posts
Epinions.com2854
posts
planetfeedback.com1500
posts
Twitter.com500
posts
Facebook.comS
OURCE
= T
WITTER
.
COM
S
OURCE
= Y
ELP
.
COM
S
OURCE
= F
ACEBOOK
.
COM
S
TEP
BY
STEP
SENTIMENT
TEXT
MINING
PROCESS
Slide 21
Process
• Blogs • Customer review sites • Online consumer forum • Customers\Ven dors emails • Unstructured data from ApplicationsInput
Output
• Inferences • Customer’s sentimentsO
VERALL
S
ENTIMENTS
D
ASHBOARD
T
HEREFORE
R text mining algorithm
RHadoop
BIG DATA BEST PRACTICE-5 :
COLUMNAR &IN MEMORYARCHITECTURES TO SPEED UP CHAIN OF THOUGHT
Which devices are infected from a malicious attack ?
H
OW
TO
H
ANDLE
“N
EEDLE
IN
A
H
AYSTACK
”
W
ORKLOADS
?
What happened on
firewall-3 between
3:17 and 3:21 am ?
How many payment
gateway drops
happened between
9:47 am and 9:52 am
on 15-Nov-2012 ?
Data forensic queries
supporting chain of
thoughts
26
Id Name Designation Tenure
S1 Prem Founder 8
S2 Simon Security Architect 5 S3 Bhavana Sales Head 6
S4 Ram CEO 3
S5 Shyam Developer 1
S1PremFounder8 S2SimonSecurityArchitect5 S3BhavanaSalesHead6 S4RamCEO3 S5ShyamDeveloper1
S1S2S3S4S5PremSimonBhavanaRamShyamFounderSecurityHeadSalesHeadCEODeveloper85631
interactive or real-time query for large datasets =key to analyst productivity (support chain of thought analysis).
Chain of thought analysis = Explore data torrent by quickly running off a series of iterative queries, each informed by the last.
Most solutions aren’t fast enough and reduce analytical effectiveness when users chain of thought process is interrupted
In memoy DB Tools
Dremel at Google, Druid at Metamarkets, Sting at Netflix,
Cloudera’s Impala
C Berkeley’s AMPLab’s Spark, SAP Hana,
Platfora.
T
HEREFORE
Examine
columnar databases
and
inmemory databases
to
speed up important query workloads
Download evaluation version of Actian, Infobright and do a
B
EST
P
RACTICE
-6
H
OW
TO
P
LAN
FOR
100
X
SCALABILITY
?
BIG DATA BEST PRACTICE-6
:
T
HINK100
XS
CALABILITY!!!
R
EAL
LIFE
EXAMPLE
Industry
= Telecom
Business context
National content filtering solution
Events Generated Per Day
:
1 Billion Events
New URL’s Classified per Day
:
1 Million
Price sensitive search Store search Ratings based ordering Comparator events Basket add events Payment Gateway events
The data torrent
The Organisation
BIG DATA BEST PRACTICE-7 :
D
ETECT
D
ATA
PATTERNS
IN
REAL
TIME
!!!
T
HE
CONTEXT
Velocity is high
Decision making window is low
R
EAL
TIME
EXAMPLE
Decision window = 8 mins
If a high value customer ( decile = 1 on last 36 months revenue )
and intra book interval > threshold and recency of search < 70
then route to call center channel
T
HEREFORE
Include S4 and other real time analytics into your
Big data reference architecture
BIG DATA BEST PRACTICE-8
T
HE
BASICS
Captology = Persuasion thru technology
D
ESIGN FORB
EHAVIOURALC
HANGE
Persuasion examples
Users to change channel behaviour ( Move from Desktop to Mobile channel )
C
APTOLOGY
IN
A
CTION
Captology in Insurance
Reduce rates each time a person reports his or her exercise behaviour to a group of peers online
T
HERE
ARE
TOO
MANY
GOOD
PRODUCTS
HIDDEN
BEHIND
BAD
USER
INTERFACES
P
RODUCT
= I
NTERFACE
BIG DATA BEST PRACTICE-9
B
EST
P
RACTICE
-9
I
NTERSECT
OF
M
OVING
P
ARTS
ARE
THE
WEAK
LINKS
Big Data Moving moving parts
Columnar databases
Hadoop clusters
Advanced visualisation layer
Real time components
Data pipelines
API’s scrappers to syndicate info
Bridge to existing DW
The intersect can give away as data / user volumes increase
A real life big data architecture architecure
Event loggers
Hbase/Cassandra for high velocity event absorption
Sqoop/Flume for data ingestion
Hadoop cluster for massive data crunching
R for extracting patterns
Columnar database for 10 x lightning retrieval
Tableau for advanced visualisation
S4 for real time analytics
Channel integration components
Hadoop Cluster R Predictor ranking Infobright Columnar DB
T
HEREFORE
… W
ATCH
THE
FOLLOWING
4 W
EAK
LINKS
1.
Link between Operational
event streams and
Hadoop
cluster
2.
Link between Hadoop
cluster and
Columnar
database
3.
Link between Columnar
database and the
visualisation tool
4.
Time it takes for the
machine learning
algorithm
to run
HIGH VELOCITY DATA PIPELINE
BIG DATA BEST PRACTICE-10
BEST PRACTICE-10 :
7 C
ORE MACHINE LEARNING BUILDING BLOCKS FOR ORCHESTRATING ANALYTICAL PROCESSES