Copyright © 2014 Splunk Inc.
Hunk & Elas=c MapReduce:
Big Data Analy=cs on AWS
Dritan Bi=ncka
Disclaimer
2
During the course of this presenta=on, we may make forward looking statements regarding future events or the expected performance of the company. We cau=on you that such statements reflect our current expecta=ons and
es=mates based on factors currently known to us and that actual events or results could differ materially. For important factors that may cause actual results to differ from those contained in our forward-‐looking statements, please review our filings with the SEC. The forward-‐looking statements made in the this presenta=on are being made as
of the =me and date of its live presenta=on. If reviewed aTer its live presenta=on, this presenta=on may not contain current or accurate informa=on. We do not assume any obliga=on to update any forward looking statements we may make. In addi=on, any informa=on about our roadmap outlines our general product direc=on and is subject to change
at any =me without no=ce. It is for informa=onal purposes only and shall not, be incorporated into any contract or other commitment. Splunk undertakes no obliga=on either to develop the features or func=onality described or to
About Me
!
Member of BD Solu=on Architecture team
!
Large scale deployments
!
Cloud and Big Data
Agenda
!
Hunk
!
Amazon EMR
!
Understanding how Hunk and EMR can work together
!
Demo
– Analyzing HDFS/S3 data with Hunk on EMR
Introduc=on
to Hunk
6
Splunk as a single pane of
glass for your machine data
8
RDBM
Splunk>
NoSQL RDBM NoSQL Splunk>
Hunk for Hadoop and NoSQL Data Stores
Explore
Analyze
Visualize
RDBM Splunk> NoSQLHunk for Hadoop and NoSQL Data Stores
10Explore
Analyze
Visualize
RDBM Splunk> NoSQLHadoop Components
HDFS
– NameNode
– DataNode
– Distributed, replicated, massively scalable file system
MapReduce
– JobTracker
– TaskTracker
– Programming paradigm; two phase processing of large datasets
ê
We also use it, though a simplified version of it
– Scalable, fault tolerant etc.
COMPUTE
Splunk and Hadoop Data
Export:
Write data out to Hadoop,
search based (push)
Explore:
Read data from Hadoop and
analyze on SH
12
Splunk and Hadoop Data
Export:
Write data out to Hadoop,
search based (push)
Explore:
Read data from Hadoop and
analyze on SH
Splunk Hadoop Connect
Splunk and Hadoop Data
Export:
Write data out to Hadoop,
search based (push)
Explore:
Read data from Hadoop and
analyze on SH
14
STORAGE
Splunk Hadoop Connect
PULL
✓
✗
Splunk and Hadoop Data – Today
COMPUTE
STORAGE
Explore
Visualize Dashboard
s
Share
Analyze
✓
64-‐bit Linux OS
splunkweb
•
Web and Applica=on server
•
Python, AJAX, CSS, XSLT, XML
•
Search Head
•
Virtual Indexes
•
C++, Web Services
REST API COMMAND LINE
Explore Analyze Visualize Dashboards Share
ODBC
splunkd
Splunk Stack
64-‐bit Linux OS
splunkweb
•
Web and Applica=on server
•
Python, AJAX, CSS, XSLT, XML
•
Search Head
•
Virtual Indexes
•
C++, Web Services
REST API COMMAND LINE
Explore Analyze Visualize Dashboards Share
ODBC
splunkd
Hadoop Interface
• Hadoop Client Libraries • JAVA
64-‐bit Linux OS
splunkweb
•
Web and Applica=on server
•
Python, AJAX, CSS, XSLT, XML
•
Search Head
•
Virtual Indexes
•
C++, Web Services
REST API COMMAND LINE
Explore Analyze Visualize Dashboards Share
ODBC
splunkd
Hadoop Interface
• Hadoop Client Libraries • JAVA
Scaling with Hadoop
18
Connect Hunk to mul=ple Hadoop clusters
Hadoop Cluster 3 Hadoop Cluster 2 Hadoop Cluster 1
What Makes it Stick?
ERP1 (prod) ERP2 (test)
VIX-‐1 VIX-‐2 VIX-‐3 VIX-‐4
ERP Provider Family
Hadoop In order to access and process data in external data stores
(supports HDFS out-of-the-box), Hunk External Resource Providers (ERP) carry out the store-specific file system implementation and computational semantics.
Provider Family is a logical grouping of data store framework that accesses the same
“kind” of external systems and shares a global set of configura=ons.
A provider is a collec=on of specific Hunk ERP helper process implementa=on within the provider family and shares a cluster-‐specific configura=ons.
ATer you set up a provider, you configure virtual indexes (VIX) by giving Hunk informa=on about the data loca=on. Hunk then use the informa=on and its underlying implementa=on to distribute searches.
Explore, Analyze, Visualize Data in Hadoop
!
No fixed schema to search unstructured data
!Preview results while MapReduce jobs start
!Easier app development than in raw Hadoop
20
!
Unlock business value of data in Hadoop
!Fast to learn instead of scarce skills
Integrated Analy=cs Plaoorm for Hadoop Data
21Full-‐featured,
Integrated
Product
Insights for
Everyone
Works with
What You
Have Today
Explore
Visualize
Dashboards
Share
Hadoop
(MapReduce
& HDFS)
Amazon EMR
!
Amazon EMR is Hadoop framework in
the cloud offered as a managed service
!
Used in “variety of applica.ons,
including log analysis, web indexing,
data warehousing, machine learning,
financial analysis, scien.fic simula.on,
and bioinforma.cs”
Provisioning Hadoop on AWS
24
1.
Login to AWS Console
2.
Fill in a form
3.
Click “Create Cluster”
4.
Wait a few minutes for
a fully operaYonal
Why is EMR Compelling?
!
No Hadoop/HDFS management
!
NaYve support for AWS S3
– Vast amounts of data in S3
!
Cluster Elas=city
!
Spot vs. Reserved Instances
– Long running vs. transient
!
Pay for what you use
!
Thousands of customers
Master
HDFS
S3
Managed Hadoop
framework on the
cloud with access to
vast amounts of
data in HDFS and S3
Explore, analyze and
visualize data from
a central place
Full analy=cs
solu=on for Big Data
on the cloud
Integra=ng Hunk with EMR
Hunk on EMR: Op=on 1
!
Classic Hunk + Hadoop
– Provision an EMR cluster
– Provision a Hunk EC2 instance using the AWS Marketplace Hunk AMI
– Bring Your Own License (BYOL)
– Configure Hunk with EMR cluster
ê
Edit Security Groups to allow access
ê
Master IP addresses & Ports
ê
Create provider
ê
Create Virtual Index
Hunk on EMR: Op=on 2
28
Demo
!
Analyze ELB or S3 Access Logs
!
Analyze CloudTrail Access Logs
Copyright © 2014 Splunk Inc.