Dashboard Engine for Hadoop
June 2015
Matt McDevitt
Sr. Project Manager
Pavan Challa
Sr. Data Engineer
CONFIDENTIAL | 2
Agenda
•
Think Big Overview
•
Engagement Model
•
Solution Offerings
•
Dashboard Engine
•
Demo
•
Q&A
2 © 2015 Think Big, a Teradata CompanyCONFIDENTIAL | 4
•
Founded in 2010, acquired in 2014,
International in 2015
•
First and leading professional services firm
exclusively focused on big data
•
End to End Services: Strategy, Design,
Implementation, IP/Software, Support and
Managed Services
•
Academy to scale delivery capability
•
Extend and integrate open source with UDA
•
Team-based delivery with Solution Center
•
Growing quickly:
we’re hiring!
Think Big Overview
Think Big Founded 2010
4
PRESTO
CONFIDENTIAL | 6
Big Data
Program Mgt
Business
Analytics
Managed
Services
Data
Engineering
Think Big Analytics
VELOCITY
Methodology
• Solutions
• Planning and Design
• Prioritization
• Capability Backlog
• Grooming for engineering
• Engineering
• Sprint(s)
• Releases
• Quality Assurance & Test
• Managed Support
• Break Fix
• Sustaining Engineering
• New Models
• New Analytics
• New Insights
• New Data Requirements
• New Data
• Big Data Approach
• Use Cases
• Roadmap
•
Data Science
•
Discovery
•
R&D
Big Data Lab
6 © 2015 Think Big, a Teradata Company
1.
Big Data Strategy Roadmap
2.
Data Lake Starter Program
3.
Data Lake Optimization
4.
Data Lake Managed Services
5.
Presto for the Enterprise – new as of June 10, 2015
6.
Big Data Managed Services
7.
Think Big Academy
Think Big Solution Offerings
• Device Data Manufacturing Operations
• Omni-Channel Marketing Analytics
• Financial Services Fraud/Risk Analytics
• Healthcare personalization
Custom Analytics Solution Services
• Device Data Behavior Analytics
• IT Threat Detection
• Public Sector Risk Analysis
• Gaming Analytics
MAKING BIG DATA COME ALIVE
MAKING BIG DATA COME ALIVE
Data Lake: Starter Program
−
Stand up a Data Lake and build 3 governed batch data ingest streams
−
Includes Services and Subscription Software Frameworks
Data Lake: Optimization
−
Add governance to your Data Lake
−
For Data Lakes not originally built by Think Big
Data Lake: Dashboard Engine Reporting
−
Install and configure engine with Data Lake to build dashboard analytics for
deep dimensional rollup reporting capabilities with Tableau on Hadoop
Data Lake: Security
−
Data Security & InfoSec, Cluster Hardening, Perimeter, Connectivity
Data Lake: Managed Services
−
Only for Data Lakes that Think Big Designs and Builds
−
On Premise, Public Cloud (AWS) and Private Cloud (Teradata and Altiscale)
CONFIDENTIAL | 10
Design
Build & Test
Integrate & Tune
Assess, Mentor & Plan
• Collaborative workshops with
business groups
• Identification and prioritization
of high-value data streams
• Gap analysis
• Develop Ingest
workflows
• Install Metadata and
Info Security Services
• Prepare Cluster for
Integration test
• Install Ingest & System
Test
• Begin Profiling Data
• Learn about Information
Security and data wrangling
• Begin Building DL Reporting
• Final tuning, assessment and
next steps
Think Big Data Lake Starter Program
(8 Week Engagement)
Develop & Unit
Testing
Data Stream
Prioritization
Info Security
Objectives
Data Profiling
and Capability
Follow-up
Roadmap
2 weeks
2 week
2 week
2 weeks
Executive
Presentation
Objective: Design, Develop and Deploy Data Lake Ingestion with Governance
Software
Component
Installation
Data
Sources
Organization &
Training
Cluster
configuration &
Integration
System
Integration
Testing
10 © 2015 Think Big, a Teradata CompanyEnterprise Data Lake
Information Sources
Evaluate
Source Data
Ingest
Collect & Manage
Metadata
Apply
Structure
Sequence
Compress
Automate
Protect
Prepare Data
for Ingest
Prepare Source
Metadata
Perimeter-Authentication-Authorization
InfoSec
Downstream
Applications
Dashboard
Engine
CONFIDENTIAL | 12
Data Lab
Data Repository
Security, Archival
RainStor – System of Record,
Archive
Governed Ingestion
CDC
Buffer Server
Spark
Msg Queue
Kafka
Experimental Data
Raw
Data
Processing
Derived
Views
Loom – integrated Metadata, lineage,
Wrangling
Metadata Repository
Dashboard Engine
API
Realtime
Processing
API
Discovery Zone
Statistics
Machine Learning
Graph
Analytics
12 © 2015 Think Big, a Teradata CompanyCONFIDENTIAL | 14
Why a Dashboard Engine?
14
Events
Hadoop
•
Near real-time analytics
•
Easily scales to 100s of simulaneous users
•
Query latency typically under 100 ms
•
Deep dimensional drill-down
•
Works with popular BI tools
−
javascript, jquery
−
Tableau
−
others announced soon
CONFIDENTIAL | 16
Using Tableau without Dashboard Engine
Hadoop
Middle
Tier Server
Extract
• Queryable data limited by
size of Server.
• Doesn’t scale as users grow.
16 © 2015 Think Big, a Teradata Company
•
For the time the query is running, most or all of the cluster is dedicated
to that one query.
−
Has limitations if the cluster has other loads
−
Has limitations for simultaneous dashboard users
•
Low latencies possible only if all the event data is in RAM at query
time.
18
•
Uses the power of Apache Spark to pre-aggregate data
•
Scales as event volume grows.
•
Scales as number of users grows.
Think Big’s Dashboard Engine for Hadoop
CONFIDENTIAL | 20
Store cube data
Arr
iv
al
s
-s
:CA
-2014
-01
-04
A
rr
iv
al
s
-s
:CA
-2014
-01
-03
2053
1911
1965
14147
14158
14269
A
rr
iv
al
s
-a:S
F
O
-s
:CA
-2014
-01
-02
A
rr
iv
al
s
-a:S
F
O
-s
:CA
-2014
-01
-03
A
rr
iv
al
s
-a:S
F
O
-s
:CA
-2014
-01
-04
429
479
433
…
…
A
rr
iv
al
s
-s
:CA
-2014
-01
-02
A
rr
iv
al
s
-2014
-01
-02
A
rr
iv
a
ls
-2014
-01
-03
A
rr
iv
al
s
-2014
-01
-04
•
Aggregate API that understands metrics, dimensions, time ranges.
•
Relational API that understands (some) SQL.
API - Connecting to the Dashboard Engine
Aggregate API
22
•
Running on a 16-node cluster (TD Appliance for Hadoop)
•
Process and store all data in ~ 2 hours
Flight Data Statistics for Demo
Rows
Storage space
Flight records
160 million
30 GB
MOLAP cube
35 billion
2.1 TB
CONFIDENTIAL | 24
•
Sends SQL queries to the API
SQL Query to REST API Example
SELECT
FlightData.Date
AS
"none_Date_ok",
FlightData.State
AS
"none_State_nk”,
SUM
(FlightData.Arrivals)
AS
"sum_Arrivals_nk”
FROM
"default"
.
"FlightData" "FlightData"
GROUP BY
"none_Date_ok” , "none_State_nk”
•
Translated to Aggregate API queries
http://10.25.12.241:52080/clickstream/aggregate/v1/?
period=day&start=1970-01-01
&
dimension=State:
&
metric=Arrivals
<index
name
=
"AirportsByState"
>
<periods>
<period>
day
</period>
</periods>
<indexDimensions>
<dimension
name
=
"State"
/>
</indexDimensions>
<listDimensions>
<dimension
name
=
"Airport"
/>
</listDimensions>
</index>
CONFIDENTIAL | 26
Aggregate use: Show arrivals for all airports for NY
© 2015 Think Big, a Teradata Company 26
http://10.25.12.241:52080/clickstream/aggregate/v1/?period=da
y&start=2014-01-04&end=2014-01-05&
dimension=Airport:&dimension=State:NY
&metric=Arrivals&head
ers=on
Day Start
Airport State Arrivals
2014-01-04 ALB
NY
20
2014-01-04 ART
NY
1
2014-01-04 BUF
NY
40
...
2014-01-04 JFK
NY
167
2014-01-04 LGA
NY
206
2014-01-04 ROC
NY
17
2014-01-04 SWF
NY
2
2014-01-04 SYR
NY
14
<index
name
=
"ListFlightNoCarrierCityState"
>
<periods>
<period>
day
</period>
</periods>
<indexDimensions>
</indexDimensions>
<listDimensions>
<dimension
name
=
"State"
/>
<dimension
name
=
"City"
/>
<dimension
name
=
"Carrier"
/>
<dimension
name
=
"FlightNo"
/>
</listDimensions>
</index>
CONFIDENTIAL | 28
Dimensions use: Show all Flight/Carrier/City/State
© 2015 Think Big, a Teradata Company 28
http://10.25.12.241:52080/clickstream/dimensions/v1/?period
=day&start=2014-01-04&end=2014-01-05&
dimension=State:
&
dimension=City:
&
dimension=Carrier:
&
dime
nsion=FlightNo:
"results":[
["AK","Anchorage, AK","AS","101"],
["AK","Anchorage, AK","AS","102"],
["AK","Anchorage, AK","AS","103"],
["AK","Anchorage, AK","AS","106"],
["AK","Anchorage, AK","AS","108"],
...
["AL","Huntsville, AL","DL","1782"],
["AL","Huntsville, AL","DL","2077"],
...
<index
name
=
"ListFlightNoByCarrierState"
>
<periods>
<period>
day
</period>
</periods>
<indexDimensions>
<dimension
name
=
"State"
/>
<dimension
name
=
"Carrier"
/>
</indexDimensions>
<listDimensions>
<dimension
name
=
"FlightNo"
/>
</listDimensions>
</index>
Index Question
Q: Drill down to a list of flights that had caused delay in Colorado done by Delta?
A: Create the index below, rerun index creation step, query delay metrics for
30