Session 1: IT Infrastructure Security
James Campbell
Corporate Systems Engineer
HP – Vertica
[email protected]
Vertica / Hadoop Integration and Analytic
Capabilities for Federal Big Data Challenges
Big Data - Revisited
2
• Are the terms “Big Data” and Hadoop synonymous?
• What are the primary drivers for government agencies in
addressing Big Data?
• What other types of tools are available to work with Big Data?
The Big Data Challenge
Volume
Variety
Velocity
V
a
l
u
e
1000x
Social Media Video
Audio Email Texts Mobile
Transactional Data Machine/Sensor Docs
Search Engine Images
New Solutions NeededAre
Diverse Users
Ad Hoc Questions
BIG DATA
In Data There is
Gold
In Data, There is Gold
• What value are you looking to find in your data?
• How fast do you need to find gold?
• Make sure you don’t get fools gold
Approaches to Finding Gold in Big Data
Mining for Gold in Big Data
• Analysis and reporting are not the same thing
– Organizations should not equate reporting with analysis
– Reporting Environments
• Select reports to run
• Execute reports
• View results
• Analysis is an interactive process of analyzing data
– Frame research/investigation question
– Identify data requirements
– Analyze data (interactive process)
– Interpret the results
• Inflexible
• Predefined
• Flexible
• Custom
• Focused on finding
answers
Reporting Vs. Analytics
Reporting
• Standard views of data
• Answers standard set of
questions
• Does not require a human
• Is inflexible
Analytics
• Interactive Process
•Correctly Frame Problem
•Collection of Data
•Analyze Data
•Interpret Results
• Provides answers
• Customized
• Involves human interaction
• Flexible
• Real‐Time
Analytic Pain Points
• Low performance
• Limited functionality
• Complexity in deployment and use
• Is not timely with demands for analytics results
• Does not interoperate with other big data platforms (i.e. Hadoop)
• Skilled labor requirements of newer technologies
• Older technologies unable answer “big data” challenges
Hadoop Answers Many Big Data Challenges
Varied Data Structures
Large Data Volumes
Rich Set of Analytics Varied Data Sources
Quick analysis of
complex relationships
Interactive Analysis
Performance
Enhanced Queries
Hadoop Architectural Components
Process Layer
Map Reduce
Map Step – Create key/value tuplesReduce Step – Receives sorted key value tuples and runs user provided program
Storage Layer
HDFS – Cluster file system written in Java that sits on top of host file system. HDFS
Other Storage – Amazon S3, CloudStore, FTP Filesystem, other distributed files systems available through file://URl
Job Manager
Job Manager – Manage jobs, which include tasks across all nodesTask Manager – Manage each individual task (could be one or more per node) Resource – Added in latest Hadoop Release Management
Hadoop Key-Value and Database Storage
Systems
HDFS or other Distributed File System
Key Value / Database System
Client Applications
HDFS Provides Underling Distributed
Storage Mechanism May create files, indexes, depending on
Apache project Clients can be SQL, NoSQL, programs, etc.
Map / Reduce
May use Map/Reduce
Framework
Choosing The Right Tool for the Job
• Vertica for Interactive And Real-
time Analytics
• Hadoop for Long-running Batch
Analytics (fault tolerance)
• Map reduce works best when there is
a large set of input data where only a
small portion of the data is required for
analysis
A Platform Designed for Big Data
Real Time Massively Parallel Processing
Native and Performance Optimized High Availability
Native Columnar RDBMS
Columnar
Compression
Concurrent
Load & Query
Elastic
Cluster
SQL
Analytics
User‐
Defined
Analytics
Optimized
Connectors
Standard
Interface
Next Generation Administration and Design Tools
What Analytics can HP Vertica handle?
SQL
• SQL analytic
functions
• Graph
• Monte Carlo
• Statistical
• Geospatial
Extended SQL
• Sessionization
• Time series
• Pattern
matching
• Event series
joins
SDKs
• C++
• R
• ?
Check out: https://github.com/vertica/Vertica‐Extension‐Packages
SQL Analytics + ‐ Built for Big Data
Features
• Time series gap filing and interpolation
• Event-based window functions and sessionization
• Pattern matching
• Event series join
• Statistical functions
• Geospatial functions
Benefits
• High performance (Keep Data close to CPU)
• Low cost (Industry Standard building blocks)
• Ease of use (Automated + Available)
Use Cases
‒
Tickstore data cleanups
‒
CDR/VOD data analysis
‒
Clickstream sessionization
‒
Data aggregation and compression
‒
Monte Carlo simulation
‒
Social graph analysis
‒
Sensor Data
‒
SmartGrid
‒
Predictive maintenance
‒
…
Vertica Cluster
User-Defined Extensions in R
• What is R?
–
Open source language for statistical computing
–
Wide range of packages available for advanced data mining and statistical analysis
• Advantages of UDx in R
–
HP Vertica automatically parallelizes the execution of user-defined R code
–
Optimized data transfer between HP Vertica and R
Function Setup + Usage
-- Define function CREATE LIBRARY rlib
AS ‘/path/rcode.R’ LANGUAGE 'R';
CREATE TRANSFORM FUNCTION Kmeans
AS LANGUAGE 'R' NAME 'kmeansCluFactory' LIBRARY rlib;
-- Use function
CREATE TABLE point_data ( x FLOAT, y FLOAT );
SELECT Kmeans(x, y) OVER() FROM point_data;
R Source Code
UDx in R Example: K‐Means Clustering
# Example: K-means (k=5)
# Input: two-dimensional points
# Output: the point coordinates plus their assigned
# cluster
kmeansClu <- function(x) {
cl <- kmeans(x,5,10)
res <- data.frame(x[,1:2], cl$cluster) res
}
kmeansCluFactory <- function() {
list(name=kmeansClu,
udxtype=c("transform"), intype=c("float","float"),
outtype=c("float","float","int"), outnames=c("x","y","cluster") ) }
HP Vertica and Hadoop are Complementary
HP Vertica
• Designed for performance
• Interactive analytics
• A rich SQL ecosystem
HP Vertica
• Designed for performance
• Interactive analytics
• A rich SQL ecosystem
Hadoop
• Designed for fault‐tolerance
• Batch analytics
• A rich Programming Model
Hadoop
• Designed for fault‐tolerance
• Batch analytics
• A rich Programming Model
Both Purpose‐
Built
Scalable
Analytics
Platforms
Hadoop + HP Vertica: Joint Use Cases
Use Case 1:
•
Hadoop for data integration, transformation, and data quality management
•
HP Vertica for structured analytics, traditional business intelligence data warehousing, and
analysis and reporting
•
Assumes a balance composition of developers fluent on Hadoop and SQL
Use Case 2:
•
Hadoop as an operational data store
•
HP Vertica for data augmentation of data in Hadoop.
•
Assumes more SQL developers than Hadoop developers
•
Leverages the strength of team mix
Use Case 3:
•
Data federation across Hadoop and HP Vertica
•
Variety of user interfaces for data interaction and an analysis data store
HDFS for Storage, HP Vertica + Hadoop for Analytics
•
Real-time analytics on HP Vertica (needs speed)
•
Long-running/exploratory analytics on Hadoop (needs fault tolerance)
•
Load from HDFS directly to HP Vertica
•
HP Vertica SQL access to HDFS
HP Vertica - Hadoop Connector
• Allows flexibility & interoperability
• Integrate with Hadoop/MapReduce
and Pig
•
HP Vertica-aware extension to Hadoop
•
Specialized adapter for distributed streaming
between Hadoop and HP Vertica
• Developers need access to fast
DBMS that co-exists with Hadoop
rather than being embedded
•
Operate on different clusters, generally by
different groups of people
•
Allows customers to scale computation
independent of DBMS
Hadoop / HP Vertica: Advanced Analytics
Hadoop / HP Vertica: Advanced Analytics
MapReduce / Pig Job
DFS Block 1 DFS Block 1
DFS Block 1
DFS Block 2 DFS Block 2
DFS Block 2
DFS Block 3 DFS Block 3 Map
Map
Map
Reduce
HP Vertica
Data data data data data da data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data
MapReduce / Pig Job
DFS Block 1 DFS Block 1
DFS Block 1
DFS Block 2 DFS Block 2
DFS Block 2
DFS Block 3 DFS Block 3 Map
Map
Map
Reduce
HP Vertica
HP Vertica
Hadoop / Vertica: ETL
Hadoop / Vertica: ETL
HDFS FileNative Load and Query from HDFS
HP V e rtic a HP V e rtic a
Goal:
‐ Query data residing on HDFS directly from Vertica
Method:
‐ Develop User‐Defined Loaderto HDFS data files
‐ Define External Table for a
“virtual table” view of HDFS data
Benefits:
‐ Simple, direct integration with HDFS (no MapReduce)
‐ Data remains in Hadoop – no synchronization required
‐ Queries access latest information in HDFS
Goal:
‐ Load data staged on HDFS into BI schema in Vertica
Method:
‐ Develop User‐Defined Loaderto HDFS data files
‐ Load data directly into Vertica from HDFS
Benefits:
‐ Simple, direct integration with HDFS (no MapReduce)
‐ Data stored in Vertica’s query‐
optimized format for near real‐
time analysis and reporting
External Table
Custom Connectors with User Defined Load
2
1
• Override any part of HP Vertica’s normal load
process
• Source (stream data from any source)
• Filter (transform data to a new format)
• Parser (convert data stream into database tuples)
• E.g. Use source and filter to load audio data directly into
Vertica:
COPY music (filename AS ‘Sample’, time_index, data filler int, L AS data, R AS data) FIXEDWIDTH COLSIZES (17, 18)
WITH SOURCE ExternalSource(cmd=’arecord -d 10′)
FILTER ExternalFilter(cmd=’sox –type wav – –type dat -’);