Session 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges

(1)

Session 1: IT Infrastructure Security

James Campbell

Corporate Systems Engineer

HP – Vertica

[email protected]

Vertica / Hadoop Integration and Analytic

Capabilities for Federal Big Data Challenges

(2)

Big Data - Revisited

2 • Are the terms “Big Data” and Hadoop synonymous?

• What are the primary drivers for government agencies in

addressing Big Data?

• What other types of tools are available to work with Big Data?

(3)

The Big Data Challenge

Volume

Variety

Velocity

V

a

l

u

e

1000x

Social Media Video

Audio Email Texts Mobile

Transactional Data Machine/Sensor Docs

Search Engine Images

New Solutions NeededAre

Diverse Users

Ad Hoc Questions

BIG DATA

(4)

In Data There is

Gold

In Data, There is Gold

• What value are you looking to find in your data?

• How fast do you need to find gold?

• Make sure you don’t get fools gold

(5)

Approaches to Finding Gold in Big Data

Mining for Gold in Big Data

• Analysis and reporting are not the same thing

– Organizations should not equate reporting with analysis

– Reporting Environments

• Select reports to run

• Execute reports

• View results

• Analysis is an interactive process of analyzing data

– Frame research/investigation question

– Identify data requirements

– Analyze data (interactive process)

– Interpret the results

• Inflexible

• Predefined

• Flexible

• Custom

• Focused on finding

answers

(6)

Reporting Vs. Analytics

Reporting

• Standard views of data

• Answers standard set of

questions

• Does not require a human

• Is inflexible

Analytics

• Interactive Process

•Correctly Frame Problem

•Collection of Data

•Analyze Data

•Interpret Results

• Provides answers

• Customized

• Involves human interaction

• Flexible

• Real‐Time

(7)

Analytic Pain Points

• Low performance

• Limited functionality

• Complexity in deployment and use

• Is not timely with demands for analytics results

• Does not interoperate with other big data platforms (i.e. Hadoop)

• Skilled labor requirements of newer technologies

• Older technologies unable answer “big data” challenges

(8)

Hadoop Answers Many Big Data Challenges

Varied Data Structures

Large Data Volumes

Rich Set of Analytics Varied Data Sources

Quick analysis of

complex relationships

Interactive Analysis

Performance

Enhanced Queries

(9)

Hadoop Architectural Components

Process Layer

Map Reduce

^Map Step – Create key/value tuples

Reduce Step – Receives sorted key value tuples and runs user provided program

Storage Layer

HDFS – Cluster file system written in Java that sits on top of host file system. HDFS

Other Storage – Amazon S3, CloudStore, FTP Filesystem, other distributed files systems available through file://URl

Job Manager

Job Manager – Manage jobs, which include tasks across all nodes

Task Manager – Manage each individual task (could be one or more per node) Resource – Added in latest Hadoop Release Management

(10)

Hadoop Key-Value and Database Storage

Systems

HDFS or other Distributed File System

Key Value / Database System

Client Applications

HDFS Provides Underling Distributed

Storage Mechanism May create files, indexes, depending on

Apache project Clients can be SQL, NoSQL, programs, etc.

Map / Reduce

May use Map/Reduce

Framework

(11)

Choosing The Right Tool for the Job

• Vertica for Interactive And Real-

time Analytics

• Hadoop for Long-running Batch

Analytics (fault tolerance)

• Map reduce works best when there is

a large set of input data where only a

small portion of the data is required for

analysis

(12)

A Platform Designed for Big Data

Real Time Massively Parallel Processing

Native and Performance Optimized High Availability

Native Columnar RDBMS

Columnar

Compression

Concurrent

Load & Query

Elastic

Cluster

SQL

Analytics

User‐

Defined

Analytics

Optimized

Connectors

Standard

Interface

Next Generation Administration and Design Tools

(13)

What Analytics can HP Vertica handle?

SQL

• SQL analytic

functions

• Graph

• Monte Carlo

• Statistical

• Geospatial

Extended SQL

• Sessionization

• Time series

• Pattern

matching

• Event series

joins

SDKs

• C++

• R

• ?

Check out: https://github.com/vertica/Vertica‐Extension‐Packages

(14)

SQL Analytics ⁺ ‐ Built for Big Data

Features

• Time series gap filing and interpolation

• Event-based window functions and sessionization

• Pattern matching

• Event series join

• Statistical functions

• Geospatial functions

Benefits

• High performance (Keep Data close to CPU)

• Low cost (Industry Standard building blocks)

• Ease of use (Automated + Available)

Use Cases

‒

Tickstore data cleanups

‒

CDR/VOD data analysis

‒

Clickstream sessionization

‒

Data aggregation and compression

‒

Monte Carlo simulation

‒

Social graph analysis

‒

Sensor Data

‒

SmartGrid

‒

Predictive maintenance

‒

…

(15)

Vertica Cluster

User-Defined Extensions in R

• What is R?

–

Open source language for statistical computing

–

Wide range of packages available for advanced data mining and statistical analysis

• Advantages of UDx in R

–

HP Vertica automatically parallelizes the execution of user-defined R code

–

Optimized data transfer between HP Vertica and R

(16)

Function Setup + Usage

-- Define function CREATE LIBRARY rlib

AS ‘/path/rcode.R’ LANGUAGE 'R';

CREATE TRANSFORM FUNCTION Kmeans

AS LANGUAGE 'R' NAME 'kmeansCluFactory' LIBRARY rlib;

-- Use function

CREATE TABLE point_data ( x FLOAT, y FLOAT );

SELECT Kmeans(x, y) OVER() FROM point_data;

R Source Code

UDx in R Example: K‐Means Clustering

# Example: K-means (k=5)

# Input: two-dimensional points

# Output: the point coordinates plus their assigned

# cluster

kmeansClu <- function(x) {

cl <- kmeans(x,5,10)

res <- data.frame(x[,1:2], cl$cluster) res

}

kmeansCluFactory <- function() {

list(name=kmeansClu,

udxtype=c("transform"), intype=c("float","float"),

outtype=c("float","float","int"), outnames=c("x","y","cluster") ) }

(17)

HP Vertica and Hadoop are Complementary

HP Vertica

• Designed for performance

• Interactive analytics

• A rich SQL ecosystem

HP Vertica

• Designed for performance

• Interactive analytics

• A rich SQL ecosystem

Hadoop

• Designed for fault‐tolerance

• Batch analytics

• A rich Programming Model

Hadoop

• Designed for fault‐tolerance

• Batch analytics

• A rich Programming Model

Both Purpose‐

Built

Scalable

Analytics

Platforms

(18)

Hadoop + HP Vertica: Joint Use Cases

Use Case 1:

•

Hadoop for data integration, transformation, and data quality management

•

HP Vertica for structured analytics, traditional business intelligence data warehousing, and

analysis and reporting

•

Assumes a balance composition of developers fluent on Hadoop and SQL

Use Case 2:

•

Hadoop as an operational data store

•

HP Vertica for data augmentation of data in Hadoop.

•

Assumes more SQL developers than Hadoop developers

•

Leverages the strength of team mix

Use Case 3:

•

Data federation across Hadoop and HP Vertica

•

Variety of user interfaces for data interaction and an analysis data store

HDFS for Storage, HP Vertica + Hadoop for Analytics

•

Real-time analytics on HP Vertica (needs speed)

•

Long-running/exploratory analytics on Hadoop (needs fault tolerance)

•

Load from HDFS directly to HP Vertica

•

HP Vertica SQL access to HDFS

(19)

HP Vertica - Hadoop Connector

• Allows flexibility & interoperability

• Integrate with Hadoop/MapReduce

and Pig

•

HP Vertica-aware extension to Hadoop

•

Specialized adapter for distributed streaming

between Hadoop and HP Vertica

• Developers need access to fast

DBMS that co-exists with Hadoop

rather than being embedded

•

Operate on different clusters, generally by

different groups of people

•

Allows customers to scale computation

independent of DBMS

Hadoop / HP Vertica: Advanced Analytics

MapReduce / Pig Job

DFS Block 1 DFS Block 1

DFS Block 1

DFS Block 2

DFS Block 3 DFS Block 3 Map

Map

Reduce

HP Vertica

Data data data data data da data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data

MapReduce / Pig Job

DFS Block 1

DFS Block 2

DFS Block 3 DFS Block 3 Map

Map

Reduce

HP Vertica

Hadoop / Vertica: ETL

HDFS File

(20)

Native Load and Query from HDFS

HP V e rtic a HP V e rtic a

Goal:

‐ Query data residing on HDFS directly from Vertica

Method:

‐ Develop User‐Defined Loaderto HDFS data files

‐ Define External Table for a

“virtual table” view of HDFS data

Benefits:

‐ Simple, direct integration with HDFS (no MapReduce)

‐ Data remains in Hadoop – no synchronization required

‐ Queries access latest information in HDFS

Goal:

‐ Load data staged on HDFS into BI schema in Vertica

Method:

‐ Develop User‐Defined Loaderto HDFS data files

‐ Load data directly into Vertica from HDFS

Benefits:

‐ Simple, direct integration with HDFS (no MapReduce)

‐ Data stored in Vertica’s query‐

optimized format for near real‐

time analysis and reporting

External Table

(21)

Custom Connectors with User Defined Load

2

1 • Override any part of HP Vertica’s normal load

process

• Source (stream data from any source)

• Filter (transform data to a new format)

• Parser (convert data stream into database tuples)

• E.g. Use source and filter to load audio data directly into

Vertica:

COPY music (filename AS ‘Sample’, time_index, data filler int, L AS data, R AS data) FIXEDWIDTH COLSIZES (17, 18)

WITH SOURCE ExternalSource(cmd=’arecord -d 10′)

FILTER ExternalFilter(cmd=’sox –type wav – –type dat -’);

HP V e rtic a

Read: http://www.vertica.com/2012/07/09/on‐the‐trail‐of‐a‐red‐tailed‐hawk‐part‐2/

(22)

QUESTIONS?

Jim Campbell HP Vertica

[email protected] [email protected] P: 703-753-5970