• No results found

Session 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges

N/A
N/A
Protected

Academic year: 2021

Share "Session 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges"

Copied!
22
0
0

Loading.... (view fulltext now)

Full text

(1)

Session 1: IT Infrastructure Security

James Campbell

Corporate Systems Engineer

HP – Vertica

[email protected]

Vertica / Hadoop Integration and Analytic

Capabilities for Federal Big Data Challenges

(2)

Big Data - Revisited

2

• Are the terms “Big Data” and Hadoop synonymous?

• What are the primary drivers for government agencies in

addressing Big Data?

• What other types of tools are available to work with Big Data?

(3)

The Big Data Challenge

Volume

Variety

Velocity

V

a

l

u

e

1000x

Social Media Video

Audio Email Texts Mobile

Transactional Data Machine/Sensor Docs

Search Engine Images

New Solutions NeededAre

Diverse Users

Ad Hoc Questions

BIG DATA

(4)

In Data There is

Gold

In Data, There is Gold

• What value are you looking to find in your data?

• How fast do you need to find gold?

• Make sure you don’t get fools gold

(5)

Approaches to Finding Gold in Big Data

Mining for Gold in Big Data

• Analysis and reporting are not the same thing

– Organizations should not equate reporting with analysis

– Reporting Environments

• Select reports to run

• Execute reports

• View results

• Analysis is an interactive process of analyzing data

– Frame research/investigation question

– Identify data requirements

– Analyze data (interactive process)

– Interpret the results

• Inflexible

• Predefined

• Flexible

• Custom

• Focused on finding 

answers

(6)

Reporting Vs. Analytics

Reporting

• Standard views of data

• Answers standard set of 

questions

• Does not require a human

• Is inflexible

Analytics

• Interactive Process

•Correctly Frame Problem

•Collection of Data

•Analyze Data

•Interpret Results

• Provides answers

• Customized

• Involves human interaction

• Flexible

• Real‐Time

(7)

Analytic Pain Points

• Low performance

• Limited functionality

• Complexity in deployment and use

• Is not timely with demands for analytics results

• Does not interoperate with other big data platforms (i.e. Hadoop)

• Skilled labor requirements of newer technologies

• Older technologies unable answer “big data” challenges

(8)

Hadoop Answers Many Big Data Challenges

Varied Data Structures

Large Data Volumes

Rich Set of Analytics Varied Data Sources

Quick analysis of 

complex relationships

Interactive Analysis

Performance 

Enhanced Queries

(9)

Hadoop Architectural Components

Process Layer

Map Reduce

Map Step – Create key/value tuples

Reduce Step – Receives sorted key value tuples  and runs user provided program

Storage Layer

HDFS – Cluster file system written in Java that sits on top  of host file system.  HDFS 

Other Storage  – Amazon S3, CloudStore, FTP Filesystem, other distributed files systems available through file://URl

Job Manager

Job Manager – Manage jobs, which include tasks  across all nodes

Task Manager – Manage each individual task (could be one or more per node) Resource – Added in latest Hadoop Release Management

(10)

Hadoop Key-Value and Database Storage

Systems

HDFS or other Distributed File System

Key Value / Database System

Client Applications

HDFS Provides  Underling Distributed 

Storage Mechanism May create files,  indexes, depending on 

Apache project Clients can be SQL,  NoSQL, programs, etc.

Map / Reduce

May use  Map/Reduce 

Framework

(11)

Choosing The Right Tool for the Job

Vertica for Interactive And Real-

time Analytics

• Hadoop for Long-running Batch

Analytics (fault tolerance)

• Map reduce works best when there is

a large set of input data where only a

small portion of the data is required for

analysis

(12)

A Platform Designed for Big Data

Real Time Massively Parallel Processing

Native and Performance Optimized High Availability

Native Columnar RDBMS

Columnar 

Compression

Concurrent 

Load & Query

Elastic 

Cluster

SQL 

Analytics

User‐

Defined 

Analytics

Optimized 

Connectors

Standard 

Interface

Next Generation Administration and Design Tools

(13)

What Analytics can HP Vertica handle?

SQL

• SQL analytic

functions

• Graph

• Monte Carlo

• Statistical

• Geospatial

Extended SQL

• Sessionization

• Time series

• Pattern

matching

• Event series

joins

SDKs

• C++

• R

• ?

Check out: https://github.com/vertica/Vertica‐Extension‐Packages

(14)

SQL Analytics + ‐ Built for Big Data

Features

• Time series gap filing and interpolation

• Event-based window functions and sessionization

• Pattern matching

• Event series join

• Statistical functions

• Geospatial functions

Benefits

• High performance (Keep Data close to CPU)

• Low cost (Industry Standard building blocks)

• Ease of use (Automated + Available)

Use Cases

Tickstore data cleanups

CDR/VOD data analysis

Clickstream sessionization

Data aggregation and compression

Monte Carlo simulation

Social graph analysis

Sensor Data

SmartGrid

Predictive maintenance

(15)

Vertica Cluster

User-Defined Extensions in R

• What is R?

Open source language for statistical computing

Wide range of packages available for advanced data mining and statistical analysis

• Advantages of UDx in R

HP Vertica automatically parallelizes the execution of user-defined R code

Optimized data transfer between HP Vertica and R

(16)

Function Setup + Usage

-- Define function CREATE LIBRARY rlib

AS ‘/path/rcode.R’ LANGUAGE 'R';

CREATE TRANSFORM FUNCTION Kmeans

AS LANGUAGE 'R' NAME 'kmeansCluFactory' LIBRARY rlib;

-- Use function

CREATE TABLE point_data ( x FLOAT, y FLOAT );

SELECT Kmeans(x, y) OVER() FROM point_data;

R Source Code

UDx in R Example: K‐Means Clustering

# Example: K-means (k=5)

# Input: two-dimensional points

# Output: the point coordinates plus their assigned

# cluster

kmeansClu <- function(x) {

cl <- kmeans(x,5,10)

res <- data.frame(x[,1:2], cl$cluster) res

}

kmeansCluFactory <- function() {

list(name=kmeansClu,

udxtype=c("transform"), intype=c("float","float"),

outtype=c("float","float","int"), outnames=c("x","y","cluster") ) }

(17)

HP Vertica and Hadoop are Complementary

HP Vertica

• Designed for performance

• Interactive analytics

• A rich SQL ecosystem

HP Vertica

• Designed for performance

• Interactive analytics

• A rich SQL ecosystem

Hadoop

• Designed for fault‐tolerance

• Batch analytics

• A rich Programming Model

Hadoop

• Designed for fault‐tolerance

• Batch analytics

• A rich Programming Model

Both Purpose‐

Built

Scalable

Analytics

Platforms

(18)

Hadoop + HP Vertica: Joint Use Cases

Use Case 1:

Hadoop for data integration, transformation, and data quality management

HP Vertica for structured analytics, traditional business intelligence data warehousing, and

analysis and reporting

Assumes a balance composition of developers fluent on Hadoop and SQL

Use Case 2:

Hadoop as an operational data store

HP Vertica for data augmentation of data in Hadoop.

Assumes more SQL developers than Hadoop developers

Leverages the strength of team mix

Use Case 3:

Data federation across Hadoop and HP Vertica

Variety of user interfaces for data interaction and an analysis data store

HDFS for Storage, HP Vertica + Hadoop for Analytics

Real-time analytics on HP Vertica (needs speed)

Long-running/exploratory analytics on Hadoop (needs fault tolerance)

Load from HDFS directly to HP Vertica

HP Vertica SQL access to HDFS

(19)

HP Vertica - Hadoop Connector

• Allows flexibility & interoperability

• Integrate with Hadoop/MapReduce

and Pig

HP Vertica-aware extension to Hadoop

Specialized adapter for distributed streaming

between Hadoop and HP Vertica

• Developers need access to fast

DBMS that co-exists with Hadoop

rather than being embedded

Operate on different clusters, generally by

different groups of people

Allows customers to scale computation

independent of DBMS

Hadoop / HP Vertica: Advanced Analytics

Hadoop / HP Vertica: Advanced Analytics

MapReduce / Pig Job

DFS Block 1 DFS Block 1

DFS Block 1

DFS Block 2 DFS Block 2

DFS Block 2

DFS Block 3 DFS Block 3 Map

Map

Map

Reduce

HP Vertica

Data data data data data da data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data

MapReduce / Pig Job

DFS Block 1 DFS Block 1

DFS Block 1

DFS Block 2 DFS Block 2

DFS Block 2

DFS Block 3 DFS Block 3 Map

Map

Map

Reduce

HP Vertica

HP Vertica

Hadoop / Vertica: ETL

Hadoop / Vertica: ETL

HDFS File

(20)

Native Load and Query from HDFS

HP  V e rtic a HP  V e rtic a

Goal:

‐ Query data residing on HDFS  directly from Vertica

Method:

‐ Develop User‐Defined Loaderto  HDFS data files

‐ Define External Table for a 

“virtual table” view of HDFS data

Benefits:

‐ Simple, direct integration with  HDFS (no MapReduce)

‐ Data remains in Hadoop – no  synchronization required

‐ Queries access latest  information in HDFS

Goal:

‐ Load data staged on HDFS into  BI schema in Vertica

Method:

‐ Develop User‐Defined Loaderto  HDFS data files

‐ Load data directly into Vertica from HDFS

Benefits:

‐ Simple, direct integration with  HDFS (no MapReduce)

‐ Data stored in Vertica’s query‐

optimized format for near real‐

time analysis and reporting

External Table

(21)

Custom Connectors with User Defined Load

2

1

• Override any part of HP Vertica’s normal load

process

• Source (stream data from any source)

• Filter (transform data to a new format)

• Parser (convert data stream into database tuples)

• E.g. Use source and filter to load audio data directly into

Vertica:

COPY music (filename AS ‘Sample’, time_index, data filler int, L AS data, R AS data) FIXEDWIDTH COLSIZES (17, 18)

WITH SOURCE ExternalSource(cmd=’arecord -d 10′)

FILTER ExternalFilter(cmd=’sox –type wav – –type dat -’);

HP  V e rtic a

Read: http://www.vertica.com/2012/07/09/on‐the‐trail‐of‐a‐red‐tailed‐hawk‐part‐2/

(22)

QUESTIONS?

Jim Campbell HP Vertica

[email protected] [email protected] P: 703-753-5970

References

Related documents

Among many TCM medical and philosophical concepts, I specifically focus on the healing, the silence and the miracle cure and how they are embodied and co-constructed by

ing S1s!re-i on having the right product at the right place and at the ri stems facilitate order taking and information gathering form the customer and require

In conclusion, for the studied Taiwanese population of diabetic patients undergoing hemodialysis, increased mortality rates are associated with higher average FPG levels at 1 and

However the miscibility of conductive polymers (PANI) in the paint is a determinant factor. In order to obtain the homogeneous dispersion of conductive polymers inside

In 2011 he joined the Clean Energy team as an Investment Director where he manages the Ingenious Solar EIS and also the Renewable Energy EIS Fund, deploying retail investors’ Funds

Remote control technologies and network monitoring and management tools that work over the Internet mean that a team of centralized technicians with the right resources can

Students find this process of social integration and identity formation particularly challenging in the first weeks of higher education (Hughes and Smail, 2015; Warin and