Teradata Unified Big Data Architecture

(1)

(2)

• Recap the challenges of Big Analytics

• The 2 analytical gaps for most enterprises

• Teradata Unified Data Architecture

- How we bridge the gaps

- The 3 core elements of the architecture

- Teradata’s solutions in the architecture

• Bring it all together Teradata, Teradata Aster, and

Hadoop.

(3)

Recap of the Big Data

Analytics Challenge

(4)

New and Emerging Sources of Data

BIG DATA

User Generated

Content

Mobile Web

SMS/MMS

Sentiment

External Demographics

HD Video

Speech to Text

Product/Service Logs

Social Network

Business Data Feeds

Petabytes

User Click Stream

Web logs

Web

Offer history

A/B testing

Dynamic Pricing

Affiliate Networks

Search marketing

Behavioral Targeting

Dynamic Funnels

Terabytes

Segmentation

Offer details

Customer Touches

Support Contacts

CRM

Gigabytes

Megabytes

Purchase record

Purchase detail

Payment record

ERP

And using an RDBMS/SQL alone is difficult or impossible

So it’s the data, right? …Yes

So it’s the analytics, right? …. Yes

So it’s the need for iterative visualisation …. Yes

(5)

Big Data Analytics …MORE Analytics on ALL the data

Enabling All Users, All Tools and Any Data for Capture to Analysis

Java, C/C++, Pig, Python, R, SAS, SQL, Excel, BI, Visualisation, etc.

Discover and Explore

Reporting and Execution

_{in the Enterprise}

Capture, Store and Refine

Audio/

(6)

Database and Analytic Processing Layer

The Big Data Architecture Today Has Gaps

Data Storage and

Refining

Data Warehouse

Audio/

Video

Images

Text

Web and

Social

Machine

Logs

CRM

SCM

ERP

Engineers

Data Scientists

Quants

Business Analysts

Java, C/C++, Pig, Python, R, SAS, SQL, Excel, BI, Visualisation, etc.

Gap 2: File system lacks

optimisers, data locality,

indexes

MapReduce

(Processing)

Gap 1:

(7)

Capture, Store, Refine

Teradata Unified Big Data

Architecture

for the Enterprise

Audio/

Video

Images

Text

Web and

Social

Machine

Logs

CRM

SCM

ERP

Engineers

Data Scientists

Quants

Business Analysts

Java, C/C++, Pig, Python, R, SAS, SQL, Excel, BI, Visualisation, etc.

Discovery Platform

_{Integrated Data}

Warehouse

SQL-H

Aster MapReduce Portfolio

Teradata SQL Analytics Portfolio

(8)

Teradata Aster Discovery Platform 5.10

Fastest path to big data apps and new business insights

• Integrated hardware and

software appliance

• Relational-data architecture can

be extended for non-relational types

and procedural M-R analytics

• SQL-MapReduce framework

• Analyze both multi-structured

complex and relational data

Growing the Development Bucket

• 70+ pre-built functions for data

acquisition, preparation, analysis &

visualization

• Richest Add-On Capabilities:

Attensity, Zementis, SAS, R

• Visual IDE & VM-based dev

environment: develop apps very fast

Data Scientists

Analysts

Customers

Business

Interactive & Visual Big Data Analytic Apps

Store

Process

Develop

SQL

Platform Services

(e.g. query planning, dynamic workload management, security…)

SQL-MapReduce®

Row Store

Column Store

Data Acquisition Module SQL-H Teradata RDBMS Data Preparation Module Unpack Pivot Apache Log Parser Analytics Module Pathing Graph Statistical Viz Module Flow Viz Hierarchy Viz Affinity Viz Partner & Add-On Modules Attensity Zementis SAS, R

(9)

Big Data Apps in Days – not Weeks or Months

Data

Scientists

Analysts

Customers

Business

DATA SOURCES

ASTER DISCOVERY PORTFOLIO

Structured

Data

Multi-Structured

Data

Hadoop

Data

OLTP

DBMS’s

Data

Acquisition

Module

• Hadoop access

• Teradata

access

• RDBMS access

Data

Preparation

Module

• Data Adaptors

• Data

Transformers

- JSON, XML,

Apache, etc

Analytics

Module

• Statistical

• Pattern

Matching

• Pathing

• Graph

Algorithms

• Text

• More …

Visualisation

Module

• Flow Visualizer

• Hierarchy

• Flow

• Sankey

• Affinity

• More ….

CUSTOM BIG ANALYTICS APPS

PACKAGED BIG ANALYTICS APPS

(10)

MapReduce vs. SQL - Reduce Function

Data output from Mass Spectrometer

335.2094368 0 335.2105961 0 335.2117553 0 335.2129146 53.024086 335.2140739 184.1607361 335.2152332 264.3601074 335.2163925 259.6187134 335.2175518 239.7870178 335.2187111 313.8243713 335.2198704 490.8760071 335.2210297 634.064209 335.222189 589.8432007 335.2233483 351.9743347 335.2245077 65.21440887 335.225671 0 336.890869 0 336.892037 75.75605011 336.893205 179.8110657 336.894373 247.535553 336.895541 225.6489563 336.8967091 140.6246338 337.1257588 0 337.1280972 86.48993683 337.1292664 170.0835876 337.1304357 215.8146362 337.1316049 188.9733276 337.1327741 110.2854233 337.1912444 0 337.192414 0 337.1935835 143.2112122 337.1947531 357.401123 337.1959227 467.1167297 337.1970923 411.569458 337.1982619 245.5514221 337.1994315 80.80451202

Detecting centroids of peaks is highly complex

using SQL as it is not a set based operation

(11)

SELECT file_id ,scan_id ,ren_tm ,ms_lvl ,mz ,i AS n_

,SUM(i) OVER (PARTITION BY file_id, ms_lvl, ren_tm ORDER BY mz ASC ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING) AS p_i ,(CASE WHEN (i > 0) THEN 1 ELSE 0 END) AS Ind

,(Ind - SUM(ind) OVER (PARTITION BY file_id, ms_lvl, ren_tm ORDER BY mz ASC ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING)) ,CAST((CASE WHEN B = 1 THEN CSUM(1,Ind)

WHEN B = 0 AND Ind = 1 THEN 0

ELSE NULL END) AS DECIMAL(38,0)) AS CurveID FROM dd_stg.mzml

WHERE ms_lvl = 1 ) WITH DATA

PRIMARY INDEX (mz)

Almost 800 lines of complex SQL

,(weighted_peak_mz * chrg) / 700000.000000000000000 AS delta_mz ,CASE WHEN (

(CASE WHEN

SUM((weighted_peak_mz * chrg)) OVER (PARTITION BY file_id, ms_lvl ORDER BY Weighted_peak_mz, scan_id ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING)

BETWEEN ((weighted_peak_mz * chrg) - delta_mz) AND ((weighted_peak_mz * chrg) + delta_mz) THEN 'Y'

ELSE NULL END) = 'Y'

OR (CASE WHEN

SUM((weighted_peak_mz * chrg)) OVER (PARTITION BY file_id, ms_lvl ORDER BY Weighted_peak_mz, scan_id ROWS BETWEEN 1 FOLLOWING AND 1 FOLLOWING)

BETWEEN ((weighted_peak_mz * chrg) - delta_mz) AND ((weighted_peak_mz * chrg) + delta_mz) THEN 'Y'

ELSE NULL END) = 'Y'

OR

(CASE WHEN

SUM((weighted_peak_mz * chrg)) OVER (PARTITION BY file_id, ms_lvl ORDER BY Weighted_peak_mz, scan_id ROWS BETWEEN 2 PRECEDING AND 2 PRECEDING)

BETWEEN ((weighted_peak_mz * chrg) - delta_mz) AND ((weighted_peak_mz * chrg) + delta_mz) THEN 'Y' ELSE NULL END) = 'Y' OR (CASE WHEN SELECT file_id,scan_id,ren_tm,ms_lvl,mz ,i

,CASE WHEN ind = 1 THEN SUM(CurveID+Mark) OVER (PARTITION BY file_id, ms_lvl, ren_tm ORDER BY mz, ind ROWS UNBOUNDED PRECEDING)

ELSE NULL END AS CurveNum

FROM (SELECT file_id,scan_id,ren_tm,ms_lvl,mz,n_I AS i ,CASE

WHEN ( (CASE

WHEN n_i - p_i > 0 THEN 1 WHEN n_i - p_i < 0 THEN -1 ELSE 0

END) - SUM(CASE

WHEN n_i - p_i > 0 THEN 1 WHEN n_i - p_i < 0 THEN -1 ELSE 0

END) OVER (PARTITION BY file_id, ms_lvl, ren_tm ORDER BY mz ASC ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING) ) = 2 THEN 1 ELSE 0 END AS Mark ,Ind ,B

SELECT A.file_id ,A.ren_tm ,A.scan_id ,A.ms_lvl ,A.CurveNum A.Weighted_Peak_mz ,A.ren_tm ,A.sum_i ,A.ren_tm - B.ren_tm AS Diff_Ren_Tm

,A.Weighted_Peak_mz - B.Weighted_Peak_mz AS Diff_WP ,B.CurveNum AS L_CurveNum

,B.Weighted_Peak_mz AS L_Weighted_Peak_mz ,B.ren_tm AS L_ren_tm

,B.sum_i AS L_Sum_I

FROM DD_STG.S2_WEIGHTED_CURVE AS A

INNER JOIN DD_STG.S2_WEIGHTED_CURVE AS B

ON (A.Weighted_Peak_mz - B.Weighted_Peak_mz) BETWEEN 0.00000 AND 1.000000 AND A.ren_tm = B.ren_tm

AND A.CurveNum <> B.CurveNum AND B.max_i > (0.66667 * A.max_i)

,A.Weighted_Peak_mz - B.Weighted_Peak_mz AS Diff_WP ,B.CurveNum AS L_CurveNum

,B.Weighted_Peak_mz AS L_Weighted_Peak_mz ,B.ren_tm AS L_ren_tm

,B.sum_i AS L_Sum_I

FROM DD_STG.S2_WEIGHTED_CURVE AS A

INNER JOIN DD_STG.S2_WEIGHTED_CURVE AS B

ON (A.Weighted_Peak_mz - B.Weighted_Peak_mz) BETWEEN 0.00000 AND 1.000000 AND A.ren_tm = B.ren_tm

AND A.CurveNum <> B.CurveNum AND B.max_i > (0.66667 * A.max_i)

(12)

Procedural code … declared to the Aster as new

new MapReduce function called PeakPick

while (inputIterator.advanceToNextRow()) {

currIntensity=inputIterator.getDoubleAt(5); maxIntensity=0.0;

//Initialise Temp Array

for (int i=0; i <= 50; i++){ curveArray[0][i]=0; curveArray[1][i]=0; } if (overlapFlag==1){ count = 1; } else { count = 0; }

//Find start of Curve, lastintensity is 0

//or previous lastintensity is higher than lastintensity – overlapping peaks (double peak curve)

if (currIntensity > 0 && lastIntensity == 0 || overlapFlag==1){

//Populate Temp Array with Curve points and find maxIntensity to derive threshold

while (currIntensity > 0){ if(maxIntensity < currIntensity) maxIntensity=currIntensity; if (overlapFlag==1){ overlapFlag=0; curveArray[0][count-1]=overlapMZ; curveArray[1][count-1]=overlapIntensity; PI = overlapIntensity; currIntensity=inputIterator.getDoubleAt(5); } curveArray[0][count]=inputIterator.getDoubleAt(4); curveArray[1][count]=inputIterator.getDoubleAt(5); count++; inputIterator.advanceToNextRow(); PI2 = PI; PI = currIntensity;

if (currIntensity > PI && PI2 > PI){

//Overlapping Peak found, store MZ and Intensity and start new Curve for next Iteration

overlapFlag=1; overlapMZ=inputIterator.getDoubleAt(4); overlapIntensity=inputIterator.getDoubleAt(5); break; } }

//Process Temp Array to create intermediate metrics

while (curveArray[1][curveCount] > 0){ if (curveArray[1][curveCount] > intensityThreshold){ if (maxMZ < curveArray[0][curveCount]){ maxMZ=curveArray[0][curveCount]; } if (minIntensity > curveArray[1][curveCount] || minIntensity == 0){ minIntensity=curveArray[1][curveCount]; }

if (minMZ > curveArray[0][curveCount] || minMZ == 0){

minMZ=curveArray[0][curveCount]; } sumIntensity=sumIntensity+curveArray[1][curveCount]; sumMZ=sumMZ+curveArray[0][curveCount]; sumMZByIntensity=sumMZByIntensity+(curveArray[0][curveCou nt]*curveArray[1][curveCount]); curvePoints++; } curveCount++; }

(13)

SQL MapReduce – Reduce Function

In Teradata Aster SQL-MR code run by analyst becomes trivial

SELECT *

FROM

PeakPick

**(ON SELECT * FROM STG.MassSpecLoad)**

Parameters can

easily be included in

the function and

exposed to the

analyst

In Hadoop, command line interface means Engineers involved at all

times

(14)

TERADATA UNIFIED DATA ARCHITECTURE

DISCOVERY PLATFORM

CAPTURE | STORE | REFINE

INTEGRATED DATA WAREHOUSE

LANGUAGES MATH & STATS DATA MINING BUSINESS INTELLIGENCE APPLICATIONS

Engineers

Data Scientists

Business Analysts

Front-Line Workers

Customers / Partners

Quants

Operational Systems

Executives

Big Data Analytics

Big Data Management

(15)

• Single View of the Business,

Cross-Functional

• SQL based

• Structured schema

• Productionised Analytics

• Active

• Complex mixed workloads

• Highest service level goals

• Highest resilience

• 1000’ users

The Integrated Data Warehouse

Business Analysts

Knowledge Workers

DATA MINING

BUSINESS INTELLIGENCE APPLICATIONS

Customers/Partners

Marketing

Executives

Front-line Workers

Operational Systems

(16)

• Project-led view of data

approach for big analytics

• Big Analytics using SQL-MR

• Schema-Lite

• Interactive Discovery Analytics

• Load fast, act fast, fail fast

analytical workload

• Interactive

• Limited service levels

• Resilience

• 10’s users

The Discovery Environment

Business Analysts

BIG ANALYTICS

SQL AND MAP-REDUCE DATA VISUALISATION

Data Scientists

Power Analysts

Rules Discovery

(17)

• Lowest Cost Storage footprint

• NoSchema design, load raw

files

• MapReduce based

• Deep history and 1

st

_{level data}

transformations

• Simple single use workloads

• Batch and open source

analytics

• High Data Availability service

level goal

• High resilience

Hadoop Big Data Management

Power Analysts

TRANSFORMATIONS

SPECIAL PURPOSE ANALYTIC REGULATORY

Data Scientists

IT Professionals

Single use Systems

(18)

TERADATA UNIFIED DATA ARCHITECTURE

CAPTURE | STORE | REFINE

LANGUAGES MATH & STATS DATA MINING BUSINESS INTELLIGENCE APPLICATIONS