• No results found

Teradata Unified Big Data Architecture

N/A
N/A
Protected

Academic year: 2021

Share "Teradata Unified Big Data Architecture"

Copied!
19
0
0

Loading.... (view fulltext now)

Full text

(1)
(2)

• Recap the challenges of Big Analytics

• The 2 analytical gaps for most enterprises

• Teradata Unified Data Architecture

- How we bridge the gaps

- The 3 core elements of the architecture

- Teradata’s solutions in the architecture

• Bring it all together Teradata, Teradata Aster, and

Hadoop.

(3)

Recap of the Big Data

Analytics Challenge

(4)

New and Emerging Sources of Data

BIG DATA

User Generated

Content

Mobile Web

SMS/MMS

Sentiment

External Demographics

HD Video

Speech to Text

Product/Service Logs

Social Network

Business Data Feeds

Petabytes

User Click Stream

Web logs

Web

Offer history

A/B testing

Dynamic Pricing

Affiliate Networks

Search marketing

Behavioral Targeting

Dynamic Funnels

Terabytes

Segmentation

Offer details

Customer Touches

Support Contacts

CRM

Gigabytes

Megabytes

Purchase record

Purchase detail

Payment record

ERP

And using an RDBMS/SQL alone is difficult or impossible

So it’s the data, right? …Yes

So it’s the analytics, right? …. Yes

So it’s the need for iterative visualisation …. Yes

(5)

Big Data Analytics …MORE Analytics on ALL the data

Enabling All Users, All Tools and Any Data for Capture to Analysis

Java, C/C++, Pig, Python, R, SAS, SQL, Excel, BI, Visualisation, etc.

Discover and Explore

Reporting and Execution

in the Enterprise

Capture, Store and Refine

Audio/

(6)

Database and Analytic Processing Layer

The Big Data Architecture Today Has Gaps

Data Storage and

Refining

Data Warehouse

Audio/

Video

Images

Text

Web and

Social

Machine

Logs

CRM

SCM

ERP

Engineers

Data Scientists

Quants

Business Analysts

Java, C/C++, Pig, Python, R, SAS, SQL, Excel, BI, Visualisation, etc.

Gap 2: File system lacks

optimisers, data locality,

indexes

MapReduce

(Processing)

Gap 1:

(7)

Capture, Store, Refine

Teradata Unified Big Data

Architecture

for the Enterprise

Audio/

Video

Images

Text

Web and

Social

Machine

Logs

CRM

SCM

ERP

Engineers

Data Scientists

Quants

Business Analysts

Java, C/C++, Pig, Python, R, SAS, SQL, Excel, BI, Visualisation, etc.

Discovery Platform

Integrated Data

Warehouse

SQL-H

Aster MapReduce Portfolio

Teradata SQL Analytics Portfolio

(8)

Teradata Aster Discovery Platform 5.10

Fastest path to big data apps and new business insights

Integrated hardware and

software appliance

Relational-data architecture can

be extended for non-relational types

and procedural M-R analytics

SQL-MapReduce framework

Analyze both multi-structured

complex and relational data

Growing the Development Bucket

70+ pre-built functions for data

acquisition, preparation, analysis &

visualization

Richest Add-On Capabilities:

Attensity, Zementis, SAS, R

Visual IDE & VM-based dev

environment: develop apps very fast

Data Scientists

Analysts

Customers

Business

Interactive & Visual Big Data Analytic Apps

Store

Process

Develop

SQL

Platform Services

(e.g. query planning, dynamic workload management, security…)

SQL-MapReduce®

Row Store

Column Store

Data Acquisition Module SQL-H Teradata RDBMS Data Preparation Module Unpack Pivot Apache Log Parser Analytics Module Pathing Graph Statistical Viz Module Flow Viz Hierarchy Viz Affinity Viz Partner & Add-On Modules Attensity Zementis SAS, R

(9)

Big Data Apps in Days – not Weeks or Months

Data

Scientists

Analysts

Customers

Business

DATA SOURCES

ASTER DISCOVERY PORTFOLIO

Structured

Data

Multi-Structured

Data

Hadoop

Data

OLTP

DBMS’s

Data

Acquisition

Module

• Hadoop access

• Teradata

access

• RDBMS access

Data

Preparation

Module

• Data Adaptors

• Data

Transformers

- JSON, XML,

Apache, etc

Analytics

Module

• Statistical

• Pattern

Matching

• Pathing

• Graph

Algorithms

• Text

• More …

Visualisation

Module

• Flow Visualizer

• Hierarchy

• Flow

• Sankey

• Affinity

• More ….

CUSTOM BIG ANALYTICS APPS

PACKAGED BIG ANALYTICS APPS

(10)

MapReduce vs. SQL - Reduce Function

Data output from Mass Spectrometer

335.2094368 0 335.2105961 0 335.2117553 0 335.2129146 53.024086 335.2140739 184.1607361 335.2152332 264.3601074 335.2163925 259.6187134 335.2175518 239.7870178 335.2187111 313.8243713 335.2198704 490.8760071 335.2210297 634.064209 335.222189 589.8432007 335.2233483 351.9743347 335.2245077 65.21440887 335.225671 0 336.890869 0 336.892037 75.75605011 336.893205 179.8110657 336.894373 247.535553 336.895541 225.6489563 336.8967091 140.6246338 337.1257588 0 337.1280972 86.48993683 337.1292664 170.0835876 337.1304357 215.8146362 337.1316049 188.9733276 337.1327741 110.2854233 337.1912444 0 337.192414 0 337.1935835 143.2112122 337.1947531 357.401123 337.1959227 467.1167297 337.1970923 411.569458 337.1982619 245.5514221 337.1994315 80.80451202

Detecting centroids of peaks is highly complex

using SQL as it is not a set based operation

(11)

SELECT file_id ,scan_id ,ren_tm ,ms_lvl ,mz ,i AS n_

,SUM(i) OVER (PARTITION BY file_id, ms_lvl, ren_tm ORDER BY mz ASC ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING) AS p_i ,(CASE WHEN (i > 0) THEN 1 ELSE 0 END) AS Ind

,(Ind - SUM(ind) OVER (PARTITION BY file_id, ms_lvl, ren_tm ORDER BY mz ASC ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING)) ,CAST((CASE WHEN B = 1 THEN CSUM(1,Ind)

WHEN B = 0 AND Ind = 1 THEN 0

ELSE NULL END) AS DECIMAL(38,0)) AS CurveID FROM dd_stg.mzml

WHERE ms_lvl = 1 ) WITH DATA

PRIMARY INDEX (mz)

Almost 800 lines of complex SQL

,(weighted_peak_mz * chrg) / 700000.000000000000000 AS delta_mz ,CASE WHEN (

(CASE WHEN

SUM((weighted_peak_mz * chrg)) OVER (PARTITION BY file_id, ms_lvl ORDER BY Weighted_peak_mz, scan_id ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING)

BETWEEN ((weighted_peak_mz * chrg) - delta_mz) AND ((weighted_peak_mz * chrg) + delta_mz) THEN 'Y'

ELSE NULL END) = 'Y'

OR (CASE WHEN

SUM((weighted_peak_mz * chrg)) OVER (PARTITION BY file_id, ms_lvl ORDER BY Weighted_peak_mz, scan_id ROWS BETWEEN 1 FOLLOWING AND 1 FOLLOWING)

BETWEEN ((weighted_peak_mz * chrg) - delta_mz) AND ((weighted_peak_mz * chrg) + delta_mz) THEN 'Y'

ELSE NULL END) = 'Y'

OR

(CASE WHEN

SUM((weighted_peak_mz * chrg)) OVER (PARTITION BY file_id, ms_lvl ORDER BY Weighted_peak_mz, scan_id ROWS BETWEEN 2 PRECEDING AND 2 PRECEDING)

BETWEEN ((weighted_peak_mz * chrg) - delta_mz) AND ((weighted_peak_mz * chrg) + delta_mz) THEN 'Y' ELSE NULL END) = 'Y' OR (CASE WHEN SELECT file_id,scan_id,ren_tm,ms_lvl,mz ,i

,CASE WHEN ind = 1 THEN SUM(CurveID+Mark) OVER (PARTITION BY file_id, ms_lvl, ren_tm ORDER BY mz, ind ROWS UNBOUNDED PRECEDING)

ELSE NULL END AS CurveNum

FROM (SELECT file_id,scan_id,ren_tm,ms_lvl,mz,n_I AS i ,CASE

WHEN ( (CASE

WHEN n_i - p_i > 0 THEN 1 WHEN n_i - p_i < 0 THEN -1 ELSE 0

END) - SUM(CASE

WHEN n_i - p_i > 0 THEN 1 WHEN n_i - p_i < 0 THEN -1 ELSE 0

END) OVER (PARTITION BY file_id, ms_lvl, ren_tm ORDER BY mz ASC ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING) ) = 2 THEN 1 ELSE 0 END AS Mark ,Ind ,B

SELECT A.file_id ,A.ren_tm ,A.scan_id ,A.ms_lvl ,A.CurveNum A.Weighted_Peak_mz ,A.ren_tm ,A.sum_i ,A.ren_tm - B.ren_tm AS Diff_Ren_Tm

,A.Weighted_Peak_mz - B.Weighted_Peak_mz AS Diff_WP ,B.CurveNum AS L_CurveNum

,B.Weighted_Peak_mz AS L_Weighted_Peak_mz ,B.ren_tm AS L_ren_tm

,B.sum_i AS L_Sum_I

FROM DD_STG.S2_WEIGHTED_CURVE AS A

INNER JOIN DD_STG.S2_WEIGHTED_CURVE AS B

ON (A.Weighted_Peak_mz - B.Weighted_Peak_mz) BETWEEN 0.00000 AND 1.000000 AND A.ren_tm = B.ren_tm

AND A.CurveNum <> B.CurveNum AND B.max_i > (0.66667 * A.max_i)

,A.Weighted_Peak_mz - B.Weighted_Peak_mz AS Diff_WP ,B.CurveNum AS L_CurveNum

,B.Weighted_Peak_mz AS L_Weighted_Peak_mz ,B.ren_tm AS L_ren_tm

,B.sum_i AS L_Sum_I

FROM DD_STG.S2_WEIGHTED_CURVE AS A

INNER JOIN DD_STG.S2_WEIGHTED_CURVE AS B

ON (A.Weighted_Peak_mz - B.Weighted_Peak_mz) BETWEEN 0.00000 AND 1.000000 AND A.ren_tm = B.ren_tm

AND A.CurveNum <> B.CurveNum AND B.max_i > (0.66667 * A.max_i)

(12)

Procedural code … declared to the Aster as new

new MapReduce function called PeakPick

while (inputIterator.advanceToNextRow()) {

currIntensity=inputIterator.getDoubleAt(5); maxIntensity=0.0;

//Initialise Temp Array

for (int i=0; i <= 50; i++){ curveArray[0][i]=0; curveArray[1][i]=0; } if (overlapFlag==1){ count = 1; } else { count = 0; }

//Find start of Curve, lastintensity is 0

//or previous lastintensity is higher than lastintensity – overlapping peaks (double peak curve)

if (currIntensity > 0 && lastIntensity == 0 || overlapFlag==1){

//Populate Temp Array with Curve points and find maxIntensity to derive threshold

while (currIntensity > 0){ if(maxIntensity < currIntensity) maxIntensity=currIntensity; if (overlapFlag==1){ overlapFlag=0; curveArray[0][count-1]=overlapMZ; curveArray[1][count-1]=overlapIntensity; PI = overlapIntensity; currIntensity=inputIterator.getDoubleAt(5); } curveArray[0][count]=inputIterator.getDoubleAt(4); curveArray[1][count]=inputIterator.getDoubleAt(5); count++; inputIterator.advanceToNextRow(); PI2 = PI; PI = currIntensity;

if (currIntensity > PI && PI2 > PI){

//Overlapping Peak found, store MZ and Intensity and start new Curve for next Iteration

overlapFlag=1; overlapMZ=inputIterator.getDoubleAt(4); overlapIntensity=inputIterator.getDoubleAt(5); break; } }

//Process Temp Array to create intermediate metrics

while (curveArray[1][curveCount] > 0){ if (curveArray[1][curveCount] > intensityThreshold){ if (maxMZ < curveArray[0][curveCount]){ maxMZ=curveArray[0][curveCount]; } if (minIntensity > curveArray[1][curveCount] || minIntensity == 0){ minIntensity=curveArray[1][curveCount]; }

if (minMZ > curveArray[0][curveCount] || minMZ == 0){

minMZ=curveArray[0][curveCount]; } sumIntensity=sumIntensity+curveArray[1][curveCount]; sumMZ=sumMZ+curveArray[0][curveCount]; sumMZByIntensity=sumMZByIntensity+(curveArray[0][curveCou nt]*curveArray[1][curveCount]); curvePoints++; } curveCount++; }

(13)

SQL MapReduce – Reduce Function

In Teradata Aster SQL-MR code run by analyst becomes trivial

SELECT *

FROM

PeakPick

(ON SELECT * FROM STG.MassSpecLoad)

Parameters can

easily be included in

the function and

exposed to the

analyst

In Hadoop, command line interface means Engineers involved at all

times

(14)

TERADATA UNIFIED DATA ARCHITECTURE

DISCOVERY PLATFORM

CAPTURE | STORE | REFINE

INTEGRATED DATA WAREHOUSE

LANGUAGES MATH & STATS DATA MINING BUSINESS INTELLIGENCE APPLICATIONS

Engineers

Data Scientists

Business Analysts

Front-Line Workers

Customers / Partners

Quants

Operational Systems

Executives

Big Data Analytics

Big Data Management

(15)

Single View of the Business,

Cross-Functional

SQL based

Structured schema

Productionised Analytics

Active

Complex mixed workloads

Highest service level goals

Highest resilience

1000’ users

The Integrated Data Warehouse

Business Analysts

Knowledge Workers

DATA MINING

BUSINESS INTELLIGENCE APPLICATIONS

Customers/Partners

Marketing

Executives

Front-line Workers

Operational Systems

INTEGRATED DATA WAREHOUSE

(16)

Project-led view of data

approach for big analytics

Big Analytics using SQL-MR

Schema-Lite

Interactive Discovery Analytics

Load fast, act fast, fail fast

analytical workload

Interactive

Limited service levels

Resilience

10’s users

The Discovery Environment

Business Analysts

BIG ANALYTICS

SQL AND MAP-REDUCE DATA VISUALISATION

Data Scientists

Power Analysts

Rules Discovery

DISCOVERY PLATFORM

(17)

Lowest Cost Storage footprint

NoSchema design, load raw

files

MapReduce based

Deep history and 1

st

level data

transformations

Simple single use workloads

Batch and open source

analytics

High Data Availability service

level goal

High resilience

Hadoop Big Data Management

Power Analysts

TRANSFORMATIONS

SPECIAL PURPOSE ANALYTIC REGULATORY

Data Scientists

IT Professionals

Single use Systems

(18)

TERADATA UNIFIED DATA ARCHITECTURE

DISCOVERY PLATFORM

CAPTURE | STORE | REFINE

INTEGRATED DATA WAREHOUSE

LANGUAGES MATH & STATS DATA MINING BUSINESS INTELLIGENCE APPLICATIONS

Engineers

Data Scientists

Business Analysts

Front-Line Workers

Customers / Partners

Quants

Operational Systems

Executives

Big Data Analytics

Big Data Management

(19)

Unified Data Architecture™

To leverage Big Data you must give all the

business analysts in your organization the right

analytical tool on all the existing and new data

available

1

Unified Data Architecture™ - architecture that

leverages the right technology on the right

analytical problems - leveraging best-of-breed

technologies

2

Big Data Analytics – Teradata and Aster harness the business value of

Big Data. Every company needs both a Data Warehouse and a

Discovery Platform

Big Data Management – Hadoop for landing, storing, and refining

data

3

Democratise Big Data and Maximise Enterprise Adoption

References

Related documents

4.0 A DIVERGING DOS STRATEGY: APPLICATION OF ALLENIC RHODIUM(I)-CATALYZED CYCLOCARBONYLATION, MICROWAVE-ASSISTED [2 + 2] CYCLOADDITION, AND SILVER(I)-CATALYZED CYCLOISOMERIZATION

These cavities spent the least amount of time above 35˚C and 40˚C (Fig 9A-F) and thus a model cannot be run because there are so few non- diapausing individuals spending

Research in the field of systemic digitization of the urban environment identi- fies a set of measures, usually reduced to three stages: (1) the formation of a common structure to

The aim of this study is to examine the evolution of the language of Sinhalese fiction from i860 - 1970* As there is a large number of authors in this period, our

The supply of market information for the agricultural sector in West Africa is highly variable. A number of MIS provide data on cereals, including public, private,

Van Kleef young jenever Van Kleef old jenever Van Kleef corenwijn Hoppe Vieux Rum Bacardi superior Bacardi limón Bacardi razz Bacardi oakheart Bacardi 8 años Bacardi

277 In contrast with a democratic political logic, which, broadly speaking, only supports disobedience when it is construed as a form of political participation, non-violent,

Besides the general information, such as capacity, used space or file system type it provides the possibility to get info on hard disk geometry, cluster size, exact