• Recap the challenges of Big Analytics
• The 2 analytical gaps for most enterprises
• Teradata Unified Data Architecture
- How we bridge the gaps
- The 3 core elements of the architecture
- Teradata’s solutions in the architecture
• Bring it all together Teradata, Teradata Aster, and
Hadoop.
Recap of the Big Data
Analytics Challenge
New and Emerging Sources of Data
BIG DATA
User Generated
Content
Mobile Web
SMS/MMS
Sentiment
External Demographics
HD Video
Speech to Text
Product/Service Logs
Social Network
Business Data Feeds
Petabytes
User Click Stream
Web logs
Web
Offer history
A/B testing
Dynamic Pricing
Affiliate Networks
Search marketing
Behavioral Targeting
Dynamic Funnels
Terabytes
Segmentation
Offer details
Customer Touches
Support Contacts
CRM
Gigabytes
Megabytes
Purchase record
Purchase detail
Payment record
ERP
And using an RDBMS/SQL alone is difficult or impossible
So it’s the data, right? …Yes
So it’s the analytics, right? …. Yes
So it’s the need for iterative visualisation …. Yes
Big Data Analytics …MORE Analytics on ALL the data
Enabling All Users, All Tools and Any Data for Capture to Analysis
Java, C/C++, Pig, Python, R, SAS, SQL, Excel, BI, Visualisation, etc.
Discover and Explore
Reporting and Execution
in the Enterprise
Capture, Store and Refine
Audio/
Database and Analytic Processing Layer
The Big Data Architecture Today Has Gaps
Data Storage and
Refining
Data Warehouse
Audio/
Video
Images
Text
Web and
Social
Machine
Logs
CRM
SCM
ERP
Engineers
Data Scientists
Quants
Business Analysts
Java, C/C++, Pig, Python, R, SAS, SQL, Excel, BI, Visualisation, etc.
Gap 2: File system lacks
optimisers, data locality,
indexes
MapReduce
(Processing)
Gap 1:
Capture, Store, Refine
Teradata Unified Big Data
Architecture
for the Enterprise
Audio/
Video
Images
Text
Web and
Social
Machine
Logs
CRM
SCM
ERP
Engineers
Data Scientists
Quants
Business Analysts
Java, C/C++, Pig, Python, R, SAS, SQL, Excel, BI, Visualisation, etc.
Discovery Platform
Integrated Data
Warehouse
SQL-H
Aster MapReduce Portfolio
Teradata SQL Analytics Portfolio
Teradata Aster Discovery Platform 5.10
Fastest path to big data apps and new business insights
•
Integrated hardware and
software appliance
•
Relational-data architecture can
be extended for non-relational types
and procedural M-R analytics
•
SQL-MapReduce framework
•
Analyze both multi-structured
complex and relational data
Growing the Development Bucket
•
70+ pre-built functions for data
acquisition, preparation, analysis &
visualization
•
Richest Add-On Capabilities:
Attensity, Zementis, SAS, R
•
Visual IDE & VM-based dev
environment: develop apps very fast
Data Scientists
Analysts
Customers
Business
Interactive & Visual Big Data Analytic Apps
Store
Process
Develop
SQL
Platform Services
(e.g. query planning, dynamic workload management, security…)
SQL-MapReduce®
Row Store
Column Store
Data Acquisition Module SQL-H Teradata RDBMS Data Preparation Module Unpack Pivot Apache Log Parser Analytics Module Pathing Graph Statistical Viz Module Flow Viz Hierarchy Viz Affinity Viz Partner & Add-On Modules Attensity Zementis SAS, R
Big Data Apps in Days – not Weeks or Months
Data
Scientists
Analysts
Customers
Business
DATA SOURCES
ASTER DISCOVERY PORTFOLIO
Structured
Data
Multi-Structured
Data
Hadoop
Data
OLTP
DBMS’s
Data
Acquisition
Module
• Hadoop access
• Teradata
access
• RDBMS access
Data
Preparation
Module
• Data Adaptors
• Data
Transformers
- JSON, XML,
Apache, etc
Analytics
Module
• Statistical
• Pattern
Matching
• Pathing
• Graph
Algorithms
• Text
• More …
Visualisation
Module
• Flow Visualizer
• Hierarchy
• Flow
• Sankey
• Affinity
• More ….
CUSTOM BIG ANALYTICS APPS
PACKAGED BIG ANALYTICS APPS
MapReduce vs. SQL - Reduce Function
Data output from Mass Spectrometer
335.2094368 0 335.2105961 0 335.2117553 0 335.2129146 53.024086 335.2140739 184.1607361 335.2152332 264.3601074 335.2163925 259.6187134 335.2175518 239.7870178 335.2187111 313.8243713 335.2198704 490.8760071 335.2210297 634.064209 335.222189 589.8432007 335.2233483 351.9743347 335.2245077 65.21440887 335.225671 0 336.890869 0 336.892037 75.75605011 336.893205 179.8110657 336.894373 247.535553 336.895541 225.6489563 336.8967091 140.6246338 337.1257588 0 337.1280972 86.48993683 337.1292664 170.0835876 337.1304357 215.8146362 337.1316049 188.9733276 337.1327741 110.2854233 337.1912444 0 337.192414 0 337.1935835 143.2112122 337.1947531 357.401123 337.1959227 467.1167297 337.1970923 411.569458 337.1982619 245.5514221 337.1994315 80.80451202
Detecting centroids of peaks is highly complex
using SQL as it is not a set based operation
SELECT file_id ,scan_id ,ren_tm ,ms_lvl ,mz ,i AS n_
,SUM(i) OVER (PARTITION BY file_id, ms_lvl, ren_tm ORDER BY mz ASC ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING) AS p_i ,(CASE WHEN (i > 0) THEN 1 ELSE 0 END) AS Ind
,(Ind - SUM(ind) OVER (PARTITION BY file_id, ms_lvl, ren_tm ORDER BY mz ASC ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING)) ,CAST((CASE WHEN B = 1 THEN CSUM(1,Ind)
WHEN B = 0 AND Ind = 1 THEN 0
ELSE NULL END) AS DECIMAL(38,0)) AS CurveID FROM dd_stg.mzml
WHERE ms_lvl = 1 ) WITH DATA
PRIMARY INDEX (mz)
Almost 800 lines of complex SQL
,(weighted_peak_mz * chrg) / 700000.000000000000000 AS delta_mz ,CASE WHEN (
(CASE WHEN
SUM((weighted_peak_mz * chrg)) OVER (PARTITION BY file_id, ms_lvl ORDER BY Weighted_peak_mz, scan_id ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING)
BETWEEN ((weighted_peak_mz * chrg) - delta_mz) AND ((weighted_peak_mz * chrg) + delta_mz) THEN 'Y'
ELSE NULL END) = 'Y'
OR (CASE WHEN
SUM((weighted_peak_mz * chrg)) OVER (PARTITION BY file_id, ms_lvl ORDER BY Weighted_peak_mz, scan_id ROWS BETWEEN 1 FOLLOWING AND 1 FOLLOWING)
BETWEEN ((weighted_peak_mz * chrg) - delta_mz) AND ((weighted_peak_mz * chrg) + delta_mz) THEN 'Y'
ELSE NULL END) = 'Y'
OR
(CASE WHEN
SUM((weighted_peak_mz * chrg)) OVER (PARTITION BY file_id, ms_lvl ORDER BY Weighted_peak_mz, scan_id ROWS BETWEEN 2 PRECEDING AND 2 PRECEDING)
BETWEEN ((weighted_peak_mz * chrg) - delta_mz) AND ((weighted_peak_mz * chrg) + delta_mz) THEN 'Y' ELSE NULL END) = 'Y' OR (CASE WHEN SELECT file_id,scan_id,ren_tm,ms_lvl,mz ,i
,CASE WHEN ind = 1 THEN SUM(CurveID+Mark) OVER (PARTITION BY file_id, ms_lvl, ren_tm ORDER BY mz, ind ROWS UNBOUNDED PRECEDING)
ELSE NULL END AS CurveNum
FROM (SELECT file_id,scan_id,ren_tm,ms_lvl,mz,n_I AS i ,CASE
WHEN ( (CASE
WHEN n_i - p_i > 0 THEN 1 WHEN n_i - p_i < 0 THEN -1 ELSE 0
END) - SUM(CASE
WHEN n_i - p_i > 0 THEN 1 WHEN n_i - p_i < 0 THEN -1 ELSE 0
END) OVER (PARTITION BY file_id, ms_lvl, ren_tm ORDER BY mz ASC ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING) ) = 2 THEN 1 ELSE 0 END AS Mark ,Ind ,B
SELECT A.file_id ,A.ren_tm ,A.scan_id ,A.ms_lvl ,A.CurveNum A.Weighted_Peak_mz ,A.ren_tm ,A.sum_i ,A.ren_tm - B.ren_tm AS Diff_Ren_Tm
,A.Weighted_Peak_mz - B.Weighted_Peak_mz AS Diff_WP ,B.CurveNum AS L_CurveNum
,B.Weighted_Peak_mz AS L_Weighted_Peak_mz ,B.ren_tm AS L_ren_tm
,B.sum_i AS L_Sum_I
FROM DD_STG.S2_WEIGHTED_CURVE AS A
INNER JOIN DD_STG.S2_WEIGHTED_CURVE AS B
ON (A.Weighted_Peak_mz - B.Weighted_Peak_mz) BETWEEN 0.00000 AND 1.000000 AND A.ren_tm = B.ren_tm
AND A.CurveNum <> B.CurveNum AND B.max_i > (0.66667 * A.max_i)
,A.Weighted_Peak_mz - B.Weighted_Peak_mz AS Diff_WP ,B.CurveNum AS L_CurveNum
,B.Weighted_Peak_mz AS L_Weighted_Peak_mz ,B.ren_tm AS L_ren_tm
,B.sum_i AS L_Sum_I
FROM DD_STG.S2_WEIGHTED_CURVE AS A
INNER JOIN DD_STG.S2_WEIGHTED_CURVE AS B
ON (A.Weighted_Peak_mz - B.Weighted_Peak_mz) BETWEEN 0.00000 AND 1.000000 AND A.ren_tm = B.ren_tm
AND A.CurveNum <> B.CurveNum AND B.max_i > (0.66667 * A.max_i)
Procedural code … declared to the Aster as new
new MapReduce function called PeakPick
while (inputIterator.advanceToNextRow()) {
currIntensity=inputIterator.getDoubleAt(5); maxIntensity=0.0;
//Initialise Temp Array
for (int i=0; i <= 50; i++){ curveArray[0][i]=0; curveArray[1][i]=0; } if (overlapFlag==1){ count = 1; } else { count = 0; }
//Find start of Curve, lastintensity is 0
//or previous lastintensity is higher than lastintensity – overlapping peaks (double peak curve)
if (currIntensity > 0 && lastIntensity == 0 || overlapFlag==1){
//Populate Temp Array with Curve points and find maxIntensity to derive threshold
while (currIntensity > 0){ if(maxIntensity < currIntensity) maxIntensity=currIntensity; if (overlapFlag==1){ overlapFlag=0; curveArray[0][count-1]=overlapMZ; curveArray[1][count-1]=overlapIntensity; PI = overlapIntensity; currIntensity=inputIterator.getDoubleAt(5); } curveArray[0][count]=inputIterator.getDoubleAt(4); curveArray[1][count]=inputIterator.getDoubleAt(5); count++; inputIterator.advanceToNextRow(); PI2 = PI; PI = currIntensity;
if (currIntensity > PI && PI2 > PI){
//Overlapping Peak found, store MZ and Intensity and start new Curve for next Iteration
overlapFlag=1; overlapMZ=inputIterator.getDoubleAt(4); overlapIntensity=inputIterator.getDoubleAt(5); break; } }
//Process Temp Array to create intermediate metrics
while (curveArray[1][curveCount] > 0){ if (curveArray[1][curveCount] > intensityThreshold){ if (maxMZ < curveArray[0][curveCount]){ maxMZ=curveArray[0][curveCount]; } if (minIntensity > curveArray[1][curveCount] || minIntensity == 0){ minIntensity=curveArray[1][curveCount]; }
if (minMZ > curveArray[0][curveCount] || minMZ == 0){
minMZ=curveArray[0][curveCount]; } sumIntensity=sumIntensity+curveArray[1][curveCount]; sumMZ=sumMZ+curveArray[0][curveCount]; sumMZByIntensity=sumMZByIntensity+(curveArray[0][curveCou nt]*curveArray[1][curveCount]); curvePoints++; } curveCount++; }
SQL MapReduce – Reduce Function
In Teradata Aster SQL-MR code run by analyst becomes trivial
SELECT *
FROM
PeakPick
(ON SELECT * FROM STG.MassSpecLoad)
Parameters can
easily be included in
the function and
exposed to the
analyst
In Hadoop, command line interface means Engineers involved at all
times
TERADATA UNIFIED DATA ARCHITECTURE
DISCOVERY PLATFORM
CAPTURE | STORE | REFINE
INTEGRATED DATA WAREHOUSE
LANGUAGES MATH & STATS DATA MINING BUSINESS INTELLIGENCE APPLICATIONS
Engineers
Data Scientists
Business Analysts
Front-Line Workers
Customers / Partners
Quants
Operational Systems
Executives
Big Data Analytics
Big Data Management
•
Single View of the Business,
Cross-Functional
•
SQL based
•
Structured schema
•
Productionised Analytics
•
Active
•
Complex mixed workloads
•
Highest service level goals
•
Highest resilience
•
1000’ users
The Integrated Data Warehouse
Business Analysts
Knowledge Workers
DATA MINING
BUSINESS INTELLIGENCE APPLICATIONS
Customers/Partners
Marketing
Executives
Front-line Workers
Operational Systems
INTEGRATED DATA WAREHOUSE
•
Project-led view of data
approach for big analytics
•
Big Analytics using SQL-MR
•
Schema-Lite
•
Interactive Discovery Analytics
•
Load fast, act fast, fail fast
analytical workload
•
Interactive
•
Limited service levels
•
Resilience
•
10’s users
The Discovery Environment
Business Analysts
BIG ANALYTICS
SQL AND MAP-REDUCE DATA VISUALISATION
Data Scientists
Power Analysts
Rules Discovery
DISCOVERY PLATFORM
•
Lowest Cost Storage footprint
•
NoSchema design, load raw
files
•
MapReduce based
•
Deep history and 1
st
level data
transformations
•
Simple single use workloads
•
Batch and open source
analytics
•
High Data Availability service
level goal
•
High resilience
Hadoop Big Data Management
Power Analysts
TRANSFORMATIONS
SPECIAL PURPOSE ANALYTIC REGULATORY
Data Scientists
IT Professionals
Single use Systems
TERADATA UNIFIED DATA ARCHITECTURE
DISCOVERY PLATFORM
CAPTURE | STORE | REFINE
INTEGRATED DATA WAREHOUSE
LANGUAGES MATH & STATS DATA MINING BUSINESS INTELLIGENCE APPLICATIONS