• No results found

Unified Batch & Stream Processing Platform

N/A
N/A
Protected

Academic year: 2021

Share "Unified Batch & Stream Processing Platform"

Copied!
20
0
0

Loading.... (view fulltext now)

Full text

(1)

Director Product Management

Unified Batch & Stream Processing

Platform

Himanshu Bari

(2)

Most Big Data Use Cases Are About

Improving/Re-write EXISTING

solutions To KNOWN problems…

(3)

Current Solutions Were Built On

A. Imperfect information

B. Expensive s/w & h/w infrastructure

C. Relational data stores

(4)

Inevitable Course of the Re-write

Specialized solutions

Near perfect information

Commodity hardware

Open source software

Next gen data management platform

Hadoop

NoSQL Graph Search RDBMSEDW +

In Memory Processing

(5)

Data Processing Categories in Big Data Use Cases

Known Unknown

Questions known?

Data Velocity

Static Streaming

Batch Processing Stream Processing

Ad-hoc Query

N/A

(6)

Every Batch Process *Could* Have Been A Stream Process

• Every ‘Static’ data point was ‘Streaming’ at some point

• We choose to wait and collect a bunch of data points and then process them at once in ‘Batch Mode’

• Move processing time closer to the data generation or ‘Event’ occurrence time

• Reduces time to insight and allows you to be proactive with timely actions

(7)

But We Still Need Batch

• Historical data analysis

• What-if analysis

• Experimentation

• Data re-statement

• Transaction processing and re-conciliation

• Audit

• Machine learning model training

• ……..

(8)

YARN HDFS

Need A Unified Platform

Enterprise Repositories

RDBMS EDW

NoSQL Graph

Search Other…

Any Notification Engine Any BPM Engine

Stale Data

Change Data

‘Fresh’

Streaming Data

Unified Batch & Stream Processing Engine to execute data pipelines

Streaming only pipelines OR

Batch processing only pipelines OR

Dual Batch & Stream processing pipelines

(9)

DataTorrent RTS Provides A Unified Batch & Streaming Platform

Graphical Application Assembly Real-Time Data Visualization

Re-Usable Java Operator Library Apache 2.0 licensed – Project Malhar

Scalable, High Performance, Fault Tolerant In-Memory Data Processing Platform Apache 2.0 licensed - Project Apex

Hadoop 2.0 —YARN + HDFS Physical Virtual Cloud

Transform Normalize

Descriptive &

Predictive Analytics

Alert Action

Management & Monitoring

Unified Batch & Streaming Data Ingestion & Distribution Pipeline for Hadoop

(10)

DataTorrent Project Apex- Unified Batch & Stream Processing Engine

In-memory compute engine

1

Hadoop native (YARN + HDFS) architecture

2

Sub-second event processing

3

Event order guarantees

4

Flexible stream partitioning mechanism

5

Stateful fault tolerance with NO data loss 6

Automatic & incremental recovery 7

Rolling & tumbling

windows 8

Processing guarantees 9

Ability to update application

without downtime 10

(11)

Unified Data Ingestion & Distribution Pipeline

• FTP, S3 etc.

• Kafka & JMS

• Change data capture Input & Output

Variety

• Overcome HDFS small file problem

• Skew management Tackle data size

& speed fluctuations

• Parameters like filtering criteria, bandwidth utilization & polling interval should be updateable at runtime

Run-time updates

• Graphical UI & API

• End to end metrics

Simple to build &

Operate

• Easily extend and insert

operations for data preparation Easily

customizable

• Runs within the Hadoop cluster over YARN & HDFS

Hadoop Native

(12)

Hadoop Data Ingestion & Distribution Application

(13)

Data Prep & Analytics Layer Requirements

• Truly scale horizontally across the Hadoop cluster

• Pre-built operators

ᵒ Re-ordering ᵒ Normalization ᵒ Transpose ᵒ De-duplication ᵒ Tagging

ᵒ Filtering ᵒ Enrichment

• Operators work seamlessly in both streaming & batch mode

ᵒ Data local HDFS read & process

ᵒ Ability to do computations on time window as well as file boundaries

• Ability to re-use existing business logic

• Simple workflow & scheduling capabilities

ᵒ Built-in or integrations with Oozie or other schedulers

(14)

Development API Requirements

• Consistent between Streaming & Batch pipelines

• No mapreduce

• No exposure of low level processing engine concepts

• Easily extendible

(15)

Malhar Operator Library Overview

Malhar Operator Library

Input / Output Connectors

Compute Operators

Parsers Stream

Manipulators

Query &

Scripting Stats & Math

Pattern Matching Machine Learning &

Algorithms

File Systems RDBMS NoSQL Messaging

Notification In Memory Databases Social Media Protocol read/write

(16)

Visual Application Assembly

• Simple to extend

ᵒ Simple API to enable any existing DataTorrent operator ᵒ Ability to plug any business logic using a custom operator

• Easy to Use

ᵒ Web based drag-n-drop development environment ᵒ Automatic port compatibility

validation

ᵒ Simple schema management ᵒ Generic property

configurator

• Easy to Operate

ᵒ No external component dependency - Runs natively in Hadoop

ᵒ Integrated with DataTorrent management platform

(17)

Streaming or Batch Data Processing Visualization

• Intuitive

user interface

ᵒ Auto-generate or custom create ᵒ One dashboard for multiple apps ᵒ Supports bar, line, pie, area

charts & data tables

• Easy to Operate

ᵒ No external component dependency - Runs natively in Hadoop

ᵒ Integrated with DataTorrent management platform

• Simple to extend

ᵒ Any DAG operator can be

(18)

Summary

• World is moving from ‘Batch’ to

‘Streaming’ BUT both are required

• Need a Hadoop native in memory compute engine that is scalable &

fault tolerant in BOTH batch &

streaming modes

• With out -of-the box data prep &

analytics operators

• Using a consistent & functional development API

• Operationalized through a common management & monitoring layer Project Apex

https://www.datatorrent.com/product/project -apex/

DataTorrent RTS Sandbox

https://www.datatorrent.com/download/

Graphical Application Assembly Real-Time Data Visualization

Malhar Apache 2.0 license Re-Usable Java Operator Library

Scalable, High Performance, Fault Tolerant In-Memory Data Processing Platform Apache 2.0 licensed - Project Apex

Hadoop 2.0 —YARN + HDFS Physical Virtual Cloud Transform

Normalize

Descriptive &

Predictive Analytics

Alert Action

Management & Monitoring

Unified Batch & Streaming Data Ingestion & Distribution Pipeline for Hadoop

(19)

Big Data. Big Actions. Now.

Big Data. Big Actions. Now.

(20)

Some Verticals & Use Cases

Ad-Tech Telco & Cable

• Real-time customer facing dashboards on key performance indicators

• Click fraud detection

• Billing optimization

• Call detail record (CDR) & extended data record (XDR) analysis for

• Service quality improvement

• Capacity planning/optimization

• Understanding customer behavior AND context

• Packaging and selling anonymous customer data

Financial Services IoT

• Fraud & risk monitoring

• Sentiment based trading strategies

• Usage based insurance

• Improved credit risk assessment

• Improving turn around time of trade settlement processes

• Process optimization

• Proactive maintenance prediction

• Remote monitoring & diagnostics

References

Related documents

3 Weeks course on Christian Meditation Good Shepherd Room, Parish Center, 7 - 8 p.m?. March 2: What is

John the Baptist, Plum City Parochial Admin: Rev.. Czerwonka

Accepted 2017 October 9. The broad-band XMM–Newton + NuSTAR spec- trum of P13 is qualitatively similar to the rest of the ULX sample with broad-band coverage, suggesting that

We found that, years in profession and income levels did not affect emotional exhaustion (p>0.05), whilst having negative feelings about profession increased

9 This estimate is derived using estimates of the total number of rental occupied housing units from the American Community Survey (2009-2013 five-year estimates) in combination

In this study we try to know the user needs and measure their satisfaction towards library performance of the residential hall libraries of Dhaka University.. This study covers the

Chairman and Chairman of the Nomination Committee Harvey McGrath was appointed as an independent non-executive director of Prudential in September 2008 and became the Chairman

While there were 300 NGOs working in southern India, there were just five in the Andaman and Nicobar Islands.. At this time, there had been a lot of publicity about a movie