• No results found

Creating Big Data Applications with Spring XD

N/A
N/A
Protected

Academic year: 2021

Share "Creating Big Data Applications with Spring XD"

Copied!
59
0
0

Loading.... (view fulltext now)

Full text

(1)

Creating  Big  Data  Applications


with  Spring  XD

Thomas  Darimont  

@thomasdarimont

(2)

THE FASTEST PATH TO NEW BUSINESS VALUE

(3)

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

Journey

3

Introduction Concepts

Applications Outlook

(4)

Introduction

4

(5)

Spring XD - Overview

Platform for Big Data Applications

Ingestion, Processing, Movement, Analytics

Stream and Batch Processing

Scalable Distributed Runtime

Support for Deep Analytics

Proven Spring Technologies

5

(6)

Spring XD - Why yet another Big Data Platform?

Alternative to Frameworks like Flume, Oozie, Sqoop, Storm

Just one Platform instead of many

Common things easy, complex things possible

Complementary to many technologies

Big SQL / MPP Databases - Impala, HAWQ

Stream Processing - Apache Spark

NoSQL DataStores - Cassandra, MongoDB

6

(7)

eXtreme Data

“Spring XD - one stop shop for

developing and deploying 


Big Data Apps“

X D

7

(8)

Spring XD - 10,000 Foot View

Spring XD Runtime

BIDIRECTIONAL

Compute HDFS

RDBMS

NoSQL

R, SAS

Streams Jobs

ingest workflow

export taps

Predictive Modelling

>_

Redis

Rest

8

(9)

Spring XD - Easy to Setup and Run

Store incoming HTTP data into HDFS

9

(10)

Spring XD - Easy to Setup and Run

10 1. Install via package manager / unzip

2. Start $ xd-singlenode $ xd-shell

3. Define xd:> stream create ingest --definition “http | hdfs”

4. Run xd:> stream deploy ingest Yes, writing HTTP

Data to HDFS

…can be that simple!

(11)

Core Concepts

11

(12)

Spring XD - Core Concepts

Runtime

Modules

Streams

Taps

Analytics

Jobs

Extensibility

Deployment

12

(13)

Spring XD - Runtime

Hosts Stream Processing & Batch Workflows & Analytics Manages Component Distribution

Communication via MessageBus Additional Services

Configuration / Cluster State: ZooKeeper Analytics: Redis, In-Memory

Message Bus: Redis, RabbitMQ, Kafka, Local

13

(14)

Spring XD - Instance Types

XD-Admin

Assigns Modules to Containers Manages Cluster

Failover & HA XD-Container

Loads / Executes Modules Connects to Data Bus

Standalone, YARN, Cloud Foundry

14

XD Admin Leader XD Admin

Leader XD Admin

Leader ZK

XD Container XD Container

module module

module

XD UI XD Shell

Kafka/RabbitMQ/Redis

module

module module

module

module Batch Job State DB

Analytics Repository

(15)

Spring XD - Runtime Modes

singl

e-nod

e standalone

Development 15 XD Admin

XD Container Module JVM

ZK DB MB

multi-node distributed

Production

JVM XD Admin

XD Container Module

XD Container Module

JVM JVM

JVM JVM

ZK DB JVMJVM

MB

(16)

Spring XD - Distributed Runtime

16 XDA

XDC time

Zookeeper XDA

XDC log

XDC

Message Bus

deploy

bind

XDA = XD Admin XDC = XD Container

(17)

Modules

17

(18)

Modules

Unit of execution

Source, Sink, Processor, Jobs Defined in XML or JVM Language

Spring config file with Spring Bean Definitions Can have Parameters

50+ already included in XD

Define new Modules via Composition

18

(19)

Modules - Overview

Source#20

Processor#13

Sink#20

HTTP SFTP Tail File Mail Syslog TCP / TCP Client

Reactor IP RabbitMQ JMS

Time MQTT Mongo

Kafka JDBC

Gemfire CQ, Source Twitter Search, Stream

Stdout Capture

Filter Transform

Splitter Aggregator HTTP Client Shell Command

Script Groovy Python

Java

JPMML-Evaluator JSON-to-Tuple Object-to-JSON

Log File JDBC MQTT TCP Mongo

Mail Null Sink

Redis RabbitMQ

HDFS HDFS Dataset Shell Command

GemFire Server Splunk Server Dynamic Router

Counter + 1 Gauge + 1

19

(20)

Streams

20

(21)

Streams

Programming model for real-time processing

How data is collected, processed, and stored or forwarded DSL analog to Unix Pipes and Filters

Source | Processor 0…* | Sink

Data is pumped through MessageBus Spring Integration Components

Stream Source Processor Sink

Message Bus

21

(22)

Streams - Example

stream create test1 --definition

"http | transform --expression=payload.toUpperCase() | log” --deploy

Source Processor Option Sink

“Transform payload incoming from HTTP to uppercase and send to log”

22

(23)

Taps

23

(24)

Taps

Stream Source Processor Sink

Message Bus

Special type of Stream

Consume data along the processing pipeline Original stream stays unaffected

Collect metrics and perform analytics

Tap Processor Sink

24

(25)

Taps Example

stream create test1tap --definition

“tap:stream:test1.transform > transform --expression='tapped: '+payload | log”

--deploy


Tap Source

First create the stream

stream create test1 —definition

"http | transform --expression=payload.toUpperCase() | log” --deploy

25 Redirection

Then create the tap: “onto transform stage, add prefix and send to log”

(26)

Analytics

26

(27)

Analytics

Counters

Simple Counter - how many tweets?

Field Value Counter - how many for tag=#java?

Aggregate Counter - how many tweets for #java per time interval?

Gauges

Gauge - what was the last seen value?

Rich Gauge - what was the last seen value/avg/min/max?

Backed by Redis, In-Memory via Spring Data Repositories Accessible via XD-Shell and REST API on XD-Admin

27

(28)

Advanced Analytics

Processor Modules

Python: numpy, pandas, scikit-learn, NLTK, SimpleCV Shell: R-Project rscript, OpenCV

Java / Groovy

PMML Processor Module

Predictive Model Markup Language

Description of Parameterised Data Mining Models Allows to Operationalise Predictive Models

Real-time evaluation and scoring

28

(29)

Jobs

29

(30)

Jobs

Programming Model for Batch Processing Create, Schedule, Execute and Monitor

Spring Batch and Spring Hadoop Components

Jobs#5 CSV to JDBC FTP to HDFS JDBC to HDFS HDFS to JDBC

HDFS to MongoDB

30

(31)

Jobs - Example

job create --name "helloworld-job"

--definition “helloworld"

--deploy

job launch --name "helloworld-job"

stream create --name "hw-cron"

--definition "trigger --cron='0/5 * * * * *' > queue:job:helloworld-job” —deploy

Create job from existing job definition

Run job once

Run job periodically

31

(32)

Management

32

(33)

Spring XD - Shell

CLI based on Spring Shell

Manages Streams, Jobs, Analytics and Deployment Completion / Assist

Many built-in Commands try help

Started via xd-shell

33

(34)

Spring XD - Admin UI

Management Interface accessible from XD-Admin Node

34

XD-ADMIN:9393/admin-ui

(35)

Spring XD - REST Interface

accessible from XD-Admin Node used by XD-Shell and Admin-UI

35

http://xd-admin:9393

(36)

Extensibility

36

(37)

Extensibility

Custom Modules

Source, Sink, Processor, Job

Spring Integration, Spring Batch E.g. to wrap a Java Library

Upload new modules via XD-Shell / REST

Register custom Spring Expression Language Aliases

from java.lang.Double.parseDouble(payload.sensorValue) to #parseDouble(payload.sensorValue)

Scripts

Collection of XD commands Automation

37

(38)

Deployment

38

(39)

Deployment

deploy or --deploy

stream deploy firstStream

stream create secondStream … --deploy Deployment Manifest

Customize via --properties Parameter Control # of Module Instances

Define Target Server or Group Direct Binding

Stream Data Partitioning

39

(40)

Deployment Manifest - Module Count

http | worker | hdfs

http http

worker worker worker worker

hdfs hdfs hdfs

stream deploy … --properties

module.http.count=2, module.worker.count=4, module.hdfs.count=3

40

(41)

Deployment Manifest - Module Placement

http | worker | hdfs

stream deploy … --properties

module.http.count=2, module.worker.count=4, module.hdfs.count=3 module.http.criteria=


group.contains(‘WEB’)

http http

worker worker worker worker

hdfs hdfs hdfs WEB

xd/bin/xd-container --groups="WEB" 41

(42)

Deployment Manifest - Data Partitioning

http | worker | hdfs

stream deploy … --properties

module.worker.count=4,

module.http.producer .partitionKeyExpression=

payload.customerId

WEB http http

worker worker worker worker

hdfs hdfs hdfs

0

1

2

3

partition := hash(payload.customerId) % worker.count 42

(43)

Applications

43

(44)

Spring XD - Measuring Live Usage for a Major Sports League

Measuring live video usage through mobile applications

44

(45)

Spring XD - IoT Connected Car

Journey and Range Prediction

45

(46)

Spring XD - Smartgrid

ACM Distributed Event Based Systems 2014

Scalable, Real-Time Analytics, High Volume Sensor Data

Short-Term Load Forecasting in a Power Grid

Sensor Data from Smart Plugs

Stream Components

Sensor Data Ingestion

Data Aggregation

Load Prediction

Analytics via REST

46

Demo

(47)

What’s next?

47

(48)

Roadmap - 1.2 and beyond

Custom Modules in HDFS

More OOTB Modules

Web based Editor for Streams & Jobs

Apache Ambari Support

Security Enhancements

Spring XD on Pivotal Cloud Foundry

GA Release Planned for May 2015

48

(49)

Learn more

Project http://projects.spring.io/spring-xd

GitHub https://github.com/spring-projects/spring-xd

Wiki http://docs.spring.io/spring-xd/docs/current/reference/html/

Samples https://github.com/spring-projects/spring-xd-samples Modules https://github.com/spring-projects/spring-xd-modules JIRA https://jira.spring.io/browse/XD

Stackoverflow http://stackoverflow.com/questions/tagged/spring-xd

49

(50)

Spring XD - Takeaway

Unified runtime for both Real-time and Batch use cases

Scalable, Distributed and Fault Tolerant Runtime Increased

Productivity through out-of-the-box

components

Closed Loop Analytics through online (stream) and

offline (batch) data

Swiss-army knife of data movement and data

pipelines

Repeatable ‘turnkey’

solution for next generation data-centric use cases

Data Ingestion, Processing, Movement, Analytics

50

(51)

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

51

Learn More. Stay Connected.

Twitter: twitter.com/springcentral YouTube: spring.io/video

LinkedIn: spring.io/linkedin Google Plus: spring.io/gplus

(52)

Backup Slides

52

(53)

Lambda Architecture

53

(54)

Lambda Architecture

(55)

Lambda Architecture - Spring XD

Data

Lake HAWQ

Serving Layer

Spring Boot Spring Boot Spring Boot Spring Boot

Speed Layer

Batch Layer

Batch Views Real-time Views

Gemfire

Spring

Stream Processing Batch Processing

Analytics Ingest Export

Workflow Orchestration

Predictive Analytics

XD>

(56)

Predictive Models

Model

Parameterised Algorithm Model Building

Derive a parameterised algorithm from the data Slow process

Usually large data volume -> done offline as a batch process Model Scoring

Use the model to predict new information Fast process

Can be done as part of stream processing

56

(57)

PMML

Predictive Model Markup Language Open Standard

Maintained by Data Mining Group (DMG) XML based DSL for predictive models Can be interpreted

15 Model Types (Naive Bayes, General Regression, Neural Networks, etc.) First Version (1999) – Current Version 4.2.1

“Lingua Franca for Predictive Models”

“Bridge the Gap between Data Scientists and Engineers”

57

(58)

Anatomy of a PMML Model

Predictive Model

Algorithm description(s) Parameterisation

“trained model”

Pre Processing Post Processing

Transform model output Thresholds / Business rules

Source:(PMML(in(Ac/on,(2nd(Edi/on,(2012,(p.(7. 58

(59)

Predictive Analytics with Spring XD

XD Module analytic-pmml

Introduced in Spring 1.0.0 M6 (April 2014) Real-time evaluation and scoring

Based on JPMML-Evaluator Wide range of Model types

spring-xd-modules/analytics-ml-pmml on Github

References

Related documents

‘Zefyr’ caused by Gnomonia fragariae in the greenhouse 11 weeks after inoculation: (A) Severe stunt of plants inoculated by root dipping in ascospore

In the context of this study, it is the rate at which Nigerian doctors in public hospitals intend to leave their employment because of the nature of managerialist employment system

9 This estimate is derived using estimates of the total number of rental occupied housing units from the American Community Survey (2009-2013 five-year estimates) in combination

● Destroy two monitoring wells, MW-37 and MW-42, in accordance with the Riverside County Environmental Health Department and the State of California well destruction procedures

Peran pelaksana program sangat menentukan keberhasilan pencapaian tujuan program karena kurangnya dukungan bidan terhadap program dapat mempengaruhi implementasinya terhadap

ACC/AHA 2005 Guideline Update for the Diagnosis and Management of Chronic Heart Failure in the Adult: a report of the American College of Cardiology/American Heart Association

Name of the local provider (contact name and telephone number) Distance in miles from the site to the nearest adequate power sources Capacity of nearest utility substation.

The proper motion errors of the CPM companions were typ- ically much larger than those of the known primaries, although in some cases we achieved a high precision for the CPM