Creating Big Data Applications
with Spring XD
Thomas Darimont
@thomasdarimont
THE FASTEST PATH TO NEW BUSINESS VALUE
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
Journey
3
Introduction Concepts
Applications Outlook
Introduction
4
Spring XD - Overview
Platform for Big Data Applications
Ingestion, Processing, Movement, Analytics
Stream and Batch Processing
Scalable Distributed Runtime
Support for Deep Analytics
Proven Spring Technologies
5
Spring XD - Why yet another Big Data Platform?
Alternative to Frameworks like Flume, Oozie, Sqoop, Storm
Just one Platform instead of many
Common things easy, complex things possible
Complementary to many technologies
Big SQL / MPP Databases - Impala, HAWQ
Stream Processing - Apache Spark
NoSQL DataStores - Cassandra, MongoDB
6
eXtreme Data
“Spring XD - one stop shop for
developing and deploying
Big Data Apps“
X D
7
Spring XD - 10,000 Foot View
Spring XD Runtime
BIDIRECTIONAL
Compute HDFS
RDBMS
NoSQL
R, SAS
Streams Jobs
ingest workflow
export taps
Predictive Modelling
>_
Redis
Rest
8
Spring XD - Easy to Setup and Run
Store incoming HTTP data into HDFS
9
Spring XD - Easy to Setup and Run
10 1. Install via package manager / unzip
2. Start $ xd-singlenode $ xd-shell
3. Define xd:> stream create ingest --definition “http | hdfs”
4. Run xd:> stream deploy ingest Yes, writing HTTP
Data to HDFS
…can be that simple!
Core Concepts
11
Spring XD - Core Concepts
Runtime
Modules
Streams
Taps
Analytics
Jobs
Extensibility
Deployment
12
Spring XD - Runtime
Hosts Stream Processing & Batch Workflows & Analytics Manages Component Distribution
Communication via MessageBus Additional Services
Configuration / Cluster State: ZooKeeper Analytics: Redis, In-Memory
Message Bus: Redis, RabbitMQ, Kafka, Local
13
Spring XD - Instance Types
XD-Admin
Assigns Modules to Containers Manages Cluster
Failover & HA XD-Container
Loads / Executes Modules Connects to Data Bus
Standalone, YARN, Cloud Foundry
14
XD Admin Leader XD Admin
Leader XD Admin
Leader ZK
XD Container XD Container
module module
module
XD UI XD Shell
Kafka/RabbitMQ/Redis
module
module module
module
module Batch Job State DB
Analytics Repository
Spring XD - Runtime Modes
singl
e-nod
e standalone
Development 15 XD Admin
XD Container Module JVM
ZK DB MB
multi-node distributed
Production
JVM XD Admin
XD Container Module
XD Container Module
JVM JVM
JVM JVM
ZK DB JVMJVM
…
…
MB
Spring XD - Distributed Runtime
16 XDA
XDC time
Zookeeper XDA
XDC log
XDC
Message Bus
deploy
bind
XDA = XD Admin XDC = XD Container
Modules
17
Modules
Unit of execution
Source, Sink, Processor, Jobs Defined in XML or JVM Language
Spring config file with Spring Bean Definitions Can have Parameters
50+ already included in XD
Define new Modules via Composition
18
Modules - Overview
Source#20
Processor#13
Sink#20
HTTP SFTP Tail File Mail Syslog TCP / TCP Client
Reactor IP RabbitMQ JMS
Time MQTT Mongo
Kafka JDBC
Gemfire CQ, Source Twitter Search, Stream
Stdout Capture
Filter Transform
Splitter Aggregator HTTP Client Shell Command
Script Groovy Python
Java
JPMML-Evaluator JSON-to-Tuple Object-to-JSON
Log File JDBC MQTT TCP Mongo
Mail Null Sink
Redis RabbitMQ
HDFS HDFS Dataset Shell Command
GemFire Server Splunk Server Dynamic Router
Counter + 1 Gauge + 1
19
Streams
20
Streams
Programming model for real-time processing
How data is collected, processed, and stored or forwarded DSL analog to Unix Pipes and Filters
Source | Processor 0…* | Sink
Data is pumped through MessageBus Spring Integration Components
Stream Source Processor … Sink
Message Bus
21
Streams - Example
stream create test1 --definition
"http | transform --expression=payload.toUpperCase() | log” --deploy
Source Processor Option Sink
“Transform payload incoming from HTTP to uppercase and send to log”
22
Taps
23
Taps
Stream Source Processor … Sink
Message Bus
Special type of Stream
Consume data along the processing pipeline Original stream stays unaffected
Collect metrics and perform analytics
Tap Processor … Sink
24
Taps Example
stream create test1tap --definition
“tap:stream:test1.transform > transform --expression='tapped: '+payload | log”
--deploy
Tap Source
First create the stream
stream create test1 —definition
"http | transform --expression=payload.toUpperCase() | log” --deploy
25 Redirection
Then create the tap: “onto transform stage, add prefix and send to log”
Analytics
26
Analytics
Counters
Simple Counter - how many tweets?
Field Value Counter - how many for tag=#java?
Aggregate Counter - how many tweets for #java per time interval?
Gauges
Gauge - what was the last seen value?
Rich Gauge - what was the last seen value/avg/min/max?
Backed by Redis, In-Memory via Spring Data Repositories Accessible via XD-Shell and REST API on XD-Admin
27
Advanced Analytics
Processor Modules
Python: numpy, pandas, scikit-learn, NLTK, SimpleCV Shell: R-Project rscript, OpenCV
Java / Groovy
PMML Processor Module
Predictive Model Markup Language
Description of Parameterised Data Mining Models Allows to Operationalise Predictive Models
Real-time evaluation and scoring
28
Jobs
29
Jobs
Programming Model for Batch Processing Create, Schedule, Execute and Monitor
Spring Batch and Spring Hadoop Components
Jobs#5 CSV to JDBC FTP to HDFS JDBC to HDFS HDFS to JDBC
HDFS to MongoDB
30
Jobs - Example
job create --name "helloworld-job"
--definition “helloworld"
--deploy
job launch --name "helloworld-job"
stream create --name "hw-cron"
--definition "trigger --cron='0/5 * * * * *' > queue:job:helloworld-job” —deploy
Create job from existing job definition
Run job once
Run job periodically
31
Management
32
Spring XD - Shell
CLI based on Spring Shell
Manages Streams, Jobs, Analytics and Deployment Completion / Assist
Many built-in Commands try help
Started via xd-shell
33
Spring XD - Admin UI
Management Interface accessible from XD-Admin Node
34
XD-ADMIN:9393/admin-ui
Spring XD - REST Interface
accessible from XD-Admin Node used by XD-Shell and Admin-UI
35
http://xd-admin:9393
Extensibility
36
Extensibility
Custom Modules
Source, Sink, Processor, Job
Spring Integration, Spring Batch E.g. to wrap a Java Library
Upload new modules via XD-Shell / REST
Register custom Spring Expression Language Aliases
from java.lang.Double.parseDouble(payload.sensorValue) to #parseDouble(payload.sensorValue)
Scripts
Collection of XD commands Automation
37
Deployment
38
Deployment
deploy or --deploy
stream deploy firstStream
stream create secondStream … --deploy Deployment Manifest
Customize via --properties Parameter Control # of Module Instances
Define Target Server or Group Direct Binding
Stream Data Partitioning
39
Deployment Manifest - Module Count
http | worker | hdfs
http http
worker worker worker worker
hdfs hdfs hdfs
stream deploy … --properties
module.http.count=2, module.worker.count=4, module.hdfs.count=3
40
Deployment Manifest - Module Placement
http | worker | hdfs
stream deploy … --properties
module.http.count=2, module.worker.count=4, module.hdfs.count=3 module.http.criteria=
group.contains(‘WEB’)
http http
worker worker worker worker
hdfs hdfs hdfs WEB
xd/bin/xd-container --groups="WEB" 41
Deployment Manifest - Data Partitioning
http | worker | hdfs
stream deploy … --properties
…
module.worker.count=4,
…
module.http.producer .partitionKeyExpression=
payload.customerId
WEB http http
worker worker worker worker
hdfs hdfs hdfs
0
1
2
3
partition := hash(payload.customerId) % worker.count 42
Applications
43
Spring XD - Measuring Live Usage for a Major Sports League
Measuring live video usage through mobile applications
44
Spring XD - IoT Connected Car
Journey and Range Prediction
45
Spring XD - Smartgrid
ACM Distributed Event Based Systems 2014
Scalable, Real-Time Analytics, High Volume Sensor Data
Short-Term Load Forecasting in a Power Grid
Sensor Data from Smart Plugs
Stream Components
Sensor Data Ingestion
Data Aggregation
Load Prediction
Analytics via REST
46Demo
What’s next?
47
Roadmap - 1.2 and beyond
Custom Modules in HDFS
More OOTB Modules
Web based Editor for Streams & Jobs
Apache Ambari Support
Security Enhancements
Spring XD on Pivotal Cloud Foundry
GA Release Planned for May 2015
48
Learn more
Project http://projects.spring.io/spring-xd
GitHub https://github.com/spring-projects/spring-xd
Wiki http://docs.spring.io/spring-xd/docs/current/reference/html/
Samples https://github.com/spring-projects/spring-xd-samples Modules https://github.com/spring-projects/spring-xd-modules JIRA https://jira.spring.io/browse/XD
Stackoverflow http://stackoverflow.com/questions/tagged/spring-xd
49
Spring XD - Takeaway
Unified runtime for both Real-time and Batch use cases
Scalable, Distributed and Fault Tolerant Runtime Increased
Productivity through out-of-the-box
components
Closed Loop Analytics through online (stream) and
offline (batch) data
Swiss-army knife of data movement and data
pipelines
Repeatable ‘turnkey’
solution for next generation data-centric use cases
Data Ingestion, Processing, Movement, Analytics
50
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
51
Learn More. Stay Connected.
Twitter: twitter.com/springcentral YouTube: spring.io/video
LinkedIn: spring.io/linkedin Google Plus: spring.io/gplus
Backup Slides
52
Lambda Architecture
53
Lambda Architecture
Lambda Architecture - Spring XD
Data
Lake HAWQ
Serving Layer
Spring Boot Spring Boot Spring Boot Spring Boot
Speed Layer
Batch Layer
Batch Views Real-time Views
Gemfire
Spring
Stream Processing Batch Processing
Analytics Ingest Export
Workflow Orchestration
Predictive Analytics
XD>
Predictive Models
Model
Parameterised Algorithm Model Building
Derive a parameterised algorithm from the data Slow process
Usually large data volume -> done offline as a batch process Model Scoring
Use the model to predict new information Fast process
Can be done as part of stream processing
56
PMML
Predictive Model Markup Language Open Standard
Maintained by Data Mining Group (DMG) XML based DSL for predictive models Can be interpreted
15 Model Types (Naive Bayes, General Regression, Neural Networks, etc.) First Version (1999) – Current Version 4.2.1
“Lingua Franca for Predictive Models”
“Bridge the Gap between Data Scientists and Engineers”
57
Anatomy of a PMML Model
Predictive Model
Algorithm description(s) Parameterisation
“trained model”
Pre Processing Post Processing
Transform model output Thresholds / Business rules
Source:(PMML(in(Ac/on,(2nd(Edi/on,(2012,(p.(7. 58
Predictive Analytics with Spring XD
XD Module analytic-pmml
Introduced in Spring 1.0.0 M6 (April 2014) Real-time evaluation and scoring
Based on JPMML-Evaluator Wide range of Model types
spring-xd-modules/analytics-ml-pmml on Github