Getting “Real”
Real Time Data Integration Patterns and
Architectures
Nelson Petracek
Senior Director, Enterprise Technology Architecture Informatica
Digital Government Institute’s Enterprise Architecture Conference, May 1, 2014, Washington, DC
MORE AGILITY RIGHT Time PROACTIV E vs. REACTIVE INSTANT TRUST Self-Service Fresh Information All Data One Place Immediate Response Times 100% Uptime
User Expectations
Representative Use Cases
Sensor Monitoring
Customer Interaction
Security
•
It is no longer sufficient to view
information “after the fact”.
•
Business demands information
sooner, with more accuracy, in order
to meet competitive and regulatory
demands.
•
Business needs to respond to
“threats” and “opportunities sooner.
•
Reduce decision latency.•
Proactive alerts and notifications.•
Improve TTA (time to answer).Traditional Data Management Approaches
Store Analyze Act Data Integratio n EDW BI Valuable for: •Reporting •Historical Activity •Strategic AnalysisThe Challenges with Traditional Approaches
Store Analyze
Act
•
Takes too long to deliver
what is needed.
•
Lots of “wait” and “waste” in
the process.
•
No common and trusted data
access.
•
Information is missing or is
stale / delayed.
•
Too much “decision
Next Generation Data Integration
A Shift in Thinking is Needed…
•
Need to shift from building large, monolithic applications to smaller sets of distributed “micro-applications” based on the principles of “Reactive Applications”*.•
Resilient•
Scalable•
Event Driven•
Responsive•
Move away from a “store first” approach; provide the ability to process event data as it arrives.•
Focus on hybrid architectures that facilitate both batch and real-time processing.Reactive Applications: Characteristics
Resilient • Able to recover at all levels.
• Utilize fine grained resilience on the component level.
• “Bulkhead pattern”.
Scalable • Avoid contention on shared resources.
• Scale out or up as needed (without rewrites).
• Maintain programming model as system is scaled. Event-Driven • System communicate via events.
• Loosely coupled, asynchronous, Amdahl’s Law. • Efficient use of resources.
Responsive • Honor response time guarantees regardless of load.
• Provide users with a rich, interactive experience. • Observable models, event streams, stateful clients.
Sample Architectural Approach: Reactive
Applications
Operational Data
(Field Devices, Applications, Clickstream, IoT, logs, etc.) Event Based Applications Various Source Applications / Technologies Data Warehouse Hadoop / NoSQL Analytics Streaming Collection Vibe Data Stream Data Integration PowerCenter Event Processing Streaming Analytics RulePoint CDC / Data Access CDC PWX
Ultra MessagingReal Time Stream Transport / Delivery Ultra Messaging Stream Transformation B2B Data Transformation Power Exchange
OI System
Action
•
Proactive actions
instead of reactive.
•
Allows the end-user to
define conditions and
rules through
self-service capabilities.
•
Users are “pushed” the
information they need,
when they need it, in the
system that they need it.
EVENT S
DATA ALERT S
Sample “Big Data” Reference Architectures
* Source: http://hortonworks.com/hdp/ * Source: http://www.cloudera.com/content/cloudera/en/products-and-services/ “Real-Time” Component
Hybrid Architecture: Batch Plus Real-Time
“Big Data Supply Chain” Data Sources
(Devices, Apps, Clickstream, IoT,
logs, etc.)
• Batch
• Map / Reduce, YARN • Data Analytics
• Long term Persistence, High Latency • e.g. Purchase history analysis. Historical Batch
Computation
Distributed
Real-Time Computation • Real-Time
• Continuous Computations
• Streaming Analytics / Event Processing • Incremental, Low Latency
• e.g. Sensor / infrastructure monitoring.
Data Targets (Dashboards, BI,
Stream Collection
•
Separate from “batch” or “bulk” data loading.•
Involves the collection of event data (“streams”) as they occur, from various endpoints, systems, and people.•
Multiple options available:•
“Micro-batch” or near real-time data integration.•
Data integration hub pattern.•
Real-time collection.•
Data replication, etc.•
Number of factors to look at when determining the right pattern to utilize.Stream Collection: Replication
EXTRACT SERVER MANAGER SERVER MANAGER http:// APPLY ConsoleSource System Target System
SQL Apply Merge Apply Audit Apply Intermediate Files Committed Checkpoint Checkpoint High Speed Extraction High Speed Parallel Apply
•
Utilize replicationbeyond the “copying” of data from one data
store to another.
•
Event-enable back-end data stores.•
Non-intrusively detect changes in data,publish data changes to one or more targets.
•
Real-time delivery ofthe latest data changes to target systems.
Stream Collection: Data Integration Hub Pattern
•
Eliminate point-to-point collection / delivery interfaces.•
Provide a location independent mechanism for data producers (and consumers) to “talk” to one another.•
“Publish and Subscribe”•
Manage data delivery impedance mismatches.•
Provide self-service capabilities.•
Centralize data quality, masking, transformation logic.Stream Collection: Distributed Agents
•
Distribute collection across thousands of endpoints.•
Perform filtering,transformation, etc. “close to the source”.
•
Focus on daemon-less or broker-less designs forimproved performance and scalability.
•
Provide varying qualities of service.•
Streaming, guaranteed, etc.•
Allow for dynamic configuration. Sources Targets Stream Node Stream Node Stream Node Stream Node Stream Node Stream NodeStream Collection: Distributed Agents with
Collectors
Event Processin g Data Integration HDFS EDW Real Time Actions Local Hub Regional Hub Central Hub AgentEdge data filtering and processing Streaming Data Collection Data Transfer Agent Agent Agent
Event Streaming Analytics
•
Execute logic against real-time streams.•
Utilize streaming language constructs.•
Logic may be executed at a point-in-time, or over time.•
Temporal reasoning.•
Join or merge multiple streams together for real-time pattern recognition, correlation, etc. across data sources.•
Timely and contextual.•
Augment real-time streams with historical context.Distributed Real-Time Computation
Event Delivery
Data Integration Hub • Allow data consumers to “subscribe” to data
previously pushed to the hub.
• Batch + near real-time feed. Data Integration • Feed content into back-end systems through application interfaces.
• Batch + near real-time feed. Streaming Delivery • Push content to end
applications, dashboards, etc.
• Content may consist of derived or raw events.
• Near real-time + real-time feed.
Lambda Architecture
* Source: http://jameskinley.tumblr.com/post/37398560534/the-lambda-architecture-principles-for-architecting
•
Data is distributed to both a “Batch Layer” and “Speed Layer” for processing.•
Batch layer manages theappend-only master set
of raw data.
•
“Serving Layer” indexes batch views forlow-latency queries.
•
“Speed Layer” coversrecent data not in the
Batch Layer.