Taming the Elephant with
Big Data Management
Big Data Management
Safe Harbor
The information being provided today is for informational purposes only. The
development, release and timing of any Informatica product or functionality
described today remain at the sole discretion of Informatica and should not be
relied upon in making a purchasing decision. Statements made today are based
on currently available information, which is subject to change. Such statements
should not be relied upon as a representation, warranty or commitment to deliver
specific products or functionality in the future
Overview of Data Integration Solutions
•
Data Warehousing•
Agile BI•
Real-time DI•
Data Migration•
Apps Integration (on-prem)•
DW Offloading/ Optimization•
Data Lakes•
Big Data Analytics•
NoSQL Integration•
Apps Integration (Hybrid)•
Cloud & Hybrid DI•
DW & Analytics (Cloud DBs)Traditional Workloads
Next-Gen Workloads
Cloud & SaaS Workloads
PowerCenter
Big Data Management
Cloud Data Integration
Informatica’s big data Journey – 2012
•
2012 – 1
strelease of Informatica Big Data Edition
•
1
stData Integration Platform to
•
Natively execute on Hadoop•
Support for Map Reduce•
Support for HDFS/Hive/HBase•
Profile Natively on HadoopHadoop 1.0
Map Reduce
Processing & Resource ManagementHDFS
Distributed StorageInformatica’s big data Journey – 2016
YARN
INFA ENGINEBlaze
SparkCore Spark Core Tez
Hive on
Tez Hive onSpark Spark
Smart Executor
Informatica
Big Data Management
HDFS
Map Reduce Hive on Map Reduce•
Polyglot computing:
Map Reduce, Blaze,
Tez, Spark
•
Multi-distribution
support on both
on-prem and cloud
•
End to End Big Data
Management
•
Run on Informatica Node(s)
•
Connect to Hadoop
sources/targets
•
Run on Hadoop
cluster
•
Connect to Hadoop
sources/targets
•
Connect to
non-Hadoop
sources/targets
Big data modes of execution
Why Informatica BDM?
Informatica Big Data Management
Informatica
Native PushdownSQL
Hadoop Pushdown Map
Reduce Tez Spark Blaze
Polyglot
Computing
Business
logic
Informatica MappingsSolution
Big Data Challenges
36%
Obtaining Skills and
capabilities needed
33%
Security, Privacy
& Data Quality
26%
Integrating
multiple data
sources
26%
Integrating big data
technology with
existing
infrastructure
Source: Gartner → Mapping based development → PC Reuse → SQL to Mapping → Kerberos Support→ Sentry / Ranger Support → Data masking, OS Profiles → DQ, Profiling on Hadoop → Power Exchange → Data Processor → SQOOP → On-Prem distro support → Cloud distro support
3 pillars of Informatica Big Data Management
Data
Integration Data Quality & Governance SecurityData Single, Comprehensive and Integrated Platform
for
100+
PRE-BUILT PARSERS200+
PRE-BUILT CONNECTORSOut of the
Box
BUSINESS RULES AND DATA STANDARDIZATION WebSphere MQ JMS MSMQ SAP NetWeaver XI JD Edwards Lotus Notes Oracle E-Business PeopleSoft Oracle DB2 UDB DB2/400 SQL Server Sybase ADABAS Datacom DB2 IDMS IMS Word, Excel PDF StarOffice WordPerfect Email (POP, IMPA) HTTP Informix Teradata Netezza ODBC JDBC VSAM C-ISAM Binary Flat Files Tape Formats… Web Services TIBCO webMethods Flat files ASCII reports HTML RPG ANSI LDAP EDI–X12 EDI-Fact RosettaNet HL7 HIPAA XML LegalXML IFX cXML AST FIX SWIFT Cargo IMP MVR Salesforce CRM Force.com RightNow NetSuite ADP Hewitt SAP By Design Oracle OnDemand Facebook Twitter LinkedIn Kapow Pivotal Vertica Netezza Teradata Aster
Universal connectivity
Data Storage &
Transport Formats Industry Standard Formats Organizational Formats
XML JSON Parquet AVRO Financial Services Healthcare EDI Delimited Files PDF Word Excel Hadoop Cluster Informatica IDE
Pre-Built Parsers for Industry Standards
SQOOP
•
JDBC based universal connectivity to many sources
•
No need for installation of database clients on Hadoop cluster to read / write
data
•
Seamless integration into Informatica mappings
•
Integration at both connection and data object level
Profiling on Hadoop
Analyst
Statistics to identify
anomalies
Value & Pattern
Analysis
Drill down analysis
Multi tenancy
Data Quality on Hadoop
Data Quality
Address validation
Parse
Match
Standardize
Security has many aspects
In
fra
st
ru
c
tu
re
Da
ta
Authentication Authorization Auditing Monitoring Encryption Data Masking+
http://blogs.informatica.com/2015/07/24/bigdatasecurity-2/
Ap
p
lic
a
tio
n
Multi-tenancy+Authentication: Kerberos
Industry standard
authentication for Hadoop
clusters
•
Informatica BDM Supports:
•
Kerberos authentication in INFA domains
•
Connecting to Kerberos enabled Hadoop
clusters
•
360
Osupport:
•
Client & Server
•
Metadata access & Data access
Blaze Security Integration – Ranger/Sentry
Informatica node Hadoop Cluster
Blaze Runtime Blaze Container
Mapping at runtime (in-memory)
Source Transforms Target
Ranger/Sentry
Blaze Executor
HDFS Data files
HDFS Service / Hive Server 2 Optimizer
call
Informatica Monitoring
1
Informatica Monitoring
1
Informatica Monitoring
2
3 1
Data Masking
Mask sensitive data while
ingesting and processing
•
Supports Persistent Data Masking
•
16 different techniques supported including
•
SSN•
Credit Card•
First & Last names, Emails•
Polyglot engine:
•
Supported in Native mode•
Supported in Hive modeMulti-tenancy
Application Binding
•
Bind multiple Informatica users to one or more system accounts•
System accounts can be OS / Hadoop accounts•
Primarily used in batch use-cases, mappingsUser Binding
•
Also known as pass through security•
Bind individual Informatica users to their corresponding OS / Hadoop accounts3 pillars of Informatica Big Data Management
Data
Integration Data Quality & Governance SecurityData Single, Comprehensive and Integrated Platform
for
End-to-End Big Data Management
•
SQOOP
•
Blaze
•
DI on Spark
•
SQOOP for Profiling
•
Blaze for Profiling
•
JDBC for reference
data*
•
Kerberos
•
Sentry / Ranger
Deep Dive
Scenario:
INFA Air receives information from multiple airports on the expected / actual schedules of various flights. They need to consolidate this information into a Hadoop environment to perform analytics such as flight-on-time analysis
Challenges:
•
Data is collected in various formats with various intervals: Some provide in flat files and some are staged in Oracle table•
All this data is ingested into a Hive table for cleansing and analysis•
The data from hive table is subsequently sent to alerting system to send individual alerts for travelersDEMO – Use case
Private Network
Hadoop Cluster
Lab environment
Hadoop Node 1
Hadoop Node 2
Informatica Server
Informatica Client
Login credentials
Host name Username Password
Hadoop Node 1 psvrl65iw2016hdp00
1 iw2016 iw2016
Hadoop Node 2 psvrl65iw2016hdp00
2 iw2016 iw2016
INFA Server psvrl65iw2016i1001 iw2016 iw2016 INFA Client psvw7iw2016i1001 Administrator iw2016 Administrator,
Monitoring Administrator Administrator
Lab access:
https://informatica.instructorled.training
Access code: 34762748
xx
•
Lab 1 – High speed Ingestion in pushdown mode
•
Read from flat file•
Read from Oracle•
Union the data•
Write to hive•
Lab 2 – Extraction with schema-on-read
•
Read from Hive•
Write to flat file•
Dynamically update the schema•
Use BlazeQuestions…?
Informatica User Groups are a great way for
you to invest in your professional development
and learn about new Informatica offerings.
•
Local Chapter Leaders manage each IUG
online and via in person meetings
•
Network and Socialize
•
Find and share content, best practices & tips
•
Learn about the latest technologies and
solutions from Informatica
•
Discover how colleagues and peers use
Informatica
•
https://network.informatica.com/welcome/
•
LEARN MORE AT IW16 : Go to the
Solutions Expo Informatica Pavilion /
Ecosystem & Innovation Area:
•
Talk to regional user group leaders
•
Learn about meeting plans
•
Join your regional user group
•
When:
•
Monday 6:00pm – 8:30pm
•
Tuesday 10:45am – 2:15pm
•
Wednesday 10:30am – 1:45pm
•
Where:
•
Moscone West Hall Level One