Taming the Elephant with Big Data Management. Deep Dive

(1)

Taming the Elephant with

Big Data Management

(2)

Big Data Management

(3)

Safe Harbor

The information being provided today is for informational purposes only. The

development, release and timing of any Informatica product or functionality

described today remain at the sole discretion of Informatica and should not be

relied upon in making a purchasing decision. Statements made today are based

on currently available information, which is subject to change. Such statements

should not be relied upon as a representation, warranty or commitment to deliver

specific products or functionality in the future

(4)

Overview of Data Integration Solutions

•

Data Warehousing

•

Agile BI

•

Real-time DI

•

Data Migration

•

Apps Integration (on-prem)

•

DW Offloading/ Optimization

•

Data Lakes

•

Big Data Analytics

•

NoSQL Integration

•

Apps Integration (Hybrid)

•

Cloud & Hybrid DI

•

DW & Analytics (Cloud DBs)

Traditional Workloads

Next-Gen Workloads

Cloud & SaaS Workloads

PowerCenter

Big Data Management

Cloud Data Integration

(5)

Informatica’s big data Journey – 2012

•

2012 – 1

st

release of Informatica Big Data Edition

•

1

st

Data Integration Platform to

•

Natively execute on Hadoop

•

Support for Map Reduce

•

Support for HDFS/Hive/HBase

•

Profile Natively on Hadoop

Hadoop 1.0

Map Reduce

Processing & Resource Management

HDFS

Distributed Storage

(6)

Informatica’s big data Journey – 2016

YARN

INFA ENGINE

Blaze

Spark

Core Spark Core Tez

Hive on

Tez Hive onSpark Spark

Smart Executor

Informatica

Big Data Management

HDFS

Map Reduce Hive on Map Reduce

•

Polyglot computing:

Map Reduce, Blaze,

Tez, Spark

•

Multi-distribution

support on both

on-prem and cloud

•

End to End Big Data

Management

(7)

•

Run on Informatica Node(s)

•

Connect to Hadoop

sources/targets

•

Run on Hadoop

cluster

•

Connect to Hadoop

sources/targets

•

Connect to

non-Hadoop

sources/targets

Big data modes of execution

(8)

Why Informatica BDM?

Informatica Big Data Management

Informatica

Native PushdownSQL

Hadoop Pushdown Map

Reduce Tez Spark Blaze

Polyglot

Computing

Business

logic

Informatica Mappings

Solution

(9)

Big Data Challenges

36%

Obtaining Skills and

capabilities needed

33%

Security, Privacy

& Data Quality

26%

Integrating

multiple data

sources

26%

Integrating big data

technology with

existing

infrastructure

Source: Gartner → Mapping based development → PC Reuse → SQL to Mapping → Kerberos Support

→ Sentry / Ranger Support → Data masking, OS Profiles → DQ, Profiling on Hadoop → Power Exchange → Data Processor → SQOOP → On-Prem distro support → Cloud distro support

(10)

3 pillars of Informatica Big Data Management

Data

Integration Data Quality & Governance SecurityData Single, Comprehensive and Integrated Platform

for

(11)

100+

PRE-BUILT PARSERS

200+

PRE-BUILT CONNECTORS

Out of the

Box

BUSINESS RULES AND DATA STANDARDIZATION WebSphere MQ JMS MSMQ SAP NetWeaver XI JD Edwards Lotus Notes Oracle E-Business PeopleSoft Oracle DB2 UDB DB2/400 SQL Server Sybase ADABAS Datacom DB2 IDMS IMS Word, Excel PDF StarOffice WordPerfect Email (POP, IMPA) HTTP Informix Teradata Netezza ODBC JDBC VSAM C-ISAM Binary Flat Files Tape Formats… Web Services TIBCO webMethods Flat files ASCII reports HTML RPG ANSI LDAP EDI–X12 EDI-Fact RosettaNet HL7 HIPAA XML LegalXML IFX cXML AST FIX SWIFT Cargo IMP MVR Salesforce CRM Force.com RightNow NetSuite ADP Hewitt SAP By Design Oracle OnDemand Facebook Twitter LinkedIn Kapow Pivotal Vertica Netezza Teradata Aster

Universal connectivity

(12)

Data Storage &

Transport Formats Industry Standard Formats Organizational Formats

XML JSON Parquet AVRO Financial Services Healthcare EDI Delimited Files PDF Word Excel Hadoop Cluster Informatica IDE

Pre-Built Parsers for Industry Standards

(13)

SQOOP

•

JDBC based universal connectivity to many sources

•

No need for installation of database clients on Hadoop cluster to read / write

data

•

Seamless integration into Informatica mappings

•

Integration at both connection and data object level

(14)

Profiling on Hadoop

Analyst

Statistics to identify

anomalies

Value & Pattern

Analysis

Drill down analysis

Multi tenancy

(15)

Data Quality on Hadoop

Data Quality

Address validation

Parse

Match

Standardize

(16)

Security has many aspects

In

fra

st

ru

c

tu

re

Da

ta

Authentication Authorization Auditing Monitoring Encryption Data Masking+

http://blogs.informatica.com/2015/07/24/bigdatasecurity-2/

Ap

p

lic

a

tio

n

Multi-tenancy+

(17)

Authentication: Kerberos

Industry standard

authentication for Hadoop

clusters

•

Informatica BDM Supports:

•

Kerberos authentication in INFA domains

•

Connecting to Kerberos enabled Hadoop

clusters

•

360

O

support:

•

Client & Server

•

Metadata access & Data access

(18)

Blaze Security Integration – Ranger/Sentry

Informatica node Hadoop Cluster

Blaze Runtime Blaze Container

Mapping at runtime (in-memory)

Source Transforms Target

Ranger/Sentry

Blaze Executor

HDFS Data files

HDFS Service / Hive Server 2 Optimizer

call

(19)

Informatica Monitoring

1

(20)

Informatica Monitoring

1

(21)

Informatica Monitoring

2

3 1

(22)

Data Masking

Mask sensitive data while

ingesting and processing

•

Supports Persistent Data Masking

•

16 different techniques supported including

•

SSN

•

Credit Card

•

First & Last names, Emails

•

Polyglot engine:

•

Supported in Native mode

•

Supported in Hive mode

(23)

Multi-tenancy

Application Binding

•

Bind multiple Informatica users to one or more system accounts

•

System accounts can be OS / Hadoop accounts

•

Primarily used in batch use-cases, mappings

User Binding

•

Also known as pass through security

•

Bind individual Informatica users to their corresponding OS / Hadoop accounts

(24)

3 pillars of Informatica Big Data Management

Data

Integration Data Quality & Governance SecurityData Single, Comprehensive and Integrated Platform

for

End-to-End Big Data Management

•

SQOOP

•

Blaze

•

DI on Spark

•

SQOOP for Profiling

•

Blaze for Profiling

•

JDBC for reference

data*

•

Kerberos

•

Sentry / Ranger

(25)

Deep Dive

(26)

Scenario:

INFA Air receives information from multiple airports on the expected / actual schedules of various flights. They need to consolidate this information into a Hadoop environment to perform analytics such as flight-on-time analysis

Challenges:

•

Data is collected in various formats with various intervals: Some provide in flat files and some are staged in Oracle table

•

All this data is ingested into a Hive table for cleansing and analysis

•

The data from hive table is subsequently sent to alerting system to send individual alerts for travelers

DEMO – Use case

(27)

Private Network

Hadoop Cluster

Lab environment

Hadoop Node 1

Hadoop Node 2

Informatica Server

Informatica Client

(28)

Login credentials

Host name Username Password

Hadoop Node 1 psvrl65iw2016hdp00

1 iw2016 iw2016

Hadoop Node 2 psvrl65iw2016hdp00

2 iw2016 iw2016

INFA Server psvrl65iw2016i1001 iw2016 iw2016 INFA Client psvw7iw2016i1001 Administrator iw2016 Administrator,

Monitoring Administrator Administrator

Lab access:

https://informatica.instructorled.training

Access code: 34762748

xx

(29)

(30)

(31)

•

Lab 1 – High speed Ingestion in pushdown mode

•

Read from flat file

•

Read from Oracle

•

Union the data

•

Write to hive

•

Lab 2 – Extraction with schema-on-read

•

Read from Hive

•

Write to flat file

•

Dynamically update the schema

•

Use Blaze

(32)

Questions…?

(33)

Informatica User Groups are a great way for

you to invest in your professional development

and learn about new Informatica offerings.

•

Local Chapter Leaders manage each IUG

online and via in person meetings

•

Network and Socialize

•

Find and share content, best practices & tips

•

Learn about the latest technologies and

solutions from Informatica

•

Discover how colleagues and peers use

Informatica

•

https://network.informatica.com/welcome/

•

LEARN MORE AT IW16 : Go to the

Solutions Expo Informatica Pavilion /

Ecosystem & Innovation Area:

•

Talk to regional user group leaders

•

Learn about meeting plans

•

Join your regional user group

•

When:

•

Monday 6:00pm – 8:30pm

•

Tuesday 10:45am – 2:15pm

•

Wednesday 10:30am – 1:45pm

•

Where:

•

Moscone West Hall Level One