Sharon Sophia Stephen
Big Data PreSales Consultant
February 21, 2015
Constructing a Data Lake: Hadoop
and Oracle Database United!
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Safe Harbor
The following is intended to outline our general product direction. It is
intended for information purposes only, and may not be incorporated
into any contract. It is not a commitment to deliver any material, code,
or functionality, and should not be relied upon in making purchasing
decisions. The development, release, and timing of any features or
functionality described for Oracle’s products remains at the sole
Program Agenda
Hadoop Overview
Data Lake
How the technologies can work together
Tools for Integration
Q&A
1
2
3
4
5
Exadata
+
Oracle Database
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
What Is Hadoop?
“The Apache™ Hadoop® project develops open-source software for
reliable, scalable, distributed computing.
“The Apache Hadoop software library is a
framework
that allows for the
distributed
processing
of
large data sets
across clusters of computers using simple
programming models. It is designed to scale up from single servers to thousands of
machines, each offering
local computation and storage
. Rather than rely on
hardware to deliver
high-availability
, the library itself is designed to detect and
handle failures at the application layer, so delivering a highly-available service on
top of a cluster of computers, each of which may be prone to failures.”
What is HDFS?
•
HDFS is the primary storage system underlying Hadoop
•
Fault tolerant, scalable, highly available
•
Designed to be well-suited to distributed processing
–
Splits large files into blocks
–
Multiple copies stored on different disks on separate nodes
•
Is superficially structured like a UNIX file system
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
A MapReduce (True distributed computing) Analogy
A MapReduce (True distributed computing) Example
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
MapReduce Phases
–
Map
•
Each Map task usually works on a single input split
•
Hadoop tries to run map tasks on the slave node that contains the stored HDFS data block
(data locality)
•
The input is presented to the Map phase as a key-value pair
–
Shuffle and Sort
•
Groups all the values together for each key (using the intermediate output data from all of the
completed mappers)
•
Sorts the keys
–
Reduce
•
The intermediate output of the Shuffle and Sort phase is the input to the Reduce phase
What else is Hadoop
•
Apache Hadoop
•
Apache Hive
•
Apache Pig
•
Apache HBase
•
Apache Zookeeper
•
Apache Flume
•
Apache Sqoop
•
Apache Mahout
•
Apache Whirr
•
Apache Oozie
•
Fuse-DFS
•
Hue
•
Plus Additional projects
•
Impala
•
BDR
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Two Great Tastes That Go Great Together
Hadoop
Hadoop
Database
A. Hadoop can do some things a Database
reasonably does not.
Database
Things you can do (Big Oval)
Things you can do cost-effectively (Small Oval)
B. Hadoop expands the amount of things you
can do cost-effectively.
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Data lake
Data Lake
A data lake is a storage repository that holds a vast
amount of raw data in its native format until it is
needed. While a hierarchical data warehouse
stores data in files or folders, a data lake uses a flat
architecture to store data.
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Focus in this Session
14
Data organized for fast query
Structured schema
Complex programming models
Read, write, delete, update
Access specific record
Relational
Hadoop
Data in files
Schema on read
Simple programming model for
large scale data processing
Append only
Oracle Big Data Connectors
Data Load
Oracle Loader for Hadoop
Data Access
Oracle SQL Connector for
HDFS
Oracle Data Integrator
R Analytics
Oracle R Advanced Analytics
on Hadoop
XML/XQuery
Oracle XQuery on Hadoop
XQuery
R Client
Optimized for Hadoop: Maximize
parallelism
Fast performance
Analyze data on Hadoop using
familiar client tools
Oracle Bigdata SQL
(for eng. System)
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Integrating data – Hadoop and Oracle databases
Transferring from Hadoop to Oracle
RDBMS
Transferring from Oracle RDBMS to Hadoop
OSCH – For CSV Text
OSCH – Hive Non Partition
OLH – For CSV text
OLH – Hive Non-partition
CopytoBDA (only on Engineered system platform)
OLH – Data pump (offline)
Oracle Data Integrator
OLH – Hive partition
Oracle Golden gate
OLH – Hive Partition parquet
Sqoop
Sqoop – – For CSV text
Sqoop – Hive partition
Sqoop – Hive partition parquet
Sqoop 2 - Hive partition parquet
Oracle Loader for Hadoop
•
Parallel load, optimized for
Hadoop
•
Automatic load balancing
•
Convert to Oracle format on
Hadoop
–
Save database CPU
•
Load specific Hive partitions
•
Kerberos authentication
•
Load directly into In-Memory
JSON
Log
files
Hive
Text
Avro
Parquet
Sequence
files
Compressed
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Oracle Loader for Hadoop
Performance
•
Extremely fast performance
•
Sample numbers (on Oracle Engineered Systems)
–
4.4 TB/hour end-to-end (load + Hadoop process)
•
Much higher than typical customer requirements
•
Optimized for Oracle Big Data Appliance and
Oracle Exadata: InfiniBand Connectivity
Oracle SQL Connector for HDFS
OSCH
Hive
Text
OSCH OSCH OSCHExternal
Table
create table customer_address
( ca_customer_id
number(10,0)
, ca_street_number char(10)
, ca_state char(2)
, ca_zip char(10))
organization external (
TYPE
ORACLE_LOADER
DEFAULT DIRECTORY DEFAULT_DIR
•
Parallel query and load
•
Load into database or query in
place
•
Access text or Hive over text
•
Access compressed data
•
Access specific Hive partitions
•
Kerberos authentication
Compressed
files
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Oracle SQL Connector for HDFS
•
Includes tool to generate external table
•
Performance on Engineered Systems
–
15 TB/hour load time
•
Query and load Oracle Data Pump files
–
Binary file in Oracle format
Oracle SQL Connector for Hadoop works with multiple versions
Database versions
(on any operating system*)
10.2.0.5 and greater
11.2.0.3 and greater
12c
Hadoop versions
Certified by
Apache Hadoop 2.x
Oracle
CDH 4.x (Cloudera)
Oracle
CDH 5.x (Cloudera)
Oracle
HDP 1.3 (Hortonworks)
Hortonworks
HDP 2.1 (Hortonworks)
Hortonworks
*
Oracle SQL Connector for HDFS requires
Hadoop client to be supported on the
operating system
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Test Case
Data Generation
•
Data was generated using a tool
–
Built by the product development team.
–
Used for benchmarking.
–
Generates data within HDFS.
–
Random data.
•
The data-files were a combination of various data types such as int, float,
varchar, date and timestamp.
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Integrating data – Oracle databases to Hadoop
Things to consider while moving data from Hadoop to Oracle RDBMS.
•
OSCH has proven consistently to be the best tool if you have to move
massive amount of text data . OSCH moved 14 TB in an hour on
Engineered systems.
•
OLH is the best tool if you are dealing with large amount of partitioned
tables on both Source and Target.
•
Generally aligning # of files at the source to the # of partition at the target
Oracle Loader for Hadoop Vs Oracle SQL connector for Hadoop
Oracle Loader for
Hadoop
Oracle SQL
connector for
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Oracle Data Integration for Big Data and Hadoop
Oracle Confidential 26
Comprehensive data integration platform designed to work with all data
Oracle Data Integrator
(Data Transformation)
Enterprise Data Quality
(Profile, Cleanse, Match and De-duplicate)
Fast
Load
Oracle GoldenGate
(Data Replication)
Enterprise Metadata Management
(Lineage, Impact Analysis and Data Provenance)
Data Replication
Continuous data staging into Hadoop
Data Transformation
–
Pushdown processing in Hadoop
Data Federation
–
Query Hadoop SQL via JDBC
Data Quality
–
Fix quality at the source or invoke
Machine Learning in Hadoop
Metadata Management
–
Lineage and Impact Analysis w/Hadoop
Data Service Integrator
(Data Federation)
Synchronization
Realtime
Staging
Pushdown
Data Transformations
Oracle Big Data SQL
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |