• No results found

Constructing a Data Lake: Hadoop and Oracle Database United!

N/A
N/A
Protected

Academic year: 2021

Share "Constructing a Data Lake: Hadoop and Oracle Database United!"

Copied!
32
0
0

Loading.... (view fulltext now)

Full text

(1)

Sharon Sophia Stephen

Big Data PreSales Consultant

February 21, 2015

Constructing a Data Lake: Hadoop

and Oracle Database United!

(2)

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Safe Harbor

The following is intended to outline our general product direction. It is

intended for information purposes only, and may not be incorporated

into any contract. It is not a commitment to deliver any material, code,

or functionality, and should not be relied upon in making purchasing

decisions. The development, release, and timing of any features or

functionality described for Oracle’s products remains at the sole

(3)

Program Agenda

Hadoop Overview

Data Lake

How the technologies can work together

Tools for Integration

Q&A

1

2

3

4

5

Exadata

+

Oracle Database

(4)

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

What Is Hadoop?

“The Apache™ Hadoop® project develops open-source software for

reliable, scalable, distributed computing.

“The Apache Hadoop software library is a

framework

that allows for the

distributed

processing

of

large data sets

across clusters of computers using simple

programming models. It is designed to scale up from single servers to thousands of

machines, each offering

local computation and storage

. Rather than rely on

hardware to deliver

high-availability

, the library itself is designed to detect and

handle failures at the application layer, so delivering a highly-available service on

top of a cluster of computers, each of which may be prone to failures.”

(5)

What is HDFS?

HDFS is the primary storage system underlying Hadoop

Fault tolerant, scalable, highly available

Designed to be well-suited to distributed processing

Splits large files into blocks

Multiple copies stored on different disks on separate nodes

Is superficially structured like a UNIX file system

(6)

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

A MapReduce (True distributed computing) Analogy

(7)

A MapReduce (True distributed computing) Example

(8)

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

MapReduce Phases

Map

Each Map task usually works on a single input split

Hadoop tries to run map tasks on the slave node that contains the stored HDFS data block

(data locality)

The input is presented to the Map phase as a key-value pair

Shuffle and Sort

Groups all the values together for each key (using the intermediate output data from all of the

completed mappers)

Sorts the keys

Reduce

The intermediate output of the Shuffle and Sort phase is the input to the Reduce phase

(9)

What else is Hadoop

Apache Hadoop

Apache Hive

Apache Pig

Apache HBase

Apache Zookeeper

Apache Flume

Apache Sqoop

Apache Mahout

Apache Whirr

Apache Oozie

Fuse-DFS

Hue

Plus Additional projects

Impala

BDR

(10)

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Two Great Tastes That Go Great Together

Hadoop

Hadoop

Database

A. Hadoop can do some things a Database

reasonably does not.

Database

Things you can do (Big Oval)

Things you can do cost-effectively (Small Oval)

B. Hadoop expands the amount of things you

can do cost-effectively.

(11)
(12)

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Data lake

(13)

Data Lake

A data lake is a storage repository that holds a vast

amount of raw data in its native format until it is

needed. While a hierarchical data warehouse

stores data in files or folders, a data lake uses a flat

architecture to store data.

(14)

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Focus in this Session

14

Data organized for fast query

Structured schema

Complex programming models

Read, write, delete, update

Access specific record

Relational

Hadoop

Data in files

Schema on read

Simple programming model for

large scale data processing

Append only

(15)

Oracle Big Data Connectors

Data Load

Oracle Loader for Hadoop

Data Access

Oracle SQL Connector for

HDFS

Oracle Data Integrator

R Analytics

Oracle R Advanced Analytics

on Hadoop

XML/XQuery

Oracle XQuery on Hadoop

XQuery

R Client

Optimized for Hadoop: Maximize

parallelism

Fast performance

Analyze data on Hadoop using

familiar client tools

Oracle Bigdata SQL

(for eng. System)

(16)

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Integrating data – Hadoop and Oracle databases

Transferring from Hadoop to Oracle

RDBMS

Transferring from Oracle RDBMS to Hadoop

OSCH – For CSV Text

OSCH – Hive Non Partition

OLH – For CSV text

OLH – Hive Non-partition

CopytoBDA (only on Engineered system platform)

OLH – Data pump (offline)

Oracle Data Integrator

OLH – Hive partition

Oracle Golden gate

OLH – Hive Partition parquet

Sqoop

Sqoop – – For CSV text

Sqoop – Hive partition

Sqoop – Hive partition parquet

Sqoop 2 - Hive partition parquet

(17)

Oracle Loader for Hadoop

Parallel load, optimized for

Hadoop

Automatic load balancing

Convert to Oracle format on

Hadoop

Save database CPU

Load specific Hive partitions

Kerberos authentication

Load directly into In-Memory

JSON

Log

files

Hive

Text

Avro

Parquet

Sequence

files

Compressed

(18)

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Oracle Loader for Hadoop

Performance

Extremely fast performance

Sample numbers (on Oracle Engineered Systems)

4.4 TB/hour end-to-end (load + Hadoop process)

Much higher than typical customer requirements

Optimized for Oracle Big Data Appliance and

Oracle Exadata: InfiniBand Connectivity

(19)

Oracle SQL Connector for HDFS

OSCH

Hive

Text

OSCH OSCH OSCH

External

Table

create table customer_address

( ca_customer_id

number(10,0)

, ca_street_number char(10)

, ca_state char(2)

, ca_zip char(10))

organization external (

TYPE

ORACLE_LOADER

DEFAULT DIRECTORY DEFAULT_DIR

Parallel query and load

Load into database or query in

place

Access text or Hive over text

Access compressed data

Access specific Hive partitions

Kerberos authentication

Compressed

files

(20)

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Oracle SQL Connector for HDFS

Includes tool to generate external table

Performance on Engineered Systems

15 TB/hour load time

Query and load Oracle Data Pump files

Binary file in Oracle format

(21)

Oracle SQL Connector for Hadoop works with multiple versions

Database versions

(on any operating system*)

10.2.0.5 and greater

11.2.0.3 and greater

12c

Hadoop versions

Certified by

Apache Hadoop 2.x

Oracle

CDH 4.x (Cloudera)

Oracle

CDH 5.x (Cloudera)

Oracle

HDP 1.3 (Hortonworks)

Hortonworks

HDP 2.1 (Hortonworks)

Hortonworks

*

Oracle SQL Connector for HDFS requires

Hadoop client to be supported on the

operating system

(22)

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Test Case

(23)

Data Generation

Data was generated using a tool

Built by the product development team.

Used for benchmarking.

Generates data within HDFS.

Random data.

The data-files were a combination of various data types such as int, float,

varchar, date and timestamp.

(24)

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Integrating data – Oracle databases to Hadoop

Things to consider while moving data from Hadoop to Oracle RDBMS.

OSCH has proven consistently to be the best tool if you have to move

massive amount of text data . OSCH moved 14 TB in an hour on

Engineered systems.

OLH is the best tool if you are dealing with large amount of partitioned

tables on both Source and Target.

Generally aligning # of files at the source to the # of partition at the target

(25)

Oracle Loader for Hadoop Vs Oracle SQL connector for Hadoop

Oracle Loader for

Hadoop

Oracle SQL

connector for

(26)

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Oracle Data Integration for Big Data and Hadoop

Oracle Confidential 26

Comprehensive data integration platform designed to work with all data

Oracle Data Integrator

(Data Transformation)

Enterprise Data Quality

(Profile, Cleanse, Match and De-duplicate)

Fast

Load

Oracle GoldenGate

(Data Replication)

Enterprise Metadata Management

(Lineage, Impact Analysis and Data Provenance)

Data Replication

Continuous data staging into Hadoop

Data Transformation

Pushdown processing in Hadoop

Data Federation

Query Hadoop SQL via JDBC

Data Quality

Fix quality at the source or invoke

Machine Learning in Hadoop

Metadata Management

Lineage and Impact Analysis w/Hadoop

Data Service Integrator

(Data Federation)

Synchronization

Realtime

Staging

Pushdown

Data Transformations

(27)

Oracle Big Data SQL

(28)

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Summary

Fast, easy, integration of all data in your Big Data solution

Oracle Big Data Connectors

Oracle Big Data SQL (on Oracle Engineered Systems)

Oracle Data Integrator

Oracle Golden Gate

(29)

Additional information

http://www.oracle.com/technetwork/database/bigdata-appliance/oracle-bigdatalite-2104726.html

http://www.oracle.com/technetwork/database/bigdata-appliance/overview/index.html

(30)
(31)
(32)

References

Related documents

Ryan wrote of this phenomenon in his book Blaming the Victim also as a response to Daniel Moynihan’s The Negro Family: The Case for National Action (1965), which rationalizes

Similarly, comfort and travel time are valued higher by commuters from zones close to CBD (i.e., within 5 km to the CBD) than those from city peripherals. It was, how- ever,

+ Hadoop/NoSQL Exadata + Oracle Database Oracle Catalog External Table Hive metadata External Table Hive Metastore.|. Copyright © 2014 Oracle and/or

surfaces incorporating; inset one and a half bowl single drainer stainless steel sink unit with mixer tap, inset stainless steel five ring 'B osch' gas hob with fitted

Since 1996 Wärtsilä has produced a standard family of transverse thrusters (controllable, CT, and fixed pitch, FT) in the power range up to 3300 kW.. Many operators rely on

The minimum requirements on the qualifications and experience of the key personnel of a registered specialist contractor in site formation works category (RSC(SF)) are given in

It has been recognized that theories for describing the states of stress and failure in unsaturated soil require consideration of the thermodynamic properties of the pore water in

The following research project is split into two phases and concludes with a synthesis of both phases. The overarching aim of the research project is to