What Next for DBAs in the Big Data Era

(1)

What Next for DBAs in the Big Data Era

(2)

Satyendra Kumar Pasalapudi

Associate Practice Director – IMS @ Apps Associates

Co Founder & President of AIOUG

(3)

Agenda

• Technology Trends

• Big Data Overview

• Hadoop Basics

• NoSQL Databases

• Big Data Sql

(4)

Cost effectively manage

and analyze

all available data in its

native form

unstructured,

structured, streaming

ERP

CRM RFID

Website

Network Switches Social Media

Billing

(5)

(6)

History of databases

Magnetic tape “flat” (sequential) files

Pre-computer technologies: Printing press Dewey decimal system Punched cards Magnetic Disk IMS Relational Model defined Indexed-Sequential Access Mechanism (ISAM) Network Model IDMS ADABAS System R Oracle V2 Ingres dBase DB2 Informix Sybase SQL Server Access Postgres MySQL Cassandra Hadoop Vertica Riak HBase Dynamo MongoDB Redis VoltDB Hana Neo4J Aerospike Hierarchical model 1960-70

(7)

Why?

•

3

rd

_{Platform drives}

new demands on

the database:

– Global High

Availability

– Data volumes

– Unstructured data

– Transaction rates

– Latency

• A single

architecture cannot

meet all those

demands

(8)

Operational RDBMS (Oracle, SQL Server, …) In-memory Analytics (HANA, Exalytics …) In-memory processing (Spark) Hadoop Web DBMS (MySQL, Mongo, Cassandra)

ERP & in-house CRM

Analytic/BI software

(SAS, Tableau

Web Server _WarehouseData

RDBMS (Oracle, Terradata …)

(9)

(10)

(11)

Biggest IT inflection

point in our

generation

Cloud

Mobile

Social

Big

Data

(12)

(13)

The instrumented human

• Bluetooth Personal Area Network

• 3G/WiFi Wide Area Network

• GPS

• Storage

• Pulse, temp monitor

• Silent alarms

• Pedometer, sleep monitoring

• Compass

• Camera

• Mike/earphones

• Heads up display

• Emotion/Attention monitor

(14)

(15)

Google File System (GFS)

Map Reduce

BigTable

Google Applications

Google Software

Architecture

(circa 2005)

(16)

Start _Map Reduce Map Map Map Map Map Map Map Map Map Map Map Map Map Map Map Map Map Map Map Map Map Map Map Map Map Map Map Map Map Map Map Map Map Map

Map Reduce

(17)

Hadoop Design Principles

• System shall manage and heal itself

–

Automatically and transparently route around failure

–

Speculatively execute redundant tasks if certain nodes are detected to be

slow

• Performance shall scale linearly

–

Proportional change in capacity with resource change

• Compute should move to data

–

Lower latency, lower bandwidth

(18)

Hadoop History

• Dec 2004 – Google GFS paper published

• July 2005 – Nutch uses MapReduce

• Feb 2006 – Starts as a Lucene subproject

• Apr 2007 – Yahoo! on 1000-node cluster

• Jan 2008 – An Apache Top Level Project

• Jul 2008 – A 4000 node test cluster

(19)

Hadoop Ecosystem

HDFS (Hadoop Distributed File System)

HBase

(key-value store)

MapReduce (Job Scheduling/Execution System)

Data Access

Sqoop

Flume

Client Access

Hue

Hive(Sql)

Pig(Pl/Sql)

ZooK

ee

pe

r

(Coo

rdina

ti

on)

(Streaming/Pipes APIs)

Chu

kw

a

(M

onit

ori

ng

)

Data Mining

Mahout

OS – Redhat, Suse, Ubuntu,Windows

Commodity Hardware

Java Virtual Machine Networking

Orchestration

(20)

(21)

(22)

Hadoop at Yahoo

• 2010(biggest cluster):

• 4000 nodes 16PB disk

• 64 TB of RAM

• 32,000 Cores

• 2014:

–

16 Clusters

(23)

(24)

(25)

Database Market Disruption

(26)

(27)

Name Site Counter

Dick Ebay 507,018

Dick Google 690,414

Jane Google 716,426

Dick Facebook 723,649

Jane Facebook 643,261

Jane ILoveLarry.com 856,767

Dick MadBillFans.com 675,230

NameId Name 1 Dick 2 Jane SiteId SiteName 1 Ebay 2 Google 3 Facebook 4 ILoveLarry.com 5 MadBillFans.com

NameId SiteId Counter

1 1 507,018

1 3 690,414

2 3 716,426

1 3 723,649

2 3 643,261

2 4 856,767

1 5 675,230

Id Name Ebay Google Facebook (other columns) MadBillFans.com

1 Dick 507,018 690,414 723,649 . . . 675,230

Id Name Google Facebook (other columns) ILoveLarry.com

2 Jane 716,426 643,261 . . . 856,767

(28)

Financial services

Discover fraud patterns based on multi-years worth of credit card transactions and in a time scale that does not allow new patterns to accumulate significant losses. Measure transaction processing latency across many business processes by processing and correlating system log data.

Internet retailer Discover fraud patterns in Internet retailing by mining Web click logs. Assess risk by product type and session/Internet Protocol (IP) address activity.

Retailers Perform sentiment analysis by analyzing social media data.

Drug discovery Perform large-scale text analytics on publicly available information sources.

Healthcare Analyze medical insurance claims data for financial analysis, fraud detection, and preferred patient treatment plans. Analyze patient electronic health records for evaluation of patient care regimes and drug safety.

Mobile telecom Discover mobile phone churn patterns based on analysis of CDRs and correlation with activity in subscribers’ networks of callers.

IT technical support Perform large-scale text analytics on help desk support data and publicly available support forums to correlate system failures with known problems.

Scientific research Analyze scientific data to extract features (e.g., identify celestial objects from telescope imagery). Internet travel Improve product ranking (e.g., of hotels) by analysis of multi-years worth of Web click logs.

(29)

Document databases

• _{Structured documents – XML and JSON (JavaScript Object Notation)}

become more prevalent within applications

• _{Web programmers start storing these in BLOBS in MySQL}

• _{Emergence of XML and JSON databases}

(30)

Graph Database Neo4J Infinite Graph FlockDB Document JSON based MongoDB CouchDB RethinkDB XML based MarkLogic BerkeleyDB XML Key Value MemchacheD B Oracle NoSQL Dynamo Voldemort DynamoDB Riak

Table Based BigTable

Cassandra

Hbase

HyperTable

(31)

(32)

(33)

(34)

(35)

(36)

(37)

(38)

(39)

(40)

(41)

Big Data Architecture

D

A

T

A

S

O

U

R

C

E

S

DATA LAKE – On AWS Big Data Infra (Optrion2)

DATA CONNECTORS

A

N

A

L

Y

T

I

C

S

DATA LAKE on Oracle Big data Appliance

(Option1)

DATA LAKE – On Premise Hadoop Infra(Option3)

D

A

T

A

L

A

K

E

(42)

On Premise Hadoop as RDBMS “active archive

”

SALES 2013

Oracle Database

Structured Data Analytics from Apps

SALES 2012

SALES 2011

SALES 2010

SALES 2011

SALES 2010

“Hive” provides an

SQL-like query layer over

Hadoop and MapReduce

Unstructured + Structured Data Analytics from Apps

Hadoop for

Structured

Archive and

Unstructured

data

(43)

AWS EMR as RDBMS “active archive

”

SALES 2013

Oracle Database

Structured Data Analytics from Apps

SALES 2012

SALES 2011

SALES 2010

SALES 2011

SALES 2010

“Hive” provides an

SQL-like query layer over

Amazon EMR

Unstructured + Structured Data Analytics from Apps

AWS EMR for

Structured

Archive and

Unstructured

data

(44)

Oracle Database Support for All Data

• Structured Data

• Numeric, String, Date, …

• Row and column

formats

• Unstructured Data

• LOB

• Text

• XML

• JSON

• Spatial

• Graph

4

6

(45)

Run the Business

 Scale-out and scale-up

 Collect any data

 SQL

 Transactional and analytic applications for the enterprise

 Secure and highly available

Relational

Oracle Support for Any Data Management System

Hadoop

Change the Business

 Scale-out, low cost store

 Collect any data

 Map-reduce, SQL

 Analytic applications

NoSQL

Scale the Business

 Scale-out, low cost store

 Collect key-value data

 Find data by key

(46)

Big Data SQL

4

8 SELECT w.sess_id, c.name

FROM

web_logs w, customers c

WHERE w.source_country = ‘Brazil’

AND

w.cust_id = c.customer_id;

Relevant SQL runs on BDA nodes

10’s of Gigabytes of Data

Only columns and rows needed to answer query are returned

Hadoop Cluster

B B B

Big Data SQL

Oracle Database

CUSTOMERS WEB_LOGS

SQL Push Down in Big Data SQL

• Hadoop Scans on Unstructured Data

• WHERE Clause Evaluation

• Column Projection

• Bloom Filters for Better Join Performance

(47)

Data Analytics Challenge

Separate silos of information to analyze

4

9

(48)

Data Analytics Challenge

Separate data access interfaces

5

0

(49)

SQL on Hadoop is Obvious

Oracle

Confidential –

Internal/Restricte

d/Highly

Restricted

5

1

(50)

Data Analytics Challenge

No comprehensive SQL interface across Oracle, Hadoop and NoSQL

5

2

(51)

Oracle Big Data Management System

Rich, comprehensive SQL access to all enterprise data

5

3

(52)

Before

After

What Does Unified Query Mean for You?

Data Science

PhD

???

(53)

Before

After

What Does Unified Query Mean for You?

Application Development

(54)

Storage Layer

Big Data SQL : A New Hadoop Processing Engine

Filesystem (HDFS)

NoSQL Databases

(Oracle NoSQL DB, Hbase)

Resource Management (YARN, cgroups)

Processing Layer

MapReduc

e

and Hive

Spark

Impala

Search

Big Data

(55)

What Next for DBA’s in Big Data Era?

NoSQL

Hadoop

Big data Sql

12c New Features on Big data

Engineered Systems Knowledge

(56)

Connect with Us

Web:

www.appsassociates.com

Email:

[email protected]

|

[email protected]

YouTube:

www.youtube.com/user/AppsAssociates

LinkedIn:

www.us.linkedin.com/company/apps-associates

Twitter:

@AppsAssociates

(57)