Big Data Landscape for Databases

(1)

Big Data Landscape

for Databases

Bob Baran

Senior Sales Enginee

[email protected]

!

(2)

Typical Database Workloads

OLTP Applications Real-Time Web,

Mobile, and IoT Applications

Real-Time, Operational

Reporting

Ad-Hoc Analytics Enterprise Data Warehouses Typical Databases • MySQL • Oracle • MongoDB • Cassandra • MySQL • Oracle • MySQL • Oracle • Greenplum • Paraccel • Netezza • Teradata • Oracle • Sybase IQ Use Cases • ERP, CRM, Supply

Chain

• Web, mobile, social

• IoT • Operational Datastores • Crystal Reports • Exploratory Analytics • Data Mining • Enterprise Reporting Workload Strengths • Real-time updates • ACID transactions • High concurrency of small reads/ writes • Range queries • Real-time updates

• High ingest rates

• High concurrency of small reads/ writes

• Range queries • Real-time updates • Canned, parameterized reports • Range queries • Complex queries requiring full table scans • Append only • Parameterized reports against historical data

(3)

Recent History of RDBMSs

▪ RDBMS Definition

▪ Relational with joins

▪ ACID transactions

▪ Secondary indexes

▪ Typically row-oriented

▪ Operational and/or analytical workloads

▪ By early 2000s

▪ Limited innovation

(4)

Hadoop Shakes Up Batch Analytics

▪ Data processing framework

▪ Cheap distributed file system

▪ Brute force, batch processing through MapReduce

▪ Great for batch analytics

(5)

NoSQL Shakes Ups Operational DBs

▪ NoSQL wave

▪ Companies like Google, Amazon and

LinkedIn needed greater scalability &

schema flexibility

▪ New databases developed by developers,

not database people

▪ Provided scale-out, but lost SQL

▪ Worked well at web startups because:

▪ In some cases, use cases did not need ACID

(6)

Convoluted Evolution of Databases

Sc ala bili ty Hierarchical/ Network Databases 1970s Indexed Files (ISAM) 1960s Traditional RDBMSs 1980s-2000s Hadoop 2005 NoSQL Databases 2010 _Scale-out SQL Databases 2013 Scale Out Scale Up

(7)

Mainstream user changes

▪ Driven by web, social, mobile, and Internet of Things

▪ Major increases in scale – 30% annual data growth

▪ Significant requirements for semi-structured data

▪ Though relatively little unstructured

▪ Technology adoption continuum

What is it? Should I

use it? Why wouldn’t I use it?

Cloud NoSQL for

web apps Scale-out SQL DBs

(8)

Schema on Ingest vs. Schema on Read

▪ Even “schemaless” MongoDB requires “schema”

- 10 Things You Should Know About Running MongoDB At Scale

• By Asya Kamsky, Principal Solutions Architect at MongoDB

Schema 

on Ingest Schema on Read

• Schema on Read if

you only use data a few times a year

• Structured data

should always remain structured

• Add schema if data

used regularly

(9)

Scale-out is the future of databases

Scale Up Scale Out

NoSQL NewSQL

SQL-on-Hadoop Hadoop

RDBMS Analytic Engines

How do I scale?

(10)

NoSQL

Pros

▪ Easy scale-out

▪ Flexible schema

▪ Easier web development with

hierarchical data structures (MongoDB)

▪ Cross-data center replication

(Cassandra)

Cons

▪ No SQL – requires retraining

and app rewrites

▪ No joins – i.e., no cross row/

document dependencies

▪ No reliable updates through

transactions across rows/tables

▪ Eventual consistency

(Cassandra)

▪ Not designed to do

aggregations required for analytics

(11)

NewSQL

Pros

▪ Easy scale-out

▪ ANSI SQL – eliminates

retraining and app rewrites

▪ Reliable updates through ACID

transactions

▪ RDBMS functionality

▪ Strong cross-data center

replication (NuoDB)

Cons

▪ Proprietary scale-out,

unproven into petabytes

▪ Must manage another

distributed infrastructure beyond Hadoop

▪ Can not leverage Hadoop

(12)

NewSQL – In-Memory

Pros

▪ Easy scale-out

▪ High performance because

everything in memory

▪ ACID transactions within nodes

Cons

▪ Memory 10-20x more expensive

▪ Limited SQL

▪ Limited cross-node transactions

▪ Proprietary scale-out, unproven

into petabytes

▪ Must manage another distributed

infrastructure beyond Hadoop

▪ Can not leverage Hadoop

(13)

Operational RDBMS on Hadoop

Pros

▪ Easy scale-out

▪ Scale-out infrastructure proven

into petabytes

▪ ANSI SQL – eliminates

retraining and app rewrites

▪ Reliable updates through ACID

transactions

▪ Leverages Hadoop distributed

infrastructure and tool ecosystem

Cons

▪ Full table scans slower than MPP

DBs, but faster than traditional RDBMSs

▪ Existing HDFS data must be

(14)

MPP Analytical Databases

Pros

▪ Easy scale-out

▪ Very fast performance for full

table scans

▪ Highly parallelized, shared

nothing architectures

▪ May have columnar storage

(Vertica)

▪ No maintenance of indexes

(Netezza)

Cons

▪ Poor concurrency models prevent

support of real-time apps

▪ Poor performance for range

queries

▪ Need to redistribute all data to

add nodes (hash partitioning)

▪ May require specialized hardware

(Netezza)

▪ Proprietary scale out - can not

leverage Hadoop ecosystem of tools

(15)

SQL-on-Hadoop – Analytical Engines

Pros

▪ Easy scale-out

▪ Scale-out proven into

petabytes

▪ Leverages Hadoop distributed

infrastructure

▪ Can leverage Hadoop

ecosystem of tools

Cons

▪ Relatively immature, especially

compared to MPP DBs

▪ Limited SQL

▪ Poor concurrency models prevent

support of real-time apps

▪ No reliable updates through

transactions

▪ Intermediate results must fit in

(16)

Future: Hybrid In-Memory Architectures

Memory Cache

with Disk

_In-Memory

Pure 

Hybrid 

In-Memory

- Very expensive - Unsophisticated memory management - Flexible, cost-effective - Controlled by optimizer - In-memory materialized views?

(17)

Summary – Future of Databases

▪ Predicted Trends

▪ Scale-out dominates databases

▪ Developers stop worrying about data size and

develop new data-driven apps

▪ Hybrid in-memory architecture becomes

mainstream

▪ Predicted Winners

▪ Hadoop becomes de facto distributed file system

▪ NoSQL used for simple web apps

(18)

Questions?

Bob Baran

Senior Sales Engineer

[email protected]

!

(19)

Powering Real-Time

Apps on Hadoop

Bob Baran

[email protected]

!

(20)

Who Are We?

THE ONLY

HADOOP RDBMS

Power operational applications 

on Hadoop

Affordable, Scale-Out – Commodity hardware Elastic – Easy to expand or scale back

Transactional – Real-time updates & ACID

Transactions

ANSI SQL – Leverage existing SQL code, tools, &

skills

10x

 

Better Price/Perf

(21)

What People are Saying…

Recognized as a key innovator in databases

Scaling out on Splice Machine presented   some major benefits  

over Oracle 

...automatic balancing between clusters...avoiding the costly

licensing issues. Quotes Awards   An alternative to today’s RDBMSes,  Splice Machine effectively   combines traditional relational

database technology with   the scale-out capabilities  

of Hadoop.

The uniqueclaim of … Splice Machine is that it can run

transactional applications 

as well as support analytics on  

(22)

Advisory Board

Advisory Board includes luminaries in databases and technology

Roger Bamford 

Former Principal Architect at Oracle

Father of Oracle RAC

Mike Franklin 

Computer Science Chair, UC Berkeley

Director, UC Berkeley AmpLab Founder of Apache Spark

Marie-Anne Neimat

Co-Founder, Times-Ten Database Former VP, Database Eng. at Oracle

Ken Rudin

Head of Analytics at Facebook Former GM of Oracle Data Warehousing

(23)

Combines the Best of Both Worlds

▪ _{Scale-out on commodity servers} ▪ _{Proven to 100s of petabytes} ▪ _{Efficiently handle sparse data} ▪ _{Extensive ecosystem}

RDBMS

▪ _{ANSI SQL}

▪ _{Real-time, concurrent updates} ▪ _{ACID transactions}

(24)

Focused on OLTP and Real-Time Workloads

OLTP Applications Real-Time Web,

Mobile, and IoT Applications

Real-Time, Operational

Reporting

Ad-Hoc Analytics Enterprise Data Warehouses Typical Databases • MySQL • Oracle • MySQL • Oracle • MongoDB • Cassandra • MySQL • Oracle • Greenplum • Paraccel • Netezza • Teradata • Oracle • Sybase IQ Use Cases • ERP, CRM, Supply

Chain

• Web, mobile, social

• IoT • Operational Datastores • Crystal Reports • Exploratory Analytics • Data Mining • Enterprise Reporting Workload Strengths • Real-time updates • ACID transactions • High concurrency of small reads/ writes • Range queries • Real-time updates

• High ingest rates

• High concurrency of small reads/ writes • Range queries • Real-time updates • Canned, parameterized reports • Range queries • Complex queries requiring full table scans • Append only • Parameterized reports against historical data

(25)

OLTP Campaign Management: Harte-Hanks

Overview _{Digital marketing services provider}

Unified Customer Profile

Real-time campaign management OLTP environment with BI reports

Challenges

Oracle RAC too expensive to scale Queries too slow – even up to ½ hour

Getting worse – expect 30-50% data growth Looked for 9 months for a cost-effective solution

Solution Diagram Initial Results

¼

cost

with commodity scale out

10-20x price/perf

with no application, BI or ETL rewrites

Cross-Channel  Campaigns

Real-Time Personalization

(26)

Reference Architecture: Operational Data Lake

Offload real-time reporting and analytics from expensive OLTP and DW systems

OLTP Systems Ad Hoc Analytics Operational Data Lake Executive Business Reports Operational Reports & ERP CRM Supply Chain HR … Data Warehouse Datamart Stream or Batch Updates ETL Real-Time, Event-Driven

(27)

Streamlining the Structured Data Pipeline in Hadoop

Source Systems ERP … CRM Sqoop Apply Inferred Schema Stored as flat files

SQL Query Engines BI Tools

Traditional Hadoop Pipeline

vs. Source Systems ERP CRM Exisiting ETL Tool BI Tools

Streamlined Hadoop Pipeline

Advantages

• Reduced operational costs with less complexity

• Reduced processing time and errors with fewer translations • Real-time updates for data

(28)

Complementing Existing Hadoop-Based Data Lakes

Optimizing storage and querying of structured data as part of ELT or Hadoop query engines

OLTP Systems ERP CRM Supply Chain HR … SCHEMA ON INGEST: Streamlined, structured-to-structured integration Structured

Data Unstructured Data

1

2 3

SCHEMA BEFORE READ:

Repository for structured data or metadata from ELT process on unstructured data

HCATALOG Pig

SCHEMA ON READ:

Ad-hoc Hadoop queries across structured and unstructured data

(29)

Proven Building Blocks: Hadoop and Derby

APACHE DERBY

▪ ANSI SQL-99 RDBMS ▪ Java-based ▪ ODBC/JDBC Compliant

!

APACHE HBASE/HDFS

▪ Auto-sharding ▪ Real-time updates ▪ Fault-tolerance ▪ Scalability to 100s of PBs ▪ Data replication

(30)

HBase: Proven Scale-Out

(31)

Splice Optimizations to HBase

▪ Splice Storage is optimized over raw HBase

▪ We use Bitmap Indexes to store data in packed byte arrays

▪ This approach allows us to store data in a much smaller footprint than traditional HBase ▪ With a TPCH schema, we found a 10X reduction in data size reduction

▪ _{Requires far less hardware and resources to perform the same workload}

▪ Asynchronous Write Pipeline

▪ HBase writes (puts) are not pipelined and block while the call is being made

▪ Splice’s write pipeline allows us to reach speeds of over 100K writes / second per HBase node

▪ This allows extremely high ingest speeds without requiring more hardware and custom code ▪ Transactions

▪ As scalability increases, the likelihood of failures increases

▪ We utilize Snapshot Isolation to make sure if there is a failure, it does not corrupt existing data

▪ RDBMS Capabilities

▪ The use of SQL vs. custom scans and the ability for an optimizer to choose the best access path to the data

(32)

Distributed, Parallelized Query Execution

Parallelized computation across cluster Moves computation to   the data Utilizes HBase   co-processors No MapReduce HBase   Co-Processor

!

HBase Server Memory Space

(33)

ANSI SQL-99 Coverage

▪ Data types – e.g., INTEGER, REAL,

CHARACTER, DATE, BOOLEAN, BIGINT

▪ DDL – e.g., CREATE TABLE, CREATE

SCHEMA, ALTER TABLE, DELETE, UPDATE

▪ Predicates – e.g., IN, BETWEEN, LIKE,

EXISTS

▪ DML – e.g., INSERT, DELETE, UPDATE,

SELECT

▪ Query specification – e.g., SELECT

DISTINCT, GROUP BY, HAVING

▪ SET functions – e.g., UNION, ABS, MOD,

ALL, CHECK

▪ Aggregation functions – e.g., AVG, MAX,

▪ _{Conditional functions – e.g., CASE,}

searched CASE

▪ _{Privileges – e.g., privileges for SELECT,} DELETE, INSERT, EXECUTE

▪ _{Cursors – e.g., updatable, read-only,}

positioned DELETE/UPDATE

▪ _{Joins – e.g., INNER JOIN, LEFT OUTER}

JOIN

▪ _{Transactions – e.g., COMMIT, ROLLBACK,}

READ COMMITTED, REPEATABLE READ, READ UNCOMMITTED, Snapshot Isolation

▪ _Sub-queries

(34)

Window Functions (Advanced Analytics Functions)

▪ Analytics such as Running total, Moving averages, Top-N Queries

▪ Performs calculations across a set of table rows related to the current

row in the window

▪ Similar to aggregate functions with two significant differences:

▪ Outputs one row for each input value it operates upon.

▪ Groups rows with window partitioning and frame clauses vs. Group BY

▪ SPLICE MACHINE Currently Supports

▪ RANK ▪ DENSE_RANK ▪ ROW NUMBER ▪ AVG ▪ SUM ▪ COUNT ▪ MAX ▪ MIN

(35)

Lockless, ACID transactions

• Adds multi-row, multi-table

transactions to HBase with rollback

• Fast, lockless, high concurrency

• Extends research from Google Percolator, Yahoo Labs, U of Waterloo

(36)

Customer Performance Benchmarks

Typically 10x price/performance improvement

30x 3-7x 10-20x _10x 20x 7x SPEED PRICE/  PERFORMANCE VS. FASTER LOWER

(37)

(38)

1 day 5 days (including prep) 2 weeks 3-6 weeks 3-10 months

Splice Machine Safe Journey Process

Initial  Overview

• Splice Machine overview

• Set the stage for Rapid Assessment

Rapid  Assessment

• Half day workshop • Assess Splice Machine

fit

• Identity target use cases

• Risk assessment of use cases

• Agree upon success criteria

Proof of  Concept

• Prove client use case on Splice Machine hosted environment

• Benchmark using customer queries and schema

• On Customer data or generated data that resembles customer data

Pilot  Project

• Identify paid pilot use case with limited change management impact

• Install Splice Machine on client environment • Deploy use case/

application on client data

• Prove Splice Machine against key requirements Enterprise Implementation • Kickstart • Requirements • Design/Dev • QA Test • Cutover • Hypercare

(39)

Safe Journey Enterprise Implementation Stages

Kickstart Packaged 2 week program to get new client off to strong start on solid foundation

!

Incorporates: • Splice Architecture & Development courses • Risk Assessment Workshop • Implementation Blueprint Requirements Establish clear functional and performance requirements document

!

Can be a “refresh only” if project is a port of an existing app to Splice Design/Dev Based on Agile method. Phase is divided into  2 week sprints

!

Stories covering a set of required capabilities are assigned to each developer

!

A design doc is created, code is written, unit tests are QA Test The QA test period includes: • Performance Test • End-to-End System Integration Test • User Acceptance Test

!

Depending on scale of project there may be multiple iterations of each test with break/ fix cycles in between Cutover Formal period in which Splice-based solution goes-live and pre-existing system is deprecated Parallel Ops Used when an existing system is being ported to Splice Machine from another database

!

The new Splice Machine-based system runs side by side with the old system for a period of time

Hypercare Period of on-site support during cutover and for a period immediately following go-live

(40)

Common Risks and Mitigation Strategies

Data migration

• Risk: Clients are typically migrating very large data sets to Splice

Machine. Issues with migration of certain data types such as dates can waste a lot of time reloading large amounts of data

• Solution: First migrate a small subset of tables that contain all required data types. Ensure these migrate successfully before migrating the entire database

Changes to source schema during implementation

• Risk: Changes to the schema of the source database to be migrated

during the course of the implementation will lead to a significant amount of rework and reloading of data, adding unplanned time to the project • Solution: All stakeholders agree up front to freeze the schema as of an

agreed upon date prior to the Design/Development stage. Stored procedure conversion

• Risk: Stored procedures need to be converted from the original language (e.g., PL/SQL) to Java. Complex stored procedures make include

(41)

Common Risks and Mitigation Strategies

SQL compatibility

• Risk: Even though Splice Machine conforms to the ANSI 99+ SQL standard, virtually every database has unique syntax and some queries may need to be modified. Additionally, SQL generated by packaged applications may not be modifiable.

• Solution: Formal review of SQL syntax during the requirements phase. Modify relevant queries during the Design/Dev phase. If not modifiable an enhancement request for Splice Machine to support the required syntax out of the box may needed.

Indexing

• Risk: Proper indexing is usually important to maximize the performance of Splice Machine. Splice Machine indexes are likely to differ from the indexes required for a traditional RDBMS

• Solution: Ensure that query performance SLAs are clearly defined in the Requirements phase. Incorporate proper index design early in the Design/Dev phase. Assume some iteration will be required to achieve the optimal indexes

Hadoop knowledge

(42)

Summary

THE ONLY

HADOOP RDBMS

Power operational applications 

on Hadoop

Affordable, Scale-Out – Commodity hardware Elastic – Easy to expand or scale back

Transactional – Real-time updates & ACID

Transactions

ANSI SQL – Leverage existing SQL code, tools, &

skills

10x

 

Better Price/Perf

(43)

Questions?

Bob Baran

[email protected]