• No results found

Big Data Landscape for Databases

N/A
N/A
Protected

Academic year: 2021

Share "Big Data Landscape for Databases"

Copied!
43
0
0

Loading.... (view fulltext now)

Full text

(1)

Big Data Landscape

for Databases

Bob Baran

Senior Sales Enginee

[email protected]

!

(2)

Typical Database Workloads

OLTP Applications Real-Time Web,

Mobile, and IoT Applications

Real-Time, Operational

Reporting

Ad-Hoc Analytics Enterprise Data Warehouses Typical DatabasesMySQL OracleMongoDB Cassandra MySQL OracleMySQL OracleGreenplum Paraccel NetezzaTeradata Oracle Sybase IQ Use CasesERP, CRM, Supply

Chain

Web, mobile, social

IoTOperational Datastores Crystal ReportsExploratory Analytics Data MiningEnterprise Reporting Workload StrengthsReal-time updates ACID transactions High concurrency of small reads/ writes Range queriesReal-time updates

High ingest rates

High concurrency of small reads/ writes

Range queriesReal-time updates Canned, parameterized reports Range queriesComplex queries requiring full table scans Append onlyParameterized reports against historical data

(3)

Recent History of RDBMSs

RDBMS Definition

▪ Relational with joins

▪ ACID transactions

▪ Secondary indexes

▪ Typically row-oriented

▪ Operational and/or analytical workloads

By early 2000s

▪ Limited innovation

(4)

Hadoop Shakes Up Batch Analytics

Data processing framework

▪ Cheap distributed file system

▪ Brute force, batch processing through MapReduce

Great for batch analytics

(5)

NoSQL Shakes Ups Operational DBs

NoSQL wave

▪ Companies like Google, Amazon and

LinkedIn needed greater scalability &

schema flexibility

▪ New databases developed by developers,

not database people

▪ Provided scale-out, but lost SQL

Worked well at web startups because:

▪ In some cases, use cases did not need ACID

(6)

Convoluted Evolution of Databases

Sc ala bili ty Hierarchical/ Network Databases 1970s Indexed Files (ISAM) 1960s Traditional RDBMSs 1980s-2000s Hadoop 2005 NoSQL Databases 2010 Scale-out SQL Databases 2013 Scale Out Scale Up

(7)

Mainstream user changes

Driven by web, social, mobile, and Internet of Things

▪ Major increases in scale – 30% annual data growth

▪ Significant requirements for semi-structured data

▪ Though relatively little unstructured

Technology adoption continuum

What is it? Should I

use it? Why wouldn’t I use it?

Cloud NoSQL for

web apps Scale-out SQL DBs

(8)

Schema on Ingest vs. Schema on Read

Even “schemaless” MongoDB requires “schema”

- 10 Things You Should Know About Running MongoDB At Scale

• By Asya Kamsky, Principal Solutions Architect at MongoDB

Schema


on Ingest Schema
on Read

• Schema on Read if

you only use data a few times a year

• Structured data

should always remain structured

• Add schema if data

used regularly

(9)

Scale-out is the future of databases

Scale Up Scale Out

NoSQL NewSQL

SQL-on-Hadoop Hadoop

RDBMS Analytic Engines

How do I scale?

(10)

NoSQL

Pros

▪ Easy scale-out

▪ Flexible schema

▪ Easier web development with

hierarchical data structures (MongoDB)

▪ Cross-data center replication

(Cassandra)

Cons

▪ No SQL – requires retraining

and app rewrites

▪ No joins – i.e., no cross row/

document dependencies

▪ No reliable updates through

transactions across rows/tables

▪ Eventual consistency

(Cassandra)

▪ Not designed to do

aggregations required for analytics

(11)

NewSQL

Pros

▪ Easy scale-out

▪ ANSI SQL – eliminates

retraining and app rewrites

▪ Reliable updates through ACID

transactions

▪ RDBMS functionality

▪ Strong cross-data center

replication (NuoDB)

Cons

▪ Proprietary scale-out,

unproven into petabytes

▪ Must manage another

distributed infrastructure beyond Hadoop

▪ Can not leverage Hadoop

(12)

NewSQL – In-Memory

Pros

▪ Easy scale-out

▪ High performance because

everything in memory

▪ ACID transactions within nodes

Cons

▪ Memory 10-20x more expensive

▪ Limited SQL

▪ Limited cross-node transactions

▪ Proprietary scale-out, unproven

into petabytes

▪ Must manage another distributed

infrastructure beyond Hadoop

▪ Can not leverage Hadoop

(13)

Operational RDBMS on Hadoop

Pros

▪ Easy scale-out

▪ Scale-out infrastructure proven

into petabytes

▪ ANSI SQL – eliminates

retraining and app rewrites

▪ Reliable updates through ACID

transactions

▪ Leverages Hadoop distributed

infrastructure and tool ecosystem

Cons

▪ Full table scans slower than MPP

DBs, but faster than traditional RDBMSs

▪ Existing HDFS data must be

(14)

MPP Analytical Databases

Pros

▪ Easy scale-out

▪ Very fast performance for full

table scans

▪ Highly parallelized, shared

nothing architectures

▪ May have columnar storage

(Vertica)

▪ No maintenance of indexes

(Netezza)

Cons

▪ Poor concurrency models prevent

support of real-time apps

▪ Poor performance for range

queries

▪ Need to redistribute all data to

add nodes (hash partitioning)

▪ May require specialized hardware

(Netezza)

▪ Proprietary scale out - can not

leverage Hadoop ecosystem of tools

(15)

SQL-on-Hadoop – Analytical Engines

Pros

▪ Easy scale-out

▪ Scale-out proven into

petabytes

▪ Leverages Hadoop distributed

infrastructure

▪ Can leverage Hadoop

ecosystem of tools

Cons

▪ Relatively immature, especially

compared to MPP DBs

▪ Limited SQL

▪ Poor concurrency models prevent

support of real-time apps

▪ No reliable updates through

transactions

▪ Intermediate results must fit in

(16)

Future: Hybrid In-Memory Architectures

Memory Cache

with Disk

In-Memory

Pure


Hybrid


In-Memory

- Very expensive - Unsophisticated memory management - Flexible, cost-effective - Controlled by optimizer - In-memory materialized views?

(17)

Summary – Future of Databases

Predicted Trends

▪ Scale-out dominates databases

▪ Developers stop worrying about data size and

develop new data-driven apps

▪ Hybrid in-memory architecture becomes

mainstream

Predicted Winners

▪ Hadoop becomes de facto distributed file system

▪ NoSQL used for simple web apps

(18)

Questions?

Bob Baran

Senior Sales Engineer

[email protected]

!

(19)

Powering Real-Time

Apps on Hadoop

Bob Baran

Senior Sales Engineer

[email protected]

!

(20)

Who Are We?

THE ONLY

HADOOP RDBMS

Power operational applications


on Hadoop

Affordable, Scale-Out – Commodity hardware Elastic – Easy to expand or scale back

Transactional – Real-time updates & ACID

Transactions

ANSI SQL – Leverage existing SQL code, tools, &

skills

10x

Better Price/Perf

(21)

What People are Saying…

Recognized as a key innovator in databases

Scaling out on Splice Machine presented 
 some major benefits 


over Oracle


...automatic balancing between clusters...avoiding the costly

licensing issues. Quotes Awards 
 An alternative to today’s RDBMSes,
 Splice Machine effectively 
 combines traditional relational

database technology with 
 the scale-out capabilities 


of Hadoop.

The uniqueclaim of … Splice Machine is that it can run

transactional applications


as well as support analytics on 


(22)

Advisory Board

Advisory Board includes luminaries in databases and technology

Roger Bamford


Former Principal Architect at Oracle

Father of Oracle RAC

Mike Franklin


Computer Science Chair, UC Berkeley

Director, UC Berkeley AmpLab Founder of Apache Spark

Marie-Anne Neimat

Co-Founder, Times-Ten Database Former VP, Database Eng. at Oracle

Ken Rudin

Head of Analytics at Facebook Former GM of Oracle Data Warehousing

(23)

Combines the Best of Both Worlds

Scale-out on commodity servers Proven to 100s of petabytes Efficiently handle sparse data Extensive ecosystem

RDBMS

ANSI SQL

Real-time, concurrent updates ACID transactions

(24)

Focused on OLTP and Real-Time Workloads

OLTP Applications Real-Time Web,

Mobile, and IoT Applications

Real-Time, Operational

Reporting

Ad-Hoc Analytics Enterprise Data Warehouses Typical DatabasesMySQL OracleMySQL Oracle MongoDB CassandraMySQL OracleGreenplum Paraccel NetezzaTeradata Oracle Sybase IQ Use CasesERP, CRM, Supply

Chain

Web, mobile, social

IoTOperational Datastores Crystal ReportsExploratory Analytics Data MiningEnterprise Reporting Workload StrengthsReal-time updates ACID transactions High concurrency of small reads/ writes Range queriesReal-time updates

High ingest rates

High concurrency of small reads/ writes Range queriesReal-time updates Canned, parameterized reports Range queriesComplex queries requiring full table scans Append onlyParameterized reports against historical data

(25)

OLTP Campaign Management: Harte-Hanks

Overview Digital marketing services provider

Unified Customer Profile

Real-time campaign management OLTP environment with BI reports

Challenges

Oracle RAC too expensive to scale Queries too slow – even up to ½ hour

Getting worse – expect 30-50% data growth Looked for 9 months for a cost-effective solution

Solution Diagram Initial Results

¼

cost

with commodity scale out

10-20x price/perf

with no application, BI or ETL rewrites

Cross-Channel
 Campaigns

Real-Time Personalization

(26)

Reference Architecture: Operational Data Lake

Offload real-time reporting and analytics from expensive OLTP and DW systems

OLTP Systems Ad Hoc Analytics Operational Data Lake Executive Business Reports Operational Reports & ERP CRM Supply Chain HR … Data Warehouse Datamart Stream or Batch Updates ETL Real-Time, Event-Driven

(27)

Streamlining the Structured Data Pipeline in Hadoop

Source Systems ERP … CRM Sqoop Apply Inferred Schema Stored as flat files

SQL Query Engines BI Tools

Traditional Hadoop Pipeline

vs. Source Systems ERP CRM Exisiting ETL Tool BI Tools

Streamlined Hadoop Pipeline

Advantages

• Reduced operational costs with less complexity

• Reduced processing time and errors with fewer translations • Real-time updates for data

(28)

Complementing Existing Hadoop-Based Data Lakes

Optimizing storage and querying of structured data as part of ELT or Hadoop query engines

OLTP Systems ERP CRM Supply Chain HR … SCHEMA ON INGEST: Streamlined, structured-to-structured integration Structured

Data Unstructured Data

1

2 3

SCHEMA BEFORE READ:

Repository for structured data or metadata from ELT process on unstructured data

HCATALOG Pig

SCHEMA ON READ:

Ad-hoc Hadoop queries across structured and unstructured data

(29)

Proven Building Blocks: Hadoop and Derby

APACHE DERBY

▪ ANSI SQL-99 RDBMS ▪ Java-based ▪ ODBC/JDBC Compliant

!

APACHE HBASE/HDFS

▪ Auto-sharding ▪ Real-time updates ▪ Fault-tolerance ▪ Scalability to 100s of PBs ▪ Data replication

(30)

HBase: Proven Scale-Out

(31)

Splice Optimizations to HBase

Splice Storage is optimized over raw HBase

▪ We use Bitmap Indexes to store data in packed byte arrays

▪ This approach allows us to store data in a much smaller footprint than traditional HBase ▪ With a TPCH schema, we found a 10X reduction in data size reduction

Requires far less hardware and resources to perform the same workload

Asynchronous Write Pipeline

▪ HBase writes (puts) are not pipelined and block while the call is being made

▪ Splice’s write pipeline allows us to reach speeds of over 100K writes / second per HBase node

▪ This allows extremely high ingest speeds without requiring more hardware and custom code ▪ Transactions

▪ As scalability increases, the likelihood of failures increases

▪ We utilize Snapshot Isolation to make sure if there is a failure, it does not corrupt existing data

RDBMS Capabilities

▪ The use of SQL vs. custom scans and the ability for an optimizer to choose the best access path to the data

(32)

Distributed, Parallelized Query Execution

Parallelized computation across cluster Moves computation to 
 the data Utilizes HBase 
 co-processors No MapReduce HBase 
 Co-Processor

!

HBase Server Memory Space

(33)

ANSI SQL-99 Coverage

Data types – e.g., INTEGER, REAL,

CHARACTER, DATE, BOOLEAN, BIGINT

DDL – e.g., CREATE TABLE, CREATE

SCHEMA, ALTER TABLE, DELETE, UPDATE

Predicates – e.g., IN, BETWEEN, LIKE,

EXISTS

DML – e.g., INSERT, DELETE, UPDATE,

SELECT

Query specification – e.g., SELECT

DISTINCT, GROUP BY, HAVING

SET functions – e.g., UNION, ABS, MOD,

ALL, CHECK

Aggregation functions – e.g., AVG, MAX,

Conditional functions – e.g., CASE,

searched CASE

Privileges – e.g., privileges for SELECT, DELETE, INSERT, EXECUTE

Cursors – e.g., updatable, read-only,

positioned DELETE/UPDATE

Joins – e.g., INNER JOIN, LEFT OUTER

JOIN

Transactions – e.g., COMMIT, ROLLBACK,

READ COMMITTED, REPEATABLE READ, READ UNCOMMITTED, Snapshot Isolation

Sub-queries

(34)

Window Functions (Advanced Analytics Functions)

▪ Analytics such as Running total, Moving averages, Top-N Queries

▪ Performs calculations across a set of table rows related to the current

row in the window

▪ Similar to aggregate functions with two significant differences:

▪ Outputs one row for each input value it operates upon.

▪ Groups rows with window partitioning and frame clauses vs. Group BY

▪ SPLICE MACHINE Currently Supports

▪ RANK ▪ DENSE_RANK ▪ ROW NUMBER ▪ AVG ▪ SUM ▪ COUNT ▪ MAX ▪ MIN

(35)

Lockless, ACID transactions

• Adds multi-row, multi-table

transactions to HBase with rollback

• Fast, lockless, high concurrency

• Extends research from Google Percolator, Yahoo Labs, U of Waterloo

(36)

Customer Performance Benchmarks

Typically 10x price/performance improvement

30x 3-7x 10-20x 10x 20x 7x SPEED PRICE/
 PERFORMANCE VS. FASTER LOWER

(37)
(38)

1 day 5 days (including prep) 2 weeks 3-6 weeks 3-10 months

Splice Machine Safe Journey Process

Initial
 Overview

• Splice Machine overview

• Set the stage for Rapid Assessment

Rapid
 Assessment

• Half day workshop • Assess Splice Machine

fit

• Identity target use cases

• Risk assessment of use cases

• Agree upon success criteria

Proof of
 Concept

• Prove client use case on Splice Machine hosted environment

• Benchmark using customer queries and schema

• On Customer data or generated data that resembles customer data

Pilot
 Project

• Identify paid pilot use case with limited change management impact

• Install Splice Machine on client environment • Deploy use case/

application on client data

• Prove Splice Machine against key requirements Enterprise Implementation • Kickstart • Requirements • Design/Dev • QA Test • Cutover • Hypercare

(39)

Safe Journey Enterprise Implementation Stages

Kickstart Packaged 2 week program to get new client off to strong start on solid foundation

!

Incorporates: • Splice Architecture & Development courses • Risk Assessment Workshop • Implementation Blueprint Requirements Establish clear functional and performance requirements document

!

Can be a “refresh only” if project is a port of an existing app to Splice Design/Dev Based on Agile method. Phase is divided into
 2 week sprints

!

Stories covering a set of required capabilities are assigned to each developer

!

A design doc is created, code is written, unit tests are QA Test The QA test period includes: • Performance Test • End-to-End System Integration Test • User Acceptance Test

!

Depending on scale of project there may be multiple iterations of each test with break/ fix cycles in between Cutover Formal period in which Splice-based solution goes-live and pre-existing system is deprecated Parallel Ops Used when an existing system is being ported to Splice Machine from another database

!

The new Splice Machine-based system runs side by side with the old system for a period of time

Hypercare Period of on-site support during cutover and for a period immediately following go-live

(40)

Common Risks and Mitigation Strategies

Data migration

• Risk: Clients are typically migrating very large data sets to Splice

Machine. Issues with migration of certain data types such as dates can waste a lot of time reloading large amounts of data

• Solution: First migrate a small subset of tables that contain all required data types. Ensure these migrate successfully before migrating the entire database

Changes to source schema during implementation

• Risk: Changes to the schema of the source database to be migrated

during the course of the implementation will lead to a significant amount of rework and reloading of data, adding unplanned time to the project • Solution: All stakeholders agree up front to freeze the schema as of an

agreed upon date prior to the Design/Development stage. Stored procedure conversion

• Risk: Stored procedures need to be converted from the original language (e.g., PL/SQL) to Java. Complex stored procedures make include

(41)

Common Risks and Mitigation Strategies

SQL compatibility

• Risk: Even though Splice Machine conforms to the ANSI 99+ SQL standard, virtually every database has unique syntax and some queries may need to be modified. Additionally, SQL generated by packaged applications may not be modifiable.

• Solution: Formal review of SQL syntax during the requirements phase. Modify relevant queries during the Design/Dev phase. If not modifiable an enhancement request for Splice Machine to support the required syntax out of the box may needed.

Indexing

• Risk: Proper indexing is usually important to maximize the performance of Splice Machine. Splice Machine indexes are likely to differ from the indexes required for a traditional RDBMS

• Solution: Ensure that query performance SLAs are clearly defined in the Requirements phase. Incorporate proper index design early in the Design/Dev phase. Assume some iteration will be required to achieve the optimal indexes

Hadoop knowledge

(42)

Summary

THE ONLY

HADOOP RDBMS

Power operational applications


on Hadoop

Affordable, Scale-Out – Commodity hardware Elastic – Easy to expand or scale back

Transactional – Real-time updates & ACID

Transactions

ANSI SQL – Leverage existing SQL code, tools, &

skills

10x

Better Price/Perf

(43)

Questions?

Bob Baran

Senior Sales Engineer

[email protected]

!

References

Related documents

Adeleke (1999) reported in her study of student teachers attitude to teaching that 87% of undergraduate student teachers in the Faculty of Education in one of the first

In this paper, the influence of construction details on the overall thermal insulation and energy demand in Belgian houses is examined.. The research was carried out for

This B section melody is based completely in the C# Aeolian mode and uses different bass notes to create movement and different qualities within the mode while still staying true to

• From the outset, I extend special thanks to the Minister of Tertiary Education and Skills Training, Senator the Honourable Fazal Karim for his leadership and vision to

In the table below, Interest Rates for Computing Non-Federal Repayments, the interest rates are based on Section 106 of WRDA ’86, which says that the interest rate applicable

The specificity of the approach that I suggest in this essay lies in the fact that the starting point of our analysis is neither the history of feminism nor the development of

Currently, Nike and Adidas are dominating the market for athletic footwear, leaving little room for Under Armour (Van Riper, 2013).. Nike and Adidas are Under Armour’s

Secondly, we contribute to the development of branch-and-price meth- ods for routing with synchronization in general: We provide solutions to the pricing subproblem, which is