Typical Database Workloads
OLTP Applications Real-Time Web,Mobile, and IoT Applications
Real-Time, Operational
Reporting
Ad-Hoc Analytics Enterprise Data Warehouses Typical Databases • MySQL • Oracle • MongoDB • Cassandra • MySQL • Oracle • MySQL • Oracle • Greenplum • Paraccel • Netezza • Teradata • Oracle • Sybase IQ Use Cases • ERP, CRM, Supply
Chain
• Web, mobile, social
• IoT • Operational Datastores • Crystal Reports • Exploratory Analytics • Data Mining • Enterprise Reporting Workload Strengths • Real-time updates • ACID transactions • High concurrency of small reads/ writes • Range queries • Real-time updates
• High ingest rates
• High concurrency of small reads/ writes
• Range queries • Real-time updates • Canned, parameterized reports • Range queries • Complex queries requiring full table scans • Append only • Parameterized reports against historical data
Recent History of RDBMSs
▪
RDBMS Definition
▪ Relational with joins
▪ ACID transactions
▪ Secondary indexes
▪ Typically row-oriented
▪ Operational and/or analytical workloads
▪
By early 2000s
▪ Limited innovation
Hadoop Shakes Up Batch Analytics
▪
Data processing framework
▪ Cheap distributed file system
▪ Brute force, batch processing through MapReduce
▪
Great for batch analytics
NoSQL Shakes Ups Operational DBs
▪
NoSQL wave
▪ Companies like Google, Amazon and
LinkedIn needed greater scalability &
schema flexibility
▪ New databases developed by developers,
not database people
▪ Provided scale-out, but lost SQL
▪
Worked well at web startups because:
▪ In some cases, use cases did not need ACID
Convoluted Evolution of Databases
Sc ala bili ty Hierarchical/ Network Databases 1970s Indexed Files (ISAM) 1960s Traditional RDBMSs 1980s-2000s Hadoop 2005 NoSQL Databases 2010 Scale-out SQL Databases 2013 Scale Out Scale UpMainstream user changes
▪
Driven by web, social, mobile, and Internet of Things
▪ Major increases in scale – 30% annual data growth
▪ Significant requirements for semi-structured data
▪ Though relatively little unstructured
▪
Technology adoption continuum
What is it? Should I
use it? Why wouldn’t I use it?
Cloud NoSQL for
web apps Scale-out SQL DBs
Schema on Ingest vs. Schema on Read
▪ Even “schemaless” MongoDB requires “schema”
- 10 Things You Should Know About Running MongoDB At Scale
• By Asya Kamsky, Principal Solutions Architect at MongoDB
Schema
on Ingest Schema on Read
• Schema on Read if
you only use data a few times a year
• Structured data
should always remain structured
• Add schema if data
used regularly
Scale-out is the future of databases
Scale Up Scale Out
NoSQL NewSQL
SQL-on-Hadoop Hadoop
RDBMS Analytic Engines
How do I scale?
NoSQL
Pros
▪ Easy scale-out
▪ Flexible schema
▪ Easier web development with
hierarchical data structures (MongoDB)
▪ Cross-data center replication
(Cassandra)
Cons
▪ No SQL – requires retraining
and app rewrites
▪ No joins – i.e., no cross row/
document dependencies
▪ No reliable updates through
transactions across rows/tables
▪ Eventual consistency
(Cassandra)
▪ Not designed to do
aggregations required for analytics
NewSQL
Pros
▪ Easy scale-out
▪ ANSI SQL – eliminates
retraining and app rewrites
▪ Reliable updates through ACID
transactions
▪ RDBMS functionality
▪ Strong cross-data center
replication (NuoDB)
Cons
▪ Proprietary scale-out,
unproven into petabytes
▪ Must manage another
distributed infrastructure beyond Hadoop
▪ Can not leverage Hadoop
NewSQL – In-Memory
Pros
▪ Easy scale-out
▪ High performance because
everything in memory
▪ ACID transactions within nodes
Cons
▪ Memory 10-20x more expensive
▪ Limited SQL
▪ Limited cross-node transactions
▪ Proprietary scale-out, unproven
into petabytes
▪ Must manage another distributed
infrastructure beyond Hadoop
▪ Can not leverage Hadoop
Operational RDBMS on Hadoop
Pros
▪ Easy scale-out
▪ Scale-out infrastructure proven
into petabytes
▪ ANSI SQL – eliminates
retraining and app rewrites
▪ Reliable updates through ACID
transactions
▪ Leverages Hadoop distributed
infrastructure and tool ecosystem
Cons
▪ Full table scans slower than MPP
DBs, but faster than traditional RDBMSs
▪ Existing HDFS data must be
MPP Analytical Databases
Pros
▪ Easy scale-out
▪ Very fast performance for full
table scans
▪ Highly parallelized, shared
nothing architectures
▪ May have columnar storage
(Vertica)
▪ No maintenance of indexes
(Netezza)
Cons
▪ Poor concurrency models prevent
support of real-time apps
▪ Poor performance for range
queries
▪ Need to redistribute all data to
add nodes (hash partitioning)
▪ May require specialized hardware
(Netezza)
▪ Proprietary scale out - can not
leverage Hadoop ecosystem of tools
SQL-on-Hadoop – Analytical Engines
Pros
▪ Easy scale-out
▪ Scale-out proven into
petabytes
▪ Leverages Hadoop distributed
infrastructure
▪ Can leverage Hadoop
ecosystem of tools
Cons
▪ Relatively immature, especially
compared to MPP DBs
▪ Limited SQL
▪ Poor concurrency models prevent
support of real-time apps
▪ No reliable updates through
transactions
▪ Intermediate results must fit in
Future: Hybrid In-Memory Architectures
Memory Cache
with Disk
In-Memory
Pure
Hybrid
In-Memory
- Very expensive - Unsophisticated memory management - Flexible, cost-effective - Controlled by optimizer - In-memory materialized views?Summary – Future of Databases
▪
Predicted Trends
▪ Scale-out dominates databases
▪ Developers stop worrying about data size and
develop new data-driven apps
▪ Hybrid in-memory architecture becomes
mainstream
▪
Predicted Winners
▪ Hadoop becomes de facto distributed file system
▪ NoSQL used for simple web apps
Who Are We?
THE ONLY
HADOOP RDBMS
Power operational applications
on Hadoop
Affordable, Scale-Out – Commodity hardware Elastic – Easy to expand or scale back
Transactional – Real-time updates & ACID
Transactions
ANSI SQL – Leverage existing SQL code, tools, &
skills
10x
Better Price/Perf
What People are Saying…
Recognized as a key innovator in databasesScaling out on Splice Machine presented some major benefits
over Oracle
...automatic balancing between clusters...avoiding the costly
licensing issues. Quotes Awards An alternative to today’s RDBMSes, Splice Machine effectively combines traditional relational
database technology with the scale-out capabilities
of Hadoop.
The uniqueclaim of … Splice Machine is that it can run
transactional applications
as well as support analytics on
Advisory Board
Advisory Board includes luminaries in databases and technology
Roger Bamford
Former Principal Architect at Oracle
Father of Oracle RAC
Mike Franklin
Computer Science Chair, UC Berkeley
Director, UC Berkeley AmpLab Founder of Apache Spark
Marie-Anne Neimat
Co-Founder, Times-Ten Database Former VP, Database Eng. at Oracle
Ken Rudin
Head of Analytics at Facebook Former GM of Oracle Data Warehousing
Combines the Best of Both Worlds
▪ Scale-out on commodity servers ▪ Proven to 100s of petabytes ▪ Efficiently handle sparse data ▪ Extensive ecosystem
RDBMS
▪ ANSI SQL
▪ Real-time, concurrent updates ▪ ACID transactions
Focused on OLTP and Real-Time Workloads
OLTP Applications Real-Time Web,Mobile, and IoT Applications
Real-Time, Operational
Reporting
Ad-Hoc Analytics Enterprise Data Warehouses Typical Databases • MySQL • Oracle • MySQL • Oracle • MongoDB • Cassandra • MySQL • Oracle • Greenplum • Paraccel • Netezza • Teradata • Oracle • Sybase IQ Use Cases • ERP, CRM, Supply
Chain
• Web, mobile, social
• IoT • Operational Datastores • Crystal Reports • Exploratory Analytics • Data Mining • Enterprise Reporting Workload Strengths • Real-time updates • ACID transactions • High concurrency of small reads/ writes • Range queries • Real-time updates
• High ingest rates
• High concurrency of small reads/ writes • Range queries • Real-time updates • Canned, parameterized reports • Range queries • Complex queries requiring full table scans • Append only • Parameterized reports against historical data
OLTP Campaign Management: Harte-Hanks
Overview Digital marketing services providerUnified Customer Profile
Real-time campaign management OLTP environment with BI reports
Challenges
Oracle RAC too expensive to scale Queries too slow – even up to ½ hour
Getting worse – expect 30-50% data growth Looked for 9 months for a cost-effective solution
Solution Diagram Initial Results
¼
costwith commodity scale out
10-20x price/perf
with no application, BI or ETL rewrites
Cross-Channel Campaigns
Real-Time Personalization
Reference Architecture: Operational Data Lake
Offload real-time reporting and analytics from expensive OLTP and DW systemsOLTP Systems Ad Hoc Analytics Operational Data Lake Executive Business Reports Operational Reports & ERP CRM Supply Chain HR … Data Warehouse Datamart Stream or Batch Updates ETL Real-Time, Event-Driven
Streamlining the Structured Data Pipeline in Hadoop
Source Systems ERP … CRM Sqoop Apply Inferred Schema Stored as flat filesSQL Query Engines BI Tools
Traditional Hadoop Pipeline
vs. Source Systems ERP CRM Exisiting ETL Tool BI Tools
Streamlined Hadoop Pipeline
Advantages
• Reduced operational costs with less complexity
• Reduced processing time and errors with fewer translations • Real-time updates for data
Complementing Existing Hadoop-Based Data Lakes
Optimizing storage and querying of structured data as part of ELT or Hadoop query engines
OLTP Systems ERP CRM Supply Chain HR … SCHEMA ON INGEST: Streamlined, structured-to-structured integration Structured
Data Unstructured Data
1
2 3
SCHEMA BEFORE READ:
Repository for structured data or metadata from ELT process on unstructured data
HCATALOG Pig
SCHEMA ON READ:
Ad-hoc Hadoop queries across structured and unstructured data
Proven Building Blocks: Hadoop and Derby
APACHE DERBY
▪ ANSI SQL-99 RDBMS ▪ Java-based ▪ ODBC/JDBC Compliant!
APACHE HBASE/HDFS
▪ Auto-sharding ▪ Real-time updates ▪ Fault-tolerance ▪ Scalability to 100s of PBs ▪ Data replicationHBase: Proven Scale-Out
Splice Optimizations to HBase
▪ Splice Storage is optimized over raw HBase
▪ We use Bitmap Indexes to store data in packed byte arrays
▪ This approach allows us to store data in a much smaller footprint than traditional HBase ▪ With a TPCH schema, we found a 10X reduction in data size reduction
▪ Requires far less hardware and resources to perform the same workload
▪ Asynchronous Write Pipeline
▪ HBase writes (puts) are not pipelined and block while the call is being made
▪ Splice’s write pipeline allows us to reach speeds of over 100K writes / second per HBase node
▪ This allows extremely high ingest speeds without requiring more hardware and custom code ▪ Transactions
▪ As scalability increases, the likelihood of failures increases
▪ We utilize Snapshot Isolation to make sure if there is a failure, it does not corrupt existing data
▪ RDBMS Capabilities
▪ The use of SQL vs. custom scans and the ability for an optimizer to choose the best access path to the data
Distributed, Parallelized Query Execution
Parallelized computation across cluster Moves computation to the data Utilizes HBase co-processors No MapReduce HBase Co-Processor!
HBase Server Memory SpaceANSI SQL-99 Coverage
▪ Data types – e.g., INTEGER, REAL,
CHARACTER, DATE, BOOLEAN, BIGINT
▪ DDL – e.g., CREATE TABLE, CREATE
SCHEMA, ALTER TABLE, DELETE, UPDATE
▪ Predicates – e.g., IN, BETWEEN, LIKE,
EXISTS
▪ DML – e.g., INSERT, DELETE, UPDATE,
SELECT
▪ Query specification – e.g., SELECT
DISTINCT, GROUP BY, HAVING
▪ SET functions – e.g., UNION, ABS, MOD,
ALL, CHECK
▪ Aggregation functions – e.g., AVG, MAX,
▪ Conditional functions – e.g., CASE,
searched CASE
▪ Privileges – e.g., privileges for SELECT, DELETE, INSERT, EXECUTE
▪ Cursors – e.g., updatable, read-only,
positioned DELETE/UPDATE
▪ Joins – e.g., INNER JOIN, LEFT OUTER
JOIN
▪ Transactions – e.g., COMMIT, ROLLBACK,
READ COMMITTED, REPEATABLE READ, READ UNCOMMITTED, Snapshot Isolation
▪ Sub-queries
Window Functions (Advanced Analytics Functions)
▪ Analytics such as Running total, Moving averages, Top-N Queries
▪ Performs calculations across a set of table rows related to the current
row in the window
▪ Similar to aggregate functions with two significant differences:
▪ Outputs one row for each input value it operates upon.
▪ Groups rows with window partitioning and frame clauses vs. Group BY
▪ SPLICE MACHINE Currently Supports
▪ RANK ▪ DENSE_RANK ▪ ROW NUMBER ▪ AVG ▪ SUM ▪ COUNT ▪ MAX ▪ MIN
Lockless, ACID transactions
• Adds multi-row, multi-table
transactions to HBase with rollback
• Fast, lockless, high concurrency
• Extends research from Google Percolator, Yahoo Labs, U of Waterloo
Customer Performance Benchmarks
Typically 10x price/performance improvement30x 3-7x 10-20x 10x 20x 7x SPEED PRICE/ PERFORMANCE VS. FASTER LOWER
1 day 5 days (including prep) 2 weeks 3-6 weeks 3-10 months
Splice Machine Safe Journey Process
Initial Overview
• Splice Machine overview
• Set the stage for Rapid Assessment
Rapid Assessment
• Half day workshop • Assess Splice Machine
fit
• Identity target use cases
• Risk assessment of use cases
• Agree upon success criteria
Proof of Concept
• Prove client use case on Splice Machine hosted environment
• Benchmark using customer queries and schema
• On Customer data or generated data that resembles customer data
Pilot Project
• Identify paid pilot use case with limited change management impact
• Install Splice Machine on client environment • Deploy use case/
application on client data
• Prove Splice Machine against key requirements Enterprise Implementation • Kickstart • Requirements • Design/Dev • QA Test • Cutover • Hypercare
Safe Journey Enterprise Implementation Stages
Kickstart Packaged 2 week program to get new client off to strong start on solid foundation!
Incorporates: • Splice Architecture & Development courses • Risk Assessment Workshop • Implementation Blueprint Requirements Establish clear functional and performance requirements document!
Can be a “refresh only” if project is a port of an existing app to Splice Design/Dev Based on Agile method. Phase is divided into 2 week sprints!
Stories covering a set of required capabilities are assigned to each developer!
A design doc is created, code is written, unit tests are QA Test The QA test period includes: • Performance Test • End-to-End System Integration Test • User Acceptance Test!
Depending on scale of project there may be multiple iterations of each test with break/ fix cycles in between Cutover Formal period in which Splice-based solution goes-live and pre-existing system is deprecated Parallel Ops Used when an existing system is being ported to Splice Machine from another database!
The new Splice Machine-based system runs side by side with the old system for a period of time
Hypercare Period of on-site support during cutover and for a period immediately following go-live
Common Risks and Mitigation Strategies
Data migration• Risk: Clients are typically migrating very large data sets to Splice
Machine. Issues with migration of certain data types such as dates can waste a lot of time reloading large amounts of data
• Solution: First migrate a small subset of tables that contain all required data types. Ensure these migrate successfully before migrating the entire database
Changes to source schema during implementation
• Risk: Changes to the schema of the source database to be migrated
during the course of the implementation will lead to a significant amount of rework and reloading of data, adding unplanned time to the project • Solution: All stakeholders agree up front to freeze the schema as of an
agreed upon date prior to the Design/Development stage. Stored procedure conversion
• Risk: Stored procedures need to be converted from the original language (e.g., PL/SQL) to Java. Complex stored procedures make include
Common Risks and Mitigation Strategies
SQL compatibility• Risk: Even though Splice Machine conforms to the ANSI 99+ SQL standard, virtually every database has unique syntax and some queries may need to be modified. Additionally, SQL generated by packaged applications may not be modifiable.
• Solution: Formal review of SQL syntax during the requirements phase. Modify relevant queries during the Design/Dev phase. If not modifiable an enhancement request for Splice Machine to support the required syntax out of the box may needed.
Indexing
• Risk: Proper indexing is usually important to maximize the performance of Splice Machine. Splice Machine indexes are likely to differ from the indexes required for a traditional RDBMS
• Solution: Ensure that query performance SLAs are clearly defined in the Requirements phase. Incorporate proper index design early in the Design/Dev phase. Assume some iteration will be required to achieve the optimal indexes
Hadoop knowledge
Summary
THE ONLY
HADOOP RDBMS
Power operational applications
on Hadoop
Affordable, Scale-Out – Commodity hardware Elastic – Easy to expand or scale back
Transactional – Real-time updates & ACID
Transactions
ANSI SQL – Leverage existing SQL code, tools, &
skills
10x
Better Price/Perf