How, What, and Where of Data Warehouses for MySQL

(1)

How, What, and Where of Data Warehouses for MySQL

Robert Hodges

(2)

Introducing Continuent

•

The leading provider of clustering and replication for open source DBMS

•

Our Product: Continuent Tungsten

•

Clustering - Commercial-grade HA, performance scaling and data management for MySQL

•

Replication - Flexible, high-performance data movement

(3)

Why Do MySQL Applications

Need a Data Warehouse?

(4)

De!ning the Problem

“In Retail War, Prices on Web Change Hourly”

(New York Times, Dec 1st, 2012)

(5)

Typical Schema for Sales Analytics

Sales

* customer

* product

* quantity

* sale type

* location

* discount

* sale_amount

* sale_time

* period

* payment_type

* campaign

Location

* city

* county

* state

Product

* sku

* product_type ...

Customer

* first_name

* last_name

* loyalty_rank

* street ...

Period

* hour

* day_of_week

* day_of_month

* week

(6)

InnoDB = Row Store

id cust_id prod_id ...

1 335301 532 ...

2 2378 6235 ...

3 ... ... ...

Sales Table

cust_id id

2378 2

335301 1

... ...

Cust_ID Index

prod_id id

532 1

6235 2

... ...

Prod_ID Index Clustered

by primary key

Row data stored together

Indexes use primary key Indexes

slow writes

(7)

Row Store + MySQL Server = OLTP

•

Fast update of small number of rows

•

Limited indexing (few, B-Tree only)

•

Minimal compression

•

Nested loop joins

•

Single-threaded query

•

Sharded data sets

(8)

OLTP != Analytics

•

Parallel execution

•

E"cient search on any column

•

Time series

•

Spatial query

•

Recursive query

•

Star schema organization

•

Data cubes/pivot tables (OLAP)

•

Business Intelligence (BI) tool integration

(9)

Solution: MySQL + Data Warehouse

Sharded MySQL for high transaction throughput

Data warehouse for fast analytics Near-realtime

loading

(10)

Data Warehouse Options

(11)

Commercial DBMS -- Oracle

•

Parallel query (automatic in 11G)

•

Hash, bitmap indexes

•

Wide variety of compression options

•

Amazingly advanced query optimizer

•

Star schemas with dimensions & hierarchies

•

Excellent vertical performance scaling

•

Stable and well-known BI tools

(12)

Column Store Architecture

Sales Table

cust_id 335301 2378 ...

prod_id 532

6235 ...

Column data

stored together Updates to entire row are hideously slow

quantity 1

3 ...

...

...

Every column

is an index Good

compression

(13)

Column Stores -- Vertica

•

PostgreSQL syntax (but little/no code)

•

Parallel query

•

Built-in star schema support

•

Time series support

•

Multiple compression methods

•

Built-in HA model

•

Widely used, excellent scaling

(14)

Column Store--Calpont In!niDB

•

Looks like MySQL to apps (with minor di#erences)

•

Distributed architecture with parallel query

•

Columns compressed and fully indexed

•

Automatic partitioning of data

•

Built-in HA using distributed data copies

(15)

NoSQL/Hadoop

•

Minimal SQL dialect (subset of SQL-92)

•

Data access is non-transparent

•

Hadoop is batch-oriented

•

Excellent horizontal scaling in cloud

•

Parallel query using map/reduce

•

Handles failures by automatic job resubmit

•

HiveQL is getting better fast

(16)

Real-Time Data Loading

(17)

Options for Loading Data Warehouse

1. Extract/Transfer/Load (ETL) software

• Stable & good GUI tools but slow, resource intensive, has app a#ects

2. Do-it-yourself reads from the binlog

• Unstable and hard to maintain (ask me how I know)

3. Real-time replication with Tungsten Replicator

• Fast with minimal application load or disruption

(18)

DEMO

MySQL to Vertica replication with some bells and a whistle MySQL

db01 db02 db03

db01

renamed02

X

sysbench

(19)

Understanding Tungsten Replicator

Master

(Transactions + Metadata)

Slave

THL

DBMS Logs

Replicator

(Transactions + Metadata) THL

Replicator Download

transactions via network

Apply using JDBC

(20)

Pipelines with Parallel Apply

Extract Filter Apply

Stage

Extract Filter Apply

Stage Stage

Pipeline

Master

DBMS Transaction

History Log

In-Memory

Queue Slave

DBMS Extract Filter Apply Extract Filter Apply Extract Filter Apply

(21)

Real-Time Heterogeneous Transfer

MySQL Tungsten Master Replicator

Oracle

Service oracle

MySQLExtractor Special Filters

* Transform enum to string

binlog_format=row

Tungsten Slave Replicator Service oracle

Special Filters

* Ignore extra tables

* Map names to upper case

* Optimize updates to remove unchanged columns

MySQL Binlog

(22)

Column Store--Real-Time Batches

MySQL Tungsten Master Replicator Service my2vr

MySQLExtractor Special Filters

* pkey - Fill in pkey info

* colnames - Fill in names

* replicate - Ignore tables

binlog_format=row

Tungsten Slave Replicator Service my2vr

MySQL Binlog

CSV FilesCSV

FilesCSV FilesCSV

FilesCSV Files

Large transaction batches to leverage

load parallelization

(23)

Batch Loading--The Gory Details

Replicator

Service my2vr

Transactions from master

CSV FilesCSV

FilesCSV Files

Staging TablesStaging

TablesStaging Tables

Base TablesBase

TablesBase Tables

Merge

(or) COPY directly to COPY to

stage tables SELECT to base tables

(24)

Vertica Implementation Steps

(25)

0. Get Software and Documentation

•

Get the software:

http://code.google.com/p/tungsten-replicator

•

Get the documentation:

https://docs.continuent.com/wiki/display/TEDOC

(26)

1. Best Practices for MySQL

•

Single column keys

•

(Currently required by Tungsten)

•

^{UTF-8 data}

•

GMT timezone

•

Row replication enabled

(27)

2. Handle Availability

•

What happens if MySQL fails?

•

What happens if a replicator fails?

•

What happens if Vertica fails?

(28)

3. Create Base Tables

/* MySQL table definition */

CREATE TABLE `sbtest` (

`id` int(10) unsigned NOT NULL AUTO_INCREMENT, `k` int(10) unsigned NOT NULL DEFAULT '0',

`c` char(120) NOT NULL DEFAULT '', `pad` char(60) NOT NULL DEFAULT '', PRIMARY KEY (`id`),

KEY `k` (`k`));

/* Vertica table definition */

create table db01.sbtest(

id int, k int,

c char(120), pad char(60) );

(29)

4. Provision Initial Data

•

Option 1 (Large data sets): CSV Loading

•

Option 2 (Small data sets): Run transactions through replicator itself

mysql> SELECT * from foo INTO OUTFILE ‘foo.csv’;

...

(Fix up data if necessary) ...

vsql> COPY foo FROM 'foo.csv'

DIRECT NULL 'null' DELIMITER ',' ENCLOSED BY '"';

(30)

5. Select Tungsten Filter Options

•

Tables to ignore/include?

•

Schema/table/column renaming?

•

Map names to upper/lower case?

•

Custom !lters?

tungsten-installer --master-slave -a \ --service-name=mysql2vertica \

...

--svc-extractor-filters=replicate \ --svc-applier-filters=dbtransform \

--property=replicator.filter.replicate.do=db01.*,db02.* \ --property=replicator.filter.dbtransform.from_regex1=db02 \ --property=replicator.filter.dbtransform.to_regex1=renamed02 \ ...

(31)

5. Customize Merge Script

# Hacked load script for Vertica--deletes always precede inserts, so

# inserts can load directly.

# Extract deleted data keys and put in temp CSV file for deletes.

!egrep '^"D",' %%CSV_FILE%% |cut -d, -f4 > %%CSV_FILE%%.delete COPY %%STAGE_TABLE_FQN%% FROM '%%CSV_FILE%%.delete'

DIRECT NULL 'null' DELIMITER ',' ENCLOSED BY '"'

# Delete rows using an IN clause. You could also set a column value to

# mark deleted rows.

DELETE FROM %%BASE_TABLE%% WHERE %%BASE_PKEY%% IN (SELECT %%STAGE_PKEY%% FROM %%STAGE_TABLE_FQN%%)

# Load inserts directly into base table from a separate CSV file.

!egrep '^"I",' %%CSV_FILE%% |cut -d, -f4- > %%CSV_FILE%%.insert COPY %%BASE_TABLE%% FROM '%%CSV_FILE%%.insert'

DIRECT NULL 'null' DELIMITER ',' ENCLOSED BY '"'

(32)

6. Create Staging Tables

/* Full staging table */

create table db01.stage_xxx_sbtest(

tungsten_opcode char(1), tungsten_seqno int,

tungsten_row_id int, id int,

k int,

c char(120), pad char(60));

(OR)

/* Staging table with delete keys only. */

create table db01.stage_xxx_sbtest(id int);

(33)

7. Install Replicators

•

Master/slave vs. direct replication

•

Filter settings (and where to run them)

•

Directory to hold CSV !les

•

How long to preserve logs

•

Memory size (Java heap)

•

Run replicator locally or on separate host(s)

(34)

8. Test and Deploy!

•

Typical test cycles for DW loading run to months

•

Not weeks or days

•

Use production data

•

Monitoring/alerting

(35)

Advanced Replication Features

(36)

More Possibilities for Analytics...

OLTP Data

MySQL Master

Complex, near real-time

reporting

Light-weight, real-time operational status

Web-facing mini- data marts for SaaS users

(37)

Adding Clustering to MySQL

Replicator

Replicator nyc (master)

fra (master)

sfo (master)

nyc (slave) fra (slave) sfo (slave)

Replicator New York

Frankfurt

Replicator

WarehouseData

(38)

Conclusion

•

Data warehouses enable fast analytics on MySQL transactions

•

Multiple data warehouse technologies

•

Heterogenous data replication solves the problem of real-time loading

(39)

One more thing:

WE’RE HIRING!!!

(40)

Continuent Web Page:

http://www.continuent.com Tungsten Replicator 2.0:

http://code.google.com/p/tungsten-replicator

Our Blogs:

http://scale-out-blog.blogspot.com http://datacharmer.org/blog

http://www.continuent.com/news/blogs 560 S. Winchester Blvd., Suite 500

San Jose, CA 95128 Tel +1 (866) 998-3642 Fax +1 (408) 668-1009

e-mail: [email protected]