How, What, and Where of Data Warehouses for MySQL
Robert Hodges
Introducing Continuent
•
The leading provider of clustering and replication for open source DBMS•
Our Product: Continuent Tungsten•
Clustering - Commercial-grade HA, performance scaling and data management for MySQL•
Replication - Flexible, high-performance data movementWhy Do MySQL Applications
Need a Data Warehouse?
De!ning the Problem
“In Retail War, Prices on Web Change Hourly”
(New York Times, Dec 1st, 2012)
Typical Schema for Sales Analytics
Sales
* customer
* product
* quantity
* sale type
* location
* discount
* sale_amount
* sale_time
* period
* payment_type
* campaign
Location
* city
* county
* state
Product
* sku
* product_type ...
Customer
* first_name
* last_name
* loyalty_rank
* street ...
Period
* hour
* day_of_week
* day_of_month
* week
InnoDB = Row Store
id cust_id prod_id ...
1 335301 532 ...
2 2378 6235 ...
3 ... ... ...
Sales Table
cust_id id
2378 2
335301 1
... ...
Cust_ID Index
prod_id id
532 1
6235 2
... ...
Prod_ID Index Clustered
by primary key
Row data stored together
Indexes use primary key Indexes
slow writes
Row Store + MySQL Server = OLTP
•
Fast update of small number of rows•
Limited indexing (few, B-Tree only)•
Minimal compression•
Nested loop joins•
Single-threaded query•
Sharded data setsOLTP != Analytics
•
Parallel execution•
E"cient search on any column•
Time series•
Spatial query•
Recursive query•
Star schema organization•
Data cubes/pivot tables (OLAP)•
Business Intelligence (BI) tool integrationSolution: MySQL + Data Warehouse
Sharded MySQL for high transaction throughput
Data warehouse for fast analytics Near-realtime
loading
Data Warehouse Options
Commercial DBMS -- Oracle
•
Parallel query (automatic in 11G)•
Hash, bitmap indexes•
Wide variety of compression options•
Amazingly advanced query optimizer•
Star schemas with dimensions & hierarchies•
Excellent vertical performance scaling•
Stable and well-known BI toolsColumn Store Architecture
Sales Table
cust_id 335301 2378 ...
prod_id 532
6235 ...
Column data
stored together Updates to entire row are hideously slow
quantity 1
3 ...
...
...
...
...
Every column
is an index Good
compression
Column Stores -- Vertica
•
PostgreSQL syntax (but little/no code)•
Parallel query•
Built-in star schema support•
Time series support•
Multiple compression methods•
Built-in HA model•
Widely used, excellent scalingColumn Store--Calpont In!niDB
•
Looks like MySQL to apps (with minor di#erences)•
Distributed architecture with parallel query•
Columns compressed and fully indexed•
Automatic partitioning of data•
Built-in HA using distributed data copiesNoSQL/Hadoop
•
Minimal SQL dialect (subset of SQL-92)•
Data access is non-transparent•
Hadoop is batch-oriented•
Excellent horizontal scaling in cloud•
Parallel query using map/reduce•
Handles failures by automatic job resubmit•
HiveQL is getting better fastReal-Time Data Loading
Options for Loading Data Warehouse
1. Extract/Transfer/Load (ETL) software
• Stable & good GUI tools but slow, resource intensive, has app a#ects
2. Do-it-yourself reads from the binlog
• Unstable and hard to maintain (ask me how I know)
3. Real-time replication with Tungsten Replicator
• Fast with minimal application load or disruption
DEMO
MySQL to Vertica replication with some bells and a whistle MySQL
db01 db02 db03
db01
renamed02
X
sysbench
sysbench
sysbench
Understanding Tungsten Replicator
Master
(Transactions + Metadata)
Slave
THL
DBMS Logs
Replicator
(Transactions + Metadata) THL
Replicator Download
transactions via network
Apply using JDBC
Pipelines with Parallel Apply
Extract Filter Apply
Stage
Extract Filter Apply
Stage Stage
Pipeline
Master
DBMS Transaction
History Log
In-Memory
Queue Slave
DBMS Extract Filter Apply Extract Filter Apply Extract Filter Apply
Real-Time Heterogeneous Transfer
MySQL Tungsten Master Replicator
Oracle
Service oracle
MySQLExtractor Special Filters
* Transform enum to string
binlog_format=row
Tungsten Slave Replicator Service oracle
Special Filters
* Ignore extra tables
* Map names to upper case
* Optimize updates to remove unchanged columns
MySQL Binlog
Column Store--Real-Time Batches
MySQL Tungsten Master Replicator Service my2vr
MySQLExtractor Special Filters
* pkey - Fill in pkey info
* colnames - Fill in names
* replicate - Ignore tables
binlog_format=row
Tungsten Slave Replicator Service my2vr
MySQL Binlog
CSV FilesCSV
FilesCSV FilesCSV
FilesCSV Files
Large transaction batches to leverage
load parallelization
Batch Loading--The Gory Details
Replicator
Service my2vr
Transactions from master
CSV FilesCSV
FilesCSV Files
Staging TablesStaging
TablesStaging Tables
Base TablesBase
TablesBase Tables
Merge
(or) COPY directly to COPY to
stage tables SELECT to base tables
Vertica Implementation Steps
0. Get Software and Documentation
•
Get the software:http://code.google.com/p/tungsten-replicator
•
Get the documentation:https://docs.continuent.com/wiki/display/TEDOC
1. Best Practices for MySQL
•
Single column keys•
(Currently required by Tungsten)•
UTF-8 data•
GMT timezone•
Row replication enabled2. Handle Availability
•
What happens if MySQL fails?•
What happens if a replicator fails?•
What happens if Vertica fails?3. Create Base Tables
/* MySQL table definition */
CREATE TABLE `sbtest` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT, `k` int(10) unsigned NOT NULL DEFAULT '0',
`c` char(120) NOT NULL DEFAULT '', `pad` char(60) NOT NULL DEFAULT '', PRIMARY KEY (`id`),
KEY `k` (`k`));
/* Vertica table definition */
create table db01.sbtest(
id int, k int,
c char(120), pad char(60) );
4. Provision Initial Data
•
Option 1 (Large data sets): CSV Loading•
Option 2 (Small data sets): Run transactions through replicator itselfmysql> SELECT * from foo INTO OUTFILE ‘foo.csv’;
...
(Fix up data if necessary) ...
vsql> COPY foo FROM 'foo.csv'
DIRECT NULL 'null' DELIMITER ',' ENCLOSED BY '"';
5. Select Tungsten Filter Options
•
Tables to ignore/include?•
Schema/table/column renaming?•
Map names to upper/lower case?•
Custom !lters?tungsten-installer --master-slave -a \ --service-name=mysql2vertica \
...
--svc-extractor-filters=replicate \ --svc-applier-filters=dbtransform \
--property=replicator.filter.replicate.do=db01.*,db02.* \ --property=replicator.filter.dbtransform.from_regex1=db02 \ --property=replicator.filter.dbtransform.to_regex1=renamed02 \ ...
5. Customize Merge Script
# Hacked load script for Vertica--deletes always precede inserts, so
# inserts can load directly.
# Extract deleted data keys and put in temp CSV file for deletes.
!egrep '^"D",' %%CSV_FILE%% |cut -d, -f4 > %%CSV_FILE%%.delete COPY %%STAGE_TABLE_FQN%% FROM '%%CSV_FILE%%.delete'
DIRECT NULL 'null' DELIMITER ',' ENCLOSED BY '"'
# Delete rows using an IN clause. You could also set a column value to
# mark deleted rows.
DELETE FROM %%BASE_TABLE%% WHERE %%BASE_PKEY%% IN (SELECT %%STAGE_PKEY%% FROM %%STAGE_TABLE_FQN%%)
# Load inserts directly into base table from a separate CSV file.
!egrep '^"I",' %%CSV_FILE%% |cut -d, -f4- > %%CSV_FILE%%.insert COPY %%BASE_TABLE%% FROM '%%CSV_FILE%%.insert'
DIRECT NULL 'null' DELIMITER ',' ENCLOSED BY '"'
6. Create Staging Tables
/* Full staging table */
create table db01.stage_xxx_sbtest(
tungsten_opcode char(1), tungsten_seqno int,
tungsten_row_id int, id int,
k int,
c char(120), pad char(60));
(OR)
/* Staging table with delete keys only. */
create table db01.stage_xxx_sbtest(id int);
7. Install Replicators
•
Master/slave vs. direct replication•
Filter settings (and where to run them)•
Directory to hold CSV !les•
How long to preserve logs•
Memory size (Java heap)•
Run replicator locally or on separate host(s)8. Test and Deploy!
•
Typical test cycles for DW loading run to months•
Not weeks or days•
Use production data•
Monitoring/alertingAdvanced Replication Features
More Possibilities for Analytics...
OLTP Data
MySQL Master
Complex, near real-time
reporting
Light-weight, real-time operational status
Web-facing mini- data marts for SaaS users
Adding Clustering to MySQL
Replicator
Replicator nyc (master)
fra (master)
sfo (master)
nyc (slave) fra (slave) sfo (slave)
Replicator New York
Frankfurt
Replicator
WarehouseData
Conclusion
•
Data warehouses enable fast analytics on MySQL transactions•
Multiple data warehouse technologies•
Heterogenous data replication solves the problem of real-time loadingOne more thing:
WE’RE HIRING!!!
Continuent Web Page:
http://www.continuent.com Tungsten Replicator 2.0:
http://code.google.com/p/tungsten-replicator
Our Blogs:
http://scale-out-blog.blogspot.com http://datacharmer.org/blog
http://www.continuent.com/news/blogs 560 S. Winchester Blvd., Suite 500
San Jose, CA 95128 Tel +1 (866) 998-3642 Fax +1 (408) 668-1009
e-mail: [email protected]