• No results found

How, What, and Where of Data Warehouses for MySQL

N/A
N/A
Protected

Academic year: 2021

Share "How, What, and Where of Data Warehouses for MySQL"

Copied!
40
0
0

Loading.... (view fulltext now)

Full text

(1)

How, What, and Where of Data Warehouses for MySQL

Robert Hodges

(2)

Introducing Continuent

The leading provider of clustering and replication for open source DBMS

Our Product: Continuent Tungsten

Clustering - Commercial-grade HA, performance scaling and data management for MySQL

Replication - Flexible, high-performance data movement

(3)

Why Do MySQL Applications

Need a Data Warehouse?

(4)

De!ning the Problem

“In Retail War, Prices on Web Change Hourly”

(New York Times, Dec 1st, 2012)

(5)

Typical Schema for Sales Analytics

Sales

* customer

* product

* quantity

* sale type

* location

* discount

* sale_amount

* sale_time

* period

* payment_type

* campaign

Location

* city

* county

* state

Product

* sku

* product_type ...

Customer

* first_name

* last_name

* loyalty_rank

* street ...

Period

* hour

* day_of_week

* day_of_month

* week

(6)

InnoDB = Row Store

id cust_id prod_id ...

1 335301 532 ...

2 2378 6235 ...

3 ... ... ...

Sales Table

cust_id id

2378 2

335301 1

... ...

Cust_ID Index

prod_id id

532 1

6235 2

... ...

Prod_ID Index Clustered

by primary key

Row data stored together

Indexes use primary key Indexes

slow writes

(7)

Row Store + MySQL Server = OLTP

Fast update of small number of rows

Limited indexing (few, B-Tree only)

Minimal compression

Nested loop joins

Single-threaded query

Sharded data sets

(8)

OLTP != Analytics

Parallel execution

E"cient search on any column

Time series

Spatial query

Recursive query

Star schema organization

Data cubes/pivot tables (OLAP)

Business Intelligence (BI) tool integration

(9)

Solution: MySQL + Data Warehouse

Sharded MySQL for high transaction throughput

Data warehouse for fast analytics Near-realtime

loading

(10)

Data Warehouse Options

(11)

Commercial DBMS -- Oracle

Parallel query (automatic in 11G)

Hash, bitmap indexes

Wide variety of compression options

Amazingly advanced query optimizer

Star schemas with dimensions & hierarchies

Excellent vertical performance scaling

Stable and well-known BI tools

(12)

Column Store Architecture

Sales Table

cust_id 335301 2378 ...

prod_id 532

6235 ...

Column data

stored together Updates to entire row are hideously slow

quantity 1

3 ...

...

...

...

...

Every column

is an index Good

compression

(13)

Column Stores -- Vertica

PostgreSQL syntax (but little/no code)

Parallel query

Built-in star schema support

Time series support

Multiple compression methods

Built-in HA model

Widely used, excellent scaling

(14)

Column Store--Calpont In!niDB

Looks like MySQL to apps (with minor di#erences)

Distributed architecture with parallel query

Columns compressed and fully indexed

Automatic partitioning of data

Built-in HA using distributed data copies

(15)

NoSQL/Hadoop

Minimal SQL dialect (subset of SQL-92)

Data access is non-transparent

Hadoop is batch-oriented

Excellent horizontal scaling in cloud

Parallel query using map/reduce

Handles failures by automatic job resubmit

HiveQL is getting better fast

(16)

Real-Time Data Loading

(17)

Options for Loading Data Warehouse

1. Extract/Transfer/Load (ETL) software

Stable & good GUI tools but slow, resource intensive, has app a#ects

2. Do-it-yourself reads from the binlog

Unstable and hard to maintain (ask me how I know)

3. Real-time replication with Tungsten Replicator

Fast with minimal application load or disruption

(18)

DEMO

MySQL to Vertica replication with some bells and a whistle MySQL

db01 db02 db03

db01

renamed02

X

sysbench

sysbench

sysbench

(19)

Understanding Tungsten Replicator

Master

(Transactions + Metadata)

Slave

THL

DBMS Logs

Replicator

(Transactions + Metadata) THL

Replicator Download

transactions via network

Apply using JDBC

(20)

Pipelines with Parallel Apply

Extract Filter Apply

Stage

Extract Filter Apply

Stage Stage

Pipeline

Master

DBMS Transaction

History Log

In-Memory

Queue Slave

DBMS Extract Filter Apply Extract Filter Apply Extract Filter Apply

(21)

Real-Time Heterogeneous Transfer

MySQL Tungsten Master Replicator

Oracle

Service oracle

MySQLExtractor Special Filters

* Transform enum to string

binlog_format=row

Tungsten Slave Replicator Service oracle

Special Filters

* Ignore extra tables

* Map names to upper case

* Optimize updates to remove unchanged columns

MySQL Binlog

(22)

Column Store--Real-Time Batches

MySQL Tungsten Master Replicator Service my2vr

MySQLExtractor Special Filters

* pkey - Fill in pkey info

* colnames - Fill in names

* replicate - Ignore tables

binlog_format=row

Tungsten Slave Replicator Service my2vr

MySQL Binlog

CSV FilesCSV

FilesCSV FilesCSV

FilesCSV Files

Large transaction batches to leverage

load parallelization

(23)

Batch Loading--The Gory Details

Replicator

Service my2vr

Transactions from master

CSV FilesCSV

FilesCSV Files

Staging TablesStaging

TablesStaging Tables

Base TablesBase

TablesBase Tables

Merge

(or) COPY directly to COPY to

stage tables SELECT to base tables

(24)

Vertica Implementation Steps

(25)

0. Get Software and Documentation

Get the software:

http://code.google.com/p/tungsten-replicator

Get the documentation:

https://docs.continuent.com/wiki/display/TEDOC

(26)

1. Best Practices for MySQL

Single column keys

(Currently required by Tungsten)

UTF-8 data

GMT timezone

Row replication enabled

(27)

2. Handle Availability

What happens if MySQL fails?

What happens if a replicator fails?

What happens if Vertica fails?

(28)

3. Create Base Tables

/* MySQL table definition */

CREATE TABLE `sbtest` (

`id` int(10) unsigned NOT NULL AUTO_INCREMENT, `k` int(10) unsigned NOT NULL DEFAULT '0',

`c` char(120) NOT NULL DEFAULT '', `pad` char(60) NOT NULL DEFAULT '', PRIMARY KEY (`id`),

KEY `k` (`k`));

/* Vertica table definition */

create table db01.sbtest(

id int, k int,

c char(120), pad char(60) );

(29)

4. Provision Initial Data

Option 1 (Large data sets): CSV Loading

Option 2 (Small data sets): Run transactions through replicator itself

mysql> SELECT * from foo INTO OUTFILE ‘foo.csv’;

...

(Fix up data if necessary) ...

vsql> COPY foo FROM 'foo.csv'

DIRECT NULL 'null' DELIMITER ',' ENCLOSED BY '"';

(30)

5. Select Tungsten Filter Options

Tables to ignore/include?

Schema/table/column renaming?

Map names to upper/lower case?

Custom !lters?

tungsten-installer --master-slave -a \ --service-name=mysql2vertica \

...

--svc-extractor-filters=replicate \ --svc-applier-filters=dbtransform \

--property=replicator.filter.replicate.do=db01.*,db02.* \ --property=replicator.filter.dbtransform.from_regex1=db02 \ --property=replicator.filter.dbtransform.to_regex1=renamed02 \ ...

(31)

5. Customize Merge Script

# Hacked load script for Vertica--deletes always precede inserts, so

# inserts can load directly.

# Extract deleted data keys and put in temp CSV file for deletes.

!egrep '^"D",' %%CSV_FILE%% |cut -d, -f4 > %%CSV_FILE%%.delete COPY %%STAGE_TABLE_FQN%% FROM '%%CSV_FILE%%.delete'

DIRECT NULL 'null' DELIMITER ',' ENCLOSED BY '"'

# Delete rows using an IN clause. You could also set a column value to

# mark deleted rows.

DELETE FROM %%BASE_TABLE%% WHERE %%BASE_PKEY%% IN (SELECT %%STAGE_PKEY%% FROM %%STAGE_TABLE_FQN%%)

# Load inserts directly into base table from a separate CSV file.

!egrep '^"I",' %%CSV_FILE%% |cut -d, -f4- > %%CSV_FILE%%.insert COPY %%BASE_TABLE%% FROM '%%CSV_FILE%%.insert'

DIRECT NULL 'null' DELIMITER ',' ENCLOSED BY '"'

(32)

6. Create Staging Tables

/* Full staging table */

create table db01.stage_xxx_sbtest(

tungsten_opcode char(1), tungsten_seqno int,

tungsten_row_id int, id int,

k int,

c char(120), pad char(60));

(OR)

/* Staging table with delete keys only. */

create table db01.stage_xxx_sbtest(id int);

(33)

7. Install Replicators

Master/slave vs. direct replication

Filter settings (and where to run them)

Directory to hold CSV !les

How long to preserve logs

Memory size (Java heap)

Run replicator locally or on separate host(s)

(34)

8. Test and Deploy!

Typical test cycles for DW loading run to months

Not weeks or days

Use production data

Monitoring/alerting

(35)

Advanced Replication Features

(36)

More Possibilities for Analytics...

OLTP Data

MySQL Master

Complex, near real-time

reporting

Light-weight, real-time operational status

Web-facing mini- data marts for SaaS users

(37)

Adding Clustering to MySQL

Replicator

Replicator nyc (master)

fra (master)

sfo (master)

nyc (slave) fra (slave) sfo (slave)

Replicator New York

Frankfurt

Replicator

WarehouseData

(38)

Conclusion

Data warehouses enable fast analytics on MySQL transactions

Multiple data warehouse technologies

Heterogenous data replication solves the problem of real-time loading

(39)

One more thing:

WE’RE HIRING!!!

(40)

Continuent Web Page:

http://www.continuent.com Tungsten Replicator 2.0:

http://code.google.com/p/tungsten-replicator

Our Blogs:

http://scale-out-blog.blogspot.com http://datacharmer.org/blog

http://www.continuent.com/news/blogs 560 S. Winchester Blvd., Suite 500

San Jose, CA 95128 Tel +1 (866) 998-3642 Fax +1 (408) 668-1009

e-mail: [email protected]

References

Related documents

Cole submitted, however, that such a reading of the paragraph ignored the references therein to section 2(a)(iv) and (2)(a)(ii) of the Act which, so he submitted, made it clear

Strategy #4: Review and update comprehensive park plan Action Step #1: Identify financial costs of plan components Action Step #2: Research funding options. Action Step #3:

Representative Jim Leach, one of the writers of the Gramm-Leach-Bliley Act, which repealed the Glass- Steagall Act, believes that the repeal made the current crisis

Zhang Guobin & Long Yu, Connectivity and International Law in the 21st Century Maritime Silk Road, in R ETHINKING S ILK R OAD , supra note 8, at 57, 58; see also Björn

Providing high quality administrative support, the Portfolio Office will liaise with the SLS Directorate Office, other Directorates and Faculties as appropriate to ensure

Provide (or sub-contract) optical fiber connectivity between UNHCR 94 Rue de Montbrillant, 1202 Geneva and the Supplier co-location center.. The Supplier shall provide the

For each of the three types of events, Clemson and USC football games or races at Darlington Raceway, Table 2 reports the mean real accommodations tax collections and the

Unknown Function Metabolic pathways Biosynthesis of secondary metabolites Transmembrane transport Carbon metabolism Biosynthesis of amino acids Meiosis - yeast