Determine the Right Analytic Database: A Survey of New Data Technologies

(1)

Determine the Right

Analytic Database:

A Survey of New

Data Technologies

O’Reilly Strata Conference February 1, 2011

Mark R. Madsen

http://ThirdNature.net

Twitter: @markmadsen

Atomic Avenue #1 by Glen Orbik

Key

Questions

▪

What

technologies

are

available?

▪

What

are

they

good

for?

▪

How

do

you

decide

which

to

use?

Page 2

(2)

Consequences of Commoditization: Data Volume

Time Data Generated Chipping GPS RFID Sensors Spimes You are here

H

(3)

Lots

of

H

“More” can become a qualitative rather than quantitative difference

Really

lots

of

H

(4)

An Unexpected Consequence of Data Volumes

Sums, counts and sorted results only get you so far.

An Unexpected Consequence of Data Volumes

Our ability to collect data is still outpacing our ability to derive meaning from it.

(5)

Don’t

worry

about

it.

We’ll

just

buy

more

hardware.

CPUs, memory and

storage track to very

similar curves

RIP

Moore’s

Law:

it

nearly

ground

to

a

halt

for

(6)

Technology Has Changed (a lot) But We Haven’t

1900 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000 1010 10 9 10 8 107 106 105 104 103 102 101 10 10‐1 01‐2 10‐3 10‐4 10‐5 10‐6 Calculations per second per $1000

Mechanical Relay Vacuum tube Transistor Integrated circuit Data: Ray Kurzweil, 2001

10,000 X improvement

Current DW architecture and methods start here in the mid-1980s

(7)

Technology Maturity (time + engineering effort)

New

Technology

Evolution

Means

New

Problems

1970 1980 1990 2000 2010 2020 Uniprocessor and custom CPU era Symmetric multi‐ processing era Massively parallel era Early engineering phase

Exploring, learning, inventing

Investment phase

Improving, perfecting, applying

Core problems solved 1010 10 9 10 8 107 106 105 104 103 102 101 10 10‐1 01‐2 10‐3 10‐4 10‐5 10‐6

What’s different?

Parallelism

We’re not getting more CPU

power, but more CPUs.

There are too many CPUs

relative to other resources,

creating an imbalance in

hardware platforms.

Most software is designed

for a single worker, not

high degrees of parallelism

(8)

Core

problem:

software

is

not

designed

for

parallel

work

Databases must be designed to permit local work with minimal global coordination and data redistribution.

(9)

Storage Improvements

For

data

workloads,

disk

throughput

still

key.

Improvements:

▪ Spinning disks at .05/GB

▪ Solid state disks remove

some latencies, read speed

of ~250MB/sec

▪ SSD capacity still rising

▪ Card storage (PCI), e.g.

FusionIO at 1.5GB/sec ▪ SSD is still costly at $2/GB up to $30/GB

Compression Applied to Stored Data

10x compression means 1 disk I/O can read

10x as much data, stretching your current

hardware investment

But it eats CPU and

memory.

YMMV

(10)

Scale‐up vs. Scale‐out Parallelism

Uniprocessor environments

required

chip

upgrades.

SMP

servers

can

grow

to

a

point,

then

it’s

a

forklift

upgrade

to

a

bigger

box.

MPP

servers

grow

by

adding

mode

nodes.

Database and Hardware Deployment Models

Three

levels

of

software

‐

hardware

integration:

▪ Database appliance (specialized hardware and software)

▪ Preconfigured (commodity) hardware with software

▪ Software on generic hardware

Then

there

are

the

hardware

‐

database

parallel

models:

Page 20

Shared Everything Shared Disk Shared Nothing

Database DB DB Database

(11)

In‐Memory Processing

1. Maybe

not

as

fast

you

think.

Depends

entirely

on

the

database

(e.g.

VectorWise)

2. So

far,

applied

mainly

to

shared

‐

nothing

models

3. Very

large

memories

are

more

applicable

to

shared

‐

nothing

than

shared

‐

memory

systems

Box‐limited Limited by node scaling

e.g. 2 TB max e.g. 16 nodes, 512MB per = 8TB

4. Still

an

expensive

way

to

get

performance

Columnar Databases

ID Name Salary 1 Marge Inovera $50,000 2 Anita Bath $120,000 3 Nadia Geddit $36,000 Marge Inovera Anita Bath Nadia Geddit $50,000 $120,000 $36,000 1 2 3 In a row-store model these three rows would be stored in sequential order as shown here, packed into a block.

In a column store model database they would be divided by columns and stored in different blocks.

Not just changing the storage layout. Also involves changes to the execution engine and query optimizer.

(12)

Column Stores Rule the TPC‐H Benchmark

Columnar Advantages and Disadvantages

+ Reduced I/O for queries not reading all columns

+ Better compression characteristics, meaning database size < raw data size (unlike row store) and less I/O + Ability to operate on compressed data, improving

overall system performance

+ Less manual tuning

‐ Slower inserts and updates (causing ELT and trickle‐ feed problems*)

‐ Worse for small retrievals and random I/O ‐ Uses more system memory and CPU

(13)

Advanced Analytic Methods Machine learning Statistics Numerical methods Text mining & text analytics Rules engines & constraint programming Information theory & IR Visualization

Explosion of Analytic Techniques

GIS

Map

‐

Reduce

is

a

parallel

programming

framework

that

allows

one

to

code

more

easily

across

a

distributed

computing

environment,

not

a

database.

So how do I query the database? It’s not a database, it’s a key-value store!

Ok, it’s not a database How do I query it? You write a distributed mapreduce function in erlang. Did you just tell me to go to hell? I believe I did, Bob.

(14)

What’s Different

No

database

No

schema

No

metadata

No

query

language*

Good

for:

▪ Processing lots of complex

or non‐relational data

▪ Batch processing for very

large amounts of data

* Hive, Hbase, Pig, others

Using MapReduce / Hadoop

28

Hadoop is one implementation of MapReduce. There are

different variations with different performance and resource

characteristics e.g. Dryad, CGL‐MR, MPI variants

Hadoop is only part of the solution. You need more for

enterprise deployment. Cloudera’s distribution for Hadoop

shows what a complete environment could look like.

(15)

How

Hadoop fits

into

a

traditional

BI

environment

Databases Documents Flat Files XML Queues ERP Applications

Source Environments

File loads ETL

Data Warehouse

Developers Analysts End Users

Development tools and IDEs

Analysis tools, BI BI, Applications

Data stores that augment or replace relational access and storage models with other methods.

Different storage models:

• Key‐value stores • Column families

• Object / document stores • Graphs

Different access models:

• SQL (rarely) • programming API • get/put

Reality: mostly suck for BI & analytics Analytic DB vendors are coming from the other direction:

• Aster Data – SQL wrapped around MR

• EMC (Greenplum) – MR on top of the database

NoSQL theoretically = “not only sql”, in reality…

(16)

Some

realities

to

consider

Cheap

performance?

▪ Do you have 20 blades

lying around unused?

▪ How much concurrency?

▪ How much effort to write

queries? Debug them?

▪ Performance comparisons:

10x slower on the same

hardware?

The

key

is

the

workload

type

and

the

scale

of

it.

Page 31

Do

you

really

need

a

rack

of

blades

for

computing?

Graphics co‐processors have

been used for certain problems

for years.

Offer single‐system solution to

offload very large compute‐

intensive problems.

Order of magnitude cost

reduction, order of magnitude

performance increase with

current technology today (for

compute‐intensive problems).

(17)

Other Options for analytic software deployment

The basic models.

1. Separate tools and systems

(MapReduce and nosql are a

simple variation on this theme)

2. Integrated with a database

3. Embedded in a database

The primary arguments about

deployment models center on

whether to take data to the

code or code to the data.

33

Leveraging the Database

Levels

of

database

integration:

▪ Native DB connector

▪ External integration

▪ Internal integration

▪ Embedded

+

Less

data

movement

+

Possible

dev

process

support

+

Hardware

/

environment

savings

+

Possible

“sandboxing”

support

‐

Limitations

on

techniques

(18)

In‐database Execution

You

can

do

a

lot

with

standards

‐

compliant

SQL

If

the

database

has

UDFs,

you

can

code

too

(but

it’s

harder)

Parallel

support

for

UDFs varies

Some

vendors

build

functions

directly

into

the

database,

(usually

scalar)

Iterative

algorithms

(ones

that

converge

on

a

solution)

are

problematic,

more

so

in

MPP

35

What are factors in the decision?

User

concurrency:

one

job

or

many

Repetition

is

a

key

element:

▪ Execute once and apply (build a response

or mortality model)

▪ Many executions daily (web cross‐sells)

In

‐

process

or

Batch?

▪ Batch and use results – segment, score

▪ In‐process reacts on demand – detect

fraud, recommend

In

‐

process

requires

thinking

about

how

it

integrates

with

the

calling

application.

(SQL

(19)

MATCHING

THE

PROBLEMS

TO

TECHNOLOGIES

The problem of size is three problems of volume.

Number of users! Computations! Amount of data!

(20)

Hardware Architectures and Deployment

Compute

and

data

sizes

are

the

key

requirements

39 Data volume <10s GB 100s GB 1s TB 10s TB 100sTB PB PC Shared everything or shared disk Shared nothing MR and related Computations MF GF TF PF

Hardware Architectures and Deployment

40 Data volume <10s GB 100s GB 1s TB 10s TB 100sTB PB Computations MF GF TF PF

Today’s

reality,

and

true

for

a

while

in

most

businesses.

The bulk of the market resides here!

(21)

Hardware Architectures and Deployment

41 Data volume <10s GB 100s GB 1s TB 10s TB 100sTB PB Computations MF GF TF PF

Today’s

reality,

and

true

for

a

while

in

most

businesses.

The bulk of the market resides here!

…but analytics pushes many things into the MPP zone.

The

real

question:

why

do

you

want

a

new

platform?

Trouble

doing

what

you

already

do

today

▪ Poor response times

▪ Not meeting availability deadlines

Doing

more

of

what

you

do

today

▪ Adding users, mining more data

Doing

something

new

with

your

data

▪ Data mining, recommendations, embedded real‐time

process support

What’s

desired

is

possible

but

limited

by

the

cost

of

supporting

or

growing

the

existing

environment.

(22)

The

World

According

to

Gartner:

One

Magical

Quadrant

SQL Server 2008 R2 (PDW)

Official production customers?

EMC / Greenplum

SQL limitations

Memory / concurrency issues

Ingres

OLTP database

Illuminate

SQL limitations

Very limited scalability

Sun

MySQL for a DW, is this a joke? 43

Magic Quadrant for Data Warehouse Database Management Systems

The

assumption

of

the

warehouse

as

a

database

is

gone

44 Traditional tabular or structured data Data at rest Non-traditional

data (logs, audio, documents) Parallel programming platforms Databases Streaming DBs/engines Message streams Data in motion Slide 44

(23)

Data Access Differences

Basic

data

access

styles:

▪ Standard BI and reporting

▪ Dashboards / scorecards

▪ Operational BI

▪ Ad‐hoc query and analysis

▪ Batch analytics

▪ Embedded analytics

Data

loading

styles:

▪ Refresh

▪ Incremental

▪ Constant

Evaluating ADB Options

Storage style:

▪ Files, tables, columns, cubes, KV

Storage type:

▪ Memory, disk, hybrid, compressed

Scaling model:

▪ SMP, clustered, MPP, distributed

Deployment model:

▪ Appliance, cloud, SaaS, on‐premise

Data access model:

▪ SQL, MapReduce, R, languages, etc.

License options:

▪ CPU, data size, subscription

(24)

What’s it going to cost? A small sample at list:

Solution Pricing model Price/unit 1 TB solution Remarks

DatAupia Node $ 19,500/2TB $ 19,500 You can’t buy a 1

TB Satori server

Kickfire (out of business)

Data Volume

(raw) $ 50,000,-/TB $ 50,000 Includes MySQL5.1 Enterprise

Vertica Data Volume

(raw) $ 100,000/TB $ 200,000 Based on 5 nodes, $ 20,000 each

ParAccel Data Volume

(raw) $ 100,000/TB $ 200,000 Based on 5 nodes, $ 20,000 each

EXASOL Data Volume

(active) $ 1,350/GB(€1,000/GB)

$ 350,000* Based on 4 nodes, $ 20,000 each

Teradata Node $ 99,000 / TB $ 99,000** Based on 2550

base configuration

* 1TB raw ±200 GB active, **realistic configuration likely 2x this price

47

Factors and Tradeoffs

The core tradeoff is not always

money for performance.

What else do you trade?

• Load time

• Trickle feeds

• New ETL tools

• New BI tools

• Operational complexity:

•Data integration and

management

•Backups

•Hardware maintenance

(25)

The

Path

to

Performance

1. Laborware – tuning

2. Upgrade – try to solve the

problem without changing

out the database

3. Extend – add an ADB or

Hadoop cluster to the

environment to offload a

specific workload

4. Replace – out with the old,

in with the new

Page 49

(26)

The Future

Assuming

database

market

embraces

MPP,

you

have

compute

power

that

exceeds

what

the

DB

itself

needs.

Why

not

execute

the

code

at

the

data?

Even

without

MPP,

moving

to

in

‐

database

analytic

processing

is

a

future

direction

and

is

workable

for

a

large

number

of

people.

51

(27)

Image Attributions

Thanks to the people who supplied the images used in this presentation:

Atomic Avenue #1 by Glen Orbikhttp://www.orbikart.com/gallery/displayimage.php?album=4&pos=5

spices.jpg ‐http://flickr.com/photos/oberazzi/387992959/

Black hole galaxy ‐http://www.flickr.com/photos/badastronomy/3176565627/

weaver peru.jpg ‐http://flickr.com/photos/slack12/442373910/

rc toy truck.jpg ‐http://flickr.com/photos/texas_hillsurfer/2683650363/

automat purple2.jpg ‐http://flickr.com/photos/alaina/288199169/

open_air_market_bologna‐http://flickr.com/photos/pattchi/181259150/

bored_girl.jpg ‐http://www.flickr.com/photos/alejandrosandoval/280691168/

path_vecchia.jpg ‐http://www.flickr.com/photos/funadium/2320388358/

fast kids truck peru.jpg ‐http://flickr.com/photos/zerega/1029076197/

What’s best for which types of problems?*

Page 54

Shared nothing will be best for solving large data problems, regardless

of workload or concurrency.

Column‐stores will improve query response time problems for most

traditional query and aggregation workloads.

Row‐stores will be better for operational BI or embedded BI.

Fast storage always makes things better, but is only cost‐effective for

medium scale or smaller data.

Compression will help everyone, but column‐stores more than row

stores because of how the engines work.

Map‐Reduce and distributed filesystems offer advantages of a schema‐

less storage & analytic layer that can process into relational databases. SMP and in‐memory will be better for high complexity problems under

moderate data scale, shared‐nothing and MR for large data scale. *The answer is always “it depends”

(28)

About the Presenter

Mark Madsen is president of Third Nature, a technology research and consulting firm focused on business intelligence, analytics and

performance management. Mark is an award-winning author, architect and former CTO whose work has been featured in numerous industry publications. During his career Mark received awards from the American Productivity & Quality Center, TDWI, Computerworld and the Smithsonian Institute. He is an international speaker, contributing editor at Intelligent Enterprise, and manages the open source channel at the Business Intelligence Network. For more information or to contact Mark, visit http://ThirdNature.net.

About Third Nature

Third Nature is a research and consulting firm focused on new and emerging technology and practices in business intelligence, data

integration and information management. If your question is related to BI, open source, web 2.0 or data integration then you‘re at the right place. Our goal is to help companies take advantage of information-driven management practices and applications. We offer education, consulting and research services to support business and IT organizations as well as technology vendors.

We fill the gap between what the industry analyst firms cover and what IT needs. We specialize in product and technology analysis, so we look at emerging technologies and markets, evaluating the products rather than vendor market positions.