• No results found

Determine the Right Analytic Database: A Survey of New Data Technologies

N/A
N/A
Protected

Academic year: 2021

Share "Determine the Right Analytic Database: A Survey of New Data Technologies"

Copied!
28
0
0

Loading.... (view fulltext now)

Full text

(1)

Determine the Right

Analytic Database:

A Survey of New

Data Technologies

O’Reilly Strata Conference February 1, 2011

Mark R. Madsen

http://ThirdNature.net

Twitter: @markmadsen

Atomic Avenue #1 by Glen Orbik

Key

 

Questions

What

 

technologies

 

are

 

available?

What

 

are

 

they

 

good

 

for?

How

 

do

 

you

 

decide

 

which

 

to

 

use?

Page 2

(2)

Consequences of Commoditization: Data Volume

Time Data Generated Chipping GPS RFID Sensors Spimes You are here

H

(3)

Lots

 

of

 

H

“More” can become a qualitative rather than quantitative difference

Really

 

lots

 

of

  

H

(4)

An Unexpected Consequence of Data Volumes

Sums, counts and sorted results only get you so far.

An Unexpected Consequence of Data Volumes

Our ability to collect data is still outpacing our ability to derive meaning from it.

(5)

Don’t

 

worry

 

about

 

it.

 

We’ll

 

just

 

buy

 

more

 

hardware.

CPUs, memory and 

storage track to very 

similar curves

RIP

 

Moore’s

 

Law:

 

it

 

nearly

 

ground

 

to

 

a

 

halt

 

for

 

(6)

Technology Has Changed (a lot) But We Haven’t

1900 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000 1010 10 9 10 8 107 106 105 104 103 102 101 10 10‐1 01‐2 10‐3 10‐4 10‐5 10‐6 Calculations per second per $1000

Mechanical Relay Vacuum tube Transistor Integrated circuit Data: Ray Kurzweil, 2001

10,000 X improvement

Current DW architecture and methods start here in the mid-1980s

(7)

Technology Maturity (time + engineering effort)

New

 

Technology

 

Evolution

 

Means

 

New

 

Problems

1970 1980 1990 2000 2010 2020 Uniprocessor and custom  CPU era Symmetric multi‐ processing era Massively  parallel era Early engineering phase

Exploring, learning, inventing

Investment phase

Improving, perfecting, applying

Core problems solved 1010 10 9 10 8 107 106 105 104 103 102 101 10 10‐1 01‐2 10‐3 10‐4 10‐5 10‐6

What’s different?

Parallelism

We’re not getting more CPU 

power, but more CPUs.

There are too many CPUs 

relative to other resources, 

creating an imbalance in 

hardware platforms.

Most software is designed 

for a single worker, not  

high degrees of parallelism 

(8)

Core

 

problem:

 

software

 

is

 

not

 

designed

 

for

 

parallel

 

work

Databases must be designed to permit local work with minimal global coordination and data redistribution.

(9)

Storage Improvements

For

 

data

 

workloads,

 

disk

 

throughput

 

still

 

key.

Improvements:

▪ Spinning disks at .05/GB

▪ Solid state disks remove 

some latencies, read speed 

of ~250MB/sec

▪ SSD capacity still rising

▪ Card storage (PCI), e.g. 

FusionIO at 1.5GB/sec ▪ SSD is still costly at $2/GB  up to $30/GB

Compression Applied to Stored Data

10x compression means 1 disk I/O can read 

10x as much data, stretching your current 

hardware investment

But it eats CPU and

memory.

YMMV

(10)

Scale‐up vs. Scale‐out Parallelism

Uniprocessor environments

 

required

 

chip

 

upgrades.

SMP

 

servers

 

can

 

grow

 

to

 

a

 

point,

 

then

 

it’s

 

a

 

forklift

 

upgrade

 

to

 

a

 

bigger

 

box.

MPP

 

servers

 

grow

 

by

 

adding

 

mode

 

nodes.

Database and Hardware Deployment Models

Three

 

levels

 

of

 

software

hardware

 

integration:

▪ Database appliance (specialized hardware and software)

▪ Preconfigured (commodity) hardware with software

▪ Software on generic hardware

Then

 

there

 

are

 

the

 

hardware

database

 

parallel

 

models:

Page 20

Shared Everything Shared Disk Shared Nothing

Database DB DB Database

(11)

In‐Memory Processing

1. Maybe

 

not

 

as

 

fast

 

you

 

think.

 

Depends

 

entirely

 

on

 

the

 

database

 

(e.g.

 

VectorWise)

2. So

 

far,

 

applied

 

mainly

 

to

 

shared

nothing

 

models

3. Very

 

large

 

memories

 

are

 

more

 

applicable

 

to

 

shared

nothing

 

than

 

shared

memory

 

systems

Box‐limited Limited by node scaling

e.g. 2 TB max e.g. 16 nodes, 512MB per = 8TB

4. Still

 

an

 

expensive

 

way

 

to

 

get

 

performance

Columnar Databases

ID Name Salary 1 Marge Inovera $50,000 2 Anita Bath $120,000 3 Nadia Geddit $36,000 Marge Inovera Anita Bath Nadia Geddit $50,000 $120,000 $36,000 1 2 3 In a row-store model these three rows would be stored in sequential order as shown here, packed into a block.

In a column store model database they would be divided by columns and stored in different blocks.

Not just changing the storage layout. Also involves changes to the execution engine and query optimizer.

(12)

Column Stores Rule the TPC‐H Benchmark

Columnar Advantages and Disadvantages

+ Reduced I/O for queries not reading all columns

+ Better compression characteristics, meaning database  size < raw data size (unlike row store) and less I/O + Ability to operate on compressed data, improving 

overall system performance

+ Less manual tuning

‐ Slower inserts and updates (causing ELT and trickle‐ feed problems*)

‐ Worse for small retrievals and random I/O ‐ Uses more system memory and CPU

(13)

Advanced  Analytic  Methods Machine  learning Statistics Numerical  methods Text mining  & text  analytics Rules  engines &  constraint  programming Information  theory & IR Visualization

Explosion of Analytic Techniques

GIS

Map

Reduce

 

is

 

a

 

parallel

 

programming

 

framework

 

that

 

allows

 

one

 

to

 

code

 

more

 

easily

 

across

 

a

 

distributed

 

computing

 

environment,

 

not

 

a

 

database.

So how do I query the database? It’s not a database, it’s a key-value store!

Ok, it’s not a database How do I query it? You write a distributed mapreduce function in erlang. Did you just tell me to go to hell? I believe I did, Bob.

(14)

What’s Different

No

 

database

No

 

schema

No

 

metadata

No

 

query

 

language*

Good

 

for:

▪ Processing lots of complex 

or non‐relational data

▪ Batch processing for very 

large amounts of data

Hive, Hbase, Pig, others

Using MapReduce / Hadoop

28

Hadoop is one implementation of MapReduce. There are 

different variations with different performance and resource 

characteristics e.g. Dryad, CGL‐MR, MPI variants

Hadoop is only part of the solution. You need more for 

enterprise deployment. Cloudera’s distribution for Hadoop

shows what a complete environment could look like. 

(15)

How

 

Hadoop fits

 

into

 

a

 

traditional

 

BI

 

environment

Databases Documents Flat Files XML Queues ERP Applications

Source Environments

File loads ETL

Data Warehouse

Developers Analysts End Users

Development tools and IDEs

Analysis tools, BI BI, Applications

Data stores that augment or replace relational access and storage models with other methods.

Different storage models:

• Key‐value stores • Column families

• Object / document stores • Graphs

Different access models:

• SQL (rarely) • programming API • get/put

Reality: mostly suck for BI & analytics Analytic DB vendors are coming from the other direction:

• Aster Data – SQL wrapped around MR

• EMC (Greenplum) – MR on top of the database

NoSQL theoretically = “not only sql”, in reality…

(16)

Some

 

realities

 

to

 

consider

Cheap

 

performance?

▪ Do you have 20 blades 

lying around unused?

▪ How much concurrency?

▪ How much effort to write 

queries? Debug them?

▪ Performance comparisons: 

10x slower on the same 

hardware?

The

 

key

 

is

 

the

 

workload

 

type

 

and

 

the

 

scale

 

of

 

it.

Page 31

Do

 

you

 

really

 

need

 

a

 

rack

 

of

 

blades

 

for

 

computing?

Graphics co‐processors have 

been used for certain problems 

for years.

Offer single‐system solution to 

offload very large compute‐

intensive problems.

Order of magnitude cost 

reduction, order of magnitude 

performance increase with 

current technology today (for 

compute‐intensive problems).

(17)

Other Options for analytic software deployment

The basic models.

1. Separate tools and systems 

(MapReduce and nosql are a 

simple variation on this theme)

2. Integrated with a database

3. Embedded in a database

The primary arguments about 

deployment models center on 

whether to take data to the 

code or code to the data.

33

Leveraging the Database

Levels

 

of

 

database

 

integration:

▪ Native DB connector

▪ External integration

▪ Internal integration

▪ Embedded

+

 

Less

 

data

 

movement

+

 

Possible

 

dev

 

process

 

support

+

 

Hardware

 

/

 

environment

 

savings

+

 

Possible

 

“sandboxing”

 

support

Limitations

 

on

 

techniques

 

(18)

In‐database Execution

You

 

can

 

do

 

a

 

lot

 

with

 

standards

compliant

 

SQL

If

 

the

 

database

 

has

 

UDFs,

 

you

 

can

 

code

 

too

 

(but

 

it’s

 

harder)

Parallel

 

support

 

for

 

UDFs varies

Some

 

vendors

 

build

 

functions

 

directly

 

into

 

the

 

database,

 

(usually

 

scalar)

Iterative

 

algorithms

 

(ones

 

that

 

converge

 

on

 

a

 

solution)

 

are

 

problematic,

 

more

 

so

 

in

 

MPP

35

What are factors in the decision?

User

 

concurrency:

 

one

 

job

 

or

 

many

 

Repetition

 

is

 

a

 

key

 

element:

▪ Execute once and apply (build a response 

or mortality model)

▪ Many executions daily (web cross‐sells)

In

process

 

or

 

Batch?

▪ Batch and use results – segment, score

▪ In‐process reacts on demand – detect 

fraud, recommend

In

process

 

requires

 

thinking

 

about

 

how

 

it

 

integrates

 

with

 

the

 

calling

 

application.

 

(SQL

 

(19)

MATCHING

 

THE

 

PROBLEMS

 

TO

 

TECHNOLOGIES

The problem of size is three problems of volume.

Number of users! Computations! Amount of data!
(20)

Hardware Architectures and Deployment

Compute

 

and

 

data

 

sizes

 

are

 

the

 

key

 

requirements

39 Data volume <10s GB 100s GB 1s TB 10s TB 100sTB PB PC Shared everything or shared disk Shared nothing MR and related Computations MF GF TF PF

Hardware Architectures and Deployment

40 Data volume <10s GB 100s GB 1s TB 10s TB 100sTB PB Computations MF GF TF PF

Today’s

 

reality,

 

and

 

true

 

for

 

a

 

while

 

in

 

most

 

businesses.

The bulk of the market resides here!

(21)

Hardware Architectures and Deployment

41 Data volume <10s GB 100s GB 1s TB 10s TB 100sTB PB Computations MF GF TF PF

Today’s

 

reality,

 

and

 

true

 

for

 

a

 

while

 

in

 

most

 

businesses.

The bulk of the market resides here!

…but analytics pushes many things into the MPP zone.

The

 

real

 

question:

 

why

 

do

 

you

 

want

 

a

 

new

 

platform?

Trouble

 

doing

 

what

 

you

 

already

 

do

 

today

▪ Poor response times

▪ Not meeting availability deadlines

Doing

 

more

 

of

 

what

 

you

 

do

 

today

▪ Adding users, mining more data

Doing

 

something

 

new

 

with

 

your

 

data

▪ Data mining, recommendations, embedded real‐time 

process support

What’s

 

desired

 

is

 

possible

 

but

 

limited

 

by

 

the

 

cost

 

of

 

supporting

 

or

 

growing

 

the

 

existing

 

environment.

(22)

The

 

World

 

According

 

to

 

Gartner:

 

One

 

Magical

 

Quadrant

 

SQL Server 2008 R2 (PDW)

Official production customers?

EMC / Greenplum

SQL limitations

Memory / concurrency issues

Ingres

OLTP database

Illuminate

SQL limitations

Very limited scalability

Sun

MySQL for a DW, is this a joke? 43

Magic Quadrant for Data Warehouse Database Management Systems

The

 

assumption

 

of

 

the

 

warehouse

 

as

 

a

 

database

 

is

 

gone

44 Traditional tabular or structured data Data at rest Non-traditional

data (logs, audio, documents) Parallel programming platforms Databases Streaming DBs/engines Message streams Data in motion Slide 44

(23)

Data Access Differences

Basic

 

data

 

access

 

styles:

▪ Standard BI and reporting

▪ Dashboards / scorecards

▪ Operational BI

▪ Ad‐hoc query and analysis

▪ Batch analytics

▪ Embedded analytics

Data

 

loading

 

styles:

▪ Refresh

▪ Incremental

▪ Constant

Evaluating ADB Options

Storage style:

▪ Files, tables, columns, cubes, KV

Storage type:

▪ Memory, disk, hybrid, compressed

Scaling model:

▪ SMP, clustered, MPP, distributed

Deployment model:

▪ Appliance, cloud, SaaS, on‐premise

Data access model:

▪ SQL, MapReduce, R, languages, etc.

License options:

▪ CPU, data size, subscription

(24)

What’s it going to cost? A small sample at list:

Solution Pricing model Price/unit 1 TB solution Remarks

DatAupia Node $ 19,500/2TB $ 19,500 You can’t buy a 1

TB Satori server

Kickfire (out of business)

Data Volume

(raw) $ 50,000,-/TB $ 50,000 Includes MySQL5.1 Enterprise

Vertica Data Volume

(raw) $ 100,000/TB $ 200,000 Based on 5 nodes, $ 20,000 each

ParAccel Data Volume

(raw) $ 100,000/TB $ 200,000 Based on 5 nodes, $ 20,000 each

EXASOL Data Volume

(active) $ 1,350/GB(€1,000/GB)

$ 350,000* Based on 4 nodes, $ 20,000 each

Teradata Node $ 99,000 / TB $ 99,000** Based on 2550

base configuration

* 1TB raw ±200 GB active, **realistic configuration likely 2x this price

47

Factors and Tradeoffs

The core tradeoff is not always 

money for performance.

What else do you trade?

• Load time

• Trickle feeds

• New ETL tools

• New BI tools

• Operational complexity:

•Data integration and 

management

•Backups

•Hardware maintenance

(25)

The

 

Path

 

to

 

Performance

1. Laborware – tuning

2. Upgrade – try to solve the 

problem without changing 

out the database

3. Extend – add an ADB or 

Hadoop cluster to the 

environment to offload a 

specific workload

4. Replace – out with the old, 

in with the new

Page 49

(26)

The Future

Assuming

 

database

 

market

 

embraces

 

MPP,

 

you

 

have

 

compute

 

power

 

that

 

exceeds

 

what

 

the

 

DB

 

itself

 

needs.

Why

 

not

 

execute

 

the

 

code

 

at

 

the

 

data?

Even

 

without

 

MPP,

 

moving

  

to

 

in

database

 

analytic

 

processing

 

is

 

a

 

future

 

direction

 

and

 

is

 

workable

 

for

 

a

 

large

 

number

 

of

 

people.

51

(27)

Image Attributions

Thanks to the people who supplied the images used in this presentation:

Atomic Avenue #1 by Glen Orbikhttp://www.orbikart.com/gallery/displayimage.php?album=4&pos=5

spices.jpg ‐http://flickr.com/photos/oberazzi/387992959/

Black hole galaxy ‐http://www.flickr.com/photos/badastronomy/3176565627/

weaver peru.jpg ‐http://flickr.com/photos/slack12/442373910/

rc toy truck.jpg ‐http://flickr.com/photos/texas_hillsurfer/2683650363/

automat purple2.jpg ‐http://flickr.com/photos/alaina/288199169/

open_air_market_bologna‐http://flickr.com/photos/pattchi/181259150/

bored_girl.jpg ‐http://www.flickr.com/photos/alejandrosandoval/280691168/

path_vecchia.jpg ‐http://www.flickr.com/photos/funadium/2320388358/

fast kids truck peru.jpg ‐http://flickr.com/photos/zerega/1029076197/

What’s best for which types of problems?*

Page 54

Shared nothing will be best for solving large data problems, regardless 

of workload or concurrency.

Column‐stores will improve query response time problems for most 

traditional query and aggregation workloads.

Row‐stores will be better for operational BI or embedded BI.

Fast storage always makes things better, but is only cost‐effective for 

medium scale or smaller data.

Compression will help everyone, but column‐stores more than row 

stores because of how the engines work.

Map‐Reduce and distributed filesystems offer advantages of a schema‐

less storage & analytic layer that can process into relational databases. SMP and in‐memory will be better for high complexity problems under 

moderate data scale, shared‐nothing and MR for large data scale. *The answer is always “it depends”

(28)

About the Presenter

Mark Madsen is president of Third Nature, a technology research and consulting firm focused on business intelligence, analytics and

performance management. Mark is an award-winning author, architect and former CTO whose work has been featured in numerous industry publications. During his career Mark received awards from the American Productivity & Quality Center, TDWI, Computerworld and the Smithsonian Institute. He is an international speaker, contributing editor at Intelligent Enterprise, and manages the open source channel at the Business Intelligence Network. For more information or to contact Mark, visit http://ThirdNature.net.

About Third Nature

Third Nature is a research and consulting firm focused on new and emerging technology and practices in business intelligence, data

integration and information management. If your question is related to BI, open source, web 2.0 or data integration then you‘re at the right place. Our goal is to help companies take advantage of information-driven management practices and applications. We offer education, consulting and research services to support business and IT organizations as well as technology vendors.

We fill the gap between what the industry analyst firms cover and what IT needs. We specialize in product and technology analysis, so we look at emerging technologies and markets, evaluating the products rather than vendor market positions.

References

Related documents

Board membership is open to any APRA-IL chapter member who meets all the following criteria:  Spends a portion of his or her professional duties devoted to prospects development 

Tools to some cyber glossary of terms of social good security testing an is security.. Contained within each site but have a journalist, all career progression

The main question addressed in this paper is to which extent dierent policies of opening up labor markets that accompany an integration process of goods markets aect output

“A new generation of school business managers would have a key part to play in sustainable school leadership, working alongside executive heads and providing groups of schools

Objectives of this project included the following: to define culturally specific outcomes for mental health services for children with serious emotional disturbances; to develop

Of the overall market in 2014, performance management and analytic applications accounted for 33%, business intelligence and analytics tools share was 36%, and analytic

As shown in Figure 1, the business analytics software market has three primary segments: performance management and analytic applications, business intelligence and

Darvocet/ Darvon Cause loosening, pain, high metal levels in the blood and a need for replacement surgery.. Includes DePuy ASR, DePuy Pinnacle and other