Determine the Right
Analytic Database:
A Survey of New
Data Technologies
O’Reilly Strata Conference February 1, 2011
Mark R. Madsen
http://ThirdNature.net
Twitter: @markmadsen
Atomic Avenue #1 by Glen Orbik
Key
Questions
▪
What
technologies
are
available?
▪
What
are
they
good
for?
▪
How
do
you
decide
which
to
use?
Page 2
Consequences of Commoditization: Data Volume
Time Data Generated Chipping GPS RFID Sensors Spimes You are hereH
Lots
of
H
“More” can become a qualitative rather than quantitative difference
Really
lots
of
H
An Unexpected Consequence of Data Volumes
Sums, counts and sorted results only get you so far.
An Unexpected Consequence of Data Volumes
Our ability to collect data is still outpacing our ability to derive meaning from it.
Don’t
worry
about
it.
We’ll
just
buy
more
hardware.
CPUs, memory and
storage track to very
similar curves
RIP
Moore’s
Law:
it
nearly
ground
to
a
halt
for
Technology Has Changed (a lot) But We Haven’t
1900 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000 1010 10 9 10 8 107 106 105 104 103 102 101 10 10‐1 01‐2 10‐3 10‐4 10‐5 10‐6 Calculations per second per $1000Mechanical Relay Vacuum tube Transistor Integrated circuit Data: Ray Kurzweil, 2001
10,000 X improvement
Current DW architecture and methods start here in the mid-1980s
Technology Maturity (time + engineering effort)
New
Technology
Evolution
Means
New
Problems
1970 1980 1990 2000 2010 2020 Uniprocessor and custom CPU era Symmetric multi‐ processing era Massively parallel era Early engineering phase
Exploring, learning, inventing
Investment phase
Improving, perfecting, applying
Core problems solved 1010 10 9 10 8 107 106 105 104 103 102 101 10 10‐1 01‐2 10‐3 10‐4 10‐5 10‐6
What’s different?
ParallelismWe’re not getting more CPU
power, but more CPUs.
There are too many CPUs
relative to other resources,
creating an imbalance in
hardware platforms.
Most software is designed
for a single worker, not
high degrees of parallelism
Core
problem:
software
is
not
designed
for
parallel
work
Databases must be designed to permit local work with minimal global coordination and data redistribution.
Storage Improvements
For
data
workloads,
disk
throughput
still
key.
Improvements:
▪ Spinning disks at .05/GB
▪ Solid state disks remove
some latencies, read speed
of ~250MB/sec
▪ SSD capacity still rising
▪ Card storage (PCI), e.g.
FusionIO at 1.5GB/sec ▪ SSD is still costly at $2/GB up to $30/GB
Compression Applied to Stored Data
10x compression means 1 disk I/O can read
10x as much data, stretching your current
hardware investment
But it eats CPU and
memory.
YMMV
Scale‐up vs. Scale‐out Parallelism
Uniprocessor environments
required
chip
upgrades.
SMP
servers
can
grow
to
a
point,
then
it’s
a
forklift
upgrade
to
a
bigger
box.
MPP
servers
grow
by
adding
mode
nodes.
Database and Hardware Deployment Models
Three
levels
of
software
‐
hardware
integration:
▪ Database appliance (specialized hardware and software)
▪ Preconfigured (commodity) hardware with software
▪ Software on generic hardware
Then
there
are
the
hardware
‐
database
parallel
models:
Page 20
Shared Everything Shared Disk Shared Nothing
Database DB DB Database
In‐Memory Processing
1. Maybe
not
as
fast
you
think.
Depends
entirely
on
the
database
(e.g.
VectorWise)
2. So
far,
applied
mainly
to
shared
‐
nothing
models
3. Very
large
memories
are
more
applicable
to
shared
‐
nothing
than
shared
‐
memory
systems
Box‐limited Limited by node scaling
e.g. 2 TB max e.g. 16 nodes, 512MB per = 8TB
4. Still
an
expensive
way
to
get
performance
Columnar Databases
ID Name Salary 1 Marge Inovera $50,000 2 Anita Bath $120,000 3 Nadia Geddit $36,000 Marge Inovera Anita Bath Nadia Geddit $50,000 $120,000 $36,000 1 2 3 In a row-store model these three rows would be stored in sequential order as shown here, packed into a block.In a column store model database they would be divided by columns and stored in different blocks.
Not just changing the storage layout. Also involves changes to the execution engine and query optimizer.
Column Stores Rule the TPC‐H Benchmark
Columnar Advantages and Disadvantages
+ Reduced I/O for queries not reading all columns+ Better compression characteristics, meaning database size < raw data size (unlike row store) and less I/O + Ability to operate on compressed data, improving
overall system performance
+ Less manual tuning
‐ Slower inserts and updates (causing ELT and trickle‐ feed problems*)
‐ Worse for small retrievals and random I/O ‐ Uses more system memory and CPU
Advanced Analytic Methods Machine learning Statistics Numerical methods Text mining & text analytics Rules engines & constraint programming Information theory & IR Visualization
Explosion of Analytic Techniques
GIS
Map
‐
Reduce
is
a
parallel
programming
framework
that
allows
one
to
code
more
easily
across
a
distributed
computing
environment,
not
a
database.
So how do I query the database? It’s not a database, it’s a key-value store!
Ok, it’s not a database How do I query it? You write a distributed mapreduce function in erlang. Did you just tell me to go to hell? I believe I did, Bob.
What’s Different
No
database
No
schema
No
metadata
No
query
language*
Good
for:
▪ Processing lots of complex
or non‐relational data
▪ Batch processing for very
large amounts of data
* Hive, Hbase, Pig, others
Using MapReduce / Hadoop
28
Hadoop is one implementation of MapReduce. There are
different variations with different performance and resource
characteristics e.g. Dryad, CGL‐MR, MPI variants
Hadoop is only part of the solution. You need more for
enterprise deployment. Cloudera’s distribution for Hadoop
shows what a complete environment could look like.
How
Hadoop fits
into
a
traditional
BI
environment
Databases Documents Flat Files XML Queues ERP Applications
Source Environments
File loads ETL
Data Warehouse
Developers Analysts End Users
Development tools and IDEs
Analysis tools, BI BI, Applications
Data stores that augment or replace relational access and storage models with other methods.
Different storage models:
• Key‐value stores • Column families
• Object / document stores • Graphs
Different access models:
• SQL (rarely) • programming API • get/put
Reality: mostly suck for BI & analytics Analytic DB vendors are coming from the other direction:
• Aster Data – SQL wrapped around MR
• EMC (Greenplum) – MR on top of the database
NoSQL theoretically = “not only sql”, in reality…
Some
realities
to
consider
Cheap
performance?
▪ Do you have 20 bladeslying around unused?
▪ How much concurrency?
▪ How much effort to write
queries? Debug them?
▪ Performance comparisons:
10x slower on the same
hardware?
The
key
is
the
workload
type
and
the
scale
of
it.
Page 31
Do
you
really
need
a
rack
of
blades
for
computing?
Graphics co‐processors have
been used for certain problems
for years.
Offer single‐system solution to
offload very large compute‐
intensive problems.
Order of magnitude cost
reduction, order of magnitude
performance increase with
current technology today (for
compute‐intensive problems).
Other Options for analytic software deployment
The basic models.
1. Separate tools and systems
(MapReduce and nosql are a
simple variation on this theme)
2. Integrated with a database
3. Embedded in a database
The primary arguments about
deployment models center on
whether to take data to the
code or code to the data.
33
Leveraging the Database
Levels
of
database
integration:
▪ Native DB connector
▪ External integration
▪ Internal integration
▪ Embedded
+
Less
data
movement
+
Possible
dev
process
support
+
Hardware
/
environment
savings
+
Possible
“sandboxing”
support
‐
Limitations
on
techniques
In‐database Execution
You
can
do
a
lot
with
standards
‐
compliant
SQL
If
the
database
has
UDFs,
you
can
code
too
(but
it’s
harder)
Parallel
support
for
UDFs varies
Some
vendors
build
functions
directly
into
the
database,
(usually
scalar)
Iterative
algorithms
(ones
that
converge
on
a
solution)
are
problematic,
more
so
in
MPP
35
What are factors in the decision?
User
concurrency:
one
job
or
many
Repetition
is
a
key
element:
▪ Execute once and apply (build a response
or mortality model)
▪ Many executions daily (web cross‐sells)
In
‐
process
or
Batch?
▪ Batch and use results – segment, score
▪ In‐process reacts on demand – detect
fraud, recommend
In
‐
process
requires
thinking
about
how
it
integrates
with
the
calling
application.
(SQL
MATCHING
THE
PROBLEMS
TO
TECHNOLOGIES
The problem of size is three problems of volume.
Number of users! Computations! Amount of data!Hardware Architectures and Deployment
Compute
and
data
sizes
are
the
key
requirements
39 Data volume <10s GB 100s GB 1s TB 10s TB 100sTB PB PC Shared everything or shared disk Shared nothing MR and related Computations MF GF TF PF
Hardware Architectures and Deployment
40 Data volume <10s GB 100s GB 1s TB 10s TB 100sTB PB Computations MF GF TF PFToday’s
reality,
and
true
for
a
while
in
most
businesses.
The bulk of the market resides here!
Hardware Architectures and Deployment
41 Data volume <10s GB 100s GB 1s TB 10s TB 100sTB PB Computations MF GF TF PFToday’s
reality,
and
true
for
a
while
in
most
businesses.
The bulk of the market resides here!
…but analytics pushes many things into the MPP zone.
The
real
question:
why
do
you
want
a
new
platform?
Trouble
doing
what
you
already
do
today
▪ Poor response times
▪ Not meeting availability deadlines
Doing
more
of
what
you
do
today
▪ Adding users, mining more data
Doing
something
new
with
your
data
▪ Data mining, recommendations, embedded real‐time
process support
What’s
desired
is
possible
but
limited
by
the
cost
of
supporting
or
growing
the
existing
environment.
The
World
According
to
Gartner:
One
Magical
Quadrant
SQL Server 2008 R2 (PDW)
Official production customers?
EMC / Greenplum
SQL limitations
Memory / concurrency issues
Ingres
OLTP database
Illuminate
SQL limitations
Very limited scalability
Sun
MySQL for a DW, is this a joke? 43
Magic Quadrant for Data Warehouse Database Management Systems
The
assumption
of
the
warehouse
as
a
database
is
gone
44 Traditional tabular or structured data Data at rest Non-traditional
data (logs, audio, documents) Parallel programming platforms Databases Streaming DBs/engines Message streams Data in motion Slide 44
Data Access Differences
Basic
data
access
styles:
▪ Standard BI and reporting
▪ Dashboards / scorecards
▪ Operational BI
▪ Ad‐hoc query and analysis
▪ Batch analytics
▪ Embedded analytics
Data
loading
styles:
▪ Refresh
▪ Incremental
▪ Constant
Evaluating ADB Options
Storage style:
▪ Files, tables, columns, cubes, KV
Storage type:
▪ Memory, disk, hybrid, compressed
Scaling model:
▪ SMP, clustered, MPP, distributed
Deployment model:
▪ Appliance, cloud, SaaS, on‐premise
Data access model:
▪ SQL, MapReduce, R, languages, etc.
License options:
▪ CPU, data size, subscription
What’s it going to cost? A small sample at list:
Solution Pricing model Price/unit 1 TB solution Remarks
DatAupia Node $ 19,500/2TB $ 19,500 You can’t buy a 1
TB Satori server
Kickfire (out of business)
Data Volume
(raw) $ 50,000,-/TB $ 50,000 Includes MySQL5.1 Enterprise
Vertica Data Volume
(raw) $ 100,000/TB $ 200,000 Based on 5 nodes, $ 20,000 each
ParAccel Data Volume
(raw) $ 100,000/TB $ 200,000 Based on 5 nodes, $ 20,000 each
EXASOL Data Volume
(active) $ 1,350/GB(€1,000/GB)
$ 350,000* Based on 4 nodes, $ 20,000 each
Teradata Node $ 99,000 / TB $ 99,000** Based on 2550
base configuration
* 1TB raw ±200 GB active, **realistic configuration likely 2x this price
47
Factors and Tradeoffs
The core tradeoff is not always
money for performance.
What else do you trade?
• Load time
• Trickle feeds
• New ETL tools
• New BI tools
• Operational complexity:
•Data integration and
management
•Backups
•Hardware maintenance
The
Path
to
Performance
1. Laborware – tuning
2. Upgrade – try to solve the
problem without changing
out the database
3. Extend – add an ADB or
Hadoop cluster to the
environment to offload a
specific workload
4. Replace – out with the old,
in with the new
Page 49
The Future
Assuming
database
market
embraces
MPP,
you
have
compute
power
that
exceeds
what
the
DB
itself
needs.
Why
not
execute
the
code
at
the
data?
Even
without
MPP,
moving
to
in
‐
database
analytic
processing
is
a
future
direction
and
is
workable
for
a
large
number
of
people.
51
Image Attributions
Thanks to the people who supplied the images used in this presentation:
Atomic Avenue #1 by Glen Orbikhttp://www.orbikart.com/gallery/displayimage.php?album=4&pos=5
spices.jpg ‐http://flickr.com/photos/oberazzi/387992959/
Black hole galaxy ‐http://www.flickr.com/photos/badastronomy/3176565627/
weaver peru.jpg ‐http://flickr.com/photos/slack12/442373910/
rc toy truck.jpg ‐http://flickr.com/photos/texas_hillsurfer/2683650363/
automat purple2.jpg ‐http://flickr.com/photos/alaina/288199169/
open_air_market_bologna‐http://flickr.com/photos/pattchi/181259150/
bored_girl.jpg ‐http://www.flickr.com/photos/alejandrosandoval/280691168/
path_vecchia.jpg ‐http://www.flickr.com/photos/funadium/2320388358/
fast kids truck peru.jpg ‐http://flickr.com/photos/zerega/1029076197/
What’s best for which types of problems?*
Page 54
Shared nothing will be best for solving large data problems, regardless
of workload or concurrency.
Column‐stores will improve query response time problems for most
traditional query and aggregation workloads.
Row‐stores will be better for operational BI or embedded BI.
Fast storage always makes things better, but is only cost‐effective for
medium scale or smaller data.
Compression will help everyone, but column‐stores more than row
stores because of how the engines work.
Map‐Reduce and distributed filesystems offer advantages of a schema‐
less storage & analytic layer that can process into relational databases. SMP and in‐memory will be better for high complexity problems under
moderate data scale, shared‐nothing and MR for large data scale. *The answer is always “it depends”
About the Presenter
Mark Madsen is president of Third Nature, a technology research and consulting firm focused on business intelligence, analytics and
performance management. Mark is an award-winning author, architect and former CTO whose work has been featured in numerous industry publications. During his career Mark received awards from the American Productivity & Quality Center, TDWI, Computerworld and the Smithsonian Institute. He is an international speaker, contributing editor at Intelligent Enterprise, and manages the open source channel at the Business Intelligence Network. For more information or to contact Mark, visit http://ThirdNature.net.
About Third Nature
Third Nature is a research and consulting firm focused on new and emerging technology and practices in business intelligence, data
integration and information management. If your question is related to BI, open source, web 2.0 or data integration then you‘re at the right place. Our goal is to help companies take advantage of information-driven management practices and applications. We offer education, consulting and research services to support business and IT organizations as well as technology vendors.
We fill the gap between what the industry analyst firms cover and what IT needs. We specialize in product and technology analysis, so we look at emerging technologies and markets, evaluating the products rather than vendor market positions.