1
Session# - AaS 2.1
Title – “SQL On Big Data
-Technology, Architecture and
Roadmap”
Sumit Pal
Independent Big Data and Data Science
Consultant, Boston
Data Center World – Certified Vendor Neutral
Each presenter is required to certify that their
presentation will be vendor-neutral.
As an attendee you have a right to enforce this policy of
having no sales pitch within a session by alerting the
speaker if you feel the session is not being presented in
a vendor neutral fashion. If the issue continues to be a
problem, please alert Data Center World staff after the
session is complete.
3
SQL On Big Data - Technology, Architecture and Roadmap
What is Big Data
What is SQL
SQL on Big Data
Why SQL on Hadoop
Limitations of SQL on Hadoop, Challenges & Solution
Types of SQL (Volume, Velocity & Variety)
Architectures – Batch, Interactive, Streaming
Innovations happening in this space
5
What is SQL (Structured Query Language)
- ANSI Standard
- Manipulate and work with Relational Databases
- Create Database
- Insert / Update / Delete data into Databases
- Retrieve data from Databases with filtering
criteria
What is OLAP (On Line Analytical Processing)
- Used in BI Tools
- Calculating Trends from Historical Data
- Building Business Scenarios
- Building Aggregates and View Data
Dimensionally ( Time, Geography etc.)
- Navigate and Explore – Adhoc Analysis
- Drill Down
7
What is Hadoop
New Approach to Distributed Data Processing
Data
is
Local,
NOT
move
across
Network
Shared
Nothing
Architecture
1. No synchronization requirement among the nodes
2. Designed for failure – Multiple copies of data
3. Consistent Architecture – individual failures does not fail the job
9
What is HDFS
Why SQL on Hadoop
More and more data is getting available on Hadoop
SQL is an incredibly popular data querying language
Is a bridge between business analysts and organizations’ big data
Analysts would like to query data on Hadoop without using MapReduce
Seamless Integration with BI Tools
11
SQL on Hadoop Goals
•
Distributed,
Scale
Out
Architecture
•
Avoid
Expensive
Analytic
DBs and
Appliances
•
Avoid
Data
Movement from
HDFS
to
Analytic
DBs
•
High
Concurrency of
end
users
Appliances
13
SQL
in
Hadoop Landscape
Challenges of SQL on Hadoop
Map
Reduce and
HDFS
– traditionally
meant
to
solve
Batch
Oriented
Data
Map
Reduce
is
high
‐
latency
Map
Reduce
Not
designed
for
long
Data
Pipelines
(Complex
SQL
is
inefficiently
expressed
as
many
MR
stages)
15
Approaches to solve the challenges
Storage
layer
optimizations
Data
retrieval
‐
data
locality,
storage
layout
/
formats
&
indexing
Indexing
(JethroData – later
slide)
File
Formats
– Avro,
ORC,
Parquet,
Sequence
Files
Choosing
the
optimal
file
format
in
Hadoop is
one
of
the
most
essential
drivers
of
functionality
and
performance
for
big
data
processing
and
query
Data
Compression
(
Reduce
IO)
– GZIP,
BZIP2,
LZO,
Snappy,
LZ4
Workloads
are
IO
Bound – Reduce
IO
– Compression
Algorithms
Compression has
a
gotcha
– Must
be
Splittable for
Hadoop
Tradeoff
– Storage
/
Network
Bandwidth
/
CPU
17
Analytic Types
Batch SQL on Hadoop
•
Hive is designed for batch queries
•
Uses Map Reduce in the background
•
Primarily for queries on large data sets and large ETL jobs for batch workloads
•
Not designed for OLTP OR real‐time
•
What Hive values most are scalability (scale out add machines to cluster)
•
Extensibility (with Map Reduce framework and UDF/UDAF/UDTF)
19
Slow
– multiple
stages
of
Map
Reduce
Complex
SQL
Queries
need
MR
job
Chaining,
incurs
Multiple
Shuffles
and
Disk
Writes
Batch SQL on Hadoop
Optimizations – Hive ~ Interactive SQL
Hints ( /*MAP JOIN */) ( /*STREAMABLE */)
Partitioning & Bucketing‐ Done in relational DBs, Bucketing takes care of Skewness
Vectorization
Reduces the CPU usage, for query scans, filters, aggregates, and joins.
Processing a block of 1024 rows at a time Within block, each column is stored as a vector
Uses few instructions and finishes each instruction in fewer clock cycles, by using processor
21
Interactive SQL
23
Data Node
Client: SELECT day, sum(sales) FROM t1 WHERE prod=‘abc’ GROUP BY day
Data Node Data Node Data Node Data Node Query Executor Query Executor Query Executor Query Executor Query Executor Query Planner /Mgr Query Planner/ Mgr Query Planner/ Mgr Query Planner/ Mgr Query Planner/ Mgr
Performance and resources based on the size of the dataset
MPP
/
Full
Scan
Architecture
Impala
Architecture
Biggest Advantage
No Data Movement out of the clusters, No SPOF
Dis-Advantage
Processes/Deamon on the each Data Node
Reasons for High Performance
•
C++ Instead of Java
•
Runtime Code Generation
•
New Execution Engine ( Not Map Reduce )
25
Fast
and
Efficient
IO
manager
‐
handle
large
data
spread
across
array
of
hard
drives
(rotational,
or
SSD)
Designed
to
run
on
modern
architecture,
recommended
chipsets
(i.e.
Sandy
Bridge,
Bulldozer),
as
the
LLVM
‐
IR
compiler
will
use
newer
hardware
instructions
to
help
maximize
IO
throughput
Impala’s
execution
engine
is
decoupled
from
the
storage
engine,
allowing
it
to
plug
other
storage
engines
underneath
Impala
Architecture
27
Impala
Architecture
29
Impala
Architecture
Impala
Architecture
– LLVM
&
Others
Runtime
code
generation
‐
to
improve
execution
times
Perform
just
in
‐
time
(JIT)
compilation
to
generate
machine
code
Produce
query
specific
versions
of
functions
critical
to
performance
Virtual
function
calls
incur
a
large
performance
penalty.
If
object
type
is
known,
use
code
generation
to
replace
the
virtual
function
call
with
inline
HDFS
feature
short
‐
circuit local
reads
to
bypass
the
DataNode protocol
31
ImpalaToGo – New
Kid
on
the
Block
Thin
Layer
on
top
of
Impala
(
C++
)
Moves
out
Impala
to
its
own
Cluster
(DeCouples Storage
From
Compute
)
ImpalaToGo can
now
reside
in
its
own
Cluster
(
Separate
from
Hadoop Cluster
)
Challenge
deploying
the
MPP/full
‐
scan
compute
architecture
against
data
that
is
now
separated
by
network
Deals
with
it
by
copying
relevant
data
to
local
cluster
Uses
Tachyon
to
optimize
it.
This
is
great
when
all
the
data
used
by
all
current
users
can
fit
fully
in
cache.
Once
it
doesn’t
the
network
cost
kicks
back
in
Vertica – MPP
Based
Analytic
DB
Engine
•
MPP
Columnar
RDBMS
•
10x
‐
100x
compared
to
RDBMS
on
Analytic
Queries
•
Scales
Linearly
•
Commodity
Hardware
•
Built
in
Fault
Tolerance
33
Vertica with
Hadoop (2
Use
Cases)
35
Spark
SQL
37
JethroData – Indexes
in
Hadoop
Data Node
Indexed
Architecture
Data Node Data Node Data Node Data Node Jethro Query Node Query NodeClient: SELECT day, sum(sales) FROM t1 WHERE prod=‘abc’ GROUP BY day
39
Streaming SQL - Architecture
Store
‐
first,
process
‐
second
are
unable
to
scale
for
real
‐
time
Big
Data
Hadoop unable
to
offer
latency
and
throughput
for
real
‐
time
applications
Telecoms,
IOT
and
Cybersecurity
Streams
are
infinite
tables
Standing
query
that
executes
over
data
Operating
continuously
on
data
as
they
arrive
and
by
updating
results
incrementally
in
real
‐
time
41
Streaming
Aggregation
AVG, COUNT, MAX, MIN, SUM, STDDEV_POP, STDDEV_SAMP, VAR_POP, VAR_SAMP
Sliding
Windowed
analytics
Streaming
analytics
and
sliding
windows
Output
record
(or
row)
is
produced
for
each
new
input
record.
Each
field
(or
column)
in
the
output
record
may
be
calculated
using
a
different
window
or
partition.
Windows
can
be
time
or
row
‐
based.
Streaming SQL - Architecture
•
In
Memory
Processing
•
Lock
‐
free
Data
Structures
•
Stateless
implementation
enables
SQL
queries
to
be
distributed
over
Nodes
43
Innovations
SQL
Improvements
Big
Data
Analytics
(Hive
Specific)
SQL
on
Un
Structured
Data
OLAP
on
Hadoop with
SQL
Probabilistic
Query
Engines
– BlinkDB
ACID transactions (HIVE-5317)
INSERT INTO tbl SELECT
INSERT INTO tbl VALUES
UPDATE tbl SET … WHERE …
DELETE FROM tbl WHERE …
MERGE INTO tbl USING src ON … WHEN MATCHED THEN ... WHEN NOT MATCHED
THEN ...
SET TRANSACTION LEVEL …
BEGIN/END TRANSACTION
How is it being done (look at the paper in reference)
The heart of the approach is the client side merge of the HDFS files and
directories
45
Window/Analytic Query (HIVE-4197)
Windowing functions
LEAD, LAG, FIRST_VALUE, LAST_VALUE,
OVER with standard aggregates:
COUNT, SUM, MIN, MAX, AVG
OVER with a PARTITION BY
OVER with PARTITION BY and ORDER BY OVER with a window specification
Window specifications support these standard options:
ROWS ((CURRENT ROW) | (UNBOUNDED | [num]) PRECEDING) AND (UNBOUNDED |
[num]) FOLLOWING Ranking functions:
Rank, NTile, DenseRank, CumeDist, PercentRank, NTILE
47
•
Extremely
Fast
OLAP
Engine
at
Scale
•
ANSI
SQL
Interface
on
Hadoop
•
Interactive
Query
Capability
•
MOLAP
Cube
•
Seamless
Integration
with
BI
Tools
OLAP
on
Hadoop
‐
Apache
Kylin
49
Choose Two
Fast Response
Low Latency
SQL Engines
Accurate Big Volume
51
Probabilistic SQL
Query
Engine
– Blink
DB
Hybrid
Transactional
Analytical
Processing
Hybrid Transaction/Analytical Processing
Coined in early 2014 by Gartner
New generation of in-memory data platforms
OLTP & OLAP No data duplication/movement
•
In
‐
Memory
technology
Enabler
•
Analytics
over
changing
data
•
ETL
is
not
used
any
more
SAP
with
the
SAPHANA
platform
53
55 When you put data into a structure, like SQL, you limit what you can do with the data in future. A question of Waterfall versus Agile in Data Analytics.
SQL requires a waterfall design approach and is limited.
Hadoop truly enables an agile analytics approach. You first collect all the data, and then you pull
out whatever you want.
http://www.kdnuggets.com/2015/08/interview‐stefan‐groschupf‐sql‐ hadoop.html#.VcItFXCTR9c.linkedin
Why
SQL
on
Hadoop is
a
Bad
Idea
Stefan
Groschupf
‐
CEO
and
Chairman
of
Datameer
JethroData – http://www.jethrodata.com/
Hive Paper ‐http://infolab.stanford.edu/~ragho/hive‐icde2010.pdf
Hive Transactions ‐
https://issues.apache.org/jira/secure/attachment/12604051/InsertUpdatesinHive.pdf
SQL to MR Translator –
http://web.cse.ohio‐state.edu/hpcs/WWW/HTML/publications/papers/TR‐11‐7.pdf
Impala in Action – Manning – I was one of the reviewers
Hive Cost Based Optimization ‐
https://issues.apache.org/jira/secure/attachment/12612663/CBO‐2.pdf
Hive on Spark ‐https://issues.apache.org/jira/browse/HIVE‐7292
Hive Speed ‐http://www.slideshare.net/hortonworks/hive‐on‐spark‐is‐blazing‐fast‐or‐is‐it‐ final
Apache Kylin‐http://kylin.incubator.apache.org/
BlinkDB‐http://blinkdb.org/
BlinkDB‐http://www.slideshare.net/Hadoop_Summit/t‐1205p212agarwalv2
Presto ‐http://www.slideshare.net/frsyuki/presto‐hadoop‐conference‐japan‐ 2014?related=1
57 Hive Vectorization‐https://issues.apache.org/jira/browse/HIVE‐4160
Hive LLAP ‐http://www.slideshare.net/Hadoop_Summit/llap‐longlived‐execution‐in‐hive
Impala Paper ‐http://www.cidrdb.org/cidr2015/Papers/CIDR15_Paper28.pdf
Requiremtns of Stream Processing ‐http://cs.brown.edu/~ugur/8rulesSigRec.pdf
References
(
Q &
A
)
Select
Questions
from
Audience
59