Session# - AaS 2.1 Title SQL On Big Data - Technology, Architecture and Roadmap

(1)

1

Session# - AaS 2.1

Title – “SQL On Big Data

-Technology, Architecture and

Roadmap”

Sumit Pal

Independent Big Data and Data Science

Consultant, Boston

Data Center World – Certified Vendor Neutral

Each presenter is required to certify that their

presentation will be vendor-neutral.

As an attendee you have a right to enforce this policy of

having no sales pitch within a session by alerting the

speaker if you feel the session is not being presented in

a vendor neutral fashion. If the issue continues to be a

problem, please alert Data Center World staff after the

session is complete.

(2)

3

SQL On Big Data - Technology, Architecture and Roadmap

What is Big Data

What is SQL

SQL on Big Data

Why SQL on Hadoop

Limitations of SQL on Hadoop, Challenges & Solution

Types of SQL (Volume, Velocity & Variety)

Architectures – Batch, Interactive, Streaming

Innovations happening in this space

(3)

5

What is SQL (Structured Query Language)

- ANSI Standard

- Manipulate and work with Relational Databases

- Create Database

- Insert / Update / Delete data into Databases

- Retrieve data from Databases with filtering

criteria

What is OLAP (On Line Analytical Processing)

- Used in BI Tools

- Calculating Trends from Historical Data

- Building Business Scenarios

- Building Aggregates and View Data

Dimensionally ( Time, Geography etc.)

- Navigate and Explore – Adhoc Analysis

- Drill Down

(4)

7

What is Hadoop

New Approach to Distributed Data Processing

Data

is

Local,

NOT

move

across

Network

Shared

Nothing

Architecture

1. No synchronization requirement among the nodes

2. Designed for failure – Multiple copies of data

3. Consistent Architecture – individual failures does not fail the job

(5)

9

What is HDFS

Why SQL on Hadoop

More and more data is getting available on Hadoop

SQL is an incredibly popular data querying language

Is a bridge between business analysts and organizations’ big data

Analysts would like to query data on Hadoop without using MapReduce

Seamless Integration with BI Tools

(6)

11

SQL on Hadoop Goals

• Distributed,

Scale

Out

Architecture

• Avoid

Expensive

Analytic

DBs and

Appliances

• Avoid

Data

Movement from

HDFS

to

Analytic

DBs

• High

Concurrency of

end

users

Appliances

(7)

13

SQL

in

Hadoop Landscape

Challenges of SQL on Hadoop

Map

Reduce and

HDFS

– traditionally

meant

to

solve

Batch

Oriented

Data

Map

Reduce

is

high

‐

latency

Map

Reduce

Not

designed

for

long

Data

Pipelines

(Complex

SQL

is

inefficiently

expressed

as

many

MR

stages)

(8)

15

Approaches to solve the challenges

Storage

layer

optimizations

Data

retrieval

‐

data

locality,

storage

layout

/

formats

&

indexing

Indexing

(JethroData – later

slide)

File

Formats

– Avro,

ORC,

Parquet,

Sequence

Files

Choosing

the

optimal

file

format

in

Hadoop is

one

of

the

most

essential

drivers

of

functionality

and

performance

for

big

data

processing

and

query

Data

Compression

(

Reduce

IO)

– GZIP,

BZIP2,

LZO,

Snappy,

LZ4

Workloads

are

IO

Bound – Reduce

IO

– Compression

Algorithms

Compression has

a

gotcha

– Must

be

Splittable for

Hadoop

Tradeoff

– Storage

/

Network

Bandwidth

/

CPU

(9)

17

Analytic Types

Batch SQL on Hadoop

• Hive is designed for batch queries

• Uses Map Reduce in the background

• Primarily for queries on large data sets and large ETL jobs for batch workloads

• Not designed for OLTP OR real‐time

• What Hive values most are scalability (scale out add machines to cluster)

• Extensibility (with Map Reduce framework and UDF/UDAF/UDTF)

(10)

19

Slow

– multiple

stages

of

Map

Reduce

Complex

SQL

Queries

need

MR

job

Chaining,

incurs

Multiple

Shuffles

and

Disk

Writes

Batch SQL on Hadoop

Optimizations – Hive ~ Interactive SQL

Hints ( /*MAP JOIN */) ( /*STREAMABLE */)

Partitioning & Bucketing‐ Done in relational DBs, Bucketing takes care of Skewness

Vectorization

Reduces the CPU usage, for query scans, filters, aggregates, and joins.

Processing a block of 1024 rows at a time Within block, each column is stored as a vector

Uses few instructions and finishes each instruction in fewer clock cycles, by using processor

(11)

21

Interactive SQL

(12)

23

Data Node

Client: SELECT day, sum(sales) FROM t1 WHERE prod=‘abc’ GROUP BY day

Data Node Data Node Data Node Data Node Query Executor Query Executor Query Executor Query Executor Query Executor Query Planner /Mgr Query Planner/ Mgr Query Planner/ Mgr Query Planner/ Mgr Query Planner/ Mgr

Performance and resources based on the size of the dataset

MPP

/

Full

Scan

Architecture

Impala

Architecture

Biggest Advantage

No Data Movement out of the clusters, No SPOF

Dis-Advantage

Processes/Deamon on the each Data Node

Reasons for High Performance

• C++ Instead of Java

• Runtime Code Generation

• New Execution Engine ( Not Map Reduce )

(13)

25

Fast

and

Efficient

IO

manager

‐

handle

large

data

spread

across

array

of

hard

drives

(rotational,

or

SSD)

Designed

to

run

on

modern

architecture,

chipsets

(i.e.

Sandy

Bridge,

Bulldozer),

as

the

LLVM

‐

IR

compiler

will

use

newer

hardware

instructions

to

help

maximize

IO

throughput

Impala’s

execution

engine

is

decoupled

from

the

storage

engine,

allowing

it

to

plug

other

storage

engines

underneath

Impala

Architecture

(14)

27

Impala

Architecture

(15)

29

Impala

Architecture

Impala

Architecture

– LLVM

&

Others

Runtime

code

generation

‐

to

improve

execution

times

Perform

just

in

‐

time

(JIT)

compilation

to

generate

machine

code

Produce

query

specific

versions

of

functions

critical

to

performance

Virtual

function

calls

incur

a

large

performance

penalty.

If

object

type

is

known,

use

code

generation

to

replace

the

virtual

function

call

with

inline

HDFS

feature

short

‐

circuit local

reads

to

bypass

the

DataNode protocol

(16)

31

ImpalaToGo – New

Kid

on

the

Block

Thin

Layer

on

top

of

Impala

(

C++

)

Moves

out

Impala

to

its

own

Cluster

(DeCouples Storage

From

Compute

)

ImpalaToGo can

now

reside

in

its

own

Cluster

(

Separate

from

Hadoop Cluster

)

Challenge

deploying

the

MPP/full

‐

scan

compute

architecture

against

data

that

is

now

separated

by

network

Deals

with

it

by

copying

relevant

data

to

local

cluster

Uses

Tachyon

to

optimize

it.

This

is

great

when

all

the

data

used

by

all

current

users

can

fit

fully

in

cache.

Once

it

doesn’t

the

network

cost

kicks

back

in

Vertica – MPP

Based

Analytic

DB

Engine

• MPP

Columnar

RDBMS

• 10x

‐

100x

compared

to

RDBMS

on

Analytic

Queries

• Scales

Linearly

• Commodity

Hardware

• Built

in

Fault

Tolerance

(17)

33

Vertica with

Hadoop (2

Use

Cases)

(18)

35

Spark

SQL

(19)

37

JethroData – Indexes

in

Hadoop

Data Node

Indexed

Architecture

Data Node Data Node Data Node Data Node Jethro Query Node Query Node

Client: SELECT day, sum(sales) FROM t1 WHERE prod=‘abc’ GROUP BY day

(20)

39

Streaming SQL - Architecture

Store

‐

first,

process

‐

second

are

unable

to

scale

for

real

‐

time

Big

Data

Hadoop unable

to

offer

latency

and

throughput

for

real

‐

time

applications

Telecoms,

IOT

and

Cybersecurity

Streams

are

infinite

tables

Standing

query

that

executes

over

data

Operating

continuously

on

data

as

they

arrive

and

by

updating

results

incrementally

in

real

‐

time

(21)

41

Streaming

Aggregation

AVG, COUNT, MAX, MIN, SUM, STDDEV_POP, STDDEV_SAMP, VAR_POP, VAR_SAMP

Sliding

Windowed

analytics

Streaming

analytics

and

sliding

windows

Output

record

(or

row)

is

produced

for

each

new

input

record.

Each

field

(or

column)

in

the

output

record

may

be

calculated

using

a

different

window

or

partition.

Windows

can

be

time

or

row

‐

based.

Streaming SQL - Architecture

• In

Memory

Processing

• Lock

‐

free

Data

Structures

• Stateless

implementation

enables

SQL

queries

to

be

distributed

over

Nodes

(22)

43

Innovations

SQL

Improvements

Big

Data

Analytics

(Hive

Specific)

SQL

on

Un

Structured

Data

OLAP

on

Hadoop with

SQL

Probabilistic

Query

Engines

– BlinkDB

ACID transactions (HIVE-5317)

INSERT INTO tbl SELECT

INSERT INTO tbl VALUES

UPDATE tbl SET … WHERE …

DELETE FROM tbl WHERE …

MERGE INTO tbl USING src ON … WHEN MATCHED THEN ... WHEN NOT MATCHED

THEN ...

SET TRANSACTION LEVEL …

BEGIN/END TRANSACTION

How is it being done (look at the paper in reference)

The heart of the approach is the client side merge of the HDFS files and

directories

(23)

45

Window/Analytic Query (HIVE-4197)

Windowing functions

LEAD, LAG, FIRST_VALUE, LAST_VALUE,

OVER with standard aggregates:

COUNT, SUM, MIN, MAX, AVG

OVER with a PARTITION BY

OVER with PARTITION BY and ORDER BY OVER with a window specification

Window specifications support these standard options:

ROWS ((CURRENT ROW) | (UNBOUNDED | [num]) PRECEDING) AND (UNBOUNDED |

[num]) FOLLOWING Ranking functions:

Rank, NTile, DenseRank, CumeDist, PercentRank, NTILE

(24)

47

• Extremely

Fast

OLAP

Engine

at

Scale

• ANSI

SQL

Interface

on

Hadoop

• Interactive

Query

Capability

• MOLAP

Cube

• Seamless

Integration

with

BI

Tools

OLAP

on

Hadoop

‐

Apache

Kylin

(25)

49

Choose Two

Fast Response

Low Latency

SQL Engines

Accurate Big Volume

(26)

51

Probabilistic SQL

Query

Engine

– Blink

DB

Hybrid

Transactional

Analytical

Processing

Hybrid Transaction/Analytical Processing

Coined in early 2014 by Gartner

New generation of in-memory data platforms

OLTP & OLAP No data duplication/movement

• In

‐

Memory

technology

Enabler

• Analytics

over

changing

data

• ETL

is

not

used

any

more

SAP

with

the

SAPHANA

platform

(27)

53

(28)

55 When you put data into a structure, like SQL, you limit what you can do with the data in future. A question of Waterfall versus Agile in Data Analytics.

SQL requires a waterfall design approach and is limited.

Hadoop truly enables an agile analytics approach. You first collect all the data, and then you pull

out whatever you want.

http://www.kdnuggets.com/2015/08/interview‐stefan‐groschupf‐sql‐ hadoop.html#.VcItFXCTR9c.linkedin

Why

SQL

on

Hadoop is

a

Bad

Idea

Stefan

Groschupf

‐

CEO

and

Chairman

of

Datameer

JethroData – http://www.jethrodata.com/

Hive Paper ‐http://infolab.stanford.edu/~ragho/hive‐icde2010.pdf

Hive Transactions ‐

https://issues.apache.org/jira/secure/attachment/12604051/InsertUpdatesinHive.pdf

SQL to MR Translator –

http://web.cse.ohio‐state.edu/hpcs/WWW/HTML/publications/papers/TR‐11‐7.pdf

Impala in Action – Manning – I was one of the reviewers 

Hive Cost Based Optimization ‐

https://issues.apache.org/jira/secure/attachment/12612663/CBO‐2.pdf

Hive on Spark ‐https://issues.apache.org/jira/browse/HIVE‐7292

Hive Speed ‐http://www.slideshare.net/hortonworks/hive‐on‐spark‐is‐blazing‐fast‐or‐is‐it‐ final

Apache Kylin‐http://kylin.incubator.apache.org/

BlinkDB‐http://blinkdb.org/

BlinkDB‐http://www.slideshare.net/Hadoop_Summit/t‐1205p212agarwalv2

Presto ‐http://www.slideshare.net/frsyuki/presto‐hadoop‐conference‐japan‐ 2014?related=1

(29)

57 Hive Vectorization‐https://issues.apache.org/jira/browse/HIVE‐4160

Hive LLAP ‐http://www.slideshare.net/Hadoop_Summit/llap‐longlived‐execution‐in‐hive