• No results found

Session# - AaS 2.1 Title SQL On Big Data - Technology, Architecture and Roadmap

N/A
N/A
Protected

Academic year: 2021

Share "Session# - AaS 2.1 Title SQL On Big Data - Technology, Architecture and Roadmap"

Copied!
30
0
0

Loading.... (view fulltext now)

Full text

(1)

1

Session# - AaS 2.1

Title – “SQL On Big Data

-Technology, Architecture and

Roadmap”

Sumit Pal

Independent Big Data and Data Science

Consultant, Boston

Data Center World – Certified Vendor Neutral

Each presenter is required to certify that their

presentation will be vendor-neutral.

As an attendee you have a right to enforce this policy of

having no sales pitch within a session by alerting the

speaker if you feel the session is not being presented in

a vendor neutral fashion. If the issue continues to be a

problem, please alert Data Center World staff after the

session is complete.

(2)

3

SQL On Big Data - Technology, Architecture and Roadmap

What is Big Data

What is SQL

SQL on Big Data

Why SQL on Hadoop

Limitations of SQL on Hadoop, Challenges & Solution

Types of SQL (Volume, Velocity & Variety)

Architectures – Batch, Interactive, Streaming

Innovations happening in this space

(3)

5

What is SQL (Structured Query Language)

- ANSI Standard

- Manipulate and work with Relational Databases

- Create Database

- Insert / Update / Delete data into Databases

- Retrieve data from Databases with filtering

criteria

What is OLAP (On Line Analytical Processing)

- Used in BI Tools

- Calculating Trends from Historical Data

- Building Business Scenarios

- Building Aggregates and View Data

Dimensionally ( Time, Geography etc.)

- Navigate and Explore – Adhoc Analysis

- Drill Down

(4)

7

What is Hadoop

New Approach to Distributed Data Processing

Data

 

is

 

Local,

 

NOT

 

move

 

across

 

Network

Shared

 

Nothing

 

Architecture

1. No synchronization requirement among the nodes 

2. Designed for failure – Multiple copies of data 

3. Consistent Architecture – individual failures does  not fail the job 

(5)

9

What is HDFS

Why SQL on Hadoop

More and more data is getting available on Hadoop

SQL is an incredibly popular data querying language

Is a bridge between business analysts and organizations’ big data

Analysts would like to query data on Hadoop without using MapReduce

Seamless Integration with BI Tools

(6)

11

SQL on Hadoop Goals

Distributed,

 

Scale

 

Out

 

Architecture

Avoid

 

Expensive

 

Analytic

 

DBs and

 

Appliances

Avoid

 

Data

 

Movement from

 

HDFS

 

to

 

Analytic

 

DBs

High

 

Concurrency of

 

end

 

users

Appliances

(7)

13

SQL

 

in

 

Hadoop Landscape

Challenges of SQL on Hadoop

Map

 

Reduce and

 

HDFS

 

– traditionally

 

meant

 

to

 

solve

 

Batch

Oriented

 

Data

Map

 

Reduce

 

is

 

high

latency

Map

 

Reduce

 

Not

 

designed

 

for

 

long

 

Data

 

Pipelines

(Complex

 

SQL

 

is

 

inefficiently

 

expressed

 

as

 

many

 

MR

 

stages)

(8)

15

Approaches to solve the challenges

Storage

 

layer

 

optimizations

Data

 

retrieval

  ‐

data

 

locality,

 

storage

 

layout

 

/

 

formats

 

&

 

indexing

Indexing

(JethroData – later

 

slide)

File

 

Formats

 

– Avro,

 

ORC,

 

Parquet,

 

Sequence

 

Files

Choosing

 

the

 

optimal

 

file

 

format

 

in

 

Hadoop is

 

one

 

of

 

the

 

most

 

essential

 

drivers

 

of

 

functionality

 

and

 

performance

 

for

 

big

 

data

 

processing

 

and

 

query

Data

 

Compression

 

(

 

Reduce

 

IO)

 

– GZIP,

 

BZIP2,

 

LZO,

 

Snappy,

 

LZ4

Workloads

 

are

 

IO

 

Bound – Reduce

 

IO

 

– Compression

 

Algorithms

Compression has

 

a

 

gotcha

 

– Must

 

be

 

Splittable for

 

Hadoop

Tradeoff

 

– Storage

 

/

 

Network

 

Bandwidth

 

/

 

CPU

(9)

17

Analytic Types

Batch SQL on Hadoop

Hive is designed for batch queries 

Uses Map Reduce in the background

Primarily for queries on large data sets and large ETL jobs for batch workloads

Not designed for OLTP OR real‐time

What Hive values most are scalability (scale out add machines to cluster)

Extensibility (with Map Reduce framework and UDF/UDAF/UDTF)

(10)

19

Slow

 

– multiple

 

stages

 

of

 

Map

 

Reduce

Complex

 

SQL

 

Queries

 

need

 

MR

 

job

 

Chaining,

 

incurs

 

Multiple

 

Shuffles

 

and

 

Disk

 

Writes

Batch SQL on Hadoop

Optimizations – Hive ~ Interactive SQL

Hints ( /*MAP JOIN */) ( /*STREAMABLE */)

Partitioning & Bucketing‐ Done in relational DBs, Bucketing takes care of Skewness

Vectorization

Reduces the CPU usage,  for query scans, filters, aggregates, and joins. 

Processing a block of 1024 rows at a time Within block, each column is stored as a vector

Uses few instructions and finishes each instruction in fewer clock cycles, by using processor 

(11)

21

Interactive SQL

(12)

23

Data Node

Client: SELECT day, sum(sales) FROM t1 WHERE prod=‘abc’ GROUP BY day

Data Node Data Node Data Node Data Node Query Executor Query Executor Query Executor Query Executor Query Executor Query Planner /Mgr Query Planner/ Mgr Query Planner/ Mgr Query Planner/ Mgr Query Planner/ Mgr

Performance and resources based on the size of the dataset

MPP

 

/

 

Full

 

Scan

 

Architecture

Impala

 

Architecture

Biggest Advantage

No Data Movement out of the clusters, No SPOF

Dis-Advantage

Processes/Deamon on the each Data Node

Reasons for High Performance

C++ Instead of Java

Runtime Code Generation

New Execution Engine ( Not Map Reduce )

(13)

25

Fast

 

and

 

Efficient

 

IO

 

manager

handle

 

large

 

data

 

spread

 

across

 

array

 

of

 

hard

 

drives

 

(rotational,

 

or

 

SSD)

Designed

 

to

 

run

 

on

 

modern

 

architecture,

 

recommended

 

chipsets

 

(i.e.

 

Sandy

 

Bridge,

 

Bulldozer),

 

as

 

the

 

LLVM

IR

 

compiler

 

will

 

use

 

newer

 

hardware

 

instructions

to

 

help

 

maximize

 

IO

 

throughput

Impala’s

 

execution

 

engine

 

is

 

decoupled

 

from

 

the

 

storage

 

engine,

 

allowing

 

it

 

to

 

plug

 

other

 

storage

 

engines

 

underneath

Impala

 

Architecture

(14)

27

Impala

 

Architecture

(15)

29

Impala

 

Architecture

Impala

 

Architecture

 

– LLVM

 

&

 

Others

Runtime

 

code

 

generation

to

 

improve

 

execution

 

times

Perform

 

just

 

in

time

 

(JIT)

 

compilation

 

to

 

generate

 

machine

 

code

Produce

 

query

 

specific

 

versions

 

of

 

functions

 

critical

 

to

 

performance

Virtual

 

function

 

calls

 

incur

 

a

 

large

 

performance

 

penalty.

 

If

 

object

 

type

 

is

 

known,

 

use

 

code

 

generation

 

to

 

replace

 

the

 

virtual

 

function

 

call

 

with

 

inline

HDFS

 

feature

 

short

circuit local

 

reads

 

to

 

bypass

 

the

 

DataNode protocol

 

(16)

31

ImpalaToGo – New

 

Kid

 

on

 

the

 

Block

Thin

 

Layer

 

on

 

top

 

of

 

Impala

 

(

 

C++

 

)

Moves

 

out

 

Impala

 

to

 

its

 

own

 

Cluster

 

(DeCouples Storage

 

From

 

Compute

 

)

ImpalaToGo can

 

now

 

reside

 

in

 

its

 

own

 

Cluster

 

(

 

Separate

 

from

 

Hadoop Cluster

 

)

Challenge

 

deploying

 

the

 

MPP/full

scan

 

compute

 

architecture

 

against

 

data

 

that

 

is

 

now

 

separated

 

by

 

network

Deals

 

with

 

it

 

by

 

copying

 

relevant

 

data

 

to

 

local

 

cluster

 

Uses

 

Tachyon

 

to

 

optimize

 

it.

 

This

 

is

 

great

 

when

 

all

 

the

 

data

 

used

 

by

 

all

 

current

 

users

 

can

 

fit

 

fully

 

in

 

cache.

 

Once

 

it

 

doesn’t

 

the

 

network

 

cost

 

kicks

 

back

 

in

Vertica – MPP

 

Based

 

Analytic

 

DB

 

Engine

MPP

 

Columnar

 

RDBMS

10x

100x

 

compared

 

to

 

RDBMS

 

on

 

Analytic

 

Queries

Scales

 

Linearly

Commodity

 

Hardware

Built

 

in

 

Fault

 

Tolerance

(17)

33

Vertica with

 

Hadoop (2

 

Use

 

Cases)

(18)

35

Spark

 

SQL

(19)

37

JethroData – Indexes

 

in

 

Hadoop

Data Node

Indexed

Architecture

Data Node Data Node Data Node Data Node Jethro Query Node Query Node

Client: SELECT day, sum(sales) FROM t1 WHERE prod=‘abc’ GROUP BY day

(20)

39

Streaming SQL - Architecture

Store

first,

 

process

second

 

are

 

unable

 

to

 

scale

 

for

 

real

time

 

Big

 

Data

Hadoop unable

 

to

 

offer

 

latency

 

and

 

throughput

 

for

 

real

time

 

applications

 

Telecoms,

  

IOT

 

and

 

Cybersecurity

Streams

 

are

 

infinite

 

tables

Standing

 

query

 

that

 

executes

 

over

 

data

Operating

 

continuously

 

on

 

data

 

as

 

they

 

arrive

 

and

 

by

 

updating

 

results

 

incrementally

 

in

 

real

time

(21)

41

Streaming

 

Aggregation

AVG, COUNT, MAX, MIN, SUM, STDDEV_POP, STDDEV_SAMP, VAR_POP, VAR_SAMP

Sliding

 

Windowed

 

analytics

Streaming

 

analytics

 

and

 

sliding

 

windows

Output

 

record

 

(or

 

row)

 

is

 

produced

 

for

 

each

 

new

 

input

 

record.

 

Each

 

field

 

(or

 

column)

 

in

 

the

 

output

 

record

 

may

 

be

 

calculated

 

using

 

a

 

different

 

window

 

or

 

partition.

 

Windows

 

can

 

be

 

time

 

or

 

row

based.

Streaming SQL - Architecture

In

 

Memory

 

Processing

Lock

free

 

Data

 

Structures

  

Stateless

 

implementation

 

enables

 

SQL

 

queries

 

to

 

be

 

distributed

 

over

 

Nodes

(22)

43

Innovations

SQL

 

Improvements

 

Big

 

Data

 

Analytics

 

(Hive

 

Specific)

SQL

 

on

 

Un

 

Structured

 

Data

OLAP

 

on

 

Hadoop with

 

SQL

Probabilistic

 

Query

 

Engines

 

– BlinkDB

ACID transactions (HIVE-5317)

INSERT INTO tbl SELECT

INSERT INTO tbl VALUES

UPDATE tbl SET … WHERE …

DELETE FROM tbl WHERE … 

MERGE INTO tbl USING src ON … WHEN MATCHED THEN ... WHEN NOT MATCHED 

THEN ... 

SET TRANSACTION LEVEL … 

BEGIN/END TRANSACTION

How is it being done (look at the paper in reference)

The heart of the approach is the client side merge of the HDFS files and 

directories

(23)

45

Window/Analytic Query (HIVE-4197)

Windowing functions

LEAD,  LAG,  FIRST_VALUE,  LAST_VALUE,

OVER with standard aggregates:

COUNT,  SUM,  MIN,  MAX,  AVG

OVER with a PARTITION BY

OVER with PARTITION BY and ORDER BY OVER with a window specification

Window specifications support these standard options:

ROWS ((CURRENT ROW) | (UNBOUNDED | [num]) PRECEDING) AND (UNBOUNDED | 

[num]) FOLLOWING Ranking functions: 

Rank, NTile, DenseRank, CumeDist, PercentRank, NTILE

(24)

47

Extremely

 

Fast

 

OLAP

 

Engine

 

at

 

Scale

ANSI

 

SQL

 

Interface

 

on

 

Hadoop

Interactive

 

Query

 

Capability

MOLAP

 

Cube

Seamless

 

Integration

 

with

 

BI

 

Tools

OLAP

 

on

 

Hadoop

Apache

 

Kylin

(25)

49

Choose Two

Fast Response 

Low Latency

SQL Engines

Accurate Big Volume

(26)

51

Probabilistic SQL

 

Query

 

Engine

 

– Blink

 

DB

Hybrid

 

Transactional

 

Analytical

 

Processing

Hybrid Transaction/Analytical Processing

Coined in early 2014 by Gartner

New generation of in-memory data platforms

OLTP & OLAP No data duplication/movement

In

Memory

 

technology

 

Enabler

Analytics

 

over

 

changing

 

data

ETL

 

is

 

not

 

used

 

any

 

more

SAP

 

with

 

the

SAPHANA

platform

(27)

53

(28)

55 When you put data into a structure, like SQL, you limit what you can do with the data in future. A question of Waterfall versus Agile in Data Analytics. 

SQL requires a waterfall design approach and is limited. 

Hadoop truly enables an agile analytics approach. You first collect all the data, and then you pull 

out whatever you want.

http://www.kdnuggets.com/2015/08/interview‐stefan‐groschupf‐sql‐ hadoop.html#.VcItFXCTR9c.linkedin

Why

 

SQL

 

on

 

Hadoop is

 

a

 

Bad

 

Idea

Stefan

 

Groschupf

CEO

 

and

 

Chairman

 

of

  

Datameer

JethroData – http://www.jethrodata.com/

Hive Paper ‐http://infolab.stanford.edu/~ragho/hive‐icde2010.pdf

Hive Transactions ‐

https://issues.apache.org/jira/secure/attachment/12604051/InsertUpdatesinHive.pdf

SQL to MR Translator –

http://web.cse.ohio‐state.edu/hpcs/WWW/HTML/publications/papers/TR‐11‐7.pdf

Impala in Action – Manning – I was one of the reviewers 

Hive Cost Based Optimization ‐

https://issues.apache.org/jira/secure/attachment/12612663/CBO‐2.pdf

Hive on Spark ‐https://issues.apache.org/jira/browse/HIVE‐7292

Hive Speed ‐http://www.slideshare.net/hortonworks/hive‐on‐spark‐is‐blazing‐fast‐or‐is‐it‐ final

Apache Kylin‐http://kylin.incubator.apache.org/

BlinkDB‐http://blinkdb.org/

BlinkDB‐http://www.slideshare.net/Hadoop_Summit/t‐1205p212agarwalv2

Presto ‐http://www.slideshare.net/frsyuki/presto‐hadoop‐conference‐japan‐ 2014?related=1

(29)

57 Hive Vectorization‐https://issues.apache.org/jira/browse/HIVE‐4160

Hive LLAP ‐http://www.slideshare.net/Hadoop_Summit/llap‐longlived‐execution‐in‐hive

Impala Paper ‐http://www.cidrdb.org/cidr2015/Papers/CIDR15_Paper28.pdf

Requiremtns of Stream Processing ‐http://cs.brown.edu/~ugur/8rulesSigRec.pdf

References

(

 

Q &

 

A

 

)

Select

 

Questions

 

from

 

Audience

(30)

59

3 Key Things You Have Learned During this Session

1. What is Big Data

2. What is SQL

3. SQL on Big Data

References

Related documents

These cavities spent the least amount of time above 35˚C and 40˚C (Fig 9A-F) and thus a model cannot be run because there are so few non- diapausing individuals spending

ing S1s!re-i on having the right product at the right place and at the ri stems facilitate order taking and information gathering form the customer and require

We have argued previously that our model presents estimates of both scale and technique effects, which are interpreted as the change in pollution concentrations due to change in

Although constrained by data availability, the evidence suggests that the dominant effect of subsidies was to increase social security registration of firms and workers rather

Keywords: developmental dyscalculia, developmental perspective, heterogeneity, individual differences, diagnosis, classification, research criteria.. Developmental dyscalculia (DD)

Casa Clementina, which launched its courses in March 2011, welcomes weaving teachers, natural dye experts and teachers of natural dyeing, textile researchers and artists, as well

This cone consists of the matrices which admit a Gram representation by (a specific class of) positive semidefinite operators on a possibly infinite dimensional Hilbert space instead