• No results found

Processing Big Data With SQL on Hadoop. Jens Albrecht

N/A
N/A
Protected

Academic year: 2021

Share "Processing Big Data With SQL on Hadoop. Jens Albrecht"

Copied!
43
0
0

Loading.... (view fulltext now)

Full text

(1)

Processing Big Data

With SQL on Hadoop

Jens Albrecht

(2)

Prof. Dr. Jens Albrecht 3

Why SQL for Big Data?

Mature technology

Broad knowledge available Powerful query language

High interactive performance Many third party tools for data analysis and visualization

Flexible data structures

Semi-structured data Changing schemas

Self-Service

Data integration on-the-fly

Scalability

Analysis, integration, volumen

(Relatively) Low Cost

Commodity Hardware, Open Source

(3)

Agility

Why SQL for Big Data?

+

Extended

(4)

Prof. Dr. Jens Albrecht 5

Hadoop-SQL Integration

HDFS MR / Tez

HiveQL

Hive (Native Hadoop) Pure Hadoop SQL Engines

HDFS Distributed SQL Engine HDFS Distributed SQL Engine NoSQL Hive Format-agnostic SQL Engines RDBMS Relational

RDBMS with Hadoop Access

(5)

Hadoop-SQL Integration

HDFS MR / Tez

HiveQL

Hive (Native Hadoop) Pure Hadoop SQL Engines

HDFS Distributed SQL Engine HDFS Distributed SQL Engine NoSQL Hive Format-agnostic SQL Engines RDBMS Relational

RDBMS with Hadoop Access

(6)

Prof. Dr. Jens Albrecht 7 MapReduce

Hive

Hive

General

▸Developed initially by Facebook ▸SQL-processing for HDFS and HBase ▸Table definitions in Hive Meta Store ▸Generation of MapReduce Code

Schema-on-Read via SerDe

Advantages

▸Mature part of every Hadoop

distribution

▸Simple setup

▸Java-API for UDFs

▸Usage of many data formats via SerDe Disadvantages ▸Batch-oriented, slow Hadoop HDFS/HBase Meta Store HiveQL Execution SerDe

(7)

Schema-on-Write vs. Schema-on-Read

Relational Database: Schema-on-Write

Big Data Processing: Schema-on-Read

Multi-structured Source Data ETL Relational DBMS SQL Multi-structured Source Data Load as-is Hadoop SQL Schema mapped to original files

(8)

Prof. Dr. Jens Albrecht 9

Schema-on-Read: Hive & CSV

CREATE TABLE gps_data( userid INT,

deviceid INT,

longitude STRING, latitude STRING, utctime STRING)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY '\t' STORED AS TEXTFILE;

-- load data = copy

LOAD DATA LOCAL INPATH 'new_data/gps.dat' OVERWRITE INTO TABLE gps_data;

SELECT COUNT(*) FROM gps_data;

4711 542815 49.454N 11.077E 10/01/2014@10:00:00UTC Sample Data

(9)

Schema-on-Read: Hive & Regexp

CREATE TABLE weblog ( host STRING, identity STRING, user STRING, time STRING, request STRING, status STRING, size STRING, referer STRING, agent STRING)

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\"))?" ) STORED AS TEXTFILE;

(10)

Prof. Dr. Jens Albrecht 11

Schema-Read: Hive & JSON

Source: http://thornydev.blogspot.de/2013/07/querying-json-records-via-hive.html

(11)

MapReduce Hive

Hive on Tez (Stinger)

Stinger-Initiative

▸"Make Hive 100x faster"

▸Finished with Hive 0.13 (April 2014)

▸Replace MapReduce with Tez

▸Native columnar data format (ORC)

Stinger.Next

▸Phase 1: Hive 0.14 (November 2014)

▸ACID transactions

▸Phase 2: (Q2 2015)

▸Subsecond Queries mit LLAP ▸Machine Learning Integration

▸Phase 3: (Q4 2015) ▸SQL:2011 Analytic Functions ▸Materialized Views Hadoop HDFS/HBase Tez / Yarn Meta Store HiveQL Execution SerDe

(12)

Prof. Dr. Jens Albrecht 13

Hadoop-SQL Integration

HDFS MR / Tez

HiveQL

Hive (Native Hadoop) Pure Hadoop SQL Engines

HDFS Distributed SQL Engine HDFS Distributed SQL Engine NoSQL Hive Format-agnostic SQL Engines RDBMS Relational

RDBMS with Hadoop Access

(13)

Distributed SQL Engine

Pure Hadoop SQL Engines

Approach

▸Distributed, parallel SQL engine ▸Often usage of Hive Metadata

▸Support of optimized data formats ▸Hadoop as mandatory basis

Advantages and Disadvantages ▸Significantly faster as Hive

▸Low latency through dedicated engine ▸Operator pipelining and result caching Diffentiation of solutions

▸Supported SQL functionality ▸Point querying

▸Cost-based optimizer / performance ▸Transaction support HDFS/HBase Local Agent Local Agent Local Agent Local Agent SQL Query Coordination Data Files Data Files Data Files Data Files

(14)

Prof. Dr. Jens Albrecht 15

Distributed SQL Engine

Pure Hadoop SQL Engines

HDFS/HBase Local Agent Local Agent Local Agent Local Agent SQL Query Coordination Data Files Data Files Data Files Data Files Big Insights

(15)

Pure Hadoop SQL Engines

(16)

Prof. Dr. Jens Albrecht 17

Apache Spark & Spark SQL

General

▸SQL engine based on Spark

▸Data access via data frames (former

SchemaRDD)

▸In-Memory columnar format ▸HDFS / HBase as file format Advantages

▸Spark as general-purpose parallel

computing framework

▸Support of Hive extensions like UDFs and

SerDes and Hive metadata

Disadvantages

▸Not yet fully mature

▸Not yet as fast as competitors Spark SQL HDFS/HBase Apache Spark Schema RDD SQL Execution SerDe

(17)

Apache Spark

Distributed In-Memory Computing Framework

▸Data caching

▸General framework for all kinds of SQL and non-SQL analytics

▸Support for out-of-the box libraries as well as Java, Python and Scala in the same

engine and for the same data

▸New datasources API allows to write plugins for non-Hadoop sources

Spark Execution Engine

Hadoop ZooKeeper YARN (optional) HDFS Spark SQL Spark Streaming Graph Computation (GraphX) Machine Learning (MLlib)

(18)

Prof. Dr. Jens Albrecht 19

Apache Spark

lines = spark.textFile("hdfs://...")

errors = lines.filter(_.startsWith(ERROR)) errors.persist()

// Return the time fields of errors mentioning // assuming time is field number 3 in a tsv file hdfs_errors = errors.filter(_.contains(HDFS))

time_fields = hdfs_errors.map(_.split(’\t’)(3)).collect()

Source: http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf Action Resilient Distributed Data Sets (RDDs) Transformations

(19)

Spark Transformations

Transformations: Create a new RDD from an existing one

Lazy evaluation – results are not materialized

Much more than map reduce

Actions: Return a value or dataset to the calling program

map filter sample

groupByKey sortByKey reduceByKey union pipe repartition join leftOuterJoin rightOuterJoin

reduce collect count

first take(n) saveAsTextFile

(20)

Prof. Dr. Jens Albrecht 22

Adding Schema to RDDs

(21)

Hadoop-SQL Integration

HDFS MR / Tez

HiveQL

Hive (Native Hadoop) Pure Hadoop SQL Engines

HDFS Distributed SQL Engine HDFS/ Hive Distributed SQL Engine NoSQL SQL Format-agnostic SQL Engines RDBMS Relational

RDBMS with Hadoop Access

(22)

Prof. Dr. Jens Albrecht 24

Distributed SQL Engine

SQL-Engine with Pluggable Storage

Ansatz

▸Verteilte, parallele SQL-Engine

▸Oftmals Nutzung von Hive Metadaten ▸Unterstützung optimierter Dateiformate ▸Hadoop obligatorisch als Basis

Allgemeine Vorteile

▸Deutlich schneller als Hive

▸Geringe Latenzzeiten durch Vermeidung

von Map-Reduce

▸Operatoren-Pipelining und Caching ▸Skalierbar

Allgemeine Nachteile ▸In der Regel keine

Transaktionsunterstützung Local Agent Local Agent Local Agent Local Agent SQL Query Coordination

HDFS Hive JSON Parquet

Cas-sandra MySQL Oracle …

(23)

Google Dremel

(24)

Prof. Dr. Jens Albrecht 26

Apache Drill

Self-Service Data Integration

▸No metadata repository required

▸Dynamic schema discovery: Metadata automatically extracted for data sources

▸ RDBMS and Hive (comprehensive), HBase (partial) or files (on-the-fly) ▸ Utilizes self-describing data formats (Parquet, JSON, AVRO)

▸SQL-DDL can be used to create metadata explicitly ANSI-SQL plus Flexible Data Model

▸Fully ANSI compliant SQL

▸DrQL with SQL extensions for nested data structures (like JSON)

(25)

Apache Drill: SQL for JSON

select name, flatten(fillings) as f from dfs.users.`/donuts.json`

(26)

Prof. Dr. Jens Albrecht 28

Apache Drill: SQL for Heterogeneous Data

Formats

JSON

CSV

ORC

Parquet

HBase tables

MongoDB

select USERS.name, USERS.emails.work from

dfs.logs.`/data/logs` LOGS,

dfs.users.`/profiles.json` USERS, where

LOGS.uid = USERS.uid and errorLevel > 5

(27)

Hadoop-SQL Integration

HDFS MR / Tez

HiveQL

Hive (Native Hadoop) Pure Hadoop SQL Engines

HDFS Distributed SQL Engine HDFS Distributed SQL Engine NoSQL Hive Format-agnostic SQL Engines RDBMS Relational

RDBMS with Hadoop Access

(28)

Prof. Dr. Jens Albrecht 30

External Tables in HDFS

HDFS RDBMS

Logical Mapping

CREATE TABLE SCOTT.SALES_HDFS_EXT_TAB ( PROD_ID NUMBER(6), CUST_ID NUMBER, TIME_ID DATE, CHANNEL_ID CHAR(1), PROMO_ID NUMBER(6), QUANTITY_SOLD NUMBER(3), AMOUNT_SOLD NUMBER(10,2) ) ORGANIZATION EXTERNAL ( TYPE ORACLE_LOADER

DEFAULT DIRECTORY SALES_EXT_DIR ACCESS PARAMETERS

( RECORDS DELIMITED BY NEWLINE FIELDS TERMINATED BY ','

(

PROD_ID DECIMAL EXTERNAL, … ) PREPROCESSOR HDFS_BIN_PATH:hdfs_stream ) LOCATION ( 'file_sales_1', 'file_sales_2', 'files_sales_3') );

(29)

RDBMS with Hadoop Integration

Approach

▸Towards genuine integration of

Hadoop into RDBMS

▸Utilize Hadoop's computational

power

▸Cost-based choice Advantages

▸Easiest way to use Hadoop as data

source

▸Combined access to traditional and

new data sources

Disadvantages

▸Cost

▸Limited data sources ▸Vendor lock-in HDFS/HBase Map Reduce Parallel Database Relational Tables External Tables SQL Query Coordination MR Loader

(30)

Prof. Dr. Jens Albrecht 32

RDBMS with Hadoop Integration

Products

▸Microsoft Polybase (part of MS

Analytics Platform)

▸Oracle Big Data SQL (part of Oracle

Big Data Appliance in combination with Exadata)

Use Cases

▸Extension of traditional BI System ▸Data-lake scenario with RDBMS as

primary system and Hadoop for mass data

▸Mix of analytic and transactional

load HDFS/HBase Map Reduce Parallel Database Relational Tables External Tables SQL Query Coordination MR Loader

(31)

Oracle Big Data SQL

Integrates

Hive Metadata

Allows hybrid queries

Include Hadoop and NoSQL in

relational queries

Use Exadata

(32)

Prof. Dr. Jens Albrecht 34

Hadoop-SQL Integration

HDFS MR / Tez

HiveQL

Hive (Native Hadoop) Pure Hadoop SQL Engines

HDFS Distributed SQL Engine HDFS Distributed SQL Engine NoSQL Hive Format-agnostic SQL Engines RDBMS Relational

RDBMS with Hadoop Access

Hadoop

(33)

>

(34)

Prof. Dr. Jens Albrecht 36

File formats for Big Data

Text Formats

High storage usage bad scan performance Low compression

bad scan performance

Dedicated Formats (e.g. DB internal) Not open, no interoperability

Requirements for Big Data Interoperability

Low storage / good compression

High performance Flexible schema

A schema for a file format??

Query performance for a file format??

Big Data often have a nested structure and multiple schema variants and –versions

(35)
(36)

Prof. Dr. Jens Albrecht 38

Considerations for File Formats

Query Tools

• none

• Frameworks like MapReduce, Spark, Cascading

• Query Engines like Pig, Hive, Impala

Schema Versioning • Schema present?

• If so, can it change?

Splittability • Partitioning

• Splitting of files possible for distributed processing? • Example: CSV: yes,

XML: partial, MP4: no

Block Compression • Can blocks be independently

compressed and distributed? • Block compression is a

prerequisite for partition compresssion!

File Size

• Size in bytes and number of files?

• Hadoop likes big, splittable files!

• Lots of small files cost performance

Load Profile • Write Performance • Filter operations

• Reading of single columns • Full scans

(37)

Example Row-Format: Avro

Schema specification ▸internally stored

in binary format

▸self-describing files

Reader- vs. Writer Schema

▸Allows different"Views" on files

Writer Schema Writer Schema Reader Schema Avro Parser Avro Data Resolution Rules Read Write record Person { string userName;

union { null, long } favouriteNumber; array<string> interests;

(38)

Prof. Dr. Jens Albrecht 40

Example Column Format:

Columnar formats in general

▸Trade faster reads for slower writes ▸Very good compression

Parquet Files

▸Hybrid Partitioning – sets of records in blocks, columnar within blocks ▸Zone maps per block as kind of index (min/max values per column)

(39)

Databases as Lego Construction Kit!?

SQL Prozessor Verteilte Ausführung Speicherverwaltung D at a D ic ti o n ar y SQL Traditional monolithic RDBMS SQL Prozessor Map Reduce Spark CSV Seq Avro

JSON ORC Parquet

SQL

Hadoop DB building blocks

• Generic Execution Engine

• Metadata sharing in Hive Repository or self-describing file formats

(40)

Prof. Dr. Jens Albrecht 42

>

(41)

Considerations for SQL on Hadoop Solutions

SQL Functionality

▸Coverage of SQL standard

▸User-defined functions

▸Transactional Safety

Performance and Stability

▸Multi-user workloads

▸Efficiency of joins and aggregations (I/O problems? Size limits?)

Supported Data and Storage Formats

▸Logical Format: relational, JSON, none, …

▸Physical Formats: CSV, Parquet, ORC, Avro, …

Intelligent Storage Plugins / Data Federation

▸Access to various data sources beyond Hadoop

▸Pushdown predicates, access selected columns only

(42)

Prof. Dr. Jens Albrecht 44 Hadoop Hadoop SQL SQL

Hadoop vs. SQL

Hadoop SQL Technologies ▸Supplement traditional RDBMS ▸Extend traditional RDBMS ▸Develop new RDBMS

SQL universe

gets wider.

Hadoop and SQL

move closely

together.

Database

systems become

open and

modular.

(43)

References

▸ L. Chang et al.: HAWQ: A Massively Parallel Processing SQL Engine in Hadoop.

In: Proceedings of the 2014 ACM SIGMOD international conference on Management of Data, Pages 1223-1234

▸ A. Floratour, U. Minhas, F. Özcan: SQL-on-Hadoop: Full Circle Back to Shared-Nothing Database Architectures. Proceedings of the VLDB Endowment, Vol. 7, No. 12, 2014

▸ M. Hausenblas, J. Nadeau: Apache Drill: Interactive Ad-Hoc Analysis at Scale.

Big Data Magazine, June 2013

▸ M. Kornacker, e.a.: Impala: A Modern, Open-Source SQL Engine for Hadoop. 7th Biennial Conference on Innovative Data Systems Research (CIDR’15)

▸ D. J. DeWitt, e.a.: Split Query Processing in Polybase. Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, Pages 1255-1266

▸ S. Melnik, e.a.: Dremel: Interactive Analysis of Web-scale Datasets. PVLDB, 3(1-2):330–339, 2010

▸ M. Zaharia, e.a.: Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, NSDI 2012

▸ J. Albrecht, S. Alexander: Hadoop und SQL rücken enger zusammen.

Computerwoche, Nov. 2013, http://www.computerwoche.de/a/hadoop-und-sql-ruecken-enger-zusammen,2549475 ▸ C. Deptula: Hadoop File Formats: Ist not Just CSV Anymore.

Blog Eintrag, 2014, http://inquidia.com/news-and-info/hadoop-file-formats-its-not-just-csv-anymore ▸ J. Le Dem: Efficient Data Storage Analytics with Apache Parquet 2.0,

Hadoop Summit 2014, http://de.slideshare.net/cloudera/hadoop-summit-36479635 ▸ M. Rathbone: 8 SQL-on-Hadoop frameworks worth checking out.

Blog Eintrag, 2014, http://blog.matthewrathbone.com/2014/06/08/sql-engines-for-hadoop.html ▸ P. Srivati: Resilient Distributed Datasets (RDD) for the impatient.

Blog Eintrag, 2014, http://www.thecloudavenue.com/2014/01/resilient-distributed-datasets-rdd.html ▸ S. Yegulalp: 10 ways to query Hadoop with SQL.

References

Related documents

Hadoop refers to an ecosystem of software packages, including HDFS, MapReduce, and a whole host of other software packages to support the import and export of data into and from

 Hadoop Distributed File System (HDFS) - a distributed file-system that stores data on the commodity machines, providing very high aggregate bandwidth across

In this paper author discussed an architecture of Big Data using real-time NoSQL databases, Hadoop HDFS distributed data storage and MapReduce distributed data processing over a

Both frameworks allow access to structured data stored in HDFS (the Hadoop Distributed File System) using a language based on SQL (Structured Query Language) [Cha+74; Gro+09;

Hadoop is also based on HDFS file system (Hadoop Distributed File System). Hadoop is cluster based which consist of data nodes and name nodes. Hadoop is using

The open source Apache Hadoop project includes the core modules — Hadoop Common, Hadoop Distributed File System (HDFS), Hadoop YARN and Hadoop MapReduce — but

Hadoop Distributed File System (HDFS): HDFS is a distributed system that supports storage of massive amounts of data across multiple nodes providing throughput access to the

Data Acquisition Options in BDA Overview of Oracle NoSQL Database Overview of Hadoop Framework Understand HDFS. NoSQL Versus HDFS