Processing Big Data
With SQL on Hadoop
Jens Albrecht
Prof. Dr. Jens Albrecht 3
Why SQL for Big Data?
Mature technology
Broad knowledge available Powerful query language
High interactive performance Many third party tools for data analysis and visualization
Flexible data structures
Semi-structured data Changing schemas
Self-Service
Data integration on-the-fly
Scalability
Analysis, integration, volumen
(Relatively) Low Cost
Commodity Hardware, Open Source
Agility
Why SQL for Big Data?
+
Extended
Prof. Dr. Jens Albrecht 5
Hadoop-SQL Integration
HDFS MR / Tez
HiveQL
Hive (Native Hadoop) Pure Hadoop SQL Engines
HDFS Distributed SQL Engine HDFS Distributed SQL Engine NoSQL Hive Format-agnostic SQL Engines RDBMS Relational
RDBMS with Hadoop Access
Hadoop-SQL Integration
HDFS MR / Tez
HiveQL
Hive (Native Hadoop) Pure Hadoop SQL Engines
HDFS Distributed SQL Engine HDFS Distributed SQL Engine NoSQL Hive Format-agnostic SQL Engines RDBMS Relational
RDBMS with Hadoop Access
Prof. Dr. Jens Albrecht 7 MapReduce
Hive
Hive
General
▸Developed initially by Facebook ▸SQL-processing for HDFS and HBase ▸Table definitions in Hive Meta Store ▸Generation of MapReduce Code
▸Schema-on-Read via SerDe
Advantages
▸Mature part of every Hadoop
distribution
▸Simple setup
▸Java-API for UDFs
▸Usage of many data formats via SerDe Disadvantages ▸Batch-oriented, slow Hadoop HDFS/HBase Meta Store HiveQL Execution SerDe
Schema-on-Write vs. Schema-on-Read
Relational Database: Schema-on-Write
Big Data Processing: Schema-on-Read
Multi-structured Source Data ETL Relational DBMS SQL Multi-structured Source Data Load as-is Hadoop SQL Schema mapped to original filesProf. Dr. Jens Albrecht 9
Schema-on-Read: Hive & CSV
CREATE TABLE gps_data( userid INT,
deviceid INT,
longitude STRING, latitude STRING, utctime STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t' STORED AS TEXTFILE;
-- load data = copy
LOAD DATA LOCAL INPATH 'new_data/gps.dat' OVERWRITE INTO TABLE gps_data;
SELECT COUNT(*) FROM gps_data;
4711 542815 49.454N 11.077E 10/01/2014@10:00:00UTC Sample Data
Schema-on-Read: Hive & Regexp
CREATE TABLE weblog ( host STRING, identity STRING, user STRING, time STRING, request STRING, status STRING, size STRING, referer STRING, agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\"))?" ) STORED AS TEXTFILE;
Prof. Dr. Jens Albrecht 11
Schema-Read: Hive & JSON
Source: http://thornydev.blogspot.de/2013/07/querying-json-records-via-hive.html
MapReduce Hive
Hive on Tez (Stinger)
Stinger-Initiative
▸"Make Hive 100x faster"
▸Finished with Hive 0.13 (April 2014)
▸Replace MapReduce with Tez
▸Native columnar data format (ORC)
Stinger.Next
▸Phase 1: Hive 0.14 (November 2014)
▸ACID transactions
▸Phase 2: (Q2 2015)
▸Subsecond Queries mit LLAP ▸Machine Learning Integration
▸Phase 3: (Q4 2015) ▸SQL:2011 Analytic Functions ▸Materialized Views Hadoop HDFS/HBase Tez / Yarn Meta Store HiveQL Execution SerDe
Prof. Dr. Jens Albrecht 13
Hadoop-SQL Integration
HDFS MR / Tez
HiveQL
Hive (Native Hadoop) Pure Hadoop SQL Engines
HDFS Distributed SQL Engine HDFS Distributed SQL Engine NoSQL Hive Format-agnostic SQL Engines RDBMS Relational
RDBMS with Hadoop Access
Distributed SQL Engine
Pure Hadoop SQL Engines
Approach
▸Distributed, parallel SQL engine ▸Often usage of Hive Metadata
▸Support of optimized data formats ▸Hadoop as mandatory basis
Advantages and Disadvantages ▸Significantly faster as Hive
▸Low latency through dedicated engine ▸Operator pipelining and result caching Diffentiation of solutions
▸Supported SQL functionality ▸Point querying
▸Cost-based optimizer / performance ▸Transaction support HDFS/HBase Local Agent Local Agent Local Agent Local Agent SQL Query Coordination Data Files Data Files Data Files Data Files
Prof. Dr. Jens Albrecht 15
Distributed SQL Engine
Pure Hadoop SQL Engines
HDFS/HBase Local Agent Local Agent Local Agent Local Agent SQL Query Coordination Data Files Data Files Data Files Data Files Big Insights
Pure Hadoop SQL Engines
Prof. Dr. Jens Albrecht 17
Apache Spark & Spark SQL
General
▸SQL engine based on Spark
▸Data access via data frames (former
SchemaRDD)
▸In-Memory columnar format ▸HDFS / HBase as file format Advantages
▸Spark as general-purpose parallel
computing framework
▸Support of Hive extensions like UDFs and
SerDes and Hive metadata
Disadvantages
▸Not yet fully mature
▸Not yet as fast as competitors Spark SQL HDFS/HBase Apache Spark Schema RDD SQL Execution SerDe
Apache Spark
Distributed In-Memory Computing Framework
▸Data caching
▸General framework for all kinds of SQL and non-SQL analytics
▸Support for out-of-the box libraries as well as Java, Python and Scala in the same
engine and for the same data
▸New datasources API allows to write plugins for non-Hadoop sources
Spark Execution Engine
Hadoop ZooKeeper YARN (optional) HDFS Spark SQL Spark Streaming Graph Computation (GraphX) Machine Learning (MLlib)
Prof. Dr. Jens Albrecht 19
Apache Spark
lines = spark.textFile("hdfs://...")
errors = lines.filter(_.startsWith(ERROR)) errors.persist()
// Return the time fields of errors mentioning // assuming time is field number 3 in a tsv file hdfs_errors = errors.filter(_.contains(HDFS))
time_fields = hdfs_errors.map(_.split(’\t’)(3)).collect()
Source: http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf Action Resilient Distributed Data Sets (RDDs) Transformations
Spark Transformations
Transformations: Create a new RDD from an existing one
▸
Lazy evaluation – results are not materialized
▸
Much more than map reduce
Actions: Return a value or dataset to the calling program
map filter sample
groupByKey sortByKey reduceByKey union pipe repartition join leftOuterJoin rightOuterJoin …
reduce collect count
first take(n) saveAsTextFile …
Prof. Dr. Jens Albrecht 22
Adding Schema to RDDs
Hadoop-SQL Integration
HDFS MR / Tez
HiveQL
Hive (Native Hadoop) Pure Hadoop SQL Engines
HDFS Distributed SQL Engine HDFS/ Hive Distributed SQL Engine NoSQL SQL Format-agnostic SQL Engines RDBMS Relational
RDBMS with Hadoop Access
Prof. Dr. Jens Albrecht 24
Distributed SQL Engine
SQL-Engine with Pluggable Storage
Ansatz
▸Verteilte, parallele SQL-Engine
▸Oftmals Nutzung von Hive Metadaten ▸Unterstützung optimierter Dateiformate ▸Hadoop obligatorisch als Basis
Allgemeine Vorteile
▸Deutlich schneller als Hive
▸Geringe Latenzzeiten durch Vermeidung
von Map-Reduce
▸Operatoren-Pipelining und Caching ▸Skalierbar
Allgemeine Nachteile ▸In der Regel keine
Transaktionsunterstützung Local Agent Local Agent Local Agent Local Agent SQL Query Coordination
HDFS Hive JSON Parquet
Cas-sandra MySQL Oracle …
Google Dremel
Prof. Dr. Jens Albrecht 26
Apache Drill
Self-Service Data Integration
▸No metadata repository required
▸Dynamic schema discovery: Metadata automatically extracted for data sources
▸ RDBMS and Hive (comprehensive), HBase (partial) or files (on-the-fly) ▸ Utilizes self-describing data formats (Parquet, JSON, AVRO)
▸SQL-DDL can be used to create metadata explicitly ANSI-SQL plus Flexible Data Model
▸Fully ANSI compliant SQL
▸DrQL with SQL extensions for nested data structures (like JSON)
Apache Drill: SQL for JSON
select name, flatten(fillings) as f from dfs.users.`/donuts.json`
Prof. Dr. Jens Albrecht 28
Apache Drill: SQL for Heterogeneous Data
Formats
▸
JSON
▸
CSV
▸
ORC
▸
Parquet
▸
HBase tables
▸
MongoDB
select USERS.name, USERS.emails.work from
dfs.logs.`/data/logs` LOGS,
dfs.users.`/profiles.json` USERS, where
LOGS.uid = USERS.uid and errorLevel > 5
Hadoop-SQL Integration
HDFS MR / Tez
HiveQL
Hive (Native Hadoop) Pure Hadoop SQL Engines
HDFS Distributed SQL Engine HDFS Distributed SQL Engine NoSQL Hive Format-agnostic SQL Engines RDBMS Relational
RDBMS with Hadoop Access
Prof. Dr. Jens Albrecht 30
External Tables in HDFS
HDFS RDBMS
Logical Mapping
CREATE TABLE SCOTT.SALES_HDFS_EXT_TAB ( PROD_ID NUMBER(6), CUST_ID NUMBER, TIME_ID DATE, CHANNEL_ID CHAR(1), PROMO_ID NUMBER(6), QUANTITY_SOLD NUMBER(3), AMOUNT_SOLD NUMBER(10,2) ) ORGANIZATION EXTERNAL ( TYPE ORACLE_LOADER
DEFAULT DIRECTORY SALES_EXT_DIR ACCESS PARAMETERS
( RECORDS DELIMITED BY NEWLINE FIELDS TERMINATED BY ','
(
PROD_ID DECIMAL EXTERNAL, … ) PREPROCESSOR HDFS_BIN_PATH:hdfs_stream ) LOCATION ( 'file_sales_1', 'file_sales_2', 'files_sales_3') );
RDBMS with Hadoop Integration
Approach
▸Towards genuine integration of
Hadoop into RDBMS
▸Utilize Hadoop's computational
power
▸Cost-based choice Advantages
▸Easiest way to use Hadoop as data
source
▸Combined access to traditional and
new data sources
Disadvantages
▸Cost
▸Limited data sources ▸Vendor lock-in HDFS/HBase Map Reduce Parallel Database Relational Tables External Tables SQL Query Coordination MR Loader
Prof. Dr. Jens Albrecht 32
RDBMS with Hadoop Integration
Products
▸Microsoft Polybase (part of MS
Analytics Platform)
▸Oracle Big Data SQL (part of Oracle
Big Data Appliance in combination with Exadata)
Use Cases
▸Extension of traditional BI System ▸Data-lake scenario with RDBMS as
primary system and Hadoop for mass data
▸Mix of analytic and transactional
load HDFS/HBase Map Reduce Parallel Database Relational Tables External Tables SQL Query Coordination MR Loader
Oracle Big Data SQL
Integrates
Hive Metadata
Allows hybrid queries
▸
Include Hadoop and NoSQL in
relational queries
▸
Use Exadata
Prof. Dr. Jens Albrecht 34
Hadoop-SQL Integration
HDFS MR / Tez
HiveQL
Hive (Native Hadoop) Pure Hadoop SQL Engines
HDFS Distributed SQL Engine HDFS Distributed SQL Engine NoSQL Hive Format-agnostic SQL Engines RDBMS Relational
RDBMS with Hadoop Access
Hadoop
>
Prof. Dr. Jens Albrecht 36
File formats for Big Data
Text Formats
High storage usage bad scan performance Low compression
bad scan performance
Dedicated Formats (e.g. DB internal) Not open, no interoperability
Requirements for Big Data Interoperability
Low storage / good compression
High performance Flexible schema
A schema for a file format??
Query performance for a file format??
Big Data often have a nested structure and multiple schema variants and –versions
Prof. Dr. Jens Albrecht 38
Considerations for File Formats
Query Tools• none
• Frameworks like MapReduce, Spark, Cascading
• Query Engines like Pig, Hive, Impala
Schema Versioning • Schema present?
• If so, can it change?
Splittability • Partitioning
• Splitting of files possible for distributed processing? • Example: CSV: yes,
XML: partial, MP4: no
Block Compression • Can blocks be independently
compressed and distributed? • Block compression is a
prerequisite for partition compresssion!
File Size
• Size in bytes and number of files?
• Hadoop likes big, splittable files!
• Lots of small files cost performance
Load Profile • Write Performance • Filter operations
• Reading of single columns • Full scans
Example Row-Format: Avro
Schema specification ▸internally stored
in binary format
▸self-describing files
Reader- vs. Writer Schema
▸Allows different"Views" on files
Writer Schema Writer Schema Reader Schema Avro Parser Avro Data Resolution Rules Read Write record Person { string userName;
union { null, long } favouriteNumber; array<string> interests;
Prof. Dr. Jens Albrecht 40
Example Column Format:
Columnar formats in general
▸Trade faster reads for slower writes ▸Very good compression
Parquet Files
▸Hybrid Partitioning – sets of records in blocks, columnar within blocks ▸Zone maps per block as kind of index (min/max values per column)
Databases as Lego Construction Kit!?
SQL Prozessor Verteilte Ausführung Speicherverwaltung D at a D ic ti o n ar y SQL Traditional monolithic RDBMS SQL Prozessor Map Reduce Spark CSV Seq AvroJSON ORC Parquet
SQL
Hadoop DB building blocks
• Generic Execution Engine
• Metadata sharing in Hive Repository or self-describing file formats
Prof. Dr. Jens Albrecht 42
>
Considerations for SQL on Hadoop Solutions
SQL Functionality▸Coverage of SQL standard
▸User-defined functions
▸Transactional Safety
Performance and Stability
▸Multi-user workloads
▸Efficiency of joins and aggregations (I/O problems? Size limits?)
Supported Data and Storage Formats
▸Logical Format: relational, JSON, none, …
▸Physical Formats: CSV, Parquet, ORC, Avro, …
Intelligent Storage Plugins / Data Federation
▸Access to various data sources beyond Hadoop
▸Pushdown predicates, access selected columns only
Prof. Dr. Jens Albrecht 44 Hadoop Hadoop SQL SQL
Hadoop vs. SQL
Hadoop SQL Technologies ▸Supplement traditional RDBMS ▸Extend traditional RDBMS ▸Develop new RDBMSSQL universe
gets wider.
Hadoop and SQL
move closely
together.
Database
systems become
open and
modular.
References
▸ L. Chang et al.: HAWQ: A Massively Parallel Processing SQL Engine in Hadoop.
In: Proceedings of the 2014 ACM SIGMOD international conference on Management of Data, Pages 1223-1234
▸ A. Floratour, U. Minhas, F. Özcan: SQL-on-Hadoop: Full Circle Back to Shared-Nothing Database Architectures. Proceedings of the VLDB Endowment, Vol. 7, No. 12, 2014
▸ M. Hausenblas, J. Nadeau: Apache Drill: Interactive Ad-Hoc Analysis at Scale.
Big Data Magazine, June 2013
▸ M. Kornacker, e.a.: Impala: A Modern, Open-Source SQL Engine for Hadoop. 7th Biennial Conference on Innovative Data Systems Research (CIDR’15)
▸ D. J. DeWitt, e.a.: Split Query Processing in Polybase. Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, Pages 1255-1266
▸ S. Melnik, e.a.: Dremel: Interactive Analysis of Web-scale Datasets. PVLDB, 3(1-2):330–339, 2010
▸ M. Zaharia, e.a.: Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, NSDI 2012
▸ J. Albrecht, S. Alexander: Hadoop und SQL rücken enger zusammen.
Computerwoche, Nov. 2013, http://www.computerwoche.de/a/hadoop-und-sql-ruecken-enger-zusammen,2549475 ▸ C. Deptula: Hadoop File Formats: Ist not Just CSV Anymore.
Blog Eintrag, 2014, http://inquidia.com/news-and-info/hadoop-file-formats-its-not-just-csv-anymore ▸ J. Le Dem: Efficient Data Storage Analytics with Apache Parquet 2.0,
Hadoop Summit 2014, http://de.slideshare.net/cloudera/hadoop-summit-36479635 ▸ M. Rathbone: 8 SQL-on-Hadoop frameworks worth checking out.
Blog Eintrag, 2014, http://blog.matthewrathbone.com/2014/06/08/sql-engines-for-hadoop.html ▸ P. Srivati: Resilient Distributed Datasets (RDD) for the impatient.
Blog Eintrag, 2014, http://www.thecloudavenue.com/2014/01/resilient-distributed-datasets-rdd.html ▸ S. Yegulalp: 10 ways to query Hadoop with SQL.