Self-service BI for big data applications using Apache Drill

(1)

(2)

Manage

ment

-MCS

MapR Data Platform

APACHE HADOOP AND OSS ECOSYSTEM

Security YARN Pig Cascading Spark Batch Spark Streaming Storm Streaming HBase Solr NoSQL & Search Juju Provisioning & coordination Savannah Mahout MLLib ML, Graph

MapR Data Platform for Hadoop and NoSQL

GraphX

MapReduce v1 & v2

EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS

Workflow & Data Governance Tez* Hive Impala Spark SQL SQL

Sentry Oozie ZooKeeper

Sqoop

Knox Falcon Whirr

Flume Data Integration & Access HttpFS Hue

Enterprise-grade Interoperability Performance Multi-tenancy Security Operational

Drill

(3)

SEMI-STRUCTURED DATA

STRUCTURED DATA

1980

1990

2000

2010

2020

Data Is Doubling Every Two Years

Unstructured data

will account

for

more than 80%

of the data

collected by organizations

T

ota

l Da

ta

Sto

red

IT Resources

(4)

1980

₁₉₉₀

₂₀₀₀

₂₀₁₀

₂₀₂₀

Fixed schema

DBA controls structure

Dynamic / Flexible schema

Application controls structure

NON-RELATIONAL DATASTORES

RELATIONAL DATABASES

GBs-TBs TBs-PBs

Volume

Database

Data Increasingly Stored in

Non-Relational

Datastores

Structure

Development

Structured Structured, semi-structured and unstructured Planned (release cycle = months-years) Iterative (release cycle = days-weeks)

(5)

How To Bring SQL Into An Unstructured Future?

Familiarity of SQL

Agility & Flexibility of NoSQL

• SQL

• BI (Tableau, MicroStrategy,

etc.)

• Low latency

• Scalability

• No schema management

– HDFS (Parquet, JSON, etc.)

– HBase

– …

(6)

Industry's First

Schema-free SQL engine

(7)

Apache Drill Brings Flexibility & Performance

Access to any data type, any data source

• Relational • Nested data • Schema-less

Rapid time to insights

• Query data in-situ • No Schemas required • Easy to get started

Integration with existing tools

• ANSI SQL

• BI tool integration

Scale in all dimensions

• TB-PB of scale • 1000’s of users • 1000’s of nodes

(8)

Extending Self Service to Schema-free data

A

gi

li

ty

& Busines

s V

alue

Use cases for BI

IT-Driven BI

Self-Service BI

Schema-Free Data Exploration

IT-Driven BI IT-Driven BI

Self-Service BI

Analyst-driven with no IT dependency

Analyst-driven with IT support for ETL

IT-created

reports, spreadsheets

(9)

Enabling “As-It-Happens” Business with Instant Analytics

Hadoop data Data modeling Transformation

Data movement

(optional)

Users

Hadoop data Users

Governed approach

Exploratory approach

New Business questions Source data evolution

Total time to insight: weeks to months

(10)

Drill’s Role in the Enterprise Data Architecture

Raw data

• JSON, CSV, ...

“Optimized” data

• Parquet, …

Centrally-structured data

• Schemas in Hive Metastore

Relational data

• Highly-structured data

Hive, Impala, Spark SQL

Oracle, Teradata

Exploration

(11)

Access control that scales

PAM Authentication +

User Impersonation

Fine-grained row and

column level access control

with Drill Views – no

centralized security

repository required

Files

HBase

Hive

Drill View 1

Drill View 2

U U

U

User

(12)

Granular security permissions through Drill views

Name City State Credit Card #

Dave San Jose CA 1374-7914-3865-4817 John Boulder CO 1374-9735-1794-9711

Raw File (/raw/cards.csv) AdminsOwner Permission

Admins

Business Analyst _{Data Scientist}

Name City State Credit Card #

Dave San Jose CA 1374-1111-1111-1111

John Boulder CO 1374-1111-1111-1111

Data Scientist View(/views/maskedcards.csv)

Not a physical data copy

Name City State

Dave San Jose CA John Boulder CO

Business Analyst View

Owner Admins Permission Business Analysts Owner Admins Permission Data Scientists

(13)

Business Benefits

Rapid time-to-value for business analysts:

SQL specialists and BI analysts can query any dataset—including complex

nested data—instantly, versus waiting several weeks for data preparation by IT.

Efficiency with easy governance for IT

:

IT can avoid unnecessary ETL cycles and schema maintenance activities, but

still ensure governance through easy-to-deploy granular access controls.

Accelerated big data adoption for businesses

:

(14)

Quick Tour

(15)

Data is growing fast and scattered in various silo’s:

Website click logs

• JSON files

Customers

(16)

Apache Drill: SQL in a Non-Relational World

• ANSI SQL

• BI (Tableau, MicroStrategy, etc.)

• Low latency

• Scalability

• Agility

• Create and maintain schemas in

advance:

– HDFS (Parquet, JSON, etc.) – HBase

– …

• Transform, copy, or move data

(17)

Closing The Gap Between Different Datasources using Drill

Product database

• Prod_id

• Productname • Category

• Price

Website click logs

• Trans_id • Sess_date • Cust_id • Device • Prod_id • Purch_flag

Customers

• Cust_id

• Customername • State

• Gender • Agg_rev • Age

(18)

(19)

In lieu of the live demonstration please find links below:

• Apache Drill with Tableau (4:28):

https://www.youtube.com/watch?v=EH0_vRTAkyk

• Twitter analytics with Apache Drill and Microstrategy (5:02):

https://www.youtube.com/watch?v=-gqwgahtc2Y

• Analyzing JSON and Packet Data with SAP Lumira and Apache

(20)

(21)

Raw Data Exploration _{JSON Analytics} DWH Offload …

…

{JSON}, Parquet Text Files …

Self-Service Data Exploration

(22)

Data Warehouse Offload with Drill & MapR

Ultimately replace existing expensive SQL analytics platform with Hadoop

• Apache Drill allows interactive analysis on large datasets with MapR as the underlying platform that meets scale, reliability and data protection needs • SQL users did not have to learn Pig, HiveQL or any other language and

continue to use Tableau and Squirrel on top of Drill

OBJECTIVES

CHALLENGES

SOLUTION

• Hadoop and Drill dramatically reduce the price point to less than $1,000 / TB • MapR platform with Drill delivers reliability and performance for the end users • Leverage existing BI and SQL skill-sets on Hadoop without retraining

Business Impact Potential

• Mine credit card data and compares consumer shopping habits

• Require internal SQL specialists to gain instant access to data at all times • Want to preserve instant access to data but a lower price point

• Need a system that is reliable, does not lose data and is fast • Must be able to leverage the SQL skill sets in the company

(23)

Telecom OEM application with Drill & MapR

Leverage Drill’s JSON capabilities to create revenue-generating IOT services

• Apache Drill is being used to build the engine for the interactive experience • Drill allows SQL queries on incoming JSON structures natively without

requiring any centralized schema definitions

• Drill connects to all BI tools using standard ODBC connectors

OBJECTIVES

CHALLENGES

SOLUTION

• Provide new revenue-generating services to mobile operators

Business

• Offer service to mobile operators to proactively monitor and improve their subscriber experience

• Instant availability of data from diverse and disparate sources • Data is very diverse and dynamic using JSON as the key format

• Require interactive, ad-hoc analysis capabilities via standard BI tools such as Tableau and Spotfire

(24)

Recap: Apache Drill enables Self Service SQL for Big data

AGILITY

INSTANT INSIGHTS TO BIG DATA

FLEXIBILITY

ONE INTERFACE FOR HADOOP & NOSQL

FAMILIARITY

EXISTING SKILLS & TECHNOLOGIES

• Direct queries on self describing data

• No schemas or ETL required

• Query HBase and other NoSQL stores • Use SQL to natively

operate on complex data types (such as JSON)

• Leverage ANSI SQL skills and BI tools

• Plug-n-play with Hive schema, file formats, UDF’s

(25)

Learn more and get started with Apache Drill

New to MapR and/or Drill?

– Get started with Free MapR On Demand training

– Test Drive Drill on cloud with Amazon EMR

– Learn how to use Drill with Hadoop using MapR sandbox

Ready to play with your data?

– Try out Apache Drill in 10 mins guide on your desktop

– Download Drill for your MapR cluster and start exploration • Use both with relational and JSON datasets

– Comprehensive tutorials and documentation available

Ask questions

(26)

Thank You

@mapr

maprtech

muddenfeldt@mapr.com

mkieboom@mapr.com

MapRTechnologies

maprtech

mapr-technologies

(27)

(28)

MapR with Drill is Top-Ranked SQL-on-Hadoop

Source: Gigaom Research, 2015 Key:

• Number indicates companies relative strength across all vectors

• Size of ball indicates company’s relative strength along individual vector

Like other vendors’

offerings, Drill

handles BI and

interactive queries with

great aplomb, but it is

designed to serve these

workloads with data

complexity that goes

well beyond the flat

structured data that

other

SQL-on-Hadoop systems deal

with.

(29)

Drill Hive Impala Spark SQL

Key Use Cases Self-service Data Exploration

Interactive BI / Ad-hoc queries

Batch/ ETL/ Long-running jobs Interactive BI / Ad-hoc queries SQL as part of Spark pipelines

/ Advanced analytic workflows

Data Sources

Files support Parquet, JSON, Text, all Hive file formats

Yes (all Hive file formats) Yes (Parquet, Sequence, RC, Text, AVRO…)

Parquet, JSON, Text, all Hive file formats

HBase/MapR-DB Yes Yes, performance issues Yes, performance issues Same as Hive

Beyond Hadoop Yes No No Yes

Data Types

Relational Yes Yes Yes Yes

Complex/Nested Yes Limited No Limited

Metadata Schema-less /Dynamic schema

Yes No No Limited

Hive Meta store Yes Yes Yes Yes

SQL / BI tools

SQL support ANSI SQL HiveQL HiveQL ANSI SQL (limited) &

HiveQL

Client support ODBC/JDBC ODBC/JDBC ODBC/JDBC ODBC/JDBC

Beyond Memory Yes Yes Yes Yes

(30)

MapR: Best Solution for Customer Success

Premier

Investors

High Growth

2X

Growth In Direct Customers

90%

Subscription Licenses_{Software Margins}

140%

Dollar-based Net Expansion

700+

Customers

2X

Growth In Annual _{Subscriptions ( ACV)}

Best Product

(31)

Key Reasons for Selecting MapR

(32)

Analytics with 1st

generation SQL-on-Hadoop requires ETL and schema creation. Operational apps on HBase/Accumulo must be run in a separate cluster from the analytics cluster.

HBase/Accumulo suffer from service disruptions due to compactions, garbage collection, and region splits. All data movement into HDFS force batch processing.

1

2

3 MapR Provides the Only Real-Time Distribution

Apache Drill provides immediate self-service data exploration with no waiting on IT.

MapR-DB runs in the same cluster as the analytics cluster (Hadoop), to avoid batch data copies across clusters.

MapR-DB architecture ensures consistently high responsiveness (low latency). MapR ingests data in real-time via MapR-DB, HDFS API, and NFS.

2

1

(33)

MapR: The

Only

Platform

Architected

For Big, Fast, Reliable

APACHE HADOOP AND OSS ECOSYSTEM

Security YARN Spark Streaming Storm Streaming NoSQL & Search Juju Provisioning & coordination Savannah ML, Graph Mahout MLLib GraphX

EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS

Workflow & Data Governance Pig Cascading Spark Batch MapReduc e v1 & v2

Tez HBase Solr Hive Impala Spark SQL Drill SQL

Sentry Oozie ZooKeeper Sqoop Flume Data Integration & Access HttpFS Hue