Manage
ment
-MCS
MapR Data Platform
APACHE HADOOP AND OSS ECOSYSTEM
Security YARN Pig Cascading Spark Batch Spark Streaming Storm Streaming HBase Solr NoSQL & Search Juju Provisioning & coordination Savannah Mahout MLLib ML, Graph
MapR Data Platform for Hadoop and NoSQL
GraphX
MapReduce v1 & v2
EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS
Workflow & Data Governance Tez* Hive Impala Spark SQL SQL
Sentry Oozie ZooKeeper
Sqoop
Knox Falcon Whirr
Flume Data Integration & Access HttpFS Hue
Enterprise-grade Interoperability Performance Multi-tenancy Security Operational
Drill
SEMI-STRUCTURED DATA
STRUCTURED DATA
1980
1990
2000
2010
2020
Data Is Doubling Every Two Years
Unstructured data
will account
for
more than 80%
of the data
collected by organizations
T
ota
l Da
ta
Sto
red
IT Resources
1980
1990
2000
2010
2020
Fixed schema
DBA controls structure
Dynamic / Flexible schema
Application controls structure
NON-RELATIONAL DATASTORES
RELATIONAL DATABASES
GBs-TBs TBs-PBs
Volume
Database
Data Increasingly Stored in
Non-Relational
Datastores
Structure
Development
Structured Structured, semi-structured and unstructured Planned (release cycle = months-years) Iterative (release cycle = days-weeks)
How To Bring SQL Into An Unstructured Future?
Familiarity of SQL
Agility & Flexibility of NoSQL
•
SQL
•
BI (Tableau, MicroStrategy,
etc.)
•
Low latency
•
Scalability
•
No schema management
– HDFS (Parquet, JSON, etc.)
– HBase
– …
Industry's First
Schema-free SQL engine
Apache Drill Brings Flexibility & Performance
Access to any data type, any data source
• Relational • Nested data • Schema-less
Rapid time to insights
• Query data in-situ • No Schemas required • Easy to get started
Integration with existing tools
• ANSI SQL
• BI tool integration
Scale in all dimensions
• TB-PB of scale • 1000’s of users • 1000’s of nodes
Extending Self Service to Schema-free data
A
gi
li
ty
& Busines
s V
alue
Use cases for BI
IT-Driven BI
Self-Service BI
Schema-Free Data Exploration
IT-Driven BI IT-Driven BI
Self-Service BI
Analyst-driven with no IT dependency
Analyst-driven with IT support for ETL
IT-created
reports, spreadsheets
Enabling “As-It-Happens” Business with Instant Analytics
Hadoop data Data modeling Transformation
Data movement
(optional)
Users
Hadoop data Users
Governed approach
Exploratory approach
New Business questions Source data evolution
Total time to insight: weeks to months
Drill’s Role in the Enterprise Data Architecture
Raw data
• JSON, CSV, ...
“Optimized” data
• Parquet, …
Centrally-structured data
• Schemas in Hive Metastore
Relational data
• Highly-structured data
Hive, Impala, Spark SQL
Oracle, Teradata
Exploration
Access control that scales
PAM Authentication +
User Impersonation
Fine-grained row and
column level access control
with Drill Views – no
centralized security
repository required
Files
HBase
Hive
Drill View 1
Drill View 2
U U
U
User
Granular security permissions through Drill views
Name City State Credit Card #
Dave San Jose CA 1374-7914-3865-4817 John Boulder CO 1374-9735-1794-9711
Raw File (/raw/cards.csv) AdminsOwner Permission
Admins
Business Analyst Data Scientist
Name City State Credit Card #
Dave San Jose CA 1374-1111-1111-1111
John Boulder CO 1374-1111-1111-1111
Data Scientist View(/views/maskedcards.csv)
Not a physical data copy
Name City State
Dave San Jose CA John Boulder CO
Business Analyst View
Owner Admins Permission Business Analysts Owner Admins Permission Data Scientists
Business Benefits
Rapid time-to-value for business analysts:
SQL specialists and BI analysts can query any dataset—including complex
nested data—instantly, versus waiting several weeks for data preparation by IT.
Efficiency with easy governance for IT
:
IT can avoid unnecessary ETL cycles and schema maintenance activities, but
still ensure governance through easy-to-deploy granular access controls.
Accelerated big data adoption for businesses
:
Quick Tour
Data is growing fast and scattered in various silo’s:
Website click logs
• JSON files
Customers
Apache Drill: SQL in a Non-Relational World
•
ANSI SQL
•
BI (Tableau, MicroStrategy, etc.)
•
Low latency
•
Scalability
•
Agility
•
Create and maintain schemas in
advance:
– HDFS (Parquet, JSON, etc.) – HBase
– …
•
Transform, copy, or move data
Closing The Gap Between Different Datasources using Drill
Product database
• Prod_id
• Productname • Category
• Price
Website click logs
• Trans_id • Sess_date • Cust_id • Device • Prod_id • Purch_flag
Customers
• Cust_id
• Customername • State
• Gender • Agg_rev • Age
In lieu of the live demonstration please find links below:
• Apache Drill with Tableau (4:28):
https://www.youtube.com/watch?v=EH0_vRTAkyk
• Twitter analytics with Apache Drill and Microstrategy (5:02):
https://www.youtube.com/watch?v=-gqwgahtc2Y
• Analyzing JSON and Packet Data with SAP Lumira and Apache
Raw Data Exploration JSON Analytics DWH Offload …
…
{JSON}, Parquet Text Files …
Self-Service Data Exploration
Data Warehouse Offload with Drill & MapR
Ultimately replace existing expensive SQL analytics platform with Hadoop
• Apache Drill allows interactive analysis on large datasets with MapR as the underlying platform that meets scale, reliability and data protection needs • SQL users did not have to learn Pig, HiveQL or any other language and
continue to use Tableau and Squirrel on top of Drill
OBJECTIVES
CHALLENGES
SOLUTION
• Hadoop and Drill dramatically reduce the price point to less than $1,000 / TB • MapR platform with Drill delivers reliability and performance for the end users • Leverage existing BI and SQL skill-sets on Hadoop without retraining
Business Impact Potential
• Mine credit card data and compares consumer shopping habits
• Require internal SQL specialists to gain instant access to data at all times • Want to preserve instant access to data but a lower price point
• Need a system that is reliable, does not lose data and is fast • Must be able to leverage the SQL skill sets in the company
Telecom OEM application with Drill & MapR
Leverage Drill’s JSON capabilities to create revenue-generating IOT services
• Apache Drill is being used to build the engine for the interactive experience • Drill allows SQL queries on incoming JSON structures natively without
requiring any centralized schema definitions
• Drill connects to all BI tools using standard ODBC connectors
OBJECTIVES
CHALLENGES
SOLUTION
• Provide new revenue-generating services to mobile operators
Business
• Offer service to mobile operators to proactively monitor and improve their subscriber experience
• Instant availability of data from diverse and disparate sources • Data is very diverse and dynamic using JSON as the key format
• Require interactive, ad-hoc analysis capabilities via standard BI tools such as Tableau and Spotfire
Recap: Apache Drill enables Self Service SQL for Big data
AGILITY
INSTANT INSIGHTS TO BIG DATA
FLEXIBILITY
ONE INTERFACE FOR HADOOP & NOSQL
FAMILIARITY
EXISTING SKILLS & TECHNOLOGIES
• Direct queries on self describing data
• No schemas or ETL required
• Query HBase and other NoSQL stores • Use SQL to natively
operate on complex data types (such as JSON)
• Leverage ANSI SQL skills and BI tools
• Plug-n-play with Hive schema, file formats, UDF’s
Learn more and get started with Apache Drill
New to MapR and/or Drill?
– Get started with Free MapR On Demand training
– Test Drive Drill on cloud with Amazon EMR
– Learn how to use Drill with Hadoop using MapR sandbox
Ready to play with your data?
– Try out Apache Drill in 10 mins guide on your desktop
– Download Drill for your MapR cluster and start exploration • Use both with relational and JSON datasets
– Comprehensive tutorials and documentation available
Ask questions
Thank You
@mapr
maprtech
muddenfeldt@mapr.com
mkieboom@mapr.com
MapRTechnologies
maprtech
mapr-technologies
MapR with Drill is Top-Ranked SQL-on-Hadoop
Source: Gigaom Research, 2015 Key:
• Number indicates companies relative strength across all vectors
• Size of ball indicates company’s relative strength along individual vector
Like other vendors’
offerings, Drill
handles BI and
interactive queries with
great aplomb, but it is
designed to serve these
workloads with data
complexity that goes
well beyond the flat
structured data that
other
SQL-on-Hadoop systems deal
with.
Drill Hive Impala Spark SQL
Key Use Cases Self-service Data Exploration
Interactive BI / Ad-hoc queries
Batch/ ETL/ Long-running jobs Interactive BI / Ad-hoc queries SQL as part of Spark pipelines
/ Advanced analytic workflows
Data Sources
Files support Parquet, JSON, Text, all Hive file formats
Yes (all Hive file formats) Yes (Parquet, Sequence, RC, Text, AVRO…)
Parquet, JSON, Text, all Hive file formats
HBase/MapR-DB Yes Yes, performance issues Yes, performance issues Same as Hive
Beyond Hadoop Yes No No Yes
Data Types
Relational Yes Yes Yes Yes
Complex/Nested Yes Limited No Limited
Metadata Schema-less /Dynamic schema
Yes No No Limited
Hive Meta store Yes Yes Yes Yes
SQL / BI tools
SQL support ANSI SQL HiveQL HiveQL ANSI SQL (limited) &
HiveQL
Client support ODBC/JDBC ODBC/JDBC ODBC/JDBC ODBC/JDBC
Beyond Memory Yes Yes Yes Yes
MapR: Best Solution for Customer Success
Premier
Investors
High Growth
2X
Growth In Direct Customers90%
Subscription LicensesSoftware Margins140%
Dollar-based Net Expansion700+
Customers
2X
Growth In Annual Subscriptions ( ACV)Best Product
Key Reasons for Selecting MapR
Analytics with 1st
generation SQL-on-Hadoop requires ETL and schema creation. Operational apps on HBase/Accumulo must be run in a separate cluster from the analytics cluster.
HBase/Accumulo suffer from service disruptions due to compactions, garbage collection, and region splits. All data movement into HDFS force batch processing.
1
2
3
MapR Provides the Only Real-Time Distribution
Apache Drill provides immediate self-service data exploration with no waiting on IT.
MapR-DB runs in the same cluster as the analytics cluster (Hadoop), to avoid batch data copies across clusters.
MapR-DB architecture ensures consistently high responsiveness (low latency). MapR ingests data in real-time via MapR-DB, HDFS API, and NFS.
2
1
MapR: The
Only
Platform
Architected
For Big, Fast, Reliable
APACHE HADOOP AND OSS ECOSYSTEMSecurity YARN Spark Streaming Storm Streaming NoSQL & Search Juju Provisioning & coordination Savannah ML, Graph Mahout MLLib GraphX
EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS
Workflow & Data Governance Pig Cascading Spark Batch MapReduc e v1 & v2
Tez HBase Solr Hive Impala Spark SQL Drill SQL
Sentry Oozie ZooKeeper Sqoop Flume Data Integration & Access HttpFS Hue
MapR Data Platform (Random Read/Write) MapR-FS
(HDFS and NFS APIs)
MapR-DB (High-Performance NoSQL)