How To Create A Data Visualization With Apache Spark And Zeppelin

(1)

Big Data Visualization

using

(2)

Agenda

• Big Data and Ecosystem tools

• Apache Spark

• Apache Zeppelin

• Data Visualization

• Combining Spark and Zeppelin

2

(3)

(4)

Big Data

• Data size beyond systems capability

– Terabyte, Petabyte, Exabyte

• Storage

– Commodity servers, RAID, SAN

• Processing

– In reasonable response time – A challenge here

4

(5)

Tradition processing tools

• Move what ?

– the data to the code or – the code to the data

Data

move data to code

move code to data

Code

(6)

Traditional processing tools

• Traditional tools

– RDBMS, DWH, BI – High cost

– Difficult to scale beyond certain data size

• price/performance skew

• data variety not supported

6

(7)

Map-Reduce and NoSQL

• Hadoop toolset

– Free and open source – Commodity hardware

– Scales to exabytes(10¹⁸), maybe even more

• Not only SQL

– Storage and query processing only – Complements Hadoop toolset

(8)

All is well ?

• Hadoop was designed for batch processing

• Disk based processing: slow

• Many tools to enhance Hadoop’s capabilities

– Distributed cache, Haloop, Hive, HBase

• Not for interactive and iterative

8

(9)

(10)

What is singularity ?

10

0 1000 2000 3000 4000 5000 6000 7000 8000

1 2 3 4 5 6 7

AI capacity

Decade

Decade vs AI capacity

Point of singularity

(11)

Technological singularity

• When AI capability exceeds Human capacity

• AI or non-AI singularity

• 2045: http://en.wikipedia.org/wiki/Ray_Kurzweil

– The predicted year

(12)

APACHE SPARK

(13)

History of Spark

Spark 1.3.0 released

2015 March

Spark 1.0.0 released. 100TB sort achieved in 23 mins

2014

Spark donated to Apache Software Foundation

2013

Spark is made open source.

Available on github

2010

Spark created by PhD student at UC Berkeley, Matei

Zaharia

2009

(14)

Contributors in Spark

• Yahoo

• Intel

• UC Berkeley

• …

• 50+ organizations

14

(15)

Hadoop and Spark

• Spark complements the Hadoop ecosystem

• Replaces: Hadoop MR

• Spark integrates with

– HDFS – Hive – HBase – YARN

(16)

Other big data tools

• Spark also integrates with

– Kafka – ZeroMQ – Cassandra – Mesos

16

(17)

Programming Spark

• Java

• Scala

• Python

• R

(18)

Spark toolset

18

Apache Spark

Spark SQL

Spark Streamin

g

MLlib GraphX

Spark R

Blink DB Spark Cassandra

Tachyon

(19)

What is Spark for ?

Batch

(20)

The main difference: speed

• RAM access vs Disk access

– RAM access is 100,000 times faster !

20

https://gist.github.com/hellerbarde/2843375

(21)

Lambda Architecture pattern

• Used for Lambda architecture

implementation

– Batch layer – Speed layer – Serving layer

Batch Layer Speed

Layer

Data Input

(22)

Worker Node Executor

Deployment Architecture

22

Master Node

Executor

TaskTask_Task ^Cache

Worker Node Executor

Executor^Executor

HDFS Data Node HDFS

Data Node

TaskTask_Task ^Cache

Spark’s Cluster Manager Spark Driver

HDFS Name Node

(23)

(24)

Interactive data analytics

• For Spark and Flink

• Web front end

• At the back, it connects to

– SQL systems(Eg: Hive) – Spark

– Flink

24

(25)

Deployment Architecture

Spark / Flink / Hive

Web browser 1

Web browser 2

Web Server

Local

Optional

Remote

(26)

Notebook

• Is where you do your data analysis

• Web UI REPL with pluggable interpreters

• Interpreters

– Scala, Python, Angular, SparkSQL, Markdown and Shell

26

(27)

User Interface features

• Markdown

• Dynamic HTML generation

• Dynamic chart generation

• Screen sharing via websockets

(28)

SQL Interpreter

• SQL shell

– Query spark data using SQL queries

– Return normal text, HTML or chart type results

28

(29)

Scala interpreter for Spark

• Similar to the Spark shell

• Upload your data into Spark

• Query the data sets(RDDs) in your Spark server

• Execute map-reduce tasks

• Actions on RDD

• Transformations on RDD

(30)

DATA VISUALIZATION

(31)

Visualization tools

(32)

D3 Visualizations

Source: https://github.com/mbostock/d3/wiki/Gallery 32

(33)

The need for visualization

Big Data

_something^Do

to data

User gets

comprehensible

(34)

Tools for Data Presentation Architecture

34

1.Identify

2.Locate

3.Manipulate

4.Format

5.Present A data analysis tool/toolset would support:

(35)

(36)

Spark and Zeppelin

36

Spark Worker

Node

Spark Master

Node

Spark Worker

Node

Zeppelin daemon Web

browser 1

Web browser 2

Web browser 3

Web Server

Local Interpreters

Remote Interpreters

(37)

Zeppelin views: Table from SQL

(38)

Zeppelin views: Table from SQL

38

%sql select age, count(1) from bank where

marital="${marital=single,single|divorced|married}"

group by age order by age

(39)

Zeppelin views: Pie chart from SQL

(40)

Zeppelin views: Bar chart from SQL

40

(41)

Zeppelin views: Angular

(42)

Share variables: MVVM

• Between Scala/Python/Spark and Angular

• Observe scala variables from angular

42

Scala-Spark Angular

x = “foo” x = “bar”

Zeppelin

(43)

Screen sharing using Zeppelin

• Share your graphical reports

– Live sharing

– Get the share URL from zeppelin and share with others – Uses websockets

• Embed live reports in web pages

(44)

FUTURE

(45)

Spark and Zeppelin

• Spark

– Machine Learning using Spark – GraphX and MLlib

• Zeppelin

– Additional interpreters – Better graphics

– Report persistence

(46)

SUMMARY

(47)

Summary

• Spark and tools

• The need for visualization

• The role of Zeppelin

• Zeppelin – Spark integration

(48)