Big Data for Architects

(1)

Big Data for Architects

Sam Babad @ MidLink

[email protected]

052-595-8885

Agenda

• What is BigData

• BigData challenges

• Textbook big data architecture

• Analysis, visualization & self service

• BigData in the cloud

(2)

What is BigData

Why is everyone talking about BigData?

• BigData is a Big Buzzword coined by Big

Companies to make Big Money

• There are a lot of new and exciting

technologies behind the buzzword

• A lot of money has been and is being

invested into BigData companies

(3)

What is BigData

• Commonly defined

as a combination of

three

• (the three Vs)

–

Volume

–

Velocity

–

Variety

• My Definition

– Data that prevents running the algorithms you want on it in a reasonable amount of time.

What is BigData?

• Big Data is not a buzzword. Big Data is something

more and more organizations have

• Today most consider > 15TB Big Data

• For real time > 5TB is considered Big Data

• The internet entities are managing PBs of data

(4)

Data sources

• Machine generated

• Handheld devices

• Optimizers

• Analyzers

• Loggers

• Probes

• And a billion sensors

Use Cases

• Advertisement optimization and verification

• Security and fraud detection

• Algo-trading

• Behavioral analysis

• Network optimization

• Real time billing

(5)

Some numbers

• In memory data bases of 50TB are

reasonable

• Columnar of 2-5PB are reasonable

• We usually do not consider 15TB big

data for analytics

• This is all on commodity hardware

Some numbers

• AirBus A380 => ~640TB per flight

• Twitter ~12TB per day

• NYSE 1TB per day

• Storage capacity doubles every 3 years

(6)

Is there value in BigData

• In 2014 Big Data vendors will pocket nearly $30 Billion from HW, SW and PS (2020 => $76B) :marketwatch.com

• How did Target know your 16 years old is pregnant before you did?

• Would you like to know the weather tomorrow?

• Take a look at Goldcorp Challenge

• Fight Cancer, Crime, Starvation & pollution

• There is one CCTV camera for every 32 people in the UK

Leveraging BigData technologies

• BigData technologies can be used to solve

–

50GB is not BD but reducing foot print is big!

–

Postgre the new DB since 1989

–

CEP (Complex Event Processing) have

replaced rules engines

(7)

Big Data Challenge

–

Data and Analytical Complexity

Extracting actual business value Data-driven decision making

BigData challenges

• Organizations capture far more data than

they ever have

• Storage is usually not the issue

• No two organizations are the same, nor are

their data sets

(8)

Data Challenges & Complexity

• Volume

– Collecting

– Storage & backup

– Scanning (querying)

• Velocity

– Data that has to be processed in real time

– Parallelism

– Correlation (over time and between series)

Data Challenges & Complexity

• Variety

– Unstructured data along side structured data

– Internal along side external data

– A variety of formats MP4, AVI, Avro etc…

• Distribution

– Multiple data centers

– Regulation

(9)

Data Challenges & Complexity

• Life cycle management – Retention policy

• What is a enough sampling? 2, 3 xmases? • Stale data (in different systems)

• Coherence/Consistency

– Between systems & data sources

• Integration

– Data collection and augmentation

– Multiple systems integration (islands of data)

Data Challenges & Complexity

• Security

–

All your data is in one location

–

Who is allowed to access which data?

–

Multiple tenancy

• One version of the truth

–

Where do you live right now?

(10)

Data Challenges & Complexity

• Working with Legacy systems

–

prolong life time

• Self-service

– Investigative access to ALL users

– Non techies included

– Keeping access permissions in mind

– Ever changing data model

– High touch vs. Low touch data

(11)

BigData the BigPicture

–

partial

No SQL Columnar Hadoop In-Memory Appliance OLTP Cloud Visualization Sharding IBM EMC HP Oracle Google Yahoo Amazon SAP Actian Datastax EnterpriseDB Apache Twitter Tableau LinkedIn Datameer

(12)

BigData echo system

ETL Enrichment Storage Lifecycle Virtualization Deep Analytics BI Code

Considerations

• Is data your business?

• How much data do you have?

• Where is it now?

• What is your growth projection?

• Can you take it outside the organization?

• Do you need to augment?

(13)

Considerations

• For how long do you need the data?

• Can you save partial data/aggregates?

• Can you join/dedup/aggregate historical data?

• Do you need real-time processing?

• Do you have frequent schema changes?

• Do you need to do OLAP & OLTP on the same system?

• Any limitations HW/OS/MEM?

Climbing up the ladder

• Will fast storage be enough?

• Will open source cut it? Or, what part do you use open source for?

• Which route is best? Hadoop, NoSQL, InMemory, Columnar etc…

• What do you use to code?

• Which ETL do you use?

(14)

Textbook architecture (lambda)

(15)

Introducing the major players

Kafka.apache.org

• Apache Kafka is publish-subscribe messaging rethought as a distributed commit log

• Open sourced and maintained by LinkedIn

• Fast

• Scalable

• Durable

• Distributed by Design

(16)

Kafka

–

how it works

• Publishers send messages to a cluster of brokers

• The brokers persist the messages to disk

• Consumers can request a range of messages

– Offset and Length

• Everything is distributed Pub, Sub & Queues

• Consumers maintain their own state (No TX)

• Throughput oriented

Kafka

–

log based queue

• Messages are persisted to append-only logs

– Sequential writes and reads

• Topics are distributed queues (partitions)

– Partitions are replicated

– Leader & followers

– Producers load balance

– Instances know each other

(17)

Kafka

• A single broker can handle hundreds of

megabytes of reads and writes per second from

thousands of clients

• A cluster is a great data transfer backbone

• Data is retained based on a preset time (2 days)

• Guaranties order of messages per partition

• Uses java.nio.FileChannel#transferTo (sendfile

system call)

Storm

• CEP – Complex Event Processor

• Platform for analyzing streams of data as they “occur” • Highly distributed, real-time computation system

• Provides primitives for real time computation • Simplifies working with queues and workers • Fault tolerant and scalable

• Complementary to Hadoop

• Created at Backtype and acquired by Twitter in 2011 • Component of Apache Incubator from 2013

(18)

Storm Components

• Master node - Nimbus daemon

– Distributes code around the cluster, assigns the tasks and monitors failures

• Worker node - Supervisor daemon

– Listens to the work assigned, runs the worker process

• Zookeeper

– maintains the coordination service between the supervisor and master

Storm Components

• Stream - An unbounded sequence of tuples

• Spout - A stream source. Reads data from real data sources (logs, API calls, event data..) and generates a stream..

• Bolt - processes input streams (joins, filters,

aggregations...) and produces output streams. Contains data processing, persistence, and messaging alert logic.

(19)

Storm Components

• Spouts and Bolts are packaged into a

Topology

• Topology is submitted to Storm clusters for

execution and runs indefinitely until it is

manually terminated

Storm use-cases

• Stream processing of tweets

• Real-time log processing

• Sensor data analysis

• Financial market analysis

• Natural Language Processing

(20)

Hadoop

• H

igh

A

vailability

D

istributed

O

bject

O

peration

P

latform

• Distributed infrastructure for parallel processing

• Initially HDFS & Map/Reduce

• Ability to grow to thousands of machines

• Virtualizes the hardware

• Redundant

• Simplifies growth & parallel processing

• Distros: Cloudera, Hortonworks, Apache, IBM

…

HDFS

• H

adoop

D

istributed

F

ile

S

ystem

• Distributed file system for redundant storage

• Uses commodity hardware

• Supports big files (PBs) + large quantities of files

• Write-once-ready-many

• Built for HW failure

(21)

HDFS Architecture

HDFS

–

Master

/Slave

• Data is organized into files and directories

• Files are divided into uniform sized blocks

• Master

“

Namenode

”

• Manages the file system namespace in memory

• Maintain file name to list blocks + location mapping

• Manages block allocation/replication

• Checkpoints namespace and journals namespace

changes for reliability

(22)

HDFS

–

Master/

Slave

• Slaves “Datanodes” handle block storage

• Blocks are stored using the underlying OS’s files

• Clients access the blocks directly from datanodes

• Periodically sends block reports to Namenode

• Periodically check block integrity

• Blocks are replicated (=3)

• File system keeps checksums of data for corruption detection and recovery

• HDFS exposes block placement so that computation can be migrated to data

Map/Reduce

• Programming model for distributed

computation jobs at a massive scale

• A framework to organize and execute such

job

• The idea is take the logic to the data and

not vice versa

(23)

Map/Reduce

• Map

– Inspect a huge amount of data

– Get something of interest

– Shuffle and sort the interesting data

• Reduce

– Aggregate the interesting data

– Generate a final report

• Think of grep | sort | aggregate by customer ID

(24)

Hadoop

–

the big picture

Pig

• Map/Reduce is a lot of work to write

• Pig is a High-level data flow language

• Data processing language

• Compiler to translate Pig Latin to

Map/Reduce

(25)

Pig example

Users = load‘users’as (name, age);

Fltrd = filter Users by age >= 18 and age <= 25; Pages = load ‘pages’ as (user, url);

Jnd = joinFltrdby name, Pages by user; Grpd = groupJndbyurl;

Smmd = foreachGrpdgenerate group,COUNT(Jnd) as clicks; Srtd = orderSmmdby clicks desc;

Top5 = limitSrtd 5;

store Top5 into‘top5sites’;

Hive

–

SQL with Hadoop

• Puts a schema/structure to log data stored in HDFS

• Provides an SQL-like query language

• SQL is complied into a chain of M/R jobs

• Usually faster than Java M/R because of the optimizer in the complier

• Has a command shell interface

• Tables can be associated with a serializer /deserializer class - parse data into the table

(26)

Hive high level architecture

HBase

• Open source non-relational distributed

column-oriented database on top of HDFS

– (row key, column family, column, timestamp) -> value

• Real time read/write access to data in HDFS

• No: Sql, Join, TX

• Yes: Scalable, Redundant, Consistent

(27)

Cloudera Impala

• High performance general purpose SQL query

language for Hadoop not based on M/R

• Written in C++ with dynamic code generation

• Impalad runs on every node

– Accepts client request, plans & executes the query

• Statestored used to find the data

– Provides name service & metadata distribution

• Catalogd metadata updates to all nodes

Cloudera impala performance

• I/O bound x3-4 then hive

• Multi M/R phases in Hive x45 faster in

impala

(28)

Parquet

• Columnar storage across the Hadoop platform

– Reduces IO requirement

– Saves space (better compression)

– Different encoding for different types of data

– Additional metadata Page/Column/File statistics

• Implemented both in Java for M/R & C++(Impala)

– Can be used with Impala, Hive, Pig & M/R

– Impala & Hive can access the same parquet tables

Columnar databases

• Stores information by columns, rather than rows

• This is much better for data crunching applications

and DWH than using row storage.

• The table below will be stored on disk in this order:

– 1, 2, 3 -10:17, 11:11, 13:15 -1:37, 5:13, 3:59 -12345, 23456, 34567

• Brands

– ParAccel a.k.a Matrix

– Vertica

– RedShift (cloud)

Call ID Time Duration Number

1 10:17 1:37 12345

2 11:11 5:13 23456

(29)

Columnar database features

• Homogenous blocks on disk allow for better compression/encoding of data

• Storage reduction by X3 to X12

• Massive Parallel Processing (MPP) on commodity HW enabled by correct distribution keys

• Take the code to the data

• Shared nothing

• Standard SQL language and relational schema

• UDF (User Defined Functions) like stored procedures

Columnar database performance

• Columnar addressed the disk I/O bottle neck

specifically targeting seek time

• Designed for BI oriented queries

– Select 7 columns from a table with 28 columns will fetch ¼ of the data as each block has info from only one column

– The data will also be mostly sequential so no seek time

– Compression will reduce amount of data fetched even more

• Local processing on each node (shared nothing)

(30)

Columnar database pitfalls

• Trickle loading is a tricky

• Concurrency might be an issue

• Internal network speed is crucial

• Data redistribution is an issue

• Updates are out of the question

• Joins can be an issue

• However, for that you get X10 to X100 faster queries, less storage, bulk inserts of XTB an hour and excellent

simultaneous load and query it is worth it

In-Memory database

• A DBMS the only uses main memory (RAM) to

store data

• In Memory replication is used for High Availability

• Usually these DBs do an Asyncbackup to an

external file or DB

• NVDIMM now enables to run at full speed and

maintain data in the event of power failure

(31)

In-Memory database

• Designed for extreme OLTP processing

• Some IMDBs (In Memory DB) are MPP

• Close to linear scaling (some claim a=0.85)

• Some use smart mechanisms to only hold HOT data in memory

• Memory is 3,000 times faster then disk

• Cost in constantly dropping while size is increasing

In-Memory examples

• Hana

– Column-oriented, Relational, MPP RDBMS

– Convergence of OLTP and OLAP Analytics

• VoltDB

– MPP, Relational, RDBMS, ACID, ANSI SQL, Java UDF

– No write a head, no redo log, no access to disk in TX

– Real time response & async-offload (File,Columnar,Hadoop)

(32)

Analysis, visualization and self service

ETL

• ETL is still mostly a traditional realm

– Goldengate, Informatica & Python are very much in use as an addition/instead of Hadoop

• More sources and targets as each BigData

technology is targeted at a different requirement

• Data travels more

– Multiple new BigData technologies side by side with

(33)

Traditional BI tools

• Traditional BI tools still play a big roll

– OBIEE, Cognos, BO, Microstrategy

– Have connectors to BD technologies like Impala & Hive • Most use JDBC or ODBC connectivity

– Enable federated reports from multiple sources

– Still require building a “world”

– Are still relatively cumbersome

– Provide most users with limited access to limited data

Self service BI

• With more and more data collected & its value

rising, accessibility is becoming a big issue

• Canned reports simply don

’

t cut it anymore

• Business users want to

“

add

”

a local excel file

they have to the data and run correlations

• Exported partial data is less valuable

(34)

Self service

• Data consumers want to run their own analysis,

do their own exploration and use simple tools

• They are not techies

• Security and constraints are required for self

service, especially in multi tenants environments

• The trick in self service is to give access to all the

data required and only that data

• Ability to easily manipulate data is critical

New visualization tools

• The

“

new

”

visualization tools are still traditional

– Tableau, Qlickview and others offer a sexier, easier to use interface, intuitive for more users

– Point and click discovery and reporting

– Do not offer a paradigm shift

• New BI & visualization tools are just emerging

– Datameer, Platfora

(35)

New BI tools Datameer as an example

• Paradigm shifts only happen with new BI tools that

step out of the old realm

• Changing the way we think about BI

– No ETL – just EL

– No building of “worlds”

– Working directly and exclusively on Hadoop

– Easy integration and import of data

– An “Excel” like interface for analytics

(36)

Datameer capabilities

• Easy integration and import of information from files, logs, database and many other sources

• Data manipulation done via excel. SQL or M/R knowledge is not required

• Administrator can control data access at the column level

• “Power Point” like visualization

• A new way of looking at self service BI

(37)

BigData in the cloud

• More and more organization place their data in the

cloud, thus the data in the cloud is exploding

– Pay as you go, for what you use

– Elasticity – Grow when required

• think of changes as duplicate, change, test, switch drop the old

– Decrease administration overhead

– Use state of the art technologies

Google BigQuery

• Big Data analysis requires expensive hardware

and skilled DB administrators

• Managing data centers and tuning software is

time consuming and expensive

• Why not to use web services as analytic tools

instead?

• BigQuery - a fully managed data analytics service

in the cloud

(38)

What is BigQuery?

• Service for interactive analyzing of big datasets

• Works in conjunction with Google Storage

•

Uses SQL like query syntax

• Web-service accessed by a RESTful API

• 99.9% reliable and secure:

• Data replicated across multiple data centers

• Secured through Access Control Lists

• Scalability to any number of users

(39)

BigQuery

–

echo system

• Part of the Google cloud offering

App engine(app execution)

Compute Engine (linuxVMs) Storage

BigQuery

•

Works with most BI & Visualization tools

•

Direct access from Google App engine

•

Securely share & distribute the results

•

Offered with Premium support

BigQuery - Technology

• Columnar

• MPP

• Uses tree structure for distribution

• Limitations

•

Only one join per query

•

Relatively slow and runs out of resources

(rare)

(40)

Amazon RedShift

• Data warehousing done the AWS way

–

Easy to provision and scale up massively

–

Pay as you go

–

Really fast performance

–

Open and flexible with support for popular tools

–

Petabytescale

Amazon RedShift

• Dramatically reduce I/O by

– Column storage

– Data compression

– Zone maps

– Direct-attached storage

– Large data block sizes

(41)

Amazon RedShift

• Parallelize and distribute everything

–

Query

–

Load

–

Backup

–

Restore

–

Resize

Amazon RedShift

–

more features

• Monitor query performance

• Point & click resize or recreate

• Built in security

• Automatic backups

• Integrates with multiple data sources

• Use existing analysis tools

(42)

DynamoDB

• Fully managed noSQL DB service on AWS

• Table based (each table is independent)

• Data stored in the form of name - value attributes

• Automatic scaling – Provisioned throughput – Storage scaling

– Distributed architecture

• Monitoring tables with CloudWatch

• Integration with Elastic MapReduce – Analyze and store in S3

DynamoDB

●

Schema free

●

Fast to find using primary and range keys.

●

Support for complex queries (scan)

●

Consistency by default with high cost to ensure it

●

Must use SDK/API to access

●

Complex queries are made using Sequential/Full

table scan (high cost)

(43)

DynamoDB Architecture

●

Data spread across hundreds of servers (nodes)

●

Multiple versions of data across multiple nodes

●

Conflict resolutions during reads (not writes)

●

Servers form a cluster in a form of a

“

ring

”

●

Client connection through:

– Routing using a load balancer

– Client-library that reflects Dynamo’s partitioning

scheme and can determine the storage host to connect

DynamoDB Limitations

• 64 KB limit on item size (row size)

• 1 MB limit on fetching data

• Pay more if you want strongly consistent data

• Size is multiple of 4KB (provisioning throughput wastage)

• No table joins

• Indexes are created during table creation only

• No triggers

• Limited comparison capability

(44)

Amazon Elastic MapReduce (EMR)

• A web service to process vast amounts of data

• “On demand” Hadoop cluster

• Store data on Amazon S3

• Scale the number of virtual servers in your cluster to manage your computation needs

• Start Hadoop cluster to process data

• Turn off when done

• Pay for the hours used

• EMR integrates seamlessly with AWS services

(45)

EMR

• Hadoop clusters running on Amazon EMR use:

– EC2 instances as virtual Linux servers for the master and slave nodes

– Amazon S3 for bulk storage of input and output data

– CloudWatch to monitor cluster performance and raise alarms

• Move data into and out of DynamoDB using

Amazon EMR and Hive (Used by EMR)

EMR considerations

• EMR reduces Hadoop management complexity/costs

• Compared to a local cluster - Lower performance of M/R jobs

• Reduced data throughput (S3 cannot be compared to a local hdd)

• EMR Hadoop is usually not the latest version

(46)

Kinesis

• AWS service for real time processing of streaming data

• Easy administration

• Performs continuous processing on streaming big data

• Scales seamlessly to match the operational needs

• Redshift and DynamoDB integration

• Client libraries allow designing and operating real-time streaming data applications

• Ensures high durability and availability of data by replicating across multiple Availability Zones

• Cost efficient for workloads of any scale

(47)

Kinesis architecture - Input

• Kinesis streams are sharded

• Each shard ingests up to 1MB/sec of data and up to 1000 tps

• Data is stored for 24 hours

• Scaling is done by adding or removing shards

• To store data in a stream producers use a PUT call

• PUTs are distributed across the shards by using a Partition Key

Kinesis architecture -Output

• You must design distributed, fault tolerant and scalable application that can keep up with the stream

• Use Kinesis Client Library to:

– Simplify reading from the stream – Automatically start Kinesis workers

– Adjust the number of workers as the number of shards

changes

– Restart workers if they fail and redistribute to use the

new EC2 instances

(48)

Kinesis pricing

• Pay as you go. No up-front costs Hourly shard rate - $0.015

Per 1,000,000,000 PUTs - $0.028

● Customers specify throughput requirements in shards ● Each shard delivers 1MB/s on ingest and 2MB/s on

egress

● Inbound data transfer is free

● EC2 charges apply for Kinesis processing apps

(49)

Will hardware catch up?

(50)

Technology to balance growth

• A lot of the BigData technology revolved around

tackling disk speed and size

– Avoiding seek time

– Working in memory

– Smart Caching

– Order data homogeneously for better compression

– Duplicate data for bandwidth purposes

– Eventually consistent

The storage world is not idle

• Storage today is faster with more capacity and

“

smarter

”

• 1 million IOPs is

“

easy

”

– Violin – 70TB of usable space in a 3U SSD box

– XtreamIO – Extreme deduplication in SSD arrays

– Infinidat – 2PB in a 42U rack

– ExaData – Database logic executed at hardware level

(51)

Storage revives legacy

• Legacy system longevity can be greatly prolonged by advanced storage

• The easiest “upgrade” path – just copy the data

• Limitations like data size, TX per second and other boundaries can be taken down by storage