• No results found

Big Data for Architects

N/A
N/A
Protected

Academic year: 2021

Share "Big Data for Architects"

Copied!
53
0
0

Loading.... (view fulltext now)

Full text

(1)

Big Data for Architects

Sam Babad @ MidLink

[email protected]

052-595-8885

Agenda

What is BigData

BigData challenges

Textbook big data architecture

Analysis, visualization & self service

BigData in the cloud

(2)

What is BigData

Why is everyone talking about BigData?

BigData is a Big Buzzword coined by Big

Companies to make Big Money

There are a lot of new and exciting

technologies behind the buzzword

A lot of money has been and is being

invested into BigData companies

(3)

What is BigData

Commonly defined

as a combination of

three

(the three Vs)

Volume

Velocity

Variety

My Definition

– Data that prevents running the algorithms you want on it in a reasonable amount of time.

What is BigData?

Big Data is not a buzzword. Big Data is something

more and more organizations have

Today most consider > 15TB Big Data

For real time > 5TB is considered Big Data

The internet entities are managing PBs of data

(4)

Data sources

Machine generated

Handheld devices

Optimizers

Analyzers

Loggers

Probes

And a billion sensors

Use Cases

Advertisement optimization and verification

Security and fraud detection

Algo-trading

Behavioral analysis

Network optimization

Real time billing

(5)

Some numbers

In memory data bases of 50TB are

reasonable

Columnar of 2-5PB are reasonable

We usually do not consider 15TB big

data for analytics

This is all on commodity hardware

Some numbers

AirBus A380 => ~640TB per flight

Twitter ~12TB per day

NYSE 1TB per day

Storage capacity doubles every 3 years

(6)

Is there value in BigData

• In 2014 Big Data vendors will pocket nearly $30 Billion from HW, SW and PS (2020 => $76B) :marketwatch.com

• How did Target know your 16 years old is pregnant before you did?

• Would you like to know the weather tomorrow?

• Take a look at Goldcorp Challenge

• Fight Cancer, Crime, Starvation & pollution

• There is one CCTV camera for every 32 people in the UK

Leveraging BigData technologies

BigData technologies can be used to solve

other problems

50GB is not BD but reducing foot print is big!

Postgre the new DB since 1989

CEP (Complex Event Processing) have

replaced rules engines

(7)

Big Data Challenge

Data and Analytical Complexity

Extracting actual business value Data-driven decision making

BigData challenges

Organizations capture far more data than

they ever have

Storage is usually not the issue

No two organizations are the same, nor are

their data sets

(8)

Data Challenges & Complexity

Volume

– Collecting

– Storage & backup

– Scanning (querying)

Velocity

– Data that has to be processed in real time

– Parallelism

– Correlation (over time and between series)

Data Challenges & Complexity

Variety

– Unstructured data along side structured data

– Internal along side external data

– A variety of formats MP4, AVI, Avro etc…

Distribution

– Multiple data centers

– Regulation

(9)

Data Challenges & Complexity

• Life cycle management – Retention policy

• What is a enough sampling? 2, 3 xmases? • Stale data (in different systems)

• Coherence/Consistency

– Between systems & data sources

• Integration

– Data collection and augmentation

– Multiple systems integration (islands of data)

Data Challenges & Complexity

Security

All your data is in one location

Who is allowed to access which data?

Multiple tenancy

One version of the truth

Where do you live right now?

(10)

Data Challenges & Complexity

Working with Legacy systems

prolong life time

Self-service

– Investigative access to ALL users

– Non techies included

– Keeping access permissions in mind

– Ever changing data model

– High touch vs. Low touch data

(11)

BigData the BigPicture

partial

No SQL Columnar Hadoop In-Memory Appliance OLTP Cloud Visualization Sharding IBM EMC HP Oracle Google Yahoo Amazon SAP Actian Datastax EnterpriseDB Apache Twitter Tableau LinkedIn Datameer

(12)

BigData echo system

ETL Enrichment Storage Lifecycle Virtualization Deep Analytics BI Code

Considerations

Is data your business?

How much data do you have?

Where is it now?

What is your growth projection?

Can you take it outside the organization?

Do you need to augment?

(13)

Considerations

• For how long do you need the data?

• Can you save partial data/aggregates?

• Can you join/dedup/aggregate historical data?

• Do you need real-time processing?

• Do you have frequent schema changes?

• Do you need to do OLAP & OLTP on the same system?

• Any limitations HW/OS/MEM?

Climbing up the ladder

• Will fast storage be enough?

• Will open source cut it? Or, what part do you use open source for?

• Which route is best? Hadoop, NoSQL, InMemory, Columnar etc…

• What do you use to code?

• Which ETL do you use?

(14)

Textbook architecture (lambda)

(15)

Introducing the major players

Kafka.apache.org

• Apache Kafka is publish-subscribe messaging rethought as a distributed commit log

• Open sourced and maintained by LinkedIn

• Fast

• Scalable

• Durable

• Distributed by Design

(16)

Kafka

how it works

Publishers send messages to a cluster of brokers

The brokers persist the messages to disk

Consumers can request a range of messages

– Offset and Length

Everything is distributed Pub, Sub & Queues

Consumers maintain their own state (No TX)

Throughput oriented

Kafka

log based queue

Messages are persisted to append-only logs

– Sequential writes and reads

Topics are distributed queues (partitions)

– Partitions are replicated

– Leader & followers

– Producers load balance

– Instances know each other

(17)

Kafka

A single broker can handle hundreds of

megabytes of reads and writes per second from

thousands of clients

A cluster is a great data transfer backbone

Data is retained based on a preset time (2 days)

Guaranties order of messages per partition

Uses java.nio.FileChannel#transferTo (sendfile

system call)

Storm

• CEP – Complex Event Processor

• Platform for analyzing streams of data as they “occur” • Highly distributed, real-time computation system

• Provides primitives for real time computation • Simplifies working with queues and workers • Fault tolerant and scalable

• Complementary to Hadoop

• Created at Backtype and acquired by Twitter in 2011 • Component of Apache Incubator from 2013

(18)

Storm Components

Master node - Nimbus daemon

– Distributes code around the cluster, assigns the tasks and monitors failures

Worker node - Supervisor daemon

– Listens to the work assigned, runs the worker process

Zookeeper

– maintains the coordination service between the supervisor and master

Storm Components

• Stream - An unbounded sequence of tuples

• Spout - A stream source. Reads data from real data sources (logs, API calls, event data..) and generates a stream..

• Bolt - processes input streams (joins, filters,

aggregations...) and produces output streams. Contains data processing, persistence, and messaging alert logic.

(19)

Storm Components

Spouts and Bolts are packaged into a

Topology

Topology is submitted to Storm clusters for

execution and runs indefinitely until it is

manually terminated

Storm use-cases

Stream processing of tweets

Real-time log processing

Sensor data analysis

Financial market analysis

Natural Language Processing

(20)

Hadoop

H

igh

A

vailability

D

istributed

O

bject

O

peration

P

latform

Distributed infrastructure for parallel processing

Initially HDFS & Map/Reduce

Ability to grow to thousands of machines

Virtualizes the hardware

Redundant

Simplifies growth & parallel processing

Distros: Cloudera, Hortonworks, Apache, IBM

HDFS

H

adoop

D

istributed

F

ile

S

ystem

Distributed file system for redundant storage

Uses commodity hardware

Supports big files (PBs) + large quantities of files

Write-once-ready-many

Built for HW failure

(21)

HDFS Architecture

HDFS

Master

/Slave

Data is organized into files and directories

Files are divided into uniform sized blocks

Master

Namenode

• Manages the file system namespace in memory

• Maintain file name to list blocks + location mapping

• Manages block allocation/replication

• Checkpoints namespace and journals namespace

changes for reliability

(22)

HDFS

Master/

Slave

• Slaves “Datanodes” handle block storage

• Blocks are stored using the underlying OS’s files

• Clients access the blocks directly from datanodes

• Periodically sends block reports to Namenode

• Periodically check block integrity

• Blocks are replicated (=3)

• File system keeps checksums of data for corruption detection and recovery

• HDFS exposes block placement so that computation can be migrated to data

Map/Reduce

Programming model for distributed

computation jobs at a massive scale

A framework to organize and execute such

job

The idea is take the logic to the data and

not vice versa

(23)

Map/Reduce

Map

– Inspect a huge amount of data

– Get something of interest

– Shuffle and sort the interesting data

Reduce

– Aggregate the interesting data

– Generate a final report

Think of grep | sort | aggregate by customer ID

(24)

Hadoop

the big picture

Pig

Map/Reduce is a lot of work to write

Pig is a High-level data flow language

Data processing language

Compiler to translate Pig Latin to

Map/Reduce

(25)

Pig example

Users = load‘users’as (name, age);

Fltrd = filter Users by age >= 18 and age <= 25; Pages = load ‘pages’ as (user, url);

Jnd = joinFltrdby name, Pages by user; Grpd = groupJndbyurl;

Smmd = foreachGrpdgenerate group,COUNT(Jnd) as clicks; Srtd = orderSmmdby clicks desc;

Top5 = limitSrtd 5;

store Top5 into‘top5sites’;

Hive

SQL with Hadoop

• Puts a schema/structure to log data stored in HDFS

• Provides an SQL-like query language

• SQL is complied into a chain of M/R jobs

• Usually faster than Java M/R because of the optimizer in the complier

• Has a command shell interface

• Tables can be associated with a serializer /deserializer class - parse data into the table

(26)

Hive high level architecture

HBase

Open source non-relational distributed

column-oriented database on top of HDFS

– (row key, column family, column, timestamp) -> value

Real time read/write access to data in HDFS

No: Sql, Join, TX

Yes: Scalable, Redundant, Consistent

(27)

Cloudera Impala

High performance general purpose SQL query

language for Hadoop not based on M/R

Written in C++ with dynamic code generation

Impalad runs on every node

– Accepts client request, plans & executes the query

Statestored used to find the data

– Provides name service & metadata distribution

Catalogd metadata updates to all nodes

Cloudera impala performance

I/O bound x3-4 then hive

Multi M/R phases in Hive x45 faster in

impala

(28)

Parquet

Columnar storage across the Hadoop platform

– Reduces IO requirement

– Saves space (better compression)

– Different encoding for different types of data

– Additional metadata Page/Column/File statistics

Implemented both in Java for M/R & C++(Impala)

– Can be used with Impala, Hive, Pig & M/R

– Impala & Hive can access the same parquet tables

Columnar databases

Stores information by columns, rather than rows

This is much better for data crunching applications

and DWH than using row storage.

The table below will be stored on disk in this order:

– 1, 2, 3 -10:17, 11:11, 13:15 -1:37, 5:13, 3:59 -12345, 23456, 34567

Brands

– ParAccel a.k.a Matrix

– Vertica

– RedShift (cloud)

Call ID Time Duration Number

1 10:17 1:37 12345

2 11:11 5:13 23456

(29)

Columnar database features

• Homogenous blocks on disk allow for better compression/encoding of data

• Storage reduction by X3 to X12

• Massive Parallel Processing (MPP) on commodity HW enabled by correct distribution keys

• Take the code to the data

• Shared nothing

• Standard SQL language and relational schema

• UDF (User Defined Functions) like stored procedures

Columnar database performance

Columnar addressed the disk I/O bottle neck

specifically targeting seek time

Designed for BI oriented queries

– Select 7 columns from a table with 28 columns will fetch ¼ of the data as each block has info from only one column

– The data will also be mostly sequential so no seek time

– Compression will reduce amount of data fetched even more

Local processing on each node (shared nothing)

(30)

Columnar database pitfalls

• Trickle loading is a tricky

• Concurrency might be an issue

• Internal network speed is crucial

• Data redistribution is an issue

• Updates are out of the question

• Joins can be an issue

• However, for that you get X10 to X100 faster queries, less storage, bulk inserts of XTB an hour and excellent

simultaneous load and query it is worth it

In-Memory database

A DBMS the only uses main memory (RAM) to

store data

In Memory replication is used for High Availability

Usually these DBs do an Asyncbackup to an

external file or DB

NVDIMM now enables to run at full speed and

maintain data in the event of power failure

(31)

In-Memory database

• Designed for extreme OLTP processing

• Some IMDBs (In Memory DB) are MPP

• Close to linear scaling (some claim a=0.85)

• Some use smart mechanisms to only hold HOT data in memory

• Memory is 3,000 times faster then disk

• Cost in constantly dropping while size is increasing

In-Memory examples

Hana

– Column-oriented, Relational, MPP RDBMS

– Convergence of OLTP and OLAP Analytics

VoltDB

– MPP, Relational, RDBMS, ACID, ANSI SQL, Java UDF

– No write a head, no redo log, no access to disk in TX

– Real time response & async-offload (File,Columnar,Hadoop)

(32)

Analysis, visualization and self service

ETL

ETL is still mostly a traditional realm

– Goldengate, Informatica & Python are very much in use as an addition/instead of Hadoop

More sources and targets as each BigData

technology is targeted at a different requirement

Data travels more

– Multiple new BigData technologies side by side with

(33)

Traditional BI tools

Traditional BI tools still play a big roll

– OBIEE, Cognos, BO, Microstrategy

– Have connectors to BD technologies like Impala & Hive • Most use JDBC or ODBC connectivity

– Enable federated reports from multiple sources

– Still require building a “world”

– Are still relatively cumbersome

– Provide most users with limited access to limited data

Self service BI

With more and more data collected & its value

rising, accessibility is becoming a big issue

Canned reports simply don

t cut it anymore

Business users want to

add

a local excel file

they have to the data and run correlations

Exported partial data is less valuable

(34)

Self service

Data consumers want to run their own analysis,

do their own exploration and use simple tools

They are not techies

Security and constraints are required for self

service, especially in multi tenants environments

The trick in self service is to give access to all the

data required and only that data

Ability to easily manipulate data is critical

New visualization tools

The

new

visualization tools are still traditional

– Tableau, Qlickview and others offer a sexier, easier to use interface, intuitive for more users

– Point and click discovery and reporting

– Do not offer a paradigm shift

New BI & visualization tools are just emerging

– Datameer, Platfora

(35)

New BI tools Datameer as an example

Paradigm shifts only happen with new BI tools that

step out of the old realm

Changing the way we think about BI

– No ETL – just EL

– No building of “worlds”

– Working directly and exclusively on Hadoop

– Easy integration and import of data

– An “Excel” like interface for analytics

(36)

Datameer capabilities

• Easy integration and import of information from files, logs, database and many other sources

• Data manipulation done via excel. SQL or M/R knowledge is not required

• Administrator can control data access at the column level

• “Power Point” like visualization

• A new way of looking at self service BI

(37)

BigData in the cloud

More and more organization place their data in the

cloud, thus the data in the cloud is exploding

– Pay as you go, for what you use

– Elasticity – Grow when required

• think of changes as duplicate, change, test, switch drop the old

– Decrease administration overhead

– Use state of the art technologies

Google BigQuery

Big Data analysis requires expensive hardware

and skilled DB administrators

Managing data centers and tuning software is

time consuming and expensive

Why not to use web services as analytic tools

instead?

BigQuery - a fully managed data analytics service

in the cloud

(38)

What is BigQuery?

Service for interactive analyzing of big datasets

Works in conjunction with Google Storage

Uses SQL like query syntax

Web-service accessed by a RESTful API

99.9% reliable and secure:

• Data replicated across multiple data centers

• Secured through Access Control Lists

Scalability to any number of users

(39)

BigQuery

echo system

Part of the Google cloud offering

App engine(app execution)

Compute Engine (linuxVMs) Storage

BigQuery

Works with most BI & Visualization tools

Direct access from Google App engine

Securely share & distribute the results

Offered with Premium support

BigQuery - Technology

Columnar

MPP

Uses tree structure for distribution

Limitations

Only one join per query

Relatively slow and runs out of resources

(rare)

(40)

Amazon RedShift

Data warehousing done the AWS way

Easy to provision and scale up massively

Pay as you go

Really fast performance

Open and flexible with support for popular tools

Petabytescale

Amazon RedShift

Dramatically reduce I/O by

– Column storage

– Data compression

– Zone maps

– Direct-attached storage

– Large data block sizes

(41)

Amazon RedShift

Parallelize and distribute everything

Query

Load

Backup

Restore

Resize

Amazon RedShift

more features

Monitor query performance

Point & click resize or recreate

Built in security

Automatic backups

Integrates with multiple data sources

Use existing analysis tools

(42)

DynamoDB

• Fully managed noSQL DB service on AWS

• Table based (each table is independent)

• Data stored in the form of name - value attributes

• Automatic scaling – Provisioned throughput – Storage scaling

– Distributed architecture

• Monitoring tables with CloudWatch

• Integration with Elastic MapReduce – Analyze and store in S3

DynamoDB

Schema free

Fast to find using primary and range keys.

Support for complex queries (scan)

Consistency by default with high cost to ensure it

Must use SDK/API to access

Complex queries are made using Sequential/Full

table scan (high cost)

(43)

DynamoDB Architecture

Data spread across hundreds of servers (nodes)

Multiple versions of data across multiple nodes

Conflict resolutions during reads (not writes)

Servers form a cluster in a form of a

ring

Client connection through:

– Routing using a load balancer

– Client-library that reflects Dynamo’s partitioning

scheme and can determine the storage host to connect

DynamoDB Limitations

• 64 KB limit on item size (row size)

• 1 MB limit on fetching data

• Pay more if you want strongly consistent data

• Size is multiple of 4KB (provisioning throughput wastage)

• No table joins

• Indexes are created during table creation only

• No triggers

• Limited comparison capability

(44)

Amazon Elastic MapReduce (EMR)

• A web service to process vast amounts of data

• “On demand” Hadoop cluster

• Store data on Amazon S3

• Scale the number of virtual servers in your cluster to manage your computation needs

• Start Hadoop cluster to process data

• Turn off when done

• Pay for the hours used

• EMR integrates seamlessly with AWS services

(45)

EMR

Hadoop clusters running on Amazon EMR use:

– EC2 instances as virtual Linux servers for the master and slave nodes

– Amazon S3 for bulk storage of input and output data

– CloudWatch to monitor cluster performance and raise alarms

Move data into and out of DynamoDB using

Amazon EMR and Hive (Used by EMR)

EMR considerations

• EMR reduces Hadoop management complexity/costs

• Compared to a local cluster - Lower performance of M/R jobs

• Reduced data throughput (S3 cannot be compared to a local hdd)

• EMR Hadoop is usually not the latest version

(46)

Kinesis

• AWS service for real time processing of streaming data

• Easy administration

• Performs continuous processing on streaming big data

• Scales seamlessly to match the operational needs

• Redshift and DynamoDB integration

• Client libraries allow designing and operating real-time streaming data applications

• Ensures high durability and availability of data by replicating across multiple Availability Zones

• Cost efficient for workloads of any scale

(47)

Kinesis architecture - Input

• Kinesis streams are sharded

• Each shard ingests up to 1MB/sec of data and up to 1000 tps

• Data is stored for 24 hours

• Scaling is done by adding or removing shards

• To store data in a stream producers use a PUT call

• PUTs are distributed across the shards by using a Partition Key

Kinesis architecture -Output

• You must design distributed, fault tolerant and scalable application that can keep up with the stream

• Use Kinesis Client Library to:

– Simplify reading from the stream – Automatically start Kinesis workers

– Adjust the number of workers as the number of shards

changes

– Restart workers if they fail and redistribute to use the

new EC2 instances

(48)

Kinesis pricing

• Pay as you go. No up-front costs Hourly shard rate - $0.015

Per 1,000,000,000 PUTs - $0.028

● Customers specify throughput requirements in shards ● Each shard delivers 1MB/s on ingest and 2MB/s on

egress

● Inbound data transfer is free

● EC2 charges apply for Kinesis processing apps

(49)

Will hardware catch up?

(50)

Technology to balance growth

A lot of the BigData technology revolved around

tackling disk speed and size

– Avoiding seek time

– Working in memory

– Smart Caching

– Order data homogeneously for better compression

– Duplicate data for bandwidth purposes

– Eventually consistent

The storage world is not idle

Storage today is faster with more capacity and

smarter

1 million IOPs is

easy

– Violin – 70TB of usable space in a 3U SSD box

– XtreamIO – Extreme deduplication in SSD arrays

– Infinidat – 2PB in a 42U rack

– ExaData – Database logic executed at hardware level

(51)

Storage revives legacy

• Legacy system longevity can be greatly prolonged by advanced storage

• The easiest “upgrade” path – just copy the data

• Limitations like data size, TX per second and other boundaries can be taken down by storage

• And that is before DNA Storage – 5.5PB on one cubic millimeter

What does the future hold?

(52)

Information Week

innovators

1.

MongoDB

2.

Amazon (Redshift,

EMR, DynamoDB)

3.

Cloudera(CDH, Impala)

4.

Couchbase

5.

Datameer

6.

Datastax

7.

Hadapt

8.

Hortonworks

9.

Karmasphere

10.

MapR

11.

Neo Technology

12.

Platfora

13.

Splunk

A few facts

• Data will continue to explode

• Value will become harder to harvest

• Storage will get cheaper

• CPUs will be faster

• Cost will be a big player

• The realm remains prone to disruption

(53)

Sam Babad @ MidLink

[email protected]

References

Related documents

Simulating clinical concentrations and delivery rates of a typical intravenous infusion, a variety of routinely used pharmaceutical drugs were tested for potential binding to

  Campus and Location 

In conclusion, for the studied Taiwanese population of diabetic patients undergoing hemodialysis, increased mortality rates are associated with higher average FPG levels at 1 and

The main wall of the living room has been designated as a &#34;Model Wall&#34; of Delta Gamma girls -- ELLE smiles at us from a Hawaiian Tropic ad and a Miss June USC

On the other hand, methylene blue which happens to have molecular weight of 319.85 g/mol has a slightly lower diffusion rate compared to potassium permanganate

German copyr ight app l ies.. In trans igence of the targe t commun i ty on changes in l ifes ty les may a lso lead to un in tended consequences. car dependency and

Players can create characters and participate in any adventure allowed as a part of the D&amp;D Adventurers League.. As they adventure, players track their characters’

This recommended standard specification has been formulated as a guide to users, industry and government to ensure the proper use, maintenance and inspection of Load binders designed