• No results found

NextGen Infrastructure for Big Data

N/A
N/A
Protected

Academic year: 2021

Share "NextGen Infrastructure for Big Data"

Copied!
60
0
0

Loading.... (view fulltext now)

Full text

(1)

NextGen Infrastructure for

Big Data

Anil Vasudeva, President & Chief Analyst, IMEX Research

Author: Anil Vasudeva, IMEX Research

(2)

SNIA Legal Notice

The material contained in this tutorial is copyrighted by the SNIA and author unless

otherwise noted.

Member companies and individual members may use this material in presentations

and literature under the following conditions:

Any slide or slides used must be reproduced in their entirety without

modification

The SNIA must be acknowledged as the source of any material used in the

body of any document containing material from these presentations.

This presentation is a project of the SNIA Education Committee.

Neither the author nor the presenter is an attorney and nothing in this

presentation is intended to be, or should be construed as legal advice or an opinion

of counsel. If you need legal advice or a legal opinion please contact your attorney.

The information presented herein represents the author's personal opinion and

current understanding of the relevant issues involved. The author, the presenter,

and the SNIA do not assume any responsibility or liability for damages arising out of

any reliance on or use of this information.

(3)

NextGen Infrastructure forBig Data

© 2012 Storage Networking Industry Association. All Rights Reserved.

Source: © IMEX Research – Big Data Industry Report ©2011-12

3

3

Abstract

NextGen Infrastructure for

Big Data

This session will appeal to Business Planning, Marketing, Technology System Integrators and Data Center

Managers seeking to understand the drivers behind the demand for and rise of Big Data.

Abstract

The internet has spawned an explosion in data growth in the form of data sets, called Big Data, that are

so large they are difficult to store, manage and analyze using traditional RDBMS which are tuned for

Online Transaction Processing (OLTP) only. Not only is this new data heavily unstructured, voluminous

and streams rapidly and difficult to harness but even more importantly, the infrastructure cost of HW and

SW required to crunch it using traditional RDBMS, to derive any analytics or business intelligence online

(OLAP) from it, is prohibitive.

To capitalize on the Big Data trend, a new breed of Big Data technologies (such as Hadoop and others)

many companies have emerged which are leveraging new parallelized processing, commodity hardware,

open source software and tools to capture and analyze these new data sets and provide a

price/performance that is 10 times better than existing Database/Data Warehousing/Business Intelligence

Systems.

Learning Objectives

The presentation will illustrate the existing operational challenges businesses face today using RDBMS

systems despite using fast access in-memory and solid state storage technologies. It details how IT is

harnessing the emergent Big Data to manage massive amounts of data and new techniques such as

parallelization and virtualization to solve complex problems in order to empower businesses with

knowledgeable decision-making.

It lays out the rapidly evolving big data technology ecosystem - different big data technologies from

Hadoop, Distributed File Systems, emerging NoSQL derivatives for implementation in private and hybrid

cloud-based environments, Storage Infrastructure Requirements to Store, Access, Secure, Prepare for

analytics and visualization of data while manipulating it rapidly to derive business intelligence online, to run

businesses smartly.

(4)

Big Data in IT Industry Roadmap

Cloudization

On-Premises > Private Clouds > Public Clouds

DC to Cloud-Aware Infrast. & Apps. Cascade migration to SPs/Public Clouds.

Integrate Physical Infrast./Blades to meet CAPSIMS

®IMEX

Cost, Availability, Performance, Scalability, Inter-operability, Manageability & Security

Integration/Consolidation

Standard IT Infrastructure- Volume Economics HW/Syst SW

Standardization

Virtualization

Pools Resources. Provisions, Optimizes, Monitors

Shuffles Resources to optimize Delivery of various Business Services

Automatically Maintains Application SLAs

(Self-Configuration, Self-Healing©IMEX, Self-Acctg. Charges etc.)

Automation

IT Industry

Roadmap

Analytics – BI

Predictive Analytics - Unstructured Data

(5)

NextGen Infrastructure forBig Data

© 2012 Storage Networking Industry Association. All Rights Reserved.

Source: © IMEX Research – Big Data Industry Report ©2011-12

5

5

NextGen IT Infrastructure

Enterprise VZ Data Center

On-Premise Cloud

Home Networks

Web 2.0 Social Ntwks. Facebook, Twitter, YouTube… Cable/DSL… Cellular

Wireless

Internet

ISP

Core

Optical Edge

ISP

ISP

ISP

ISP

ISP

Supplier/Partners

Remote/Branch Office

Public CloudCenter

© Servers VPN IaaS, PaaS SaaS Vertical Clouds

ISP

Tier-3

Data Base

Servers

Tier-2 Apps

Management Directory Security Policy

Middleware Platform Switches: Layer 4-7, Layer 2, 10GbE, FC Stg. Caching, Proxy, FW, SSL, IDS, DNS, LB, Web Servers Application Servers HA, File/Print, ERP, SCM, CRM Servers Database Servers, Middleware, Data Mgmt.

Tier-1

Edge Apps

FC/ IPSANs

Request for data from a remote client to a Data Center or Cloud crosses a myriad of

systems and devices. Key is identifying bottlenecks & improving performance

(6)

Harnessing Big Data for Business Insights

Majority of data growth is being

driven by unstructured data

and billions of large objects

Information is at the center of

New Wave of opportunity

80% of world’s data is unstructured

driven by rise in Mobility devices,

collaboration machine generated data.

Data

Sources

Big Data

Infrastructure

Business

Insights

(7)

NextGen Infrastructure forBig Data

© 2012 Storage Networking Industry Association. All Rights Reserved.

Source: © IMEX Research – Big Data Industry Report ©2011-12

Unstructured Big Data can provide Next

Gen Analytics to help businesses make

informed, better decision in:

Product Strategy

Targeting Sales

Just-In-Time Supply-Chain Economics

Business Performance Optimization

Predictive Analytics &

Recommendations

Country Resources Management

Corporate Need: Business Perf... Optimization

(8)
(9)

NextGen Infrastructure forBig Data

© 2012 Storage Networking Industry Association. All Rights Reserved.

Source: © IMEX Research – Big Data Industry Report ©2011-12

Corporate Need: Business Insights

.

Store

Information Exploding

Volume: Digital Content doubling every 18

months. Velocity: >80% growth driven from

unstructured data. Variety: sources of data

changing

A unified information/content

storage methodology

that enables

users to manage the volume, velocity and

variety of information from multiple sources

Manage

Complexity in "managing"

information.

- Need to classify,

synchronize, aggregate, integrate, share,

transform, profile, move, cleanse, protect,

retire

A solution portfolio of tools and

services

to manage all types of

information in a hybrid storage environment

Analyze

Current solutions limited to BI

tools

focused on structured and lagging

information

Build/buy packaged

Real-Time

Predictive Analytical Solutions

for

unstructured analytics tools

Collaborate

Multiple access methods

meet needs of a diverse audience.

needed to

Centralized share, collaborate

and

act on insights anytime, anywhere on any

device.

Model/ Adapt

Ability to

information impacts the business

understand how the

.

How to transfer to action.

Model Information on current

operations w/potential strategy

impact.

Leverage Tech. to adapt.

Item

Issue

Solution

(10)

Opportunity: Converting Big Data Deluge

into Predictive Analytics & Insights

Personal Location Services Data Generated by

TB/Year

Navigation Devices 600

Navigation Apps on Phones 20

Smart Phone Opt-In Tracking 1000

Geo Targeted Ads 20

People Locator (Emergency Calls/Search..) 10

Location based Services (e.g.Games) 5

Other 45

(11)

NextGen Infrastructure forBig Data

© 2012 Storage Networking Industry Association. All Rights Reserved.

Source: © IMEX Research – Big Data Industry Report ©2011-12

Issues with Existing RDBMS

Key Issues with RDBMS Technologies

Handling Mixed Unstructured Data

- RDBMs don’t handle non-tabular data

(Notorious for doing a poor job on recursive data structure)

Legacy Archaic Architecture

- RDBMS don’t parallelize well to accommodate

commodity HW clusters

Speed

- Seek time of physical Storage has not kept pace with

network speed improvements

Scale

- Difficult to scale-out RDBMS efficiently – Clustering

beyond few servers notoriously hard

Integration

- Data processing tasks need to combine data from

non-related sources, over a network

Volume

- Data volumes have grown from 10s GB >100s TB >

PBs in recent years. Existing Tabular RDBMS can’t

handle such large DBs

11

Fault

Tolerance

Availability

Deep

Insights

EU Adhoc

Analytics

Cost

Unstructured

Data

Latency

Scalability

Big Data

Analytics

(12)

Issues with Existing RDBMS

(13)

NextGen Infrastructure forBig Data

© 2012 Storage Networking Industry Association. All Rights Reserved.

Source: © IMEX Research – Big Data Industry Report ©2011-12

Big Data - Database Solutions

(14)

Big Data – The New Face of DBs

Big Data Paradigm - The New face of DB Systems

• Adopts Schema-Free Architecture

• Can do away with Legacy Relational DB Systems

Some data have sparse attributes, do not need relational property

• Key Oriented Queries

Some data stored/retrieved mainly by primary key, w/o complex joins

• Trade-off of Consistency, Availability & Partition Tolerance

• Scale Out, not up, - Online Load balancing cluster growth

(15)

NextGen Infrastructure forBig Data

© 2012 Storage Networking Industry Association. All Rights Reserved.

Source: © IMEX Research – Big Data Industry Report ©2011-12

Analytics – The Next Frontier in IT

(16)
(17)

NextGen Infrastructure forBig Data

© 2012 Storage Networking Industry Association. All Rights Reserved.

Source: © IMEX Research – Big Data Industry Report ©2011-12

17

Key Innovations – Solid State Storage

Source: IMEX Research SSD Industry Report ©2011

Note: 2U storage rack, • 2.5” HDD max cap = 400GB / 24 HDDs, de-stroked to 20%, • 2.5” SSD max cap = 800GB / 36 SSDs

0

300

600

2009

2010

2011

2012

2013

2014

Un

it

s (

M

illio

n

s)

0

8

IO

P

S

/G

B

IOPS/GB HDDs

HDD

SSD

17

Key to Database performance are random IOPS. SSDs outshine HDD in IO price/performance

– a major reason, besides better space and power, for their explosive growth.

(18)

Innovations – DB SW Technologies

Tech Innovation 1985 1990 1995 2000 2005 2010 2015

OLTP Transactions

DB SW Rows Locking Optimizer Parallel Query Clustering XML Grid Open Source / Hadoop

OLAP- Analytics

DB SW Indexing Partitioning Columnar

Materialized View

Bit Mapped

Index In-Memory Query Binding

Hardware 32 bit SMP NUMA 64 bit Multi-core/ Blades Flash MPP

Big Data Multi-core Columnar In-Memory MPP Visualization

OLTP Database Innovation Progress

0.1 1 10 100 1000 10000 100000 1985 1990 1995 2000 2005 2010 2015 0.01 0.1 1 10 100 1000 10000

$/TPMc

TPMc/Processor

$

/ T

P

M

c

T

P

M

c /

P

ro

cesso

r

Big Data:

Analytics

DB Technology Impact

0 0 1 10 100 1,000 10,000 100,000 1985 1990 1995 2000 2005 2010 2015 0.0001 0.001 0.01 0.1 1 10 100 1000 10000 100000 .1 .01 DW Size TB $/GB

(19)

NextGen Infrastructure forBig Data

© 2012 Storage Networking Industry Association. All Rights Reserved.

Source: © IMEX Research – Big Data Industry Report ©2011-12

Big Data - Key Requirements

Types of Data Organizations Analyze

59% 44% 69% 64% 41% 33% 51% 28% 46% 36% 36% 18% 18% 21% 8% 3% 3% 68% 68% 37% 23% 33% 34% 26% 32% 21% 15% 11% 15% 11% 9% 3% 6% 5%

Customer/member data

Transactional data from applications

Application Logs

Other Types of Event Data

Network Monitoring/Network Traffic

Online Retail Transactions

Other Log Files

Call Data Records

Web Logs

Text data from social media and online

Search logs

Trade/quote data

Intelligence/defense data

Multimedia (audio/video/images)

Weather

Smartmeter data

Other (please specify)

Hadoop

Non Hadoop

(20)

Big Data – Architectural Goals

Big Data

Platform

Meet Enterprise Criterion

Meet Requirements of V3

(21)

NextGen Infrastructure forBig Data

© 2012 Storage Networking Industry Association. All Rights Reserved.

Source: © IMEX Research – Big Data Industry Report ©2011-12

21

Big Data - Market Requirements

Unified system:

Pre-integrated for Ease of Installation and Management

• Platform –

Large Scale Indexing

Pre-integrated using Hadoop Foundation,

• Integrated Text Analytics -

Address Unstructured Data

Usability

-

User Friendly Admin Console including HDFS Explorer, Query

Languages

Enterprise Class Features

– Provisioning, Storage, Scheduler, Advance Security

• Supports

search-centric, document-based XML data model

o

store documents within a transactional repository.

Schema-Free

:

o

No advance knowledge of the document structure (its "schema") needed

o

Index words and values from each of the loaded documents together with

its document structure.

(22)

Big Data – Market Requirements

Architectural

Shared-nothing clustered DB architecture

o

programmable and extensible application servers.

• Support

massive scalability

to petabytes of source data

• Support

open-source XQuery- and XSLT-driven architecture

• Simple to Deploy, Develop and Manage (

UI & Restful Interface

)

• Support

extreme mixed workloads

- a wide variety of data types including

arbitrarily hierarchical data structures, images, waveforms, data logs etc.

• Support

thousands of geographically dispersed on--line users

and

programs

executing variety of requests from ad hoc queries to strategic analysis

Loading data before declaring or discovering its structure

Load data in batch and streaming fashion

Integrate data from multiple sources

during load process at very high rates

Spread I/O and data across instances

• Provide consistent

performance with linear cost

• Leverage

Open Source SW

Lo Costs, Multiple Sources, Hadoop Foundation Tools

• Connectivity with Oracle DB, Teradata Warehouse,

JDBC Connectivity,

(23)

NextGen Infrastructure forBig Data

© 2012 Storage Networking Industry Association. All Rights Reserved.

Source: © IMEX Research – Big Data Industry Report ©2011-12

Real Time Analytics Execution

• Execute

streaming” analytic queries in real time

on incoming load data

Updating data in place

at full load speeds

• Scheduling and execution of

complex multi-hundred node workflows

• Join a

billion row dimension table to a trillion row fact table

without

pre-clustering the dimension table with the fact table

Performance

• Analyze data, at very high rates >GB/sec

Predictable Sub-ms response time

for highly constrained standard SQL

queries

Availability

• Ability to

configure without any single point of failure

Auto-Failover Extreme High Availability

o

Automated failover and process continuation without operational

interruption when processing nodes fail

23

(24)

Big Data – Product Metrics Choices

Big Data

-

Product

Metrics

Data Set Size

PB

TB

GB

Data Structure

Transaction Machine Unstructured Other

Access/Use

Transaction Search Analytics

Parallel

Processing

Appliance Cluster < 1K Cluster > 1K

Memory

In-MemoryFlash

DB Technique

Columnar Zero Sharing No SQL

Data

Cataloging

SW

Text Image Audio

(25)

NextGen Infrastructure forBig Data

© 2012 Storage Networking Industry Association. All Rights Reserved.

Source: © IMEX Research – Big Data Industry Report ©2011-12

Advantage: Big Data Products

Characteristic

Legacy Paradigm

Big Data Paradigm

Structure

Transactional/Corporate

Unstructured/Derivative/Internet

Mode

Data Collection

Data Analysis

Focus

Find Answers

Find Questions

Facility

Reportive / What Happened?

Analytic / Why did it Happen?

Predictive / What will Happen Next?

Opportunity

Very Small Growth

Massive Growth

Players

Legacy Players

Agile Start Ups, well funded

Impact

Analyze Existing Businesses

Create New Businesses

(26)

Characteristic

Traditional RDBMS

Big Data/MapReduce

Data Size

GB

PB

Access

Interactive

Batch/Near Real-Time

Latency

Low

High

Data Updates

Read & Write Many Times

Write Once Read Many Times

Schema/Structure

Static Schema

Dynamic Schema

Language

SQL

UQL/Procedural (Java,C++..)

Integrity

High

Not 100%

Works Well for

Process Intensive Jobs

Data Intensive Jobs

Works Well w Data Size

Gigabytes

Petabytes

Data/Processing

Interactions

Low Latency/High BW – precursor to

success. Ntwk. BW can be a

bottleneck causing nodes to be idle

Sends Code to Data, instead of

Sending Data to other Nodes

(Requiring Lower BW in Cluster)

Fault Tolerance

Coordinating Processes with Node

Failures – a challenge

Fault Tolerant for HW/SW Failures

Access

Interactive

Batch/Near Real-time

Scaling

Non-linear

Linear

Pgm-Distribution of Jobs

Difficult

Simple & Effective

(27)

NextGen Infrastructure forBig Data

© 2012 Storage Networking Industry Association. All Rights Reserved.

Source: © IMEX Research – Big Data Industry Report ©2011-12

Big Data Ecosystem

(28)

Big Data Stack

Collector Data Importer Data Importer

RHive

Enterprise

PerfMon, Query Plan Hive Workflow Oracle-to-Hive

Search

Rest/ JSON API

Data

exporter

Databases

Advanced

Analytics

OLAP Server OLTP Server

ETL

Ad-hoc query

Data Sources

Data Store

(Hadoop)

Oracle IBM Teradata…

DBA

Streaming

Data

Devices

Analytics Platform

Data

Sources

Applications

(29)

NextGen Infrastructure forBig Data

© 2012 Storage Networking Industry Association. All Rights Reserved.

Source: © IMEX Research – Big Data Industry Report ©2011-12

Hadoop’s Fit in Enterprise Stack

(30)

Big Data - Hadoop Architecture

Data Flow

(Pig)

SQL

(Hive)

Distributed Computing Framework

(MapReduce)

Metadata

(HCatalog)

Column-Stg

(HBase)

Co

or

din

atio

n

(Z

oo

keep

er

)

M

ana

ge

me

nt

(H

M

S)

Hadoop

Distributed File System

(HDFS)

Programming

Languages

Computations

Object

Storage

Tabular

Storage

(31)

NextGen Infrastructure forBig Data

© 2012 Storage Networking Industry Association. All Rights Reserved.

Source: © IMEX Research – Big Data Industry Report ©2011-12

Big Data Connectors to EDW/BI

Servers

Operating System

Hypervisor/VMs

Big Data Storage Framework

(HDFS)

Big Data Processing Framework

(MapReduce)

Big Data Access Framework

Pig

Hive

Sqoop

Big Data

(Connectors

)

Big Data Orchestration Framework

HBase

Avro

Flume

ZooKeeper

BI APPLICATIONS

(Query, Analytics, Reporting, Statistics)

EDW

Ba

c

k

up

&

Re

c

ov

e

ry

M

a

na

ge

m

e

nt

S

ec

u

ri

ty

Network

BI Framework - Interoperable with Enterprise Data Warehousing

(32)

Big Data Infrastructure – Map Reduce

Map Reduce

A Distributed Computing Model

Typical Pipeline:

Input>

Map

>Shuffle/Sort>

Reduce

>Output

Easy to Use , Developer writes few functions,

Moves compute to Data

Schedules work on HDFS node with data

Scans through data, reducing seeks

Automatic Reliability and re-execution on failure

Column-Stg (HBase) Data Flow (Pig) SQL (Hive) Metadata (HCatalog)CV C oor di na ti on (Z oo ke ep er ) M an ag em en t (HM S) Hadoop Distributed File System (HDFS) Metadata (HCatalog)CV MapReduce Computing Framework)

(33)

NextGen Infrastructure forBig Data

© 2012 Storage Networking Industry Association. All Rights Reserved.

Source: © IMEX Research – Big Data Industry Report ©2011-12

HDFS Architecture

Actively Maintaining High Availability

Persistent Namespace Metadata & Journal

Name Node

Block

Map

Namespace

State

NFS NFS

N

am

esp

ac

e

Hierarchical Namespace

File name >> Block IDs >> Block Locations

1. Replicate 3. Block

Received Periodically Check Block Checksums 0. Bad/ Lost Block Block ID >> Data JBOD Data Node b1 b3 b2 JBOD Data Node b1 b3 b2 JBOD Data Node b1 b3 b2 JBOD Data Node b1 b3 b2

Horizontally Scale I/O & Storage

JBOD Data Node b1 b3 b2 JBOD Data Node b1 b3 b2 JBOD Data Node b1 b3 b2 JBOD Data Node b1 b3 b2 2. Copy Block ID >> Data Horizontally Scale I/O & Storage

Name Node

Block

Map

Namespace

State

Persistent Namespace Metadata & Journal

NFS NFS

N

am

esp

ac

e

Heartbeats & Block Reports

HDFS

Immutable File System – Read, Write, Sync/Flush – No random writes

Storage Server used for Computation – Move Computation to Data

Fault Tolerant & Easy Management – Built In Redundancy, Tolerates Disk & Node Failure, Auto-Managing

addition/removal of nodes, One operator/8K nodes

Not a SAN but high bandwidth network access to data via Ethernet

Used typically to Solve problems not feasible with traditional systems: Large Storage Capacity >100PB raw,

Large IO/computational BW >4K node/cluster, scale by adding commodity HW, Cost ~$1.5/GB incl. MR cluster

(34)

Hadoop Distributed File System

HDFS Architecture

HDFS Characteristics

Based on Google GFS

(Google

File System)

Redundant Storage

for massive

amounts of data

Data is distributed across all

nodes

at load time – efficient

MapReduce processing

Runs on commodity hardware

assumes high failure rate for

components

Works well with lots of large files

Built around Write once

– Read

many times”

Large Streaming Reads

– Not

random access

High Throughtput

more important

than low latency

(35)

NextGen Infrastructure forBig Data

© 2012 Storage Networking Industry Association. All Rights Reserved.

Source: © IMEX Research – Big Data Industry Report ©2011-12

Hadoop Architecture - Overview

Hadoop Data Processing Architecture

(36)

Key Technologies Required for Big Data

Key Technologies Required for Big Data

Cloud Infrastructure

Virtualization

Networking

Storage

o

In-Memory Data Base (Solid State Memory)

o

Tiered Storage Software (Performance Enhancement)

o

Deduplication (Cost Reduction)

(37)

NextGen Infrastructure forBig Data

© 2012 Storage Networking Industry Association. All Rights Reserved.

Source: © IMEX Research – Big Data Industry Report ©2011-12

Cloud Infrastructure for Big Data

37

Examples

eMail - Yahoo!,Google…

Collaboration - Facebook,Twitter …

Bus.Apps - SalesForce, GoogleApps, Intuit…

Examples

Amazon EC2

Force.com

Navitaire

Examples

Amazon S3

Nirvanix

Infrastructure HW &

Services

- Servers, Network, Storage

- Management, Reporting

SaaS

PaaS

IaaS

Platform Tools & Services

- Deploy developed platforms

ready for Application SW on

Cloud Aware Infrastructure

Software-as-a-Service

- Servers, Network, Storage

- Management, Reporting

Service

Providers

Examples Public - BT, Telstra, T-Systems France Telecom Private –

Hybrid – IBM/Cloudburst,

Cloud Services Providers

Public

– Mutitenancy,OnDemand

Private

- On Premises, Enterprise

(38)

Platform Tools & Services

Operating Systems

Cloud Computing

Public Cloud Service

Providers

Private Cloud

Enterprise

App

SLA

SaaS

Applications

…..

.N

et

Py

tho

n

EJB

Ru

by

PH

P

…..

PaaS

IaaS

SaaS

Virtualization

Resources (Servers, Storage, Networks)

Hybrid

Cloud

App

SLA

App

SLA

App

SLA

App

SLA

M

a

na

ge

me

nt

(39)

NextGen Infrastructure forBig Data

© 2012 Storage Networking Industry Association. All Rights Reserved.

Source: © IMEX Research – Big Data Industry Report ©2011-12

Private Cloud Requirements for Big Data

Public

Cloud

Storage

Costs

O

per

ati

onal

F

lex

ibi

lity

App

Silos

VZ

Private

Cloud

Storage

Automation

Automated Provisioning

- Moving Data & Processes Seamlessly

Allocation, Self Tuning of Resources to

meet Workload Requirements

Self-

Service

Access Resources on demand to

speed deployment and delivery

Scale Resource Up/down

to optimize their usage,

Release when not needed

Service Catalog

Choose pre-defined IT Services/user/dept.

Define SLA to efficiently meet services

Service

Analytics

Monitor & Analyze Usage for

Charge back

Interactively auto-tune

performance

Availability with SLA

requirements

Private

Cloud

Storage

(40)

Virtualization: Workloads Consolidation

• A

single server 1.5x larger

than

standard 2-way server

will handle

consolidated load of 6 servers

.

• VZ manages the workloads +

important apps get the compute

resources

they need automatically

w/o operator intervention.

• Physical

consolidation of 15-20:1

is easily possible

Reasonable goal for VZ x86

servers – 40-50% utilization

on

large systems (>4way), rising as

dual/quad core processors

becomes available

Savings

result in

Real Estate,

Power & Cooling, High Availability,

Hardware, Management

(41)

NextGen Infrastructure forBig Data

© 2012 Storage Networking Industry Association. All Rights Reserved.

Source: © IMEX Research – Big Data Industry Report ©2011-12

Virtualization: TCO Savings

$-$4,000

$8,000

$12,000

$16,000

w/o VZ

w VZ

Provisioning

Hardware

SAN

Network

Power & Cooling

DC Real Estate

Disaster Recovery

Downtime

Co

st

o

v

er

3

y

ear

s

995 Pre-Virtualization (VZ) Servers

78 VZ

Servers

VZ SW

&

Support

F

o

r

T

CO

A

n

a

ly

s

is

,

E

M

:

(408)

268

-080

0

(42)

Storage Infrastructure for Big Data

Storage

Efficiency

Virtualization

Mapping P > V, VM Management

Performance

In-Memory DB, Auto-Tiering-SSD/HDD

Costs Reduction

Thin Provisioning

Deduplication

Availability

RAID/Auto recover HA, Snapshots, CDP, Cloning, DRS

Security

Encryption/DLP

Service

Efficiency

Storage -as-a Service

Service Catalogs by Workloads etc.

Policy Infrastructure

Service Level Attributes

Service Measurements

Performance Analytics

IOPS/Response Time, Bandwidth

Automation

Unified SAN/NAS Protocols

(43)

NextGen Infrastructure forBig Data

© 2012 Storage Networking Industry Association. All Rights Reserved.

Source: © IMEX Research – Big Data Industry Report ©2011-12

Storage Architecture - Impact from VZ

Source: IMEX Research SSD Industry Report ©2011

43

Replication

RAID – 0,1,5,6,10

Virtual Tape

Back Up/Archive/DR

Data Protection

Virtualization

MAID

Deduplication

Thin Provisioning

Storage Efficiency

Auto Tiering

Virtualization (VZ)

requires Shared Storage for

- VMotion

- Storage VMotion

- HA/DRS

- Fault Tolerance

Additional Capacity

Consumed for

- VZ snapshots,

- VM Kernel etc.

(44)

Storage – Issues & Solutions

Traditional

Data Growth

S

tor

age Cos

ts

Reduc

ti

on

Capacity Requirements

Snapshots ~ 75%

Thin Provisioning ~30%

DeDuplication ~ 25-95%

Auto-Tiering 65-95%

Thin Replication ~ 95%

RAID*DP ~ 40% vs. R10

Virtual Clones ~80%

CAPACITY

SAVINGS

~ xx %

(45)

NextGen Infrastructure forBig Data

© 2012 Storage Networking Industry Association. All Rights Reserved.

Source: © IMEX Research – Big Data Industry Report ©2011-12

Storage Architecture Impacting Big Data

Auto-Tiering System

using Flash SSDs

Data Class

(Tiers 0,1,2,3)

Storage Media Type

(Flash/Disk/Tape)

Policy Engines

(Workload Mgmt.)

Transparent Migration

(Data Placement)

File Virtualization

(Uninterrupted App.Opns.in Migration)

Source: IMEX Research SSD Industry Report ©2011

Replication

RAID – 0,1,5,6,10

Virtual Tape

Back Up/Archive/DR

Data Protection

Storage Virtualization

MAID

Deduplication

Thin Provisioning

Storage Efficiency

Auto-Tiering

45

DRAM

Flash

SSD

Performance Disk

Capacity Disk

Tape

(46)

I/O Access Frequency vs. Percent of Corporate Data

SSD

Logs

Journals

Temp Tables

Hot Tables

FCoE/

SAS

Arrays

Tables

Indices

Hot Data

Cloud

Storage

SATA

Back Up Data

Archived Data

Offsite DataVault

2%

10%

50%

100%

1%

65%

75%

95%

%

o

f

I/

O

A

cc

ess

es

(47)

NextGen Infrastructure forBig Data

© 2012 Storage Networking Industry Association. All Rights Reserved.

Source: © IMEX Research – Big Data Industry Report ©2011-12

47

47

SSD Storage: Filling Price/Perf.Gaps

HDD

Tape

DRAM

CPU SDRAM

Performance

I/O

Access Latency

HDD becoming

Cheaper, not faster

DRAM getting

Faster (to feed faster CPUs) &

Larger (to feed Multi-cores &

Multi-VMs from Virtualization)

SCM

NOR

NAND

PCIe

SSD

SATA

SSD

Price

$/GB

Source: IMEX Research SSD Industry Report ©2010-12

SSD segmenting into

PCIe SSD Cache

- as backend to DRAM

&

SATA SSD

(48)

SSD Storage - Performance & TCO

SAN TCO using HDD vs. Hybrid Storage

14.2 5.2 75 28 0 64 0 36 145 0 0 50 100 150 200 250 HDD Only HDD/SSD

C

o

st

$K

Power & Cooling RackSpace SSDs HDD SATA HDD FC

Pwr/Cool RackSpace SSD HDD-SATA HDD-FC

SAN Performance

Improvements using SSD

0 50 100 150 200 250 300

FC-HDD Only

SSD/SATA-HDD

IOP S 0 1 2 3 4 5 6 7 8 9 10 $ /IO P

Performance (IOPS) $/IOP

$/IOPS

Improvement

800%

IOPS

Improvement

475%

(49)

NextGen Infrastructure forBig Data

© 2012 Storage Networking Industry Association. All Rights Reserved.

Source: © IMEX Research – Big Data Industry Report ©2011-12

Workloads Characterization

*IOPS for a required response time ( ms)

*=(#Channels*Latency-1)

(RAID - 0, 3)

500

100

MB/sec

10

1

5

50

Data

Warehousing

OLAP

Business

Intelligence

(RAID - 1, 5, 6)

IO

PS*

(*

L

at

en

cy

-1)

Web 2.0

Audio

Video

Scientific Computing

Imaging

HPC

TP

HPC

10K

100 K

1K

100

10

1000 K

OLTP

eCommerce

Transaction

Processing

(50)

Storage performance, management and costs

are big issues in running Databases

Data Warehousing Workloads

are I/O intensive

Predominantly read based with low hit ratios on buffer pools

High concurrent sequential and random read levels

Sequential Reads requires high level of I/O Bandwidth (MB/sec)

Random Reads require high IOPS)

Write rates driven by life cycle management and sort operations

OLTP Workloads

are strongly random I/O intensive

Random I/O is more dominant

Read/write ratios of 80/20 are most common but can be 50/50

Can be difficult to build out test systems with sufficient I/O characteristics

Batch Workloads

are more write intensive

• Sequential Writes requires high level of I/O Bandwidth (MB/sec)

Backup & Recovery

times are critical for these workloads

• Backup operations drive high level of sequential IO

• Recovery operation drives high levels of random I/O

(51)

NextGen Infrastructure forBig Data

© 2012 Storage Networking Industry Association. All Rights Reserved.

Source: © IMEX Research – Big Data Industry Report ©2011-12

Best Practices – Storage in Big Data Apps

Goals & Implementation

Establish

Goals for SLAs

(Performance/Cost/Availability), BC/DR (RPO/RTO) &

Compliance

Increase Performance for DB, OLTP and OLAP Apps

:

Random I/O > 20x , Sequential I/O Bandwidth > 5x

Remove Stale data from Production Resources to improve performance

Use

Partitioning Software to

Classify Data

By Frequency of Access (Recent Usage) and

Capacity (by percent of total Data) using general guidelines as:

Hyperactive (1%), Active (5%), Less Active (20%), Historical (74%)

Implementation

Optimize Tiering

by Classifying Hot & Cold Data

Improve Query Performance by reducing number of I/Os

Reduce number of Disks Needed by 25-50% using advance compression software achieving

2-4x compression

Match Data Classification vs.Tiered Devices

accordingly

Flash, High Perf Disk, Low Cost Capacity Disk, Online Lowest Cost Archival Disk/Tape

Balance Cost vs. Performance

of Flash

More Data in Flash > Higher Cache Hit Ratio > Improved Data Performance

Create and Auto-Manage Tiering

(Monitoring, Migrations, Placements) without

manual intervention

(52)

Best Practices: I/O Forensics in Storage-Tiering

LBA Monitoring and Tiered Placement

Every workload has unique I/O access signature

Historical performance data for a LUN can

identify performance skews & hot data regions

by LBAs

Storage-Tiering at

LBA

/

Sub-LUN Level

Storage-Tiered Virtualization

Physical Storage

Logical Volume

SSDs

Arrays

HDDs

Arrays

Hot Data Cold Data

Automatic

Migration

(Policy Based)

(53)

NextGen Infrastructure forBig Data

© 2012 Storage Networking Industry Association. All Rights Reserved.

Source: © IMEX Research – Big Data Industry Report ©2011-12

Best Practices: Cached Storage

App/DB Server

LSI MegaRAID CacheCade Pro 2.0

Application

Improvement over Cached

vs.HDD only

Oracle OLTP Benchmarks

681%

SQL Server OLTP

Benchmark

1251%

Neoload (Web Server

Simulation

533%

SysBench (MySQL OLTP

Server)

150%

0 373 655 0 100 200 300 400 500 600 700

All HDD Smart Flash

Cache on Warpdrive Persist Data

TPS

0 330 660 0 100 200 300 400 500 600 700

All HDD Smart Flash

Cache Persist Data on Warpdrive

Response Time

(54)

Big Data Targets: Analytics

Key Industries Benefitting from Big Data Analytics

C

A

G

R

%

(

2010

-1

5)

Global IT Spending by Industry Verticals 2010-15 $B

(55)

NextGen Infrastructure forBig Data

© 2012 Storage Networking Industry Association. All Rights Reserved.

Source: © IMEX Research – Big Data Industry Report ©2011-12

Big Data Targets – Storage Infrastructure

Data Intensity by Industry Vertical

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Banking & Financial Services

Media & Entrertainment

Healthcare Providers

Professional Services

Telecommunications

Pharma, Life Sciences & Medical Pdcts

Retail & Wholesales

Utilities

Ind'l Elex & Electrical Equipment

SW Publishing & Internet Srvcs

Consumer Products

Insurance

Transportation

Energy

Installed Terabytes/$M Rev 2010

Value Potential of Using Big Data by Data Intensive Verticals

(56)

Big Data Targets: Storage Infrastructure

Data Stored by Large US Enterprises

Big Data Storage Potential

Data Stored by Large US Enterprise

14% 12% 10% 10% 9% 6% 6% 6% 5% 4% 4% 3% 3% 3% 2% 2% Discrete Manafacturing Government Communications and Media Process manufacturing Banking Health Care Providers Securities & Investment Srvcs Professional Services Retail Education Insurance Transportation Wholesale Utilities Resource Industries Consumer and Recreational

1,312 1,792 831 1,931 370 3,866 278 697 319 870 801 536 1,507 825 150 231

Big Data Storage Potential

Data Stored by Large US Enterprises

Stored Data by Industry

(in US 2009 PB)

Stored Data TB/Firm

(>1K Employees US)

(57)

NextGen Infrastructure forBig Data

© 2012 Storage Networking Industry Association. All Rights Reserved.

Source: © IMEX Research – Big Data Industry Report ©2011-12

Big Data Targets: Savings w Open Source

Legacy BI vs. Open Source Big Data Analytics

(58)
(59)

NextGen Infrastructure forBig Data

© 2012 Storage Networking Industry Association. All Rights Reserved.

Source: © IMEX Research – Big Data Industry Report ©2011-12

59

Key Takeaways

Big Data creating paradigm shift in IT Industry

Leverage the opportunity to optimize your computing infrastructure with Big Data

Infrastructure after making a due diligence in selection of vendors/products, industry testing

and interoperability.

Apply best storage technologies listed in this presentation and elsewhere

Optimize Big Data Analytics for Query Response Time vs. # of Users

Improving Query Response time for a given number of users (IOPs) or Serving more users

(IOPS) for a given query response time

Select Automated Storage Management Software

Data Forensics and Tiered Placement

Every workload has unique I/O access signature

Historical performance data for a LUN can identify performance skews & hot data regions by

LBAs.Non-disruptively migrate hot data using auto-tiering Software

Optimize Infrastructure to meet needs of

Applications/SLA

Performance Economics/Benefits

Typically 4-8% of data becomes a candidate and when migrated for higher performance

tiering can provide response time reduction of ~65% at peak loads. Many industry Verticals

and Applications will benefit using Big Data

(60)

Q&A / Feedback

Many thanks to the following individuals

for their contributions to this tutorial.

Source: IMEX Research

Joseph White

Anil Vasudeva

References

Related documents

Nurses feel that both the software and the nurse are essential to clinical decision-making, and describe a process of ‘dual decision- making’, with the nurse as active decision

The publisher or other rights-holder may allow further reproduction and re-use of this version - refer to the White Rose Research Online record for this item.. Where records

human body can persist through death is equally a reason to suppose that a. human animal can persist through death, and any reason to deny

In this scenario total energy consumed is above the “target” energy demand for the transport sector for this scenario of 403 TWh (Table 3) and again, it was not possible to push

The 10 resident domains cluster into three groups : universal requirements for older people living in residential settings (privacy, the ability to personalise their

to the Convention for the Protection of Human Rights and Dignity of the Human Being with Regard to the Application of Biology and Medicine, on the Prohibition of Cloning Human

The role of dopamine in chemoreception remains to be fully established, but it is clear that stimulus evoked transmitter release from type I cells on to afferent nerve endings is a

The International Board held that any organisation composed of at least 75% of Business and Professional Women was eligible for membership in the Federation, and the Council of