• No results found

Big Data Storage IGATE

N/A
N/A
Protected

Academic year: 2021

Share "Big Data Storage IGATE"

Copied!
36
0
0

Loading.... (view fulltext now)

Full text

(1)

2014 Storage Developer Conference. © IGATE. All Rights Reserved.

Big Data Storage

Apurva Vaidya

Anshul Kosarwal

(2)

2014 Storage Developer Conference. © IGATE. All Rights Reserved.

Agenda

Abstract

Big Data and its characteristics

Key requirements of Big Data Storage

Triage and Analytics Framework

Big Data Storage Choices

Hyper scale, Scale Out NAS, Object Storage

Impact of Flash memory and tiered storage technologies

Big Data Storage Architecture

Performance Optimization

Conclusion

(3)

2014 Storage Developer Conference. © IGATE. All Rights Reserved.

Abstract

Big Data is a key buzzword in IT.

The amount of data to be stored and

processed has exploded to a mind boggling

degree.

Although "Big Data Analytics" has evolved to

get a handle on this information content, the

data must be appropriately stored for easy

and efficient retrieval.

And for this,

proper storage

options should be

in place to handle

(4)

2014 Storage Developer Conference. © IGATE. All Rights Reserved.

“The pessimist – complains about the wind

The optimist – expects it to change

The leader – adjusts the sails”

(5)

2014 Storage Developer Conference. © IGATE. All Rights Reserved.

(6)

2014 Storage Developer Conference. © IGATE. All Rights Reserved.

Storage Capacity

IN

FOR

M

A

TI

ON

ST

OR

A

G

E C

A

PA

CI

TY

LARGE SMALL ANALOG 2.6 Exabytes DIGITAL 0.02 Exabytes

1986 1993 2000 2002 2007

% Digital

1%

3%

25%

94%

50%

Beginning of the Digital age ANALOG 19 Exabytes DIGITAL 280 Exabytes

Global

Information

Storage Capacity

2017

7,235 Exabytes

Source: Business Computing World

(7)

2014 Storage Developer Conference. © IGATE. All Rights Reserved.

Key Characteristics of Big Data

The

Four V’s

of Big

Data

Volume

SCALE OF DATA

Variety

DIFFERENT

FORMS OF

DATA

Veracity

UNCERTAINTY

OF DATA

?

By 2016 there will be

18.9 Billion

Network Connections

Velocity

ANALYSIS

1 IN 3 BUSINESS

LEADERS

Don’t trust the information

they use to make decisions

Sensor Data

It’s estimated that

2.5 TRILLION GIGABYTES

Of data are created each day

Terabytes

Records

Transactions

Tables, files

Batch

Near time

Real time

Streams

Structured

Unstructured

Semistructured

Source: http://www.ibmbigdatahub.com/infographic/four-vs-big-data

(8)

2014 Storage Developer Conference. © IGATE. All Rights Reserved.

Prioritizing Things

Co

nce

rns

• Data growth

• System

performance and

scalability

• Network

congestion and

connectivity

Bu

sin

ess P

rio

rit

ies

• Enterprise growth

• Attracting &

retaining new

customers

• Reducing

enterprise costs

Te

chno

lo

gy

Pr

ef

er

ence

s

• Analytics/business

Intelligence

• Mobile

technologies

• Cloud computing

(9)

2014 Storage Developer Conference. © IGATE. All Rights Reserved.

(10)

2014 Storage Developer Conference. © IGATE. All Rights Reserved.

Requirements

Key Requirements of Big Data Storage

Provide tiered

storage

Scalable

Self-Managing

Highly Available

Security

Widely Accessible

Self Healing

Integrate with

(11)

2014 Storage Developer Conference. © IGATE. All Rights Reserved.

(12)

2014 Storage Developer Conference. © IGATE. All Rights Reserved.

TAF Overview

Device

Log Analysis

Sensor Data Analytics

Logs

Sensor

Data

Pattern Identification

Pattern Matching

Real-time

Analytics

Query Expression

Hadoop

MPP Database

Actions

Columnar

Storage

Triage

Predictive

Analysis

(13)

2014 Storage Developer Conference. © IGATE. All Rights Reserved.

The way we discovered our needs

But as data grew, there was performance

drop

So, in order to OPTIMIZE storage we went on to discover storage options which can

scale

IGATE has its ‘Triage and

Analytics framework’ for

proactive diagnostic and

predictive analysis

May be it was time for us to think about the

storage

that goes below and its

scalability

Working

Good

Scale-out nodes on a Hadoop

framework.

Fast for computation

Leverages Individual nodes, servers, CPU, memory and RAM

RAID support

Efficient

For cost and performance

Hadoop

Hadoop

DAS Storage

High Availability

Global Namespace

(14)

2014 Storage Developer Conference. © IGATE. All Rights Reserved.

(15)

2014 Storage Developer Conference. © IGATE. All Rights Reserved.

Storage Options

Traditional

Storage

DAS,

Tape,

etc

NAS,

SAN

Storage

Object

Out NAS

Scale

Hyper

Scale

Storage

(16)

2014 Storage Developer Conference. © IGATE. All Rights Reserved.

Traditional Storage Options

UnStructured

Data

Semi

Structured

Data

Structured

Data

Legacy Storage

Options

Machine

Data

Human

Information

Business

Data

Traditional Storage Options Fails

When It Comes to BIG DATA

Needs

Traditional Storage

Tape Storage

Disk Storage

(17)

2014 Storage Developer Conference. © IGATE. All Rights Reserved.

(18)

2014 Storage Developer Conference. © IGATE. All Rights Reserved.

Scale Out NAS

18

Deliver a flexible and scalable platform for Big Data.

Simplify management of big data storage infrastructure.

Increase efficiency and reduce costs.

Built to

Why it can be useful?

Simple to scale

Scale Out NAS is a very simple to scale platform for Big Data.

Predictable

The performance and reliability is predictable

It leverages all the resources in an efficient way

Efficient

Available

High Availability support is within the FS and hence can be guaranteed

(19)

2014 Storage Developer Conference. © IGATE. All Rights Reserved.

Object Storage

Built to

Optimized object storage systems purpose built for petabytes scale, which also enables organizations to

build cost-efficient storage pools for all their unstructured data needs.

Provide a highly reliable, infinitely scalable, flexible pricing options, and the ability to store, retrieve, and

leverage data the way you want to.

Immutable data

Provides best option for unstructured immutable data

Scalable to Petabytes and beyond

The object storage systems are scalable to petabytes of data, billions of files and beyond.

Maximum Portability

Provide set of interfaces including C++ and Java APIs, REST for web applications and file gateways to support file-based workflows.

Cloud Capability (Multi Tenancy)

Storage can be carved up into virtual containers based on who owns the data.

Index and Search

Indexing of the metadata allows extremely fast filtering and search capabilities.

Lowest cost of ownership

It is delivered on commodity based platform for lowest cost of ownership.

(20)

2014 Storage Developer Conference. © IGATE. All Rights Reserved.

Hyper Scale Storage

Built for

Key Criteria

Coined to represent the scale-out design of IT systems with the ability to massively expand under a controlled

and efficient managed, software-defined framework.

It enables the construction of storage facilities which caters for very high volumes of processing and

manages and store petabytes (PB) or even greater quantities of data.

20

Elasticity

It works with a notion of treating data independent from the hardware,

making it easy to grow or shrink volumes as needed, with no interruption in

services.

Scalability

It supports not just terabytes, but petabytes of data, right from the start.

It eliminates hot spots, I/O bottlenecks and latency, hence offering very

fast access to information.

High performance and

availability

Standards based and open

source

It supports any and every file type, benefiting from the collective

collaboration inherent in any open-source system, resulting in constant

testing, feedback and improvements.

Leading-edge architecture

and design

It supports or includes a complete storage OS stack, hence has a modular,

stackable architecture which enables data to be stored in native formats and

which also avoid metadata indexing, instead storing and locating data

algorithmically.

(21)

2014 Storage Developer Conference. © IGATE. All Rights Reserved.

Analytical Comparison

Scalability

Scale Out NAS caters problem with software

management

approach.,

that

enables

virtualization layer to makes the nodes behave

like a single system.

Object storage can scale

seamlessly for number of

objects, Exabyte capacity and

distributed sites.

Hyperscale architectures typically include dozens

to hundreds of generic servers running either a

clustered application or a hypervisor.

Accessibility /

Availability

Assuming that hardware will fail, Scale Out

infrastructure is built (on commodity

hardware) and designed to deal with a higher

rate of hardware failures.

This type of storage supports

Omni-protocol

support

(reconfigurable

bit-stream

processing in high speed

network). It also supports

wide choice of protocols and

rich application ecosystem.

Automatic replication deliver a high level of data

protection and resiliency. It also eliminates hot

spots, I/O bottlenecks and latency, hence offering

very fast access to information.

Efficiency

Scale-out NAS for big data is intelligent

enough to leverage all the resources in

storage system, regardless of where they are.

By moving data around it optimizes

performance /capacity.

Object

storage

provides

highest efficiency with low

overhead, no bandwidth kill

and lowest management cost.

Hyperscale gives stripped storage which provides

rapid and efficient expansion to handle big data in

different use cases, such as Web serving or for

some database applications.

Reliability

Reliability In Scale Out NAS is achieved by

features like hardware RAID, redundant

hot-swappable power supply modules, and full

access to data even if a compute node goes

down.

Provides all round data

protection

with

smart

replication.

Reliability with hyper scale can be achieved by

installing number of hyper scale manager servers.

Performance

Scale out storage system expands to maintain

functionality and performance as it grows

in-line with data centre demands as business

needs change over time.

Object storage gives high

performance for throughput,

IOPS & low latency.

The hyperscale auto-balance workloads (can also

obtain priority), when nodes are added.

(22)

2014 Storage Developer Conference. © IGATE. All Rights Reserved.

(23)

2014 Storage Developer Conference. © IGATE. All Rights Reserved.

Current Storage Architectures

23

The problems with this type of architecture:

Underutilized storage

Localization of data

Dependence on server OS for replication,

etc.

Direct Access Storage Architecture

Networked Storage Architecture

The major problems with this model is:

Explosive increase in random I/O coming from high

performance server systems is the least efficient for

mechanical hard disk drives.

(24)

2014 Storage Developer Conference. © IGATE. All Rights Reserved.

Future System and Storage Architectures

24

This architectural approach is revolutionary and

will provide the largest benefit but will require the

greatest change in infrastructure.

Flash on Server (FOS)

Flash on Storage Controller (FOSC)

This approach is more evolutionary and will be

easier to incorporate into existing application

infrastructures.

(25)

2014 Storage Developer Conference. © IGATE. All Rights Reserved.

Big Data Storage Architecture

(26)

2014 Storage Developer Conference. © IGATE. All Rights Reserved.

Considerations

Three questions every business should consider when architecting the storage component of a

big data plan:

•Own data center or data warehouse or go with a cloud system.

•Newer storage technologies: object storage, software-defined storage, and data-defined

storage.

What kind of storage will you use?

•The amount of data produced, collected, and managed is going to grow, and it may do so

exponentially.

How much data is there and are you planning for growth?

•Content driven or analytical application

•Structured/Unstructured

•In the big data era, it’s not uncommon to encounter a “bigger is better” mindset.

What kind of data you have and what is the intent of storing data?

(27)

2014 Storage Developer Conference. © IGATE. All Rights Reserved.

Big Data Storage Architecture Decisions

•RAID, HDFS and more

Protecting Data Sets

•Loads of data vs processing

Capacity Intensive vs Compute Intensive

•Global namespace file system for easy management or object storage's use of metadata

Scale Out vs Object

•Regulatory compliance(HIPAA, etc.) and security(PCI standards)

Compliance and Security needs

(28)

2014 Storage Developer Conference. © IGATE. All Rights Reserved.

Optimizations for Big Data

(29)

2014 Storage Developer Conference. © IGATE. All Rights Reserved.

Optimizing Hadoop for MapReduce

• Specifies the size of data blocks in which the input data set is split. Default is

64MB.

dfs.block.size(renamed to dfs.blocksize in Hadoop 2.x)

• Specifies whether to compress output of maps before sending across the

network. Default is ‘false’. Uses SequenceFile compression.

mapred.compress.map.output

• Multiple instances of mao or reduce tasks can be run in parallel. Default is ‘true’.

The output of the task which finishes first is taken and the other task is killed.

mapred.map/reduce.tasks.speculative.execution

• The maximum number of map/reduce tasks that will be run simultaneously by a

task tracker. Default is 2.

(30)

2014 Storage Developer Conference. © IGATE. All Rights Reserved.

Optimizing Hadoop for MapReduce

• The size of in-memory buffer (in MBs) used by map task for sorting its output.

Default is 100MB.

io.sort.mb

• The maximum number of streams to merge at once when sorting files. This

property is also used in reduce phase. Default is 10.

io.sort.factor

• The maximum number of tasks to run for a given job per JVM on a tasktracker. A

value of –1 indicates no limit: the same JVM may be used for all tasks for a job.

Default is 1.

mapred.job.reuse.jvm.num.tasks

• The number of threads used to copy map outputs to the Reducer. Default is 5.

(31)

2014 Storage Developer Conference. © IGATE. All Rights Reserved.

Input

Split

Input

Split

Input

Split

In-Memory

Buffer

merge

merge

Reduce1

Reduce

‘n’

Copy Phase

Output to

HDFS

Output to

HDFS

Reduce Phase

Sort Phase

(merger till last

batch)

Sorted Map

Output for each

reducer

Merger

Partitions

Partition, Sort

and spills to disk

Spills

r1 r2 r ‘n’

r1 r2 r ‘n’

r1 r2 r ‘n’

Map1

Map2

Map ‘n’

(32)

2014 Storage Developer Conference. © IGATE. All Rights Reserved.

Efficient MapReduce for problem solving

• Algorithmic complexities, Data Structures, Join strategies

Algorithmic Performance

• Follow general Java Best Practices

• MapReduce Patterns(Numerical Summarizations, Filtering, Reduce Side Join)

Implementation Performance

• Storage, CPU, Hadoop Environment, etc.

Physical Performance

MapReduce and HDFS are complex distributed systems that run arbitrary

user code. There are no fixed rules to achieve optimal performance.

MapReduce is a framework. Solution should fit in the framework.

(33)

2014 Storage Developer Conference. © IGATE. All Rights Reserved.

Conclusion

(34)

2014 Storage Developer Conference. © IGATE. All Rights Reserved.

Conclusion

Data is at the heart of business and life in 2014; survival and success for both organization and

individuals increasingly depends on the ability to access and act on all relevant information. The

key is to craft effective strategies to distill raw data into meaningful, actionable analytics. The

payoff can include better, smarter, faster business decisions- creating a truly agile organization

and empowering people with the data they need.

Any analysis of Big Data will require a new kind of storage solution, one that can support and

manage unstructured and semi-structured data as readily as if it were structured information.

Hyperscale computing and storage leverages vast numbers of commodity servers with

direct-attached storage (DAS) to ensure scalability, redundancy and scale, and provide the input/output

operations per second (IOPS) necessary to deliver data immediately to analytics tools and the

end users who use them. Companies ready to embrace this high-performance yet cost-effective

solution will see business benefits, including better and faster decision making improved

customer relationships and retention, and a clear competitive advantage.

Also, an effective and efficient MapReduce implementation can make a lot of performance

difference. Hadoop parameters can be configured and tweaked to suit the requirements of

problem solving.

(35)

2014 Storage Developer Conference. © IGATE. All Rights Reserved.

References

1. “The Impact of Flash on Future System and Storage Architectures” ; Author: David Floyer;

http://wikibon.org/wiki/v/The_Impact_of_Flash_on_Future_System_and_Storage_Architectures; 2014.

2. “Hyper

scale

and

Software-led

Infrastructure”;

Author:

David

Floyer;

http://wikibon.org/wiki/v/Hyperscale_and_Software-led_Infrastructure; 2013

3. “Ten Properties of the Perfect Big Data Storage Architecture”; Author: Dan Woods;

http://www.forbes.com/sites/danwoods/2012/07/23/ten-properties-of-the-perfect-big-data-storage-architecture/; 2012.

4. “A look at a 7,235 Exabyte world”; Author: Larry Dignan;

http://www.zdnet.com/a-look-at-a-7235-exabyte-world-7000022200/

5. “High Performance Storage For Hyper scale Data Centers”; Author: George Crump;

http://www.storage-switzerland.com/Articles/Entries/2013/5/28_High_Performance_Storage_For_Hyperscale_Data_Centers.h

tml; 2013.

6. “What Is A Hyper scale Data Center”; Author: George Crump;

http://www.storage-switzerland.com/Articles/Entries/2013/5/20_What_Is_A_Hyperscale_Data_Center.html

7. “An

object

storage

device

for

a

data

overload”;

Author:

Randy

Kerns;

http://searchdatacenter.techtarget.com/feature/An-object-storage-device-for-a-data-overload; 2014.

8. “From TPC-C to Big Data Benchmarks: A Functional Workload Model”; Author: Yanpei Chen, Francois Raab

and Randy Kartz.

9. “TPC Benchmark B Standard Specification”; Revision 2.0. http://www.tpc.org/tpca/spec/tpcb_current.pdf,

1994.

10. “Tips for improving MapReduce Performance”; Todd Lipcon;

http://blog.cloudera.com/blog/2009/12/7-tips-for-improving-mapreduce-performance/

11. “A Beginer’s guide to Next Generation Object Storage”; Author: Tom Leyden;

http://www.ddn.com/pdfs/BeginnersGuideToObjectStorage_whitepaper.pdf

(36)

References

Related documents

The Library has available for patron use the full suite of Adobe Creative Cloud Software products available for public use free of charge. This is currently available for use only on

Essential information- Methodological notes: The 2007 survey covered all agricultural holdings having an area planted with fruit trees (apple, pear, peach, apricot, orange, lemon

Using Social Network Analysis tools we discuss the results according to (i) the structural, technological and geographical dimensions of knowledge flows, (ii) the influence of

As a …rst attempt to address perceived gender di¤erences in social pref- erences, in this paper, we examine a clear and straightforward question: Do subjects hold special beliefs

The effect is independent from alcohol consumption as the Figure 1 Meta-analysis results for rs10273639 (PRSS1–PRSS2) in patients with non-alcoholic chronic pancreatitis

We have shown that one can ε -approximate the termination value for OC-MDP (and for OC-SSG) termination games, and compute ε -optimal strategies for them, in exponential time (and

It is shown that NO x emissions from urban buses fi tted with Selective Catalytic Reduction (SCR) are comparable to those using Exhaust Gas Recirculation for Euro V vehicles,

Holtzhausen’s definition is supported by that of the National Instruction 1 of 2003 on sector policing, which defines sector policing cumulatively as a method of policing