• No results found

IT and Storage for Big Data Analytics

N/A
N/A
Protected

Academic year: 2021

Share "IT and Storage for Big Data Analytics"

Copied!
25
0
0

Loading.... (view fulltext now)

Full text

(1)

Randy Kerns

Senior Strategist

Evaluator Group

IT and Storage for

Big Data Analytics

(2)

Overview

“Big data” can mean two

different things

- Storage for large amounts of data

- Analytics against very large amounts of data

Usually from

machine-to-machine data

- Called pervasive computing

So, what does this mean for

storage?

(3)
(4)

The Storage Way to Say Big Data

Defined by architectural platform, big data storage is:

‒ Scale-out NAS

‒ Global NameSpace File System

‒ NAS gateway to SAN and Scale-out SAN

Defined by application, big data storage is:

‒ Storage for applications that handle large files and requires performance

‒ Storage for extremely large number of files

‒ Examples: Media & entertainment, oil & gas exploration, life sciences, etc.

(5)

The Analytics Way to Say Big Data

Big data analytics is:

- A term for business intelligence (BI) processes that are different from traditional data warehousing

- The ability to tap unstructured data as a source for BI processes

- Information delivered to users in real or near real-time (but not an absolute requirement)

- Convergence of multiple data sources

Latency introduced by storage, including networked

storage, is often assiduously avoided

(6)

Logs, Tweets Location HDFS NoSQL DB Customer Profiles High Scale Data Reductions BI and Analytics POS Expert System NoSQL DB Batch Low Latency 1) Identify User 2a)Lookup User Profile 2b) Lookup Location Predictions on Buying Behavior 4) Real-time: Determine Best

Offer For This User

3) Input Into

Data Analytics Model

(7)

Why Should Storage Professionals Care?

Distributed computing for analytics (Hadoop, for example)

is moving from science experiment to mission-critical

As this happens, data encompassed by these

applications becomes the responsibility of people who

worry about:

- Security

- Data protection/disaster recovery/business continuance

- Data governance and compliance

(8)

Shared Storage for the Traditional

Data Warehouse

Files /

XML data Log Files

OLTP Operational Data Warehouse Reports Dashboards Notifications Archive

Extract, Transform, Load (ETL)

Schedules

Ad hoc Queries

(9)

N O D E 1 N O D E 2 N O D E 3 N O D E n

DAS DAS DAS DAS

1 2 3 4 5 6 7 8 B 8 G M R 3 Link Active Link Active Link Active Console Pwr Active Link Active DAS Network Layer Compute Layer Storage Layer

Distributed, Shared-Nothing Architectures for

Big Data Analytics

C O N T R O L

(10)

CAP Theorem

It is impossible for a distributed computer system to

simultaneously provide all three of the following guarantees:

- Consistency (all nodes see the same data at the same time)

- Availability (a guarantee that every request receives a response about whether it was successful or failed)

- Partition tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system)

A distributed system can satisfy any two of these

guarantees at the same time, but not all three

(11)

Issue for IT

How to store information for big data

- How much data is there?????

- Where did this idea come from?

What are the requirements

Is it from analytics operations

- Store original data – capture in flight as part of the analytics

operation?

- Store as secondary process?

- Don’t save anything, except results?

(12)

N O D E 1 N O D E 2 N O D E 3 N O D E n 1 2 3 4 5 6 7 8 B 8 G M R 3 Link Active Link Active Link Active Console Pwr Active Link Active C O N T R O L Network Layer Compute Layer Storage Layer

Shared Storage as Secondary Storage

Storage Decisions 2012 | © TechTarget

Is there a place for shared storage in shared-nothing?

If so, what does it look like?

(13)

N O D E 1 N O D E 2 N O D E 3 N O D E n 1 2 3 4 5 6 7 8 B 8 G M R 3 Link Active Link Active Link Active Console Pwr Active Link Active C O N T R O L Network Layer Compute Layer Storage

Layer SAN or NAS, but more commonly Scale-out NAS

(14)

Shared Primary/Secondary Storage

Advantages

- Can reduces latency for queries that span nodes

- Enhances system availability

- Addresses the enterprise storage requirements

 Security

 Data protection/disaster recovery/business continuance

 Data governance and compliance

 Digital records management and archiving

Disadvantages

- Additional cost

- Crosses a “cultural” boundary

(15)
(16)

Big Data Storage for Big Data Analytics

Shared storage as secondary storage for big data

analytics

- Data Protection, Database of Record, Archive

- Examples: NetApp and ParAccel, EMC Data Domain/VMAX and Greenplum, RainStor

Shared storage as primary storage for big data

analytics

- Examples: Calpont, Red Hat Gluster, IBM GPFS, Nexenta ZFS, Hadoop nodes in Virtual Machines

(17)

Is Hadoop a Storage Device?

NO

- It’s a distributed computing platform

YES

- 1K node cluster w/ 1TB RAM per node = 1PB of very high performance storage

- Data protection built-in (multiple data copies but not RAID)

- HDFS - Embedded, distributed file system (like scale-out NAS)

(18)

HDFS – Hadoop File System

Very large Distributed File System (DFS)

– 10K nodes, 100 million files, 10 PB

Uses standard servers with direct attached storage

– Files are replicated to handle hardware failure – 3

copies

– Detect failures and recovers from them

Optimized for batch processing

– Data locations exposed so that computations can move

to where data resides

– Provides very high aggregate bandwidth

Runs in user space - heterogeneous OS

(19)

Hadoop File System on Standard Servers

(20)

N O D E 1 N O D E 2 N O D E 3 N O D E n

DAS DAS DAS DAS

1 2 3 4 5 6 7 8 B 8 G M R 3 Link Active Link Active Link Active Console Pwr Active Link Active C O N T R O L DAS Network Layer Compute Layer Storage Layer

Typical Hadoop Configuration

(21)

Hadoop Key Milestones

Dec 2004 – Google GFS paper published

July 2005 – MapReduce first used

Feb 2006 – Becomes Lucene subproject

Apr 2007 – Yahoo! on 1000-node cluster

Jan 2008 – Apache Top Level Project

May 2009 – Hadoop sorts a Petabyte in 17 hours

Aug 2010 – World’s largest Hadoop cluster at Facebook

- 2900 nodes

(22)

Evaluating Hadoop as a Storage Device

Snapshots?

Scale capacity and performance concurrently?

SSD and automated tiering?

Dedupe?

Insert your hot-button storage feature here: __________

(23)
(24)

IT and Big Data Analytics

There will be big data

Circumstances may vary…. and change

Participate early

- Data scientists may not have same concerns or requirements

- Decisions can limit choices

Understand options

- Products / software

(25)

Storage Decisions 2012 | © TechTarget

Randy Kerns: [email protected] Twitter: @rgkerns

Blog: http://itknowledgeexchange.techtarget.com/storage-soup/

Thank You! Questions?

References

Related documents

Considering this new development in governance in Ghana, it can be noted that over the last decade, there has been a development of fresh networks of actors

Weekly surveys will collect information on symptoms of common infections, healthcare-seeking behaviour and use of treatments including antibiotics.. We will calculate the

Hospitals also have obtained the Abepura hospital accreditation certificate by the Commission on accreditation of hospitals (KARS) No. KARS-SERT/755/VI/2012 dated June 29, 2012,

In conclusion, for the studied Taiwanese population of diabetic patients undergoing hemodialysis, increased mortality rates are associated with higher average FPG levels at 1 and

The main wall of the living room has been designated as a "Model Wall" of Delta Gamma girls -- ELLE smiles at us from a Hawaiian Tropic ad and a Miss June USC

The supply of market information for the agricultural sector in West Africa is highly variable. A number of MIS provide data on cereals, including public, private,

Of these, 50% base their population estimates on the volume of patient panels of affiliated providers, 14% on demographic information, and 7% on enrollment in a program (See fig.