IT and Storage for Big Data Analytics

(1)

Randy Kerns

Senior Strategist

Evaluator Group

IT and Storage for

Big Data Analytics

(2)

Overview

● “Big data” can mean two

different things

- Storage for large amounts of data

- Analytics against very large amounts of data

● Usually from

machine-to-machine data

- Called pervasive computing

● So, what does this mean for

storage?

(3)

(4)

The Storage Way to Say Big Data

● Defined by architectural platform, big data storage is:

‒ Scale-out NAS

‒ Global NameSpace File System

‒ NAS gateway to SAN and Scale-out SAN

● Defined by application, big data storage is:

‒ Storage for applications that handle large files and requires performance

‒ Storage for extremely large number of files

‒ Examples: Media & entertainment, oil & gas exploration, life sciences, etc.

(5)

The Analytics Way to Say Big Data

● Big data analytics is:

- A term for business intelligence (BI) processes that are different from traditional data warehousing

- The ability to tap unstructured data as a source for BI processes

- Information delivered to users in real or near real-time (but not an absolute requirement)

- Convergence of multiple data sources

● Latency introduced by storage, including networked

storage, is often assiduously avoided

(6)

Logs, Tweets Location HDFS NoSQL DB Customer Profiles High Scale Data Reductions BI and Analytics POS Expert System NoSQL DB Batch Low Latency 1) Identify User 2a)Lookup User Profile 2b) Lookup Location Predictions on Buying Behavior 4) Real-time: Determine Best

Offer For This User

3) Input Into

Data Analytics Model

(7)

Why Should Storage Professionals Care?

● Distributed computing for analytics (Hadoop, for example)

is moving from science experiment to mission-critical

● As this happens, data encompassed by these

applications becomes the responsibility of people who

worry about:

- Security

- Data protection/disaster recovery/business continuance

- Data governance and compliance

(8)

Shared Storage for the Traditional

Data Warehouse

Files /

XML data Log Files

OLTP Operational Data Warehouse Reports Dashboards Notifications Archive

Extract, Transform, Load (ETL)

Schedules

Ad hoc Queries

(9)

N O D E 1 N O D E 2 N O D E 3 N O D E n

DAS DAS DAS DAS

1 2 3 4 5 6 7 8 B 8 G M R 3 Link Active Link Active Link Active Console Pwr Active Link Active DAS Network Layer Compute Layer Storage Layer

Distributed, Shared-Nothing Architectures for

Big Data Analytics

C O N T R O L

(10)

CAP Theorem

● It is impossible for a distributed computer system to

simultaneously provide all three of the following guarantees:

- Consistency (all nodes see the same data at the same time)

- Availability (a guarantee that every request receives a response about whether it was successful or failed)

- Partition tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system)

● A distributed system can satisfy any two of these

guarantees at the same time, but not all three

(11)

Issue for IT

● How to store information for big data

- How much data is there?????

- Where did this idea come from?

● What are the requirements

● Is it from analytics operations

- Store original data – capture in flight as part of the analytics

operation?

- Store as secondary process?

- Don’t save anything, except results?

(12)

N O D E 1 N O D E 2 N O D E 3 N O D E n 1 2 3 4 5 6 7 8 B 8 G M R 3 Link Active Link Active Link Active Console Pwr Active Link Active C O N T R O L Network Layer Compute Layer Storage Layer

Shared Storage as Secondary Storage

Storage Decisions 2012 | © TechTarget

● Is there a place for shared storage in shared-nothing?

If so, what does it look like?

(13)

N O D E 1 N O D E 2 N O D E 3 N O D E n 1 2 3 4 5 6 7 8 B 8 G M R 3 Link Active Link Active Link Active Console Pwr Active Link Active C O N T R O L Network Layer Compute Layer Storage

Layer _{SAN or NAS, but more commonly Scale-out NAS}

(14)

Shared Primary/Secondary Storage

● Advantages

- Can reduces latency for queries that span nodes

- Enhances system availability

- Addresses the enterprise storage requirements

 Security

 Data protection/disaster recovery/business continuance

 Data governance and compliance

 Digital records management and archiving

● Disadvantages

- Additional cost

- Crosses a “cultural” boundary

(15)

(16)

Big Data Storage for Big Data Analytics

● Shared storage as secondary storage for big data

analytics

- Data Protection, Database of Record, Archive

- Examples: NetApp and ParAccel, EMC Data Domain/VMAX and Greenplum, RainStor

● Shared storage as primary storage for big data

analytics

- Examples: Calpont, Red Hat Gluster, IBM GPFS, Nexenta ZFS, Hadoop nodes in Virtual Machines

(17)

Is Hadoop a Storage Device?

● NO

- It’s a distributed computing platform

● YES

- 1K node cluster w/ 1TB RAM per node = 1PB of very high performance storage

- Data protection built-in (multiple data copies but not RAID)

- HDFS - Embedded, distributed file system (like scale-out NAS)

(18)

HDFS – Hadoop File System

● Very large Distributed File System (DFS)

– 10K nodes, 100 million files, 10 PB

● Uses standard servers with direct attached storage

– Files are replicated to handle hardware failure – 3

copies

– Detect failures and recovers from them

● Optimized for batch processing

– Data locations exposed so that computations can move

to where data resides

– Provides very high aggregate bandwidth

● Runs in user space - heterogeneous OS

(19)

Hadoop File System on Standard Servers

(20)

N O D E 1 N O D E 2 N O D E 3 N O D E n

DAS DAS DAS DAS

1 2 3 4 5 6 7 8 B 8 G M R 3 Link Active Link Active Link Active Console Pwr Active Link Active C O N T R O L DAS Network Layer Compute Layer Storage Layer