Randy Kerns
Senior Strategist
Evaluator Group
IT and Storage for
Big Data Analytics
Overview
●
“Big data” can mean two
different things
- Storage for large amounts of data
- Analytics against very large amounts of data
●
Usually from
machine-to-machine data
- Called pervasive computing
●
So, what does this mean for
storage?
The Storage Way to Say Big Data
●
Defined by architectural platform, big data storage is:
‒ Scale-out NAS‒ Global NameSpace File System
‒ NAS gateway to SAN and Scale-out SAN
●
Defined by application, big data storage is:
‒ Storage for applications that handle large files and requires performance
‒ Storage for extremely large number of files
‒ Examples: Media & entertainment, oil & gas exploration, life sciences, etc.
The Analytics Way to Say Big Data
●
Big data analytics is:
- A term for business intelligence (BI) processes that are different from traditional data warehousing
- The ability to tap unstructured data as a source for BI processes
- Information delivered to users in real or near real-time (but not an absolute requirement)
- Convergence of multiple data sources
●
Latency introduced by storage, including networked
storage, is often assiduously avoided
Logs, Tweets Location HDFS NoSQL DB Customer Profiles High Scale Data Reductions BI and Analytics POS Expert System NoSQL DB Batch Low Latency 1) Identify User 2a)Lookup User Profile 2b) Lookup Location Predictions on Buying Behavior 4) Real-time: Determine Best
Offer For This User
3) Input Into
Data Analytics Model
Why Should Storage Professionals Care?
●
Distributed computing for analytics (Hadoop, for example)
is moving from science experiment to mission-critical
●
As this happens, data encompassed by these
applications becomes the responsibility of people who
worry about:
- Security
- Data protection/disaster recovery/business continuance
- Data governance and compliance
Shared Storage for the Traditional
Data Warehouse
Files /
XML data Log Files
OLTP Operational Data Warehouse Reports Dashboards Notifications Archive
Extract, Transform, Load (ETL)
Schedules
Ad hoc Queries
N O D E 1 N O D E 2 N O D E 3 N O D E n
DAS DAS DAS DAS
1 2 3 4 5 6 7 8 B 8 G M R 3 Link Active Link Active Link Active Console Pwr Active Link Active DAS Network Layer Compute Layer Storage Layer
Distributed, Shared-Nothing Architectures for
Big Data Analytics
C O N T R O L
CAP Theorem
●
It is impossible for a distributed computer system to
simultaneously provide all three of the following guarantees:
- Consistency (all nodes see the same data at the same time)- Availability (a guarantee that every request receives a response about whether it was successful or failed)
- Partition tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system)
●
A distributed system can satisfy any two of these
guarantees at the same time, but not all three
Issue for IT
●
How to store information for big data
- How much data is there?????- Where did this idea come from?
●
What are the requirements
●
Is it from analytics operations
- Store original data – capture in flight as part of the analytics
operation?
- Store as secondary process?
- Don’t save anything, except results?
N O D E 1 N O D E 2 N O D E 3 N O D E n 1 2 3 4 5 6 7 8 B 8 G M R 3 Link Active Link Active Link Active Console Pwr Active Link Active C O N T R O L Network Layer Compute Layer Storage Layer
Shared Storage as Secondary Storage
Storage Decisions 2012 | © TechTarget
●
Is there a place for shared storage in shared-nothing?
If so, what does it look like?
N O D E 1 N O D E 2 N O D E 3 N O D E n 1 2 3 4 5 6 7 8 B 8 G M R 3 Link Active Link Active Link Active Console Pwr Active Link Active C O N T R O L Network Layer Compute Layer Storage
Layer SAN or NAS, but more commonly Scale-out NAS
Shared Primary/Secondary Storage
●
Advantages
- Can reduces latency for queries that span nodes
- Enhances system availability
- Addresses the enterprise storage requirements
Security
Data protection/disaster recovery/business continuance
Data governance and compliance
Digital records management and archiving
●
Disadvantages
- Additional cost- Crosses a “cultural” boundary
Big Data Storage for Big Data Analytics
●
Shared storage as secondary storage for big data
analytics
- Data Protection, Database of Record, Archive
- Examples: NetApp and ParAccel, EMC Data Domain/VMAX and Greenplum, RainStor
●
Shared storage as primary storage for big data
analytics
- Examples: Calpont, Red Hat Gluster, IBM GPFS, Nexenta ZFS, Hadoop nodes in Virtual Machines
Is Hadoop a Storage Device?
●
NO
- It’s a distributed computing platform
●
YES
- 1K node cluster w/ 1TB RAM per node = 1PB of very high performance storage
- Data protection built-in (multiple data copies but not RAID)
- HDFS - Embedded, distributed file system (like scale-out NAS)
HDFS – Hadoop File System
●
Very large Distributed File System (DFS)
– 10K nodes, 100 million files, 10 PB
●
Uses standard servers with direct attached storage
– Files are replicated to handle hardware failure – 3
copies
– Detect failures and recovers from them
●
Optimized for batch processing
– Data locations exposed so that computations can move
to where data resides
– Provides very high aggregate bandwidth
●
Runs in user space - heterogeneous OS
Hadoop File System on Standard Servers
N O D E 1 N O D E 2 N O D E 3 N O D E n
DAS DAS DAS DAS
1 2 3 4 5 6 7 8 B 8 G M R 3 Link Active Link Active Link Active Console Pwr Active Link Active C O N T R O L DAS Network Layer Compute Layer Storage Layer
Typical Hadoop Configuration
Hadoop Key Milestones
●
Dec 2004 – Google GFS paper published
●
July 2005 – MapReduce first used
●
Feb 2006 – Becomes Lucene subproject
●
Apr 2007 – Yahoo! on 1000-node cluster
●
Jan 2008 – Apache Top Level Project
●
May 2009 – Hadoop sorts a Petabyte in 17 hours
●
Aug 2010 – World’s largest Hadoop cluster at Facebook
- 2900 nodes
Evaluating Hadoop as a Storage Device
●
Snapshots?
●
Scale capacity and performance concurrently?
●
SSD and automated tiering?
●
Dedupe?
●
Insert your hot-button storage feature here: __________
IT and Big Data Analytics
●
There will be big data
●
Circumstances may vary…. and change
●
Participate early
- Data scientists may not have same concerns or requirements
- Decisions can limit choices
●
Understand options
- Products / softwareStorage Decisions 2012 | © TechTarget
Randy Kerns: [email protected] Twitter: @rgkerns
Blog: http://itknowledgeexchange.techtarget.com/storage-soup/