• No results found

Big Data Trends to Watch

N/A
N/A
Protected

Academic year: 2021

Share "Big Data Trends to Watch"

Copied!
47
0
0

Loading.... (view fulltext now)

Full text

(1)

Dale Wickizer

Chief Technology Officer

U. S. Public Sector, NetApp, Inc.

Big Data – Trends

to Watch

Digital Government Institute’s Government Big Data Conference October 11, 2012

(2)

What is “Big Data”?

Complexity

Volume Speed

“Big Data” refers to datasets whose volume, speed and

complexity is beyond the ability of typical tools to capture,

store, manage and analyze.

“Big Data” refers to datasets whose volume, speed and

complexity is beyond the ability of typical tools to capture,

store, manage and analyze.

Coined by Francis

Diebold, professor of

economics at the

University of PA in 2000, when “Big” meant

(3)

Quantifying The Big Data Challenge

Estimated size of the digital universe in 2020

60 Zettabytes

5 Billion

smart phones

30 Billion

pieces of new content to Facebook per month

Sensors Video Music Location Weblogs

Growth Over the Next Decade:

Servers (Phys/VM): 10x Data/Information: 50x #Files: 75x

IT Professionals: <1.5x

(4)

A Familiar Example: The 2012 Olympics

(5)

The Big Data Push

IT Infra Tier 1 BP OLTP DSS/DW Collaboration Tier-2 BP OLTP App Dev Web Infra No SQL Columnar DBs

Performance Small Block,

(6)

What Does This Mean to You?

B u s in e s s o r M is s io n V e lo c it y Inflection Point Information Becomes

a Propellant to the Organzation

Data Becomes a

Burden to IT Infrastructure

2010 2020

(7)

Dispelling the

(8)

Big Data Is NOT New

(9)

Big Data = Big Analytics = Hadoop?

 That’s What The Media Hype Implies, but it is NOT true!

 Traditional analytics (BI/DSS/DW) dominates the analytics market

 Like other technologies vying to gain broad adoption in Enterprise IT (e.g., Traditional Analytics, HPC & Cloud), it shows promise

(10)

Worldwide Market Numbers

2011 2012 2013 2014 2015 2016 CAGR Comments

Hadoop 0.077 0.123 0.198 0.317 0.507 0.812 60% Source: IDC -http://bit.ly/S97FcP

All Analytics 31.8 34.9 38.3 42.1 46.2 50.7 9.8% Source: IDC -http://bit.ly/S97FcP

HPC 27.5 29.4 31.4 33.6 36.0 38.5 7.0% Source: Intersect360 -http://bit.ly/S9bQWf

Public Cloud

Services 17.0 22.5 29.4 36.6 45.0 52.9 25.5% w/o BPaaS

.

Source: Gartner -http://bit.ly/S9HpPK

Public Cloud

Services 91 109 130 154 180 207 17.9% w/ BPaaS

.

Source: Gartner -http://bit.ly/S9LLGz

Overall IT 3500 3640 3786 3937 4095 4258 4.0% Source: Gartner -http://bit.ly/S9LLGz In Billions US$

(11)

Big Data Solution Portfolio

Insight from extremely large datasets

Performance for data intensive workloads Secure boundless

data storage

(12)
(13)
(14)

What Is Hadoop?



Two key components

– HDFS

 Hadoop Distributed

File System (64+ MB Blocks)

– MapReduce

 Programming model for processing and generating large datasets

(15)
(16)

Analytics & Enterprise Apps Environment

Sensors Applications Logs Location/GPS Mobile Devices Storage

(All other storage, i.e. internal DAS)

Content Repositories

Shared Storage Infrastructure

(17)

CHALLENGE NETAPP SOLUTION BENEFITS

4 weeks to run a query on

24 billion unstructured records

Impossible to run a query:

240 billion unstructured records

10 Node Hadoop

Cluster

Time reduced from

4 weeks to 10.5

hours

Previously impossible, now achievable in

AutoSupport: Hadoop Use Case at NetApp

 “Call-home” service for all NetApp® systems

 Foundation of NetApp proactive support strategies  Machine-generated data doubles every 16 months

10-node Hadoop Cluster w/

(18)

Myth #1: Hadoop Is “Cheap”



Runs on a collection of “cheap, commodity

servers”, in a distributed, shared nothing

architecture (no RAID, no HA)



Eliminates the need for “expensive”

infrastructure

(19)

Reality: Processor Core Trends

Hadoop Guidance Today:

1-2 disks per core

Growth in CPU cores makes it increasingly

difficult to call these servers “cheap”

(20)

Myth#2: Replicas Make Things Faster



By doing the “triplicas”, not only does this

provide resiliency, but jobs process faster,

because:

– Data is moved closer to processor that needs it

– This avoids the network latency of pulling data

(21)

Reality: Triplicas Slow Things Down

Network Overhead Useful Work

Availability and Resiliency

Burst Handling and Queuing

Oversubscription Ratio

Data Node Network Speed

(22)

Ethernet’s Relentless March

(23)
(24)

Faster Job Completion During Disk Failure



Compared to a Hadoop cluster with the same number

of nodes, but using only internal disk (no RAID) and

triple copies, the NetApp Open Solution for Hadoop

shows only a slight performance decrease during a

disk failure with less impact to job completion.

(25)

Myth #3: Hadoop Brings A Lot of Value



It can process ALL types of data very fast,

which traditional tools don’t handle very well



It is an Open Source suite of tools (no lock-in)



It helps bring “high performance computing” to

(26)

Reality: Too Risky For Many CIOs

Things Enterprise operations is NOT going to like about Hadoop:

•Part of a risky emerging market

•Requires highly specialized skills to “glue it” to analytics tools •Analytics folks are typically a subculture outside of enterprise IT •Low infrastructure efficiencies (33% or worse) – Not very Cloudy •Expensive, physical servers – Not cheap, commodity as promised •70% of network band bandwidth chewed up by replicas

•15%-20% of the nodes down due to disk failures

In the Enterprise, it’s all about ROI, TTM and Risk!

Shift of Federal Spending to Cloud

(27)

Analytics of Tomorrow



Traditional & Big Analytics side-by-side for years to come



Hadoop moves to shared, virtualized infrastructure, for

better efficiency and ease of management, either:

– Logically distributed, shared nothing on physically shared everything, or

– Same as above, except Hadoop becomes logically shared everything, as HDFS is replaced by a parallel file system (e.g., Lustre Cluster, StorNext or GPFS)



Enterprise class resiliency (no SPoF) and reliability with

HPC-like performance (no need for triplicas)



Use of a single copy of data for the map phase (higher

storage utilization)

(28)
(29)

Big Bandwidth Use Cases

Full Motion Video/ISR

Scalable density and performance to ingest and simultaneously

analyze UAV, satellite and other data

Video Surveillance

High bandwidth & density supporting

hundreds or thousands of HD cameras

Media Content Management

(30)

Bandwidth Driven by HPC



THE driver for innovations in bandwidth are clearly high

performance computing (HPC) applications.



HPC falls into two camps:

– High Performance Technical Computing (HPTC),

 Supports scientific and engineering modeling and simulation.

 HPTC makes up about 70% of the HPC market

– High Performance Business Computing (HPBC)

(31)
(32)

State-of-the-Science in HPC Today

 16 Max / 20 Peak PetaFLOPS (20 x 1015)

 1.6 Million Cores

 1.6 PB main memory

 > 1 Terabyte/sec using:

– 55 PB of usable NetApp storage – Lustre Cluster Parallel File System – QDR (40 Gigabit) Infiniband

 ~8 MW of Power

 Simulations for

– Nuclear Weapons Viability – Counter Terrorism

– Energy Security – Climate Change

LLNL Sequoia - designed to be the fastest

supercomputer and storage combination on the planet

Now #1 of the Top 500 HPC Systems in the world.

(33)

HPC of Tomorrow



Labs are already hard at work architecting the next

system, pushing technologies to their limits



Next stop: Exascale (10

18

FLOPS) in 2016-2018



This will require 50 – 100X that of Sequoia

– 80-160 million CPU cores

– 80-160 PB’s of main memory

– 266 – 533 TB/sec of sequential I/O from main

memory during 5 minute checkpoint



> 50 MW of power



Existing storage technologies won’t scale within

(34)
(35)

Big Content Market Difficult to Quantify



The content repository market is very difficult

to quantify:

– It crosses application areas (e.g., business

applications, HPC and analytics applications)

and may be counted in those market numbers

– A sizable amount of it is tape, but is sometimes

confused with backup/recovery

(36)

Experience Managing Data at Scale

100 Customers

50 Customers

10 Customers

4 Customers

100 PB

50 PB 20 PB 10 PB

NetApp’s Largest Customer

1000

(37)

1000 PB – More to the Story…



> 20 Data Centers



> 750 Million Customers



> 48,000 Emails / Second



> 1 Million Disks

(38)

Trends Shaping the Big Content Market



Data at scale

– Objects replace files to overcome filesystem limitations

– Containers 10’s PB in size, 10’s billions of objects, Millions of users

– “Life time retention” of enterprise and consumer data



Policy driven infrastructure - “Set it and forget it”

– Object/VM granular automated data mobility/management



Simpler, efficient, self-managing systems; low price

– Systems with lower performance & resiliency for less $/GB

– Space, power, density @ lower infrastructure cost (~2 PB/rack)



Centralized



distributed, cloud-friendly “system”

(39)

StorageGRID

Distributed, Object-Based Repository

CIFS NFS HTTP CIFS NFS HTTP CIFS NFS HTTP CIFS NFS HTTP MULTIPLE: APPLICATIONS + SITES + LOCAL PROTOCOLS

MULTIPLE: APPLICATIONS + SITES + LOCAL PROTOCOLS

MULTIPLE: TENANTS + POLICIES + ADMINISTRATORS MULTIPLE: TENANTS + POLICIES + ADMINISTRATORS

Site 1 Site 2 Site 3 … Site N

APPLICATIONS APPLICATIONS APPLICATIONS APPLICATION

High Density Storage Tape

(40)

DCR of Tomorrow



Federated CDMI (Cloud to Cloud)



Strong ecosystem of ISVs supporting CDMI



Exabyte size containers



Trillions of objects



Rack densities > 5 PB/rack



Power densities < 1 kW / PB

(41)

Summary

 Despite the hype, Big Data is not new and is more than just

analytics! (Many agencies and private companies have

struggled with Big Data for decades)

 Analytics: Traditional BI/DSS analytics still dominate.

Importance of newer NoSQL & Columnar DB applications, enabled by MapReduce will grow with the growth of multi-structured data

 Data bandwidth will be driven by HPC Exascale initiatives

 Continued data growth that outstrips network bandwidth growth and file system scalability will drive low cost, geographically

distributed, object-based, Content Repositories with federated access through CDMI

 Big Data applications, such as Hadoop, will need to adopt

(42)

So What? What Can You Do Now?



Realize that becoming “Big Data Capable” will be an

iterative, cyclical process (not revolutionary)



Don’t focus on technology; focus on a burning business

or mission need

– “Fit for Purpose” not “Field of Dreams” approach



Target Big Data investments at operational gaps in

your existing analytics capabilities (this implies you

know what those gaps are)



If you don’t have the skillsets, collaborate or grow them



Once successful, Identify adjacent use cases, rinse

and repeat.

(43)

Dale Wickizer

CTO, U.S. Public Sector NetApp, Inc.

[email protected]

(44)
(45)
(46)
(47)

References

Related documents

Important factors that could cause or contribute to such differences include risks relating to: our ability to develop and commercialize additional pharmaceutical

Specific objectives are to study (1) thermal degradation and the effect of physical aging on thermal-mechanical properties of PLA/starch composites; (2) the influence of starch

One of the reasons for the lower-than-expected performance of this market segment was a larger-than-expected shift in revenue to query, reporting, and analysis tools as well as

The estimated high frequency sub-bands (edges) are enhanced by introducing an intermediate stage using a stationary Wavelet Transform (SWT), and then all these sub-bands and

Scrooge said to the Ghost, 'Oh, please tell me who that dead man was!' The Ghost took him near his office, but it didn't stop. 'Wait!'

To answer our initial question (concerning the probability of event occurrence and a global warming influence), regardless of how much faith is placed in the exact quantitative

That, subject to the passing of Resolution 9 to be proposed at the Annual General Meeting of the Company convened for 21 September 2010 (‘‘Resolution 9’’), the Directors of

Resolution 11 – is to empower the directors to offer and grant awards pursuant to the Sembcorp Industries Performance Share Plan 2010 and the Sembcorp Industries Restricted Share