Data Centric Computing Revisited

(1)

Data Centric Computing Revisited

SPXXL/SCICOMP – Summer 2013

Piyush Chaudhary

Technical Computing Solutions

(2)

Bottom line: It is a time of Powerful Information

Dimensions of data growth

Terabytes to exabytes of existing data to process

Structured, unstructured, text, multimedia

Streaming data, milliseconds to

seconds to respond

Uncertainty

from inconsistency, ambiguities, etc.

Variety

Volume Velocity

Veracity

Data volume is on the rise

Sensors

& Devices

VoIP

9000 8000 7000 6000 5000 4000

3000 Enterprise

Data Social Media

2015

Big Data and High Performance Computing are driving systems requirements:

Move the Compute to the Data!

(3)

Directly integrating Reactive and Deep Analytics enables feedback-driven insight optimization

Data Scal e

Data Scale

Decision Frequency

Occasional Frequent Real-time

Traditional Data

Warehouse and Business Intelligence

Integration

yr mo wk day hr min sec … ms s

Exa

Peta

Tera

Giga

Mega

Kilo

Feedback

Reactive Analytics Reality

Fast

Observations Actions

History

Deep Analytics

Deep Predictions Hypotheses

High Performance Computing

On Large Data Sets

(Creating a World Model …Context)

High Performance Computing

On Large Streams of Data

(Analyzing Real Time against

The World Model… Context)

Maximum Insight Requires Combining Deep and Reactive

Analytics

(4)

2020: The Context-Centric Future

Trillions of

data sources

Billions of

Agents & User

Applications

Millions of

Analytics

Exabytes of

Context

Streaming Data

Text Data

Multi-dimensional

Time Series

Geo Spatial

Relational

Social Network Video

& Image

Etc.

Massive parallelism, storage density,

high-bandwidth, low-latency

networks and other data-centric

principles must be fundamental to the

ultimate solution architecture.

(5)

What is Driving the Explosive Growth of Big Data?

 Compute processing is becoming very cheap, allowing us to instrument everything

– More sensors (more sources of data)

– Increased resolution in sensor data (bigger data) – Cheaper storage (saving more data)

 An increasingly networked world allows us to gather data quickly and cheaply

– Data can be centralized easily and can be acted on more effectively

 Mobile computing allows for newer ways to collect data

– Smartphones are equipped with a variety of sensors and can continuously collect data

 Growth in social media is driving more sharing of data

5

(6)

Big Data Workloads and Their Evolution

• Genomics

• The Human Genome Project took over 10 years to complete and cost over $3 billion

• The Next Generation Sequencers can do it in a few days for about $1000 and generate a terabyte of data. That means that big

genomic centers can produce petabytes of data every month

6

• Smart Utilities

• Many electric utility companies are wiring their customers with smart meters

• These smart meters generate 100,000 data points per month per customer

• Utility companies need to analyze all this data for capacity planning, pricing and future investment

• Oil and Gas

• Seismic exploration data is growing so fast it has to be primarily stored on Tape

• It is migrated to disk based storage before it can be operated on and then deleted

• Financial Services

• Algorithmic trading and the requirement to be able to react quickly to changes in the market are driving the need for low latency access to data

• Telecommunications

• Mobile phones generate many CDRs related to each call, text or data usage

• Telecom providers must analyze billions of CDRs a day to improve quality, deliver services and to make investment

decisions

• Real Time Traffic Management

• Uses a mixture of real time sensors and historical data to lower congestions, increase capacity and reduce emissions

(7)

Hardware and Software Challenges of Big Data Workloads

 Big Data storage has typically grown outside of enterprise storage control. This poses a

serious management problem for data center managers to implement security control, audit

capability, backup and archiving capability, centralized management of storage, etc.

 Growth of scale out systems in business has introduced the challenges of managing a large

number of servers and big networks to commercial IT staff

 Big data workloads tend to not share infrastructure with other applications. This has caused

businesses to duplicate infrastructure for their big data applications

 Adoption of a Map Reduce framework forces language and storage choices that may not be

ideal for the application

7

(8)

Explosive Storage Growth Require New Storage Solutions

“From the dawn of civilization until 2003, humankind generated 5 exabytes of data. Now we

produce 5 exabytes every two days… and the pace is accelerating.”

— Eric Schmidt, Executive Chairman, Google

8

Picture of 5 MB IBM 305 hard drive being loaded into an airplane in 1956. The unit weighed 1000 Kg

• UPS stores more than 16 PB data, from deliveries to event planning

• Monster, the online careers company, stores 5 PB data, largely from nearly 40 million resumes

• Zynga stores 3 PB data on the gaming habits of nearly 300 million monthly online game players

• Facebook adds 7 PB storage every month onto its exabyte trove

• The Boeing 787 Dreamliner generates 1 TB data for every roundtrip, equating to hundreds of TB daily for the entire fleet

• CERN has collected more than 100 PB data from high-energy physics experiments over the past two decades, but 75 PB comes from the Large Hadron Collider in just the past three years*

* K. Davies, “Best Practices in Big Data Storage”, Tabor Communications, April 2013

(9)

Technologies in Big Data Storage Architectures

9

Businesses recognize the value of their data but to extract value out of it

they must first tame the data deluge. They must store it efficiently, organize

it and manage it before they can operate on it to gain meaningful insight

• Scale out data architecture can be an efficient and scalable way to add capacity

and performance for Big Data solutions

• The astounding growth in data means that tape has become integral to lots of big

data storage solutions

• High speed analytics and real time applications require low latency access to data

and are incorporating flash based storage

• There is a need for capacity as well as performance which means that tiering of

storage and the movement of data between the tiers is necessary

• Taking advantage of new storage technologies, like shingled magnetic recording

(SMR), for creating really dense storage pools without sacrificing performance

• Processing of data is done by a variety of traditional and emerging workloads that

have different access requirements but need to be managed seamlessly

• It is no longer enough to capture the data but increasingly important to collect

context and annotate the data. This annotated context is used to pre process the

data before analysis, make data management decisions, correlate data with other

data sources, etc.

(10)

Enterprise-class Map Reduce Solution

10

CUSTOMER REQUIREMENT:

• Leverage a shared distributed set of resources, and run a variety of heterogeneous compute and data intensive applications without the need to duplicate infrastructure

• Solution should be easy to deploy, guarantee high reliability and availability, should be easy to

manage, and support multiple lines of business and applications

• Deploy a combined Platform Symphony Map Reduce + GPFS-FPO solution to realize dramatic performance improvements and financial savings while delivering a more robust and flexible solution

•Result: IBM Platform Symphony and GPFS-FPO can help accelerate Hadoop workloads while reducing cost and improving workload reliability

Using HPC to Help Big Data

(11)

Enterprise-class Map Reduce Solution

Key Benefits

Platform Symphony Map Reduce

• Breakthrough Hadoop performance

• Deliver faster and more accurate analysis for Big Data applications by doing greater processing with less infrastructure

• Lower costs through reduction in infrastructure and administration overhead

• Enable business agility by supporting multiple groups and diverse workloads on a single shared cluster

GPFS-FPO

• GPFS-FPO allows coexistence of various analytic architectures

• Better overall performance for analytics

• Provides a more robust architecture with no single point of failure

• Provides POSIX compliance and end-to- end data management capability

• Policy driven failure handling and faster recovery

0 0.5 1 1.5 2

Execution Time (normalized)

Postmark Terasort

HDFS GPFS

0 5 10 15 20

Execution Time (normalized)

CacheTest

HDFS GPFS

Using HPC to Help Big Data

(12)

Using HPC to Help Big Data

 Use energy aware scheduling capability, developed to support the needs of the High End

HPC customers, to deliver better energy management functions integrated in a big data

solution

 Most big data workloads are based on a sockets communication API which does not provide

a low latency transport. Exploit user space sockets to leverage RDMA and minimize stack

overhead to deliver low latency messaging without changing the applications

 Use GPFS data management capabilities to provide a flexible storage architecture to meet

the needs of different applications in the enterprise; big data & traditional

12