© 2011 IBM Corporation
Data Centric Computing Revisited
SPXXL/SCICOMP – Summer 2013
Piyush Chaudhary
Technical Computing Solutions
© 2011 IBM Corporation
Bottom line: It is a time of Powerful Information
Dimensions of data growth
Terabytes to exabytes of existing data to process
Structured, unstructured, text, multimedia
Streaming data, milliseconds to
seconds to respond
Uncertainty
from inconsistency, ambiguities, etc.
Variety
Volume Velocity
Veracity
Data volume is on the rise
Sensors
& Devices
VoIP
9000 8000 7000 6000 5000 4000
3000 Enterprise
Data Social Media
2015
Big Data and High Performance Computing are driving systems requirements:
Move the Compute to the Data!
© 2011 IBM Corporation
Directly integrating Reactive and Deep Analytics enables feedback-driven insight optimization
Data Scal e
Data Scale
Decision Frequency
Occasional Frequent Real-time
Traditional Data
Warehouse and Business Intelligence
Integration
yr mo wk day hr min sec … ms s
Exa
Peta
Tera
Giga
Mega
Kilo
Feedback
Reactive Analytics Reality
Fast
Observations Actions
History
Deep Analytics
Deep Predictions Hypotheses
High Performance Computing
On Large Data Sets
(Creating a World Model …Context)
High Performance Computing
On Large Streams of Data
(Analyzing Real Time against
The World Model… Context)
Maximum Insight Requires Combining Deep and Reactive
Analytics
© 2011 IBM Corporation
2020: The Context-Centric Future
Trillions of
data sources
Billions of
Agents & User
Applications
Millions of
Analytics
Exabytes of
Context
Streaming Data
Text Data
Multi-dimensional
Time Series
Geo Spatial
Relational
Social Network Video
& Image
Etc.
Massive parallelism, storage density,
high-bandwidth, low-latency
networks and other data-centric
principles must be fundamental to the
ultimate solution architecture.
© 2011 IBM Corporation
What is Driving the Explosive Growth of Big Data?
Compute processing is becoming very cheap, allowing us to instrument everything
– More sensors (more sources of data)– Increased resolution in sensor data (bigger data) – Cheaper storage (saving more data)
An increasingly networked world allows us to gather data quickly and cheaply
– Data can be centralized easily and can be acted on more effectively Mobile computing allows for newer ways to collect data
– Smartphones are equipped with a variety of sensors and can continuously collect data
Growth in social media is driving more sharing of data
5
© 2011 IBM Corporation
Big Data Workloads and Their Evolution
• Genomics
• The Human Genome Project took over 10 years to complete and cost over $3 billion
• The Next Generation Sequencers can do it in a few days for about $1000 and generate a terabyte of data. That means that big
genomic centers can produce petabytes of data every month
6
• Smart Utilities
• Many electric utility companies are wiring their customers with smart meters
• These smart meters generate 100,000 data points per month per customer
• Utility companies need to analyze all this data for capacity planning, pricing and future investment
• Oil and Gas
• Seismic exploration data is growing so fast it has to be primarily stored on Tape
• It is migrated to disk based storage before it can be operated on and then deleted
• Financial Services
• Algorithmic trading and the requirement to be able to react quickly to changes in the market are driving the need for low latency access to data
• Telecommunications
• Mobile phones generate many CDRs related to each call, text or data usage
• Telecom providers must analyze billions of CDRs a day to improve quality, deliver services and to make investment
decisions
• Real Time Traffic Management
• Uses a mixture of real time sensors and historical data to lower congestions, increase capacity and reduce emissions
© 2011 IBM Corporation
Hardware and Software Challenges of Big Data Workloads
Big Data storage has typically grown outside of enterprise storage control. This poses a
serious management problem for data center managers to implement security control, audit
capability, backup and archiving capability, centralized management of storage, etc.
Growth of scale out systems in business has introduced the challenges of managing a large
number of servers and big networks to commercial IT staff
Big data workloads tend to not share infrastructure with other applications. This has caused
businesses to duplicate infrastructure for their big data applications
Adoption of a Map Reduce framework forces language and storage choices that may not be
ideal for the application
7
© 2011 IBM Corporation
Explosive Storage Growth Require New Storage Solutions
“From the dawn of civilization until 2003, humankind generated 5 exabytes of data. Now we
produce 5 exabytes every two days… and the pace is accelerating.”
— Eric Schmidt, Executive Chairman, Google
8
Picture of 5 MB IBM 305 hard drive being loaded into an airplane in 1956. The unit weighed 1000 Kg
• UPS stores more than 16 PB data, from deliveries to event planning
• Monster, the online careers company, stores 5 PB data, largely from nearly 40 million resumes
• Zynga stores 3 PB data on the gaming habits of nearly 300 million monthly online game players
• Facebook adds 7 PB storage every month onto its exabyte trove
• The Boeing 787 Dreamliner generates 1 TB data for every roundtrip, equating to hundreds of TB daily for the entire fleet
• CERN has collected more than 100 PB data from high-energy physics experiments over the past two decades, but 75 PB comes from the Large Hadron Collider in just the past three years*
* K. Davies, “Best Practices in Big Data Storage”, Tabor Communications, April 2013
© 2011 IBM Corporation
Technologies in Big Data Storage Architectures
9
Businesses recognize the value of their data but to extract value out of it
they must first tame the data deluge. They must store it efficiently, organize
it and manage it before they can operate on it to gain meaningful insight
• Scale out data architecture can be an efficient and scalable way to add capacity
and performance for Big Data solutions
• The astounding growth in data means that tape has become integral to lots of big
data storage solutions
• High speed analytics and real time applications require low latency access to data
and are incorporating flash based storage
• There is a need for capacity as well as performance which means that tiering of
storage and the movement of data between the tiers is necessary
• Taking advantage of new storage technologies, like shingled magnetic recording
(SMR), for creating really dense storage pools without sacrificing performance
• Processing of data is done by a variety of traditional and emerging workloads that
have different access requirements but need to be managed seamlessly
• It is no longer enough to capture the data but increasingly important to collect
context and annotate the data. This annotated context is used to pre process the
data before analysis, make data management decisions, correlate data with other
data sources, etc.
© 2011 IBM Corporation
Enterprise-class Map Reduce Solution
10
CUSTOMER REQUIREMENT:
• Leverage a shared distributed set of resources, and run a variety of heterogeneous compute and data intensive applications without the need to duplicate infrastructure
• Solution should be easy to deploy, guarantee high reliability and availability, should be easy to
manage, and support multiple lines of business and applications
• Deploy a combined Platform Symphony Map Reduce + GPFS-FPO solution to realize dramatic performance improvements and financial savings while delivering a more robust and flexible solution
•Result: IBM Platform Symphony and GPFS-FPO can help accelerate Hadoop workloads while reducing cost and improving workload reliability
Using HPC to Help Big Data
© 2011 IBM Corporation 11
Enterprise-class Map Reduce Solution
Key Benefits
Platform Symphony Map Reduce
• Breakthrough Hadoop performance
• Deliver faster and more accurate analysis for Big Data applications by doing greater processing with less infrastructure
• Lower costs through reduction in infrastructure and administration overhead
• Enable business agility by supporting multiple groups and diverse workloads on a single shared cluster
GPFS-FPO
• GPFS-FPO allows coexistence of various analytic architectures
• Better overall performance for analytics
• Provides a more robust architecture with no single point of failure
• Provides POSIX compliance and end-to- end data management capability
• Policy driven failure handling and faster recovery
0 0.5 1 1.5 2
Execution Time (normalized)
Postmark Terasort
HDFS GPFS
0 5 10 15 20
Execution Time (normalized)
CacheTest
HDFS GPFS
Using HPC to Help Big Data
© 2011 IBM Corporation
Using HPC to Help Big Data
Use energy aware scheduling capability, developed to support the needs of the High End
HPC customers, to deliver better energy management functions integrated in a big data
solution
Most big data workloads are based on a sockets communication API which does not provide
a low latency transport. Exploit user space sockets to leverage RDMA and minimize stack
overhead to deliver low latency messaging without changing the applications
Use GPFS data management capabilities to provide a flexible storage architecture to meet
the needs of different applications in the enterprise; big data & traditional
12