HPC2012 Workshop Cetraro, Italy
Supercomputing and Big Data:
Where are the Real Boundaries
and Opportunities for Synergy?
Bill Blake
CTO Cray, Inc.
The “Big Data” Challenge
6/30/2012 2
Supercomputing minimizes data movement –
The focus is loading the “mesh”
in distributed memory,
computing the answer with fast interconnects, and
visualizing the answer in memory –
the high performance
“data movement”
is for loading, check pointing or archiving.
Data-intensive computing is all about data movement –
The focus is scanning, sorting, streaming and aggregating
"all the data all the time" to get the answer or discover
new knowledge from unstructured/structured data sources.
Big Data is “Data in Motion”
6/30/2012 3
•
Set the stage for the fusion of numerically intensive and
data intensive computing in future Cray systems
•
Build on Cray's success of delivering the most scalable
systems through heterogeneous and specialized nodes
•
• Nodes not only optimized for compute, but also storage and
network I/O, all connected with the highest level of interconnect performance. Add system capability to the edge of the fabric
•
We see this effort as increasing key application
performance with an "appliance style" approach
using
Cray's primary supercomputing products with extensions
configured as optimized HW/SW stacks –
adding value
around the edge of the high performance system network
Maximum Scalability:
System Node Specialization
Net I/O System Support Service Sys Admin Users File I/O Compute /home Price Performance PCs and Workstations •10,000,000s units Enterprise servers • 100,000s units High-Performance Computing •1,000s units
Courtesy of Dr. Bill Camp, Sandia National Laboratories, circa 2000
Key to Cray’s MPP scalability is system node o/s specialization combined with very high bandwidth, low latency interconnects
Back in Time, at the Beginnings of the
Web, there was Dr. Codd and his
Relational Database …
Algebraic Set Theory
Analyzing Big Data Analytics:
A “Swim Lane” View of Supporting Business and Technical Decisions
Key Function Language Data Approach “Airline” Example OLTP Declarative (SQL) Structured (relational) ATM transactions
Buying a seat on an airplane OLAP Ad Hoc Declarative (SQL+UDF) Structured (relational)
BI aggregate and analyze bookings for new ad placements Semantic Ad hoc Declarative (SPARQL) Linked, Open (graph-based)
Analyze social graphs and infer who might travel where OLAP Ad Hoc Procedural (MapReduce) Unstructured (Hadoop files)
Application Framework for weblog analysis Optimize Models Procedural (Solver Libs) Optimization <-> Simulation Complex Scheduling Estimating empty seats Simulate Models Procedural (Fortran, C++) Matrix Math (Systems of Eq’s)
Mathematical Modeling and simulation (design airplane)
For Perspective (1980’s) …
●
The relational database was invented
on a system that merged server,
storage and database
●
It was called a mainframe
●
Change focus to vector processing
and memory performance and the
mainframe becomes a supercomputer
Key Function Language Data Approach SMP Server Cluster And MPP Cloud And Grid Web Scale OLTP Declarative (SQL) Structured (relational)
CPU
MemoryIOP
IOP
OLTP: Processing Tables with Queries
Processing millions of transactions without a
hitch
●
Do entire transaction or nothing
(don’t debit account without dispensing cash)
●
Disciplined approach to data
forms data models/schema tables on disk
●
Tables at first glance look like the rows and
columns of a spreadsheet
●
Transaction processing uses highly repetitive
queries for entering/updating
Key Function Language Data Approach SMP Server Cluster And MPP Cloud And Grid Web Scale OLTP Declarative (SQL) Structured (relational)
OLTP: Processing Tables with Queries
●
A transaction “touches” only the data it
needs to accomplish its purpose
●
Since the workload involves many, many small
data payloads, speedups can be gained through
caching and the use of indices that prevent “full
table scans”.
●
The query states what it needs, and the
database uses a (cost-based)
planner/optimizer to create the program that
actually manipulates the tables
Then The Rules Changed
●
Mainframes attacked by
killer micros!
●
Memory grew large
●
I/O became weak
●
System costs dropped
●
Storage moved off to
the network
CPU CPU CPU CPU
CPU CPU CPU CPU
Very Large Memory
I/O
Storage Area Network
Capacity Was Added By Clustering
CPU CPU CPU CPU Memory
I/O
CPU CPU CPU CPU Memory I/O CPU CPU CPU CPU Me m o ry I/O
CPU CPU CPU CPU Memory
I/O
CPU CPU CPU CPU Memory
I/O Storage Area
Network
The Online Analytical Processing Problem
● Business Intelligence Applications generate reports of aggregations
● Need to read at all the data all the time (telecom, retail, finance, advertising, etc)
● BI Analytics require ad hoc queries since you don’t know the next question to ask until you understand the answer to the last question
● Standard SQL is limited by the Algebraic Set Theory basis of
RDBMS, if you need Calculus then insert User Defined Functions into the SQL
● Programming models in conflict as Modeling and Simulation combine with Business Intelligence in Predictive Analytics
Key Function Language Data Approach SMP Server Cluster And MPP Cloud And Grid Web Scale OLAP Ad Hoc Declarative (SQL+UDF ) Structured (relational)
OLAP Databases
OLAP Databases
aka Data Warehouse
aka Data Warehouse
Processing Hundreds Processing Hundreds of Terabytes/hour of Terabytes/hour Extraction /Cleansing Trans-formation Loading Reports Reports OLTP Databases OLTP Databases Processing Millions Processing Millions Of Transactions/sec Of Transactions/sec
Operational vs. Analytical RDBMS
Client Applications Local Applications
Hours or Days
$$$ Processing $$$ Storage $$$ Fiber Channels SMP HOST 1 SMP HOST 2 SMP HOST N C6 F7 C12 F13 C38 F39 G13 G22 $$$ $$$The Netezza Idea:
Moving Processing to the Data
●
Active Disk architectures
● Integrated processing power and memory into disk units ● Scaled processing power as the dataset grew
●
Decision support algorithms offloaded to Active Disks to
support key decision support tasks
● Active Disk architectures use stream-based model ideal for the
software architecture of relational databases
●
Influenced by the success of Cisco and NetApp
appliances, the approach combined software, processing,
networking and storage leading to the first database
warehouse appliance!
The HW Architecture Responded to
the Needs of the Database Application
SQL Compiler Query Plan Optimize Admin Front End DBOS Execution Engine Linux
SMP Host Massively Parallel Intelligent Storage 1 2 3 1000+ Gigabit Ethernet Processor + streaming DB logic High-Performance Database Engine Streaming joins, aggregations, sorts, etc. Processor + streaming DB logic Processor + streaming DB logic
Snippet Processing Unit (SPU)
Netezza Performance Server
Client BI Applications Fast Loader/Unloader Local Applications ODBC 3.X JDBC Type 4 SQL/92 Processor + streaming DB logic
Move processing to the data (maximum I/O to a single table)
Shifting from Analyst to Programmer
●
Google, Yahoo and their friends are using a data-intensive
application framework to analyze very large datasets
(e.g.,weblogs) without transactions or structured data
●
What will this look like in 10 years?
● MapReduce/Hadoop: an programming model/application framework
performing “group-by (map), sort and aggregation (reduce)”
● Not queries, but programs willing to forgo the need for transactional
integrity or the performance of structured data (4X-5X disadvantage
on equal hardware, but with excellent scaling on cheap hardware)
● An increasingly popular approach with organizations that have the
programming talent to use it, especially research organizations
● Another frontal assault on the $40K per socket RDBMS licensing
Key Function Language Data Approach SMP Server Cluster And MPP Cloud And Grid Web Scale OLAP Ad Hoc Procedural (MapReduce) Unstructured (Hadoop Files)
●
Multi-dimensional predictive queries
● Connection networks, social network, Time, Space, Reasoning
Find why the customers who live on the 25th
street in Zurich did not switch their phone company between March 23 to August 19 while their friends and family switched their phone company and predict the trend moving forward?
●
Need to pick weak signals from the noise
● Data size will continue to grow and be more heterogeneous
● Finding needle in a hay stack will become more important
●
Will require real time or shorter turnaround time
● Differentiation will be based on the speed of analysis of
changing data
●
INFERENCE – “maybe” as a legitimate answer
Analysis is Getting Complex
Advancing from Generating Reports
to Inference and Discovery
●The International W3C standards body has approved the key standards, called RDF and OWL to support the Semantic Web aka Web 3.0 with “machine readable” open linked data
●Future databases will use triples (subject-predicate-object) vs tables and with RDF/OWL federate heterogeneous data
●Future databases will support reasoning not just reporting
●This work started as a combined European Defense and DARPA effort
●Major RDBMS vendors are admitting Relational and XML are
ill-suited to the needs of the semantic web of the future
Key Function Language Data Approach SMP Server Cluster And MPP Cloud And Grid Web Scale Semantic Ad hoc Declarative (SPARQL) Linked, Open (graph-based)
The Idea:
Address Memory and Network Latency
●
Decision support algorithms offloaded to multi-
threaded processors and in-memory database with
new complex key decision support tasks
● Supports new SPARQL query processing for RDF triples database
offering the speedup of XMT processing without low level API
●
Shared memory, multi-threaded Cray technology
● Integrated with semantic database (open source std compliant)
● Fastest complex query response on open linked data in the industry
●
Influenced by the success of database warehouse
appliances, our combined (database) software,
processing, networking and memory led to the first
graph appliance!
● Deliver easy to deploy solutions requiring knowledge
?
Performs like a supercomputer Uses open web 3.0 standards
Scales like a web engine Operates like a data warehouse
21