• No results found

Supercomputing and Big Data: Where are the Real Boundaries and Opportunities for Synergy?

N/A
N/A
Protected

Academic year: 2021

Share "Supercomputing and Big Data: Where are the Real Boundaries and Opportunities for Synergy?"

Copied!
24
0
0

Loading.... (view fulltext now)

Full text

(1)

HPC2012 Workshop Cetraro, Italy

Supercomputing and Big Data:

Where are the Real Boundaries

and Opportunities for Synergy?

Bill Blake

CTO Cray, Inc.

(2)

The “Big Data” Challenge

6/30/2012 2

Supercomputing minimizes data movement –

The focus is loading the “mesh”

in distributed memory,

computing the answer with fast interconnects, and

visualizing the answer in memory –

the high performance

“data movement”

is for loading, check pointing or archiving.

Data-intensive computing is all about data movement –

The focus is scanning, sorting, streaming and aggregating

"all the data all the time" to get the answer or discover

new knowledge from unstructured/structured data sources.

(3)

Big Data is “Data in Motion”

6/30/2012 3

Set the stage for the fusion of numerically intensive and

data intensive computing in future Cray systems

Build on Cray's success of delivering the most scalable

systems through heterogeneous and specialized nodes

• Nodes not only optimized for compute, but also storage and

network I/O, all connected with the highest level of interconnect performance. Add system capability to the edge of the fabric

We see this effort as increasing key application

performance with an "appliance style" approach

using

Cray's primary supercomputing products with extensions

configured as optimized HW/SW stacks –

adding value

around the edge of the high performance system network

(4)

Maximum Scalability:

System Node Specialization

Net I/O System Support Service Sys Admin Users File I/O Compute /home Price Performance PCs and Workstations •10,000,000s units Enterprise servers • 100,000s units High-Performance Computing •1,000s units

Courtesy of Dr. Bill Camp, Sandia National Laboratories, circa 2000

Key to Cray’s MPP scalability is system node o/s specialization combined with very high bandwidth, low latency interconnects

(5)

Back in Time, at the Beginnings of the

Web, there was Dr. Codd and his

Relational Database …

Algebraic Set Theory

(6)

Analyzing Big Data Analytics:

A “Swim Lane” View of Supporting Business and Technical Decisions

Key Function Language Data Approach “Airline” Example OLTP Declarative (SQL) Structured (relational) ATM transactions

Buying a seat on an airplane OLAP Ad Hoc Declarative (SQL+UDF) Structured (relational)

BI aggregate and analyze bookings for new ad placements Semantic Ad hoc Declarative (SPARQL) Linked, Open (graph-based)

Analyze social graphs and infer who might travel where OLAP Ad Hoc Procedural (MapReduce) Unstructured (Hadoop files)

Application Framework for weblog analysis Optimize Models Procedural (Solver Libs) Optimization <-> Simulation Complex Scheduling Estimating empty seats Simulate Models Procedural (Fortran, C++) Matrix Math (Systems of Eq’s)

Mathematical Modeling and simulation (design airplane)

(7)

For Perspective (1980’s) …

The relational database was invented

on a system that merged server,

storage and database

It was called a mainframe

Change focus to vector processing

and memory performance and the

mainframe becomes a supercomputer

Key Function Language Data Approach SMP Server Cluster And MPP Cloud And Grid Web Scale OLTP Declarative (SQL) Structured (relational)

CPU

Memory

IOP

IOP

(8)

OLTP: Processing Tables with Queries

Processing millions of transactions without a

hitch

Do entire transaction or nothing

(don’t debit account without dispensing cash)

Disciplined approach to data

forms  data models/schema  tables on disk

Tables at first glance look like the rows and

columns of a spreadsheet

Transaction processing uses highly repetitive

queries for entering/updating

Key Function Language Data Approach SMP Server Cluster And MPP Cloud And Grid Web Scale OLTP Declarative (SQL) Structured (relational)

(9)

OLTP: Processing Tables with Queries

A transaction “touches” only the data it

needs to accomplish its purpose

Since the workload involves many, many small

data payloads, speedups can be gained through

caching and the use of indices that prevent “full

table scans”.

The query states what it needs, and the

database uses a (cost-based)

planner/optimizer to create the program that

actually manipulates the tables

(10)

Then The Rules Changed

Mainframes attacked by

killer micros!

Memory grew large

I/O became weak

System costs dropped

Storage moved off to

the network

CPU CPU CPU CPU

CPU CPU CPU CPU

Very Large Memory

I/O

Storage Area Network

(11)

Capacity Was Added By Clustering

CPU CPU CPU CPU Memory

I/O

CPU CPU CPU CPU Memory I/O CPU CPU CPU CPU Me m o ry I/O

CPU CPU CPU CPU Memory

I/O

CPU CPU CPU CPU Memory

I/O Storage Area

Network

(12)

The Online Analytical Processing Problem

● Business Intelligence Applications generate reports of aggregations

Need to read at all the data all the time (telecom, retail, finance, advertising, etc)

BI Analytics require ad hoc queries since you don’t know the next question to ask until you understand the answer to the last question

● Standard SQL is limited by the Algebraic Set Theory basis of

RDBMS, if you need Calculus then insert User Defined Functions into the SQL

● Programming models in conflict as Modeling and Simulation combine with Business Intelligence in Predictive Analytics

Key Function Language Data Approach SMP Server Cluster And MPP Cloud And Grid Web Scale OLAP Ad Hoc Declarative (SQL+UDF ) Structured (relational)

(13)

OLAP Databases

OLAP Databases

aka Data Warehouse

aka Data Warehouse

Processing Hundreds Processing Hundreds of Terabytes/hour of Terabytes/hour Extraction /Cleansing Trans-formation Loading Reports Reports OLTP Databases OLTP Databases Processing Millions Processing Millions Of Transactions/sec Of Transactions/sec

Operational vs. Analytical RDBMS

(14)

Client Applications Local Applications

Hours or Days

$$$ Processing $$$ Storage $$$ Fiber Channels SMP HOST 1 SMP HOST 2 SMP HOST N C6 F7 C12 F13 C38 F39 G13 G22 $$$ $$$

(15)

The Netezza Idea:

Moving Processing to the Data

Active Disk architectures

● Integrated processing power and memory into disk units ● Scaled processing power as the dataset grew

Decision support algorithms offloaded to Active Disks to

support key decision support tasks

● Active Disk architectures use stream-based model ideal for the

software architecture of relational databases

Influenced by the success of Cisco and NetApp

appliances, the approach combined software, processing,

networking and storage leading to the first database

warehouse appliance!

(16)

The HW Architecture Responded to

the Needs of the Database Application

SQL Compiler Query Plan Optimize Admin Front End DBOS Execution Engine Linux

SMP Host Massively Parallel Intelligent Storage 1 2 3 1000+ Gigabit Ethernet Processor + streaming DB logic High-Performance Database Engine Streaming joins, aggregations, sorts, etc. Processor + streaming DB logic Processor + streaming DB logic

Snippet Processing Unit (SPU)

Netezza Performance Server

Client BI Applications Fast Loader/Unloader Local Applications ODBC 3.X JDBC Type 4 SQL/92 Processor + streaming DB logic

Move processing to the data (maximum I/O to a single table)

(17)

Shifting from Analyst to Programmer

Google, Yahoo and their friends are using a data-intensive

application framework to analyze very large datasets

(e.g.,weblogs) without transactions or structured data

What will this look like in 10 years?

● MapReduce/Hadoop: an programming model/application framework

performing “group-by (map), sort and aggregation (reduce)”

● Not queries, but programs willing to forgo the need for transactional

integrity or the performance of structured data (4X-5X disadvantage

on equal hardware, but with excellent scaling on cheap hardware)

● An increasingly popular approach with organizations that have the

programming talent to use it, especially research organizations

● Another frontal assault on the $40K per socket RDBMS licensing

Key Function Language Data Approach SMP Server Cluster And MPP Cloud And Grid Web Scale OLAP Ad Hoc Procedural (MapReduce) Unstructured (Hadoop Files)

(18)

Multi-dimensional predictive queries

● Connection networks, social network, Time, Space, Reasoning

Find why the customers who live on the 25th

street in Zurich did not switch their phone company between March 23 to August 19 while their friends and family switched their phone company and predict the trend moving forward?

Need to pick weak signals from the noise

● Data size will continue to grow and be more heterogeneous

● Finding needle in a hay stack will become more important

Will require real time or shorter turnaround time

● Differentiation will be based on the speed of analysis of

changing data

INFERENCE – “maybe” as a legitimate answer

Analysis is Getting Complex

(19)

Advancing from Generating Reports

to Inference and Discovery

●The International W3C standards body has approved the key standards, called RDF and OWL to support the Semantic Web aka Web 3.0 with “machine readable” open linked data

●Future databases will use triples (subject-predicate-object) vs tables and with RDF/OWL federate heterogeneous data

Future databases will support reasoning not just reporting

●This work started as a combined European Defense and DARPA effort

Major RDBMS vendors are admitting Relational and XML are

ill-suited to the needs of the semantic web of the future

Key Function Language Data Approach SMP Server Cluster And MPP Cloud And Grid Web Scale Semantic Ad hoc Declarative (SPARQL) Linked, Open (graph-based)

(20)

The Idea:

Address Memory and Network Latency

Decision support algorithms offloaded to multi-

threaded processors and in-memory database with

new complex key decision support tasks

● Supports new SPARQL query processing for RDF triples database

offering the speedup of XMT processing without low level API

Shared memory, multi-threaded Cray technology

● Integrated with semantic database (open source std compliant)

● Fastest complex query response on open linked data in the industry

Influenced by the success of database warehouse

appliances, our combined (database) software,

processing, networking and memory led to the first

graph appliance!

Deliver easy to deploy solutions requiring knowledge

(21)

?

Performs like a supercomputer Uses open web 3.0 standards

Scales like a web engine Operates like a data warehouse

21

(22)

Adapt the system to the application

not the application to the system

Vision:

Exascale

Cray’s Adaptive Supercomputing combines

multiple processing architectures into a single

scalable system—CPU, GPU, or

Multi-threaded

The focus in on the user’s application where

the adaptive software, the compiler or query

processor, knows what types of processors

are available on the heterogeneous system

and targets code to the most appropriate

processor

The next step is to evolve Adaptive

(23)

Enabling Simulation and Data Science

(24)

References

Related documents

National National Board Board for for Standards Standards in Health in Health Informatics Informatics BSHI

 SAP NetWeaver provides core functions for the technical infrastructure of your business solutions in four integration levels.. – People Integration: This ensures the employees to

We provide an affirmative answer to this question in Section 7 by showing that fortunately two well-known generic construction of optimistic fair exchange (one is based on

variance of general combining ability were lower than those of variance of specific combining ability for all the traits e.g panicle number per plant (0.54), 1000 grain weight

Number of effective tillers per plant was highly significantly (P&lt;0.01) affected by the main effects of rate and timing of N fertilizer application and

Asian retirees need other means of income in order to supplement public pension benefits and to meet their living expenses in retirement.. The need for effective retirement

Area 44 (Pars Opercularis) (left column in figure 3) in the left hemisphere showed positive RSFC with bilateral ventrolateral and dorsal frontal regions (including areas 45

F IGURE 4.37 T IME DOMAIN RESULTS OF ANISOTROPIC CYLINDER WITH INCLINED IMPACT AT THE MIDDLE CONSIDERING TIMBER POLE SITUATION ( DOWN GOING WAVE )