• No results found

Big Data for Big Science. Bernard Doering Business Development, EMEA Big Data Software

N/A
N/A
Protected

Academic year: 2021

Share "Big Data for Big Science. Bernard Doering Business Development, EMEA Big Data Software"

Copied!
38
0
0

Loading.... (view fulltext now)

Full text

(1)

Big Data for Big

Big Data for Big

Big Data for Big

Big Data for Big Science

Science

Science

Science

Bernard Doering

Bernard Doering

Bernard Doering

Bernard Doering

Business Development, EMEA

Big Data Software

(2)

Internet of Things

INTELLIGENT CLOUD

Richer data to

analyze

2.8

2.8

2.8

2.8 Zettabytes

Zettabytes

Zettabytes

Zettabytes of data generated

of data generated

of data generated

of data generated

WW in 2012

WW in 2012

WW in 2012

WW in 2012

1111

SMART CLIENTS

Richer

user experiences

Richer data from

devices

INTELLIGENT THINGS

Sources: (1) IDC Digital Universe 2020, (2) IDC

40

40

40

40 Zettabytes

Zettabytes

Zettabytes

Zettabytes of data will be

of data will be

of data will be

of data will be

generated WW in 2020

generated WW in 2020

generated WW in 2020

generated WW in 2020

1111

(3)

Transformative Forces in Computing Science

Enabling

Enabling

Enabling

Enabling exascale

exascale

exascale

exascale computing on

computing on

computing on

computing on

massive data sets

massive data sets

massive data sets

massive data sets

Helping enterprises build open

Helping enterprises build open

Helping enterprises build open

Helping enterprises build open

interoperable clouds

interoperable clouds

interoperable clouds

interoperable clouds

Contributing code and

Contributing code and

Contributing code and

Contributing code and

fostering ecosystem

fostering ecosystem

fostering ecosystem

fostering ecosystem

HPC

Cloud

Open Source

(4)

Intel® Distribution for Apache Hadoop* software

Hardware-enhanced and optimised – for

industry leading performance & security

Strengthens Apache Hadoop* ecosystem

(5)

Intel® Distribution for Apache Hadoop* v3.0

Intel® Manager for Apache Hadoop software

Deployment, Configuration, Monitoring, Alerts, and Security

Intel® Manager for Apache Hadoop software

Deployment, Configuration, Monitoring, Alerts, and Security

HDFS

Hadoop Diatributed File System HDFS

Hadoop Diatributed File System YARN (MRv2)

Distributed Processing Framework YARN (MRv2)

Distributed Processing Framework

H B a se 0 .9 6 .1 Co lu m n ar S to re H B a se 0 .9 6 .1 Co lu m n ar S to re Z o o k e e p e r 3 .4 .5 Co or di n at io n Z o o k e e p e r 3 .4 .5 Co or di n at io n F lu m e 1 .3 .0 Lo g Co lle ct or F lu m e 1 .3 .0 Lo g Co lle ct or S q o o p 1 .4 .1 D at a Ex ch an ge S q o o p 1 .4 .1 D at a Ex ch an ge Pig 0.9.2 Scripting Pig 0.9.2 Scripting Hive 0.10.0 SQL Query Hive 0.10.0 SQL Query Oozie 3.3.0 Workflow Oozie 3.3.0 Workflow Mahout 0.7 Machine Learning Mahout 0.7 Machine Learning Hcatalog Metadata Hcatalog Metadata

Intel® Manager for Apache Hadoop software

Deployment, Configuration, Monitoring, Alerts, and Security

HDFS

Hadoop Diatributed File System YARN (MRv2)

Distributed Processing Framework

H B a se 0 .9 6 .1 Co lu m n ar S to re Z o o k e e p e r 3 .4 .5 Co or di n at io n F lu m e 1 .3 .0 Lo g Co lle ct or S q o o p 1 .4 .1 D at a Ex ch an ge Pig 0.9.2 Scripting Hive 0.10.0 SQL Query Oozie 3.3.0 Workflow Mahout 0.7 Machine Learning Hcatalog Metadata Connectors

(6)

INTEL CONFIDENTIAL, 6

6

Project Gryphon

(7)

INTEL CONFIDENTIAL 7

Deploying SQL applications on Hadoop

Problem Statement

Problem Statement

Problem Statement

Problem Statement

HiveQL currently accepts only a small subset of

SQL as valid queries

Current approaches to enabling SQL on Hadoop

provide incomplete SQL

Enterprises need open source coverage &

real-time performance of analytic SQL queries on

Hadoop

HDFS Data Nodes HDFS Data Nodes HBase MapReduce Hive HiveQL

SQL-92

(8)

INTEL CONFIDENTIAL 8

Introducing Project Gryphon

Enables full SQL-92 coverage for OLAP

applications on Hadoop with Hive as the

execution back-end

Enables low-latency SQL queries on HBase with

more efficient storage engine and better

performing JDBC drivers

Enables real-time SQL using HBase co-processor

framework and several Hive query optimizations

Is open source under ASL license

(9)

Intel Distribution for Apache Hadoop* software

Security

Performance

Management

(10)

Backed by portfolio of datacenter products

Software

Network

Storage & Memory

Server

Cache

(11)

Intel portfolio delivers balanced performance

Intel® Xeon 5690

7200 HDD

1GbE Adapter

~7 minutes

>4 hours

Intel® Xeon® processor

~50%

improved

Intel® SSD 520 Series

~80%

improved

Intel® 10GbE Adapters

~50%

improved

Intel® Distribution for Apache Hadoop* software

~40%

improved

Other brands and names are the property of their respective owners

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Source: Intel Internal testing For more information go to For more information go to For more information go to

For more information go to : intel.com/performance ````

Shown to improve 1 Terabyte sort

from 4 hours to 7 minutes

(12)

Why Intel for Hadoop?

Transparent encryption

encryption

encryption

encryption in Hive, Pig, MapReduce, HDFS

Up to 20x faster en/decryption with Intel AES-NI

1

Up to 30x faster Terasort with Xeon, SSD, 10GbE

1

Up to 8.5X faster queries in Hive* & HBase

1

Support for Lustre* filesystem

(13)

Why Hadoop* + Lustre* ?

As HPC moves to Exascale, bigger simulations require better tools for analytics

Hadoop

*

is the de-facto software platform for big data analytics but…

HDFS* expects compute nodes with direct attached storage

HPC clusters have decoupled storage and compute nodes

Lustre

*

is the file system of choice for most HPC clusters

Lustre* is POSIX compliant: uses Java native file system

Lustre* – as the single storage platform for HPC & analytics – is easier to manage

(14)
(15)

Basic Science

Computing Sciences to make a better world

Government & Research

Commerce & Industry

New Users & New Uses

Business Transformation

Data-Driven Discovery

Better Products

Faster Time to Market

Reduced R&D

From

Diagnosis to

personalized

treatments

quickly

Genomics

Clinical

Information

Transform data into useful knowledge

“My goal is simple. It is complete

understanding of the universe,

why it is as it is, and why it exists

at all”

(16)

Computing Science to help

save lives

(17)

Data-Driven Discovery

Drug

Discovery

Life Sciences Genome Data EMR Clininical

Trials SensorData Images DataSim

Physical Sciences Census Data Text A/V Surveys Social Sciences

Treatment

Optimization

Hypothesis

Formation

Modeling &

Prediction

Astronomy

Particle

Physics

Public Policy

Trend

Analysis

Hypothesis

Formation

(18)

Data-Driven Discovery in Science

18

1 human genome = 1 petabyte

Finding patterns in clinical and genome data at

scale can help cure cancer and other diseases.

(19)

$100,000,000

$10,000,000

$1,000,000

$100,000

$10,000

$1,000

2003

2005

2007

2009

2011

2001

2013

Source: National Human Genome Research Project

Reducing the Cost of

(20)

Value

Enable researchers to discover biomarkers and drug targets by

correlating genomic data sets

Analytics

Provide curated data sets with pre-computed analysis

(classification, correlation, biomarkers)

Provide APIs for applications to combine and analyze public and

private data sets

Data Management

Use Hive and Hadoop for query and search

Dynamically partition and scale HBASE

Data-Intensive Discovery: Genomics

Intel Distribution

Intel Distribution

Intel Distribution

Intel Distribution

(21)

Computing with Hadoop to make a better world

Government & Research

80,000 Scientific Documents

80,000 Scientific Documents

80,000 Scientific Documents

80,000 Scientific Documents

No Doctor can read or

No Doctor can read or analyse

No Doctor can read or

No Doctor can read or

analyse

analyse

analyse

Mahout Library for analytics

Mahout Library for analytics

Mahout Library for analytics

Mahout Library for analytics

Data stored on HDFS

Data stored on HDFS

Data stored on HDFS

Data stored on HDFS

EU Project with leading universities

EU Project with leading universities

EU Project with leading universities

EU Project with leading universities

and research hospitals.

and research hospitals.

and research hospitals.

and research hospitals.

(22)

Data Value Data Value Data Analysis Data Analysis

Data-Driven Business

Customer Service Telco Content CDR IP

Traffic Product Shop CustomerBehavior

Retail Customer Behavior Transactions FSI Network Optimization Product

Innovation MarketInsight

Business

Efficiency BehaviorModeling

Fraud Analytics Client Engagement Data Management Data Management

(23)

Enterprise Data Store with Hadoop

Value

300 million wireless subscribers

Enable subscriber access to billing data

30X gain in performance; lower TCO

Analytics

Provides real-time retrieval of 6 months data

Supports new BI with 15 types of queries

Enables targeted ad serving and promotions

Data Management

Use Hadoop/HBase for search and analysis

30 TB/month of billing data

300K reads/second; 800K inserts/second

133-node cluster / Intel Xeon E5 processors

CDR

(24)

Intel IT Big Data Platform Components

MPP* Platform

MPP* Platform

MPP* Platform

MPP* Platform

– 3rd-party solution

– 100x faster than traditional systems

– Intel

®

Xeon

®

processor E7 family blades scale

easily

Intel Distribution Of Hadoop

Intel Distribution Of Hadoop

Intel Distribution Of Hadoop

Intel Distribution Of Hadoop

– Based on Apache Hadoop

– Optimized for Intel® Xeon processors,

SSD and 10GbE (Up to 20x

performance boost)

– Distributed file system that

can scale linearly

– HBase NoSql DB

Predictive Analytics Engine

Predictive Analytics Engine

Predictive Analytics Engine

Predictive Analytics Engine

In house development

Enables real time, on-going Predictive service

Intel

®

Xeon

®

processor E7 family

(25)

Big Data in Action at Intel

Test Time Reduction:

Predictive analytics in manufacturing to identify failing parts

Improve Quality & Increase Yield

Expected to save ~$200M in 2013

Malware Detection:

Analyzing ~4B access events per day at the system, network, &

application levels to discover new malware threats before they arise

Reduce and prevent network intrusion

(26)

Data-Rich Communities: Smart City

Value

Enforce traffic laws and detect license fraud

Monitor and predict traffic patterns

In a city of 31 million people

Analytics

Detect traffic law violations automatically

Detect driver license fraud by data mining

Forecast traffic with predictive analytics

Data Management

30,000 cameras

6Mb/s stream rate per camera

15 PB of images in active use

2 billion records in HBase

Detection Prevention

Regional

(27)

Driving innovation with big data analytics

European car manufacturer uses

big data

analytics

to predict machine failure and

build faster and safer cars.

Data collected from Sensors and CPUs

embedded in the cars and signals sent to

the Big Data Cloud for analysis.

Manufacturer predicts growth to >30 PB

by 2015 and ~ 300 PB by 2018.

(28)

With strong support from strategic partners

(29)

Match methods to data

*Other brands and names are the property of their respective owners.

Structured

Data

Poly-structured

Data

Relational Databases

Next-Gen Analytics

Hadoop + NoSQL

(30)
(31)

Data-Driven Discovery in Science

31

600 million collisions / sec

600 million collisions / sec

600 million collisions / sec

600 million collisions / sec

Detecting 1 in 1 trillion events to

Detecting 1 in 1 trillion events to

Detecting 1 in 1 trillion events to

Detecting 1 in 1 trillion events to

help find the Higgs Boson

help find the Higgs Boson

help find the Higgs Boson

help find the Higgs Boson

What else is possible?

What else is possible?

What else is possible?

What else is possible?

OpenLab

OpenLab

OpenLab

OpenLab with Intel

with Intel

with Intel

with Intel

---- Intel Distribution for Apache

Intel Distribution for Apache

Intel Distribution for Apache

Intel Distribution for Apache

Hadoop

Hadoop

Hadoop

Hadoop????

(32)

Bringing Hadoop* MapReduce to Lustre* Data

32

Hadoop* Adaptor for Lustre*

Available with Intel

®

Distribution of Apache

Hadoop* software 3.0

Based on YARN (Apache Hadoop 2.x)

Packaged as a single Java

*

library (JAR)

Easy to deploy with minor changes

No change in the way jobs are submitted

InfiniBand Interconnect Hadoop Compute Nodes Hadoop Compute Nodes

Lustre Storage Nodes Lustre Storage Nodes

(33)

Addressing the HPC Big Data Challenge

Intel® HPC Distribution for Apache Hadoop* Software

Intel® Manager for

Intel® Manager for

Intel® Manager for

Intel® Manager for Hadoop

Hadoop

Hadoop

Hadoop* Software

* Software

* Software

* Software

Deployment, Configuration, Monitoring, Altering and Security

Intel® Manager for

Intel® Manager for

Intel® Manager for

Intel® Manager for Lustre

Lustre

Lustre

Lustre*

*

*

*

Software

Software

Software

Software

S

qo

op

S

qo

op

S

qo

op

S

qo

op

D

at

a

Ex

ch

an

ge

Fl

u

m

e

Fl

u

m

e

Fl

u

m

e

Fl

u

m

e

Lo

g

Co

lle

ct

or

Z

oo

K

ee

pe

r

Z

oo

K

ee

pe

r

Z

oo

K

ee

pe

r

Z

oo

K

ee

pe

r

Co

or

di

na

ti

on

YARN (MRv2)

YARN (MRv2)

YARN (MRv2)

YARN (MRv2)

Distributed Processing Framework

Distributed Processing Framework

Distributed Processing Framework

Distributed Processing Framework

Moab, “

Moab, “Slurm

Moab, “

Moab, “

Slurm

Slurm

Slurm”,…

”,…

”,…

”,…

HDFS

HDFS

HDFS

HDFS

Hadoop

Hadoop

Hadoop

Hadoop Distributed File Systems

Distributed File Systems

Distributed File Systems

Distributed File Systems

Lustre

Lustre

Lustre

Lustre

Oozie

Oozie

Oozie

Oozie

Workflow

Pig

Pig

Pig

Pig

Scripting Connectors

R

R

R

R

Statistics

Hive

Hive

Hive

Hive

SQL Query Mahout Mahout Mahout Mahout Machine Learning HBase HBase HBase HBase Columnar Storage

MPI

MPI

MPI

MPI

(34)

Intel

®

HPC Distribution: Open Platform for

High Performance Data Analytics

Performance

Performance

Performance

Performance

 Bring compute to the data: Run

Bring compute to the data: Run

Bring compute to the data: Run

Bring compute to the data: Run MapReduce

MapReduce

MapReduce* on

MapReduce

* on

* on Lustre

* on

Lustre

Lustre

Lustre* without code changes

* without code changes

* without code changes

* without code changes

 Run

Run

Run

Run MapReduce

MapReduce

MapReduce* faster: Avoid the intermediate file shuffle with shared storage

MapReduce

* faster: Avoid the intermediate file shuffle with shared storage

* faster: Avoid the intermediate file shuffle with shared storage

* faster: Avoid the intermediate file shuffle with shared storage

Efficiency

Efficiency

Efficiency

Efficiency

 Avoid

Avoid

Avoid

Avoid Hadoop

Hadoop

Hadoop

Hadoop* islands in the sea of HPC systems

* islands in the sea of HPC systems

* islands in the sea of HPC systems

* islands in the sea of HPC systems

 Run

Run

Run

Run MapReduce

MapReduce

MapReduce

MapReduce jobs alongside HPC workloads with full access to the cluster resources

jobs alongside HPC workloads with full access to the cluster resources

jobs alongside HPC workloads with full access to the cluster resources

jobs alongside HPC workloads with full access to the cluster resources

Manageability

Manageability

Manageability

Manageability

 Use the seamless integration to manage one common platform for

Use the seamless integration to manage one common platform for

Use the seamless integration to manage one common platform for

Use the seamless integration to manage one common platform for Hadoop

Hadoop

Hadoop and HPC

Hadoop

and HPC

and HPC

and HPC

 Develop with multiple programming models and deploy on shared storage

Develop with multiple programming models and deploy on shared storage

Develop with multiple programming models and deploy on shared storage

Develop with multiple programming models and deploy on shared storage

(35)

Join the BETA program

Early adopters of the combined “Intel Distribution for Apache

Hadoop” Software and “Intel EE for Lustre” Software solution

will receive a free, exclusive limited-use version of the

software and exchange insights with Intel experts.

To be considered for the BETA,

To be considered for the BETA, please contact Intel:

To be considered for the BETA,

To be considered for the BETA,

please contact Intel:

please contact Intel:

please contact Intel:

35

[email protected]

[email protected]

[email protected]

(36)
(37)

For more information

37

hadoop.intel.com

intel.com/BigData

(38)

INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Intel product plans in this presentation do not constitute Intel plan of record product roadmaps. Please contact your Intel representative to obtain Intel's current plan of record product roadmaps. Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

All products, computer systems, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice.

Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. Go to: http://www.intel.com/products/processor_number

Intel, processors, chipsets, and desktop boards may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Intel, Intel Xeon, Intel Xeon Phi, the Intel Xeon Phi logo, the Intel Xeon logo and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.

Intel does not control or audit the design or implementation of third party benchmark data or Web sites referenced in this document. Intel encourages all of its customers to visit the referenced Web sites or others where similar performance benchmark data are reported and confirm whether the referenced benchmark data are accurate and reflect performance of systems available for purchase.

Other names and brands may be claimed as the property of others. Copyright © 2013, Intel Corporation. All rights reserved.

References

Related documents

WISER has more indicators than WIF: number of Web Pages in a website, number of External Links it receives (this indicator was adjusted in years 2012 and 2013 to include the number

At Key Stages 2, learners at Llansanffraid Primary School are given opportunities to build on the experiences gained during the Foundation Phase, and to promote their

The objectives of this investigation were: (1) locate and delineate prairie dog towns in WCNP using CIR photography; (2) compare costs of the remote sensing

Furthermore, as Ramsay observes, DLT/Blockchain principles of operation and their regulation might create a Gordian legal knot when it comes to tension and out- right conflict

In comparison with the original KG attacks on PEKS schemes whose KG adversary has capability correspond- ing to the KG-CKA PEKS model where the KG adversary can perform the public

At times, we may pass some of this information to other insurers or to other persons such as the Malta Insurance Association, insurance intermediaries, motor surveyors,

The three main contributions of the article are: (i) the formulation of a unified, processor demand-based and overhead-aware, schedulabil- ity analysis applicable to