• No results found

Mambo Running Analytics on Enterprise Storage

N/A
N/A
Protected

Academic year: 2021

Share "Mambo Running Analytics on Enterprise Storage"

Copied!
67
0
0

Loading.... (view fulltext now)

Full text

(1)

Mambo

Running Analytics on

Enterprise Storage

 Jingxin Feng,

Xing Lin

1

, Gokul Soundararajan

(2)

Motivation

(3)

Motivation

(4)

Motivation

No easy way to analyze data stored in enterprise storage (NFS)

Analytics System

Production System

Data

Data

(5)

Motivation

No easy way to analyze data stored in enterprise storage (NFS)

(6)

Motivation

No easy way to analyze data stored in enterprise storage (NFS)

Analytics System

Production System

Data

Data

§

Bank

§

AutoSupport

(7)

Motivation

(8)

Motivation

No easy way to analyze data stored in enterprise storage (NFS)

Analytics System

§

Separate infrastructures for production systems and analytics systems

Production System

Data

Data

(9)

Motivation

No easy way to analyze data stored in enterprise storage (NFS)

(10)

Motivation

No easy way to analyze data stored in enterprise storage (NFS)

Analytics System

Data Copying

§

Separate infrastructures for production systems and analytics systems

data silos

Production System

Data

Data

(11)

Motivation

No easy way to analyze data stored in enterprise storage (NFS)

§

Separate infrastructures for production systems and analytics systems

§

Problems

§ 

Copying PBs of data is time consuming

(12)

Motivation

No easy way to analyze data stored in enterprise storage (NFS)

Analytics System

Data Copying

§

Separate infrastructures for production systems and analytics systems

§

Problems

§ 

Copying PBs of data is time consuming

§ 

3 × storage overhead in HDFS

data silos

Production System

Data

Data

(13)

Motivation

No easy way to analyze data stored in enterprise storage (NFS)

§

Separate infrastructures for production systems and analytics systems

§

Problems

§ 

Copying PBs of data is time consuming

§ 

3 × storage overhead in HDFS

data silos

(14)

Motivation

No easy way to analyze data stored in enterprise storage (NFS)

Analytics System

Data Copying

§

Separate infrastructures for production systems and analytics systems

§

Problems

§ 

Copying PBs of data is time consuming

§ 

3 × storage overhead in HDFS

data silos

Production System

Data

Data

§ 

Periodical re-synchronization later on

(15)

Mambo

An NFS connector, enabling direct analytics for data on Enterprise storage (NFS)

Production System

Analytics System

Data

Data

(16)

Mambo

An NFS connector, enabling direct analytics for data on Enterprise storage (NFS)

Production System

Analytics System

Data

(17)

Mambo

§ 

Remove data copying

§ 

Remove storage overhead (single copy)

An NFS connector, enabling direct analytics for data on Enterprise storage (NFS)

Production System

Analytics System

Data

Data

Connector

§ 

Remove data re-synchronization

(18)

Mambo

§ 

Remove data copying

§ 

Remove storage overhead (single copy)

An NFS connector, enabling direct analytics for data on Enterprise storage (NFS)

Production System

Analytics System

Data

Data

Connector

Copying is not required; you can do analytics in-place

§ 

Remove data re-synchronization

(19)
(20)

Project History

From Research to Product

Talked with customers

Developed initial prototype

Madalin Mihailescu refined prototype

Added a distributed cache

Obtained traces from UC Berkeley

Published in FAST’13

Xing Lin refactored code for Hadoop 2.0

Optimized for 10 Gb networks

Obtained legal approval for open-source

Posted to GitHub

Customer Proof-of-Concepts (PoCs)

Pushing to merge into Hadoop

2011

2012

(21)

Use Cases

(22)

Analyze Enterprise Data In-place

Job

User jobs

Compute layer

MapReduce

File System

Yarn

HDFS

Resource management

layer

Storage layer

Generic file system layer

(23)

Analyze Enterprise Data In-place

Job

User jobs

Compute layer

MapReduce

File System

Yarn

Resource management

layer

MapReduce

File System

Yarn

Generic file system layer

(24)

Analyze Enterprise Data In-place

Job

User jobs

Compute layer

MapReduce

File System

Yarn

HDFS

Resource management

layer

Storage layer

MapReduce

File System

Yarn

NFS

• HDFS gets swapped out with NFS*

• Apache Hadoop does not get modified.

• User programs (jobs) are not modified.

(25)

§

Use NetApp FlexClones for creating test environments quickly

Production

NetApp Storage

On Premises

(26)

§

Use NetApp FlexClones for creating test environments quickly

§

Use a copy of production data for realistic Test/QA environments (e.g.,

AutoSupport)

Production

NetApp Storage

On Premises

Test/QA

FlexClone

(27)

Use cloud to Analyze Data

Private Storage

On-premise

(onsite)

Public cloud

Secondary private storage at a colocation facility (e.g., Equinix), for backup

and fast restoration with cloud

(28)

Use cloud to Analyze Data

Private Storage

On-premise

(onsite)

Private Storage

Colocated facility

Public cloud

Secondary private storage at a colocation facility (e.g., Equinix), for backup

and fast restoration with cloud

Express

interconnection

(29)

Use cloud to Analyze Data

SnapMirror

Private Storage

On-premise

(onsite)

Private Storage

Colocated facility

Public cloud

Secondary private storage at a colocation facility (e.g., Equinix), for backup

and fast restoration with cloud

Express

interconnection

(30)

Use cloud to Analyze Data

SnapMirror

Launch Hadoop in the cloud and use data on private storage

Private Storage

On-premise

(onsite)

Private Storage

Colocated facility

Public cloud

Secondary private storage at a colocation facility (e.g., Equinix), for backup

and fast restoration with cloud

Express

interconnection

(31)
(32)

Architecture Overview

(33)

Architecture Overview

(34)

Architecture Overview

Mambo: an NFS client in Java, implementing the Hadoop generic file system API

(35)

Architecture Overview

Mambo: an NFS client in Java, implementing the Hadoop generic file system API

(36)

Architecture Overview

Mambo: an NFS client in Java, implementing the Hadoop generic file system API

HDFS

Amazon S3 GlusterFS

Azure

NFS

§ 

No changes to Hadoop framework

(37)

Tight integration with Hadoop/MapReduce

§

Optimized for large sequential I/O (e.g., 1MB IO)

§

Commit data to disk only when a task succeeds

§

Intelligent prefetching for streaming reads; aware of

(38)

Implementation

File System

Hadoop generic filesystem API

YARN

MapReduce

Computation frameworks

(39)

Implementation

File System

Hadoop generic filesystem API

YARN

MapReduce

NFS File System

ü  FS metadata OPs

ü  NFS client protocol

ü  File reads

Computation frameworks

Resource management layer

(40)

Implementation

File System

Hadoop generic filesystem API

YARN

MapReduce

NFS File System

ü  FS metadata OPs

ü  NFS client protocol

ü  File reads

ü  File writes

Computation frameworks

Resource management layer

NFSv3FileSystem.java

NFSv3FileSystemStore.java

NFSv3InputStream.java

(41)

Implementation

File System

Hadoop generic filesystem API

YARN

MapReduce

NFS File System

ü  FS metadata OPs

ü  NFS client protocol

ü  File reads

Computation frameworks

Resource management layer

NFSv3FileSystem.java

NFSv3FileSystemStore.java

NFSv3InputStream.java

(42)

How to Use it?

§

Source code (jar library file)

§

Get code from GitHub

§

Compile the code

§

Install the jar file

§

Copy the jar file to the library directory for Hadoop installation

§

Only need to modify two configuration files

§

core-site.xml (hadoop core configuration file)

(43)

How to Use it?

§

Source code (jar library file)

§

Get code from GitHub

§

Compile the code

§

Install the jar file

§

Copy the jar file to the library directory for Hadoop installation

§

Only need to modify two configuration files

§

core-site.xml (hadoop core configuration file)

§

nfs-mapping.json (nfs configuration file)

(44)

Configure core-site.xml

Property

Value

fs.defaultFS

hdfs://namenode:54310/

HDFS

NFS

Property

Value

fs.defaultFS

nfs://nfsserver:2049/

fs.nfs.configuration

<path-to-configuration-file>

fs.nfs.impl

org.apache.hadoop.fs.nfs.

NFSv3FileSystem

fs.AbstractFileSystem.nfs.impl

org.apache.hadoop.fs.nfs.

NFSv3AbstractFileSystem

(45)

Configure nfs-mapping.json

§

Configurable properties

§

Export path

§

Read/write sizes

§

Split size (Hadoop task granularity)

§

Authentication method (supporting AUTH_NONE or AUTH_UNIX)

§

§

Supports multiple controllers (for NetApp clustered ONTAP)

(46)
(47)

Highlights from MixApart

MixApart: NFS connector + data prefetcher + local disk as

cache

Better performance with NFS connector than Hadoop with

ingest (18%~26% reduction in job duration)

• 

Overlaps data ingest with task computation

Matches ideal Hadoop (data ingested into HDFS

(48)

Scaling experiments

How does the NFS Connector scale with more storage and compute?

8 Nodes (FAS 8080) with 48 HDDs each and 8 10Gb links each

28 Nodes (UCS B230M2) with 20 CPU cores and 256 GB RAM

Cluster of

NFS servers

(49)

Scaling TeraGen

0.00

0.50

1.00

1.50

2.00

2.50

3.00

3.50

4.00

Running T

ime

(N

o

rm

al

ize

d

)

1 HDD

8 HDD

(50)

Scaling TeraGen

0.00

0.50

1.00

1.50

2.00

2.50

3.00

3.50

4.00

279

373

466

559

652

745

838

931

1024

Running T

ime

(N

o

rm

al

ize

d

)

Data Size

(in GB)

1 HDD

8 HDD

5 ×

(51)

Scaling TeraGen

0.00

0.50

1.00

1.50

2.00

2.50

3.00

3.50

4.00

Running T

ime

(N

o

rm

al

ize

d

)

1 HDD

8 HDD

5 ×

NFS connector scales well

for large datasets.

(52)

Scaling TeraSort

0.00

0.50

1.00

1.50

2.00

2.50

3.00

3.50

4.00

279

373

466

559

652

745

838

931

1024

Running T

ime

(N

o

rm

al

ize

d

)

Data Size

(in GB)

1 HDD

8 HDD

(53)

Overcome NFS server bottleneck

Optimize with Caching

Compute

nodes

(54)

Overcome NFS server bottleneck

§

Real workloads are cacheable

1

Optimize with Caching

NFS servers

Compute

nodes

(55)

Overcome NFS server bottleneck

§

Real workloads are cacheable

1

Optimize with Caching

Compute

nodes

FlashCache

FlashCache

FlashCache

FlashCache

(56)

Overcome NFS server bottleneck

§

Real workloads are cacheable

1

Optimize with Caching

NFS servers

Compute

nodes

FlashCache

FlashCache

FlashCache

FlashCache

Flash cache tier

(57)

Overcome NFS server bottleneck

§

Real workloads are cacheable

1

Optimize with Caching

Distributed in-memory cache tier

Compute

nodes

FlashCache

FlashCache

FlashCache

FlashCache

Flash cache tier

(58)

Next steps

(59)

Future Work

Productization within NetApp

Support pNFS protocol

(60)

Future Work

Productization within NetApp

Support pNFS protocol

Security (Kerberos authentication)

Integration tests with other frameworks

(61)

Future Work

Productization within NetApp

Support pNFS protocol

Security (Kerberos authentication)

Integration tests with other frameworks

Tachyon, HBase, Spark, and etc.

Production System Integration

NetApp Auto Support (ASUP) Team

(62)

We Need Your Help

§

Anyone interested

§ 

Try it out and tell us how it works

§ 

Filing bugs

§

Hadoop committers

§ 

Help to push NFS connector into Hadoop mainstream

§

Help integration tests with other frameworks (Tachyon, HBase, etc)

(63)

References

§

Connector Information

§ 

http://www.netapp.com/us/solutions/big-data/nfs-connector-hadoop.aspx

§

Public on GitHub:

§ 

https://github.com/NetApp/NetApp-Hadoop-NFS-Connector

§

Technical Report:

§ 

http://www.netapp.com/us/media/tr-4382.pdf

§

Paper at FAST’13

§ 

MixApart: De-Coupled Analytics for Shared Storage Servers

(64)

Summary

§

NetApp NFS connector for Hadoop

§ 

Allows analytics to use any NFS

§ 

An open implementation (no proprietary code) – contribute back to Hadoop

§ 

Works with Apache Hadoop, Apache Spark, Tachyon, and Apache HBase

(65)

Summary

§

NetApp NFS connector for Hadoop

§ 

Allows analytics to use any NFS

§ 

An open implementation (no proprietary code) – contribute back to Hadoop

§ 

Works with Apache Hadoop, Apache Spark, Tachyon, and Apache HBase

§ 

In many cases, only configuration file change is needed (no source code changes)

§

NetApp NFS connector for Hadoop is being deployed

§ 

Internal testing with other teams

(66)

Acknowledgements

§

Madalin Mihailescu for starting down this path

§

Kaladhar Voruganti, Scott Dawkins, Jeff Heller, AJ Mahajan, and Siva

Jayasenan for supporting this effort

§

Karthikeyan Nagalingam for validation and customer PoCs

§

NetApp AutoSupport team for testing it in production

(67)

Thank you

References

Related documents

KEY WORDS area at risk, clinical trial, edema, endpoint, infarct size, magnetic resonance imaging, myocardial infarction, STEMI. APPENDIX For supplemental material, please see

These research areas include models developed for policy assessment, generation capacity expansion, financial instruments, demand side management, mixing methods,

Switching function: 3/2 way, 2/2 way possible with accessoires De-energized state: NC (normally closed), NO (normally open) Valve body: aluminium, coated.

• Optimal system design ü Power generation maximization ü Cost reduction • Monitoring system configuration design • Management of construction schedule and safety

When assessing a planning application for both single and two-storey extensions, two methods for applying the 45-degree rule will be used: Method 1: Considers the depth and width

With the USA the most dominant economic power after World War II, it has not surprisingly been common in many countries in Europe to try to reshape their doctoral training so that

Role: Involving the use of forensic applications to identify and recover evidence held on computer hard drives, mobile telephones and other digital media which have been

Conclusions: Our data demonstrate a different immune-mediated inflammatory burden in heart failure patients with reduced or preserved ejection fraction, as well as