• No results found

Cloud-based Analytics and Map Reduce

N/A
N/A
Protected

Academic year: 2021

Share "Cloud-based Analytics and Map Reduce"

Copied!
40
0
0

Loading.... (view fulltext now)

Full text

(1)

1 1

Cloud-based Analytics and Map Reduce

(2)

INTEL CONFIDENTIAL 2

2

Datasets

• Many technologies converging around “Big Data” theme

Cloud Computing, NoSQL, Graph Analytics

• Biology is becoming increasingly data intensive

Sequencing, imaging, and other instruments

• Computing with big datasets is fundamentally different than big compute on small datasets

(3)

INTEL CONFIDENTIAL 3

3

Cloud-based Analytics

• Must be specific when talking cloud

IaaS, PaaS, SaaS, ...

• Basic cloud ingredients

Self-service on-demand compute and storage Commodity hardware

High capacity

Everything has an API

• Leading example: Amazon Web Services

• Honorable mention: Google Compute Engine, OpenStack, CloudStack, VMWare

(4)

INTEL CONFIDENTIAL 4

4

Good News

Mapping Informatics to the Cloud

Good News: You don’t have to change much

Using the basic IaaS building blocks we can handle

most traditional use cases

Clusters, client-server, N-tier deployments

Running standard HPC clusters in the cloud is simple

StarCluster:

http://star.mit.edu/cluster/

Create EC2 clusters in minutes

OpenMPI, ATLAS, Lapack, NumPy, SciPy SGE, IPython, Condor, MPICH2 plugins

(5)

INTEL CONFIDENTIAL 5

5

Mapping Informatics to the Cloud

• Cloud computing is democratizing access to IT infrastructure resources

Anyone can have access to massive compute and storage resources in

minutes

This changes the way we solve scientific problems

• Cloud computing is not a silver bullet for scalability

• Cloud providers have primarily focused on horizontal scaling and not on HPC

(6)

INTEL CONFIDENTIAL 6

6

Map Reduce

• We need a computing framework that is...

able to handle huge datasets - 1TB+

massively parallel - runs on commodity hardware fault tolerant - hardware fails

locality aware - moving computation is cheaper than moving data easy to use

(7)

INTEL CONFIDENTIAL 7

7

Map Reduce

• Original paper by Google in 2004

Introduced a simplified parallel processing model Used to build Google search indexes

Users specify a Map function and Reduce function

Framework manages task distribution, orchestration, data

(8)

INTEL CONFIDENTIAL 8

8

Map Reduce

• Map Reduce is a simple approach to parallel programming • Existing algorithms must be translated into one or more

map/reduce steps • Batch oriented

• Requires a distributed filesystem

• Map Reduce can be implemented on top of MPI

(9)

INTEL CONFIDENTIAL 9

9

Apache Hadoop

• Free implementation derived from Google MapReduce - written in Java

• Composed of many complementary projects

– Core – set of components and interfaces for distributed filesystems and general I/O serialization

– MapReduce – distributed data processing model and execution environment

(10)

INTEL CONFIDENTIAL 10

1

Hadoop-related projects at Apache

Hadoop Ecosystem

• Hadoop has a large ecosystem of tools

– HBase - non-relational, column-orientated database that runs on HDFS

– Pig - data flow language for exploring datasets

– Hive - distributed data warehouse with SQL-like query language

(11)

INTEL CONFIDENTIAL 11

1

Hadoop Components

• Core consists of compute layer (MapReduce) and storage layer (HDFS)

• Alternatives to HDFS – Amazon S3

GlusterFS Lustre

• Many distributions/flavors of Hadoop exist – Apache

Cloudera

Amazon Elastic Map Reduce Intel

(12)

INTEL CONFIDENTIAL 12

1

Core improvements and enterprise features

Intel Hadoop

• Encrypted HDFS • Faster job launch

• Optimized for SSDs and 10GbE networking • Accelerated Hive queries

• Premium support • Intel Manager

(13)

INTEL CONFIDENTIAL 13

1

Hadoop Job Flow

(14)

INTEL CONFIDENTIAL 14

1

(15)

INTEL CONFIDENTIAL 15

1

Translating Workloads to Map Reduce

• Parallel programming requires parallel thinking

Domain decomposition

Exploit natural parallelism

• A map reduce job assumes independent mappers and reducers running in parallel on individual slices of data • Share-nothing architecture - avoid communication and

global data structures

(16)

INTEL CONFIDENTIAL 16

1

Translating Workloads to Map Reduce

• Genomic analysis is ideally suited for the Hadoop framework

large, semi-structured, file-based data, parallel IO parallel processing by reads, samples, genes, etc.

• Hadoop interfaces exist for C, Java, Perl, R, ...

• Hadoop Streaming allows any executables to be mappers and reducers

(17)

INTEL CONFIDENTIAL 17

1

Unix Pipes for Hadoop

Enter Hadoop Streaming

• Utility that comes with Hadoop distribution • Use any mapper and reducer

Perl Bash cat/wc Binaries

• Easiest way to get started with Hadoop

(18)

INTEL CONFIDENTIAL 18

1

Revolution R with Hadoop

• Series of R connectors to Hadoop features

• Write MapReduce jobs in R using Hadoop Streaming • Import tables from HDFS and HBase

• https://github.com/RevolutionAnalytics/RHadoop

(19)

INTEL CONFIDENTIAL 19

1

Hadoop on AWS

Elastic MapReduce

• Amazon realized customers were spending a lot of time configuring and operating Hadoop clusters on EC2

• EMR is a service that runs on EC2 and handles set-up, teardown, and other Hadoop details

• Charged by instance-hour

Instances can be terminated automatically when your job finishes

• Adds ‘Job Flows’ feature

(20)

INTEL CONFIDENTIAL 20

2

Features

Elastic MapReduce

• Elastic

Add nodes to a running cluster

New: variable node count in each flow step

New: easier to dynamically resize # nodes in use

• Stores inputs and outputs in S3

New: can use multi-part HTTP upload if configured

• Easy to Use – Job Flows, Debugging

• Support for On-Demand and Reserved Instances

Recent support for Spot and VPC instances!

• Support for bootstrap actions

(21)

INTEL CONFIDENTIAL 21

2

Creating Job Flows

(22)

INTEL CONFIDENTIAL 22

2

Monitoring and Debugging

(23)

INTEL CONFIDENTIAL 23

2

Debugging

(24)

INTEL CONFIDENTIAL 24

2

Translating Workloads to Map Reduce

• The most efficient programs require use of Java and

understanding Hadoop internals

• Hadoop has more scheduling and execution overhead

compared to some traditional HPC environments

Hadoop can be integrated with cluster schedulers like SGE

• Moving large amounts of data into HDFS can be slow

Filesystem alternatives include S3, GlusterFS, Lustre, http HDFS can greatly benefit from SSDs

(25)

INTEL CONFIDENTIAL 25

2

A few tips

Data Movement

• Primary hurdle in adopting the cloud

• Avoid using SCP or other TCP based transfers unless you tune your settings

http://www.psc.edu/index.php/networking/641-tcp-tune

• Alternative transports: GridFTP, Aspera fasp, Bit Torrent • AWS offers physical import/export via FedEx

• Aggregate the data within your job if possible

(26)

INTEL CONFIDENTIAL 26

2

Cloud-scale RNA-sequencing differential expression

analysis with Myrna

Life Science Example

• Langmead et al. Genome Biology 2010, 11:R83 • RNA-Seq analysis pipeline

Focused on differential expression analysis between genes

Complementary to whole transcriptome assembly (e.g. cufflinks)

• Workflow contains 7 stages

(27)

INTEL CONFIDENTIAL 27

2

Stage 1 - Preprocess

• Process FASTQ input list

Optional dump from .sra format

• Assign sample names • Copy into HDFS

(28)

INTEL CONFIDENTIAL 28

2

Stage 2 - Align

• Align reads to reference genome using bowtie

• Each node independently obtains the bowtie index from local or shared filesystem (hdfs)

(29)

INTEL CONFIDENTIAL 29

2

Stage 3 - Overlap

• Calculate overlaps between alignment and predefined gene intervals

• Aggregate counts for each genomic feature • Parallel across alignments

(30)

INTEL CONFIDENTIAL 30

3

Stage 4 - Normalize

• Calculate normalization factor based on count distribution • Parallel across genetic feature labels

(31)

INTEL CONFIDENTIAL 31

3

Stage 5 - Statistical analysis

• Fits a linear model relating the counts to the outcome using R

• Uses values calculated from Align and Overlap stages • Parallel across genes

(32)

INTEL CONFIDENTIAL 32

3

Stage 6 - Summarize

• Significance summaries such as P-values and gene-specific counts are calculated

• Outputs a list of top N genes ranked by false discovery rate

Hadoop takes care of sorting

• This stage is serial

(33)

INTEL CONFIDENTIAL 33

3

Stage 7 - Postprocess

Discards overlaps not belonging to top genes

Creates output files, summary tables, and plots

Compressed and stored in user specified output

directory

(34)

INTEL CONFIDENTIAL 34

3

Myrna Performance

• Uses standard bioinformatics tools bowtie and R/Bioconductor in a Hadoop job flow

• Workflow broken into many stages to take advantage of parallelism

(35)

INTEL CONFIDENTIAL 35

3

Summary

• The most popular genomics algorithms will eventually get Hadoop implementations

The other 80% will not...

• Hadoop is well suited for processing large unstructured data offline

• Hadoop is not well suited for communication heavy jobs or real-time processing

• Hadoop can be run locally or integrated into existing HPC infrastructure

• HBase and other products run on Hadoop taking advantage of framework features

(36)

INTEL CONFIDENTIAL 36

3

Summary

• Building a Hadoop cluster

Use dense storage nodes

Boosting HDFS performance

Use SSD drives

Faster interconnects

Replace HDFS with S3, GlusterFS, Lustre

• Running a Hadoop cluster

Job monitoring and debugging requires additional tooling AWS Elastic Map Reduce product for EC2

(37)

INTEL CONFIDENTIAL 37

3

Observations

• The cloud is making good parallel programming techniques more important than ever

Message passing, threading, distributed systems

• Understand the difference between vertical and horizontal scaling

Use both!

• Cloud best practices are finding their way back into local infrastructure/HPC

(38)

INTEL CONFIDENTIAL 38

3

References

• http://developer.yahoo.com/hadoop/tutorial/module4.html#d ataflow • http://markusklems.files.wordpress.com/2008/07/mapreduc e.png • http://bowtie-bio.sourceforge.net/myrna/index.shtml

(39)

INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY

ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED

WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT,

COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results

to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Copyright © , Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries.

Optimization Notice

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel

microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

Legal Disclaimer & Optimization Notice

Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

(40)

References

Related documents

● Destroy two monitoring wells, MW-37 and MW-42, in accordance with the Riverside County Environmental Health Department and the State of California well destruction procedures

Conclusion: Women in this study had inadequate knowledge and inappropriate practice related to mammography as a procedure for breast cancer investigation.. Pan African

For it can be assumed that the person defines his or her life authorship and perceives its authorship aspects (subjectivity, personal resources, autonomy,

Chapter 2: Developing approaches to managing municipal revenue in a sustainable manner.4.

Although constrained by data availability, the evidence suggests that the dominant effect of subsidies was to increase social security registration of firms and workers rather

Job satisfaction is an important field of research. Job satisfaction is a crucial element for every enterprise. A bank or any enterprise cannot reach its target if

Firstly, we compare SRIG with popular penalized methods such as Lasso, Ridge regression, Adaptive Lasso (ALasso) and Elastic net (Enet) which do not use the predictor graph

The goal of this study was stability analysis of the upstream slope of earthen dams using the finite element method against sudden change in the water surface of the reservoir in