Cloud-based Analytics and Map Reduce

(1)

1 1

Cloud-based Analytics and Map Reduce

(2)

INTEL CONFIDENTIAL 2

2 Datasets

• Many technologies converging around “Big Data” theme

– _{Cloud Computing, NoSQL, Graph Analytics}

• Biology is becoming increasingly data intensive

– _{Sequencing, imaging, and other instruments}

• Computing with big datasets is fundamentally different than big compute on small datasets

(3)

3 Cloud-based Analytics

• Must be specific when talking cloud

– _{IaaS, PaaS, SaaS, ...}

• Basic cloud ingredients

– _{Self-service on-demand compute and storage} – _{Commodity hardware}

– _{High capacity}

– _{Everything has an API}

• Leading example: Amazon Web Services

• Honorable mention: Google Compute Engine, OpenStack, CloudStack, VMWare

(4)

4 Good News

Mapping Informatics to the Cloud

• Good News: You don’t have to change much

• Using the basic IaaS building blocks we can handle

most traditional use cases

– _{Clusters, client-server, N-tier deployments}

• Running standard HPC clusters in the cloud is simple

• StarCluster:

http://star.mit.edu/cluster/

– _{Create EC2 clusters in minutes}

– _{OpenMPI, ATLAS, Lapack, NumPy, SciPy} – _{SGE, IPython, Condor, MPICH2 plugins}

(5)

5 Mapping Informatics to the Cloud

• Cloud computing is democratizing access to IT infrastructure resources

– _{Anyone can have access to massive compute and storage resources in}

minutes

– _{This changes the way we solve scientific problems}

• Cloud computing is not a silver bullet for scalability

• Cloud providers have primarily focused on horizontal scaling and not on HPC

(6)

6 Map Reduce

• We need a computing framework that is...

– _{able to handle huge datasets - 1TB+}

– _{massively parallel - runs on commodity hardware} – _{fault tolerant - hardware fails}

– _{locality aware - moving computation is cheaper than moving data} – _{easy to use}

(7)

7 Map Reduce

• Original paper by Google in 2004

– _{Introduced a simplified parallel processing model} – _{Used to build Google search indexes}

– _{Users specify a Map function and Reduce function}

– _{Framework manages task distribution, orchestration, data}

(8)

8 Map Reduce

• Map Reduce is a simple approach to parallel programming • Existing algorithms must be translated into one or more

map/reduce steps • Batch oriented

• Requires a distributed filesystem

• Map Reduce can be implemented on top of MPI

(9)

9 Apache Hadoop

• Free implementation derived from Google MapReduce - written in Java

• Composed of many complementary projects

– Core – set of components and interfaces for distributed filesystems and general I/O serialization

– MapReduce – distributed data processing model and execution environment

(10)

1 Hadoop-related projects at Apache

Hadoop Ecosystem

• Hadoop has a large ecosystem of tools

– HBase - non-relational, column-orientated database that runs on HDFS

– Pig - data flow language for exploring datasets

– Hive - distributed data warehouse with SQL-like query language

(11)

1 Hadoop Components

• Core consists of compute layer (MapReduce) and storage layer (HDFS)

• Alternatives to HDFS – _{Amazon S3}

– _GlusterFS – _Lustre

• Many distributions/flavors of Hadoop exist – _Apache

– _Cloudera

– _{Amazon Elastic Map Reduce} – _Intel

(12)

1 Core improvements and enterprise features

Intel Hadoop

• Encrypted HDFS • Faster job launch

• Optimized for SSDs and 10GbE networking • Accelerated Hive queries

• Premium support • Intel Manager

(13)

1 Hadoop Job Flow

(14)

1

(15)

1

Translating Workloads to Map Reduce

• Parallel programming requires parallel thinking

– _{Domain decomposition}

– _{Exploit natural parallelism}

• A map reduce job assumes independent mappers and reducers running in parallel on individual slices of data • Share-nothing architecture - avoid communication and

global data structures

(16)

1

Translating Workloads to Map Reduce

• Genomic analysis is ideally suited for the Hadoop framework

– _{large, semi-structured, file-based data, parallel IO} – _{parallel processing by reads, samples, genes, etc.}

• Hadoop interfaces exist for C, Java, Perl, R, ...

• Hadoop Streaming allows any executables to be mappers and reducers

(17)

1 Unix Pipes for Hadoop

Enter Hadoop Streaming

• Utility that comes with Hadoop distribution • Use any mapper and reducer

– _Perl – _Bash – _cat/wc – _Binaries

• Easiest way to get started with Hadoop

(18)

1 Revolution R with Hadoop

• Series of R connectors to Hadoop features

• Write MapReduce jobs in R using Hadoop Streaming • Import tables from HDFS and HBase

• https://github.com/RevolutionAnalytics/RHadoop

•

(19)

1 Hadoop on AWS

Elastic MapReduce

• Amazon realized customers were spending a lot of time configuring and operating Hadoop clusters on EC2

• EMR is a service that runs on EC2 and handles set-up, teardown, and other Hadoop details

• Charged by instance-hour

– _{Instances can be terminated automatically when your job finishes}

• Adds ‘Job Flows’ feature

(20)

2 Features

Elastic MapReduce

• Elastic

– _{Add nodes to a running cluster}

– _{New: variable node count in each flow step}

– _{New: easier to dynamically resize # nodes in use}

• Stores inputs and outputs in S3

– _{New: can use multi-part HTTP upload if configured}

• Easy to Use – Job Flows, Debugging

• Support for On-Demand and Reserved Instances

– _{Recent support for Spot and VPC instances!}

• Support for bootstrap actions

(21)

2 Creating Job Flows

(22)

2 Monitoring and Debugging

(23)

2 Debugging

(24)

2

Translating Workloads to Map Reduce

• The most efficient programs require use of Java and

understanding Hadoop internals

• Hadoop has more scheduling and execution overhead

compared to some traditional HPC environments

– _{Hadoop can be integrated with cluster schedulers like SGE}

• Moving large amounts of data into HDFS can be slow

– _{Filesystem alternatives include S3, GlusterFS, Lustre, http} – _{HDFS can greatly benefit from SSDs}

(25)

2 A few tips

Data Movement

• Primary hurdle in adopting the cloud

• Avoid using SCP or other TCP based transfers unless you tune your settings

– _{http://www.psc.edu/index.php/networking/641-tcp-tune}

• Alternative transports: GridFTP, Aspera fasp, Bit Torrent • AWS offers physical import/export via FedEx

• Aggregate the data within your job if possible

(26)

2 Cloud-scale RNA-sequencing differential expression

analysis with Myrna

Life Science Example

• Langmead et al. Genome Biology 2010, 11:R83 • RNA-Seq analysis pipeline

– _{Focused on differential expression analysis between genes}

– _{Complementary to whole transcriptome assembly (e.g. cufflinks)}

• Workflow contains 7 stages

(27)

2 Stage 1 - Preprocess

• Process FASTQ input list

– _{Optional dump from .sra format}

• Assign sample names • Copy into HDFS

(28)

2 Stage 2 - Align

• Align reads to reference genome using bowtie

• Each node independently obtains the bowtie index from local or shared filesystem (hdfs)

(29)

2 Stage 3 - Overlap

• Calculate overlaps between alignment and predefined gene intervals

• Aggregate counts for each genomic feature • Parallel across alignments

(30)

3 Stage 4 - Normalize

• Calculate normalization factor based on count distribution • Parallel across genetic feature labels

(31)

3 Stage 5 - Statistical analysis

• Fits a linear model relating the counts to the outcome using R

• Uses values calculated from Align and Overlap stages • Parallel across genes

(32)

3 Stage 6 - Summarize

• Significance summaries such as P-values and gene-specific counts are calculated

• Outputs a list of top N genes ranked by false discovery rate

– _{Hadoop takes care of sorting}

• This stage is serial

(33)

3 Stage 7 - Postprocess

• Discards overlaps not belonging to top genes

• Creates output files, summary tables, and plots

• Compressed and stored in user specified output

3 Myrna Performance

• Uses standard bioinformatics tools bowtie and R/Bioconductor in a Hadoop job flow

• Workflow broken into many stages to take advantage of parallelism

(35)

3 Summary

• The most popular genomics algorithms will eventually get Hadoop implementations

– _{The other 80% will not...}

• Hadoop is well suited for processing large unstructured data offline

• Hadoop is not well suited for communication heavy jobs or real-time processing

• Hadoop can be run locally or integrated into existing HPC infrastructure

• HBase and other products run on Hadoop taking advantage of framework features

(36)

3 Summary

• Building a Hadoop cluster

– _{Use dense storage nodes}

– _{Boosting HDFS performance}

– _{Use SSD drives}

– _{Faster interconnects}

– _{Replace HDFS with S3, GlusterFS, Lustre}

• Running a Hadoop cluster

– _{Job monitoring and debugging requires additional tooling} – _{AWS Elastic Map Reduce product for EC2}

(37)

3 Observations

• The cloud is making good parallel programming techniques more important than ever

– _{Message passing, threading, distributed systems}

• Understand the difference between vertical and horizontal scaling

– _{Use both!}

• Cloud best practices are finding their way back into local infrastructure/HPC

(38)

3 References

• http://developer.yahoo.com/hadoop/tutorial/module4.html#d ataflow • http://markusklems.files.wordpress.com/2008/07/mapreduc e.png • http://bowtie-bio.sourceforge.net/myrna/index.shtml

(39)

INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY

ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED

WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT,

COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results

to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Optimization Notice

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel

microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

Legal Disclaimer & Optimization Notice

(40)