CSE-E5430 Scalable Cloud Computing. Lecture 4

(1)

CSE-E5430 Scalable Cloud Computing

Lecture 4

Keijo Heljanko

Department of Computer Science School of Science

Aalto University [email protected]

(2)

CSE-E5430 Scalable Cloud Computing 2/23

Hadoop - Linux of Big Data

I Hadoop = Open Source Distributed Operating System

Distribution for Big Data

I _{Based on Google architecture design}

I _{Fault tolerant distributed filesystem: HDFS}

I _{Batch processing systems: Hadoop MapReduce and}

Apache Pig (HDD), Apache Spark (RAM)

I Parallel SQL implementations for analytics: Apache Hive,

Cloudera Impala, Spark SQL, Facebook Presto

I Fault tolerant distributed database: HBase

I _{Distributed machine learning libraries, text indexing &}

(3)

(4)

“SQL on Hadoop is the New Black”

I Many new analytics SQL implementations on top of Hadoop

I Apache Hive

I Cloudera Impala

I _{Spark SQL}

(5)

Hadoop Books

I I can warmly recommend the Book:

I _{Tom White: “Hadoop: The Definitive Guide, Second}

Edition”, O’Reilly Media, 2010. Print ISBN:

978-1-4493-8973-4, Ebook ISBN: 978-1-4493-8974-1.

http://www.hadoopbook.com/

I An alternative book is:

I _{Chuck Lam: “Hadoop in Action”, Manning, 2010. Print}

(6)

Recap of Hadoop Commands Used

I start-all.sh

I Starts all Hadoop daemons.

Hint: See the homework virtual machine scriptstart all

in theWordCountdirectory for a script to clean up HDFS

and start Hadoop daemons in case something fails (due to unclean shutdown of Hadoop daemons, etc.)

(7)

Recap of Hadoop Commands Used (cnt.)

I hadoop namenode -format

I Formats the HDFS filesystem

I hadoop fs -mkdir foo

I _{Creates a directory called “}foo” in the HDFS home

directory of the user

I hadoop fs -rmr foo

I _{Removes the directory/file called “}foo” in the HDFS home

directory of the user

I hadoop fs -copyFromLocal foo.txt bar

I Copies the file “foo.txt” to the HDFS directory/file called

(8)

Recap of Hadoop Commands Used (cnt.)

I hadoop fs -ls foo

I _{Does a directory/file listing for directory called “}foo” in the

HDFS home directory of the user

I hadoop jar wc.jar WordCount foo bar

I _{Runs MapReduce job in file “}wc.jar” in the q class

“WordCount” using the HDFS directory “foo” as input and

stores the output in a new HDFS directory “bar” to be

(9)

Recap of Hadoop Commands Used (cnt.)

I hadoop fs -copyToLocal foo bar

I Copies the directory/file “foo” in the user HDFS home

directory to the local file system directory/file “bar”

I stop-all.sh

(10)

Hadoop Default ports

I Use the URL:http://localhost:50070to access the NameNode through a Web browser. Hint: You can also browse HDFS filesystem through this URL.

I Use the URL:http://localhost:8088to access the YARN resource manager to see the status of your MapReduce jobs, hit refresh to refresh the view

(11)

Hadoop NameNode and SecondaryNameNode

Locking

I Hadoop Namenode and SecondaryNameNode use locking to disallow the accidental starting of two daemons

modifying the same files and leading to HDFS inconsistency

I If either daemon gets killed uncleanly, it they might leave lock files around and prevent a new NameNode or SecondaryNameNode from starting

I The lock files are calleddata/in use.lock,

name/in use.lock, and

namesecondary/in use.lock.

I They are located indfssubdirectory ofhadoop.tmp.dir

that can be set incore-site.xml(by default

(12)

Commercial Hadoop Support

I Cloudera: A Hadoop distribution based on Apache Hadoop + patches. Available from:

http://www.cloudera.com/

I Hortonworks: Yahoo! spin-off of their Hadoop development team, will be working on mainline Apache Hadoop.

http://www.hortonworks.com/

I MapR: A rewrite of much of Apache Hadoop in C++, including a new filesystem. API-compatible with Apache Hadoop.

(13)

Hadoop YARN - MapReduce 2.0

I Jobtracker of MapReduce 1.0 is replaced by a general purpose cluster resource manager to allow MapReduce to co-exists more nicely with other programming frameworks, e.g., Apache Spark on the same cluster

(14)

Example of Hadoop Use: Hadoop-BAM + SeqPig

The rest of Lecture 4 will be an example of Hadoop research use andis not be part of the exam requirements.

(15)

Genomics Research

I Genomics research has high economic relevance: It is used by biomedicine researchers, hospital diagnostics, food industries, agronomy, pharmaceutical industries, . . .

I Example use cases of NGS data analytics include:

I _{Human genetics}_{(involving the detection of genotype}

variations that relate to diseases)

I _{Personalized medicine}_{(which will rely critically on the}

ability to rapidly and reliably map the genetic make-up of individual patients)

I Genomics selection for agriculture(which will exploit

genomics data in farmed livestock to select the next generation of breeding animals and crop plants)

I All these use cases require advanced analytics methods to be developed to cope with the data growth rate

(16)

Next Generation Sequencing and Big Data

I The amount ofNGS data worldwide is predicted to double every 5 months

I This growth is much faster thanMoore’s lawfor the growth

rate of computing (historically transistor counts have

doubled every18-24 months),Kryder’s lawfor the growth of

storage capacity (historically doubling approx every13

months), andButter’s lawfor growth in optical

communications bandwidth (historically doubling approx

every9 months)

I Without increased expenditure in distributed computing methods genomics research will hit computational limits

(17)

NGS Datasets

I The datasets are large: The 1000 Genomes project

(http://www.1000genomes.org) is a freely available>

450 TB data set of human genomes as of March 2013 I A single file to be processed can be in the order of 50-100

(18)

Hadoop-BAM

I A library to interface NGS data formats with Hadoop

I Includes tools for e.g., sorting of reads, as needed by merging results of parallel read aligners

I Supported fileformats: BAM, SAM, FASTQ, FASTA, QSEQ, BCF, and VCF

I Version 7.0 of thehadoop-BAMreleased:

http://sourceforge.net/projects/hadoop-bam/ I 2500+ Downloads of the library

I Niemenmaa, M., Kallio, A., Schumacher, A., Klemel ¨a, P.,

Korpelainen, E., and Heljanko, K.: Hadoop-BAM: Directly Manipulating Next Generation Sequencing Data in the

Cloud. Bioinformatics 28(6):876-877, 2012. (http:

(19)

SeqPig

I Parallel scripting language for NGS data setsbased on the Apache Pig language

I Compiles into Java, executed by Hadoop MapReduce

I SQL-like functionality with helper functions for NGS data: Filtering data, computing aggregate statistics, doing joins

I Supported fileformats: BAM, SAM, FASTQ, QSEQ, and FASTA

I Schumacher, A., Pireddu, L., Niemenmaa, M., Kallio, A.,

Korpelainen, E., Zanetti, G., and Heljanko, K.: SeqPig: Simple and scalable scripting for large sequencing data sets in Hadoop. Bioinformatics 30 (1): 119-120, 2014. (dx.doi.org/10.1093/bioinformatics/btt601.)

I See also supplement:

http://bioinformatics.oxfordjournals.org/ content/suppl/2013/10/17/btt601.DC1/

(20)

Scalability of SeqPig

0 10 20 30 40 50 60 0 10 20 30 40 50 60 70

Speedup versus FastQC

Worker nodes avg readqual read length basequal stats GC contents all at once

Figure:Scalabilty of SeqPig vs sequential FastQC. Computing statistics on 61.4 GB input file with up to 63 computer Hadoop cluster

(21)

SeqPig Benefits and Drawbacks

I Benefits:

I _{Automatic parallelization of data processing scripts}

I _{Easy to learn scripting language with full power of}

MapReduce

I _{Most scripts are at most tens of lines of code vs. hundreds}

to thousands of lines of Java

I _{Also allows calling back user defined functions written in}

Java/Python

I _{Implements SQL like functionality}

I Drawbacks:

I _{MapReduce has 10+ second startup delay: No for}

interactive use

(22)

MapReduce for Read Alignment

I The SEAL system is a parallelization of the BWA read alignment tool on top of Hadoop

(http://biodoop-seal.sourceforge.net/)

I It allows multiple BWA instances running on the same physical machine to share the large index of the reference sequence

I For more details, see the paper: “Luca Pireddu, Simone Leo, Gianluigi Zanetti: SEAL: a distributed short read mapping and duplicate removal tool. Bioinformatics 27(15): 2159-2160 (2011)”

(23)

Current Research Topics

I Focus on cloud based data analysis for Big Data

I MapReduce (Hadoop) and Apache Spark scalable batch processing technologies in the cloud

I Scalable datastores: HBase (Hadoop Database) is a potentially interesting cloud based datastore for also bioinformatics datasets

I Interactive analytics using Hadoop based parallel SQL engines for analytics: Hive, Spark SQL, Cloudera Impala, Facebook Presto

I Lambda architecture: Adding real-time processing and serving layer into Hadoop stack