CSE-E5430 Scalable Cloud Computing
Lecture 4
Keijo Heljanko
Department of Computer Science School of Science
Aalto University [email protected]
CSE-E5430 Scalable Cloud Computing 2/23
Hadoop - Linux of Big Data
I Hadoop = Open Source Distributed Operating System
Distribution for Big Data
I Based on Google architecture design
I Fault tolerant distributed filesystem: HDFS
I Batch processing systems: Hadoop MapReduce and
Apache Pig (HDD), Apache Spark (RAM)
I Parallel SQL implementations for analytics: Apache Hive,
Cloudera Impala, Spark SQL, Facebook Presto
I Fault tolerant distributed database: HBase
I Distributed machine learning libraries, text indexing &
CSE-E5430 Scalable Cloud Computing 4/23
“SQL on Hadoop is the New Black”
I Many new analytics SQL implementations on top of Hadoop
I Apache Hive
I Cloudera Impala
I Spark SQL
Hadoop Books
I I can warmly recommend the Book:
I Tom White: “Hadoop: The Definitive Guide, Second
Edition”, O’Reilly Media, 2010. Print ISBN:
978-1-4493-8973-4, Ebook ISBN: 978-1-4493-8974-1.
http://www.hadoopbook.com/
I An alternative book is:
I Chuck Lam: “Hadoop in Action”, Manning, 2010. Print
CSE-E5430 Scalable Cloud Computing 6/23
Recap of Hadoop Commands Used
I start-all.sh
I Starts all Hadoop daemons.
Hint: See the homework virtual machine scriptstart all
in theWordCountdirectory for a script to clean up HDFS
and start Hadoop daemons in case something fails (due to unclean shutdown of Hadoop daemons, etc.)
Recap of Hadoop Commands Used (cnt.)
I hadoop namenode -format
I Formats the HDFS filesystem
I hadoop fs -mkdir foo
I Creates a directory called “foo” in the HDFS home
directory of the user
I hadoop fs -rmr foo
I Removes the directory/file called “foo” in the HDFS home
directory of the user
I hadoop fs -copyFromLocal foo.txt bar
I Copies the file “foo.txt” to the HDFS directory/file called
CSE-E5430 Scalable Cloud Computing 8/23
Recap of Hadoop Commands Used (cnt.)
I hadoop fs -ls foo
I Does a directory/file listing for directory called “foo” in the
HDFS home directory of the user
I hadoop jar wc.jar WordCount foo bar
I Runs MapReduce job in file “wc.jar” in the q class
“WordCount” using the HDFS directory “foo” as input and
stores the output in a new HDFS directory “bar” to be
Recap of Hadoop Commands Used (cnt.)
I hadoop fs -copyToLocal foo bar
I Copies the directory/file “foo” in the user HDFS home
directory to the local file system directory/file “bar”
I stop-all.sh
CSE-E5430 Scalable Cloud Computing 10/23
Hadoop Default ports
I Use the URL:http://localhost:50070to access the NameNode through a Web browser. Hint: You can also browse HDFS filesystem through this URL.
I Use the URL:http://localhost:8088to access the YARN resource manager to see the status of your MapReduce jobs, hit refresh to refresh the view
Hadoop NameNode and SecondaryNameNode
Locking
I Hadoop Namenode and SecondaryNameNode use locking to disallow the accidental starting of two daemons
modifying the same files and leading to HDFS inconsistency
I If either daemon gets killed uncleanly, it they might leave lock files around and prevent a new NameNode or SecondaryNameNode from starting
I The lock files are calleddata/in use.lock,
name/in use.lock, and
namesecondary/in use.lock.
I They are located indfssubdirectory ofhadoop.tmp.dir
that can be set incore-site.xml(by default
CSE-E5430 Scalable Cloud Computing 12/23
Commercial Hadoop Support
I Cloudera: A Hadoop distribution based on Apache Hadoop + patches. Available from:
http://www.cloudera.com/
I Hortonworks: Yahoo! spin-off of their Hadoop development team, will be working on mainline Apache Hadoop.
http://www.hortonworks.com/
I MapR: A rewrite of much of Apache Hadoop in C++, including a new filesystem. API-compatible with Apache Hadoop.
Hadoop YARN - MapReduce 2.0
I Jobtracker of MapReduce 1.0 is replaced by a general purpose cluster resource manager to allow MapReduce to co-exists more nicely with other programming frameworks, e.g., Apache Spark on the same cluster
CSE-E5430 Scalable Cloud Computing 14/23
Example of Hadoop Use: Hadoop-BAM + SeqPig
The rest of Lecture 4 will be an example of Hadoop research use andis not be part of the exam requirements.
Genomics Research
I Genomics research has high economic relevance: It is used by biomedicine researchers, hospital diagnostics, food industries, agronomy, pharmaceutical industries, . . .
I Example use cases of NGS data analytics include:
I Human genetics(involving the detection of genotype
variations that relate to diseases)
I Personalized medicine(which will rely critically on the
ability to rapidly and reliably map the genetic make-up of individual patients)
I Genomics selection for agriculture(which will exploit
genomics data in farmed livestock to select the next generation of breeding animals and crop plants)
I All these use cases require advanced analytics methods to be developed to cope with the data growth rate
CSE-E5430 Scalable Cloud Computing 16/23
Next Generation Sequencing and Big Data
I The amount ofNGS data worldwide is predicted to double every 5 months
I This growth is much faster thanMoore’s lawfor the growth
rate of computing (historically transistor counts have
doubled every18-24 months),Kryder’s lawfor the growth of
storage capacity (historically doubling approx every13
months), andButter’s lawfor growth in optical
communications bandwidth (historically doubling approx
every9 months)
I Without increased expenditure in distributed computing methods genomics research will hit computational limits
NGS Datasets
I The datasets are large: The 1000 Genomes project
(http://www.1000genomes.org) is a freely available>
450 TB data set of human genomes as of March 2013 I A single file to be processed can be in the order of 50-100
CSE-E5430 Scalable Cloud Computing 18/23
Hadoop-BAM
I A library to interface NGS data formats with Hadoop
I Includes tools for e.g., sorting of reads, as needed by merging results of parallel read aligners
I Supported fileformats: BAM, SAM, FASTQ, FASTA, QSEQ, BCF, and VCF
I Version 7.0 of thehadoop-BAMreleased:
http://sourceforge.net/projects/hadoop-bam/ I 2500+ Downloads of the library
I Niemenmaa, M., Kallio, A., Schumacher, A., Klemel ¨a, P.,
Korpelainen, E., and Heljanko, K.: Hadoop-BAM: Directly Manipulating Next Generation Sequencing Data in the
Cloud. Bioinformatics 28(6):876-877, 2012. (http:
SeqPig
I Parallel scripting language for NGS data setsbased on the Apache Pig language
I Compiles into Java, executed by Hadoop MapReduce
I SQL-like functionality with helper functions for NGS data: Filtering data, computing aggregate statistics, doing joins
I Supported fileformats: BAM, SAM, FASTQ, QSEQ, and FASTA
I Schumacher, A., Pireddu, L., Niemenmaa, M., Kallio, A.,
Korpelainen, E., Zanetti, G., and Heljanko, K.: SeqPig: Simple and scalable scripting for large sequencing data sets in Hadoop. Bioinformatics 30 (1): 119-120, 2014. (dx.doi.org/10.1093/bioinformatics/btt601.)
I See also supplement:
http://bioinformatics.oxfordjournals.org/ content/suppl/2013/10/17/btt601.DC1/
CSE-E5430 Scalable Cloud Computing 20/23
Scalability of SeqPig
0 10 20 30 40 50 60 0 10 20 30 40 50 60 70Speedup versus FastQC
Worker nodes avg readqual read length basequal stats GC contents all at once
Figure:Scalabilty of SeqPig vs sequential FastQC. Computing statistics on 61.4 GB input file with up to 63 computer Hadoop cluster
SeqPig Benefits and Drawbacks
I Benefits:
I Automatic parallelization of data processing scripts
I Easy to learn scripting language with full power of
MapReduce
I Most scripts are at most tens of lines of code vs. hundreds
to thousands of lines of Java
I Also allows calling back user defined functions written in
Java/Python
I Implements SQL like functionality
I Drawbacks:
I MapReduce has 10+ second startup delay: No for
interactive use
CSE-E5430 Scalable Cloud Computing 22/23
MapReduce for Read Alignment
I The SEAL system is a parallelization of the BWA read alignment tool on top of Hadoop
(http://biodoop-seal.sourceforge.net/)
I It allows multiple BWA instances running on the same physical machine to share the large index of the reference sequence
I For more details, see the paper: “Luca Pireddu, Simone Leo, Gianluigi Zanetti: SEAL: a distributed short read mapping and duplicate removal tool. Bioinformatics 27(15): 2159-2160 (2011)”
Current Research Topics
I Focus on cloud based data analysis for Big Data
I MapReduce (Hadoop) and Apache Spark scalable batch processing technologies in the cloud
I Scalable datastores: HBase (Hadoop Database) is a potentially interesting cloud based datastore for also bioinformatics datasets
I Interactive analytics using Hadoop based parallel SQL engines for analytics: Hive, Spark SQL, Cloudera Impala, Facebook Presto
I Lambda architecture: Adding real-time processing and serving layer into Hadoop stack