Hadoop in Practice 2nd Edition {PRG} pdf

(1)

Alex Holmes

SECOND EDITION

M A N N I N G

IN

P

RACTICE

(2)

Praise for the First Edition of

Hadoop in Practice

A new book from Manning, Hadoop in Practice, is definitely the most modern book on the topic. Important subjects, like what commercial variants such as MapR offer, and the many different releases and APIs get uniquely good coverage in this book.

—Ted Dunning, Chief Application Architect, MapR Technologies

Comprehensive coverage of advanced Hadoop usage, including high-quality code samples.

—Chris Nauroth, Senior Staff Software Engineer The Walt Disney Company

A very pragmatic and broad overview of Hadoop and the Hadoop tools ecosystem, with a wide set of interesting topics that tickle the creative brain.

—Mark Kemna, Chief Technology Officer, Brilig

A practical introduction to the Hadoop ecosystem.

—Philipp K. Janert, Principal Value, LLC

This book is the horizontal roof that each of the pillars of individual Hadoop technology books hold. It expertly ties together all the Hadoop ecosystem technologies. —Ayon Sinha, Big Data Architect, Britely

I would take this book on my path to the future.

—Alexey Gayduk, Senior Software Engineer, Grid Dynamics

A high-quality and well-written book that is packed with useful examples. The breadth and detail of the material is by far superior to any other Hadoop reference guide. It is perfect for anyone who likes to learn new tools/technologies while following pragmatic, real-world examples.

(3)

(4)

Hadoop in Practice

Second Edition

ALEX HOLMES

(5)

www.manning.com. The publisher offers discounts on this book when ordered in quantity. For more information, please contact

Special Sales Department Manning Publications Co. 20 Baldwin Road

PO Box 761

Shelter Island, NY 11964 Email: [email protected]

No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of the publisher.

Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in the book, and Manning

Publications was aware of a trademark claim, the designations have been printed in initial caps or all caps.

Recognizing the importance of preserving what has been written, it is Manning’s policy to have the books we publish printed on acid-free paper, and we exert our best efforts to that end. Recognizing also our responsibility to conserve the resources of our planet, Manning books are printed on paper that is at least 15 percent recycled and processed without the use of elemental chlorine.

Development editor: Cynthia Kane Manning Publications Co. Copyeditor: Andy Carroll 20 Baldwin Road Proofreader: Melody Dolab Shelter Island, NY 11964 Typesetter: Gordan Salinovic

Cover designer: Marija Tudor

ISBN 9781617292224

Printed in the United States of America

(6)

v

brief contents

P

ART

1 B

ACKGROUNDANDFUNDAMENTALS

...1

1

■

Hadoop in a heartbeat

3

2

■

Introduction to YARN

22 P

ART

2 D

ATALOGISTICS

...59

3

■

Data serialization—working with text and beyond

61

4

■

Organizing and optimizing data in HDFS

139

5

■

Moving data into and out of Hadoop

174 P

ART

3 B

IG DATAPATTERNS

... 253

6

■

Applying MapReduce patterns to big data

255

7

■

Utilizing data structures and algorithms at scale

302

8

■

Tuning, debugging, and testing

337 P

ART

4 B

EYOND

M

AP

R

EDUCE

... 385

9

■

SQL on Hadoop

387

(7)

(8)

vii

P

ART

1 B

ACKGROUND

AND

FUNDAMENTALS

...1

1 Hadoop in a heartbeat

3

1.1 What is Hadoop?

4

Core Hadoop components 5 ■ The Hadoop ecosystem 10 Hardware requirements 11 ■ Hadoop distributions 12 ■ Who’s using Hadoop? 14 ■ Hadoop limitations 15

1.2 Getting your hands dirty with MapReduce

17

1.3 Summary

21

2 Introduction to YARN

22

2.1 YARN overview

23

Why YARN? 24 ■ YARN concepts and components 26 YARN configuration 29

(9)

TECHNIQUE 2 Running a command on your YARN cluster 31 TECHNIQUE 3 Accessing container logs 32

TECHNIQUE 4 Aggregating container log files 36 YARN challenges 39

2.2 YARN and MapReduce

40

Dissecting a YARN MapReduce application 40 ■ Configuration 42 Backward compatibility 46

TECHNIQUE 5 Writing code that works on Hadoop versions 1 and 2 47

Running a job 48

TECHNIQUE 6 Using the command line to run a job 49 Monitoring running jobs and viewing archived jobs 49 Uber jobs 50

TECHNIQUE 7 Running small MapReduce jobs 50

2.3 YARN applications

52

NoSQL 53 ■ Interactive SQL 54 ■ Graph processing 54 Real-time data processing 55 ■ Bulk synchronous parallel 55 MPI 56 ■ In-memory 56 ■ DAG execution 56

2.4 Summary

57 P

ART

2 D

ATA

LOGISTICS

...59

3 Data serialization—working with text and beyond

61

3.1 Understanding inputs and outputs in MapReduce

62

Data input 63 ■ Data output 66

3.2 Processing common serialization formats

68

XML 69

TECHNIQUE 8 MapReduce and XML 69 JSON 72

TECHNIQUE 9 MapReduce and JSON 73

3.3 Big data serialization formats

76

Comparing SequenceFile, Protocol Buffers, Thrift, and Avro 76 SequenceFile 78

TECHNIQUE 10 Working with SequenceFiles 80 TECHNIQUE 11 Using SequenceFiles to encode Protocol

Buffers 87

Protocol Buffers 91 ■ Thrift 92 ■ Avro 93

(10)

CONTENTS _ix

TECHNIQUE 13 Selecting the appropriate way to use Avro in MapReduce 98

TECHNIQUE 14 Mixing Avro and non-Avro data in MapReduce 99 TECHNIQUE 15 Using Avro records in MapReduce 102

TECHNIQUE 16 Using Avro key/value pairs in MapReduce 104 TECHNIQUE 17 Controlling how sorting works in

MapReduce 108 TECHNIQUE 18 Avro and Hive 108 TECHNIQUE 19 Avro and Pig 111

3.4 Columnar storage

113

Understanding object models and storage formats 115 ■ Parquet and the Hadoop ecosystem 116 ■ Parquet block and page sizes 117

TECHNIQUE 20 Reading Parquet files via the command line 117

TECHNIQUE 21 Reading and writing Avro data in Parquet with Java 119

TECHNIQUE 22 Parquet and MapReduce 120 TECHNIQUE 23 Parquet and Hive/Impala 125

TECHNIQUE 24 Pushdown predicates and projection with Parquet 126

Parquet limitations 128

3.5 Custom file formats

129

Input and output formats 129

TECHNIQUE 25 Writing input and output formats for CSV 129 The importance of output committing 137

3.6 Chapter summary

138

4 Organizing and optimizing data in HDFS

139

4.1 Data organization

140

Directory and file layout 140 ■ Data tiers 141 ■ Partitioning 142 TECHNIQUE 26 Using MultipleOutputs to partition your

data 142

TECHNIQUE 27 Using a custom MapReduce partitioner 145 Compacting 148

TECHNIQUE 28 Using filecrush to compact data 149 TECHNIQUE 29 Using Avro to store multiple small binary

files 151 Atomic data movement 157

4.2 Efficient storage with compression

158

(11)

TECHNIQUE 31 Compression with HDFS, MapReduce, Pig, and Hive 163

TECHNIQUE 32 Splittable LZOP with MapReduce, Hive, and Pig 168

4.3 Chapter summary

173

5 Moving data into and out of Hadoop

174

5.1 Key elements of data movement

175

5.2 Moving data into Hadoop

177

Roll your own ingest 177

TECHNIQUE 33 Using the CLI to load files 178 TECHNIQUE 34 Using REST to load files 180

TECHNIQUE 35 Accessing HDFS from behind a firewall 183 TECHNIQUE 36 Mounting Hadoop with NFS 186

TECHNIQUE 37 Using DistCp to copy data within and between clusters 188

TECHNIQUE 38 Using Java to load files 194

Continuous movement of log and binary files into HDFS 196 TECHNIQUE 39 Pushing system log messages into HDFS with

Flume 197

TECHNIQUE 40 An automated mechanism to copy files into HDFS 204

TECHNIQUE 41 Scheduling regular ingress activities with Oozie 209

Databases 214

TECHNIQUE 42 Using Sqoop to import data from MySQL 215 HBase 227

TECHNIQUE 43 HBase ingress into HDFS 227

TECHNIQUE 44 MapReduce with HBase as a data source 230 Importing data from Kafka 232

5.3 Moving data into Hadoop

234

TECHNIQUE 45 Using Camus to copy Avro data from Kafka into HDFS 234

5.4 Moving data out of Hadoop

241

Roll your own egress 241

TECHNIQUE 46 Using the CLI to extract files 241 TECHNIQUE 47 Using REST to extract files 242 TECHNIQUE 48 Reading from HDFS when behind a

firewall 243

TECHNIQUE 49 Mounting Hadoop with NFS 243

(12)

CONTENTS _xi

TECHNIQUE 51 Using Java to extract files 245 Automated file egress 246

TECHNIQUE 52 An automated mechanism to export files from HDFS 246

Databases 247

TECHNIQUE 53 Using Sqoop to export data to MySQL 247 NoSQL 251

5.5 Chapter summary

252 P

ART

3 B

IG

DATA

PATTERNS

...253

6 Applying MapReduce patterns to big data

255

6.1 Joining

256

TECHNIQUE 54 Picking the best join strategy for your data 257 TECHNIQUE 55 Filters, projections, and pushdowns 259

Map-side joins 260

TECHNIQUE 56 Joining data where one dataset can fit into memory 261

TECHNIQUE 57 Performing a semi-join on large datasets 264 TECHNIQUE 58 Joining on presorted and prepartitioned

data 269 Reduce-side joins 271

TECHNIQUE 59 A basic repartition join 271

TECHNIQUE 60 Optimizing the repartition join 275 TECHNIQUE 61 Using Bloom filters to cut down on shuffled

data 279 Data skew in reduce-side joins 283

TECHNIQUE 62 Joining large datasets with high join-key cardinality 284

TECHNIQUE 63 Handling skews generated by the hash partitioner 286

6.2 Sorting

287

Secondary sort 288

TECHNIQUE 64 Implementing a secondary sort 289 Total order sorting 294

TECHNIQUE 65 Sorting keys across multiple reducers 294

6.3 Sampling

297

(13)

7 Utilizing data structures and algorithms at scale

302

7.1 Modeling data and solving problems with graphs

303

Modeling graphs 304 ■ Shortest-path algorithm 304 TECHNIQUE 67 Find the shortest distance between two

users 305 Friends-of-friends algorithm 313 TECHNIQUE 68 Calculating FoFs 313

Using Giraph to calculate PageRank over a web graph 319

7.2 Modeling data and solving problems with graphs

321

TECHNIQUE 69 Calculate PageRank over a web graph 322

7.3 Bloom filters

326

TECHNIQUE 70 Parallelized Bloom filter creation in MapReduce 328

7.4 HyperLogLog

333

A brief introduction to HyperLogLog 333

TECHNIQUE 71 Using HyperLogLog to calculate unique counts 335

7.5 Chapter summary

336

8 Tuning, debugging, and testing

337

8.1 Measure, measure, measure

338

8.2 Tuning MapReduce

339

Common inefficiencies in MapReduce jobs 339 TECHNIQUE 72 Viewing job statistics 340

Map optimizations 343

TECHNIQUE 73 Data locality 343

TECHNIQUE 74 Dealing with a large number of input splits 344

TECHNIQUE 75 Generating input splits in the cluster with YARN 346

Shuffle optimizations 347

TECHNIQUE 76 Using the combiner 347 TECHNIQUE 77 Blazingly fast sorting with binary

comparators 349

TECHNIQUE 78 Tuning the shuffle internals 353 Reducer optimizations 356

(14)

CONTENTS _xiii

TECHNIQUE 80 Using stack dumps to discover unoptimized user code 358

TECHNIQUE 81 Profiling your map and reduce tasks 360

8.3 Debugging

362

Accessing container log output 362

TECHNIQUE 82 Examining task logs 362 Accessing container start scripts 363

TECHNIQUE 83 Figuring out the container startup command 363

Debugging OutOfMemory errors 365

TECHNIQUE 84 Force container JVMs to generate a heap dump 365

MapReduce coding guidelines for effective debugging 365 TECHNIQUE 85 Augmenting MapReduce code for better de

bugging 365

8.4 Testing MapReduce jobs

368

Essential ingredients for effective unit testing 368 ■ MRUnit 370 TECHNIQUE 86 Using MRUnit to unit-test MapReduce 371

LocalJobRunner 378

TECHNIQUE 87 Heavyweight job testing with the LocalJobRunner 378

MiniMRYarnCluster 381

TECHNIQUE 88 Using MiniMRYarnCluster to test your jobs 381 Integration and QA testing 382

8.5 Chapter summary

383 P

ART

4 B

EYOND

M

AP

R

EDUCE

...385

9 SQL on Hadoop

387

9.1 Hive

388

Hive basics 388 ■ Reading and writing data 391 TECHNIQUE 89 Working with text files 391 TECHNIQUE 90 Exporting data to local disk 395

User-defined functions in Hive 396 TECHNIQUE 91 Writing UDFs 396

Hive performance 399

(15)

9.2 Impala

409

Impala vs. Hive 410 ■ Impala basics 410 TECHNIQUE 94 Working with text 410 TECHNIQUE 95 Working with Parquet 412 TECHNIQUE 96 Refreshing metadata 413

User-defined functions in Impala 414

TECHNIQUE 97 Executing Hive UDFs in Impala 415

9.3 Spark SQL

416

Spark 101 417 ■ Spark on Hadoop 419 ■ SQL with Spark 419 TECHNIQUE 98 Calculating stock averages with Spark SQL 420 TECHNIQUE 99 Language-integrated queries 422

TECHNIQUE 100 Hive and Spark SQL 423

9.4 Chapter summary

423

10 Writing a YARN application

425

10.1 Fundamentals of building a YARN application

426

Actors 426 ■ The mechanics of a YARN application 427

10.2 Building a YARN application to collect cluster statistics

429

TECHNIQUE 101 A bare-bones YARN client 429 TECHNIQUE 102 A bare-bones ApplicationMaster 434

TECHNIQUE 103 Running the application and accessing logs 438 TECHNIQUE 104 Debugging using an unmanaged application

master 440

10.3 Additional YARN application capabilities

443

RPC between components 443 ■ Service discovery 444

Checkpointing application progress 444 ■ Avoiding split-brain 444 Long-running applications 444 ■ Security 445

10.4 YARN programming abstractions

445

Twill 446 ■ Spring 448 ■ REEF 450 ■ Picking a YARN API abstraction 450

10.5 Summary

450 appendix

Installing Hadoop and friends

451 index

475 bonus chapters available for download from www.manning.com/holmes2

(16)

xv

preface

I first encountered Hadoop in the fall of 2008 when I was working on an internet crawl-and-analysis project at Verisign. We were making discoveries similar to those that Doug Cutting and others at Nutch had made several years earlier about how to effi-ciently store and manage terabytes of crawl-and-analyzed data. At the time, we were getting by with our homegrown distributed system, but the influx of a new data stream and requirements to join that stream with our crawl data couldn’t be supported by our existing system in the required timeline.

After some research, we came across the Hadoop project, which seemed to be a perfect fit for our needs—it supported storing large volumes of data and provided a compute mechanism to combine them. Within a few months, we built and deployed a MapReduce application encompassing a number of MapReduce jobs, woven together with our own MapReduce workflow management system, onto a small cluster of 18 nodes. It was a revelation to observe our MapReduce jobs crunching through our data in minutes. Of course, what we weren’t expecting was the amount of time that we would spend debugging and performance-tuning our MapReduce jobs. Not to men-tion the new roles we took on as producmen-tion administrators—the biggest surprise in this role was the number of disk failures we encountered during those first few months supporting production.

(17)

The greatest challenge we faced when working with Hadoop, and specifically MapReduce, was relearning how to solve problems with it. MapReduce is its own fla-vor of parallel programming, and it’s quite different from the in-JVM programming that we were accustomed to. The first big hurdle was training our brains to think MapReduce, a topic which the book Hadoop in Action by Chuck Lam (Manning Publi-cations, 2010) covers well.

After one is used to thinking in MapReduce, the next challenge is typically related to the logistics of working with Hadoop, such as how to move data in and out of HDFS and effective and efficient ways to work with data in Hadoop. These areas of Hadoop haven’t received much coverage, and that’s what attracted me to the potential of this book—the chance to go beyond the fundamental word-count Hadoop uses and cover-ing some of the trickier and dirtier aspects of Hadoop.

(18)

xvii

acknowledgments

First and foremost, I want to thank Michael Noll, who pushed me to write this book. He provided invaluable insights into how to structure the content of the book, reviewed my early chapter drafts, and helped mold the book. I can’t express how much his support and encouragement has helped me throughout the process.

I’m also indebted to Cynthia Kane, my development editor at Manning, who coached me through writing this book and provided invaluable feedback on my work. Among the many notable “aha!” moments I had when working with Cynthia, the big-gest one was when she steered me into using visual aids to help explain some of the complex concepts in this book.

All of the Manning staff were a pleasure to work with, and a special shout out goes to Troy Mott, Nick Chase, Tara Walsh, Bob Herbstman, Michael Stephens, Marjan Bace, Maureen Spencer, and Kevin Sullivan.

I also want to say a big thank you to all the reviewers of this book: Adam Kawa, Andrea Tarocchi, Anna Lahoud, Arthur Zubarev, Edward Ribeiro, Fillipe Massuda, Gerd Koenig, Jeet Marwah, Leon Portman, Mohamed Diouf, Muthuswamy Manigan-dan, Rodrigo Abreu, and Serega Sheypack. Jonathan Siedman, the primary technical reviewer, did a great job of reviewing the entire book.

Many thanks to Josh Wills, the creator of Crunch, who kindly looked over the chap-ter that covered that topic. And more thanks go to Josh Patchap-terson, who reviewed my Mahout chapter.

(19)

xviii

about this book

Doug Cutting, the creator of Hadoop, likes to call Hadoop the kernel for big data, and I would tend to agree. With its distributed storage and compute capabilities, Hadoop is fundamentally an enabling technology for working with huge datasets. Hadoop provides a bridge between structured (RDBMS) and unstructured (log files, XML, text) data and allows these datasets to be easily joined together. This has evolved from traditional use cases, such as combining OLTP and log files, to more sophisti-cated uses, such as using Hadoop for data warehousing (exemplified by Facebook) and the field of data science, which studies and makes new discoveries about data.

This book collects a number of intermediary and advanced Hadoop examples and presents them in a problem/solution format. Each technique addresses a specific task you’ll face, like using Flume to move log files into Hadoop or using Mahout for pre-dictive analysis. Each problem is explored step by step, and as you work through them, you’ll find yourself growing more comfortable with Hadoop and at home in the world of big data.

This hands-on book targets users who have some practical experience with Hadoop and understand the basic concepts of MapReduce and HDFS. Manning’s Hadoop in Action by Chuck Lam contains the necessary prerequisites to understand and apply the techniques covered in this book.

(20)

ABOUT THIS BOOK xix

Roadmap

This book has 10 chapters divided into four parts.

Part 1 contains two chapters that form the introduction to this book. They review Hadoop basics and look at how to get Hadoop up and running on a single host. YARN, which is new in Hadoop version 2, is also examined, and some operational tips are provided for performing basic functions in YARN.

Part 2, “Data logistics,” consists of three chapters that cover the techniques and tools required to deal with data fundamentals, how to work with various data formats, how to organize and optimize your data, and getting data into and out of Hadoop. Picking the right format for your data and determining how to organize data in HDFS are the first items you’ll need to address when working with Hadoop, and they’re cov-ered in chapters 3 and 4 respectively. Getting data into Hadoop is one of the bigger hurdles commonly encountered when working with Hadoop, and chapter 5 is dedi-cated to looking at a variety of tools that work with common enterprise data sources.

Part 3 is called “Big data patterns,” and it looks at techniques to help you work effec-tively with large volumes of data. Chapter 6 covers how to represent data such as graphs for use with MapReduce, and it looks at several algorithms that operate on graph data. Chapter 7 looks at more advanced data structures and algorithms such as graph pro-cessing and using HyperLogLog for working with large datasets. Chapter 8 looks at how to tune, debug, and test MapReduce performance issues, and it also covers a number of techniques to help make your jobs run faster.

Part 4 is titled “Beyond MapReduce,” and it examines a number of technologies that make it easier to work with Hadoop. Chapter 9 covers the most prevalent and promising SQL technologies for data processing on Hadoop, and Hive, Impala, and Spark SQL are examined. The final chapter looks at how to write your own YARN appli-cation, and it provides some insights into some of the more advanced features you can use in your applications.

The appendix covers instructions for the source code that accompanies this book, as well as installation instructions for Hadoop and all the other related technologies covered in the book.

Finally, there are two bonus chapters available from the publisher’s website at

www.manning.com/HadoopinPracticeSecondEdition: chapter 11 “Integrating R and Hadoop for statistics and more” and chapter 12 “Predictive analytics with Mahout.”

What’s new in the second edition?

(21)

Parquet has also recently emerged as a new way to store data in HDFS—its columnar format can yield both space and time efficiencies in your data pipelines, and it’s quickly becoming the ubiquitous way to store data. Chapter 4 includes extensive coverage of Parquet, which includes how Parquet supports sophisticated object models such as Avro and how various Hadoop tools can use Parquet.

How data is being ingested into Hadoop has also evolved since the first edition, and Kafka has emerged as the new data pipeline, which serves as the transport tier between your data producers and data consumers, where a consumer would be a sys-tem such as Camus that can pull data from Kafka into HDFS. Chapter 5, which covers moving data into and out of Hadoop, now includes coverage of Kafka and Camus.

There are many new technologies that YARN now can support side by side in the same cluster, and some of the more exciting and promising technologies are covered in the new part 4, titled “Beyond MapReduce,” where I cover some compelling new SQL technologies such as Impala and Spark SQL. The last chapter, also new for this edition, looks at how you can write your own YARN application, and it’s packed with information about important features to support your YARN application.

Getting help

You’ll no doubt have many questions when working with Hadoop. Luckily, between the wikis and a vibrant user community, your needs should be well covered:

■ The main wiki is located at http://wiki.apache.org/hadoop/, and it contains

useful presentations, setup instructions, and troubleshooting instructions.

■ The Hadoop Common, HDFS, and MapReduce mailing lists can all be found at

http://hadoop.apache.org/mailing_lists.html.

■ “Search Hadoop” is a useful website that indexes all of Hadoop and its

ecosys-tem projects, and it provides full-text search capabilities: http://search-hadoop.com/.

■ You’ll find many useful blogs you should subscribe to in order to keep on top of

current events in Hadoop. This preface includes a selection of my favorites:

o Cloudera and Hortonworks are both prolific writers of practical applications

on Hadoop—reading their blogs is always educational: http://www.cloudera .com/blog/ and http://hortonworks.com/blog/.

o Michael Noll is one of the first bloggers to provide detailed setup instructions

for Hadoop, and he continues to write about real-life challenges:

www.michael-noll.com/.

o There’s a plethora of active Hadoop Twitter users that you may want to follow,

(22)

ABOUT THIS BOOK xxi

Code conventions and downloads

All source code in listings or in text is presented in a fixed-widthfontlikethis to separate it from ordinary text. Code annotations accompany many of the listings, highlighting important concepts.

All of the text and examples in this book work with Hadoop 2.x, and most of the MapReduce code is written using the newer org.apache.hadoop.mapreduce Map-Reduce APIs. The few examples that use the older org.apache.hadoop.mapred pack-age are usually the result of working with a third-party library or a utility that only works with the old API.

All of the code used in this book is available on GitHub at https://github.com/ alexholmes/hiped2 and also from the publisher’s website at www.manning.com/ HadoopinPracticeSecondEdition. The first section in the appendix shows you how to download, install, and get up and running with the code.

Third-party libraries

I use a number of third-party libraries for convenience purposes. They’re included in the Maven-built JAR, so there’s no extra work required to work with these libraries.

Datasets

Throughout this book, you’ll work with three datasets to provide some variety in the examples. All the datasets are small to make them easy to work with. Copies of the exact data used are available in the GitHub repository in the https://github.com/ alexholmes/hiped2/tree/master/test-data directory. I also sometimes use data that’s specific to a chapter, and it’s available within chapter-specific subdirectories under the same GitHub location.

NASDAQ financial stocks

I downloaded the NASDAQ daily exchange data from InfoChimps (www.infochimps .com). I filtered this huge dataset down to just five stocks and their start-of-year values from 2000 through 2009. The data used for this book is available on GitHub at https:// github.com/alexholmes/hiped2/blob/master/test-data/stocks.txt.

The data is in CSV form, and the fields are in the following order:

Symbol,Date,Open,High,Low,Close,Volume,Adj Close

Apache log data

I created a sample log file in Apache Common Log Format1_{with some fake Class}_E IP addresses and some dummy resources and response codes. The file is available on GitHub at https://github.com/alexholmes/hiped2/blob/master/test-data/ apachelog.txt.

(23)

Names

Names were retrieved from the U.S. government census at www.census.gov/genealogy/ www/data/1990surnames/dist.all.last, and this data is available at https:// github.com/alexholmes/hiped2/blob/master/test-data/names.txt.

Author Online

Purchase of Hadoop in Practice, Second Edition includes free access to a private web forum run by Manning Publications where you can make comments about the book, ask technical questions, and receive help from the authors and from other users. To access the forum and subscribe to it, point your web browser to www.manning.com/ HadoopinPractice, SecondEdition. This page provides information on how to get on the forum once you are registered, what kind of help is available, and the rules of con-duct on the forum. It also provides links to the source code for the examples in the book, errata, and other downloads.

Manning’s commitment to our readers is to provide a venue where a meaningful dia-log between individual readers and between readers and the author can take place. It is not a commitment to any specific amount of participation on the part of the author, whose contribution to the Author Online forum remains voluntary (and unpaid). We suggest you try asking the author challenging questions lest his interest strays!

(24)

xxiii

about the cover illustration

The figure on the cover of Hadoop in Practice, Second Edition is captioned “Momak from Kistanja, Dalmatia.” The illustration is taken from a reproduction of an album of tra-ditional Croatian costumes from the mid-nineteenth century by Nikola Arsenovic, published by the Ethnographic Museum in Split, Croatia, in 2003. The illustrations were obtained from a helpful librarian at the Ethnographic Museum in Split, itself sit-uated in the Roman core of the medieval center of the town: the ruins of Emperor Diocletian’s retirement palace from around AD304. The book includes finely colored illustrations of figures from different regions of Croatia, accompanied by descriptions of the costumes and of everyday life.

Kistanja is a small town located in Bukovica, a geographical region in Croatia. It is situated in northern Dalmatia, an area rich in Roman and Venetian history. The word “momak” in Croatian means a bachelor, beau, or suitor—a single young man who is of courting age—and the young man on the cover, looking dapper in a crisp, white linen shirt and a colorful, embroidered vest, is clearly dressed in his finest clothes, which would be worn to church and for festive occasions—or to go calling on a young lady.

(25)

(26)

Part 1 Background

and fundamentals

P

art 1 of this book consists of chapters 1 and 2, which cover the important Hadoop fundamentals.

Chapter 1 covers Hadoop’s components and its ecosystem and provides instructions for installing a pseudo-distributed Hadoop setup on a single host, along with a system that will enable you to run all of the examples in the book. Chapter 1 also covers the basics of Hadoop configuration, and walks you through how to write and run a MapReduce job on your new setup.

(27)

(28)

3

Hadoop in a heartbeat

We live in the age of big data, where the data volumes we need to work with on a day-to-day basis have outgrown the storage and processing capabilities of a single host. Big data brings with it two fundamental challenges: how to store and work with voluminous data sizes, and more important, how to understand data and turn it into a competitive advantage.

Hadoop fills a gap in the market by effectively storing and providing computa-tional capabilities for substantial amounts of data. It’s a distributed system made up of a distributed filesystem, and it offers a way to parallelize and execute programs on a cluster of machines (see figure 1.1). You’ve most likely come across Hadoop because it’s been adopted by technology giants like Yahoo!, Facebook, and Twitter to address their big data needs, and it’s making inroads across all industrial sectors. Because you’ve come to this book to get some practical experience with Hadoop and Java,1_{I’ll start with a brief overview and then show you how to install}

This chapter covers

■ Examining how the core Hadoop system works

■ Understanding the Hadoop ecosystem

■ Running a MapReduce job

1 _{To benefit from this book, you should have some practical experience with Hadoop and understand the}

(29)

Hadoop and run a MapReduce job. By the end of this chapter, you’ll have had a basic refresher on the nuts and bolts of Hadoop, which will allow you to move on to the more challenging aspects of working with it.

Let’s get started with a detailed overview.

1.1 What is Hadoop?

Hadoop is a platform that provides both distributed storage and computational capa-bilities. Hadoop was first conceived to fix a scalability issue that existed in Nutch,2_an open source crawler and search engine. At the time, Google had published papers that described its novel distributed filesystem, the Google File System (GFS), and MapReduce, a computational framework for parallel processing. The successful implementation of these papers’ concepts in Nutch resulted in it being split into two separate projects, the second of which became Hadoop, a first-class Apache project.

In this section we’ll look at Hadoop from an architectural perspective, examine how industry uses it, and consider some of its weaknesses. Once we’ve covered this background, we’ll look at how to install Hadoop and run a MapReduce job.

Hadoop proper, as shown in figure 1.2, is a distributed master-slave architecture3 that consists of the following primary components:

2 _{The Nutch project, and by extension Hadoop, was led by Doug Cutting and Mike Cafarella.}

3 _{A model of communication where one process, called the}_master_{, has control over one or more other}

pro-cesses, called slaves.

Server cloud Distributed computation

Distributed storage

Hadoop runs on commodity hardware. The computation tier is a general-purpose scheduler and

a distributed processing framework called MapReduce.

Storage is provided via a distributed ﬁlesystem

[image:29.531.93.359.58.281.2]

called HDFS.

(30)

5

What is Hadoop?

■ Hadoop Distributed File System (HDFS) for data storage.

■ Yet Another Resource Negotiator (YARN), introduced in Hadoop 2, a general-purpose scheduler and resource manager. Any YARN application can run on a Hadoop cluster.

■ _{MapReduce, a batch-based computational engine. In Hadoop 2, MapReduce is}

implemented as a YARN application.

Traits intrinsic to Hadoop are data partitioning and parallel computation of large datasets. Its storage and computational capabilities scale with the addition of hosts to a Hadoop cluster; clusters with hundreds of hosts can easily reach data volumes in the petabytes.

In the first step in this section, we’ll examine the HDFS, YARN, and MapReduce architectures.

1.1.1 Core Hadoop components

To understand Hadoop’s architecture we’ll start by looking at the basics of HDFS. HDFS

HDFS is the storage component of Hadoop. It’s a distributed filesystem that’s modeled after the Google File System (GFS) paper.4_HDFS_{is optimized for high throughput and} works best when reading and writing large files (gigabytes and larger). To support this throughput, HDFS uses unusually large (for a filesystem) block sizes and data locality optimizations to reduce network input/output (I/O).

Scalability and availability are also key traits of HDFS, achieved in part due to data replication and fault tolerance. HDFS replicates files for a configured number of times, is tolerant of both software and hardware failure, and automatically re-replicates data blocks on nodes that have failed.

4 _{See “The Google File System‚”}_{http://research.google.com/archive/gfs.html}_.

The HDFS master is responsible for partitioning the storage across

the slave nodes and keeping track of where data is located. The MapReduce master is

responsible for organizing where computational work should be scheduled on the slave nodes. The YARN master performs

the actual scheduling of work for YARN applications.

YARN slave MapReduce slave HDFS slave

YARN master MapReduce master HDFS master

YARN slave MapReduce slave HDFS slave

(31)

Figure 1.3 shows a logical representation of the components in HDFS: the NameNode and the DataNode. It also shows an application that’s using the Hadoop filesystem library to access HDFS.

Hadoop 2 introduced two significant new features for HDFS—Federation and High Availability (HA):

■ _{Federation allows}_HDFS_{metadata to be shared across multiple NameNode}

hosts, which aides with HDFS scalability and also provides data isolation, allow-ing different applications or teams to run their own NameNodes without fear of impacting other NameNodes on the same cluster.

■ High Availability in HDFS removes the single point of failure that existed in Hadoop 1, wherein a NameNode disaster would result in a cluster outage. HDFS HA also offers the ability for failover (the process by which a standby Name-Node takes over work from a failed primary NameName-Node) to be automated.

The HDFS NameNode keeps in memory the metadata about the ﬁlesystem such as which

DataNodes manage the blocks for each ﬁle.

Files are made up of blocks, and each ﬁle can be replicated multiple times, meaning

there are many identical copies of each block for the ﬁle (by default, 3). DataNodes communicate

with each other for pipelining ﬁle reads

and writes. Client

application

Hadoop filesystem

client HDFS clients talk to the NameNode for metadata-related

activities and DataNodes for reading and writing ﬁles.

/tmp/ﬁle1.txt Block A

Block B

DataNode 2

DataNode 3

DataNode 1

DataNode 3 NameNode

C

DataNode 1

D

B A

DataNode 2

C

D B

DataNode 3

[image:31.531.94.467.57.401.2]

A C

(32)

7

What is Hadoop?

Now that you have a bit of HDFS knowledge, it’s time to look at YARN, Hadoop’s scheduler. YARN

YARN is Hadoop’s distributed resource scheduler. YARN is new to Hadoop version 2 and was created to address challenges with the Hadoop 1 architecture:

■ Deployments larger than 4,000 nodes encountered scalability issues, and add-ing additional nodes didn’t yield the expected linear scalability improvements.

■ Only MapReduce workloads were supported, which meant it wasn’t suited to run execution models such as machine learning algorithms that often require iterative computations.

For Hadoop 2 these problems were solved by extracting the scheduling function from MapReduce and reworking it into a generic application scheduler, called YARN. With this change, Hadoop clusters are no longer limited to running MapReduce workloads; YARN enables a new set of workloads to be natively supported on Hadoop, and it allows alternative processing models, such as graph processing and stream pro-cessing, to coexist with MapReduce. Chapters 2 and 10 cover YARN and how to write YARN applications.

YARN’s architecture is simple because its primary role is to schedule and manage resources in a Hadoop cluster. Figure 1.4 shows a logical representation of the core components in YARN: the ResourceManager and the NodeManager. Also shown are the components specific to YARN applications, namely, the YARN application client, the ApplicationMaster, and the container.

To fully realize the dream of a generalized distributed platform, Hadoop 2 intro-duced another change—the ability to allocate containers in various configurations.

A YARN client is responsible for creating

the YARN application.

Client ResourceManager

ApplicationMaster

NodeManager

Container The ResourceManager is the

YARN master process and is responsible for scheduling and managing resources,

called “containers.”

The ApplicationMaster is created by the ResourceManager and is responsible

for requesting containers to perform application-speciﬁc work.

The NodeManager is the slave YARN process that runs on each node.

It is responsible for launching and managing containers.

Containers are YARN application-speciﬁc processes

[image:32.531.102.446.371.578.2]

that perform some function pertinent to the application.

(33)

Hadoop 1 had the notion of “slots,” which were a fixed number of map and reduce pro-cesses that were allowed to run on a single node. This was wasteful in terms of cluster utilization and resulted in underutilized resources during MapReduce operations, and it also imposed memory limits for map and reduce tasks. With YARN, each container requested by an ApplicationMaster can have disparate memory and CPU traits, and this gives YARN applications full control over the resources they need to fulfill their work.

You’ll work with YARN in more detail in chapters 2 and 10, where you’ll learn how YARN works and how to write a YARN application. Next up is an examination of MapReduce, Hadoop’s computation engine.

MAPREDUCE

MapReduce is a batch-based, distributed computing framework modeled after Google’s paper on MapReduce.5_{It allows you to parallelize work over a large amount} of raw data, such as combining web logs with relational data from an OLTP database to model how users interact with your website. This type of work, which could take days or longer using conventional serial programming techniques, can be reduced to min-utes using MapReduce on a Hadoop cluster.

The MapReduce model simplifies parallel processing by abstracting away the com-plexities involved in working with distributed systems, such as computational paral-lelization, work distribution, and dealing with unreliable hardware and software. With this abstraction, MapReduce allows the programmer to focus on addressing business needs rather than getting tangled up in distributed system complications.

MapReduce decomposes work submitted by a client into small parallelized map and reduce tasks, as shown in figure 1.5. The map and reduce constructs used in

5 _{See “MapReduce: Simplified Data Processing on Large Clusters,”}_{http://research.google.com/archive/} mapreduce.html.

Hadoop MapReduce master

Map

Reduce Client

Input data

Output data The client submits

a MapReduce job.

MapReduce decomposes the job into map and reduce tasks and schedules them for remote

execution on the slave nodes. Job

Job parts Job parts

[image:33.531.69.422.358.580.2]

Reduce

(34)

9

What is Hadoop?

MapReduce are borrowed from those found in the Lisp functional programming lan-guage, and they use a shared-nothing model to remove any parallel execution interde-pendencies that could add unwanted synchronization points or state sharing.6

The role of the programmer is to define map and reduce functions where the map function outputs key/value tuples, which are processed by reduce functions to pro-duce the final output. Figure 1.6 shows a pseudocode definition of a map function with regard to its input and output.

The power of MapReduce occurs between the map output and the reduce input in the shuffle and sort phases, as shown in figure 1.7.

6 _{A shared-nothing architecture is a distributed computing concept that represents the notion that each node}

is independent and self-sufficient.

The map function takes as input a key/value pair, which represents a logical record from the input data source. In the case of a ﬁle, this could be a line, or if the input source is a table in a database, it could be a row.

list(key2, value2) map(key1, value1)

The map function produces zero or more output key/value pairs for one input pair. For example, if the map function is a ﬁltering map function, it may only produce output if a certain condition is

met. Or it could be performing a demultiplexing operation, where a single key/value yields multiple key/value output pairs.

Figure 1.6 A logical view of the map function that takes a key/value pair as input

The shufﬂe and sort phases are responsible for two primary activities: determining the reducer that should receive the map output key/value pair (called partitioning);

and ensuring that all the input keys for a given reducer are sorted.

cat,doc1

dog,doc1 hamster,doc1

cat,doc2

dog,doc2

hampster,doc2 chipmunk,doc2

Map output Shufﬂe + sort

Mapper 1

Mapper 2

cat,list(doc1,doc2)

dog,list(doc1,doc2)

hamster,list(doc1,doc2) chipmunk,list(doc2)

Reducer 2 Sorted reduce Input

Map outputs for the same key (such as “hamster”) go to the same reducer and are then combined to

form a single input record for the reducer.

Each reducer has all of its input keys sorted.

Reducer 1

[image:34.531.97.462.345.555.2]

Reducer 3

(35)

Figure 1.8 shows a pseudocode definition of a reduce function.

With the advent of YARN in Hadoop 2, MapReduce has been rewritten as a YARN application and is now referred to as MapReduce 2 (or MRv2). From a developer’s per-spective, MapReduce in Hadoop 2 works in much the same way it did in Hadoop 1, and code written for Hadoop 1 will execute without code changes on version 2.7 There are changes to the physical architecture and internal plumbing in MRv2 that are examined in more detail in chapter 2.

With some Hadoop basics under your belt, it’s time to take a look at the Hadoop ecosystem and the projects that are covered in this book.

1.1.2 The Hadoop ecosystem

The Hadoop ecosystem is diverse and grows by the day. It’s impossible to keep track of all of the various projects that interact with Hadoop in some form. In this book the focus is on the tools that are currently receiving the greatest adoption by users, as shown in figure 1.9.

MapReduce and YARN are not for the faint of heart, which means the goal for many of these Hadoop-related projects is to increase the accessibility of Hadoop to programmers and nonprogrammers. I’ll cover many of the technologies listed in fig-ure 1.9 in this book and describe them in detail within their respective chapters. In addition, the appendix includes descriptions and installation instructions for technol-ogies that are covered in this book.

Coverage of the Hadoop ecosystem in this book The Hadoop ecosystem grows by the day, and there are often multiple tools with overlapping features and benefits. The goal of this book is to provide practical techniques that cover the core Hadoop technologies, as well as select ecosystem technologies that are ubiquitous and essential to Hadoop.

Let’s look at the hardware requirements for your cluster.

7 _{Some code may require recompilation against Hadoop 2 binaries to work with MRv2; see chapter 2 for more}

details.

The reduce function is called once per unique

map output key.

All of the map output values that were emied across all the mappers

for "key2" are provided in a list.

Like the map function, the reduce can output zero-to-many key/value pairs. Reducer output can write to ﬂat ﬁles in HDFS, insert/update rows in a NoSQL database, or write

to any data sink, depending on the requirements of the job.

list(key3, value3) reduce (key2, list (value2's))

(36)

11

What is Hadoop?

1.1.3 Hardware requirements

The term commodity hardware is often used to describe Hadoop hardware require-ments. It’s true that Hadoop can run on any old servers you can dig up, but you’ll still want your cluster to perform well, and you don’t want to swamp your operations department with diagnosing and fixing hardware issues. Therefore, commodity refers to mid-level rack servers with dual sockets, as much error-correcting RAM as is affordable, and SATA drives optimized for RAID storage. Using RAID on the DataNode filesystems used to store HDFS content is strongly discouraged because HDFS already has replica-tion and error-checking built in; on the NameNode, RAID is strongly recommended for additional security.8

From a network topology perspective with regard to switches and firewalls, all of the master and slave nodes must be able to open connections to each other. For small clusters, all the hosts would run 1 GB network cards connected to a single, good-quality switch. For larger clusters, look at 10 GB top-of-rack switches that have at least multiple 1 GB uplinks to dual-central switches. Client nodes also need to be able to talk to all of the master and slave nodes, but if necessary, that access can be from behind a firewall that permits connection establishment only from the client side.

8 _{HDFS uses disks to durably store metadata about the filesystem.}

High-level languages

Predictive analytics

Alternative processing

Miscellaneous

SQL-on-Hadoop Weave

Scalding

Cascalog

Crunch

Cascading

Pig

Impala

Hive

RHadoop

Rhipe

R

Summingbird

Spark

Storm

ElephantDB

HDFS YARN + MapReduce

[image:36.531.103.472.52.298.2]

Hadoop

(37)

After reviewing Hadoop from a software and hardware perspective, you’ve likely developed a good idea of who might benefit from using it. Once you start working with Hadoop, you’ll need to pick a distribution to use, which is the next topic.

1.1.4 Hadoop distributions

Hadoop is an Apache open source project, and regular releases of the software are available for download directly from the Apache project’s website (http:// hadoop.apache.org/releases.html#Download). You can either download and install Hadoop from the website or use a quickstart virtual machine from a commercial dis-tribution, which is usually a great starting point if you’re new to Hadoop and want to quickly get it up and running.

After you’ve whet your appetite with Hadoop and have committed to using it in production, the next question that you’ll need to answer is which distribution to use. You can continue to use the vanilla Hadoop distribution, but you’ll have to build the in-house expertise to manage your clusters. This is not a trivial task and is usually only successful in organizations that are comfortable with having dedicated Hadoop DevOps engineers running and managing their clusters.

Alternatively, you can turn to a commercial distribution of Hadoop, which will give you the added benefits of enterprise administration software, a support team to con-sult when planning your clusters or to help you out when things go bump in the night, and the possibility of a rapid fix for software issues that you encounter. Of course, none of this comes for free (or for cheap!), but if you’re running mission-critical ser-vices on Hadoop and don’t have a dedicated team to support your infrastructure and services, then going with a commercial Hadoop distribution is prudent.

Picking the distribution that’s right for you It’s highly recommended that you engage with the major vendors to gain an understanding of which distribu-tion suits your needs from a feature, support, and cost perspective. Remem-ber that each vendor will highlight their advantages and at the same time expose the disadvantages of their competitors, so talking to two or more ven-dors will give you a more realistic sense of what the distributions offer. Make sure you download and test the distributions and validate that they integrate and work within your existing software and hardware stacks.

There are a number of distributions to choose from, and in this section I’ll briefly summarize each distribution and highlight some of its advantages.

APACHE

(38)

13

What is Hadoop?

responses to problems are usually rapid, even if the actual fixes will likely take longer than you may be able to afford.

The Apache Hadoop distribution has become more compelling now that adminis-tration has been simplified with the advent of Apache Ambari, which provides a GUI to help with provisioning and managing your cluster. As useful as Ambari is, though, it’s worth comparing it against offerings from the commercial vendors, as the com-mercial tooling is typically more sophisticated.

CLOUDERA

Cloudera is the most tenured Hadoop distribution, and it employs a large number of Hadoop (and Hadoop ecosystem) committers. Doug Cutting, who along with Mike Caferella originally created Hadoop, is the chief architect at Cloudera. In aggregate, this means that bug fixes and feature requests have a better chance of being addressed in Cloudera compared to Hadoop distributions with fewer committers.

Beyond maintaining and supporting Hadoop, Cloudera has been innovating in the Hadoop space by developing projects that address areas where Hadoop has been weak. A prime example of this is Impala, which offers a SQL-on-Hadoop system, simi-lar to Hive but focusing on a near-real-time user experience, as opposed to Hive, which has traditionally been a high-latency system. There are numerous other projects that Cloudera has been working on: highlights include Flume, a log collection and distribution system; Sqoop, for moving relational data in and out of Hadoop; and Cloudera Search, which offers near-real-time search indexing.

HORTONWORKS

Hortonworks is also made up of a large number of Hadoop committers, and it offers the same advantages as Cloudera in terms of the ability to quickly address problems and feature requests in core Hadoop and its ecosystem projects.

From an innovation perspective, Hortonworks has taken a slightly different approach than Cloudera. An example is Hive: Cloudera’s approach was to develop a whole new SQL-on-Hadoop system, but Hortonworks has instead looked at innovating inside of Hive to remove its high-latency shackles and add new capabilities such as sup-port for ACID. Hortonworks is also the main driver behind the next-generation YARN platform, which is a key strategic piece keeping Hadoop relevant. Similarly, Horton-works has used Apache Ambari for its administration tooling rather than developing an in-house proprietary administration tool, which is the path taken by the other dis-tributions. Hortonworks’ focus on developing and expanding the Apache ecosystem tooling has a direct benefit to the community, as it makes its tools available to all users without the need for support contracts.

MAPR

MapR has fewer Hadoop committers on its team than the other distributions dis-cussed here, so its ability to fix and shape Hadoop’s future is potentially more bounded than its peers.

(39)

enterprise-ready filesystem, and instead developed its own proprietary filesystem, which offers compelling features such as POSIX compliance (offering random-write support and atomic operations), High Availability, NFS mounting, data mirroring, and snapshots. Some of these features have been introduced into Hadoop 2, but MapR has offered them from the start, and, as a result, one can expect that these features are robust.

As part of the evaluation criteria, it should be noted that parts of the MapR stack, such as its filesystem and its HBase offering, are closed source and proprietary. This affects the ability of your engineers to browse, fix, and contribute patches back to the community. In contrast, most of Cloudera’s and Hortonworks’ stacks are open source, especially Hortonworks’, which is unique in that the entire stack, including the man-agement platform, is open source.

MapR’s notable highlights include being made available in Amazon’s cloud as an alternative to Amazon’s own Elastic MapReduce and being integrated with Google’s Compute Cloud.

I’ve just scratched the surface of the advantages that the various Hadoop distribu-tions offer; your next steps will likely be to contact the vendors and start playing with the distributions yourself.

Next, let’s take a look at companies currently using Hadoop, and in what capacity they’re using it.

1.1.5 Who’s using Hadoop?

Hadoop has a high level of penetration in high-tech companies, and it’s starting to make inroads in a broad range of sectors, including the enterprise (Booz Allen Hamil-ton, J.P. Morgan), government (NSA), and health care.

Facebook uses Hadoop, Hive, and HBase for data warehousing and real-time appli-cation serving.9_{Facebook’s data warehousing clusters are petabytes in size with} thou-sands of nodes, and they use separate HBase-driven, real-time clusters for messaging and real-time analytics.

Yahoo! uses Hadoop for data analytics, machine learning, search ranking, email antispam, ad optimization, ETL,10_{and more. Combined, it has over 40,000 servers} run-ning Hadoop with 170 PB of storage. Yahoo! is also runrun-ning the first large-scale YARN deployments with clusters of up to 4,000 nodes.11

Twitter is a major big data innovator, and it has made notable contributions to Hadoop with projects such as Scalding, a Scala API for Cascading; Summingbird, a

9 _{See Dhruba Borthakur, “Looking at the code behind our three uses of Apache Hadoop” on Facebook at} http://mng.bz/4cMc. Facebook has also developed its own SQL-on-Hadoop tool called Presto and is migrat-ing away from Hive (see Martin Traverso, “Presto: Interactmigrat-ing with petabytes of data at Facebook,” http:// mng.bz/p0Xz).

10_{Extract, transform, and load (ETL) is the process by which data is extracted from outside sources,}

trans-formed to fit the project’s needs, and loaded into the target data sink. ETL is a common process in data ware-housing.

11_{There are more details on YARN and its use at Yahoo! in “Apache Hadoop YARN: Yet Another Resource}

(40)

15

What is Hadoop?

component that can be used to implement parts of Nathan Marz’s lambda architec-ture; and various other gems such as Bijection, Algebird, and Elephant Bird.

eBay, Samsung, Rackspace, J.P. Morgan, Groupon, LinkedIn, AOL, Spotify, and StumbleUpon are some other organizations that are also heavily invested in Hadoop. Microsoft has collaborated with Hortonworks to ensure that Hadoop works on its platform.

Google, in its MapReduce paper, indicated that it uses Caffeine,12_{its version of} MapReduce, to create its web index from crawl data. Google also highlights applica-tions of MapReduce to include activities such as a distributed grep, URL access fre-quency (from log data), and a term-vector algorithm, which determines popular keywords for a host.

The number of organizations that use Hadoop grows by the day, and if you work at a Fortune 500 company you almost certainly use a Hadoop cluster in some capacity. It’s clear that as Hadoop continues to mature, its adoption will continue to grow.

As with all technologies, a key part to being able to work effectively with Hadoop is to understand its shortcomings and design and architect your solutions to mitigate these as much as possible.

1.1.6 Hadoop limitations

High availability and security often rank among the top concerns cited with Hadoop. Many of these concerns have been addressed in Hadoop 2; let’s take a closer look at some of its weaknesses as of release 2.2.0.

Enterprise organizations using Hadoop 1 and earlier had concerns with the lack of high availability and security. In Hadoop 1, all of the master processes are single points of failure, which means that a failure in the master process causes an outage. In Hadoop 2, HDFS now has high availability support, and the re-architecture of Map-Reduce with YARN has removed the single point of failure. Security is another area that has had its wrinkles, and it’s receiving focus.

HIGH AVAILABILITY

High availability is often mandated in enterprise organizations that have high uptime SLA requirements to ensure that systems are always on, even in the event of a node going down due to planned or unplanned circumstances. Prior to Hadoop 2, the mas-ter HDFS process could only run on a single node, resulting in single points of fail-ure.13_{Hadoop 2 brings NameNode High Availability (}_HA_{) support, which means that} multiple NameNodes for the same Hadoop cluster can be running. With the current design, one of the NameNodes is active and the other NameNode is designated as a standby process. In the event that the active NameNode experiences a planned or

12_{In 2010 Google moved to a real-time indexing system called Caffeine; see “Our new search index: Caffeine”}

on the Google blog (June 8, 2010), http://googleblog.blogspot.com/2010/06/our-new-search-index-caffeine.html.

13_{In reality, the HDFS single point of failure may not be terribly significant; see “NameNode HA” by Suresh}

(41)

unplanned outage, the standby NameNode will take over as the active NameNode, which is a process called failover. This failover can be configured to be automatic, negating the need for human intervention. The fact that a NameNode failover occurred is transparent to Hadoop clients.

The MapReduce master process (the JobTracker) doesn’t have HA support in Hadoop 2, but now that each MapReduce job has its own JobTracker process (a sepa-rate YARN ApplicationMaster), HA support is arguably less important.

HA support in the YARN master process (the ResourceManager) is important, how-ever, and development is currently underway to add this feature to Hadoop.14

MULTIPLE DATACENTERS

Multiple datacenter support is another key feature that’s increasingly expected in enterprise software, as it offers strong data protection and locality properties due to data being replicated across multiple datacenters. Apache Hadoop, and most of its commercial distributions, has never had support for multiple datacenters, which poses challenges for organizations that have software running in multiple datacenters. WAN-disco is currently the only solution available for Hadoop multidatacenter support.

SECURITY

Hadoop does offer a security model, but by default it’s disabled. With the security model disabled, the only security feature that exists in Hadoop is HDFS file- and directory-level ownership and permissions. But it’s easy for malicious users to sub-vert and assume other users’ identities. By default, all other Hadoop services are wide open, allowing any user to perform any kind of operation, such as killing another user’s MapReduce jobs.

Hadoop can be configured to run with Kerberos, a network authentication proto-col, which requires Hadoop daemons to authenticate clients, both users and other Hadoop components. Kerberos can be integrated with an organization’s existing Active Directory and therefore offers a single-sign-on experience for users. Care needs to be taken when enabling Kerberos, as any Hadoop tool that wishes to interact with your cluster will need to support Kerberos.

Wire-level encryption can be configured in Hadoop 2 and allows data crossing the network (both HDFS transport15_{and MapReduce shuffle data}16_{) to be encrypted.} Encryption of data at rest (data stored by HDFS on disk) is currently missing in Hadoop.

Let’s examine the limitations of some of the individual systems.

14_{For additional details on YARN HA support, see the JIRA ticket titled “ResourceManager (RM) High-Availability}

(HA),” https://issues.apache.org/jira/browse/YARN-149.

15_{See the JIRA ticket titled “Add support for encrypting the DataTransferProtocol” at}_{https://issues.apache.org/} jira/browse/HDFS-3637.

(42)

17

Getting your hands dirty with MapReduce

HDFS

The weakness of HDFS is mainly its lack of high availability (in Hadoop 1.x and ear-lier), its inefficient handling of small files,17_{and its lack of transparent compression.} HDFS doesn’t support random writes into files (only appends are supported), and it’s generally designed to support high-throughput sequential reads and writes over large files.

MAPREDUCE

MapReduce is a batch-based architecture, which means it doesn’t lend itself to use cases that need real-time data access. Tasks that require global synchronization or sharing of mutable data aren’t a good fit for MapReduce, because it’s a shared-nothing architec-ture, which can pose challenges for some algorithms.

VERSION INCOMPATIBILITIES

The Hadoop 2 release brought with it some headaches with regard to MapReduce API runtime compatibility, especially in the org.hadoop.mapreduce package. These problems often result in runtime issues with code that’s compiled against Hadoop 1 (and ear-lier). The solution is usually to recompile against Hadoop 2, or to consider a tech-nique outlined in chapter 2 that introduces a compatibility library to target both Hadoop versions without the need to recompile code.

Other challenges with Hive and Hadoop also exist, where Hive may need to be recompiled to work with versions of Hadoop other than the one it was built against. Pig has had compatibility issues, too. For example, the Pig 0.8 release didn’t work with Hadoop 0.20.203, and manual intervention was required to work around this prob-lem. This is one of the advantages of using a Hadoop distribution other than Apache, as these compatibility problems have been fixed. If using the vanilla Apache distribu-tions is desired, it’s worth taking a look at Bigtop (http://bigtop.apache.org/), an Apache open source automated build and compliance system. It includes all of the major Hadoop ecosystem components and runs a number of integration tests to ensure they all work in conjunction with each other.

After tackling Hadoop’s architecture and its weaknesses, you’re probably ready to roll up your sleeves and get hands-on with Hadoop, so let’s look at running the first example in this book.

1.2 Getting your hands dirty with MapReduce

This section shows you how to run a MapReduce job on your host.

Installing Hadoop and building the examples To run the code example in this section, you’ll need to follow the instructions in the appendix, which explain how to install Hadoop and download and run the examples bundled with this book.

17_{Although HDFS Federation in Hadoop 2 has introduced a way for multiple NameNodes to share file}