White Paper: What You Need To Know About Hadoop

(1)

June 2011

White Paper:

What You Need To Know

About Hadoop

Inside:

• What is Hadoop, really?

• Issues the Hadoop stack can address

• Common Terms

A White Paper providing succinct information for the enterprise technologist.

(2)

What You Need To Know About Hadoop

Executive Summary

Apache Hadoop is a project operating under the auspices of the Apache Software Foundation (ASF).

The Hadoop project develops open source software for reliable, scalable, distributed computing.

Hadoop holds great promise in helping enterprises deal with the need to conduct analysis over large data sets.

What is Hadoop?

Everyone in federal IT seems to be talking about Apache Hadoop. But why? What is so special about this word? What do people mean when they say it? This paper provides information designed to put those questions and their answers into context relevant for your mission.

The term Hadoop is used two different ways in the federal IT community. Technologists consider it a framework of tools that enable distributed analysis over large quantities of data using commodity hardware. Hadoop is commonly distributed with a selection of related capabilities that make data analysis and feature access easier (like Cloudera’s Distribution including Apache Hadoop -- CDH), and this distribution of Hadoop plus capabilities is also often referred to as Hadoop. The rest of this paper uses the term Hadoop in this context, since for most of us the bundle of capabilities in CDH is Hadoop.

Why Might You Need Hadoop?

Organizations today are collecting and generating more data than ever before, and the formats of this data vary widely. Older methods of dealing with data (such as relational databases) are not keeping up with the size and diversity of data and do not enable fast analysis over large data sets.

What Issues Can Hadoop Address?

Hadoop capabilities come in two key categories, Storage and Analysis.

Storage of data is in the Hadoop Distributed File System (HDFS). HDFS is great for storing very large

(3)

Analysis is by use of a model known as MapReduce. MapReduce is a way of using the power of distributed computers to analyze data and then bring results back to correlate in a “divide and conquer” approach. MapReduce lets you run computations on nodes where the data is. Google, Twitter, Facebook and other large data intensive capabilities use this method by running MapReduce over HDFS with 1000’s of computer nodes. Twitter is a case of special note, where a Hadoop

infrastructure is used to monitor and analyze tweets.

Since CDH uses the MapReduce method it is very fast, which is important when working with large data sets. Yahoo used it to sort a terabyte in 62 seconds, for example. This is a huge benchmark in the IT world and just one example of many that proves that if you need to perform fast analysis over large data, you should be using Hadoop.

CDH also helps address cost issues in IT, since the computing power is delivered using commodity IT.

It runs on computers available from any vendor, and does not require high-end computers. The CDH Hadoop is available for free, so that helps with cost as well.

With these economical solutions to storage and analysis new challenges can be addressed. Consider, for example, the need to search and discover indicators of fraud in visa applications. Huge quantities of information must be searched and correlated from multiple sources to find indications of fraud, and the quicker the better. Hadoop is perfect for this sort of fast analysis.

Other users are leveraging Hadoop to rapidly search and correlate across vast stores of SIGINT, ELINT and unstructured text on adversaries to seek battlefield advantage for our forces. More info on these use cases is available separately.

Or consider challenges where no solution has ever been attempted. For example, consider the

government data on weather, climate, the environment, pollution, health, quality of life, the economy, natural resources, energy and transportation. Data on those topics exist in many stores across

the federal enterprise. The government also has years of information from research conducted at academic institutions across the nation. Imagine the conclusions that could be drawn from analysis over datastores like this. Imagine the benefits to our citizen’s health, commodity prices, education and employment of better analysis over these data stores.

(4)

What is a Hadoop Distribution?

Hadoop itself provides a great framework of tools for storing and analyzing data, but enterprises make use of other tools to enable IT staff and IT users to write complex queries faster, enable better security, and to facilitate very complex analysis and special-purpose computation on large datasets in a scalable, cost-effective manner.

Cloudera’s Distribution including Apache Hadoop (CDH) provides a single bundle of all Hadoop- related projects in a package that is tested together and maintained in a way enterprise CIOs expect software to be supported.

Summary

CDH is an ideal platform for consolidating large-scale data from a variety of new and legacy sources. It complements existing data management solutions with new analysis and processing tools. It delivers immediate value to federal organizations in need of better understanding of data.

For more on Hadoop see: http://Cloudera.com

(5)

For Further Reference

Many other terms are used by technologists to describe the detailed features and functions provided in CDH. The following list may help you decipher the language of Big Data:

CDH: Cloudera’s Distribution including Apache Hadoop. It contains HDFS, Hadoop MapReduce, Hive, Pig, HBase, Sqoop, Flume, Oozie, Zookeeper and Hue. When most people say Hadoop they mean CDH.

HDFS: The Hadoop Distributed File System. This is a scalable means of distributing data that takes advantage of commodity hardware. HDFS ensures all data is replicated in a location-aware manner so as to lessen internal datacenter network load, which frees the network for more complicated transactions.

Hadoop MapReduce: This process breaks down jobs across the Hadoop datanodes and then reassembles the results from each into a coherent answer.

Hive: A data warehouse infrastructure that leverages the power of Hadoop. Hive provides tools that easily summarize queries. Hive puts structure on the data and gives users the ability to query using familiar methods (like SQL). Hive also allows MapReduce programmers to enhance their queries.

Pig: A high level data-flow language that enables advanced parallel computation. Pig makes parallel programming much easier.

HBase: A scalable, distributed database that supports structured data storage for large tables.

Used when you need random, realtime read/write access to you Big Data. It enables hosting of very large tables-- billions of rows times millions of columns -- atop commodity hardware. It is a column- oriented store modeled after Google’s BigTable and is optimized for realtime data. HBase has replaced Cassandra at Facebook.

Sqoop: Enabling the import and export of SQL to Hadoop.

Flume: A distributed, reliable and available service for efficiently collecting, aggregating and moving large amounts of streaming data.

(6)

Oozie: A workflow engine to enhance management of data processing jobs for Hadoop. Manages dependencies of jobs between HDFS, Pig and MapReduce.

Zookeeper: A very high performance coordination service for distributed applications.

Hue: A browser-based desktop interface for interacting with Hadoop. It supports a file browser, job tracker interface, cluster health monitor and many other easy-to-use features.

What Do You Need To Know About Hadoop?

The CDH is 100% open source, 100% Apache licensed. It is simplified, with all required component versions and dependencies managed for you. It is integrated, with all components and functions able to interoperate through standard API’s. It is reliable, with predicable release schedules, patched with stability fixes, and stress tested. It is an industry standard, so your existing RDBMS, ETL and BI systems work with it.

And it is the tool you will need to manage the coming age of Big Data.

CTOlabs.com is a technology research, consulting and services agency which focuses on firm. Crucial We are producing more data than we

can analyze with traditional methods.

Our future requires Hadoop.

(7)

About the Author

Bob Gourley is the founder of Crucial Point, LLC and CTOlabs.com, a provider of technology concepts, vendor evaluations and technology assessments focused on enterprise grade mission needs. Mr. Gourley’s first career was as a naval intelligence officer, which included operational tours afloat and shore. He was the first J2 at DoD’s cyber defense organization, the JTF-CND.

Following retirement from the Navy, Mr. Gourley was a senior executive with TRW and Northrop Grumman, and then returned to government service as the Chief Technology Officer of the Defense Intelligence Agency.

Mr. Gourley was named one of the top 25 most influential CTOs in the globe by Infoworld in 2007, and selected for AFCEA’s award for meritorious service to the intelligence community in 2008. He was named by Washingtonian magazine as one of DC’s “Tech Titans” in 2009; and one of the “Top 25 Most Fascinating Communicators in Government IT” by the Gov2.0 community GovFresh.

He holds three masters degrees, including a master of science degree in scientific and technical intelligence from the Naval Postgraduate School, a master of science degree in military science from USMC university, and a master of science degree in computer science from James Madison University.

Mr.Gourley has published more than 40 articles on a wide range of topics and is a contributor to the book Threats in the Age of Obama (2009). He is a founding and current member of the board of directors of the Cyber Conflict Studies Association, and serves on the board of the Naval Intelligence Professionals, on the Intelligence Committee of AFCEA, and the Cyber Committee of INSA.

Bob Gourley

(8)

Contact:

Bob Gourley

[email protected] 703-994-0549