THE PLATFORM FOR BIG DATA

(1)

THE PLATFORM

FOR BIG DATA

A Blueprint for Next-Generation

Data Management

(2)

Introduction 3

Data in Crisis

3 The Data Brain

4 Anatomy of the Platform

5 Essentials of Success

7 A Data Platform

9 The Road Ahead

11 About Cloudera

12 Table of

Introduction

The modern era of Big Data threatens to upend the status quo across organizations, from the data center to the

boardroom. Companies are turning to Apache Hadoop™ (Hadoop) as the foundation for systems and tools capable

of tackling the challenges of massive data growth. Some may doubt that Hadoop is the engine of the new era of

data management, yet with the latest advances like Cloudera Impala™, Hadoop enables organizations to deploy a

central platform to solve end-to-end data problems, from batch processing to real-time applications, and to ask

bigger questions of their data.

DATA IN CRISIS

A brief history of the data growth challenges facing organizations today illustrates why Hadoop has become the central platform for Big Data.

During the mid-2000s, the stress of considerable data growth at innovative consumer web companies like Facebook, Google, and Yahoo! exposed flaws in existing data management technologies.

The assumed operational model, which dictated massive storage arrays connected to massive computing arrays via a small network pipe, was showing its age. The network capabilities were failing to keep pace with computing demands as data sets increased in size and flow. The lack of affordable and predictable scale-out architectures muted any potential benefits wrung from the growing volumes of data.

The data itself was changing as well. Non-traditional types and formats became valuable in reporting and analysis. Business teams were trying to collect new data to combine with existing customer and transaction data. In order to get a more refined picture of consumer activity, businesses wanted data in much larger volumes and from unstructured sources including web server logs, images, blogs, and social media streams. The sheer scale of this new data overwhelmed existing systems and demanded significant effort for even simple changes to data structures or reporting metrics. When orchestrating operations such as adding a single dimension or value, organizations were lucky to enact changes within two weeks. More often, changes required six months to implement. This latency meant the very questions themselves asked by the business had changed by the time IT had adjusted the infrastructure in order to be able to answer those questions.

Even more troubling was the emergence of new functional limits of the existing systems. To borrow from former US Secretary of Defense, Donald Rumsfeld: these systems had been designed to examine the “known unknowns” – the questions that a business knows to ask, but does not yet have the answers. The teams at Facebook, Google, and Yahoo! were encountering a different set of questions, the “unknown unknowns” – the questions that a business has yet to ask, but is actively seeking to discover. “We knew that the business needed more than just queries,” explained Jeff Hammerbacher, former data management leader at Facebook and current chief scientist at Cloudera. “Business now cared about processing as well. Yet our existing systems were optimized for queries, not processing.” Business needed answers from data sets that required significantly more processing, and they needed a way to explore the questions that these new data sources brought to light. The data exploration challenge stemmed from a fundamental shift in the way organizations consumed data. Data emerged from being simply a source of information for reactive decisions – the data in the report on the CFO’s desk – to the driver of proactive decisions. Data powered the content targeting campaigns and the recommendation engines of the social era. Organizations realized that through the discipline of data science and the breadth and permanence of their raw data sources and refined results, they could produce new revenue streams and cost avoidance strategies. Data had new intrinsic value. Data was now a financial asset, not a byproduct.

During the mid-2000s, the stress of massive data growth exposed flaws in existing data management companies.

(4)

These events experienced at Facebook, Google, and Yahoo! foreshadowed the challenges that now confront all industries. Back in 2004, these companies were increasingly desperate to find a solution to these data management needs.

THE DATA BRAIN

The search for a solution ranged beyond the traditional database and data mart products, considering high performance computing (HPC), innovative scale out storage, and virtualization solutions. Each of these systems had components that solved elements of the broader Big Data challenge, but none provided a comprehensive and cohesive structure for addressing the issues in their entirety. HPC suffered the same network bottlenecks of legacy systems. New storage

systems were able to optimize the cost per byte, but had no compute capacity. Virtualization excelled at making efficient use of individual machines, but did not have a mechanism for combining multiple machines to act as one.

Back then, Hammerbacher referred to the ideal solution as the “Data Brain,” which he described as “a place to put all our data, no matter what it is, extract value, and be intelligent about it.” At this time, Apache Hadoop, a nascent technology based on work pioneered by Google and created by former Yahoo! engineer and current chief architect at Cloudera, Doug Cutting, entered the IT marketplace. The initial focus of Hadoop was to improve and scale storage and processing during search indexing. Early adopters quickly realized that Hadoop, at its core, was more than just a system for building search indexes. The platform addressed the data needs of Facebook and Yahoo! as well as many others in the web and online advertising space. Hadoop offered deep, stable foundations for growth and opportunity. It was the answer to the coming wave of Big Data.

For these reasons, Cloudera was formed to focus on growing the Hadoop-based technology platform to meet the Big Data challenges the rest of the world would soon face. Cloudera has propelled Apache Hadoop to become the premier technology for real-time and batch-oriented processing workloads on extremely large and hybrid data sets. With Cloudera Enterprise, Hadoop becomes the central system in which organizations can solve end-to-end data problems that involve any combination of data ingestion, storage, exploration, processing, analytics, and serving.

With Cloudera Enterprise, Hadoop becomes the central system in which organizations can solve end-to-end

data problems.

Figure 1. “Data Brain” Lifecycle

SERVE

ANALYZE

PROCESS

EXPLORE

STORE

INGEST

(5)

ANATOMY OF THE PLATFORM

Apache Hadoop is open source software that couples elastic and versatile distributed storage with parallel processing of varied, multi-structured data using industry standard servers. Hadoop also has a rich and diverse ecosystem of supporting tools and applications.

> Core Architecture:The core of Hadoop is an architecture that marries self-healing, high-bandwidth clustered storage (Hadoop Distributed File System, or HDFS) with fault-tolerant distributed processing (MapReduce). These core components of processing power and storage capacity scale linearly as additional servers are added to a Hadoop cluster. Data housed in HDFS is divided into smaller parts, called splits, which are distributed across storage partitions within the nodes of the cluster. The partitions, which are called blocks, ensure data reliability and access. The

MapReduce framework operates in a similar fashion; MapReduce exploits the block distribution during code execution to minimize data movement and ensure optimal data availability. Deploying Hadoop means no practical limit to volume and computing that is both immediate and useable.

The underlying storage in HDFS is a flexible file system that accepts any data format and stores information permanently. Hadoop supports pluggable serialization that avoids normalization or restructuring for efficient and reliable storage in the data’s original format. As a result, if an application needs to reprocess a data set or read data in a different format, the original data is both local and in its high fidelity, native state. Hadoop reads the format at query time, a process known as “schema on read” or late-binding, which offers a significant advantage to traditional systems that require data to be formatted first, i.e. “schema on write” or early-binding, before storage and processing. This latter approach often loses relevant but latent details and requires that an organization re-run the time-consuming full data lifecycle processing to regain any lost information.

Deploying Hadoop means no practical limit to volume and computing that is both immediate and useable.

Figure 2. Storage and Compute in Hadoop

HDFS Data Distribution

MapReduce Compute Distribution

Input File Output File

1

2

3

4

2

4

5

1

2

5

1

3

4

2

3

5

1

3

4

5

Node A Node B Node C Node D Node E

1

2

3

4

2

4

5

1

2

5

1

3

4

2

3

5

1

3

4

5

(6)

Hadoop runs on industry standard hardware, and typically the cost per terabyte of Hadoop-based storage is 10x cheaper than traditional relational technology. Hadoop uses servers with local storage, thereby optimizing for high I/O workloads. Servers are connected using standard gigabit Ethernet, which lowers overall system cost yet still allows near limitless storage and processing, thanks to Hadoop’s scale-out features. In addition to its use of local storage and standard

networking, Hadoop can reduce the total hardware requirements since a single cluster provides both storage and processing.

> Extending the Core:Over time, the Apache Hadoop ecosystem has matured to make these foundational elements easier to use:

>

Higher-level languages, like Apache Pig for procedural programming and Apache Hive for SQL-like manipulation and query (HiveQL), streamline integration and ease adoption for non-developers.

>

Data acquisition tools, like Apache Flume for log file and stream ingestion and Apache Sqoop for bi-directional data movement to and from relational databases, present a greater range of data available within a Hadoop cluster.

>

End user access tools, like Cloudera Hue for efficient, user-level interaction and Apache Oozie for workflow and scheduling, give IT operations and users alike direct and manageable means to maximize their efforts.

>

For high-end, real-time serving and delivery, the ecosystem includes Apache HBase, a distributed, column-family database.

>

Cloudera has further extended the distributed processing architecture beyond batch analysis with the introduction of Impala.

Cloudera Impala is the next step in real-time query engines that allows users to query data stored in HDFS and HBase in seconds via a SQL interface. It leverages the metadata, SQL syntax, ODBC driver, and Hue user interface from Hive. Rather than using MapReduce, Impala uses its own processing framework to execute queries. The result is a 10x-50x performance improvement over Hive and enables interactive data exploration.

Cloudera Impala is the next step in real-time query engines.

Figure 3. Improved response times with Cloudera Impala for typical fraud analysis queries.

10,000 1000 100 10 300 GB Seconds (avg.) 1000 GB 1 HIVE/MR HIVE/MR IMPALA IMPALA

(7)

ESSENTIALS OF SUCCESS

What makes a successful Big Data platform? Based on years of experience as the leading vendor and solution provider for Hadoop, Cloudera defines four requirements for a successful platform – volume, velocity, variety, and value – and while competing systems and technologies satisfy some of these demands, all have shortcomings and ultimately are inadequate platforms for Big Data.

> Volume:Big Data is just that – data sets that are so massive that typical software systems are incapable of economically storing, let alone managing and computing, the information. A Big Data platform must capture and readily provide such quantities in a comprehensive and uniform storage framework to enable straightforward management and development.

While scalable data volume is a common refrain from vendors, and many systems claim to handle petabyte and exabyte-scale data stores, these statements can be misleading. The only commercially available system proven to reach 100PB is on Apache Hadoop. For other systems that do approach these volumes, the typical architectural pattern is to split or shard the data into infrastructure silos to overcome performance and storage issues. Others tie together multiple systems via federation and other virtual means and are typically subject to network latency, capability mismatch, and security constraints.

> Velocity:As organizations continue to seek new questions, patterns, and metrics within their data sets, they demand rapid and agile modeling and query capabilities. A Big Data platform should maintain the original format and precision of all ingested data to ensure full latitude of future analysis and processing cycles. The platform should deliver this raw, unfettered data at anytime during these cycles.

This requirement is a true “litmus test” for systems claiming the title of a Big Data platform. If data import requires a schema, then most likely the system has static schemas and proprietary serialization formats that are incapable of easy and rapid changes. Such models make answering the “unknown unknowns” challenge extremely difficult. This is a key differentiator between legacy relational technology systems and most Big Data solutions.

> Variety:One of the tenets of Big Data is the exponential growth of unstructured data. The vast majority of data now originates from sources with either limited or variable structure, such as social media and telemetry. A Big Data platform must accommodate the full spectrum of data types and forms.

Some solutions will highlight their flexibility with both unstructured and structured data, but in reality most employ opaque binary large object (BLOB) storage to dump unstructured data wholesale into columns within rigid relational schemas. In essence, the database becomes a file system, and while this technique appears to meet the goal of data flexibility, the system overhead inflates the economics and degrades performance. Relational technologies are simply not the right tool to handle a wide variety of formats, especially variable ones. Some systems support native XML, however, this is a single data format and suffers the same disadvantages as its

relational counterparts.

One of the tenets of Big Data is the exponential growth of unstructured data.

(8)

> Value:Driving relevant value, whether as revenue or cost savings, from data is the primary motivator for many organizations. The popularity of long tail business models has forced companies to examine their data in detail to find the patterns, affiliations, and connections to drive these new opportunities. Data scientists and developers need the full fidelity of their data, not a clustered sampling, to seek these opportunities or face the omission of a potential match that could prove wildly successful or downright catastrophic. A Big Data platform should offer organizations a range of languages, frameworks, and entry points to explore, process, analyze, and serve their data while in pursuit of these goals.

Some practitioners state they have been providing this faculty for many years, yet most have been using only SQL, which as a query language is not ideal for data processing. While user-defined functions (UDF), which are code within a query to extend and add further capabilities, do enhance SQL, organizations may only exploit the full power of data processing through a true “Turing complete” system, like Java, Python, Ruby, and other languages, within a MapReduce job. Hadoop meets and exceeds all requirements of a Big Data platform:

>

Hadoop houses all data together under a single namespace and metadata model, on a single set of nodes, with a single security and governance framework on a linearly scalable, industry standard-based hardware infrastructure.

>

Hadoop is format agnostic due to its open and extensible data serialization framework and employs the “schema on read” approach, which allows the ingestion and retrieval of any and all data formats in their native fidelity.

>

Hadoop, through the “schema on read” and format-free approach, provides complete control over changes in data formats at any time and at any point during query and processing.

>

Hadoop and its MapReduce framework and the broader ecosystem, such as Apache Hive, Apache Pig, and Cloudera Impala, grant developers and analysts a diverse yet inclusive set of both low-level and high-level tools for manipulating and querying data. With Hadoop, organizations can support multiple, simultaneous formats and analytic approaches.

Only Apache Hadoop offers all these features, and with Cloudera Enterprise, organizations benefit from a single, centralized management console with a single set of dependencies from one vendor, while still enjoying the advantages of open source software like code transparency and no vendor lock-in. The end result is: streamlined management for operators; batch, iterative, and real-time analysis for the data consumer; and faster return on investment for the forward-thinking IT leader.

Data scientists and developers need the full fidelity of their data.

(9)

A DATA PLATFORM

Apache Hadoop and Cloudera offer organizations immediate opportunities to maximize their investment in Big Data and help establish foundations for future growth and discovery. Cloudera sees three near-term activities for Hadoop in the modern enterprise: optimized infrastructure, predictive modeling, and data exploration.

> Optimized Infrastructure:Hadoop can improve and accelerate many existing IT workloads, including archiving, general processing, and, most notably, extract-transform-load (ETL) processes. Current technologies for ETL tightly couple schema mappings within the processing pipeline. If an upstream structure breaks or inadvertently changes, the error cascades through the rest of the pipeline. Changes typically require an all-or-nothing approach, and this modeling approach results in longer cycles to make adjustments or fixes.

Using MapReduce, all stages in the data pipeline are persisted to local disk, which offers a high- degree of fault-tolerance. Stage transitions gain flexibility via the “schema on read” capability. These two features enable iterative and incremental updates to processing flows, as transitional states in MapReduce are related but not dependent on each other. Thus as developers encounter errors, updates may be applied and processing restarted at the point of failure—not the entire pipeline itself.

Current ETL practices also involve considerable network traffic as data is moved into and out of the ETL grid. This movement translates into either high latency or high provisioning costs. With Hadoop and the MapReduce framework, computing is performed locally and isolates the expense of moving large volumes of data to the initial ingestion stage.

While Hadoop offers many advantages for organizations, Hadoop is not a wholesale replacement for the traditional relational system and other storage and analysis solutions. Rather, Hadoop is a strong complement to many existing systems. The combination of these technologies offers enterprises tremendous opportunities to maximize IT investments and expand business capabilities by aligning IT workloads to the strengths of each system.

Apache Hadoop and Cloudera offer organizations immediate opportunities to maximize their investment in Big Data.

Figure 4. Hadoop in the Enterprise

META DATA/

ETL TOOLS CLOUDERAMANAGER

ENTERPRISE DATA WAREHOUSE ONLINE SERVING SYSTEM WEB/MOBILE APPLICATIONS

CLOUDERA HADOOP

SYS LOGS WEB LOGS FILES RDBMS DEVELOPER

TOOLS MODELINGDATA ANALYTICSBI / ENTERPRISEREPORTING

Data Architects System Operators

Engineers Data Scientists Analysts Business Users

(10)

For example, many data warehouses run workloads that are poorly aligned to their strengths because organizations had no other alternatives. Now, organizations can shift to Hadoop many of the tasks, such as large-scale data processing and the exploration of historical data. The now unburdened data warehouse is free to focus on its specialized workloads, like current operational analytics and interactive online analytical processing (OLAP) reporting, and yet still benefit from the processing and output of the Hadoop cluster.

This architectural pattern has several benefits including a lower cost to store massive data sets, faster data transformations of large data sets, and a reduced data load into the data warehouse, which results in faster overall ETL processing and greater data warehouse capacity and agility. In short, each system – the data warehouse and Hadoop – focuses on its strength to achieve business goals.

> Predictive Modeling:Hadoop is an ideal system for gathering and organizing large volumes of varied data, and its processing frameworks provide data scientists and developers a rich toolset for extracting signals and patterns from bodies of disparate knowledge.

Organizations can exploit Hadoop’s collection tools, like Flume and Sqoop, to import a sufficient corpus and use tools like Pig, Hive, Apache Crunch, DataFu, and Oozie to execute profiling, quality checks, enrichment, and other necessary steps during data preparation. Model fitting efforts can employ common implementations, such as recommendation engines and Bayes classifiers, using Apache Mahout, which is built upon MapReduce, or construct models directly in MapReduce itself. Organizations can use the same collection of data preparation tools for validation steps too. Commonly, the resulting cleansed data set is exported to a specialized statistical system for final computation and service.

> Data Exploration:While Hadoop is a natural platform for large and dynamic data set analytics, the platform’s batch processing framework, MapReduce, has not always fit within an

organization’s interactivity and usability requirements. The design of MapReduce emphasized processing capabilities rather than rapid exploration and ease of use. The introduction of HBase was the first step towards low-latency data delivery, while Hive offered a SQL-based experience to MapReduce. Despite these advancements, developers and data scientists still lacked an interactive data exploration tool that was native to Hadoop, thus they often shifted these workloads to traditional, purpose-built relational systems.

With the addition of Cloudera Impala, Hadoop-based systems have entered the world of real-time interactivity. By allowing users to query data stored in HDFS and HBase in seconds, Impala makes Hadoop usable for iterative analytical processes. Now, developers and data scientists can interact with data at sub-second times without migrating from Hadoop.

(11)

THE ROAD AHEAD

The evolution of Hadoop as an enterprise system is accelerating, as clearly demonstrated by innovations like HBase, Hive, and now Impala. The road ahead is one of convergence. When powered by a unified, fully audited, centrally managed solution like Cloudera Enterprise, immediate opportunities – optimized infrastructure, predictive modeling, and data exploration – become the stepping stones to achieving Hammerbacher’s vision of “a place to put all our data, no matter what it is, extract value, and be intelligent about it.”

This goal is within sight; Hadoop is now the scalable, flexible, and interactive data refinery for modern enterprises and organizations. Cloudera Enterprise is the platform for solving demanding,

end-to-end data problems. Cloudera Enterprise empowers people and business with:

>

Speed-to-Insight through iterative, real-time queries and serving;

>

Usability and Ecosystem Innovation with low-latency query engines and powerful

SQL-based interfaces and ODBC/JDBC connectors;

>

Discovery and Governance by using common metadata and security frameworks;

>

Data Fidelity and Optimization resulting from local data and compute proximity that brings

analysis to “on read” data where needed;

>

Cost Savings from lower costs per terabyte, reduced lineage tracking across systems, and agile data modeling.

With Cloudera, people now have access to responsive and comprehensive high-performance storage and analysis from a single platform. People are free to explore the unknowns as well as the knowns in a single platform. People get answers as fast as they ask questions. It is time to ask bigger questions.

Hadoop is now the scalable, flexible, and interactive data hub for modern enterprises and organizations.

(12)

About Cloudera

Cloudera, the leader in Apache Hadoop-based software and services, enables data driven enterprises to easily derive business value from all their structured and unstructured data. As the top contributor to the Apache open source community and with tens of thousands of nodes under management across customers in financial services, government, telecommunications, media, web, advertising, retail, energy, bioinformatics, pharma/healthcare, university research, oil and gas and gaming, Cloudera's depth of experience and commitment to sharing expertise are unrivaled.

Cloudera provides no representations or warranties regarding the accuracy, reliability, or serviceability of any information or recommendations provided in this publication, or with respect to any results that may be obtained by the use of the information or observance of any recommendations provided herein. The information in this document is distributed AS IS, and the use of this information or the implementation of any recommendations or techniques herein is a customer’s responsibility and depends on the customer’s ability to evaluate and integrate them into the customer’s operational environment.

©2013 Cloudera, Inc. All rights reserved. Cloudera and the Cloudera logo are trademarks or registered trademarks of Cloudera Inc. in the USA and other countries. All other trademarks are the property of their respective companies. Information is subject to change without notice.