Cloudera in the Public Cloud

(1)

Cloudera in the Public Cloud

Deployment Options for the Enterprise Data Hub

WHITE PAPER

(2)

Executive Summary

3 The Case for Public Cloud

5 Public Cloud vs On-Premise

6 Public Cloud Deployment Patterns

7 Cloudera Director: Hadoop in the Cloud Without Compromise

9 The Cloudera Difference

10 About Cloudera

10

(3)

CLOUDERA

IN THE PUBLIC CLOUD

3

WHITE PAPER

Executive Summary

Information-driven enterprises have long held the common business

and IT objective of unified data management to improve insight and

build knowledge. For many, the conventional data warehouse and

data mart built on relational technology offered the only avenue

to enterprise-grade analytics, while storage arrays and archives

provided the only methods for keeping diverse data accessible for

longer time periods. Today, these organizations have a better way

to address the challenge of data management with an enterprise

data hub (EDH). The Cloudera enterprise data hub, built with

Apache Hadoop, provides a flexible, scalable, and economical data

management platform that can perform a variety of enterprise

workloads—including batch processing, interactive SQL, enterprise

search, advanced analytics and more—on a single, shared copy of

data on a common storage substrate.

Enterprises are embracing the enterprise data hub as the

centerpiece of their data management strategy, and they are

evaluating the public cloud as a deployment option. While

deployment choice does not fundamentally change the architecture

of the enterprise data hub, the additional benefit of on-demand

provisioning and elasticity in the public cloud does open new

possibilities for this evolution in data management.

On-demand provisioning and

elasticity in the public cloud

opens new possibilities for

Cloudera’s enterprise data hub,

yet this deployment option does

not fundamentally change the

architecture.

Discover

Security and Administration

Unlimited Storage

Model

Serve

Process

(4)

Organizations that realize an enterprise data hub with Cloudera

gain numerous benefits including the full technology stack and

ecosystem offerings built for Hadoop, comprehensive system and

data management tools, and limitless data storage—fine-grained,

durable, readily available, and cost-effective—for all data. Moreover,

enterprise IT teams receive mission-critical support for their EDH

systems, so business users have confidence that the data and

applications are ready and able to meet the challenges in today’s

environment.

With Cloudera, enterprises can bring this same EDH experience

to their cloud installations with no restrictions to their choice

of cloud vendor. Deploying a Cloudera EDH to the cloud means

that organizations can leverage the elasticity and on-demand

consumption models best suited to their particular business and

processing needs, yet still profit from the advantages offered by

the EDH.

(5)

CLOUDERA

IN THE PUBLIC CLOUD

5

WHITE PAPER

The Case for Public Cloud

The public cloud is a set of compute, storage, and networking resources, ranging from bare-bones architecture to fully automated infrastructure-as-a-service stacks, that a service provider offers to the general public through an on-demand model. The value and importance of the public cloud and of cloud computing in general has been accelerating as more enterprises discover the convenience and flexibility of this deployment platform. There are a number of key business drivers that enterprises consider when weighing the public cloud option.

Procurement and Capacity

Enterprise IT teams typically need flexibility with proof-of-concepts (POC), pilots, and trials to demonstrate the proper architecture for an enterprise data hub. As a result, enterprises tend to build their production environments after the POC completes as a way to mitigate the capital risk associated with procurement. Public cloud environments meet these needs perfectly as enterprises can provision and change their evaluation environments very quickly in the public cloud and use them for the duration of the POC and incur limited usage costs, let alone avoid misaligned hardware purchases. Thus IT teams can develop the right architecture and configuration with minimal capital exposure and then confidently procure and provision the on-premise production environment.

Enterprises who can procure infrastructure quickly to deploy an enterprise data hub in production sometimes encounter physical capacity constraints in their data center. These organizations often leverage the public cloud as a way to gain the needed capacity and avoid provisioning delays.

Furthermore, an organization’s first foray into enterprise data hub deployments are typically non-production, where the focus of the effort is on evaluation as well as training for cluster management, data management, and the various frameworks in the EDH. A POC or pilot program typically needs limited hardware to get started, and in most cases, time-to-value rather than performance via hardware is the most important criterion.

Strategic Flexibility

Enterprises often consider new projects and systems, like an EDH, candidates for public cloud deployment after adopting an infrastructure-level or corporate-level decision to embrace a cloud model. Some of the corporate drivers for such decisions include cloud backup, instant geo-locality, and elasticity. The case for Hadoop in the public cloud can be even stronger if the data itself is generated in the cloud as a way to minimize data move-ment. Over time, enterprises might have clusters both in the public cloud and on-premise in order to find the proper set of features that best fit the business and technology needs, and thus the enterprise data hub will span these two environments.

As enterprise IT leaders plan their enterprise data hub strategy, they will need to ensure that their choice of cloud vendor does not dictate the EDH strategy and vice versa and should avoid having a different EDH in each cloud vendor. These deployment considerations might not be immediate but are critical to a forward-thinking and adaptable IT strategy.

Acute capacity constraints in

the data center and the relative

importance of time-to-value rather

than performance are often key

drivers to public cloud deployment

decisions.

(6)

Public Cloud vs. On-Premise

The decision to use public cloud infrastructure for an enterprise data hub is a fairly simple one for IT teams who have an immediate need for storage and computing or who are driven by an organization-wide initiative. For those weighing their options between on-premise and public cloud, there are several criteria to consider in deciding on the best deployment route.

Data Location

Where is the data generated? Data can be viewed as having “mass” and thus can prove difficult (and expensive) to move from storage to computing. If the EDH is not the primary location for data, best practices suggest establishing the enterprise data hub as close to data generation or storage to help mitigate the costs and effort, especially for large volumes that are common to EDH workloads. That said, IT teams should explore the nature and use of the data closely, as volume and velocity might allow for streaming in small quantities or transfers of large, single blocks to an on-premise environment. Often, if data is generated in the public cloud or if the data is stored long term in cloud storage, such as an object store for backup or geo-locality, public cloud deployment becomes a more natural choice.

Workload Types

What are the workload characteristics? For periodic batch workloads such as MapRe-duce jobs, enterprises can realize cost savings by running the cluster only for the duration of the job and paying for the usage as opposed to keeping the cluster activated at all times. This is especially true if the workload is run only a couple hours a day or a couple of days a week. For workloads that have continuous and long-running performance needs such as Apache HBase and Cloudera Impala, the overhead of commissioning and decommissioning a cluster for the term of the event may not be justified.

Performance Demands

What are the performance needs? One of the underlying tenets of Hadoop is tightly coupled units of compute and local storage that scale out linearly and simultaneously. This computation proximity enables Hadoop to parallelize the workload and significantly ac-celerate the processing of massive amounts of data within a short period of time. However, a common foundation of cloud architectures is pools of shared storage and virtualized compute capacity that are connected via a network pipe.

These capabilities scale independently, but the network adds latency and shared storage can become a performance bottleneck for a high-throughput MapReduce job, but the exact performance needs vary from workload to workload. The ecosystem of cloud vendors offers enterprises many architectural options and configurations that can address more directly the particular needs of a workload. For example, IT teams should examine the proximity of storage to compute as well as the degree of shared resources within the service as poten-tial factors to performance, from fully virtual instances to standalone, bare-metal systems. Performance often is an important criterion when processing large volumes of data typical of Hadoop workloads. For non-production, development, or test workloads, this factor

Data location, like cloud-based

storage, and types of workloads,

like periodic batch processing, are

strong influencers on the decision

to deploy into the public cloud,

yet many see the total cost of

ownership—in terms of rapid

procurement and provisioning

of resources and the associated

opportunity costs—as the most

important motivator.

(7)

CLOUDERA

IN THE PUBLIC CLOUD

7

WHITE PAPER

Cloud TCO

What is the difference in Total Cost of Ownership (TCO)? Calculating the TCO of a public cloud deployment can extend beyond the options for compute, storage, data transfer, and the pricing thereof. A good starting point to narrow down the options is to use reference architectures from Cloudera for the cloud environment of choice. Based on the options from the reference architecture best suited for the workload or workloads, enterprises can further develop their expected usage patterns and arrive at a more accurate TCO for deploying an EDH in the public cloud. Cloudera and its partners can further assist with TCO evaluations for any environment, including those that span on-premise and public cloud.

Public Cloud Deployment Patterns

The decision to employ a public cloud as part of a company’s IT strategy is typically driven by a number of independent factors, and an EDH is commonly a component of this larger process. However, there are a number of cases where a Hadoop-based EDH is especially well suited for the benefits provided by the elasticity of cloud computing and are the drivers of a cloud deployment model. Examples such as the parallel processing desired for search indexing and interactive query and the temporary influx of workload for batch processing coalesce into two primary deployment patterns that take advantage of EDH cloud environments.

Long-Running Clusters

The full-fidelity data experience of the enterprise data hub is based on the concept of collocated storage and compute on a cluster of industry standard servers. This tenet implies a long-running clus-ter within the cloud environment that provides the base storage for the data and the compute power for typical day-to-day activities, and this type of cluster is not very different from a typical on-premise

deployment. The EDH, once established in the cloud, is managed exactly as an on-premise deployment, but there are some unique benefits to the cloud environment.

For example, one key advantage is that IT teams can provision new capacity with a few simple commands. In a matter of minutes, enterprise IT teams can bring online a new cluster that meets additional business needs or grow the storage or computing capacity of an existing cluster for a current business process. Enterprises gain IT agility without having to worry about data center capacity issues and long procurement processes.

A further benefit to a cloud environment is that enterprises are not restricted to current server or cluster configurations if business needs change. For a typical on-premise environ-ment, IT teams must determine CPU, memory, and disk capacity at the time of procurement and often purchase servers with excess capacity than currently necessary to “future proof” the infrastructure investment. In the cloud model, however, IT administrators can provision servers with different configurations at will. Enterprises can therefore provision clusters exactly as needed for today, not tomorrow, thus maximizing working capital, yet also adapt to changing business needs by allocating new servers with more CPU, memory, or disk and decommissioning older, older or obsolete servers.

Separating metadata from data

gives Hadoop a scalable design

for achieving high availability

and tunable replication without

sacrificing performance.

Business

Services Data BusinessServices Data Provisioned Servers

Cloud On-Premise

(8)

Periodic and Transient Workloads

Even when operating a long-running cluster, businesses might need additional capacity for periodic workloads. Monthly or bi-weekly reporting processes are typical examples that represent additional computing capacity needs. Once an enterprise has established a

produc-tion EDH in the cloud, IT teams can dynamically grow and shrink computing capacity in response to these periodic jobs. Administrators simply commission the new “report” servers as needed, process the reports, store the resulting information back into the EDH, and then decommission the servers. This periodic lifecycle translates into reduced costs, for instead of paying for extra machines that are only partially utilized, an enterprise pays for only the hours utilized.

Some workloads are even more transient and might not require a long-running cluster. For example, an organization may have a large amount of data to process whose results might require significant time to interpret as useful or to determine the next task. To procure servers for this kind of transient or sporadic activity might not make economic sense for some organiza-tions. The cloud offers a compelling

solution to this type of workload by combining rapid cluster provisioning and low-cost storage capabilities, such as Amazon S3. In this workload lifecycle, administrators provision a Hadoop cluster, import the data from a cloud object store, process the data, write the result back to the object store, and then decommission cluster. This approach can be very cost-effective when processing massive amounts of data if the workload is highly transient. For the occasional execution of batch jobs, elastic cloud environments might be more cost-efficient than dedicated long-running clusters. However, IT administrators should consider that multiple users might run periodic, transient jobs against the same dataset that is stored in an object store, for example. In this situation, the aggregate utilization of the cluster is a more relevant metric for calculating the cost benefits. IT teams might discover that “always-on” clusters are more economical than ones repeatedly provisioned for each user.

Reporting

Task _{Temporary Servers}

Reports Provisioned Servers Periodic Processing Task Import & Export Temporary Servers Cloud Storage Transient

(9)

CLOUDERA

IN THE PUBLIC CLOUD

9

WHITE PAPER

Cloudera Director: Hadoop in the Cloud Without Compromise

Cloudera Director, part of Cloudera’s platform, brings consistency and ease for users look-ing to deploy in the cloud, while still maintainlook-ing the benefits of Cloudera’s enterprise data hub. Cloudera Director is the first portable, self-service solution for deploying and manag-ing enterprise-grade Hadoop in the cloud. It provides a smanag-ingle pane of glass administration experience for central IT to reduce costs and deliver agility, and for end-users to self-service provision and elastically scale clusters, all while ensuring auditability. Integrated with Cloudera’s enterprise data hub, users not only get all the features necessary for cloud deployments, but also continue to get all of the enterprise-grade features available with Cloudera’s platform – including the security, governance, and administration necessary for production-ready deployments.

With Cloudera Director, users can deploy one or more clusters in their preferred VPC environment, running on an EC2 instance. Cloudera Director offers the choice of a simple web user interface, command line interface (CLI), or REST API for deploying and managing CDH or Cloudera Enterprise clusters. The web UI provides a single dashboard view of all clusters deployed through Cloudera Director and includes a self-service experience for deploying, cloning, dynamically scaling, and terminating clusters. The CLI and API provide advanced support for more customized and complex cluster topologies that are well-suited for a wider variety of workloads. Additionally, both administrators and users can repeatedly deploy multiple clusters on-demand, using cluster blueprints. This reliable, cloud-centric experience can be leveraged across multiple cloud providers, with current support available with Amazon Web Services, and other cloud environments planned for future releases. Key benefits of Cloudera Director include:

The long-term vision of Cloudera

is to embrace the potential and

flexibility of the hybrid model,

where the enterprise data hub can

operate transparently between

on-premises, private cloud, and public

cloud deployments. By bringing

together a diverse partner

ecosys-tem of cloud providers, Cloudera is

helping customers bring Hadoop

and the EDH to more enterprise

users and applications. Cloudera

continues to be the industry

stan-dard for next-generation enterprise

data management and analytics,

wherever data and workloads live.

To learn more about Cloudera’s

broad partner ecosystem, visit

http://www.cloudera.com/content/

cloudera/en/solutions/partner.html

Customer Benefit

Unique Capability

Enabling Features

Simplify Cluster Lifecycle Management

Simple UI to spin up, scale, and spin down clusters • Self-Service spin up/teardown

• Dynamic scaling for spiky workloads

• Simple cloning of clusters

• Cloud blueprints for repeatable deployments

Eliminate Lock-in Flexible, open platform • 100% open source Hadoop distribution

• Native support for hybrid deployments

• Third-party software deployment within same workflow

• Support for custom, workload-specific deployments Accelerate

Time-to-Value

Enterprise-ready security and administration • Support for complex cluster topologies

• Minimum size cluster when capacity constrained

• Management tooling

• Compliance-ready security and governance

• Backup and disaster recovery with an optimized cloud storage connector

Reduce Support Costs Monitoring & metering tools • Multi-cluster dashboard

(10)

About Cloudera

Cloudera is revolutionizing enterprise data management by offering the first unified Platform for Big Data, an enterprise data hub built on Apache Hadoop. Cloudera offers enterprises one place to store, access, process, secure, and analyze all their data, empowering them to extend the value of existing investments while enabling fundamental new ways to derive value from their data. Cloudera’s open source Big Data platform is the most widely adopted in the world, and Cloudera is the most prolific contributor to the open source Hadoop ecosystem. As the leading educator of

The Cloudera Difference

Enterprises who deploy a Cloudera enterprise data hub in the public cloud can leverage sev-eral benefits unique to Cloudera. Business and technology teams gain the same full-fidelity EDH experience as an on-premise environment, from technology capabilities to system and data management tools, coupled with mission-critical support. And organizations do not have to compromise on enterprise-grade capabilities such as data security, data governance, and latest innovations in the Hadoop platform such as Cloudera Impala, Apache Sentry, Cloudera Search, and others when operating in the public cloud.

In addition, Cloudera has designed an expanded partner program that includes a cloud services and solution provider division, called Cloudera Connect: Cloud, which can meet the growing needs of organizations looking to optimize Hadoop deployments in cloud environments for unified data management and analytics like the EDH by offering the utmost flexibility in deployment, consumption, and choice of vendor. Enterprises now have a choice of multiple pricing and support models for the enterprise data hub in the cloud. Organizations can choose either a traditional subscription model or a usage-based model for Cloudera’s offerings while purchasing infrastructure separately from the cloud partner. Alternatively, organizations can purchase directly through their cloud vendor of choice both Cloudera products and cloud infrastructure as one offering and pay one bill. Moreover, IT strategists should anticipate EDH deployments in any environment, from on-premise to cloud, in order to meet more fully the particular demands and restrictions of a workload, data set, or business user. In all of these situations, the full-fidelity experience of an EDH and the continuity of the experience, no matter the environment, are critical to achieving maximum efficiency of applications and personnel. Cloudera is unique in providing this advantage to enterprises while leaving the choice of cloud provider vendor to the customer. With upcoming enhancements to the Cloudera product suite that streamline cloud operations, enterprises easily can leverage the elasticity and on-demand consumption models of the public cloud for their Hadoop installations and consider platforms like OpenStack and VMWare for private cloud deployments.

Organizations need to consider multiple factors when deciding what part of the EDH footprint resides where. Cloudera is well positioned to help enterprises explore these factors and enable all deployment options available. With Cloudera, enterprises can take full advantage of enterprise data hub and the next generation in data management across all deployment options and environments, from on-premise to public cloud.