Cloudera in the Public Cloud
Deployment Options for the Enterprise Data Hub
WHITE PAPER
Executive Summary
3
The Case for Public Cloud
5
Public Cloud vs On-Premise
6
Public Cloud Deployment Patterns
7
Cloudera Director: Hadoop in the Cloud Without Compromise
9
The Cloudera Difference
10
About Cloudera
10
CLOUDERA
IN THE PUBLIC CLOUD
3
WHITE PAPER
Executive Summary
Information-driven enterprises have long held the common business
and IT objective of unified data management to improve insight and
build knowledge. For many, the conventional data warehouse and
data mart built on relational technology offered the only avenue
to enterprise-grade analytics, while storage arrays and archives
provided the only methods for keeping diverse data accessible for
longer time periods. Today, these organizations have a better way
to address the challenge of data management with an enterprise
data hub (EDH). The Cloudera enterprise data hub, built with
Apache Hadoop, provides a flexible, scalable, and economical data
management platform that can perform a variety of enterprise
workloads—including batch processing, interactive SQL, enterprise
search, advanced analytics and more—on a single, shared copy of
data on a common storage substrate.
Enterprises are embracing the enterprise data hub as the
centerpiece of their data management strategy, and they are
evaluating the public cloud as a deployment option. While
deployment choice does not fundamentally change the architecture
of the enterprise data hub, the additional benefit of on-demand
provisioning and elasticity in the public cloud does open new
possibilities for this evolution in data management.
On-demand provisioning and
elasticity in the public cloud
opens new possibilities for
Cloudera’s enterprise data hub,
yet this deployment option does
not fundamentally change the
architecture.
Discover
Security and Administration
Unlimited Storage
Model
Serve
Process
Organizations that realize an enterprise data hub with Cloudera
gain numerous benefits including the full technology stack and
ecosystem offerings built for Hadoop, comprehensive system and
data management tools, and limitless data storage—fine-grained,
durable, readily available, and cost-effective—for all data. Moreover,
enterprise IT teams receive mission-critical support for their EDH
systems, so business users have confidence that the data and
applications are ready and able to meet the challenges in today’s
environment.
With Cloudera, enterprises can bring this same EDH experience
to their cloud installations with no restrictions to their choice
of cloud vendor. Deploying a Cloudera EDH to the cloud means
that organizations can leverage the elasticity and on-demand
consumption models best suited to their particular business and
processing needs, yet still profit from the advantages offered by
the EDH.
CLOUDERA
IN THE PUBLIC CLOUD
5
WHITE PAPER
The Case for Public Cloud
The public cloud is a set of compute, storage, and networking resources, ranging from bare-bones architecture to fully automated infrastructure-as-a-service stacks, that a service provider offers to the general public through an on-demand model. The value and importance of the public cloud and of cloud computing in general has been accelerating as more enterprises discover the convenience and flexibility of this deployment platform. There are a number of key business drivers that enterprises consider when weighing the public cloud option.
Procurement and Capacity
Enterprise IT teams typically need flexibility with proof-of-concepts (POC), pilots, and trials to demonstrate the proper architecture for an enterprise data hub. As a result, enterprises tend to build their production environments after the POC completes as a way to mitigate the capital risk associated with procurement. Public cloud environments meet these needs perfectly as enterprises can provision and change their evaluation environments very quickly in the public cloud and use them for the duration of the POC and incur limited usage costs, let alone avoid misaligned hardware purchases. Thus IT teams can develop the right architecture and configuration with minimal capital exposure and then confidently procure and provision the on-premise production environment.
Enterprises who can procure infrastructure quickly to deploy an enterprise data hub in production sometimes encounter physical capacity constraints in their data center. These organizations often leverage the public cloud as a way to gain the needed capacity and avoid provisioning delays.
Furthermore, an organization’s first foray into enterprise data hub deployments are typically non-production, where the focus of the effort is on evaluation as well as training for cluster management, data management, and the various frameworks in the EDH. A POC or pilot program typically needs limited hardware to get started, and in most cases, time-to-value rather than performance via hardware is the most important criterion.
Strategic Flexibility
Enterprises often consider new projects and systems, like an EDH, candidates for public cloud deployment after adopting an infrastructure-level or corporate-level decision to embrace a cloud model. Some of the corporate drivers for such decisions include cloud backup, instant geo-locality, and elasticity. The case for Hadoop in the public cloud can be even stronger if the data itself is generated in the cloud as a way to minimize data move-ment. Over time, enterprises might have clusters both in the public cloud and on-premise in order to find the proper set of features that best fit the business and technology needs, and thus the enterprise data hub will span these two environments.
As enterprise IT leaders plan their enterprise data hub strategy, they will need to ensure that their choice of cloud vendor does not dictate the EDH strategy and vice versa and should avoid having a different EDH in each cloud vendor. These deployment considerations might not be immediate but are critical to a forward-thinking and adaptable IT strategy.
Acute capacity constraints in
the data center and the relative
importance of time-to-value rather
than performance are often key
drivers to public cloud deployment
decisions.
Public Cloud vs. On-Premise
The decision to use public cloud infrastructure for an enterprise data hub is a fairly simple one for IT teams who have an immediate need for storage and computing or who are driven by an organization-wide initiative. For those weighing their options between on-premise and public cloud, there are several criteria to consider in deciding on the best deployment route.
Data Location
Where is the data generated? Data can be viewed as having “mass” and thus can prove difficult (and expensive) to move from storage to computing. If the EDH is not the primary location for data, best practices suggest establishing the enterprise data hub as close to data generation or storage to help mitigate the costs and effort, especially for large volumes that are common to EDH workloads. That said, IT teams should explore the nature and use of the data closely, as volume and velocity might allow for streaming in small quantities or transfers of large, single blocks to an on-premise environment. Often, if data is generated in the public cloud or if the data is stored long term in cloud storage, such as an object store for backup or geo-locality, public cloud deployment becomes a more natural choice.
Workload Types
What are the workload characteristics? For periodic batch workloads such as MapRe-duce jobs, enterprises can realize cost savings by running the cluster only for the duration of the job and paying for the usage as opposed to keeping the cluster activated at all times. This is especially true if the workload is run only a couple hours a day or a couple of days a week. For workloads that have continuous and long-running performance needs such as Apache HBase and Cloudera Impala, the overhead of commissioning and decommissioning a cluster for the term of the event may not be justified.
Performance Demands
What are the performance needs? One of the underlying tenets of Hadoop is tightly coupled units of compute and local storage that scale out linearly and simultaneously. This computation proximity enables Hadoop to parallelize the workload and significantly ac-celerate the processing of massive amounts of data within a short period of time. However, a common foundation of cloud architectures is pools of shared storage and virtualized compute capacity that are connected via a network pipe.
These capabilities scale independently, but the network adds latency and shared storage can become a performance bottleneck for a high-throughput MapReduce job, but the exact performance needs vary from workload to workload. The ecosystem of cloud vendors offers enterprises many architectural options and configurations that can address more directly the particular needs of a workload. For example, IT teams should examine the proximity of storage to compute as well as the degree of shared resources within the service as poten-tial factors to performance, from fully virtual instances to standalone, bare-metal systems. Performance often is an important criterion when processing large volumes of data typical of Hadoop workloads. For non-production, development, or test workloads, this factor
Data location, like cloud-based
storage, and types of workloads,
like periodic batch processing, are
strong influencers on the decision
to deploy into the public cloud,
yet many see the total cost of
ownership—in terms of rapid
procurement and provisioning
of resources and the associated
opportunity costs—as the most
important motivator.
CLOUDERA
IN THE PUBLIC CLOUD
7
WHITE PAPER
Cloud TCO
What is the difference in Total Cost of Ownership (TCO)? Calculating the TCO of a public cloud deployment can extend beyond the options for compute, storage, data transfer, and the pricing thereof. A good starting point to narrow down the options is to use reference architectures from Cloudera for the cloud environment of choice. Based on the options from the reference architecture best suited for the workload or workloads, enterprises can further develop their expected usage patterns and arrive at a more accurate TCO for deploying an EDH in the public cloud. Cloudera and its partners can further assist with TCO evaluations for any environment, including those that span on-premise and public cloud.
Public Cloud Deployment Patterns
The decision to employ a public cloud as part of a company’s IT strategy is typically driven by a number of independent factors, and an EDH is commonly a component of this larger process. However, there are a number of cases where a Hadoop-based EDH is especially well suited for the benefits provided by the elasticity of cloud computing and are the drivers of a cloud deployment model. Examples such as the parallel processing desired for search indexing and interactive query and the temporary influx of workload for batch processing coalesce into two primary deployment patterns that take advantage of EDH cloud environments.
Long-Running Clusters
The full-fidelity data experience of the enterprise data hub is based on the concept of collocated storage and compute on a cluster of industry standard servers. This tenet implies a long-running clus-ter within the cloud environment that provides the base storage for the data and the compute power for typical day-to-day activities, and this type of cluster is not very different from a typical on-premise
deployment. The EDH, once established in the cloud, is managed exactly as an on-premise deployment, but there are some unique benefits to the cloud environment.
For example, one key advantage is that IT teams can provision new capacity with a few simple commands. In a matter of minutes, enterprise IT teams can bring online a new cluster that meets additional business needs or grow the storage or computing capacity of an existing cluster for a current business process. Enterprises gain IT agility without having to worry about data center capacity issues and long procurement processes.
A further benefit to a cloud environment is that enterprises are not restricted to current server or cluster configurations if business needs change. For a typical on-premise environ-ment, IT teams must determine CPU, memory, and disk capacity at the time of procurement and often purchase servers with excess capacity than currently necessary to “future proof” the infrastructure investment. In the cloud model, however, IT administrators can provision servers with different configurations at will. Enterprises can therefore provision clusters exactly as needed for today, not tomorrow, thus maximizing working capital, yet also adapt to changing business needs by allocating new servers with more CPU, memory, or disk and decommissioning older, older or obsolete servers.
Separating metadata from data
gives Hadoop a scalable design
for achieving high availability
and tunable replication without
sacrificing performance.
Business
Services Data BusinessServices Data Provisioned Servers
Cloud On-Premise
Periodic and Transient Workloads
Even when operating a long-running cluster, businesses might need additional capacity for periodic workloads. Monthly or bi-weekly reporting processes are typical examples that represent additional computing capacity needs. Once an enterprise has established a
produc-tion EDH in the cloud, IT teams can dynamically grow and shrink computing capacity in response to these periodic jobs. Administrators simply commission the new “report” servers as needed, process the reports, store the resulting information back into the EDH, and then decommission the servers. This periodic lifecycle translates into reduced costs, for instead of paying for extra machines that are only partially utilized, an enterprise pays for only the hours utilized.
Some workloads are even more transient and might not require a long-running cluster. For example, an organization may have a large amount of data to process whose results might require significant time to interpret as useful or to determine the next task. To procure servers for this kind of transient or sporadic activity might not make economic sense for some organiza-tions. The cloud offers a compelling
solution to this type of workload by combining rapid cluster provisioning and low-cost storage capabilities, such as Amazon S3. In this workload lifecycle, administrators provision a Hadoop cluster, import the data from a cloud object store, process the data, write the result back to the object store, and then decommission cluster. This approach can be very cost-effective when processing massive amounts of data if the workload is highly transient. For the occasional execution of batch jobs, elastic cloud environments might be more cost-efficient than dedicated long-running clusters. However, IT administrators should consider that multiple users might run periodic, transient jobs against the same dataset that is stored in an object store, for example. In this situation, the aggregate utilization of the cluster is a more relevant metric for calculating the cost benefits. IT teams might discover that “always-on” clusters are more economical than ones repeatedly provisioned for each user.
Reporting
Task Temporary Servers
Reports Provisioned Servers Periodic Processing Task Import & Export Temporary Servers Cloud Storage Transient
CLOUDERA
IN THE PUBLIC CLOUD
9
WHITE PAPER
Cloudera Director: Hadoop in the Cloud Without Compromise
Cloudera Director, part of Cloudera’s platform, brings consistency and ease for users look-ing to deploy in the cloud, while still maintainlook-ing the benefits of Cloudera’s enterprise data hub. Cloudera Director is the first portable, self-service solution for deploying and manag-ing enterprise-grade Hadoop in the cloud. It provides a smanag-ingle pane of glass administration experience for central IT to reduce costs and deliver agility, and for end-users to self-service provision and elastically scale clusters, all while ensuring auditability. Integrated with Cloudera’s enterprise data hub, users not only get all the features necessary for cloud deployments, but also continue to get all of the enterprise-grade features available with Cloudera’s platform – including the security, governance, and administration necessary for production-ready deployments.
With Cloudera Director, users can deploy one or more clusters in their preferred VPC environment, running on an EC2 instance. Cloudera Director offers the choice of a simple web user interface, command line interface (CLI), or REST API for deploying and managing CDH or Cloudera Enterprise clusters. The web UI provides a single dashboard view of all clusters deployed through Cloudera Director and includes a self-service experience for deploying, cloning, dynamically scaling, and terminating clusters. The CLI and API provide advanced support for more customized and complex cluster topologies that are well-suited for a wider variety of workloads. Additionally, both administrators and users can repeatedly deploy multiple clusters on-demand, using cluster blueprints. This reliable, cloud-centric experience can be leveraged across multiple cloud providers, with current support available with Amazon Web Services, and other cloud environments planned for future releases. Key benefits of Cloudera Director include:
The long-term vision of Cloudera
is to embrace the potential and
flexibility of the hybrid model,
where the enterprise data hub can
operate transparently between
on-premises, private cloud, and public
cloud deployments. By bringing
together a diverse partner
ecosys-tem of cloud providers, Cloudera is
helping customers bring Hadoop
and the EDH to more enterprise
users and applications. Cloudera
continues to be the industry
stan-dard for next-generation enterprise
data management and analytics,
wherever data and workloads live.
To learn more about Cloudera’s
broad partner ecosystem, visit
http://www.cloudera.com/content/
cloudera/en/solutions/partner.html
Customer Benefit
Unique Capability
Enabling Features
Simplify Cluster Lifecycle Management
Simple UI to spin up, scale, and spin down clusters • Self-Service spin up/teardown
• Dynamic scaling for spiky workloads
• Simple cloning of clusters
• Cloud blueprints for repeatable deployments
Eliminate Lock-in Flexible, open platform • 100% open source Hadoop distribution
• Native support for hybrid deployments
• Third-party software deployment within same workflow
• Support for custom, workload-specific deployments Accelerate
Time-to-Value
Enterprise-ready security and administration • Support for complex cluster topologies
• Minimum size cluster when capacity constrained
• Management tooling
• Compliance-ready security and governance
• Backup and disaster recovery with an optimized cloud storage connector
Reduce Support Costs Monitoring & metering tools • Multi-cluster dashboard
About Cloudera
Cloudera is revolutionizing enterprise data management by offering the first unified Platform for Big Data, an enterprise data hub built on Apache Hadoop. Cloudera offers enterprises one place to store, access, process, secure, and analyze all their data, empowering them to extend the value of existing investments while enabling fundamental new ways to derive value from their data. Cloudera’s open source Big Data platform is the most widely adopted in the world, and Cloudera is the most prolific contributor to the open source Hadoop ecosystem. As the leading educator of
The Cloudera Difference
Enterprises who deploy a Cloudera enterprise data hub in the public cloud can leverage sev-eral benefits unique to Cloudera. Business and technology teams gain the same full-fidelity EDH experience as an on-premise environment, from technology capabilities to system and data management tools, coupled with mission-critical support. And organizations do not have to compromise on enterprise-grade capabilities such as data security, data governance, and latest innovations in the Hadoop platform such as Cloudera Impala, Apache Sentry, Cloudera Search, and others when operating in the public cloud.
In addition, Cloudera has designed an expanded partner program that includes a cloud services and solution provider division, called Cloudera Connect: Cloud, which can meet the growing needs of organizations looking to optimize Hadoop deployments in cloud environments for unified data management and analytics like the EDH by offering the utmost flexibility in deployment, consumption, and choice of vendor. Enterprises now have a choice of multiple pricing and support models for the enterprise data hub in the cloud. Organizations can choose either a traditional subscription model or a usage-based model for Cloudera’s offerings while purchasing infrastructure separately from the cloud partner. Alternatively, organizations can purchase directly through their cloud vendor of choice both Cloudera products and cloud infrastructure as one offering and pay one bill. Moreover, IT strategists should anticipate EDH deployments in any environment, from on-premise to cloud, in order to meet more fully the particular demands and restrictions of a workload, data set, or business user. In all of these situations, the full-fidelity experience of an EDH and the continuity of the experience, no matter the environment, are critical to achieving maximum efficiency of applications and personnel. Cloudera is unique in providing this advantage to enterprises while leaving the choice of cloud provider vendor to the customer. With upcoming enhancements to the Cloudera product suite that streamline cloud operations, enterprises easily can leverage the elasticity and on-demand consumption models of the public cloud for their Hadoop installations and consider platforms like OpenStack and VMWare for private cloud deployments.
Organizations need to consider multiple factors when deciding what part of the EDH footprint resides where. Cloudera is well positioned to help enterprises explore these factors and enable all deployment options available. With Cloudera, enterprises can take full advantage of enterprise data hub and the next generation in data management across all deployment options and environments, from on-premise to public cloud.