Solving performance and data
protection problems with
Solving performance and data protection
problems with active-active Hadoop
Many Hadoop deployments are not realizing their full business potential, with performance1 and data protection2 cited by 62% of IT professionals as barriers
to moving into full production use. Meanwhile, 70% of Hadoop early adopters are already using multiple siloed installations in separate data centers3.
Active-active replication turns those siloed installations into a single unified HDFS cluster that provides total data protection and better performance for Hadoop applications.
Barriers to realizing full business value from
Hadoop
Let’s first consider each of the problem areas in more detail.
Performance at scale
Hadoop deployments typically start small and then see viral adoption as the value of Big Data becomes clear. Rapid adoption and increased load from new applications can lead to serious performance challenges. For example, one national energy services firm found that ingesting the largest table from its legacy ERP system caused severe performance problems for other applications. Likewise, a consumer science company had to place restrictions on new machine learning applications for the same reason, limiting eager data scientists to weekend hours on the production cluster.
Even among those who have already adopted YARN, resource management in Hadoop is an unsolved problem as these examples illustrate. YARN is designed to allocate based on capacity queues or fair division of resources. It was not built for the current generation of mixed-tenant workloads, where applications like Spark require high-memory nodes. Even recent improvements like node labels do not guarantee that the right data is always local to the right nodes.
1 http://www.wsj.com/articles/the-joys-and-hype-of-software-called-hadoop-1418777627?mod=WSJ_hp_EditorsPicks 2
http://www.techvibes.com/blog/17-billion-the-annual-cost-of-data-loss-and-downtime-in-canada-2014-12-04
Figure 1: YARN scheduling based on capacity queues, granting minimum and maximum resource allocation to different roles
Figure 2: YARN has trouble managing mixed hardware profiles and diverse workloads
In data processing pipelines such as the Lambda architecture, multiple processing stages run different applications with very different resource profiles, and YARN does not provide ideal resource management in this case. For example, ingest applications like Sqoop can experience performance degradation up to 81% when running on a cluster that is also loaded with batch processing applications. The batch applications likewise see a degradation of as much as 131%. In-memory frameworks like Spark can see an order of magnitude performance improvement when run on dedicated high-memory nodes.
Data protection
Hadoop’s file system (HDFS) provides redundancy in one Hadoop installation by distributing data between nodes and racks. It has no provision for consistent real-time backups. The backup tools used by most distributions rely on DistCp, an asynchronous batch transfer program. As simple performance testing
demonstrates, DistCp is a problematic tool when used as a primary backup solution:
•
It consumes valuable processing (MapReduce) resources on the production cluster. Some Hadoop administrators report that DistCpRisk Analysis
Cluster
75% High RAM nodes Analytics Data Missed the 25% you needMarketing Risk Analysis
Cluster
At least 25% No more than 40%
At least 50% No more than 90%
prevents other applications from running simultaneously. The problem is exacerbated as the size of a cluster grows, with large deployments able to run DistCp only once every 12 or 24 hours.
•
It is a file-based program and fails if a file copy is interrupted or corrupted. Manual intervention is then required.•
There is no guarantee of consistency when DistCp runs, and no automated way to check the consistency of backups after the fact.Furthermore, backup clusters can only be used for a limited set of read-only operations. DistCp is unable to reconcile changes made at multiple locations, and even read-only MapReduce applications generate intermediate data that must be managed carefully to avoid conflicts with backup jobs. The result is that a significant portion of the investment in hardware and operations is not contributing processing capability, negatively impacting Hadoop’s cost efficiency advantage.
Data silos
Most companies end up using multiple Hadoop clusters for one or more of these reasons:
•
Maintaining different sets of users and permissions. Hadoop security tools are only now maturing, so in the past it was simpler to isolate data that had different security requirements.•
Lack of holistic planning. Many teams and business units might stand up a new cluster just for experimentation.•
Cost model. Providing individual installations to different business units is a simple way to manage cost allocation.Maintaining siloed clusters makes sharing data between Hadoop installations difficult. Without appropriate data sharing, data scientists only have a partial view of the information, making roll-up reporting between business units difficult. Since obtaining a complete view of business operations is an important benefit of Hadoop, companies must rely on DistCp-based data transfer tools. Workflow management tools like Oozie and Falcon are very useful for building complete data pipelines, but in a cross-cluster situation require Hadoop administrators to build data transfer stages into the pipeline along with verification steps. As noted earlier, DistCp introduces performance and consistency problems that complicate and slow down data pipelines.
Figure 3: Periodic data transfer using DistCp
A single HDFS cluster spanning several Hadoop
installations and data centers
Fortunately, there is a solution. WANdisco’s active-active replication turns multiple Hadoop silos running in one or more data centers into a unified HDFS cluster with separate processing layers.
Figure 4: Non-Stop Hadoop provides a single HDFS cluster underneath several Hadoop installations at one or more locations
Total data protection
Non-Stop Hadoop provides synchronous real-time active-active replication of HDFS metadata. Every Hadoop installation, even at data centers across the WAN, will see a consistent view of the data. In the event of a failure or a network partition, the system heals automatically with no need for manual reconciliation. Non-Stop Hadoop also uses an efficient WAN block replicator to transfer data blocks to other installations without consuming processing (MapReduce) resources. Customer experience shows that even large data ingests are transferred to another data center in minutes with no performance impact on the source Hadoop installation, compared to hours of transfer time and severe performance degradation using DistCp.
Active-active NameNodes WAN Block Replicator Data Layer (Non-Stop Hadoop) Access Layer (YARN) Access Layer (YARN) Applications (MapReduce, Spark, HBase)
Applications (MapReduce, Spark, HBase)
Security and governance Security and governance Data Center 1 Hadoop A Data Nodes Active NameNode Step 2:
Data from Hadoop A is periodically DistCp’d into Hadoop B Data Center 2 Hadoop B Data Nodes Active NameNode VPN
Figure 5: Non-Stop Hadoop architecture with two data centers separated by a WAN. HDFS writes are coordinated in real time followed by asynchronous block replication.
As a result, Non-Stop Hadoop provides a Recovery Point Objective (RPO) of minutes instead of hours or days, and a Recovery Time Objective (RTO) of zero. Other data centers are available for immediate use even if one data center is lost entirely.
Improved performance for applications
Non-Stop Hadoop presents a single HDFS cluster while preserving the independence of the processing layers. As a result, applications can be run in separate installations or zones without any extra data transfer steps. For example, one zone could run critical business applications with rigorous response SLAs, and another zone could run experimental machine learning applications that use in-memory analytics. Meanwhile, other zones in other data centers can handle ingest jobs. Each zone has all the advantages of fast local access to data, making it a more effective approach than YARN’s experimental node labels which do not guarantee that the selected node is the closest to the data.
Data Center 1 World File System
DC1 Data Nodes
Coordinated MetData Replication
Active NameNode A Active NameNode B Active NameNode C Active NameNode A Active NameNode B Active NameNode C Data Center 2 World File System
WAN
Block Replication
Figure 6: Nonstop Hadoop presents a single HDFS cluster with independent processing tiers across zones
As noted earlier, ingest applications like Sqoop can experience up to a 45% performance improvement when run in a separate zone from batch processing applications, and the batch processing applications may see up to a 57% improvement when isolated from Sqoop. Likewise, Spark applications can see an order of magnitude improvement when run on a small zone with dedicated high-memory nodes.
Further, every Hadoop installation is available for full active processing. Read-only backup clusters become fully writable processing clusters. As a result, Hadoop deployments effectively double the processing node count and require less hardware to support the same processing requirements.
Breaking down data silos
Each Hadoop installation in a Non-Stop Hadoop deployment uses a single HDFS cluster, even when located across the WAN. This avoids the need for expensive data transfer stages in tools like Oozie or Falcon and provides total data visibility to data scientists.
Overcome performance and data protection
problems
Non-Stop Hadoop turns multiple Hadoop data silos into a single HDFS cluster that provides total data protection and improved performance for Hadoop applications. The single HDFS cluster also overcomes data sharing problems while delivering improved utilization of valuable Hadoop processing resources. Alternative approaches, however, are problematic:
•
Building a larger Hadoop cluster to add processing power magnifies the backup burdens to the point where system RPO becomes unacceptable.•
Another option is to rely on the network to move data to processing, discarding Hadoop’s natural preference for data locality. ThisNon-Stop HDFS
Zone A: Batch/Ingest Zone B: Low Latency Query
YARN
Non-Stop HDFS
NN NN NN NN NN NN
YARN Spark
MapReduce Hive
Pig MapReduce HBase
World Headquarters 5000 Executive Pkwy Suite 270 San Ramon, CA 94583 Europe Electric Works
Sheffield Digital Campus Sheffield S1 2BJ Japan Level 15 Cerulean Tower 26-1 Sakuragaoka-cho Shibuya-ku Tokyo Japan 150-8512 China
Financial Street Centre, Level 10
South Tower
No.9A Financial Street XiCheng District Beijing 100033
technique is not proven at scale and may prove very difficult in a WAN situation – and of course doesn’t satisfy backup/DR requirements.
Active-active replication is recognized as a vital capability for data protection, and offers much more than just data safety. A Hadoop cluster built with active-active technology weaves independent Hadoop installations into a unified HDFS cluster that alleviates several barriers to productive Hadoop deployment. For more information including architectural white papers, visit http://www. wandisco.com/hadoop.