Apache Hadoop Storage Provisioning Using VMware vsphere Big Data Extensions TECHNICAL WHITE PAPER

(1)

Storage Provisioning

Using VMware vSphere

®

(2)

Apache Hadoop Deployment on VMware vSphere Using vSphere

Big Data Extensions . . . . 3

Local Storage and Shared Storage . . . . 4

Basic vSphere Storage Concepts . . . .4

Using Local and Shared Storage for Hadoop . . . .4

Storage Provisioning by BDE . . . . 5

Datastore Management . . . .5

Cluster Specification of Storage . . . .5

Disk Placement and Storage Allocation . . . .6

Storage Management After Cluster Deployment . . . . 8

Allocation of Unused Datastore Storage . . . .8

Storage Failure and Recovery . . . .8

Disk Replacement and Node Data Disk Recovery . . . .8

Disk Replacement and Node Recovery . . . .9

Recoverable Disk Failure . . . .10

Storage Configuration for Hadoop Outside of BDE . . . .10

Data Disk Resizing . . . .10

Utilization of Additional Disks . . . .11

(3)

Apache Hadoop Deployment on

VMware vSphere Using vSphere

Big Data Extensions

The Apache™ Hadoop® software library is a framework that enables the distributed processing of large data sets across clusters of computers. It is designed to scale up from single servers to thousands of machines, with each offering local computation and storage. Hadoop is being used by enterprises across verticals for big data analytics, to help make better business decisions based on large data sets.

Serengeti™ is an open-source project initiated by VMware to automate deployment and management of Hadoop clusters on virtualized environments such as VMware vSphere®. Serengeti offers the following key benefits: • Deploy a Hadoop cluster on vSphere in minutes via one command • Employ a fully customizable configuration profile to specify computer, storage and network resources as well as node placement • Provide better Hadoop manageability and usability, enabling fast and simple cluster scale-out and Hadoop tuning • Enable separation of data and compute nodes without losing data locality • Improve Hadoop cluster availability by leveraging VMware vSphere High Availability (vSphere HA), VMware vSphere Fault Tolerance (vSphere FT) and VMware vSphere vMotion® • Support multiple Hadoop distributions, including Apache Hadoop, Cloudera CDH, Pivotal HD, MapR, Hortonworks Data Platform (HDP) and Intel IDH Through its sponsorship of Project Serengeti, VMware has been investing in making it easier for users to run big data and Hadoop workloads. VMware has introduced Big Data Extensions (BDE) as a commercially supported version of Project Serengeti designed for enterprises seeking VMware support. BDE enables customers to run clustered, scale-out Hadoop applications through vSphere, delivering all the benefits of virtualization to Hadoop users. BDE provides increased agility through an easy-to-use interface, elastic scaling through the separation of compute and storage resources, and increased reliability and security by leveraging proven vSphere technology. VMware has built BDE to support all major Hadoop distributions and associated Hadoop projects such as Pig, Hive and HBase.

(4)

Local Storage and Shared Storage

Basic vSphere Storage Concepts

VMware ESXi™ provides host-level storage virtualization, which logically abstracts the physical storage layer from virtual machines. An ESXi virtual machine uses one or more virtual disks to store its operating system (OS), program files and other data. Each virtual disk is a large physical file, or a set of files, that resides on a VMware vSphere VMFS datastore, a datastore based on some other technology such as Network File System (NFS) or VMware Virtual SAN™, or a raw disk. To access virtual disks, a virtual machine uses virtual SCSI controllers. From the standpoint of the virtual machine, each virtual disk appears as if it were a SCSI drive connected to a SCSI controller. The underlying physical storage for the virtual disk—whether accessed through parallel SCSI, iSCSI, network or Fibre Channel adapters on the ESXi host—is transparent to the guest OS and to applications running on the virtual machine.

ESXi supports two types of storage: local storage and shared storage. Local storage maintains virtual machine files on internal or directly attached external disks that are managed exclusively by that single host, whereas shared storage maintains virtual machine files on disks or storage arrays shared among more than one host, such as those connected through an IP-based or Fibre Channel network. Datastores are logical containers that hide specifics of each storage device and provide a uniform model for storing virtual machines. For more details about vSphere storage, refer to vSphere documentation.

Using Local and Shared Storage for Hadoop

When deploying a Hadoop cluster on vSphere, users can choose to use either local or shared storage for each node. BDE reads the setting from the cluster configuration file and creates virtual disks for the nodes in specified datastore(s) accordingly.

Local and shared storage offer distinctive benefits. Shared storage in a vSphere environment enables advanced capabilities such as vSphere HA, vSphere FT, and vSphere vMotion in the vSphere cluster to protect Hadoop nodes. Shared storage is typically provided by network-attached storage (NAS) or storage area network (SAN) storage arrays, which not only offer high and scalable capacity but also add another layer of storage availability through RAID, hardware redundancy and multipathing. On the other hand, local storage offers better I/O performance by eliminating network overhead and latency. To improve performance and conserve bandwidth, Hadoop is particularly designed for data locality, so data is processed on the same machine that stores it. Data locality is preserved in a Hadoop deployment on vSphere when a slave node uses local storage. Virtual disk I/Os from the slave node are directed to the local disks attached to the ESXi host without requiring to be transferred on network. vSphere virtualization enables the possibility of deploying multiple slave nodes on a single ESXi host with none of them losing data locality. When two virtual machines are deployed on the same ESXi host, communication between the virtual machines is transmitted on the virtual network that logically connects them in the host. Network traffic is performed in the ESXi host memory and never leaves the host. This enables data and compute to be separated for a Hadoop cluster without compromising data locality.

(5)

VMware recommends the following best practices for configuring storage for a Hadoop cluster deployed on vSphere: • Place the Hadoop master node (including NameNode and JobTracker) on shared storage to enable vSphere HA, vSphere FT and VMware vSphere Distributed Resource Scheduler™ (vSphere DRS) features. These features prevent the master node from being the single point of failure (SPOF) in the Hadoop cluster. • Place the Hadoop data nodes on local storage for locality and performance. Follow similar best practices of storage provisioning (disk types, number of drives per node, no RAID, and so on) as for Hadoop deployment on physical infrastructure. • If separating data and compute in the cluster, or deploying a compute-only cluster, place the compute nodes on local storage or use NFS in the form previously described.

• Place the Hadoop client nodes and other Hadoop ecosystem nodes on either local storage or shared storage. • When using local storage, set the server RAID controller cache policy to “write back” instead of “write through” if a cache battery backup unit (BBU) module is installed. Initial I/Os to a disk formatted as either “thin provision” or “thick provision lazy zeroed” will result in disk zeroing on demand, leading to degraded performance until the entire disk has been zeroed. The “write back” cache mode helps eliminate this performance degradation. By default, BDE formats node data disks to the “thick provision lazy zeroed” format, so initial Hadoop performance might not be optimal unless the “write back” mode is applied on the RAID controller.

Storage Provisioning by BDE

Datastore Management

BDE enables users to specify datastores to be selected for Hadoop deployment. In the BDE CLI, the following is the command syntax:

datastore add --name <storagepool name in BDE> --spec <datastore name in vSphere> --type <LOCAL|SHARED>

BDE defines the type of storage pool to be used for cluster deployment. A pool can contain one or many vSphere datastores. The datastore name can be specified using a wildcard certificate to include a set of datastores for cluster use. BDE currently does not check whether the datastore actually exists in VMware® vCenter™. Use of a nonexistent datastore will cause cluster creation to fail. Two other commands,

datastore delete and datastore list, are provided for deleting and listing BDE storage pools.

Cluster Specification of Storage

(6)

{ “name”: “data”, “roles”: [ “hadoop_datanode” ], “instanceNum”: 4, “cpuNum”: 2, “memCapacityMB”: 2048, “storage”: { “type”: “LOCAL”, “sizeGB”: 50 } }

The cluster specification can also be used to place system and data disks on separate datastores. In this “storage” clause, data disks are placed on dsNames4Data datastores, and system disks are placed on

dsNames4System datastores: “storage”: { “type”: “LOCAL”, “sizeGB”: 50, “dsNames4Data”: [“DSLOCALSSD”], “dsNames4System”: [“DSNDFS”] }

The cluster create command uses the --dsNames parameter to specify the list of BDE storage pools to be used for cluster creation. These storage pools must collectively meet the size and type requirements in the cluster specification. Otherwise, cluster creation will fail.

Disk Placement and Storage Allocation

To illustrate how BDE creates virtual disks and allocates storage for a node according to the cluster specification, a simple example is provided here. The same placement and allocation policy and algorithm apply to both local and shared storage.

(7)

Applying this disk placement and storage allocation policy to the example, Table 1 shows how storage is configured on each of the cluster nodes.

NODE VIRTUAL

DISK

USAGE DATASTORE SIZE VDISK TYPE

Master /dev/sda /dev/sdb /dev/sdc System Swap Data sharedDS sharedDS sharedDS 20GB ~4GB 20GB Thin provision Thin provision Thin provision Client /dev/sda /dev/sdb /dev/sdc System Swap Data sharedDS sharedDS sharedDS 20GB ~4GB 20GB Thin provision Thin provision Thin provision Data (one per ESXi host)

/dev/sda System localDS0_esx# 20GB Thin provision

/dev/sdb Swap localDS0_esx# ~4GB Thick provision

lazy zeroed

/dev/sdc Data localDS0_esx# 30GB Thick provision

lazy zeroed

/dev/sdd Data localDS1_esx# 30GB Thick provision

lazy zeroed

/dev/sde Data localDS2_esx# 30GB Thick provision

lazy zeroed

/dev/sdf Data localDS3_esx# 30GB Thick provision

lazy zeroed

/dev/sdg Data localDS4_esx# 30GB Thick provision

lazy zeroed Compute

(two per ESXi host)

/dev/sda System localDS0_esx# 20GB Thin provision

/dev/sdb Swap localDS0_esx# ~4GB Thick provision

lazy zeroed

/dev/sdc Data localDS0_esx# 5GB Thick provision

lazy zeroed

/dev/sdd Data localDS1_esx# 5GB Thick provision

lazy zeroed

/dev/sde Data localDS2_esx# 5GB Thick provision

lazy zeroed

/dev/sdf Data localDS3_esx# 5GB Thick provision

lazy zeroed

/dev/sdg Data localDS4_esx# 5GB Thick provision

lazy zeroed Table 1. Hadoop Cluster Node Storage Provisioning

(8)

Storage Management After Cluster Deployment

Allocation of Unused Datastore Storage

In a Hadoop cluster deployed by BDE, virtual disks of the cluster nodes cannot be resized through BDE to use available free space in the underlying vSphere datastores, nor can BDE create additional disks for the nodes. The unused datastore storage can be utilized in other ways:

• Scale out the existing Hadoop cluster to create more slave nodes • Create another Hadoop cluster

• Allocate the storage to other applications

However, these methods—the last two in particular—of using available free space in the datastores will inevitably lead to disk contention with the existing Hadoop cluster when running concurrently, resulting in significant performance degradation. It is not a recommended practice unless applications and workloads can be scheduled reasonably to avoid contention.

Therefore, it is very important to plan and size storage carefully at both the physical and virtual layers prior to cluster deployment, taking into consideration the existing storage requirements as well as the prospects for data growth.

Storage Failure and Recovery

Enterprise-class SAN and NAS storage rarely fails, due to the sophisticated set of high-availability capabilities built into the arrays. However, individual commercial off-the-shelf hard disk drives have a much higher probability of failure, particularly the lower-grade SATA drives that are often used for Hadoop deployment. When the underlying storage fails, the vSphere datastore becomes unavailable, resulting in loss of access to data among virtual machines using the datastore. If a virtual machine’s system disk resides in the datastore, the virtual machine is completely inaccessible. This section discusses the impact of a local disk failure on the Hadoop cluster using the disk and how to recover from the failure.

There are a few different scenarios related to hard disk failure:

• The failed disk is used by Hadoop node(s) for data disk only. The disk is not recoverable and must be replaced. • The failed disk is used by Hadoop node(s) for both system and data disks. The disk is not recoverable and must

be replaced.

• The failed disk is recoverable without its VMFS partition damaged and content corrupted. Disk Replacement and Node Data Disk Recovery

If the datastore created from the failed HDD contains Hadoop data disks only, each node using the datastore goes through the following sequence of states:

• The node loses one of the data disks. Consequently, the Hadoop Distributed File System (HDFS) blocks stored on the disk are missing from the node.

(9)

• The node can resume service after removal of the inaccessible data disk in accordance with the procedure described in VMware knowledge base article 1009854.

• After power-on of the node, BDE reprovisions the node appropriately and updates relevant Hadoop configuration files on the node to exclude the lost data disk. BDE then reports the node to be back in service. • Hadoop reports the node to be alive again. The cluster is fully functional with all nodes, although this particular

node has one fewer data disk.

After a new physical disk has replaced the failed one, the following procedure can be used to make it available to the Hadoop cluster to recover each of the affected nodes with a recreated data disk:

1. Create a VMFS datastore on the new disk, as detailed in the BDE User’s Guide. 2. Power off the node.

3. Add a virtual disk to the node. a. Click Edit Settings.

b. Click Add in the virtual machine Properties window. c. Select Hard Disk as the type of device to add. d. Select Create a new virtual disk.

e. Specify the disk size to be exactly the same as the other data disks on the node, and choose Thick Provision Lazy Zeroed as the provisioning type.

f. Select Specify a datastore or datastore cluster for the disk location, and browse to choose the datastore created in step 1.

g. Place the disk on the same SCSI controller and target location as the previously removed disk. h. Make the disk Independent in the Persistent mode.

4. Power on the node. BDE reprovisions the node appropriately and updates relevant Hadoop configuration files on the node to include the newly provisioned data disk. BDE then reports the node to be back in service.

It is recommended that an HDFS fsck be run after all affected nodes have been recovered. At this point, the Hadoop cluster is fully functional with all nodes. Each node has the same number of data disks as initially deployed by BDE. There should be no data loss throughout this entire failure and recovery process. Hadoop will not try to balance data blocks across the newly replaced data disks but will likely place blocks of newly created files on these data disks first.

Disk Replacement and Node Recovery

If a Hadoop node has both its system disk and data disk in the datastore created from the failed HDD, the cluster and node go into the following state:

• The node is completely dead due to the loss of system disk. BDE reports the node to be down. • The cluster loses the node and places it on the deadNodes list.

• The cluster remains fully functional with the remaining nodes, although at a reduced capacity. There is no data loss because HDFS has replicas of the blocks elsewhere. Over time, HDFS will detect underreplicated blocks and replicate them automatically.

There is currently no way of recovering the node after a new physical disk has replaced the failed one. To preserve the cluster size, users can run the following command to scale out the slave node group by one:

(10)

Recoverable Disk Failure

Some hard disk failures can be repaired physically or through utility software that corrects logical sector errors. In other cases, the disk itself might be fine but there can be a problem with the cable or cable connection. In these cases, the hard disk can be reused after the problem has been corrected. Normally the disk contains the original VMFS partition and data. Therefore, the vSphere datastore is recovered when the disk has been made available again. In turn, the virtual disk(s) created in the datastore become available to the Hadoop nodes. All of this happens automatically within minutes after the disk has been made ready on the ESXi host. There is no specific action that must be taken in BDE and the Hadoop cluster. Any node that uses the datastore is back in service after powering up. Depending on whether both system and data disks are impacted, during this failure and recovery process the affected nodes might go into states described in the previous scenarios. Nevertheless, except for rare and extreme cases where nodes are too concentrated on the failed disk, severely undermining Hadoop functionality and availability, the Hadoop cluster remains functional.

Storage Configuration for Hadoop Outside of BDE

Data Disk Resizing

During cluster deployment, BDE calculates disk sizes based on cluster specification and availability of datastores. After the cluster has been deployed, BDE manages node storage only in terms of maintaining the number of disks as provisioned and presenting them for Hadoop to use. It does not keep track of other aspects, including disk size. Therefore, even though BDE currently provides no function to enable disk resizing for the cluster, it does not prohibit the disks from being resized manually.

System disks should not be resized. The following procedure can be used to resize a data disk:

1. Follow Hadoop best practices to move data blocks from the disk to other disks or nodes to maintain HDFS consistency.

2. Power off the node. 3. Resize the disk in vSphere.

a. Click Edit Settings on the node.

b. Select the disk to be resized, and enter the new disk size for Provisioned Size. c. Click OK to submit the change.

4. Power on the node.

5. Utilize the newly added disk space: Option 1: a. Install a partition management utility such as GParted in the node to expand the partition on the disk. b. Run the resize2fs command to expand the ext4 file system on the partition. Option 2: a. Run the fdisk command to remove the existing partition. b. Reboot the node. BDE will prepare the disk appropriately, including partition, file system, and Hadoop directory structure creation.

(11)

Utilization of Additional Disks

Disk resizing is a way of scaling up storage on a node. Another possible way of scaling up storage is to add more data disks to the node, which is much more complicated than disk resizing. The following procedure can be used to manually add a data disk to a node:

1. Power off the node.

2. Follow the previously described procedure to add a new disk to the node. a. Place the disk in the selected datastore.

b. Attach the disk to a SCSI controller and target location sequentially to existing disks. 3. Power on the node.

4. Run the sfdisk command to create a single partition on the new disk. 5. Run mkfs to create an ext4 file system using the partition.

6. Create a mount point under /mnt and add the mount point to the /etc/fstab file. 7. Mount the file system to the mount point.

8. Create the following directory structure in the file system:

# mkdir hadoop

# chown –R hdfs:hadoop hadoop # cd hadoop

# mkdir hdfs mapred # cd hdfs

# mkdir data name secondary

# chown hdfs:hadoop data name secondary # chmod 700 name secondary

# cd ../mapred # mkdir local

# chown mapred:hadoop local

9. Edit the /usr/lib/hadoop-1.0.1/conf/hdfs-site.xml file to add the new HDFS name and data locations for the dfs.name.dir and dfs.data.dir properties respectively.

10. Edit the /usr/lib/hadoop-1.0.1/conf/mapred-site.xml file to add the new MapReduce local directory for property mapred.local.dir.

11. Restart the hadoop-0.20-datanode and hadoop-0.20-tasktracker services on the node.

(12)

Conclusion

An Apache Hadoop cluster deployed on VMware vSphere can leverage advanced vSphere HA, vSphere FT and vSphere vMotion features for enhanced availability by using shared storage, while also preserving data locality by using local storage for data nodes. Virtualization enables data and compute separation without

(13)

Apache Hadoop Storage Provisioning Using VMware vsphere Big Data Extensions TECHNICAL WHITE PAPER

Storage Provisioning

Using VMware vSphere

®

Table of Contents

Apache Hadoop Deployment on VMware vSphere Using vSphere

Big Data Extensions . . . . 3

Local Storage and Shared Storage . . . . 4

Basic vSphere Storage Concepts . . . .4

Using Local and Shared Storage for Hadoop . . . .4

Storage Provisioning by BDE . . . . 5

Datastore Management . . . .5

Cluster Specification of Storage . . . .5

Disk Placement and Storage Allocation . . . .6

Storage Management After Cluster Deployment . . . . 8

Allocation of Unused Datastore Storage . . . .8

Storage Failure and Recovery . . . .8

Disk Replacement and Node Data Disk Recovery . . . .8

Disk Replacement and Node Recovery . . . .9

Recoverable Disk Failure . . . .10

Storage Configuration for Hadoop Outside of BDE . . . .10

Data Disk Resizing . . . .10

Utilization of Additional Disks . . . .11

Apache Hadoop Deployment on

VMware vSphere Using vSphere

Big Data Extensions

Local Storage and Shared Storage

Basic vSphere Storage Concepts

Using Local and Shared Storage for Hadoop

Storage Provisioning by BDE

Datastore Management

Cluster Specification of Storage

Disk Placement and Storage Allocation

Storage Management After Cluster Deployment

Allocation of Unused Datastore Storage

Storage Failure and Recovery

Storage Configuration for Hadoop Outside of BDE

Conclusion