Gluster Filesystem 3.3 Beta 2 Hadoop Compatible Storage

(1)

(2)

ds

Copyright

(3)

ds

1. About this Guide

This guide describes Gluster Hadoop Compatible Storage feature and its installation and management.

1.1. Disclaimer

Gluster, Inc. has designated English as the official language for all of its product documentation and other documentation, as well as all our customer communications. All documentation prepared or delivered by Gluster will be written, interpreted and applied in English, and English is the official and controlling language for all our documents, agreements, instruments, notices, disclosures and

communications, in any form, electronic or otherwise (collectively, the “Gluster Documents”). Any customer, vendor, partner or other party who requires a translation of any of the Gluster Documents is responsible for preparing or obtaining such translation, including associated costs. However, regardless of any such translation, the English language version of any of the Gluster Documents prepared or delivered by Gluster shall control for any interpretation, enforcement, application or resolution.

1.2. Audience

This guide is intended for Apache Hadoop users interested in using GlusterFS as filesystem for Hadoop.

1.3. Prerequisite

This document assumes that you are familiar with the Linux operating system, concepts of File System, GlusterFS concepts, Apache Hadoop, and MapReduce framework.

1.4. Terms

Term Description

master Master manages scheduling of jobs, assigns tasks to slaves, monitors tasks and re-executes _{the failed tasks.} slave Program which submits a job to the master.

job

A set of map and/or reduce tasks, coordinated by the master. When the master receives a job, it assigns a unique name for the job, and assigns the tasks to workers until they are all completed.

map The first phase of a job, in which tasks are usually scheduled on the same node where their input data is hosted, so that local computation can be performed. Generally there is one map task per input.

(5)

ds

Term Description

mapreduce A paradigm and associated framework for distributed computing, which decouples _{application code from the core challenges of fault tolerance and data locality.} task A task is essentially a unit of work, provided to a worker.

worker

A worker is responsible for carrying out a task. A job specifies the executable that is the worker. Workers are scheduled to run on the nodes, close to the data they are supposed to be processing.

1.5. Typographical Conventions

The following table lists the formatting conventions that are used in this guide to make it easier for you to recognize and use specific types of information.

Convention Description Example

Courier Text Commands formatted as courier indicate shell commands.

gluster volume start volname

Italicized Text _{Within a command, italicized text}

represents variables, which must be substituted with specific values.

gluster volume start volname

Square Brackets _{Within a command, optional parameters} are shown in square brackets.

gluster volume start volname [force]

Curly Brackets Within a command, alternative parameters are grouped within curly brackets and separated by the vertical OR bar.

gluster volume { start | stop | delete } volname

1.6. Feedback

Gluster welcomes your comments and suggestions on the quality and usefulness of its documentation. If you find any errors or have any other suggestions, write to us at [email protected] for clarification and provide the chapter, section, and page number, if available.

Gluster offers a range of resources related to Gluster software: Discuss technical problems and solutions on the Discussion Forum

(http://community.gluster.org) Get hands-on step-by-step tutorials

(6)

ds

2. Introducing Hadoop Compatible Storage of

GlusterFS

GlusterFS 3.3 beta 2 includes compatibility for Apache Hadoop and it uses the standard file system APIs available in Hadoop to provide a new storage option for Hadoop deployments. Existing

MapReduce based applications can use GlusterFS seamlessly. This new functionality opens up data within Hadoop deployments to any file-based or object-based application.

A MapReduce framework typically divides the input data-set into independent tasks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the jobs are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executing the failed tasks.

2.1. Architecture Overview

The following diagram illustrates Hadoop integration with Gluster:

2.2. Advantages

The following are the advantages of Hadoop Compatible Storage with GlusterFS: Provides simultaneous file-based and object-based access within Hadoop. Eliminates the centralized metadata server.

(7)

ds

3. Preparing to Install Hadoop Compatible Storage

This section provides information on pre-requisites and list of dependencies that will be installed during installation of Hadoop compatible storage.

3.1. Pre-requisites

The following are the pre-requisites to install and configure GlusterFS with Hadoop Compatible Storage:

Hadoop 0.20.2 is installed, configured, and is running on all the machines in the cluster. Java Runtime Environment

Maven (mandatory only if you are building the plugin from the source) JDK (mandatory only if you are building the plugin from the source)

Source code is available at https://github.com/gluster/hadoop-glusterfs.

3.2. Dependencies

(8)

ds

4. Installing and Configuring Hadoop Compatible

Storage

This section describes how to install and configure Hadoop Compatible Storage in your storage environment and verify that it is functioning correctly.

1. Download the GlusterFS RPM files on all servers in your cluster. You can download the software at http://download.gluster.com/pub/gluster/glusterfs/qa-releases/3.3-beta-2/.

2. For each RPM file, get the md5sum (using the following command) and compare it against the md5sum file available at http://download.gluster.com/pub/gluster/glusterfs/qa-releases/3.3-beta-2/CentOS/.

$ md5sum RPM_file.rpm

3. Install GlusterFS on all servers using the following commands: # rpm -Uvh core_RPM_file # rpm -Uvh fuse_RPM_file # rpm -Uvh geo-replication_RPM_file For example: # rpm -Uvh glusterfs-core-3.3beta2-1.x86_64.rpm # rpm -Uvh glusterfs-fuse-3.3beta2-1.x86_64.rpm # rpm -Uvh glusterfs-geo-replication-3.3beta2-1.x86_64.rpm 4. Verify that 3.3beta2 version of GlusterFS is installed, using the following command:

# glusterfs –version

For more information on installing GlusterFS, refer to GlusterFS Installation at

http://www.gluster.com/community/documentation/index.php/Gluster_3.2_Filesystem_Installa tion_Guide

5. Download the glusterfs-hadoop-0.20.2-0.1.x86_64.rpm on all servers in your cluster. You can download the software at http://download.gluster.com/pub/gluster/glusterfs/qa-releases/3.3-beta-2/.

6. To install Hadoop Compatible Storage on all servers in your cluster, run the following command: # rpm –ivh --nodpes glusterfs-hadoop-0.20.2-0.1.x86_64.rpm

The following files will be extracted:

 /usr/local/lib/glusterfs-<Hadoop-version>-<gluster-plugin-version>.jar -  /usr/local/lib/conf/core-site.xml

7. (Optional) To install Hadoop Compatible Storage in a different location, run the following command:

(9)

ds

8. Edit the conf/core-site.xml file. The following is the sample conf/core-site.xml file: <configuration> <property> <name>fs.glusterfs.impl</name> <value>org.apache.hadoop.fs.glusterfs.GlusterFileSystem</value> </property> <property> <name>fs.default.name</name> <value>glusterfs://fedora1:9000</value> </property> <property> <name>fs.glusterfs.volname</name> <value>hadoopvol</value> </property> <property> <name>fs.glusterfs.mount</name> <value>/mnt/glusterfs</value> </property> <property> <name>fs.glusterfs.server</name> <value> fedora2</value> </property> <property> <name>quick.slave.io</name> <value>Off</value> </property> </configuration>

The following are the configurable fields:

Property Name Default Value Description

fs.default.name glusterfs://fedora1:9000 Any hostname in the cluster as the server and any port number.

fs.glusterfs.volname hadoopvol GlusterFS volume to mount.

(10)

ds

Property Name Default Value Description

quick.slave.io Off Performance tunable option. If this option is set to On, the plugin will try to perform I/O directly from the disk filesystem (like ext3 or ext4) the file resides on. Hence read performance will improve and job would run faster.

Note: This option is not tested widely. 9. Create a soft link in Hadoop’s library and configuration directory for the downloaded files (in

Step 7) using the following commands:

# ln -s <target location> <source location> For example,

# ln –s /usr/local/lib/glusterfs-0.20.2-0.1.jar $HADOOP_HOME/lib/glusterfs-0.20.2-0.1.jar

# ln –s /usr/local/lib/conf/core-site.xml $HADOOP_HOME/conf/core-site.xml 10. (Optional) You can run the following command on Hadoop master to build the plugin and deploy

(11)

ds

5. Starting and Stopping the Hadoop MapReduce

Daemon on GlusterFS

The MapReduce daemon serves to run MapReduce jobs on Gluster. Note: You must start Hadoop MapReduce daemon on all servers.

5.1. Starting and Stopping MapReduce Daemon

To start MapReduce daemon manually, enter the following command: # $HADOOP_HOME/bin/start-mapred.sh

To stop MapReduce daemon manually, enter the following command: # $HADOOP_HOME/bin/stop-mapred.sh

(12)

ds

6. Troubleshooting Hadoop Compatible Storage

This section describes the most common troubleshooting issues related to Hadoop Compatible Storage.

6.1. Time Sync

Running MapReduce job may throw exceptions if the time is out-of-sync on the hosts in the cluster. Solution: Sync the time on all hosts using ntpd program.

6.2. Socket Creation Errors

The CLI commands may not work with Centos 5.x, gluster ipv6 module and the socket creation may fail (error message will be logged in the log file).

(13)

ds

7. Creating GlusterFS Volumes

From GlusterFS 3.3 beta 2 onwards, you can create volumes of the following types in your storage environment:

Distributed Striped Replicated – Distributes and stripes data across replicated bricks in the volume. For more information, see Creating Distributed Striped Replicated Volumes. Striped Replicated – Stripes and replicates data across bricks in the volume. For more

information, see Creating Striped Replicated Volumes

7.1. Creating Distributed Striped Replicated Volumes

Distributed striped replicated volumes’ distributes and stripes data across replicated bricks in the cluster. For best results, you should use distributed striped replicated volumes where the

requirement is to scale storage, high concurrency environments accessing very large files, and performance is critical.

To configure a distributed striped replicated volume

1. Create a trusted storage pool consisting of the storage servers that will comprise the volume. For information on creating trusted storage pool, see

http://www.gluster.com/community/documentation/index.php/Gluster_3.2:_Adding_Servers_to _Trusted_Storage_Pool.

2. Create the volume using the following command:

Note: The number of bricks should be a multiples of number of stripe count and replica count for a distributed striped replicated volume.

# gluster volume create NEW-VOLNAME [stripe COUNT] [replica COUNT] [transport tcp | rdma | tcp,rdma] NEW-BRICK...

To create a distributed replicated striped volume across eight storage servers:

# gluster volume create test-volume stripe 2 replica 2 transport tcp server1:/exp1 server2:/exp2 server3:/exp3 server4:/exp4 server5:/exp5 server6:/exp6 server7:/exp7 server8:/exp8

Creation of test-volume has been successful Please start the volume to access data.

(Optional) Set additional options if required, such as auth.allow or auth.reject. For example:

(14)

ds

7.2. Creating Striped Replicated Volumes

Stripes data across replicated bricks in the cluster. For best results, you should use striped replicated volumes where the requirement is high concurrency environments accessing very large files and performance is critical.

To configure a striped replicated volume

1. Create a trusted storage pool consisting of the storage servers that will comprise the volume. For information on creating trusted storage pool, see

http://www.gluster.com/community/documentation/index.php/Gluster_3.2:_Adding_Servers_to _Trusted_Storage_Pool.

2. Create the volume using the following command:

Note: The number of bricks should be a multiple of the replicate count and stripe count for a striped replicated volume.

# gluster volume create NEW-VOLNAME [stripe COUNT] [replica COUNT] [transport tcp | rdma | tcp,rdma] NEW-BRICK...

To create a striped replicated volume across four storage servers:

# gluster volume create test-volume stripe 2 replica 2 transport tcp server1:/exp1 server2:/exp2 server3:/exp3 server4:/exp4

To create a striped replicated volume across six storage servers:

# gluster volume create test-volume stripe 3 replica 2 transport tcp server1:/exp1 server2:/exp2 server3:/exp3 server4:/exp4 server5:/exp5 server6:/exp6

3. (Optional) Set additional options if required, such as auth.allow or auth.reject. For example:

# gluster volume set test-volume auth.allow 10.*

Note: Make sure you start your volumes before you try to mount them or else client operations after the mount will hang. For information on starting volumes, see

(15)

ds

8. Managing Your Gluster Filesystem

The GlusterFS Administration Guide is available at:

Gluster Filesystem 3.3 Beta 2 Hadoop Compatible Storage

Copyright

Table of Contents

1. About this Guide

1.1. Disclaimer

1.2. Audience

1.3. Prerequisite

1.4. Terms

1.5. Typographical Conventions

1.6. Feedback

2. Introducing Hadoop Compatible Storage of

GlusterFS

2.1. Architecture Overview

2.2. Advantages

3. Preparing to Install Hadoop Compatible Storage

3.1. Pre-requisites

3.2. Dependencies

4. Installing and Configuring Hadoop Compatible

Storage

5. Starting and Stopping the Hadoop MapReduce

Daemon on GlusterFS

5.1. Starting and Stopping MapReduce Daemon

6. Troubleshooting Hadoop Compatible Storage

6.1. Time Sync

6.2. Socket Creation Errors

7. Creating GlusterFS Volumes

7.1. Creating Distributed Striped Replicated Volumes

7.2. Creating Striped Replicated Volumes

8. Managing Your Gluster Filesystem