Deploying a big data
solution using
IBM GPFS-FPO
Best practices to make smart decisions for optimal
performance and scalability
Abstract
Every day, the world creates 2.5 quintillion bytes of data. In fact, 90 percent of the data in the world today has been created in the last two years alone. While there is much talk about big data, it is not mere hype. Businesses are realizing tangible results from investments in big data analytics, and IBM’s big data platform is helping enterprises across all industries. IBM is unique in having developed an enterprise-class big data platform that allows you to address the full spectrum of big data business challenges.
IBM® General Parallel File System (GPFS™) offers an enterprise-class alternative to Hadoop Distributed File System (HDFS) for building big data platforms. GPFS is a POSIX-compliant, high-performing and proven technology that is found in thousands of mission-critical commercial installations worldwide. GPFS provides a range of enterprise-class data management features.
GPFS can be deployed independently or with IBM’s big data platform, consisting of IBM InfoSphere® BigInsights™ and IBM Platform™ Symphony. This document describes best practices for deploying GPFS in such environments to help ensure optimal performance and reliability.
Introduction
Businesses are discovering the huge potential of big data analytics across all dimensions of the business, from defining corporate strategy to managing customer relationships, and from improving operations to gaining competitive edge. The open source Apache Hadoop project, a software framework that enables high-performance analytics on unstructured data sets, is the centerpiece of big data solutions. Hadoop is designed to process data-intensive computational tasks, in parallel and at a scale, that previously were possible only in high-performance computing (HPC) environments.
Contents:
1 Abstract 1 Introduction 2 Assumptions3 Understanding big data environments 6 GPFS-FPO key concepts and
terminology
10 GPFS enterprise functions overview 13 Understanding data flow and
workload
14 Setting up a GPFS-FPO cluster 25 Preparing the GPFS file system layout 26 Modifying the Hadoop configuration
to use GPFS
29 Ingesting data into GPFS clusters 30 Exporting data out of GPFS clusters 30 Monitoring and administering GPFS
clusters
31 GPFS-FPO restrictions 31 Conclusion
The Hadoop ecosystem consists of many open source projects. One of the central components is the HDFS, a distributed file system designed to run on commodity hardware. Other related projects facilitate workflow and the coordination of jobs, support data movement between Hadoop and other systems, and implement scalable machine learning and data mining algorithms. However, HDFS lacks the enterprise-class functions necessary for reliability, data management and data governance.
GPFS is a POSIX-compliant file system that offers an enterprise-class alternative to HDFS. GPFS has been used in many of the world’s fastest supercomputers, including IBM Blue Gene®, IBM Watson™ (the supercomputer featured on
Jeopardy!) and the Argonne National Labs MIRA system.
Apart from supercomputers, GPFS is commonly found in thousands of commercial mission-critical installations worldwide, from biomedical research to financial analytics. In 2009, GPFS was extended to work seamlessly in the Hadoop ecosystem and is available through a feature called GPFS File Placement Optimizer (GPFS-FPO). Storing your Hadoop data using GPFS-FPO allows you to gain advanced functions and the high I/O performance required for many big data operations. GPFS-FPO provides Hadoop
compatibility extensions to replace HDFS in a Hadoop ecosystem, with no changes required to Hadoop applications. The best practices presented in this paper show how to deploy GPFS-FPO as a file system platform for big data analytics. The paper covers a variety of Hadoop deployment architectures, including InfoSphere BigInsights, Platform Symphony, direct-from-Hadoop open source (also referred as Do-It-Yourself) or with a Hadoop distribution from another vendor to work with GPFS. Architecture-specific notes are included where applicable.
Assumptions
The goal of this paper is to guide the administrator through various decision points to ensure optimal configuration based on the Hadoop application components being deployed. During the discussion, the following are assumed:
• A decision to use GPFS has already been made. The paper does not focus on why GPFS makes a better choice
compared to using HDFS—although differences with HDFS are highlighted where appropriate.
• The reader is already familiar with Hadoop components and other IBM products referred to in this paper. The paper covers high-level concepts only to establish the required context for follow-on topics. For further details, please use references listed at the end of this document or the respective product documentation.
• The reader understands this paper is not intended to serve as a comprehensive GPFS overview. Necessary concepts are described at a high level. Please use references for comprehensive understanding of the GPFS.
Understanding big data environments
Making the most of your big data environment requires that you include data from many different sources. Data sources vary by source and structure. Some are text-based documents such as logs from web applications and data feeds from social-networking applications, some are spreadsheets, and others can include transactional data from relational databases and data warehouses. Integration of these data sources forms the foundation of federated analytics. Data in these various forms tends to be large and continues to grow at anunprecedented rate. The speed at which data is analyzed and turned into business intelligence is critical to enterprises.
Figure 1 shows components of a big data environment. This environment includes core Hadoop open source components along with IBM products that seamlessly extend or replace Hadoop components to make the solution more enterprise-ready. The components applicable to your deployment depend on the data sources that you plan to integrate. Consider your current plans and future needs when making decisions for initial deployment so that the infrastructure can easily grow to support your business as it changes.
Figure 1. Big data environment.
Data sources/ connectors InfoSphere Streams IBM InfoSphere DataStage® IBM Netezza® IBM DB2® JDBC Flume BoardReader CSV/XML/JSON WebCrawler R IBM SPSS® Visualization and
discovery Development toolsEclipse plug-ins Systems management
Workload optimization
Runtime
Data store
File system
IBM BigSheets Web admin console
Text
analytics MapReduceHadoop
Jaql Hive Query
Analytics
ML analytics Text analytics
Enhanced security Big SQL BigIndex
IBM LZO-based
compression ZooKeeper Flexible scheduler
Jaql Lucene Avro
Oozie Pig Hive
Hadoop MapReduce Platform Symphony/AdaptiveMR
HBase Column store
HDFS GPFS-FPO Operational system Log and transactions Clickstreams Text, blog, weblog Existing data sources Biological sequences
IBM extensions/options (GPFS, InfoSphere BigInsights, Platform Symphony) Open source components Generic document formats
Hadoop components overview
The Hadoop project is comprised of three parts: HDFS, the
Hadoop MapReduce model and Hadoop Common. The goal of
Hadoop is to use commodity servers in a very large cluster to provide high I/O performance with minimal cost.
HDFS is designed for workloads that: • Use large files
• Access those files with sequential writes in append-only mode to store the data
• Use large sequential reads needed to run analytics jobs When writing a large file, HDFS distributes the contents of each file across the available storage by splitting it into large blocks and distributing each block across available servers (called data nodes). One of the nodes in the HDFS cluster, called the name node, coordinates access to the data in the cluster. The name node is assigned the critical role of managing data placement logic and storing metadata describing location details of the contents of a file. For optimal I/O performance, MapReduce uses the file metadata to assign workloads to the servers where the data is stored. Processing the data locally on a node eliminates network traffic overhead associated with storage area network (SAN) and network attached storage (NAS). To help ensure fault tolerance for common hardware failures, HDFS stores copies of each data block on more than one server. HDFS does not support in-place updates to stored data. To modify a file, it is necessary to move the file out of HDFS, modify the file and then move it back into HDFS. Since HDFS is not POSIX-compliant, you must use the HDFS shell and application programming interfaces (APIs) to work with the data stored within HDFS.
MapReduce is the heart of Hadoop. The term MapReduce refers to two separate distinct tasks that Hadoop programs perform. The first is the map function, which takes a set of data and processes it to capture interesting aspects of the data. The reduce function takes the output from map jobs and combines it into a larger set of output to provide insights into the data. In a Hadoop cluster, jobs are managed by a
component called JobTracker, which communicates with the name node to create multiple map-and-reduce tasks based on data locality. The tasks are managed by TaskTracker agents
Hadoop Common Components are a set of libraries that support various Hadoop subprojects. These components are intended to simplify the complex task of creating MapReduce applications. Given that HDFS is not POSIX-compliant, and therefore, common Linux or UNIX tools and APIs cannot be used, it is necessary to use the hdfs dfs <args> (or in Hadoop version1.x, hadoop dfs <args>) file system shell command interface and Hadoop file system APIs. Flume is another component used for ingesting data into HDFS. The MapReduce application development environment includes several components to reduce the time and skill needed for application development. These components include Hive, Pig, Oozie, Zookeeper, HBase and Avro, to name a few in the fast-growing list.
GPFS-FPO overview
GPFS was originally designed as a SAN file system, which typically isn’t suitable for a Hadoop cluster, since Hadoop clusters use locally attached disks to drive high performance for MapReduce applications. GPFS-FPO is an implementation of shared-nothing architecture that enables each node to operate independently, reducing the impact of failure events across multiple nodes. By storing your data in GPFS-FPO, you are freed from the architectural restrictions of HDFS. Additionally, you can take advantage of the GPFS-FPO pedigree as a multipurpose file system to gain tremendous management flexibility.
You can manage your data using existing toolsets and processes along with a wide range of enterprise-class data management functions offered by GPFS, including: • Full POSIX compliance
• Snapshot support for point-in-time data capture • Simplified capacity management by using GPFS for all
storage needs
• Policy-based information lifecycle management capabilities to manage petabytes of data
• Control over placement of each replica at file-level granularity • Infrastructure to manage multi-tenant Hadoop clusters based
on service-level agreements (SLAs)
GPFS Hadoop connector
All big data applications run seamlessly with GPFS—no application changes are required. GPFS provides seamless integration with Hadoop applications using a Hadoop connector. A GPFS Hadoop connector is shipped with GPFS, and a customized version is shipped with InfoSphere BigInsights. Platform Symphony uses the GPFS connector shipped with the GPFS product.
As shown in Figure 2, the GPFS Hadoop connector implements Hadoop file system APIs using GPFS APIs to direct file system access requests to GPFS. This redirection of data access from HDFS to GPFS is enabled by changing Hadoop XML configuration files that are described later in the document.
Figure 2. GPFS Hadoop connector.
InfoSphere BigInsights overview
InfoSphere BigInsights is a platform for Hadoop with the simple goal of making Hadoop enterprise-ready. Enterprise-readiness includes three key aspects:
• Analytics support: Enabling different classes of analysts to gain value directly from data that is stored in InfoSphere BigInsights, without the need for a team of Hadoop programmers or costly consulting.
• Data integration: Providing a comprehensive view of the business by ensuring the Hadoop system integrates with the rest of the IT infrastructure—for example, providing the ability to query Hadoop data from a data warehouse environment and vice versa.
• Operational excellence: Ensuring that the data stored in Hadoop adheres to the same business controls as the current infrastructure, including security, governance and administration.
InfoSphere BigInsights features Apache Hadoop and its related open source projects as core components, along with several IBM extensions to provide enterprise-class capabilities. These extensions include a web management console; development tools such as Eclipse plug-ins and a text analytics workbench; and analytics accelerators, visualization tools and connectors to ingest and integrate data from variety of data sources.
Platform Symphony overview
Platform Symphony is a high-performance computing infrastructure and service-oriented architecture (SOA) platform for building, running and managing compute and data-intensive applications in large-scale and shared-cluster environments. The MapReduce framework in Platform Symphony specifically enables application developers to write or integrate Hadoop-compatible MapReduce applications. It provides an application adapter that enables users to execute Hadoop MapReduce jobs without changing application code or recompiling.
The MapReduce framework in Platform Symphony is integrated with InfoSphere BigInsights as the Adaptive MapReduce component. This integration provides simplicity and ease of use for installation, configuration and integration of MapReduce applications by hiding product-specific details. Platform Symphony supports Apache Hadoop distribution and other distributions such as Cloudera. The GPFS integration is enabled in Platform Symphony Version 6.1.1 and is the required minimum for this discussion.
GPFS Hadoop Connector
Distributed file system: GPFS Distributed file system: HDFS Hadoop FS APIs
MapReduce API Applications
Figure 3 illustrates Hadoop applications interfacing with the Platform Symphony MapReduce application adapter. The adapter interfaces with Platform Symphony APIs and Hadoop common libraries, providing seamless application integration to GPFS through the Hadoop connector.
Figure 3. IBM Platform Symphony in a big data solution. The MapReduce framework is available with Platform Symphony—Advanced Edition. An entitlement key is
required to enable this feature. For detailed information about supported Hadoop distribution and licensing, see the Platform
Symphony Integration Guide for MapReduce Applications listed
under References and resources. Product versions
The following product versions are used as the basis for this document:
1. GPFS V3.5.0.11—Use of the latest available GPFS 3.5 PTF is recommended, unless BigInsights is used. Refer to GPFS
GPFS-FPO key concepts and terminology
GPFS-FPO represents a set of features that have been added in GPFS Version 3.5.0.11. This version provides formal support for GPFS-FPO. FPO extends the core GPFS architecture, providing greater control and flexibility to leverage data location, reduce hardware costs and improve I/O performance. It is important to understand key concepts and terms before you configure GPFS-FPO:• Block: A block is the largest unit for single I/O operation and space allocation in a GPFS file system. The block size is specified when a file system is created. The block size defines the stripe width for distributing data on each disk used by GPFS. GPFS supports block sizes ranging from 16 KB to 16 MB and defaults to 256 KB.
GPFS allows different block sizes for metadata and data, if disks for data and metadata are kept separate.
• Chunk: A chunk is a logical grouping of blocks that behaves like one large block. A block group factor dictates the number of blocks that are laid out on the disks attached to a node in a FPO configuration to form a chunk. The file data is divided into chunks, and each chunk is mapped to a node according to write-affinity depth setting. A chunk is then mapped to all available disks within a node. This allocation scheme is performed on a best-effort basis. For example, if a node does not have enough contiguous free blocks, GPFS attempts to allocate blocks on disks attached to other nodes. Block group factor allows applications to choose a chunk size at the file-level granularity needed to support data access requirements.
The chunk size is defined by multiplying block group factor and block size. The block group factor ranges from 1 to 1,024. The default block group factor is 1 for compatibility with standard GPFS file systems. Setting block size to 1 MB and block group factor to 128 results in a chunk size of 128 MB. Platform Symphony
Hadoop MapReduce application adapter
Platform Symphony APIs Platform Symphony
GPFS file system connector GPFS
Hadoop MapReduce apps, Oozie, Pig, Hive and so on
Hadoop MapReduce APIs
Hadoop common libraries HDFS file system connector
HDFS org.apache.hadoop.mapreduce org.apache.hadoop.mapred org.apache.hadoop.fs org.apache.hadoop.io org.apache.hadoop.util
• Network shared disk: When a LUN provided by a storage subsystem is configured for use by GPFS, it is referred to as a network shared disk (NSD). In a traditional GPFS installation, a LUN is typically made up of multiple physical disks provided by a RAID device. Once the LUN is defined as an NSD, GPFS allows all cluster nodes to have direct access to the NSDs. Alternatively, a subset of the nodes connected to the disks may provide access to the LUNs as an NSD server. In GPFS-FPO deployments, a physical disk and NSD can have a 1:1 mapping. In this case, each node in the cluster is a NSD server providing access to the disks from the rest of the cluster. In GPFS, use of the term “disk” is synonymous to NSD unless specified otherwise, mostly for historical reasons. NSDs are specified in an input file when using the mmcrnsd command or when creating a file system using the command mmcrfs. • Storage pool: A storage pool is a collection of NSDs. Storage
pools in GPFS allow you to group storage devices based on performance, locality or reliability characteristics within a file system.
Storage pools provide a method to partition file system storage to offer several benefits, including improved price-performance, reduced contention on premium resources, seamless archiving and advanced failure containment. For example, one pool could be an enterprise-class storage system using high-performance flash devices, and another pool might consist of numerous disk controllers that host a large set of economical Serial ATA (SATA) disks.
GPFS has two types of storage pools, internal and external. GPFS manages the data storage for internal storage pools. For external storage pools, GPFS does the policy processing but the data is managed by an external application such as IBM Tivoli® Storage Manager. Storage pools are declared as an attribute of the disk when defining disks (NSDs).
• Failure group: A failure group is a set of disks that share a common point of failure that could cause them all to become simultaneously unavailable. When creating multiple replicas of a given block, GPFS uses failure group information to ensure that no two replicas exist within the same failure group.
Traditionally, GPFS failure groups have been identified by simple integers. FPO introduced a multipart failure group concept to further convey the topology information of the nodes in the cluster that GPFS can exploit when making data placement decisions.
A failure group can be specified as a list of up to three comma-separated numbers that convey topology information. In general, this list is a way to specify which disks are closer together and which are farther away. In practice, the three elements of the failure group definition may represent the rack number of the node to which the disk belongs, a position within the rack (lower or upper half) and a node number. For example, the failure group 2,1,0 identifies rack 2, bottom half, first node. Note that the second number in the locality information is restricted to be either 0 or 1. When considering two disks for striping or replica placement purposes, it is important to understand the following:
• Disks that differ in the first of the three numbers are considered farthest apart. In the rack topology example, this would be disks located in different racks.
• Disks that have the same first number but differ in the second number are considered closer. In the rack topology example, this would be disks located in a different half of the rack.
• Disks that differ only in the third number are assumed to reside in different nodes placed close together. In the rack topology example, this would be disks located in the same half of the rack.
• Only disks that have all three numbers in common reside in the same node.
The use of the terms “farthest apart” and “close together” in failure group definitions refer to network topology. Two nodes “close together” may be on the same Ethernet switch, whereas nodes farther apart may traverse multiple switches to communicate. These parameters allow you the flexibility to design your data access patterns to best fit your environment, whether it is in a single data center or spread across many kilometers.
The failure group definition allows locality information of the cluster nodes and the associated disks to be described to GPFS and enables intelligent decision making for data block placements. Failure group is declared as an attribute of the disk when defining disks (NSDs).
• Metadata and data placement: GPFS allows for separation of storage used for metadata and data blocks since they may have different requirements for availability, access and performance. Metadata includes file metadata (file size, name, owner, last modified date); internal data structures such as quota information allocation maps, file system configuration and log files; and user-specified metadata such as inode, directory blocks, extended attributes and access control lists. The data includes data contents of a file.
Metadata is vital to file system availability and should be stored on the most reliable storage media. The file data can be placed on the storage media based on the importance of the data. GPFS provides rich policies for initial placement and subsequent movement of data between storage pools (to be discussed later).
GPFS places metadata in a storage pool named system. Within the system storage pool, disks can be assigned to hold metadata only, or both data and metadata. Unless specified otherwise, using placement policies, all data is stored in the system storage pool. Metadata is striped across designated disks in the system pool.
For better resiliency, distributing metadata across all available nodes is not recommended in an FPO environment. Having the metadata on too many nodes reduces the availability of the system because the probability of three nodes failing at the same time increases with the number of nodes. Therefore, it is recommended that you limit the number of nodes with metadata disks to one or two nodes per rack.
• Replication: GPFS supports replication for failure containment and disaster recovery. Replication can incur costs in terms of disk capacity and performance, and
therefore should be used carefully. GPFS supports replication of both metadata and data independently and provides fine-grained control of the replication. You can choose to replicate data for a single file, a set of files or the entire file system. Replication is an attribute of a file and can be managed individually or automated through policies.
If replication is enabled, the target storage pool for the data must have at least two failure groups defined within the storage pool. Replication means that a copy of each block of data exists in at least two failure groups for two copies, and three failure groups for three copies. GPFS supports up to three replicas for both metadata and data. The default replication factor is 1, meaning no replicas are created. In many GPFS deployments, data protection is handled by the underlying RAID storage subsystem. In FPO
environments, it is more common to rely on software replication for data protection.
Data block placement decisions in FPO environments are affected by the level of replication and the value of the
write-affinity depth and write-affinity failure group parameters
that are described in this section.
• Write-affinity depth (WAD): Introduced for FPO, WAD is a policy for directing writes. It indicates that the node writing the data directs the write to disks on its own node for the first copy and to the disks on other nodes for the second and third copies (if specified). The policy allows the application to control placement of replicas within the cluster to optimize for typical access patterns.
Write affinity is specified by a depth that indicates the number of replicas to be kept closer to the node ingesting data. This attribute can be specified at the storage pool or individual file level. The possible values are 0, 1 and 2. A write-affinity depth of 0 indicates that each chunk replica is to be striped across all the available nodes, with the restriction that no two replicas are in the same failure group.
A write-affinity depth of 1 indicates that the first replica is written to the node writing the data. The second and third replicas are striped at the chunk level among the nodes that are farthest apart (based on failure group definition) from the node writing the data, with the restriction that no two replicas are in the same failure group. Using the rack topology example, the rack used for the first replica is not used by the second and the third replicas. This allows data to remain available even if the entire rack fails.
A write-affinity depth of 2 indicates that the first replica is written to the node writing the data. The second replica is written to a node in the same rack as the first replica (based on the failure group definition), but in the other half of the rack. The third replica is striped at the chunk level across all available nodes that are farthest (per the failure group definition) from the nodes used by the first two replicas. Using the rack topology example, the second replica is placed in the same rack as the first replica, but in the second half of the rack. The third replica is striped across nodes that are not located in the rack used by the first and second replicas. For compatibility reasons, the default write-affinity depth is 0 and the unit of striping is a block; however, if block group factor is specified, striping is done at the chunk level. The replica placement is done according to the specified policy at the best-effort level. If a node does not have sufficient free space, GPFS allocates space in other failure groups, ensuring that no two replicas of a chunk are assigned to the same node.
For MapReduce and data warehousing applications, a write-affinity depth of 1 is recommended, as the application benefits from the locality of the first replica of the block. • Write-affinity failure group: The write-affinity failure
group extends the write-affinity depth concept by allowing the application to control placement of each replica of a file and is therefore applicable in FPO-enabled environments only. It defines a policy that indicates the range of nodes where replicas of blocks of a particular file are to be written. This enables applications to control the layout of a file in the cluster to align with data access patterns.
The write-affinity failure group specification function uses a failure group list to specify the nodes to be used for each of the replicas. A write-affinity failure group can be specified for each replica of a file.
When the write-affinity failure group provides specification for all replicas, write-affinity depth is completely
overridden. Write-affinity depth policy is used only for replica specification that is missing in the write-affinity failure group specification. The default policy is a null specification and uses the write-affinity depth settings for replica placement.
The following format is used to specify write-affinity failure groups:
FailureGroup1[;FailureGroup2[;FailureGroup3]]
where each FailureGroupN identifies nodes on which to place the first, second and third replicas of each chunk of data. FailureGroupN is a comma-separated string of up to three failure groups. Failure group list notation is extended to include more than one node or range of nodes in the following format:
Rack1{:Rack2{:...{:Rackx}}},Location1{:Location2{:...{:Locationx}}},Ext Lg1{:ExtLg2{:...{:ExtLgx}}}
• Wildcard characters (*) are supported in these fields. • Range can be specified using ‘-’
• Specific numbers are specified using ‘:’
• If any part of the field is missing, it is interpreted as 0. For example, the attribute 1,1,1:2;2,1,1-3;2,0,* using the rack topology terminology indicates that the first replica is striped on rack 1, rack location 1, nodes 1 and 2; the second replica is striped on rack 2, rack location 1, nodes 1, 2 and 3; and the third replica is on rack 2, rack location 0 and all nodes in that location. The attribute 1,*,*; 2,*,*;3,*,* results in striping the first replica on all nodes in rack 1, the second replica on all nodes in rack 2 and the third replica on all nodes in rack 3.
This attribute can be set using the policy rule or by setting the extended attribute write-affinity-failure-group using the command mmchattr on a file before writing data blocks.
• File placement policy: A file placement policy can be defined at the file system, storage pool or file level of granularity. For this purpose, GPFS has a policy engine that allows you to set data placement attributes based on the file name, fileset and other attributes. The policy engine offers a rich set of language to query and search files, take action and specify rules for data placement.
In most cases, file placement policies apply to groups of files based on a set of attributes. Some of the placement
attributes that can be affected by using policy rules include destination storage pool, chunk size, write-affinity depth, write-affinity failure group and replication factor. • Recovery from failures: In a typical FPO cluster, nodes
have direct-attached disks. These disks are not shared between nodes as in a traditional GPFS cluster, so if the node is inaccessible, the associated disks are also inaccessible. GPFS provides ways to automatically recover from these and similar common disk failure situations.
In FPO environments, automated recovery from disk failures can be enabled using the restripeOnDiskFailure=yes configuration option.
Whether an FPO-enabled file system is a subject of an automated recovery attempt is determined by the max replication values for the file system. If either the metadata or data replicas are greater than one, then a recovery action is triggered. The recovery actions are asynchronous, and GPFS continues processing while the recovery takes place. The results from the recovery actions and any errors encountered are recorded in the GPFS logs.
When auto-recovery is enabled, GPFS delays recovery action for disk failures to avoid recovery from transient failures such as node reboot. The disk recovery wait period can be customized as needed. The default setting is 300 seconds for disks containing metadata and 600 seconds for disks containing data only. The recovery actions include restoring proper replication of data blocks from the failed disks by creating additional copies on other available disks. In case of a temporary failure that occurs within the recovery wait period, recovery actions include the restarting of failed disks so they are available for allocation.
GPFS enterprise functions overview
GPFS offers a rich set of features to simplify datamanagement and data protection, and this section provides a brief summary of the key features. Please refer to GPFS documentation for additional details.
Fileset
In addition to a traditional file system directory based
hierarchical structure, GPFS utilizes a file system object called a fileset. A fileset is a sub-tree of a file system namespace, and in many respects behaves like an independent file system. Filesets provide a means of partitioning the file system to allow administrative operations at a finer granularity than the entire file system:
• Filesets can be used to define quotas.
• The fileset is an attribute of each file and can be specified in a policy to control initial data placement, migration and replication of the file’s data.
• Fileset snapshots can be created instead of creating a snapshot of an entire file system.
• The active file management feature is enabled on a per-fileset level.
GPFS supports independent and dependent filesets. An independent fileset is a fileset with its own inode space. An inode space is a collection of inode number ranges reserved for an independent fileset. An inode space enables more efficient per-fileset functions, such as fileset snapshots. A dependent fileset shares the inode space of an existing, independent fileset. Files created in a dependent fileset are assigned inodes in the same inode number range that is reserved for the independent fileset from which it was created. In GPFS-FPO environments, consider snapshot and
placement policy requirements as you plan GPFS file system layout. For example, temporary and transient data that does not need to be replicated can be placed in a fileset limited to one replica. Similarly, data that cannot be easily re-created may require frequent snapshots and therefore can reside in an independent fileset.
Information lifecycle management (ILM) GPFS can help you achieve data lifecycle management efficiencies through policy-driven, automated tiered storage management. Using storage pools, filesets and policies together, you can match the cost of your storage resources to the value of your data.
With these tools, GPFS can automatically determine where to physically store your data regardless of its placement in the logical directory structure. You can choose to place the data in different storage tiers and move it to a lower-cost tier as the value of the data reduces over time.
GPFS allows you to transparently move data to storage pools managed internally or to storage pools managed externally, such as Tivoli Storage Manager or IBM Linear Tape File System™ (LTFS).
Snapshot
You can snapshot an entire GPFS file system or only a fileset to preserve its contents at a single point in time. Snapshots are read-only, so changes can be made only to the active (that is, normal, non-snapshot) files and directories. The snapshot function allows a backup or mirror program to run concurrently with user updates and still obtain a consistent copy of the data as of the time that the snapshot was created. Snapshots can provide an online data protection capability that allows easy recovery from common problems such as accidental deletion of a file, and comparison with older versions of a file. GPFS allows snapshots of an entire file system, an
independent fileset or an individual file. A file-level snapshot is called a clone.
File clone
A file clone is a writable snapshot of an individual file. Cloning a file is similar to creating a copy of a file, but the creation process is faster and more space-efficient because no additional disk space is consumed until the clone or the original file is modified. Multiple clones of the same file can be created with no additional space overhead.
In a Hadoop workload, clones can be used to run MapReduce jobs while the original copy is being used to ingest new data to ensure your applications get a consistent view of data. Clones can be used to provision virtual machines by cloning a common base image to create a virtual disk for each machine. You can clone the virtual disk image of an individual machine as part of taking a snapshot of the machine state.
Access control lists
Access control protects directories and files by providing a means of specifying who is granted access. GPFS access control lists are either traditional ACLs based on the POSIX model, or network file system (NFS) V4 ACLs. NFS V4 ACLs are very different than POSIX ACLs and provide much more control of file and directory access. A comprehensive multi-tenancy solution can be built by leveraging GPFS ACLs to isolate and secure data.
Quotas
The GPFS quota system helps you to control the number of files and the amount of file data in a file system. GPFS quotas can be defined for individual users, groups of users and individual filesets. Quotas can be enabled by the system administrator to control the amount of space used by the individuals, groups of users or individual filesets. By default, user and group quota limits are enforced across the entire file system. Optionally, the scope of quota enforcement can be limited to an individual fileset boundary.
In big data environments, quotas can be applied based on the data sources to effectively manage storage capacity. In multi-tenant and cloud environments, quotas can be applied on a per-tenant or application basis and can be leveraged in chargeback.
Active file management (AFM)
Active file management is a scalable, high-performance, file-system caching toolset integrated with the GPFS cluster file system. AFM lets you create associations from a local GPFS cluster to a remote cluster, and to define the location and flow of file data to automate the management of the data. This feature allows you to implement a single namespace view across sites around the world.
AFM masks wide-area network latencies and outages by using GPFS to cache data sets, allowing data access and modifications even when the remote storage cluster is unavailable. In addition, AFM performs updates to the remote cluster asynchronously, which allows applications to continue operating unconstrained by limited outgoing network bandwidth.
AFM leverages the inherent scalability of GPFS to provide a multi-node, consistent cache of data located at another GPFS cluster. By integrating with the file system, AFM offers a POSIX-compliant interface, making the cache completely transparent to applications. AFM can be enabled at the file system level or at the fileset (independent mode) level.
In big data environments, AFM can be used to ingest data in locations that are closer to the data source. This data can then be distributed, using AFM to move data between multiple clusters for further processing. You can use AFM to maintain an asynchronous copy of the data at a separate physical location or use GPFS synchronous replication (used by FPO replicas). For synchronous replication, data at the remote site is guaranteed to be kept in sync with the main site, but latency between sites can affect the performance of any GPFS file system using synchronous replication. Asynchronous replication as used by GPFS-AFM reduces the effect of latency or limited bandwidth between sites, but does not guarantee that the remote copy is kept in exact
synchronization with the data at the main site. Clustered NFS
GPFS allows you to configure a subset of nodes in the cluster to provide a highly available solution for exporting GPFS file systems NFS.
The participating nodes are designated as clustered NFS (CNFS) member nodes, and the entire setup is frequently referred to as CNFS or a CNFS cluster. In this solution, all CNFS nodes export the same file systems to the NFS clients. When one of the CNFS nodes fails, the NFS serving load moves from the failing node to another node in the CNFS cluster.
In GPFS-FPO clusters, all data accessed over NFS should have at least two replicas (three are preferred) to allow file serving to continue as nodes are rebooted. A single replica can cause unexpected reboot of multiple nodes as the NFS high-availability component attempts to find a node to serve data. In the event of errors, it triggers node reboot to correct data access issues and moves the NFS IP to another node. This action can lead to node reboot if the node containing data is not yet available. It is recommended that you dedicate two or more nodes outside the FPO data nodes for CNFS. The CNFS nodes should not have local GPFS NSDs. This ensures that NFS data is distributed evenly across all FPO nodes and also helps reduce the rolling node failure issue. A CNFS node requires a
GPFS server license.
Efficient backup with Tivoli Storage Manager Since GPFS is POSIX-compliant, you can use any standard backup solution on the file data. The GPFS policy engine can be used to scan the file system for changed files. GPFS provides an integrated solution for faster and efficient backup to Tivoli Storage Manager (TSM) using the mmbackup command. The mmbackup command utilizes all the scalable, parallel processing capabilities of the policy engine to:
• Scan the file system
• Evaluate the metadata of all the objects in the file system • Determine which files should be sent to backup in TSM • Determine which deleted files should be expired from TSM Both backup and expiration take place when running mmbackup in the incremental backup mode. The backup process causes a high volume of metadata I/O. Therefore, use of SSD/flash devices for storing file system metadata is highly recommended for fast file system scans when using the mmbackup command for backup to TSM. These devices are also recommended when creating a backup solution using the GPFS policy engine with another data protection product from another vendor.
Retention and immutability
To prevent files from being changed or deleted unexpectedly, GPFS provides immutability and appendOnly restrictions. You can apply immutability and appendOnly restrictions either to individual files within a fileset or to a directory. A retention period can be set on a file to prevent its deletion before the retention period is expired.
An immutable file cannot be changed or renamed. An appendOnly file allows append operations, but not delete, modify or rename operations. An immutable directory cannot be deleted or renamed, and files cannot be added or deleted under such a directory. An appendOnly directory allows new files or subdirectories to be created with 0 byte length; all such new created files and subdirectories are marked as appendOnly automatically.
In big data environments, write workloads tend to be mostly append. The GPFS appendOnly restriction can be useful in some business environments to ensure authenticity of the data by protecting against inadvertent updates or deletion.
Understanding data flow and workload
Driving optimal performance from your resources requires understanding the workload to ensure the application data access pattern is aligned with the file system’s storage layout. For instance, any temporary data that can be easily re-created does not need to be replicated. Hadoop consists of several different application components, and each may have different data storage patterns (file size, directory layout) and access patterns (random, sequential), as well as appropriate chunk sizes for MapReduce tasks.Storage workload characteristics
To help you make informed decisions about the placement of data, Table 1 describes common Hadoop components along with their associated I/O workload characteristics and storage needs. The workload characteristics are described based on common production applications that might be different in your environment. Use this table as a guide to evaluate storage characteristics and to configure GPFS accordingly.
Hadoop component Workload characteristics Comments
Flume (Data collection and aggregation) Sqoop (Bulk data transfer between Hadoop and relational databases)
Misc. data sources (for example, web server, system log, application logs)
Data files—Large1 files
• Read: Does not apply • Write: Sequential append
Given that this data represents source data, replication should be set to 3. Refer to the Ingesting data into GPFS Clusters section to ensure replicas are striped across all data nodes for evenly distributed MapReduce jobs.
Hive (Data warehouse for data summarization, query, and analysis)
Big SQL (Data summarization and query) Pig (Programming and query language)
Data files (application-specific) Large/medium2
• Read: Sequential • Write: Sequential
Recommended chunk size 128 MB (block size =1 MB, block group factor =128)
If ingesting bulk data from other sources, refer to the Ingesting data into GPFS Clusters section; otherwise, use the recommended default of WAD=1.
HBase (Real-time read and write database) HBase data files (application-specific) Large-medium2
• Read: Random (~64 K) • Write: Sequential (~64 K) HBase log files: Small3
• Read: Sequential
• Write: Sequential (200–1024 bytes)
Data and log: Recommended chunk size 128 MB (block size =1 MB, block group factor =128) If ingesting bulk data from other sources, refer to the Ingesting data into GPFS Clusters section; otherwise, use the recommended default of WAD=1.
Lucene (Text search) Data files (Index files) • Read: Random • Write: Sequential
Data files (Term/dictionary files) • Read: Random
• Write: Random
No location affinity needed. Stripe over all available data nodes. Use block group factor of 1 (default) to optimize access time to indices files.
MapReduce Data files: See application-specific entries Intermediate and output files (application-specific)
• Read—sequential (~64K) • Write—write (~64K)
If the data is temporary, use single replica by using a fileset with replication set to 1 or use local file systems as per best practices.
Oozie (Work flow and job orchestration) See comments Uses local file systems for configuration files. Application-specific Jar files stored in GPFS. Use WAD=0 or follow the WAFG setting described in the Ingesting data into GPFS Clusters section.
Chukwa (Monitoring large clustered systems) See comments Chukwa uses HBase. Refer to HBase for details.
Table 1. Hadoop components with storage workload characteristics. 1. Large file size: 1 GB and above 2. Medium file size: 128 MB to 1 GB 3. Small file size: Under 128 MB
Setting up a GPFS-FPO cluster
IMPORTANT NOTES:1. Do not copy and paste commands from this document directly, as the hidden control characters could lead to unexplainable errors.
2. If using InfoSphere BigInsights Adaptive MapReduce, follow the steps under If using InfoSphere BigInsights. Steps described under If using Platform Symphony should be followed only if Platform Symphony was installed explicitly and InfoSphere BigInsights is not being used.
Planning considerations for optimum performance As you plan your GPFS-FPO cluster architecture, consider the following:
• For consistent performance, all cluster nodes should have similar hardware configuration, including the number of processors, amount of memory, and the number and type of hard disks used by GPFS.
• For resiliency, a minimum of two racks with a minimum of two nodes per rack are recommended for a production deployment. This implies a minimum of four nodes in the cluster.
• If rack configuration layout for failure group notation is being followed, an even number of nodes per rack is recommended to allow balanced grouping of nodes in rack halves.
Prepare physical disks for GPFS consumption as follows: • To obtain best price-performance, do not use hardware
RAID configuration for disks used by GPFS for metadata and data.
• Ensure write-cache is enabled for all physical disks used by GPFS for storing data only. The disks used for metadata should have the write-cache disabled, since GPFS flushes metadata to the disk as necessary to ensure file system consistency.
[root@c150f1ap06 ~]# parted /dev/sdy print /dev/sdy: unrecognised disk label
[root@c150f1ap06 ~]# parted /dev/sdy mklabel gpt [root@c150f1ap06 ~]# parted –s –a optimal /dev/sdy mkpart primary 0% 100%
[root@c150f1ap06 ~]# parted /dev/sdy print Model: IBM MBF2300RC (scsi)
Disk /dev/sdy: 300GB
Sector size (logical/physical): 512B/512B Partition Table: gpt
Number Start End Size Type File system Flags 1 1049kB 300GB 300GB primary
For Linux only
GPFS on Linux does not label disks in a way that is
recognized by the operating system and other system utilities. To all other software running on GPFS nodes, direct attach NSDs appear to be unused devices. As a result, GPFS data can be overwritten inadvertently, thus corrupting GPFS data. This often happens because the system utilities have no way to detect when a device contains GPFS data.
To prevent this risk of data corruption, you should create a GUID Partition Table (GPT) and at least one partition on each disk to be used as GPFS NSD. The example below shows disk /dev/sdy being configured in a single partition consuming the entire disk capacity.
If your analytics workload generates many temporary files smaller than 128 MB (refer to Table 1), it is strongly
recommended that you use local file systems such as ext3 and ext4. You can partition physical disks that are used for data (using fdisk or parted on Linux). Use the first 90 percent of the capacity for GPFS and leave the remaining 10 percent for a local file system.
When creating multiple partitions on a disk, always use the first partition for GPFS to ensure GPFS and disk block boundaries are aligned.
Create an ext3/ext4 file system independently on each of the second partitions. Do not use Linux Volume Manager to create a single file system combining all partitions, as that leads to performance degradation. To increase performance,
Metadata storage
To optimize performance and reliability, metadata nodes and disks should be planned carefully. Having too many nodes with metadata disks reduces overall reliability, while having too few nodes affects resiliency. Use the following guidelines to assign metadata nodes and disks:
• A GPFS cluster should have a minimum of four nodes with metadata disks.
• In a large cluster, designate one metadata node per rack. • In a very small cluster with four or less nodes, assign
metadata disks on each node.
• Each metadata node should have a minimum of two disks designated for metadata.
• Ensure metadata disks are distributed evenly across all metadata nodes for balanced workloads—for example, each of the four nodes should have four SSD drives, as opposed to a mixed configuration of one node with eight drives, one with four drives and two nodes with two drives each.
• If you plan to use SSD for metadata, the system pool should include all SSD disks designated metadataOnly.
• If you plan to use the same types of disks for metadata and data, it is still best to designate all disks in the system pool as metadataOnly. However, this might result in larger space allocated for metadata than might be needed. If storage utilization is a concern, disks in the system pool can be designated as dataAndMetadata for some performance trade-off. In that case, the file system block size must match the metadata block size. When using more than one pool, you must create GPFS policy rules to place selected data in the system pool, while directing data driving the MapReduce workload to the data pool that is FPO enabled.
Identify the nodes in your cluster for the following GPFS node roles (a physical node can take on more than one role): • Manager nodes: These nodes are part of the node pool
from which nodes are selected for various management tasks within the cluster (some are described below). A minimum of one per rack is recommended. Provide additional memory of about 600 MB on these nodes to serve management functions.
• Primary and secondary cluster configuration server: It is best to use nodes in different racks for this purpose. Consider using nodes with disks holding metadata.
• Quorum node: Nodes in the configuration server and manager roles are expected to be more reliable in terms of availability and tend to get more attention from
administrators. For quorum, use one node per rack minimum and a total count resulting in an odd number. A cluster-wide maximum of seven quorum nodes is recommended.
Determine the number of required file systems for your cluster. For better performance, a single file system is recommended for the entire cluster. With GPFS, you can partition the data and storage as needed using one file system. In some cases, you may want to have a small file system for testing purposes and one for production. Note that you can use only one file system at a time with a given Hadoop installation.
To optimize performance of a MapReduce workload, it is recommended that you use a larger inode size of 4,096 bytes (default 512 bytes), especially if the data files are smaller than 100 MB. This helps to improve performance of JobTracker when it is mapping out tasks based on the location of data blocks, by avoiding the overhead of reading additional block addresses from the disk.
Block group factor (BGF)
A block group factor of 128 works well for most workloads. However, further tuning may be needed for specific workloads based on empirical measurements. For example, in MapReduce environments, the file block size is set to 1 MB and the block group factor is set to 128, leading to an effective large block size of 128 MB. You should experiment with block group factor settings to determine the right block group factor or validate the recommended settings for your environment.
Using the policy rules or the mmchattr command, BGF can be applied to a specific file or a group of files. Use the following guidelines:
• Identify the number of storage pools you plan to have. Storage pools are typically determined by the characteristic of the storage types. In a GPFS-FPO cluster, you will have a minimum of two storage pools: one for metadata (with standard block allocation) and one for data that is FPO-enabled.
• Carefully assign a failure group to each node in the cluster based on the failure boundaries for your rack configuration, power distribution and other factors. Typically, each node with disks should map to a unique failure group notation.
• All nodes should have the same number of data disks with roughly equal capacity. If a node has fewer disks, as might be the case for nodes with metadata disks, ensure fewer MapReduce tasks are assigned to those nodes. If a node contains no data disks, no MapReduce tasks should be assigned to those nodes.
For additional information on GPFS cluster planning, see the
GPFS Concepts, Planning, and Installation Guide.
Sample GPFS-FPO configuration
Figure 4 describes a sample environment used as a basis to describe GPFS-FPO cluster planning and configuration in this section.
The example system in Figure 4 has three racks, each containing four nodes. The nodes in each rack are logically grouped in top and bottom halves of the rack for topology specification. Each of the nodes has four SAS drives that are
drives used in this example is smaller than typically used in a real cluster. All nodes have the same hardware characteristics. One node in each rack has two SSD drives that are designated for metadata. It is sufficient to have one node per rack for metadata. To have three replicas for metadata, a minimum of three failure groups are required. For configuration with fewer racks, designate multiple nodes per rack for metadata storage. Failure group notation is used to describe the node layout. In failure group notation, the first number is used to describe the rack number (1, 2 or 3), and the second number is used to describe the bottom (0) or top (1) halves of the rack. The third number denotes the node’s position (1 or 2) in the respective halves of the rack. Failure group and the unique node name (used in commands and input data for cluster setup) are identified for each of the nodes. Note that metadata disks for a node have the same failure group as the data disks for that node. This configuration results in 12 failure group notations. Figure 4. Sample cluster configuration.
Node 2 (n1_1_2) Node 1 (n1_1_1) Node 2 (n2_1_2) Node 1 (n2_1_1) Node 2 (n3_1_2) Node 1 (n3_1_1) Node 2 (n1_0_2) Node 1 (n1_0_1) Node 2 (n2_0_2) Node 1 (n2_0_1) Node 2 (n3_0_2) Node 1 (n3_0_1) FG:1,1,2 FG:1,1,1 FG:2,1,2 FG:2,1,1 FG:3,1,2 FG:3,1,1 FG:1,0,2 FG:1,0,1 FG:2,0,2 FG:2,0,1 FG:3,0,2 FG:3,0,1 Top Half: 1 Bottom Half: 0
Rack No. 1 Rack No. 2 Rack No. 3
Metadata disks—SSD Data disks—SAS
Installing and configuring GPFS clusters
GPFS-FPO was introduced with GPFS 3.5.0.7. It is highly recommended that you use GPFS 3.5.0.12 or above to leverage all GPFS-FPO functions described in this document. See the GPFS Concepts, Planning, and Installation Guide for directions specific to your operating system and hardware. To create a cluster, the first step is to assign cluster roles as described earlier. For this sample configuration:
• Quorum nodes: A total of five quorum nodes are selected. These include three metadata nodes (n1_0_2, n2_1_2, n3_0_2) and two configuration server nodes (n1_1_2, n3_1_1). The configuration servers are located in different rack halves than the metadata nodes.
• Primary cluster configuration server: Node n1_1_2. • Secondary cluster configuration server: Node n2_1_2.
Note that the primary and secondary configuration server nodes are in different racks to spread out the failure points. • Manager nodes: Use a subset of quorum nodes and
configuration server nodes for the manager role. In this case, n1_0_2, n2_1_2 and n3_1_1 are used as the manager nodes.
For large clusters, it is easier to create a file with a list of nodes and their roles. The file gpfs-fpo-nodefile describes the nodes. n1_0_1.ibm.com:: n1_0_2.ibm.com:quorum-manager: n1_1_1.ibm.com:: n1_1_2.ibm.com:quorum: n2_0_1.ibm.com:: n2_0_2.ibm.com:: n2_1_1.ibm.com:: n2_1_2.ibm.com: quorum-manager: n3_0_1.ibm.com:: n3_0_2.ibm.com:quorum: n3_1_1.ibm.com:quorum-manager: n3_1_2.ibm.com::
The following mmcrcluster command is used to create a cluster named gpfs-fpo-cluster.
mmcrcluster –N gpfs-fpo-nodefile –p n1_1_2.ibm.com –s n2_0_2.ibm.com –C gpfs-fpo-cluster –A –r /usr/bin/ ssh –R
/usr/bin/scp
#Ensure GPFS daemon is running on all nodes in the GPFS cluster.
mmgetstate -a
Option “–A” ensures that the GPFS daemon is started when a node is rebooted.
Creating failure groups and storage pools
The next step is to define the storage pools along with failure groups. All of this is accomplished with the command mmcrnsd, which takes an input file containing definitions for storage pools and NSD configuration.
%pool: pool=system layoutMap=cluster
blocksize=256K #using smaller blocksize for metadata %pool:
pool=fpodata layoutMap=cluster blocksize=1024K
allowWriteAffinity=yes #this option enables FPO feature
writeAffinityDepth=1 #place 1st copy on disks local to the node writing data blockGroupFactor=128 #yields chunk size of 128MB
#Disks in system pool are defined for metadata use only
%nsd: nsd=n1_0_2_ssd_1 device=/dev/sdm servers=n1_0_2.ibm.com usage=metadataOnly failureGroup=102 pool=system
%nsd: nsd=n1_0_2_ssd_2 device=/dev/sdn servers=n1_0_2.ibm.com usage=metadataOnly failureGroup=102 pool=system
%nsd: nsd=n2_1_2_ssd_1 device=/dev/sdm servers=n2_1_2.ibm.com usage=metadataOnly failureGroup=212 pool=system
%nsd: nsd=n2_1_2_ssd_2 device=/dev/sdn servers=n2_1_2.ibm.com usage=metadataOnly failureGroup=212 pool=system
%nsd: nsd=n3_0_2_ssd_1 device=/dev/sdm servers=n3_0_2.ibm.com usage=metadataOnly failureGroup=302 pool=system
%nsd: nsd=n3_0_2_ssd_2 device=/dev/sdn servers=n3_0_2.ibm.com usage=metadataOnly failureGroup=302 pool=system
#Disks in fpodata pool
%nsd: nsd=n1_0_1_disk1 device=/dev/sdb servers=n1_0_1.ibm.com usage=dataOnly failureGroup=1,0,1 pool=fpodata
%nsd: nsd=n1_0_1_disk2 device=/dev/sdc servers=n1_0_1.ibm.com usage=dataOnly failureGroup=1,0,1 pool=fpodata
%nsd: nsd=n1_0_1_disk3 device=/dev/sdd servers=n1_0_1.ibm.com usage=dataOnly failureGroup=1,0,1 pool=fpodata
%nsd: nsd=n1_0_1_disk4 device=/dev/sde servers=n1_0_1.ibm.com usage=dataOnly failureGroup=1,0,1 pool=fpodata
%nsd: nsd=n1_0_2_disk1 device=/dev/sdb servers=n1_0_2.ibm.com usage=dataOnly failureGroup=1,0,2 pool=fpodata
%nsd: nsd=n1_0_2_disk2 device=/dev/sdc servers=n1_0_2.ibm.com usage=dataOnly failureGroup=1,0,2 pool=fpodata
%nsd: nsd=n1_0_2_disk3 device=/dev/sdd servers=n1_0_2.ibm.com usage=dataOnly failureGroup=1,0,2 pool=fpodata
%nsd: nsd=n1_0_2_disk4 device=/dev/sde servers=n1_0_2.ibm.com usage=dataOnly failureGroup=1,0,2 pool=fpodata
%nsd: nsd=n1_1_1_disk1 device=/dev/sdb servers=n1_1_1.ibm.com usage=dataOnly failureGroup=1,1,1 pool=fpodata
%nsd: nsd=n1_1_1_disk2 device=/dev/sdc servers=n1_1_1.ibm.com usage=dataOnly failureGroup=1,1,1 pool=fpodata
%nsd: nsd=n1_1_1_disk3 device=/dev/sdd servers=n1_1_1.ibm.com usage=dataOnly failureGroup=1,1,1 pool=fpodata
%nsd: nsd=n1_1_1_disk4 device=/dev/sde servers=n1_1_1.ibm.com usage=dataOnly failureGroup=1,1,1 pool=fpodata
%nsd: nsd=n1_1_2_disk1 device=/dev/sdb servers= n1_1_2.ibm.com usage=dataOnly failureGroup=1,1,2 pool=fpodata
%nsd: nsd=n1_1_2_disk2 device=/dev/sdc servers= n1_1_2.ibm.com usage=dataOnly failureGroup=1,1,2 Below is the stanza file fpo-poolfile for the example system.
In the stanza file, NSD names include the host server name for easier illustration. Use the mmcrnsd command to create storage pools and NSDs. You can view the NSD created using the mmlsnsd command. Refer to the GPFS documentation for details on storage pool and NSD stanza format.
mmcrnsd -F fpo-poolfile mmlsnsd –m
Creating a file system
A file system can now be created using the newly created NSDs. The mmcrfs command is used to create the file system with inode size of 4,096 bytes. Refer to the GPFS
documentation for details on various options. The following commands create a file system named called bigfs with three copies for metadata and data, and with options enabled for optimized performance (-E, -S). They also mount the file system on all nodes in the cluster.
mmcrfs bigfs -F fpo-poolfile -A yes –i 4096 -m 3 -M 3 -n 32 -r 3 -R 3 –S relatime –E no
mmmount bigfs /mnt/bigfs –N all #/mnt/bigfs direc-tory must exist on all nodes
In this example, the FPO feature has already been enabled for the storage pool fpodata in the storage pool and NSD
configuration. The storage pool stanza includes desired FPO-specific attributes, including the block size for system and data pools. You can view the storage pool and NSD configuration in the newly created file system using mmlspool and mmlsdisk commands.
The option –S specifies atime update mode. A value of relatime suppresses periodic updating of atime for files and reduces the performance overhead caused due to frequent updates triggered by frequent file read operations.
If using InfoSphere BigInsights: Version 2.1.0.1 creates the file system with an option –S setting of “yes,” which results in suppressing atime updates altogether. You can use the command mmchfs to change this value.
mmlspool bigfs all –L mmlsdisk bigfs –L
Creating an initial placement policy
Before any data can be placed in this file system, it is necessary to create a data placement policy.
By default, GPFS places all the data in a system pool, unless specified. Best practices for FPO require separation of data and metadata disks in different storage pools. This practice allows the use of a smaller block size for metadata and a larger block size for data. In the sample configuration, a different class of storage (SSD) is being used for metadata, so it is best to create a separate pool. When you have more than one pool, you need to define a placement policy.
The following can be used to set attributes for .tmp files and direct all data blocks to the fpopool defined in the sample configuration. Note that granular policies (per file level) can be added to this default policy with different FPO attributes as needed. Use the following steps:
• Create a policy file containing placement rule in a file named
bigfs.pol.
$cat bigfs.pol
RULE ‘bgf’ SET POOL ‘fpopool’ REPLICATE(1) WHERE NAME LIKE ‘%.tmp’ AND setBGF(4) AND setWAD(1) RULE ‘default’ SET POOL ‘fpopool’
The first rule places all files with extension .tmp in fpopool pool using a chunk factor of four blocks, with write-affinity depth set to 1 and with data replication count set to 1 for these temporary files. The second rule is the default, placing all other files in the fpopool pool using the attributes specified at the storage pool level. This rule applies to all files that do not match any other prior rules and must be specified as the last rule. The first rule is optional and is added here only for demonstration purposes.
• Install the policy using the command mmchpolicy. The policies that are currently effective can be listed using the command mmlspolicy.
mmchpolicy bigfs bigfs.pol
• Use the GPFS command mmlsattr –L <path to a file in GPFS> to check these settings for individual files. Note that if an attribute is not listed in the output of mmlsattr, the settings provided at the pool or file system level are being used.
For additional details, refer to the GPFS Advanced
Administration Guide > Information Lifecycle Management for GPFS > Policies and rules
Tuning GPFS configuration for FPO
GPFS-FPO clusters have a different architecture than traditional GPFS deployments. Therefore, default values of many of the configuration parameters are not suitable for FPO deployments and should be changed. Use the following table as a guide to update your cluster configuration settings. If using InfoSphere BigInsights, many of these parameters are already set to the values optimal for FPO configuration. There are some parameters that are not currently set by the installer for InfoSphere BigInsights or need to be further tuned for your cluster. These parameters are marked with ‘*’ in the table.
Use the mmchconfig and mmlsconfig commands to change and list the current value of any given parameter—for example, to change readReplicaPolicy to the local setting. You can specify multiple configuration parameters in a single command as shown below.
#shows current setting of readReplicaPolicy param-eter
mmlsconfig |grep -i readReplicaPolicy
#changes the value to policy. Use –i ensure change immediate and persistent across node reboots. mmchconfig readReplicaPolicy=local –i
mmchconfig maxStatCache=100000,maxFilesToCache=100000 Replace the parameter name and values as necessary. Simply run these commands on one node in the cluster, and GPFS propagates the changes to all other nodes.
Note: Some parameters such as maxStatCache and
maxFilesToCache do not take effect until GPFS is restarted. GPFS can be restarted using the following command. /usr/lpp/mmfs/bin/mmumount all –a
/usr/lpp/mmfs/bin/mmshutdown -a /usr/lpp/mmfs/bin/mmstartup –a
#Ensure GPFS daemon has started on all nodes in the GPFS cluster.
Parameter Name (*not set by InfoSphere BigInsights 2.1.0.1 or needs review)
Default value New value Comment
readReplicaPolicy random local Enables the policy to read replicas from
local disks. enableRepWriteStream*
(undocumented)
1-if FPO enabled 0–for non-FPO cluster
0 Uses non-FPO replication mode for
FPO configuration as well.
restripeOnDiskFailure No Yes Specifies whether GPFS attempts to
automatically recover from certain common disk failures.
metadataDiskWaitTimeForRecovery* 300 seconds <see comments to customize > Sets delay period before start of recovery for failed metadata disks. Used when RestripeOnDiskFailure is set to yes.
Ensure the delay period is large enough to cover reboot time of the nodes hosting metadata disk servers. dataDiskWaitTimeForRecovery* 600 seconds <see comments to customize > Sets delay period before start of
recovery for failed data disks. Used when RestripeOnDiskFailure is set to yes.
Data disks are controlled through a separate tunable due to their distribution across the variety of node types.
Ensure the delay period is large enough to cover the reboot of the slowest node in the cluster.
syncBuffsPerIteration (undocumented) 100 1 Used to expedite buffer flush and the rename operations done by MapReduce jobs.
minMissedPingTimeout* (undocumented)
3 (seconds) 10-60 (seconds) Sets the lower bound on a missed ping timeout. For FPO clusters, a longer grace time is desirable before marking a node as dead, as it impacts all associated disks. Additionally, when running MapReduce workloads, the CPU can become overly busy and cause delayed ping responses. However, a longer timeout implies delay in recovery. A value between 10–60 seconds is recommended. This value generally provides a good balance between the time to detect the real failures and the rate of false failure detection triggered by a delayed ping response due to CPU or network overload.
leaseRecoveryWait (undocumented) 35 65 Allows a larger grace window before
starting the recovery.
Pagepool* Varied 25% of system memory Sets the amount of physical memory
reserved for cache on a node (use –N to list nodes that apply).