Next Generation Storage for the HLTCOE

(1)

T

HE

J

OHNS

H

OPKINS

U

NIVERSITY

Next Generation Storage for the HLTCOE

Scott Roberts

T

ECHNICAL

R

EPORT

9

(2)

c

JHU HLTCOE, 2013

Acknowledgements This work is supported, in part, by the Human Language

Technology Center of Excellence. Any opinions, findings, and conclusions or

recommendations expressed in this material are those of the author and do not

necessarily reflect the views of the sponsor.

HLTCOE

810 Wyman Park Drive

Baltimore, Maryland 21211

http://hltcoe.jhu.edu

(3)

Next Generation Storage for the HLTCOE

Scott Roberts

Human Language Technology Center of Excellence Johns Hopkins University

[email protected]

Abstract

The explosion of unstructured data in high perfor-mance computing presents a challenge for existing storage architecture and design. We present a com-bination of hardware and software which addresses the storage needs of our center’s compute cluster. We also demonstrate that at a constant total cost of ownership our proposed solution provides an or-der of magnitude better performance than the Johns Hopkins University’s GrayWulf cluster and is two orders of magnitude faster than the center’s existing storage array.

1 Introduction

Over the past few years, the center’s needs have changed from simply that of high capacity to requir-ing high capacity with high performance. The stor-age array supports very diverse needs which range from researchers editing code in their home directo-ries all the way to delivering source files and storing output for High Performance Compute Cluster jobs. It also supports everything from millions of small files (<1K in size) to thousands of large files (>1G in size) and everything in-between.

1.1 Current system

The center currently utilizes two OpenSolaris-based NFS storage servers using ZFS for the

underly-ing file storage. Each storage server costs approxi-mately US $20,000, provides 97TB of initial usable capacity, 150TB maximum usable capacity thanks to ZFS compression, 1,000 Input/Output Opera-tions Per Second (IOPS), and 800 MB/s sustained throughput for large files. These servers provide storage for user home directories as well as for the center’s High Performance Compute Cluster. While these servers have performed well over the last two years, the explosion of unstructured data at the cen-ter has uncovered a number of performance issues with the current setup. Specifically, the existing storage arrays cannot quickly and efficiently serve the mixed workload consisting of an ever-increasing number of small file accesses for user home direc-tories as well as serve the thousands of small and large file accesses per second required for the High Performance Computing Cluster jobs.

When these storage servers were constructed the mandate was to obtain as much usable disk space as possible. What we have noticed over the last year is that the usage pattern has changed and the system does not deliver the performance required, specifically because one is constrained in the num-ber of disks that any given server can handle by the I/O bus. Overloading this bus creates notice-able lags in what should be near-instantaneous op-erations (e.g. directory listings or opening a file for editing). Thus, when a large number of grid jobs are running the filesystem is slow to respond making it essentially unusable for user interaction.

Another issue is that every time we add a new server it requires re-configuring the entire HPC cluster with another NFS mount point. Twelve 7500 RPM

(4)

SATA drives on a single server will saturate a 10Gb network link. Adding more drives to a single server can increase the number of IOPS (until the I/O bus is saturated) but the throughput is still limited by the I/O bus and network connection. Using tradi-tional NFS this will require a separate mount point for each server which quickly becomes a large sys-tem administration burden and can lead to artificial bottlenecks. For instance, one storage server may sit largely unused with a lot of wasted disk and avail-able bandwidth while another server fills to capac-ity while its network connection is completely sat-urated. The I/O capability of storage has a direct effect on the efficiency of a high performance com-pute cluster. The faster that data can be served and output can be stored, the more quickly the grid jobs will complete.

1.2 Design rationale

As discussed, the current storage system does not provide the performance our researchers require. After several meetings with the research staff, we noted the following issues and pain points with our existing storage array:

• Lack of I/O performance

• 90% year-over-year storage growth projection • Unable to quickly expand the center’s Hadoop

Distributed Filesystem storage space

• Lack of integration with cloud storage tech-nologies such as Amazon’s S3 API

• Adding more NFS storage arrays dramatically increases system administration overhead

Clearly, there had to be a better solution. There are two widely accepted choices in the HPC storage field to address these issues as they relate to large capacity and high performance: Create a Storage Area Network (SAN) or utilize a parallel filesystem.

SANs were effectively replaced by Network At-tached Storage (NAS). In fact, the previous genera-tion storage array from BlueArc (now Hitachi Data Systems) was actually a SAN with a NAS front-end.

A SAN traditionally consists of a large number of low capacity disks in order to decrease the amount of time it takes to serve data. More physical disks means you can provide a higher number of IOPS and smaller capacity disks usually provide shorter seek times. A SAN is extremely complex, costs more, and provides less functionality than a paral-lel filesystem. The BlueArc storage array cost over 17 times more than the current system that replaced it.

A parallel filesystem aggregates the disk and net-work connections of multiple storage servers while providing a global namespace. That is to say, a par-allel filesystem can deliver the full performance of each available server. When a new server is added the additional space is instantly available for new data under the same mount point.

After discussing this issue with a number of peers and vendors, researching the literature, and per-forming our own experiments it was determined that the next generation of storage for the center should be based on a parallel filesystem. This decision in turn led to a number of questions, namely what type of hardware and which parallel filesystem to use.

2 Requirements

and

Considera-tions

The center researchers work with both speech and text data. They have diverse compute and storage requirements and work on the same compute clus-ter. In order to effect economy of scale, both home directories and high performance compute jobs are served by the same network storage system. Thus, the system must be efficient at serving both random small file accesses as well as large file streaming. While this may initially seem unreasonable and that they should be served by different storage systems optimized for different use cases, the reality is that the cutting-edge research performed at the center re-sults in user access patterns that change on a daily basis. Taking this into consideration, it makes more

(5)

sense to tune the system for a wide range of use cases and minimize the system administrative over-head and wasted disk space.

Taking into consideration the center’s compute clus-ter usage over the previous year, the desired out-come is a solution that will deliver at least 22GB/s & 32,000 IOPS read performance, 16GB/s and 16,000 IOPS write performance, 300TB of initial usable (not raw) capacity, at least 10 PB maximum de-sign capacity, ability to quickly expand the capacity, and cost no more than US $100,000. The solution should be able to provide directory listings in ≤1 second for directories containing one million files or less while HPC cluster jobs are simultaneously accessing the filesystem. The solution should also be able to provide storage for the Hadoop cluster (HDFS API) as well as an Amazon S3 API. Other considerations are space, power, cooling, and main-tenance as measured by rack units, watts, BTUs/h, and system administration time.

2.1 Why ZFS?

We currently have two NFS storage servers in our environment both running OpenSolaris and utilizing native ZFS as the backing file store. ZFS is provided as part of the OpenSolaris operating system. ZFS offers us a number of benefits, namely:

• Fast, lightweight file compression (LZJB) de-livering a nominal 1.65:1 ratio for our environ-ment

• Explicit data integrity for every block stored and served

• Ability to take snapshots of data to easily re-cover deleted files

• Ability to de-duplicate data for a storage pool containing VDI images

• Decently performing built-in NFS server • Use of various Web GUIs such as Napp-It[1]

for easy administration

2.2 ZFS on Linux

Most parallel filesystems do not easily support OpenSolaris but we did not wish to lose the ad-vantages inherent to ZFS as the underlying storage filesystem. After testing various configurations we determined that we would use ZFS on Linux[2]. This allows us to build a native Linux kernel module to support ZFS and also allows us to use our preex-isting Linux management tools.

The ZFS on Linux port was produced at the Lawrence Livermore National Laboratory (LLNL) under Contract No. DE-AC52-07NA27344 (Con-tract 44) between the U.S. Department of Energy (DOE) and Lawrence Livermore National Secu-rity, LLC (LLNS) for the operation of LLNL. It has been approved for release under LLNL-CODE-403049.[3]

As per the ZFS on Linux FAQ[4], the licensing con-cern1is not an issue for the center. We can use ZFS on Linux if we compile and install the kernel mod-ule ourselves.

2.3 Data Availability

Petascale storage does not lend itself to being backed up easily. Offsite backups for anything other than critical data is infeasible due to the relatively slow 10Gbps network link and the time required to copy data to tape or external storage for trans-port. Therefore, the storage system should be built as robustly as possible to allow for hardware failure. This will be accommodated by utilizing RAID10 on

1

ZFS is licensed under the Common Development and Dis-tribution License (CDDL), and the Linux kernel is licensed under the GNU General Public License Version 2 (GPLv2). While both are free open source licenses they are restrictive licenses. The combination of them causes problems because it prevents using pieces of code exclusively available under one license with pieces of code exclusively available under the other in the same binary. In the case of the [Linux] kernel, this prevents us from distributing ZFS as part of the kernel binary. However, there is nothing in either license that prevents dis-tributing it in the form of a binary module or in the form of source code.

(6)

the underlying ZFS block storage and by allowing for additional storage redundancy for home direc-tories. The HPC cluster will utilize RAID10 stor-age exclusively while user home directories will use RAID10 as well as a highly-available copy on a sec-ond storage server. This will ensure that user home directories will remain available should a catas-trophic server issue occur at the expense of all home directory data being replicated 4 times.

Currently, 60% of current storage is allocated to home directories. A lot of our users store exper-iment results in their home directories and if we move to a highly available solution we will need to curtail home directory usage to critical data only. We anticipate that home directory access will be slightly slower than the storage used for the HPC cluster which will encourage users to move their data to the faster storage array.

RAID10 was chosen for a number of reasons, specifically:

• RAID10 rebuild time is less than 8 hours for a 2TB disk replacement, as compared to 72 hours for a RAID6 configuration

• Higher I/O bandwidth:

– A 12-drive RAID10 consisting of 6 sets of 2-drive RAID1 mirrors will read data at N*1.5 sets and write data at N sets; e.g. if a single drive delivers 150MB/s and 150 IOPS, the system should deliver se-quential read I/O of 1,350MB/s, sequen-tial write I/O of 900MB/s and 1,350 IOPS – A 12-drive RAID50 consisting of 2 sets of 6-drive RAID5 (on ZFS, RAIDZ) will read data at approximately N*1.25 sets and write data at N sets where the speed of a set is limited to the speed of a sin-gle drive in the set; e.g. if a single drive delivers 150MB/s and 150 IOPS, the system will deliver sequential read I/O of 375MB/s, sequential write I/O of 300MB/s, and 375 IOPS

• Higher resiliency because one could lose up to 6 drives in a 12-drive RAID10 set versus only 2 in a RAID50 configuration

The downside to RAID10 is that half of the raw ca-pacity is lost to replication. That is to say, if you have 100TB of raw storage, only 50TB will be us-able. It is important to note that if we did not lever-age RAID10 and the snapshot capability built into ZFS, explicit backups would still be required which requires additional storage space and system admin-istration time. Therefore we believe the benefits gained by switching to RAID10 offsets the loss of usable capacity.

3 Hardware

Most storage vendors provide high density storage consisting of 36-60 drives in 4U of rack space. While this is in fact desirable for a pure archival or backup solution, it has a serious performance penalty when used as a live filesystem. This is one of the issues we encountered with our existing NFS storage arrays which contain (81) 2TB drives. An individual storage server is limited by the mother-board and SAS/SATA backplanes. In other words, the total number of drives are capable of greater per-formance than what can be provided by the I/O bus.

After careful evaluation and demonstration of avail-able hardware, we selected the Quanta 12-drive server based on its performance characteristics and total cost of ownership. The examples given in the previous section are based on 12 hard drives for the purpose of comparing different RAID lev-els with the Quanta server. This section provides the background for that analysis by describing the current vendor hardware offerings we evaluated for this project.

3.1 Dell

Dell and Omnibond published the OrangeFS Reference Architecture[5] which utilizes the Dell PowerEdge R710 server and PowerVault MD3200/MD1200 storage stack containing four MD1200 expansion chassis for a total of 60 data

(7)

drives. The MD3200 server contains the RAID controller hardware and 12 hard drives. Each MD1200 storage chassis contains 12 hard drives. However, Dell released the next-generation Pow-erVault storage servers - the MD3620i - which provide for 60 drives in a single enclosure.

According to the Reference Architecture docu-ment, the Dell stack provides for 1.6GB/s read and 1.5GB/s write throughput with 1MB I/O chunks. The Seagate Constellation ES.2 drives are each capable of approximately 150MB/s per drive[6], meaning that 60 drives are theoretically capable of delivering 9GB/s. The chassis are interconnected with two 6Gbps SAS lanes which provides 1.5GB/s throughput. This leads to the conclusion that the MD storage stack I/O bus is completely saturated even considering that the chassis were optimized as per Dell’s ”MD3200 High Performance Tier Imple-mentation”[7] document.

3.2 Quanta

Quanta offers a 4U 60-drive storage Just a Bunch of Disks (JBOD) chassis (MESOS M4600H) and a 1U 12-drive complete storage server (STRATOS S100-L11SL). The published performance numbers[8] on the 60-drive unit are in line with the Dell Power-Vault stack, meaning that the I/O bus becomes sat-urated before the drives can deliver their maximum throughput. The 12-drive unit, however, shows par-ticular promise in that twelve Seagate Constella-tion ES.2 drives deliver 1.8GB/s and 1,800 IOPS which compares favorably to the real world per-formance numbers in Section 5. We benchmarked the 12-drive Quanta server, hereinafter referred to as ”Quanta”.

3.3 Supermicro

The center’s existing OpenSolaris-based storage ar-ray contains a total of 81 drives via two sets of 36-bay SC847’s with attached 45-drive SC847 JBODs set up as 2 system drives, 2 global hot spares, and 11

sets of 7-drive RAIDZ2 (RAID6 equivalent) which deliver approximately 800MB/s sequential read I/O and 1,000 IOPS. In this configuration, the I/O bus and network links are fully saturated well below the maximum drive capability.

Supermicro now offers a 60-drive 4U JBOD unit but given the bus architecture and drive layout the I/O bus would be saturated well below the maximum drive throughput.

4 Parallel Filesystems

There are many parallel filesystems on the market today. However, being a research center we have a mandate to limit recurring costs which means that we restricted our selections to Free and Open Source Software. There are four well-known op-tions used in HPC clusters today: Ceph, GlusterFS, Lustre, and OrangeFS (PVFS2). After evaluating the available parallel filesystems we selected Or-angeFS.

4.1 Ceph

Ceph[9] is “a unified, distributed storage system designed for excellent performance, reliability and scalability.” The very first version was made avail-able in July 2012 and is not recommended for production use. Ceph has a requirement to use Btrfs[10] as the underlying block storage. Btrfs is still under heavy development as well and is also not recommended for production use. While initial per-formance statistics available from the Ceph Website appear promising, we eliminated Ceph from consid-eration based on their own recommendations to not use it for primary storage.

4.2 GlusterFS

GlusterFS[11] is “an open source, distributed file system capable of scaling to several petabytes

(8)

(ac-tually, 72 brontobytes!) and handling thousands of clients.”[12] GlusterFS is unique in that it does not require a separate metadata server. Instead, the client software determines where files are stored based on a hash algorithm and GlusterFS meta-data is stored directly with each file in the form of Linux extended file attributes. This means that performance remains consistent without regard to the number of files stored. GlusterFS also provides data replication, data striping, and directory quotas. However, in our testing it provided an order of mag-nitude slower performance than any other parallel filesystem. We also discovered a bug: it does not handle symbolic links which are used extensively by the center. We removed GlusterFS from consid-eration because of the poor performance and lack of symbolic link support.

4.3 Lustre

“Lustre is a parallel distributed file system, gener-ally used for large scale cluster computing. The name Lustre is a portmanteau word derived from Linux and cluster. Lustre file systems are scalable and can support tens of thousands of client sys-tems, tens of petabytes (PB) of storage, and more than a terabyte per second (TB/s) of aggregate I/O throughput.”[13]

The University of California at San Diego has a 1 petabyte Lustre-based storage array named “Data Oasis.”[14] However, based on a review of their ar-chitecture and the related mailing list, we discov-ered that their configuration requires two racks of equipment and two full time storage administrators. They ran into a number of configuration and stabil-ity issues during the first year of operation. This be-ing said, it does provide excellent performance but at a poor price point: 1 petabyte cost UCSD almost $1 million in hardware. Given the complexity of the setup and pricing we eliminated Lustre from consid-eration.

4.4 OrangeFS / PVFS2

“Orange File System is a branch of the Parallel Vir-tual File System. Like PVFS, Orange is a parallel file system designed for use on high end comput-ing (HEC) systems that provides very high perfor-mance access to disk storage for parallel applica-tions. OrangeFS is different from PVFS in that we have developed features for OrangeFS that are not presently available in the PVFS main distribution. While PVFS development tends to focus on specific very large systems, Orange considers a number of areas that have not been well supported by PVFS in the past.”[15]

During our evaluation we found OrangeFS ex-tremely easy to setup, configure, and administer. Additional performance is gained by storing the filesystem metadata on SSDs. In our limited test-ing we found that the performance scaled nearly lin-early with each additional server which means that when we add a server we obtain both additional disk space and I/O performance.

5 Performance Tests

Figure 1: GrayWulf’s Amdahl Number

1.5 GB/s ∗ 8 bits 21.3 GHz =

12 Gb/s

21.3 GHz = 0.56

Figure 2: GrayWulf’s Amdahl Memory Ratio 24 GB

21.3 GHz = 1.13

Figure 3: GrayWulf’s Amdahl IOPS Ratio

6, 000 IOPS 21.3 billion (GHz) 50,000 instructions = _{21,300,000,000}6, 000 50,000 = 6, 000 426, 000 = 0.014

Gene Amdahl describes three laws of a balanced system resulting in an Amdahl number, an Amdahl

(9)

Table 1: Performance, power, and cost characteristics of various data-intensive architectures Note: “Amdahl Seq” is the Amdahl Number, “Amdahl Mem” is the Amdahl Memory Ratio, and “Amdahl Rand” is the Amdahl IOPS Ratio

CPU Mem SeqIO RandIO Disk Power Cost Relative Amdahl numbers [GHz] [GB] [GB/s] [kIOPS] [TB] [W] [$] Power Seq Mem Rand GrayWulf 21.3 24 1.5 6.0 22.5 1150 19253 1.000 0.56 1.13 0.014 Intel 3.2 2 0.5 10.4 0.5 28 1177 0.024 1.25 0.63 0.163 Zotac 3.2 4 0.5 10.4 0.5 30 1189 0.026 1.25 1.25 0.163 Supermicro 20.2 24 8.1 12.2 162.0 5600 20004 4.870 3.21 1.19 0.030 Quanta HDD 16.0 32 3.0 1.9 24.0 200 5001 0.174 1.50 2.00 0.006 Quanta 12-SSD 16.0 32 6.0 540.0 6.0 150 8001 0.130 3.00 2.00 1.688 Quanta 7-SSD 16.0 32 3.5 315.0 3.5 150 5801 0.130 1.75 2.00 0.984

Memory Ratio, and an Amdahl IOPS Ratio.[16] The Amdahl Number is calculated as the Sequential IO performance in Gb/s divided by the CPU in GHz. GrayWulf’s Amdahl Number is 0.56 as shown in Figure 1. The Amdahl Memory Ratio is calculated as the RAM in GB divided by the CPU in GHz as shown in Figure 2. The Amdahl IOPS Ratio is cal-culated as 1 IOP per 50,000 instructions as shown in Figure 3.

The goal is Amdahl unity which is described as a system that has an Amdahl Number of 1.0, an Am-dahl Memory Ratio of 1.0, and an AmAm-dahl IOPS Ratio of 1.0. At the value of 1 for all three metrics, the system is perfectly balanced: data is provided to the CPU as quickly as it can process it and the entire system is capable of being 100% utilized. If any of the three values are less than 1.0, the CPU is waiting for data which in Linux would show as I/O wait. If any of the three values are greater than 1.0 the I/O subsystem is partially idle waiting for the next CPU instruction which means that it is over provisioned and thus inefficient. We note, however, that an Am-dahl Memory Ratio greater than 1.0 is not necessar-ily inefficient because operating systems like Linux use excess RAM for disk and application cache. The Amdahl Memory Ratio is more important for applications which do not require a lot of RAM and are strictly I/O bound.

In the spirit of the Amdahl Blade comparison pa-per[17], in Table 1 we compare our four solutions against the GrayWulf cluster and best two “Amdahl Blades” (Intel and Zotac-based boards) as bench-marks: one of the existing NFS OpenSolaris-based Supermicro servers, a Quanta S100-L11SL server containing 12 Seagate Constellation ES.2 2TB hard drives, a Quanta S100-L11SL server containing 12 Crucial M4 500MB SSDs, and a Quanta S100-L11SL server containing 7 Crucial M4 500MB SSDs.

These numbers were used in 2009 when designing the GrayWulf[18] system located at the Johns Hop-kins University Physics & Astronomy Department. Even though it was built four years ago, due to its excellent performance (in fact, exceeding many cur-rent vendor offerings) we consider the GrayWulf system as the benchmark against which we mea-sure our other systems. Therefore, we show how our existing Supermicro and planned Quanta sys-tems compare to the GrayWulf in Table 1. All three Quanta solutions are based around the Quanta S100-L11SL server which we show as containing either 12 hard drives, 12 Solid State Drives, or 7 Solid State Drives.

It is interesting to note that the Quanta HDD so-lution has an Amdahl number of 1.5, an Amdahl Memory Ratio of 2.0 but has a relatively poor

(10)

Am-dahl IOPS Ratio of 0.006. More intriguing is the Quanta server with 12 SDDs which exceeds all three numbers meaning that in theory the CPU should never be starved for data at the expense of ineffi-ciency. The closest we can get to Amdahl unity is a Quanta server containing 7 SSDs which has an Amdahl number of 1.75, an Amdahl Memory Ratio of 2.0, and an Amdahl IOPS ratio of 0.984, but we cannot afford it. A completely SSD-based solution does not provide the disk capacity we require for less than US $100,000.

Table 2 can be used in many different ways de-pending on which metric is under consideration. The “single node capability” section is a baseline which shows the performance of an individual node wherein we see that the Quanta HDD solution has 25% fewer CPU resources than GrayWulf (16 GHz vs 21.3 GHz), has 100% better sequential I/O per-formance (3 GB/s vs 1.5 GB/s), is capable of 68% fewer random operations (1,900 IOPS vs 6,000 IOPS), contains 7% more raw disk (24 TB vs 22.5 TB), consumes 83% less power (200 watts vs 1,150 watts) and costs 74% less ($5,000 vs $19,250). The only metric where the Quanta HDD solution falls short is in IOPS performance, but this is overcome as we discuss in our total cost analysis.

Overall cost was one of the most important fac-tors in designing our new storage array. Scaling the systems to constant total cost is most revealing be-cause the Quanta HDD solution bests the GrayWulf in every category except for the number of nodes which we will discuss in a moment. The Quanta HDD solution has 190% more processing power, 669% more sequential I/O performance, 22% bet-ter random I/O performance, 311% more raw disk storage, and consumes 33% less power. The Gray-Wulf is based on the Dell PowerEdge 2950 server which is a 2U rack mount server thus 5.19 GrayWulf nodes would occupy 10.38 rack units. The Quanta server chassis is a 1U rack mount server therefore 20 nodes occupies 20 rack units. While this represents a 93% increase in rack space, we postulate that the increase in processing power, I/O performance, disk space, and lower power consumption offsets the in-crease in physical space. It would have been ideal to

have found a solution which provided better perfor-mance, more capacity, and used less rack space than the GrayWulf system. However, we do have plenty of available rack space at this time to accommodate our proposed solution.

5.1 Testing methodology

We tested individual drives formatted with EXT4 partitions to determine the maximum throughput as shown in Table 3. This established the baseline per-formance before we implemented ZFS for the block storage and OrangeFS for the parallel filesystem.

The Quanta server’s drives are physically connected as follows: Eight are attached to an LSI2008 SAS controller and four are directly attached to four SATA connectors. The LSI2008 controller has two lanes and four hard drives are attached to each lane as shown in Figure 4.

Figure 4: Quanta S100-L11SL Bus Diagram

chassis front Bus A (SATA) 3 2 1 0 Bus B (SAS) 7 6 5 4 Bus C (SAS) 11 10 9 8 motherboard chassis rear

We then incrementally tested different ZFS and Linux disk I/O scheduler configurations to de-termine differences from baseline performance as shown in Table 4.

5.2 Test Results

Bonnie++[19] version 1.0.3e was used for all per-formance tests in Tables 3-7 unless otherwise noted. We first tested the individual drive performance of the Quanta server as shown in Table 3. We found that the drives connected to the SAS controller pro-vided 4.5% faster write throughput, 4.6% faster read

(11)

Table 2: Comparison of the systems scaled to various dimensions

CPU SeqIO RandIO Disk Power Cost Relative Nodes [GHz] [GB/s] [kIOPS] [TB] [W] [$] Power

Single node capability

GrayWulf 21.3 1.5 6.0 22.5 1150 19250 1.000 1.00 Supermicro 20.2 8.1 12.2 162.0 5600 20000 4.870 1.00 Quanta HDD 16.0 3.0 1.9 24.0 200 5000 0.174 1.00 Quanta 12-SSD 16.0 6.0 540.0 6.0 150 8000 0.130 1.00 Quanta 7-SSD 16.0 3.5 315.0 3.5 150 5800 0.130 1.00

Scaled to constant total cost

Scaled to constant sequential read

Scaled to constant random read

Scaled to constant power

Scaled to constant disk space

(12)

throughput, and a statistically insignificant differ-ence in the number of IOPS as compared to the drives connected to the SATA interfaces.

The simultaneous four drive tests were designed to detect bottlenecks in the SAS and SATA busses. We found that the SAS channel provided 2.6% faster write throughput but insignificant differences in read throughput and number of IOPS. However, as compared to the single drive tests, the four drive tests showed slightly improved write throughput but significantly worse read throughput and IOPS. In particular, there was a 72.8% drop in IOPS perfor-mance which we were unable to explain.

When we tested all twelve hard drives simultane-ously, the mean statistics per drive were 190 MB/s write, 250 MB/s read, and 160 IOPS. The massive 51% drop in IOPS performance between the sin-gle SAS drive test and the twelve drive simultane-ous test was due to a CPU context switching con-straint: there were 12 threads running performance tests and only 4 CPU threads were available because the Quanta server contains a single four-core CPU. We note that Seagate[6] specifies that the drives are only capable of 150 IOPS, and though this test de-livered the worst results the drives dede-livered better performance than advertised.

We then installed ZFS on Linux version 0.6.0rc14 and performed a single drive test as shown in Ta-ble 4. This test yielded better performance num-bers due to ZFS compression: 1.1 GB/s write, 989.5 MB/s read, and 290.2 IOPS. ZFS compres-sion stores fewer blocks on disk, which dramatically increased the single-drive performance while using approximately 2% system CPU. We deemed this an acceptable tradeoff for a pure storage server. We then tested a variety of ZFS configurations, taking into account the settings used in the Dell OrangeFS Reference Architecture whitepaper.

The best performance was reached when we enabled compression, we disabled explicit per-transaction write flushing (sync=disabled), and we accounted for the fact that the Constellation ES.2 drives used for testing were 4k sector “Advanced

Format” drives (ashift=12). Disabling sync means that transactions are batched, sorted for sequential writing, and written every 5 seconds. This is accept-able in our environment because the storage servers are connected to an uninterruptible power supply. We gained further performance by telling the Linux kernel to use the Completely Fair Queuing (CFQ) I/O scheduler (scheduler=cfq), setting the maxi-mum number of kilobytes the block layer would al-low for a request to 1MB (max sectors kb=1024), and increase the drive transaction queue to 512 re-quests (nr rere-quests=512). This yielded 3.9 GB/s write, 2.3 GB/s read, and 776.6 IOPS. This is the last configuration shown in Table 4 which we se-lected as the optimum configuration for this server.

Finally we installed OrangeFS version 2.8.7 using ZFS as the backing file storage. We ran the same ZFS benchmarks and our results are shown in Ta-ble 5. There was a noticeaTa-ble drop in performance due to the extra layer of abstraction. When we used the optimum ZFS configuration we determined ear-lier, we obtained 597 MB/s write, 559 MB/s read, and 444.5 IOPS. This represents an 85% drop in write throughput, a 76% drop in read throughput, and a 75% drop in the number of IOPS as com-pared to using ZFS alone. However, we believe this performance bottleneck can be overcome with addi-tional OrangeFS tuning for our environment.

We then replicated the Dell OrangeFS Whitepaper benchmarks as noted in Table 6. Only graphs were provided in the whitepaper, so we can only provide estimates in Table 7 for a performance comparison of the Dell MD storage stack against the Quanta S100. The MD storage stack contained 60 drives and so can provide more throughput. However, the 12-drive Quanta shows an equivalent write perfor-mance and a 44.5% read perforperfor-mance improvement over the 15 drive Dell MD stack test. We believe the increase in read performance is due to the newer Quanta hardware architecture coupled with ZFS.

(13)

Table 3: Quanta S100-L11SL Individual Drive Testing

Seq Write Seq Read Rand IO [MB/s] [MB/s] [IOPS] SAS single drive performance 191.999 258.996 326.8 SATA single drive performance 183.693 247.626 327.5 Simultaneous single drive test spanning two SAS channels (mean per drive) 189.644 249.573 324.8 Simultaneous four drive test on a single SAS channel (mean per drive) 195.257 206.470 189.1 Simultaneous four drive test on all four SATA channels (mean per drive) 190.305 208.075 191.2 All twelve drives simultaneously (mean per drive) 190.000 250.000 160.0

Table 4: ZFS Testing

ZFS Configuration ZFS Options and Testing Notes Seq Write Seq Read Rand IO [MB/s] [MB/s] [IOPS] 1-drive RAID0 compression=on, sync=disabled 1085.591 989.514 290.2 6-drive RAID10 compression=on, sync=disabled 1076.090 1139.736 639.8 6-drive RAID10 compression=off, sync=disabled 428.093 496.323 349.9 12-drive RAID10 compression=on, sync=disabled 1024.287 1308.707 708.2 12-drive RAID10 ashift=12, recordsize=4k,

compression=on, sync=disabled 83.124 61.312 106.7 12-drive RAID10 ashift=12, recordsize=64k,

compression=on, sync=disabled 948.632 961.203 412.8 12-drive RAID10 ashift=12, compression=on, sync=disabled 1093.201 1302.865 718.4 12-drive RAID10 ashift=12, compression=on, sync=disabled

64G test file with 128K record size 3137.117 2792.309 704.0 12-drive RAID10 ashift=12, compression=on, sync=disabled,

zfs prefetch disable=1

64G test file with 128K record size 3766.921 41.895 706.5 12-drive RAID10 ashift=12, compression=on, sync=disabled,

scheduler=deadline 3538.552 2050.413 713.6 12-drive RAID10 ashift=12, compression=on, sync=disabled,

(14)

Table 5: OrangeFS on ZFS Testing

ZFS Configuration ZFS Options and Testing Notes Seq Write Seq Read Rand IO [MB/s] [MB/s] [IOPS] 12-drive RAID10 ashift=12, compression=on, sync=disabled 546.674 551.286 479.9 12-drive RAID10 ashift=12, compression=on,

sync=disabled, checksum=off 549.339 35.098 590.7 12-drive RAID10 ashift=12, compression=on, sync=disabled

scheduler=cfq, max sectors kb=1024,

nr requests=512 597.042 559.864 444.5

Table 6: Replication of Dell OrangeFS Reference Architecture Benchmarks on Quanta S100-L11SL

Seq Write Seq Read Rand IO [MB/s] [MB/s] [IOPS] Thread Test 1 thread 1321.962 265.899 2 threads 1533.855 391.251 4 threads 1623.011 783.698 10 threads 1554.623 938.060

24G test file with 128K record size 597.098 675.952 1609 24G test file with 1M record size 1604.950 2291.204 1791

LUN Test (12 drives)

4G test file with default Bonnie++ options 1583.231 1558.290

4G test file with 128K record size 648.327 770.118 11859 4G test file with 1M record size 1469.250 2359.021 1799

Table 7: Dell OrangeFS Reference Architecture Benchmarks on Dell MD Stack

Seq Write Seq Read [MB/s] [MB/s] Thread Test 1 thread 810 800 2 threads 1000 1400 4 threads 1800 2100 10 threads 2100 3000

24G test file with 1M record size 2259 3014 LUN Test (4G test file with 1M record size)

5 drives 400 350

10 drives 1050 700

15 drives 1400 1050

(15)

6 Conclusion

The final system design will consist of 20 Quanta S100-L11SL servers containing (12) 2TB tradi-tional hard drives. Leveraging OrangeFS for the parallel filesystem, the system as a whole is capable of delivering 30GB/s write, 46GB/s read, and be-tween 37,260-237,180 IOPS of performance. The variation in IOPS performance is dependent on the file size and number of bytes written per commit as documented in the Test Results section. For refer-ence, the combined performance of the two existing OpenSolaris-based NFS storage servers is 1.6 GB/s read and 2,000 IOPS. The final system design rep-resents a 2,775% increase in read performance and a 1,763-11,759% increase in IOPS. In other words, the proposed system is over an order of magnitude faster in read performance and provides over two orders of magnitude faster random file accesses.

We need to be able to add storage servers on an as-needed basis while delivering scalable performance, and that is precisely what OrangeFS provides in our computing environment.

Acknowledgements

A hearty thank you to Matt DiFabion and Glen Coppersmith for hardware installation and provid-ing feedback, respectively. We are very appreciative of the assistance provided by Omnibond for the Or-angeFS configuration and Seneca Data Distributors for providing the demonstration hardware.

References

[1] http://napp-it.org. [2] http://zfsonlinux.org.

[3] http : / / zfsonlinux . org / zfs -disclaimer.

[4] http : / / zfsonlinux . org / faq . html. [5] http : / / content . dell . com / us / en / enterprise / d / business˜solutions˜engineeringdocs˜en / Documents˜orange fs -reference - architecture . pdf . aspx. [6] http : / / origin - www . seagate . com / internal - hard - drives / enterprise - hard - drives / hdd / constellation.

[7] https : / / www . dell . com / downloads / global / products / pvaul / en / powervault md3200 -high - performance - tier -implementation.pdf.

[8] http : / / www . quantaqct . com / en / 01_product/02_detail.php?mid= 29&sid=143&id=144&qs=96.

[9] http://ceph.com.

[10] https : / / oss . oracle . com / projects/btrfs/. [11] http://www.gluster.org. [12] http://www.gluster.org/about. [13] http://lustre.org. [14] http : / / tritonresource . sdsc . edu/oasis. [15] http://orangefs.org.

[16] G. M. Amdahl, “Computer architecture and amdahl’s law,” Solid-State Circuits Newslet-ter, IEEE, vol. 12, no. 3, pp. 4–9, 2007,ISSN: 1098-4232. DOI: 10 . 1109 / N - SSC . 2007.4785612.

[17] A. S. Szalay, G. C. Bell, H. H. Huang, A. Terzis, and A. White, “Low-power amdahl-balanced blades for data intensive comput-ing,” SIGOPS Oper. Syst. Rev., vol. 44, no. 1, pp. 71–75, Mar. 2010,ISSN: 0163-5980.DOI: 10.1145/1740390.1740407. [Online]. Available: http://doi.acm.org/10. 1145/1740390.1740407.

(16)

[18] “Graywulf: scalable clustered architecture for data intensive computing,” in Proceed-ings of the 42nd Hawaii International Con-ference on System Sciences, ser. HICSS ’09, Washington, DC, USA: IEEE Computer So-ciety, 2009, pp. 1–10, ISBN: 978-0-7695-3450-3. DOI: 10 . 1109 / HICSS . 2009 . 234. [Online]. Available: http : / / dx . doi . org / 10 . 1109 / HICSS . 2009 . 234.

[19] http : / / www . coker . com . au / bonnie++.