Research on Distributed File System Framework Based on Solid State Drive

(1)

2017 International Conference on Computer Science and Application Engineering (CSAE 2017) ISBN: 978-1-60595-505-6

Research on Distributed File System Framework

Based on Solid State Drive

Pengfei Wang*, Guangming Liu and Jie Yu

College of Computer Science, National University of Defense Technology, Hunan Province Changsha City, China

ABSTRACT

As the computing power of high performance computing (HPC) systems is developing to exascale, and Big data application tends to be I/O intensive, the storage system is facing high I / O traffic. The poor performance of disk has gradually become the bottleneck of storage system. In recent years, Flash technology has developed rapidly, Flash-based Solid State Drive (SSD) has been widely used in HPC, adding SSD into the storage system can improve the performance of the storage system. In this paper, we designed DSFS, a user-level distributed file system based on SSD. DSFS uses SSD as the underlying storage medium, and adopts Infiniband RDMA (Remote Direct Memory Access) transmission mechanism. The test result shows that DSFS improves the performance of HPC storage system effectively.

INTRODUCTION

With the rapid development of high-performance computing power, applications can handle larger data sets. Typical I/O intensive applications, such as oil seismic data processing program and genetic sequencing data analysis program and so on, require the storage system to have a high bandwidth as well as low latency. At present, HPC storage systems are generally based on disk, the random read and write throughput of disk is low, which leads to the gap between HPC computing performance and storage performance. Storage system bottlenecks become increasingly prominent.

(2)

RELATED WORK

SSD in Storage Architecture

There are several ways to apply SSD to a storage system, such as replacing HDD directly with SSD, using SSD as burst buffer[4], using SSD as HDD cache and so on.

The Gordon Supercomputer at the San Diego Supercomputing Center uses 300TB flash to take place of traditional HDDs to provide high-performance support for data-intensive applications [5].

The US Energy Research Science Computing Center (NERSC) uses Cray DataWarp to implement Buster Buffer [6], accelerating the performance of small I / O operations and increasing the available bandwidth for the application.

Chen et al. proposed Hystor [7], a hybrid storage system based on SSD and HDD. Hystor migrates key data blocks to SSD by monitoring application I / O requests, data block size and access frequency. And the file system metadata is stored in the SSD according to the semantics to improve the metadata access performance. The SSD is used as write cache to improve the performance of the write-intensive application.

With the development of transistor technology, SSD price is reducing, replacing HDD with SSD has becoming the trend. In this paper, we studied the SSD-oriented parallel file system framework, and it is meaningful for the study of parallel file system completely based on SSD.

MooseFS

MooseFS (MFS) is one of the open source implementation of GFS[8], and is used for large-scale and distributed cluster environment. MFS provides consistency guarantees and fault tolerance mechanisms. It has the advantages of high availability, easy expansion, multi-copy support, recycle bin and easy deployment.

MFS inherits the structural characteristics and working pattern of GFS. It includes a metadata server, a metalogger and multiple chunkservers and clients. MFS is designed basing on the disk storage media, and the underlying network communication is based on socket, the file data is divided into chunk blocks which is stored in the local file system.

SYSTEM ARCHITECTURE

DSFS is developed based on MFS. To make it suitable for high-performance computing environment, we apply Infiniband RDMA for data transmission which effectively improves I/O performance. In this section we will introduce the system structure and its working mechanism.

System Structure

(3)

[image:3.612.200.392.60.187.2]

Figure 1. System structure.

(1) Master

Master is responsible for managing the entire file system, storing and maintaining the metadata, controlling the read and write process, managing the recycle bin and rebalancing load between the data storage servers.

(2) Chunkserver

The file data is divided into fixed-size data blocks (called Chunk), the Chunk is stored in Chunkserver.

(3) Client

The file system is mounted via FUSE [9](File system in Userspace) kernel interface, which supports POSIX semantics. The user program does not need to be modified to access DSFS directly.

Working Mechanism

DSFS is a user-level distributed file system, which means that it is mounted through FUSE and can be used for any operating system that supports FUSE. The application is located at the user level of the compute node. The application I/O request is transferred to the FUSE kernel module by the operating system. The FUSE user interface intercepts the file operation and calls the system after the request finished.

[image:3.612.198.398.562.685.2]

The metadata information is managed by the Master. The Client receives the read or write requests from the user program and requests the metadata from the master. After receiving the metadata information, the client communicates with the corresponding Chunkerserver to complete the read and write operations.

Figure 2 shows the Write operation workflow:

clients chunkservers 5 Write data

1 a sk f_{or m}

eta_da ta 2

Crea te ch

unk

3 su cces

s ack

6 Success ack

4 m eta_da

ta a ck 7 w

rite_su ce_ss

Figure 2. Write operation workflow.

clients chunkservers

master Infiniband RDMA

M eta_da

ta _Manag emen

(4)

(1) The system uses asynchronous write, it writes data to the Client write buffer while it receives write request, and then return writing success message to user program. The Client asks for metadata from Master by the same time.

(2) Master chooses one Chunkserver according to the load balancing policy, and sends chunk message to the Chunkserver.

(3) Chunkserver receives the message to create a chunk, and returns the metadata to Master;

(4) Master returns the metadata information to the Client, the Client establishes connection with Chunkserver according to the metadata, and then writes data and returns write state to the Master. Master updates the metadata and the write operation is completed.

[image:4.612.91.512.273.358.2]

RDMA-BASED DATA TRANSFER

Figure 3. Communication structure with Socket / RDMA.

In MFS, data is transferred between clients and chunkservers through socket-based communication. As shown in Figure.3, there are six times of data copy and context change between user-mode and kernel-mode in one read/write operation, which consumes significant CPU resources and limits the I/O bandwidth.

Most HPC interconnects compute nodes through RDMA-enabled network. We adopt RDMA-based communication in DSFS to reduce context change and data copy. The clients are mounted with DIRECT_IO parameter to bypass VFS page cache. And we have designed RDMA-buffer in both client and chunkserver sides.

After taking these optimization, as shown in Figure 3, there are only two times of data copy and one context change in one read/write operation. We carried out a comparing experiment, and the result shows that the write bandwidth of DSFS-RDMA is 1.2 times higher than that of DSFS-Socket when data transfer size is larger than 64KB.

EVALUATION Testbed

We take evaluation of DSFS with TH-1A supercomputer which is located in National Supercomputer Center in TianJin China. The compute node is equipped with a dual 6-core Intel Xeon X5670 processor with 24GB of memory and communicates via Infiniband. Its minimum unilateral latency is 1.57 us and the unidirectional bandwidth is up to 6.34 GB/s[10].

Client _Chunkserver

User application

buffer

VFS page cache

Client buffer

Socket buffer

Socketbuffer SSD

User-mode

Kernel-mode

Sock et

Client

User application

buffer

VFS page cache

Client buffer

Socket buffer

Chunkserver

RDMA

buffer SSD

User-mode

Kernel-mode

(5)

The DSFS is a user space file system based on high-performance computing cluster computing node SSD, which provides high-performance storage services to any computing node in a cluster as needed. Master and Client can be deployed in any nodes in the cluster according to application requirements; Chunkserver running in the computing node with SSD.

Benchmark Test

We evaluate the sequential write/read bandwidth of DSFS with IOR of version 2.10.1, test file size of 1GB, one client and one chunkserver, under Direct I / O mode. Before we conduct the test, we first write 128GB data to avoid the influence of client cache. The IOR command is ‘mpirun –n 1 IOR –t $blocksize –b 1g –d 3 –F –C –B –i 24 –k –w –r –o testfile’.

Figure 4. DSFS and Luster sequential read/write bandwidth test results.

Figure 4 is the bandwidth test result. The results show that compared with the disk-based Luster, DSFS greatly improved the sequential read and write performance. The sequential read performance improves 3.2 times, sequential write performance improves 3.9 times. And DSFS-RDMA write/read performance improves nearly 1.2 times compared with that of DSFS-Socket when block size is larger than 64KB. When block size is less than 64KB, applying and releasing RDMA buffer take much more time than data transfer, therefore DSFS-RDMA provides lower performance than DSFS-Socket.

I/O Intensive Application Test

Gethering program is one of the core programs in the data processing of oil seismic exploration process, it is a typical data-intensive application. We evaluate DSFS with Gethering program.

As shown in Figure 5, the system can improve the performance of the program by 2.3 times, and effectively shortening the running time of the I/O intensive application.

16 64 128 512 1024 4096

0 200 400 600 800 1000 1200 1400 1600

Lu

stre

_write

file block size

[image:5.612.222.418.256.376.2]

(6)

(a) The bandwidth of the program running in Lustre

[image:6.612.114.465.55.189.2]

(b) The bandwidth of the program running in DSFS

Figure 5. Comparison of the bandwidth of the program running in Lustre and DSFS.

CONCLUSIONS

Poor disk I/O performance has gradually become the bottleneck of high-performance storage systems. While Flash technology is developing rapidly, SSD is widely applied in high-performance computing clusters. In this paper, We study the user-level distributed file system framework based on SSD, and designed and implemented DSFS, which adopts Infiniband RDMA for data transmission and can effectively improve the performance of storage system. We evaluated DSFS with benchmark and data-intensive application, the results show that DSFS improves sequential read performance by 3.2 times and sequential read performance by 3.9 times.

REFERENCES

1. Catalyst. Http://computation.llnl.gov/computers/catalyst

2. Vetter J. S., and Mittal S. 2015. “Opportunities for Nonvolatile Memory Systems in Extreme-Scale

High-Performance Computing,” Computing in Science & Engineering, 2015, 17(2):73-82.

3. The Infiniband Trade Association.Http://www.infiniban dta.org

4. Wang T., Oral S., Wang Y., et al. 2014. “BurstMem: A high-performance burst buffer system for

scientific applications,” in IEEE International Conference on Big Data, IEEE, 2014:71-79.

5. "Meet Gordon, the World's First Flash Supercomputer,"

http://www.wired.com/2011/12/gordon-supercomputer/.

6. Cray datawarp applications I/O accelerator. Http://www.cray.com/products/storage/datawarp

7. F. Chen, D. A. Koufaty and X. Zhang. 2011. "Hystor: making the best use of solid state drives in

high performance storage systems," in Int. Conf. On Supercomputing, 2011, pp. 22-32.

8. Howard S.G., Gobioff H., Leung S. 2004. “The Google File System,” ACM Sigops Operating

Systems Review, 2004, 37(5):29-43.

9. Szeredi M. 2009. “FUSE: Filesystem in Userspace.”

10. Xie M., Lu Y., Liu L., et al. 2011. “Implementation and Evaluation of Network Interface and

Message Passing Services for TianHe-1A Supercomputer.”