Parallel file systems are mandatory for large-scale parallel systems as they provide functional- ity for parallel access to shared file data from a very large number of computing elements. The parallel file system can be seen as a third component of a supercomputer. In large-scale sys- tems, the file system itself consists of several server components, building together a parallel cluster with special support for I/O. In general, a parallel file system consists of two compo- nents: the hardware and software environment. Software is needed, on the server nodes, where file-system daemons provide access to the local file system for network attached clients. On the client side (e.g. compute nodes), software is also needed to interact with the remote file system. Depending on the file system, the software is either fully integrated into the Linux kernel (e.g. Lustre), or is partly running as an independent daemon (e.g. GPFS).
While HPC sites deploy a parallel file system for those data spaces that require high I/O band- width and support for parallel access, they are often using the Network File System (NFS) protocol to mount a center-wide HOME file system to the HPC system. While the NFS is not designed to support parallel access to files from a large number of clients, it allows a file system to be easily distributed to different computers of an HPC site. The HOME file system is typically used to store program source files and small configuration files. Parallel access is not necessarily needed for such a file system: compilation is typically done on the login- nodes of the HPC systems and job execution directories are typically located on the parallel file system. Storing binary files like executables or shared libraries on an NFS-mounted file system may result in bottlenecks and is therefore not allowed at some HPC sites. These sites require files to be copied to an execution directory on the parallel file system before job exe- cution. Nevertheless, the limitations of loading shared libraries in parallel at large scale have motivated the design and implementation of a tool to support dynamic loading in this work (cf. Section 2.3.2).
Both SIONlib and Spindle, whose design and implementation are discussed in this disser- tation, are approaches which were motivated by I/O limitations on HPC systems equipped with one of the two parallel file systems (GPFS and Lustre). While most of the optimization strategies will also be useful for other parallel file systems [76], the focus in the following description will be on these two aforementioned file systems.
GPFS
The General Parallel File System (GPFS) from IBM has its origins in 1993, when IBM started the development of the Tiger Shark file system. First commercial versions of GPFS were available under the name Parallel I/O File System (PIOFS). Since 1998, the file system has been distributed under the name GPFS [80, 44]. A large number of HPC systems listed in the current Top500 ranking list [70] are using GPFS to provide fast storage capabilities. The IBM Blue Gene/Q system JUQUEEN in J¨ulich is one of these. The GPFS file system of JUQUEEN is installed on a dedicated I/O cluster named JUST (J¨ulich storage server, [56]), which is an IBM System x GPFS Storage Server (GSS). It implements a new approach, where functionality of Redundant Array of Independent Disks, Level 6 (RAID-6) is not built into hardware controllers [21]. Instead, the RAID-6 functionality is performed by the GPFS native RAID feature (GNR) on the software level. Main reasons to move this functionality from hardware to software is that in case of a disk failure, the rebuilt speed is significantly improved. As a cluster file system, GPFS provides access to file-system data from multiple nodes. There- fore, GPFS deploys a daemon on each node of the cluster. File systems are built on disk storage that is attached to one, some, or all of these cluster nodes. The file system is globally visible on all nodes and provides file storage with a global name space. Figure 1.6a shows a GPFS configuration using the Network Shared Disk (NSD) server model. In this configuration the GPFS daemons (NSD clients) that are running on the application nodes use the NSD block I/O protocol to communicate over the network to the GPFS daemons (NSD servers) that running on the file-system nodes. In general, GPFS daemons can take over file management tasks, which is a key element for scalability. For example, metadata of a file is handled by the first
1.2 Parallel I/O App NSD Server NSD Server … … File-System Nodes Application Nodes … NSD-Client NSD Server NSD Server … NSD Server NSD Server … App NSD-Client App NSD-Client App NSD-Client App NSD-Client App NSD-Client… N etw or k GPFS
(a) GPFS Network Shared Disk (NSD) Model
OSS OSS … … OSTs Applications Nodes … App Lustre OSS OSS … … MDS MDS MDT App App App App App N etw or k LNet (b) Lustre Architecture
Figure 1.6: General structure of GPFS and Lustre.
client that opens or creates a file. There are only a few file system tasks that are centralized and assigned to one GPFS daemon (e.g. the management of the file system itself).
Generally, parallel access to different files and concurrent access from different nodes to the same file is possible. To ensure protected access, GPFS provides a flexible and scalable byte- range locking mechanism. Initially, write or read locks for new or un-used files have to be requested from a central token manager. In contrast to other file system implementations, the lock tokens will then be delegated to the requesting daemon. Further daemons requesting locks for byte ranges of the same file must request a lock from the daemon holding the corresponding lock and will inherit a lock token for the requested byte-range. This mechanism offers a scalable parallel distribution of write/read locks in case of parallel access to a shared file. In GPFS data is stored in units of file-system blocks. These blocks can be described as the smallest unit of data that can be handled efficiently. For example, GPFS daemons allocate an internal GPFS page pool in memory as buffer for incoming or outgoing data. GPFS pages are moved between memory and disk storage only as a whole block. Therefore, applications can achieve the best performance on the file system when they write and read data according to the file system block structure. This means that I/O requests have to be aligned to file sys- tem block borders and that chunk sizes should be a multiple of the block size. Efficient and easy-to-use I/O libraries should be designed to hide these constraints internally.
Lustre
The first version of the parallel file system Lustre [95] began development in 1999 under the name object-based disk file system (obfds). The first version of Lustre [11] was deployed 2003 on a production cluster system at the Lawrence Livermore National Laboratory (LLNL). The development of Lustre has continued and the file system is installed on a large variety of large- scale HPC systems. Most of the HPC systems at LLNL are meanwhile driven with Lustre file systems, importantly, the Sierra cluster that was used for development and evaluation of Spindle, one of the tools in this dissertation, has access to one of the Lustre file systems.
Lustre is also used at the general-purpose cluster JUROPA at JSC. Two of the largest Lustre installations can be found at the Oakridge National Laboratory [84] and under the name FEFS (Fujitsu Exabyte File System) at the Riken Institute in Japan [79] where it serves as the file system for the K computer.
Lustre implements a distributed, object-based, storage. It internally handles objects to store data: files are stored in data objects and Lustre uses special index objects to store directory information and file metadata. The file system is implemented with Object Storage Targets (OSTs), which are running on the Object Storage Server (OSS) nodes (cf. Figure 1.6b). The OSTs maintain local file systems to store the data, using ext4 (ldiskfs) or other Linux file system implementations. Lustre builds the parallel Lustre file system by distributing file data over these OSTs. In contrast to GPFS, Lustre gives users and applications the right to influence this file distribution. To facilitate this, the OSTs of a Lustre file system are visible to users and each has a unique identifier. This allows users to select how a file should be distributed over the OSTs by using special lfs sub-commands before the file is created. The distribution model has only a few parameters: the number of OSTs, a starting offset number, and the chunk size can be specified. According to the specified parameters, the file byte range is divided into chunks which are distributed in a round-robin fashion over the selected OSTs during the write operation. Enabling users to select OSTs for their I/O operation could potentially cause an imbalance in OSTs usage. As we will see later in Section 4.3, such an imbalance can directly cause delays in parallel applications. However, the simple model does not allow for the selection of a set of OSTs by specifying their identifier, which limits the usability of the lfs sub-commands for implementing optimizations to prevent such imbalance.
The Lustre client is integrated into the Linux kernel. It can use the Linux file cache in memory to store Lustre file data. Therefore, Lustre need neither start an additional daemon on the client nodes nor allocate additional memory for caching. The Lustre kernel extension will communicate over the Lustre network abstraction layer (LNet) with the Lustre servers. In addition to object targets, Lustre implements Meta Data Targets (MDTs) on the Meta Data Servers (MDSs), which are responsible for maintaining the name space of the file system. In contrast to the OSTs, only one metadata target is typically active for a file system; multiple MDTs are only used to build failover groups. The MDT stores the metadata information of files and directories (e.g. their names, access rights and ownerships). In addition, the MDS is responsible for selecting the OSTs that store file data according to the user specifications. The corresponding information about OST mapping and file distribution will be returned to those Lustre clients that open the file. After file creation, the MDS is no longer involved in the I/O operation. The Lustre clients directly interact with the assigned OSTs. The file metadata on MDS is updated after closing the file and the MDS is not involved in file block allocation. In contrast to GPFS, MDS functionality is not distributed over multiple servers. This causes bottlenecks when a large number of clients perform file look-up operations in parallel. Lustre implements an additional service on each storage target for file locking, the Lustre Distributed Lock Manager. This is used to protect concurrent file and metadata operations, serializing such operations and ensures a consistent view. Similar to GPFS, Lustre implements byte-range locking. This enables shared-file I/O where different clients write disjointed chunks of a file.