Contribution of this Thesis - Efficient Task-Local I/O Operations of Massively Parallel Applica

The characteristic that load sequences of libraries are similar on all processes and are concentrated to short time interval can only be assumed at program startup and not during runtime. For example, dlopen may only be used on a subset of application processes and these processes can load different sets of libraries. These choices are determined by input configurations or intermediate results of computation. The load operations may also not concentrated on a short time interval because even if all processes load the same DSOs via dlopen, the load operations may not start at the same time or in the same order.

1.5 Contribution of this Thesis

Common use cases of parallel task-local I/O are application checkpointing and the non-volatile storage of temporary data in tools and applications during the runtime of a job. Both require a high I/O bandwidth from the parallel file system, because the size of the data is often on the order of the size of the available memory. Particularly at large scales, the performance of traditional parallel task-local I/O is limited. As we will discuss in Chapter 2, this limitation is mainly caused by metadata handling which can result in the creation of tens of thousands of files in a directory. Bottlenecks can originate from the serial nature of the POSIX I/O interface, which ignores the parallelism of the I/O operation and passes the burden of metadata handling onto the file system. Additionally, POSIX I/O is ultimately a limiting factor when starting parallel applications, which are dynamically linked and which add additional binary library data to the process during startup or runtime. The dynamic loader is designed as a serial tool and uses POSIX I/O internally for file access. Therefore, in this way it does not exploit the parallelism of the loaded applications.

POSIX I/O, is provided by the file system as the general interface to interact with the operating systems and applications. The interface has a long history. Enhancements to this interface require a long time to be implemented and manifested. However, the increasing size of parallel applications and the underlying parallel HPC systems, requires a modernization of the POSIX I/O interface, at least for the previously described use cases. The main goals of such an enhancement are (i) to provide high I/O bandwidths at large scale and (ii) to simultaneously limit the metadata management overhead for applications and tools, which perform task-local I/O operations at large scale. To accomplish these goals, a solution should exploit application and HPC parallelism for task-local I/O operations. Such a solution should operate in user- space without modification of the runtime and file system. The solution should leave the task- local file I/O strategy unmodified and it should be scalable, both in terms of I/O bandwidth and metadata overhead.

This dissertation introduces two new approaches that fulfill the above-mentioned objectives. These approaches are implemented in the parallel I/O library SIONlib and with the tool Spin- dle, which supports the efficient loading of dynamically linked executables at large scale. The main issue of traditional parallel task-local I/O is the generation of a large number of individual files, which causes overhead in file creation and management. The first major contribution of this dissertation, the parallel I/O Library SIONlib, replaces the individual files by

a small number of shared files preventing such overhead. Application tasks are assigned one or more chunks of these shared files, to which the task has exclusive access. This approach leaves the task-local file I/O strategy of applications unchanged. Furthermore, SIONlib uses the communication layer of the application to aggregate and distribute metadata among the tasks, which guarantees scalability to the size of the simulation. Efficiency is further improved by exploiting the I/O infrastructure and file-system properties, for example, to align data or to organize I/O aggregation according to the hierarchy of the I/O infrastructure. However, metadata performance in shared file I/O also suffers at very large scale (more than 64k tasks). Therefore, SIONlib can also use multiple files to build a virtual shared file. This strategy par- allelizes metadata handling, because the physical files are handled by different components of the file system. As a result, the techniques, which are presented in this dissertation and implemented into SIONlib, enable applications to perform parallel task-local I/O on shared files with comparable I/O bandwidths and considerably less metadata overhead. For example, as described in Section 4.4.1, SIONlib improves I/O efficiency of the application MP2C. Now it can scale to the full BG/Q JUQUEEN system, running with 1.8 million tasks and writing checkpoint files of several TiB with more than 50 % of file-system peak.

The second major contribution is the efficient support of the dynamic loading of parallel applications. The approach, which will be presented in this thesis, consists of three main techniques, (i) the interception of the dynamic loader, (ii) the distributed caching of load-path information and the library data, and (iii) the deployment of an overlay network to connect the distributed cache servers. All these components are running in user space. Modifications to the runtime system and to the applications are not necessary. For example, the standard GNU Linux loader provides a user-space interception technique, which allows Spindle to monitor library load requests and to modify the results of look-up operations. The distributed cache buffers the results of library look-up operations and thus prevents multiple similar accesses to the file system. In addition, with the help of the overlay network library data is transferred to a location near the process that needs it. Taken these techniques reduce the number of file- system operations for loading a parallel, dynamically linked application to those needed to load a single program instance; the dynamic loading overhead remains constant.

Both, SIONlib and Spindle represent transient solutions that implement approaches which even demonstrate their efficiency at scale. They should, therefore, be adopted in the next generation of large-scale runtime and file systems.

The dissertation is organized as follows. After the introduction of both, parallel task-local I/O and dynamic loading in Chapter 1, Chapter 2 discusses the limitations of these techniques at large scale and the root causes of these limitations. Chapter 3 presents the design of SIONlib, which implements the approach of improved parallel task-local I/O. Next, Chapter 4 shows the results of evaluating these techniques with artificial I/O workloads as well as with real-world applications. Chapter 5 presents the design of Spindle, an approach designed to aggressively cache library look-up and load operations. Spindle performance is then evaluated on two different platforms in Chapter 6. Finally, Chapter 7 gives a concluding summary of the dissertation and provides an outlook for future research based on both SIONlib and Spindle.

2 I/O Limitations at Large Scale

With the increasing number of tasks of parallel applications, the efficiency of application I/O becomes more crucial. Similar to the parallelization of the computational part of the application with programming models like MPI and OpenMP, the I/O of the application has also to be parallelized. Common tools for this are, parallel I/O libraries like MPI I/O, pNetCDF, or parallel HDF5 on a higher, application oriented abstraction level, whereas task-local I/O patterns are typically realized with native POSIX I/O operations on a lower file-system oriented abstraction level (cf. Figure 1.5 on page 11). As already described, this work focuses on task-local I/O patterns that are given by I/O operations to individual files of parallel applications and dynamic loading of library data for starting parallel applications. Both patterns have scalability issues at larger scale. This chapter discusses the limiting factors, which hinder applications with such I/O patterns to scale to very large number of tasks. The limitations are addressed by I/O strategies that are designed in this work and which are implemented in two tools SIONlib and Spindle. In the following, an abstract model for parallel I/O data flow on complex I/O infrastructures is introduced, which helps to explain the limitation of the different types of parallel task-local I/O and dynamic loading patterns.

2.1 Schematic View of the Parallel I/O Data Flow

The objective of I/O is to move data between different storage resources. In the simplest case, data is moved by an I/O call from a location in the local memory of a compute node to a file on a local disk. In systems, which are more complex and support I/O from multiple compute nodes, the data will be moved in a number of steps from a compute node’s memory to a disk of a shared parallel file system.

Figure 2.1 shows a simple model of an abstract I/O node, which represents one processing step in the data flow from compute node to the file system. In addition to the input and output

Input … Buffer Buffer Data Processing Output Input Input

I/O-Node Parallel File System

…

Compute-Nodes

streams, an abstract I/O node has also memory buffers for incoming and outgoing data. With the use of these buffers, I/O-operations of the application can be asynchronous to the disk I/O, because data is written to the memory buffer first and transferred to disk later in time. The I/O handling within a compute node is similar. Therefore it can be modeled also with this abstraction. The Unix kernel will process the data, which is handed over by system calls from the application. The data itself is stored in application memory or in system memory (cf. Figure 2.2). GPFS starts an own client daemon and allocates a page pool in memory to maintain file system operations. The page pool is used for managing file-system pages, which are read or modified by client processes. In contrast, Lustre uses the system I/O memory caches to store data, which have to be transferred between compute nodes and file system. Further, the model can be applied also to the file system. A file system consists typically of multiple server nodes, which are connected to the file system client daemons on compute- or I/O-nodes. The server nodes manage also the file system disk storage, which is attached to these nodes. In principle, the server has the same functionality as I/O nodes. The data will be transferred between disks and connected client nodes, using an internal memory buffer. The data transfer in parallel I/O can be modeled by a network of these abstract I/O components. This network will contain compute nodes, I/O forwarding nodes, and file-system nodes as depicted in Figure 2.2. Typically, the number of components will decrease on the way from application to file-system disk. Multiple incoming I/O streams will be serialized to a smaller number of outgoing I/O streams. Those points in the network are therefore typical bottlenecks at large scale.

Metadata plays a special role in this model: Depending on the underlying file system, metadata will be handled either together with the file data or it will be handled on separate servers. In the first case, the network can also reflect these operations. In the second case, additional

… … …

N

1,1

N

1,m

…

N

2,1

N

_2,m

…

N

n,1

N

_n,m

…

_…

… I/O-Nodes Compute-Nodes

N

…

OS

Parallel File System

… …

…

In document Efficient Task-Local I/O Operations of Massively Parallel Applications (Page 44-48)