• No results found

2.4 Introduction to Parallel I/O

2.4.3 High-level I/O Libraries and Middleware

The IEEE Portable Operating System Interface (POSIX) [48] defines the primary application programming interface (API) for UNIX variants and other operating systems, along with command line interpreters and utility interfaces, The first standard was released in 1988, when a single computer owned its own file system. The POSIX API presents a useful, ubiquitous interface for basic I/O, but lacks useful constructs for parallel I/O. When using POSIX I/O, a cluster application is basically one program running on N nodes, but it looks likes N programs to the file system. In addition, POSIX has no support for noncontiguous I/O and no hinting

or prefetching mechanism. Together with its rules such as atomic writes and read-

after-write consistency, POSIX I/O has a negative impact on parallel applications.

To overcome the limitations of POSIX I/O, high-level I/O libraries and middleware solutions have been developed to provide data abstraction and to organize accesses from many processes, especially those using collective I/O. The following sections introduce MPI-IO and selected libraries.

2.4.3.1 MPI-IO

in distributed memory architectures, parallel I/O systems needs a mechanism to specify collective operations, user-defined data types to describe noncontiguous data layouts in both memory and file, communicators to separate application-level message passing from I/O-related message passing, and non-blocking operations. MPI-IO, which was released in 1997 as part of MPI-2, is an I/O interface specification for use in MPI applications and provides a low-level interface for carrying out parallel I/O. Internally, it uses the same data model as POSIX, i.e., streams of bytes in a file. MPI-IO’s features include collective I/O, noncontiguous I/O with MPI data types and file views, non-blocking I/O, and Fortran and additional language bindings. Collective I/O is a critical optimization strategy for reading from, and writing to, a parallel file system. The collective read and write calls force all processes in a communicator to read/write data simultaneously and to wait for each other. The MPI implementation optimizes the read/write request based on the combined requests of all processes and can merge the requests of different processes for efficiently servicing the requests. Given N processes, each process participates in reading or writing a portion of a common file.

There are several different MPI-IO implementations. One of the widely used ones is ROMIO [49] from the Argonne National Laboratory. ROMIO leverages MPI-1 communication and supports local file systems, network file systems, and parallel file systems such as GPFS and Lustre. The implementation includes data sieving techniques and two-phase optimizations.

2.4.3.2 HDF5 and Parallel HDF5

The Hierarchical Data Format v5 (HDF5) [50] consists of three components: a data model, an open source software, and an open file format. The HDF5 file format is defined by the HDF5 file format specification [51] and defines the bit- level organization of an HDF5 file on storage media. HDF5 is designed to support high volume and/or complex data, every size and type of system (portability), and

2 Communication and I/O in HPC Systems

flexible, efficient storage and I/O. An HDF5 file is basically a container consisting of data objects, which can either be datasets or groups. Datasets organize and contain multidimensional arrays of a homogeneous type, while groups are container structures consisting of datasets or other groups.

Altogether, HDF5 provides a simple way to store scientific data in a database-like organization. The data model complements the file format by providing an abstract data model independent of the storage medium or programming environment. It specifies the logical organization and access of HDF5 data from an application perspective, and enables scientists to focus on high-level concepts of relationships between data objects rather than descending into the details of the specific layout. The HDF5 software provides a portable I/O library, language interfaces, and tools for managing, manipulating, viewing, and analyzing data in the HDF5 format. The I/O library is written in C, and includes optional C++, Fortran 90, and high-level APIs. For flexibility, the API is extensive with over 300 functions. Parallel HDF5 is a configuration of the HDF5 library, which can be used to share open files across multiple parallel processes. It uses the MPI standard for interprocess communication.

2.4.3.3 ADIOS

The Adaptable I/O System (ADIOS) [52, 53] is an extendable framework which allows scientists to plug-in several different I/O methods, data management services, file formats, and other services such as analytic and visualization tools. It provides a simple I/O application programming interface (API) and allows the usage of different computational technologies to achieve good, predictable, performance. ADIOS provides a mechanism to externally describe an application’s I/O requirements using an XML-based configuration file. One of its salient features is that I/O methods can be exchanged between runs by modifying the configuration file without the need to modify or recompile the application code. ADIOS has demonstrated impressive I/O performance results on leadership class machines and clusters, and has been deployed on several supercomputing sites including the Argonne Leadership Computing Facility (ALCF), the Oak Ridge Leadership Computing Facility (OLCF), the National Energy Research Scientific Computing Center (NERSC), the Swiss National Supercomputing Centre (CSCS), Tianhe-1 and 2, and the Pawsey Supercomputing Centre.

ADIOS is implemented as a user library. It can be used like any other I/O library, except that it has a declarative approach for I/O. The user defines in the application source code the “what” and “when” while the framework takes care of the “how”. There are two key ideas. First, users do not need to be aware of the low-level layout 30

and organization of data. Second, application developers should not be burdened with optimizations for each platform they use. It is capable of I/O aggregation on behalf of the application to increase the I/O performance and scalability.

2.4.3.4 SIONlib

The Scalable I/O library (SIONlib) [54] implements the idea of writing and reading binary data to or from several thousands of processors into one or a few physical file(s). In other words, SIONlib maps the task-local I/O paradigm onto shared I/O. SIONlib is implemented as an additional software layer in between the parallel file system and a scientific application, and provides an extension to the traditional POSIX or standard C file I/O API. To enable parallel access to files, SIONlib provides collective open and close functions, while writing and reading files can be done asynchronously. Its main application area are internal file formats such as scratch files and checkpointing files. SIONlib requires only minimal changes to the application source code, mainly to the open and close function calls. Since each task writes its own data segment to the same file, SIONlib assigns different regions of the file to a task. This minimizes the file lock contention and I/O serialization for collective I/O. In order to do so, SIONlib needs to know the estimated (or known) data size to place the data accordingly. If a write call exceeds the data size specified in the open call, SIONlib moves forward to next chunk. SIONlib provides support for different languages including C, C++, and Fortran, and supports MPI, OpenMP, and MPI+OpenMP.