Support for Tools - Efficient Task-Local I/O Operations of Massively Parallel Applications

start at beginning of the file and has to traverse all metadata blocks until the block of the requested key is found. To optimize the read and look-up procedure, SIONlib maintains a hash table in memory, which contains key, size and offset for each key-value pair already found. As indicated in Figure 3.13, data blocks can be extended to the next chunk if space is not sufficient in the current chunk.

3.10 Support for Tools

Parallel tools like Scalasca [36] or Score-P [59] place different requirements on parallel I/O libraries. These tools interact with applications and inherit additional restrictions from them concerning the parallelization scheme and runtime configuration. Therefore, tools like Score- P have to support applications with different parallelization schemes, as they instrument the application code to obtain event-trace data.

The Score-P instrumentation and measurement infrastructure, which is used by Scalasca and Vampir, provides support for applications with a variable and not pre-defined number of threads and hence, the I/O layer, which is used to store event traces on disk, must support this feature as well. SIONlib fulfills this requirement by providing the key-value container. For hybrid applications, Score-P records event traces on each thread, but writes the event traces to output files from only one thread. The output format of these files is the OTF2 format. Conversely, VampirServer, the parallel version of Vampir, and Scalasca read OTF2 data from multiple threads concurrently. Therefore, SIONlib provides a function to duplicate a SIONlib file handle (sion dup) as an additional enhancement. Once a SIONlib file is opened in parallel, this function can be used to replicate the internal data structures for each thread. This allows each thread to perform independent read operations. Further tool-specific support is available in SIONlib with the generic API, with the reinitialization of already opened files, and with the mapped parallel open mode, which supports to open SIONlib files with a different number of tasks at file creation time.

3.10.1 Generic API

The generic interface of SIONlib is primarily designed for the integration of SIONlib into the Score-P infrastructure. Score-P internally uses an abstract communication layer and provides callback functions to propagate the communication methods to underlying software layers. As described in Section 3.7, the interfaces of SIONlib for MPI, OpenMP, and hybrid applications provide communication methods via callback functions to the generic parallel layer in the same way. SIONlib only needs a few communication functions for metadata management in the parallel generic layer. These are the broadcast, the gather, and the scatter method to distribute and collect metadata, as well as a barrier method to synchronize the tasks. All methods are collective and have to be performed either on all tasks or on a subset of tasks. The global communicator group consists of all tasks participating in SIONlib I/O, whereas the local communicator group includes only those tasks that access the same physical file. SIONlib’s generic interface provides an API to define and register a user-defined set of callback functions and helps in this way to implement a new parallel API for SIONlib, for exam-

ple, to support applications and tools that are based on parallelization paradigms other than MPI or OpenMP (cf. Figure 3.11). This has simplified the integration of SIONlib, because the Score-P callback functions can be propagated directly to the generic interface of SIONlib. A more practical advantage of the generic interface is that all required software layers have no dependencies to external libraries. This eases the integration of SIONlib into the build environment of tools.

3.10.2 Reinit

The reinit function of SIONlib allows postponing the exact specification of chunk sizes to a later time of execution, which is possible as long as no data has been written to the file. Reinit was implemented as a special feature for Scalasca 1.x to improve the efficiency of SIONlib I/O for trace data. Scalasca provides an internal buffer on each task, which is filled with trace data during runtime. Typically, the buffer is large enough to store all data of the run and it will be written to a SIONlib file at end of execution. Only in the rare cases of insufficient free buffer space, the buffers have to be flushed during runtime as illustrated in Figure 3.14 (left). Therefore, the file has to be created and opened at start time to be prepared for intermediate flushes. However, at this time, the exact chunk size is not known and Scalasca has to specify the size of the memory buffer as a maximum chunk size. In the case that the memory buffer size is much larger than the size of the recorded event set the resulting SIONlib file would be very sparse and I/O would be less efficient, because file systems blocks are not filled completely. The exact chunk size is known at the end of the execution and can be passed to SIONlib via the reinit call. SIONlib will then reinitialize the internal data structures without recreating and reopening the physical files. This feature becomes more important, if the data in the memory buffer is compressed in-place and not during the write operation as it is implemented in Scalasca (1.x). In this case, the chunk sizes can be further reduced to the size of the compressed data. Especially in combination with the coalescing I/O feature of SIONlib, files can be much denser as depicted in Figure 3.14 (right).

T T T T T T

parallel open

parallel close start of application

end of application

write write write write write write

W W T T T T T T parallel open parallel close start of application end of application

write write write write write write

potential intermediate flushes

…

Chunk size, specified at open at reinit

parallel reinit

…

recalculate exact chunk size

SIONlib-file (compress data in-memory)

Time

W W

Figure 3.14: Example of using the SIONlib reinit feature in Scalasca. The instrumented application records event traces in memory buffers. The use of the SIONlib reinit function reduces the chunk size of the forehandedly opened file to the actual required sizes (right) compared to the original scheme (left).

3.10 Support for Tools

3.10.3 Mapped open

Parallel access to SIONlib files requires that the number of participating tasks is equal to the number of chunks in the file. Therefore, applications reading a previously created SIONlib file have to run with the same size as the creating application. Applications or parallel tools that run with less or more tasks can only fall back to the serial API to open the file container on each task individually. As the memory and logistic overhead for this serial open is very high compared to the parallel collective open, the feature mapped open was added to SIONlib. This feature is not only useful for restarting applications on a different number of tasks. It also supports parallel tools that often run in smaller configurations than the simulation itself. Parallel post-processing tools are becoming more important as the size of application output files grows in such a way that they cannot be moved off-site and have to be processed at the same location. A special use case is VampirServer, the parallel version of the performance analysis tool Vampir. The server part of VampirServer is used to access large trace data directly on the HPC system, instead of moving it first to a local desktop system or reading it serialized on a login node. VampirServer interacts directly with a Vampir client, which is running on a local desktop system or a login node of the HPC system. As VampirServer runs in parallel, it can read and process trace data in parallel on the HPC system and therefore it provides a faster interaction with the user and requires less data movement than reading the data locally on a desktop system. However, VampirServer typically runs on a moderate number of tasks, which is mainly defined by the required main memory needed to store the trace data of large runs. Figure 3.15 shows a simple example for a mapped open, where a SIONlib file is read from an application, running only with half the number of tasks. With mapped open, each task can specify an individual list of global rank numbers whose data chunks are intended to be read from this task. The mapped open is a collective operation. Therefore, the metadata of the SIONlib files is read only once from a file by one task and is distributed to the other tasks with collective communication operations, similar to the parallel open. Furthermore, SIONlib only has to maintain the metadata of the specified tasks in memory and only has to open those physical files that contain a requested chunk. In comparison with a native serial open from each task, the operation is more efficient in memory usage and causes less file-system activity.

T

₂

T

₁

T

FS Block FS Block FS Block FS Block FS Block FS Block FS Block FS Block

T

Write (1:1)

Mapped read (1:2)

Figure 3.15: Example of using the SIONlib mapped open to read a file container with four reader tasks, while it was initially written using eight tasks. Each of this reader tasks now has to read chunks from multiple writer tasks (in this example two).

The mapped open also supports the creation of SIONlib files with a different number of tasks. This is typically used in parallel preprocessing tools to prepare input files for larger simulation runs.

In document Efficient Task-Local I/O Operations of Massively Parallel Applications (Page 82-85)