regions. One example is the performance analysis tool Scalasca (cf. 1.3.3), which supports in its version 1.x hybrid applications with a fixed number of OpenMP threads and requires that each thread writes and reads trace data itself. With the restriction that the number of threads is known in advance when the file is created, the proposed container format can support parallel task-local I/O from hybrid applications without modification. Instead of assigning a chunk of the file to each MPI task, a chunk is assigned to each OpenMP thread. In this case, all I/O operations have to be performed inside the parallel region, where all OpenMP threads are forked and active.
The restriction to require that the number of threads is known in advance excludes the category of applications that define the number of threads dynamically during runtime. For example, the application can define the number of threads depending on the computational algorithm used, input data requirements or the maximum available threads on the compute node. To support such applications, the current scheme of the shared file container has to be enhanced in a way that chunks can be created and added to the existing container on the fly after opening the file. Such changes to the file container structure would require that updates are communicated directly to all other threads in the application, which introduces a potentially large number of synchronization points during runtime. Similar to the implementation with interleaving records, the overhead to guarantee a consistent view of the container would limit scalability and will therefore not be considered for SIONlib.
Partial support for multi-level parallelization with a variable number of threads can be imple- mented if parallel write operations can be performed outside the parallel region. The orga- nization of the file container could then be implemented in following way: chunks will only be assigned to MPI tasks. Similar to the coalescing approach, only one thread per MPI task will write the data on behalf of all threads. To allow a separation of the different data streams at read time, the data blocks have to be annotated with their corresponding thread number. This strategy will fix the number of chunks in the file container to the number of MPI tasks on the outer parallelization level. Furthermore, parallel read operations can be threaded on the inner parallelization level, because they do not change the container structure. One applica- tion example that benefits from this strategy, is the Score-P instrumentation and measurement infrastructure, because it supports hybrid programs with variable number of threads (cf. Sec- tion 1.3.3). Score-P realizes writing of trace data on the outer level, whereas reading of trace data is multi-threaded.
3.3 Objectives and Strategy
The major objective of SIONlib is to provide efficient support for task-local I/O patterns at large scale. Recapitulating the findings from the previous discussion about traditional parallel task-local I/O and shared file I/O, SIONlib should fulfill the following goals:
• The general structure of application I/O patterns should be unchanged. Especially, the task-local representation of application data has to be kept. This helps to ease the tran- sition from standard POSIX I/O to parallel I/O with support of an I/O library.
• The library has to mitigate the limitations of parallel task-local I/O on current file sys- tems, which are mainly caused by the large number of individual files. A transition from a file-per-task scheme to a shared file container with parallel I/O is mandatory.
• The solution should also eliminate the limitations of parallel I/O to a shared file at large scale, as described in the previous section. This means that the metadata overhead of accessing one big shared file from a large number of tasks has to be reduced.
• Existing software layers on current HPC systems should not be modified. Furthermore, all components should run in user space. Therefore, it is advisable to use POSIX I/O in the low-level interface to access the file system. On the application level, the solu- tion should be integrated as a library into the parallel application, where it can use the communication layer of the application to exchange data internally.
• Access to the file data should be possible from parallel applications and tools as well as from serial applications. This allows an easy integration into existing workflows. • The solution should support applications with different requirements with respect to data
size and distribution. Examples are the support of small data chunks per task and the support of hybrid applications.
As required, SIONlib has to avoid large numbers of files due to limited metadata scalability. This leads to the basic strategy of SIONlib to use a file container instead of individual files as illustrated in Figure 3.6. Located as an additional software layer between a parallel application and the underlying parallel file system, SIONlib maps a large collection of logical task-local files onto a number of SIONlib shared files. The limitations of shared-file I/O at large scale motivated the design of the multi-file approach of SIONlib, which uses multiple physical files to represent the virtual shared file container. This strategy is described in more detail in Sec- tion 3.4, whereas the organization of the file container into multiple files and chunks for each
…
… Application Tasks Logical task- local filesParallel file system Physical multi-file T1 T2 T3 Tn-2 Tn-1 Tn Serial program
SIONlib
Figure 3.6: File-container concept of SIONlib. A large number of logical task-local files is mapped onto a single physical file (or a small set of physical files), called a multi-file. The multi-file can be accessed from both a parallel and a serial application.