Loading as a parallel service - Flexible Caching Algorithms

5.7 Flexible Caching Algorithms

5.8.4 Loading as a parallel service

Parallel loading is an example of a case in which a sequential solution applied by each of P processes does not yield good parallel performance. The key observation is that most processes request the same objects from the file system, or at least objects that have also been loaded by nearby processes. Rather than accessing the remote file system each time a file is

needed, the likelihood that a neighbor has already requested the file should be exploited. Thus, I/O can be coordinated to distribute files much more efficiently and the stress on the limited file system resources can be reduced.

Alternative techniques, like DVS (discussed above), attempt to increase the level of I/O coor- dination transparently at the file-system level while keeping loader behavior fixed. These techniques have the advantage of maintaining existing abstractions, such as POSIX I/O, which are familiar to users of Unix-like operating systems and which make sense in a sequential environment. However, the strict semantics of such abstractions can limit their scalability in a parallel environment. As an example, POSIX file I/O semantics disallow caching of failed open calls, forcing every library search query to go all the way to the file system. Further, traditional file I/O abstractions are oblivious to the type of data being transferred, which precludes many parallel optimizations.

In the Spindle approach, the abstraction is raised to the level of the loader, which allows the loader to perform its own coordinated I/O. Thus, it can exploit the knowledge about the files in parallel use. Object code is nearly always read-only and further ample parallelism exists in parallel loading. The performance advantages of exploiting both of these characteristics are too great to be ignored in a parallel environment. For this reason, it is a recommendation that the loader architecture has to be changed for parallel machines. Spindle represents a significant step towards such a truly massively parallel loading service architecture.

As described, Spindle implements a new approach for supporting dynamic loading at large scale, which does not require changes to system software or hardware. Further, executables do not have to be re-compiled or re-linked to be supported by Spindle. The next chapter will show results of measurements with Spindle, which demonstrates its scalability.

6 Evaluation of Parallel Loading with

Spindle

As described in the previous chapter, Spindle is designed to reduce the load time of parallel applications that make extensive use of DSOs. To verify this as well as to show the scalability of the described approach, Spindle was tested with two benchmarks, which are running automatically generated codes that load DSOs at program start or during runtime. The first one is CLoadtest , a benchmark that generates a pure C-code program and library sources. The second one is the synthetic benchmark named Pynamic [61], which is co-designed together with a real application at LLNL to emulate its behavior in a proxy benchmark. The first benchmark program was evaluated on the JSC system JUROPA and was used for the initial performance studies on JUQUEEN; the second one was tested on the Sierra Linux cluster installed at LLNL. The following sections introduce both benchmark programs and present the results running these benchmarks on the two systems JUROPA and Sierra. Finally, the memory footprint of Spindle is discussed.

6.1 Simple Loader Benchmark

Essentially, the simple loader benchmark CLoadtest is a code generator, which produces the benchmark code and the corresponding libraries. The main program and the libraries are generated with pure C-code and have to be compiled before running the benchmark. In this simple benchmark, dynamic loading is only tested for DSOs that are directly linked to the main program. Therefore, only measurements of the application startup phase are possible. Configurations that are more sophisticated like DSO loading at runtime are tested with the Pynamic benchmark, which is described in the next section.

As depicted on the left of Figure 6.1, the code generator can be configured with a number of input parameters: the number of libraries, the code size, the number of functions in a library, and number of C source code files for each library. This set of parameters allows the

Code- generator - # libs - # code size - # functions - # C-files/library _{driver_main.rts} main: … DSO: lib_001.so function_001() … function_002() function_<n>() function_003() … Makefile

Figure 6.1: The code generator of CLoadtest creates the benchmark program and the DSOs. Libraries and main program are compiled from pure C-code sources.

benchmark to test different aspects of the dynamic loader. For example, on the one hand, large numbers of libraries produce high load on the metadata servers. On the other hand, large code size will generate large library files, which require high bandwidth to the file system (or Spindle caches) during program startup. The third parameter describes the number of functions in each library file. The main program calls only the first function of each library directly. A cascading call tree links together the remaining library functions, so that at the end each function was executed once. The code generator can produce large C files, depending on the selected values for code size and number of functions. To limit the file size and the time to compile a library, the code generator splits the library’s source code into multiple files. The number of C files per library is therefore an extra input parameter. In addition to the code files, the generator produces a Makefile, which is used to compile all libraries and to generate the dynamically linked program executable. For comparison, the Makefile produces also a statically linked program executable.

The benchmark CLoadtest was used for initial measurements of traditional dynamic loading on JUQUEEN and JUROPA, as described in Section 2.3. In addition, it was also used to eval- uate Spindle’s performance on the JUROPA system (cf. Section 6.3). Especially, the simple approach of using only libraries with automatically generated C-functions without external dependencies, made it easy to port and run the benchmark on these systems. Furthermore, it has excluded external sources that could interfere with the measurements (e.g., external library files on other file systems or the load of additional files) and therefore, it represents a low-level benchmark for dynamic loading.

In document Efficient Task-Local I/O Operations of Massively Parallel Applications (Page 130-133)