Small data I/O at large scale - Efficient Task-Local I/O Operations of Massively Parallel Appli

4.2 I/O Benchmarks

4.2.6 Small data I/O at large scale

With the coalescing approach, SIONlib supports applications that require to store only a small amount of data per task (cf. Section 3.8). Because SIONlib normally extends chunks to the minimum size of one file-system block, this would lead to a high amount of unused disk space in the SIONlib file container without using the coalescing feature. SIONlib solves this issue by aggregating data collectively on a smaller number of collector tasks that write the data to the file container on behalf of the other tasks. A critical configuration parameter of coalescing I/O is the number of sender tasks, which send their data to one collector. SIONlib has im- plemented a heuristics to find a default number of collectors according to the specified chunk sizes. Optionally, users can overwrite this default number of tasks per collector (collsize). Figure 4.13 shows the results of a parameter study on JUQUEEN to find the optimal parameter ranges for different chunk sizes. The tests were performed on one midplane of JUQUEEN with 64 tasks per CN using one file per ION, which results in four physical files. As an example, the measurement of the write bandwidth with a chunk size of 1 MiB will be explained in the following. With a file system block size of 4 MiB and less than four tasks per collector, the file- system blocks are not filled in the SIONlib file container. Therefore, GPFS has to handle up to four times more file-system blocks. This leads to a linearly decreasing write time from one to four tasks per collector. In next task range from four to 64 tasks per collector, file-system blocks are completely filled (as all measurement points are multiples of four), which leads to nearly constant writing time. Up to a collsize of 64 tasks, one collector is located on each CN on average. This is changed with higher collsize numbers of 128 and more. In this case, collector tasks are running only on a subset of CNs and only those are actively communicating with the IONs. As I/O traffic is maintained on the ION with a multi-threaded daemon, the concurrency of the threads will be reduced with decreasing number of collectors, which has

1 10 100 1 2 4 8 16 32 64 128 256 512 1,024 2,048 4,096 8,192 Ti me (s ec onds )

# Tasks per Collector Write (100 bytes)

Write (1 KiB) Write (100 KiB) Write (1 MiB)

(a) Writing data

1 10 100 1 2 4 8 16 32 64 128 256 512 1,024 2,048 4,096 8,192 Ti me (s ec onds )

# Tasks per Collector Read (100 bytes)

Read (1 KiB) Read (100 KiB) Read (1 MiB)

(b) Reading data

Figure 4.13: Evaluation of the influence of the parameter collsize on the time for writing and reading data of different size with coalescing I/O on one midplane of JUQUEEN (64 tasks per node).

4.2 I/O Benchmarks 0 10 20 30 40 50 60 70 80 131,072 262,144 524,288 1,048,576 2,097,152 Ti me (s ec onds ) # Tasks Write ( 1 KiB) Write ( 32 KiB) Write ( 1 MiB) Write ( 4 MiB)

(a) Write Time

0 20,000 40,000 60,000 80,000 100,000 120,000 140,000 160,000 131,072 262,144 524,288 1,048,576 2,097,152 Ba ndw idt h (M iB /s ) # Tasks Write ( 1 KiB) Write ( 32 KiB) Write ( 1 MiB) Write ( 4 MiB) (b) Write bandwidth

Figure 4.14: Scalability of coalescing I/O with SIONlib on JUQUEEN, using a collsize of 512 tasks: write time and bandwidth for different data sizes and numbers of nodes (64 tasks per node).

another positive effect on the writing time. The writing time decreases further until 256 tasks and remains constant up to 512 tasks. Starting from 1024 tasks per collector, the bandwidth decreases linearly from initially 3500 MiB/s to 500 MiB/s at 8192 tasks per collector. The reason for this decrease is that not enough I/O streams to each ION are in use. The bandwidth of one I/O stream seems to be saturated at less than 500 MiB/s. Consequently, at least four collectors should be used per ION to fulfill the overall ION bandwidth of about 2 GiB/s. The results of the read time and the runs with smaller data sizes show a similar behavior. However, the runs with smaller data size are faster because the collectors have to transfer less data. Resulting from the discussion above, the selection of a collsize of 512 tasks is optimal for the tested data sizes. Therefore, this number of tasks per collector was configured for the scalability benchmarks that are shown in Figure 4.14. The benchmarks were run for data sizes from 1 KiB to 4 MiB from two racks of JUQUEEN up to the full system. Because of the reduced number of tasks that interact with the file system, the tests were configured to create one file per ION. While the two tests with the small chunk sizes of 1 KiB and 32 KiB created only small data sets of less than 56 GiB, the other two tests with 1 MiB and 4 MiB were dominated by the I/O bandwidth. For example, the latter test with 4 MiB chunk size created about 7 TiB of data on disk on 28 racks. The coalescing approach of SIONlib scales for the two smaller data sizes constantly up to one million tasks, which leads to a writing time from below ten seconds up to only 17 seconds at full scale with 1.8 million tasks.

As coalescing I/O is intended for large-scale applications with small data size per task, the measurements demonstrate that SIONlib can enable also task-local I/O for those applications without structural changes to the code. Additionally, Figure 4.14 shows that the coalescing approach also scales for applications with larger chunk sizes. The achieved write bandwidth of 108 GiB/s for a chunk size of 4 MiB is in the same range as the full-scale measurements without coalescing I/O (cf. previous section).

In document Efficient Task-Local I/O Operations of Massively Parallel Applications (Page 101-103)