Scalability of Parallel Task-Local I/O - Efficient Task-Local I/O Operations of Massively Paral

servers and meta-data storage have to be added to the network as new abstract I/O nodes. For example, GPFS has its own meta-data server with own disk storage, but meta-data will be handled directly by the client daemons itself (cf. Figure 2.3).

2.2 Scalability of Parallel Task-Local I/O

The main reason to use task-local I/O in parallel applications is that the I/O can be easily implemented by using built-in function calls (e.g., with write/read-calls of POSIX, ANSI C, or Fortran). No additional I/O libraries are needed. These I/O calls are serial and they are not aware of the collective nature of application I/O. This leads to independent I/O streams, which are characteristic for task-local I/O in that way that the I/O calls are not synchronized between the tasks of a parallel application. Each I/O stream has one source (e.g., a process) and a destination, which is the individual file on the file system. Therefore, in terms of parallelism, task-local I/O can be seen as embarrassingly parallel.

On an ideal I/O infrastructure and file system, this approach should give the best performance, since waiting time at synchronization points between I/O tasks will not occur. Depending on the underlying I/O infrastructure, this type of I/O will give the best I/O throughput on a system. However, the I/O throughput can be limited if the I/O infrastructure has nodes/hubs on which multiple I/O streams are interleaved.

Another downside of the traditional parallel task-local I/O approach is the large number of files. They can cause problems on the file system and are difficult to handle by users. The next two sections will cover these aspects.

2.2.1 Parallel file creation

Trying to create tens of thousands of files simultaneously in the same directory may be serialized on the metadata servers. For example, on the IBM Blue Gene/Q system JUQUEEN described later in this document, the parallel creation of 1.8 million files in the same directory will take about 13 minutes. In general, parallel file systems handle metadata in a different way than the file data itself. Two types of metadata have to be considered for file creation: the inode of the created file and the inode of the directory that contains the files. The first one is not critical, because, the file inode is only accessed by the task that creates the file. In contrast, the inode of the directory is shared by all tasks. The directory’s inode stores the pointers to the inodes of files in this directory in a list data structure (cf. Figure 2.3a). Typically, the inodes are stored on special metadata disks, and can therefore be seen as special file objects. The file system has to serialize the access to these objects to ensure consistency (e.g. by file locking). Depending on the file system, the management of the directory’s inode is done by different components. Lustre has a special metadata server (MDS), which manages all inodes. GPFS has implemented the delegation of the responsibility for the management of a file to the first client that accesses an inode. As shown in Figure 2.3b, one of the GPFS clients has control over the inode; all other clients have to contact this client to get access to the inode. In addition, GPFS implements load balancing for directory inodes by distributing the inode over

directory inode f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 fn List of Entries … Tasks Tasks Tasks Tasks Serialization: file create

FS Block FS Block FS Block

(a) Inode update

… … N1,1 N1,m … Nn,1 Nn,m … … … … I/O Nodes Compute

Nodes Parallel File System

… … … … … Meta data server Meta data Meta data

(b) Inode control delegation in GPFS

Figure 2.3: Schematic view of parallel file creation. Tasks creating a file have to add an entry to the inode of the directory. As the directory inode exists only once, concurrent file creation of multiple tasks has to be serialized. GPFS optimize metadata handling by delegating the control over metadata to the first GPFS clients that opens a file.

multiple file-system blocks and distributing file entries across file-system blocks by comput- ing hash values from the file name. This optimization allows moderate parallel access to the inode, because the file-system blocks will be locked separately (GPFS FGDL, Fine Granular Directory Locking).

Figure 2.4 shows the timings for parallel file creation on JUQUEEN in a directory on the GPFS scratch file system (blue curve). The time for file creation scales approximately linearly with the number tasks. The file creation rate (red curve) increases first from 1400 files/s to a maximum 3800 files/s at 128k task. The rate decreases again to a value around 2200 files/s at larger scale, which seems to remain constant up to the full system. A reason for the decrease of the file creation rate can be that the parallelization strategy of GPFS FGDL scales only up to certain number of tasks, which seems to be reached on the JUQUEEN file system at 128k tasks.

The measurement results for file creation show impressively that task-local file I/O is not

24.8 26.9 34.7 73.8 240.6 410.2 777.8 1,388 2,441 3,782 3,554 2,179 2,556 2,359 0 5,000 10,000 15,000 20,000 25,000 10 100 1000 32,768 131,072 524,288 2,097,152 Fi le C re at ion R at e ( fil es /s ec on d) Cr ea te T ime (s ec onds ) # Tasks

File create time Files creation rate

1.835.008

Figure 2.4: Parallel file creation in one directory of the GPFS scratch file system on the Blue Gene/Q system JUQUEEN.

2.2 Scalability of Parallel Task-Local I/O

applicable for massively parallel applications. When using this I/O pattern for output files, the creation time has to be spent for each new set of files. Although checkpoint files can be reused after creating it at program start, in general, massively parallel applications have to abandon the traditional task-local file approach and need to implement parallel I/O to shared files. A popular workaround for the parallel file creation issue is to pre-create a reasonable number of sub-directories, and to distribute the files over those sub-directories. The inodes of the sub- directories are then managed from multiple GPFS clients in parallel. In this case, the number of directories should have the same order of magnitude as the number of tasks, to guaran- tee low file creation overhead. However, at large scale, this technique only multiplies the manageability issues of using task-local files as it increases the number of inodes needed for directories and files. In addition, this workaround only shifts the problem to the parallel create- operation of the sub-directories in the same parent directory. Albeit less expensive in terms of compute time, creating the files beforehand is inconvenient and requires maintaining some of the I/O functionality of an application separate from the main code. A script to generate the files during a preceding serial job would have to know number, names, and locations of the files, needing some form of agreement between the application and the script. Furthermore, the large number of files and directories has to be handled in all following post-processing steps of the simulation. These issues are described in the next section.

2.2.2 File management

Even with those workarounds described in the last sections, large numbers of files severely complicate file management on different levels. For example, copying files to a tape archive (e.g., during backup) may be significantly slowed down. Especially when archival requests from different users are executed in an interleaved fashion, different files of the same directory may end up on different tapes; making their later retrieval challenging or even impractical if the tape cartridge has to be exchanged too often.

Merging all of the files into a single file during a post-processing step, for example, using the tar command, also has disadvantages both in terms of the time needed to perform the operation and the at least temporary duplication of the required storage space. Not only access from a parallel application to individual files, but also from serial tools will become a bottleneck at large scale, as typical Unix-commands like ls or tar will process the files in a list-based execution order. The large execution time, which is caused by metadata operations (e.g., file open), will grow linearly with the number of files. This makes the administration of directories with tens of thousands of entries without support for group operations and automated filtering ineffective. Besides the high complexity of managing large numbers of files, large-scale file operations can cause side effects including temporary service disruptions that are noticeable by other users and that can jeopardize the stability of the overall system. To avoid such phe- nomena, some environments impose limits on the total number of files a user or a group of users can have, offering another good reason not to use one physical file per task.

In summary, the file management issues described in this section reveal another important reason to abandon task-local I/O patterns. As a solution, these task-local I/O patterns have to be replaced with shared-file I/O patterns.

In document Efficient Task-Local I/O Operations of Massively Parallel Applications (Page 48-51)