Access Patterns - Introduction to Parallel I/O

2.4 Introduction to Parallel I/O

2.4.4 Access Patterns

Typically, scientific I/O comprises of large amounts of data written to files in a structured or even sequential append-only way. Most HPC storage systems employ a parallel file system such as Lustre or GPFS to hide the complex nature of the underlying storage infrastructure, e.g., spinning disks and RAID arrays, and provide a single address space for reading and writing to files. There are three common I/O patterns [44] used by applications to interact with the parallel file system.

2.4.4.1 Single Writer I/O

In the single writer I/O pattern, also known as sequential I/O, one process aggregates data from all other processes, and then, performs I/O operations to one or more files. The ratio of writers to running processes is 1 to N , as depicted in Figure 2.13.

2 Communication and I/O in HPC Systems

File

P0 P1 P2 P3 P4 ... PN

Figure 2.13: Single writer I/O pattern.

File0

P0 P1 P2 P3 P4 PN

File2

File1 File3 File4 ... FileN

...

Figure 2.14: File-per-process I/O pattern.

This pattern is very simple to implement and easy to manage, and can provide a good performance for very small I/O sizes. But, single writer I/O does not scale for large-scale application runs since it is limited by a single I/O process. It performs inefficiently for any large number of processes or data sizes leading to a linear increase in time spent in writing.

2.4.4.2 File-Per-Process I/O

In the file-per-process I/O pattern, each process performs I/O operations on to individual files as shown in Figure 2.14. If an application runs with N processes, N or more files are created and accessed (N :M ratio with N ≤ M ). Up to a certain point, this pattern can perform very fast, but is limited by each individual process which performs I/O. It is the simplest implementation of parallel I/O enabling the possibility to take advantage of an underlying parallel file system.

On the downside, file-per-process I/O can quickly accumulate many files. Parallel file systems often perform well with this type of I/O access up to several thousands of files, but synchronizing metadata for a large collection of files introduces a potential bottleneck, e.g., even a simple ls can break when being performed on a folder containing thousands of files. Also, an increasing number of simultaneous disk accesses creates contention on file system resources.

2.4.4.3 Single Shared File I/O

The single shared file I/O pattern allows many processes to share a common file handle but write to exclusive regions of a file. Figure 2.15a displays the independent 32

File P0 P1 P2 P3 P4 PN ... ... (a) Independent. File P0 P4

...

P1 P2 P3

...

(b) Collective buffering.

Figure 2.15: Single shared file I/O patterns.

variant of this I/O pattern, where all processes of an application write to the same file. The data layout within the file is very important to prevent concurrent accesses to the same region. Contesting processes can introduce a significant overhead, since the file system uses a lock manager to serialize the access and guarantee file consistency. The advantage of the single shared file I/O pattern lies in the data management and portability, e.g., when using a high-level I/O library such as HDF5.

Figure 2.15b displays the collective buffering variant of this pattern. This technique improves the performance of shared file access by offloading some of the coordination work from the file system to the application. Subsets of processes are grouped in so called aggregators performing parallel I/O to shared files. This increases the number of shared files and improves the file system utilization. Also, it decreases the number of processes which access a shared file, and therefore, mitigates file system contention.

C

h

a

p

t

e

3

Extoll System Environment

Interconnection networks are a key component in large-scale HPC deployments and play a vital role in the overall system performance. They must offer high communication bandwidth and low latency to cope with the nature of the mixed system workloads, which include data-intensive and communication-intensive applications. Essentially, networks are the backbone for both communication and I/O, and connect compute nodes, I/O devices, and storage devices. The design of future HPC interconnect technologies faces multiple challenges, including scalability, efficiency, heterogeneity, resiliency, virtualization, cost and power consumption.

Extoll is such a high-performance interconnect technology [55, 56], which started off as a research project at the Heidelberg University in 2009. Its key objective is to fit the needs of future HPC applications in terms of latency, bandwidth, message rate [57], and scalability. Released in 2016, the first Application Specific Integrated

Circuit (ASIC) version of Extoll is the Tourmalet chip. Extoll has been chosen as

the fundamental technology for the research work presented in Chapters 4 through 6 and is described in the following sections.

3.1 Technology Overview

The top-level diagram of the Extoll technology is displayed in Figure 3.1. The design is divided into three parts: the host interface, the network interface, and the network.

The network part of the NIC provides six links to build a 3D torus network topology. The Extoll NIC implements a direct network approach by integrating a switching logic on-chip, which removes the need for external switches. An additional

3 Extoll System Environment Host Interface Network Interface Network Network Switch LP LP LP (7th) LP LP LP LP NP NP NP NP

Control & Status ATU SMFU Egress Ingress RMA Req Resp Cmp VELO Req Cmp PCIe Root CPU Caches Memory Notification Queue RMA VELO SMFU Register File Virtual Address Space LP = Link Port NP = Network Port Req = Requester Resp = Responder Cmp = Completer

Interconnection Network PCIe BAR PCIe desc riptor On-Chip Network PCIe x16 Gen3 notification MMIO EXTOLL NIC

Figure 3.1: Overview of the Extoll NIC top-level diagram.

7th _{link can be used to connect network-attached memory [58], accelerators or special}

purpose cards to the network, or to provide network connectivity to the outside world without breaking up the torus topology. The link ports are connected to a crossbar (network switch) that either routes messages between the link ports, or from the network to the functional units. The crossbar provides features such as virtual channels, packet retransmission, multi-cast groups, and table-based routing. Each crossbar port has its own local routing table. Routing entries specify whether to forward the incoming packets to the NIC’s network interface units or to reroute the packet to another crossbar port, which forwards the packet to a node that is closer to its destination. Along with deterministic routing for each link port, adaptive routing is supported to avoid network congestion and to improve network utilization.

The Tourmalet chip comes with a PCI Express Gen3 x16 host interface. An on-chip network with high throughput and low latency connects the functional units of the network interface with the host interface. Another feature of the Extoll NIC is that it can be configured to act as a PCIe root port. With this functionality, the NIC can configure other PCIe endpoint devices connected to its host interface.

The network interface provides the hardware support for different communication models, including Remote Direct Memory Access (RDMA), Message Passing Interace (MPI), and Partitioned Global Address Space (PGAS).

In document Accelerating Network Communication and I/O in Scientific High Performance Computing Environments (Page 47-52)