UVAS and remote memory access - The Heterogeneous Parallel Execution model

5.2 The Heterogeneous Parallel Execution model

5.2.2 UVAS and remote memory access

A common characteristic of the codes in Listings 5.1 and 5.2 is that the same virtual memory address might refer to multiple host and GPU memory locations on dierent computational units (i.e., CPU or accelerator). The most important consequence of virtual memory aliasing is the inability to perform remote memory accesses between GPUs. Neither the host nor the GPUs can determine at runtime the physical memory that a given pointer variable is referring to. Therefore, memory copy operations require a source and destination GPU/host (e.g.,cudaMemcpyHostToDevicein Listing 5.1, and device[id]/hostin Listing 5.2).

CHAPTER 5. HETEROGENEOUS MULTI-GPU EXECUTION

1 for (int i = 0; i < time_steps; ++i) {

2 launch_stencil_kernel(out + left_off, in + left_off, size_left, stream3);

3 launch_stencil_kernel(out + right_off, in + right_off, size_right, stream4);

4 launch_stencil_kernel(out + center_off, in + center_off, size_center, stream1);

6 if(id == num_gpus - 1 && global_id < last_id) {

7 // MPI send right and receive left

8 cudaMemcpyAsync(r_bound_host, &out[r_bound_off], size, stream4);

9 cudaStreamSynchronize(stream4);

10 MPI_SendRecv(r_bound_host, size, MPI_FLOAT,

11 r_neighbor, 0, MPI_COMM_WORLD,

12 l_halo_host, size, MPI_FLOAT,

13 l_neighbor, 0, MPI_COMM_WORLD, &status);

14 if (l_neighbor != MPI_NULL)

15 cudaMemcpyAsync(&in[l_halo_off], l_halo_host, size, stream3);

16 }

17 if(id == 0 && global_id > 0) { // MPI send left and receive right

18 cudaMemcpyAsync(l_bound_host, &out[l_bound_off], size, stream3);

19 cudaStreamSynchronize(stream3);

20 MPI_SendRecv(l_bound_host, size, MPI_FLOAT,

21 l_neighbor, 0, MPI_COMM_WORLD,

22 r_halo_host, size, MPI_FLOAT,

23 r_neighbor, 0, MPI_COMM_WORLD, &status);

24 if (r_neighbor != MPI_NULL)

25 cudaMemcpyAsync(&in[r_halo_off], r_halo_host, size, stream4);

26 }

28 if(id > 0) { // Right halo is ready (1a)

29 cudaStreamSynchronize(stream3);

30 sem_post(&write_sem[id-1]);

31 }

32 if(id < num_gpus - 1) { // Left halo is ready (1b)

33 cudaStreamSynchronize(stream4);

34 sem_post(&write_sem[id+1]);

35 }

36 if(id > 0) // Wait for neighbors (2)

37 sem_wait(&write_sem[id]);

38 if(id < num_gpus - 1)

39 sem_wait(&write_sem[id]);

40 cudaStreamSynchronize(stream1);

42 tmp = in; in = out; out = tmp; // Exchange pointers

43 }

5.2. THE HETEROGENEOUS PARALLEL EXECUTION MODEL

HPE denes a Unied Virtual Address Space (UVAS), where a virtual memory address unequivocally identies a single location in a GPU/host physical memory. This feature allows the host or any GPU to easily determine the source and destination memories of the memory transfer operations. Cou- pling the UVAS with hardware support for remote memory transfers, GPUs can transparently access remote memory locations through regular pointers. Listing 5.3 illustrates the programmability benets provided by the UVAS in our RTM example. Figure 5.4 also shows that the synchronization scheme is much simpler. First, the GPU-to-GPU memory copies that implement the domain boundary exchange between accelerators in the same node are removed, since the kernel code directly accesses the boundary data of the neighboring domains. The kernel launch now receives an additional param- eter id that identies the current domain and is used by the kernel code

to determine the index of the pointers that belong to the neighboring domains. The outermost CPU threads must still copy the boundary data to an intermediate host memory buer (lines 8 and 18) before exchanging halo data with neighbour MPI processes for the next iteration (lines 10 and 20). Another host to GPU copy is needed to update the halo data in the corre- sponding GPU memories (lines 15 and 25). Then, all CPU threads signal that the boundary data (i.e., halo data for other CPU threads) is available for the next kernel call using the write_semsemaphore of the left and right

neighbors (lines 30 and 34, arcs 1a and 1b). Finally, each CPU thread waits for the neighbor CPU threads to nish (lines 37 and 39) before exchanging the input and output pointers.

Another benet of the UVAS is that fewer parameters are required by memory copy calls (lines 8, 15, 18 and 25). The UVAS enables the runtime system to determine both the source and destination GPU/host of a memory copy by inspecting the source and destination addresses, eliminating the need for programmers to specify it.

Implementation notes

In order to provide UVAS in systems with pre-Fermi GPUs, we propose a software implementation based on memory segmentation (Figure 5.5). A virtual memory subspace is assigned to the host and each GPU present in the system. The maximum size of each memory subspace is given by the number of bits in GPU physical addresses (e.g., 40 bits for NVIDIA GPUs, 1 TB). These memory subspaces only contain mappings for data hosted in one physical memory. We use the upper bits of the virtual address to identify the GPU where the data is hosted. The HPE runtime assigns a bit pattern to each GPU in initialization time. On API calls taking pointers as input

CHAPTER 5. HETEROGENEOUS MULTI-GPU EXECUTION

Figure 5.5: Software-based Unied Virtual Address Space implementation using segmentation.

parameters (e.g., memory copy operations), HPE determines the virtual address subspaces involved in the operation. In the GPU code, those bits that identify the virtual address subspace must be discarded in each memory access. Some processors already ignore the upper bits of the address, otherwise this transformation can be transparently inserted by the compiler. For example, a pointer to virtual address 0x000200 00001000 will be truncated to

0x00 00001000, which is a valid GPU physical address, and 2 will be used to

identify the GPU that holds the data.

However, using the software segmentation technique has some limitations. For example, since each virtual address space is mapped to the whole continuous physical address space of an accelerator, it is not possible to transparently distribute data structures by mapping dierent virtual address ranges across GPUs. Therefore, data structures must be split into chunks and use dierent pointers to access the appropriate chunks. Conversely, GPUs with virtual memory support (e.g., Fermi and later) can have a continuous rep- resentation of a distributed data structure in the UVAS and, therefore, only need a single pointer for the whole data structure.

In document On the programmability of multi-GPU computing systems (Page 72-75)