Partitioned Global Address Space Models - X10 for high performance scientific computing

work items, which are processed in work groups (mirroring the architecture of a typical GPU). There are mechanisms for allocating and sharing memory between threads in a work group, and synchronizing within a group. Kernels are executed asynchronously on the accelerator through a queuing system, and may also be executed on a standard CPU. TheOpenCLmemory model distinguishes between the main memory of the host device and memory of the accelerator device, and anAPI is provided to manage transfers between the two. Device memory is further divided into private memory, local memory accessible by all threads in a work group, constant memory that is readable by all threads, and global memory. The Compute Unified Device Architecture (CUDA) framework [NVIDIA,2013] is similar toOpenCL, but is specific to NVIDIAGPUs.

TheOpenCLandCUDAprogramming models reflect the divisions in memory of typicalGPUarchitectures, and the cost of transferring between portions of the memory. However, they also allow for threads to share memory using a relaxed memory consistency model. As such they combine elements of both distributed and shared memory models. The close mapping between these models and target architectures (GPUs) supports the development of high performance codes, however, it may not be possible to achieve performance portability to other non-GPUarchitectures.

2.3 Partitioned Global Address Space Models

The partitioned global address space (PGAS) model is similar to the shared memory programming model in that all threads of execution have access to a global shared memory space. However, in thePGASmodel, the memory space is divided into local partitions, with an implied cost to moving data between partitions. The programmer has control over the placement of data and consequently computation over those data, which is critical for high performance on modern computers [Yelick et al.,2007a].

2.3.1 UPC

A reliable approach to implementing the PGASmodel is to extend an existing programming language. UPC [UPC Consortium, 2005] extends ANSI C with shared

(global) objects. Shared objects may be divided into portions local to each thread in the computation, but each thread may access remote data using shared pointers. UPC also defines a user-controlled memory consistency model, in which each memory access is either strictin the sense of sequential consistency [Lamport, 1979] orrelaxed in which case ordering of memory accesses is only preserved from the point of view of the local thread. UPCalso introduced asplit-phase barrier, in which threads signal arrival at the barrier and may then continue to do useful (purely local) work until all other threads have arrived at the barrier. This may be used to reduce idle time due to barrier synchronization, in applications where the work may be divided into distributed and local portions. UPCalso provides a number of collective ‘relocalization’ functions (broadcast, scatter, exchange) and computation functions (reduce, prefix reduce) similar to those defined byMPI.

UPC’s simple locality model provides performance portability across a range of shared and distributed memory systems. SPMD-style applications written in UPC have achieved high performance on the largest computing clusters. It is less well suited to irregular applications requiring load balancing, as data decompositions are static and new threads cannot be created.Min et al.[2011] propose a dynamic tasking library andAPIas an extension toUPCto support such applications.

2.3.2 Coarray Fortran

CAFis an extension of Fortran for SPMDprogramming with the partitioned global address space model [Numrich and Reid,1998;Mellor-Crummey et al.,2009]. Multi- ple process images execute the same program, each with its own local data. The key concept inCAFis thecoarray, which is an array shared between multiple images in which each image has a local portion of the array, but may directly access data local to other images. The original coarray extensions (which have since been adopted into the Fortran 2008 standard) required coarrays to be statically allocated across all processes. Later work [Mellor-Crummey et al.,2009] expandsCAFto support dynam- ically allocated arrays over subsets of images, and global pointers for the creation of general distributed data structures.

By extending Fortran, CAF builds on decades of effort in the development of high-performance compilers and application codes, and provides an evolutionary pathway for existing codes to exploit parallelism using thePGASmodel.

2.3.3 Titanium

Titanium is a parallel dialect of Java designed for high performance scientific computing [Yelick et al.,1998]. Titanium extends serial Java with multi-dimensional arrays, separating the index space (domain) from the underlying data, and an unordered loop construct,foreach, which allows arbitrary reordering of loop iterations to support optimizations such as cache blocking and tiling [Yelick et al.,2007b]. To avoid the overheads associated with boxed types in Java, Titanium supports the definition ofimmutable classes, which save memory and preserve locality by dispensing with pointers.

Titanium follows theSPMDmodel of parallelism; processes synchronize at barrier statements and a single-qualification analysis is used to ensure that all processes en- counter the same sequence of barriers. Distributed data structures such as distributed arrays may be constructed usingglobal pointers, which may refer to objects in the local partition or a remote partition. Difficulties in implementing full distributed garbage collection motivated the introduction of memoryregions, into which objects may be allocated and an entire region de-allocated with a single method call [Yelick et al., 2007b]. (A better solution to this problem was subsequently implemented for the X10 language; see2.4.1.)

In document X10 for high performance scientific computing (Page 33-35)