• No results found

Host and device have separate memory. In CUDA, memory can be stored in various ways and different types of memory work faster and perform more efficiently in certain situations [70].

GPUs contain a number of specific memory types which can be used explicitly by devel- opers to cache memory access:

Globalmemory is accessible from anywhere on the device - this means any thread run-

ning the program. It is the largest memory on the GPU and isoff-chip, having the slowest

SECTION 3.3: CUDA MEMORY 41

performance of global memory but this is not always possible depending upon the ap- plication. Coalescing generally refers to when consecutive threads access consecutive memory addresses. When designing algorithms for use with the GPU, care must be taken to make sure memory is allocated efficiently for the best performance possible. Trade-offs between memory access complexity, number of memory accesses, and num- ber of computations performed need to be taken into account because coalesced memory access is significantly faster than non-coalesced [3].

L1 and L2 caches. The Fermi architecture introduced a single unifed memory request

path for loads and stores - theL1 cacheproviding store and load operation service per

SM multiprocessor, and the unifiedL2 cacheproviding service for all operations (store,

load and texture) [94]. The size L1 cache can be configured as needed; the total on-chip memory is 64KB, which can be split between both the shared memory and the L1 cache. It may be divided into 48KB shared, 16KB L1 or 16KB shared, 48KB L1 on Fermi architectures, while Kepler architectures allow an additional split of 32KB of memory to both locations. The L1 cache allows for caching of local memory operations if using the standard Kepler architecture, however Tesla cards may also use the L1 cache for global memory operations. The L2 cache provides fast and efficient data sharing across the entire GPU. GK110 provides an L2 cache size of 1536KB (up from 768KB in Fermi architectures).

Theread-only data cacheis a new method of utilizing the on-chip texture cache for read- only data. Originally this cache was accessible by mapping data as textures; the Kepler architecture modified the cache to be directly accessible to the stream multiprocessor. The size of the read-only data cache is 48KB [94].

Shared memory is on-chip memory shared between all the threads on an SMX. It is relatively fast memory and can be used by threads in the same block to communicate with one another when needed. It is particularly advantageous when a thread block can load a block of data into on-chip shared memory, process it there, and then write the final result back out into external memory [70]. As mentioned previously, the total on-chip

memory is split between shared memory and theL1 cache.

Texturememory is a section of global memory which has been cached for efficient ac- cess. It is specifically designed for fast access to images used for texturing in computer graphics. Using coalesced global memory reads are the most efficient way to read device memory, but sometimes algorithms are unable to read memory using such a regular pat- tern. Texture memory is thus often used for data that is unable to be coalesced in global memory. It performs spatial caching and can potentially improve performance if con- secutive reads access data at addresses which are within a one, two or three dimensional

blocks of memory.Texture fetchingis the process of reading a texture is typically done

Constant memory is read-only memory, stored in global memory but also cached for efficient access. Constant variables are often used to provide input values to kernels. Registers are the fastest type of memory on the device and are used by single threads

to store local variables. In the GK110 architecture, each thread has access to up to 255 registers.

Local memoryis any memory which is private to the thread - this includes the registers. Local memory is used once all other streaming multiprocessor resources have been used up, for example register spilling. It is stored in the L1 cache first, but if this cache is full it is moved to the L2 cache, and if it cannot be stored there, it is stored in global memory. Regardless of where it is stored, each thread still has access to its own local space. While it is possible to achieve speed-ups of simulations using only global memory on the device, true performance optimization occurs through good management of all types of memory as each type has strengths and weaknesses. This thesis will attempt to show usage of most of the memory types to good effect in different simulations, although there is always room for improvement, especially as graphics cards are quickly becoming more and more powerful, with more options for programming them being added all the time as the CUDA programming model evolves.