Vendor-specic SIMD implementations

So far in this chapter we have discussed some aspects of the GPU hardware but from a general perspective. In this section we shall be discussing the hardware implementations on oer from AMD and NVIDIA, the leading GPU vendors. Precisely, we shall discuss the implementation details of their latest hardware architecture and will be focusing on the parts that are directly related to running general purpose applications on the GPU.

Chapter 5. Background on Parallel Computing with General Purpose GPUs 80

Figure 5.7: Generalized block diagram of AMD's GCN architecture.

5.3.1 The Graphics Core Next architecture (AMD)

The Graphics Core Next (GCN) architecture was rst introduced by AMD during their

Fusion Developer Summit (AFDS) in 2011 [102,109] and later launched with their South-

ern Island devices with model names in the form of Radeon HD 7xxx. For instance, at the time of launch, their agship model was the HD 7970 GPU model. The GPU devices based on the GCN architecture are available as discrete peripheral devices or part of their system-on-chip line of products called accelerated processing units (APU), which is a combination of a GPU and a CPU on the same die to provide a heterogeneous solution for computing tasks.

At the heart of the GCN architecture are the streaming multiprocessors and, without loss

of generality, Figure 5.7 depicts a simple block diagram of a GPU with four streaming

multiprocessors. Each multiprocessor comprises of processing elements grouped into what is known as vector units and each vector unit is made up of 16 processing elements. There are a total of 4 vector units in each streaming multiprocessor which amounts to a total of 64 processing elements per multiprocessor. At any point in time during the execution of a program code on the GPU, a vector unit is responsible for executing a wavefront, hence, each thread in the wavefront is assigned to a single processing element. However, a wavefront is scheduled quarterly so only 16 threads gets scheduled at a time until eventually all 64 threads in the wavefront are scheduled.

Each processing element consists of its own private memory or registers visible only to the thread it is executing. A form of high-speed, low-latency memory, known as local

Figure 5.8: Generalized block diagram of NVIDIA's Kepler architecture.

data store (LDS), also exists on each multiprocessor and it is visible to all threads in a group executing on the multiprocessor. The LDS provides support for scatter and gather operations and a means for threads to share data with each other. The frame buer or video memory resides o-chip and provides huge amounts of storage but incurs the highest amount of latency. A robust memory controller provides high-speed and high-bandwidth access for the streaming multiprocessors to the frame buer with several memory channels. To put this into perspective, let us consider an actual GCN device like the AMD Radeon HD 7970 GPU. This GPU features 32 SMs which equates to a total of 2,048 PEs. Each SM has a LDS size of 32 kB and the reference design has a 3 GB GDDR5 frame buer with a 384-bit wide memory interface.

5.3.2 The Kepler architecture (NVIDIA)

NVIDIA's Kepler architecture is a successor to the previous Fermi architecture and was introduced with the NVIDIA GeForce GTX 6xx series. They are also available as discrete GPUs or embedded devices, like their Tegra line of mobile GPUs. The Kepler architecture is also based around the streaming multiprocessor, which NVIDIA refers to

as SMX [82]. The block diagram in Figure5.8illustrates how the compute components

are arranged from a general point of view.

The heart of the Kepler architecture for compute lies in the Graphics Processor Cluster (GPC) and a single GPU comprises a number of these. The GPC is simply a group of 2 streaming multiprocessors. Each SMX in a GPC is made up of 6 CUDA (Compute Unied Device Architecture) arrays and each CUDA array further consists of 32 CUDA cores. A CUDA array is akin to the vector unit in the AMD's GCN architecture and

Chapter 5. Background on Parallel Computing with General Purpose GPUs 82 the CUDA cores are simply the processing elements. This implies that a single SMX consists of 192 CUDA cores or processing elements. The CUDA array is responsible for executing a warp with a single thread being executed by one CUDA core.

A shared memory also exists for each SMX which is only accessible to warps executing on that SMX. The shared memory also supports scatter and gather operations and allows warps on the same SMX to share data. Each CUDA core also has registers which are private to the thread executing on each of them. As usual, a large video memory resides o-chip and a memory interface provides high-speed, high-bandwidth access for the multiprocessors.

As an example, the NVIDIA GeForce GTX 680 GPU features 4 GPCs which equates to 8 SMX for a total of 1,536 CUDA cores. Standard video memory conguration is 2 GB GDDR5 memory with a 256-bit wide memory interface.

In document A Study of Time and Energy Efficient Algorithms for Parallel and Heterogeneous Computing (Page 96-99)

Vendor-specic SIMD implementations