2.3 GPU-based Processing
2.3.1 GPU Computing
The modern graphics pipeline consists of three main stages: the vertex, geometry and fragment stage. Figure 2.3 depicts these stages and the workflow in the pipeline. With the advent of programmable pipeline, which mostly replaced the fixed-function pipeline, the user can define a program specific to each stage.
The three stages operate as follows:
• In the vertex stage a vertex program operates on incoming vertices, manipulating them according to the application’s objectives. Traditionally this includes vertex coordinate transformation and lighting calculations.
• In the geometry stage assembled primitives can be modified, extended or deleted by a geometry program. Here, new extra rendering primitives can be emitted, or the geometry generated procedurally by adding or removing vertices.
points in the original cluster are considered. Thus yielding, as for bisecting k-means, a nested hierarchy.
However, in the publicly available implementation of the X-means algorithm [PM] this restriction is relaxed and the points can be assigned also to neighboring clusters.
fragment geometry
programvertex program
program Rasterization
Graphics Memory primitives
fragments
1
Frame Buffer TF
3 2
vertices
Figure 2.3: The graphics pipeline. TF: transform feedback.
• In the fragment stage a fragment program is executed for each rasterized fragment.
Here, the color of each output pixel is calculated and written to the frame buffer or the render target, or to multiple render targets5 (MRT).
Each of these stages is capable of memory gathering, i.e. the ability to fetch data from different positions in texture memory. However, only the vertex stage is capable of performing scattering, i.e. alter the output position of an element. This can be achieved by drawing the input vertices as points with correspondingly selected output locations. This procedure, however, may lead to memory and rasterization coherence problems, which can ultimately affect the performance [OLG∗07]. In contrast, the output address of a fragment is predefined even before the fragment is processed. This limitation needs to be taken into account because it dictates the processing workflow of the fragment stage.
Earlier hardware architectures implemented each of these programmable stages in spe-cial dedicated hardware units called shader units, each optimized for its task. However, with
5The Multiple Render Targets (MRT) mechanism allows to simultaneously write at a given position in all render targets.
the introduction of NVIDIA’s GeForce 8 series and ATI Radeon HD 2000 series GPUs the Unified Shader Architecture has been proposed. Such hardware is composed of a bank of computing units (shader units), each capable of performing any of the pipeline steps. Thus, the GPU can dynamically schedule the computing units for better load balance, thereby significantly increasing the GPU throughput.
The programmable units of the GPU follow a single program multiple data (SPMD) programing model, i.e. all elements are processed in parallel using the same program. Each element is processed independently from each other and they cannot communicate with each other. Currently, there are two main graphics application programming interfaces (APIs): OpenGL and DirectX. For programing the shader units on the graphics card the so-called shading languages emerged: Cg (C for Graphics) [MGAK03], GLSL (OpenGL Shading Language) [Ros06] and HLSL (High Level Shader Language). For our GPU-based implementation we use OpenGL with GLSL.
General-Purpose Computing on the GPU
Utilizing the GPU for non-graphics applications has evolved as a special field of General-Purpose Computation on the GPU (GPGPU)6. This was mainly driven by a steady in-crease in the computational power of the GPU compared to the CPU. As a result, many applications or simulations can be speeded up by this highly parallel streaming processor [OLG∗07], [OHL∗08], [SDK05].
Programing the GPU for general-purpose applications can be achieved in two ways [OHL∗08]:
1. Using a graphics API, i.e. shading languages Cg, GLSL or HLSL. Here the program is structured in terms of the general graphics pipeline stages, see Figure 2.3. Usually, the programmer redefines a non-graphics problem using graphics terminology and data structures such as textures, vertices, fragments, buffers, etc. There are many GPGPU techniques which efficiently map complex applications to the GPU. We refer the reader to the state-of-the-art report of Owens et al. [OLG∗07] and [OHL∗08], where many of these techniques are describe in detail.
2. Using non-graphics interfaces to the hardware. Programming a non-graphics general-purpose application using a graphics API is in many cases cumbersome and the programmable units are only accessible at an intermediate step in the pipeline. For a common programmer a familiar high level language which gives direct access to the GPU programming units is more desirable. NVIDIA’s CUDA7 (Compute Uniform Device Architecture) programing language is a good example in this case. It allows a flexible and better suited environment for general parallel processing. However, this flexibility comes at the cost that the user needs to understand the low-level de-tails of the hardware to achieve good performance. The OpenCL8 (Open Computing
6http://www.gpgpu.org/
7http://www.nvidia.com/cuda
8http://www.khronos.org/opencl
Language) language, which is an open standard like OpenGL and supports GPUs of multiple vendors, is about to become an alternative to CUDA in the near future.
There are many areas where GPUs have been used for general-purpose computing and this is an ever growing field. Examples include physically-based simulation (game physics, biophysics, climate research), signal and image processing (image segmentation, computer vision, medical imaging), computational finance and many others. We refer the reader to [OLG∗07] and [OHL∗08] for a more comprehensive overview of these applications. The GPGPU web page6 also keeps track for most of the ongoing developments in this field. In the next section we review the GPU-based clustering approaches in details.