To this end, we have introduced several approaches to assessing associations upon complex diseases, and introduced various approaches to tackling their induced MHT problem. The underlying studies of these approaches are typically comprised of exceptionally large samples of data. For example, within a GWAS it is not uncommon to be analyzing hundreds of thousands of SNP markers upon thousands of study subjects [21, 22]. The analysis of these immense data sets, demands both high computational power and appropriate tools for its implementation. Parallel computing is an approach well suited to deliver high computational power for the analysis of such data sets. Within this section, along with its corresponding subsections, we introduce the notion of parallel computing upon the graphical processing unit (GPU) of the personal computer, and outline the requirements (tools) for its implementation.
It is this author’s opinion that the dynamic evolution of the personal computer is one of the most intriguing phenomenon occurring within today’s research practices. In particular, the recent (mid-late 2006) birth of each of the multi-core central processing unit (CPU) and the programmable manycore GPU. Each of these advancements for the personal computer lends improvement in computational power and reshapes the way one is required to think about solving complex problems. It is through advancements in computing architecture such as these, which allows one to delve into the analysis of ever increasingly more complex subject matter. Whether it be analyzing the tertiary structure of a protein (proteomics) or one’s attempt at locating genetic markers which are associated with an increased risk for a disease trait (genetic epidemiology), the field of genetics accurately fits within the domain of analyzing complex subject matter. The demand for computational power in this field is steadily increasing. As genetic technology continues to advance (e.g., through finding more efficient methods to ascertain genetic information; development of methods which allow one to obtain more genetic information) the demand for computational power increases.
1.4.1 The Programming Paradigm for the Future of High Performance Computing upon the Personal Computer
As defined by Almasi and Gottlieb (1989), parallel computing is a form of computation in which many calculations are carried out, where a large problem is broken down into two or more smaller problems, and these smaller problems are simultaneously solved [64]. As opposed to solving the larger problem as it exists (serial computing), the act of simultaneously solving the partitioned
smaller problems can lead to the ascertainment of computational results at a quicker rate. For example, suppose it is desired to compute the sum of the initial four counting numbers. We could solve this problem by: summing the initial two counting numbers together; add the resultant to the third counting number; and add the resultant to the fourth counting number. Note that this solution does not adhere to the Almasi and Gottlieb definition of parallel computing (i.e., this solution is serial computing), the computations in which incorporate a total of three sums, each sum in which entails the storing – say to computer system memory – of the corresponding resultant of said sum. In other words, this serial solution requires a total computational time equal to the aggregation of performing three pairwise sums and the storing of three elements (i.e., positive integers).
On the other hand, noting that addition is commutative, to compute the sum of the initial four counting numbers, we could break this problem down into two disjoint smaller problems (i.e., adher- ing to the Almasi and Gottlieb definition of parallel computing), each handled by an independent thread:5 the sum of the first two counting numbers, denoting the resultant by s
1; and the sum of the third and fourth counting numbers, denoting the resultant by s2. These two disjoint problems are simultaneously solved, so that to this end, the computational time is equal to the aggregate of one sum and the storing of one element to system memory. The desired result is obtained by summing the two resultants, s1 and s2. Thus, overall the parallel computing solution has required a total computational time equal to the aggregation of two sums and the storing of two elements. All else being equal, the computational time required by the serial solution is 1.5 times that of the parallel solution. Therefore, when compared to serial computing, parallel computing can lead to the ascertainment of computational results at a quicker rate. Note that the actual speedup of the parallel program – over that of the corresponding serial program – is dependent on the proportion of the programming code written in a parallel context. This phenomenon is known as Amdahl’s Law [66].
Parallel computing is not a novel notion and has been employed for many years, mainly in high performance computing (e.g., computer clusters and supercomputers), but interest has grown recently at the personal computing level due to physical constraints (e.g., heat dissipation and elec- tricity consumption) of microprocessors (CPUs) [67]. These constraints essentially prevent increases in frequency scaling (a measure of the speed of a microprocessor). In fact, the computer industry has accepted that future performance increases in CPUs must largely come from increasing the number of cores within the CPU, rather than making a single core go faster [67]. Indeed, to circumvent these
physical constraints, CPU manufacturers, such as Intel and Advanced Micro Devices (AMD), have recently (mid 2006) developed the multi-core CPU for the personal computer. One can envision each core of a multi-core CPU: analogous to existing as a single-lane upon a multi-lane highway; its assigned computations are performed independently of other cores, which allows for uninterrupted computational flow from-and-to system memory. Thus, all else being equal (e.g., CPU clock speed, memory speed, etc.) the multi-core CPU comprised of c cores is capable of performing c times as many computations per unit time as that of the single-core CPU of yesteryear. The act of unlocking the full capabilities of the multi-core CPU, reduces to parallel computing. That is, the personal computer user streams specially written programming code to the multi-core CPU, thereby activat- ing the cores within said CPU. In brief, the adaptation of parallel computing upon the personal computer consists of two essential components: a multi-core CPU (or, as we will encounter within §1.4.2, manycore GPU); and specialized programming code. Without the latter, the multi-core CPUs of the future are no more useful than the single core CPU of yesteryear. Therefore, parallel computing is indeed the programming paradigm of the future for high performance computing upon the personal computer.
1.4.2 Parallel Computing upon the NVIDIA Manycore GPU
Since many personal computers possess a GPU which is independent of the CPU, there are essentially two competing ways – hardware specific – in which to program in parallel upon the personal computer, either by way of programming specifically to: the multi-core CPU; or, the manycore GPU. Here, we motivate the utility of the manycore GPU over the multi-core CPU as the specific hardware utilized for parallel computing upon the personal computer. In order to do this, let us first briefly outline the required components for parallel computing upon the personal computer:
1. A computer warehousing at least one of a multi-core CPU or manycore GPU;
2. Ability for the user to program within a high-level programming language (e.g., C, C++, FORTRAN);
3. A specialized toolkit – computer hardware (i.e., CPU or GPU) and programming language specific – which provides the user a set of extensions (to harness the parallel computing nature of the hardware) to the high-level programming language; and
4. A compiler capable of compiling the specialized parallel programming code, where parallel programming code is defined as any code written through the collaboration between (2) and (3) above.
Henceforth, any references to CPU and GPU are synonymous with multi-core CPU and [NVIDIA] manycore GPU, respectively.
There is an array of reasons, justifying programming in parallel upon the GPU over that of the CPU. First, whereas the CPU is currently – as of December 2011 – limited to comprise six cores (Intel Westmere/Gulftown processors), the GPU can contain upwards of 1024 cores (NVIDIA GeForce GTX 590). This surplus in core units over the CPU, in-and-of-itself, makes the GPU the more attractable resource for parallel computing upon the personal computer. Moreover, even with hyper-thread – each processing core being able to concurrently process multiple threads – support, the Westmere/Gulftown CPUs are merely capable of processing twelve (12) threads (i.e., operations) concurrently [68]. On the other hand, each of the sixteen (16) multiprocessors upon the NVIDIA GeForce GTX 580 GPU can concurrently process 1536 threads, so that the maximum number of active threads concurrently processed upon this GPU is 24 576 [68, 69].
Second, the NVIDIA corporation’s – a worldwide leader in graphics card manufacturing – Com- pute Unified Device Architecture (CUDA) toolkit, designed for parallel computing upon NVIDIA GPUs, is provided free of charge and readily downloadable from the NVIDIA website.6 Moreover, the CUDA toolkit contains the aforementioned required parallel components for each of (3) (pro- gramming language extensions) and (4) (compiler), thereby providing: a consolidated means by which to ascertain said two parallel components; maximum compatibility between the parallel pro- gramming code and the compiler utilized to compile said code; and maximum compatibility with its targeted computer hardware. In contrast, obtaining a toolkit for parallel programming upon the CPU is either through a third-party (relative to the CPU manufacturer) – such as Open Multi- Processing (OpenMP) or Open Computing Language (OpenCL) – or, essentially not free of charge. In utilizing a third-party toolkit, one introduces the potential for incompatibility between each of: the parallel programming code; the compiler; and the targeted computer hardware. These ideas hold true since the toolkit is geared toward several possible intended hardware profiles, and the compiler is ‘third-party’ to the toolkit. As of December 2011, although CPU manufacturer Intel has
released several toolkits (e.g., the Intel Parallel Studio Suite software) there is a fee associated with obtaining the software, of which the minimum MSRP is $799.7
Third, the computational speed of the GPU is substantially greater than that of the CPU. As of May 2011, the computational ability of the fastest NVIDIA GPU (NVIDIA GeForce GTX 580 GPU) was over 1.5 teraflops (one teraflop (TFLOP) = one trillion floating point operations per second) [69]. Whereas, at the same point in history, the computational ability of the fastest CPU (Intel Westmere CPU) was less than 13% of that for this GPU [69]. Fourth, the bandwidth – the quantity of information being able to be moved per unit time – of the memory for the GPU is much greater than that of the CPU. As of May 2011, the memory bandwidth of the GPU (∼ 195 gigabytes per second) was about 450% greater than that of the fastest CPU (NVIDIA GeForce GTX 580 GPU versus the Intel Westmere CPU) [69].
Finally, the computational power of the GPU is readily scalable. Whereas the top-end moth- erboards for personal computers offer support for a single CPU, many of these motherboards are comprised of multiple graphics card expansion slots. This implies that one can introduce multiple GPUs upon these motherboards, thereby scaling – the factor of which is essentially equal to the number of GPUs warehoused within the personal computer (see §2.6 for an illustration of this no- tion) – the computational power of the GPU over the CPU. In particular, the ASUS P6T7 WS SuperComputer motherboard8supports up to four NVIDIA GeForce GTX 580 GPU graphics cards, providing upwards of six teraflops (1.5 TFLOPs for each GPU) of GPU computing performance.
1.4.3 The NVIDIA CUDA Programming Model
In November 2006, the NVIDIA corporation introduced their Compute Unified Device Ar- chitecture (CUDA), “A general purpose parallel computing architecture that leverages the parallel compute engine in NVIDIA GPUs to solve many complex computational problems in a more efficient way than on a CPU” [69, 70]. Here, we interface CUDA with the C programming language, which is called CUDA C programming [69]. CUDA C programming is heterogenous computing, insofar as it involves running code on two different platforms – each embedded within the same personal computer system – concurrently: a host system with a CPU; and one or more devices (frequently graphics adapter cards) with CUDA-enabled NVIDIA GPUs. This is accomplished by way of the CUDA data processing flow:
7Retrieved from http://software.intel.com/en-us/articles/buy-or-renew/, December 30, 2011. 8Retrieved from http://usa.asus.com/product.aspx?P_ID=9ca8hJfGz483noLk&templete=2,
1. Copy data from host memory to device (known also as global) memory; 2. Host instructs the device to process data;
3. The device executes in parallel upon its cores; and
4. The results are copied from device memory to host memory.
At its core are three key abstractions – a hierarchy of thread groups, shared memory, and bar- rier synchronization – which are simply exposed to the programmer as a minimal set of language extensions.
CUDA extends upon the C language by allowing the user to write C [device] functions, known as kernels. As opposed to regular C functions being executed once, when invoked kernels are executed N times upon the device in a parallel manner by N different CUDA threads. In other words, a single kernel call of N threads is analogous to simultaneously executing N iterations of a [solely serial based] C function. Threads are organized (i.e., grouped) – at the host level – into a grid of thread blocks. Threads are indexed and identified by the device through the threadIdx CUDA resource control variable, while blocks are indexed by way of the blockIdx CUDA resource control variable. At the simplest level, this within-blocks thread index is one-dimensional (maximum of three-dimensions), for which threads are identified by the CUDA resource control variable threadIdx.x (the ‘.x’ references the first dimension of the threadIdx control variable). Similarly, the simplest within-grid block index is one-dimensional (maximum of three-dimensions), for which blocks are identified by the CUDA resource control variable blockIdx.x. The number of one-dimensional thread blocks of the CUDA grid – assigned by the user at time of kernel execution at the host level – is referenced within the device by way of the CUDA resource control variable gridDim.x; the number of one-dimensional threads per thread block of the CUDA grid – maximum value of 1024 upon the NVIDIA GeForce GTX 470 GPU, the GPU used by this author, assigned by the user at time of kernel execution at the host level – is referenced within the device by way of the CUDA resource control variable blockDim.x. Table 1.1 displays a CUDA grid of gridDim.x = B one-dimensional thread blocks, each block comprised of blockDim.x = T one-dimensional threads.
Thread blocks are required to execute independently – it must be possible to execute them in any order, in parallel or in series. This independence requirement allows thread blocks to be scheduled in any order across any number of cores as depicted by Table 1.1, enabling programmers to write code that scales with the number of cores. Threads within a block can cooperate by sharing
Table 1.1: A CUDA Grid of B Thread Blocks, Each Block Comprised of T Threads. Grid of gridDim.x × blockDim.x = B × T Threads
Thread Block 1 (blockIdx.x = 0) · · · Thread Block B (blockIdx.x = B − 1) Thread ID (threadIdx.x) Thread ID (threadIdx.x)
0 · · · T − 1 · · · 0 · · · T − 1
data through a medium called shared memory, and the user can place barrier synchronization points within the kernel to coordinate memory accesses. More precisely, one can specify synchronization points in the kernel by calling the CUDA syncthreads() intrinsic function; syncthreads() acts as a barrier at which all threads in the block must wait before any is allowed to proceed. Shared memory is expected to be much faster than global device memory – “any opportunity to replace global memory accesses by shared memory accesses should therefore be exploited” [69].
The CUDA architecture is built around a scalable array of multithreaded Streaming Multipro- cessors (SMs). When a CUDA program on the host CPU invokes a kernel grid, the blocks of the grid are enumerated and distributed to multiprocessors with available execution capacity. The threads of a thread block execute concurrently on one multiprocessor, and multiple thread blocks can execute concurrently on one multiprocessor. As thread blocks terminate, new blocks are launched on the vacated multiprocessors. A multiprocessor is designed to execute hundreds of threads concurrently. To manage such a large amount of threads, it employs a unique architecture called Single-Instruction, Multiple-Thread (SIMT).
The SIMT within the multiprocessor creates, manages, schedules, and executes threads in groups of thirty-two (32) parallel threads called warps. Individual threads composing a warp start together at the same program address, but they have their own instruction address counter and register state and are therefore free to branch and execute independently. The term warp originates from weaving, the first parallel thread technology [68]. When a multiprocessor is given one or more thread blocks to execute, it partitions them into warps that get scheduled by a warp scheduler for execution. The way a block is partitioned into warps is always the same; each warp contains threads of consecutive, increasing thread IDs with the first warp containing thread 0. For further details, the reader is encouraged to review the document at http://developer.download.nvidia.com/ compute/DevZone/docs/html/C/doc/CUDA_C_Programming_Guide.pdf.
1.4.4 Example
As a simple example of an arithmetic problem which can be solved in a parallel manner within the CUDA C programming environment, consider summing over the elements contained within the SNP profile for the ith study participant, g
i, where gi – and its corresponding elements, gji, j = 1, . . . , m – are defined within §2.2.1, some i = 1, . . . , n, and for notational clarity and simplicity of explanation we assume that m = 210= 1024. We denote the resultant of this sum by s
i. It is, si= m X j=1 gji. (1.4)
To compute this sum – in serial within a high-level programming language – we could follow the procedure of Algorithm 1.1.
Algorithm 1.1 Serial Sum
si← 0. {Initialize the value of si to zero}. for j = 1 to m do
si← si+ gji. {Increment the value of si by that of gji}. end for
To carry out the recipe outlined in Algorithm 1.1 within the C programming language environ- ment, we could invoke Code Snippet 1.1. For this code note that: after each of the m iterations of the for loop, the resultant s (i.e., si) is updated with the value g[j] (i.e., g{j+1}i, j = 0, . . . , m − 1); the (k + 1)stiteration of the loop does not begin until the kth iteration has completed, k = 1, . . . , m − 1. Thus, a total of m arithmetic operations are performed at m distinct points in time. This code is an example of a particular scan operation, called sequential scan [71, 72].
Code Snippet 1.1. s = 0;
for(j = 0; j < m; j++) s += g[j];
On the other hand, to compute the sum (1.4) in a parallel manner within the CUDA C pro- gramming environment, we could follow the procedure of Algorithm 1.2.
Algorithm 1.2 Parallel Sum
1. (Host) Copy the elements gji, j = 1, . . . , m, from a host memory object to a device memory object, as follows. Suppose the elements gji, j = 1, . . . , m, reside within the host memory object (vector) h data (of data type, say unsigned int). Here, for a given memory object, we use the prefixes h, d, and s to reference host memory, device memory, and shared memory, respectively. Now, a CUDA kernel can only access device or shared memory objects, and cannot directly access the elements within a host memory object (e.g., h data). So, to proceed, we must: create (or, allocate) a device memory object which will warehouse the elements of h data, say d data, and copy the elements of h data to d data. To carry out these respective tasks, we invoke the following two lines of code
cudaMalloc((void **) &d data, m * sizeof(unsigned int));
cudaMemcpy(d data, h data, m * sizeof(unsigned int), cudaMemcpyHostToDevice);
2. (Host) Invoke a kernel comprised of one block (B = 1) and T = 512 threads, as follows. We first note that per [69], a kernel is defined using the global declaration specifier, where the return data type is required to be void. Next, note that our kernel requires two parameter specifications: the device object d data, so that the kernel can access and operate upon the corresponding elements within this object; and, a device object, say d result, which will warehouse the value of (1.4). Overall, our kernel declaration – whose name is SNP Add – is