CUDA Implementation - Exploiting multiple levels of parallelism of Convergent Cross Mapping

NVIDIA released the CUDA language in 2008 [37], which is an extension of C. Briefly, the Single Program Multiple Data (SPMD) code is written using a CUDA kernel function, the data to be operated on are copied from CPU RAM to the global memory of the device, and the C program running on the CPU initiates the data-parallel computation via a kernel function call. A GPU kernel function contains the code that will be executed simultaneously by the GPU processors, and CUDA uses the function type qualifier global to declare that a function is a GPU kernel function at compile time. As such, the implementation covered here was written using the CUDA API and was composed of three kernels (CUDA functions) in the CCM which are executed on the GPU device.

• The first kernel calculates the pairwise Euclidean distance matrix of sizeT×T containing the distances among theT lagged points in reconstructed shadow manifold and the same lagged points, where each point actually is anE-dimensional lagged vector.

• The second kernel performs the radix sort given the distances between any query point and all T E-dimensional reference points, where such distances serve as the output of the first kernel.

• The third kernel computes the Pearson’s correlation coefficient between twoR-length sequences (predict sequence and corresponding target sequence), where each sequence containsE+ 1 nearest neighbours generated by the kNN search step.

3.3.1 Pairwise Distance Calculation Kernel

IN (T x E) OUT (T x T) BlockDim*bx BlockDim*by BlockDim*bx BlockDim*by

MX Pairwise Distance Matrix

Figure 3.4: The parallel algorithm for pairwise distance calculation. Each block computes one sub- matrix of OUT, and the threads work on one pair of aligned sub-matrices of IN, as is illustrated as the figure.

The kernel input, aT×Ematrix (reconstructed shadow manifold) in which each row is theE-dimensional lagged point, is stored at device memory for SP to access. The kernel computes the pairwise distances among these points (different pair of rows) and produces a T ×T symmetric matrix as output, which is stored at device memory too.

The CUDA version for pairwise Euclidean distance algorithm, as shown in Figure 3.4, uses one thread for one entry in the OUT matrix, which means there are T2 _{threads. The threads are organized into} BlockDim×BlockDimtwo-dimensional blocks, which will be run on the GPU cores. And these blocks are organized into the _BlockDimT × T

BlockDim sub-matrix format. Generally,BlockDim is supposed to be 16 for

Maxwell or more advanced GPU architectures, while it is supposed to be 4 for Kepler or older architecture. A thread orients itself through its block and thread indices in the following way:

bx=blockIdx.x;by=blockIdx.y;

tx=threadIdx.x;ty=threadIdx.y.

(3.2)

With the above coordinate system, a thread is responsible for calculating the entry in the matrixOUT

at rowBlockDim∗by+tyand columnBlockDim∗bx+tx. Let us assume that a thread needs to calculate the entry (i, j) in theOUT matrix. It will first load the two sub-matrices of IN anchored by the variable

y and x at the corresponding upper left corners. The synchronization function syncthreads() needs to be called for imposing a barrier before accumulating its own partial Euclidean distance in a temp variable. Then the threads have to be synchronized again before processing the next pair of sub-matrices inIN. These procedures continue to be executed until all blocks finish the calculation on the corresponding entry.

3.3.2 Radix Sort Operation Kernel

3 2 3 1 2 1 1 2 0 5

Count Table

0 3 5 8 9 11 12 13 15 15

Offset Table

Preﬁx Scan Step On GPU (SIMD)

0 3 2 3 1 2 1 1 2 0

0 3 5 5 4 3 3 2 3 2

0 3 5 8 9 8 7 5 6 4

0 3 5 8 9 11 12 13 15 12

Figure 3.5: The parallel prefix scan algorithm. The interval is changed from 1 to intervallog2n.

The radix sort applied on each chunk includes three stages: count, prefix scan, and reorder. For the counting stage, the input is an array of keys, and the output is a matrix of counts of each value in the input. The straightforward implementation is taking advantage of shared memory with an atomic increment operation on the counter of keys for each block. Then the count matrix for the entire input represents the summation of the individual tables in each block. Threads can be used to sum up all the count values and write them to the global memory. As for the prefix scan stage, the traditional sequential scan algorithm is poorly suited to GPUs because it does not take advantage of GPU data parallelism. The parallel version of the prefix scan is based on the algorithm presented by Hillis and Steele [27] and demonstrated for GPUs by Harris [25]. Figure 3.5 illustrates this operation. Although the parallel prefix scan performsO(nlogn) addition operations, it takesO(logn) time complexity to finish which can be a huge improvement in performance. The last stage is to write the data back to the appropriate location in global memory. For example, if the size of the block is 16×16 = 256, the GPU takes two steps to finish the reordering process. Firstly, block reads 256 keys and sorts locally. The system then writes them back to global memory. The corresponding global index address of value nis calculated by:

ci=i−lon+gonb (3.3)

whereiis the local start index in the block,lonis the local offset, andgon_b is the global offset of the valuen

processed by block indexb. The above stage will be repeated fordchunks (c1,c2, ...,dd) from least significant

bits to the next significant bits, and so on. In the kernel, the Thrust library [3]thrust::sort by key is used to implement the sort operation as it contains the CUDA code of the radix sort function. At a general level, Thrust is a productivity-oriented library for CUDA, as it is an analog of C++ standard template library.

3.3.3 Pearson’s Correlation Coefficient Kernel

As shown in Figure 3.6, Pearson’s correlation coefficient kernel takes two sub-matrices and outputs the corresponding value in the buffer of the output matrix.

X (R x [E+1]) BlockDim*bx+tx

Y (R x [E+1]) _{OUT (R x 1)}

Figure 3.6: The parallel algorithm forRsamples, (E+ 1)-dimension vectors, of Pearson’s correlation coefficient calculation. Each thread computes one row ofOUT, which takes one pair of the same row of XandY.

As applied here, the Pearson’s correlation coefficient measures the linear correlation between two sequences

x and y. In order to achieve the parallelism on GPU, R realizations in CCM of predicted values x and corresponding y, each of them is in E+ 1 dimension, form the matrix-like input X and Y, with the size assuming to be R×(E+ 1). A single thread on CUDA works on the input (Xi and Yi, which are the

same row) to produce the correspondingρin the output matrix OU T. A similar coordinate system to that used in the parallel pairwise distances calculation can be applied in this algorithm. In CUDA, the thread

i is identified by BlockDim×bx+tx. It only processes the corresponding sequences (row) Xi andYi and

generates the ρi in the OU T matrix at positioni.

In document Exploiting multiple levels of parallelism of Convergent Cross Mapping (Page 35-38)