Project INF BigData. Figure 1: Plot of the learned function from the checker board data set.

(1)

Project INF — BigData

Roberto Fontanarosa†_{, Tobias Rupp}‡_{, and Steffen Hirschmann}∗

Figure 1: Plot of the learned function from the “checker board” data set.

Abstract— Prediction and forecasting has become very important in modern society. Regression analysis enables to predict

easily based on given data. This paper focuses on regression analysis on spatially adaptive sparse grids using the existing toolbox SG++. It will be implemented on graphics cards using NVIDIA’s CUDA technology. Several big data sets and results obtained by the authors’ implementation are presented. Finally the results will be compared to existing implementations.

Index Terms—Sparse Grids, Regression Analysis, Big Data, NVIDIA CUDA

1 INTRODUCTION

Today almost all information is being stored using computers. In recent years data sets are getting bigger and they get collected, be-cause huge disk space is affordable. These large data sets are called “big data” and they are produced almost everywhere. To name only a

few: medicine, astrophysics, banks and online shops.

Big Data can be collected for many purposes as well. One is to learn from the data for future events, so called prediction. This can be done via regression analysis. Since this is quite time consuming, this paper adresses regression analysis on “sparse grids”.

1.1 Regression Analysis

“Regression analysis is a statistical tool for the investigation of rela-tionships between variables.”[13] The investigator usually seeks to ascertain the casual effect of one variable upon another. This can for example be the change of the weather, when the wind changes di-rection. Independent variables represent the input, while dependent variables represent the output. The result of a regression analysis is a regression function y = f (x), where x is an independent variable and y the dependent variable.

After a function has been learned, one must ensure its quality. Because this testing should never be done with the learning data, one needs to provide seperate data for testing. To this end the initial data set is being seperated into two parts. Typically the ratio between the training and the testing data is 2 : 1.

To actually do a regression analysis on a sparse grid, the following equation has to be solved according to [11]

1 MBB T_{+ λC} α = 1 MBy (1)

where λ is the regularisation operator, ϕ the basis function and the coefficient vector α of fN the solution. The N × N matrix C,

†_{Roberto Fontanarosa} ‡_{Tobias Rupp} ∗_{Steffen Hirschmann}

Mat.-Nr. 2661256 Mat.-Nr. 2658638 Mat.-Nr. 2658913 {fontanro, rupptl, hirschsn}@studi.informatik.uni-stuttgart.de

ci j= R

5ϕi(x) ∗ 5ϕj(x)dx stems from the smoothness term; the

N× M and M × N matrices B and BT_{, b}

i j= ϕi(xj), and the vector y

of the target values yifrom the error term. [11]

1.2 Practical use

The practical use is — as stated above — typically prediction and forecasting. But it is also used to understand which of the indepen-dent and depenindepen-dent variables are related.

Prediction and forecasting are applied widely. Most people get to see obvious applications like weather forecasts. But these tech-niques are also in use behind sophisticated product proposals or classifications in astrophysics. One could even use them in medical diagnosis.

2 RELATEDWORK

SG++ is a toolbox, that allows to use spatially adaptive sparse grids without great expense. It is flexible and doesn’t need the vast initial overhead, that has to be spent while implementing sparse grids and corresponding algorithms. ”To be able to deal with different kinds of problems in a spatially adaptive way - ranging from interpolation and quadrature via the solution of differential equations to regression, classification, and more - a main motivation behind the development and all considerations was to create a toolbox which can be used in a very flexible and modular way by different users in different applications.”

SG++ is capable of doing regression analysis using the CPU and the GPU. CPU implementations may use traditionally recursive sparse grid algorithms or iterative ones described in [1]. They can be parallelized using OpenMP or MPI. Also, Heinecke provided an implementation which was specially adapted to Intel CPU architec-tures. There is also a module which performs the calculations using OpenCL. So SG++ is capable of calculating on the GPU.

Since OpenCL is not able to utilize modern NVIDIA graphics cards to their maximum capacity, we took the effort and implemented regression analysis using NVIDIA’s CUDA technology.

(2)

3 COMPUTEUNIFIEDDEVICEARCHITECTURE

The Compute Unified Device Architecture (“CUDA”) is a parallel computing architecture from NVIDIA. It allows to perform calcula-tion using NVIDIA graphics cards. Since the architecture of graph-ics cards is very different to the one of CPU’s, CUDA allows huge speedups in several applications, particularly highly parallelizable ones. Compare [3].

CUDA has been successfully used in astro physics, computer biology, dynamical fluid simulation and many more. It is available for most of the modern NVIDIA graphics cards from the series GeForce, ION, Quadro and Tesla. For a list of all CUDA capable devices see [5].

3.1 Concepts

CUDA is an extension to the C programming language. It adds three main abstractions, thread hierarchy, shared memory and synchro-nization [3].

3.1.1 Thread Hierarchy

Every CUDA thread executes a so called CUDA kernel. This is a special function which can be executed on the device and called from the host. In this context, device means the graphics card and host the CPU. Threads are gathered in blocks. Threads from the same block are scheduled together, reside on the same processor core and share this core’s memory. Currently, a block may contain up to 1024 threads [3].

However the threads in one block may not cover the whole prob-lem. Consequently, blocks are arranged in a so called grid. Both, block size and grid size can be up to three dimensional. The value of these sizes must be specified at kernel launch time. Depending on the actual calculations, the amount of memory that needs to be shared, the device used and the values for block and grid size, the best performance results will be achieved.

3.1.2 Memory Hierarchy

A CUDA capable graphics card provides different kinds of memory. One gets the longest access times at the highest level, where global memory is. It is shared throughout the whole device and can be accessed from the host via copy methods. Basically, every program that wants to calculate something on the device must first copy its data to the global memory.

Memory shared by a block is called shared memory. It is a low level memory which can be accessed very fast.

Graphics cards with a compute capability of 2.0 or higher also have cache (like CPU machines) [3]. The lowest cache level provides shared memory as well as level 1 cache. One may set preferences between these two.

There are also other kinds of memory on a graphics card, which we don’t use. Namely, constant memory, texture memory and surface memory. We didn’t use these types of memory because we explicitly made use of shared memory when loading data, so there wouldn’t have been any advantage since they are all located in global memory. Besides this, we didn’t need byte addressing of surface memory nor the filtering support which texture memory offers.

4 PRESENTATION OF THEPRODUCT

As already mentioned, a program which wants to do a regression analysis on sparse grids has to solve equation 1. Matrix B is formed by evaluating every basis function of the sparse grid at every data point.

For a given grid point g with level lg and index ig the tensor

product basis function ϕgis evaluated according to the following

rule, where xmis the data point and d the dimensionality of the

problem: ϕ (xm) = d

∏

k=1 max 1 − 2 l_gk xmk− igk , 0 (2)

Figure 2: Important part of the architecture of SG++.

LearnerVectorizedIdentity instantiates a

DMSystemMatrixVectorizedIdentity which uses an OperationMultipleEvalIterativeX to provide the multand multTranpose functions. These operation classes typically rely on so called kernels to implement the mult functions.

4.1 Functionality from SG++

SG++ provides classes for creating sparse grids and refining them. It implements the abstract learning process for regression analysis. Since our product will be included in SG++, we relied on it as much as possible to minimize redundancies.

The abstract learning process implemented in SG++ uses the BiCGSTAB solver to solve the system of linear equations 1. Like all CG algorithms it therefore only needs a matrix multi-plication. This multiplication with the system matrix BBT+ λC is — for parallel applications — implemented in a class called DMSystemMatrixVectorizedIdentity. The name tells that the identity matrix serves as C. This class, in turn, instantiates an OperationMultipleEvalIterative which implements the routines for multiplication with B and BT. A general overview of the important SG++ classes is given in Figure 2.

4.2 The Authors’ Implementation

Since we used all the existing SG++ infrastructure, all we had left to do was defining an OperationMutipleEval which imple-ments the multiplications using CUDA. For this purpose we de-signed a simple layer architecture. At the top layer there is the OperationMultipleEvalIterativeCUDA. It implements the basic functionality which is required by SG++, like measur-ing the execution time. It also instantiates a class from the mid layer, called CUDAKernelWrapper. Basically, this class does memory management. After the KernelWrapper prepared the de-vice it executes the respective function from the lowest layer called CUDAKernels. This module defines the CUDA kernels which do the actual calculation of v = Bu and u = BT

α with bi j= ϕi(xj)

defined in equation 2.

An overview of how our classes are integrated in SG++ is given in Figure 3. All the classes and modules have been implemented in single and double precision according to the architecture of SG++.

In the following, we will describe these layers or rather classes more thoroughly.

4.2.1 OperationMultipleEvalIterativeCUDA (“OMI”)

This class represents the interface between SG++ and the authors’ implementation. Because it is derived from the general

(3)

Operation-Figure 3: Integration of the authors’ classes into SG++ (right). One can easily spot the layered architecture.

MultipleEval classes it inherits the required multiplication methods from SG++. Basically, it instantiates a KernelWrapper object and for-wards all multiplication method calls to it. Moreover, as mentioned above, it implements a simple time measuring.

4.2.2 CUDAKernelWrapper

The CUDAKernelWrapper has to upload all the data needed by the kernel to the device. This includes the grid, the data set and allocating memory for the result. For this purpose, the KernelWrapper has to handle the case that the data set is too big to entirely fit into the device memory and thus, if necessary has to split up the data set. After preparing the device memory, the KernelWrapper invokes the kernel with pointers on the allocated memory and their respective sizes. After the kernel execution, the result is downloaded into main memory and returned to the caller. Moreover, the KernelWrapper has to handle the calculation for surplus grid points which can’t be covered by the kernels, because of the chosen block and grid sizes. This surplus calculation is CPU based and the implementation follows the general algorithms described in the subsection 4.2.3. For the sake of high performance most of the outer loops in these algorithms are parallelized using OpenMP.

4.2.3 CUDAKernels

As mentioned above the kernels are the lowest level in the layered architecture. They have to implement the actual multiplication with B and BT where B = bi j = ϕi(xj). The evaluation of the basis

functions ϕ is defined in equation 2.

The grid points’ levels and indices are provided by SG++ as matrices in Rg,d, where g is the number of grid points and d the number of dimensions. The data set is a matrix Rm,d, where m is the number of data points. Result and source vectors are just simple arrays.

The implementation of mult and multTranspose are adapted from Heinecke [1]. The basic algorithm for the mult function with linear basis functions is shown in Algorithm 1; multTranpose is shown in Algorithm 2. l denotes the matrix of the grid’s levels. i the grid’s indices. x is the data set:

As one may see, the most inner loop is the evaluation of the basis function. We also implemented modified basis functions which can handle boundary values. These require a slight change of the basis function evaluation and will lead to a case differentiation in the inner loop, if the function is adjacent to the boundary. All the inner basis

Algorithm 1 Pseudo code for function mult. Calculates u = Bv. See [1] for g ← 0..gmaxdo ug← 0 for m ← 0..mmaxdo temp← vm for d ← 0..dmaxdo

temp← temp · max(1 −

lg,d· xm,d− ig,d , 0) end for ug← ug+ temp end for end for

Algorithm 2 Pseudo code for function multTranspose. Calcu-lates v = BTα . See [1] for m ← 0..mmaxdo vm← 0 for g ← 0..gmaxdo temp← αm for d ← 0..dmaxdo

temp← temp · max(1 −

lg,d· xm,d− ig,d , 0) end for vm← vm+ temp end for end for

functions will just be evaluated in the same way as the linear basis functions.

These algorithms are converted to CUDA kernels in a quite obvi-ous way. The most outer FOR loop is left out and mmaxrespectively

gmax kernels are being launched. Each with the loop variable as

thread index.

The kernel makes use of several CUDA techniques to achieve a good performance. First of all, the kernels are implemented as templates with the dimensionality as template argument. So the compiler is enabled to unroll the most inner loop, because it has a constant trip count. To force the NVIDIA CUDA compiler to unroll loops with constant trip count, #pragma unroll has been added at first.

Another benefit of template kernels is that a thread can keep his associated grid or data points in registers, since register count must 3

(4)

be determined at compile time.

The algorithms require each thread to access the same data or grid point at the same time. To ensure maximum performance, the kernel will first load this data into shared memory before giving all threads access to the shared memory.

In most cases grid size is smaller than data set size. If this factor gets too large, performance will suffer. To take care of this problem, the mult kernel, which iterates over all data set points, is able to be launched with a block size in y direction bigger than 1. The mult kernel then will split up the data set evenly between these y threads. Afterwards they will accumulate their results using the atomicAddinstruction. Atomic instructions are well known to give bad performance, but as time will roughly be reduced by1_yif the device has enough resources, the overall performance will be better. The kernels also enable this accumulation to take place in shared memory rather than in global memory.

CUDA supports atomic operations only for single precision. Self coded atomic operations, which do not rely directly on hardware synchronization, will lead to a bad performance. So we decided to support this y thread concept only for single precision.

Kernel Calls The CUDAKernels module also provides sample kernel calls which are prepared to handle a dimensionality up to 100, using template kernels. A call with a greater number of dimensions will not be able to use template kernels. The prespecified block size is 640, since this gave the best performance in the tests we carried out. The amount of y threads spent for a specific kernel call is automatically determined at run time.

5 KERNEL BENCHMARK

In order to ensure functionality of the kernel, it has been tested heavily. Later, we used these test programs to perform benchmarks on the kernel without any other piece of software than just the test drivers and the kernels themselves. The results are being presented in this section.

5.1 Configuration

The kernel has been tested on several NVIDIA graphics cards, namely GeForce 560Ti, GeForce GTX680, Tesla K10 and Tesla K20. On the K10 only one of the two available cores has been used. Ubuntu 12.04 LTS (kernel 3.2.0-37) served as op-erating system using the NVIDIA Linux graphics driver 304.64 for the 560Ti, GTX680 and K10. The K20 was installed in a Red Hat linux machine with kernel 2.6.32-279 and NVIDIA Linux graphics driver 310.19. All cards were installed on a PCIe 2.0 slot. CUDA compilation tools, release 5.0, V0.2.1221 were used to compile the executables for all the benchmarks applying these compiler flags: -O3 -use fast math -arch=sm 20 -DFORCE UNROLL. Since the 560Ti only supports compute capa-bility 2.0 we used -arch=sm 20 for its benchmarks. For bench-marks with a dimensionality higher than two, we additionally set the -DUSE SHAREDflag.

The benchmarks have been conducted in single and double pre-cision. As a reference, we also ran a CPU based kernel, which was implemented according to Heinecke[1] just like the CPU calcula-tions in the KernelWrapper. It has been shared memory parallelized with OpenMP, using all available cores of an Intel Core i7-2600 (@ 3.4 GHz). This processor is based on the Sandy Bridge architecture and has four physical cores. Besides those it uses Intel’s simultane-ous multi threading implementation Hyper Threading Technology (HTT)[2]. This makes a total of eight parallel OpenMP tasks. The CPU based calculation was hosted on the Ubuntu machine named above. GCC 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5) has been used with the following compiler flags: -fopenmp -mfpmath=sse -msse3 -O3.

Random numbers in [0, 1) created by rand() of the glibc served as input for the data set and the grid’s levels and indices.

Every shown result has been averaged over ten tests. The Speedup S=TCPU

TGPU is shown in braces. It is rounded to two decimal places. The

times are as exact as they can be measured by using sys/time.h. Here, only the results of the linear no boundary version are pre-sented. The versions with modified boundary functions perform a bit worse, but the ratio of the different cards is very similar.

The following abbreviations will be used for the sake of brevity: ds(dataset size), gs (grid size), dims (number of dimensions), sp (single precision), d p (double precision).

5.2 Results

5.2.1 Dimensional comparison

We fixed the dataset and grid size and varied the number of dimen-sions. Calculations have been carried out using single precision. The results are shown in table 1.

Remarks

• The 560Ti has too few registers, so it was not able to calculate the 21 dimensions using 640 threads per block. Reducing the number of threads per block would have solved this problem, but we didn’t want to vary them.

5.2.2 Data Set Size Comparison

In this benchmark, the dimensions were fixed and the data set and grid size were varied. These calculations were also carried out using single precision. Table 2 shows the results.

5.2.3 Single-double precision comparison

This final benchmark compares the single precision performance ver-sus double precision performance of the cards. We chose a moderate problem size for this benchmark. The results are shown in table 3. 5.3 Conclusion

As one can easily see, the card performing best in our conducted benchmarks is the GTX680. It has a clock size of 1006MHz[7]. The Tesla K10 has a clock size of 745MHz[4]. Since the GTX680 and the Tesla K10 both use the Kepler architecture, this explains most of the difference. Also, we don’t make explicitly use of shader model 3.x’s innovations. Implicitly through -arch=sm 30 the compiler may use one of the benefits discussed in [3] like more resident blocks and threads per multiprocessor or twice as many registers per multiprocessor.

The 560Ti runs at 822MHz[6]. This is approximately 15% slower than the clock frequency GTX680. However, the GTX680 performs up to 1.86 times faster than the 560Ti in our benchmarks. Besides the shader clock frequency, the GTX680 has a higher memory clock frequency and a newer architecture. The peak performance of the GTX680 is according to NVIDIA 2.44 times higher than the one of the 560Ti (optimal conditions; only fused multiply add instructions). Since we include data transfers, this number is not realistic in our benchmarks. Besides this, our instructions are far from being per-fect multiply add instructions. We even have to call max and abs functions.

The K20 is in single precision as good as the GTX680. The more interesting part of this card is the double precision performance. As one can see, the quotient TSP

TDP is almost comparable to the CPU. This

is a tremendously higher double precision performance than by any other tested card.

5.3.1 Performance

Finally, we ran performance tests. The tests were carried out on different configurations, some with multiple cards or GPUs. We used test sizes where the kernels roughly have their best overall performance. Therefore, ds must equal gs since again, both ker-nels were executed in sequence. For configurations with multiple cards, the distribution of the problem between the multiple cards was based on the data set. An evenly distributed part of the data set was associated with each card or GPU. Then they calculated their respective intermediate result and from this their final result. Afterwards the multiple vectors were added together using CUDA,

(5)

Figure 4: See Table 1 for description. All times in seconds.

dims= 1 dims= 11 dims= 21 i7-2600 7.86s (1) 51.35s (1) 1min 40.3s (1)

560Ti 0.294924s (26.65) 1.221051s (42.05) -GTX680 0.172549s (45.55) 0.818410s (62.74) 1.348612s (74.37) K10 (single) 0.259565s (30.28) 1.215397s (42.25) 2.017326s (49.72) K20 0.393801s (19.95) 0.858973s (59.78) 1.579738s (63.49) Table 1: Variation of dims; parameters were: sp, ds =

102400, gs = 30720

ds= gs = 12800 ds= gs = 128000 ds= gs = 1280000 i7-2600 1.309966s (1) 2min 14.043222s (1) 3h 47min 31.705827s (1)

560Ti 0.043141s (30.36) 3.482182s (38.49) 5min 52.770630s (38.7) GTX680 0.024311s (53.88) 1.913565s (70.05) 3min 9.091099s (72.2) K10 (single) 0.035734s (36.66) 2.868357s (46.73) 4min 38.845643s (48.96)

K20 0.037600s (34.84) 2.049184s (65.41) 3min 11.137712s (71.42) Table 2: Variation of ds and gs; parameters were: sp, dims = 5

again. To accumulate the results of two cards we simply transferred one of the results to the other card using cudaMemcpyPeer. Then this cards added the two results together.

The timings we measured include — as above — all memory transfers and — for multiple cards or GPUs — accumulation of the result. Table 4 shows the achieved effective floating point operation per second for single precision. These don’t count the operations which are really done by the GPU but rather the operations needed for the abstract algorithm like shown in 1 and 2.

Configuration eff. GFLOPS

560Ti (300)

K10 (1 GPU) 420

K10 (2 GPUs) 800

GTX680 620

2x GTX680 1200

Table 4: Performance overview at gs = ds = 128000, dims = 20. For multiple cards or GPUs, ds was multiplied by the total number of cards. On the 560Ti, only 16 dimensions were used (see above).

One can see that the use of two cards doesn’t double the per-formance. This is most likely due to the fact that the two card benchmark had to accumulate two result parts after the actual kernel execution.

For a comparison to the OpenCL version build into SG++, see section 7.

6 TESTS

Our product has been tested with different kinds of big data sets to assure it works perfectly with every kind of data set.

6.1 Data Sets

To compare our results to others, we first tested the checker board data set and the five dimensional astrophysical data set (“DR5”). The “DR5” data set is a real-world data set, that contains photometric data. Regression analysis on it allows astrophysicists to predict the spectroscopic red shift of galaxies. The checker board data set has a size of 30.000 and three dimensions. DR5 has 431.908 instances and five dimensions. Both data sets are included within SG++.

As already mentioned above, we tested our product with a lot of different data sets, that vary in their size, number of dimensions and number of attributes. We wanted to cover every kind of data set, so we can assure our product works perfectly, regardless of the data set’s size or dimensionality.

Afterwards we processed the “Contraceptive Method Choice” (CMC) data set. It is a subset of the 1987 National Indonesia Con-traceptive Prevalence survey. The “ConCon-traceptive Method Choice” data set has 1473 instances and ten attributes. The goal is to predict the current contraceptive method choice of a non-pregnant woman, based on her demographic and socio-economic characteristics. [10] After that we sampled the “Spambase” data set, that has a size of 4601 and 57 dimensions. It contains information of spam mails. It’s goal is to find out weather an email is a spam mail, or not. [9]

The “Poker Hand” data set was the next one we tested. It has 11 dimensions and a size of 1025010, which makes it one of the biggest data sets we tested. The goal is to find out, if a poker hand 5

(6)

sp d p Td p

Tsp

i7-2600 2min 40.414001s (1) 2min 49.805392s (1) 1.06 560Ti 3.804234s (42.17) 16.934206s (10.03) 4.45 GTX680 2.599252s (61.72) 13.633006s (12.46) 5.24 K10 (single) 3.619410s (44.32) 20.696951s (8.2) 5.72 K20 2.470339s (64.93) 3.875494s (43.82) 1.57 Table 3: single vs. double precision benchmark. Parameters were: ds = 102400, gs = 102400, dims = 10.

is a Straight Flush, Royal Flush, two pair etc. The data consists of 10 numbers indicating the suit (hearts, spade, diamond or clubs) and the value (2, 3, 4, etc) of 5 cards, followed by an 11th number, that indicates the poker hand (Straight Flush, Royal Flush, two pair etc) [12].

The liver disorders data set has 6 dimensions which are medical attributes of persons. The objective is to classify the 681 instances whether they suffer a liver disorder or not.

The objective of the skin segmentation data set is to segment people into groups by their skin color [8]. Therefore it has 4 attributes (RGB values of the skin colors and a group). It has 245057 instances. 6.2 Configuration

We did the regression analysis respectively classification on the GTX680 with the following parameters:

parameter value start level 3 λ 1e-6 CG max. iterations 250 CG eps. 1e-4 #refinements 6 refinement threshold 0.0 #points refined 100 CG max. iter (first steps) 20

CG eps. (first steps) 1e-1

Table 5: Test parameters for the NativeClassifyBenchmark of SG++. For all tests the respective data set has been split into two third training instances and one third testing instances.

When there was a big imbalance between grid and data set size (number of training instances) and the data set had only very few dimensions, we chose to rather use y threads than shared memory for single precision. The concerning results in the next section are asterisked.

6.3 Performance

Since we didn’t vary the parameters for any data set, the standard linear grid was sometimes not able to cover all the features of a particular data set. Hence, the mean squares error of the regression was too big to call it successful. We put the concerning results in brackets.

Be aware that the resulting timings of this benchmark also depend on SG++ since we relied on it for creating and refining the sparse grid and implementing the abstract learning process. For timings regarding only the CUDA part of the authors see section 5.

The results are shown in table 6.

6.3.1 Remarks

Considering the results and the size of the poker hand data set, it obviously needs quite exact calculations so that double precision has an advantage over single precision leading to a less refined grid and less CG iterations.

7 CONCLUSIONS

Heinecke achieved within OpenCL implementation for the DR5 data set, using single precision and a linear grid, an execution time of 740 seconds. He performed these tests on a NVIDIA GTX 470. Since we did not have this particular card for testing purpose, one has to compare them by a rule of thumb estimation. The NVIDIA GTX 470 has a peak performance of about 1088 GFLOPS, the NVIDIA GTX680 has around 3090 GFLOPS. So the card Heinecke used was approximately three times slower in peak performance. If we divide his results by three a comparison should be more equitable. The quotient is 740₃ ≈ 246. This is still higher than our achieved 189 seconds. The conclusion might be that CUDA is better suited for doing heavy computations on NVIDIA graphics cards.

CUDA Features Since we directly relied on CUDA, we could make use of CUDA specific features:

• CUDA streams — we are able to upload/download and com-pute on different data simultaneously.

• Pre-compiled kernels — this can actually be a disadvantage or lead to more work — see the template kernel we had to use in order to unroll.

• Explicit exploitation of the memory hierarchy — we did make use of shared memory to share the same data among each thread in a block as fast as possible.

• Explicit memory management — by the use of cudaMalloc() and cudaFree() one is more flexible in memory management compared to OpenCL’s buffer object, especially when it comes to working with streams.

• Atomic-operations — they allow for rapid merging of results of different threads. (Although they are not the best choice most of the time.)

The Implementation We also implemented the kernels a bit differently which might come in handy for the measurements:

We used restricted pointers for all pointers to device memory in the kernel (restricted pointers are a promise not to point to aliased memory and therefore the compiler is able to optimize more aggres-sively), the OpenCL kernels are only compiled with strict aliasing rules (-fstrict-aliasing) which is a less limiting variant of restricted pointers.

(7)

precision grid type Checker Board DR5 CMC Poker Hand Spambase Liver Disorders Skin Segmentation

sp linear 5.96707s* 189.045s 0.091428s 41.1256s (3.98101s) 1.26726s (1.62102s)

mod 7.62147s* 302.329s 0.506413s 114.398s 64.535s 2.96597s 7.27053s

dp linear 8.10462s 210.813 0.055357s 21.4762s (7.81965s) 2.55708s (1.40641s)

mod 13.5172s 574.58s 1.28509s 366.136s 148.695s 7.67746s 8.53089s

Table 6: Timings of the conducted data set regressions respectively classifications. For a description of the data sets see section 6.1. For the configuration see section 6.2 and performance remarks see section 6.3.

Also, we implemented the possibility to use y threads if there is a strong disproportion between data set size and grid size, several threads can be used to compute the same grid point.

The kernels are implemented as templates which ensures the unrolling of the most inner loop at compile time. The OpenCL im-plementation doesn’t have to rely on this feature since it is compiled at run-time and so the most inner loop can always be unrolled. But this has to be done at run-time.

In general it can be assumed that specific CUDA commands are better optimized than corresponding OpenCL commands which must be more general. Furthermore, that NVIDIA’s CUDA drivers are of better quality than their OpenCL drivers. Sadly, using CUDA is a loss of portability. As stated above, run-time compiled code can be a huge benefit when it comes to create code that is well suited for a specific task that can only be determined at run-time. There may be many more benefits of using OpenCL but the authors are not very familiar with OpenCL. As a matter of fact the stake holders of a project should ponder which technique is better suited for their particular project.

REFERENCES

[1] Dirk Pfl¨uger Alexander Heinecke. Emerging architectures enable to boost massively parallel data mining using adaptive sparse grids. pages 14–16, 2011.

[2] Intel Corporation. Intel R _CoreTM

i7-2600 Processor. website, March 2013. http://ark.intel.com/de/products/52213. [3] NVIDIA Corporation. NVIDIA CUDA C Programming Guide, 2012.

Version 4.2.

[4] NVIDIA Corporation. Tesla K10 GPU Accelerator - Board Specifi-cation, November 2012. http://www.nvidia.com/content/ tesla/pdf/NV_DS_TeslaK_Family_May_2012_LR.pdf. [5] NVIDIA Corporation. A Supervised Machine Learning Algorithm

for Arrhythmia Analysis. website, March 2013. http://archive. ics.uci.edu/ml/datasets/Arrhythmia.

[6] NVIDIA Corporation. GeForce GTX 560 Ti and GeForce GTX 550 Ti. website, March 2013. http://www.nvidia.com/object/ product-geforce-gtx-560ti-gtx-550ti-us.html. [7] NVIDIA Corporation. NVIDIA GeForce GTX 680.

web-site, March 2013. http://www.nvidia.in/object/ geforce-gtx-680-in.html#pdpContent=2.

[8] Rajen B. Bhatt et al. IEEE-INDICON. In Efficient skin region seg-mentation using low complexity fuzzy decision tree model, pages 1–4, Ahmedabad, India, 2009.

[9] M Hopkins, E Reeber, G Forman, and J Suermondt. http:// archive.ics.uci.edu/ml/datasets/Spambase. [10] T-S Lim, W-Y Loh, and Y-S Shih. A comparison of prediction accuracy,

complexity, and training time of thirty-three old and new classification algorithms. In Machine Learning, 1999.

[11] Dirk Pfl¨uger. Spatially adaptive sparse grids for high-dimensional problems. pages 103–108, 2010.

[12] D. Deugo. R. Cattral, F. Oppacher. Evolutionary Data Mining with Automatic Rule Generalization. Recent Advances in Computers, Com-puting and Communications. This was a slightly different dataset that had more classes, and was considerably more difficult.

[13] Alan O. Sykes. An introduction to regression analysis. http://www.law.uchicago.edu/files/files/20. Sykes_.Regression.pdf.