4.4 Discussion
As discussed in the introduction, more specifically in chapter 2, mini-batch approaches are not new in the clustering community and encountered a great success when applied to standard k-means [40]. In his work, Sculley showed how a mini-batch Stochastic Gradient Descent (SGD) procedure converges faster than regular GD. However he proposed to set the size of mini-batches to a rather small value, namely ≈ 103, and to fix an a-priori number of iterations for the algorithm. Our suggestion here is quite different, indeed the number of iterations is by construction equal to the number of mini-batches B in order to exploit the entire dataset. Moreover, a major difference with the SGD procedure proposed by Sculley is here represented by the inner loop. We actually believe that iterating each mini-batch up to convergence can lead to a better minimization of the cost function and to a less noisy procedure. To prove this point in the experimental section the reader can find a comparison between the here proposed algorithm and the mini-batch SGD procedure proposed by Sculley.
We stress also the fact that our parallelization approach is rather differ-ent when compared to parallel patch clustering algorithms [42] as they were discussed in the introductory chapter. Indeed, we don’t parallelize across mini-batches assigning one mini-batch per node. Instead, we parallelize the iterations within each mini-batch thus allowing the algorithm to better cope with large sample size. In Fig.4.3 one can better appreciate such difference in a scenario where the available computational resources are limited with respect to the size of K. In such realistic case of application it is evident how our algorithm provides a better approximation to the single-batch re-sult. Indeed the DKK algorithm, being able to distribute the Ki partial kernel matrix over the entire network of nodes can cope with significantly larger mini-batches of the data per iteration.
Chapter 4. Distributed Kernel K-means
Figure 4.3: Graphical representation of the two different mini-batch par-allelization approaches of Patch clustering and DKK. A system of 4 nodes with limited amount of memory per node Rn (represented as the area of the central gray squares) is taken into consideration with respect to a kernel ma-trix K (in blue) that requires 16Rnavailable memory. In order to cope with such matrix the Patch algorithm requires a subdivision of the data into 4 small mini-batches and runs in a single trivially parallel iteration. The pro-posed DKK algorithm when dealing with the same matrix is able to gather a global results in two iterations, requiring however just 2 larger mini-batches.
The quality of the global clustering result is expected to be higher for the DKK algorithm since it take into consideration N22 kernel elements whereas the patch algorithm take into consideration a smaller fraction of them i.e.
N2 4
56
Chapter 5
GPU Accelerated DKK for Clustering MD Data
In the previous chapter we introduced a novel clustering engine, namely Dis-tributed Kernel K-means (DKK). We showed how starting from a proper reformulation of the original kernel k-means algorithm is possible to devise an efficient distribution strategy that together with the proposed two-fold approximation is able to reduce dramatically the computational burden and the memory footprint of exact kernel methods. It is worth observing how-ever that, even if the number of kernel evaluations in the proposed DKK algorithm is cut down to N × sNB, the performances still depend heavily on the actual implementation of such kernel matrix evaluation step. This observation is particularly relevant if one desires to perform clustering over molecular dynamics trajectories, indeed as we discussed in Chapter 2 such scenario require a similarity measure based on Minimum Root Min Square Displacement e.g. gaussian kernel in the form of:
K(xi, xj) = e−
RMSD(xi,xj )2 σ2
The evaluation step of such kind of complex kernel matrix or sub-matrix can represents a major bottleneck in the proposed algorithm affecting the overall performances.
In this chapter we tackle the problem proposing an accelerated version of the DKK algorithm which is able to run on state-of-the-art heterogeneous computational platforms. At first we discuss an offload acceleration strategy carefully designed to exploit the iterative nature of the mini-batch algorithm.
Even though such acceleration strategy can be implemented in principle for all sort of offload accelerators (general purpose GPUs (gpGPUs) as well as Intel Many Integrated Core architectures) we focus the attention in the second part of the chapter on the design of an actual CUDA implementation for nVIDIA Graphic Processing Units (GPUs). The reason for this choice has to be found in both the popularity of such accelerators on the market and
Chapter 5. GPU Accelerated DKK for Clustering MD Data
on the great control that CUDA offers to the developer allowing low-level memory optimizations.
5.1 Offload acceleration strategy
In the following section we discuss how the mini-batch structure of the algo-rithm can be exploited in order to design an effective acceleration strategy.
We will consider an offload acceleration model where the host processor and the target device have separate memory address spaces and communicate via a bus with limited bandwidth (e.g. PCIe) with respect to the processor-memory standard bus.
As already discussed the bottleneck of the computation in real appli-cation scenarios is usually the kernel matrix evaluatoin. The evaluation of such large kernel matrix or sub-matrix perfectly fits the massively parallel architecture of nowadays accelerators, therefore it is a reasonable choice to offload that portion of the computation. One of the key elements for an ef-ficient acceleration scheme is the overlapping in time between the host and the target workload [63].
There is no hope to find an overlapping scheme within the inner loop of the algorithm since the entire procedure depends on the kernel matrix elements to be completed, therefore we concentrate our efforts on the outer loop. Each i-th iteration depends on the previous ones, in order to initialize the set of labels ui. This is what prevents the algorithm to be trivially parallel forcing to run just one mini-batch per time. However if one considers the first two steps of each outer loop iteration i.e. mini-batch fetch Xi and kernel matrix evaluation Ki it is clear that they can be performed independently for each i. We exploit this feature, instructing the target device to compute the kernel matrix K(i+1)while the host processor executes the inner loop of the algorithm on the i-th mini-batch.
The offload procedure is detailed in Fig. 5.1. The overall performance gain for such acceleration strategy however heavily depends on the acceler-ator side implementation of the kernel matrix evaluation. For this reason the following section deals with the details of a CUDA implementation for the Root Mean Square Deviation (RMSD) kernel matrix evaluation.