Implementation parameters - Design and HPC implementation of unsupervised Kernel methods in the

Figure 5.4: Example showing how to implement a rotational index to avoid shared memory bank conflicts while computing the summations described in Eq.5.9, 5.10 and 5.11. We recall that superscripts refer to frame indices whereas subscripts refer to atom indices within a frame. (left) A straightfor-ward implementation of the summation will cause all the threads to work on the same atom at each iteration. This cause shared memory bank conflicts that strongly affect performances. (right) A simple rotational atom index is used so that each thread work on a different atom while computing the summation: ridx = n + tIdx.x + tIdx.y%B.

5.3 Implementation parameters

A final remark has to be done on the implementation parameters i.e. f and B. Tuning such parameters depending on the hardware is crucial in order to get peak performances. Among the two, choosing f is simpler:

looking at the specifications of the hardware one can verify the existence of a vectorized load instruction and choose f accordingly (e.g. if the hardware provide a 64-bit vectorized load instruction one can safely set f = 2). In order to properly choose the block size B we rather suggest to look at two different hardware properties: the available shared memory and the number of threads scheduled together W (i.e. warp size for nVidia GPUs). We obviously want the number of threads per block B² to match an integer multiple of W and we would like to reduce the amount of memory required by each block in order to increase the relative occupancy of the multiprocessors.

B² = iW, i > 1 (5.12)

4 × 6f B² << M (5.13)

For instance, in the experimental section later discussed where an nVidia GTX980 board is used, we set f = 2 since an 8-byte vectorized load instruc-tion is available and B = 8 requiring 3kB of shared memory per block over the available 96kB.

Chapter 5. GPU Accelerated DKK for Clustering MD Data

5.4 Discussion

We close this chapter commenting on the fact that nVIDIA GPUs are by far the most common many-core accelerators on the market and that all major Molecular Dynamics (MD) simulations software tools [65, 66], are nowadays offering a CUDA accelerated version of their code. It should be therefore clear the importance for post-processing tools to be able to run natively on GPU accelerated architectures. The proposed DKK algorithm together with the accelerated implementation presented in this chapter perfectly fulfill this requirement.

It is also worth stressing the fact that up to our knowledge the low level MD data structure optimization proposed here is novel and represents an advancment with respect to other proposed CUDA RMSD implementations e.g. [67].

5.4. Discussion

Algorithm 3: Gaussian kernel evaluation with minimum RMSD met-ric, pseudocode for a many core implementation.

input:

dataset X, number of atoms N_a, block size B, vector type size f block index bIdx, thread index: tIdx

S_xx, S_xy, S_xz, S_yx, S_yy, S_yz, S_zx, S_zy, S_zz, G_i, G_j shared memory variables:

buffers for frames [iB, i(B + 1) − 1]: ix, iy, iz buffers for frames [jB, j(B + 1) − 1]: jx, jy, jz output:

Kernel matrix K

1 tID = tIdx.y*B+tIdx.x

2 offi = bIdx.y * 3BNa/f

3 offj = bIdx.x * 3BN_a/f

4 for r ← 0,^N_{f B}^a do

5 ix[tID] = X[offi+tID]

6 iy[tID] = X[offi+B*B+tID]

7 iz[tID] = X[offi+2*B*B+tID]

8 load in the same way jx, jy, jz

9 offi += B*B*3

10 offj += B*B*3

11 sync block threads

12 for n ← 0, f B − 1 do

13 r = (n+tIdx.x+tIdx.y)%B

14 S_xx += ix[tIdx.y*B+r]*jx[tIdx.x*B+r]

15 update in the same way S_xy, S_xz, S_yx, S_yy, S_yz, S_zx, S_zy, S_zz

16 Gi = Gi + ix[tIdx.y*B+r]²+iy[tIdx.y*B+r]²+iz[tIdx.y*B+r]²

17 Gj = Gj + jx[tIdx.x*B+r]²+jy[tIdx.x*B+r]²+jz[tIdx.x*B+r]²

18 end

19 sync block threads

20 end

21 compute c0, c1, c2, c3, c4 according to Eq.5.4

22 for iterations ← 0, MAX ITERATIONS do

23 λ -= (c4*λ⁴ + c3*λ³ + c2*λ² + c1*λ + c0)/(4*c4*λ³ + 3*c3*λ² + 2*c2*λ + c1)

24 end

25 msd = (Gi+Gj-2*λ)/Na

26 K[bIdx.y*B+tIdx.y,bIdx.x*B+tIdx.x] = exp(-msd/σ²)

Chapter 6

A Principal Paths Finding Algorithm in Kernel Space

The previous two chapters introduced two rather technological advance-ments in the field of Unsupervised Learning (UL) for Molecular Dynamics (MD). Both the Distributed Kernel K-means (DKK) algorithm and its ac-celerated version were developed following the requirements of the clustering problems applied to MD datasets i.e. to large datasets not embeddable in a vector space requiring an expensive distance matrix evaluation. In chapter 3 a further connection between MD and UL was highlighted: there we showed how MD not only pushes the technological aspects of UL as an increasingly demanding domain of application but is also able to inspire totally new UL principles.

With this respect in the same chapter, starting from the concept of Minimum Free Energy Path (MFEP) we introduced the one of Principal Paths in Data Space as a natural solution to the problem of inferring a smooth path connecting a starting sample to an ending one, locally passing through the middle of the data.

Hereafter we show how a kernel based method can be derived in order to approximate such kind of paths. We introduce a novel regularized cost function starting from the standard kernel k-means cost by addition of a 1D topology imposed as a set of harmonic restraints together with fixed boundary conditions (i.e. fixed starting and ending cluster centers). The minimization of such cost naturally leads to an Expectation Maximization (EM) algorithm that will be discussed both in the original and in a kernel space. From an algorithmical standpoint one may think at a Principal Path as a regularized path, with fixed boundaries, attracted towards the local center of mass of the data.

As it will be clear, the quality and the smoothness of the solution found by such algorithm is ruled by the regularization parameter. Informally we can think to this central parameter as the one that controls the trade-off

be-Chapter 6. A Principal Paths Finding Algorithm in Kernel Space

tween the path smoothness and how much the path passes through the data.

A large portion of the chapter is therefore dedicated to the model selection phase of such parameter. More specifically, a Bayesian maximal evidence principle that allows blindfolded in sample model selection is derived.

Throughout the chapter the following notation will be used:

• N is the number of samples.

• N_C is the number of cluster prototypes.

• φ(·) : R^d→ R^d⁰ is the possibly non-linear transformation mapping the d-dimensional input space into a d⁰-dimensional transformed one.

• X is the N × d matrix of samples x_i arranged in a row wise fashion i.e. Xi,· = xi.

• K is the N × N kernel matrix defined as K_i,j = hφ(x_i), φ(x_j)i.

• W is the N_C× d⁰ matrix of cluster prototypes w_i arranged in a row wise fashion.

• |w_i| represents the cardinality of the i-th cluster.

• u_i ∈ [1, N_C] is the label associated with the i-th sample x_i.

• w₀ and w_N_C₊₁ represents the boundary conditions of the algorithm i.e. the starting and ending points of the inferred path.

6.1 The cost function

We now give a formal definition of the newly defined principal path learning problem by introducing a regularized cost function to be optimized. As usual in the context of regularized functional optimization we write a cost having the following form:

Ω(W, u, X, γ, λ) = γΩ_X(W, u, X) + λΩ_W(W) (6.1) More specifically, the primal problem that we aim to optimize in order to infer a smooth transition path from the starting point w₀to the ending point w_N_C₊₁ is the following:

As anticipated, this functional represents a regularized version of the stan-dard kernel k-means cost already discussed in chapter 2. Here the first and last clusters, namely w₀ and w_N_C₊₁ are kept fixed as boundary conditions 70

6.1. The cost function

Figure 6.1: An example of index permutation Q to initialize a 1D topol-ogy i.e. an NC-segments curve. The initial cluster centers can be picked with standard k-means initialization algorithms such as k-means++ while the permutation Q can be derived by some rational such as for example a shortest path algorithm on top of a fully connected topology.

and a 1D topology is forced by the proposed quadratic regularization cost ΩW which introduces a set of harmonic restraints connecting subsequent cluster prototypes (w_i+1, w_i).

The resulting one-dimensional topology is assumed to be set beforehand by means of a simple permutation Q : Z⁺_N_C → Z⁺_N_C of the cluster indices as detailed in Fig.6.1. For the sake of convenience in the following, without loss of generality we will assume Q to be the identity so that Q(i) = i.

We stress the fact that, as in the case of kernel k-means, Ω is a non-convex function with respect to W mainly because of the hard-assignment of latent labels i.e. δ(ui, j). The two hyperparameters γ and λ regulate the trade-off between data-fitting and smoothness of the inferred path as shown in Fig.6.2(a). It is worth noting that for the sole purpose of optimization, only the ratio s = ^λ_γ is relevant, being γ a simple scaling factor. Hereafter we will refer to s as regularization parameter; why keeping γ and λ separated during the derivation will be clear later when a Bayesian model selection framework will be introduced.

Lemma 1. The primal cost function has the following compact trace for-mulation:

Ω(W, u, X, γ, λ)

= γ

2Tr(W^TA_X(u)W − 2C^TW + D_X)+

+λ

2Tr(W^TA_WW − 2B^TW + D_W)

(6.3)

Proof. Eq.6.3 is derived by construction, having defined the following ma-trices:

Chapter 6. A Principal Paths Finding Algorithm in Kernel Space

• The N_C × d⁰ boundary condition matrix:

B_i,· =

• The N_C × N_C hessian matrix of the standard k-means cost function i.e. A_Xi,j =

In document Design and HPC implementation of unsupervised Kernel methods in the context of molecular dynamics (Page 65-73)