Kullback-Leibler Divergence Implementation

Part II: Supervised Learning

7.4 Semi-Supervised NMF (SSNMF)

7.5.2 Kullback-Leibler Divergence Implementation

Unlike the Euclidean distance, the divergence update rules do not depend so heavily on matrix multiplications. Thus, in the case of the multiplicative rules, four kernels are required (SumW,SumH,UpdateH MDandUpdateW MD).

TheSumWkernel calculates∑kWkafor each column a of W and puts the result in a vector of dimension r, see (7.6). Similarly,SumHcalculates∑vHav for each row a of H, placing the result in a vector also with dimension r, see (7.7). Listing 7.4 presents the code of theSumWkernel, which uses the reduction process, described earlier in Section 2.5 (page 32). Note that for simplification, the code presented here, assumes that W is stored in column-major order. However, the GPUMLib code available for download can be configured to support both row-major and column- major orders.

The kernels UpdateH MD and UpdateW MD respectively update all the elements of H and W. Both kernels work in a similar manner, thus we will focus on the inner-working of theUpdateH MDkernel.

138 7 Non-Negative Matrix Factorization (NMF)

Listing 7.4 One of the CUDA kernels (SumW) used to implement the NMF algorithm for the multiplicative update rules, considering the Kullback-Leibler divergence.

template <int blockSize>

__global__ void _{SumW(cudafloat * W, int d, cudafloat * sumW) {} extern __shared__ cudafloat w[];

w[threadIdx.x] = CUDA_VALUE(0.0);

for(int k = threadIdx.x; k < d; k += blockSize) {

w[threadIdx_{.x] += W[d *} blockIdx.x + k]; }

__syncthreads();

if (blockSize >= 1024) {

if (threadIdx.x < 512) w[threadIdx.x] += w[threadIdx.x + 512];

__syncthreads(); }

if (blockSize >= 512) {

if (threadIdx.x < 256) w[threadIdx.x] += w[threadIdx.x + 256];

__syncthreads(); }

if (blockSize >= 256) {

if (threadIdx.x < 128) w[threadIdx.x] += w[threadIdx.x + 128];

__syncthreads(); }

if (blockSize >= 128) {

if (threadIdx.x < 64) w[threadIdx.x] += w[threadIdx.x + 64];

__syncthreads(); }

if (threadIdx.x < 32) {

volatile _{cudafloat * _w = w;}

if (blockSize >= 64) _w[threadIdx.x] += _w[threadIdx.x + 32];

if (blockSize >= 32) _w[threadIdx.x] += _w[threadIdx.x + 16];

if (blockSize >= 16) _w[threadIdx.x] += _w[threadIdx.x + 8];

if (blockSize >= 8) _w[threadIdx.x] += _w[threadIdx.x + 4];

if (blockSize >= 4) _w[threadIdx.x] += _w[threadIdx.x + 2];

if (blockSize >= 2) _w[threadIdx.x] += _w[threadIdx.x + 1];

if (threadIdx.x == 0) { cudafloat sum = w[0];

if (sum < SMALL_VALUE_TO_ADD_DENOMINATOR) {

sum = SMALL_VALUE_TO_ADD_DENOMINATOR; }

sumW[blockIdx.x] = sum;

} } }

7.6 Results and Discussion 139

In order to update a given element Haμ, we need to access all the elements in the column a of W and all elements in the columnμof both V and WH, as shown in Figure 7.6. Hence, the CUDA thread assigned to update a given matrix element Haμ needs to access the same elements of V and WH than the threads assigned to process the elements Hiμ (i=a). Similarly it needs to access the same elements of W as those required by the threads processing the elements Ha j( j=μ).

The rationale behind organizing the threads into blocks is to share as much information as possible among the threads within a block. This substantially improves the kernel performance, since (as we said before) accessing the shared memory is significantly faster than accessing the global device memory. Given the amount of shared memory available per block in our devices (see Table A.2, page 202), we found that we were able to store at least 32×32 pieces of the matrices

W and(V)i j/(WH)i j. Thus, ideally our kernel should be executed in blocks of 32×32=1024 threads. However, the devices available at the time (a GeForce 8600 GT and a GeForce GTX 280), supported a maximum of 512 threads per block (see Tables A.2 (page 202) and 2.1 (page 22)). To solve this problem and create a kernel that is able to run on any device, while maximizing the amount of information shared, each block contains 32×16=512 (blockDim.x= 32,blockDim.y= 16) threads. However, each thread gathers two elements of W, V and WH instead of one, and updates two elements of H (observe Figure 7.6). Therefore, although each block contains only 512 threads, 1024 elements are updated. This strategy improves the speedup gains.

The additive update rules, for the divergence, require only two kernels (UpdateH ADandUpdateW AD) which are similar toUpdateH MD.

7.6 Results and Discussion

7.6.1 Experimental Setup

We have conducted all the NMF related experiments in the face recognition domain. Face recognition has many potential applications in various distinct areas, such as military, law-enforcement, anti-terrorism, commercial and human- computer interaction [238]. Over the past decades, face recognition has become an increasingly important area, attracting researchers from pattern recognition, neural networks, image processing, computer vision, machine learning and psychology among others [238, 260]. However, this is still a very challenging and complex problem, because the appearance of individuals is affected by numerous factors (e.g. illumination conditions, facial expressions, usage of glasses) and current systems are still no match for the human perception system [260]. A detailed survey on existing techniques and methods for face recognition can be found in Zhao et al. [260]. Typically, solving this problem involves several phases: (i) segmentation of the faces, (ii) extraction of relevant features from the face regions, (iii) recognition and

140 7 Non-Negative Matrix Factorization (NMF)

Fig. 7.6 Processing carried out, for each element Haμ, by theUpdateH MDkernel

(iv) verification [260]. However, in this work, we concentrate on the last phases, leaving out the segmentation phase. Accordingly, instead of relying on handcrafted features, we use the NMF algorithm to (ii) extract features directly from the raw images’ data. These are then used to (iii) create and (iv) validate a face recognition model, using the process described in Section 7.3.

Altogether, in our testbed experiments we have used three different databases: the Center for Biological and Computational Learning (CBCL), Yale and AT&T databases. These were described in Section A.4 (see page 210, 215, 207).

The CBCL face database #1 was used specifically to test and validate the GPU parallel implementations of the NMF algorithm. The tests were conducted using the 2,429 face images of the training dataset. The matrix containing the face images was created by placing one image per column. Thus, in this case, matrix V is composed by 361 rows (19×19 pixels) and 2,429 columns (samples).

Aside from testing and validating the GPU parallel implementations of the NMF algorithm, the remaining two databases (the Yale and AT&T) were also used to evaluate the effectiveness of the classification method presented in Section 7.3 as well as the performance of the SSNMF method described earlier (see Section 7.4). To this end, the leave-one-out-per-class cross-validation method was used. Thus, in the case of the Yale database, the training matrix, Vtrain, is composed of 4,096

7.6 Results and Discussion 141

composed of 4,096 rows and 15 columns. Similarly, in the case of the AT&T (ORL) database, Vtrainis composed of 10,304 (112×92) rows and 360 columns and Vtest

is composed of 10,304 rows and 40 columns.

In addition, the Yale face database was also used to further test and validate the ATS (described in Section 3.4). Accordingly, we have used the process, described in Section 7.3, to combine the NMF algorithm with the MBP algorithm. Moreover, in this particular experiment, we decided to use the hold-out validation instead of the leave-one-out-per-class cross-validation method, so that we could train more networks using the ATS. Hence, in order to build the training dataset, we randomly select 8 images of each person (corresponding to approximately 3/4 of the database images). Consequently, the remaining 3 images per person, encompassing approximately 1/4 of the images, were used to create the test dataset. Thus, in this case the training matrix, Vtrain, is composed of 4,096 rows and 120 columns, while

the test matrix, Vtest, is composed of 4,096 rows and 45 columns.

With the exception of the experiments conducted in order to determine the GPU implementations’ speedups, the Euclidean distance implementation with the multiplicative update rules of the NMF algorithm was used, since as we shall see in the next Section this is the fastest implementation.

Before running the experiments, a histogram equalization was applied to the datasets’ images, in order to reduce the influence of the surrounding illumination (see Section A.6). Moreover, all the tests were performed using the computer system 2 (see Table A.1, page 202).

In document Machine Learning for Adaptive Many Core Machines A Practical Approach (Studies in Big Data) 2015th Edition pdf (Page 148-152)