Host entry point for permutation kernels

4.3 Profiling results

5.1.2 Host entry point for permutation kernels

Finally, we need to have a host function as an entry point for initializing the data and invoking the GPU kernel functions. Algorithm 5.3 presents a host function that will initialize data, then will choose a suitable GPU kernel for computing stride permutation. Moreover, we assume that grids and thread blocks are one dimensional.

Profiling results 68

Algorithm 5.3 HostGeneralStridePermutation (~X, ~Y,K,N,k,s,r,b)

input:

- two positive integersk and r as specified in the introduction, - a positive integer K representing the stride of the permutation, - a positive integer N,

- a positive integer s representing size of shared memory for each thread block, - a positive b integer representing size of a 1D thread block,

- vector~X havingN elements of_Z/p_Z withp=rk_{+ 1, thus storing}_N_×_k_machine-

words, viewed as the row-major layout of the transposition of a matrix M0 with

N rows and k columns. output:

- vector~Y havingN elements of_Z/p_Z withp=rk+ 1, thus storingN×kmachine- words, viewed as the row-major layout of the transposition of a matrix M1 with

N rows andk columns, storing result of stride permutation such that~Y :=LN K(~X).

. In either case, initializing 1D grid of dimension N/b

if b < K then

KernelBasePermutationMultipleBlocks<<<N/b,b>>>(~X, ~Y,K,N,k,s,r) else if b=k then

KernelBasePermutationSingleBlock<<<N/b,b>>>(~X, ~Y,K,N,k,s,r) end if

return . End of Kernel

5.2 Profiling results

In this section, we have the profiling results for the CUDA implementation of Algorithms 5.1 and 5.2, respectively.

Figure5.1 shows the result of profiling for computing LKJ_K with K = 256 and J = 4096. For thread blocks of size b = 256, and shared memory of size s = 212 digits of size of a machine-word, this implementation assigns 8 thread blocks for computing each stride permutation.

Also, Figure5.2shows the profiling result for the implementation that assigns one thread block for computing the permutation LKJ_K , with K = 16 and J = 216_.

The profiling results are measured for the following metrics:

Profiling results 69

2. the total number of issued instructions per cycle (IPC), 3. instruction overhead,

4. throughput of loading data from global memory, 5. throughput of storing data to global memory, and 6. the efficiency percentage for accessing global memory.

As the final note, we have collected the profiling data on a NVIDIA Geforce-GTX760M card (hardware specifications are mentioned in Appendix B).

I n v o c a t i o n s M e t r i c N a m e M e t r i c D e s c r i p t i o n Min Max Avg D e v i c e " G e F o r c e GTX 760 M (0) " K e r n e l : k e r n e l _ p e r m u t a t i o n _ 2 5 6 _ g e n e r a l _ p e r m u t a t e d _ v 0 ( __int64 , _ _ i n t 6 4 * , _ _ i n t 6 4 * , _ _ i n t 6 4 *) 1 a c h i e v e d _ o c c u p a n c y A c h i e v e d O c c u p a n c y 0 . 1 2 4 8 1 2 0 . 1 2 4 8 1 2 0 . 1 2 4 8 1 2 1 ipc E x e c u t e d IPC 0 . 0 8 5 6 5 7 0 . 0 8 5 6 5 7 0 . 0 8 5 6 5 7 1 i n s t _ r e p l a y _ o v e r h e a d I n s t r u c t i o n R e p l a y O v e r h e a d 0 . 7 1 0 6 3 8 0 . 7 1 0 6 3 8 0 . 7 1 0 6 3 8 1 g s t _ t h r o u g h p u t G l o b a l S t o r e T h r o u g h p u t 3 . 9 5 7 0 GB / s 3 . 9 5 7 0 GB / s 3 . 9 5 7 0 GB / s 1 g l d _ t h r o u g h p u t G l o b a l L o a d T h r o u g h p u t 3 . 9 6 0 9 GB / s 3 . 9 6 0 9 GB / s 3 . 9 6 0 9 GB / s 1 g l d _ e f f i c i e n c y G l o b a l M e m o r y L o a d E f f i c i e n c y 9 9 . 9 3 % 9 9 . 9 3 % 9 9 . 9 3 % 1 g s t _ e f f i c i e n c y G l o b a l M e m o r y S t o r e E f f i c i e n c y 1 0 0 . 0 0 % 1 0 0 . 0 0 % 1 0 0 . 0 0 %

Figure 5.1: Profiling results for stride permutationLKJ_K forK= 256 and J = 4096.

I n v o c a t i o n s M e t r i c N a m e M e t r i c D e s c r i p t i o n Min Max Avg D e v i c e " G e F o r c e GTX 760 M (0) " K e r n e l : k e r n e l _ p e r m u t a t i o n _ 1 6 _ p e r m u t a t e d ( _ _ i n t 6 4 * , _ _ i n t 6 4 *) 1 a c h i e v e d _ o c c u p a n c y A c h i e v e d O c c u p a n c y 0 . 2 4 8 9 4 8 0 . 2 4 8 9 4 8 0 . 2 4 8 9 4 8 1 ipc E x e c u t e d IPC 0 . 0 8 7 6 5 3 0 . 0 8 7 6 5 3 0 . 0 8 7 6 5 3 1 i n s t _ r e p l a y _ o v e r h e a d I n s t r u c t i o n R e p l a y O v e r h e a d 1 . 9 1 5 6 4 5 1 . 9 1 5 6 4 5 1 . 9 1 5 6 4 5 1 g s t _ t h r o u g h p u t G l o b a l S t o r e T h r o u g h p u t 3 . 9 7 9 4 GB / s 3 . 9 7 9 4 GB / s 3 . 9 7 9 4 GB / s 1 g l d _ t h r o u g h p u t G l o b a l L o a d T h r o u g h p u t 3 . 9 8 7 2 GB / s 3 . 9 8 7 2 GB / s 3 . 9 8 7 2 GB / s 1 g l d _ e f f i c i e n c y G l o b a l M e m o r y L o a d E f f i c i e n c y 9 9 . 8 5 % 9 9 . 8 5 % 9 9 . 8 5 % 1 g s t _ e f f i c i e n c y G l o b a l M e m o r y S t o r e E f f i c i e n c y 1 0 0 . 0 0 % 1 0 0 . 0 0 % 1 0 0 . 0 0 %

Chapter 6 Big Prime Field FFT on GPUs

In this chapter, we explain how we can compute FFT for vectors of elements in _Z/p_Zon GPUs. First, in Section 6.1, we have a quick review of the Cooley-Tukey FFT algorithm. Then, in Section 6.2, we explain an algorithm for computing multiplication by twiddle factorson GPUs. Furthermore, in Section6.3, we explain how by usingsix-step recursive FFT, we can compute FFT through a base-case formula that is faster in practice. Next, in Section 6.4, we explain how we can compute the FFT for vectors of any length in

Z/pZ. Finally, in Section 6.5, we have profiling results for CUDA implementation of

algorithms of this chapter.

6.1 Cooley-Tukey FFT

As we explained in Chapter2, for computing the FFT for a vector of N =KJ elements in _Z/p_Z, and for ωN _{= 1, the Cooley-Tukey FFT algorithm factorizes the computation}

in the following way:

DFTN = (DFTK⊗IJ)DK,J(IK⊗DFTJ)LNK.

In this notation, DK,J represents the multiplication by the powers of ω. Moreover, the diagonal twiddle matrix DK,J is defined as

DK,J = K−1

j=0

diag(1, ωj_i, . . . , ω_ij(J−1)).

In practice, The Cooley-Tukey FFT algorithm is not a suitable choice for implementation on GPUs, mostly because of the way that it accesses the memory. Therefore, we need

Multiplication by twiddle factors 71

an equivalent equation which is more suitable for structure of GPUs. That is, we must have an equation that can efficiently exploitblock parallelismof GPUs. In terms of tensor notation, block parallelism can be realized by tensor products of the formIJ⊗DFTK, and

therefore, we should find a solution to convert our computations to the mentioned form. For this purpose, we use thesix-step recursive FFT algorithm [10], which is expressed in the following way:

DFTN =LNK(IJ ⊗DFTK)LNJDK,J(IK⊗DFTJ)LNK.

By this formula, we can further expand the left partIJ⊗DFTKto reduce all computations

to a base-case DFTK. Accordingly, by having an efficient implementation for computing

DFTK, we can have a high performance implementation of the FFT.

In document Fast Fourier Transforms over Prime Fields of Large Characteristic and their Implementation on Graphics Processing Units (Page 78-82)