4.3 Profiling results
5.1.2 Host entry point for permutation kernels
Finally, we need to have a host function as an entry point for initializing the data and invoking the GPU kernel functions. Algorithm 5.3 presents a host function that will initialize data, then will choose a suitable GPU kernel for computing stride permutation. Moreover, we assume that grids and thread blocks are one dimensional.
Profiling results 68
Algorithm 5.3 HostGeneralStridePermutation (~X, ~Y,K,N,k,s,r,b)
input:
- two positive integersk and r as specified in the introduction, - a positive integer K representing the stride of the permutation, - a positive integer N,
- a positive integer s representing size of shared memory for each thread block, - a positive b integer representing size of a 1D thread block,
- vector~X havingN elements ofZ/pZ withp=rk+ 1, thus storingN×kmachine-
words, viewed as the row-major layout of the transposition of a matrix M0 with
N rows and k columns. output:
- vector~Y havingN elements ofZ/pZ withp=rk+ 1, thus storingN×kmachine- words, viewed as the row-major layout of the transposition of a matrix M1 with
N rows andk columns, storing result of stride permutation such that~Y :=LN K(~X).
. In either case, initializing 1D grid of dimension N/b
if b < K then
KernelBasePermutationMultipleBlocks<<<N/b,b>>>(~X, ~Y,K,N,k,s,r) else if b=k then
KernelBasePermutationSingleBlock<<<N/b,b>>>(~X, ~Y,K,N,k,s,r) end if
return . End of Kernel
5.2
Profiling results
In this section, we have the profiling results for the CUDA implementation of Algorithms 5.1 and 5.2, respectively.
Figure5.1 shows the result of profiling for computing LKJK with K = 256 and J = 4096. For thread blocks of size b = 256, and shared memory of size s = 212 digits of size of a machine-word, this implementation assigns 8 thread blocks for computing each stride permutation.
Also, Figure5.2shows the profiling result for the implementation that assigns one thread block for computing the permutation LKJK , with K = 16 and J = 216.
The profiling results are measured for the following metrics:
Profiling results 69
2. the total number of issued instructions per cycle (IPC), 3. instruction overhead,
4. throughput of loading data from global memory, 5. throughput of storing data to global memory, and 6. the efficiency percentage for accessing global memory.
As the final note, we have collected the profiling data on a NVIDIA Geforce-GTX760M card (hardware specifications are mentioned in Appendix B).
I n v o c a t i o n s M e t r i c N a m e M e t r i c D e s c r i p t i o n Min Max Avg D e v i c e " G e F o r c e GTX 760 M (0) " K e r n e l : k e r n e l _ p e r m u t a t i o n _ 2 5 6 _ g e n e r a l _ p e r m u t a t e d _ v 0 ( __int64 , _ _ i n t 6 4 * , _ _ i n t 6 4 * , _ _ i n t 6 4 *) 1 a c h i e v e d _ o c c u p a n c y A c h i e v e d O c c u p a n c y 0 . 1 2 4 8 1 2 0 . 1 2 4 8 1 2 0 . 1 2 4 8 1 2 1 ipc E x e c u t e d IPC 0 . 0 8 5 6 5 7 0 . 0 8 5 6 5 7 0 . 0 8 5 6 5 7 1 i n s t _ r e p l a y _ o v e r h e a d I n s t r u c t i o n R e p l a y O v e r h e a d 0 . 7 1 0 6 3 8 0 . 7 1 0 6 3 8 0 . 7 1 0 6 3 8 1 g s t _ t h r o u g h p u t G l o b a l S t o r e T h r o u g h p u t 3 . 9 5 7 0 GB / s 3 . 9 5 7 0 GB / s 3 . 9 5 7 0 GB / s 1 g l d _ t h r o u g h p u t G l o b a l L o a d T h r o u g h p u t 3 . 9 6 0 9 GB / s 3 . 9 6 0 9 GB / s 3 . 9 6 0 9 GB / s 1 g l d _ e f f i c i e n c y G l o b a l M e m o r y L o a d E f f i c i e n c y 9 9 . 9 3 % 9 9 . 9 3 % 9 9 . 9 3 % 1 g s t _ e f f i c i e n c y G l o b a l M e m o r y S t o r e E f f i c i e n c y 1 0 0 . 0 0 % 1 0 0 . 0 0 % 1 0 0 . 0 0 %
Figure 5.1: Profiling results for stride permutationLKJK forK= 256 and J = 4096.
I n v o c a t i o n s M e t r i c N a m e M e t r i c D e s c r i p t i o n Min Max Avg D e v i c e " G e F o r c e GTX 760 M (0) " K e r n e l : k e r n e l _ p e r m u t a t i o n _ 1 6 _ p e r m u t a t e d ( _ _ i n t 6 4 * , _ _ i n t 6 4 *) 1 a c h i e v e d _ o c c u p a n c y A c h i e v e d O c c u p a n c y 0 . 2 4 8 9 4 8 0 . 2 4 8 9 4 8 0 . 2 4 8 9 4 8 1 ipc E x e c u t e d IPC 0 . 0 8 7 6 5 3 0 . 0 8 7 6 5 3 0 . 0 8 7 6 5 3 1 i n s t _ r e p l a y _ o v e r h e a d I n s t r u c t i o n R e p l a y O v e r h e a d 1 . 9 1 5 6 4 5 1 . 9 1 5 6 4 5 1 . 9 1 5 6 4 5 1 g s t _ t h r o u g h p u t G l o b a l S t o r e T h r o u g h p u t 3 . 9 7 9 4 GB / s 3 . 9 7 9 4 GB / s 3 . 9 7 9 4 GB / s 1 g l d _ t h r o u g h p u t G l o b a l L o a d T h r o u g h p u t 3 . 9 8 7 2 GB / s 3 . 9 8 7 2 GB / s 3 . 9 8 7 2 GB / s 1 g l d _ e f f i c i e n c y G l o b a l M e m o r y L o a d E f f i c i e n c y 9 9 . 8 5 % 9 9 . 8 5 % 9 9 . 8 5 % 1 g s t _ e f f i c i e n c y G l o b a l M e m o r y S t o r e E f f i c i e n c y 1 0 0 . 0 0 % 1 0 0 . 0 0 % 1 0 0 . 0 0 %
Chapter 6
Big Prime Field FFT on GPUs
In this chapter, we explain how we can compute FFT for vectors of elements in Z/pZon GPUs. First, in Section 6.1, we have a quick review of the Cooley-Tukey FFT algorithm. Then, in Section 6.2, we explain an algorithm for computing multiplication by twiddle factorson GPUs. Furthermore, in Section6.3, we explain how by usingsix-step recursive FFT, we can compute FFT through a base-case formula that is faster in practice. Next, in Section 6.4, we explain how we can compute the FFT for vectors of any length in
Z/pZ. Finally, in Section 6.5, we have profiling results for CUDA implementation of
algorithms of this chapter.
6.1
Cooley-Tukey FFT
As we explained in Chapter2, for computing the FFT for a vector of N =KJ elements in Z/pZ, and for ωN = 1, the Cooley-Tukey FFT algorithm factorizes the computation
in the following way:
DFTN = (DFTK⊗IJ)DK,J(IK⊗DFTJ)LNK.
In this notation, DK,J represents the multiplication by the powers of ω. Moreover, the diagonal twiddle matrix DK,J is defined as
DK,J = K−1
M
j=0
diag(1, ωji, . . . , ωij(J−1)).
In practice, The Cooley-Tukey FFT algorithm is not a suitable choice for implementation on GPUs, mostly because of the way that it accesses the memory. Therefore, we need
Multiplication by twiddle factors 71
an equivalent equation which is more suitable for structure of GPUs. That is, we must have an equation that can efficiently exploitblock parallelismof GPUs. In terms of tensor notation, block parallelism can be realized by tensor products of the formIJ⊗DFTK, and
therefore, we should find a solution to convert our computations to the mentioned form. For this purpose, we use thesix-step recursive FFT algorithm [10], which is expressed in the following way:
DFTN =LNK(IJ ⊗DFTK)LNJDK,J(IK⊗DFTJ)LNK.
By this formula, we can further expand the left partIJ⊗DFTKto reduce all computations
to a base-case DFTK. Accordingly, by having an efficient implementation for computing
DFTK, we can have a high performance implementation of the FFT.