2018 International Conference on Computer, Communication and Network Technology (CCNT 2018) ISBN: 978-1-60595-561-2
MR-Butterfly—A fast Fourier Transform Algorithm Based on MapReduce
Yu YU
Institute of Ocean Instruments and Metrology, Qilu University of Technology, Shandong Academy of Sciences, Qingdao 266100, China
Keywords: MapReduce model, FFT, Cooley-Tukey algorithm.
Abstract. Based on the research and analysis of the butterfly computing structure of the Cooley-Tukey algorithm, a fast Fourier transform algorithm MR-Butterfly for large data is proposed in this paper. The algorithm makes full use of the fast Fourier leaf-changing butterfly calculation structure, which can be used to determine the characteristics of the Cooley-Tukey butterfly calculation unit for the complex multiplication and complex addition operations in batch processing. The MR-Butterfly algorithm does not need to deal with synchronization and communication. The algorithm has the advantages of simple structure, robustness, universality and extensibility. The experimental part verifies the validity of the algorithm by using the test data with large data volume.
Introduction
Fast Fourier transform has an extremely important position in the field of scientific computing. In [1], Tega bit level integer multiplication is fulfilled using the Strassen algorithm and MapReduce model. The Strassen algorithm is an integer multiplication algorithm based on Fast Fourier Transform (FFT). In [2], Discrete Fourier Transform is used to reduce dimensions of large-scale time series data, based on which a similarity search method suitable for large-scale time series databases is proposed.
Cooley-Tukey algorithm is first proposed in [3], which has become the most commonly used FFT algorithm. In [4], performance analysis of Cooley-Tukey algorithm based on IBM Cyclops Multi-core architecture is done using PAR-R2-FFT parallel base-2 time extraction Fourier transform algorithm.
For N-point sequences FFT, PAR-R2-FFT algorithm is composed of log2N levels computing
operations. The butterfly unit is evenly distributed to the computing unit for concurrent processing. In the calculation process, the pre-stored butterfly operation's shuffling and butterfly operation coefficients are used to achieve the sharing of the intermediate calculation results of the algorithm through the global shared memory. In [5], the proposed Fast Fourier Transform algorithm consists of three steps. First, inverting the code sequences of the data sequences. Then, dividing the data sequence into p blocks. For each block, the N/p data items can sequentially input into processing units PE[0] to PE[p-1]. P is the total number of processing units. N is the length of data sequence for Fast Fourier Transform. Step two and three perform the actual Fast Fourier Transform, step two corresponds to the sequential execution part of the Fast Fourier Transform process, and step three corresponds to the parallel execution part of the Fast Fourier Transform. The sequential execution
section refers to the stages starting from the log (2 N/ )p stages of the N-point Fast Fourier Transform.
The sequential execution section refers to the stages starting from the N-point Fast Fourier Transform. The Fast Fourier Transform of the N/p partial data is executed in each execution unit. The number of N/p partial data is stored in the local memory of each execution unit. The parallel execution section
represents the remaining log ( )2 p stages in the N-point Fast Fourier Transform. The index distance
MR-Butterfly
MR-Butterfly algorithm is based on cloud computing MapReduce model. The parallelization of Cooley-Tukey butterfly operation makes the MR-Butterfly algorithm suitable for fast discrete Fourier transform of massive data. The core idea of the MR-Butterfly algorithm is to make full use of the characteristics of the fast Fourier transform butterfly computation structure that can be determined in advance and performs centralized batch processing of complex multiplication and complex addition in each Cooley-Tukey butterfly computation unit. Compared with the traditional parallelized fast Fourier transform algorithm, the advantage of the MR-Butterfly algorithm is that it does not need to deal with synchronization and communication. The algorithm has a simple structure, has good robustness and scalability, and has good universality for amount of data.
The core of the MR-Butterfly algorithm is the calculation of the butterfly shuffle coefficient r
N S and
the calculation of the butterfly coefficient r
N
W . The following are detailed descriptions of r
N
S and r
N W .
Calculation of r N S
In the calculation of the MR-Butterfly algorithm, the calculation can be abstracted as a solution function:
[2] ( , , )
R =SNR i j N (1)
The arrayRis used to store the result of the transformation, which are keyi and 2j
ij i+a × , i
indicates the input number.j indicates the current stage.Nstands for the length of the sequence.
Solving the functionSNR i j N( , , )directly can save the storage space of matrix
2
log
N N
T
× and more suitable
for parallel program. In MR-Butterfly algorithm, the direct calculation of SNR i j N( , , )is used.
2
log N
log2N
[image:2.612.138.477.385.561.2](a) (b)
Figure 1. Matrix division.
As depicted in figure 1-(a), each column of matrix TN×log2Nis divided into two areas. The black
region has a transform matrix coefficient of -1. Therefore, the solution of function SNR i j N( , , )is
equivalent to determining whether the input of the j levels i is in the black area of the jcolumn of
the matrixTN×log2N.
The specific steps for solving function SNR i j N( , , ) directly are as follows:
a. In the case where j=0 , if i is an even number, then R[0]=i , R[1]= +i 1 . Otherwise,
[0] 1
R = −i ,R[1]=i, then returnR.
b. Let 1
2j
m= + ,k=[ /i m], calculate the relative position rp.
If i≤ ×k m,rp= ×k m− −i 1, then rp=(k+ ×1) m− −i 1.
c. Calculation ofR
If rp<0, then R[0]=i, [1] 2j
If 2j 1
rp≤ − , [0] 2j
R = −i ,R[1]=i,then returnR. Otherwise, R[0]=i, [1] 2j
R = +i , returnR.
Calculation of r N W
In the base two time extraction of series with Cooley-Tukey algorithm, r
N
W the corresponding
transformation matrix of the butterfly operation structure is as follows:
2
l o g
0 0 0
0 0 0
0 0 0
0 2 0
0 0 0
0 0 1
0 0 2
0 2 3
[image:3.612.200.416.144.302.2]N N M × =
Figure 2. Matrix
r N W
.
In the calculation of the MR-Butterfly algorithm, the calculation can be abstracted as a function:
[2] ( , , )
C =WNR i j N
(2)
Array Nis used to store the transformed result r
N
W and r N W
− , rindicates the relative position of the
input serial numberi, jrepresents the current level, Nrepresents the length of the sequences.
The calculation of r
N
W is the same as r
N
S . Matrix
2
log
N N
M
× is divided as in Figure 1-(b). The specific
solution steps of function WNR i j N( , , ) are as follows:
a. If i is an even number with j=0, then C[0] 1= ,C[1] 1= . Otherwise, C[0] 1= ,C[1]= −1, returnC.
b. Let 1
2j
m= + ,k=[ /i m], calculate relative positionrp.
If i≤ ×k m,rp= ×k m− −i 1, then rp=(k+ ×1) m− −i 1.
c. Calculation of C
If rp<0, then [0] 1C = , [1] 1C = , return C.
If 2j 1
rp≤ − , then let
2 log
w= N, 2j
l= , 1
2w j ( 1)
rV − − l rp
= × − − . Otherwise
cos( 2 / )
real= − ×PI×rV N , im=sin( 2− ×PI×rV/N)
[0]
C =real+im×i , [1]C = −real−im×i , with i = −1, return C.
If 2j 1
rp> − , [0] 1R = , [1] 1R = , return C.
From the above description, it can be seen that each computing unit can independently perform the
calculation of r N S and r N W
under the condition of known parameters( , ,i j N), so that the MR-Butterfly
algorithm supports parallelization processing.
Experiments
In the experiment part, two sets of test data are used to verify the MR-Butterfly algorithm. The number of test data are: 217,218,219,220,221,222,223,224.
The experimental environment is: 3 PCs, Intel dual-core 2.4GHZ processor, 2GB memory and 250GB hard disk, operating system Ubuntu 12.04 LTS; Hadoop installation package, version 1.1.2.
For real fast Fourier transforms, the speed can be normalized to the representation of FLOPS (floating point operations). The definition of FLOPS is as follows:
2
2.5Nlog N
FLOPS
FFT
=
FFT
∆ represents the time in seconds for performing a Fast Fourier Transform.
17 18 19 20 21 22 23 24
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5x 10
4
C
o
m
p
u
ta
ti
o
n
T
im
e
(s
)
Log2(Problem Size)
17 18 19 20 21 22 23 24
0 100 200 300 400 500 600 700
C
o
m
p
u
ta
ti
o
n
T
im
e
(s
)
Log2(Problem Size) Combine
Add Sort Shuffle
[image:4.612.124.484.83.244.2](a) (b)
Figure 3. Operation execution time.
Figure 3-(a) shows the relationship between the problem size and the execution time of the MR-Butterfly algorithm. The execution time of the MR-Butterfly algorithm increases rapidly with the problem size. As shown in Figure. 3-(b), the execution time of all partial operation of the algorithm increases with the increase of the problem size. However, the growth rate of the sort execution time curve is obviously higher than that of other curves. It can be inferred that the sort operation is the main bottleneck of the MR-Butterfly algorithm. In future studies, the MR-Butterfly algorithm will be validated in large-scale clusters.
Conclusion
In this paper, based on the research and analysis of the butterfly calculation structure of the Cooley-Tukey algorithm extracted from the base 2 time, a fast Fourier transform algorithm MR-Butterfly suitable for big data is proposed. The algorithm makes full use of the fast Fourier. The leaf transform butterfly computation structure can be determined in advance, and the complex multiplication operations and complex addition operations in each Cooley-Tukey butterfly computation unit are processed in batches. MR-Butterfly algorithm does not need to deal with synchronization and communication, the algorithm structure is simple, robust, universal and scalable, the experimental part of the test data using a large amount of data to verify the algorithm.
Acknowledgement
This work is supported by the National Key R&D Program of China under the Grant 2017YFC1405600, National Science Foundation for Young Scientists of China under the Grant 41706101, Qingdao City Southern District Science and Technology Development Fund under the Grant 2016-2-012-ZH.
References
[1]Sze, T W. Schönhage-Strassen algorithm with MapReduce for multiplying terabit integers[C]
//Proceedings of the 2011 International Workshop on Symbolic-Numeric Computation. ACM, 2012: 54-62.
[2]Agrawal R, Faloutsos C, Swami A. Efficient similarity search in sequence databases[J].
Foundations of data organization and algorithms, 1993: 69-84.
[3]Cooley J W, Tukey J W. An algorithm for the machine calculation of complex Fourier series[J].
[4]Chen L, Gao G R. Performance analysis of cooley-tukey fft algorithms for a many-core architecture[C]//Proceedings of the 2010 Spring Simulation Multiconference. Society for Computer Simulation International, 2010: 81.
[5]Bahn J H, Yang J, Bagherzadeh N. Parallel FFT algorithms on network-on-chips[C]//Information
Technology: New Generations, 2008. ITNG 2008. Fifth International Conference on. IEEE, 2008: 1087-1093.
[6]Aarnio T. Parallel data processing with MapReduce[C]//TKK T-110.5190, Seminar on