Performance Tuning of a CFD Code on the Earth Simulator

(1)

1 5 10 15 20 25 30 35 40 45 50 1 5 10 15 20 25 30 35 40 45 50 ABSTRACT ABSTRACT ABSTRACT ABSTRACT ABSTRACT 1-1 1-2 1-3 & 2-1 2-2 2-3 & 3-1

Performance Tuning of a CFD Code on the Earth Simulator

By Ken’ichi ITAKURA,* Atsuya UNO,* Mitsuo YOKOKAWA,† Minoru SAITO,‡ Takashi ISHIHARA§ and Yukio KANEDA§

*Earth Simulator Center, Japan Marine Science and Technology Center

†National Institute of Advanced Industrial Science and Technology

‡NEC Informatec Systems, Ltd. §Nagoya University

High-resolution direct numerical simulations (DNSs) of incompressible turbulence with numbers of grid points up to 20483 have been executed on the Earth Simulator (ES). The DNSs are based on the Fourier spectral method, so that the equation for mass conservation is accurately solved. In DNSs based on the spectral method, most of the computation time is consumed in calculating the three-dimensional (3D) Fast Fourier Transform (FFT). In this paper, we tuned the 3D-FFT algorithm for the Earth Simulator and have achieved DNS at 16.4TFLOPS on 20483 grid points.

KEYWORDS KEYWORDS KEYWORDS KEYWORDS

KEYWORDS Earth Simulator, Performance Tuning, CFD (Computational Fluid Dynamics), DNS (DirectEarth Simulator, Performance Tuning, CFD (Computational Fluid Dynamics), DNS (DirectEarth Simulator, Performance Tuning, CFD (Computational Fluid Dynamics), DNS (DirectEarth Simulator, Performance Tuning, CFD (Computational Fluid Dynamics), DNS (DirectEarth Simulator, Performance Tuning, CFD (Computational Fluid Dynamics), DNS (Direct Numerical Simulation), FFT (Fast Fourier Transform)

Numerical Simulation), FFT (Fast Fourier Transform)Numerical Simulation), FFT (Fast Fourier Transform) Numerical Simulation), FFT (Fast Fourier Transform)Numerical Simulation), FFT (Fast Fourier Transform) 1.

1. 1. 1.

1. INTRODUCTIONINTRODUCTIONINTRODUCTIONINTRODUCTIONINTRODUCTION

Direct numerical simulation (DNS) of turbulence provides us with detailed data on turbulence that is free from experimental uncertainty. DNS is therefore not only a powerful means of finding directly appli-cable solutions to problems in practical application areas that involve turbulent phenomena, but also of advancing our understanding of turbulence itself — the last outstanding unsolved problem of classical physics, and a phenomenon that is seen in many areas which have societal impacts.

The Earth Simulator (ES)[1-3] provides a unique opportunity in these respects. On the ES, we have recently achieved DNS of incompressible turbulence under periodic boundary conditions by a spectral method on 20483 grid points. Being based on the spec-tral method, our DNS accurately satisfies the law of mass conservation; this cannot be achieved by a con-ventional DNS based on a finite-difference scheme. Such accuracy is of crucial importance in the study of turbulence, and in particular for the resolution of small eddies.

The computational speed for DNSs was measured by using up to 512 processor nodes (PNs) of the ES to simulate runs with different numbers of grid points. The best sustained performance of 16.4TFLOPS was achieved in a DNS on 20483 grid points.

2. 2.2.

2.2. SPECTRAL METHODS FOR DNSSPECTRAL METHODS FOR DNSSPECTRAL METHODS FOR DNSSPECTRAL METHODS FOR DNSSPECTRAL METHODS FOR DNS

In DNS by a spectral method[4], most of the CPU time is consumed in the evaluation of the nonlinear terms in the Navier-Stokes (NS) equations, which are expressed as convolutional sums in the wave vector space. As is well known, the sum can be efficiently evaluated by using an FFT. In order to achieve high levels of efficiency in terms of computational speed and numbers of modes retained in the calculation, we use the so-called phase-shift method, in which a con-volutional sum is obtained for a shifted grid system as well as for the original grid system. This method allows us to keep all of the wave vector modes which satisfy k < KC = √2 N/3, where N is the number of

discretized grid points in each direction of the Carte-sian coordinates, thus greatly increasing the number of retained modes. The nonlinear term can be evalu-ated without aliasing error by 18 real 3D-FFTs[4].

Our code, based on the standard 4-th order Runge-Kutta method for advancing time, was written in Fortran 90, and 25N3 main dependent variables are used in the program. The total number of lines of code, excluding comment lines, is about 3,000. The required amount of memory for N = 2048 and double-precision data was thus estimated to be about 1.56TB.

3. 3.3.

3.3. PARALLELIZATIONPARALLELIZATIONPARALLELIZATIONPARALLELIZATIONPARALLELIZATION

Since the 3D-FFT accounts for more than 90% of the computational cost of executing the code for a DNS of turbulence by the Fourier spectral method, the most crucial factor in maintaining high levels of performance in the DNS of turbulence is the efficient execution of this calculation. In particular, in a

(2)

1 5 10 15 20 25 30 35 40 45 50 1 5 10 15 20 25 30 35 40 45 50 parallel implementation, the 3D-FFT has a global

data dependence because of its requirement for global summation over PNs, so data is frequently moved about among the PNs during this computation.

Vector processing is capable of efficiently handling the 3D-FFT as decomposed along any of the three axes. Applying the domain decomposition method to assign the calculations for respective sets of several 2D planes to individual PNs, and then having each PN execute the 2D-FFT for its assigned slab by vector processing and automatic parallelization is an effec-tive approach. The FFT in the remaining, direction should then be performed after transposition of the 3D array data. Domain decomposition in the k3

direc-tion in wave-vector space and in the y direcdirec-tion in physical space was implemented in the code.

We achieved a high-performance FFT by imple-menting the following ideas/methods on the ES.

3.1 3.1 3.1 3.1

3.1 Data AllocationData AllocationData AllocationData AllocationData Allocation

Let nd be the number of PNs, and let us consider

the 3D-FFT for N3

real-valued data, which we will call u. In wave-vector space, the Fourier transform û of the real u is divided into a real part ûR and an

imagi-nary part ûI, each of which is of size N3/2. These data

are divided into nd data sets, each of which is

allo-cated to the global memory region (GMR) of a PN, where the size of each data set is (N+1) × N × (N/2/nd).

Similarly, the real data of u are divided into nd data

sets of size (N_{+1) × (N/n}d) × N, each of which is

allo-cated to the GMR of the corresponding PN. Here the symbol ni in n1× n2× n3 denotes the data length along

the i-th axis, and we set the length along the first axis to (N_{+1), so as to speed up the memory throughput by} avoiding memory-bank conflict.

3.2 3.2 3.2 3.2

3.2 Automatic ParallelizationAutomatic ParallelizationAutomatic ParallelizationAutomatic ParallelizationAutomatic Parallelization

For efficiently performing the N _{× (N/2/n}d)

1D-FFTs of length N along the first axis, the data along the second axis are divided up equally among the 8 APs of the given PN. This division can be efficiently achieved by using the automatic parallelization pro-vided by the ES. However, we decided to achieve this in practice by manual insertion of the some parallelization directives before target do-loops, which directs the compiler to apply automatic parallelization. We did this because we had found that the use of automatic parallelization by the com-piler did not make use of the benefits of parallel execution.

3.3 3.3 3.3 3.3

3.3 Radix-4 FFTRadix-4 FFTRadix-4 FFTRadix-4 FFTRadix-4 FFT

Though the peak performance of an AP of the ES is

8GFLOPS, the bandwidth between an AP and the memory system is 32GB/s. This means that only one double-precision floating-point datum can be supplied for every two possible double-precision floating-point operations. The ratio of the number of times memory is accessed to the number of floating-point data op-erations for the radix-2 FFT is 1; the memory system is thus incapable of supplying sufficient data to the processors.

This bottleneck of memory access in the kernel loop of a radix-2 FFT function degrades the sustained levels of performance in the overall task. Thus, to obtain calculation of the 1D-FFT within the ES effi-ciently, the radix-4 FFT must replace the radix-2 FFT to the extent that this is possible. This is because of the lower ratio of the number of memory accesses to the number of floating-point data operations in the kernel loop of the radix-4 FFT, so the radix-4 FFT better fits the ES.

3.4 3.43.4 3.4

3.4 Remote Memory AccessRemote Memory AccessRemote Memory AccessRemote Memory AccessRemote Memory Access

Before performing the 1D-FFT along the 3rd axis, we need to transpose the data from the domain de-composition along the 2nd axis to the one along the 3rd axis. The remote memory access (RMA) function is capable of handling the transfer of data which is required for this transposition. RMA is a means for the direct transfer of data from the GMR of a PN of the ES to the GMR of pre-assigned PN. It is then not necessary to make copies of the data, i.e., data are copied neither to the MPI-library nor to the communi-cations buffer region of the OS. In a single cycle of RMA transfer, N × (N/nd) × (N/2/nd) data are

trans-ferred from each of the nd PNs to the other PNs. The

data transposition can be completed with (nd - 1) such

RMA-transfer operations, after which N × (N/nd) ×

(N/2) data will have been stored at each target PN. The 1D-FFT for the 3rd axis is then executed by dividing the do-loop along the 1st axis so as to apply automatic parallel processing.

4. 4.4. 4.

4. PERFORMANCE OF PARALLEL COMPUTA-PERFORMANCE OF PARALLEL COMPUTA-PERFORMANCE OF PARALLEL COMPUTA-PERFORMANCE OF PARALLEL COMPUTA-PERFORMANCE OF PARALLEL COMPUTA-TION

TIONTION TIONTION

We have measured the sustained performance for both the double-precision and single-precision ver-sions by changing the number N3 of grid points, set-ting N values of 128, 256, 512, 1024, and 2048. The corresponding numbers of PNs taken up in the ES are 64, 128, 256, and 512; see Table I for the correspon-dences between number of PNs and number of grid points which we tested.

(3)

1 5 10 15 20 25 30 35 40 45 50 1 5 10 15 20 25 30 35 40 45 50 Runge-Kutta integration was measured by using the

MPI function MPI_wtime. The times taken in initial-ization and I/O processing are excluded from the mea-surement because these values are negligible in com-parison with the cost of Runge-Kutta integration, which increases with the number of time steps.

The number of floating-point operations in the measurement range, which is needed in calculation of the sustained performance, is monitored by an ana-lytical method for obtaining the number of opera-tions. This is based on the fact that the number of operations in a 1D-FFT with both radix-2 and 4 against the number of grid points N is N(5p + 8.5q), where N is represented by 2p

4q and p _{= 0 or 1} accord-ing to the value of N[5]. Consideraccord-ing that the DNS code has 72 real 3D-FFTs per time step and summing up the number of operations over the measurement range by hand, we obtain the following analytical expressions of the number of operations:

459N3 log2 N + (288+16π)N3, if p=0, (1)

459N3

log2 N + (369+16π)N3, if p=1. (2)

The results of these expressions can be used as reference values on the number of operations in com-paring the sustained performance of code.

Table I shows the effective performance results. N3

and nd in the table indicate the numbers of grid points

and PNs used in each run, respectively. The number np of APs in each PN is fixed to 8. In the measurement

runs, the data allocated to the global memory region of each PN are transferred by MPI_put, and the tasks in each PN are divided up among the APs. The maxi-mum sustained performance of 16.4TFLOPS was for the problem size of 20483 with 512 PNs and the data are single precision format.

Table I shows that the sustained performance is better for single precision than for double precision code for larger values of N and/or smaller values of nd.

Although the numbers of floating-point operations in the code are the same regardless of the precision, only half as much data is transferred in the transposition for 3D-FFT in the former case.

Table II shows the ratio of data transfer time to whole time by changing the number N3 of grid points and the number nd of PNs. The ratio becomes larger

as the number of nd becomes larger in the case where

N is constant. When nd is constant, the ratio also

becomes smaller as N becomes larger. The MPI_put function shows good performance for the larger size of data, and the latency should account for the small amount of data transfer time on the ES. On the other hand, the latency of MPI_put function dominates

data transfer time when the message size is small. Therefore, the smaller the message size residing in each PN becomes, the larger the ratio of data transfer time becomes compared to the computation time.

Figure 1 shows the computation time of one-time step of the Runge-Kutta integration in double preci-sion for 10243 grid points by divided into the data transfer time and others, when the number of APs in PN is changed as 1, 2, 4, and 8. The computation time is reduced almost linearly with respect to the number of APs and the number of PNs, because the ratio of data transfer time to the computation time is not very crucial for these cases.

Figure 2 shows the computation time for three versions of Trans7 as the number of PNs and the precision of floating point data are changed. The dif-ference of the versions is a 3D-FFT implementation; one uses a radix-2 FFT with transposition on global memory area (Case 1), the other uses a radix-4 FFT with transposition on local memory area (Case 2), and the rest use a radix-4 FFT with transposition on global memory area (Case 3). Case 3 shows the best performance of the three. Although the use of global memory area reduces the computation time, it is found that the implementation by using radix-4 FFT is much more effective in reducing the time. It should be mentioned that users pay attention to the ratio of the number of times memory is accessed to the num-ber of floating point data operations in kernel loops in developing the program for the ES. It is also clear that the data transfer time for single precision is smaller than that for double precision, as stated.

N3\ nd 512 256 128 64 20483 _14.6[16.4] _7.4[8.4] 10243 12.2[12.1] 6.7[7.7] 3.5[4.0] 1.8[2.1] 5123 _4.4[4.3] _3.0[3.3] _1.7[1.9] 2563 _1.4[1.3] _1.1[1.2] 1283 _0.3[0.3]

Table I Performance in TFLOPS of the computa-tions with double [single] precision arithmetic as analytical expressions for numbers of opera-tions. N3_{\ n} d 512 256 128 64 20483 _11.3[7.1] _11.0[6.4] 10243 15.6[15.2] 13.7[8.8] 12.5[7.8] 11.5[6.8] 5123 _19.8[19.8] _14.8[11.2] _12.9[8.4] 2563 22.6[23.0] 16.5[14.6] 1283 _21.7[21.5]

Table II The data transfer time ratio (%) to whole time with double [single] precision arithmetic.

(4)

1 5 10 15 20 25 30 35 40 45 50 1 5 10 15 20 25 30 35 40 45 50 5. 5. 5. 5.

5. SUMMARYSUMMARYSUMMARYSUMMARYSUMMARY

For the DNS of turbulence, the Fourier spectral method has the advantage of accuracy, particularly in terms of solving the Poisson equation, which repre-sents mass conservation and needs to be solved accu-rately for a good resolution of small eddies. However, the method requires frequent execution of the 3D-FFT, the computation of which requires global data transfer. In order to achieve high performance in the DNS of turbulence on the basis of the spectral method, efficient execution of the 3D-FFT is therefore

of crucial importance.

By implementing new methods for the 3D-FFT on the ES, we have accomplished high-performance DNS of turbulence on the basis of the Fourier spectral method. This is probably the world’s first DNS of incompressible turbulence with 20483 grid points. The DNS provides valuable data for the study of the universal features of turbulence at large Reynolds number. The numbers of degrees of freedom 4 × N3 (4 is the number of degrees of freedom (u1, u2, u3, p) in

terms of grid points, and N3 is the number of grid points) are about 3.4 × 1010 in the DNS with N = 2048. 0.00 2.00 4.00 6.00 8.00 10.00 12.00 14.00 16.00 18.00

1CPU 2CPU 4CPU _8CPU

(best) 1CPU 2CPU 4CPU (best)8CPU 1CPU 2CPU 4CPU (best)8CPU 1CPU 2CPU 4CPU (best)8CPU

Time (sec)

data transfer time calculation time

64nodes 128nodes 256nodes 512nodes

Fig. 1 Performance tuning results (1): N= 1024.

0.00 0.50 1.00 1.50 2.00 2.50 3.00 3.50 4.00

8CPU (best) doublenoGM doubleradix2 double8CPU (best)

float noGM floa t radix2 flo at 8CPU (best) double

64nodes

noGM doubleradix2

double 8CPU (best) floa t no_{GM float}radix2 float

8CPU (best) doublenoGM doublerad

ix2 double

8CPU (best)

floa t

noGM floatradix2 f

lo at

8CPU (best) double_no

GM doubleradix2 double8CPU (best)

flo at no_GM floa t radix2 f lo at Time (sec)

data transfer time

calculation time

128nodes 256nodes 512nodes

(5)

1 5 10 15 20 25 30 35 40 45 50 1 5 10 15 20 25 30 35 40 45 50

Ken’ichi ITAKURA received his B.E. and M.E. degrees in information engineering from Uni-versity of Tsukuba in 1993 and 1995, respec-tively. He received his Ph.D. degree in infor-mation science, also from the University of Tsukuba, in 1999. He worked at the Center for Computational Physics, University of Tsukuba by a Research Associate in 1999 and 2000. He was also engaged in the re-search and development of the Earth Simulator. He is now engaged at the Earth Simulator Center.

Dr. Itakura is a member of the IEEE Computer Society and the Information Processing Society of Japan.

Atsuya UNO received his B.E. and M.E. de-grees in information engineering from the Uni-versity of Tsukuba in 1995 and 1997, respec-tively. He received his Ph.D. degree in infor-mation science, also from the University of Tsukuba, in 2000. He has been engaged in the Earth Simulator Research and Development Center for 2 years since 2000. He is now engaged at the Earth Simulator Center. The sustained speed is 16.4TFLOPS. To the authors’ knowledge, these values are the highest in any simu-lation so far carried out on the basis of spectral meth-ods in any field of science and technology.

ACKNOWLEDGMENTS ACKNOWLEDGMENTS ACKNOWLEDGMENTS ACKNOWLEDGMENTS ACKNOWLEDGMENTS

The authors would like to thank Dr. Tetsuya Sato, director-general of the Earth Simulator Center, for his warm encouragement of this study.

REFERENCES REFERENCES REFERENCES REFERENCES REFERENCES

[1] M. Yokokawa, “Present Status of Development of the Earth Simulator,” Innovative Architecture for Future Generation High-Performance Processors and Systems (IWIA’01), pp.93-99, Jan. 18-19, 2001.

[2] T. Sato, S. Kitawaki and M. Yokokawa, “Earth Simulator Running,” Proceedings of ISC 2002, (CD-ROM), June 20-22, 2002.

[3] M. Yokokawa, K. Itakura, et al., “16.4-Tflops Direct Nu-merical Simulation of Turbulence by a Fourier Spectral Method on the Earth Simulator,” Proceedings of SC2002, http://www.sc2002.org/paperpdfs/pap.pap273.pdf, Nov. 16-22, 2002.

[4] C. Canuto, M. Y. Hussaini, et al., “Spectral Methods in Fluid Dynamics,” Springer-Verlag, 1988.

[5] C. V. Loan, “Computational Frameworks for the Fast Fou-rier Transform,” SIAM 1992.

* * * * * * * * * * * * * * *

Received November 8, 2002

Mitsuo YOKOKAWA received his B.Sc and M.Sc degrees in Mathematics from Tsukuba University in 1982 and 1984, respectively. He also received his D.Eng. degree in Computer Engineering from Tsukuba University in 1991. He joined the Japan Atomic Energy Research Institute in 1984 and was engaged in high-performance compu-tation in nuclear engineering. He was engaged in the develop-ment of the Earth Simulator from 1996 to 2002. He joined the National Institute of Advanced Industrial Science and Tech-nology (AIST) in 2002 and is Deputy Director of the Grid Technology Research Center of AIST. He was a visiting re-searcher of the Theory Center at Cornell University in US from 1994 to 1995.

Dr. Yokokawa is a member of the Information Processing Society of Japan and the Japan Society for Industrial and Applied Mathematics.

Minoru SAITO received his B.Sc. degree in physics from Yamagata University in 1983. He is now engaged in NEC Informatec Systems, Ltd. He was on loan to the Earth Simulator Research and Development Center from Au-gust 1999 to March 2002.

(6)

1 5 10 15 20 25 30 35 40 45 50 1 5 10 15 20 25 30 35 40 45 50

Takashi ISHIHARA received his B.Sc. degree in physics and M.E. degree in applied physics from Nagoya University in 1989 and 1991, re-spectively. He received his D.E degree in ap-plied physics from Nagoya University in 1994. He was a research associate at the Depart-ment of Mathematics in Toyama University from 1994 to 1997. He has been an assistant professor at the Graduate School of Engineering in Nagoya University since 1997.

Dr. Ishihara is a member of the Physical Society of Japan and of the Japan Society of Fluid Mechanics.

Yukio KANEDA received his B.Sc. and M.Sc. degrees in physics from the University of To-kyo in 1971 and 1973, respectively. He re-ceived his D.Sc. degree in physics, also from the University of Tokyo, in 1976. He has worked at the Faculty of Science as a research associate, and at the Faculty of Engineering, Nagoya Univer-sity as a research associate, an associate professor and a pro-fessor. He has also worked at the Graduate School of Poly-Mathematics, Nagoya University as a professor, and is cur-rently a professor in the Department of Computational Science and Engineering, at the Graduate School of Engineering, Nagoya University. His main interest is in fluid mechanics, including computational fluid mechanics and turbulence.

Dr. Kaneda is a member of the Physical Society of Japan, the Japan Society of Fluid Mechanics, and the Japan Society for Industrial and Applied Mathematics.