COMPONENTWISE ERROR ANALYSIS FOR FFT'S WITH APPLICATIONS TO FAST HELMHOLTZ SOLVERS

(1)

COMPONENTWISE ERROR ANALYSIS FOR FFT'S WITH

APPLICATIONS TO FAST HELMHOLTZ SOLVERS

M. ARIOLI1, H. MUNTHE-KAAS2,

AND L. VALDETTARO 3

Abstract. We analyze the stability of the Cooley{Tukey algorithm for the Fast Fourier Transform

of ordern= 2

k and of its inverse by using componentwise error analysis.

We prove that the components of the roundo errors are linearly related to the result in exact arithmetic. We describe the structure of the error matrix and we give optimal bounds for the total error in innity norm and inL

2 norm.

The theoretical upper bounds are based on a `worst case' analysis where all the rounding errors work in the same direction. We show by means of a statistical error analysis that in realistic cases the max-norm error grows asymptotically like the logarithm of the sequence length by machine precision. Finally, we use the previous results for introducing tight upper bounds on the algorithmic error for some of the classical fast Helmholtz equation solvers based on the Fast Fourier Transform and for some algorithms used in the study of turbulence.

1. Introduction.

LetFn be the Fourier matrix of ordern, i.e.

(Fn)j;k =e2 ij k n ; 0

j;k < n

The one dimensional discrete Fourier transform is dened as the matrix vector product ^

x = Fnx :

The one dimensional inverse transform is given as

x = 1nF

nx ;

and the d-dimensional transform as ^ x = Fnd Fn d?1 Fn 1 x ;

where denotes the matrix tensor product (see e.g. [3, 10] for its denition and

prop-erties).

We henceforth assume that allnjare powers of two. The Cooley{Tukey Fast Fourier

Transform (FFT) ([4, 5]) is based on the following factorization of Fn:

BnFn=Dk(In= 2 F 2)Dk?1(In=4 F 2 I 2) D 2(I2 F 2 In= 4)D1(F2 In= 2) (1)

whereBn is a permutation matrix described below, Dj are diagonal matrices and F2 is

the order 2 Fourier matrix, i.e.

F2 =

1 1 1 ?1

!

1Istituto di Analisi Numerica, Consiglio Nazionale delle Ricerche, via Abbiategrasso 209, 27100

Pavia, Italy.

2Institute for Informatics, University of Bergen, 5020 Bergen, Norway

(2)

The diagonal entries of Dj are all of the form

(Dj)k;k =e2i j k

wherej

kare real. The matricesDj are calledtwiddle matricesande2i j

k twiddle factors.

The permutation matrixBn performs abit reversal permutationof a vector, i.e. it swaps

two elementsx(j) and x(k), whenever the binary representation of k equals the binary representation of j written backwards. Note that Bn is its own inverse.

In the multidimensional case, the Cooley{Tukey factorization can be written (Bnd Fn d) (Bn 1 Fn 1) = Dk(In=2 F 2)Dk?1(In=4 F 2 I 2) D 2(I2 F 2 In= 4)D1(F2 In= 2) wheren = ndnd ?1 n 1, andBn

i are bit reversal matrices. The only dierences between

this equation and Equation (1) are the values of the twiddle factors. (For details see [11]). Thus from a roundo error analysis point of view, we can concentrate on the one dimensional case, and the results also apply to the multidimensional case. (If one wants to do a VERY careful analysis, one can use the fact that in the multidimensional case there are more twiddle factors equal to 1, and the constants in the error term can thus be improved slightly. This is however not of practical interest.)

The roundo error analysis of the FFT has been studied from several dierent points of view and for an extended bibliographical survey we refer to chapter 2 of [3] and to the more recent chapter 1 of [10]. We refer to [10] for the general framework on FFT.

The motivation of our study comes from the large discrepancy between the very small errors observed in numerical experiments and the much larger upper bounds of the errors that would be expected from simple arguments. Indeedy = FNx is obtained

in log2N steps of the form x

(k) Hx(k?1). Because the kF

2 k

1 is 2, at each of the step

the local error grows like 2 times machine precision times the kx (k?1)

k

1, where x (k?1)

is the computed value at step k.

Simple computations ([3], Theorem 2.5.2, page 59) show that the local factor 2 implies that the nal bound on the error grows with a factor nlog2n. If we want to

have tighter bound of the error we must delay the use of the norms and compute rst a componentwise expression of the global error.

In Section 2 we present the results of the componentwiseerror analysis applied to the Cooley{Tukey algorithm for the FFT and its inverse, assuming that the computations are done using a rounding arithmetic. From the results proved in [13, 3], we assume that the twiddle factors are computed by direct call to the library functions so that they are the rounded values of the exact results. Other techniques such as Forward and Logarithmic Recursion, Repeated Multiplications, Subvector Scaling and Recursive Bisection give less accurate algorithms ( [13, 3]). We give new upper bounds for the roundo error in the output signal measured in various dierent norms. In particular, our upper bound on the max norm of the error is a substantial improvement of the one we know in the literature (see [5, 3, 10]). We want to underline here the importance

(3)

of measuring the error in max-norm for general applications, including the ones in Turbulence theory which we are studying in Section 4.

In Section 3 we describe statistical properties of the algorithmic error under some hypotheses on the statistical distribution of the rounding errors. The results of that section, describing the expected value of the maximum roundo error relative to the innity norm of the output signal for any input signal, produce a more sound and general statistical analysis compared with those in the papers by Weinstein , by Kaneko and Liu and by Oppenheim and Weinstein ( [16, 8, 12]). In [16] the error analysis is done assuming a white noise input signal. The results are given in terms of the mean squared output error rather than on the expectation value of the maximum error. In [12] the results of [16] are reviewed and some experimental investigations are carried out on very simple sinusoidal signals. Finally in [8] a bound is given on the total mean square error relative to theL2-norm of the output signal. We conclude Section 3 by verifying,

with numerical experiments, the statistical error bounds that we have derived. The experiments have been run on several computers with dierent arithmetics, such as Alliant-FX80, CRAY2, Risc workstations.

In Section 4 we analyze the consequences of the previous result on the simulations of developed turbulence using spectral methods.

Finally, in Section 5 we apply our results to the roundo error analysis of fast solvers for the Helmholtz equation.

As a short hand notation we introduce Hj =Dj(I2 j ?1F 2 I 2 k ?j) (2)

and write Equation (1) as: B2 k F 2 k = k Y j=1 Hj HkH 1: (3)

After a symmetric permutation, Hj can be written as a block diagonal matrix with 2

by 2 blocks.

It is useful to know that by truncating this product, one can obtain the transforms of order 2l _with _l k: I2 k ?l(B 2 lF 2 l) = k Y j=k?l+1 Hj : (4)

In the following, ifu and v are vectors of entries uiandviandQ and P are matrices

of entries qij and pij, juj is the vector of entries juij, jQj is the matrix of entries jqijj.

uv means ui vi for alli, and QP means qij pij for all i and j.

Finally, by inspecting the pattern of the matrices, it follows that

jHjjjHkj=jHjHkj; j 6=k:

(4)

2. Error analysis of the FFT.

Let () denote the result of a oating point

computation. We assume that the arithmetic of the computer satises the following for real arithmetic:

fl(a b) = (a b)(1 + ) ; jj<

(6)

where is the machine precision and is one of +?=. To a great extent modern

computers have arithmetic that satises the assumption (6) with the exception of CRAY computers. For complex arithmetic (6) leads to the following error bounds for the basic operations:

Lemma 2.1. Complex oating point arithmetic is bounded by:

fl(zw) = (z w)(1 + ) ; jj< (1 + p

2)

fl(z + w) = (z + w)(1 + ) ; jj<

where is the machine precision. For proof of the Lemma, see [3].

This leads to the following bound for multiplication with the elementary matrices Hj in equation (2):

Lemma 2.2.

fl(Hjx) = (I + j)Hjx where j is a diagonal matrix satisfying:

jjj< (3 + p

2) I +O( 2)

Proof: After a symmetric permutation,Hj can be written as a block diagonal matrix

with blocks of the form:

!1 0 0 !2 ! 1 1 1 ?1 !

We assume that the twiddle factors !i are computed exactly and then rounded, i.e.

fl(!) = !(1 + ) ; where jj<

Using Lemma 2.1 we obtain this bound.

Lemma 2.2 yields the following result for the componentwise errors in the FFT:

Theorem 2.3. The errors in the radix 2 Cooley Tukey FFT of size n = 2k _is bounded by: Fnx?fl(Fnx) = GFnx +O( 2) where G = Bn 0 @ k X j=1 HkHk?1 Hj +1jH ?1 j+1 H ?1 k?1H ?1 k 1 A Bn 4

(5)

and Bn is the bit{reversal permutation.

Proof: From Equation (3) and Lemma 2.2 we get: Bnfl(Fn x) = (I + k)Hk(I + 1)H1x = HkH 1x + k X j=1 HkHk?1 Hj +1jHjHj?1 H 1x + O( 2) = BnFnx + k X j=1 HkHj +1jH ?1 j+1 H ?1 k BnFnx +O( 2)

Since Bn is its own inverse, we obtain the result.

Interestingly, the error matrix G has a rather regular structure, being essentially the sum of block diagonal matrices with Toeplitz circulant diagonal blocks:

Theorem 2.4. Let Cj =HkHk?1 Hj +1jH ?1 j+1 H ?1 k?1H ?1 k ; let m = 2k?j, and let B

m be the bit reversal permutation of orderm. Then

(In=mBm)Cj(In=mBm)

is a block diagonal matrix with m by m Toeplitz circulant matrices on the diagonal.

Before we prove this theorem, we recall the following well known relationship be-tween Toeplitz circulant matrices and discrete Fourier transforms:

Lemma 2.5. Let

= diag(0;1;:::;n?1) :

Then

FnF?1

n =circ(0; 1;:::; n?1)

where the numbers j are the inverse Fourier transform of the numbers k, i.e.

j = 1nn ?1 X k=0 ke?2 ij k n

Proof of Theorem 2.4: If we block the diagonal matrix j in blocks lj of sizem, we

nd from Equation (4) that the blocks of

(In=mBm)Cj(In=mBm)

are given as FmljF?1

(6)

We have now the necessary results for nding the innity norm relative error of the computed ^x:

Theorem 2.6. The innity norm of the error matrix G is bounded by

jjGjj

1< 10:7 p

n +O( 2) :

Thus the relative error of the computed x^ is bounded by

1= jj (Fnx)?x^jj 1= jjx^jj 1< 10:7 p n +O( 2) :

Proof: From the Plancherel identityjjxjj 2 = 1 p njjx^jj 2 we get: jjCjjj 1 = m?1 X l=0 jlj p m m?1 X l=0 jlj 2 ! 1=2 = m?1 X l=0 jlj 2 ! 1=2 p m(3 +p 2) +O( 2) ;

where the numbers j are the inverse Fourier transform of the numbersk. Thus

jjGjj 1 k X j=1 jjCjjj 1 (3 + p 2)(1 + p 2 +p 4 ++ q n=2) +O( 2) = (3 + p 2)(p n?1) p 2?1 +O( 2)< 10:7 p n +O( 2):

Theorem 2.7. The 2-norm of the error matrix G is bounded by

jjGjj 2< 10:7log 2n + O( 2) :

Thus the relative error of the computed x^ is bounded by

jj (Fnx)?x^jj 2= jjx^jj 2< 10:7log 2n + O( 2) :

Proof: The proof follows from the orthogonal properties ofFn and Theorem 2.3.

Finally, the following theorem gives the componentwise error bounds.

Theorem 2.8. j( (Fnx)?x)^ jj< 10:7log 2n jjFnxjj 1 ; j = 1;:::;n (7) j( (Fnx)?x)^ jj< 10:7log 2n jjxjj 1 ; j = 1;:::;n (8) j( (Fnx)?x)^ jj < 10:7 p n jjFnxjj 1 ; j = 1;:::;n (9) 6

(7)

Proof: Formula 7 follows from Theorem 2.7 and fromjjFnxjj 2

jjFnxjj 1.

Formula 9 is the componentwise version of Theorem 2.6.

Formula 8 follows from 5 and the relation proved in Theorem 2.3 Bn(( (Fnx)?x) =^ k X j=1 HkHk?1 Hj +1jHjHj?1 H 1x + O( 2)

For the inverse FFT, we have results similar to those of Theorems 2.3, 2.4, 2.6, 2.7 and 2.8. In particular, we have

1 nF nx?fl(1nF nx) = ~G(1nF nx) +O( 2) with jj~Gjj 1< 10:7 p n +O( 2) :

Moreover, the matrix ~G satises properties similar to those satised by G.

3. Statistical error analysis.

The above analysis is based on a `worst case' anal-ysis where all rounding errors work in the same direction. In this section we will show that if we assume a statistical distribution of rounding errors, the relative error in the answers will grow as log2(n) rather than the

p

n bound derived above.

Theorem 2.3 gives an exact expression for the roundo error, expressed in terms of unknown matrices j of the form

= diag(0;1;:::;n?1):

In this section we make the following assumption about the statistical distribution of the complex numbersi:

Assumption 1. The numbers i are independent complex stochastic variables where the real and imaginary parts are binormally distributed with mean 0 and vari-ance 2, i.e. the probability density of the real and imaginary parts are given by:

f(<(i);=(i)) =BiNo(0; 2) = 1 22e ?jij 2= 2 2 :

This distribution is not realistic for the arithmeticmodel (6), since, in the statistical distribution, the rounding errors i range in [?1;1]. A more realistic probability

distribution would be a uniform distribution in a disk, i.e. the probability density g(<(i);=(i)) = ( 1 u2 for jij 2 u 2 0 otherwise where u is the machine precision.

(8)

Due to the Central Limit Theorem, we can however expect that results derived from the distributions f and g will behave similarly for large n.

The right choice for the variance 2 in our model may be debated. A reasonable

choice is

2 = 8u2=9

which will give the same value forE[jij] for both distributionsf and g. We are however

more interested in how the relative error scales for large n than in the exact value of the constant of the error expression, so for simplicity we choose the larger value

2 =u2 :

The following two Lemmas are based on elementary properties of the normal dis-tribution:

Lemma 3.1. Let be a complex stochastic variable with distribution

BiNo(0; 2)

and let z be a complex number. Then z has a distribution

zBiNo(0; 2

jzj 2)

Lemma 3.2. Let 1 and 2 be independent stochastic variables with distributions

1;2

BiNo(0; 2 1;2)

Then = 1+2 has the probability distribution

BiNo(0; 2 1 +

2 2)

From these and from the properties of orthogonality of Fn we derive ([1]):

Lemma 3.3. Let = flgn ?1

l=0 be a stochastic vector where the components are

independent complex stochastic variables with distributions

lBiNo(0; 2) :

Then the inverse FFT is a stochastic vector where the components are independent with distributions

lBiNo(0; 2=n) :

From Theorem 2.3 we see that the relative error 1= jjx^?fl(Fnx)jj 1 jjx^jj 1 8

(9)

is largest when all the components of ^x are of the same size, and in the following we therefore assume that

jx^ij= 1 for alli :

(10)

Let Cj be the elementary matrices of Theorem 2.4. From Theorem 2.4 and

Lem-mas 3.1{3.3 we nd that the components of y = Cjx are distributed:

yi BiNo(0; 2) :

Since there are log2(n) such terms in the error expression we get:

Lemma 3.4. Lety = G^x, where jx^ij= 1 and Gis given in Theorem 2.3. Then the

components of y are independently distributed

yi BiNo(0; 2log 2(n)) : Let z = maxn?1 i=0 jyij=jjyjj

1. It remains to computeE[z]. It is well known that if

x1 and x2 are real normally distributed stochastic variables with mean 0 and variance

1, thenY =q

x2 1+x

2

2 is Rayleigh distributed, i.e.

Y ye ?y 2= 2; y > 0 : Thus P(Y y) = 1?e ?y 2= 2 : LetZ = maxn?1 i=0 Yi. Then P(Z z) = (1?e ?z 2= 2)n :

We will show that this function approaches a step function as n!1:

Lemma 3.5. For large n we have

Fn(y) = (1?e ?y 2= 2)n ! p 2ln(n)(y)

where z(y) is the step function

z(y) =

(

0 if y < z 1 otherwise The width of the step in Fn decreases as O(ln(n)

?1=2).

Proof: Fix the function at a value 0< c < 1: (1?e ?y

2=

2)n=c. Solving for y, we get:

y(c) =q

?2ln(1?c 1=n)

(10)

Assuming n large, we nd by Taylor expansion: y(c) = q ?2ln(1?c ?1=n) q 2ln(n)?ln(?ln(c)) q 2ln(n)?ln(?ln(c))=(2 q 2ln(n)) = q 2ln(n) +O(ln(n) ?1=2)

Fixing two function values c1 and c2 we see that

y(c1) ?y(c 2) = O(ln(n) ?1=2) !0 when n!1

Thus Fn(y) approaches a step function z(y) with a step at z =

q

2ln(n), and the width of the step in Fn(y) decreases as O(ln(n)

?1=2).

Thus the discussion in this section can be summarized in the following theorem, which is the stochastic analogue of Theorem 2.6:

Theorem 3.6. Under Assumption 1, where = , the relative error in the com-puted results satisfy

E[1] =E [ jj (Fnx)?x^jj 1= jjx^jj 1] s 2 ln(2)ln(n) +O()

where E[z] denotes the expected value of the stochastic variable z. Furthermore: Var[1]

!0 when n !1 :

Proof: It is evident that if a stochastic variable X satises P(X x) = x 0(x) then E(X) = x0 and Var(X) = 0 : Now let y be as in Lemma 3.4 and let z =jjyjj

1. From Lemma 3.5 we nd that

E 2 4 z q log2(n) 3 5= q 2ln(n) +O(ln(n) ?1=2) Thus E[z] = s 2 ln(2)ln(n) +O() 10

(11)

This is derived under the assumption in Equation (10) about ^x. For general ^x, the relative error becomes smaller.

The derivation is based on Taylor expansions, and is valid for `suciently' large n. To check the validity of the derivation we have compared the bound in Theorem 3.6 with the correct value for the statistical expectation, computed by numerical integration.

The statistical error bounds were based on Assumption 1. We have done several numericalexperimentsto checkif this assumption is realisticand if the log2(n) behaviour

is also seen in practice. The numerical computations were done on Alliant FX80, IBM-RS6000 and CRAY2 starting from a complex signal having random uncorrelated real and imaginary parts distributed uniformlybetween?1 and 1; the direct transform of the

signal was performed both in double (Alliant FX80 and IBM-RS6000) and quadruple precision (CRAY2 and IBM-RS6000). The innity norm error of the double precision FFT was then computed as:

1=

jjFdnx?Fqnxjj jjFqnxjj

(We label withFd_{the double precision FFT and with}_Fq _{the quadruple precision FFT).}

In Fig. 1 we plot the ratio between 1 and the p

n bound, derived in the `worst case' analysis, vs log2n. We clearly see that this bound overestimates the actual error.

The t with the log2n law, shown in Fig. 2 where we plot 1=log

2n vs log2n, is clearly

much better.

When all the computations were performed using the CRAY2 and the IBM-RS6000, we obtained results similar to those displayed in gures 1 and 2.

4. Application to Turbulence.

Turbulent ows contain a large range of length scales. The largest scale is determined by the geometric dimension of the uid, the smallest scale is the one at which the molecular viscosity is eective. The intermediate scales, which are not directly aected by the energy injection and dissipation, display an energy spectrum exhibiting an universal scaling behaviour:

E(k) = k ?n

For three dimensional turbulence this spectrum is close to the well known Kolmogorov spectrum with n = 5=3, while for bidimensional turbulence the exponent is close to n = 3 ([9]).

In order to simulate eectively turbulent ows, a numerical method with high accuracy is needed. It is generally agreed that spectral methods are quite successful for this problem ([6, 2]). In these methods, the unknown function u is expanded in terms of an innite sequence of orthogonal functions i: u = P

1

k=?1u^kk.

The numerical implementation of spectral methods requires a routine that com-putes the coecients of the expansion, starting from the value of the function on a discretized grid, and the reciprocal routine which gives the value of the function on the mesh points from the spectral coecients. It is required that these routines do not

(12)

Fig. 1. Computed innity norm error of the double precision FFT of a random signal: 1

= p

n is plotted versuslog2

n.

Fig. 2. Computed innity norm error of the double precision FFT of a random signal: 1

=log 2

nis plotted versuslog2

n.

(13)

pollute the spectrum, i.e. that all the coecients of the expansion are computed with a good accuracy. We want to study here the eects of the roundo error on such a transformation.

We will restrict ourselves to the Fourier spectral method, which uses a Fourier series on a regular grid (e.g. u =P

1

k=?1u^ke

ikx _{in 1 dimension). The coecients of the series,}

^

uk; are obtained by computing the d-dimensional FFT of the mesh points (d is the

dimensionality of the space); conversely, the values at the mesh points are found by performing the inverse FFT of the coecients of the Fourier series.

In order to be consistent with turbulence data we consider a signal with random phases and with the following spectrum:

E(k) = k ?n

e?k

wheren represents the scaling behaviour of the inertial region and is the viscous cuto. The actual choice of the value of n and for the experiment is not very important: the relevant thing is that there is a strong dierence in magnitude between the large and the small scales. We do the numerical experiment in the following way: we start with the signal in spectral space, we perform the inverse FFT of this signal in quadruple precision, and we go back to spectral space using a double precision FFT. Since the errors arising from the computations in extended precision are negligible, the dierence between the original signal and the result of the computation is precisely the error of the double precision FFT.

From the error analysis of Section 2 we expect that the absolute errors will be distributed among all the components; this is because the error matrix G of Theorem 1 mixes up all the components.

In Fig. 3 we show the dierence between the original signal and the result of the computation. We see that indeed the absolute errors are distributed throughout the spectrum. As a consequence, the relative errors on the high harmonics are quite strong: from Fig. 4 we see that all the components which are smaller by a factor of 10

?16

than the largest components are completely polluted by the double precision FFT. Current three-dimensional computations of turbulent ows do not display such a strong dierence in magnitude between the large and the small scales: the largest avail-able computations ([15]) display spectra that vary by 6 orders of magnitude. The sit-uation is however dierent for two-dimensional computations, where higher resolutions are reachable. We present in Fig. 5 a typical prole, in spectral space, of the nonlinear term of the Navier-Stokes equation (courtesy M. Manzini). This is the term which must be transformed to spectral space at each time step. The gure is obtained by xing one of the two wavenumbers and plotting the spectral coecients of the nonlinear term as a function of the second wavenumber. We see that the last components are smaller by a factor of 10?10 than the largest components; therefore they are computed with a

relative precision of only 10 10

10

?5 ( is the machine precision).

5. Error analysis of fast Helmholtz solvers based on FFT.

One of the major area of application of the FFT is in the solution of the discretization of the Helmholtz problem on a square or a cube domain

(14)

Fig. 3. Roundo error distribution in the computation of the double precision FFT of a signal having random phase and spectrak

?3 e

?:00232k. The absolute error is plotted versus k.

Fig. 4. Double precision FFT of a signal having random phase and spectra k ?3

e

?:00232k, versus k. Note the complete pollution of the spectra fork5 10

3.

(15)

Fig. 5. Plot of the real part of the spectral coecients of the nonlinear term of the Navier-Stokes equa-tion coming from two-dimensional turbulence simulaequa-tions, as a funcequa-tion of the horizontal wavenumber k.

(16)

r

2u + u = f

with u = g on @. For simplicity we will go into details only for the 2D case.

We will consider two kinds of discretization, central nite dierences and the spec-tral method, and three algorithms for solving the corresponding discrete problem. In particular, we will give bounds for the roundo error when we use the FFT for solving the discretized problems.

The nite dierence discretization can be easily expressed in term of the tridiagonal matrix of order N T = 2 6 6 6 6 6 6 6 4 2 ?1 ?1 2 ?1 ... ... ... ?1 2 ?1 ?1 2 3 7 7 7 7 7 7 7 5 ;

where N = 2p _{is the number of points of discretization in one direction. Then the}

matrixA, discretizing the Helmholtz operator on a square, can be expressed as A = INT + TIN+ININ:

The eigenvector matrix P of the matrix T is the imaginary part of a minor of FN

normalized by the factor (2=(N+1))1=2 (see [14] ). The corresponding eigenvalue matrix

D is diagonal with entries

di =?2 + 2cos(i=(N + 1)):

Moreover, ([14]), the matrixPP is the eigenvector matrix of A and the corresponding

eigenvalue matrix is

IND + DIN +ININ:

The product of a vectory by P is the sine transform of y. In both cases, the products Py and PPz, where y is a N-vector and z is an N

2 vector, can be performed by the

FFT algorithm using the F2N matrix.

Then, from the results of Section 2, we have that fl(Py) = Py + Gy and fl((PP)z) = (PP)z + Gz with jjGjj 1 p N +O( 2) 16

(17)

and jjGjj 1 N +O( 2) where 10:7.

Two of the most common algorithms for solving the discrete problem Ax = b

are based on the possibility of diagonalizing A and T by PP and P and can be

described as follows Algorithm 1. Step 1 y = (PP)b Step 2 u = (IND + DIN +ININ) ?1y Step 3 x = (PP)Tu Algorithm 2. Step 1 y = (PIN)b Step 2 Solve V u = (INT + DIN +ININ)u = y Step 3 x = (PIN)Tu

Since the matrixV is a positive denite tridiagonal matrix, the system V u = y is solved by the Cholesky algorithm and the computed solution u is the exact solution of the system ([7])

(V + V )u = y ; jVj6jVj:

(11)

If we assume that we have a periodic boundary condition for the Helmholtz equa-tion, the problem can be discretized by truncating the Fourier series. For simplicity, we will assumeg = 0 in the following. If we discretize the space variable in the Fourier nodes

xj = 2j=(N + 1) ; yk = 2k=(N + 1) ; j;k = 1;;N

then the Fourier coecients ~ujk and ~fjk of the functions u and f are obtained by

applying an inverse FFT to the values u(xj;yk) and f(xj;yk). In [2], it is proved that,

using this property, the solution of a Helmholtz equation can be obtained by solving the system Sx = b with S = (BNFNBNFN) ?1D F(BNFNBNFN);

(18)

where we denote by DF the diagonal matrix of entries (DF)j;k =?(j 2+k2) + Algorithm 3. Step 1 y = (BNFNBNFN) ?1b Step 2 u = (DF) ?1y Step 3 x = (BNFNBNFN)u

We rst establish a technical lemma that will be used in the proof of the main results stated in theorem 5.2.

From the algebraic point of view, the three algorithms that we have described, can be seen as three specic cases of the following sequence of matrix operations

x = W?1U?1Wb;

(12)

where W is a matrix related to FN and U is either a tridiagonal or a diagonal positive

denite matrix. Taking into account formula (11) we can also assume that U cjUj;

(13)

with c independent from N. Moreover, we can assume without lost of generality that W?1 =WT.

The following result gives a general expression of the errorx aecting the computed value x of x in (12). Lemma 5.1. Let y = fl(Wb) = Wb + ?Wb ; u = fl(W T_{v) = W}T_{v + ~?W}T_{v ;} and (13). Then x = [~??WTU ?1UW + WTU?1?UW]x + O( 2):

Proof: From (11) we have

z = fl(U?1y) = (U + U) ?1y:

The result is then obtained by expanding the right hand side of

x = fl(WT_fl(U?1fl(Wb)));

by linearizing the errors and by the fact that (U + U)?1 = (I ?U ?1U)U?1+ O( 2): 18

(19)

We are now ready to state the general expression of the error x aecting the computed value x of x in (12).

Theorem 5.2. Let x(1), x(2) and x(3) be the computed values of the solution x

obtained by Algorithm 1, Algorithm 2 and Algorithm 3 respectively. Then

jjx (1) ?xjj 1 jjxjj 1 c 1N[2 + 1(A)] + O( 2 ); (14) jjx (2) ?xjj 1 jjxjj 1 c 2 p N[1 + 1(A) + (8 + jj)jjA ?1 jj 1] + O( 2); (15) jjx (3) ?xjj 1 jjxjj 1 c 3N[2 + 1(S)] + O( 2); (16)

with c1, c2 and c3 constant and

1(A) = jjA ?1 jj 1 jjAjj 1 ; 1(S) = jjS ?1 jj 1 jjSjj 1:

Moreover, for the Poisson equation ( = 0) we have

jjx (1) ?xjj 1 jjxjj 1 O( N 3 ); (17) jjx (2) ?xjj 1 jjxjj 1 O( p NN2); (18) jjx (3) ?xjj 1 jjxjj 1 O( N 3log 2 2N): (19)

Remark.

Formulas (17-19) are to our knowledge the rst bounds given on the roundo error for the Poisson problem. Our bounds take into account the novelty of the results of section 1. When using nite dierences it is known that the condition number1(A) does not depend on the dimensionality of the problem. This property

does not hold when the FFT algorithm is used, as is shown in Appendix A. As a consequence, the roundo error (19) has a dierent expression for 1D and 3D problems (the behaviour would be respectivelyN5=2log

2N and N 9=2log

2N in 1D and 3D).

We note also that all these bounds are based on the upper bound worst case analysis of section 2. Using the statistical analysis of section 3 the bounds (17-18) drop to

O(N 2log

2

2N), and the bound (19) becomes:

O(N 2log 2 2N) in 1D, O(N 2log 3 2N) in 2D and O(N 3log 2 2N) in 3D.

For the Helmholtz problem (formulas (14-16)), the condition numbers 1(A) and

1(S) depend heavily on the value of . We can only say that for positive the

(20)

Proof:

For Algorithm 1 and Algorithm 2 we can write the expression of U in (13) as U = D

with D the corresponding diagonal matrix of each algorithm and jj . Then, from

Lemma 5.1 we have

x = [~??WTW + A

?1WT?WA]x + O(

2):

Using the innity norm we obtain

jjxjj 1 [jj~?jj 1+ jjWTWjj 1+1(A) jjWT?Wjj 1] jjxjj 1+ O( 2); where 1(A) = jjAjj 1 jjA ?1 jj 1.

Formulae (14) and (16) follow directly from the fact that

jjWT?Wjj 1

10:7N +O( 2)

(the proof of this is totally equivalent to the one of Theorem 2.6), and from the property

jj~?jj 1

10:7N +O( 2):

For proving (15) it is necessary to remember that the structure of U is block diagonal with diagonal blocks of sizeN which are tridiagonal.

From Lemma 5.1 we have

jjxjj 1 [jj~?jj 1+1(A) jjW T_?W jj 1+ jjW T_U?1UW jj 1] + O( 2):

The rst two terms, since ~? and ? are block diagonal with diagonal blocks of size N, can be bounded as follows

jj~?jj 1 10:7 p N +O( 2) and jj?jj 1 10:7 p N +O( 2):

For the third term we have

jjW T_U?1UW jj 1 jjA ?1 jj 1 jj(PIN)U(PIN) T jj 1:

Because there exists a permutation matrix Q such that Q(PIN)QT = INP and

QUQT ₌ gU with gU having a block tridiagonal structure with blocks diagonal, we

have jj(PIN)U(PIN) T jj 1= jj(INP) g U(INP) T jj 1: 20

(21)

The value ofjj(INP) g

U(INP)Tjj

1 is attained on one block ofN lines of the matrix.

From this we derive

jj(INP) g U(INP) T jj 1 p Njj(INP) g U(INP) T jj 2 p NjjUjj 2: and then jjUjj 2 jj jUj jj 2 6jj jUjjj 2 + O( 2) 6(8 +jj) +O( 2):

Finally, (17) and (18) follow from ([14]) 1(A)

cN 2

and (19) follows from the following result established in Appendix A: 1(S)

cN 2log

2 2N:

6. APPENDIX A.

In this appendix we show that the condition number in inn-ity norm of the matrixS of Theorem 5.2 approaches, for = 0, the value 2N2(log

2 2N)=

in the limit of large N. Since both S and S?1 are Toeplitz circulant matrices, their

innity norms are equal to theL1 norm of their rows. In particular, for S

?1 we have jjS ?1 jj 1= 1N 2 N?1 X p=0;q=0 N?1 X j=0;k=0 1 (j + 1)2+ (k + 1)2e ? 2 i(pj +q k ) N : It is easy to prove that

jjS ?1 jj 1 log 2N: The matrix S1 = FNF ?1

N , with diagonal matrix of entries k = k2, is the one

dimensional equivalent of S and

jjSjj 1 2jjS 1 jj 1:

Since S1 is a Toeplitz circulant matrix, its innity norm is equal to theL1 norm of

its rows. We have thus:

jjS 1 jj 1= 1N N?1 X j=0 N?1 X k=0 (k + 1)2e? 2 ij k N

From the well known property

n

X

k=0

sinkx = sin n + 12 xsin nx 2 sin x2 n X k=0

coskx = sin n + 12 xcos nx

2 sin x2

(22)

the inner sum can be evaluated exactly: jjS 1 jj 1= N(2N + 1) 6 + 12 N?1 X j=1 q 1 + (N2 ?1)sin 2x j sin2x j (xj j N ) We rewrite this expression in the following form:

jjS 1 jj 1= N(2N + 1) 6 + q (N2 ?1) 2 N?1 X j=1 1 sinxj s 1 +_(N2 1 ?1)sin 2x j

and we expand the square root in Taylor series around 1:

s 1 + _(N2 1 ?1)sin 2x j = 1 X i=0 ci h (N2 ?1) i ?i sin?2ix j

(ci are the coecients of the Taylor expansion of p

1 +x. We have then: jjS 1 jj 1= N(2N + 1) 6 + q (N2 ?1) 2 N?1 X j=1 1 X i=0 ci h (N2 ?1) i ?i sin?2i?1x j

We interchange the inner and the outer sum:

jjS 1 jj 1= N(2N + 1) 6 + q (N2 ?1) 2 1 X i=0 ci h (N2 ?1) i ?i N ?1 X j=1 sin?2i?1x j (20)

We will show that the inner sum is dominated by the contribution i = 0 in the limit of large N, and that this contribution is of order 2N(log2N)= +

O(N).

We begin by proving that the contribution for i6= 0 is of order O(N):

Lemma 6.1.

sin?1x

2x 8x2]0; 2[

From this we derive:

1 X i=1 ci h (N2 ?1) i ?i N ?1 X j=1 sin?2i?1 j N 2 1 X i=1 ci h (N2 ?1) i ?i (N?1)=2 X j=1 sin?2i?1 j N 2 1 X i=1 N 2 2i+1 ci h (N2 ?1) i ?i (N?1)=2 X j=1 1 j2i+1 2C 1 X i=1 N 2 2i+1 ci h (N2 ?1) i ?i 2CN 1 X i=1 ci 2i+1 = O(N) 22

(23)

(C is a constant such that P 1

j=11=j

2i+1 < C

8i1).

We show now that the contribution for i = 0 is of order 2N(log2N)=.

Lemma 6.2. 1 x 1 sinx 1 x + x 8x2]0; 2]

From this we derive:

(N?1)=2 X j=1 N j (N?1)=2 X j=1 1 sin j_N (N?1)=2 X j=1 N j +jN N log2 N 2 (N?1)=2 X j=1 1 sin j_N N log2 N 2 + 2(N +1) In the limitN !1 we have:

N?1 X j=1 1 sin j_N 2 (N?1)=2 X j=1 1 sin j_N ! 2N log2N +O(N) (21)

From Equations (20) and (21) the innity norm of the matrix S1 in the limit of large

N can be evaluated as:

jjS 1 jj 1 q (N2 ?1) 2 2N log2N N2log 2N :

It is interesting to note that S1 is the spectral matrix in 1-D corresponding to the

discretization of the second derivative. Its condition number is 1(S1)

N 2log

2N:

Analogously, denoting by S3 the N 3

N

3 spectral matrix corresponding to the 3-D

discretization of the Laplace operator we have 1(S3)

N 3log

2N:

The proof is similar to the one given for S.

It is relevant to note that, the innity norm condition number of the matrices coming from the nite dierence approximation of the Laplace operator behaves as

O(N

2); whereas that coming from the spectral approximation matrices S

1, S and S3 increases from O(N 2log 2N) to O(N 3log

(24)

[1] D. Brillinger,Time Series. Data Analysis and Theory, Holt, Rinehart and Wilson, 1975. [2] C. Canuto, M. Y. Hussaini, A. Quarteroni, and T. A. Zang, Spectral Methods in Fluid

Dynamics, Springer-Verlag New York Inc., New York, 1988.

[3] C. Y. Chu, The Fast Fourier Transform on Hypercube Parallel Computers, technical report 87-882, Department of Computer Science, Cornell University, 1987.

[4] J. W. Cooley and J. W. Tukey, An Algorithm for the Machine Calculation of Complex Fourier Series., Math. Comp., 19 (1965), pp. 297{301.

[5] W. M. Gentleman and G. Sande, Fast Fourier Transforms - For Fun and Prot., in 1966 Fall Joint Computer Conference, AFIPS Conf. Proceedings, vol. 29, Spartan, Washington, D.C., 1966, pp. 563{578.

[6] D. Gottlieb and S. A. Orszag,Numerical Analysis of Spectral Methods: Theory and Appli-cations, SIAM-CBMS, Philadelphia, 1977.

[7] N. J. Higham,Bounding the Error in Gaussian Elimination for Tridiagonal Systems., SIAM J. Matrix Anal. Appl., 11 (1990), pp. 521{530.

[8] T. Kaneko and B. Liu, Accumulation of Round-O Error in Fast Fourier Transforms, jacm, 17 (1970), pp. 637{654.

[9] R. H. Kraichnan, Inertial ranges in two-dimensional turbulence., The Physics of Fluids, 10 (1967), p. 1417.

[10] C. V. Loan, Computational frameworks r the Fast Fourier Transform, SIAM, Philadelphia, 1992.

[11] H. Munthe-Kaas, Super Parallel FFT's., Rep. no. 52 1991, Dept. of Informatics University of Bergen, Norway, (1991).

[12] A. V. Oppenheim and C. J. Weinstein,Eects of Finite Register Lenght in Digital Filtering and Fast Fourier Transform, in Proceeding of IEEE 60, vol. 60, 1972, pp. 957{976.

[13] G. U. Ramos,Roundo Error Analysis of the Fast Fourier Transform, Math. Comp., 25 (1971), pp. 757{768.

[14] P. N. Swarztrauber,Fast Poisson Solvers., vol. 24, MAA Studies in Numerical Analysis, 1977. [15] A. Vincent and M. Meneguzzi,The spatial structure and statistical properties of homogeneous

turbulence., J. Fluid Mech., 225 (1991), p. 1.

[16] C. J. Weinstein,Roundo Noise in Floating Point Fast Fourier Transform Computation, IEEE Trans. Audio and Eletroacoustic, AU-17 (1969), pp. 209{215.