Fast multiplication of multiple-precision integers

(1)

RIT Scholar Works

Theses

Thesis/Dissertation Collections

1991

Fast multiplication of multiple-precision integers

Sonja Benz

Follow this and additional works at:

http://scholarworks.rit.edu/theses

This Thesis is brought to you for free and open access by the Thesis/Dissertation Collections at RIT Scholar Works. It has been accepted for inclusion in Theses by an authorized administrator of RIT Scholar Works. For more information, please contactritscholarworks@rit.edu.

Recommended Citation

(2)

FAST MULTIPLICATION

OF

MULTIPLE-PRECISION INTEGERS

by Sonja Benz

A thesis, submitted to

The Faculty of the Department of Computer Science, in partial fulfillment of the requirements for the degree of

Master of Science in Computer Science.

Approved by:

Professor S. Radziszowski

Professor P. Anderson

Professor A. Kitchen

(3)

Jl;f

~f ~'(

'O~1-JihvJJ.

r-f.R -

fl'eo;noVL-

A~8~J-d

,

ltei-76-

ALH he!

g-vrJ1.ht

Ao

At

~

/;Jo,.lJ

(J

c.

e

Me

l/Vv(J

n'

a

f

f-lt-t''h

I-~IOl-v

Llh

rOf1j

au

~JuS"l ~

-"'~

Hfru

dJ..J-c/,

Ovt

{'VI/v?

t

-n-tr Ct'Ot.

e

1211

AJ

w4

()/~

~<f.i

/t'Lo

J

N.f'-o

d...u (,

e

n;;

D;-

~:1--t

fa

tvf:-

r

~

(4)

Abstract

1 Introduction

o

1

2 Theoretical Background 4

2.1 Preliminary Remarks. 4

2.2 Terminology and Notation. 5

2.3 Classical n2 Multiplication Algorithm 7

2.4 KARATSUBA's Algorithm . 8

2.5 Fast Convolution Algorithms 9

2.5.1 General Idea . . . 9

2.5.2 Discrete Fourier Transform 9

2.5.3 Application of the Fourier Transform to Convolution 11

2.5.4 Fast Fourier Transform . . . . 14

2.5.5 Mod m Fast Fourier Transform 18

2.5.6 Modular Arithmetic . . . 22

2.5.7 Ordinary Fast Convolution Algorithm 24

2.5.8 SCHONHAGE-STRASSEN's Algorithm (1971) 27

2.5.9 Modified SCHONHAGE-STRASSEN Algorithm (1982) 37

(5)

3.1 _Preliminary Remarks 41

3.2 Fundamental Elements 42

3.2.1 Hardware and Software 42

3.2.2 Number Representation 43

3.2.3 _MemoryManagement 45

3.2.4 User Interface 46

3.2.5 Time Measurement 46

3.2.6 Conditional Compilation 48

3.3 I iplementation ofthe Multiplication Algorithms 49

'5.1 Classical Algorithm 49

3.3.2 KARATSUBA's Algorithm 51

3.3.3 Basic ElementsofFast Convolution Algorithms .... 53

3.3.4 _Ordinary Fast Convolution Algorithm 59

3.3.5 SCHONHAGE-STRASSEN's Algorithm ₍₁₉₇₁₎ .... 60

3.3.6 SCHONHAGE-STRASSEN's Algorithm ₍₁₉₈₂₎ .... 61

4 Results and Discussion 63

4.1 _Preliminary Remarks 63

4.2 KARATSUBA's Algorithm The Influence of the Depth of

Recursion 64

4.3 PerformanceoftheAlgorithms 67

4.4 Related Studies 80

4.5 Conclusions 81

(6)

Multiple-precision multiplication algorithms are offundamental interest for

both theoretical and practical reasons. The conventional method requires

0(n2) bit operations whereas thefastest known multiplication algorithm is oforder0(n _log n _{log log} n). Thepricethat has tobepaidfortheincrease in speed is a much more sophisticated _theory and _programming code.

This workpresents an extensive _study ofthebest knownmultiple-precision

multiplication algorithms. Different algorithms areimplemented in _C, their performance is analyzed in detail and compared to each other. The break

even _points, which are essentialforthe selection ofthefastest algorithm for

(7)

Introduction

The multiplication _operation, which except for addition is the most ele

mentary arithmetic _{operation, is} of fundamental interest for both _purely

theoretical and practicalreasons. Thewell-known classical method achieves 0(n2) bit _complexity 1. While the performance ofthis method is satisfac toryforsmall_numbers, theapplicationfor large numbersisrestricted _byan unfavorable asymptotic behavior.

Thecommon andintuitiveopinionthat theschool method can'tbe improved

significantly was disproved in 1962 _by A. _KARATSUBA, who suggested a

faster algorithmwith_complexity _0(nlo&3) _[2, _p.62] _[15,p.278]. Heobtained thisimprovement _by_employingsimple algebraicidentitiesandarecursivedi

vide and conquer approach. _By 1966 A. L. TOOM and S. A. COOK discov ered another _{asymptotically} fast algorithm with _complexity _0(n2aVlog2n)

for some constant a _[15, p.286]. _They took advantage of the similarity of

integer and polynomial arithmetic andinterpreted multiplication ofintegers as multiplication of polynomials.

The best knownupperbound formultiplication is 0(n _log n _{log log} _n) and is achieved_byan algorithm ofA. SCHONHAGEandV. STRASSEN _[32] [5].

SCHONHAGE and STRASSENused theconnectionbetweenmultiplication

of natural numbers and polynomial multiplication and alsoemployedthefast Fourier transform.

This upper bound applies to_Turing_machines, i.e. toaserialmachinemodel

with a bounded central high speed _memory and alimited word size. This

1See section 2.2for thedefinition of "orderO"

(8)

machine isbelieved tobe a realistic model ofstandard computers. Attempts toestimate_complexityofintegermultiplication must refertoa specific com

putational model. For a random access machine _(RAM), whose registers

are capable of_holding integers of_{arbitrary size, the} upper bound drops to

O(nlogn) [15, p.295]. A pointer machine(also called "storagemodification

machine"

or _"linking _automata") can _carry out an n-bit multiplication in

0(n) steps _[30] _[15, p.295].

It is a _{strange phenomenon} _that _although

asymptotically fast algorithms have been known for _decades, the multiple-precision software still uses the

standard procedures _[28], R. P. _BRENT, for _example, who wrote an ex tensive Fortranmultiple-precision floating-point arithmetic _package, argued that he used the classical _0(n2) method because "it seems difficult to im plement fasteralgorithms ina machineindependent manner in Fortran" [7]. There seems tobe ageneral consensus that theSCHONHAGE-STRASSEN

algorithm is "hard to implement" _[5] [29]. BORODIN even states: "How

ever, it is not at all clear that the method could ever be practically im

plemented"

[4]. _Actually, the SCHONHAGE-STRASSEN algorithm was

originally interpreted as a theoretical result without _any practical impact

[31] [29]. _{Most practically} orientedpapers and books recommend the use of a simplified version ofthe algorithm _{[4] [19]} [5].

The _study ofmultiplication algorithms _{undoubtedly is} _{interesting by} itself

and for _purely scientific reasons. On the other _hand, there is also a great

demandfor fastmultiple-precision algorithmsina vast range of applications.

In addition to typical applications _{like primality} tests _{[21] [22] [24]} and high-precision computations of constants _{[27] [8] [5]} _[11],the propagation of

progessive cryptological methods _{[13] [26]} has increased the interest in fast multiplication algorithms. _Furthermore,integermultiplication mod (2+₁₎ for valueslike n _800, n= 64000or even n = 512000is of practicalinterest

for the application of afast method to compute _{approximately} the zeros of

complex polynomials _[28] [31]. Otherapplications can be found in _{[6] [3]} [5].

This work presents an extensive _study ofthebest known multiple-precision

multiplicationalgorithms. Differentalgorithms- the

classical school_method,

the KARATSUBA algorithm and variations ofthe SCHONHAGE-STRAS

SENalgorithm- _areimplementedinthe

programminglanguage C andtested on an INTEL 80386 machine. The performance ofthe multiplication algo

rithmsis analyzed in _detail, andthebreak even points are determined. The

simplicity ofthe classical method produces_only asmall overhead when im

(9)

"small"

multiple-precision numbers. The more sophisticated multiplication

(10)

Theoretical

Background

2.1

Preliminary

Remarks

This chapter develops the theoretical concepts of five multiplication algo rithms,

the classical multiplication algorithm

the KARATSUBA algorithm

the_{ordinary fast} convolution algorithm

theoriginal SCHONHAGE-STRASSEN algorithm ₍₁₉₇₁₎

the modified SCHONHAGE-STRASSEN algorithm (1982).

Thereby the _ordinary fast convolution algorithm and the SCHONHAGE-STRASSEN algorithms from 1971 and 1982 are so-called "fast convolution

algorithms".

Section 2.2 introduces fundamental_terminologyandnotation. Insection 2.3 the classical, and in section 2.4 theKARATSUBA multiplication algorithm is developed.

(11)

2.5.2 and _2.5.4) and the convolution theorem (section 2.5.3). Common to all ofthe threefast convolution algorithmsisthe application ofthe "mod m fast Fourier transfrom"

(section _2.5.5) which works in a finite field or _ring Zm ofintegerswith arithmetic carriedout modulo anintegerrn. In_addition, the _{SCHONHAGE-STRASSEN} algorithm from 1971 employs the Chinese remainder theorem which is introduced in section 2.5.6. _Finally, sections 2.5.7, 2.5.8 and 2.5.9 _develop the concrete individual algorithms, the fast

ordinary convolution _algorithm, the SCHONHAGE-STRASSEN algorithm (1971) and the modified SCHONHAGE-STRASSEN algorithm from _1982, respectively.

2.2

Terminology

and Notation

Fundamental _terminology and notation will be introduced here. When we

analyze and compare multiplicationalgorithms wetalkabouttimecomplex

ity. In particular, we will focus onthe asymptotic _{time comlexity,} since the size of the problems that can be solved is determined _by the asymptotic behavior of a specific algorithm. If an algorithm processes inputs of size n

in time c n2

for some constant _c, for _example, the algorithm is said to be

0(n2), read "order n2".

Definition 2.1 (Order _O) A nonnegativefunction g(n) is said to be 0(f(n)) (read order_f(n)) _ifthere exists a constant c such that

g(n) < cf(n)

for all but some finite (possibly _empty) set _of nonnegative values _ofn [2, p.2].

Itis importanttodistinguish between _{complexity in}bit operationsand com

plexity inarithmetic operations. For_example,multiplication oftwon-bit in tegerscanbe done in₀₍₁₎arithmetic operationsbutin0(n _log n _{log log} _n) bit operations. The bitwise model takes _{into account,} that the time of an arithmetic operation depends on the size of theoperands and incorporates the so called logarithmic cost 1 _[2] of operands. All variables are assumed to-have the values 0 or _1, i.e. _they are bits and the operations used are

(12)

logical rather than arithmetic. This model is useful if we talk about basic

operations, such asthearithmetic_ones,which are primitive inother models. Unless stated _otherwise, we will talk about bit complexity here.

Appropriaterepresentationsfor largeintegers havetobe chosen. _Bydefault

all integers in this paper will be nonnegative. A positive integer can be

written in radix B positional _notation, with

nB-\

u= ]T

UiB\ 0<ui< B, (2.1)

i=0

where ub is thenumberofdigits ,-. Well-known standard examples arethe binary, octal, decimal and hexadecimal radix B representations ofintegers. We will writefor a radix B number u

u = (ub_i,...,ui,u0)b- _(2-2)

Occasionally, we will _employ vector notation

u = _{[u0,ui,...,MB-i]r,} _(2.3)

with u _being alength _ns column vector.

Note,that when wetalked about n-bitintegers and_0(n2)bit complexitywe should have used the _terminology n2-bit integers and _0(71?.) bit _complexity instead. Often B _^ 2 and_{thus ni} _^ _ng and we _carefullyhavetodistinguish

between n-i and ns- _{Nevertheless,} oftenthe _meaningof_ni is clear fromthe

context and we will omit the index andjust write n.

In section 2.5 it will be shown that the critical problemin _finding fast mul

tiplication algorithms is tospeed _up the convolution ofdigits oftwofactors

u and v.

Definition 2.1 _{(Convolution)} Let R bean_arbitrarycommutative _ringR. Assume u = [rj,

> ^n-i]T

and v = [vo, ...,

vn-i]T

are two column vectors

with _Ui,Vi G R. Then theconvolution _ofu and_v, denoted_uv, is the vector

c = [c0,

...,c2n_i]T, where

n-1

j=0

(m = _Vi = ₀

(13)

Thus

c0 = _uQv0

Ci = _u0^i ₊ _UiVo

C2 = _U0V2₊ _Ul^l ₊ _U2V0 ,

C2n-1 = 0.

2.3 Classical n2

Multiplication Algorithm

The crucial point in multiple-precision arithmetic operations for comput ers is that _long numbers _may be written in radix B _notation, where B = 2word atze _for

a given word size of a computer. The term n-digit integerthen means _{any integer less} than Bn. Multiple-precision operations can be con structedon_topoffundamentaloperationsforsingle digitswhich are supplied in formofsingle- _or _{double-precision} _operations

by standard computers.

Algorithm C _{(Classical multiplication)}

Given nonnegative integers _{(u_i,} ...,wo)b and (uTO-i,...,V)b> this al

gorithm forms their radix-_i? _product

(wm+n^i,...,wo)b- (The conven tionalpencil-and-paper methodis based on _calculatingthe partial products (un-i, ...,uo)b x Vj first, for 0 < j < m, and then adding these products together with appropriate _scaling _factors; but in a computer it is best to do the addition _concurrently with _{the multiplication,} as described in this

algorithm.) According to [15, p._253] the outline is as follows:

Cl _{[Initialize.]} Set _tuo, ,wm+n-i autozero- Set j

< _0. (If

w0,...,wm+n-i

were not cleared tozero the steps below would set

(u;m+n_i,...,u;o)B *~ (un-l,,Uo)B^('"m-l,,Vo)B+(wm+n-i,...,Wo)B.

This more general operation is sometimes _useful.)

C2 [Zeromultiplier?] If Vj = _0,goto _{step C6.} (This test saves agood deal oftime ifthereis a reasonable chancethat _Vj is _zero, but otherwise it may be omitted without _affecting the_validity ofthe _algorithm.)

C3 [Initialize _i.] Set i

(14)

C4 _[Multiply and_{add.] Set}t <-

,-x_{v, + wi+j +}_k; thenset _wi+J *- _t

modB and A; <- _[</BJ.

(Here the "carry" k will always be in the range

0 < k < _B; seebelow).

C5 _[Loop on _i.] Increase i _by one. Now if i < _n, go back to step C4; otherwise set _w,+j < _k.

C6 _[Loop on _j.] Increase _j _by one. Now if _j < m, go back _{to step} _C2; otherwise the algorithm terminates.

The critical points for an efficient implementation ofthe algorithm are the twoinequalities

0 < t < _B2, 0 < k< _B,

since _they determine the size ofthe register needed for the calculation on

step C4. By induction, and for k < B at the start of _C4, it can be proved

that

ti,-x v{+_wi+j + k < (B

-1) x (B

-1)+ (B

-1) + (B

-1) = B2 - 1< B2.

2.4 KARATSUBA's Algorithm

The KARATSUBA algorithminitsoriginalformis described in _[2,p.62]. A similar but more simplified method canbe foundin_{[15, 4.3.3]}andis adopted

in this paper.

Let u and v be two 2n-bit numbers _{(u2n-\,U2n-2i}5^0)2 and

(^2n-\-,V2n-2i1^0)2. Ifwe partition u and v intotwohalves, we can write u = TUX ₊ _U0, v = TVX ₊ _V0,

where _U\ = _(u2n-i,

,un)2 is the "most significant half of the num

ber u and _Uo = (wn-i,

,^0)2 the "least significant half; similarly V\ =

(v2n-\,>vn)2 and Vo = (vn-i,,^0)2- Now the product uv can be com puted as

uv = (22n

+2n)U1V1 +2n(Ux

-/70)(Vq

-V\)+(2n +1)U0V0. Thus theproblem of_{multiplying 2n-bit}numbersisreducedto threemultipli cations ofn-bit _{numbers, namely} _{UiV\, (U\} Uo)(Vq Vi), UqVq, plus some simple _shifting and _adding operations. _By _applying the multiplication for mula _{recursively to} evaluate the three products _(U\V\, _(U\

(15)

2.5 _{Fast Convolution} _Algorithms

2.5.1 _General _Idea

Thecrucial point inmultiple-precision multiplicationis itsclose_relationship

topolynomial multiplication. Theconventional positional notationfor inte

gers is essentially a polynomial-based representation. If u = (un-i,

, uo)b

is aradix B_integer, then u representsthevalue ofits associatedpolynomial

n-1 u(B) =

Y^

UiB*

0 < u, < B.

i=0

Let us consider an example. Let B = ₁₀ and u = _314, v 293. Then

u(B) =

3J32

+ IB+4= 300 + 10 + 4

v(B) = 2B2 _{+ 95 + 3}= _{200 + 90}₊3.

The integer product w = u v 92002 can also be obtained _by _multiply ing the two polynomials _u(B) and _v(B), _{getting w(B)} = _u(B)v(B), and

evaluating w(10).

w(B) = 64 ₊293 ₊ 2652_{+ 395 + 12}

u>(10) = 92002.

Thus, the critical problemis tofind fast algorithms tospeed _up the convo

lution ofdigits 2 of the twofactors u und v from whichthe result is recon

structed, calledfast convolutionagorithms. Fastconvolution algorithms are always based on someform offast Fouriertransform.

2.5.2 Discrete Fourier Transform

The Fourier transform requires the existence of a so-called primitive n-th

root _ofunity. The definition of a primitive n-th root of_unity is givenfor an arbitrary ring R, although thediscrete Fourier transform will be defined in a field Fin this section 3 .

2Seesection 2.2 for the_introductory definition of _{"convolution'',} and section 2.5.3 for

moredetails.

,3Forsimplicity both the discrete Fourier transform and the fast Fourier transform (section _2.5.4)willbe defined inafield F However,thedefinitionmaybe extended toan

(16)

Definition 2.1 (Primitive n-th root of_unity) Given a commutative

ring (R,_+,-,0,1), an element_un _ofR such that

1. un = ₁

2. u*

? 1, forl<k<n

is said to be aprimitive n-th root of unity.

Equivalently we_say _{that un} _isoforder n in R.

Example 2.1

1. In Z we have for n = _2,

U2 = 1.

2. In _Z5 we have for n = _4,_w4 = 2.

3. In C we have for an_arbitrary _{n, un}

-e2ln, i = /-!

The Fourier transform will be defined in an _arbitrary field F. Let u =

[u0,u1:..., un_i]T

be alength n _(column) vector with u,-G F.

Definition 2.2 (Discrete Fourier transform _(DFT)) Assume _{that u>n} is aprimitive n-th root _{of unity in} afieldF. Then the vector

FT(u) = [fQ,f1,...,fn_1]T

with

n-1

fi =

Y.

UJUn' fr { = _0. -n

~ 1 J=0

is called the discrete Fourier _{transform of}u.

Lemma 2.1 (Inverse discrete Fourier _transform) Let un be a primi

tive n-th root _{of unity in} afield F. Then the vector

FT-\u) = [f0J1,...Jn_1]T

with

n-1

fi = n_1

Y

uiun13-, for i = _0,...,n - 1 i=o

is called the inverse Fouriertransform _ofn.

The interested reader can find the proof that _FT~l(FT(u)) = u in [25]. Thefast Fourier transform _(FFT)is just an efficient method of_calculating

(17)

2.5.3 _Application ofthe Fourier Transform to Convolution

In 2.5.1 we _already mentioned that the conventional positional notationfor integers is _essentially a polynomial-based representation. Given a polyno

mialofdegree _(n-1)

n-1

p(x)

-Y

Ui2;t'

i=0

p(x) can be _uniquely represented in _{two ways,} either by alist ofits coeffi cients _uo,...,un_x or by alist ofits values at n distinct points xQ,...,xn-\. The process of_calculating the values at different pointsis called _evaluation, the inverse process of_finding thecoefficient representation of a polynomial given its values at points _x0>--j^n-i is called interpolation (fig. 2.1).

Figure 2.1: Evaluation andinterpolation of polynomials.

n coefficients of_p(x)

evaluation interpolation

values of_p{x) at ndistinct points

Assume zo,ii,xn-i are values atu>,u>^, ...,cj-1,withu>n being aprim

itive n-th root of_{unity, then} one can_easily seethat adiscrete Fourier trans

form can be viewed as atransformfrom the representation of a polynomial by its coefficients to therepresentation _by its values:

[l,^-1,^2,^'3,...,^"-1'] X [u0,Ul,W2,U3,...,-l]T =

0 + l(^) +

U2U)2

+ U3(uknf +...+ Un-!^)"-1.

Likewise, theinverse Fouriertransformisequivalent to_{interpolating}a_poly nomial given its values at the n-th roots of unity.

(18)

section 2.2. _Actually, _computing the convolutionoftwo vectors is the same

as _computing _the_product _of_two_polynomials. _Let

n-1 n-1

p(x) =

^^

uix'

and _q(x) =

^

VjXJ i=0

be two (n

-1) degree polynomials. The product is the (2n

-2) degree polynomial

n₁ /n 1

P(x)q(x) =

YsiYvw-ir1

t=0 S'=o '

with Ui = _Vi = _{0 if i} _< ₀ _or _i > n. Observe that the coefficients of the

product polynomial are _exactly the coordinates of the convolution of the

coefficient vectors [u0, and [vq,...,un_i]T, with C2-i = 0.

Since such a product polynomial can be _uniquely represented _by its values

at 2n points, a new method _{for multiplying} the n-coefficients polynomials

p(x) and _q(x) is possible. We evaluate both _p(x) and _q(x) at 2n selected

points, then multiply together the corresponding values ofthe polynomials at _{these points,} _forming 2n products. The polynomial that _uniquely fits

these 2n values is the desired product _p(x)q(x) (fig. _2.2) 4.

Figure 2.2: Principle ofthe convolution theorem.

In coefficients of p(x) and _q(x)

evalutation (FFT)

valuesof

p(x) and _q(x) at In points

conventional arithmetic

pairwisescalar

multiplication

-*- 2n _coefficients _of p{x)q(x)

interpolation

(FFT"1)

values of

p(x)q(x)

at 2n points

4The figure illustrates theconvolution theorem _(2.1) where thefirst n coefficients of_p

(19)

Theorem 2.1 (Convolution _Theorem)

Let u = [uo,ui,...,un_a,0,...,0]T

and v = [v0,_vu. . _.,u-i,0,. . be column vectors _oflength 2n. Then

uv =

FT-x(FT(u) FT(v)) [2, p.255].

Since the convolution oftwo vectors oflength n is a vector oflength _2n, a "padding"

with zerosis required to achieve two 2n-length vectors. We can

avoid this "padding" with zerosif we use theso-called "wrapped" convolu

tion.

Definition 2.2 (Negative wrapped _convolution)

Let u = _[uq,

Ui,. . and v = [o,ui,. . _., un-i]T

be column vectors

of length n. The negative wrapped convolution _of u and v is the vector d = _[do,_d\,. . .

,

dn-\]T

, where

i n 1

di =

YU3Vi-l

~

Y

ujvn+i-j-j=0 j=i+l

The negative wrapped convolution is used _by the

SCHONHAGE-STRASSEN algorithm (section 2.5.8 and 2.5.9).

Theorem 2.2 (Convolution theorem for negative wrapped _convolution)

Let u = [uq, u\, . .

.,

un-i]T

and v = [vo,

Vi,. . .

,

vn-i]T

be column vectors _of

length n. Assume ip2

= un. Furthermore let d = [do,di,. . . be the

negative wrapped convolution _ofu and v and

u

-[u0,i)Ui,...,i)n~lun-i\T

v =

[u0,V^i,---ffln-V-i]T-Then

d= FT-\FT(u)- _FT(v)).

An outline of the proof can be found in [2, p._257], the details are in [17,

(20)

2.5.4 Fast _Fourier _Transform

The discrete Fourier transform of a length n vector u can be computed in

0(n ₎ time if we assume that arithmetic operations on _arbitrary elements of a vector u require one _step each. On the other _hand, for n

-2k

the fast Fourier transform achieves 0(n_log_n) arithmetic operations, which _asymp

totically is the best known method.

In section 2.5.2 we called the vector _FT(u) = [/0,/i,. . . ,

/n-i]T

with _/, =

Hj=oujwn tflediscrete Fouriertransformof u. Weobservedinsection2.5.3 that _computing the i-th component _/; ofthe discrete Fourier transform is equivalent to _evaluating the polynomial _p(u{n),

PK) = _u0₊

uiui + u2u2ni

+ ...+un_iJnn-^.

To attain a fast algorithm for _performing the Fourier transform we _apply a divide and conquer method and split the problem into two _subproblems, that is we replace the problem of_evaluating an n-th degree polynomial _by the problem of evaluationtwo ^-th degree polynomials [18]. We write_p(u*n) as the sum ofits

odd-and even-power terms:

PK) = _(o₊

2W2i +

---+ Un_2o;in-2)1) ---+(Ul^+ +

---+ nn_14n-1)t) and get PK) = Ejt1 "rf + EJU1 =

S(U2J)

Wnt("2n%

(*)

where s and t are polynomials ofdegree |. _By _using the next two lemmas this equation can be simplified significantly.

Lemma 2.1 Let n be even and_un be a primitive n-th root _of unity. Then

uj2

is a primitive \-th root _{of unity}

Proof. We will prove that _{corresponding} todefinition 2.1 u2

is aprimitive

^-th root of unity. Since (ui2)* = w

= _1,ui2

is a |-th root of unity. But

(u2)3

^ 1 for 0 < _j < _|, otherwise we would have u>k

= ₁ _with ₀ _< _k _<

2\

in"contradiction to u)n being a primitive n-th root of unity. Thus u2 is a

(21)

Figure 2.3: Repeated _splitting of a polynomialin its even and odd parts.

splitting

Lemma 2.2 Let n be _even, letu>n be a primitive n-th root _ofunity. Then

n

k+% k

forO<k<

Proof. First we note that

*+f k _f

But (cJtI)2 =u>

= _1, and (u2 )2 = 1 in afieldimplies u>n = 1. Since _wn is

n n

aprimitive n-th rootof_unity and 1 <

f

< _n,_{uZ -fi} 1. Hence Url = -1 and

wn 2 =

u;-w^ = -Jl.

This is the desired result. ?

Applying lemma 2.2to _(*) wehave

p(Jn) = _siu*) _+Jnt(^)

p(un+>) =

(22)

since

Jn

* =

-uln and (w+?)2 = (-Jn)2 =J%. The problem of_evaluating the polynomial

p(uixn) at n points reduces to the problem of_evaluating two polynomials s andt at _| points and _combining theresult withthe cost of_\ scalar multiplications and n additions. For n = 2k therepeated application

of the _splitting algorithm reduces after _log2n halvings to n 1-point trans forms, which are trivial and of no cost at all (figure 2.3). To analyze the arithmetical _complexity ofthe fast Fourier transform let _T(n) be the time

required to perform an n-point Fourier transform _by the above method. Then

T(n) =

2T(f) +n(t+

tf)

=

2T(f) +cn

where_iis the timerequiredfora_{multiplication,} and_t isthe timerequired for an addition. Since _T(l) = _0, _the recurrence can be solved to

T(n) = c n _log2 n

= 0(n _log n).

The complete fast Fourier transformagorithm is given below [19, p.298].

Algorithm FFT (FastFourier transform _(FFT))

Input.

Primitive n-th root of_unity _un, for some n = 2k.

Polynomial a(x) =

Y17=o a'x'i giyen _by its coefficients ai.

Output. Vector_FT(a) = [f0,. . .

, fn_i}T

where /_,- =

J2?=o

i<-Method. FFT procedure (fig. 2.4).

To compute the inverse fast Fourier transform we _modify the procedure

slightly by replacing u> by

u~k

and _dividing _/, _by n.

Most scientific applications use an iterative version of the FFT procedure because of its slightly enhanced speed. Furthermore, _by _using the origi

nal data space to store the _temporary results ofthe _combining process the

storage requirements are kept within linear bound. _{That way} the original

(23)

Figure 2.4: Recursive FFT procedure

procedure _FFT(n, a(x), u>n, fa[])

if (n= ₁ ₎ _then do begin /* end _ofrecursion */ fa = _a0

end

else do begin /*split in even and odd terms */

<X) =

E?=0 fl2i+l^

/*

recursive calls */

FFT(n, b(x), Jn, fb[]) FFT(n, c(x), Ji, fc[])

/*

combine */

for (k ₀₎ until(n ₁₎ do begin

fa = _fbk ₊

Jn

X _fCk

fak+n = _fbk ~

Un X _fCk

end

(24)

computation andthe_subsequently required_unscramblingoftheoutputdata

for n = _{8. At}

each level of computation _{the array}of n data is overwritten

by the new results. _Note, it is not _necessary that the unscrambling is per

formed at the end ofthe FFT procedure. _Analogously it can be applied to

the input data [10]. Since in _many applications the forward transform is

followed _by the inverse_transform, theappropriate choice ofFFT algorithm

may effectively avoid the_reordering ofthedata _[34] [9].

Figure 2.5: In-place computation and _unscrambling oftheoutput data.

scrambled unscrambled

output _array output _array

o """* 'o

i

fflffl

combining

2.5.5 Mod m Fast Fourier Transform

In section 2.5.2 the discrete Fourier transform was defined in an _arbitrary

field F. If we use the field of the complex numbers some disadvantages

have to be taken into consideration: Complex numbers must be approxi

mated by finite precision _{numbers, thus} introducing round-off errors. Both

multiplication operation in general and the calculation of powers _u'n are

time-consuming, so often the _uln are calculated once and then stored in a

(25)

of integers with artihmetic carried out modulo an integer m _instead, and

performing an FFT-like _transform, called mod m fast Fourier transform or number _{theoretic transform} _(NTT) [23]. _Employing a number theoretic transform makes it possible toget rid of round-off_{errors, to} replace multi

plications _by shifts and _additions, and to avoid the storage ofthe complex

values _uffl _Naturally this requires the choice of appropiate values for the

parameters _n, modulus m and n-th root of_unity un.

Note, that _Zm is a _{ring in general,} but it is also a field for m prime. The

Fourier transform and its inverse were shown to be valid both in the field

and in _{the ring} _Zm [25]. The conditions _concerning the primitive n-th root of_unity un (definition _2.1) in _Zm become equivalent to the congruences

w"

=

l(mod-'), for all _i, 1 < i < I _(2.4)

u>n ^ l(modpi), for all _k, 1 < k < _n, and some _i,1 < i < I _(2.5)

with

m =p[i-p?-...-pTl'

[16]. _(2.6)

Next, the _necessary and sufficient condition for the existence of a number theoretic transform oflength n will be derived. _Thereby, EULER's totient

function _{<p(m) is} the number ofintegers in _Zm that are _relatively prime to

m.

Theorem 2.1 (EULER's _theorem) _{For every} a and m such that the

greatest common divisor ofa and m is _1, denoted gcd(a,m) = _1,

av(m) _{mod m} = 1 [12, p.42].

For rn a_prime,_y(m) = m-_1. _If_m_is _composite_and_has _prime _{factorization}

m =PJ1 pT22 . . .

pY, then <p(m) is given by

<r(rn) =

l[pr1(pl-l)

[12,P-41]. _(2.7) t'=i

EULER's theorem implies that a _only can bea primitive n-th root of_unity

y(m) .

^

if n divides <>(m), denoted n\tp(m), since (a n )n mod m = av(Tn> _mod_m.

By applying the same line of_reasoning toequation _2.4, un is an n-th root of_{unity only} if_n|v?(ptr'), with _</>(ptr') =

p?'1

(pi

-1) (equation _2.7), i.e.

(26)

Furthermore, since not _{only the} field _Zm but also the_ring _Zm is taken into

condideration it must be ensured for the inverse Fourier transform that n

has aninverseelement _in.Zm. Hence n must be _relatively prime to the _p,'s,

and we obtain

n\(pi

-1), for all i.

Finally, we definethe _necessary and sufficient condition for theexistence of

number theoretic transforms oflength n. Theorem 2.2 states _exactly what the possible transformlengths for agivenmodulus are.

Theorem 2.2 (Existence of n-th Degree _NTT) Let Zm be a ring,_of

integers modulo _m, m = pJ1 pT22 . . .

p\l

. A number theoretic transform (NTT) oflength n exists in _Zm _ifand _only _if

n\gcd((pi-l),(P2-l),...,(Pl-l)) [16}{1).

We are interested especially in NTT's which have fast implementations.

There are threerequirements for fast NTT's which haveto be considered:

1. n = 2k. This is an obvious consequence ofthe algorithmforthe FFT.

2. Multiplication _by powers ofun should be simple and fast, _preferably

shifts, which can be achieved if_u>n is a powerof2.

3. Arithmetic mod m should be fast.

The _following _corollary to the theorem 2.2 specifies the conditions for the

existenceoflength n number theoretic _transforms,for n= _2k, thus_meeting

requirement 1.

Corollary 2.1 (Existence of n = 2k _NTT's) Let _Zm be a _{ring of inte}

gers modulo _m, m = pj1 -pj2

. . . p]'

The length n 2k NTTexists in _Zm

ifand _only_ifit ispossible to write allprimes

p,-(i = _{1, 2,} .. .

,I) in theform

Pi = 9i2h' + 1,

where gi is an odd number and _hi > n.

Proof. It follows from theorem 2.2 that primes _p: (i =

1,2,...,/) must be greater than two for the length n > 1 transform to exist. _Every prime

number that is greaterthan two can always be uniquely written as

Pi = gi2hi

(27)

where ft,- > ₁ and _#, an odd number. The greatest common divisor of

numbers _p; - 1 (i ₌

1, 2,,/) can be written as

(p1-l,p2-l,...,p,-l) =

c-2d

where c is thegreatest common divisorofnumbers _gi, and d is the smallest

of the exponents ft,. It then follows from theorem 2.2 that length n = 2k

NTT's exist _exactly when 2k divides _2d, which is true if all exponents

ft,-conform to ft,- _>

n [16].D

Let us call primes oftheform p, gi2h'

+ 1 Fourierprimes. Examples of

Fourier primes are someofthe Fermat numbers Ft, where

Ft =

22' + _1,

which are used _by a number theoretic _transform, called Fermat number

transform. SCHONGAGEand STRASSEN adopted in theirmultiplication

algorithm, version _1971, theFermat number_transform, whereas the version

from 1982 works with moduli which are composed of Fourier primes [32]

[28].

Originally Fermat conjectured that all Fermat numbers are _primes, but it

seems that _only _Fo through _F4 are prime and all others are composite. The

first seven values are:

F0 = 3

Fi = 5

F2 = ₁₇

F3 = ₂₅₇

F4 = 65537

F5 = _{4 294 967 297}= 641 x6 700 417

F6 = 274 177 X 67 280 421 310 721 ~ _1.84

X 1019[1].

Eachprime factorof a compositeFermat number_Ft is oftheformk2t+2

+ 1

[1]. To _simplify matters we will concentrate on Fo through F4 which are

prime andhavea maximumlength ntransformof_max(n) =

<p(Ft) = _Ft 1.

max(n) canbeachieved _by _{choosing n,}a primitive n-th root of_unity,tobe

3,-but requirement 2 demands that un is a power oftwo. Now, for u> = 2

we get n = _2t+1, since w>odft =

(2)2'+1

mod _Ft = (22')2 mod_Ft =

(-1)2

(28)

analogous way. In the case of composite numbers _Ft we also get n = 2t+1

for un = 2 [1].

In requirement 3 we asked for simple arithmetic modular operations. The

conditions to attain fast operations are not _very restrictive and not _only Fermat numbers fulfill them. For an _arbitrary number m = xl

+ 1, if we

write a number a in radix xl

notation as a sequence ofb blocks of/ digits, then amodm can be calculated _by _alternately _adding and _subtracting the

b blocks of/ digits.

Lemma 2.1 (Fast _computingof _{modulo) Let} m = xl

+ 1 and let a = Ei=o aix''> where 0 < ai < xl

foreach i. Then

6-1

a =

Yiai(~^y

mo(*

m-1=0

Proof. Observe that xl

= -1 mod m [2].D

Example 2.2

Let x = _2,1 = 2 and m = 22₊ 1. Consider a = _(101100)2 = ₍₂₃₀₎₄ = _44.

Here ao = _0, _ai = _3, and a2 = _2. We compute _ao _ai ₊ _a2 = ₁ and then

add m tofind that a= ₄ _mod_5.

2.5.6 Modular Arithmetic

Modular arithmetic isused indifferent fastconvolution algorithms. Theun

derlying idea is tohaveseveral "moduli" _{mi, m2,} . . .

,mr which are pairwise

relatively prime, and instead of _doing calculations _directly with an integer

u (0 < u _{< mi} _m2 . . . mT) we work with

"residues"

umod m,-. Let us

denote

ui = u mod _mi, _u2 u mod _m2. . .

It is trivial to compute _{u\, n2,}. . .

,uT andit is important to note that there is no information lost at all by _performing this computation. The Chi

nese remainder _{theorem states, that} we can recompute_u, given its residues

Ul,...,UT.

Theorem 2.1 (Chinese remainder _theorem) Let _mi,m2,...,mr be

positive integers that are _relativelyprime in pairs, _i.e.,

(29)

Letm =

mi m2 . . . mr andletui,...,uT beintegers. Then there is

exactly one integeru that satisfies the conditions

0<u<m,and u =

uj (mod m^) for 1 <j < r[15,p.270].

Proof. For each _j, 1 < _j _{< r,} _gcd(mj,_^-) = _{1. Therefore} _there exists

aninverse element _yj such that _(^-) _yj mod _m3 = _{1. Since}

mj and ^- are

relatively prime,by EULER's theorem (theorem _2.1) we_may take

y. ₌

(-I^KM. mj

Furthermore,(^-)yjmod m; = ₀for_i

^ j because m,is afactor of(%*-).

Let

u=

j

^(

)yjUj I mod m. _(2.8)

Then u is a solution of umod _mj =

Uj (1 < j < _{r) because}

A

(m\

u mod _mj = I

J yjUj mod_mj =

Uj

[15, p.270] [19, p.254][12,p.47] . ?

Unfortunately,calculating u_{corresponding}toformulae2.8requires boththe

evaluation ofEULER's (^-function and the_handling of quitelargenumbers,

since _%-yjitj could be _verylarge. In KNUTH _[15, _p.274] and LIPSON _[19, p._254] thedescriptionofasuperior algorithmcanbe found. Weinclude such

an algorithm for the special case of r = _2, since _{that way} it is employed _by

the algorithmof SCHONHAGE and STRASSEN [32].

Let ui = u mod_mi, _n2 = u mod_m2and_{gcd(mi, m2)} = _1. Weare_looking

for the solutions to

u =

Ui + a _mi _(2.9)

u =

u2 + /3-m2 _(2.10)

Note that ui+ami =

n2 mod _m2 ifami =

w2-ui mod m2. Since _(mi,m2) are _relatively prime we have a inverse element

mf1

with

mimj"1

(30)

m2. _Thus, ifwe choose a = _(u2

i)

mj"1

mod _{m2 then}

u = _ui ₊

ami

= _ui ₊ (u2 _ui)m{"1mi mod m2

=

ui +(u2

-ui)lmodm2

= _u2 mod _m2,

so that

u = _ui ₊

ami = _Ui ₊[((u2

-Ui) mj"1) mod _m2] _mi _(2-11)

satisfies equations 2.9 and 2.10 _[19, p.254].

Example 2.3

mi = _7,

m2 = _9,_Ui = _6,_u2 = _3, _then

u = 6 (mod 7)

u = ₃ _(mod 9)

and

mj"1

mod9 = 7-1 mod9 = _4, since 4-7 = 1 mod 9. _Finally u =

6+ [(3

-6) 4mod_9] 7 = 48 [19, _p.255].

Equation 2.11 is used _by the SCHONHAGE-STRASSEN algorithm [32].

TheChinese remaindertheorem isalso usedin afast convolution _algorithm,

called "three prime

algorithm"

(see also section 2.5.7). Here three small

primes _p,,where thep,'s are smallerthan theword size of_{the computer,} are

chosentoperformthefast Fouriertransformandits inverse in _ZPi. Single- or

double precision operations can be _used,_{resulting in fast Fourier} transform

procedures, but the resulting coefficients have to be reconstructed _by the

Chinese remainder theorem.

2.5.7 _Ordinary Fast Convolution Algorithm

Simple fast convolution algorithms work _analogously to the principles de

veloped in section 2.5.3. There are two additional problems _involved, which

we will discuss now.

The first problem is how to find "admissible"

triples _{(wnfl,rifl,m)} 5 , such

that ns has a multiplicative inverse and _unB is a primitive ng-th root of

5Recall the _terminology introducedin section 2.2. We have todistinguish betweenni

(31)

umty in the _ring of integers modulo _m, so we can _apply the fast Fourier

transformand theconvolutiontheorem _[2,p.274]. Wewantto set aside this

question for a moment.

The second problem is how to split two large n2-bit numbers u and v into

coefficients, needed in the polynomial-based representation and multiplica tion algorithms. Imagine a large n2-bit integer _u, where _{corresponding} to

the convolution theorem (theorem 2.1)the

^

high order bits are zeros. We

may think of_it, for _example, as ahuge _string ofbitswhich hasto be stored

and handled in some proper form. _Now, an obvious solution is to split the

number into _ng blocks of _bits, each block _fitting the size of a byte or a

computer word. Furthermore we would like to define the blocks to be our

coefficients ofthe polynomial u(B). Fig. 2.6 shows an extension offig. 2.2

including the processof_splittingtwo n2-bit numbers u and vinto ub blocks

of / _bits, where _n2 = _ns _I, and _{reconstructing} the _resulting _long integer

product from the single coefficients. Note that the dashed fine indicates a logical relation which is not a part ofthe calculation procedure. We

per-Figure 2.6: Scheme of_ordinary convolution.

split u and v

into

ub blocks of/ bits

.

evalutate _u(x)v(x) at radix B

(B=2<) i

ns coefficients of

u(x) and _v(x)

|. _h

conventional arithmetic

ub coefficients of u(x)v(x)

evalutation

(FFT)

interpolation

(FFT-1)

values of u(x) and _v(x)

at ub points

\

values of u(x)v(x) at ns points pairwise scalar

(32)

form the _Fourier _{transform in}

Zm, and the bit-size of the blocks has to be at least as large as [log2(m ₊ 1)]. _Furthermore, m should be large enough

to preserve the elements

c,-oftheconvolution. Since

ng₁

ng1

u =

u,-2'*', v=

Vi2li, (2.12)

i'=0 ,=0

where 0 <

u,-,u,-< 2l for 0 < i <

*f

and

u,-= _v{ = 0 for i > ^f, _the elements

c,-ofthe convolution are

ns-i

c =

YI

uivi-ii 0 < i < ns _(2.13)

j=o

(u,- = u,- ₌

0 if i < 0 or i> _{^-). From}_this _expression _{it is} obvious that for

0 _{< Ui,Vi} < 2'

Thus

0<c,<^.22'. _(2.14)

m > ^- 221

> _max(ci) _(2.15)

is the _resulting condition for m. Fig. 2.7 illustrates the principal structure

of a coefficient block. Itwould beattractivetohavecoefficient blocks which

Figure 2.7: Scheme ofcoefficient block

I bits of original number

0 0 0

(21 + _log₂ n _g₎ bits

optional extension

are smaller or equal to the word size ofa computer (or _eventually halfthe

word _size),since then _only single ordouble precision operations are needed

to perform the fast Fourier transform and its inverse. But then welimit m

to m < word size and_{unfortunately,} as a_{result, the} permitted sizefor large

(33)

Example 2.4

Take m = ₁₂₇ 224 ₊

1, where m is thelargest Fourier prime less than 231

[19, p.306]. _{Then max(ns)} 224

(section _2.5.5) and since ns is a power of twowe can _apply afast Fourier transformalgorithm. _Using m > ^22' and n2 ^l 6 the_following examples reflect therestriction on thesize oflarge

numbers.

n2 / nB m > 208 13 24 230 2 816 11 28 230 36 864 9 212 230 458 752 7 216 230

5 106

5 220 230

(Note, the restriction on the size ofthe numbers is more _{annoying for} m <

216).

For larger sized numbers weeither have torelax the restriction on the size

of _m, _allowing m > word size or we have to choose a different approach

to the problem. In _[25] _[19, p._310] _[4] the Chinese remainder theorem is used _by the so-called "three prime

algorithm"

toachieve afast convolution

algorithm which works for still larger numbers although the m,'s are still

less or equal to the word size (see also section 2.5.6).

Thefirst_problem, _findingadmissible triples _(wB,_ng,_{m) is}not _trivial, there is no systematic _way of_determining "best" choices [1]. One must use in tuition, insight and some searching. Often m is selected and the _resulting

possible ns and_unfl are then examined. Examples for unrestricted sizes of

m are given in [20] [28].

Thefinal illustration in Fig. 2.8 uses atriple _(wnfl,nB,_m) = (220,_8,₂₈₀₊₁₎ [28]. A large number u with 256 _bits, the 128 _leading bits filled with _zeros, is subdivided into 8 blocks of 32 bits each

(^

= 32). Each block of 32 bits is embedded into a81-bit _string, as a number in Z2so+1. Then the fast

convolution algorithm can beperformed as indicated in Fig. 2.6.

2.5.8 SCHONHAGE-STRASSEN's Algorithm

(1971)

Now weturn toa more sophisticatedfast convolution _{algorithm, namely} the

SCHONHAGE-STRASSEN multiplication algorithm in its original version

6

(34)

Figure 2.8: Division of a large number into blocks of bits (ordinary fast

convolution algorithm).

izp.

0

ZJD

i i

-J 32

1 0

1o

|o

|

0 nl

....! 0^

oU

-81

-n i

r^

from 1971 _[32] [2]. This algorithm achieves 0(n _log n _{log log} _n) bit com

plexity and is the _{asymptotically} fastest known multiplication algorithm.

Similarly to the_ordinaryconvolution algorithminsection_2.5.7,we split two n2-bit numbers u and v into n# blocks of I bits. To _simplify _matters, we

restrict _{n2 to}be a power of_two,

n2 = 2* _(2.16)

For _n2 _^ 2k

weadd an appropriate number of_leading _zeros, thus_increasing

the constant factor of the bit complexity. _Furthermore, we calculate the

product of two n2-bit integers u and v mod (2"2 + 1). To obtain the exact

product _uv, wemust add_leadingO'sand_multiply theincreased 2n2-bit inte

gers u and v mod(22"2 + 1), again_increasing thecomplexity by a constant

factor. Inthe_followinglet u and vbe n2-bitintegerswhich we_multiply mod

(2"2 ₊ _1). _Observe,that2n2

requires n2+lbits for its_binary representation. For u = 2n2 or v = 22 the multiplication reduces toa special simple case:

1) u = 2"2, _v^2n\

2) u^2"2, v = _2n7

3) u = _2n2, u = _2n2,

uv= (2712 + 1

-v) (mod2"2 + ₁₎ or

uv= (2"2 + 1

-u) (mod 2"2 + ₁₎ or

(35)

Otherwise the n2-bit numbers u and v are split into ns blocks

of / _bits,

ns = _22,

nB = ₂ _2,

/ = _2,

/ = _2^,

if fc is even,

if k is _odd,

if k is even,

ifk_{is odd,}

with ri2 = _tib 1. We _write _{u and v} _in _radix _B 2l

u = u^^-^'

+ .-.+ u^'

+ uo

u = ns_12<nB-1)'

+ . . . + tM2'

+ t;0.

Further parameters ofthe admissible triple _(unB,nB,m) (section _2.5.7) are

41

wB = 2ns _and

m =

22' + 1.

Table 2.1 summarizes the parameters. Theorem 2.2 _{states, that} a number

theoretic transform oflength ns exits in _Zm_, for m =p^1

p2 . . . p[', if

and _{only if}

nB\ gcd((px

-l),(p2-l), 1)). _(2.17)

It can_easily be shown that the parametersin table2.1 meet this condition.

We will do it for k _even, k odd is analogous. _First, observe that the m;'s

areFermat numbers _Ft, with

Ft = 22'

+ 1.

Table 2.1: Parameters for SCHONHAGE-STRASSEN _algorithm, version

1971

/ nB _Wns m

k: even

k

22 2* 24

k2

2(2"^) _{+ i}

k: odd

fc+1 2 2

fc-i

2T 28

k+3

(36)

We must distinguish between t < _4, where _Ft is prime, and t > 4 where Ft

is composed of primes ofthe form _p = k2t+2 ₊ ₁ (section 2.5.5). For t < 4 and _Ft prime equation 2.17 reduces to the condition nB\(Ft

-1) which is

true for _ub = 22 _and _Ft = m = 22 ₊ _1, since

\

< 2^. For t > 4 we have _nB|(2t+2) or (2^)1(2^ +₂₎ since < + 3.

So _far, given the parameters oftable _2.1, the convolution theorem can be applied. _Furthermore, the three requirements for a fast number theoretic

transform (section _2.5.5) are realized: Since nB is a power of two a fast

FFT like algorithm can be used. The multiplications by powers of_wfl are

replaced _by simple shifts since _unfi is a power of_two, and we can build a

fast arithmetic package for m = 22'_{+ 1} _{by doing} the calculations for blocks of radix 22'.

Theproduct uv isequal to the convolution ofthecoefficients ofu and v and

is given _by

uv=

u2s-22(2"*-2>'

+. . . + _ui

2l

+yo (2-18)

where

Vl =

Y,

U3vi-J, ^l< 2ns- (2-19) 3=0

(For_j < 0orj > nB-l, Uj =

Vj = 0. _2/2B-i = 0 is includedfor_symmetry

only.) Applying the convolution theorem to compute the product uv (fig.

2.2), the pairwise multiplication ofthe Fourier transformsof u and v would require _2nB multiplications. We can reduce the number ofmultiplications to ub by usingthenegative wrapped convolution (definition 2.2). Rewriting equation 2.18 gives

uv = +...

ynB2^1 + yn,-^*-1* + yu^*^1 + ...+ * = (y2nB-22^-2)l + 2nBl + (ynB-i2(ni'-1)' + ynB-22(n*-2)< + ...+ ,).

Since _nB / =

n2 wehave

2ns' ₌

-1 mod(2"2 + 1), and _by lemma 2.1 (fast

computing of_modulo) we get

, = ₊ .

..ynB) +

(ynB-i2("fl-1);

+ + Ito) (mod2"2 + 1).

=

(wnB-^nB-x)l

+ +

;i2'

+ wQ) (mod2"2 + 1),

with Wi = _yi-ynB+i, 0<i<nB,

(37)

where the u>,'s are the coordinates ofthenegative wrapped convolution.

There is _{one nuisance more which adds some clumsiness} _{to this} _multiplica

tion algorithm. Recall that m hasto be atleast as large as theelements of

the _{convolution, u>,,}

m > _max(wi) _(section _2.5.7).

Theproduct oftwo/-bit numbersisless than22'

and yi and _ynB+i are sums

ofi+1and _nB-(i+l) such_products, respectively. Thereforeu;,- ₌

yi-yUB+i is in the range of

-(nB - 1

-i)221

<Wi<(i+ 1)22', (2.20)

and _Wi can have at most _nB - 22'

different values. On the other _hand,

m _{< nB} 22 _, since m = 22' ₊ _1, _and _{we can't} _recover _the _precise _integer

product of u and vjust_by _applyingtheconvolution theoremonce. Ifwe use

the Chinese remainder theorem (theorem _2.1) and compute the u>;'s _twice,

once modulo 22'+ 1 and once modulo _nB,we can recover all possible nB22/

values ofu>,-. Take

w[ = _Wi mod ns

w'{ = _w{ mod (22'_{+ 1).}

Because _nB is a power of two and (22' ₊ ₁₎ is odd, _nB and (22' + ₁₎ are

relatively prime, and we can make use of the Chinese remainder _theorem,

namely by equation 2.11 wehave

w{ =

w'l + [((w'i

-<)

(22'

+ _I)"1)mod _nB] (221 ₊1). _(2.21)

Since nB is a power of2 and nB < 22', nB|(22/) and (22'+l)_1 mod_nB = _1.

Thus

Wi =

w'{ + [(w'i

-w'f) mod_nB]

(22'

+ 1). (2.22)

Fig. 2.9 illustrates the basic structureofthe algorithm. For _{simplicity, the}

recursive characterofthe algorithm is not illustrated here.

The product uv can be computed from the tffls _by _adding _{the nB} u;,'s

together with appropriate shifts. To calculate the u>,'s mod nB(22' + ₁₎

we compute the tffls _twice, once mod nB and once mod 22' ₊ 1. Then

we compute _{the u>,} modnB(22'

+ _{1) by} employing the Chinese remainder

(38)

Figure 2.9: Structure ofSCHONHAGE-STRASSEN algorithm

r>2bit numbers u, v

Split into n_g blocks of I bits

get coefficients of u(2' ), v(2')

Calculate elements of negative wrapped

convolution _Wj mod ng(2^'

+1)

Calculate:

w'i=

W; mod (221 ₊₁₎

Calculate:

Wj'= _w

j mod ng

Apply Chinese remainder theorem

to _Wj'and _{wj ",}

get _Wj mod nB(221

+1)

(39)

We _already developed the method _{for calculating} _u>", _namely the negative

wrapped convolution _by means ofthefast Fourier transformfor the

param-21 eters oftable 2.1. With xj) = wnB

, ^ is a 2nB-th root of_unity, and so the

negative wrapped convolution of

[uo,i>ux,...^nB-lunB_i]

and _[u0,#,-,-_ffl"5-1^-!]

is

K,tpui,...,f"1^.!] mod (22'

+ 1).

The w"

= _Wi _mod(22' ₊

1) can be obtained _{by dividing by} ip*

which are

simple shift operations since _ip is apower oftwo.

The methodto calculate Wi modnB is different. Take

u[ = _Ui mod_nB

v\ = _Vi mod nB.

The convolution elements _y\, where

nB \

y\ =

Y

u'iv'i-3' 0 < i < _2nB

-1, 0 <

u;,u,-< nB

!=0

and

y;<nB-221og2ns

require at most _31og2_nB bits for their representation. We form twolonger

numbers u and v as shown in fig. 2.10 with

u=

Y

<231og2nfl, v=

Y

u,'231oS2nfl

'=0 i=0

andget the product uv_by _{employing any}multiphcation algorithm for large

numbers. _Note, _{that nB} is much smaller than 22'. From the result _uv,

2nB-l

uv=

Y

_y'i2^^n^1,

!=0

we can _easily pick out the _y\ and calculate _w\ = (y'{

-y'nB+i) mod nB.

(40)

Figure 2.10: Illustration ofthe numbers _u, v and their product uv, used in

computation of

u;,-mod_nB

"B-1 u\

u1

LM. io_oiAjo-olIi

^/

l

^

2log2nB log2nB

nd-1

I0--0

1 u

Ll[0-0ll|

y 2nB-1 _y _i _y'o

1

-"-T3loq,nQ

r-"-Algorithm SI (SCHONHAGE-STRASSEN integer multiplication algo

rithm _(1971))

Input. n2-bit integers u and _v, where _n2 = _2k.

Output, uvmod (2"2 + 1).

Method. If _n2 _{is small, multiply} u and v modulo (2"2 ₊ ₁₎ _using an _ap

propriate fast algorithm. For large _n2 use this algorithm with the

parameters _nB, _/, _wB and m of table 2.1. Split u and v into _nB

blocks of/ bits and express

no-l ns-l

u =

Y

u'2'''

v=Yvi2H

for 0< u,-,u,<2'.

t=0 i=0

1. Compute the Fouriertransform modulo m = 22'₊ ₁ _of

[u0,tpui,...,^nB-1unfl_i]T, [vo,rpvu...,

VB~lvnB_i]T

(41)

2. Compute the coordinatewise product ofthe Fourier transforms of

step 1,mod _m, _by_using algorithmSi recursively. (Thesituation in which afactoris 22'

is handled as a special _{easy case,} see page

28.)

3. Compute theinverse Fourier transformofthevectorof_step _2,mod

m,whichis _[w0,_i/>wu...,VB~lwnB-i]T

mod(22/₊1). Compute

w'f = _w{ _mod (22/₊

1) bymultiplying ip{Wi by i>~{ mod (22/_{+ 1).}

4. _{Compute w\} _W{mod nB as follows:

Compute u\ = _m _mod

nB, v- =

u,-mod _nB,for 0 < i < nB. Construct u =

Y!iio~l

u[-2(31o82"b)?{) = Y^iQ~lv^3lo^nB)i

bystringingtheu and_v\togetherwith2_log2nB _intervening

O's (fig. 2.10).

Computeuv =

E?=o_1u^3'"52"*')'

usingan appropriatefast multiplication algorithm.

Compute w'{ = (y\

-y'nB+i) mod_nB,for 0 < i < nB. 5. Compute

u;,-=

wf+ [(m*

-w'i) mod _nB] (22'+ 1). 6. Compute E"fo_1

wi2''

mod(2"2 + 1). This is the desired result.

Steps 1 through 3 compute w"

= _Wi mod(22' ₊ _1), _step 4 computes w[ = _Wi mod _nB, step 5 recovers

w,- _from u;,- _and w"

by using the Chinese remainder theorem and the last _step evaluates the product mod (22 + ₁₎ with radix 2'.

Example 2.5

For_simplicity we_apply algorithm Si _only once anddo not exploit itsrecur

sive character. Let

u = (1011011)2= _(91)io, v = (10001111)2 = (143)i0.

To get the exact result uv we must add leading zeros and compute uvmod (216

+ 1). The parameters are

n2 = _16, _nB = _4, 1 =_4, _w4 = _16, m = 28

+ 1 = _257,

and we split u and v into 4 blocks of4 bits each:

u = 0000 0000 0101 1011

(42)

In positional notation we have

u = 0-212 ₊ 0-28 ₊ 5-24

+ 11 = [11,5,0,0]T,

v = 0-212 ₊ 0-28

+ 8-24 + 15 = _[15,8,0,_0]T.

Step 1 With tj> = 4 (if)is _the_8-th _{root of}

unity) wehave

ity = [1-11,4-5,0,0]T= _{[11,20,0,0]T,}

tty = [1-15,4-8,0,0]T= _{[15,32,0,0]T.}

To calculate the Fourier transformmodulo 257 we use

u

= _1,

u\ = _16,

u\ =

-1, _ul = -16

and obtain

FT(u^) = [31,74,248,205]Tmod257

FT(v^) = _[47,_{13, 240,}17]T_mod _257.

Step 2

FT(u^) FT(v^) = [31 _47,₇₄ _13,₂₄₈

-240,205

17]T

mod 257

= [172,191, 153, 144]T

mod257.

Step 3 The inverse transform requires multiplication _by

u~B and _by rig1.

But _u~B = <^>nB~'

mod m because uj'nBu^-'

= _\ _mod _m _and n^1

=

24'_p_{mod m}

, for nB = 2P. Since u~B and

nsx

are powers of 2 the

multiplication reduces to simple shifts. We use

u4 = _1,

tffl1 = u34 = -16, uj2=u2 = _-1, uf =

u\ = _16,

V> = _1,

i>-1

= i,7 = _21\ V_2 = V6 = _212, ^-3 = ^5 =

210

and n^1

= 214 and have FT~l(FT(u^) FT(v^)) = [165,_{138, 126,}0]T

u>0'

= 165 ffl0 = 165 mod257

w'{ = 138 ffl-1 = 163 mod257

w'{ = 126 ffl-2 = ₄₀ _mod₂₅₇

(43)

Step 5

Step 4

u0 = _3,

ui = _1,

u',: = 0 if t > _1, vQ = _3,

v{ = ₀ _if _t _> ₀

u = ₀ ₁₀₀₀₀

11, v = ₀ ₀₀₀₀₀ ₁₁

uv = ₀ _{00 00} ₁₁ _{00 10} 01

j/o = (00 ₁₀ _01)a = _(9)io,

uj = (00 ₀₀ ₁₁₎₂ = _(3)i0, _y{ = 0 if i> _l,

w'0 = ₉_mod₄= ₁

u>i = 3mod4 = 3

wl = _0, _if _t _> ₁

w0 = _{165 +}

[(1-165)mod_4]-257= 165₊ 0= ₁₆₅

wi = ₁₆₃₊[(3

-163) mod_4] 257= ₁₆₃

u>2 = 40

w3 = ₀

Step 6

uv= ₀ 212₊₄₀ 28_{+ 163} 24_{+ 165} 2 = _{13 013}

Since 91 143 = ₁₃ _013, _the _{result checks!}

2.5.9 Modified SCHONHAGE-STRASSEN Algorithm

(1982)

A. SCHONHAGE published an improved version of the

SCHONHAGE-STRASSEN multiplication algorithm in 1982 [28]. In his introduction he

admitted, that the multiplication methodin its original version was "some

what

clumsy"

[28]. _By choosing different parameters,theapplication ofthe Chinese remainder theorem can be avoided and the algorithm reduces to

step 1, 2, 3 and 6 ofthe original algorithm SI. The simplified version also

achieves 0(n log n log log _n) bit _complexityand seems tobemorefeasible

especially for numbers of moderatelength.

Imagine two n2-bit numbers u and v. In the original version we restricted

n2 to a power of_two, n2 = 2k. Now the formula for _n2 is

n2 =

(44)

Note,that_{this representation, if}_possible,isunique. Tocomputetheproduct uv mod(2"2 ₊₁₎ wesplit u and v intonB blocksof/ _bits, n2 = _nB _I, with

nB = 2LU+1

I = u2^~\

Further parameters ofthe admissible triple _(ws,nB,m) are

m

u. s

w. nB

ri/+l-1_q

= 21 2 _i-^ _jf _fc is _even,

= _2W-4, _if k _is odd.

The parametersare summarized in table 2.2.

Observe that m is not restricted to Fermat numbers Ft but rather is com

posed of Fourier primes. Since

[|

_J + 1 < _[|] ₊ 1 we can _employ a fast

Fourier transformoflength nB = 2L2J+1.

Fortunately, wedon't havetoapply the Chineseremainder theorem. Let u;,

be the elements of _{the convolution, then} we will show that m >