Faster Multiplication by the “Comba” Method

5.2 Multiplication

5.2.2 Faster Multiplication by the “Comba” Method

One of the huge drawbacks of the “baseline” algorithms is that at theO(n2_{) level} the carry must be computed and propagated upwards. This makes the nested loop very sequential and hard to unroll and implement in parallel. The “Comba” [4] method is named after little known (in cryptographic venues) Paul G. Comba, who described a method of implementing fast multipliers that do not require nested carry fix-up operations. As an interesting aside it seems that Paul Barrett describes a similar technique in his 1986 paper [6] written five years before.

At the heart of the Comba technique is again the long-hand algorithm, except in this case a slight twist is placed on how the columns of the result are produced. In the standard long-hand algorithm, rows of products are produced and then added together to form the result. In the baseline algorithm, the columns are added together after each iteration to get the result instantaneously.

In the Comba algorithm, the columns of the result are produced entirely independently of each other; that is, at theO(n2) level a simple multiplication and addition step is performed. The carries of the columns are propagated after the nested loop to reduce the amount of work required. Succinctly, the first step of the algorithm is to compute the product vector~xas follows:

~xn =

i+j=n

aibj,∀n∈ {0,1,2, . . . , i+j} (5.1)

where~xn is then′thcolumn of the output vector. Consider Figure 5.3, which

98 www.syngress.com 5 7 6 First Input × 2 4 1 Second Input 1·5 = 5 1·7 = 7 1·6 = 6 First pass 4·5 = 20 4·7 + 5 = 33 4·6 + 7 = 31 6 Second pass 2·5 = 10 2·7 + 20 = 34 2·6 + 33 = 45 31 6 Third pass 10 34 45 31 6 Final Result

Figure 5.3: Comba Multiplication Diagram

At this point the vector x = h10,34,45,31,6i is the result of the first step of the Comba multiplier. Now the columns must be fixed by propagating the carry upwards. The resultant vector will have one extra dimension over the input vector, which is congruent to adding a leading zero digit (Figure 5.4).

AlgorithmComba Fixup.

Input. Vector~xof dimensionk

Output. Vector~xsuch that the carries have been propagated. 1. fornfrom 0 tok−1 do

1.1~xn+1←~xn+1+⌊~xn/β⌋

1.2~xn←~xn(modβ)

2. Return(~x).

Figure 5.4: Algorithm Comba Fixup

With that algorithm andk = 5 andβ = 10 the ~x=h1,3,8,8,1,6ivector is produced. In this case, 241·576 is in fact 138816 and the procedure succeeded. If the algorithm is correct and, as will be demonstrated shortly, more efficient than the baseline algorithm, why not simply always use this algorithm?

Column Weight.

At the nested O(n2_{) level the Comba method adds the product of two single} precision variables to each column of the output independently. A serious obstacle is if the carry is lost, due to lack of precision before the algorithm has a chance to fix the carries. For example, in the multiplication of two three-digit numbers, the third column of output will be the sum of three single precision multiplications. If the precision of the accumulator for the output digits is less than 3·(β−1)2_,

5.2 Multiplication 99 then an overflow can occur and the carry information will be lost. For anymand n digit inputs the maximum weight of any column is min(m, n), which is fairly obvious.

The maximum number of terms in any column of a product is known as the “column weight” and strictly governs when the algorithm can be used. Recall that a double precision type hasαbits of resolution and a single precision digit haslg(β) bits of precision. Given these two quantities we must not violate:

k·(β−1)2<2α (5.2)

which reduces to

k· β2−2β+ 1

<2α (5.3)

Let ρ = lg(β) represent the number of bits in a single precision digit. By further re-arrangement of the equation the final solution is found.

k < 2

(22ρ₋₂ρ+1_{+ 1)} (5.4)

The defaults for LibTomMath are β = 228 _and_α_{= 2}64_{, which means that}_k is bounded by k < 257. In this configuration, the smaller input may not have more than 256 digits if the Comba method is to be used. This is quite satisfactory for most applications, since 256 digits would allow for numbers in the range of 0≤x <27168_{, which is much larger than most public key cryptographic algorithms} require.

100 www.syngress.com Algorithmfast s mp mul digs.

Input. mp inta, mp intband an integerdigs

Output. c← |a| · |b|(modβdigs_).

Place an array of MP WARRAYsingle precision digits namedW on the stack. 1. Ifc.alloc < digsthen growctodigsdigits. (mp grow)

2. If step 1 failed return(MP MEM). 3. pa←MIN(digs, a.used+b.used) 4. Wˆ ←0 5. forixfrom 0 topa−1 do 5.1ty←MIN(b.used−1, ix) 5.2tx←ix−ty 5.3iy←MIN(a.used−tx, ty+ 1) 5.4 forizfrom 0 toiy−1 do 5.4.1 Wˆ ← Wˆ +atx+iybty−iy 5.5Wix← Wˆ(modβ) 5.6 Wˆ ← ⌊W /βˆ ⌋ 6. oldused←c.used 7. c.used←digs 8. forixfrom 0 topado 8.1cix←Wix

9. forixfrompa+ 1 tooldused−1 do 9.1cix←0

10. Clamp c.

11. Return MP OKAY.

Figure 5.5: Algorithm fast s mp mul digs

Algorithm fast s mp mul digs.This algorithm performs the unsigned multiplication ofaand busing the Comba method limited todigsdigits of precision (Figure 5.5).

The outer loop of this algorithm is more complicated than that of the baseline multiplier. This is because on the inside of the loop we want to produce one column per pass. This allows the accumulator ˆW to be placed in CPU registers and reduce the memory bandwidth to twomp digitreads per iteration.

5.2 Multiplication 101 Thety variable is set to the minimum count ofix, or the number of digits in b. That way, ifahas more digits thanb, this will be limited tob.used−1. Thetx variable is set to the distance pastb.usedthe variable ixis. This is used for the immediately subsequent statement where we findiy.

The variableiy is the minimum digits we can read from either a or b before running out. Computing one column at a time means we have to scan one integer upwards and the other downwards. astarts attxandb starts atty. In each pass we are producing theix’th output column and we note thattx+ty=ix. As we movetxupwards, we have to move ty downwards so the equality remains valid. Theiyvariable is the number of iterations untiltx≥a.usedorty <0 occurs.

After every inner pass we store the lower half of the accumulator intoWixand

then propagate the carry of the accumulator into the next round by dividing ˆW byβ.

To measure the benefits of the Comba method over the baseline method, consider the number of operations that are required. If the cost in terms of time of a multiply and addition ispand the cost of a carry propagation isq, then a baseline multiplication would requireO (p+q)n2

time to multiply twon-digit numbers. The Comba method requires onlyO(pn2_{+qn) time; however, in practice the speed} increase is actually much more. WithO(n) space the algorithm can be reduced to O(pn+qn) time by implementing the nmultiply and addition operations in the nested loop in parallel.

File: bn fast s mp mul digs.c 018 /* Fast (comba) multiplier

019 *

020 * This is the fast column-array [comba] multiplier. It is 021 * designed to compute the columns of the product first 022 * then handle the carries afterwards. This has the effect 023 * of making the nested loops that compute the columns very 024 * simple and schedulable on super-scalar processors.

025 *

026 * This has been modified to produce a variable number of 027 * digits of output so if say only a half-product is required 028 * you don’t have to compute the upper half (a feature 029 * required for fast Barrett reduction).

030 *

031 * Based on Algorithm 14.12 on pp.595 of HAC.

032 *

033 */

102 www.syngress.com

035 {

036 int olduse, res, pa, ix, iz; 037 mp_digit W[MP_WARRAY];

038 register mp_word _W; 039

040 /* grow the destination as required */ 041 if (c->alloc < digs) {

042 if ((res = mp_grow (c, digs)) != MP_OKAY) {

043 return res;

044 }

045 }

046

047 /* number of output digits to produce */ 048 pa = MIN(digs, a->used + b->used); 049

050 /* clear the carry */ 051 _W = 0;

052 for (ix = 0; ix < pa; ix++) {

053 int tx, ty;

054 int iy;

055 mp_digit *tmpx, *tmpy; 056

057 /* get offsets into the two bignums */ 058 ty = MIN(b->used-1, ix);

059 tx = ix - ty;

060

061 /* setup temp aliases */ 062 tmpx = a->dp + tx; 063 tmpy = b->dp + ty; 064

065 /* this is the number of times the loop will iterate, essentially 066 while (tx++ < a->used && ty-- >= 0) { ... }

067 */

068 iy = MIN(a->used-tx, ty+1); 069

070 /* execute loop */

071 for (iz = 0; iz < iy; ++iz) {

072 _W += ((mp_word)*tmpx++)*((mp_word)*tmpy--);

073 }

074

5.2 Multiplication 103

076 W[ix] = ((mp_digit)_W) & MP_MASK; 077

078 /* make next carry */

079 _W = _W >> ((mp_word)DIGIT_BIT); 080 } 081 082 /* setup dest */ 083 olduse = c->used; 084 c->used = pa; 085 086 { 087 register mp_digit *tmpc; 088 tmpc = c->dp;

089 for (ix = 0; ix < pa+1; ix++) {

090 /* now extract the previous digit [below the carry] */ 091 *tmpc++ = W[ix];

092 }

093

094 /* clear unused digits [that existed in the old copy of c] */ 095 for (; ix < olduse; ix++) {

096 *tmpc++ = 0; 097 } 098 } 099 mp_clamp (c); 100 return MP_OKAY; 101 } 102

As per the pseudo–code we first calculatepa(line 48) as the number of digits to output. Next, we begin the outer loop to produce the individual columns of the product. We use the two aliasestmpxandtmpy(lines 62, 63) to point inside the two multiplicands quickly.

The inner loop (lines 71 to 73) of this implementation is where the trade–off come into play. Originally, this Comba implementation was “row–major,” which means it adds to each of the columns in each pass. After the outer loop it would then fix the carries. This was very fast, except it had an annoying drawback. You had to read an mp word and two mp digits and write one mp word per iteration. On processors such as the Athlon XP and P4 this did not matter much since the cache bandwidth is very high and it can keep the ALU fed with data. It did, however, matter on older and embedded CPUs where cache is often slower and

104 www.syngress.com often does not exist. This new algorithm only performs two reads per iteration under the assumption that the compiler has aliased ˆW to a CPU register.

After the inner loop we store the current accumulator inW and shift ˆW (lines 76, 79) to forward it as a carry for the next pass. After the outer loop we use the final carry (line 76) as the last digit of the product.

In document BigNum Math Implementing Cryptographic Multiple Precision Arithmetic pdf (Page 116-123)