• No results found

Faster Squaring by the “Comba” Method

5.3 Squaring

5.3.2 Faster Squaring by the “Comba” Method

A major drawback to the baseline method is the requirement for single precision shifting inside theO(n2) nested loop. Squaring has an additional drawback in that it must double the product inside the inner loop as well. As for multiplication, the Comba technique can be used to eliminate these performance hazards.

The first obvious solution is to make an array of mp words that will hold all the columns. This will indeed eliminate all of the carry propagation operations from the inner loop. However, the inner product must still be doubled O(n2) times. The solution stems from the simple fact that 2a+ 2b+ 2c= 2(a+b+c). That is, the sum of all of the double products is equal to double the sum of all the products. For example,ab+ba+ac+ca= 2ab+ 2ac= 2(ab+ac).

However, we cannot simply double all the columns, since the squares appear only once per row. The most practical solution is to have two mp word arrays. One array will hold the squares, and the other will hold the double products. With both arrays, the doubling and carry propagation can be moved to a O(n) work level outside theO(n2) level. In this case, we have an even simpler solution in mind.

134 www.syngress.com Algorithmfast s mp sqr.

Input. mp inta

Output. b←a2

Place an array ofMP WARRAYmp digits namedW on the stack. 1. Ifb.alloc <2a.used+ 1 then growbto 2a.used+ 1 digits. (mp grow). 2. If step 1 failed return(MP MEM).

3. pa←2·a.used 4. ˆW1←0 5. forixfrom 0 topa−1 do 5.1 Wˆ ←0 5.2ty←MIN(a.used−1, ix) 5.3tx←ix−ty 5.4iy←MIN(a.used−tx, ty+ 1) 5.5iy←MIN(iy,⌊(ty−tx+ 1)/2⌋) 5.6 foriz from 0 toiz−1 do 5.6.1 Wˆ ← Wˆ +atx+izaty−iz 5.7 Wˆ ←2· Wˆ + ˆW1 5.8 ifixis even then 5.8.1 Wˆ ← Wˆ +` a⌊ix/2⌋ ´2 5.9Wix← Wˆ(modβ) 5.10 ˆW1← ⌊W /βˆ ⌋ 6. oldused←b.used 7. b.used←2·a.used 8. forixfrom 0 topa−1 do 8.1bix←Wix

9. forixfrompatooldused−1 do 9.1bix←0

10. Clamp excess digits fromb. (mp clamp) 11. Return(MP OKAY).

Figure 5.13: Algorithm fast s mp sqr

Algorithm fast s mp sqr. This algorithm computes the square of an input using the Comba technique. It is designed to be a replacement for algorithm s mp sqr when the number of input digits is less thanMP WARRAY and less than δ2. This algorithm is very similar to the Comba multiplier, except with a few

5.3 Squaring 135 key differences we shall make note of (Figure 5.13).

First, we have an accumulator and carry variables Wˆ and ˆW1, respectively. This is because the inner loop products are to be doubled. If we had added the previous carry in we would be doubling too much. Next, we perform an addition MIN condition oniy(step 5.5) to prevent overlapping digits. For example,a3·a5 is equala5·a3, whereas in the multiplication case we would have 5< a.used, and 3≥0 is maintained since we double the sum of the products just outside the inner loop, which we have to avoid doing. This is also a good thing since we perform fewer multiplications and the routine ends up being faster.

The last difference is the addition of the “square” term outside the inner loop (step 5.8). We add in the square only to even outputs, and it is the square of the term at the⌊ix/2⌋position.

File: bn fast s mp sqr.c

018 /* the gist of squaring...

019 * you do like mult except the offset of the tmpx [one that 020 * starts closer to zero] can’t equal the offset of tmpy. 021 * So basically you set up iy like before then you min it with 022 * (ty-tx) so that it never happens. You double all those 023 * you add in the inner loop

024

025 After that loop you do the squares and add them in. 026 */

027

028 int fast_s_mp_sqr (mp_int * a, mp_int * b) 029 {

030 int olduse, res, pa, ix, iz; 031 mp_digit W[MP_WARRAY], *tmpx; 032 mp_word W1;

033

034 /* grow the destination as required */ 035 pa = a->used + a->used;

036 if (b->alloc < pa) {

037 if ((res = mp_grow (b, pa)) != MP_OKAY) {

038 return res;

039 }

040 }

041

042 /* number of output digits to produce */ 043 W1 = 0;

136 www.syngress.com

044 for (ix = 0; ix < pa; ix++) {

045 int tx, ty, iy;

046 mp_word _W; 047 mp_digit *tmpy; 048 049 /* clear counter */ 050 _W = 0; 051

052 /* get offsets into the two bignums */ 053 ty = MIN(a->used-1, ix);

054 tx = ix - ty;

055

056 /* setup temp aliases */ 057 tmpx = a->dp + tx; 058 tmpy = a->dp + ty; 059

060 /* this is the number of times the loop will iterate, essentially 061 while (tx++ < a->used && ty-- >= 0) { ... }

062 */

063 iy = MIN(a->used-tx, ty+1); 064

065 /* now for squaring tx can never equal ty

066 * we halve the distance since they approach at a rate of 2x 067 * and we have to round because odd cases need to be executed

068 */

069 iy = MIN(iy, (ty-tx+1)>>1); 070

071 /* execute loop */

072 for (iz = 0; iz < iy; iz++) {

073 _W += ((mp_word)*tmpx++)*((mp_word)*tmpy--);

074 }

075

076 /* double the inner product and add carry */ 077 _W = _W + _W + W1;

078

079 /* even columns have the square term in them */ 080 if ((ix&1) == 0) {

081 _W += ((mp_word)a->dp[ix>>1])*((mp_word)a->dp[ix>>1]);

082 }

083

5.3 Squaring 137

085 W[ix] = (mp_digit)(_W & MP_MASK); 086

087 /* make next carry */

088 W1 = _W >> ((mp_word)DIGIT_BIT); 089 } 090 091 /* setup dest */ 092 olduse = b->used; 093 b->used = a->used+a->used; 094 095 { 096 mp_digit *tmpb; 097 tmpb = b->dp;

098 for (ix = 0; ix < pa; ix++) {

099 *tmpb++ = W[ix] & MP_MASK;

100 }

101

102 /* clear unused digits [that existed in the old copy of c] */ 103 for (; ix < olduse; ix++) {

104 *tmpb++ = 0; 105 } 106 } 107 mp_clamp (b); 108 return MP_OKAY; 109 } 110

This implementation is essentially a copy of Comba multiplication with the appropriate changes added to make it faster for the special case of squaring. The innermost loop (lines 72 to 74) computes the products the same way the multi- plication routine does. The sum of the products is doubled separately (line 77) outside the innermost loop. The square term is added ifix is even (lines 80 to 82), indicating column with a square.

Related documents