5.3 Squaring
5.3.5 Karatsuba Squaring
Letf(x) =ax+b represent the polynomial basis representation of a number to square. Leth(x) = (f(x))2represent the square of the polynomial. The Karatsuba equation can be modified to square a number with the following equation.
h(x) =a2x2+ (a+b)2−(a2+b2)
x+b2 (5.7)
Upon closer inspection, this equation only requires the calculation of three half-sized squares: a2, b2, and (a+b)2. As in Karatsuba multiplication, this algorithm can be applied recursively on the input and will achieve an asymptotic running time ofO nlg(3)
.
If the asymptotic times of Karatsuba squaring and multiplication are the same, why not simply use the multiplication algorithm instead? The answer to this arises from the cutoff point for squaring. As in multiplication, there exists a cutoff point, at which the time required for a Comba–based squaring and a Karatsuba–based squaring meet. Due to the overhead inherent in the Karatsuba method, the cutoff point is fairly high. For example, on an AMD Athlon XP processor withβ= 228, the cutoff point is around 127 digits.
Consider squaring a 200–digit number with this technique. It will be split into two 100–digit halves that are subsequently squared. The 100–digit halves will not be squared using Karatsuba, but instead using the faster Comba–based squaring algorithm. If Karatsuba multiplication were used instead, the 100–digit numbers would be squared with a slower Comba–based multiplication.
5.3 Squaring 139 Algorithmmp karatsuba sqr.
Input. mp inta
Output. b←a2
1. Initialize the following temporary mp ints: x0,x1,t1,t2,x0x0, andx1x1. 2. If any of the initializations on step 1 failed return(MP MEM).
Split the input. e.g. a=x1βB+x0 3.B ← ⌊a.used/2⌋
4.x0←a(modβB) (mp mod 2d) 5.x1← ⌊a/βB⌋(mp lshd) Calculate the three squares. 6.x0x0←x02 (mp sqr) 7.x1x1←x12
8.t1←x1 +x0 (s mp add) 9.t1←t12
Compute the middle term. 10. t2←x0x0 +x1x1 (s mp add) 11. t1←t1−t2
Compute final product. 12. t1←t1βB (mp lshd) 13. x1x1←x1x1β2B
14. t1←t1 +x0x0 15. b←t1 +x1x1 16. Return(MP OKAY).
Figure 5.14: Algorithm mp karatsuba sqr
Algorithm mp karatsuba sqr. This algorithm computes the square of an inputausing the Karatsuba technique. It is very similar to the Karatsuba–based multiplication algorithm with the exception that the three half-size multiplications have been replaced with three half-size squarings (Figure 5.14).
The radix point for squaring is simply placed exactly in the middle of the digits when the input has an odd number of digits; otherwise, it is placed just below the middle. Steps 3, 4, and 5 compute the two halves required usingB as the radix point. The first two squares in steps 6 and 7 are straightforward, while the last
140 www.syngress.com square is of a more compact form.
By expanding (x1 +x0)2, thex12andx02terms in the middle disappear; that is, (x0−x1)2−(x12+x02) = 2·x0·x1. Now if 5nsingle precision additions and a squaring ofn-digits is faster than multiplying twon-digit numbers and doubling, then this method is faster. Assuming no further recursions occur, the difference can be estimated with the following inequality.
Letprepresent the cost of a single precision addition andqthe cost of a single precision multiplication both in terms of time4.
5pn+q(n 2+n)
2 ≤pn+qn
2 (5.8)
For example, on an AMD Athlon XP processor,p= 13 andq= 6. This implies that the following inequality should hold.
5n 3 + 3n2+ 3n < n 3 + 6n2 5 3+ 3n+ 3 < 1 3+ 6n 13 9 < n
This results in a cutoff point aroundn= 2. As a consequence, it is actually faster to compute the middle term the “long way” on processors where multipli- cation is substantially slower5 than simpler operations such as addition.
File: bn mp karatsuba sqr.c
018 /* Karatsuba squaring, computes b = a*a using three 019 * half size squarings
020 *
021 * See comments of karatsuba_mul for details. It 022 * is essentially the same algorithm but merely 023 * tuned to perform recursive squarings. 024 */
025 int mp_karatsuba_sqr (mp_int * a, mp_int * b) 026 {
027 mp_int x0, x1, t1, t2, x0x0, x1x1;
028 int B, err;
029
030 err = MP_MEM;
4Or machine clock cycles.
5On the Athlon there is a 1:17 ratio between clock cycles for addition and multiplication. On
the Intel P4 processor this ratio is 1:29, making this method even more beneficial. The only common exception is the ARMv4 processor, which has a ratio of 1:7.
5.3 Squaring 141
031
032 /* min # of digits */ 033 B = a->used;
034
035 /* now divide in two */ 036 B = B >> 1;
037
038 /* init copy all the temps */
039 if (mp_init_size (&x0, B) != MP_OKAY) 040 goto ERR;
041 if (mp_init_size (&x1, a->used - B) != MP_OKAY)
042 goto X0;
043
044 /* init temps */
045 if (mp_init_size (&t1, a->used * 2) != MP_OKAY)
046 goto X1;
047 if (mp_init_size (&t2, a->used * 2) != MP_OKAY)
048 goto T1;
049 if (mp_init_size (&x0x0, B * 2) != MP_OKAY)
050 goto T2;
051 if (mp_init_size (&x1x1, (a->used - B) * 2) != MP_OKAY) 052 goto X0X0; 053 054 { 055 register int x; 056 register mp_digit *dst, *src; 057 058 src = a->dp; 059
060 /* now shift the digits */ 061 dst = x0.dp; 062 for (x = 0; x < B; x++) { 063 *dst++ = *src++; 064 } 065 066 dst = x1.dp; 067 for (x = B; x < a->used; x++) { 068 *dst++ = *src++; 069 } 070 } 071
142 www.syngress.com 072 x0.used = B; 073 x1.used = a->used - B; 074 075 mp_clamp (&x0); 076
077 /* now calc the products x0*x0 and x1*x1 */ 078 if (mp_sqr (&x0, &x0x0) != MP_OKAY)
079 goto X1X1; /* x0x0 = x0*x0 */
080 if (mp_sqr (&x1, &x1x1) != MP_OKAY)
081 goto X1X1; /* x1x1 = x1*x1 */
082
083 /* now calc (x1+x0)**2 */
084 if (s_mp_add (&x1, &x0, &t1) != MP_OKAY)
085 goto X1X1; /* t1 = x1 - x0 */
086 if (mp_sqr (&t1, &t1) != MP_OKAY)
087 goto X1X1; /* t1 = (x1 - x0) * (x1 - x0) */ 088
089 /* add x0y0 */
090 if (s_mp_add (&x0x0, &x1x1, &t2) != MP_OKAY)
091 goto X1X1; /* t2 = x0x0 + x1x1 */
092 if (s_mp_sub (&t1, &t2, &t1) != MP_OKAY)
093 goto X1X1; /* t1 = (x1+x0)**2 - (x0x0 + x1x1) */ 094 095 /* shift by B */ 096 if (mp_lshd (&t1, B) != MP_OKAY) 097 goto X1X1; /* t1 = (x0x0 + x1x1 - (x1-x0)*(x1-x0))<<B */ 098 if (mp_lshd (&x1x1, B * 2) != MP_OKAY) 099 goto X1X1; /* x1x1 = x1x1 << 2*B */ 100
101 if (mp_add (&x0x0, &t1, &t1) != MP_OKAY)
102 goto X1X1; /* t1 = x0x0 + t1 */
103 if (mp_add (&t1, &x1x1, b) != MP_OKAY)
104 goto X1X1; /* t1 = x0x0 + t1 + x1x1 */ 105 106 err = MP_OKAY; 107 108 X1X1:mp_clear (&x1x1); 109 X0X0:mp_clear (&x0x0); 110 T2:mp_clear (&t2); 111 T1:mp_clear (&t1); 112 X1:mp_clear (&x1);
5.3 Squaring 143 113 X0:mp_clear (&x0); 114 ERR: 115 return err; 116 } 117
This implementation is largely based on the implementation of algorithm mp karatsuba mul. It uses the same inline style to copy and shift the input into the two halves. The loop from line 54 to line 70 has been modified since only one input exists. The usedcount of bothx0 and x1 is fixed up, and x0 is clamped before the calculations begin. At this point, x1 and x0 are valid equivalents to the respective halves as if mp rshd and mp mod 2d had been used.
By inlining the copy and shift operations, the cutoff point for Karatsuba mul- tiplication can be lowered. On the Athlon, the cutoff point is exactly at the point where Comba squaring can no longer be used (128 digits). On slower processors such as the Intel P4, it is actually below the Comba limit (at 110 digits).
This routine uses the same error trap coding style as mp karatsuba sqr. As the temporary variables are initialized, errors are redirected to the error trap higher up. If the algorithm completes without error, the error code is set toMP OKAY and mp clears are executed normally.