• No results found

Performing point halving efficiently

3.6 Point multiplication using halving

3.6.2 Performing point halving efficiently

Point halving requires a field multiplication and three main steps: (i) computing the trace of t; (ii) solving the quadratic equation (3.42); and (iii) computing a square root. In a normal basis, field elements are represented in terms of a basis of the form

{β,β2,...,β2m−1

}. The trace of an elementc=ciβ2

i

=(c0,c1,...,cm−1)is given by

Tr(c)=ci. The square root computation is a left rotation:√c=(c1,...,cm−1,c0).

Squaring is a right rotation, and x2+x=ccan be solved bitwise. These operations are expected to be inexpensive relative to field multiplication. However, field multi- plication in software for normal basis representations is very slow in comparison to multiplication with a polynomial basis. Conversion between polynomial and normal bases at each halving appears unlikely to give a competitive method, even if signifi- cant storage is used. For these reasons, we restrict our discussion to computations in a polynomial basis representation.

Computing the trace

Letc=mi=01cizi∈F2m, withci∈ {0,1}, represented as the vectorc=(cm−1,...,c0).

A primitive method for computing Tr(c)uses the definition of trace, requiringm−1 field squarings andm−1 field additions. A much more efficient method makes use of the property that the trace is linear:

Tr(c)=Tr m−1 i=0 cizi = m−1 i=0 ciTr(zi).

The values Tr(zi)may be precomputed, allowing the trace of an element to be found efficiently, especially if Tr(zi)=0 for mosti.

Example 3.82(computing traces of elements inF2163) Consider F

2163 with reduction

polynomial f(z)=z163+z7+z6+z3+1. A routine calculation shows that Tr(zi)=1 if and only if i ∈ {0,157}. As examples, Tr(z160+z46)=0, Tr(z157+z46)=1, and Tr(z157+z46+1)=0.

Solving the quadratic equation

The first step of point halving seeks a solutionx of a quadratic equation of the form

x2+x =c over F2m. The time performance of this step is crucial in obtaining an

efficient point halving.

Definition 3.83 Let m be an odd integer. Thehalf-trace function H :F2m →F2m is

defined by

H(c)=

(m−1)/2

i=0 c22i.

Lemma 3.84(properties of the half-trace function) Letmbe an odd integer. (i) H(c+d)=H(c)+H(d)for allc,d∈F2m.

(ii) H(c)is a solution of the equationx2+x=c+Tr(c). (iii) H(c)=H(c2)+c+Tr(c)for allc∈F2m.

Letc=mi=01cizi ∈F2m with Tr(c)=0; in particular, H(c)is a solution ofx2+

x=c. A simple method for finding H(c)directly from the definition requiresm−1 squarings and(m−1)/2 additions. If storage for{H(zi):0i<m}is available, then Lemma 3.84(i) may be applied to obtain

H(c)=H m−1 i=0 cizi = m−1 i=0 ciH(zi).

However, this requires storage formfield elements, and the associated method requires an average ofm/2 field additions.

Lemma 3.84 can be used to significantly reduce the storage required as well as the time needed to solve the quadratic equation. The basic strategy is to write H(c)= H(c)+swherechas fewer nonzero coefficients thanc. For eveni, note that

H(zi)=H(zi/2)+zi/2+Tr(zi).

Algorithm 3.85 is based on this observation, eliminating storage ofH(zi)for all even

i. Precomputation builds a table of(m−1)/2 field elementsH(zi)for oddi, and the algorithm is expected to have approximatelym/4 field additions at step 4. The terms involving Tr(zi)andH(1)have been discarded, since it suffices to produce a solution

s∈ {H(c),H(c)+1}ofx2+x=c.

Algorithm 3.85Solvex2+x=c(basic version)

INPUT:c=mi=01cizi ∈F2m wheremis odd and Tr(c)=0.

OUTPUT: A solutionsofx2+x=c.

1. PrecomputeH(zi)for oddi, 1≤im−2. 2. s←0.

3. Fori from(m−1)/2 downto 1 do

3.1 Ifc2i =1 then do:cc+zi,ss+zi. 4. ss+ (m1)/2 i=1 c2i−1H(z2i−1). 5. Return(s).

Further improvements are possible by use of Lemma 3.84 together with the reduction polynomial f(z). Letibe odd, and define jandsby

The basic idea is to apply Lemma 3.84(iii) jtimes, obtaining

H(zi)=H(z2ji)+z2j−1i+ ··· +z4i+z2i+zi+jTr(zi). (3.43) Let f(z)=zm+r(z), wherer(z)=zb+···+zb1+1 and 0<b1<···<b<m. Then

H(z2ji)=H(zsr(z))=H(zs+b)+H(zs+b−1)+ ··· +H(zs+b1)+H(zs).

Thus, storage forH(zi)may be exchanged for storage ofH(zs+e)fore∈ {0,b1,...,b}

(some of which may be further reduced). The amount of storage reduction is limited by dependencies among elementsH(zi).

If degr <m/2, the strategy can be applied in an especially straightforward fashion to eliminate some of the storage forH(zi)in Algorithm 3.85. Form/2<i<m−degr,

H(zi)=H(z2i)+zi+Tr(zi)

=H(r(z)z2im)+zi+Tr(zi)

=H(z2im+b+ ···+z2im+b1+z2im)+zi+Tr(zi).

Since 2im+degr<i, the reduction may be applied to eliminate storage ofH(zi)for oddi,m/2<i<m−degr. If degr is small, Algorithm 3.86 requires approximately

m/4 elements of storage.

Algorithm 3.86Solvex2+x=c

INPUT:c=im=01cizi∈F2m wheremis odd and Tr(c)=0, and reduction polynomial f(z)=zm+r(z).

OUTPUT: A solutionsofx2+x=c.

1. PrecomputeH(zi)foriI0∪I1, whereI0andI1consist of the odd integers in [1,(m−1)/2]and[m−degr,m−2], respectively.

2. s←0.

3. For each oddi((m−1)/2,m−degr), processed in decreasing order, do: 3.1 Ifci =1 then do:cc+z2im+b+ ··· +z2im,ss+zi.

4. Fori from(m−1)/2 downto 1 do:

4.1 Ifc2i =1 then do:cc+zi,ss+zi.

5. ss+

iI0∪I1

ciH(zi). 6. Return(s).

The technique may also reduce the time required for solving the quadratic equation, since the cost of reducing each H(zi)may be less than the cost of adding a precom- puted value ofH(zi)to the accumulator. Elimination of the even terms (step 4) can be implemented efficiently. Processing odd terms (as in step 3) is more involved, but will be less expensive than a field addition if only a few words must be updated.

Example 3.87 (Algorithm 3.86 for the fieldF2163) ConsiderF2163 with reduction poly-

nomial f(z)=z163+z7+z6+z3+1. Step 3 of Algorithm 3.86 begins withi =155. By Lemma 3.84,

H(z155)=H(z310)+z155+Tr(z155)

=H(z147z163)+z155

=H(z147(z7+z6+z3+1))+z155.

Ifc155=1, thenz154+z153+z150+z147is added toc, andz155is added tos. In this

fashion, storage for H(zi)is eliminated fori ∈ {83,85,...,155}, the odd integers in

((m−1)/2,m−degr).

Algorithm 3.86 uses 44 field elements of precomputation. While this is roughly half that required by the basic algorithm, it is not minimal. For example, storage forH(z51)

may be eliminated, since

H(z51)=H(z102)+z51+Tr(z51)

=H(z204)+z102+z51+Tr(z102)+Tr(z51)

=H(z163z41)+z102+z51

=H(z48+z47+z44+z41)+z102+z51

which corresponds to equation (3.43) with j=2. The same technique eliminates stor- age forH(zi),i∈ {51,49,...,41}. Similarly, if (3.43) is applied withi=21 and j=3, then

H(z21)=H(z12+z11+z8+z5)+z84+z42+z21.

Note that the odd exponents 11 and 5 are less than 21, and hence storage for H(z21)

may be eliminated.

In summary, the use of (3.43) with j ∈ {1,2,3} eliminates storage for odd val- ues ofi ∈ {21,41,...,51,83,...,155}, and a corresponding algorithm for solving the quadratic equation requires 37 elements of precomputation. Further reductions are pos- sible, but there are some complications since the formula for H(zi)involvesH(zj)for

j>i. As an example,

H(z23)=H(z28+z27+z24+z21)+z92+z46+z23

and storage for H(z23) may be exchanged for storage on H(z27). These strategies reduce the precomputation to 30 field elements, significantly less than the 44 used in Algorithm 3.86. In fact, use of

zn=z157+n+zn+1+zn−3+zn−6

together with the previous techniques reduces the storage to 21 field elementsH(zi)for

this final reduction comes at a somewhat higher cost in required code compared with the 30-element version.

Experimentally, the algorithm for solving the quadratic equation (with 21 or 30 ele- ments of precomputation) requires approximately 2/3 the time of a field multiplication. Special care should be given to branch misprediction factors (§5.1.4) as this algorithm performs many bit tests.

Computing square roots inFFF

2m

The basic method for computing√c, wherec∈F2m, is based on the little theorem of

Fermat:c2m=c. Then√ccan be computed as√c=c2m−1, requiringm−1 squarings. A more efficient method is obtained from the observation that√ccan be expressed in terms of the square root of the elementz. Letc=mi=01cizi ∈F2m,ci∈ {0,1}. Since

squaring is a linear operation inF2m, the square root ofccan be written as √ c= m−1 i=0 cizi 2m−1 = m−1 i=0 ci(z2 m−1 )i. Splittingcinto even and odd powers, we have

c= (m−1)/2 i=0 c2i(z2 m−1 )2i+ (m−3)/2 i=0 c2i+1(z2 m−1 )2i+1 = (m−1)/2 i=0 c2izi+ (m−3)/2 i=0 c2i+1z2 m−1 zi = ieven ciz i 2+√z iodd ciz i−1 2 .

This reveals an efficient method for computing√c: extract the two half-length vectors

ceven=(cm−1,...,c4,c2,c0)andcodd=(cm−2,...,c5,c3,c1)from c(assumingm is

odd), perform a field multiplication of codd of length m/2 with the precomputed

value√z, and finally add this result withceven. The computation is expected to require

approximately half the time of a field multiplication.

In the case that the reduction polynomial f is a trinomial, the computation of√c

can be further accelerated by the observation that an efficient formula for√z can be derived directly from f. Let f(z)=zm+zk+1 be an irreducible trinomial of degree

m, wherem>2 is prime.

Consider the case thatkis odd. Note that 1≡zm+zk (mod f(z)). Then multiplying byzand taking the square root, we get

zzm+21+z k+1

Thus, the product √z·codd requires two shift-left operations and one modular

reduction.

Now supposek is even. Observe that zm zk+1 (mod f(z)). Then dividing by

zm−1and taking the square root, we get

zzm−21(z k

2+1) (mod f(z)).

In order to computezs modulo f(z), wheres= m21, one can use the congruences

ztzkt+zmt (mod f(z))for 1≤tk for writing zs as a sum of few pos- itive powers of z. Hence, the product√z·codd can be performed with few shift-left

operations and one modular reduction.

Example 3.88(square roots inF2409) The reduction polynomial for the NIST recom-

mended finite fieldF2409is the trinomial f(z)=z409+z87+1. Then, the new formula

for computing the square root ofc∈F2409is

c=ceven+z205·codd+z44·codd mod f(z).

Example 3.89(square roots inF2233) The reduction polynomial for the NIST recom-

mended finite fieldF2233 is the trinomial f(z)=z233+z74+1. Sincek=74 is even,

we have √z =z−116·(z37+1) mod f(z). Note that z−74≡ 1+z159 (mod f(z))

and z−42 ≡ z32+z191 (mod f(z)). Then one gets that z−116 ≡ z32+z117+z191 (mod f(z)). Hence, the new method for computing the square root ofc∈F2233 is

c=ceven+(z32+z117+z191)(z37+1)·codd mod f(z).

Compared to the standard method of computing square roots, the proposed technique eliminates the need of storage and replaces the required field multiplication by a faster operation. Experimentally, finding a root in Example 3.89 requires roughly 1/8 the time of a field multiplication.

3.6.3

Point multiplication

Halve-and-add variants of the point multiplication methods discussed in §3.3 replace most point doublings with halvings. Depending on the application, it may be necessary to convert a given integerk=(kt−1,...,k0)2for use with halving-based methods. Ifk

is defined by

kkt1/2t−1+ ··· +k2/22+k1/2+k0 (modn)

then k P=ti=10ki/2iP; i.e.,(kt1,...,k0) is used by halving-based methods. This can be generalized to width-wNAF.

Lemma 3.90 Letti=0ki2i be thew-NAF representation of 2t−1k modn. Then kt−1 i=0 kt1i 2i +2k t (modn).

Proof: We have 2t−1kti=0ki2i (modn). Sincenis prime, the congruence can be divided by 2t−1to obtain kt i=0 ki 2t−1−it−1 i=0 kt1i 2i +2k t (modn).

Algorithm 3.91 presents a right-to-left version of the halve-and-add method with the input 2t−1kmodnrepresented inw-NAF. Point halving occurs on the input P rather than on accumulators. Note that the coefficientkt is handled separately in step 2 as it corresponds to the special term 2kt ink. The expected running time is approximately

(step 4 cost)+(t/(w+1)−2w−2)A+t H (3.44) where H denotes a point halving and A is the cost of a point addition when one of the inputs is in λ-representation. If projective coordinates are used for Qi, then the additions in steps 3.1 and 3.2 are mixed-coordinate. Step 4 may be performed by con- version of Qi to affine (with cost I+(5·2w−2−3)M if inverses are obtained by a simultaneous method), and then the sum is obtained by interleaving with appropriate signed-digit representations of the odd multipliersi. The cost of step 4 for 2≤w≤5 is approximatelyw−2 point doublings and 0, 2, 6, or 16 point additions, respectively.3 Algorithm 3.91Halve-and-addw-NAF (right-to-left) point multiplication

INPUT: Window widthw, NAFw(2t−1kmodn)=ti=0ki2i,PG. OUTPUT:k P. (Note:k=k0/2t−1+ ··· +kt2/2+kt1+2kt modn.)

1. SetQi←∞foriI = {1,3,...,2w−1−1}. 2. Ifkt=1 thenQ1=2P.

3. Fori fromt−1 downto 0 do: 3.1 Ifki>0 thenQk iQki+P. 3.2 Ifki<0 thenQk iQkiP. 3.3 PP/2. 4. QiIi Qi. 5. Return(Q).

3Knuth suggests calculatingQ

iQi+Qi+2forifrom 2w−1−3 to 1, and then the result is given by

Consider the casew=2. The expected running time of Algorithm 3.91 is then ap- proximately(1/3)t A+t H. If affine coordinates are used, then a point halving costs approximately 2M, while a point addition costs 2M+V since theλ-representation of

Pmust be converted to affine with one field multiplication. It follows that the field op- eration count with affine coordinates is approximately(8/3)t M+(1/3)t V. However, ifQis stored in projective coordinates, then a point addition requires 9M. The field op- eration count of a mixed-coordinate Algorithm 3.91 withw=2 is then approximately 5t M+(2M+I).

Algorithm 3.92 is a left-to-right method. Point halving occurs on the accumula- tor Q, whence projective coordinates cannot be used. The expected running time is approximately

(D+(2w−2−1)A)+(t/(w+1)A+t H). (3.45) Algorithm 3.92Halve-and-addw-NAF (left-to-right) point multiplication

INPUT: Window widthw, NAFw(2t−1kmodn)=ti=0ki2i,PG. OUTPUT:k P. (Note:k=k0/2t−1+ ··· +kt2/2+kt1+2kt modn.)

1. ComputePi=i P, fori ∈ {1,3,5,...,2w−1−1}. 2. Q←∞. 3. Fori from 0 tot−1 do 3.1 QQ/2. 3.2 Ifki>0 thenQQ+Pk i. 3.3 Ifki<0 thenQQPk i. 4. Ifkt=1 thenQQ+2P. 5. Return(Q).

Analysis

In comparison to methods based on doubling, point halving looks best when I/M is small and k P is to be computed for P not known in advance. In applications, the operations k P andk P+l Q with P known in advance are also of interest, and this section provides comparative results. The concrete example used is the NIST random curve overF2163 (§A.2.2), although the general conclusions apply more widely.

Example 3.93(double-and-add vs. halve-and-add) Table 3.11 provides an operation count comparison between double-and-add and halve-and-add methods for the NIST random curve overF2163. For the field operations, the assumption is thatI/M=8 and

that a field division has costI+M.

The basic NAF halving method is expected to outperform the w-NAF doubling methods. However, the halving method has 46 field elements of precomputation. In contrast, Algorithm 3.36 withw=4 (which runs in approximately the same time as withw=5) requires only six field elements of extra storage.

Storage Point Field operations (H=2M,I/M=8) Method (field elts) operations affine projective

NAF, doubling (Algorithm 3.36) 0 163D+54A 217(M+V)=2173 1089M+I=1097 NAF, halving (Algorithm 3.91) 46 163H+54A 435M+54V= 924 817M+I= 825 5-NAF, doubling (Algorithm 3.36) 14 [D+7A]+163D+27A 198(M+V)=1982 879M+8V+I= 959 4-NAF, halving (Algorithm 3.91) 55 [3D+6A]+163H+30A — 671M+2I= 687 5-NAF, halving (Algorithm 3.92) 60 [D+7A]+163H+27A 388M+35V= 705 —

Table 3.11. Point and field operation counts for point multiplication for the NIST random curve overF2163. Halving uses 30 field elements of precomputation in the solve routine, and 16 el-

ements for square root. A=A+M, the cost of a point addition when one of the inputs is in

λ-representation. Field operation counts assume that a divisionV costsI+M.

The left-to-rightw-NAF halving method requires that the accumulator be in affine coordinates, and point additions have cost 2M+V (since a conversion from λ- representation is required). For sufficiently large I/M, the right-to-left algorithm will be preferred; in the example, Algorithm 3.91 with w = 2 will outperform Algorithm 3.92 at roughlyI/M=11.

For point multiplicationk P where Pis not known in advance, the example case in Table 3.11 predicts that use of halving gives roughly 25% improvement over a similar method based on doubling, whenI/M=8.

The comparison is unbalanced in terms of storage required, since halving was per- mitted 46 field elements of precomputation in the solve and square root routines. The amount of storage in square root can be reduced at tolerable cost to halving; significant storage (e.g., 21–30 elements) for the solve routine appears to be essential. It should be noted, however, that the storage for the solve and square root routines is per field. In addition to the routines specific to halving, most of the support for methods based on doubling will be required, giving some code expansion.

Random curves vs. Koblitz curves The τ-adic methods on Koblitz curves (§3.4) share strategy with halving in the sense that point doubling is replaced by a less- expensive operation. In the Koblitz curve case, the replacement is the Frobenius map

τ:(x,y)(x2,y2), an inexpensive operation compared to field multiplication. Point multiplication on Koblitz curves usingτ-adic methods will be faster than those based on halving, with approximate cost fork Pgiven by

2w−2−1+ t

w+1

A+t·(cost ofτ)

when using a width-w τ-adic NAF in Algorithm 3.70. To compare with Table 3.11, assume that mixed coordinates are used,w=5, and that field squaring has approximate

cost M/6. In this case, the operation count is approximately 379M, significantly less than the 687M required by the halving method.

Known point vs. unknown point In the case thatPis known in advance (e.g., signa- ture generation in ECDSA) and storage is available for precomputation, halving loses some of its performance advantages. For our case, and for relatively modest amounts of storage, the single-table comb method (Algorithm 3.44) is among the fastest and can be used to obtain meaningful operation count comparisons. The operation counts for

k Pusing methods based on doubling and halving are approximately