Custom Hardware Acceleration - Cryptographic coprocessors for embedded systems

Algorithm Main loop 192 bit 256 bit 521 bit Time (ms) Time (ms) Time (ms) D&A (Alg. 6) _{{16M, 18A}} 1060 1976 20330 ML co-Z (Alg. 8) _{{16M, 21A}} 1083 2033 21220 Joye I (Alg. 11) _{{16M, 21A}} 1085 2036 21286 Joye II (Alg. 12) _{{16M, 27A}} 1058 1981 20157 ML (X,Z) (Alg. 9) (Alg. 26 & 33) _{{17M, 14A}} 1134 2141 22355 ML (X,Z) (Alg. 9) (Alg. 27 & 33) _{{16M, 14A}} 1056 1980 20683 ML (X,Z) (Alg. 9) (Alg. 28 & 34) _{{15M, 13A}} 991 1876 19299 ML (X,Y) (Alg. 13) _{{14M, 39A}} 935 1753 17726 SD (X,Y) (Alg. 14) _{{14M, 28A}} 924 1737 17668

Table 3.2: Software implementation results on a Microblaze processor.

3.6 Custom Hardware Acceleration

In the previous section, the baseline results for a software based design were presented. Due to the configurability of FPGAs, they offer many options in terms of increasing the performance of designs. The Microblaze processor supports the addition of custom instructions, through the use of an FSL bus, as described in Section 2.9.2. In this section an analysis of the use of a custom multiply instruction for the Microblaze will be conducted.

In Section 3.2, various ECC algorithms were introduced that remove the need to perform finite field inversions during the main loop of the scalar multiplication algorithm. The main loop then consists of additions, subtractions, and multiplications over

F_q_{; of which the multiplications are the most computationally intensive. When decid-}

ing on an instruction to add custom hardware for, it therefore makes sense to offload the multiplication operation as it has the greatest impact on the computation time of the ECC algorithms. This can be seen in Table 3.2, where the algorithm with the least number of multiplications, SD (X,Y) (Alg. 14), has the best performance.

3.6.1 Montgomery Multiplication in Hardware

To implement the Montgomery multiplication in hardware a modified version of the algorithm is used. From [124], we know that when used as part of the computation of Q = kP in ECC algorithms, the conditional subtraction at the end of the Montgomery multiplication algorithm is not required. Algorithm 16 shows how the Montgomery multiplication is performed without a conditional subtraction. The algorithm can be

3.6 Custom Hardware Acceleration

more suited to a hardware implementation than Algorithm 15. Algorithm 16 Montgomery multiplication.

Input: A′ ₌Pl i=0a′i2i, B′ = Pl i=0b′i2i, q Output: R′ _{= A}′_{· B}′_{· 2}−l+2 _{(mod q)} 1: R′ = 0, a′_l+1 = b′_l+1= 0; 2: for i = 0 to l + 1 do 3: ti = R′i−1+ (b′iA′ (mod 2)) 4: R′_i= (R′_i−1+ t_iq + b′_iA′)/2 5: end for

A circuit for performing a Montgomery multiplication is shown in Figure 3.2. The design requires two l + 2 bit full adders, which form the critical path of the design.

A′ _B′ R′ i q R′ i−1 l + 2 l + 2 l + 2 l + 2 l + 2 l + 2 l + 2 shift

Figure 3.2: Montgomery multiplier.

3.6.2 Instruction Set Extension Results

The Montgomery multiplier was connected via an FSL bus to the Microblaze processor, as shown in Figure 3.3. The clock signal from the multiplier is supplied by the Microb-

3.6 Custom Hardware Acceleration

laze, through the FSL bus; therefore, the multiplier runs at the same frequency as the Microblaze. In this setup, the multiplier forms the critical path in the design and hence determines the clock frequency for the entire system. In the case of the 192 bit design, the system clock frequency was set to 100 MHz and in the 256 and 521 bit cases, the clock frequency was set to 75 MHz. A pipeline register was added to the multiplier design for the 521 bit implementation in order to reduce its critical path. This doubles the number of clock cycles it takes to perform a multiplication. The FPGA area usage results for each entire system and the multipliers alone are shown in Table 3.3. A timer and debug module were also included in the design in order to measure computation times and for command line output.

DDR2 RAM BRAM xps timer memory controller Microblaze FPGA chip PLB bus PLB bus PLB bus custom peripherals FSL bus 192/256/521 bit Montgomery Multiplier

Figure 3.3: Microblaze with hardware multiplier.

Table 3.4 shows the results obtained from the Microblaze with hardware acceleration. The results show the timing for performing a full computation of kP including all conversion to and from affine coordinates, and also any precomputations for each algorithm. The results assume that the point P is unknown and therefore no values that could be precomputed and stored in RAM are used. Comparing Table 3.4 with Table 3.2, it can be seen that the hardware multiplier reduces the computation time of the different algorithms by on average 89-94%. The large reduction in computation time can be attributed to the fact that the hardware multiplier performs both the mul-

3.6 Custom Hardware Acceleration

Design Area BRAM DDR2 DSP48E Freq.

(Slices) RAM (M Hz) Microblaze 192 bit 3145 _{65 × 36 k 256 MB} 3 100 Microblaze 256 bit 3454 _{65 × 36 k 256 MB} 3 75 Microblaze 521 bit 3466 _{65 × 36 k 256 MB} 3 75 192 bit mult 334 0 0 0 100 256 bit mult 499 0 0 0 75 521 bit mult 904 0 0 0 75

Table 3.3: Microblaze FPGA resource usage.

tiplication and Montgomery modular reduction, which are more time consuming than modular additions.

In the software implementation results from Table 3.2, the SD(X,Y) (Alg. 14) was fastest. This due to the fact that the SD(X,Y) (Alg. 14) algorithm requires the least number of multiplications. With the addition of the hardware Montgomery multiplier the ML(X,Z) (Alg. 9) (Alg. 27 & 33) and ML(X,Z) (Alg. 9) (Alg. 28 & 34) algorithms are the best performing. These algorithms require one and two extra multiplications, respectively, over the SD(X,Y) (Alg. 14) algorithm, however, the number of additions is reduced. When implemented in software, the multiplication operation is the dominant factor in the computation time and additions account for only a small percentage of the computation time. The inclusion of the hardware multiplier has reduced the computation times to the point where the addition operations also have a noticeable impact on the performance; thus, changing the order of the results.

Algorithm Main loop 192 bit 256 bit 521 bit Time (ms) Time (ms) Time (ms) D&A (Alg. 6) _{{16M, 18A}} 94 228 1534 ML co-Z (Alg. 8) _{{16M, 21A}} 113 296 2207 Joye I (Alg. 11) _{{16M, 21A}} 114 297 2221 Joye II (Alg. 12) _{{16M, 27A}} 96 252 1686 ML(X,Z) (Alg. 9) (Alg. 26 & 33) _{{17M, 14A}} 74 203 1176 ML(X,Z) (Alg. 9) (Alg. 27 & 33) _{{16M, 14A}} 71 166 1175 ML(X,Z) (Alg. 9) (Alg. 28 & 34) _{{15M, 13A}} 71 185 1172 ML(X,Y) (Alg. 13) _{{14M, 39A}} 97 250 1638 SD(X,Y) (Alg. 14) _{{14M, 28A}} 87 225 1506

Table 3.4: Microblaze with Montgomery multiplier results.1

It should be noted that the times given in this table are average computation times averaged over 1000 executions of each algorithm. As the results are from a software implementation, the computation time of each operation can vary slightly between different executions.

In document Cryptographic coprocessors for embedded systems (Page 59-63)