• No results found

3.3 Computer hardware

4.1.3 WG hardware implementations

The first member of WG stream cipher family to be implemented in hardware was the eSTREAM candidate WG-29 [7]. For the F229, a type II optimal normal basis exists, which allows efficient field

arithmetic. In [111], useful properties of the trace function were found, that allowed elimination of two multipliers. Switching to the polynomial basis representation of field elements, the same group later on improved their implementation results for WG-29 in [112]. The same paper also reports efficient polynomial basis implementations of WG-16.

An implementation of a lightweight WG stream cipher WG-5, targeting passive RFIDs, was re- ported in [68]. Selected metrics of their implementation results are shown in upper part of Table

4.2, omitting the power metric and optimality scores derived using the power results. The defining polynomial of F25, the characteristic polynomial for the LFSR and the decimation value were cho-

sen not only based on resulting cryptographic properties but to produce the most optimal hardware. An interesting feature of this paper is their parameter selection, aiming to reduce the hardware cost. Based on ASIC implementation results for the chosen frequency of 100kHz, WG-5 outperforms the ciphers it was compared to, including Grain and Trivium.

WG Source ASIC dec Area Throughput Radix T A2

Cipher Technology [GE] [kbps]

WG-5 † [68] 130nm

1 1229 100 1 66.2

11 1235 100 1 65.6

1 1350 200 2 110

11 1360 200 2 108

WG Source ASIC Architecture Area Speed Radix T A

Cipher Technology [GE] [MHz]

WG-8 ‡ [113] 65nm CA 1786 500 1 0.28 3942 610 11 1.70 TF1 7523 229 1 0.03 42762 122 11 0.03 TF2 3162 260 1 0.08 22668 205 11 0.10 TF3 2981 GE 254 1 0.08 19882 205 11 0.11

† WG-5 with 80-bit key and IV, and LFSR of length 32

‡ WG-8 with 80-bit key and IV, decimation exponent 19, and LFSR of length 20 Table 4.2: Post-PAR CMOS implementation results for WG-5 and WG-8

Another instance of lightweight WG stream ciphers, the WG-8, was reported in [113]. Selected metrics of their implementation results are shown in Table4.2, omitting the power metric and op- timality scores derived using the power results. It explores four different hardware architectures. The first implementation is a constant array based design, denoted “CA” in Table 4.2 (with one

array holding the WGP values and one array holding the WGT values). Then two tower construc- tions F(24)2 were implemented, using different defining polynomials for the first extension, denoted

“TF1” and “TF2” in Table4.2. One of them used polynomial basis for F24 and table look-up based

field arithmetic, and the other one type I optimal normal basis, yielding efficient field arithmetic. The fourth design, denoted “TF3” in Table4.2, used the tower construction F((22)2)2, with normal

basis representation of elements at each level of the tower, similar to the work [8]. FPGA and ASIC implementation results were given for 1-bit and for 11-bit output versions for all four de- signs. Since the cipher is small enough, the best results were achieved for the table look-up based design.

perf. modules Speed Area

o1 o2

goal used [GHz] [kGE]

WG(16, 32) implementation from [115] F(((22)2)2)2 field construction WG(16, 32) - using `1(x) f,o1 L1 d5 s7 c8 2.17 22.5 9.6 4.3 A,o2 L0 d1 s2 c2 0.88 10.9 8.1 7.4 WG(16, 32) - using `2(x) f,o1 N1 d5 s7 c8 2.13 18.0 11.8 6.6 A,o2 N0 d1 s2 c2 0.93 11.5 8.1 7.0 F(24)4 field construction WG(16, 32) - using `3(x) f C2 d9 s13 c12 2.44 26.3 9.27 3.5 A,o1,o2 C1 d7 s11 c10 1.79 17.0 10.5 6.2 WG(16, 32) - using `4(x) f T2 d9 s13 c12 2.38 27.0 8.8 3.3 A,o1,o2 T0 d7 s11 c10 2.08 20.8 10.0 4.8 WG(16, 32) implementation from [8] M16/I8level 0.55 12.0 4.6 3.8

for `1, `2, `3, `4see Section20.4

Table 4.3: Post-PAR CMOS 65nm implementation results for the WG(16, 32) keystream generators

WG-16 was studied in [8, 9,111, 112, 114] and is intended for use in confidentiality and integrity algorithms in mobile communications, such as 4G-LTE networks [114]. As such, WG-16 has stronger security requirements when compared to WG-5 and WG-8, and is using 128-bit key and

IV; is not a lightweight instance. New implementation of WG-16 [115] is using two different tower field constructions F(((22)2)2)2 and F(24)4 and three new LFSR polynomials `2, `3and `4in addition to

the original polynomial `1used in [8,111,114,112]. The highest frequency WG(16, 32) keystream

generator, obtained for the 65 nm ASIC library, reached the clock speed of 2.44GHz at 26.3kGE, and the smallest area keystream generator the clock speed of 0.88GHz at 10.9kGE, shown in Table

4.3. Modules were synthesized for different performance goals: highest frequency f, smallest area A, best optimality scores o1=T·fA and o2 =T·fA2, where T is the throughput in parcels-per-cycle. The

highest frequency FPGA implementation on Xilinx Spartan 6 reached the clock speed of 256MHz using 631 slices for LFSR using `3and F(24)4 tower field construction.

Authors of [112] used three different implementation strategies: a standard, pipelined, and serial- ized design, coupled with two different multipliers. The results of [115] are compared to the [112] implementations using Karatsuba multiplier [116], which were both smaller and faster than their alternative, and are the only ones reported in Table 4.4. The 65nm ASIC results they present are obtained pre-PAR, and for a fair comparison, the pre-PAR results for modules using `1(x) from

[115] are reported in Table4.4. L1d5s7c8 shows a speedup of 1.82 at a moderate 27% increase in area when compared to pipelined architecture from [112], and module L0d1s2c2 a significant speedup of 5.75 at 22% area increase when compared to standard architecture from [112]. Their smallest (serial) design takes only 51% of the area used by the smaller implementation L0d1s2c2. However, their serialized architecture has a throughput 16, and as was pointed out by [117], even a lightweight stream cipher should have a throughput at least 1clk.cyclebit and WG-16 is not a lightweight stream cipher. When compared to the ASIC pre-PAR results for the ZUC implementation reported in [118], L1d5s7c8 can reach the same frequency at a slightly higher area, but has significantly lower optimality scores due to the lower throughput.

An ongoing project, WG-lite Design Space Exploration [119], is exploring hardware implementa- tions of WG ciphers defined over small finite fields F2m, where m= 5, 7, 8, 10, 11, 13, 14, 16. From

previous work on WG-5, WG-8, WG-16 and WG-29, it is known that for small instances, such as WG-5 and WG-8, constant array implementations yield a smaller area compared to implementa- tion using discrete components, such as multipliers and exponentiations. One of the objectives of the WG-lite Design Space Exploration project is to find the threshold when the constant array im- plementations become larger than the discrete component implementations. The implementation results for the WG permutation and transformation modules using polynomial bases, implemented using (a) discrete components and (b) constant arrays are reported in [120]. A section from the implementation results from [120] is summarized in Table4.5: it shows the implementation results for the non-decimated WG permutation modules using polynomial basis over all existing primitive polynomials for the given finite field F2m. The results are presented as minimum and maximum

area, followed by the mean and standard deviation. They show the tipping point between discrete component and constant array implementation for the finite field F210. Future work for the WG-lite

Basis design Speed Area

o1 o2

used used [GHz] [kGE]

WG(16, 32) implementation from [115] WG(16, 32) - using F(((22)2)2)2 and l1(x) TFB L1 d5 s7 c8 2.50 13.5 1.9 1.4 TFB L0 d1 s2 c2 1.11 9.81 1.1 1.2 WG(16, 32) implementation from [112] PB standard 0.19 8.06 0.2 0.3 PB pipelined 1.37 10.6 1.3 1.0 PB serialized∗ 0.71 5.03 0.2 0.5 Other ciphers ZUC†[118] 2.50 12.5 64.0 5.12

∗ throughput T= 16clk.cyclebit

† throughput T= 32clk.cyclebit , TSMC 65nm ASIC

Table 4.4: Pre-PAR CMOS 65nm implementation results for WG(16, 32) keystream generators

Discrete Components Constant Array

min max mean sd min max mean sd

WG-5 426 473 446 21 46 57 53 4 WG-7 1157 1390 1316 66 247 268 257 6 WG-8 1937 2078 2035 42 508 574 542 24 WG-10 2520 3018 2928 87 2173 2257 2211 18 WG-11 3134 3674 3567 81 5104 5470 5282 79 WG-13 5642 6366 6170 99 18163 19731 18985 289 WG-14 7136 8102 7885 116 35299 38164 37338 414 WG-16 11052 12302 12036 145 127171 137624 135513 1596

Table 4.5: Pre-PAR CMOS 65nm implementation results for WGP modules with decimation 1: comparison of discrete component and constant array implementations from [120]