The implementation and performance/cost/power analysis of the network security accelerator on SoC applications

(1)

The implementation and performance/cost/power analysis of the network security accelerator on SoC applications

Ruei-Ting Gu

Department of Computer Science and Engineering, National Sun Yat-Sen University

Kaohsiung, Taiwan 804 [email protected]

Kuo-Huang Chung

Kaohsiung, Taiwan 804

[email protected]

Ing-Jer Huang

Kaohsiung, Taiwan 804 [email protected]

Abstract

The internet is going deep into human life, there are more and more applications around us connected tightly with it, and the network security is become more and more important. The DES and AES are the most popular standards using in network security, and many protocols have included them. Because enciphering is a computation intensive work so that we may use hardware accelerators to offload CPU’s loading. We propose the analysis of performance, cost and power which is helpful for designing a network SoC in the beginning. For this reason we implement the DES/AES hardware accelerator.

The experiment results tell us that we only need 2979/74345 gates respectively to gain 3000 times faster, it is a very worth investment. The DES/AES accelerators run at 83 and 70 MHz with 3.3 V core voltage and have 12.86% and 64.16% power consumption in additional, but if the ratio of DES over whole program exceeds 11.4% or AES exceeds 39.95%, the system energy consumption will go down because the computing time is reduced.

Key words: network security, hardware accelerator, performance/cost/power analysis, network SoC chip.

1 Introduction

With the development of Internet, there are more and more applications around us connected tightly with it, and the network security is become more and more important.

Because it is imposable to prevent someone from intercepting the network packages, especially who did it on purpose. But we still have to protect our data, so we can only encipher the data that whoever get the data but can not read it.

For the enciphering the DES (Data Encryption Standard) [1] and AES (Advanced Encryption Standard) [2] are the most popular standards used in many protocols like 802.11i, WAP, etc. But the computation is exhaustive for a low cost SoC that we usually need a hardware accelerator to offload the CPU loading. There are already some accelerators, and we want to know the performance, cost, and power consumption. It is helpful to design a network security specific SoC in the beginning. For measuring the performance, cost, and power consumption we implement the DES/AES hardware accelerators. We experiment the performance comparing with software implementation and analysis that how much performance is improved, how much cost and how much power consumption.

The organization of the rest of this paper is as follows.

The section 2 is related work which introduces the DES and AES standard, how it works with the data and some existing hardware accelerators. Section 3 is our hardware

platform implementation for experimenting and analysis.

There are the experiment results and analysis in section 4 and final the conclusion is section 5.

2 Related work

In this section we will probe into encryption about network data security. The DES and AES are the most popular encryption standards with high speed and enough strength to protect the information. They are using symmetric cipher for fast computation. Although the symmetric way may not have enough strength than the asymmetric way like RSA [3] but they use more than 64 bits key to improve the security. By using symmetric cipher it is much easier to implement into hardware as accelerator.

2.1 DES algorithm

DES cryptographic algorithm is used to protect civilian satellite communications, gateway servers, set-top boxes, Virtual Private Networks (VPN), video transmissions, and numerous other data transfer applications. This algorithm is designed to encipher and decipher blocks of data consisting of 64 bits under control of a 64-bit key.

Deciphering must be accomplished by using the same key as for enciphering, but with the schedule of addressing the key bits altered so that the deciphering process is the reverse of the enciphering process.

For now the CPU speed is more and more fast (for now is over 3 GHz), the DES may not strong enough to protect data and then the TDEA (Triple Data Encryption Algorithm or called Triple-DES) is appeared. TDEA used three keys to encipher 3 times with DES: it uses 1st key to encipher data then the 2nd key to decipher and then 3rd key to encipher again. If we want to decipher a TDEA data we have to use the 3rd to decipher first then the 2nd key to encipher then the 1st key to decipher and get the original data.

2.2 AES algorithm

This AES standard specifies the Rijndael algorithm ([2]

and [6]), a symmetric block cipher that can process data blocks of 128 bits, using cipher keys with lengths of 128, 192, and 256 bits. It is more complex than DES but earn much more security.

The algorithm may be used with the three different key lengths indicated above, and therefore these different

“flavors” may be referred to as “AES-128”, “AES-192”, and “AES-256”. Hence, at the beginning of the Cipher or Inverse Cipher, the input array_in, is copied to the State and through the AES process functions – AddRoundKey(), SubBytes(), ShiftRows(), MixColumns().

The core computation in each function uses

(2)

Exclusive-OR and Multiplication operations in the finite field. There are some researches about the hardware implementations like C.C.Wang[7] designed a VLSI hardware for computing multiplications and inverse , A.V.

Dinh[8] implemented a low latency architecture for computing multiplicative inverses and divisions, and Jing, M.H.[9] designed a fast inverse module in AES.

3 Hardware accelerator implementation

3.1 Hardware acceleration platform

To integrate the hardware accelerators into SoC we choice the ARM based CPU platform and follow the ARM7TDMI coprocessor interface [10] for the maximum compliance. Figure 1 is the integrated system block diagram shows the I/O pins of the coprocessor connected to ARM7TDMI. The nCPI, CPA and CPB connect to ARM7TDMI’s nCPI, CPA and CPB pins respectively, and all components connect to the System Bus (may be AMBA).

ARM7TDI

Core CoP0

DES CoP1

AES

Memory nCPI

CPA/CPB

System Bus

Figure 1 - Integrated system

The CoP0 is coprocessor 0 which is the DES accelerator and CoP1 is AES accelerator.

3.2 The design of DES hardware accelerator The core of DES computation is the 16 times L n = R n-1 and --- (1) R n = L n-1 ⊕ f(R n-1, K n), --- (2) therefore we considering the implementation in two different ways:

1. Fully parallel design:

We can duplicate the L n = R n-1 and R n = L n-1⊕  f (R n-1, K n) hardware circuit for 16 copies and finish the computation in 1 cycle.

2. Sequential design:

We can use single hardware circuit but run 16 rounds to finish the computation.

The first way to implement hardware is the fastest way that it can finish the computation in one single cycle. But the cost is too much to gain the benefit, and may become critical path of the whole system so that it may slow down the system speed. For this reason, we pick the second way to implement our hardware accelerator.

The figure 2 shows the architecture of our design, for each 64-but data would take 16 cycles. Figure 3 shows the control flow for AddRoundKey() function. Figure 4 shows the whole DES hardware design; there are two parts in this design – first the Key Schedule takes the responsibility to produce 16 Keys for each round and then according the round keys to encipher or decipher.

f

Ln = Rn-1 Rn = Ln-1⊕f(Rn-1, Kn)

INVERSE INITIAL PERMUTATION INITIAL PERMUTATION

INPUT

OUTPUT

K_n 0

1 round

Figure 2 - DES iteration architecture

Start

Enter Input Data / Input

Key

Key Schedule

Encryption/

Decryption Counter ++

Counter > 16

Get Result

NO

Yes

Figure 3 - The flow chart of DES software program

(3)

IP

F Key Schedule

[1:32]

[33:64]

IIP Output

Data Input

Data KeyIn

0 1 Encryption

Round

Round == 0

Figure 4 - DES algorithm block diagram

3.3 The design of AES hardware accelerator The architecture of AES hardware can divided two parts also, the first part is key expansion then uses the keys to encipher or decipher. In this hardware we followed Jing, M.H.[9] to implement the multiplier in finite field; and used table lookup to implement the S-box which is considerably used in SubBytes() and SubWord() functions. Figure 5 shows the block diagram of AES hardware accelerator architecture.

The core computation of AES is the key expansion, and is complex also. Because there are varied key lengths of AES key (AES-128, AES-192, and AES-256), so that we use 4 multiplaxtor to take control of the data path like figure 6 shows. The 1^st mux would let left way data pass if it is AES-128 key length or else it let right way pass through. The mux 2^nd and 3^rd would let up and left way pass respectively in AES-192 or else it would take down and right respectively. The last 4^th will choice up way for AES-128 or the middle for AES-192 or the bottom for AES-256.

KeyIn 6 KeyIn 5 KeyIn 4 KeyIn 3 KeyIn 0 KeyIn 1 KeyIn 2

KeyIn 7 ^Rot ^Sub ^Sub

Rcon

1

2 3

4

Figure 6 – AES integrated parallel Key Expansion

Key Expansion keyIn

keyRound keyLength

Encryption Module

Decryption Module

Data Output Encryption

Input Data round

Figure 5 - The block diagram of complete circuit of AES

4 Experiment result and analysis

4.1 DES/AES hardware accelerator performance We compared the hardware performance with the pure software solutions which is written in C compiled with ADS 1.2 (ARM C compiler), and the result is shown in table 1. We encipher 3 different data size packages with DES, AES-128, AES-192, and AES-256 respectively, and count the cycles to finish the jobs. The performance improvement is about 3397-3992 times faster than software solutions.

(4)

4.2 Cost and Power consumption

Our hardware accelerators are implemented with verilog RTL code and synthesize with Synopsysy Design AnalyzerTM using TSMC 0.35µm process. The DES and AES results are shown in table 2 and table 3 respectively.

The area (gate count) of DES is 2978.86 and the cycle time is 12.02 ns. Comparing with the InventraTM DES Encryption core [11] which is a commercial product, our speed is slower for 2.06 ns but the area is smaller for 1000 gates (25%). The result is shown in table 4. The area of AES module is 74345 gates the cycle time is 14.27 ns, and the maximum throughput is 897 Mbps, comparing with the Ocean Logic^TM(OL) [12] AES module (Table 5) our performance is about 3.63 times faster in cycle count.

Because the OL AES module uses ASIC 0.18 process and runs at 200 MHz frequency so that the throughput is about the same.

We also implement an ARM compatible CPU core for the platform and the power consumption is 31.56 mW.

The power consumption of CPU+DES module, which run at 83 MHz and the core voltage is 3.3 V, is 35.62 mW.

The power consumption of CPU+AES module running at 70 MHz with 3.3 V core voltages is 51.81 mW. The DES/AES module added 12.86% and 64.16% of power

respectively. The performance/cost ratio tells us that we

h there may be some power consumption adding i

nd the Energy = power * cycle count * cycle time gram e

spending 2978/74345 gates cost and a little power to gain 3000 times faster performance is a worthily and wisely choice.

Althoug

nto the SoC system; for the hardware view it is consumed more power, but for the task view in the SoC system it can shorten the computing time very much and may save more energy than it consumed. For example to finish a DES/AES task the hardware accelerator needs 1 second and the software solution needs up to 3000 seconds, the energy consumption of the system with accelerator is definitely smaller than which without accelerator.

According to the Amdahl’s Law:

Ex ExecutionTimeunaffected

t improvemen of

AmountTimeaffectedbyimprovement+ ecution

A

It tells us that if the ratio of DES over whole pro xceeds 11.4 % the system energy consumption will go down. For AES the ratio is 39.096%. Table 6 shows the details of each module.

Table 1 - The hardware/software performance comparing result

3992 3969

3942 3736

Performance improvement

5270 4520

3770 12000

21037170 17943120

14863785 44840010

AES-256 AES-192

AES-128 DES

AES-256 AES-192

AES-128 DES

1500-byte data

3853 3785

3670 3736

364 316

270 800

1402478 1196208

990919 2989334

AES-256 AES-192

AES-128 DES

AES-256 AES-192

AES-128 DES

100-byte data

3580 3501

3397 3736

188 164

140 384

673189 574179

475641 1434880

AES-256 AES-192

AES-128 DES

AES-256 AES-192

AES-128 DES

48-byte data

Hardware (cycle) Software (cycle)

3992 3969

3942 3736

5270 4520

3770 12000

21037170 17943120

14863785 44840010

AES-256 AES-192

AES-128 DES

AES-256 AES-192

AES-128 DES

1500-byte data

3853 3785

3670 3736

364 316

270 800

1402478 1196208

990919 2989334

AES-256 AES-192

AES-128 DES

AES-256 AES-192

AES-128 DES

100-byte data

3580 3501

3397 3736

188 164

140 384

673189 574179

475641 1434880

AES-256 AES-192

AES-128 DES

AES-256 AES-192

AES-128 DES

48-byte data

Hardware (cycle) Software (cycle)

Table 2 - DES module synthesis results with TSMC 0.35µm process

AES module synthesize results TSMC 0.35µm

Gate Count 74345

Cycle time (ns) 14.27 Power (mW)

Voltage: 3.3V, Frequency:

70MHz

20.2344

AES-128 AES-192 AES-256

KE cycles 20 16 14

Cipher cycles 10 12 14 KE Throughput

(Mbps) 448.49 804.93 1281.41 Cipher throughput

(Mbps) 896.99 747.49 640.70 DES module synthesis results

Synopsys Design Analyzer TSMC 0.35 µm

Gate Count 2978.86 Power (mW)

Votage: 3.3V, Frequency: 83MHz

3.8528

Cycle time (ns) 12.02 Cycle / block 16 MAX Throughput (Mbps) 333

Table 3 - AES module synthesis results with TSMC 0.35µm process

(5)

Table 4 - Compare with InventraTM DES core

Table 5 - Compare the cycle count with OL module

Module Minimal ratio of feature related computation

ARM + DES 11.401%

ARM + AES-128 39.096%

ARM + AES-192 39.095%

ARM + AES-256 39.095%

Table 6 - Minimal ratio of feature related computation

5 Conclusion

We propose the performance/cost/power information about the network security hardware accelerator on SoC, and to measure the data we implement the DES/AES modules using ARM7TDMI coprocessor interface on ARM based SoC. It is a worth investment in spending a little cost but gaining much more performance (over 3000 times faster). The DES/AES accelerators run at 83 and 70 MHz with 3.3 V core voltage and have 12.86% and 64.16% power consumption in additional, but if the ratio of DES over whole program exceeds 11.4% or AES exceeds 39.95%, the system energy consumption will go down because the computing time is reduced.

6 Reference

[1] NIST, Data Encryption Standard, FIPS PUB 46-3, October 25, 1999

[2] NIST, Advanced Encryption Standard, FIPS PUB 197, November 26, 2001

[3] Rivest, A. Shamir and L. Adleman. A Method for Obtaining Digital Signatures and Public-Key Cryptosystems. Communications of the ACM, 21 (2), pp. 120-126, February 1978.

[4] http://www.xilinx.com/

[5] http://www.xilinx.com/bvdocs/appnotes/xapp270.pd f

[6] Joan Daemen, Vincent Rijmen, “AES Proposal:

Rijndael” document Version 2, May 9, 1999.

Gate count Cycle time(ns) Our DES core 2978 12.06 Inventra^TM DES

core 4000 10

[7] C.C. Wang, T.K. Truong, H.M. Shao, L.J. Deutsch, J.K. Omura, and I.S. Reed, “VLSI architecture for computing multiplications and inverse in GF(2^m)”, IEEE Transactions on Computer, Volume C-34, No.

6, August 1985

[8] A.V. Dinh, R.J. Bolton, and R. Mason, “A Low Latency Architecture for Computing Multiplicative Inverses and Divisions in GF(2m)”, IEEE

Transactions on Circuits and Systems—II: Analog and Digital Signal Processing, Volume 48, No. 8, August 2001

AES-128 AES-192 AES-256 OL_KEXP_E

D (cycle) 44 52 60

OL_AES_ED

(cycle) 44 52 60

Our Key Expansion

(cycle)

20 16 14 Our AES

(cycle) 10 12 14

Improvement (OL / Our

design)

2.93 3.71 4.29 Average

Improvement 3.63

[9] Jing, M.H.; Chen, Y.H.; Chang, Y.T.; Hsu, C.H.,

“The design of a fast inverse module in AES”, International Conferences on Info-tech and Info-net, 2001, Volume: 3, Page(s) : 298-303

[10] ARM7TDMI Data Sheet

[11] Inventra^TM, DES-core, DES Encryption core, http://www.mentorg.com/inventra/

[12] Ocean Logic^TM, OL_AES AES Core family Rev 1.4, http://www.ocean-logic.com/pub/OL_AES.pdf