The implementation and performance/cost/power analysis of the network security accelerator on SoC applications
Ruei-Ting Gu
Department of Computer Science and Engineering, National Sun Yat-Sen University
Kaohsiung, Taiwan 804 [email protected]
Kuo-Huang Chung
Department of Computer Science and Engineering, National Sun Yat-Sen University
Kaohsiung, Taiwan 804
Ing-Jer Huang
Department of Computer Science and Engineering, National Sun Yat-Sen University
Kaohsiung, Taiwan 804 [email protected]
Abstract
The internet is going deep into human life, there are more and more applications around us connected tightly with it, and the network security is become more and more important. The DES and AES are the most popular standards using in network security, and many protocols have included them. Because enciphering is a computation intensive work so that we may use hardware accelerators to offload CPU’s loading. We propose the analysis of performance, cost and power which is helpful for designing a network SoC in the beginning. For this reason we implement the DES/AES hardware accelerator.
The experiment results tell us that we only need 2979/74345 gates respectively to gain 3000 times faster, it is a very worth investment. The DES/AES accelerators run at 83 and 70 MHz with 3.3 V core voltage and have 12.86% and 64.16% power consumption in additional, but if the ratio of DES over whole program exceeds 11.4% or AES exceeds 39.95%, the system energy consumption will go down because the computing time is reduced.
Key words: network security, hardware accelerator, performance/cost/power analysis, network SoC chip.
1 Introduction
With the development of Internet, there are more and more applications around us connected tightly with it, and the network security is become more and more important.
Because it is imposable to prevent someone from intercepting the network packages, especially who did it on purpose. But we still have to protect our data, so we can only encipher the data that whoever get the data but can not read it.
For the enciphering the DES (Data Encryption Standard) [1] and AES (Advanced Encryption Standard) [2] are the most popular standards used in many protocols like 802.11i, WAP, etc. But the computation is exhaustive for a low cost SoC that we usually need a hardware accelerator to offload the CPU loading. There are already some accelerators, and we want to know the performance, cost, and power consumption. It is helpful to design a network security specific SoC in the beginning. For measuring the performance, cost, and power consumption we implement the DES/AES hardware accelerators. We experiment the performance comparing with software implementation and analysis that how much performance is improved, how much cost and how much power consumption.
The organization of the rest of this paper is as follows.
The section 2 is related work which introduces the DES and AES standard, how it works with the data and some existing hardware accelerators. Section 3 is our hardware
platform implementation for experimenting and analysis.
There are the experiment results and analysis in section 4 and final the conclusion is section 5.
2 Related work
In this section we will probe into encryption about network data security. The DES and AES are the most popular encryption standards with high speed and enough strength to protect the information. They are using symmetric cipher for fast computation. Although the symmetric way may not have enough strength than the asymmetric way like RSA [3] but they use more than 64 bits key to improve the security. By using symmetric cipher it is much easier to implement into hardware as accelerator.
2.1 DES algorithm
DES cryptographic algorithm is used to protect civilian satellite communications, gateway servers, set-top boxes, Virtual Private Networks (VPN), video transmissions, and numerous other data transfer applications. This algorithm is designed to encipher and decipher blocks of data consisting of 64 bits under control of a 64-bit key.
Deciphering must be accomplished by using the same key as for enciphering, but with the schedule of addressing the key bits altered so that the deciphering process is the reverse of the enciphering process.
For now the CPU speed is more and more fast (for now is over 3 GHz), the DES may not strong enough to protect data and then the TDEA (Triple Data Encryption Algorithm or called Triple-DES) is appeared. TDEA used three keys to encipher 3 times with DES: it uses 1st key to encipher data then the 2nd key to decipher and then 3rd key to encipher again. If we want to decipher a TDEA data we have to use the 3rd to decipher first then the 2nd key to encipher then the 1st key to decipher and get the original data.
2.2 AES algorithm
This AES standard specifies the Rijndael algorithm ([2]
and [6]), a symmetric block cipher that can process data blocks of 128 bits, using cipher keys with lengths of 128, 192, and 256 bits. It is more complex than DES but earn much more security.
The algorithm may be used with the three different key lengths indicated above, and therefore these different
“flavors” may be referred to as “AES-128”, “AES-192”, and “AES-256”. Hence, at the beginning of the Cipher or Inverse Cipher, the input array_in, is copied to the State and through the AES process functions – AddRoundKey(), SubBytes(), ShiftRows(), MixColumns().
The core computation in each function uses
Exclusive-OR and Multiplication operations in the finite field. There are some researches about the hardware implementations like C.C.Wang[7] designed a VLSI hardware for computing multiplications and inverse , A.V.
Dinh[8] implemented a low latency architecture for computing multiplicative inverses and divisions, and Jing, M.H.[9] designed a fast inverse module in AES.
3 Hardware accelerator implementation
3.1 Hardware acceleration platform
To integrate the hardware accelerators into SoC we choice the ARM based CPU platform and follow the ARM7TDMI coprocessor interface [10] for the maximum compliance. Figure 1 is the integrated system block diagram shows the I/O pins of the coprocessor connected to ARM7TDMI. The nCPI, CPA and CPB connect to ARM7TDMI’s nCPI, CPA and CPB pins respectively, and all components connect to the System Bus (may be AMBA).
ARM7TDI
Core CoP0
DES CoP1
AES
Memory nCPI
CPA/CPB
System Bus
Figure 1 - Integrated system
The CoP0 is coprocessor 0 which is the DES accelerator and CoP1 is AES accelerator.
3.2 The design of DES hardware accelerator The core of DES computation is the 16 times L n = R n-1 and --- (1) R n = L n-1 ⊕ f(R n-1, K n), --- (2) therefore we considering the implementation in two different ways:
1. Fully parallel design:
We can duplicate the L n = R n-1 and R n = L n-1⊕ f (R n-1, K n) hardware circuit for 16 copies and finish the computation in 1 cycle.
2. Sequential design:
We can use single hardware circuit but run 16 rounds to finish the computation.
The first way to implement hardware is the fastest way that it can finish the computation in one single cycle. But the cost is too much to gain the benefit, and may become critical path of the whole system so that it may slow down the system speed. For this reason, we pick the second way to implement our hardware accelerator.
The figure 2 shows the architecture of our design, for each 64-but data would take 16 cycles. Figure 3 shows the control flow for AddRoundKey() function. Figure 4 shows the whole DES hardware design; there are two parts in this design – first the Key Schedule takes the responsibility to produce 16 Keys for each round and then according the round keys to encipher or decipher.
f
Ln = Rn-1 Rn = Ln-1⊕f(Rn-1, Kn)
INVERSE INITIAL PERMUTATION INITIAL PERMUTATION
INPUT
OUTPUT
Kn 0
1 round
Figure 2 - DES iteration architecture
Start
Enter Input Data / Input
Key
Key Schedule
Encryption/
Decryption Counter ++
Counter > 16
Get Result
NO
Yes
Figure 3 - The flow chart of DES software program
IP
F Key Schedule
[1:32]
[33:64]
[33:64]
IIP Output
Data Input
Data KeyIn
0 1 Encryption
Round
Round == 0
Figure 4 - DES algorithm block diagram
3.3 The design of AES hardware accelerator The architecture of AES hardware can divided two parts also, the first part is key expansion then uses the keys to encipher or decipher. In this hardware we followed Jing, M.H.[9] to implement the multiplier in finite field; and used table lookup to implement the S-box which is considerably used in SubBytes() and SubWord() functions. Figure 5 shows the block diagram of AES hardware accelerator architecture.
The core computation of AES is the key expansion, and is complex also. Because there are varied key lengths of AES key (AES-128, AES-192, and AES-256), so that we use 4 multiplaxtor to take control of the data path like figure 6 shows. The 1st mux would let left way data pass if it is AES-128 key length or else it let right way pass through. The mux 2nd and 3rd would let up and left way pass respectively in AES-192 or else it would take down and right respectively. The last 4th will choice up way for AES-128 or the middle for AES-192 or the bottom for AES-256.
KeyIn 6 KeyIn 5 KeyIn 4 KeyIn 3 KeyIn 0 KeyIn 1 KeyIn 2
KeyIn 7 Rot Sub Sub
Rcon
1
2 3
4
Figure 6 – AES integrated parallel Key Expansion
Key Expansion keyIn
keyRound keyLength
Encryption Module
Decryption Module
Data Output Encryption
Input Data round
Figure 5 - The block diagram of complete circuit of AES
4 Experiment result and analysis
4.1 DES/AES hardware accelerator performance We compared the hardware performance with the pure software solutions which is written in C compiled with ADS 1.2 (ARM C compiler), and the result is shown in table 1. We encipher 3 different data size packages with DES, AES-128, AES-192, and AES-256 respectively, and count the cycles to finish the jobs. The performance improvement is about 3397-3992 times faster than software solutions.
4.2 Cost and Power consumption
Our hardware accelerators are implemented with verilog RTL code and synthesize with Synopsysy Design AnalyzerTM using TSMC 0.35µm process. The DES and AES results are shown in table 2 and table 3 respectively.
The area (gate count) of DES is 2978.86 and the cycle time is 12.02 ns. Comparing with the InventraTM DES Encryption core [11] which is a commercial product, our speed is slower for 2.06 ns but the area is smaller for 1000 gates (25%). The result is shown in table 4. The area of AES module is 74345 gates the cycle time is 14.27 ns, and the maximum throughput is 897 Mbps, comparing with the Ocean LogicTM (OL) [12] AES module (Table 5) our performance is about 3.63 times faster in cycle count.
Because the OL AES module uses ASIC 0.18 process and runs at 200 MHz frequency so that the throughput is about the same.
We also implement an ARM compatible CPU core for the platform and the power consumption is 31.56 mW.
The power consumption of CPU+DES module, which run at 83 MHz and the core voltage is 3.3 V, is 35.62 mW.
The power consumption of CPU+AES module running at 70 MHz with 3.3 V core voltages is 51.81 mW. The DES/AES module added 12.86% and 64.16% of power
respectively. The performance/cost ratio tells us that we
h there may be some power consumption adding i
nd the Energy = power * cycle count * cycle time gram e
spending 2978/74345 gates cost and a little power to gain 3000 times faster performance is a worthily and wisely choice.
Althoug
nto the SoC system; for the hardware view it is consumed more power, but for the task view in the SoC system it can shorten the computing time very much and may save more energy than it consumed. For example to finish a DES/AES task the hardware accelerator needs 1 second and the software solution needs up to 3000 seconds, the energy consumption of the system with accelerator is definitely smaller than which without accelerator.
According to the Amdahl’s Law:
Ex ExecutionTimeunaffected
t improvemen of
AmountTimeaffectedbyimprovement+ ecution
A
It tells us that if the ratio of DES over whole pro xceeds 11.4 % the system energy consumption will go down. For AES the ratio is 39.096%. Table 6 shows the details of each module.
Table 1 - The hardware/software performance comparing result
3992 3969
3942 3736
Performance improvement
5270 4520
3770 12000
21037170 17943120
14863785 44840010
AES-256 AES-192
AES-128 DES
AES-256 AES-192
AES-128 DES
1500-byte data
3853 3785
3670 3736
Performance improvement
364 316
270 800
1402478 1196208
990919 2989334
AES-256 AES-192
AES-128 DES
AES-256 AES-192
AES-128 DES
100-byte data
3580 3501
3397 3736
Performance improvement
188 164
140 384
673189 574179
475641 1434880
AES-256 AES-192
AES-128 DES
AES-256 AES-192
AES-128 DES
48-byte data
Hardware (cycle) Software (cycle)
3992 3969
3942 3736
Performance improvement
5270 4520
3770 12000
21037170 17943120
14863785 44840010
AES-256 AES-192
AES-128 DES
AES-256 AES-192
AES-128 DES
1500-byte data
3853 3785
3670 3736
Performance improvement
364 316
270 800
1402478 1196208
990919 2989334
AES-256 AES-192
AES-128 DES
AES-256 AES-192
AES-128 DES
100-byte data
3580 3501
3397 3736
Performance improvement
188 164
140 384
673189 574179
475641 1434880
AES-256 AES-192
AES-128 DES
AES-256 AES-192
AES-128 DES
48-byte data
Hardware (cycle) Software (cycle)
Table 2 - DES module synthesis results with TSMC 0.35µm process
AES module synthesize results TSMC 0.35µm
Gate Count 74345
Cycle time (ns) 14.27 Power (mW)
Voltage: 3.3V, Frequency:
70MHz
20.2344
AES-128 AES-192 AES-256
KE cycles 20 16 14
Cipher cycles 10 12 14 KE Throughput
(Mbps) 448.49 804.93 1281.41 Cipher throughput
(Mbps) 896.99 747.49 640.70 DES module synthesis results
Synopsys Design Analyzer TSMC 0.35 µm
Gate Count 2978.86 Power (mW)
Votage: 3.3V, Frequency: 83MHz
3.8528
Cycle time (ns) 12.02 Cycle / block 16 MAX Throughput (Mbps) 333
Table 3 - AES module synthesis results with TSMC 0.35µm process
Table 4 - Compare with InventraTM DES core
Table 5 - Compare the cycle count with OL module
Module Minimal ratio of feature related computation
ARM + DES 11.401%
ARM + AES-128 39.096%
ARM + AES-192 39.095%
ARM + AES-256 39.095%
Table 6 - Minimal ratio of feature related computation
5 Conclusion
We propose the performance/cost/power information about the network security hardware accelerator on SoC, and to measure the data we implement the DES/AES modules using ARM7TDMI coprocessor interface on ARM based SoC. It is a worth investment in spending a little cost but gaining much more performance (over 3000 times faster). The DES/AES accelerators run at 83 and 70 MHz with 3.3 V core voltage and have 12.86% and 64.16% power consumption in additional, but if the ratio of DES over whole program exceeds 11.4% or AES exceeds 39.95%, the system energy consumption will go down because the computing time is reduced.
6 Reference
[1] NIST, Data Encryption Standard, FIPS PUB 46-3, October 25, 1999
[2] NIST, Advanced Encryption Standard, FIPS PUB 197, November 26, 2001
[3] Rivest, A. Shamir and L. Adleman. A Method for Obtaining Digital Signatures and Public-Key Cryptosystems. Communications of the ACM, 21 (2), pp. 120-126, February 1978.
[4] http://www.xilinx.com/
[5] http://www.xilinx.com/bvdocs/appnotes/xapp270.pd f
[6] Joan Daemen, Vincent Rijmen, “AES Proposal:
Rijndael” document Version 2, May 9, 1999.
Gate count Cycle time(ns) Our DES core 2978 12.06 InventraTM DES
core 4000 10
[7] C.C. Wang, T.K. Truong, H.M. Shao, L.J. Deutsch, J.K. Omura, and I.S. Reed, “VLSI architecture for computing multiplications and inverse in GF(2m)”, IEEE Transactions on Computer, Volume C-34, No.
6, August 1985
[8] A.V. Dinh, R.J. Bolton, and R. Mason, “A Low Latency Architecture for Computing Multiplicative Inverses and Divisions in GF(2m)”, IEEE
Transactions on Circuits and Systems—II: Analog and Digital Signal Processing, Volume 48, No. 8, August 2001
AES-128 AES-192 AES-256 OL_KEXP_E
D (cycle) 44 52 60
OL_AES_ED
(cycle) 44 52 60
Our Key Expansion
(cycle)
20 16 14 Our AES
(cycle) 10 12 14
Improvement (OL / Our
design)
2.93 3.71 4.29 Average
Improvement 3.63
[9] Jing, M.H.; Chen, Y.H.; Chang, Y.T.; Hsu, C.H.,
“The design of a fast inverse module in AES”, International Conferences on Info-tech and Info-net, 2001, Volume: 3, Page(s) : 298-303
[10] ARM7TDMI Data Sheet
[11] InventraTM, DES-core, DES Encryption core, http://www.mentorg.com/inventra/
[12] Ocean LogicTM, OL_AES AES Core family Rev 1.4, http://www.ocean-logic.com/pub/OL_AES.pdf