HIGH THROUGHPUT EVALUATION OF SHA-1 IMPLEMENTATION
USING UNFOLDING TRANSFORMATION
Shamsiah Binti Suhaili1 and Takahiro Watanabe2
1Faculty of Engineering, Universiti Malaysia Sarawak, Kota Samarahan, Sarawak, Malaysia
2Graduate School of Information, Production and Systems, Waseda University, Hibikino, Wakamatsu-ku, Kitakyushu-shi, Fukuoka, Japan
E-Mail: [email protected]
ABSTRACT
Hash Function is widely used in the protocol scheme. In this paper, the design of SHA-1 hash function by using Verilog HDL based on FPGA is studied to optimise both hardware resource and performance. It was successfully synthesised and implemented using Altera Quartus II Arria II GX: EP2AGX45DF29C4. In this paper, two types of design are proposed, namely SHA-1 and SHA-1unfolding. The maximum frequency of SHA-1 design is 274.2 MHz which is higher than SHA-1 unfolding that has the maximum frequency of only 174.73 MHz. However, this leads to a high throughput of the SHA1 unfolding design with 2236.54 Mbps. Besides, both designs provide a small area implementation on Arria II that requires only 423 and 548 Combinational ALUTs, 693 and 907 total register, respectively.
Keywords: maximum frequency, FPGA, HDL, SHA-1.
INTRODUCTION
Implementation of hash function on reconfigurable hardware is one of the practical solutions for embedded system which can give different results based on the structure of reconfigurable logic of FPGA. In other words, FPGA has the capability to improve the performance in terms of power, speed and area implementation. FPGA offers several benefits for cryptographic algorithm hash function because it is small, incurs low development cost, has high speed and fine memory; it is highly flexible, including capability for frequent modification of hardware, short time to market as well as easy experimental testing and verification. It tends to be an excellent choice when dealing with algorithms but it has the disadvantage of high power consumption. Therefore, in order to apply the high-speed cryptographic solution on reconfigurable hardware, further research relating to high speed and small area implementation needs to be taken into account.
Hash Function is a transformation that takes variables input message M and returns a fixed-size length which is called hash value [1,2,3]. There are many types of hash functions such as MD5, 224, 256, SHA-384 and SHA-512. The purpose of this paper is to analyse the structure of SHA-1 hash function on reconfigurable hardware and to obtain small area implementation as well as high frequency maximum. In short, balancing between maximum frequency and area implementation of the design needs to be considered. The high performance of the hash function design is important to improve the throughput of the design since nowadays all systems need fast implementation. The motivation of this research is to study the structure of SHA-1 hash function as it is important for some applications such as Message Authentication Code (MAC) [1]. SHA-1 hash algorithm has been studied with careful design at every stage of its inner structure using Verilog. There are many researches pertaining to SHA-1 FPGA-based implementation [4-12].
However, some of the papers need further improvement. In this paper, Altera Quartus II Arria II GX: EP2AGX45DF29C4 is chosen as a target device for both SHA-1 and SHA-1 unfolding implementation because it has the potential to provide high performance for the design. The paper is organised as the following: Section II presents the description of SHA-1 algorithm; Section III briefly explains Unfolding Algorithm; Section IV contains the performance evaluation; and Section V ends the paper with a conclusion of SHA-1 implementation.
SHA-1 ALGORITHM
Secure Hash Algorithm (SHA-1) input must be and the message is processed in 512-bit blocks sequentially with 160-bit message digest output. The process of SHA-1 algorithm is divided into two parts: pre-processing and hash computation. The non-linear function of SHA-1 operates on three 32-bit words B, C, and D with logical sequence from fo until f79.
PRE-PROCESSING
Table-1. Buffer Initialisation of SHA-1.
HASH COMPUTATION
SHA-1 hash computation processes the padded message with message schedule of 80 steps processing of 32-bit, W0,W1...W79. Equation 1 illustrates the compression function of SHA-1 for input A, B, C, D, and E. The symbol << means the register input shifts to the left with the value given. T consists of Wt and Kt where Wt is expanded message word of round t, and Kt is round constant of round t.
Figure-1. SHA-1 Compression function.
A
F
B C D
W K ET 5 , , t t
D E C D B
C A B T
A , , 30, ,
(1)
The formula for the derivation of 32-bit block message schedule Wt is simply from message input for
16
t . The remaining values of Wt where t 16 are derived using the following Equation (2).
2 8 14 16
1
t t t t
t ROTL W W W W
W (2)
After initialising five working variables A, B, C, D, and E with buffer initialisation H0, H1, H2, H3, H4 in the pre-processing, the hash computation uses the constant Kt and round function for 0 t 79 as shown in Table-2 and Table 3 to process the message. The symbol,,
in non-linear function of four rounds SHA-1 algorithm from Table-3 represent logical AND, NOT and XOR operation respectively. After rounding four rounds that consist of 80 steps, the final step is adding the initial value with the last output hash.
Table-2. Constant Kt.
Table-3. Round function.
UNFOLDING ALGORITHM
Unfolding algorithm is one of the techniques that can be used by DSP application to obtain a new program that performs more than one iteration of the original program. In addition, unfolding factor, J describes the number of iterations from the original program. The rules of unfolding algorithm are explained as below [4]:
1. For each node U in the original Data flow graph (DFG), draw the J nodes U0,U1,U2
2. For each edge U V with
delay in the original, draw the J edges Ui Vi%J with
J
i delays for i 0,1,..., J 1.
In order to explain the structure of unfolding algorithm, one example of DSP program is shown in Figure-2.
) ( ) 9 ( )
(n ay n x n
Figure-2. The original DSP program [4].
DFG can be constructed from Figure-2, which is the original DSP program by replacing the input and output port with node A and B while the addition and multiplication processes are represented by node C and node D respectively as shown in Figure-3.
Figure-3. The 2-unfolded DFG [4].
Based on the first rule of unfolding algorithm, there are 8 nodes that represent i0,1 namelyA0,B0,C0,D0,A1,B1,C1 andD1. The second step
of unfolding algorithm is to connect each edgeUV in the DSP program. The edge UV with no delay is divided into two parts, U0V0 and U1V1. Therefore,
the edge CD with
9 delays becomes90%2 0D
C with
2 0
9 delays and
91%2 1D
C with 2 1
9 delays. Finally, the 2-unfolded DFG is created
with C0D1 with 4 delays and C1D0 with 5 delays
respectively.
SHA-1 UNFOLDING ALGORITHM
The proposed SHA-1 unfolding algorithm with factor 2 is shown in Figure-4. It consists of two non-linear functions with three different inputs, two circular left shift of both A and B by 30 and two circular left shift of A by 5 and Temp by 5 respectively. From this figure, there are 8 addition operations which perform in parallel form during the execution process. Thus, the critical path of the design has only four addition processes. In other words, two hash operations are executed per cycle. This process reduces the number of cycle from 80 cycles to 40 cycles in order to obtain the final output hash. Hence, unfolding transformation can increase the throughput of the SHA-1 hash function.
Figure-4. SHA-1 Unfolding compression function.
The outputs of SHA-1 unfolding algorithms are shown in the following equation. ROTLa
b representscircular left shift or left rotation operation of bbya
position to the left, and funct
p,q,r
means non-linearfunction at timetfor three different input p,qand r.
At func
Bt Ct Dt
Et Wt KtROTL
Temp 5 , ,
1 1 30 1 52 { } , ,
t t t t t t t K W D C B ROTL A func Temp ROTL A Temp
Bt2
tt ROTL A
C2 30 (4)
tt ROTL B
D2 30
t
t C
E 2
PERFORMANCE EVALUATIONS
implementation is carried out to evaluate the performance of the design [5-13]. All the results are presented in Table-4. The proposed SHA-1 design and SHA-1 unfolding use only 423 and 548 Combinational ALUT respectively. Besides, total register of the design increases from 693 to 907 in SHA-1 unfolding design. The comparison of area implementation and speed of the design depends on FPGA family devices. The designer needs to choose the appropriate device in order to reduce the usage of logic utilisation as well as increase the performance of the design. The total estimated power dissipation of the SHA-1 unfolding decreases from 625.86 to 456.2SHA-1 mW. From this table, it is shown that the throughput of the design for SHA-1 unfolding increases significantly with 174.73 MHz maximum frequency. The throughput of the design is about 2236.54 Mbps which is higher than that of SHA-1 design, with only 1754.88 Mbps. Hence, the throughput of the design can be calculated by using the following formula where block size is 512 bits.
Latency size block Frequency
Throughput (5)
Table-4. FPGA-based SHA-1 implementation.
RESULTS ANALYSIS
There are several others published FPGA-based implementation of SHA-1. In this paper, two types of FPGA, Xilinx and Altera are listed as CAD tool for design implementation in order to compare the effects of area implementation in terms of Combinational ALUT, Logic Element, Slices and total register. Since the device of SHA-1 implementations is not the same, the comparison of the design in terms of area and speed can be evaluated from target devices. In other words, the designer can choose a device that will provide high performance implementation. Table-5 shows the SHA-1 area implementation on different types of FPGA family devices. From this table, we consider the latency of previous papers which the authors did not mention as normal SHA-1 operation.
Table-5. Area implementation of SHA-1 and SHA-1 unfolding.
design provides the highest maximum frequency which is 274.15 MHz with a throughput of 1754.56.
Table-6. Maximum frequency of SHA-1.
Table-7 shows the maximum frequency (fMax) of SHA-1 unfolding. From this table, the proposed SHA-1 unfolding design obtains the highest maximum frequency which is 174.73 MHz if compared with other SHA-1 unfolding designs. This leads to high throughput of SHA-1 unfolding design. The throughput from L. Jiang and J.Kim design provides high throughput because of the pipeline design. However, this design uses large area implementation. As we can see from Table-5, the same authors use large amount of Combinational ALUT which is about 33764 and 1649 if compared with this SHA-1 unfolding design that only uses 548 Combinational ALUT.
Table-7. Maximum frequency of SHA-1 unfolding.
CONCLUSIONS
The architecture of SHA-1 Unfolding was successfully synthesised and implemented on Altera Arria II: EP2AGX45DF29C4 using Verilog HDL. The maximum frequency of the design is 174.73MHz while the area utilisation in terms of combinational ALUTs and total register are 548 and 907 respectively. The maximum frequency of SHA-1design implementation illustrates the critical path of the design. In order to obtain the high performance design, not only speed needs to be considered, but the area implementation should also be taken into account. Some other methodology or technique can be implemented to increase the maximum frequency as well as throughput of the design. High performance with efficient design incorporates considerations of small area implementation, high maximum frequency and small estimation power consumption; this in turn will lead to high throughput of the design.
ACKNOWLEDGEMENTS
This work is supported by Universiti Malaysia Sarawak (UNIMAS).
REFERENCES
[1] Beale Q. Dang. 2011. Draft NIST Special Publication 800 – 107. Recommendation for Applications using Approved Hash Algorithm, Computer Security Division, Information Technology Laboratory.
[2] Federal Information Processing Standards. Secure Hash Standard (SHS), FIPS PUB 180-3. 2008. Information Technology Laboratory National Institute of Standards and Technology Gaithersburg.
[3] F. R. Henriquez, N.A. Saqib, A. D. Perez, C. K. Koc. 2006. Cryptographic Algorithms on Reconfigurable Hardware, Springer series on Signal and Communication.
[4] K.K.Parhi. 1999. VLSI Digital Signal Processing Systems: Design and Implementation, John Wiley & Sons, Inc. 119-140.
[5] K. Jarvinen. 2004. Design and Implementation of a SHA-1 Hash Module on FPGAs. Helsinki University of Technology Signal Processing Laboratory.
[6] Y.K.Kang, D.W.Kim, T.W.Kwon, J.R.Choi. 2002. An Efficient Implementation of Hash Function Processor for IPSEC. Proceedings 2002. IEEE Asia-Pasific Conference on ASIC. pp. 93-96.
[8] Diez, J.M., Bojanic S., Stanimirovic Lj., Carreras C., Nieto-Taladriz O.. 2002. Hash Algorithms for Cryptographic Protocols: FPGA Implementations. Proceeding of 10th Telecommunications forum
TELFOR’2002, Belgrade, Yugoslavia.
[9] D. Zibin, Z. Ning. 2003. FPGA Implementation of SHA-1 Algorithm, ASIC 2003. Proceedings 5th
International Conference. 2: 1321–1324.
[10] L. Jiang, Y. Wang, Q. Zhao,Y. Shao, X. Zhao. 2009. Ultra High Throughput Architectures for SHA-1 Hash Algorithm on FPGA, Computational Intelligence and Software Engineering, CiSE 2009, International Conference, Wuhan. pp. 1-4.
[11] N. Sklavos, E. Alexopoulos and O. Koufopavlou. 2003. Networking Data Integrity: High Speed Architectures and Hardware Implementations. The International Arab Journal of Information Technology. 1(0).
[12] Y. K. Lee, H. Chan, I. Verbauwhede. 2006. Throughput Optimized SHA-1 Architecture Using Unfolding Transformation. Application-specific Systems, Architectures and Processors (ASAP’06). pp. 354-359.
[13] J. Hoon Lee, S. Choon Kim, Y. Jun Song. 2011. High-Speed FPGA Implementation of the SHA-1 Hash Function. IEICE Trans. Fundamentals, E94-A(9)
[14] J. Kim, H. Lee, Y. Won. 2012. Design for High
Throughput SHA-1 Hash Function on FPGA. Fourth
International Conference on Ubiquitous and Future