The utilization of global communications network for supporting new electronic applica- tions is growing. Many applications provided over the global communications network involve exchange of security-sensitive information between di ﬀ erent entities. Often, communicating entities are located at di ﬀ erent locations around the globe. This demands deployment of cer- tain mechanisms for providing secure communications channels between these entities. For this purpose, cryptographic algorithms are used by many of today’s electronic applications to maintain security. Cryptographic algorithms provide set of primitives for achieving di ﬀ erent security goals such as: confidentiality, data integrity, authenticity, and non-repudiation. In gen- eral, two main categories of cryptographic algorithms can be used to accomplish any of these security goals, namely, asymmetric key algorithms and symmetric key algorithms. The secu- rity of asymmetric key algorithms is based on the hardness of the underlying computational problems, which usually require large overhead of space and time complexities. On the other hand, the security of symmetric key algorithms is based on non-linear transformations and permutations, which provide e ﬃ cient **implementations** compared to the asymmetric key ones. Therefore, it is common to use asymmetric key algorithms for key exchange, while symmetric key counterparts are deployed in securing the communications sessions. This thesis focuses on finding e ﬃ cient **hardware** **implementations** for symmetric key cryptosystems targeting mobile communications and resource constrained applications.

166 Read more

Next, C. Kyrkou et al. [78] extended their previous work for accelerating the cascaded SVM classifier [76] by proposing an optimized hybrid architecture with the proposed **hardware** reduction method and an additional novel response evaluation method. The proposed response evaluation process was developed by using the Neural Network (NN) model to classify the responses of the preceding simple stages in the cascade in order to remove samples before the final complicated classification stage, leading to classification speed improvement. In addition, the architecture employed local binary pattern (LBP) descriptors for applying feature extraction prior to the final stage in order to improve detection accuracy. The presented architecture was implemented on Spartan-6 FPGA (replacing the old one used before), targeting embedded face detection using higher resolution of 800x600 images than that in their previous work and other **hardware** **implementations**. The implemented hybrid architecture achieved real-time processing of 40 fps with 80% detection accuracy, as well as 25% and 20% reduction in area and peak power respectively, with only 1% reduction in classification accuracy. But compared to their previous work, it seems that lower figures were achieved for both area and accuracy. This is because they are evaluating a big test set of higher resolution images, targeting real-time processing of online video classification as an embedded benchmark application. 4.2.2 Group 3: SVM-based Applications **Implementations** This group was constructed to demonstrate the usage of the SVM classification in a wide range of applications, in which the research papers are focused more on the main application’s implementation, rather than the classification purpose implementation as in group two. Different **hardware** architectures have been introduced in literature for implementing algorithms including classification task targeting particular applications. This group consists of 13 papers to introduce some research work of different applications that deal with images.

20 Read more

In our **hardware** implementation, we have performed bundles of experiments on OpenMP on multi-core CPU architecture. We implemented our OpenMP experiments on 4-core, 8-thread 4 th generation i-7 Intel CPU Desktop with Linux operating system. Due to the limited number of IEEE power system test cases, we will use some of the benchmarks of matrixMarkets, which are almost all real-number test cases, plus our case9241pegase example, which is a complex number test case. We first implemented Naive C, and then exploited its thread-level parallelization by parallelizing the outer for-loop and implementing reduction to the inner loop. Meanwhile, we also implemented the test cases with several open-source codes on GPU, namely CUSP [9], clSpMV and yaSpMV. Both double-precision and single-precision floating point data type were measured. The performances are shown in Figure 4.11. We observed that compared to naive-C implementation, the OpenMP parallelization on a 4 th generation Intel i-7 4-core architecture could generally give a boost of performance up to 1.5 to 3 times, while the NVIDIA GeForce GTX 960 GPU architecture could give up to 3∼5 times more speed for the COO sparse format. yaSpMV could generally give the best performance - about 15 times faster than naive-C imple- mentation. For single-precision floating point data type, the performance can be 1.2 to 1.5 times better than double-precision floating point data type. When it comes to the implementation of our IEEE test case for the power system, or specifically case9241pegase, the performance of GPU did not show better performance than the multi-core architecture. This problem is likely because in OpenMP code we have expanded the complex arithmetics in plain real-number based ones, while in the open-source GPU implementation complex class under namespace std is called, which may add some overhead to its performance. To be added, since clSpMV and yaSpMV both doesn’t support double floating point arithmetics or complex arithmetics, there are some data missing in Figure 4.11. Since we are dealing with complex arithmetics, we will evaluate how complex arithmetics perform in yaSpMV which generally gives the best performance among most test cases.

151 Read more

This section compares our DF **hardware** performance with various state-of-the-art designs. The design requires fewer transposition buffers and fewer gate counts than [7], which used a similar design approach to this one (pipeline stage and 1-D filtering architecture). Although this design requires more processing cycles than in Tobajas et al. [8], we can lower the gate count and achieve lower transposition buffer usage. That is because in [8], a double filter with two identical filtering units was proposed as opposed to our 1-D filtering strategy. Moreover, the proposed HESPA, an intelligent edge skip processing approach, can achieve as few as 100 cycles per MB in the best case, which even outperforms the 2-D architecture in [8] (110 cycles). The design consumes 19.8K gates at a clock frequency of 200 MHz in a 0.18μm standard cell library. The **hardware** cost of the proposed scheme is very competitive compared with other state-of-the-art literatures using 1-D filtering architecture.

During each cycle of the transfer state, the data from the registers is inverted. That means during each cycle, some data lines are being pulled high. However, the long routing across the Logic Locked design are only charged when the output of the registers change from logic zero to logic one. Therefore, the power consumption difference should be seen every other cycle during the transfer state. This is why the **hardware** is designed so that the data bits which are inverted are written from the host computer. There they can be changed to different proportions of active bits. This also keeps the synthesis tool from removing the logic as unnecessary. The data is read back out for the same reason and to verify the inversion.

110 Read more

In the survey presented in [1], the round two candidates of the CAESAR competition were cate- gorized into five families on the basis of their base constructions: block cipher-based, stream cipher-based, key-less permutations, hash-function-based and dedi- cated schemes. AEAD ciphers based on block-cipher allows block-level parallelism while using the underly- ing block cipher, such as the Offset Code Book mode (OCB) [30, 29, 25], the Synthetic Counter-in-Tweak mode (SCT) [27] and the Offset Two-Round mode (OTR) [26]. An important aspect of the study of AEAD schemes is the evaluation of their **hardware** performance, which clearly needs more efforts. So far, nearly all candi- dates have been supported with a basic **hardware** imple- mentation [14]. However, the **implementations** are done on various platforms, for different interfaces. Further- more, several designs have unique advantages to offer in some platforms, e.g., Field Programmable Gate Array (FPGA). However, FPGA boards are mainly used for verification of the design with the help of programmable gates but it does not provide actual performance met- rics of the design which can only be achieved by imple- menting the design in Application-Specific Integrated Circuits (ASIC). In addition, all the available **hardware** **implementations** of the CAESAR competition candi- dates on the ATHENa **hardware** evaluation website [14] are fully sequential **implementations**, i.e. to start pro- cessing a new block, all the previous blocks have to be finished. These **implementations** do not take full advan- tage of the specific characteristics of the schemes based on the aforementioned modes.

16 Read more

In this paper, we give another look at linear regression side channel attack under Hamming weight/ Hamming distance model. We find that LRA has great advantages than CPA in many general cases. We propose two typical cases, recovering keys with XOR operation leakage and chosen plaintext attack on block ciphers in T-table software or round based **hardware** **implementations**. For the first case, in 128-bit leakage, we achieve as high as 400% improvement compared with CPA. Furthermore, the computational complexity is only O(1). For the second case, we show that LRA is extremely powerful as it can overcome unknown constant mask. We believe that this characteristic of linear regression provides a feasible attacking method which could be used in many other cases. Experiments on AES are also given which verify the efficiency of two typical cases in practice.

11 Read more

**Hardware** **implementations**. As recalled in the previous section, Bloem et al. [9] provide a tool for proving probing security of masked **implementations** in the ISW model with glitches. While this tool benefits from the new treatment of physical defaults, it faces efficiency issues and cannot handle classical higher- order examples. Recently Bloem, Iusupov, Krenn, and Mangard [10] provide some technical optimizations based on an earlier version of this paper (using our same tool), but that are still restricted to proofs on probing security. Namely, proven **implementations** thus cannot be safely composed to achieve larger se- cure ones. The work of Faust et al. follows the alternative approach of proving the strong non-interference of some basic gadgets with glitches, which allows composing circuits at arbitrary orders (but less efficiently) [19].

20 Read more

architectural designs depending on the required perfor- mance. In the literature there are systolic **implementations** where speed is the most important requirement [19, 20], serial designs where low resource consumption is selected [1, 21], and a meet in the middle approach (semiparallel) where both criteria are balanced [8]. In this study, the Jacobi algorithm was implemented using the Xilinx Vivado HLS tool, to explore different **hardware** **implementations** and compare them to select the most efficient one in terms of execution time and FPGA resources used.

12 Read more

In this paper, we have presented Simeck, a new family of lightweight block ci- phers. Simeck is very suitable for resource-constrained devices, such as passive RFID tags and wireless sensor networks. We have provided an extensive explo- ration for different **hardware** architectures in order to make a balance between area, throughput, and power consumption for Simon and Simeck in both CMOS 130nm and CMOS 65nm techniques. We have shown that it is possible to design a smaller cipher than Simon in terms of area and power consumption. Moreover, we have improved the **hardware** **implementations** of Simon given in the origi- nal paper. In addition, the similarities between Simon/Speck and Simeck allow us to have an idea of the actual security offered by Simeck. Even if the round function of Simeck is quite simple, this round function is iterated a sufficient number of time to provide an adequate security against most known attacks. In conclusion, all of the instances in the Simeck family can meet the area, power consumption, and throughput requirements in the passive RFID tags and they are promising candidates for resource-constrained devices.

23 Read more

that an unprotected MAC-Keccak **hardware** **implementations** is vulnerable to side-channel attacks. This attack method is very different from the previous attacks on software **implementations** because software **implementations** have all the steps executed in serial and the intermediate results (e.g., the compression step in θ and the output of θ) can be used to recover the key bits information directly. For **hardware** **implementations**, all five steps of a single round of MAC-Keccak are executed in one clock cycle and there is no register activity directly related to the intermediate key-dependent states of the linear operation output.

13 Read more

Motivated by the above applications, many lightweight pseudorandom number generators (PRNGs) have been devised in recent years, such as LAMED [16], Melia-Segui et al. [13], Warbler [11], J3Gen [14], and AKARI1B [12]. LAMED [16] is designed based on registers, arithmetic logic unit (ALU), XOR and modular operations. Melia-Segui et al.’s PRNG [13] and J3Gen [14] rely on the security of linear feedback shift registers (LFSRs) and a truly random number generator (TRNG). Warbler [11] is designed by using the properties of nonlinear feedback shift registers (NLFSRs) and the WG-5 transformation modules. The estimated areas of these four PRNGs are all below 2000 GEs, the maximum area limit for resource constrained applications [2,9]. However, there have been no actual **hardware** **implementations** for them until now. AKARI1B [12] is designed based on the T-function and a non-linear filter function, and it was synthesized using the UMC Faraday 90nm technology.

13 Read more

Abstract—Having ciphers that provide confidentiality and authenticity, that are fast in software and efficient in **hardware**, these are the goals of the CAESAR authenticated encryption competition. In this paper, the promising CAESAR candidate A SCON is implemented in **hardware** and optimized for different typical applications to fully explore A SCON ’s design space. Thus, we are able to present **hardware** **implementations** of Ascon suit- able for RFID tags, Wireless Sensor Nodes, Embedded Systems, and applications that need maximum performance. For instance, we show that an A SCON implementation with a single unrolled round transformation is only 7 kGE large, but can process up to 5.5 Gbit/sec of data (0.75 cycles/byte), which is already enough to encrypt a Gigabit Ethernet connection. Besides, A SCON is not only fast and small, it can also be easily protected against DPA attacks. A threshold implementation of A SCON just requires about 8 kGE of chip area, which is only 3.1 times larger than the unprotected low-area optimized implementation.

Gura, N., Shantz, S. C., Eberle, H., Gupta, S., Gupta, V., Finchelstein, D., Goupy, E. and Stebila, D. (2002) An End-to-End Systems Approach to Elliptic Curve Cryptogarphy. Proceedings of the Fourth International Workshop on Cryptographic **Hardware** and Embedded Systems. August 13-15. Heidelberg: Springer Verlag, 349 – 365.

19 Read more

**hardware** and just a bit permutation in software, which does not influence the probing security. With the input and output constraints for our synthesis tool, we also ensure that the mask encoding for each byte is the same, and we can thus safely compose these modules without creating flaws in the probing model for first-orders. However, we note that this composition argument is only true for first-order **implementations** for which a probing attacker is restricted to a single probe. This means that multivariate probes are of no concern and thus probes occur only in a single submodule. The tables in Appendix A show that the reuse of randomness has no influence on the output distributions of cascaded gates, as long as the mask encoding is done with precision. Our synthesis tool creates **implementations** which, by construction, ensure that the mask encoding is fixed at the inputs and outputs of submodules. Our submodules have been formally verified for these encodings. Therefore, combined with the fact that probes can only placed on a single submodule, this ensures that the entire AES implementation is first-order secure.

28 Read more

The overall **hardware** architecture for SWIFFT is shown in Figure 5. We opted for the BlockRAM-based implementation of the MAC multipliers (the first ap- proach) having one cycle latency. The input message is read in n-bit blocks through m consecutive cycles. The input multiplexers implement the multiplica- tion of the input polynomial coefficient bits by ω i . The output of the multiplexer is already represented in the diminished one format. The FFT starts processing one n-bit block of the input message in each clock cycle and has a latency of log 2 (n) cycles. The MAC accumulates the Fourier coefficients to generate the final hash value. The overall execution time of SWIFFT is log 2 (n) + m + 2 cycles. The hash value obtained is represented in the diminished-one format - converting it back to the normal representation is covered in the next section. 5 **Hardware** architecture for the SWIFFTX hash function

17 Read more

Multivariate signature belongs to Multivariate-Quadratic-Equations Public Key Cryptography (MPKC), which is secure to quantum computer attacks. Compared with RSA and ECC, it is required to speed up multivariate signature **implementations**. A high-speed **hardware** architecture for signature generations of a multivariate scheme is proposed in this paper. The main computations of signature generations of multivariate schemes are additions, multiplications, inversions, and solving systems of linear equations (LSEs) in a finite field. Thus, we improve the finite field multiplications via using composite field expression and design a finite field inversion via using binary trees. Besides, we improve solving LSEs in a finite field based on a variant algorithm of Gauss-Jordan elimination and use the XOR gates to compute additions. We implement the high-speed **hardware** architecture based on the above improvements on an Altera Stratix Field-Programmable Gate Array (FPGA), which shows that it takes only 90 clock cycles and 0.9 μ s to generate a multivariate signature. The comparison shows that the **hardware** architecture is much faster than other **implementations**.

She has experienced in the technical and project management during her services in the Centre for Integrated Information System CIIS, the core IT centre for UiTM for 15 years. She work in the network project **implementations** and management for 13 years, as project manager in development the university smartcard applications and system for three years and database design and structure for a year. She has join the academic since January 2009 and has published 12 (4 main author and 8 co-author) refereed international proceedings in the area of computer networking and engineering. She is actively doing research in the area of computer engineering and currently has completed 3 researches. Her research interest includes Network Traffic Management, Network Security, Contactless Smartcard Applications, MIFARE Technology Applications, Protective Management System, E- Content Management and Development and WEB-Based applications development.

By determining each student’s ESL competence in advance, by means of an independent system (e.g., CELT), we are able to provide DEUCE with a rating for each subject’s competence. Against these measures, DEUCE engages with students, applies the tests, and determines its own measure for each student performance. Thereafter, DEUCE evaluates the criteria and the **implementations** against the predetermined rating for each student. In this fashion, DEUCE is able to discriminate across the criteria and their individual **implementations**. The three principal domain-specific ingredients that contribute to DEUCE operation are represented in Figure 1, below.

With growing business complexity, the IT **implementations** have also become increasingly complex. Today’s IT **implementations** need to cater users across multiple geographies, support complex business requirements, manage large data volumes and need to be available 24x7 with zero or very minimum downtime. This has posed many challenges in front of architects who design these architectures.