2.2 DPA countermeasures
2.2.3 Software countermeasure
The algorithm implementation might be adapted to take advantage of the architecture- level countermeasures presented in the previous section. The efficiency of non-
Chapter 2. Related work
deterministic processors as a countermeasure depends on the potential of parallelization of the instruction path. It requires an implementation with few dependencies among in- structions. The modification can be done automatically or by the programmer.
Moreover, even if there are no countermeasures applied in the underlying hardware, modifications can be made in its implementation to complicate SCA. In [90] they emulate by software the behavior of DPL. Software countermeasures can be classified as time- oriented hiding techniques, amplitude-oriented hiding techniques, including masking techniques.
Software implementations, or countermeasures introduced automatically by the com- piler, must take into account that any branch introduced that depends on the key or sen- sitive data is a potential leakage for a visual inspection SPA. The compiler controls the final implementation used in the execution: assembly instructions, hardware resource used (floating point unit, vector processor), and it is critical for software countermeasures. In [52] authors evaluate a proof-of-concept prototype to study how back-end compilation and optimizations affect to the leaked information of a x86 processor for timing attacks. Its conclusion is that optimizations introduce asymmetries that can be used by attackers to obtain sensitive data with SCA, so they must be disabled. However, there are two draw- backs in this approach. Some optimization reduce the information leaked by reducing the amount of times a sensitive data is manipulated. Moreover, not using compiler optimiza- tions implies a waste of compiler resources.
On the other hand, if there is no deviation on the execution, sensitive data might be executed always at the same instant, providing a perfect aligning to perform any kind of DSCA.
In [171] authors evaluate software implementations of AES using random insertion of dummy instructions, masking and shuffling.
Time delay
According to [113], the complexity of a DSCA attack (expressed as the number of power consumption traces required) grows quadratically with the standard deviation of the instant when the intermediate value is calculated. The growth is linear in case integration techniques are used in the attack, such as the “sliding window” introduced in [48].
Time delay countermeasures introduce random delays in the execution flow with variance as large as possible with no dependence on sensitive data. A first approach would be to introduce Random Process Interrupt (RPI). A software implementation of RPI could be implemented as an internal hardware interrupt generated randomly. Both with more RPI or longer RPI the efficiency of the integration technique is reduced as well as the efficiency of the implementation in terms of performance. A second approach would be to insert random delays at different points. The random delays are implemented with simple loops with random trip count.
In [179] they present the problem of uniformly distribute the random delay values, which reduces the standard deviation of the cumulative delay as the number of RPI
2.2. DPA countermeasures
increases. Therefore, the effectiveness of the random delay countermeasure is reduced as it is applied more times. They propose a new distribution of the random delay value to solve the problem keeping high values of the standard deviation.
In [53], they propose a new technique that improves the distribution of the cumulative delay called “Floating Mean”. This method is parameterized by two constants that define the maximum delay a and the execution width b. Each execution has two phases
1. Range selection. Choose number m randomly uniformly on [0, a − b]
2. Individual delay. Choose individual delay value randomly uniformly on [m, m + b] The mean of an individual selection is m + b/2, although the mean varies form on different selections.
If the efficiency ratio of individual delays is σd/µd, the efficiency of the cumulative
sum distribution of previous proposals decreases like Θ(1/√N ), which tends to 0 when N tends to ∞. With the “Floating Mean” technique, the efficiency evolves like Θ(1).
In [54] they improve the method, which distribution is not plain, by varying the range width of each execution using an extra parameter k.
In this implementations, when it is possible to extract a pattern to detect the execution of dummy instructions, a preprocessing technique could be applied to remove them from the trace and reduce its efficiency. The “elastic alignment” technique [180], used for architecture-level countermeasures, could also be used to remove the patterns associated to delays. Random delay insertion techniques are a low effort countermeasure to complicate an attack requiring an expert attacker to perform preprocessing, with the drawback of performance overhead.
In Section 5.3 we implement time delay using automatic generated implementations. Our solution avoids the pattern detection of dummy instructions, as there are no dummy instructions. It turns a vulnerability, the data-dependent execution flow of a software implementation, into a countermeasure against SCA by randomly switching between different implementations.
The solution can be extended to insert dummy calls to useful functions. In [102] authors evaluate the efficiency of inserting random dummy round executions in AES. Distinguish the power trace from a dummy round execution from the productive one seems complex. However, with a detailed observation of traces, they distinguish productive rounds from dummy ones, allowing them to discard the latter reducing the efficiency of the countermeasure. The differences were related to the decision of performing next phase or not (if it is the last round). This approach, in spite of the overhead introduced, seems interesting if applied very carefully.
Dummy executions should match exactly the pattern of real ones. In Section 6.2.2 we protect a vulnerable implementation with partially dummy rounds. Each round generates a random number of useful data. Instead of selecting between useful and dummy rounds, we propose selecting between a set of 5 types of execution rounds. The proposed solution generates a identical instruction flow for each kind of round.
Chapter 2. Related work
Masking
Random masking is a countermeasure used to fight against first-order power analysis presented in [48]. It consists in modifying the algorithm, avoiding the direct manipulation of sensitive data with known values, handling intermediate computations under a probabilistic form to defeat statistical correlation. Masking consists on representing the sensitive data word by two or more shares (as in secret sharing) where the sum or the XOR of the shares is equal to the intended value of the word. The sum is used in arithmetic masking while the XOR is used in Boolean masking.
The fundamental hypothesis of DSCA, the existence of an intermediate value that depends on the secret key, is not true anymore. The operations are performed on the resulting value, the masked value, and the result must be unmasked at the end, or the performed operations must be adapted to the mask applied, with transformed LUTs. The number of times a mask is reused is configurable. However, when LUTs are adapted to a concrete mask, it should not be greatly reduced.
Maskings are certainly the most intensively used approach to protect power-sensitive cryptographic software as it appears that data randomization usually works well in practice, even when hardware countermeasures are not available.
It was first applied to DES in [79], as DES was the target of the first DPA. The masking proposed consisted on two shares, or simple masking, which can be attacked with second order attacks. In [79] authors evaluate the effect of the operations of the algorithm on the modified data, and how to recover the desired result (how to unmask). DES algorithm includes linear operations (permutation, bit expansion, XOR) and one non- linear operation. In the former operations, it is easy to calculate the masked value by applying the operations to both shares (the masked value and the mask). In the latter operation, the transformation using an S-box, they propose to use 2 new S-boxes adapted to the mask.
The application of masking to AES candidates is presented in [124] as a countermea- sure. The application of masking to Rijndael, the selected candidate, is more efficient in terms of code size, memory usage and executed cycles than the other candidates.
Boolean masking is applied when Boolean operations are involved, while arithmetic masking is applied when the algorithm includes arithmetic operations. If an encryption algorithm includes both kind of operations, a method to transform from one to the other is required. This method was originally presented in [48], although improved methods are available, including patents such as US 7 334 133 or WO 2013128036 A1.
In Section 2.1.2 Higher-Order DPA is presented to attack implementations protected with masking. These attacks perform a statistical analysis on more intermediate values, including every instant where the sharings are manipulated.
Proposals with more than two shares have been proposed since the first notion of Higher Order DPA (HO-DPA). In [153] authors present a method to implement a k-share masking implementation of AES. Moreover, they evaluate the efficiency of their proposal and formally proof the security of the resulting implementations to attacks of a lower order. The complexity of a practical attack increases as the order is augmented. Their
2.2. DPA countermeasures
proposal for third order masking 400% in code size, 320% in memory usage and 15600% in executed cycles.
The security of masked implementations of encryption algorithms is formally proven in [144] using the “only computation leaks information” model.
We have also seen masking applied at logic-level in Section 2.2.1, including the Threshold Implementation techniques.
Bitslice execution
Encryption algorithms typically repeat simple operations several times. Software implementations execute sequentially the different operations inside a loop. However, in order to obtain a fast software implementation, usage of any available hardware unit has been explored to perform as much operations in parallel as possible.
In 1997, Biham introduces a non-conventional way to implement algorithms using a scalar processor as a Single Instruction Multiple Data (SIMD). It was first introduced in [27] for the DES algorithm. It involves breaking down the algorithm into logical bit operations so that N parallel operations are possible on a single N-bit microprocessor to achieve high throughput when operating with binary operations inside a register. Biham translates DES S-boxes into the logical gate circuit (binary operations) and performs the 8 S-box operations in parallel on a 64-bit processor. These implementations are especially efficient when the cipher does not use all the power of the machine instructions or the word size of the processor is much greater than the word size of the operands used by the encryption algorithm.
In 2009, the fastest implementation of AES in an Intel machine at the time is developed in [96] using bitslice implementations that execute parallel operations in XMM registers of Streaming SIMD Extensions (SSE). They perform eight 16-byte AES blocks in parallel. Their solution reduced from 18 cycles/byte to 7.59 cycles/byte.
Transforming input data to fit the bitslice implementation introduces an overhead in performance. The process mixes bits from different registers to generate new variables. This operations affecting singular bits are very inefficient. In [80] authors presented CRISP, a Cryptographic Reduced Instruction Set Computing (RISC) processor. It has a 6-address instruction, that can be used to select multiple output registers. It includes a LUT functional unit with two instructions: one for load a LUT configuration and one for using it. The latter modification is the definition of a GRP and SHIF T P AIR instructions, which combined provide efficient mechanisms to perform any n-bit permutation.
The bitslice implementation is inherently immune to timing attacks, since the execu- tion time of these instructions is independent of the input values. Implementing a DSCA on this implementation is complex because the same operation is being performed with different input data 8 times. Preparing the input data to encrypt 8 identical consecutive blocks makes the resistance disappear. In [162] it is shown that bitslice AES implementa- tions in embedded systems, without vector units, might not be resistant against differen- tial SCA.
Chapter 2. Related work
implementation for FSR-based algorithms using their ANF.
Compiler Assisted Countermeasures
Automatic insertion of countermeasures at compiler level is one of the main goals of security engineers. In [14] authors present European CACE project and preliminary results. The main objective of CACE project is to provide engineers with a toolbox that allows them to develop robust implementations of cryptographic algorithms. They propose a new language, CAO, with a new compiler that provides high-level instructions that are compiled to well proven machine implementations. A result of the CACE project has been a set of certified shared libraries with algorithm implementations.
Compiler optimizations are powerful tools for automatically analyze and transform final software implementations. Developers might introduce annotations to help the compiler make correct decisions. Moss et al. [132] use the compiler, derived from CAO, to transform Haskell software implementations of AES into ARM assembly code masking the sensitive data. Sensitive data is labeled by the developer. Using dataflow analysis they know which mask has already been applied to input data of Sbox transformations and automatically adapt Sbox values. As they state, it is a first step towards automated masking.
Similar solutions are CompCert [110], which is another compiler that also presents its own programming language, Coq, and EasyCrypt framework [15]. Almeida et al. [8] combine both frameworks to generate certified computer-aided secure implementations that take into account existing and known attacks. In their solution, the compiler checks semantic preservation of the machine code generated as well as behaviour preservation. With this solution, they guarantee the compiler doesn’t generate information leakages in transforming the code from C-like language to assembly code. The behavior preservation must be checked by the compiler front-end, the optimization passes and the back-end in order to succeed.
The code analyzed and used in the abovementioned compiler proposals must not include references to dynamic unsecured linked libraries. These libraries, if not protected, could leak information about the data manipulated that is not checked by the compiler. If the security evaluation of an implementation is done empirically by performing attacks on a real target, the effects of linked libraries are taken into account. This is the framework of this Ph.D. thesis, as described in Chapter 3.
Timing attacks exploit execution paths with different lengths where the branch de- pends on sensitive data or secret keys. In [52] authors detect that traditional optimization passes introduce more data-dependent branches, so they propose to disable compiler op- timizations. The difference can be due to variable-latency instructions in the different paths. In [50] they propose different Low-Level Virtual Machine (LLVM) transformation passes which automatically compensate the different paths with new instructions (NOP, for example). The static analysis is not always possible, so in many cases the security level achieved is low. Moreover, the overhead is high and the portability is low. They conclude that side-channel aware hardware support at compiler level.
2.2. DPA countermeasures
Compiler analysis tools can be used to assess whether any particular leakage in any particular computational phase is statistically dependent on the secret data and statistically independent of any random information used to protect the implementation. In [19] they present a methodology, using LLVM, to detect sensitive operations using Data Flow Graph (DFG). They analyze the statistical dependence of the detected intermediate values with an input variable by formulating the problem as a satisfiability query (SAT). SAT problems are efficiently applied in logic synthesis, looking for ’X’ to simplify logic functions.
In [17] a first step towards automatically apply countermeasures is presented and tested on an 8-bit AVR processor. The manipulation of sensitive data is detected by the compiler and random precharging is introduced loading a random value on the target register. The Hamming Distance is modified between executions with the same intermediate value. In [18], they add new countermeasures to the compiler, including automatic Boolean masking, unmasking on the output sink of the basic block. One benefit of the compiler benefits compared to other software countermeasure is the locality of its application, as it protects the minimum number of instruction instances to achieve a desired level of security. In [2] they propose similar methods to automatically apply masking on sensitive instructions. These proposals need to be revised, as stated in [13], because side-effects not known by the compiler are introduced by the automatic insertion of masking countermeasures and they are vulnerable to attacks.
In Section 2.2.2, some architecture countermeasures are presented that shuffle available instructions, introducing non-determinism in execution. They take advantage of the availability of instructions with no data dependency. In [3] authors propose to include different instruction scheduling at software level (increasing the code size) of critical functions. In runtime, the branch is selected randomly. It is the software equivalency of the above mentioned countermeasures. Following the random branch concept, they propose in [4] a similar idea where different options are not derived only from different scheduling, but from different equivalent assembly code. The back-end traditionally must choose among the available alternatives for implementing a C instruction. They propose to implement multiple alternatives an select randomly among them.
In Chapter 5 we evaluate the effect of different compiler optimization sequences on software implementations of FSR-based algorithms. In Section 5.3 we propose to randomly switch the code executed each. Changing the instruction scheduling or generating equivalent functionality for a block use the compiler back-end. Instead of using the compiler back-end to generate the alternatives, we propose to apply different compiler optimization sequences to generate different implementations.