internal register, and operated only inside the co-processor. As power traces of the SORU2 execution unit are much smaller, the signal-noise ratio would be much smaller, and any statistical analysis would require a lot more traces to succeed.
4.4.2 Non determinism
The multiple SORU2 configuration contexts can be preloaded before starting to execute the cipher algorithm, and it can change the active configuration every clock cycle.
The most straightforward approach to take advantage of the proposed architecture for avoiding SCA attacks is by generating multiple SORU2 implementations of the program loops, and randomly change between them in run-time. Non-deterministic changes between functionally-equivalent implementations of loop bodies would increase significantly the noise level at no cost for the embedded system. An attack based on power analysis would become impossible or very difficult. It is similar to Path Swapping architectural countermeasure described in Section 2.2.2.
To apply this technique into SORU2 we have only added two minor modifications: 1. A configuration ID inside SORU2 is now divided into ID and mask. All the configu-
ration IDs that share the same value when xor’ed with its mask are considered to be equivalent.
2. One bit in the SORU2 configuration register identifies if the non-deterministic execution mode is active or not. When active, the sequencer activates a random configuration, among the equivalents, in each iteration of a SIMD operation.
Non-determinism in the BRU internal configuration adds security to the use of the secured logic already proposed
Just as the BRU configuration can be switched between equivalent configurations that generate the same output for a given input, the pipeline planification can be configured with different instruction flows which generate the same final output data for a given input data, although the intermediate outputs of the BRU are not the same.
4.5
Results
We have implemented a SystemC RTL simulator of the proposed architecture in order to evaluate not only the performance and energy consumption of the hardware architecture, but also the quality of the compilation process. The main processor is simulated with an instruction-level simulator based on LLVM.
We employ a very simple power model for SORU2 execution unit, considering only the internal registers of the pipeline. We believe this as a worst-case scenario, as we have neglected many sources of power consumption that would reduce the SNR even more for real devices. We apply basic CPA using the Hamming Distance model. DCPA is not required as there are no “ghost peaks”.
Chapter 4. Countermeasure proposal I: reconfigurable co-processor
(a) Result (b) Resistance
Figure 4.4: CPA on SORU: basic KeeLoq configuration
Moreover, the acquired power traces are perfectly aligned, as they have been obtained by simulation, and therefore we can consider this also as a worst-case scenario, because in practice, trace alignment is one of the most difficult phases for a CPA attack.
We have implemented a basic configuration for a KeeLoq implementation that uses 2 BRUs of SORU2: the first BRU obtains the key bit and the second BRU obtains the index from the state and performs the XOR operations. After 529 clock cycles the output data is available in the SORU register files.
Figure 4.4a shows that this implementation is completely vulnerable to CPA. Fig- ure 4.4a shows the result of CPA on the first 100 rounds of KeeLoq execution, targeting the round 8. The resulting correlation factor for the correct key guess is 1. As the state is available as the output of BRU 2 and the attacker and simulator models match exactly, the correlation is maximum. This result confirms the first SCA attack on KeeLoq pub- lished in [65]. Figure 4.4b depicts that the correct key guess has the maximum correlation coefficient and it outstands the other key candidates using very few traces in the attack.
However, as there are 2 vacant BRUs, we can configure the idle BRUs to perform operations, even other KeeLoq operations, in order to introduce noise. Figure 4.5 shows the evaluation of the two possible candidates. We conclude that it is better to perform extra KeeLoq operations than random operations.
Figure 4.6 depicts the resistance against CPA of a system performing 3 simultaneous KeeLoq operations with the same key (BRU 1) on different input data. Surprisingly, the result is worst than performing just an extra KeeLoq operation.
In order to increase the correlation with incorrect key candidates we decided to perform KeeLoq with alternative keys, on the same input data. We implemented a SORU configuration that executes 2 SORU operations: BRUs 1 and 3 provide the key bit while BRUs 2 and 4 execute the XOR operations.
Figure 4.7a depicts that the correlation factor is higher. However, the difference between key candidates is reduced, and the two keys used appear together. The correct
4.5. Results
(a) Random (b) Extra KeeLoq (same key, different data)
Figure 4.5: Resistance against CPA on SORU: fill with random or extra KeeLoq (same key)
Chapter 4. Countermeasure proposal I: reconfigurable co-processor
(a) Result from BRU 1,2 (b) Result from BRU 3,4
Figure 4.7: Resistance against CPA on SORU: x2 KeeLoq configuration
key guess and the dummy key outstand after 120 traces approximately. However, the two of them have a very similar correlation factor.
Additionally, we have implemented KeeLoq algorithms with configurations that are not based on the basic configuration presented. The only unknown bit in order to calculate simultaneously multiple rounds is bit 31. We propose a configuration that calculates 4 rounds simultaneously, for the two possible values of bit 31. We call it “Eager” x4 KeeLoq implementation, as the two possible values are calculated. The BRU selects the real value using a multiplexer and simple logic between the two possible values. The output of the BRU is not exactly the state, as the Most Significant Bits (MSBs) include the calculated bits for a high value bit 31, and the Least Significant Bits (LSBs) contains the calculated bits for a low value of bit 31. In order to have access to every needed bit of the state, that are not in the output, the output is always stored in a register of the SORU register file. This register is loaded into the BRU, and it will be the SELF output delayed one cycle. The BRU can reconstruct the real state of the KeeLoq round mixing inputs 1 and 3 of the BRU (input 2 is the output of the previous BRU, which includes the bits of the key). Figure 4.8 depicts the evolution of the registers of BRU2 using this configuration, indicating the state calculated, which is not an output until the last round.
Figure 4.9a shows that this implementation reduced drastically the calculated corre- lation factor and hides the leaked information. However, adapting the CPA to target the transitions that take place in the implementation, which are four rounds at one clock cycle, results in Figure 4.9b.
This configuration leaves 2 BRUs. The available configurations can be combined, executing KeeLoq operations on the same input data with different keys. The basic KeeLoq configuration, executed in BRUs 3 and 4 is still vulnerable. The “Eager” implementation is not vulnerable anymore. Moreover, as the KeeLoq operation is vulnerable, the attacker might think the attack is successful, although the key obtained is from a dummy KeeLoq operation. Figure 4.10 shows the resistance against SCA of the implementation when the attack uses the standard HD model and the x4 HD model.
4.5. Results D7 D6 D5 D4 D3 D2 D1 D0 D7 D6 D5 D4 D3 D2 D1 D0 D8H D7 D6 D5 D4 D3 D2 D8L D8 D7 D6 D5 D4 D3 D2 D1 D7 D6 D5 D4 D3 D2 D1 D0 Logic D9H D8 D7 D6 D5 D4 D3 D9L D9 D8 D7 D6 D5 D4 D3 D2 D8H D7 D6 D5 D4 D3 D2 D8L Logic
Reg 1: SELF Reg 3: R3 R3 (0) R3 (1) R3 (2) R3 (0) R3 (1) R3 (-‐1)
Figure 4.8: SORU: x4 KeeLoq register evolution
(a) Using standard HD model (b) Using adapted HD model
Chapter 4. Countermeasure proposal I: reconfigurable co-processor
(a) Using standard HD model (b) Using adapted HD model
Figure 4.10: Resistance against CPA on SORU: x4 combined KeeLoq configuration
During an execution of basic KeeLoq, 4 executions of “Eager” x4 KeeLoq can be performed. This combination provides five different executions, where only one would perform the correct one.
4.6
Conclusions
We have designed a dynamically reconfigurable co-processor specially easy to target from the point of view of the compiler. This is an important feature to take into account when the application requires dynamic reconfiguration of the available resources. In particular, it is well suited for enhancing attack resistance in embedded systems.
The experimental results show at least three orders of magnitude improvement in DPA resistance for a custom KeeLoq implementation, by using only static techniques. The performance results on a DSP benchmark show a mean speedup of 4 with only doubling the total area when connected to a simple RISC processor, and reducing the instruction fetches also significantly. These results support this design as a better alternative to exploit the instruction-level parallelism in embedded systems, specially when compared with superscalar processors, VLIW processors, or multiprocessors, as well as providing resources for securing sensitive algorithms.
The advantages become more obvious when considering energy consumption in battery-powered sensor nodes. As opposed to typical FPGA devices, where interconnect power consumption is dominant, SORU2 units force a predefined pipelined dataflow, reducing significantly the length of interconnections and the load capacitance. The loss in flexibility is highly compensated by the ease of mapping complete loops into a single SORU2 instruction, enabling the use of dynamic context-aware optimization techniques.
Moreover, our SORU2-based approach to avoid side-channel attacks allows a high degree of decoupling between the application development and the security-aware implementation, taking into account architecture and run-time issues.
4.6. Conclusions
Although a non security-aware KeeLoq implementation of SORU2 is completely vulnerable, we have presented a set of KeeLoq configurations that can be set up from the main controller to implement more robust implementations of the algorithm against SCA with no speed overhead compared to the basic KeeLoq configuration.
Chapter 5
Countermeasure proposal II: existing
compiler optimization
This chapter describes a methodology and framework proposal to evaluate the effect of existing compiler optimizations on the resistance against SCA of encryption software implementations.
The framework is used to evaluate the effect of optimization passes on the resistance against SCA of KeeLoq software implementations.
We propose a scheme to use a combination of optimization passes to increase the security of software implementations against SCA.
5.1
Introduction
The compiler controls the final implementation executed by the device, including the assembly instructions and the hardware resources used (floating point unit, vector processor).
Software countermeasures require less investment than developing new hardware countermeasures. However, they need effort from developers in creating countermeasures for each algorithm, and the result is unpredictable due to the presence of a compiler that maps high-level implementations (which may be secure) to final machine code, introducing leakages in the process. The compiler is critical for software countermeasures. The compiling process of LLVM, similar to any compiler, is depicted in Figure 3.5 of Section 3.2. The compiling process for low-end devices is divided in three steps: front-end, optimizer and back-end. The three of them transform the source code and might introduce exploitable leakage sources, depending on the selected target.
In Section 2.2.3 we describe proposals to use compiler optimizations to automatically check the resistance against SCA or to automatically introduce countermeasures to protect the implementation. However, as stated in [13], side-effects not known by the compiler might be introduced reducing the effect of the countermeasures or the confidence on the analysis.
Chapter 5. Countermeasure proposal II: existing compiler optimization
Moreover, even if the optimizer takes into account the side-effects, the countermea- sures must be applied at the back-end (as it is proposed in [19]), as the translation from IR to assembly language might introduce exploitable side-effects. It might generate data dependent branches with variations in execution time, which can be exploited with tim- ing analysis or SPA, or can execute sensitive information always at the same instant, ex- ploitable using differential SCA. Additionally, the effect of an optimization pass that auto- matically introduces countermeasures, such as software precharging or Boolean masking, can be removed by a subsequent optimization pass, as the operations added are not rele- vant for the final result.
The authors in [52] evaluate a proof-of-concept prototype to study how back-end compilation and optimizations affect to the leaked information of a x86 processor for timing attacks. Its conclusion is that optimizations introduce asymmetries that can be used by attackers to obtain sensitive data with SPA, so they must be disabled. However, the FSR-based algorithms are typically used in low-end embedded devices. It is important to optimize performance to take out the best of devices.
In this chapter, we prepare a proof-of-concept scenario to evaluate the effect of compiler optimizations on the resistance against SCA and we propose a strategy to generate more robust implementations using available compiler optimizations.