Conclusions - Securing implementations of feedback-shift-register-based ciphers using compiler

Implementing the system using a non-deterministic processor or with secured logic would be a good countermeasure for high volume systems. However, it might not be suitable for already deployed systems. The cost difference between this solutions and the low-power devices typically used today is high.

In this chapter we have presented several existing proposals for masking countermeasures at different levels. Automatic insertion of masking has been proposed for hardware countermeasures, at logic level, and software countermeasures, at the compiler. However, implementations secured with masking are vulnerable to high-order DPA.

Countermeasures in the time dimension include shuffling operations or inserting dummy operations [179]. The sequential nature of the algorithm limits what can be done using randomized execution, but limits are widened when combined with dummy instruction insertion. These countermeasures must perform differently every execution; otherwise, the CPA would success. In [114] they mention that they do not provide a high level of protection against SCA. Using windowing reduces the effect of shuffling and dummy instruction efficiency can be reduced with elastic alignment [180].

In [102] authors evaluate the efficiency of inserting random dummy round executions in AES. With a detailed observation of traces they distinguish productive rounds from dummy ones, allowing them to discard the latter reducing the efficiency of the countermeasure. The differences were related to the decision of performing next phase or not (if it is the last round).

We focus our work in creating new hiding countermeasures by randomizing the execution of the algorithm using resources that are not exclusively dedicated to the encryption. We present novel ideas that are more resistant to both SPA and DSCA. When using dummy executions, we avoid any difference between them and the real executions.

Chapter 3

Methodology

An analysis framework is required in order to evaluate the impact of the countermeasures proposed in this Ph.D thesis. Using the same framework with different implementations is mandatory for comparing resistance against SCA. An absolute result of an attack can not be used to compare with external results, using other attacks, targets or leakage models.

A unified framework to compare countermeasure and attackers was proposed in [165] with five steps.

1. Define the implementation: in software it would be the machine binary code used. 2. Define the target: it is the device executing the algorithm and the SNR of the traces

measured from the execution of an implementation on the target

3. Evaluate the information: estimate the amount of exploitable information contained on the traces through a concrete template-like attack. They use MIA for this purpose. 4. Define the adversary: define the setup of the attack, including the specification of the implementation of the steps required: model estimation, measurement, pre- processing and alignment, and statistical analysis.

5. Evaluate the security: obtain the success rate of the side-channel key recovery adversary algorithm to establish how successfully can an adversary exploit leakage in a practical attack.

We propose an analysis framework that implements steps 1, 2, 4 and 5 in order to evaluate the security of our proposals. We avoid the estimation of the information because it is not needed for the security evaluation step.

Figure 3.1 depicts the analysis framework proposed for the SCA resistance evaluation. The evaluation process begins with the selection of the software implementations of an encryption algorithm that are going to be analyzed. Each software implementation is compiled to the machine code of the target device. The framework includes the possibility of not defining the concrete target and use the compiler own intermediate representation,

Chapter 3. Methodology Optimizer Keys Data Compiler Device Simulator Traces #keys#data executions #keys#data simulations Pre-processing Leakage model Multiple SCA SCA report Algorithm implementations Assembly Analysis

Figure 3.1: Proposed analysis framework for SCA resistance

if it can be interpreted and simulated. The preliminary evaluation could be done using this approach, although effects introduced by the target device wouldn’t be taken into account.

The compiler applies a sequence of optimization passes to the implementation, which are customizable. The analysis framework enables the evaluation of one algorithm implementation with multiple optimization sequences, as it might change the final machine implementation. During the compiler stage, the framework compares the resulting implementations from applying the different optimization sequences in order to remove identical results from the evaluation process.

The result is a set of assembly language programs to be evaluated, with the corre- sponding analysis performed at compiler level for it. The set of binary implementations of the algorithm is used to generate input traces for the SCA. There are two alternatives for the generation of traces: simulation and execution.

The simulation can be done interpreting the intermediate representation of the compiler, as stated above, or using a simulator of a concrete target device. Both cases require assumptions about the information leaked during the execution, provided by a Leakage

Model. On the one hand, more accuracy of the simulation provides information closer to reality. On the other hand, more accurate simulations generate more information, which needs to be manipulated in order to perform an attack. Applying differential SCA on sim- pler simulations requires less time and resources. However, there are some side-channel effects that are ignored by the model and they cannot be exploited.

A real device can be used to obtain the traces. An experimental setup includes the target and the instruments needed to measure the side-channel leakage. Considering PAA, the target must be adapted to the measurement setup, removing the capacitors of its power supply when possible and introducing a resistor in the power supply path (it is typically introduced in the ground line). An oscilloscope measures the voltage drop on the resistor, which depends on the current sunk by the target device. The execution of the algorithm with different keys and data must be done automatically, and it must be synchronized with the oscilloscope to capture the region of interest of the power trace selecting the trigger carefully. Therefore, a debugger with breakpoint management and capability to write in a concrete data memory address is required to synchronize oscilloscope and execution, load a selected input data and wait for the oscilloscope to store the measurements.

Obtaining traces from a real device requires a more complex experimental setup and provides traces with a very high time resolution (large amount of data) and with noise. The noise has different sources: functional units that are consuming power and are not related to the encryption algorithm, electrical noise in the resistor or the connection of the probe to the platform. The data as is obtained is very difficult to use as input of a SCA. Data must be pre-processed to be exploited correctly. Pre-processing includes the alignment of traces, use of filters, integrate the power trace over an established period (a clock cycle), peak extraction or normalization. Pre-processing, similar to simulating, prepares data for SCA although it might remove side-effects that could be exploited by other SCA.

The framework defines an interface to interchangeably call the simulator or the device. The experiments require a coordination software which has been developed using an Expect script, a tool for automating interactive command-line applications. Both options must be supported with two different scripts that interact with the simulator and the debugger.

The analysis done by the compiler and the traces are inputs of a SCA toolsuite. It generates a report about the security level of the algorithm against the different SCA evaluated and the performance features, including code size or execution time. For SCA evaluation is mandatory to check at least timing and differential SCA. The results of the attacks are written to a report which, in the feedback stage, provides information for the modification of the compiler, optimizer or source code. An aggregation function can be developed in the feedback stage to automatically select the most appropriate sequences of optimizations and high level implementation to be used. The aggregation function requires the definition of a metric on the features to be considered in the selection: performance, timing leakage, resistance against DSCA, code size, etc. It is mandatory that the aggregation function is not just a linear combination of the resulting values, as there are security conditions that should never be used, even if the metrics about the other features show very good results. We leave the definition of an appropriate aggregation function for future research, although we provide metrics for the individual features.

Chapter 3. Methodology

This is the analysis framework designed for this Ph.D. thesis. The following Sections describe the concrete elements used for the different processes of the framework. The FSR- based algorithm used in the experiments is KeeLoq. The compiler and optimizer selected is LLVM. The target device is MSP430 and it can be both executed and simulated. The SCA resistance is evaluated with timing analysis and CPA. The leakage model used is the Hamming Weight model for software implementations and the Hamming Distance model for hardware implementations. The models are used both in the simulator and by the SCA attacks.

The experiments presented in this Ph.D. thesis use the simulator of MSP430 instead of a real device. Regarding statistical attacks, they typically require thousands of executions on a real device, even in simple devices as MSP430 [61]. In order to gain feedback about the resistance against SCA of a great number of implementations in the same device (applying different combination of optimization passes), this solution would be too slow.

Using the simulator provides some advantages when compared to implementations in real devices:

• The setup is very simple, avoiding the use of oscilloscopes or EM probes • Traces are perfectly synchronized and there is no clock deviation

• Time used for setup is drastically reduced (no oscilloscope, no binary loader, no trigger, etc)

• Traces are expected to be better correlated with information leaked, as the effects of other peripherals are eliminated

• The model of the SCA matches precisely the model of the traces. It is consistent with the recommendations proposed in [165] about a very high-skilled adversary.

3.1 KeeLoq

KeeLoq is a solution patented by Microchip to implement RKE since the mid-1990s, providing access and security to systems using wireless communications. It is the most popular of such systems in Europe and the US for garage door openers, with multiple users authorized by the system, and it is also extended in automotion. It is a unidirectional transmission system. The user carries a transmitter with her and sends a message that is validated by the receiver to provide the functionality requested.

The basic features of KeeLoq Unidirectional Transmission are : • 66-bit transmission length (32-bit hop code, 34-bit fixed code) • 2 to 5 status bits

• Multiple functions per transmitter (up to 15)

3.1. KeeLoq

Figure 3.2: KeeLoq packet structure. Retrieved from [177]

(a) Encryption (b) Decryption

Figure 3.3: KeeLoq structure. Retrieved from [187]

• Transparent synchronization

Figure 3.2 provides information about the packet structure that is sent for every user request. The 32-bit hop code was originally encrypted from input data using an NLFSR- based encryption algorithm. It provides for more than 4 billion code combinations.

The original encryption algorithm is also known as KeeLoq encryption algorithm. It is a NLFSR-based block encryption algorithm with a 64-bit symmetric key and 32-bit input and output blocks. Figure 3.3 depicts the structure of the encryption and decryption round. The round is performed 528 times to complete the process.

Although the system behaves like a Stream Cipher, it is achieved by encrypting a 16- bit counter that is synchronized in transmitter and receiver using a block cipher. It was hardware oriented in conception, although an implementation in software is provided by the manufacturer and its use is extended in industry. Recently, Microchip provided new solutions with the same scheme, using different block encryption algorithms, including XTEA and AES.

The 64-bit key is not assigned arbitrarily. It is obtained by encrypting the serial number of the device using the KeeLoq encryption algorithm and a “Manufacturer key”. The transmitter stores the serial number and the generated key, while the receiver also stores

Chapter 3. Methodology

Figure 3.4: KeeLoq learning mechanism. Retrieved from [177]

the “Manufacturer key” to insert new transmitters in the system. Figure 3.4 depicts the process of deriving the key of a transmitter during the learning process.

The KeeLoq encryption algorithm is known to be vulnerable to logic attacks where only input and output data might be known [91, 1]. Additionally, SCA methods have been proven to be successful in guessing the key used by the algorithm in real devices with KeeLoq code hopping [65, 138] in both hardware and software implementations, as it has been described in Section 2.3.

An attack on a single device might provide information about the symmetric key used by a concrete user. However, the attacks presented in [138, 97] obtain the manufacturer key by attacking the software implementation of the receiver in the learning process.

The first SCA applied to KeeLoq was published in [65]. They describe how secret keys can be revealed in practice from the power consumption of KeeLoq implementations in hardware and software. The first step is to simulate the power consumption of a hardware implementation, using the Hamming Distance (HD) model, and attack it using CPA. The intermediate value selected is the round 6, with 26 key candidates. This provides the potential of the attack under perfect conditions, a best case scenario from the attacker point of view. The correlation factor obtained for the correct key is 1 and its curve outstands the value of other candidate keys.

The practical attack is performed without knowledge on the implementation, on a real device, although the algorithm is known. Using SPA visual inspection they locate the interesting section of the power traces, the 528 rounds. They align traces and extract peaks from the signal for every clock cycle. With the resulting reduced set of samples, the CPA is performed using the HD model for hardware implementation and the HW model for the software implementation. CPA on the software implementation requires 10000 traces, while 10 are enough for the hardware implementation.

The software implementation used contains data-dependent execution time for each round of a KeeLoq decryption, so an accumulated delay complicates the CPA as the number of executed rounds increases. However, as we have seen in Section 2.1.1, an SPA exploits the data-dependent instruction paths. In [97], a SPA is performed characterizing the time needed by the controller to complete a round. All the implementations evaluated had a relation between the time elapsed and one of the state bits. In order to detect the

3.2. LLVM

Programming language

C/C++, Obj-C, Java, Fortran, ...

Front-End

LLVM-GCC, Clang

LLVM bitcode

Back-End

Target machine code

X86, ARM, PowerPC, ... Optimizer Passes JIT – Interpreter X86, PowerPC, ARM Execute

Figure 3.5: LLVM compilation flow

beginning and the end of each round, they correlated windows of the trace to detect patterns between rounds. Using this method it is possible to break KeeLoq using SPA with only one sample.

The impact of the attacks on real devices of KeeLoq and the extended use of the software implementation of this encryption algorithm makes it suitable for being the target of the countermeasure proposals of this Ph.D. thesis.

3.2 LLVM

LLVM is a Low Level Virtual Machine first created in [106]. It is an optimization centered compiler. Figure 3.5 depicts the process of compiling with LLVM, where front- end and back-end are decoupled. The front-end compiles from high-level language (C, Java) to an Intermediate Representation (IR). The IR in LLVM is LLVM bitcode, and it is executable in the Virtual Machine, with lli, the LLVM interpreter. Over the LLVM bitcode, optimization passes are applied one after the other to generate an optimized bitcode. Optimized code can be executed with lli or compiled to generate assembly code for a target, including low cost and low power processors, such as MSP430, ARM7TDMI or PIC16.

The main advantages for using LLVM are the modularity, the tools and templates available and the JIT execution using the lli. GCC is an interesting alternative. However, the source code of GCC is cumbersome to manipulate considering optimizations and code generation. LLVM provides templates and instructions to implement optimization passes affecting Basic Blocks, Functions or Loops. The templates include the automatic execution of required optimization passes, doing preliminary analysis of code, in case the required

Chapter 3. Methodology

optimization has not been applied previously or its output it has been invalidated by other optimization passes.

LLVM provides tools for traversing Control Flow Graphs (CFGs) and the Directed Acyclic Graphs (DAGs) with the instruction set information of Basic Blocks. LLVM can be extended by easily defining intrinsics that will be treated as available functions. LLVM includes a mechanism to extend the basic functions of the IR language called intrinsic. Adding new intrinsics does not require changing all of the transformations in LLVM. LLVM has intrinsics for several purposes, including variable argument handling, stack management or standard C library calls.

The possibility of using the interpreter is equivalent to using a high level simulator of the algorithm implementation. A few modifications in the LLVM interpreter generates power consumption traces for Arithmetic Logic Unit (ALU) and memory access operations, the information leaked from data and address buses and registers.

This solution provides some advantages when compared to concrete target simulator: • The measurements are even faster

• Trace is expected to be more correlated with information leaked, as there is no timing deviation depending on data as the number of registers available is infinite

• This is the worst case from the defender point of view. A small improvement in this scenario might be a large one in real targets.

A solution based in the LLVM interpreter model doesn’t take into account effects introduced by the backend, which might change the implementation. It implies an advantage when considering optimization passes without a concrete target, as only optimization effects are taken into account. However, when a concrete target is specified, the back-end effects are critical. It is specially important considering typical microcontrollers used in WSN, which are 8-bit (PIC16, PIC18, ATMEGA) or 16-bit (MSP430) while LLVM is a 32-bit virtual machine. In case of timing or template attacks, this solution is completely invalid, as the back-end effect on the implementation is critical.

3.3 MSP430

According to the automatic analysis framework described in this Chapter, the target device should have a simulator, there should be mechanisms to automatically control real devices and it should be supported by the compiler chosen for optimization analysis, which is LLVM.

The selected algorithm, KeeLoq, and other FSR-based encryption algorithms are used in embedded systems. The core of this systems is typically a low-end Microcontroller Unit (MCU), with 4-bit, 8-bit or 16-bit CPU and small amount of RAM and FLASH memory. Their memory hierarchy includes two levels: registers and RAM memory.

In document Securing implementations of feedback-shift-register-based ciphers using compiler optimizations and co-processors (Page 68-83)