Evaluation of optimization passes - Securing implementations of feedback-shift-register-based c

In order to evaluate the resistance of a software implementation of an algorithm we use the framework described in Chapter 3. A binary program is generated from the software implementation; a simulator executes the binary program and generates power traces; the traces are used to implement SCA. The amount of traces required by a DSCA indicates the resistance of the implementation against DSCA.

There are many optimization passes available. LLVM includes two kinds of optimization passes which might have an effect on the final machine code generated: transform and analysis. Other compilers have equivalent optimizations available, although the names might be different. All transform passes mutate the program in some way. Their input is an IR of the program and metadata associated, with information about properties of the program. According to the goal of the optimization pass, the IR and the metadata are used to detect blocks that can be modified, transform the program and remove the affected metadata. Examples of transform passes are mem2reg, which reduces access to memory by using registers, loop-unroll, which removes loop structures of short loops by repli- cating the loop body, or instcombine, which combines instructions to form fewer, simple instructions.

Analysis passes do not modify the code. They compute information that other passes can employ or can be used for debugging purposes. They generate metadata with this information. There are transform passes that require a sequence of analysis passes before executing: the optimizer invokes them automatically when the pass is not explicitly invoked. Examples of analysis passes are da, which is used to detect dependencies in memory accesses, loops, which is used to identify natural loops, or

5.2. Evaluation of optimization passes

scalar-evolution, which is used to analyze and categorize scalar expressions in loops, in order to recognize general induction variables.

On the one hand, a transform pass might generate a different optimized code depending on the previous optimization passes invoked. On the other hand, the effect of an optimization pass might be undone by subsequent transform passes. Therefore, evaluating individual optimization passes, isolated, is not appropriate in order to evaluate the effect of compiler optimizations on the resistance against SCA. Evaluating sequences of optimization passes is required.

The LLVM optimizer has 61 transform passes and 46 analysis passes, which combined provide a vast amount of optimization sequences. The resulting binary programs of each of the sequences generated are considered different software implementations. Evaluat- ing the effect of every resulting software implementation using the above described framework is unaffordable. We need to extract some preliminary results from small combination of optimization passes. These reduced set of optimization sequences will provide a first idea of the effect of compiler optimization on SCA resistance.

We use the standard compile optimizations of LLVM. It is a sequence of 66 optimization passes, with some of them repeated. The standard compile optimization sequence is typically used to optimize the output from the C front-end. We evaluate which of its optimization subsequences has better resistance against SCA.

The concrete sequence of optimizations used is obtained with the command

llvm-as < /dev/null | \

opt -std-compile-opts -disable-output -debug-pass=Arguments

The generation of binaries starts by first creating an unoptimized LLVM bitcode output (bitcode0). We propose an algorithm to generate subsequences from the original sequence, with two parameters that can be configured: direction and method. The direction parameter indicates whether the algorithm adds starts from a void sequence and adds passes from the original sequence to build the subsequences (positive) or it begins with the original sequence and removes passes from it (negative). The method parameter indicates whether the subsequence is calculated from the previous subsequence (accumulative) or it is calculated as if it were the first optimization pass found (individual). Appendix A gives more details on the algorithms and the optimization passes involved.

Figure 5.1 depicts the process followed in the generation of optimized bitcode for the experiment with the parameters positive accumulative.

In the positive accumulative algorithm, the first optimization pass is applied to the original non-optimized bitcode generating an optimized output (bitcode1). A second pass is applied to bitcode1 to generate bitcode2, and so forth, creating a set of N + 1 bitcodes, N of them corresponding to the output of the N optimization passes and another one to the bitcode generated by the front-end without optimizations.

It is important to understand that the i-th bitcode is the result of successively applying the first i optimization passes to the first unoptimized bitcode output, and not the result of applying the i-th optimization alone.

Chapter 5. Countermeasure proposal II: existing compiler optimization Algorithm C language Front-End LLVM-GCC LLVM bitcode1 LLVM bitcode0 LLVM bitcode2 … LLVM bitcodeN CFG filtering LLVM bitcode0 opt Pass1 … LLVM bitcode0 opt Pass2 opt PassN

Figure 5.1: Optimized bitcode generation process: positive accumulative

Some of the N + 1 outputs might be equal; the analysis passes do not modify the bitcode and a transform pass might not be able to optimize the input bitcode. The set of N + 1binaries is filtered in order to avoid the evaluation of identical binaries: those with a CFG identical to an already generated bitcode are detected, the relationship noted down and the bitcode not evaluated.

The algorithm is repeated covering the four possible combinations of the two parameters, comparing the CFG with previous subsequences to avoid recalculation of SCA ex- periments.

We evaluate subsequences of optimizations applied to a KeeLoq software implementation using this algorithm. In Section 3.1 the KeeLoq algorithm and published successful SCA to hardware and software implementations are presented. The attacked software target was a LUT-based implementation performed in an 8-bit PIC microcontroller, with the data word (the state) stored in a register. This corresponds to the KeeLoq decryption routines written in C of KeeLoq, following Microchip’s application note TB041 [93].

The algorithm generates 51 implementations after CFG filtering. After 100 executions of each implementation, the algorithm detects 12 different implementations. The naming of the generated implementations, and the subsequences used to implement them, follows the pattern o_XXX_[pn][ai]. XXX corresponds to the generation order of the subsequence by the algorithm. [pn][ai] indicate the parameters of the algorithm that generated the subsequence. However, the implementation can be generated by a set of subsequences that generate an equivalent machine code. The framework saves the relationship between sequences that generate an equivalent machine code in order to extract conclusions about the effect of a concrete optimization pass. Table 5.1 shows the performance evaluation of the implementations. Table A.1 in Appendix A shows the performance of the 51 implementations.

We evaluate the resistance against SCA of the different groups. We first evaluate the information leakage through timing analysis. We detect from the previous executions that there are variations in the execution time depending on the data executed. Table 5.2 shows the data dependent execution paths of the different implementations.

5.2. Evaluation of optimization passes

exec cycles Code size Register Memory

min max (bytes) usage static dynamic

o_018_pa 33233 34147 200 10 0 60 o_011_ni 36066 36931 220 11 0 62 o_012_ni 42850 43849 244 11 0 62 o_026_pa 43499 44632 252 11 0 62 o_018_ni 49287 50621 276 10 0 70 o_047_pi 61458 63690 312 7 0 70 o_009_pa 62529 65610 300 7 0 70 o_034_pi 62543 65490 316 7 0 70 o_011_pa 63067 65880 312 7 0 70 o_018_na 67091 69055 324 7 0 70 o_034_na 67685 69783 328 7 0 70 o_047_na 70379 73326 340 7 0 70

Table 5.1: Performance evaluation

bit 31 bit 26 bit 20 bit 9 bit 1 bit32

o_018_pa -2 1 5 3 2 2 o_011_ni -2 1 5 3 5 -2 o_012_ni 0 1 5 3 5 0 o_026_pa 0 1 5 3 5 2 o_018_ni 0 1 5 3 5 5 o_047_pi 2 4 8 6 5 7 o_009_pa 2 7 11 9 8 7 o_034_pi 2 7 11 9 8 5 o_011_pa -2 7 11 9 8 5 o_018_na 0 4 8 6 5 5 o_034_na 0 4 8 6 5 7 o_047_na 0 7 11 9 8 7

Chapter 5. Countermeasure proposal II: existing compiler optimization

Figure 5.2: Distribution function of starting instant for round 9

timing leakage, although the differences between implementations are not high. There are optimizations that generate lower differences (the fastest ones). The best optimization sequence for this software implementation of KeeLoq, according to the timing leakage, would be the o_012_ni which is equivalent to including the whole sequence of std-compile-opts.

The timing deviation has an effect similar to the random delay countermeasure, described in Section 2.2.3. The implementations with greater data dependent timing variation are more misaligned. While a completely aligned implementation would have peaks of high correlation coefficient when the intermediate (or similar) data is manipulated, the correlation coefficient obtained for a misaligned implementation is a Gaussian. The peak correlation coefficient decreases as the misalignment increases, which is related with the number of rounds executed. The accumulation of the timing deviation effects increases the misalignment.

Figure 5.2 shows the distribution function of the time elapsed by each of the generated implementation groups to start the execution of round 9, where the target intermediate value is manipulated.

5.2. Evaluation of optimization passes 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 500 1000 1500 2000 2500 cor relation co e ﬀ ( ρ ) clock cycles time distribution

(a) Global: 2000 traces

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 50 100 150 200 250 300 350 400 450 cor relation co e ﬀ ( ρ ) clock cycles (b) Zoom: 2000 traces

Figure 5.3: CPA on the o_009_pa implementation

the target intermediate value. A completely aligned implementation would have peaks with lower values in the rounds near the attack target round. However, Figure 5.2 shows that there is an accumulated misalignment. As we stated above, the misalignment reduces the peak value of the correlation coefficient, and it can be reduced under the value for previous rounds.

Figure 5.3 depicts the problem applied to the o_009_pa implementation. The correlation factor calculated for each wrong key guess is represented in grey lines, while the correct one is the black line. The distribution function associated to o_009_pa from Figure 5.2 is plotted with a black dashed line. According to the distribution function, the target intermediate value can be manipulated at any instant between the blue dashed lines.

The maximum correlation coefficient from the whole trace is not obtained from the zone of interest, but from the beginning of the execution, when there are no misalignments, although the value of the state does not correspond to round 8. Figure 5.4a depicts the resistance against a CPA covering the whole trace. The implementation seems to be very resistant. Figure 5.4b shows the resistance against CPA between the blue dashed lines of Figure 5.3a. Figure 5.3b shows that there are three groups of state value manipulations. The first one is mainly formed by the previous state manipulation, and the top key guess differs from the correct key guess by one bit. The top key guess of the other two is the correct key guess, although the correlation factor is lower. Considering zoomed CPA the best implementation is o_011_ni. The results are available at the annex.

Figure 5.5 depicts the results of DCPA against the o_009_pa implementation and its resistance. The results of the implementation groups are available at the annex.

Figure 5.6 depicts the success of DCPA against the 12 groups of generated machine code depending on the number of samples used. We use our proposal DCPA instead of CPA as stated in Section 3.4.

Figure 5.6a is a matrix showing whether the maximum correlation coefficient ρ of the CPA corresponds to the correct key guess, depending on the samples used. Figure 5.6b is

Chapter 5. Countermeasure proposal II: existing compiler optimization 0 0.2 0.4 0.6 0.8 1 0 100 200 300 400 500 600 700 800 900 1000 cor relation co e ﬀ ( ρ ) traces (a) Global 0 0.2 0.4 0.6 0.8 1 0 500 1000 1500 2000 2500 3000 cor relation co e ﬀ ( ρ ) traces (b) Zoom

Figure 5.4: Resistance to CPA of the o_009_pa implementation

0 0.005 0.01 0.015 0.02 0.025 0 500 1000 1500 2000 2500 cor relation co e ff . di ff er ence ( Δρ ) clock cycles (a) Result 0 0.02 0.04 0.06 0.08 0.1 0 500 1000 1500 2000 2500 cor relation co e ff . di ff er ence ( Δρ ) traces (b) Resistance

5.2. Evaluation of optimization passes

(a) Correct key top first (b) Correct key among top ten

Figure 5.6: Summary of successful DCPA against optimized programs

a matrix showing whether the correlation coefficient of the correct key guess is among the top ten correlation coefficients depending on the samples used. The combination of the two of them summarizes the resistance against DCPA of the implementations generated.

We conclude from Figure 5.6 that the most robust implementation against DCPA is o_034_pi, which corresponds to the loop-rotate pass. However, it can not be considered a secure implementation and it is one of the least efficient implementations. It is similar to the implementation o_047_pi, which is also one of the most resistant implementations according to Figure 5.6a, but it is inefficient compared to other implementations. It results from applying gvn (Global Value Numbering), which eliminates redundant loads and instructions. From Figure 5.6b we conclude that the implementation o_011_ni leaks less information than o_047_pi. It is obtained by removing the first application of the instcombineoptimization pass from the std-compile-opts sequence. The result is one of the most resistant implementations and one of the fastest. However, removing any of the other instcombine applications has no effect on the optimized implementation.

The most efficient implementation is o_018_pa, which ends with a scalar replacement of aggregates optimization pass (scalarrepl-ssa). However, it is one of the most vul- nerable to DCPA. Adding optimization passes to the subsequence o_018_pa maintains the same performance until an instcombine optimization pass is applied, which corresponds to o_026_pa.

We conclude from the analysis that there is no generic optimization sequence that makes an insecure implementation to be secure. There are optimization sequences that provide more robust implementations than others, so an analysis to select the best optimization sequence is required. We confirm that an analysis of an individual optimization pass does not provide enough information, as the instcombine optimization pass ap- pears 5 times in the optimization sequence and its effects on the performance and the DCPA vary from one to the other. The sequences must be analyzed.

The resistance of the implementations against CPA corresponds to the data dependent timing deviation introduced by the implementation. However, it makes it more sensitive

Chapter 5. Countermeasure proposal II: existing compiler optimization

to SPA that exploits time dependency on input data.

In document Securing implementations of feedback-shift-register-based ciphers using compiler optimizations and co-processors (Page 102-110)