Cache based Timing Attacks on
Embedded Systems
Malte Wienecke
Monday 20
thJuly, 2009
Master Thesis
Ruhr-Universit¨
at Bochum
Chair for Embedded Security
Prof. Dr.-Ing. Christof Paar
Statement
I hereby declare that the work presented in this thesis is my own work and that to the best of my knowledge it is original, except where indicated by references to other authors.
Hiermit versichere ich, dass ich meine Masterarbeit eigenst¨andig verfasst und keine anderen als die angegebenen Quellen und Hilfsmittel benutzt, sowie Zitate kenntlich gemacht habe.
Contents
1 Introduction 1
1.1 Motivation . . . 1
1.2 Organization of this Thesis . . . 2
1.3 Notations and Conventions . . . 2
1.3.1 Representation of Numbers of Different Bases . . . 3
1.3.2 Notation to Address Parts of Bit Values . . . 3
2 Side Channel Attacks 5 2.1 Power Analysis . . . 5
2.2 Timing Analysis . . . 5
2.3 Branch Prediction Analysis . . . 7
2.4 Analysis of the Cache Behavior . . . 7
2.4.1 Functionality of the Cache . . . 8
2.4.2 Cache Collision Attacks . . . 10
2.4.3 Countermeasures . . . 12
3 The Advanced Encryption Standard 15 3.1 Mathematical Preliminaries . . . 16
3.1.1 Addition + . . . 17
3.1.2 Multiplication • . . . 17
3.1.3 Polynomials with Coefficients in GF(28) . . . 17
3.2 Functions of the AES . . . 18
3.2.1 SubBytes Transformation . . . 19 3.2.2 ShiftRows Transformation . . . 19 3.2.3 MixColumns Transformation . . . 20 3.2.4 AddRoundkey Transformation . . . 20 3.2.5 Key Generation . . . 21 3.3 AES Implementations . . . 22
3.3.1 8-Bit Straightforward Implementation . . . 22
3.3.2 32-Bit Transformation Table Implementation . . . 23
vi Contents
4 The Attack Setup 29
4.1 Used Hardware . . . 30
4.1.1 SBC2440-II Board . . . 30
4.1.2 Pentium 4 PC . . . 31
4.2 Measuring the Encryption Time . . . 32
4.3 Cleaning the Cache . . . 33
4.4 The Attack Scenario . . . 35
5 Attacks 37 5.1 Final Round Attack . . . 37
5.1.1 Theoretical Description . . . 37
5.1.2 Online Phase . . . 38
5.1.3 Offline Phase . . . 38
5.1.4 Results . . . 40
5.2 Expanded Final Round Attack . . . 40
5.2.1 Theoretical Description . . . 41
5.2.2 Online Phase . . . 41
5.2.3 Offline Phase . . . 41
5.2.4 Results . . . 43
5.3 Expanded Second Round Attack . . . 43
5.3.1 Theoretical Description . . . 44
5.3.2 Online Phase . . . 45
5.3.3 Offline Phase . . . 45
5.3.4 Results . . . 46
5.4 Pair Encryption Collision Attack . . . 47
5.4.1 Theoretical Description . . . 47
5.4.2 Online Phase . . . 50
5.4.3 Offline Phase . . . 51
5.4.4 Results . . . 52
5.5 Improved Pair Encryption Collision Attack . . . 54
5.5.1 Theoretical Description . . . 54 5.5.2 Online Phase . . . 55 5.5.3 Offline Phase . . . 56 5.5.4 Results . . . 57 6 Conclusion 59 6.1 Future Work . . . 61 A Appendix 63 A.1 AES S-Box . . . 63
1 Introduction
This introductory chapter motivates the use of cryptography on embedded de-vices, and the risk of side channel attacks, which leads to the aim of this thesis. Additionally, the organization of the thesis and used notations are described.
1.1 Motivation
In the last few years the need for embedded systems has grown enormously. They can be found in almost every electronic device, like in mobile phones, navigation systems, PDAs, chip cards, and even in coffee machines. They offer almost un-limited possibilities: one can check the emails, play games, place orders, and pay bills from almost every point in the world. Along with the fields of application, the need for security on the embedded devices has arisen. These security require-ments are, for example, the protection of sensitive personal data, or, the assurance of integrity and confidentiality of the communication channels. This demands can be solved using cryptography. Cryptographic application can, among others, be used to encrypt sensitive data, or communication channels, but the security of such applications is not only based on the underlying cipher. It also depends on the implementation and on the executing device. Unfortunately, these im-plementations and devices often tend to leak sensitive information of the cipher, i.e., secret key information. These information leaks are called Side Channels
and are based on the physical characteristics of the particular implementation on the device, such as execution time, power consumption, or electromagnetic emission. An adversary can exploit these side channels to extract the necessary information for the reconstruction of parts or even the entire secret key of the cipher.
One of these side channels is based on the behavior of the cache. The cache is a small memory located inside the microprocessor. The processor handles all the instructions and data of the programs and the operating system. The data itself is transfered from the hard drive to the main memory. The problem is that the memory is not as fast as the CPU. To prevent unpleasant waiting time, the cache comes into play. Its storage capacity is smaller compared to the main memory, but the cache can be accessed at the same clock rate as the CPU. Therefore, the cache is used to bridge the time gap between the main memory and the CPU by buffering the data, which is probably processed next. If the processor needs data that is already stored in the cache, a so called cache hit, it can be
2 Introduction
accessed very fast. If, however, the data is not located in the cache, the data must be fetched from the memory. This procedure is called cache miss and takes more time compared to a cache hit. Through the resulting timing differences secret information leaks, which an adversary can extract in order to exploit the cryptographic primitive.
This thesis analyzes the information leakage caused by the timing differences of the caching behavior of an embedded device. As target cipher the widespread Advanced Encryption Standard (AES) is used and its complete secret key is extracted with the attacks presented in this thesis.
1.2 Organization of this Thesis
This thesis is divided into six chapters. Chapter 2 gives a brief introduction into thematic of side channel attacks. The most common side channels attacks are presented: the power and timing analysis. Further, the branch prediction analysis and the analysis of the cache behavior, which is analyzed in this thesis, are described. In the latter case, the functionality of the cache, and the cache collision attacks are presented. Additionally, possible countermeasures are described to prevent the use of the cache as side channel.
The next chapter introduces the Advanced Encryption Standard (AES). With the mathematical preliminaries, the background of finite fields is presented, in order to understand the underlaying mathematics of the AES. Afterwards, the round functions and the key generation algorithm is presented. Furthermore, the 8-bit and 32-bit implementation is described and the cache behavior of these implementations is analyzed. The analysis illustrates the appearance of cache hits and cache misses during one encryption.
In Chapter 4 the attack model used for the attack is described. Here, the participating parties are presented with the corresponding hardware used in this thesis. Also, the methods to measure the encryption time and to clean the cache are discussed. Moreover, the assumptions made for the attacker and her abilities are described.
The next chapter presents the five attacks. Each contains a brief theoretical description of the information leakage, and of the online and offline phase. At the end of each chapter the results of the attack are presented.
In the conclusion of the last chapter the results of the attacks are compared to each other and possible further work is presented.
1.3 Notations and Conventions
This section describes two used notations of this thesis. The are presented, be-cause different methods exist to describe the same circumstances.
1.3 Notations and Conventions 3
1.3.1 Representation of Numbers of Different Bases
The representation of numbers of different bases is in this thesis indicated by a subscript of the baseb behind the number x:
x(b).
If no base is mentioned and no additional information exist, the number xis a decimal number. In the following example the use of the notation is presented.
181 =B5(16)= 10110101(2)
1.3.2 Notation to Address Parts of Bit Values
Sometimes it is necessary to address specific bit of a value. These bits are usually the most significant bits of the value. In this thesis the following notation is used to address the most significant bits of a value, where n specifies the number of bits of the valuex:
hxin.
The number of considering bits n can be left out, if the amount is mentioned before.
An example for this notation is:
2 Side Channel Attacks
Nowadays the security of a cryptographic application is not only based on the un-derlying cipher. It but also depends on the implementation and on its executing device. Unfortunately these implementations and devices often tend to leak sen-sitive information of the cipher, i.e., secret key information. These information leaks are calledSide Channels and are based on the physical characteristics of the
particular implementation, such as execution time, power consumption or electro-magnetic emission. These are also the typical and most discussed side channels. An adversary can use these side channels to extract the necessary information for the reconstruction of parts or even the entire secret key of the cipher.
2.1 Power Analysis
The analysis of the power consumption, or power analysis, was presented in the late 1990s by P. C. Kocheret al. [KJJ98, KJJ99]. In general there are two main
approaches to extract the sensitive data from the power consumption traces. The Simple Power Analysis (SPA) needs only few traces and extracts, for example, the information of type and length of functions or their order of ap-pearance. In some cases even secret key information can be obtained with the SPA. Unfortunately the SPA requires in most cases a precise information of the implementation.
The other approach is the Differential Power Analysis (DPA) which works on a large sample of power traces. This allows to extract information even from traces with a lot of noise.
2.2 Timing Analysis
The analysis of the execution timing was also introduced by P. C. Kocher [Koc96] in the year 1996. He described the theoretical approach to exploit timing mea-surements to obtain the entire secret key of Diffie-Hellman, RSA, DSS and other cryptographic systems. An attacker can take advantage of the knowledge of the exact implementation of the system and the small variations in the processing time. Particularly attacks based on the Square-And-Multiply Exponentiation and Montgomery Multiplication were discussed.
In 1998 J.-F. Dhemet al. [DKL+00] improved Kocher’s ideas and developed an
tim-6 Side Channel Attacks
ing measurements the entire 512-bit key was extracted from the signing algorithm of the smart card.
Three years later W. Schindleret al. [SKQ01] presented a new improvement of
the timing attack. With only 5,000 measurements and with very limited knowl-edge of implementation details the entire 512-bit key could be obtained from an RSA with the Montgomery Multiplication cryptographic system. In this im-proved timing attack more complex statistic methods were used.
The basic idea behind these attacks is shown in the following example:
An attacker has the intention to exploit the multiplication in the Square-And-Multiply algorithm described in Algorithm 2.1. The execution time for the mul-tiplication is constant, but if the result of the mulmul-tiplication is greater than the modulus s then an additional reduction is performed.
Algorithm 2.1 Square-And-Multiply algorithm
Input: m, s and K = (knkn−1. . . k0)(2) Output: x≡mk mods 1: x⇐m 2: for i= (n−1) to 0 do 3: x⇐x2 mods 4: if (ki == 1) then 5: x⇐(x·m) mod s 6: end if 7: end for 8: return x
First of all, a large sample of plaintexts P and its encryption time t must be captured. Afterwards, the attacker divides the key into known and unknown bits: K =kknownkkunknown. The goal is to convert the unknown into known bits
step-by-step, starting with the key bit kn−1.
She assumes that this particular bit is set and an oracleO((pi, ti)) divides the
samples into two sets S0, S1. The oracle computes for each plaintext the
Square-And-Multiply algorithm for all known key bits. If, after the multiplication based on the guessed bit, a reduction is made, the sample (pi, ti) is added to the setS0.
In the other case it is added to the set S1:
O : (pi, ti)∈ (
S0 if reduction necessary.
S1 if no reduction necessary. (2.1)
By calculating the mean time of the two sets M(S0), M(S1), the attacker can
decide if the guessed bit is indeed a 1 or not.
If the assumption of the guessed bit is correct, the oracle divides the samples correctly. Hence, the mean time of S0 should be slightly greater than the one of
2.3 Branch Prediction Analysis 7
If, however, the assumption is not correct, then the complete decision criteria was wrong and the multiplication did not occur in the actual computation of the samples. Hence, both sets are formed randomly and the mean time of them are almost equal.
Using this strategy the attacker can rebuild the entire key. However, the at-tacker must have an exact knowledge of the implementation.
In the year 2003 D. Brumley et al. [BB03] presented the first remote timing
attack. This attack exploits an unprotected RSA implementation of OpenSSL over a local network. Here a prime factor of the RSA modulus n can be recon-structed by analyzing the encrytion time of two chosen plaintexts. O. Aci¸cmez
et al. [ASK05] improved the efficiency of this attack. To analyse the timing
behavior over the local network less measurements were needed to reduce the communication noise.
2.3 Branch Prediction Analysis
Another side channel has been presented in the year 2006 by O. Aci¸cmez et al. [AKS07b, AKS07a]. Here the information leakage is based on the timing
differences produced by the feature of branch prediction of modern CPUs. The branch prediction unit (BPU) tries to predict the results for conditional branches and the instruction for the corresponding branch are loaded into the instruction pipeline of the CPU. If the prediction of the BPU is correct, the execution can continue without a delay. In case of a misprediction, the pipeline is filled with the wrong instructions and the correct branch must be loaded. This slows down the execution by a few extra clock cycles. An attacker can take advantage of this delay by recalculating the prediction behavior. Additionally, the BPU can be manipulated by running an unprivileged process simultaneously on the same processor as the target process, which is computing a cryptographic algorithm.
2.4 Analysis of the Cache Behavior
In 1998 J. Kelseyet al. [KSWH98] proposed to use the cache behavior of modern
processors as a side channel against ciphers with large lookup tables like S-Boxes. This proposal was established by D. Page [Pag02] in the year 2002, who described and simulated a theoretical attack on DES. The first real implementation of such an attack was developed by Y. Tsunoo et al. [TSS+03] against DES and
Triple-DES.
In order to understand the procedures of the cache attacks, the functionality of the cache is summarized in the following section. Thereafter, the side channel is described in more detail followed by a presentation of possible countermeasures to prevent such attacks.
8 Side Channel Attacks
2.4.1 Functionality of the Cache
The core of a modern computer or mobile device is the processor or microproces-sor. It processes all the instructions and data of the programs and the operating system. The data itself is transfered from the hard drive to the memory. The problem is, that the memory is not as fast as the CPU. To prevent unpleasant waiting time, a small, but fast memory is added to the processor, the so-called cache. Its storage capacity is smaller compared to the main memory, but the cache can be accessed at the same clock rate as the CPU. The higher perfor-mance is achieved by a different design compared to the memory.
The cache is a Static Random Access Memory (SRAM). This is a technique, where one bit and its inverse are stored with two cross-coupled inverters. The stored information in the cache must not be refreshed, which is indicated by the word static.
Main memory otherwise uses Dynamic Random Access Memory (DRAM), where one bit is stored in a capacitance. All stored information must be re-freshed periodically, because the capacitances loses electric charge over the time. Based on the more complex design of the cache, more space is needed to store the same amount of date and it is more expensive. On the other hand, the cache provides a higher performance compared to the DRAM. As a compromise a small cache is added to the CPU to bridge the time gap between the main memory and the CPU by buffering the data, which is probably processed next.
The cache itself is arranged in 2l cache lines. Each of these lines can hold
2b bytes. This leads to a complete cache size of 2(b+l) bytes. Additionally, the
tag-RAM is located in the cache. This memory stores address information for every cache line entry.
The basic operating mode of the cache is very simple. If the CPU needs data, it checks the cache first. For this operation the part of the address which represents the cache line, the so-called tag, is compared with the values in the tag-RAM. If the comparison is successful, the data is found in the cache. This is a so called cache hit and the data is processed by the CPU without accessing the
main memory. In the other case, called cache miss, the data is fetched from the
memory and stored into the cache. Always an entire cache line is fetched from the memory. More clock cycles, compared to a cache hit, are needed until the CPU can process the data.
There are several techniques to improve the basic operating mode, in order to improve the ratio of the cache hits and cache misses.
One is the direct-mapped cache, drafted in Figure 2.1. In this mode every data from the main memory can only be stored in one specific cache line. This allows a very simple and fast verifying method to check if the data is cached. But the ratio of cache misses is very high.
2.4 Analysis of the Cache Behavior 9 Line 0 Line 1 Line 2 Line l-1 Line l-2 Line 3 Line 0 Line 1 Line 2 Line l-1 Line l-2 Line 3 Line l+1 Line l Line m-1 Line m-2 Cache Memory
Figure 2.1: Visualization of a direct-mapped cache
in Figure 2.2. To determine, if the needed data is cached, all entries must be checked. This takes a long time compared to the direct-mapped cache, but has the advantage that the amount of cache misses is very low.
Line 0 Line 1 Line 2 Line l-1 Line l-2 Line 3 Line 0 Line 1 Line 2 Line l-1 Line l-2 Line 3 Line l+1 Line l Line m-1 Line m-2 Cache Memory
Figure 2.2: Visualization of a fully associative cache
A combination of the advantages of both models is the n-way set associative cache. An entry can be stored in n possible cache lines. These cache lines are combined into one cache set, see Figure 2.3. This leads to a tradeoff between cache hit time and cache miss ratio.
10 Side Channel Attacks Line 0 Line 1 Line 2 Line l-1 Line l-2 Line 3 Line 0 Line 1 Line 2 Line l-1 Line n-2 Line 3 Line 0 Line 1 Line 2 Line l-1 Line l-2 Line 3 Line 0 Line 1 Line 2 Line l-1 Line l-2 Line 3 Line l+1 Line l Line m-1 Line m-2 Cache Memory Set 0 Set 1 Set n-1
Figure 2.3: Visualization of an n-way set associative cache
2.4.2 Cache Collision Attacks
The main goal of cache attack is to detect internal collisions. One definition of internal collisions is given by K. Schrammet al. [SLFP04]. They occur if different
input data is processed by the same function f of a cryptographic primitive to equal output values. Normally collisions of just one encryption are observed. The way, how to take advantage of such a collision is easily presented in an example: Let the colliding functionf be a key addition, it combines the plaintext value pi with the corresponding key value ki using an XOR-addition:
pi ⊕ki =xi (2.2)
If two different input values p1, p2 collide, i.e., the output of the key addition
are equal, it is possible to extract information of the key. This is achieved by creating a simple system of equations:
p1⊕k1 =p2⊕k2. (2.3)
The result is a clear relationship between the two secret key values, since the plaintext values are known:
k1⊕k2 =p1⊕p2. (2.4)
This relationship can be used in an exhaustive key search. Instead of iterating over both key values, the adversary has to find only one value and the other is given by the relationship.
2.4 Analysis of the Cache Behavior 11
The problem is the detection of such collisions. One way to solve this problem is by taking advantage of the cache. Here, the results of the colliding function f are further processed by a lookup table T. By adding the lookup table to the example of Equation 2.2, the example changes to:
T(pi ⊕ki) =ci. (2.5)
In case of an internal collision of f, two values p1, p2 are processed by the key
addition into equal values x1, x2. By looking up the first value x1 in the table
T, its data is fetched from the memory into the cache. This leads to a cache hit while processing the second valuex2. The described circumstance is calleddirect collision.
Since an entire cache line is always filled, a cache hit is not only evoked, if the lookup values x1 and x2 are equal. It is sufficient, if the values are located in
the same cache line. Under the assumption that tables are stored aligned in the cache, i.e., the starting address of the tables begins with 0, a cache hit occurs if all bits ofx1 and x2 are equal ignoring the (log2γ) least-significant bits, where γ
stands for the number of table elements in one cache line. This kind of collisions are called cache line collision.
In general, the cache attacks are divided into three approaches to detectdirect collisions and cache line collisions: trace driven, access driven, and time driven
attacks.
In trace driven attacks the adversary has the ability to observe every single memory and cache access. Therefore, she knows when and where a collision occurs. [Pag02]
The access driven attacks provide the information which set of the cache is accessed by the cryptographic progress. For this the cache is first filled with data of the attacker. After the encryption she checks which data is still present in the cache. [OST06]
The last approach, time driven attacks, is analyzed in this thesis. Information is obtained by the influence of cache hits and cache misses on the execution time. Therefore this approach is also called cache based timing attacks. In this case the attacker can only capture the total execution time of the encryption and make a statistic evaluation to extract the information. Therefore a much higher number of samples is needed compared to trace driven attacks.
In the year 2005 D. J. Bernstein [Ber05] presented a cache timing attack against AES. The attack is based on the assumption that the execution time of the AES is connected to its input. For the attack an identical reference machine with the same implementation as the target is used to study the timing behavior. Therefore, a key is chosen, for example, just zeros and large samples of random plaintext are generated on the reference device. Based on the samples the mean time and variance of each plaintext byte are calculated for all possible values. Afterwards, the correlation between the different positions are calculated. The
12 Side Channel Attacks
same analysis steps are performed for the sample of the target device. The attacker compares the results and tries to find similarities in order to extract the secret key information of the target system.
This led to other attacks based on the vulnerability of the data cache against the AES. 2006 J. Bonneau et al. [BM06] presented amongst others the attacks
described in Section 5.1 and 5.2. One year later O. Aci¸cmez et al. [ASK07]
introduced the remote attack discussed in Section 5.3.
2.4.3 Countermeasures
There are several countermeasures to prevent the loss of information through a side channel breach of the cache behavior. But one must bear in mind, that the cache is designed to achieve a better performance and to speed up the applica-tions. Some of these countermeasures present a trade-off between security and performance.
Constant Time Implementation
One countermeasure is the constant time implementation. This means that the execution time does not depend on the secret key nor the input data. In case of the AES this can be achieved by a fully unrolled implementation, where the table lookups are replaced by the mathematical computations.
Another constant time countermeasure is the bitslice implementation of the AES presented by M. Matsui et al. [MN07] . It does not use any lookup tables
with key or data-dependent address, i.e., no information can leak through the cache side channel. The implementation is however designed for the Intel Core2 processor where it provides a very high performance.
Cache Warming
To reduce or prevent the leakage of information it is also possible to warm up the cache as mentioned by D. Page [Pag02]. This means that the lookup tables, or parts of them, are loaded into the cache before the encryption. In case of the full warming all cache misses are avoided and no information can leak. This forfeits the performance, because the implementation must ensure that the entire data is stored in the cache. Even then security cannot be guaranteed due to the multitasking ability of modern processors other processes can evict parts or the entire cache.
By random warming random parts of the lookup data is loaded into the cache. This randomizes the cache hits. Statistically this procedure can be considered as additional noise which can be suppressed by using additional measures.
2.4 Analysis of the Cache Behavior 13 Bucketing
B. K¨opf et al. presented their countermeasure as a ’provably secure and efficient
countermeasure against timing attacks’[KD09]. This is achieved by a combination of input blinding and bucketing. The blinding randomizes the input of the cryp-tographic device. The bucketing divides the distribution of the execution time into intervals and returns the result of each encryption at the end of correspond-ing interval. These intervals are called buckets. A tradeoff is possible between the provided and provable security and the performance of the cryptographic device by adjusting the number and size of the buckets.
3 The Advanced Encryption
Standard
The Advanced Encryption Standard (AES) is one from the National Institute of Standards and Technology (NIST) standardized symmetric block cipher. In 1997 the NIST requested proposals for a new encryption standard [Nat97] in order to offer a more secure alternative to the in 1976 standardized Data Encryption Standard (DES). As minimal requirements the proposals had to be a symmetric block cipher with block length of 128 bits. Additionally, the algorithms should support the key length of 128, 192, and 256 bits. The boundless use and distri-bution of the AES should also not be restricted by patents or other regulations of any kind.
During the First AES Candidate Conference (AES1) on August 20, 1998, fifteen proposals were announced. These candidates were analyzed by the international cryptographic community and the results presented on the Second AES Candi-date Conference (AES2) in March 1999. Based on these result in August of 1999 the NIST announced the five finalist designs: MARS, RC6, Rijndael, Serpent and Twofish. After further evaluation the NIST declared Rijndael as the new AES in October 2000 [NBB+00]. One year later, on November 26, 2001, the AES was
officially approved and standardized in the U.S. Federal Information Processing Standards Publication 197 (FIPS-197) [Nat01].
The algorithm behind the AES, Rijndael [DR99, Nat01], is a symmetric block cipher designed by the belgian cryptographers Joan Daemen and Vincent Rijmen which gave it the name of Rijndael according to their last names. It is based on the previous design called Square which was publicized in 1997 [DKR97]. Rijndael originally supports variable block length of 128, 192, and 256 and different key length of 128, 192, and 256 bits. In the standardization process however only the in Table 3.1 described standards were specified.
Figure 3.1 shows the schematic layout of the AES. The plaintextP is processed innrounds into the ciphertextC. The number of rounds depend on the size of the keyK, as seen in Table 3.1. Each round, except the last one, is divided into four round functions: SubBytes, ShiftRows, MixColumns and AddRoundkey. In the last round theMixColumns function is missing.
In the succeeding section the mathematic background of the AES is presented, followed by the descriptions of the round functions used for the encryption, and
16 The Advanced Encryption Standard
Table 3.1: AES Standards with key size, block size, and number of rounds
Standard Key size Block size Number of rounds
(bit) (bit) AES-128 128 128 10 AES-192 192 128 12 AES-256 256 128 14 SubBytes ShiftRows MixColumns AddRoundKey Round 1 x1 SubBytes ShiftRows MixColumns AddRoundKey Round 2 SubBytes ShiftRows AddRoundKey Round n x2 x(n−1) x0 AddRoundKey k2 k1 kn k0 KeySchedule AES plaintextP secret keyK ciphertextC
Figure 3.1: Schematic layout of the AES
of the key generation process. Additional, a 8-bit and 32-bit implementation is described. The chapter finishes with an analysis of the cache behavior of the presented implementations.
3.1 Mathematical Preliminaries
The AES algorithm is based on several operations in the finite field GF(28). In
general finite fields are defined as fields with a finite number of elements. The amount of elements is called the order pn where p is the prime characteristic of
the field and n a positive integer exponent. These fields are also called Galois Field or GF(pn).
Is the value n = 1, the order is prime, then the field GF(p) is the ring of integers modulo p.
For an exponentn >1 the elements of the finite fieldGF(pn) can be represented
as polynomials over the field GF(p) of degree less than n. An example for this representation is the field GF(28). The elements of GF(28) can be represented
as polynomials with coefficients modulo 2.
1x7+ 0x6+ 1x5+ 0x4+ 1x3+ 0x2+ 1x+ 1 =x7+x5+x3+x+ 1
To simplify the representation, the elements of GF(2n) are often described by
binary values. The above polynomial can be represented as: 10101011(2)
3.1 Mathematical Preliminaries 17
3.1.1 Addition +
The addition operation in fields GF(p) is the same operation as in a ring of integers modulo the characteristicp. For instance, the operation 5 + 7 in GF(9) is described as (5 + 7) mod 9≡3.
The addition in fields GF(pn) , for n > 1, is performed by a coefficient-wise
addition over GF(p). For example, the field GF(28) used by the AES has the
characteristic 2. Based on this characteristic the addition operation can be de-scribed as a addition modulo 2 or as an XOR operation ⊕:
Binary: 10101011 + 01011010 = 10101011⊕01011010 = 11110001
Polynomial: (x7+x5+x3+x+ 1) + (x6+x4+x3+x) =x7+x6+x5+x4+ 1
3.1.2 Multiplication
•
The multiplication in finite fieldsGF(p) is like the addition the same as in a ring of integers modulo the characteristic p.
In fields GF(pn), for an exponent n > 1 the multiplication is described as a
multiplication module an irreducible polynomial. This polynomial is an reduction polynomial used to define the finite field. In other words, it is a multiplication followed by a division with the irreducible polynomial. The remainder is the wanted product.
For example, the AES uses for its finite fieldGF(28) the irreducible reduction
polynomial:
r(x) = x8+x4+x3+x+ 1. (3.1)
This leads to a representation of GF(28), in which a multiplication of two
byte values, e.g., 171 = AB(16) = 10101011(2) and 90 = 5A(16) = 01011010(2), are
described as: (x7+x5+x3+x+ 1)•(x6+x4 +x3+x) =(x13+x11+x9+x7+x6)⊕(x11+x9+x7+x5+x4)⊕ (x10+x8+x6+x4+x3)⊕(x8+x6+x4+x2+x) =(x13+x10+x6+x5+x4+x3+x2+x) ≡(x6+x4 +x2+x) mod (x8+x4+x3+x+ 1) (3.2)
3.1.3 Polynomials with Coefficients in
GF
(2
8)
In the MixColumn function of the AES the column is interpreted as a poly-nomial with coefficients in GF(28). I.e., the column is a polynomial s(x) =
18 The Advanced Encryption Standard
s3x3+s2x2+s1x+s0 of degree lower than 4 where si for 0≤i <4 are elements
of GF(28).
The addition of two polynomials b(x) and s(x) can be described as:
b(x)s(x) =(b3x3+b2x2+b1x+b0)(s3x3+s2x2+s1x+s0)
=(b3⊕s3)x3+ (b2⊕s2)x2+ (b1⊕s1)x+ (b0⊕s0) (3.3)
To define a multiplication of these polynomials another reduction polynomial l(x) is used to guarantee that the degree of the result stays below 4. In case of AES the reduction polynomial is set to l(x) = x4 + 1. This polynomial is not
irreducible, i.e., not all polynomials have an inverse element for the multiplication modulol(x). Only polynomials that cannot be divided by (x+1) have an inverse. The multiplication of two polynomials b(x) and s(x) modulo l(x) with the
symbol can be described as:
c(x) =b(x)s(x) =(b3x3+b2x2+b1x+b0)(s3x3+s2x2+s1x+s0) =(b3•s3)x6+ (b3•s2 ⊕b2•s3)x5+· · · ≡c3x3+c2x2+c1x+c0 modl(x) (3.4) where c0 =b0•s0⊕b3•s1⊕b2•s2 ⊕b1•s3 (3.5) c1 =b1•s0⊕b0•s1⊕b3•s2 ⊕b2•s3 (3.6) c2 =b2•s0⊕b1•s1⊕b0•s2 ⊕b3•s3 (3.7) c3 =b3•s0⊕b2•s1⊕b1•s2 ⊕b0•s3. (3.8)
This is based on the reduction with the polynomiall(x), i.e.,xi mod (x4+1)≡
ximod 4.
In order to simplify the multiplication further, for multiplications with the same polynomial b(x), it is possible to convert the pre-mentioned results into the following matrix multiplication:
c0 c1 c2 c3 = b0 b3 b2 b1 b1 b0 b3 b2 b2 b1 b0 b3 b3 b2 b1 b0 s0 s1 s2 s3 (3.9)
3.2 Functions of the AES
The first functions described in this section are the round functions. These func-tions process the plaintext into the ciphertext. The input data of the funcfunc-tions is
3.2 Functions of the AES 19
arranged in a two dimensional byte array. This alignment is called thestate and
is formed by four rows of four bytes. Each byte in the state has an identifier sr,c
where r indicates its row and cits column, with 0 ≤ r <4 and 0 ≤ c <4. The transcription of the plaintext P ={p0, . . . , p15}into the state is presented in the
following equation: P = p0 p4 p8 p12 p1 p5 p9 p13 p2 p6 p10 p14 p3 p7 p11 p15 = s0,0 s0,1 s0,2 s0,3 s1,0 s1,1 s1,2 s1,3 s2,0 s2,1 s2,2 s2,3 s3,0 s3,1 s3,2 s3,3 (3.10)
After the round functions the key generation algorithm is presented. It de-scribes the procedure to create the different round keys used during the encryp-tion.
3.2.1 SubBytes Transformation
The first function in a round is theSubBytestransformation. It is a non-linear substitution, that operates on every byte separately. Actually the SubBytes
function consists of two transformations. First the calculation of the multiplica-tive inverse in the finite fieldGF(28), where the zero element is mapped to itself.
The second is an affine transformation overGF(2): x0
i =xi⊕x(i+4) mod 8⊕x(i+5) mod 8⊕x(i+6) mod 8⊕x(i+7) mod 8⊕ci (3.11)
It processes a byte x = x7, . . . , x0 to x0, where xi are the corresponding bits
of the byte, for 0 ≤ i < 8. The constant c used in the equation has the value
63(16)= 01100011(2).
In software these two transformations are often combined to a substitution table, the so called S-Box, or in equations S. The S-Box, see Appendix A.1, works on the two nibbles of the input byte. The more significant nibble indicates the row and the less significant nibble the column of the substitution table to determine the result of the substitution.
3.2.2 ShiftRows Transformation
TheShiftRowstransformation is the second round function. It performs a cyclic
shift on the rows of the state. The shift moves a byte to a lower position in the row. If a byte slices out of the row, it is placed at the highest position as presented in Figure 3.2. The number of shifted bytes for each row depends on its index, hence, the first row with index 0 is not affected by this transformation. This operation can be described as followed:
s0
20 The Advanced Encryption Standard
17 5.1.2 ShiftRows() Transformation
In the ShiftRows() transformation, the bytes in the last three rows of the State are cyclically shifted over different numbers of bytes (offsets). The first row, r = 0, is not shifted.
Specifically, the ShiftRows() transformation proceeds as follows:
Nb Nb r shift c r c r s
s', = ,(+ (, ))mod for 0< r <4 and 0! c <Nb, (5.3)
where the shift value shift(r,Nb) depends on the row number, r, as follows (recall that Nb = 4): 1 ) 4 , 1 ( =
shift ; shift(2,4)=2; shift(3,4)=3. (5.4)
This has the effect of moving bytes to “lower” positions in the row (i.e., lower values of c in a given row), while the “lowest” bytes wrap around into the “top” of the row (i.e., higher values of
c in a given row).
Figure 8 illustrates the ShiftRows() transformation.
S S ’ 0 , 0 s s0,1 s0,2 s0,3 s0,0 s0,1 s0,2 s0,3 0 , 1 s s1,1 s1,2 s1,3 s1,1 s1,2 s1,3 s1,0 0 , 2 s s2,1 s2,2 s2,3 s2,2 s2,3 s2,0 s2,1 0 , 3 s s3,1 s3,2 s3,3 s3,3 s3,0 s3,1 s3,2
Figure 8. ShiftRows() cyclically shifts the last three rows in the State.
5.1.3 MixColumns() Transformation
The MixColumns() transformation operates on the State column-by-column, treating each column as a four-term polynomial as described in Sec. 4.3. The columns are considered as polynomials over GF(28) and multiplied modulo x4 + 1 with a fixed polynomial a(x), given by
a(x) = {03}x3 + {01}x2 + {01}x + {02} . (5.5) As described in Sec. 4.3, this can be written as a matrix multiplication. Let
) ( ) ( ) (x ax sx s# = " : ShiftRows() 0 , r s sr,1 sr,2 sr,3 ' 0 , r s sr',2 ' 3 , r s ' 1 , r s
Figure 3.2: ShiftRows cyclically shifts the last three rows in the state [Nat01]
3.2.3 MixColumns Transformation
The third round function effects the columns of the state (see Figure 3.3). Each column is interpreted as a four-term polynomial:
sc(x) =s3,cx3+s2,cx2+s1,cx+s0,c for 0≤c <4 (3.13)
The coefficientssr,care treated as elements inGF(28), for 0≤r, c < 4. The
trans-formation multiplies these polynomials modulo (x4+ 1) with a fixed polynomial
a(x) = 3x3 + 1x2+ 1x+ 2:
s0
c(x) = a(x)sc(x) mod (x4+ 1). (3.14)
As presented in Section 3.1.3, this multiplication can be transformed into the following matrix multiplication:
s0 0,c s0 1,c s0 2,c s0 3,c = 2 3 1 1 1 2 3 1 1 1 2 3 3 1 1 2 s0,c s1,c s2,c s3,c for 0≤c <4 (3.15)
3.2.4 AddRoundkey Transformation
The last round function is an addition of the state columns with the corresponding round key. The keys are provided by the key scheduling algorithm as four four-byte long words. Every one of them is added by an XOR operation to a column of the state. Figure 3.4 describes this transformation. The result is the input of the next round or the outgoing ciphertext.
3.2 Functions of the AES 21 18 ! ! ! ! " # $ $ $ $ % & ! ! ! ! " # $ $ $ $ % & = ! ! ! ! ! " # $ $ $ $ $ % & c c c c c c c c s s s s s s s s , 3 , 2 , 1 , 0 ' , 3 ' , 2 ' , 1 ' , 0 02 01 01 03 03 02 01 01 01 03 02 01 01 01 03 02 for 0 'c < Nb. (5.6)
As a result of this multiplication, the four bytes in a column are replaced by the following:
= (c s0, ({02} • s0,c) ) ({03} •s1,c) ) s2,c) s3,c = (c s1, s0,c) ({02} • s1,c) ) ({03} • s2,c) ) s3,c = (c s2, s0,c) s1,c) ({02} •s2,c) ) ({03} • s3,c) = (c s3, ({03} • s0,c) ) s1,c) s2,c) ({02} • s3,c).
Figure 9 illustrates the MixColumns() transformation.
0 , 0 s s0,1 s0,2 s0,3 ' 0 , 0 s ' 1 , 0 s ' 2 , 0 s ' 3 , 0 s 0 , 1 s s1,1 s1,2 s1,3 ' 0 , 1 s ' 1 , 1 s ' 2 , 1 s ' 3 , 1 s 0 , 2 s s2,1 s2,2 s2,3 ' 0 , 2 s ' 1 , 2 s ' 2 , 2 s ' 3 , 2 s 0 , 3 s s3,1 s3,2 s3,3 ' 0 , 3 s ' 1 , 3 s ' 2 , 3 s ' 3 , 3 s
Figure 9. MixColumns() operates on the State column-by-column.
5.1.4 AddRoundKey() Transformation
In the AddRoundKey() transformation, a Round Key is added to the State by a simple bitwise XOR operation. Each Round Key consists of Nb words from the key schedule (described in Sec. 5.2). Those Nb words are each added into the columns of the State, such that
] [ ] , , , [ ] ' , ' , ' , ' [s0,c s1,c s2,c s3,c = s0,c s1,c s2,c s3,c ) wround*Nb+c for 0 'c < Nb, (5.7)
where [wi] are the key schedule words described in Sec. 5.2, and round is a value in the range
0' round 'Nr. In the Cipher, the initial Round Key addition occurs when round = 0, prior to the first application of the round function (see Fig. 5). The application of the AddRoundKey()
transformation to the Nr rounds of the Cipher occurs when 1'round 'Nr.
The action of this transformation is illustrated in Fig. 10, where l = round * Nb. The byte address within words of the key schedule was described in Sec. 3.1.
MixColumns() c s0, c s1, c s2, c s3, ' , 0c s ' , 1c s ' , 2c s ' , 3c s
Figure 3.3: MixColumns operates on the state column-by-column [Nat01]
19 0 , 0 s s0,1 s0,2 s0,3 s0',0 ' 1 , 0 s ' 2 , 0 s ' 3 , 0 s 0 , 1 s s1,1 s1,2 s1,3 ' 0 , 1 s ' 1 , 1 s ' 2 , 1 s ' 3 , 1 s 0 , 2 s s2,1 s2,2 s2,3 ' 0 , 2 s ' 1 , 2 s ' 2 , 2 s ' 3 , 2 s 0 , 3 s s3,1 s3,2 s3,3 l w wl+1 wl+2 wl+3 ' 0 , 3 s ' 1 , 3 s ' 2 , 3 s ' 3 , 3 s
Figure 10. AddRoundKey() XORs each column of the State with a word from the key schedule.
5.2 Key Expansion
The AES algorithm takes the Cipher Key, K, and performs a Key Expansion routine to generate a key schedule. The Key Expansion generates a total of Nb (Nr + 1) words: the algorithm requires an initial set of Nb words, and each of the Nr rounds requires Nb words of key data. The resulting key schedule consists of a linear array of 4-byte words, denoted [wi ], with i in the range 0 !i < Nb(Nr+ 1).
The expansion of the input key into the key schedule proceeds according to the pseudo code in Fig. 11.
SubWord() is a function that takes a four-byte input word and applies the S-box (Sec. 5.1.1,
Fig. 7) to each of the four bytes to produce an output word. The function RotWord() takes a
word [a0,a1,a2,a3] as input, performs a cyclic permutation, and returns the word [a1,a2,a3,a0]. The
round constant word array, Rcon[i], contains the values given by [xi-1,{00},{00},{00}], with x i-1 being powers of x (x is denoted as {02}) in the field GF(28), as discussed in Sec. 4.2 (note that i starts at 1, not 0).
From Fig. 11, it can be seen that the first Nk words of the expanded key are filled with the Cipher Key. Every following word, w[[i]], is equal to the XOR of the previous word, w[[i-1]], and the word Nk positions earlier, w[[i-Nk]]. For words in positions that are a multiple of Nk, a transformation is applied to w[[i-1]] prior to the XOR, followed by an XOR with a round constant, Rcon[i]. This transformation consists of a cyclic shift of the bytes in a word (RotWord()), followed by the application of a table lookup to all four bytes of the word (SubWord()).
It is important to note that the Key Expansion routine for 256-bit Cipher Keys (Nk = 8) is slightly different than for 128- and 192-bit Cipher Keys. If Nk = 8 and i-4 is a multiple of Nk, then SubWord() is applied to w[[i-1]] prior to the XOR.
"
c s0, c s1, c s2, c s3, ' , 0c s ' , 1c s ' , 2c s ' , 3c s wl+c Nb round l= *Figure 3.4: AddRoundKey XORs each column of the state with a word from
the key schedule [Nat01]
3.2.5 Key Generation
The key scheduling algorithm expands and generates the round keys. For each round and for the initial key addition one round key is necessary, i.e., for the AES-128 standard eleven round keys in total. Each of these keys consist of four four-byte long words. Figure 3.5 visualizes the generation process of these keywords wi (0≤ i < 44) for the AES-128 standard. The secret key is required
as initialization. It consists of the four words fromw0 untilw3. To expand these
words to all the round keys the algorithm processes the data recursively. Each new keyword wi for i > 3 is a XOR combination of the key words w(i−1) and
w(i−4).
However, if (imod 4 ≡ 0), the key word wi is computed in a more complex
way, i.e., the wordw(i−1) is transformed in three steps.
The first step is a cyclic rotation R. Each byte of the input word win =
{a0, a1, a2, a2} is shifted by one position to generate the output word wout =
{a1, a2, a3, a0}.
The next step consists of the transformation described in Section 3.2.1. There-fore the S-Box S is used to substitute each byte of the input word to generate the output word.
The last step is an addition of a round constantrcj. The constant is an element
of the finite field GF(28) and is defined as rc
22 The Advanced Encryption Standard rc1 rc2 w0 w1 w2 w3 w4 w5 w6 w7 w8 rc10 w40 w41 w42 w43 Initialization R S R S R S
Figure 3.5: The key generation algorithm for the AES-128 standard
After this step the output is combined with the word w(i−4). This leads to
following equation to generate wi for 3< i <44:
wi = ( S(R(w(i−1)))⊕rc(i/4)⊕w(i−4) for (imod 4≡0) w(i−1)⊕w(i−4) for (imod 46≡0). (3.16)
3.3 AES Implementations
In order to construct the AES in software several different implementations can be used. Two of the most common implementations are presented in this section, the 8-bit straightforward implementation and the 32-bit transformation table implementation.
3.3.1 8-Bit Straightforward Implementation
The straightforward 8-bit implementation, as presented in [DR02], is for exam-ple used for smart cards and other 8-bit processors. The round functions can be programmed by implementing the different steps, except for theSubBytes trans-formation, which is implemented using the S-Box as described in Section 3.2.1. To improve the efficiency of the multiplication of a variable in GF(28) with a
constant used in the MixColumn function, the operation is replaced with a repeated multiplication of the constant 02(16). This multiplication of 02(16) is
3.3 AES Implementations 23
3.3.2 32-Bit Transformation Table Implementation
For 32-bit platforms [DR02] presents a common method to implement the AES.
TheSubBytes and MixColumnsfunction are combined to transformation
ta-bles, or T-Tables. This technique uses five T-Tata-bles, one for the last round with-out theMixColumns operation and four tables for the remaining nine rounds. With these tables the computation of one round only requires sixteen table look ups. The trick behind this method is the construction of the tables. Therefore one round of the encryption of a columnc is examined more closely.
The first round function is the SubBytes transformation. It substitutes each input bytes with the S-Box into the valuess0:
S[s0,c] S[s1,c] S[s2,c] S[s3,c] = s0 0,c s0 1,c s0 2,c s0 3,c (3.17) By performing the next transformation, the ShiftRows function, the values of s0 are rotated to a lower position in the corresponding row according to their
row index. In order to get an entire output columns00, the input bytes are chosen
as described in the following equation:
S[s0,c] S[s1,(c+1) mod 4] S[s2,(c+2) mod 4] S[s3,(c+3) mod 4] = s00 0,c s00 1,c s00 2,c s00 3,c (3.18)
In the following MixColumns function the output values s00 are transformed
by a the matrix multiplication presented in Section 3.2.3. This leads to the following equation: 2 3 1 1 1 2 3 1 1 1 2 3 3 1 1 2 S[s0,c] S[s1,(c+1) mod 4] S[s2,(c+2) mod 4] S[s3,(c+3) mod 4] = s000 0,c s000 1,c s000 2,c s000 3,c (3.19)
The round finishes by adding the round key. It combines the output column s000 with the corresponding key word k={k
0. . . k3}using an XOR addition. The
result is the output sout of the complete round: 2 3 1 1 1 2 3 1 1 1 2 3 3 1 1 2 S[s0,c] S[s1,(c+1) mod 4] S[s2,(c+2) mod 4] S[s3,(c+3) mod 4] ⊕ k0 k1 k2 k3 = sout 0,c sout 1,c sout 2,c sout 3,c (3.20)
Equation 3.20 also represents an entire round of the encryption. In the following equations the transformation of each output byte is described separately:
24 The Advanced Encryption Standard sout 0,c = (2•S[s0,c])⊕(3•S[s1,(c+1) mod 4])⊕(1•S[s2,(c+2) mod 4]) ⊕(1•S[s3,(c+3) mod 4])⊕k0 (3.21) sout 1,c = (1•S[s0,c])⊕(2•S[s1,(c+1) mod 4])⊕(3•S[s2,(c+2) mod 4]) ⊕(1•S[s3,(c+3) mod 4])⊕k1 (3.22) sout 2,c = (1•S[s0,c])⊕(1•S[s1,(c+1) mod 4])⊕(2•S[s2,(c+2) mod 4]) ⊕(3•S[s3,(c+3) mod 4])⊕k2 (3.23) sout 3,c = (3•S[0,c])⊕(1•S[s1,(c+1) mod 4])⊕(1•S[s2,(c+2) mod 4]) ⊕(2•S[s3,(c+3) mod 4])⊕k3 (3.24)
Here every input value is used four times, once for each output byte. Based on this observation the construction of the T-Tables is performed. For each row of the state one table is created with 256 four-byte entries. These tables are calculated in the following way:
T0[i] = { (2•S[i]) k S[i] k S[i] k (3•S[i]) } (3.25) T1[i] = { (3•S[i]) k (2•S[i]) k S[i] k S[i] } (3.26) T2[i] = { S[i] k (3•S[i]) k (2•S[i]) k S[i] } (3.27) T3[i] = { S[i] k S[i] k (3•S[i]) k (2•S[i]) } (3.28)
for 0≤i <256.
How one column is processed with the T-Tables is visualized in Figure 3.6.
T0[s= 0,c] T1[s= 1,c] T2[s= 2,c] T3[s= 3,c] ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ sout 0,c = sout 1,c = sout 2,c = sout 3,c = k0 k1 k2 k3
Figure 3.6: Demonstration of the functionality of the T-Table implementation
for the data s of column cwith the corresponding round key word k ={ki}for 0≤i <4
The computation of the last round only requires one table T4, because the
MixColumns transformation is missing. This table implements just the S-Box
of SubBytes transformation.
With this implementation of the T-Tables the computation of one AES round is performed with just 16 table lookups, i.e., 160 table lookups for an entire encryption of one block.
3.4 Cache Behavior of the AES 25
3.4 Cache Behavior of the AES
Cache based timing attacks take advantage of the information leakages produced by the cache behavior of the implementation. This behavior is based on accessing lookup tables. Therefore, the table lookups of the implementations presented in Section 3.3 are analyzed.
In the straightforward implementation of the AES the only lookup table is the S-Box (Appendix A.1). It consists of 256 one-byte entries. During one encryp-tion the table is accessed 160 times. In Figure 3.7 the S-Box calls of a random encryption are traced and visualized. This visualization marks the bytes of the encryption where a direct collision or a cache line collision is evoked. Noticeable in this trace is the fact that all table lookups in the last rounds evoke collisions. This leads to the assumption, that all entries of the S-Box are loaded into the cache. Round 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Byte Direct Collision Cache Line Collision
Figure 3.7: Direct and cache line collisions during one random encryption using
the S-Box implementation. One cache line is formed by 8 S-Box entries.
The 32-bit transformation table implementation uses five different lookup ta-bles. Each of these tables has 256 four-byte entries. For the first nine rounds of the encryption four tables are needed, the tables Ti, for 0 ≤ i < 4. In one round each of these tables are accessed four times, i.e., 36 lookups in the nine rounds. For the computation of the last round the remaining tableT4 is used and
accessed 16 times. This leads to 160 table lookups in total. Figure 3.8 visualizes the cache collisions during a random encryption using the 32-bit implementation. Here, less collisions are evoked, since the 160 table lookups are divided on five tables.
All these collisions influence the execution time of the encryption, since a cache hit or collision is faster than a cache miss. If however the entire tables are stored
26 The Advanced Encryption Standard Round 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Byte T-Table 0 T-Table 1 T-Table 2 T-Table 3 T-Table 4 Direct Collision Cache Line Collision
Figure 3.8: Direct and cache line collisions during one random encryption using
the 32-bit transformation table implementation. One cache line is formed by 8 table entries.
in the cache, the encryption time should be almost constant. Figure 3.9 shows a histogram of the number of collisions over 1,000,000 random encryption. It presents, that the occurrence of cache hits are based on a normal distribution.
In order to get further information of the cache behavior it is useful to know how often a cache hit occurs at a certain point of the encryption. This probability is depicted in Figure 3.10. The information is based on the analysis of 1,000,000 random encryptions. For example, with a probability 36% a cache hit occurs while processing the 8th byte in round 4 of the encryption.
The analysis of the cache behavior leads to the conclusion that the transfor-mation table implementation has a greater vulnerability against cache attacks than the straightforward implementation. This is based on the different tables which offer a larger lookup space. Especially the last round is vulnerable, because it is processed with only one table, which is not accessed as much as the other tables.
3.4 Cache Behavior of the AES 27 40 45 50 55 60 65 70 75 80 0 2 4 6 8 10 12x 10 4
Number of cache hits
#
Figure 3.9: Histogram of the number of cache hits during the encryption
Round 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Byte T-Table 0 T-Table 1 T-Table 2 T-Table 3 T-Table 4 0 0 0 0 3 3 3 3 6 6 6 6 9 9 9 9 12 12 12 12 15 15 15 15 17 17 17 17 20 20 20 20 22 22 22 22 25 25 25 25 27 27 27 27 29 30 29 29 32 32 32 32 34 34 34 34 36 36 36 36 38 38 38 38 40 40 40 40 42 42 42 42 44 44 44 44 45 45 45 45 47 47 47 47 49 49 49 49 50 50 50 50 52 52 52 52 53 53 53 53 55 55 55 55 56 56 56 56 58 58 58 58 59 59 59 59 60 60 60 60 61 61 61 61 63 63 63 63 64 64 64 64 65 65 65 65 66 66 66 66 67 67 67 67 0 3 6 9 12 15 17 20 22 25 27 29 32 34 36 38
Figure 3.10: Probability of a cache hit during a encryption using the 32-bit
transformation table implementation. The probability is displayed as percentage. One cache line is formed by 8 table entries.
4 The Attack Setup
Nowadays, a lot of embedded devices can access the internet. The connection is commonly established via the local network using the ethernet interface. Over this communication channel a lot of information is transfered. This communica-tion model adopted to create the attack model for this thesis. The attack model describe which parties take place in the attacks and how they interact.
The basic attack model used in this thesis is formed by two parties that com-municate over a local network, the server and the client. They provide a simple client-server architecture, like the one shown in Figure 4.1. This model is used for remote attacks, because the communication and interactions are carried out over the local network.
client server P C secret keyK C=AESK[P] t t C ciphertext P plaintext
Figure 4.1: The simple client-server architecture of the attacks
The first party is the client. In this thesis, the client is the adversary. She has the intension to break the cryptographic primitive of the server by recovering the entire secret key. Therefore, the client generates plaintexts and sends them to the server over the ethernet.
The second party is the server that is in this model the target. It provides the service to encrypt data for other parties, using the AES implementation of the crypto library of OpenSSL [Ope] in version 0.98k. Therefore, the server receives the plaintexts from the local network and encrypts them with the AES using the secret key. After the encryption the server sends the ciphertext back to the sender of the plaintext, in this case, back to the client.
30 The Attack Setup
The hardware used to represent the server and the client is described in the following section. Here, the attention turned to the cache of the hardware.
In order to extract the information to reconstruct the secret key, the attacker needs additional side channel information. The information leak used in this thesis is the total encryption time. In Section 4.2 two approaches to measure the encryption time are presented. Afterwards, two methods of cleaning the cache are described.
Section 4.4 presents the actual attack scenario and describes the assumptions made for the attacks described in this thesis.
4.1 Used Hardware
Two hardware devices are used in this thesis. The important device is the em-bedded board that is used to represent the target server. The adversary uses a common PC, which is uses as client and for further analyses.
4.1.1 SBC2440-II Board
The hardware used as target is the Embest SBC2440-II board. It provides a Samsung S3C2440A [Sam] 16/32-bit RISC processor developed for low-power and high-performance applications like in mobile devices. It can be clocked up to 400 MHz. The CPU provides a lot of functionality to handle the boards interfaces, such as an LCD controller. Additionally, the microprocessor contains an ARM920T[ARM] core designed by ARM Limited.
The microprocessor has a Harvard memory architecture with separate data and instruction cache. The caches have a size of 16KB each. They are divided into 512 cache lines of 8 four-byte words. They are arranged in 64-way set-associative caches with each having eight sets of 64 cache lines. An entry can be located in just one set, but in this set it can be stored in any of the set’s cache lines. The addressing of the caches is visualized in Figure 4.2. The bits 7 to 5 of the address define the set where the entry is located. The cache line itself can be determined by comparing the tag of the address, bits 31 to 8, with the tags stored in the cache. The bits 4 to 2 specify the word in the cache line and the bytes in a word can be addressed with the bits 1 to 0.
The ARM920T core also implements a Memory Management Unit (MMU) to translate the virtual addresses to the physical addresses of the memory. The MMU handles the permission checks for the instruction and data addresses used by the microprocessor.
The processor can handle two types of instruction sets. First the 32-bit long ARM instruction set and second the compressed 16-bit Thumb instruction set. This Thumb instruction set used to trade off between high performance and high
4.1 Used Hardware Caches, Write Buffer, and Physical Address TAG (PA TAG) RAM 31
ARM DDI 0151C Copyright © 2000, 2001 ARM Limited. All rights reserved. 4-5
Figure 4-1 Addressing the 16KB ICache
4.2.2 Enabling and disabling the ICache
On reset, the ICache entries are all invalidated and the ICache is disabled.
You can enable the ICache by writing 1 to the Icr bit, and disable it by writing 0 to the Icr bit. 31 TAG 8 7 5 4 2 1 0 Word Byte Seg TAG W 0 W 7 CAM RAM Decoder 0 7 32 RDATA[31:0] Cache line/index SEG 0 select 2KB RAM = 64 lines x 8 words 63 0 SEG 0 7 6 5 4 3 2 1 Modified Virtual Address
7 0
Figure 4.2: The logical model of the addressing of the data or instruction cache.
[ARM]
code density. The compression is made at the expense of the versatile functions of the 32-bit ARM instruction set.
The Embest SBC2440II board operates with an open source ARM-Linux with a 2.6.13 linux kernel. Alternatively Microsoft WindowsCE can be installed.
The board itself is fabricated by the Embest Info & Tech Corporation and is classified as a single board computer. In Figure 4.3 the board is visualized. It provides a lot of additional the features which are not used in course of this , e.g., audio in/output, LCD interface, and camera interface. They are useful for the development and testing for and of portable devices like PDAs. To communicate with the board it provides, amongst other features, an ethernet interface and a serial port. The single board computer is equipped with 64 MB of each SDRAM and NAND flash.
4.1.2 Pentium 4 PC
As client a PC equipped with a Pentium 4 processor [Int04] is used. On the same computer the analysis and reference computations are made. It is clocked with a frequency of 2.4 GHz. The processor offers a cache architecture with a 8 KB level
32 The Attack Setup
Figure 4.3: Overview of the Embest SBC2440-II board [She]
1 cache and 512 KB level 2 cache. They are arranged in a 4-way set associative cache and each cache line has the size of 64 bytes. Additionally, the PC offers 512 MB of memory.
As operating system an Ubuntu 8.10 linux distribution is used with an 2.6.37 kernel.
4.2 Measuring the Encryption Time
The total encryption time is the base for the analysis to extract the secret key of the server. There are two approaches to measure the encryption time. The first is shown in Figure 4.4. The time is measured on the client side. The adversary sends the plaintext P to the server and takes the time until the answer with the ciphertext C arrives. The resulting encryption time te is stored with the plain
and ciphertext.
The other approach measures the time on the server side, shown in Figure 4.5. The client sends the plaintext P to the server, who encrypts is. Additionally, the server measures the time of the encryption process. The ciphertext C and the encrypton time te is sent back to the client.
The approach that measures the time on the client side has the advantage, that the attacker has no access to the server and all her actions are limited to the client side. This is typical for remote attacks. But on the contrary,
4.3 Cleaning the Cache 33 client server P C secret keyK C=AESK[P] t t C ciphertext P plaintext te encryption time te=measureTime()
Figure 4.4: Visualization of the measurement of the encryption time on the
client side client server P (te, C) secret keyK (te, C) =AESK[P] t t C ciphertext P plaintext te encryption time
Figure 4.5: Visualization of the measurement of the encryption time on the
server side
the communication time between client and server is also part of the measured time in this approach. Averaged over 1,000,000 measurements, an encryption takes almost 75 microseconds and the corresponding communication is almost 560 microseconds long. The communication time is added to the encryption time as noise. In order to remove the noise extra measurements have to be made. Therefore, this thesis uses the method that measures only the pure encryption time on the server side. Hence, there is not so much noise in the measurements and less samples are needed.
4.3 Cleaning the Cache
For the attacks described in this thesis it is necessary to make sure, that no data of the previous encryption is in the cache. Therefore, the cache is cleaned before each encryption. This could be done with two different methods.
34 The Attack Setup
The first method is presented in Figure 4.6. Here, the process of the encryption has the functionality of removing its data from the cache. The attacker triggers the cache cleaning by sending the trigger message cct to the server.
client server P C secret keyK C=AESK[P] t t C ciphertext P plaintext cct cleanCache()
cct clean cache trigger
Figure 4.6: Visualization of the triggered cache cleaning method
The second approach is based on the fact, that servers are multitasking systems. Different processes handle a lot of requests at the same time by splitting the computation time up. These simultaneous processes use the same cache. This leads to a random eviction of the cache. In this method the workload of the server is simulated by a jamming process, as shown in Figure 4.7. This process runs simultaneously to the encryption process and cleans the cache randomly, i.e., the data of the encryption process is randomly removed form the cache, too.
client server P C secret keyK C=AESK[P] t t C ciphertext P plaintext jamProcess() t
Figure 4.7: Visualization of the cache cleaning method with a simultaneous jam
4.4 The Attack Scenario 35
In this thesis, the first method, the so called inner process method, is used, because to many measurements are needed for a successful attack using the other method. During an analysis more than 40 times of the measurements are needed, if a jamming process is used.
4.4 The Attack Scenario
The attack scenario describes the abilities of the different parties, which take place in the attacks, and what assumptions are made for them.
In Figure 4.8 the attack scenario for the attacks is presented. The first party is the client of the adversary which generates the plaintext. This generation is either random or the attacker can choose which plaintext she sends to the target. This depends on the actual attack, which is carried out. Additionally, the attacker can decide when to clean the cache. To do so, she sends a trigger message to the server, whereupon the cache is flushed.
client server P (te, C) secret keyK (te, C) =AESK[P] t t C ciphertext P plaintext cct cleanCache() cct clean cache trigger
te encryption time
Figure 4.8: Visualization of the attack scenario. The encrypion time is
mea-sured on the server side and triggered cache cleaning is used. The second party is the target server. It has the ability to measure the time of the encryption. This time is taken in microseconds and sent back to the client with the encryption time. Additionally, the assumption is made, that only the client interacts with the server. Hence, there are no competing processes running on the target, except for the operating system.