**KUL - COSIC** **ECRYPT Summer School - 1** **Albena, May 2011 **

### AES and other secret key

### implementations

**Ingrid Verbauwhede**

### ingrid.verbauwhede-at-esat.kuleuven.be

### K.U.Leuven, ESAT- SCD - COSIC

### Computer Security and Industrial Cryptography

### Acknowledgements:

**Current and former Ph.D. students**

**at UCLA and K.U.Leuven**

**Outline & Goal**

### • Crypto engineering for secret key algorithms

### – Design parameters

### – DES

### – Modes of operation

### – AES

**KUL - COSIC** **ECRYPT Summer School - 3** **Albena, May 2011 **

**Design Parameters**

### Embedded security:

### Area, delay, power, energy

**Crypto engineering everywhere**

### • Continuum between software

### and hardware

### – ASIC (microcode) – FPGA – fully

### programmable processor

### Everything is always

### connected everywhere

**KUL - COSIC** **ECRYPT Summer School - 5** **Albena, May 2011 **

**Embedded Security**

### NEED BOTH

### •

### Efficient, light-weight Implementation

### – Within power, area, timing budgets

### – Public key: 1024 bits RSA on 8 bit

### μ

### C and 100

### μ

### W

### – Public key on a passive RFID tag

### •

### Trustworthy implementation

### – Resistant to attacks

### – Active attacks: probing, power glitches, JTAG scan chain

### – Passive attacks: side channel attacks, including power, timing

### and electromagnetic leaks

**Cost definition**

### •

### Area

### •

### Time

### •

### Power, Energy

### •

### Physical Security

**KUL - COSIC** **ECRYPT Summer School - 7** **Albena, May 2011 **

**Design parameters**

### • Speed or throughput:

### – HW: Gbits/sec or Mbits/sec/slice

### – SW: Cycles/byte, independent of clock frequency

### • Area:

### – HW: mm2 (gate or transistor count)

### – SW: memory footprint

### • Power or energy consumption:

### – Power (Watts) for cooling or transmission (RFID)

### – Energy (Joule): battery operated devices

### • Security: difficult to measure, but we want it

### – Entropy, leakage functions?

### – Measurements until disclosure?

**Throughput: Real-time**

### • Extremely high throughput (Radar or fiber optics)

### • One operator (= hardware unit, e.g. adder, shifter, register)

### • for each operation (= algorithmic, e.g. addition, multiplication, delay)

### clock frequency = sample frequency

### • Most designs: time multiplexing

### clock frequency = sample frequency

### clock frequency

**KUL - COSIC** **ECRYPT Summer School - 9** **Albena, May 2011 **

**SW: cycles per byte**

### • “independent” of

### clock frequency

### or machine

### • Size of packet

### matters

### • “match” of

### algorithm to

### architecture

### [Source: http://bench.cr.yp.to/results-sha3.html]

### Size (bytes)

### 8

### 64

### 4096

### 40 cycles/byte

### Cycles/byte

**Power density problem**

### • Intel S. Borkar power density problem

**KUL - COSIC** **ECRYPT Summer School - 11** **Albena, May 2011 **

**Low Energy: battery capacity**

### • Rabaey slide battery capacity

**One AAA battery: 1300 to 5000 Joule**

**Power and Energy are not the same!**

### • Power = P = I x V (current x voltage) (= Watt)

### – instantaneous

### – Typically checked for cooling or for peak

### performance

### • Energy = Power x execution time (= Joule)

### – Battery content is expressed in Joules

### – Gives idea of how much Joules to get the job done

**KUL - COSIC** **ECRYPT Summer School - 13** **Albena, May 2011 **

**Heat and parallelism**

**memory**

**processor**

**M**

**P**

**C**

_{P}

_{P}

_{mono}_{ = CV}

_{ = CV}

**2**

_{f (Watt)}

_{f (Watt)}

**Power**

**(Heat)**

**C/4**

**C/4**

**C/4**

**C/4**

**M/4 P/4 M/4 P/4 M/4 P/4 M/4 P/4**

**4 (C/4)V**

**2**

_{(f/4) = P}

_{(f/4) = P}

**mono**

**/4**

**but since f ~ V**

**can be even P**

**mono**

**/4**

**3**

**Reduce power = reduce WASTE !!**

**TREND: MULTI-CORE!!**

**KUL - COSIC** **ECRYPT Summer School - 15** **Albena, May 2011 **

**Logic Design Activities**

### • Logic and FSM synthesis

### – State minim., coding

### – Multilevel Logic Optimisation

### • Technology Mapping

### – Functions to library cells

### – Minimal Area for given delay

### • Timing Verification

### – Estimate wiring load C

### – Critical logic path

### • Layout

### – P&R C extraction from wiring ...

**Delay**
**Area**

**!**

**aoi**

**!**

**ff**

**Extraction-> Timing**

**Timing**

**Closure**

**2**

**6...**

**Logic**

**Depth**

**#literals**

**VHDL**

**Logic**

**Synthesis**

**(Synopsys)**

**Standard Cell Layout**

**Std. Cell**

**Std. Cell Place & Route (RT-Module)**

**Routing Channel**

**Cell Row**

**KUL - COSIC** **ECRYPT Summer School - 17** **Albena, May 2011 **

**Standard Cell Zoom In**

### layout

**vdd**

**vss**

**Module Generation**

**For data-path operators: structure is in bit-slices**

**Computer generated layout as function of wordlength**

**Compact, predictable IP**

**Power**

**Instruction, Clock**

**KUL - COSIC** **ECRYPT Summer School - 19** **Albena, May 2011 **

**Standard Cell and Module**

**Courtesy: J. Van Campenhout RUG**

**Datapath**

**Standard Cell**

_{Random Logic}

_{Random Logic}

**Start with easy one:**

**Block cipher - DES**

**KUL - COSIC** **ECRYPT Summer School - 21** **Albena, May 2011 **

**Symmetric key: DES**

### •

### DES = Data Encryption Standard

### •

### FIPS Standard 46 effective in July 1977: US government standard

### for sensitive but unclassified data

### •

### Re-affirmed in 1983, 1988, 1993, 1999 (FIPS 46-3)

### •

### July 26, 2004: FIPS 46-3 is withdrawn: use TDEA or AES

### •

### TDEA = Triple DES encryption algorithm – NIST 800-67

**DES**

**Plaintext (Pi)**

**Ciphertext (Ci)**

**Key = 56 bits + 8 parity bits**

**64**

_{64}

_{64}

**64**

**TDEA**

### • Triple DES Encryption Algorithm, NIST Spec. Pub.

### 800-67 (May 2004)

### • Three Key options:

### – K1, K2, K3 different

### – K1=K3, K2 different

### – K1=K2=K3, backward compatible with single DES

### • two-key triple DES: until 2009

### • three-key triple DES: until 2030

**DES**

**Plaintext (Pi)**

**Ciphertext (Ci)**

**64**

**64**

**64**

**DES-1**

**DES**

**K1**

**64**

**K2**

**64**

**K3**

**KUL - COSIC** **ECRYPT Summer School - 23** **Albena, May 2011 **

**DES = Feistel cipher**

**+**

**f**

**f**

**L**

_{i-1}**R**

_{i-1}**L**

_{i}**R**

_{i}**Encryption round i**

**+**

**f**

**f**

**L**

_{i-1}**R**

_{i-1}**L**

_{i}**R**

_{i}**Decryption round i**

**K**

_{i}**K**

_{i}### • DES has 16 rounds + initial and final permutation

### • Basic cipher structure is Feistel cipher

### – other examples of Feistel: IDEA, FEAL, Kasumi

### •

**Hardware: encryption = decryption!**

**Hardware: encryption = decryption!**

** (different for AES)**

**(different for AES)**

**DES- f function**

**DES- f function**

**32b-to-48b permutation**

**(wiring & bit duplication)**

**input of S-boxes: 8x6b**

**Si = 6b-to-4b non linear**

**substitution (ROM or logic**

**based Look up table)**

**output of S-boxes: 8x4b**

**Expansion E**

**+**

**32**

**R**

_{i-1}**K**

_{i}**48**

**48**

**S1 S2 S3 S4 S5 S6 S7 S8**

**Permutation P**

**32**

**32**

**f(R**

_{i-1, }**K**

_{i}**)**

### •

**Because of Feistel: no need for f **

**Because of Feistel: no need for f**

**-1**

**-1**

_{ (different for AES)}

_{ (different for AES)}**32b-to32b permutation**

**(wiring)**

**KUL - COSIC** **ECRYPT Summer School - 25** **Albena, May 2011 **

**DES Key schedule**

**PC1**

**C**

**D**

**56**

**64**

**PC2**

**48**

**56**

**PC1: permute and drop 8 bits**

**C&D: rotate left 1 or 2**

**bits each round**

**DECRYPTION: rotate right**

**PC2: permute and select 48**

**output bits**

**Initial key K**

**Round Key K**

_{i}**C&D left/right shift registers: encryption & decryption HW**

**C&D left/right shift registers: encryption & decryption HW**

**Key Schedule**

### Two options:

### • On the “fly” = just in time processing

### • Pre-compute and store

### Memory

### BC

### Key

### Schedule

### Key

### Schedule

### BC

### Typical for Hardware

**KUL - COSIC** **ECRYPT Summer School - 27** **Albena, May 2011 **

**Key schedule on the fly**

### • The cost of fast key

### context switching:

### • Example for IPSEC

### router

### – one 128 bit key = 1408

### bits round keys (10 rounds

### + initial key)

### – half of internet packets are

### only 64 bytes in length

### (512 bits)

**10**

**102**

_{10}3

_{10}4

_{10}5**0**

**2**

**4**

**6**

**8**

**10**

**Record Size (bytes)**
**ARC4**
**AES**
**3DES**

**Context bandwidth (**

**G**

**bps**

**)**

**Data at 1Gbps**

**[source: J. Goodman]**

**BANDWIDTH PROBLEM !**

**Modes of operation**

**KUL - COSIC** **ECRYPT Summer School - 29** **Albena, May 2011 **

**Design method**

### • Advice: include modes of operation into

### hardware IP module or co-processor:

### - increases the complexity somewhat: more

### control or instructions are needed

### + CLEAN security partitioning

### + reduces communication overhead and traffic

**Modes of operation: ECB**

### • ECB = Electronic code book

### • cipher blocks are independent, thus insertion or

### deletion of blocks can go undetected

### • block cipher does not hide data patterns

### • BC = block cipher (e.g. 3DES or AES)

**BC**

**BC-1**

**Message M**

**Ciphertext C**

**Plaintext M**

**KUL - COSIC** **ECRYPT Summer School - 31** **Albena, May 2011 **

**Modes of operation: CBC**

### • CBC = Cipher block chaining

### • error in Ci: propagation over 2 blocks (R

_{i}

### and R

_{i+1}

### )

### • loss of block synchronization = fatal

### • error in Pi: propagation for the remaining blocks

### • mostly used with encryption-only for MAC generation

**BC**

**+**

**P**

_{i}**C**

_{i-1}_{C}

_{C}

**i**

**BC-1**

**+**

**C**

_{i-1}**R**

_{i}**Modes of operation: CBC-MAC**

### • Cipher block chaining – Message Authentication Code

### • Initialization Vector: IV = C

_{-1}

### • Feedback inhibits pipelining

**BC**

**+**

**P**

_{i}**C**

_{i-1}**BC**

**+**

**C**

_{i-1}**P**

_{i}**KUL - COSIC** **ECRYPT Summer School - 33** **Albena, May 2011 **

**Pipelining?**

### • Due to feedback pipeline remains empty

### • Worse for triple DES

**DES**

**+**

**C**

_{i-1}**P**

_{i}**16 x**

**P**

**i**

**DES**

**+**

**C**

_{i-1}**P**

_{i}**DES**

**+**

**C**

_{i-1}**P**

_{i}**48 x**

**P**

**i**

**48 rounds:**

**16 DES(K1)**

**16 DES-1(K2)**

**16 DES (K3)**

**Modes of operation: counter**

### • Converts block cipher into stream cipher

### • no feedback: pipelining is possible

### • crucial to choose non-repeating counter functions, e.g. LFSR

### • crucial to choose counter IV’s that are UNIQUE

**BC**

**P**

**i**

**y**

**i**

**C**

_{i}**y**

**i**

**BC**

**P**

**i**

**cntr**

**i**

**cntr**

**i**

**KUL - COSIC** **ECRYPT Summer School - 35** **Albena, May 2011 **

**Modes of operation: OCB**

### • Offset code book – proprietary

### • (used to be) popular because option in 802.11 WLAN

### • can be pipelined

### • need encryption & decryption logic (problem for AES)

**BC**

**+**

**P**

_{i}**Z**

_{i}**C**

_{i}**+**

**Z**

_{i}**BC-1**

**+**

**P**

_{i}**Z**

_{i}**+**

**Z**

_{i}**+**

**extras**

**checksum**

_{+}

_{+}

**extras**

**checksum**

**CCM (Counter + CBC MAC) mode**

### • Encryption & MAC creation (802.11 WLAN)

### MIC = Message Integrity Check is same as MAC

Clear text frame

FC Dur A1 A2 A3 SC A4 QC PC Data MIC

AES_E(K) AES_E(K) AES_E(K) AES_E(K) AES_E(K)

CBC-MAC

AES_E(K) AES_E(K)

FC Dur A1 A2 A3 SC A4 QC PC Data MIC

**Pl(2)**
**Pl(1)**
Counter preload
Transmitted
encrypted frame
IV
AES_E(K)
FCS
0 padded
0 padded

Flag Nonce Dlen

Flag Nonce Cnt
Hlen
AES_E(K)
**Pl(C)**
AES_E(K)
**Pl(0)**

**KUL - COSIC** **ECRYPT Summer School - 37** **Albena, May 2011 **

**Block cipher modes of operation**

### •

### Conclusion: most practical applications ONLY use encryption. Important for:

### – AES because encryption is more efficient than decryption (non-Feistel)

### – area constraint applications (e.g. IEEE 802.11)

### Privacy & MAC

### Privacy & MAC

### Yes

### Only CTR

### Dec

### 2Enc

### Enc

### 2Enc

### OCB

### CCM

### Yes

### Enc

### Enc

### CTR

### No

### Enc

### Enc

### OFB

### No

### Enc

### Enc

### CFB

### Message

### authentication

### No

### Enc

### Enc

### CBC-MAC

### No

### Dec

### Enc

### CBC

### not used

### Yes

### Dec

### Enc

### ECB

### Notes

### Pipelining

### Receiver

### Sender

**Block cipher AES**

**KUL - COSIC** **ECRYPT Summer School - 39** **Albena, May 2011 **

**AES: Byte substitution**

### • Byte substitution: each byte individual

### • 16

*identical*

### Sboxes

### • Area - time trade-off: HW multiplexing

### • 32 for Rijndael

**a2 a6 a10 a14**

**a0 a4 a8 a12**

**a1 a5 a9 a13**

**a3 a7 a11 a15**

**b0 b4 b8 b12**

**b1 b5 b9 b13**

**b2 b6 b10 b14**

**b3 b7 b11 b15**

**GF(2**

**8)**

**-1**

**Permute**

**ai**

**bi**

**AES: Shiftrow**

### • Shiftrow: circularly rotate each row of state array

### • Easy wiring

**a2 a6 a10 a14**

**a0 a4 a8 a12**

**a1 a5 a9 a13**

**a3 a7 a11 a15**

**a10 a14 a2 a6**

**a0 a4 a8 a12**

**a5 a9 a13 a1**

**a15 a3 a7 a11**

### Shiftrow

**KUL - COSIC** **ECRYPT Summer School - 41** **Albena, May 2011 **
**a6 a5 a4 a3 a2 a1 a0 0**

**0 0 0 a7 a7 0 a7 a7**

**b7 b6 b5 b4 b3 b2 b1 b0**

**AES: mix column**

### • matrix multiplication of state array columns

### – multiply with constant entries

**a2 a6 a10 a14**
**a0 a4 a8 a12**
**a1 a5 a9 a13**
**a3 a7 a11 a15**
**b0 b4 b8 b12**
**b1 b5 b9 b13**
**b2 b6 b10 b14**
**b3 b7 b11 b15**
**bi**
**b _{i+1 }**

**bi+2**

**bi+3**

**ai**

**ai+1**

**ai+2**

**ai+3**

**2 3 1 1**

**1 2 3 1**

**1 1 2 3**

**3 1 1 2**

### =

### 2 x

### +

**a7 a6 a5 a4 a3 a2 a1 a0**

**a6 a5 a4 a3 a2 a1 a0 0**

**0 0 0 a7 a7 0 a7 a7**

**b7 b6 b5 b4 b3 b2 b1 b0**

### +

### 3 x

**Mix column - encryption**

**GF(B x 2)**

**GF(B x 3)**

**+**

### G(x)

### 00011011

**<< 1**

### carry

**0**

**1**

**GF(B x 1)**

**+**

**GF(B)**

### 8

### a

### b

### c

### d

### 02 03 01 01

### 01 02 03 01

### 01 01 02 03

### 03 01 01 02

**Mix Column Operation is**

**GF(2**

**8**

_{) Linear Transform}

_{) Linear Transform}

**+**

**<< 1**

**0**

**+**

**+**

**+**

**KUL - COSIC** **ECRYPT Summer School - 43** **Albena, May 2011 **

**Key schedule**

### • Unit is 32 bit words, W[i] = 32 bit = 1 column

### • 4 different operations

### • One round key is four W[i]’s

### Input key

### W[i-Nk] ^ W[i-1]

### W[i-Nk] ^ ByteSub(RotByte(W[i-1]))^ Rcon[i/Nk]

### 128b

### W[i-Nk] ^ ByteSub(W[i-1])

### 0

### 1

### 2

### 3

### 4

### 5

### 6

### 7

### 8

### 9

### 10

### 0

### 1

### 2

### 3

### 4

### 5

### 6

### 7

### 8

### 9

### 10 11

### 12

### 0

### 1

### 2

### 3

### 4

### 5

### 6

### 7

### 8

### 9

### 10 11

### 12

### 13

### 14

### 192b

### 256b

**Key schedule**

### • Encryption key

### – HW: on the fly: round key[i] = function (round key[i-1])

### – SW: pre-compute and store in context (176, 208, 240Bytes)

### • Decryption key

### – encryption key in reverse order

### –

**BUT**

### need final round words to start

### …

### W[15] = D0 = C1^D1

### W[14] = C0 = B1^C1

### W[13] = B0 = A1^B1

### W[12] = A0 = f(C1^D1)^A1

### …

### …

### W[16] = A1 = f(D0)^A0

### W[17] = B1 = f(D0)^A0^B0

### W[18] = C1 = f(D0)^A0^B0^C0

### W[18] = D1 = f(D0)^A0^B0^C0^D0

### …

### 128 bit encrypt

### 128 bit decrypt

### Initial Round Words

### Initial Round Words

**Final Round Words**

**KUL - COSIC** **ECRYPT Summer School - 45** **Albena, May 2011 **

**Combined AES architecture**

### • AES is

*not*

### Feistel

### • Every operation

### has its inverse

**MixColumn**
**ShiftRow**
**ByteSub**
**KeyAdd**
**KeyAdd**
**ShiftRow**
**ByteSub**
**KeyAdd**
**Ki**
**K0**
**KNr**
**Input Data**
**Output Data**

**Encryption**

**MixColumn-1**

**ShiftRow-1**

**ByteSub-1**

**KeyAdd-1**

**KeyAdd-1**

**ShiftRow-1**

**ByteSub-1**

**KeyAdd-1**

**Ki**

**K0**

**KNr**

**Output Data**

**Input Data**

**Decryption**

**Decryption datapath**

**MixColumn-1**

**ShiftRow-1**

**ByteSub-1**

**KeyAdd-1**

**KeyAdd-1**

**ShiftRow-1**

**ByteSub-1**

**KeyAdd-1**

**Ki**

**K0**

**KNr**

**Output Data**

**Input Data**

**Decryption**

**MixColumn-1**

**ByteSub-1**

**ShiftRow-1**

**KeyAdd**

**KeyAdd**

**ByteSub-1**

**ShiftRow-1**

**KeyAdd**

**Ki**

**K0**

**KNr**

**Output Data**

**Input Data**

**Decryption**

**KeyAdd**

**ShiftRow-1**

**ByteSub-1**

**KeyAdd**

**MixColumn-1**

**ShiftRow-1**

**ByteSub-1**

**KeyAdd**

**Ki**

**KNr**

**K0**

**Output Data**

**Input Data**

**Decryption**

**MixColumn-1**

**ByteSub-1**

**ShiftRow-1**

**KeyAdd**

**KeyAdd**

**ByteSub-1**

**ShiftRow-1**

**KeyAdd**

**Ki**

**K0**

**KNr**

**Output Data**

**Input Data**

**Decryption**

**Reorganize**

**Key addition**

**is its own**

**switch**

**shiftrow and**

**bytesub**

**Flip**

**upside**

**down**

**KUL - COSIC** **ECRYPT Summer School - 47** **Albena, May 2011 **

**Combined architecture**

**MixColumn**

**ShiftRow**

**ByteSub**

**KeyAdd**

**KeyAdd**

**ShiftRow**

**ByteSub**

**KeyAdd**

**Ki**

**K0**

**KNr**

**Input Data**

**Output Data**

**Encryption**

**KeyAdd**

**ShiftRow-1**

**ByteSub-1**

**KeyAdd**

**MixColumn-1**

**ShiftRow-1**

**ByteSub-1**

**KeyAdd**

**Ki**

**KNr**

**K0**

**Output Data**

**Input Data**

**Decryption**

### • Does not follow

### completely Rijndael

### proposal (which suggests

### to switch KeyAdd and

### MixColumn) because it

### requires a InvMixColumn

### to be applied to the

### roundkey.

**Combined architecture**

### • Provides encryption &

### decryption

### • Key addition is its

### own inverse

**ByteSub**

**ShiftRow**

**enc**

**0**

**1**

**MixColumn**

**1**

**0**

**0**

**1**

**KeyAdd**

**0**

**1**

**enc**

**enc**

**enc· first**

**first**

**regSel**

**Output Data**

**Input Data**

**Combined**

**enc**

**KUL - COSIC** **ECRYPT Summer School - 49** **Albena, May 2011 **

**Sub modules**

**GF(28)-1**

**perm**

**perm-1**

**1**

**0**

**0**

**1**

**enc**

**In**

**Out**

**ByteSub**

**0**

**1**

**0**

**1**

**Out[3:0]**

**Out[7:4]**

**Out[11:8]**

**Out[15:12]**

**In[3:0]**

**In[4,7,6,5]**

**In[6,5,4,7]**

**In[11:8]**

**In[12,15,14,13]**

**In[14,13,12,15]**

**enc**

**ShiftRow**

**2 3 1 1**

**1 2 3 1**

**1 1 2 3**

**3 1 1 2**

**·Incol+enc·**

**c 8 c 8**
**8 c 8 c**
**c 8 c 8**
**8 c 8 c**
**·Incol**
**Outcol=**

**MixColumn**

**Reason that**

**decryption is slower**

**Sbox optimization**

### •

**GF(2**

**8**

_{)}

_{)}

**-1**

_{ requires large Look up tables}

### • Map to isomorphic fields,

**GF((2**

**4**

_{)}

_{)}

**2**

_{) or }

_{GF(((2}

_{GF(((2}

**2**

_{)}

_{)}

**2**

_{)}

_{)}

**2**

_{)}

_{)}

_{ and}

### invert there

### • smaller but slower!

**()**

**2**

_{p}

_{p}

_{()}

_{()}

**-1**

**0**

**A**

**A**

**A**

**A**

**-1****a**

**a**

**l****a**

**a**

_{h}**GF(2**

**GF(2**

**4****)**

**)**

**GF(2**

**GF(2**

**8****)**

**)**

**GF(2**

**GF(2**

**8****)**

**)**

**KUL - COSIC** **ECRYPT Summer School - 51** **Albena, May 2011 **

**Sbox experiment**

### • 0.18

### μ

### m CMOS, Synopsys experiment

### • size of 1 Sbox, push for area or for speed

### 0

### 200

### 400

### 600

### 800

### 1000

### 1200

### 1400

### 1600

### 1800

### 2000

### 0

### 2

### 4

### 6

### 8

**Latency (ns)**

**Area (gates)**

### Direct Implementation

### Wolkerstorfer

### GF(2

8_{)}

### GF((2

4_{)}

2_{)}

**Compact SBOX**

### • GF(((2

### 2

_{)}

### 2

_{)}

### 2

_{) instead of GF(2}

### 8

_{)}

### • Reduces the gate count to only 280 gates!

### [size depends on the choice of

### ]

^2
^-1
4
4 4
4
8
8
GF(28_{)}

_{)}2

_{)}2

_{)}GF(((22

_{)}2

_{)}2

_{)}GF(28

_{)}

**KUL - COSIC** **ECRYPT Summer School - 53** **Albena, May 2011 **

**Ballpark numbers**

### • 1 gate = 2input NAND gate = 4 transistors

### • Sbox size:

### – GF(2

8_{)}

-1 _{ = 650 to 700 gates}

### – GF((2

4_{)}

2_{)}

-1 _{ = 400 gates ([Wol02])}

### – GF(((2

2_{)}

2_{)}

2_{)}

-1_{ = 280 gates ( [Sat01][Men05])}

### – but 50 to 100% slower

### • AES core encryption only: 20K to 25Kgates

### – 128bit data, 128 bit key

### – key schedule on the fly

### – 1 clock cycle per round

### • AES core for encryption and decryption: 40Kgates

### – 128 bit data, 128 bit key

### – precompute and store round keys: 128x11bits SRAM

### – 1 clock cycle per round

### –

**savings in combining logic, losses in multiplexers!**

**savings in combining logic, losses in multiplexers!**

**Extensions to Rijndael**

### • Original Rijndael submission

### – 128, 192 or 256 bits data (limited to 128bit in AES standard)

### – 128, 192 or 256 bits key (unchanged in AES standard)

### • Tricky part in key schedule: 2 loops in parallel

### N-1 rounds

### plaintext

### m

### m

**m**

**128**

_{192}

_{192}

**256**

## {

**k**

**128**

_{192}

_{192}

**256**

## {

### keyadd

### substitution

### shiftrow

### mixcolumn

### substitution

### shiftrow

### keyadd

### keyschedule

### key

### k

### L rounds

### k

### m

### final round

### roundkey

**KUL - COSIC** **ECRYPT Summer School - 55** **Albena, May 2011 **

**Key schedule for Rijndael**

### Cycle 1

### Roundkey

### Iterated Key

### keysched1

### keysched1

### Cycle 2

### Cycle 3

### Roundkey

### Iterated Key

### keysched1 keysched2

### keysched1

### Cycle 1

### Cycle 2

**data=128**

**key=192**

**data=192**

**key=128**

**Key schedule for Rijndael**

### • Key schedule on the

### fly: one clockcycle

### per round

### seed

### keysched1

### keysched2

### P

### C

### N

### roundkey

### data

_{encrypt}

### Key Schedule

### controller

### iteratedkey

### roundkey assembly

### m

### k

### m,k

### m

**KUL - COSIC** **ECRYPT Summer School - 57** **Albena, May 2011 **

### • Rijndael

### • Enc + Dec

### • 0.18um CMOS

### • Standard cells

### • AES, 2nd generation

### • Regular & WDDL based implementation

### • Standard cells

**KUL - COSIC** **ECRYPT Summer School - 59** **Albena, May 2011 **

**SW implementations**

### Illustrate with DES

### • Option 1: operation by operation

### • Option 2: table look-up

### • Option 3: bit-slices

**SW: DES- f function**

**SW: DES- f function**

**32b-to-48b perm/expansion**

**Combined with permutation P:**

** one Look-Up-Table**

**Wordwise EXOR: efficient**

**Sbox: Look-up Table**

**Expansion E**

**+**

**32**

**R**

_{i-1}**K**

_{i}**48**

**48**

**S1 S2 S3 S4 S5 S6 S7 S8**

**Permutation P**

**32**

**32**

**f(R**

_{i-1, }**K**

_{i}**)**

**KUL - COSIC** **ECRYPT Summer School - 61** **Albena, May 2011 **

**Software SBOX**

**KUL - COSIC** **ECRYPT Summer School - 63** **Albena, May 2011 **
Christof Paar, MEAD Course „Cryptographic Engineering“, Sept. 2009

**Bit Slicing: Alternative Data Representation**

**Bit Slicing: Alternative Data Representation**

### •

### Introduced by Biham, 1997

### •

### each register contains 1 bit of, e.g.,

### 32 blocks

### •

### pipelining (=parallelization) of n

### encryptions

** bit 1, block n**

** bit 1, block 1**

** bit 2, block 1**

** bit 2, block n**

**Block 1**

**Block n**

**Block 2**

**…**

**64**

** bit 64, block 1**

** bit 64 , block n**

**…**

### register (n=16, 32, 64)

### •

### CPU can be viewed as

### 16/32/64 one-bit parallel

### processors

### •

### CPU acts as SIMD

### (Single-instruction

### multiple-data) processor

**Encryption with Bit Slicing (1): Permutations**

**Encryption with Bit Slicing (1): Permutations**

** bit 1, block n**

** bit 1, block 1**

** bit 2, block 1**

** bit 2, block n**

** bit 64, block 1**

** bit 64 , block n**

**…**

** bit 1 , block n**

** bit 2, block n**

** bit 2, block 1**

** bit 64, block 1**

** bit 64, block n**

** bit 1, block 1**

**…**

### Bit permuation is realized

### by re-ordering of registers

### (in practice: re-ordering of

### pointers)

**KUL - COSIC** **ECRYPT Summer School - 65** **Albena, May 2011 **
Christof Paar, MEAD Course „Cryptographic Engineering“, Sept. 2009

**Encryption with Bit Slicing (2): S-Box**

**Encryption with Bit Slicing (2): S-Box**

**S-box**

**a**

**b c d e f**

**O**

**1**

**O**

**2**

**O**

**3**

**O**

**4**

### Each output can be expressed as Boolean function (i.e., function

### on bits)

### O

1### = f

1### (a, b, c, d, e, f)

### …

### O

4### = f

4### (a, b, c, d, e, f)

### On average, each S-box requires about 100 gates.

**Encryption with Bit Slicing (2): S-Box**

**Encryption with Bit Slicing (2): S-Box**

** bit 48 , block n**

** bit 1, block n**

** bit 1, block 1**

** bit 2, block 1**

** bit 2, block n**

### bit 48, block 1

**…**

### Ex:

### Reg 8 AND Reg 17

### OR Reg 44

### NOR …

**S-box**

**a**

**b c d e f**

**O**

**1**

**O**

**2**

**O**

**3**

**O**

**4**

### Idea: Compute S-box S

_{i}

### for many blocks in parallel!

### i.e., realize functions

### O

_{1}

### = f

_{1}

### (a, b, c, d, e, f)

### …

### O

_{4}

### = f

_{4}

### (a, b, c, d, e, f)

### for 64 (or 32 or 16) blocks of data in parallel!

**KUL - COSIC** **ECRYPT Summer School - 67** **Albena, May 2011 **
Christof Paar, MEAD Course „Cryptographic Engineering“, Sept. 2009

** DES Bit Slicing: Performance on 64 bit CPU**

**DES Bit Slicing: Performance on 64 bit CPU**

### 8 S-Boxes, 64 blocks parallel:

### 100 x 8 = 800 instructions

### total

### (incl. load/store & conversion of I/O data)

### 19000 instr. / 64 blocks

### 300 instr. / block

### 4-5 instr. / bit

### Throughput on 300 MHz Alpha

### bit sliced:

### 137 Mbit/sec

### optimized non-bit sliced:

### 46 Mbit/sec

### ONLY COMPATIBLE WITH PIPELINE TYPE MODES OF OPERATION!!

### Further reading: [Biham 97]

### [

1] Amphion CS5230 on Virtex2 + Xilinx Virtex2 Power Estimator [2] Dag Arne Osvik: 544 cycles AES – ECB on StrongArm SA-1110[3] Helger Lipmaa PIII assembly handcoded + Intel Pentium III (1.13 GHz) Datasheet [4] gcc, 1 mW/MHz @ 120 Mhz Sparc – assumes 0.25 u CMOS

[5] Java on KVM (Sun J2ME, non-JIT) on 1 mW/MHz @ 120 MHz Sparc – assumes 0.25 u CMOS

### 648 Mbits/sec

### Asm

### Pentium III [3]

### 41.4 W

### 0.015 (1/800)

### Java [5] Emb.

### Sparc

### 450 bits/sec

### 120 mW

### 0.0000037

### (1/3.000.000)

### C Emb. Sparc [4]

### 133 Kbits/sec

### 0.0011 (1/10.000)

### 350 mW

### Power

### 1.32 Gbit/sec

### FPGA [1]

### 11 (1/1)

### 3.84 Gbits/sec

### 0.18

### μ

### m CMOS

### Figure of Merit

### (Gb/s/W)

### Throughput

### AES 128bit key

### 128bit data

### 490 mW

### 2.7 (1/4)

### 120 mW

**Throughput – Energy numbers**

### ASM StrongARM

**KUL - COSIC** **ECRYPT Summer School - 69** **Albena, May 2011 **