Designing Optimal Implementations of Linear Layers (Full Version)

(1)

of Linear Layers (Full Version)

Ruoxin Zhao

1,2

, Baofeng Wu

1

, Rui Zhang

1,2

and Qian Zhang

1

1 _{State Key Laboratory of Information Security,}

Institute of Information Engineering, Chinese Academy of Sciences

[email protected]

2

University of Chinese Academy of Sciences

Abstract. Linear layer is a fundamental primitive for many aspects of information technology. For information security, the performance of a linear layer depends on two aspects: diffusion ability and implementation cost, where the latter is usually measured by the number of XORs required to implement it. For many years, linear layers have been implemented by computing co-ordinates of the output independently. With this method, costs are determined only by the matrices representing linear layers. However, we note that the implementation cost of a given linear layer depends not only on its matrix but also on its implementation methods. So, in this paper, we focus on another implementation method: modifying input vectors to output step by step. This method uses fewer XORs than previous methods do and makes the implementation cost of every linear layer same as that of its inverse. With the new implementation method, we first clarify the measurement of implementation cost and the optimal implementation procedure of invertible linear layers. Here, “optimal” means using fewest XORs. Then, to find the optimal implementation procedure of a given invertible linear layer, we construct a graph-theoretical model and transfer the problem to the shortest path problem in graph theory. Next, we construct a new “double-direction” algorithm that uses less storage and makes the search for a shortest path more efficient in a regular graph. However, this algorithm is not practical enough for heavyweight invertible linear layers because of its high space/time complexity. So, we present another algorithm for finding efficient implementations of invertible linear layers. The advantages of this algorithm are its low complexity and high practicality. By this algorithm, we design highly efficient implementations of the linear layers of AES. To handle more general linear layers, we finally construct a practical algorithm for finding efficient implementations of singular linear layers.

Keywords: Linear Layer·XOR Count·Equivalence Relation·Regular Graph·The Shortest Path

1 Introduction

(2)

implementation efficiency. In general, these two aspects often contradict. In other words, the higher its diffusion ability is, the lower its implementation efficiency is. So, finding proper tradeoffs is a challenge for designers.

Efficient hardware implementation draws more and more attention in recent years. Implementation efficiency on hardware depends on several factors including running time, area, and power consumption. In most cases, these factors restrict mutually. So, “optimal implementation” has different meanings under different circumstances. For example, there are many lightweight equipments such as a variety of sensors in internet of things (IoT). Consequently, area of circuits naturally becomes the primary factor when designing them. Essentially, linear layers are _Fq-linear transformations over (Fq)n. We only concern

about qas powers of 2 since they are almost all the cases in computer science. According to correspondence between linear transformations and matrices, everyFq-linear layer can

be represented by a matrix overFq. And this representation is unique under a fixed basis

of (Fq)n. Some linear layers are just matrices overF2, such as the candidates presented

in [10, 13, 8]. They are very convenient for implementation. However, there are also many linear layers over the extension fields of F2, such as the candidates presented in

[4, 1, 9, 12]. A typical example of this type is the linear layers used in AES. They are (4×4) matrices over F28. Note that multiplying a fixed element in _F₂m is actually an

F2-linear transformation over (F2)mand can be represented by an (m×m) matrix overF2.

So, anF2m-linear layer over (_F₂m)n is essentially an_F₂-linear transformation over (_F₂)mn and can be represented by an (mn×mn) matrix overF2. Hence, matrices overF2 are

generic forms for linear layers.

Because every linear layer can be represented by a matrix overF2 and every element

in _F2m is represented by a bit string in computers, the number of bit-operations required for implementing linear layers naturally becomes the measurement of implementation cost of them. In this work, we concentrate on how to implement a given linear layer with as few XORs as possible.

1.1 Related Work

As we mentioned above, the implementation efficiency of a linear layer is measured by the number of XORs needed to implement it. For an invertiblen×nmatrixLoverF2

and an input column vectorX,LXis usually implemented by computing the co-ordinates independently. For example, the first co-ordinate ofLX is computed as the product of the first row ofLandX. Consequently, the number of XORs ofLis naturally considered to be the difference between the Hamming weight ofL and the its ordern. However, if a linear layerL0 is an invertiblen×nmatrix overF2m, counting its number of XORs is a little complicated. In [12], the authors investigated this case in details. They formally defined the XOR counts of elements in_F2m which is closely relevant to the computational pattern over_F2m, or more specifically, the basis of_F₂m. Then, the XOR count ofL0 is the sum of XOR counts of all entries ofL0 plusmn(n−1). In fact, according to our statement above,

L0 can be alternatively represented by an (mn×mn) matrixL00over_F2 once we fix an

F2-basis ofF2m. From this viewpoint, the XOR count ofL0 defined in [12] is exactly the number of XORs needed to implement the (mn×mn) matrixL00, namely, the difference between the Hamming weight of L00 and its order mn. Later, some papers ([11, 9, 8]) about linear layers adopted the essentially same measurement of implementation cost as in [12]. Here we point out that all those papers have identical essence: computing the co-ordinates of outputs of a linear layer independently. Thus, the implementation cost of linear layers in those papers is only related to their matrices overF2 once bases are fixed.

In recent years, a new notion of implementation cost of linear layers were presented by Jean, Peyrin, and Sim ([7]). Later, Beierle, Kranz, and Leander ([1]) presented a measurement of implementation cost of multiplication with a fixed element in finite field

(3)

layers. In brief, their new measurement originally came from an implementation method different from previous ones. And this new one can indeed save XORs in comparison with previous methods.

1.2 Our Contributions

In this work, our goal is to find the optimal implementation for any given linear layer. Here, “optimal” means using fewest XORs. To attain the goal, we investigate the relation between implementation cost and implementation procedure. From the investigation, we present a generic measurement for implementation cost of linear layers. After that, we construct a graph-theoretical model and transfer the main problem to the shortest path problem in graph theory. Then, we construct a particular algorithm that is very proper to solve the particular shortest path problem related to invertible linear layers. However, this algorithm for finding optimal implementations of linear layers is not practical enough because of its high space/time complexity. So, we construct another practical algorithm with very low space/time complexity for finding efficient implementations of invertible linear layers, although we cannot guarantee that it necessarily gives optimal implementations. To handle more general linear layers, we finally present a highly practical algorithm for finding efficient implementations of singular linear layers.

In Section 3, we firstly investigate the effect of different implementation procedures for implementation cost of invertible linear layers. Briefly, we compare two implementation strategies: computing the co-ordinates of output vectors independently and modifying the input vectors to output step by step (MIOSS). The former is straightforward and has been used for years. However, the latter requires fewer XORs. Therefore, we focus on the latter one. From the investigation, we present a reasonable measurement of implementation cost of a invertible linear layerL: the minimum number of additive elementary matrices inL’s factorization of the form P1A1· · ·PsAsPs+1 where everyPi is a permutation matrix and

every Aj is an additive elementary matrix. Then, we give some properties that make our

measurement more flexible.

In Section 4, we firstly give an equivalence relation over all invertible linear layers of ordernoverF2. Based on this equivalence relation, we construct a graph whose vertices

are all the equivalence classes. Then we show finding an optimal implementation of a given invertible linear layer Lis essentially same as finding a shortest path between the vertex containingLand the vertex containing the identity matrixIn.

In Section 5, our main goal is to solve the shortest path problem in the graph defined in Section 4. Although there has already been “single-direction” methods (Dijkstra’s algorithm, for example) for the shortest path problem, we abandon them and construct a “double-direction” algorithm. That is because the graph that we are talking about is a regular graph, and our double-direction algorithm uses less storage and makes the search for a shortest path more efficient in a regular graph. With our algorithm, we perform experiments to some linear layers and obtain good results.

Although the algorithm in Section 5 can give us optimal implementations, it is not practical enough for heavyweight invertible linear layers because of its high space/time complexity. To solve this problem, we present another algorithm in Section 6. An important advantage of this algorithm is that its space/time complexity is incredibly low. On the other hand, we have to admit that it does not necessarily give us optimal implementations of linear layers. As an application, we investigate the linear layers of AES with the algorithm and get implementations much more efficient than before.

(4)

2 Preliminary

2.1 Notations

In this paper,Fq orGF(q) denotes the finite field ofqelements. Mm×n(R) denotes the

set consisting of all (m×n) matrices over a ringR,IMn×n(R) denotes the set consisting

of all (n×n) invertible matrices over a ringR,PMn×n(R) denotes the set consisting of

all (n×n) permutation matrices over a ringR,EEMn×n(R) denotes the set consisting of

all (n×n) exchanging elementary matrices over a ringR,MEMn×n(R) denotes the set

consisting of all (n×n) multiplicative elementary matrices over a ringR, andAEMn×n(R)

denotes the set consisting of all (n×n) additive elementary matrices over a ringR. For a matrixA,AT _{denotes the transpose of}_A_and _W

H(A) denotes the Hamming weight (the

number of nonzero entries) of A. Ei,j denotes the matrix whose (i, j)th entry is 1 and

other entries are 0. In denotes the (n×n) identity matrix. For a setS,]S or|S|denotes

the cardinality ofS.

2.2 Some Basic Facts about Linear Algebra

In this subsection, we review some basic facts about linear algebra. For more details, refer to [6].

To begin with, let us talk about elementary operations to matrices. For every matrix

L ∈ Mm×n(F) where F is a field, there are three types of elementary row operations

on it: row exchanging, row multiplication, and row addition. Row exchanging means exchanging tow rows ofL. Row multiplication means multiplying every entries of a row by an invertible element of F. Row addition means adding a scalar-product of an element inF and a row to another row. In addition to elementary row operations, there are also three types of corresponding elementary column operations. All the elementary operations to matrices are invertible.

Corresponding to elementary operations, there are three types of elementary matri-ces: exchanging elementary matrices, multiplicative elementary matrices, and additive elementary matrices. An (n×n) exchanging elementary matrix is the matrix obtained by exchanging two rows ofIn. An (n×n) multiplicative elementary matrix is the matrix

obtained from In by substituting an invertible element inF for a 1 on the diagonal of

In. An (n×n) additive elementary matrix Ai,j is the matrixIn+aEi,j where a∈ F

and i6=j. For any matrixL∈ Mm×n(F), left-multiplyingLby an (m×m) elementary

matrix causes a corresponding elementary row operation to it, while right-multiplyingL

by an (n×n) elementary matrix causes a corresponding elementary column operation to it. A useful fact is that every elementary matrix is invertible. The inverse matrix of an exchanging elementary matrix is just itself. And the inverse matrix of an additive elementary matrixAi,j=In+aEi,j isIn−aEi,j.

Another type of matrices important for this paper is permutation matrix. Ann×n

permutation matrix is the matrix obtained fromInby permutes the rows of it. It is trivial

that a matrix P∈ Mn×n(F) is a permutation matrix if and only if it is invertible and its

entries are zero exceptnentries equal to 1. For any matrixL∈ Mm×n(F), left-multiplying

Lby an (m×m) permutation matrix permutes the rows of it, while right-multiplyingL

by an (n×n) permutation matrix permutes the columns of it. In this paper, we sometimes write a permutation matrixP ∈ Mn×n(F) as a column (ρ(1),· · · , ρ(n))T, whereρis a

permutation over the setN ={1,· · ·, n}andρ(i) is the column index of the nonzero entry in thei-th row ofP fori= 1,· · ·, n. For example, we write the permutation matrix



  

0 0 1 0 0 0 0 1 0 1 0 0 1 0 0 0



  

=



  

3 4 2 1



  

=



  

ρ(1)

ρ(2)

ρ(3)

ρ(4)



(5)

With this notation, ifP = (ρ(1),· · ·, ρ(n))T, then P−1_{= (}_ρ−1₍₁₎_,_{· · ·}_{, ρ}−1₍_n₎₎T _{is also a}

permutation matrix, and thei-th row ofP Lis theρ(i)-th row ofL.

2.3 Some Basic Facts about Graph Theory

In this subsection, we review some basic facts about graph theory. For more details, refer to [2,5].

A graphG is composed of two types of objects. It has a set V of elements called vertices (or nodes) and a set of unordered pairs of vertices called edges. We denote the graph whose vertex set isV and edge set isEbyG= (V, E). The cardinality of the vertex setV is called the order of the graphG. Ifα={x, y}is an edge ofG, we say thatαjoins

xandy,xandyare adjacent, andxandαare incident. The degree of a vertexxis a the number of edges that are incident withxand is denoted by deg(x). If the degree of every vertex of the graphGisr, we sayGis anr-regular graph.

In a graphG= (V, E), a sequence ofmedges of the form

{x0, x1},{x1, x2},· · · ,{xm−1, xm}

is called a walk of length m. We also denote it by x0 −x1− · · · − xm. The walk

x0−x1− · · · −xm is closed (open) ifx0 =xm(x06=xm). If a walk has distinct edges,

we call it a trail. In addition, if a trail has distinct vertices, we call it a path. A closed path is called a cycle. The distance between two verticesxandyis the shortest length of walks joining them and denoted byd(x, y). It is clear that a walk joiningxandyof length

d(x, y) is a path. We say two verticesxandy are connected if there exists a walk joining them. And we say a graph Gis connected if every pair of vertices ofGis connected.

2.4 Linear Layers

In general, an F-linear layer is an F-linear transformation over Fn for a field F. In this paper, we merely focus on fields of characteristic 2 because they are widely used in computer science and telecommunication.

SupposeL is anFq-linear layer over (Fq)n, whereq= 2m. According to the

correspon-dence between linear transformations and matrices, Lcan be uniquely represented by a matrix inMn×n(Fq) under a givenFq-basis of (Fq)n. Let

L=



  

α1,1 α1,2 · · · α1,n

α2,1 α2,2 · · · α2,n

· · · ·

αn,1 αn,2 · · · αn,n



  

∈ Mn×n(Fq).

Then for every input columnX = (x1,x2,· · · ,xn)T ∈(Fq)n, the output is Y =LX =

(y1,y2,· · ·,yn)T ∈ (Fq)n, where yi = Pn_j₌₁αi,jxj for i = 1,· · ·, n. Note that being

multiplied by a fixed element in Fq is an F2-linear transformation overFq and can be

uniquely represented by a matrix inMm×m(F2) under a givenF2-basis ofFq. Therefore,

Lis essentially anF2-linear transformation overFmn2 and can be represented by a matrix

L=



  

L1,1 L1,2 · · · L1,n

L2,1 L2,2 · · · L2,n

· · · ·

Ln,1 Ln,2 · · · Ln,n



  

∈ Mmn×mn(F2),

whereLi,j∈ Mm×m(F2) fori, j= 1,· · ·, n. From this viewpoint, everyFq-linear layer is

essentially anF2-linear transformation and can be represented by a matrix overF2. That

(6)

3 Measurement of Implementation Costs of Linear Layers

In this section, we clarify how to measure the implementation costs of linear layers. As we mentioned above, considering invertible linear layers overGF(2) suffices.

In general, the implementation cost of a given linear layerLoverGF(2) is measured by the number of additions (XORs) required to compute the output vectorLXfor input

X. In [12], the authors formally presented a measurement of the number of XORs (XOR-count) of elements in GF(2m_{) as well as of whole diffusion layers. Later, some papers}

([11,9,8]) adopted that measurement. We know that multiplication with a given element inGF(2m) can be represented by anF2-linear transformation overFm2 , namely, a matrixL

inMm×m(F2). From this viewpoint, their measurement XOR-count is exactlyWH(L)−m

since co-ordinates of output vector are computed independently. In fact, this notion has been used for years. However, the implementation cost of a given linear layer depends not only on the linear layer itself but also on ways by which we implement it. To show the effect of different implementation methods on implementation costs, we at first give Example1.

Example 1. Suppose

L=



  

1 1 1 0 0 1 0 0 1 1 1 1 1 1 0 0



  

∈ IM4×4(F2)

is a linear transformation over F42. Then for every input vectorX = (x1, x2, x3, x4)T ∈ GF(2)4, the corresponding output is

Y =LX=



  

x1+x2+x3 x2 x1+x2+x3+x4

x1+x2



   .

We may computeY with a common method that has been used for years. That is, we compute the co-ordinates ofY independently. In this case, it requires WH(L)−4 = 6

additions (XORs) over GF(2). However, we can adopt another method to compute Y

– modifying the input X to the outputLX step by step as Figure1. In Figure1, we

Figure 1: Modifying input to output step by step (MIOSS)

computeLXthrough a series of operations on the co-ordinates ofX. These operations can be easily implemented on programmable hardware (ASIC or FPGA, for instance). Step (1) addsx2 tox1 and remainx2, x3, x4. So, its delay isTX. Likewise, step (2) and step

(3) costsTX delay, respectively. Note that step (2) requires onlyTX delay but not 2TX

(7)

Figure 2: Implementing Example1by MIOSS in one clock cycle

register. Step (4) does not cost any delay because it is merely performed by twisting wires. Although we illustrate the procedure by four steps, it can actually be implemented in one clock cycle as Figure2. Regardless of 4 clock cycles or 1 clock cycle, the whole procedure of the second method uses 3 XORs. It is quite fewer than 6 XORs of the first method. In form of matrices, we can express the second method asLX=P A4,3A3,1A1,2X, where

eachAi,j=I4+Ei,j andP = (ρ(1), ρ(2), ρ(3), ρ(4))T = (3,2,4,1)T.

Computing every co-ordinate independently is indeed easy to implement and has been used for years. However, it is not necessarily the optimal implementation according to Example1. Here, “optimal” means “using fewest XORs”. To find the optimal implementa-tions of linear layers, we must take implementation procedures into account. Let us give a lemma at first.

Lemma 1. Suppose L ∈ IMn×n(F2). Then there exists a series of matrices Pi ∈

PMn×n(F2)and Aj∈ AEMn×n(F2)such that L=P1A1P2A2· · ·PsAsPs+1.

Proof. As we know,Lcan be transformed to its equivalent standard form through a series of elementary row/column operations. L’s equivalent standard form isIn ∈ IMn×n(F2)

becauseLis invertible. According to the invertibility of elementary operations, In can be

transformed toLthrough a series of elementary row/column operations. So,Lis equal to the product of a series of elementary matrices. Note that multiplicative elementary matrix over GF(2) is justIn and can be omitted from the product. Besides, the product

of some exchanging elementary matrices with ordernis a permutation matrix. Thus,L

is equal to the form P1A1P2A2· · ·PsAsPs+1, where Pi ∈ PMn×n(F2) for i = 1,· · · , s, Aj ∈ AEMn×n(F2) forj= 1,· · · , s+ 1.

In accordance with Lemma1, for every linear layerL∈ IMn×n(F2) and input vector X = (x1,· · ·, xn)T ∈ Fn2, the output vector is LX = P1A1P2A2· · ·PsAsPs+1X which

means we can get the output vector through a series of co-ordinate permutations or additive elementary operations. On the contrary, every modification procedure from X to LX

can be expressed by the form P1A1P2A2· · ·PsAsPs+1X as in Example1. Note that for

implementation ofP1A1P2A2· · ·PsAsPs+1X, left-multiplying everyPj costs 0 XOR while

left-multiplying everyAi costs 1 XOR. In detail, for a column vectorX = (x1,· · ·, xn)T in

Fn2 and a permutation matrixP = (ρ(1),· · · , ρ(n))T whereρ(i) denotes the column index

of nonzero entry in thei-th row ofP,

P X= (xρ(1), xρ(2),· · ·, xρ(n)).

For a column vectorX = (x1,· · · , xn)T inFn2 and an additive elementary matrixAi,j=

Ei,j+In inAEMn×n(F2), left-multiplying X byAi,j will add thej-th row ofX to its

i-th row. Therefore, L’s factorizationP1A1P2A2· · ·PsAsPs+1 in Lemma1indicates the

number of XORs used for its implementation. Now we can give the following definition.

(8)

written as formP1A1P2A2· · ·PsAsPs+1, where everyPi ∈ PMn×n(F2) and every Aj ∈

AEMn×n(F2). And we call an implementation procedureLX=P1A1P2A2· · ·PrArPr+1X

containing M in −XOR−Count(L) additive elementary matrices a minimum-XOR-implementation ofL.

In addition to Definition1, we have other ways to describe the optimal implementation procedures of linear layers. To clarify those ways, we need the following lemma.

Lemma 2. SupposeP = (ρ(1),· · · , ρ(n))T ∈ PMn×n(F2),Ar,s =In+Er,s ∈ AEMn×n(F2). ThenP Ar,s=Aρ−1₍_r₎_,ρ−1₍_s₎P whereA_ρ−1₍_r₎_,ρ−1₍_s₎=In+Eρ−1₍_r₎_,ρ−1₍_s₎∈ AEMn×n(F2).

Proof. As we mentioned before, left-multiplyingP permutes the rows ofAr,s. In detail,

the i-th row of P Ar,s is the ρ(i)-th row of Ar,s. So, the ρ−1(r)-th row of P Ar,s is the

r-th row of Ar,s, and the ρ−1(s)-th row of P Ar,s is the s-th row of Ar,s. Obviously,

P Ar,s will become P if we add ρ−1(s)-th row of P Ar,s to ρ−1(r)-th row of it. Thus,

Aρ−1₍_r₎_,ρ−1₍_s₎P A_r,s = P. Then we get P A_r,s = (A_ρ−1₍_r₎_,ρ−1₍_s₎)−1P = A_ρ−1₍_r₎_,ρ−1₍_s₎P since (Aρ−1₍_r₎_,ρ−1₍_s₎)−1=A_ρ−1₍_r₎_,ρ−1₍_s₎.

Combining Lemma1 and Lemma2, we get the next lemma easily. We omit its proof because it is really trivial.

Lemma 3. Every L ∈ IMn×n(F2) can be factorized to the form L = Q

s

i=1Bi where

every Bi is either inPMn×n(F2) or inAEMn×n(F2).

We call every factorization ofL as the form in Lemma3a P-AE factorization of it. Obviously, the factorization form in Lemma1is a P-AE factorization. On the contrary, every P-AE factorization can be written as the form in Lemma1since identity matrix is also a permutation matrix, and the product of permutation matrices inPMn×n(F2)

is a permutation matrix too. Therefore, the minimum XOR count of a given linear layer

L∈ IMn×n(F2) is exactly equal to the minimum number of additive elementary matrices

in L’s P-AE factorizations. In summary of this paragraph, we present the following theorem.

Theorem 1. For every linear layerL∈ IMn×n(F2),M in−XOR−Count(L)is equal to the minimum number of additive elementary matrices in L’s P-AE factorizations.

In [1], the authors measured the lowest implementation cost (it is called XOR count in [1]) of a linear layerL∈ IMn×n(F2) by the minimum numbertsuch thatLcan be written

asL=PQt

i=1Ai whereP ∈ PMn×n(F2) and everyAi∈ AEMn×n(F2). It is easy to see

that every factorization of the formL=PQt

i=1Ai corresponds to a factorization of the

form in Definition1according to Lemma2. So, our definition of Min-XOR-count does not contradict theirs. We shall indicate the advantage of our definition later.

Because a linear layer is often used bidirectionally (for example encryption and decryp-tion), we also need to consider the inverse of it.

Theorem 2. For every linear layer L ∈ IMn×n(F2), M in−XOR−Count(L−1) is equal to M in−XOR−Count(L). Moreover, if a minimum-XOR-implementation of Lis

P1A1P2A2· · ·PrArPr+1, then Pr−+11ArPr−1· · ·A1P1−1 is a minimum-XOR-implementation of L−1.

Proof. Suppose a factorizationL=Q1B1· · ·QsBsQs+1is a minimum-XOR-implementation

ofL, where eachQi∈ PMn×n(F2) and eachBj∈ AEMn×n(F2). Then we have a

factor-izationL−1₌_Q−1

s+1Bs−1Q−s1· · ·B

−1 1 Q

−1

1 ofL−1. Note that the inverse of a permutation

matrix is a permutation too, and the inverse of each additive elementary matrix Bj is just

Bj itself. Thus,

(9)

Likewise, we can get

M in−XOR−Count(L−1)≥M in−XOR−Count(L).

So, M in−XOR−Count(L−1_{) is equal to}_{M in}₋_XOR₋_Count₍_L_{). And the second}

part of the theorem is trivial.

At the end of this section, we point out that if a linear layerL0 is a power of another linear layer L, for exampleL0 =L5_{, then minimum-XOR-implementation of}_L0 _{is better}

successive 5 times minimum-XOR-implementations of L. That is because successive 5 times minimum-XOR-implementations ofLis merely an implementation procedure ofL0

and it cannot perform better than minimum-XOR-implementation of L0. Consequently, if

L0 =Lr, then M in−XOR−Count(L0)≤rM in−XOR−Count(L).

4 A Graph-Theoretical Model

After clarifying the measurement of the lowest implementation cost of linear layers, we are confronted by two problems:

1. How to calculate the Min-XOR-Count of a given linear layerL∈ IMn×n(F2);

2. How to get a minimum-XOR-implementation of a given linear layerL∈ IMn×n(F2).

Obviously, the 1st problem will be trivial provided that we solve the 2nd one. In this section, we would like to handle the two problems by a graph-theoretical model.

First of all, we introduce a relation overIMn×n(F2). We say two matrices B, C ∈

IMn×n(F2) are row-permutation-equivalent (denoted by B ∼RP C) if there exists a

Q ∈ PMn×n(F2) such that B = QC. For every B ∈ IMn×n(F2), B ∼RP B since

B =InB. IfB∼RP C, thenC∼RP B becauseC=Q−1B andQ−1∈ PMn×n(F2) too.

If there existsQ1, Q2∈ PMn×n(F2) such thatB =Q1CandC=Q2D, thenB=Q1Q2D

andQ1Q2∈ PMn×n(F2). Therefore, “∼RP” is an equivalence relation overIMn×n(F2).

For every B ∈ IMn×n(F2), let [B]RP denote the equivalence class containingB under

“∼RP”. It is easy to see that for everyB∈ IMn×n(F2), [B]RP ={QB|Q∈ PMn×n(F2)},

andQ1B6=Q2B ifQ16=Q2.

Next, we define a graph dependent on row-permutation-equivalence overIMn×n(F2).

Definition 2. For a positive integern, let G(n) = (V, E) be a graph where the vertex set consists of all the equivalence classes under row-permutation-equivalence relation over IMn×n(F2). And two vertices [B]RP and [C]RP are adjacent if there existsB0 ∈[B]RP,

C0∈[C]RP andA∈ AEMn×n(F2) such thatB0=AC0.

Now let us show some useful properties of the graph in Definition2.

Theorem 3. Let G(n) = (V, E)be the graph described in Definition 2and [B]RP be a

vertex of G(n). Then G(n)is an(n2−n)-regular graph,[A1B]RP 6= [A2B]RP for distinct

A1, A2∈ AEMn×n(F2), and[AB]RP runs all the vertices adjacent to[B]RP whenAruns

overAEMn×n(F2).

Proof. If [A1B]RP = [A2B]RP, then A1B = P A2B for some P ∈ PMn×n(F2).

Conse-quently, we get A1=P A2 by right-multiplying two sides ofA1B =P A2B byB−1. We

assert thatP must beIn. Otherwise, some entries on the diagonal ofP A2 would be 0

andP A2 cannot be equal toA1. Then A1 =A2. So, [A1B]RP 6= [A2B]RP for distinct

A1, A2∈ AEMn×n(F2).

On the other hand, if a vertex [C]RP is adjacent to [B]RP, there existsB0 ∈[B]RP,

(10)

Consequently, there exists P2 ∈ PMn×n(F2) and A0 ∈ AEMn×n(F2) such that C0 = AB0 = AP1B = P2A0B according to Lemma 2. So, [C]RP = [C0]RP = [P2A0B]RP =

[A0B]RP. That shows [AB]RP runs all the vertices adjacent to [B]RP whenA runs over

AEMn×n(F2).

In summary of two preceding paragraphs, the degree of every vertex ofG(n) is the cardinality ofAEMn×n(F2), namely,n2−n.

Finally, we present a significant theorem that explains why we set up the graph-theoretical model.

Theorem 4.LetG(n) = (V, E)be the graph described in Definition2andL∈ IMn×n(F2). Then the minimum XOR-count of Lis equal to the distance between vertices[L]RP and

[In]RP inG(n).

Proof. For every factorization ofL with the formP1A1P2A2· · ·PsAsPs+1, where every Pi∈ PMn×n(F2) and every Aj∈ AEMn×n(F2), there exists a path

[L]RP = [P1A1P2A2· · ·PsAsPs+1]RP−[P2A2· · ·PsAsPs+1]RP

− · · · −[PsAsPs+1]RP−[Ps+1]RP = [In]RP

between [L]RP and [In]RP of lengths.

On the other hand, for every path

[L]RP−[Lr−1]RP− · · · −[L2]RP−[L1]RP −[In]RP

of lengthr, according to Theorem3, there must beL=Qr−1Ar−1Lr−1,Lj =Qj−1Aj−1Lj−1

forj= 2,· · ·, r−1, andL1=Q0A0Infor someQi∈ PMn×n(F2) andAi∈ AEMn×n(F2)

fori= 0,1,· · ·, r. Consequently,L=Qr−1Ar−1· · ·Q1A1Q0A0In which is a factorization

ofLwith the form in Lemma1.

Therefore, the minimum XOR-count ofL is equal to the minimum length of paths between [L]RP and [In]RP, namely, their distance.

From the proof of Theorem4, we can easily see that a shortest path between [L]RP

and [In]RP indicates a minimum-XOR-implementation ofL.

5 Searching for Optimal Implementations of Invertible

Lin-ear Layers

In this section, we talk about how to get a minimum-XOR-implementation of a given linear layerL∈ IMn×n(F2). In accordance with the preceding section, this question is

equivalent to finding a shortest path between [L]RP and [In]RP.

Let us sketch our strategy firstly. SupposeG(V, E) is a connected graph, a, b ∈V. To find a shortest path betweenaandb, we letA0={a}, B0={b} and check whether a=b. Ifa=b, we do not have to do anything. If a6=b, we construct a setA1 consisting

of the vertices ofG that are adjacent toa. In other words,A1 consists of the vertices

having distance 1 from a. Then we check whether there existsx∈ A1such thatx=b. If

there is, we get a shortest patha−bbetweenaandb. If there is not, we construct a set B1 consisting of the vertices ofGthat are adjacent to b. In other words,B1 consists of

the vertices having distance 1 from b. Then we check whether there exists x∈ A1 and y ∈ B1 such thatx=y. If there is, we get a shortest patha−a∗1−b betweenaandb

wherea∗₁∈ A1. If there is not, we proceed the procedure above: check and move forwards

(11)

x∈ Ak+1 andy ∈ Bk+1 (ory∈ Bk) for some ksuch thatx=y. As a result, we find a

shortest path

a−a∗1− · · · −a∗k−a∗k+1−b∗k− · · · −b∗1−b (or a−a∗1− · · · −a∗k−b∗k− · · · −b∗1−b)

betweenaandb, wherea∗_i ∈ Ai,b∗i ∈ Bi, every paira∗i, a∗i+1 and every pairb∗j, b∗j+1 are

adjacent. We formally describe the above procedures in Algorithm1 with pseudocode.

Algorithm 1Double-Direction Search for a Shortest Path

Require: a graphG= (V, E),a, b∈V.

Ensure: a shortest path betweenaandb. A0← {a},B0← {b},i←0,j←0,link←0;

whilelink= 0do ifthere existsa∗

i ∈ Aiandb∗j∈ Bjsuch thata∗i =b∗j then link←1;

continue;

else

i←i+ 1;

construct a setAiconsisting of the vertices adjacent to some vertex inAi−1and not inSi −1 k=0Ak;

end if

ifthere existsa∗_i ∈ Aiandb∗j∈ Bjsuch thata∗i =b ∗

j then

link←1; continue;

else

j←j+ 1;

construct a setBjconsisting of the vertices adjacent to some vertex inBj−1and not inS j−1 k=0Bk;

end if end while

return the patha∗₀− · · · −a∗_i −b∗_j−1− · · · −b∗0 such that every paira ∗

k, ak+1and every pairb∗l, b ∗ l+1 are adjacent;

Lemma 4. SupposeG= (V, E)is a graph,a, b∈V. Then we get a shortest path between

a andbwith Algorithm 1.

Proof. Assume there is a patha−c1− · · · −cr−1−bshorter than the output of Algorithm

1. Thenck ∈ Ak fork= 1,· · ·,dr−₂1e, andcr−l∈ Bl forl= 1,· · · ,br−₂1c. Consequently,

there exists a∗_k ∈ Ak, b∗l ∈ Bl such that a∗k = b∗l for some k < i or somel < j which

contradicts Algorithms1.

Algorithm1 is a generic method for any graph. Now we use it to handle our main target: a minimum-XOR-implementation of a given linear layerL∈ IMn×n(F2). For this

target, we present Algorithm2and omit the proof of its correctness because it directly comes from Algorithm 1. Here, we just give an explanation of Algorithm2 as follows.

• We choose an element in every equivalence class to represent it. For example,L∗_i

represents the class [L∗_i]RP. Hence, we determine [Li]RP = [Bj]RP by checking

Li∼RP Bj.

• When we need to determine whether two invertible matrices Li andBj are

row-permutation-equivalent, we check the Hamming weight ofLiBj−1 sinceLi∼RP Bj

if and only ifWH(LiBj−1) =n. This method can be implemented easily whenB is

a product of some additive elementary matrices. More explicitly, ifB =A1· · ·As,

B−1₌_A

s· · ·A1.

(12)

Algorithm 2Finding a Minimum-XOR-Implementation of an Invertible Linear Layer

Require: a positive integern, a matrixL∈ IMn×n(F2).

Ensure: a minimum-XOR-implementation ofL. L0← {L},B0← {In},i←0,j←0,link←0;

whilelink= 0do ifthere existsL∗

i ∈ LiandB ∗

j ∈ Bjsuch thatL ∗

i ∼RP B∗j then link←1;

continue;

else

i←i+ 1;

construct a setLiconsisting of the matrices having the formAr,sLi−1and not row-permutation-equivalent to any matrix inSi_k−1₌₀Lk, where Ar,sruns overAEMn×n(F2) andLi−1 runs over

Li−1;

end if

ifthere existsL∗

i ∈ LiandB ∗

j ∈ Bjsuch thatL ∗

i ∼RP B∗j then link←1;

continue;

else

j←j+ 1;

construct a setBj consisting of the matrices having the formAr,sBj−1and not row-permutation-equivalent to any matrix inSj_k−1

=0Bk, whereAr,sruns overAEMn×n(F2) andBj−1runs over

Bi−1;

return the pathL∗₀− · · · −L∗

i−B

∗

j−1− · · · −B ∗

0such that every pairL ∗

k, Lk+1and every pairBl∗, B ∗ l+1 are adjacent;

double-direction search. Let us show what will happen if we handle our main problem with Dijkstra’s algorithm. We use the same notations as that in Algorithm1. We start from vertexato find a shortest path betweenaandb. By means of Dijkstra’s algorithm, we need to construct a series of setsAiconsisting of vertices whose distance toaisifori= 1,2,· · ·.

According to Theorem3,|A1|=n2−n, and|Ai|= (n2−n)(n2−n−1)i−1fori≥2 in the

worst case. If the distance betweenaandbiss≥2, the algorithm will not terminate until Asis constructed. We see the cardinality ofAis increase too fast and will occupy too much

space. Actually, that is the reason why we abandon single-direction strategies and adopt a double-direction strategy – moving forwards step by step fromaandbalternately. With a double-direction strategy, we can save a lot of memory space and modify the search for the shortest path. For instance, suppose the distance betweenaandbiss= 2t. If we use our method, then we need to constructAi,Bi fori= 1,· · · , t, where|A1|=|B1|=n2−n,

|A2|=|B2|= (n2−n)(n2−n−1),· · ·,|At|=|Bt|= (n2−n)(n2−n−1)t−1in the worst case.

However, if we use a single-direction strategy, we have to construct Ai fori= 1,· · · ,2t,

where|A1|=n2−n,|A2|= (n2−n)(n2−n−1),· · ·,|At|= (n2−n)(n2−n−1)t−1,|At+1|=

(n2₋_n₎₍_n2₋_n₋₁₎t_{,· · ·}_,|A

2t|= (n2−n)(n2−n−1)2t−1.

5.1 Experimental Results

We search the minimum XOR implementations of many linear layers and list some of them in Appendix A. The optimal implementation procedure of each one is like what we show in Example1. Meanwhile, we list the minimum XOR count and the difference between Hamming weight and the order of the linear layers in Table 1, where the former indicates implementation cost of our strategy and the latter indicates the cost of computing co-ordinates of output independently. L1,· · ·, L10in Table 1are matrices inM5×5(F2),

andL11,· · ·, L20 are matrices inM6×6(F2). We do not choose (6×6) matrices with large

Hamming weights because of hardware limitation of the PC we use. According to the experimental results, we save approximately 55.2% XORs of implementingL1,· · · , L10in

(13)

Table 1: Implementation Cost of Linear Layers

Linear Layer L1 L2 L3 L4 L5 L6 L7 L8 L9 L10

Min-XOR-Count 6 7 6 5 6 6 6 6 6 6

WH(Li)−n 14 14 14 13 13 14 13 13 13 13

Linear Layer L11 L12 L13 L14 L15 L16 L17 L18 L19 L20

Min-XOR-Count 6 6 6 5 6 7 6 7 6 6

WH(Li)−n 8 7 8 7 7 8 8 8 8 8

independently). And the corresponding percentage for implementingL11,· · ·, L20is 20.8%.

6 A More Practical Strategy for Efficient Implementations

of Invertible Linear Layers

Although we present an algorithm to search for an optimal implementation of a given linear layer, its space/time complexity skyrockets along with the increase of order and minimum XOR count of the given linear layer. For example, in the case when a linear layerLis in IM8×8(F2), the time complexity of Algorithm2is 56r, wherer=M in−XOR−Count(L).

Ifr= 15 (not very large), then the time complexity of searching for the minimum XOR implementation ofLwill be 5615≈287.

In this section, to avoid high computational complexity of searching for optimal implementations of linear layers, we switch to other efficient implementations of them. Our aim is still looking for a P-AE factorization of a given linear layer, but it is not necessarily a minimum XOR implementation of it.

As we know, every matrixM ∈ IMn×n(F2) can be transformed to a diagonal matrix

with the same rank asM via a series of row/column exchanging and row/column addition. This diagonal matrix must beIn becauseM’s rank isn. According to Lemma2,M can

be transformed to a permutation matrix P ∈ PMn×n(F2) via a series of row/column

addition. In a form of matrix, there exists a series of additive elementary matrices

R1,· · ·, Rr andC1,· · · , Cs such thatRr· · ·R1M C1· · ·Cs=P. Consequently, we get a

P-AE factorization M = R1· · ·RrP Cs· · ·C1. That indicates left-multiplying an input

columnX ∈Fn2 byM with this P-AE factorization usesr+sXORs. If each row/column

addition reduces the Hamming weight of the matrix by 1, the implementation cost of this MIOSS procedure will be equal to that of computing co-ordinates of output of M

independently, namely, WH(M)−n. Therefore, if we want an MIOSS procedure better

than computing co-ordinates of output of M independently, the guideline is trying to reduce the Hamming weight of the given linear layer as much as possible by each additive row/column elementary operation in Rr· · ·R1M C1· · ·Cs=P. To attain this goal, we

present Algorithm3.

In Algorithm3,ris a variable recording the number of additive elementary operations, andRU Bis assigned the difference between the Hamming weight of origin matrixM and its ordern. If rexceedsRU B, it is unnecessary to let the program proceed because its output MIOSS procedure will not be better than computing co-ordinates of output of the linear layer independently. In each while loop, Algorithm3looks for an additive elementary operation that can reduce the Hamming weight of M most among all additive row and column elementary operations and operate M by it. If the algorithm finally displays “Success”, then we will get a series of additive elementary operations and a permutation matrix. In form of matrix multiplication, we will get Rr· · ·R1M C1· · ·Cs = P, where

each Ri and each Cj are additive elementary matrices andP is a permutation matrix.

(14)

Algorithm 3Finding an Efficient Implementation of an Invertible Linear Layer

Require: a positive integern, a matrixM∈ IMn×n(F2);

Ensure: a series of additive elementary operations and a permutation matrix inPMn×n(F2);

r←0,RU B←WH(M)−n;

whileWH(M)> nandr≤RU Bdo

letαidenote thei-th row ofMandβidenote thei-th column ofMfori= 1,· · ·, n,weightdecrease← WH(α1)−WH(α1+α2),AE←(R,2,1);

for1≤i < j≤ndo

computeαi+αj;

ifWH(αi)−WH(αi+αj)> weightdecreasethen

weightdecrease←WH(αi)−WH(αi+αj),AE←(R, j, i);

end if

ifWH(αj)−WH(αi+αj)> weightdecreasethen

weightdecrease←WH(αj)−WH(αi+αj),AE←(R, i, j);

end if end for

for1≤i < j≤ndo

computeβi+βj;

ifWH(βi)−WH(βi+βj)> weightdecreasethen

weightdecrease←WH(βi)−WH(βi+βj),AE←(C, j, i);

end if

ifWH(βj)−WH(βi+βj)> weightdecreasethen

weightdecrease←WH(βj)−WH(βi+βj),AE←(C, i, j);

end if end for

ifAE= (R, i, j)then

addαitoαj, outputAE,r←r+ 1;

else

addβitoβj, outputAE,r←r+ 1;

ifWH(M)> nthen print “Fail.”;

else

print “Success viarsteps.”, outputM;

end if

One important advantage of Algorithm3is that its time/space complexity is much lower than that of Algorithm2. For a linear layerM ∈ IMn×n(F2), if Algorithm3runsk

while loops, then the time complexity of it isk(n2₋_n_{). It is a piece of cake in comparison}

with (n2₋_n₎k _{– the time complexity of Algorithm} ₂_{with the same number of loops.}

Therefore, Algorithm3 is substantially more practical than Algorithm2 is. Nevertheless, we have to admit that Algorithm3does not necessarily give us an optimal implementation of the input linear layer even though it succeeds.

It is well known that implementation costs of the inverses of many linear layers are higher than that of themselves (the linear layer of AES, for example). As we mentioned before, besides using fewer XORs, another advantage of MIOSS is implementing every invertible linear layer and its inverse with the same cost. This advantage is also valid to the contents of this section. More specifically, if we get an efficient implementation

M =R1· · ·RrP Cs· · ·C1of a given linear layerM ∈ IMn×n(F2) by Algorithm3, then we

immediately obtain an implementation ofM−1_: _M−1₌_C

1· · ·CsP−1Rr· · ·R1. Obviously, M−1 _{has the same implementation cost as} _M _{does. In fact, we recommend a better}

(15)

6.1 New Efficient Implementations of the Linear Layers of AES

As an application of Algorithm3, we investigate the linear layers of AES. The linear layer of AES consists of two phases: ShiftRows and MixColumns. We care nothing about ShiftRows since it is just a permutation over co-ordinates of input vectors. MixColumn is the one we really concern about. The Hamming weights of the matrices of encryption MixColumns and decryption MixColumns are 184 and 472, respectively. So, if we implement them by computing co-ordinates of output independently, they will cost 152 XORs and 440 XORs, respectively. However, with Algorithm3, we get an efficient implementation of encryption MixColumns with 132 XORs and an efficient implementation of decryption MixColumns with 228 XORs. In other words, we can save 13.16% XORs of computing co-ordinates of output of MixColumns independently and 48.18% XORs of computing co-ordinates of output of the inverse MixColumns independently. We would like to show the design of these efficient implementations below.

LetM C denote the matrix of encryption MixColumns of AES. Then

M C=



  

M(0x02) M(Ox03) M(Ox01) M(Ox01)

M(0x01) M(0x02) M(0x03) M(0x01)

M(0x01) M(0x01) M(0x02) M(0x03)

M(0x03) M(0x01) M(0x01) M(0x02)



  

is a matrix inM32×32(F2), where

M(0x02) =

           

0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0

            ,

M(Ox03) =

           

1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 1 1 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 1 1 1 0 0 0 0 0 0 1

            ,

M(0x01) =

           

1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1

            .

Meanwhile, the matrix of decryption MixColumns of AES is

M C−1 ₌



  

M(0x0e) M(0x0b) M(0x0d) M(0x09)

M(0x09) M(0x0e) M(0x0b) M(0x0d)

M(0x0d) M(0x09) M(0x0e) M(0x0b)

M(0x0b) M(0x0d) M(0x09) M(0x0e)



  

(16)

where

M(0x0e) =

           

0 1 1 1 0 0 0 0 1 0 1 1 1 0 0 0 0 1 0 1 1 1 0 0 0 0 1 0 1 1 1 0 0 1 1 0 0 1 1 1 0 1 0 0 0 0 1 1 0 0 1 0 0 0 0 1 1 1 1 0 0 0 0 0

            ,

M(0x0b) =

           

1 1 0 1 0 0 0 0 1 1 1 0 1 0 0 0 1 1 1 1 0 1 0 0 1 1 1 1 1 0 1 0 0 0 1 0 1 1 0 1 1 1 0 0 0 1 1 0 1 1 1 0 0 0 1 1 1 0 1 0 0 0 0 1

            ,

M(0x0d) =

           

1 0 1 1 0 0 0 0 1 1 0 1 1 0 0 0 0 1 1 0 1 1 0 0 1 0 1 1 0 1 1 0 1 1 1 0 1 0 1 1 0 1 0 0 0 1 0 1 1 0 1 0 0 0 1 0 0 1 1 0 0 0 0 1

            ,

M(0x09) =

           

1 0 0 1 0 0 0 0 1 1 0 0 1 0 0 0 1 1 1 0 0 1 0 0 0 1 1 1 0 0 1 0 1 0 1 0 1 0 0 1 1 1 0 0 0 1 0 0 0 1 1 0 0 0 1 0 0 0 1 0 0 0 0 1

            .

We haveWH(M C) = 184 andWH(M C−1) = 472.

We conduct Algorithm3 toM C andM C−1 _{but get nothing useful. Therefore, we}

perform some tricks on them: divide and conquer.

Firstly, let us considerM C. We divideM C to 4 blocks:

M C=

M1 M2

M2 M1

,

where

M1 =

M(0x02) M(0x03)

M(0x01) M(0x02)

,

and

M2 =

M(0x01) M(0x01)

M(0x03) M(0x01)

.

We easily get WH(M1) = 49 and WH(M2) = 43. Meanwhile, we let the input vector

(17)

as

M C·X =

M1 M2

M2 M1

(X1, X2, X3, X4)T

=

M1(X1, X2)T +M2(X3, X4)T M2(X1, X2)T +M1(X3, X4)T

.

(1)

Then the implementation cost of M C is the sum of twice of that of M1 and twice of that ofM2 plus 32. We conduct Algorithm 3toM1 andM1−1 _{and obtain an MIOSS}

implementation of M1 with 46 XORs. But WH(M1) = 49 for which we can compute

co-ordinates ofM1(X1, X2)T independently with 49−16 = 33 XORs. So, such an MIOSS

implementation of M1 is not a good option. However, we may use divide and conquer once again to M1. We conduct Algorithm 3 to M(0x02) and M(0x02)−1 and obtain an MIOSS implementation of M(0x02) with 3 XORs. This MIOSS implementation of

M(0x02) costs the same number of XORs as computing co-ordinates of output vector independently does. So we implement M(0x02) by computing co-ordinates of output vector independently. Next, we conduct Algorithm 3toM(0x03) and M(0x03)−1 _and

obtain an MIOSS implementation of M(0x03) with 9 XORs as follows.

M(0x03) = A7,8A6,7A5,6A4,5A3,4 A2,3A1,2A8,1P0x03A5,1,

(2)

where eachAi,j =Ei,j+I8 andP0x03=I8. This MIOSS implementation uses less XORs

than computing co-ordinates of output ofM(0x03) independently does because the latter uses 11 XORs. ImplementingM(0x02) by computing co-ordinates of output independently and implementing M(0x03) by the MIOSS method in Equation (2), we can implement

M1 by divide and conquer with 3×2 + 9 + 16 = 31 XORs. It is less than 33 XORs of implementing M1 by computing co-ordinates of output independently. So, we choose divide and conquer to implement M1 as follows:

M1(X1, X2)T =

M(0x02)X₁T +M(0x03)X₂T M(0x01)X1T +M(0x02)X2T

, (3)

where left-multiplying M(0x02) is implemented by computing co-ordinates of output independently, and left-multiplyingM(0x03) is implemented by MIOSS method in Equation

2. Similarly, we investigate the costs of different implementation methods of M2. We conduct Algorithm3toM2 andM2−1_{and obtain an MIOSS implementation of}_M_{2 with}

19 XORs as follows.

M2 = A9,1A10,2A11,3A12,4A13,5A14,6A15,7 A16,8A1,16A2,9A3,10A4,11A7,14A12,16 A5,12A13,16A6,13A15,16A8,15PM2,

(4)

where eachAi,j=Ei,j+I16andPM2 is described in Table2. This MIOSS implementation

ofM2 is better than computing co-ordinates of output ofM2 independently because the latter costs 43−16 = 27 XORs. In addition, implementingM2 by divide and conquer costs 9 + 16 = 25 XORs which is larger than 19 too. So, we choose the MIOSS procedure in Equation (4) to implementM2. In summary of this paragraph, we choose the divide and conquer procedure in Equation (1) to implement M C, choose the divide and conquer procedure in Equation (3) to implementM1, choose the MIOSS procedure in Equation (2) to implementM(0x03), choose the MIOSS procedure in Equation (4) to implement

M2, and finally obtain an efficient implementation ofM Cwith 31×2 + 19×2 + 32 = 132 XORs. It saves 13.16% XORs of computing co-ordinates of output ofM C independently.

Secondly, let us consider M C−1_{. Similar to the case of}_{M C}_{, we divide} _{M C}−1 _{to 4}

blocks:

M C−1=

M3 M4

M4 M3

(18)

Table 2: The Permutation MatrixPM2 in the MIOSS Implementation ofM2

i 1 2 3 4 5 6 7 8

ρM2(i) 9 10 11 12 13 14 15 16

i 9 10 11 12 13 14 15 16

ρM2(i) 2 3 4 5 6 7 8 1

• ρM2(i) is the column index of the nonzero

entry ini-th row ofPM2.

where

M3 =

M(0x0e) M(0x0b)

M(0x09) M(0x0e)

,

and

M4 =

M(0x0d) M(0x09)

M(0x0b) M(0x0d)

.

We easily getWH(M3) = 115 andWH(M4) = 121. Meanwhile, we let the input vector

Y = (Y1, Y2, Y3, Y4)T, where eachYi∈F82. Hereby, the output vector can be computed as

M C−1·Y =

M3 M4

M4 M3

(Y1, Y2, Y3, Y4)T

=

M3(Y1, Y2)T +M4(Y3, Y4)T M4(Y1, Y2)T +M3(Y3, Y4)T

.

(5)

Then the implementation cost ofM C−1_{is the sum of twice of that of}_M_{3 and twice of that}

ofM4 plus 32. For convenience of later investigation, let us consider the implementation costs of M(0x0e),M(0x0b),M(0x0d), and M(0x09) first. We conduct Algorithm 3to

M(0x0e) andM(0x0e)−1and obtain an MIOSS implementation ofM(0x0e) with 15 XORs. This MIOSS implementation ofM(0x0e) uses less XORs than computing co-ordinates of output of M(0x0e) independently does because the latter uses 28−8 = 20 XORs. Thus, we choose the MIOSS procedure to implementM(0x0e). Likewise, we choose an MIOSS procedure with 15 XORs to implement M(0x0b), choose an MIOSS procedure with 15 XORs to implementM(0x0d), and choose an MIOSS procedure with 11 XORs to implementM(0x09). Then, we conduct Algorithm3to M3 andM3−1 _{and obtain an}

MIOSS implementation ofM3 with 52 XORs as follows.

M3 = A3,1A5,15A13,7A7,16A4,2A2,1A10,2 A11,2A11,3A1,9A8,1A6,16A12,6A6,15 A6,7A14,7A1,10A15,1A3,6A5,6A13,6 A7,8A4,7A15,8A4,15A9,11A13,10 A16,10A14,11A16,11A2,1A2,3A2,5 A3,1A3,14A13,3A9,13A1,14A5,1A5,16 A12,5A9,12A2,9A16,2A1,16PM3A2,6 A12,6A12,4A7,1A2,1A11,12A3,11,

(6)

where eachAi,j=Ei,j+I16andPM3 is described in Table3. This MIOSS implementation