Haute Ecole d’Ing´enierie et de Gestion du Canton de Vaud
Implementation of a post-quantum algorithm
Author:
Florent Piller
July 2019
1 Contents
Contents
1 Introduction 3
1.1 General Context . . . . 3
1.2 Post-quantum cryptography . . . . 4
1.3 NIST contest . . . . 6
2 Choices made for the implementation 8 2.1 Programming language . . . . 8
2.2 Algorithm . . . . 9
2.2.1 Category of the algorithms . . . . 9
2.2.2 Round5 . . . . 9
3 Algorithms 11 3.1 Round 5 . . . . 11
3.1.1 The General Learning With Rounding Problem . . . . 11
3.1.2 Error-correcting code . . . . 12
3.1.3 Algorithms . . . . 12
4 Implementation 18 4.1 Round 5 . . . . 18
4.1.1 Additions to the reference implementation . . . . 18
4.1.2 Problems encountered during the implementation . . . 18
4.1.3 Optimization . . . . 21
4.1.4 Comparison with C implementations. . . . . 29
4.1.5 Future improvements. . . . 37
4.1.6 Code . . . . 37
5 Conclusion 38
6 Authenticity 39
A Results of the optimized profiling 42
B Result of the KAT and the speedtest 44
2 Contents
Abstract
With the growth in power of quantum computers, the post-quantum algorithm will probably be the next security standards for digital com- munications. This is one of the reason why the National Institute of Standards and Technology launched a contest in 2017 in order to find the most secure post-quantum algorithm
Round5 is one of the proposal for this contest and the main point of this work was to implement it as optimized as possible. In this paper, you will find the difficulties I encountered in order to have a usable Scala version as well as all the steps of optimization I made in order to get the faster implementation.
The final result is a program that is on par with the reference imple- mentation of the Round5 submission but not as fast as the optimized version that came along the Round5 proposal.
3
1 Introduction
1.1 General Context
Quantum computing. It is in the 1970s that quantum cryptography ap- pears thanks to the work of Wiesner and Brassard. The idea Wiesner got was to use quantum mechanics in order to encode two messages in conju- gate observables. The property of these observables was that if one of the message was read, the other was destroyed.
These observable, called today qubits, are the basic unit of quantum com- puting and work as follows : the qubit is always in an unknown state between 0 and 1 and is represented as the sum of the probability of being 1 plus the probability of being 0.
This interesting part is that in quantum physics there exists quantum su- perposition which declares that you can add two or more quantum state and it gives a valid quantum state. This means we can add qubits states to gain more information. To give you an example, the state of the addition of two qubits is α ∗ 00 + β ∗ 01 + γ ∗ 10 + δ ∗ 11.
As said before, qubits are the basic information storage of the quantum computers and if we give an equivalent in terms of bits, a n quibs quantum computer can store as much as 2n bits of information.
That is the reason why the progress made in building more powerfull quan- tum computer in recent years are dangerous for our digital security. Numer- ous cryptographic algorithms rely on the hardness of resolving mathematical problems on actual computer like factoring large numbers or finding the dis- crete logarithm of a given number.
With quantum computing these problems can be solved in a polynomial time using Shor’s Algorithm [1] and this means that if the quantum computers become too powerfull, the actual algorithms are not secure enough. In fact, Dr. Michele Mosca of the University of Waterloo said about one of the most famous security algorithm RSA :” I estimate a 1/7 chance of breaking RSA- 2048 by 2026 and a 1/2 chance by 2031”.[2]
But solutions in order to maintain a decent security already exists and are two known cryptographic fields:
• Quantum cryptography
• Post-quantum cryptography
Quantum cryptography uses quantum physical properties to provide secu- rity. The problem with this field is the necessity to have a quantum computer to be secure against other quantum attacks.
4 1.2 Post-quantum cryptography
Post-quantum cryptography uses mathematical problems that are supposed to be hard to solve for quantum computer in order to gain security. The interesting point about this field is that these problems can be computed by normal computers and be used in our current environment. This the reason why post-quantum field will probably provide the algorithms that will replace the current security algorithm like RSA.
1.2 Post-quantum cryptography
By definition, post-quantum cryptography is cryptography that resist at- tacks using classical and quantum computer. This field consits manly of five systems [3] that are :
• Hashed-based cryptography
• Code-based cryptography
• Lattice-based cryptography
• Multivariate-based cryptography
• Isogeny-based cryptography
Hashed-base cryptography. This family of algorithms uses hash func- tions as algorithmic primitive. A hash function is a function that given an arbitrary input length can output a fixed size data.
The security around hash-based cryptography concerns the three following points of the hash function:
• It is computationally infeasible to find two different messages with the same hash (collision resistance)
• Given a hash, it is computationally infeasible to find the message that gave this hash (first preimage resistance)
• Given a message and its hash, it is computationally infeasible to find another message that have the same hash (second preimage resistance) As of today, the security around the hash functions depends on the length of the output. One can find a collision in O(2output lengh/2) by using bruteforce due to the birthday paradox and can only find attacks on preimages by bruteforce and the complexity of it is O(2output lengh) because the probability to find another message with the same hash is 1/2output lengh
5 1.2 Post-quantum cryptography
With quantum computing, there are few changes, but the complexity of the attack still are in the form of O(2x). For example with Grover’s algorithm, we can attack a first preimage with only O(2output lengh/2) operations[4].
This means the security of hashing functions depends on the size of the output even against quantum computers. The only thing we must do is adjust the output length to match the wanted security. In this example just doubling the size of the output results in the same level of security as we currently have.
Code-based cryptography. As indicated by his name the code-based cryptography relies on the use error correction codes in order to gain security.
The principle is that you willingly put error in some part of your computation and with the error correction code the receiver of your ciphertext can correct them and get the right plaintext.
Lattice-based cryptography. This term designates the cryptosystems that are based on lattices.
A lattice is a discrete subgroup of Rn or the set L(b1, .., bn) of all linear combinationsP xibi where xi∈ Z and bi are linearly independent[5]. It can be represented as dots regularly spaced in an infinite field as you can see in the following figure.
Figure 1: Representation of a 2d lattice, https://en.wikipedia.org/
wiki/Post-Quantum_Cryptography_Standardization
However, in cryptography, lattices work in a the finite abelian group Zd/L, where L is the lattice [5].
6 1.3 NIST contest
The security around lattice-based cryptography relies manly on the the following problems :
• The Shortest Vector Problem
• The Closest Vector Problem
• The Shortest Independent Vector Problem
In the Shortest Vector Problem is the following. Given a lattice L, you must find the shortest non-zero vector in L .
The Closest Vector Problem is a bit different. Given a lattice L and a vector v find the closest point to v in L.
As last, the Shortest Independent Vector Problem is defined as: Given a lattice L ∈ Zn×n find n different vectors in L in order to minimize the norm of matrix V = [v0, .., Vn− 1].
All these problems currenty have no algorithm able to solve them effectively.
One of the most know algorithm is the Lenstra–Lenstra–Lov´asz but the complexity of attacks using LLL is 2θ(nloglogn/logn)[6] so this attacks are not feasible.
Multivariate cryptography. As described in Multivariate Public Key Cryptography : ”A multivariate public key cryptosystem has a set of (usu- ally) quadratic polynomials over a finite field as a public map [7].
The security of this comes the hypothesis that solving nonlinear equations over a finite field is NP-hard.
Isogeny-based cryptography. This field is uses isogeny between usually supersingular elliptic curve as main tool to construct algorithms.
As of today we have no algorithms that can find an isogeny between two elliptic curves in a polynomial time, which guaranties the security of the isogeny-based algorithms.
1.3 NIST contest
The National Institute of Standards and Technology launched a contest in early 2017 in order to find a family of quantum resistant cryptosystems for digital signatures, public-key encryption and key-establishment algorithms.
The first round ended in November 2017 with sixty-nine proposals and in January 2019, the second round was announced with twenty-six contender remaining. All these algorithms are resumed in the following figure :
7 1.3 NIST contest
Figure 2: Second round submission according to the cryptographic field, https://en.wikipedia.org/wiki/Lattice_(group)
As showed in the figure above, the repartition is seventeen PKE/KEM for only nine DSA. Furthermore, four of the contestants appeared to be fusions of round one proposals.
The main point of this paper is to take one of these algorithms and to implement it in an optimized fashion.
8
2 Choices made for the implementation
2.1 Programming language
First, the choice of the programming language had to be done from a fixed pool, corresponding to the ones I know. I decided not to try a new language because of the lack of time for the work leaving me with the following options:
• C, C++
• Java
• JavaScript
• Scala
• Python
With this list made, the first option was C or C++. Indeed, these languages allow a direct memory access facilitating the low-level operation on bytes.
Unfortunately, nearly all the submissions for the contest had the mandatory implementation done in C. Furthermore, lot of these already proposed an optimized version of their algorithm. The main point of this work being the implementation and optimization of an algorithm, that was already done for the C programming language.
JavaScript and Python were not chosen because I thought these languages were too high-level programming language and I could not perform an op- timization that would be relevant enough for this work.
So the choice was between Java and Scala both of them have some good points but in the end I selected Scala because of the following reasons:
• It is based on JVM so we can use Java libraries and the code can run on every computer that has Java installed.
• Java new policies on making new updates of Java SE cost money makes Scala more interesting because I believe the use of Scala will grow in the future so as the need of good libraries.
• I felt more confident in using Scala over Java
9 2.2 Algorithm
2.2 Algorithm
After reading all the submissions for the contest, I had to choose the algo- rithm I wanted to implement. In order to make a wise choice, I divided the task in two :
• The choice of the algorithm category
• The choice of the algorithm
The first part allows me to have a better understanding of the contest envi- ronment and to choose a coherent pool of algorithm, while the second part allows me to compare the algorithms between them and to find the most promising one to implement.
2.2.1 Category of the algorithms
In order to make a choice, I checked for a categorization of the submissions according to their respective cryptographic fields.
As we can see in the resuming table of the contest algorithms 2, out of the six categories, the lattice-based and code-based categories gather most of the submissions. It seemed reasonable to choose between those two.
My final choice was set on the lattice-based cryptography. This was moti- vated by the fact that lattice-based cryptography offers a tradeoff between the public key size and the ciphertext size compare to code-based where the ciphertext is small, but the public key size is bigger.[8]. Furthermore, size of public key in code-based cryptography can be problematic with example like DoS attack or packets that become big for TLS [9]
At last, with lattices, I still got the option to implement a signature algo- rithm.
2.2.2 Round5
As for my first implementation, I wanted an encryption algorithm. I fixed my choice on Round5 for the following reasons :
• NTRU and NewHope already had a Java implementation in the Boun- cyCastle library. This means any Scala developer can use the Java implementation in his code and I wanted to implement an algorithm that has not been made for Scala.
• The Round5 documentation seemed very well made with about 100 pages. In addition to the theorical implementation, there is a full description of a C implementation of the algorithm.
10 2.2 Algorithm
• Round5 implementation has a good flexibility. Indeed, with a single implementation, a user can choose between all the security level asked by the NIST contest. This means less work for the maintenance and more flexibility during the use in production.
• After the merging between Round2 and HILA5, some decyption prob- lems where found and Round5 had to change some parts of the im- plementation and the documentation. This means the algorithm have already been checked more in details than others so there is a lesser probability of finding errors in the new Round5.
• Round5 was the only algorithm with Saber to use Learning With Rounding, which makes both of them standout compared to the usual algorithm based on Learning With Error.
• The algorithm was made by employees from different companies in- culding Philips and Cisco. If Round5 obtains a good result in the contest, I believe that it can make a difference in the popularity of the algorithm compared to the algorithm that are only developped at universities.
11
3 Algorithms
3.1 Round 5
Round 5 is a proposal for the second round of NIST post-quantum con- test that results of a merge between two round one contestants, HILA5 and Round 2.
It consists of two algorithms, r5 cpa kem for the key encryption mechanism and the r5 cca pke for the public key encryption schema.
These algorithms are based on the General Learning With Rounding prob- lem. More precisely, the user have the choice between the Learning With Rounding problem or a Ring Learning With Rounding problem, depending on a parameter in algorithm.
Furthermore, Round 5 uses in some case an error-correcting code base on HILA5 in order to reduce the decryption failure probability, leading to small key size and a faster program.
3.1.1 The General Learning With Rounding Problem In Round 5, the GLWR problem can be separate in two cases.
Learning With Rounding. It is used when the parameter n = 1 .This means we have only 1 coefficient per polynomial and the GLWR is a Learn- ing With Rounding problem.
Based on the definition found in Learning with Rounding, Revisited, the LWR problem is : ”[..] We released a deterministically rounded version [of sample] < a, s >∈ Zq. In particular, for some p < q, we divide up the el- ements of Zq into p contiguous intervals of roughly q/p elements each and define the rounding function b.cp : Zq→ Zp that maps x ∈ Zqinto the index of the interval that x belongs to. The LWR assumption states that:
(A, bA · scp) is computationally indistinguishable from (A,bucp)” [10]
Ring Learning With Rounding. If the parameter n > 1 then the GWLR is an instance of RLWR. The Ring Learning With Rounding is based on the same assumptions that LWR except that the samples are not part of Zq but are in a polynomial rings over finite fields.
12 3.1 Round 5
3.1.2 Error-correcting code
Depending on the value of a parameter f , the Round 5 uses an error- correction code to correct up to 5 errors during a transmission. The code used is XEf and is defined in section 2.4.1 of the official documentation [11].
3.1.3 Algorithms
Parameters In the following table, you can get all the symbols for the Round 5 algorithms:
Figure 3: Symbols for Round 5, [11] p.15
In the following pages, you will find the algorithms used by Round 5.
13 3.1 Round 5
r5 cpa pke These algorithms are the internal blocks used in both r5 cpa kem and r5 cca pke.
Figure 4: r5 cpa pke algorithms, [11] p.17
Figure 5: r5 cpa pke algorithms, [11] p.17
14 3.1 Round 5
Figure 6: r5 cpa pke algorithms, [11] p.17
r5 cpa kem These next algorithms are IND-CPA secure key encapsula- tion.
Figure 7: r5 cpa pke algorithms, [11] p.18
Figure 8: r5 cpa pke algorithms, [11] p.18
15 3.1 Round 5
Figure 9: r5 cpa pke algorithms, [11] p.18
r5 cca kem These ones are the internal blocks for the r5 cca pke algo- rithms.
Figure 10: r5 cpa pke algorithms, [11] p.19
Figure 11: r5 cpa pke algorithms, [11] p.19
16 3.1 Round 5
Figure 12: r5 cpa pke algorithms, [11] p.19
r5 cca pke The following algorithms form an IND-CCA secure public key encryption schema.
Figure 13: r5 cpa pke algorithms, [11] p.19
17 3.1 Round 5
Figure 14: r5 cpa pke algorithms, [11] p.19
Figure 15: r5 cpa pke algorithms, [11] p.19
18
4 Implementation
4.1 Round 5
4.1.1 Additions to the reference implementation
The BitString class. In order to represent polynomial coefficients as specified in the documentation, I had to make a BitString class. BitStrings in Round5 are strings where you can access each bit and are represented in little-endian. This means for example that the value 13 will be seen as
”1011” in BitString.
The choice I made for the implementation of this class was to use a list of Boolean to match the zeros and ones of a BitString. Booleans are 1 bit and can be easily mapped to make a string value (true=1, false=0). This means a gain in memory compare to a string that is a sequence of chars that are 16 bits each.
This class comes with a companion object that allow to transform any Int in a BitString.
The Polynomial class. In the main implementation, all the matrices are filled with polynomial coefficients and the polynomials themselves are just defined by one parameter that is the maximal degree of the polynomial.
This means polynomials are just a sequence of number in a row of a matrix.
Scala being a more object-oriented language than C, I decided to create a class in order to get actual polynomials.
In this class stores the coefficients of the polynomials, the ring in which the polynomials are (xn+1+ 1 or xn+1/(x − 1)) and the modulus of the coeffi- cients.
4.1.2 Problems encountered during the implementation
Differences between the versions of the documentation. The first few weeks of implementation where made from the latest draft of the Round5 dated from December 2018. On the 30th March 2019 the official version for the second round was released and I had to redo parts of my code due to the fact that some small precisions were added to the documentation.
19 4.1 Round 5
Documentation for C language. The main documentation for the algo- rithm use types like bitStrings that are not available in Scala. Furthermore, the types used in the paper are all in little-endian so I had to create a Bit- String class where I thought originally that I could use short type to store the 16 bits number.
The other big difference was the fact that in Scala we have structures to store objects like array where in C you have pointer. So, it took me quite some time to understand how to modify the functions to allow the usage of data structures of variable dimensions instead of a pointer.
The BitString class. I made a BitString class, but it appeared to be too slow for the purpose of the program. Every time I needed to get the value of a BitString, I had go through all the list of boolean to get a mapping form boolean to zeros and ones, reverse the list to have it on big-endian then parse it as an Integer. This operation had to be done two times for an addition or a multiplication of two BitStrings and then do the other way around to store the Integer value as a BitString.
Since all the coefficients of the polynomials were BitString, each time a multiplication of two matrices occurs, millions of these operations had to be made and the time needed to do all of this was just way to much.
To give an example, I had to make roughly two million operations, I waited 10 minutes compared to a few milliseconds in the C implementation. This made me decide to change the polynomial coefficients to chars that are in Scala unsigned 16 bits objects and to serialize all the polynomials coefficients to bitStrings when needed.
By doing so, I could do in a few seconds the work done in 10 minutes, but I needed to change lots of my methods to take chars instead of BitStrings.
Representation of negatives binary byte. For the packing and un- packing methodes where I needed to pack byte, I didn’t know that negative bytes are represented as Integer in memory. This means when I called the method toBinaryString of the Byte class, I got a 32 bits Integer and not an 8 bits Byte. This leads me to have a BitString that was too big after the packing and a wrong result when unpacking.
Differences between the documentation and the reference imple- mentation. During the debugging, I noticed some differences between the theorical implementation in the reference manual and the reference imple- mentation in C.
20 4.1 Round 5
The first one was the function permutation tau 2 called in create A. In the documentation, the method used in permutation tau 2 is drgb sample 2 with range q but in the C files, the range is len tau 2. This lead me to have a wrong values in my matrix A for parameter tau = 2
The lack of clarity on the choice of the ring parameter. Round5 relies on two different rings depending of the parameters of the application.
The ring is either
Φn+1(x) = xn+1− 1 x − 1 or
Nn+1(x) = xn+1
The only difference between the rings is in the function mult matrix. If the ring is Φn+1(x), polynomials must be lifted before the multiplication and then unlifted while in the ring Nn+1(x), we just multiply both polynomials.
In the Technical Specification of Reference Implementation chapter of the official documentation, it is not explained when to choose Nn+1(x) over Φn+1(x).
In the official implementation, the default value was Φn+1(x) for the key generation, so I created by default all my polynomials in this ring.
The problem was that the ring Nn+1(x) is used for ciphertext computation and decryption and only if the variant of the Round5 uses error correction.
This means I only saw my mistake when I debugged the 5d variants of Round5 and it took my quit sometimes to understand that the fault was not in the correcting code but in the choice of the ring.
Scala functions not working as attended. The function getBytes for String did not do what I expected. In my case, I had String with 16 Char each of them taking only 8 bits of memory. This means I expected the getBytes function to return a 16 byte long array, but the result was the following :
Figure 16: getBytes vs map
This means I had to replace all my getBytes by map( .toByte)
21 4.1 Round 5
4.1.3 Optimization
Result of unoptimized program. In order to see the improvement of my implementation, I decided to take as measure the time to pass a KAT test. This choice was made because it is a strong indicator of the time the algorithm take. Each KAT test uses either 75 or 100 times the full Round5 program with different inputs meaning we have no values in cache that will truncate the result time.
As for the algorithms, I will use all the Round5 variants for NIST level 1.
These corresponds to the small size parameters in order to minimize the time that the KAT tests take.
The results of the unoptimized implementation are in the following table:
Algorithm Time to pass a KAT [ms]
R5N1 1KEM 0d 468480
R5ND 1KEM 0d 7941
R5ND 1KEM 5d 6333
R5N1 1PKE 0d 749999
R5ND 1PKE 0d 10155
R5ND 1PKE 5d 9962
Figure 17: Timetable to pass KAT test for unoptimized program
As show in the above table, the time to run KAT are very slow, it takes more than 7 minutes when it should only take a few seconds so there was a lot of work to be done.
Profiler. In order to optimize my implementation, I had to know the execution flow of my application and which functions are unoptimized. For that, there are software called profilers that analyze the runtime of a program and provides data on it. In my case, the data were the time each functions took during the entire program and the percent it represented compared to the execution time.
Furthermore, I had the choice between a few profilers for JVM and I selected two of them.
The first one was JProfiler,it was the most recommended during my reseach but the problem with this profiler was the overhead it add to the program.
As an example, the R5N1 1KEM 0d variant takes about six seconds to run but whith the profiler on, it takes about 20 minutes. This is the reason why
22 4.1 Round 5
I choose to take another more lightweight profiler for the beginning of the optimization.
My second choice was YourKit, it was also strongly recommended and the runtime of the program was way faster than with JProfiler. The problem with YourKit is that it did not show data such as the number of times a method is called, this means I used this one only for the first part of the profiling and when I needed more details I switched to JProfiler.
Hardware. Since the running time of the application can change accord- ing to hardware, these are the specifications I used for the profiling :
• i7-7740X @ 4.3 GHz
• 16 Go RAM
• Windows 10 Education
Results of the unoptimized profiling. The following images are the results of the first profiling of the Scala implementation.
Figure 18: R5ND 1PKE 0d keygen
Figure 19: R5ND 1PKE 0d decryption
Pack ct and pack pk. The problem with the pack ct and pack pk was that I used in my BitString class a List of boolean. In Scala, List is an immutable class which means every time I append an element to a List, a new List is created with the added element at the end and then I have to change the reference to the previous List in order to have the new one.
Furthermore, the append operation has a O(log(n)) complexity [].
23 4.1 Round 5
My decision was to stay with List in my BitString class and use ListBuffer class in order to create the list of Boolean of . This class is mutable so the size of the lsit is easily scalable and the append et prepend operations are O(1).
BitString Class. Once This made, I profiled again the R5ND 1PKE 0d variant of Round5 and got the following results:
Figure 20: R5ND 1PKE 0d after adding ListBuffer
As the results showed, the functions BitStringToString and toByteArray of my BitString class took more than 50% of the decapulate method and in total those two methods used 55% of the total time of the application.
Looking a bit more in depts with the profiler the problem was the drop method of slice function.
Figure 21: R5ND 1PKE 0d slice issues
It was really surprising to me because slice is a native function of the Scala language. This means the problem was not the method I used but the algorithm I used.
Indeed, slice(n,n+1) is the same as drop(n) and take(n+1-n) and since both of these methods are implemented with while loop, the number of operations to get a sublist is always n+1.
24 4.1 Round 5
Both of my methods uses the same pattern for the slicing that is the following :
for i ← 0 until n do
list.slice(i * const, i + 1 * const ) end
This means every time I called slice the take part only go through a set amount of element but the drop part increases each time by a const amount. The complexity of such algorithm is n ∗ const +Pn
i=0(const) which is O(n + const ∗ n(n + 1)/2) and that equals O(n2).
So the decision I make was to implement a tail-recursive method for both toByteArray and BitStringToString which has the following pseudocode as core :
def loop(xs ← List[boolean], acc ← ListBuf f er[Boolean], result ← ListBuf f er[Any], wanted length ← Int):
if head of xs is not Nil :
if length of acc equals wanted length:
loop(tail xs, head xs, add acc to result, wanted length);
else:
loop(tail xs, Add head xs to acc, add acc to result, wanted length)
else:
return result.toList
With this function, the list is only passed through once which is O(n) , the tail and head for xs as well as the add to acc and result are all in con- stant time. This means this algorithm has a total complexity of O(n) and the algorithm being tail-recursive, I could enable the compiler optimization with the @tailrecursive annotation.
Unpack ct and unpack pk. I noticed that the methods unpack ct and unpack pk had the same issues as toByteArray so I did the exact same changes for these two.
25 4.1 Round 5
Mult matrix. Once again, I profiled the R5ND 1PKE 0d variant and got the results below:
Figure 22: R5ND 1PKE 0d after the addition of tail recursion As it is showed above, the decryption was more than two times faster and the time used for the BitString methods passed from more than 20% to only 10%. The next optimization was the encrypt and decrypt functions and more precisely the mult matrix method as we can see in the following results of the encrypt method:
Figure 23: R5ND 1PKE 0d encrypt function after the addition of tail re- cursion
The mult matrix function takes about 50% of the total time of the en- crypt function and is also a bit part of the keygen method. In total, it represents about 23% of the total time of the application.
The only optimization available is the mult poly ntru method, the other function taking close to not time to run as showed below:
Figure 24: R5ND 1PKE 0d details of mult matrix
The Round5 has two different approaches to the mult poly ntru, depend- ing on the size of the polynomials.
26 4.1 Round 5
The first case is when the length of the polynomials is one. The mult poly ntru between polynomials a and b is the following :
We lift a:
a = lif t(a0) = {−a0, 0}
Then we multiply it with b:
c = {−a0, 0} × {b0, 0} = {−a0b0, 0}
And we unlift:
c = unlif t({−a0b0, 0}) = −a0b0
This means that the mult poly ntru is equals to −a×b which is just a normal coefficient multiplication between two matrices.
That is the reason why the mult poly ntru of mult matrix can be replaced by :
mult matrix(A, B) = Alif t coef f × Bcoef f
Where Alif t coef f is the matrix of all the coefficients of the polynomials in matrix A multiplied by −1 and Bcoef f is the matrix of all the coefficients of the polynomials in matrix B.
the utility of creating such matrices are that I could use linear algebra li- braries that are optimized for the multiplication of matrices. I decided to use the MTJ-N library because it was the fastest implementation for multi- plication of matrices of size bigger than 20 [12].
The second case is when the polynomials are longer than one. In that context, the mult poly ntru do a multiplication between two polynomials and the algorithm I use is the following :
for i ← 0 until length of a do for j ← 0 until length of b do
poly(i + j modulo length of a) += a(i) * b(j) modulo q end
end
The complexity of such algorithm is O(N2). The problem is that I didn’t find a way to optimize this part with a faster algorithm. I found two different algorithm that are in O(N log(N )) and both being not usable for my case.
The first one was Number-theoretic transform (NTT), but the requirement for the usage of this algorithm is that the polynomials are in Zqnwhere n is a power of two and q ≡ 1 mod 2n. In Round5, q < 2n for all ring variants and never equals to one so the requirement is never reached.
The second one was the Fast Fourier Transform (FTT), which can be used to
27 4.1 Round 5
calculate the multiplication of two polynomials with the following algorithm [13]:
def mult poly(a ← coeff of poly a, b ← coeff of poly b):
F T TA← F T T (a) F T TB ← F T T (b)
for i ← 0 until length of ado F T TC(i) = F T TA(i) + F T TB(i) end
Result = IF T T (1/length of a ∗ F T TC(i))
I tried to use a the library Apache commons and I implemented this algorithm, but it did not work with Round5 while it was working with a small test in a Scala worksheet. The problem is probably a explained by eh9 on stackoverflow [14] that my modulus being a power of two, not all my elements are multiplicatively invertible. This probably leads the FTT to return a wrong result.
This means I only could optimize the mult matrix for the R5N1 variants of Round5
Micro-optimization. One at this point I did another profiling and could not find another big optimization to do on application. As you can see in the profiling results on appendix A, all function except mult matrix takes most of the time with O(1) operations like prepend or append which means I now to find micro-optimization in order to gain execution time.
I found the three following improvements :
• Replace a % n by a & (n-1). This is possible because in our case, n is always a power of two and this improves the speed of the modulo operation by two [15]
• In DRGB sample methods, I replaced the operation:
BigInt(array.reverse).toChar by
(((array(1).toInt << 8) & 0xF F 00) + (array(0) & 0xF F )) this allows me to create a Char from a byte array without having to reverse it and to create an intermediate object.
• I replaced all the toChar call on intermediate calculation by 0xFF in order to avoid function calls.
28 4.1 Round 5
Results of the optimization. In the following graphics, you can see the improvement s of the algorithm through the process of optimization
Figure 25: Comparison table between the different version of the Scala pro- gram
As we can see the optimization level reached between the R5DN and the R5N1 are very different. Indeed, I am now times faster with R5N1 with the optimized implementation but I only am 3 times faster with R5ND.
This comes from the fact that I could not find an optimization for the polynomials multiplication in R5ND and manage to use a near to optimal library for my matrix multiplication in R5N1.
29 4.1 Round 5
4.1.4 Comparison with C implementations.
In this part, I will compare the Scala implementation I made with the ref- erence and optimized implementation proposed by the Round5 team.
Hardware. All the benchmak was done whith the following hardware:
• i7-7500U @ 2.7 GHz
• 8 Go RAM DDR4 2633 MHz
• Linux Mint 19.1
I choose to compare the variants of Round5 depending on the algorithm used, if it is ring based of not and finally if it uses error correcting code.
30 4.1 Round 5
R5N1 KEM. The first comparison is between the non ring KEM variants.
Figure 26: Comparison between R5N1 KEM variants
As we can see in the graphics above, the Scala implementation I made is five to ten times faster than the reference implementation but ten to twenty times slower than the optimized one. Furthermore, I noticed many interest- ing points.
Even if Scala is five to six times faster than in C reference, the decapsulate part is always at least three times slower.
The time used for the encapsulation is in Scala about twice the time of the key generation while in C either they take the same time or the encapsula- tion is faster than the key generation.
31 4.1 Round 5
R5ND KEM. This part is about the ring based KEM without error cor- recting code.
Figure 27: Comparison between R5ND KEM variants
I noticed that the Scala implementation takes about the same time or is slower than both C implementations. The interesting points of this graph are the following:
• As the size of the parameters grows, difference between the reference and the Scala implementation shrinks. From taking 1.5 more times with small parameter to being 20% faster with the biggest parameters size.
• Compared to the R5N1 variants, the time allowed to each function is the proportionally same between between Scala and C.
32 4.1 Round 5
R5ND KEM 5d. This paragraph concerns the ring with error correction code variants of Round5
Figure 28: Comparison between R5ND KEM 5d variants
Compared to both previous variants of KEM, Scala is always slower than C. As in R5ND, the time gap reduces with the increase in size of the parameters but Scala remains at best 20% slower than the reference imple- mentation.
This is due to the decapsulate method that always takes twice the time of the time of the keygen function when both should take the same time.
33 4.1 Round 5
R5N1 PKE. With this part, the algorithm used switch to the PKE and more precisely the non ring variant.
Figure 29: Comparison between R5N1 PKE variants
The results of R5N1 PKE are very similar to the R5N1 KEM. Indeed, Scala is five times faster than reference C implementation but twenty times slower than the optimal C.
Furthermore, the problem of the decapsulate method taking too much time is also present, however, this time the encapsulation is also slower than what is should be if we refer to C. Indeed, the three functions should take the same time, but on average, the keygen takes twice less time than the other methods.
34 4.1 Round 5
R5ND PKE. This comparison concerns the ring-based PKE without er- ror correcting code.
Figure 30: Comparison between R5ND PKE variants
We can see that the results of the R5ND PKE looks the same as the R5ND KEM. Indeed, in all the variants, Scala is slower than C. The time gap goes from ten to twenty percent slower for the reference implementation to twenty to thirty times slower with the optimized program.
However the time percent of each functions is the same between Scala and C.
35 4.1 Round 5
R5ND PKE 5d. The last comparison is about the ring-based PKE with error correcting code.
Figure 31: Comparison between R5ND PKE 5d variants
As the graph shows, the R5ND PKE 0d variants behave the same as R5ND PKE. All the implementations are slower than C but the percentage of time used by each functions are pretty much the same.
General comparison. As the last comparison, I wanted to have a sum- mary of the performances in Scala for each categories of Round5 algorithm.
The results are in the following table.
36 4.1 Round 5
Category Scala
(ms)
C ref (ms)
C opti (ms)
Scala /C ref
Scala /C opti R5N1 KEM 358.86 3 204.94 30.05 0.11 11.94 R5N1 PKE 424.34 1 532.35 17.34 0.28 24.47
R5ND KEM 20.09 17.49 0.85 1.15 23.65
R5ND PKE 41.04 32.27 0.95 1.27 43.24
R5ND KEM 5d 24.26 16.07 0.66 1.51 36.64
R5ND PKE 5d 41.23 28.23 0.89 1.46 46.09
Figure 32: Mean time to run and comparison between Scala and C imple- mentations
As already show in the other comparisons, the Scala implementation can not compete against the optimized C implementation. Indeed, the gap in performances is too important and can be explained by the following factors:
• An object-orientated program in Scala which brings an overhead of operations to be done. The C program only uses u int 16 t pointers compared to the Scala program that needs to often swich from Bytes to Char and from arrays of Char to array of Polynomial leading to the creations of numerous transition objects.
• The little-endianness of the program that forced the usage of a Bit- String Class in order to pack and unpack the data correctly while in C the usage of inline functions allow the compiler to get ride of a part of the overhead form the little-endiannss.
• An optimization pushed further in C than in Scala.
However it is impossible to be as good as the optimal implementation in C with JVM. I noticed that it takes more time for the R5ND PKE 0d func- tion to do the calls to Cipher.getInstance method than to run an entire C optimized implementation (see Appendix A Figure 4).
With this in mind, I think the performances of the Scala implementation are not as bad as they seems in the above table. The most time used by a ring variant is 57ms on average which makes it usable on any program.
The issue comes from R5N1 variants which can take up to 750ms for the R5N1 5PKE 0d. In this case, it is more wise to only use the small parame- ters that only takes 67 ms for the KEM and 200 ms for the PKE.
37 4.1 Round 5
4.1.5 Future improvements.
In order to improve the Scala implementation of Round5, the two following points have be made :
• Change the algorithm of mult poly ntru for ring based variants. The function is the same as the unoptimized version and has a complexity of O(N2).
• Find the reason why the decapsulate and encapsulate functions take proportionally more times than in C.
4.1.6 Code
All the code can be found on https://github.com/shinopill/TB
38
5 Conclusion
The proposed implementation during this work fulfielled a good amount of the objectives, the final implementation being sufficiently optimized for a commun usage.
However, the difficulites encontered during the basic implementation of the algorithm have done that the remaning time was not sufficient in order to have an optimization on par with the one proposed by the Round5 team.
That being said, the improvement between the basic Scala implementation and the optimized version remains of quality, the latter being up to fity times faster than the first version. And that is why I believe I was able to propose a quality implemention of the Round5 algorithm.
39
6 Authenticity
I, Florent Piller, hereby declare that I am the legitimate and sole author of this document and that no sources outside of those cited in the bibliography were used to write it.
Date Signature
40 References
References
[1] “Shor’s algorithm,” page Version ID: 894865823. [On- line]. Available: https://en.wikipedia.org/w/index.php?title=Shor%
27s algorithm&oldid=894865823
[2] M. Mosca, “Cybersecurity in an era with quantum computers: will we be ready?” [Online]. Available: https://eprint.iacr.org/2015/1075.pdf [3] P. Schwabe, “The transition to post-quantum cryptography,”
p. 72. [Online]. Available: https://cryptojedi.org/peter/data/
nancy-20180219.pdf
[4] “Grover’s algorithm,” page Version ID: 897237747. [On- line]. Available: https://en.wikipedia.org/w/index.php?title=Grover%
27s algorithm&oldid=897237747
[5] P. Nguy, “Lattice-based cryptography,” p. 95.
[6] E. L. Antonsen, “Lattice-based cryptography - acom- parative description andanalysis of proposed schemes.”
[Online]. Available: https://pdfs.semanticscholar.org/47ad/
e44f3b8beb64042ca99b2532cff5b84d8727.pdf
[7] J. Ding and B.-Y. Yang, “Multivariate public key cryptography,”
in Post-Quantum Cryptography, D. J. Bernstein, J. Buchmann, and E. Dahmen, Eds. Springer Berlin Heidelberg, pp. 193–241. [Online].
Available: http://link.springer.com/10.1007/978-3-540-88702-7 6 [8] T. Lange, “Code-based cryptography.” [Online]. Available: https:
//www.hyperelliptic.org/tanja/vortraege/aim-qc.pdf
[9] D. J. Bernstein and T. Lange, “The year in post-quantum crypto.”
[Online]. Available: https://www.hyperelliptic.org/tanja/vortraege/
20181228-pqc.pdf
[10] J. Alwen, S. Krenn, K. Pietrzak, and D. Wichs, “Learning with rounding, revisited,” in Advances in Cryptology – CRYPTO 2013, R. Canetti and J. A. Garay, Eds. Springer Berlin Heidelberg, vol. 8042, pp. 57–74. [Online]. Available: http:
//link.springer.com/10.1007/978-3-642-40041-4 4
41 References
[11] H. Baan, S. Bhattacharya, S. Fluhrer, O. Garcia-Morchon, R. Player, R. Rietman, M.-J. O. Saarinen, J. L. Torre-Arce, and Z. Zhang,
“Round5: KEM and PKE based on (ring) learning with rounding thurs- day 28th march, 2019,” p. 153.
[12] Runtime: i5-3570k QuadCore. [Online]. Avail- able: https://lessthanoptimal.github.io/Java-Matrix-Benchmark/
runtime/2019 02 i53570/
[13] cs.cmu.edu. * multiplying polynomials * fast fourier transform (FFT).
[Online]. Available: http://www.cs.cmu.edu/afs/cs/academic/class/
15451-s10/www/lectures/lect0423.txt
[14] math - multiplication using FFT in integer rings. [On- line]. Available: https://stackoverflow.com/questions/10243885/
multiplication-using-fft-in-integer-rings
[15] Modulo operator performance impact. [Online]. Available: https:
//lustforge.com/2016/05/08/modulo-operator-performance-impact/
42
A Results of the optimized profiling
The following figures are the results of the optimized profiling with every function that takes more the 0.9% of the running time.
Figure 1: R5ND 1PKE 0d encrypt
43
Figure 2: R5ND 1PKE 0d decrypt
44
Figure 3: R5ND 1PKE 0d keygen
Figure 4: R5ND 1PKE 0d RNG
B Result of the KAT and the speedtest