Sie sind auf Seite 1von 92

Code-based

Cryptography:Implementing the
McEliece Scheme on Recongurable
Hardware
Stefan Heyse

May 31, 2009

Diploma Thesis
Ruhr-University Bochum

Faculty of Electrical Engineering and Information


Technology
Chair for Embedded Security
Prof. Dr.-Ing. Christof Paar
Dr.-Ing. Tim Güneysu
Gesetzt am October 14, 2009 um 12:09 Uhr.
III

Abstract

Most advanced security systems rely on public-key schemes based either on the
factorization or discrete logarithm problem. Since both problems are known to be
closely related, a major breakthrough in cryptanalysis tackling one of those prob-
lems could render a large set of cryptosystems completely useless. The McEliece
public-key scheme is based on the alternative security assumption that decoding
unknown linear, binary codes is NP-complete. In this work, we investigate the effi-
cient implementation of the McEliece scheme on reconfigurable hardware what was
– up to date – considered a challenge due to the required storage of its large keys. To
the best of our knowledge, this is the first time that the McEliece encryption scheme
is implemented on a Xilinx Spartan-3 FPGA.

Zusammenfassung

Die meisten modernen Sicherheitssysteme basieren auf Public-Key-Verfahren die


entweder auf dem Problem der Faktorisierung oder dem diskreten Logarithmuspro-
blem beruhen. Da die beiden Probleme dafür bekannt sind, eng miteinander verbun-
den zu sein, könnte ein wichtiger Durchbruch in der Kryptoanalyse, der eines dieser
Probleme löst, eine große Anzahl von Kryptosystemen völlig nutzlos machen. Das
McEliece Public-Key-System basiert auf der alternativen Annahme, dass die Deco-
dierung unbekannte linearer Codes NP-vollständig ist. In dieser Arbeit untersuchen
wir die effiziente Umsetzung des McEliece Verfahren auf rekonfigurierbarer Hard-
ware, was durch die notwendige Speicherung der grossen Schlüssel - bis heute - eine
Herausforderung war. Nach besten Wissen ist dies das erste Mal, dass das McEliece
Verfahren auf einem Xilinx Spartan-3 FPGA implementiert wird.
V

FOR MANDY AND EMELIE


VI

Erklärung/Statement

Hiermit versichere ich, dass ich meine Diplomarbeit selbst verfaßt und keine an-
deren als die angegebenen Quellen und Hilfsmittel benutzt sowie Zitate kenntlich
gemacht habe.

I hereby declare that the work presented in this thesis is my own work and that
to the best of my knowledge it is original except where indicated by references to
other authors.

Bochum, den October 14, 2009


Stefan Heyse
Contents
1. Introduction 3
1.1. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2. Existing Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3. Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4. Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2. The McEliece Crypto System 7


2.1. Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2. Key Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3. Encryption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4. Decryption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.5. Reducing Memory Requirements . . . . . . . . . . . . . . . . . . . . . . 9
2.6. Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.6.1. Weaknesses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.6.2. Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.7. Side Channel Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3. Introduction to Coding Theory 15


3.1. Codes over Finite Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2. Goppa Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3. Parity Check Matrix of Goppa Codes . . . . . . . . . . . . . . . . . . . 16
3.4. Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.5. Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.6. Solving the Key Equation . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.7. Extracting Roots of the Error Locator Polynomial . . . . . . . . . . . . 20

4. Recongurable Hardware 23
4.1. Introducing FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.1.1. Interfacing the FPGA . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.1.2. Buffers in Dedicated Hardware . . . . . . . . . . . . . . . . . . . 24
4.1.3. Secure Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2. Field Arithmetic in Hardware . . . . . . . . . . . . . . . . . . . . . . . . 26

5. Designing for Area-Time-Eciency 29


5.1. Extended Euclidean Algorithm . . . . . . . . . . . . . . . . . . . . . . . 29
5.1.1. Multiplication Component . . . . . . . . . . . . . . . . . . . . . 31
Contents 1

5.1.2. Component for Degree Extraction . . . . . . . . . . . . . . . . . 31


5.1.3. Component for Coefficient Division . . . . . . . . . . . . . . . . 32
5.1.4. Component for Polynomial Multiplication and Shifting . . . . 33
5.2. Computation of Polynomial Square Roots . . . . . . . . . . . . . . . . . 34
5.3. Computation of Polynomial Squares . . . . . . . . . . . . . . . . . . . . 35

6. Implementation 37
6.1. Encryption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6.1.1. Reading the Public Key . . . . . . . . . . . . . . . . . . . . . . . 38
6.1.2. Encrypting a message . . . . . . . . . . . . . . . . . . . . . . . . 39
6.2. Decryption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6.2.1. Inverting the Permutation P . . . . . . . . . . . . . . . . . . . . . 42
6.3. Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.3.1. Computing the Square Root of T(z)+z . . . . . . . . . . . . . . . 45
6.3.2. Solving the Key Equation . . . . . . . . . . . . . . . . . . . . . . 46
6.3.3. Computing the Error Locator Polynomial Sigma . . . . . . . . . 46
6.3.4. Searching Roots . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.4. Reverting the Substitution S . . . . . . . . . . . . . . . . . . . . . . . . . 48

7. Results 51
8. Discussion 55
8.1. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
8.2. Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
8.3. Outlook for Further Work . . . . . . . . . . . . . . . . . . . . . . . . . . 56

A. Tables 57
B. Magma Functions 59
C. VHDL Code Snippets 65
D. Bibliography 69
E. List of Figures 73
F. List of Tables 75
G. Listings 77
H. List of Algorithms 79
1. Introduction
1.1. Motivation

The advanced properties of public-key cryptosystems are required for many crypto-
graphic issues, such as key establishment between parties and digital signatures. In
this context, RSA, ElGamal, and later ECC have evolved as most popular choices and
build the foundation for virtually all practical security protocols and implementa-
tions with requirements for public-key cryptography. However, these cryptosystems
rely on two primitive security assumptions, namely the factoring problem (FP) and
the discrete logarithm problem (DLP), which are also known to be closely related.
With a significant breakthrough in cryptanalysis or a major improvement of the
best known attacks on these problems (i.e., the Number Field Sieve or Index Calculus),
a large number of recently employed cryptosystems may turn out to be insecure
overnight. Already the existence of a quantum computer that can provide compu-
tations on a few thousand qubits would render FP and DLP-based cryptography
useless. Though quantum computers of that dimension have not been reported to
be built yet, we already want to encourage a larger diversification of cryptographic
primitives in future public-key systems. However, to be accepted as real alternatives
to conventional systems like RSA and ECC, such security primitives need to support
efficient implementations with a comparable level of security on recent computing
platforms. For example, one promising alternative are public-key schemes based
on Multivariate Quadratic (MQ) polynomials for which hardware implementations
were proposed on CHES 2008 [9]. In this work, we demonstrate the efficient im-
plementation of another public-key cryptosystem proposed by Robert J. McEliece in
1978 that is based on coding theory [35]. The McEliece cryptosystem incorporates
a linear error-correcting code (namely a Goppa code) which is hidden as a gen-
eral linear code. For Goppa codes, fast decoding algorithms exist when the code is
known, but decoding codewords without knowledge of the coding scheme is proven
NP-complete [3]. Contrary to DLP and FP-based systems, this makes this scheme
also suitable for post-quantum era since it will remain unbroken when appropriately
chosen security parameters are used [6].
The vast majority1 of today’s computing platforms are embedded systems. Only
a few years ago, most of these devices could only provide a few hundred bytes
1 Alreadyin 2001, 98% of the microprocessors in world-wide production were assembled in embed-
ded platforms.
4 Chapter 1 - Introduction

of RAM and ROM which was a tight restriction for application (and security) de-
signers. Thus, the McEliece scheme was regarded impracticable on such small and
embedded systems due to the large size of the private and public keys.For example
for a 80 bit security level, public key is 437.75 Kbyte and secret key is 377 Kbyte large.
But nowadays, recent families of microcontrollers provide several hundreds of bytes
of Flash-ROM. Moreover, recent off-the-shelf hardware such as FPGAs also contain
dedicated memory blocks and Flash memories that support on-chip storage of up to
a few megabits of data. In particular, these memories could be used, e.g., to store
the keys of the McEliece cryptosystem.

While a microcontroller implementation already exist [16] this work present the
first implementation of the McEliece cryptosystem on a Xilinx Spartan-3AN 1400
FPGA which is a suitable for many embedded system applications. To the best
of our knowledge, no other implementations for the McEliece scheme have been
proposed targeting embedded platforms. Fundamental operations for McEliece are
based on encoding and decoding binary linear codes in binary extension fields that,
in particular, can be implemented very efficiently in dedicated hardware. Unlike FP
and DLP-based cryptosystems, operations on binary codes do not require compu-
tationally expensive multi-precision integer arithmetic what is beneficial for small
computing platforms.

On quantum computers algorithms with a polynomial complexity exist which


break the security assumption in both cases [41]. McEliece is based on a proven
NP-complete problem and is therefore not effected by the computational power of
quantum computers. This is the primary reason why McEliece is an interesting
candidate for post quantum cryptography.

Finally, the rising market of pervasive computing, in which mobile phones, cars
and even white goods communicate with each other create a new kind of cryptog-
raphy: lightweight cryptography. In the context of lightweight cryptography it is
important for crypto systems to perform fast on systems with limited memory, CPU
resources and battery capacity. Therefore many symmetric and asymmetric algo-
rithms like AES, DES, RSA and ECC were ported to microcontrollers and even new
algorithms like Present [10] were developed especially for embedded systems.

For these reasons it is particularly interesting how McEliece performs on embed-


ded systems in comparison to commonly used public key schemes like RSA and
ECC.
1.2 Existing Implementations 5

1.2. Existing Implementations

There exist only a few McEliece software implementations [39, 40] for 32 bit archi-
tectures. Implementation [39] is written in pure i386 assembler. It encrypts with 6
KBits/s and decrypts with 1.7 Kbits/s on a 16 MHz i386-CPU. A C-implementation
for 32-bit architectures exists in [40]. The code is neither formatted nor does it con-
tain a single line of source code comment. Nevertheless, it was used for the open
source p2p software freenet and entropy [19, 18]. Due to the poor documentation
we are unable to give performance results for the specific code. An FPGA imple-
mentation of the original McEliece crypto system appears not to exist, only in [8] a
signature scheme derived from McEliece is implemented on a FPGA.
To the best of our knowledge, this work presents the first implementation of the
McEliece public key scheme for embedded systems.

1.3. Goals

Goal of this thesis is a proof-of-concept implementation of McEliece for reconfig-


urable hardware. The primary target is to solve the memory problem for the large
public and private key data and fit all components into a low cost FPGA of the
Spartan-3 class. Simultaneously we want to achieve a high performance to allow
the McEliece scheme to become a competitor in post quantum crypto systems on
embedded systems. For encryption the large public key has to be stored in a fast
accessible area and for decryption a way has to be found to reduce the private key
size, because the target FPGA has only limited memory available.

1.4. Outline

First, the growing interest in post quantum cryptography is motivated and why
especially embedded implementations are necessary. Next, an introduction to the
McEliece scheme is given. Subsequently the basic concepts of error correcting codes,
especially Goppa codes are presented. Afterwards, an introduction to FPGAs is
given and it is shown how to perform finite field arithmetics with them. Then the
encryption implementation is explained with respect to the properties of the target
platform. Next, the implementation of decryption is described on the second target
platform. Finally, the results are presented and options for further improvement are
outlined.
2. The McEliece Crypto System
In this chapter an introduction to McEliece is given and the algorithms to generate
the key pairs and how to encrypt and decrypt a message are presented. After-
wards, currently known weaknesses and actual attacks on the McEliece scheme are
presented.

2.1. Overview

The McEliece crypto system consists of three algorithms which will be explained in
the following sections: A key generation algorithm which produces a public/private
key pair, an encryption algorithm and a decryption algorithm. The public key is
a hidden generator matrix G of a binary linear code of length n and dimension k
capable of correcting up to t errors. R. McEliece suggested to use classical binary
goppa codes. The generator matrix of this code is hidden using a random k × k
binary non-singular substitution matrix S and a random n × n permutation matrix
P. The matrix Ĝ = S ⋅ G ⋅ P and the error- correcting capability t forms the public
key. G itself together with matrices S and P form the secret key.
To encrypt a message, it is transformed into a codeword of the underlying code
and an error e with Hamming weight w H (e) ≤ t is added. Without knowledge
of the specific code used, errors can not be corrected and therefore the original
message cannot be recovered. The owner of the secret informations can reverse the
transformations of G and use the decoding algorithm of the code to correct the errors
and recover the original message.

c‘ = m ⋅ Ĝ = m ⋅ S ⋅ G ⋅ P → c = c‘ + e = m ⋅ S ⋅ G ⋅ P + e
(2.1)
→ c ⋅ P −1 = m ⋅ S ⋅ G ⋅ P ⋅ P −1 + e ⋅ P −1 = m ⋅ S ⋅ G + e ⋅ P −1

The decoding algorithm can still be used to correct the t errors and get m‘ =
m ⋅ S because the equation w H (e) = w H (e ⋅ P−1 ) ≤ t still holds. To finally receive
the original message m a multiplication with S−1 is performed which results in
m‘ ⋅ S−1 = m ⋅ S ⋅ S−1 = m.
8 Chapter 2 - The McEliece Crypto System

2.2. Key Generation

Basically the parameters of the McEliece crypto system are the parameters of the
Goppa code used. After choosing the underlying Galois field GF(2m )and the error
correcting capability of the code t, all other parameters depend on these two values.
The original parameters suggested by McEliece in [35] are m = 10, t = 50, but in
citehac, the authors noted that t = 38 is a better choice with regard to the compu-
tational complexity of the algorithm while not reducing the security level. These
parameters nowadays only give 260 bit security [6] compared to a symmetric ciher.
To achieve an appropriate level of 280 -ḃit security at least parameters n = 211 = 2048,
t = 38 and k = n − m ⋅ t = 1751 must be chosen. Then the key generation is:

Algorithm 1 Key Generation Algorithm for McEliece Scheme


Input: Security Parameters m, t
Output: K pub , Ksec
1: n ← 2m , k ← n − m ⋅ t
2: C ← random binary (n, k)-linear code C capable of correcting t errors
3: G ← k × n generator matrix for the code C
4: S ← random k × k binary non-singular matrix
5: P ← random n × n permutation matrix
6: Ĝ ← k × n matrix S × G × P
7: return Public Key( Ĝ, t); Private Key(S, G, P).

Note that in Section 2.4 only matrices P−1 and S−1 are used. Here exist the pos-
sibility to precompute these inverse matrices and then the private (decryption) key
is redefined to (S−1 , G, P−1 ). The permutation matrix P is very sparse. In every row
and columns exactly a single one occurs. This fact is used in the implementation to
save space when storing P.

2.3. Encryption

Suppose Bob wishes to send a message m to Alice whose public key is (Ĝ, t):
McEliece encrypts a k bit message into a n bit ciphertext. With the actual parame-
ters, this results in 2048 bit ciphertext for a 1751 bit message. This is an overhead of
about nk = 1.17.
2.4 Decryption 9

Algorithm 2 McEliece Message Encryption


Input: m, K pub = (Ĝ, t)
Output: Ciphertext c
1: Encode the message m as a binary string of length k
2: c‘ ← m ⋅ Ĝ
3: Generate a random n-bit error vector z containing at most t ones
4: c = c‘ + z
5: return c

2.4. Decryption

To decrypt the ciphertext and receive the message Alice has to perform the following
steps:

Algorithm 3 McEliece Message Decryption


Input: c, Ksec = ( P−1 , G, S−1 )
Output: Plaintext m
1: ĉ ← c ⋅ P−1
2: Use a decoding algorithm for the code C to decode ĉ to m̂ = m ⋅ S
3: m ← m̂ ⋅ S−1
4: return m

2.5. Reducing Memory Requirements

To make McEliece-based cryptosystems more practical (i.e., reduce the key sizes),
there is an ongoing research to replace the code with one that can be represented in
a more compact way. Examples for such alternative representations are quasi-cyclic
codes [20], or low density parity check codes [37], but have also been broken again []
Using a naive approach, in which the code is constructed from the set of all ele-
ments in F2m in lexicographical order and both matrices S, P are totally random, the
public key Ĝ = S × G × P becomes a random n × k matrix. However, since P is a
sparse permutation matrix with only a single 1 in each row and column, it is more
efficient to store only the positions of the 1’s, resulting in an array with n ⋅ m bits.
Another trick to reduce the public key size is to convert Ĝ to systematic form
{ Ik ∣ Q}, where Ik is the k × k identity matrix. Then, only the (k × (n − k)) matrix
Q is published [17]. But to achieve the same security level the message has to be
10 Chapter 2 - The McEliece Crypto System

converted and additional operations [17] has to be computed to avoid weaknesses


due to the know structure of Q [17].
In the last step of code decoding (Algorithm 4), the k message bits out of the n
(corrected) ciphertext bits need to be extracted. Usually, this is done by a mapping
matrix iG with G × iG = Ik . But if G is in systematic form, then this step can be
omitted, since the first k bits of the corrected ciphertext corresponds to the message
bits. Unfortunately, G and Ĝ cannot both be systematic at the same time, since then
Ĝ = { Ik ∣ Q̂} = S × { Ik ∣ Q} × P and S would be the identity matrix which is
inappropriate for use as the secret key.
For reduction of the secret key size, we chose to generate the large scrambling ma-
trix S−1 on-the-fly using a cryptographic pseudo random number generator (CPRNG)
and a seed. During key generation, it must be ensured that the seed does not gen-
erate a singular matrix S−1 . A random binary matrix is with a probability of about
33% invertible (value is empiric, not proven). Depending on the target platform and
available cryptographic accelerators, there are different options to implement such a
CPRNG (e.g. AES in counter mode or a hash-based PRNG) on embedded platforms.
However, to prevent the security margin to be reduced, the complexity to attack the
sequence of the CPRNG should not be significantly lower than to break the consid-
ered McEliece system with a static scrambling matrix S. However, the secrecy of
S−1 is not required for hiding the secret polynomial G (z) [17]. The secret matrix S
indeed has no cryptographic function in hiding the secret goppa polynomial g(z).
Today, there is no way to recover H with the knowledge of only G ⋅ P.

2.6. Security

In the past, many researchers attempted to break the McEliece scheme [33, 32, 2] but
none of them was successful for the general case.

2.6.1. Weaknesses
At Crypto’97, Berson [7] shows that McEliece has two weaknesses. It fails when
encrypting the same message twice and when encrypting a message that has a
known relation to another message. In the first case assume that c1 = m ⋅ Ĝ + e1
and c2 = m ⋅ Ĝ + e2 with e1 ∕= e2 are send which leads to c1 + c2 = e1 + e2 .
Now compute two sets L0 , L1 where L0 contains all positions l where c1 + c2 is
zero and L1 contains the positions l where c1 + c2 is one. Due to that two errors e1 , e2
are chosen independently and the transmitted messages are identical, it follows that
2.6 Security 11

for a position l ∈ L0 most probably neither c1 (l ) nor c2 (l ) is modified by an error,


while for a position l ∈ L1 certainly one out of c1 (l ) or c2 (l ) is modified by an error.
For the parameters suggested by McEliece n = 1024, k = 524, t = 50 the probability
for a l ∈ L0 of e1 (l ) = e2 (l ) = 1 is about 0.0024 (or the other way round) for most
l ∈ L0 mainly e1 (l ) = e2 (l ) = 0. The probability pi that exactly i positions are for the
same time changed by e1 and e2 is:

50
(1025 )(50974
−i )
pi = Pr (∣{l : e1 (l ) = 1} ∩ {l : e2 (l ) = 1}∣ = i ) = (2.2)
(1024
50 )

Therefore, the cardinality of L1 is:

50
E(∣ L1 ∣) = ∑ (100 − 2 ⋅ i) ⋅ pi ≈ 95.1 (2.3)
i =0

For example, with the cardinality ∣ L1 ∣ = 94 it follows that ∣ L0 ∣ = 930 and only
3 entries of L0 are affected by an error. The probability to select 524 unmodified
positions from L0 is

(927
525)
≈ 0.0828 (2.4)
(930
524)

Therefore after about 12 trials an unmodified codeword is selected which can be


decoded with the help of the public generator matrix.
In the second case where the two messages have a known linear relation the sum
of the two ciphertexts becomes:

c1 + c2 = m1 ⋅ Ĝ + m2 ⋅ Ĝ + e1 + e2 (2.5)

Due to the known relation (m1 + m2 ) ⋅ Ĝ can be computed and subsequently c1 +


c2 + (m1 + m2 ) ⋅ Ĝ = e1 + e2 . Now, proceed like in the first case, using c1 + c2 + (m1 +
m2 ) ⋅ Ĝ = e1 + e2 instead of c1 + c2 . Remark that this attack does not reveal the secret
key.
There are several ways to make McEliece resistant against those weaknesses. Most
of them scramble or randomize the messages to destroy any relationship between
two dependent messages[17].
12 Chapter 2 - The McEliece Crypto System

2.6.2. Attacks
According to [17], there is no simple rule for choosing t with respect to n. One
should try to make an attack as difficult as possible using the best known attacks.
A recent paper [5] from Bernstein, Lange and Peters introduces an improved at-
tack of McEliece with support of Bernsteins list decoding algorithm
√ [4] for binary
Goppa codes. List decoding can approximately correct n − n ⋅ (n − 2t − 2 errors
in a length-n classical irreducible degree-t binary Goppa code, while the so far the
best known algorithm from Patterson [38] which can correct up to t errors. This
attack reduces the binary work factor to break the original McEliece scheme with a
(1024, 524) Goppa code and t = 50 to 260.55 bit operations. Table 2.1 summarizes the
parameters suggested by [5] for specific security levels:

Table 2.1.: Security of McEliece Depending on Parameters


Security Level Parameters Size K pub Size Ksec
(n, k, t), errors added in KBits (G(z), P, S) in KBits

Short-term (60 bit) (1024, 644, 38), 38 644 (0.38, 10, 405)
Mid-term (80 bit) (2048, 1751, 27), 27 3, 502 (0.30, 22, 2994)
Long-term (256 bit) (6624, 5129, 115), 117 33, 178 (1.47, 104, 25690)

For keys limited to 216 , 217 , 218 , 219 , 220 bytes, the authors propose Goppa codes
of lengths 1744, 2480, 3408, 4624, 6960 and degrees 35, 45, 67, 95, 119 respectively, with
36, 46, 68, 97, 121 errors added by the sender. These codes achieve security levels
84.88, 107.41, 147.94191.18, 266.94 against the attack described by the researchers.

2.7. Side Channel Attacks

The susceptibility of the McEliece cryptosystem to side channel attacks has not exten-
sively been studied, yet. This is probably due to the low number of practical systems
employing the McEliece cryptosystem. However, embedded systems can always be
subject to passive attacks such as timing analysis [30] and power/EM analysis [34].
In [42], a successful timing attack on the Patterson algorithm was demonstrated. The
attack does not recover the key, but reveals the error vector z and hence allows for
efficient decryption of the message c. This implementations is not be susceptible to
this attack due to unconditional instruction execution, e.g., the implementation will
not terminate after a certain number of errors have been corrected.
2.7 Side Channel Attacks 13

For a key recovery attack on McEliece, the adversary needs to recover the secret
substitution and permutation matrices S and P, and the Goppa code itself, either
represented by G or by the Goppa polynomial g(z) and the support L. A feasible
SPA is possible to recover ĉ. For each bit set in ĉ, our implementations perform one
iteration of the euclidean algorithm (EEA). Assuming the execution of EEA to be
visible in the power trace, ĉ can easily be recovered. The attacker is able to recover
the whole whole permutation matrix P with less than 100 chosen ciphertexts. This
powerful attack can easily be prevented by processing the bits of ĉ in a random order.
Without recovering P, power attacks on inner operations of McEliece are aggravated
due to an unknown input. Classical timing analysis, as described in [30] seems
more realistic, as many parts of the algorithm such as EEA,Permutation and modulo
reduction exhibit a data-dependent runtime. Yet, no effective timing attack has been
reported so far, and simple countermeasures are available [31].
An attacker knowing P could recover the generator polynomial G (z) and thereby
break the implementation [17]. As a countermeasure, we will randomize the execu-
tion of step 1 of Algorithm 4 due to a possible SPA recovering ĉ, and consequently
P−1 . Differential EM/power attacks and timing attacks are impeded by the permuta-
tion and scrambling operations (P and S) obfuscating all internal states, and finally,
the large key size. Yet template-like attacks [12] might be feasible if no further pro-
tection is applied.
3. Introduction to Coding Theory
For this section it is assumed that the reader is familiar with the theory of finite
fields and algebraic coding theory. A good introduction to finite fields for engineers
can be found in [36]. The following definitions are from [45] and define the crucial
building blocks for Goppa codes.

3.1. Codes over Finite Fields

Because the McEliece scheme is based on the topic of some problems in coding
theorie, a short introduction into this field based on [17] is given.

Definition 3.1.1 An (n, k)-code C over a finite Field F is a k-dimensional subvectorspace of


the vector space F n . We call C an (n, k, d)-code if the minimum distance is d = minx,y∈C dist( x, y),
where dist denotes a distance function, e.g. Hamming distance. The distance of x ∈ F n to
the null-vector wt( x ) := dist(0, x ) is called weight of x.

Definition 3.1.2 The matrix G ∈ F k×n is a generator matrix for the (n, k)-code C over
F, if the rows of G span C over F. The matrix H ∈ F (n−k)×n is called parity check matrix
for the code C if H T is the right kernel of C. The code generated by H is called dual code of
C and denoted by CT .

By multiplying a message with the generator matrix a codeword is formed, which


contains redundant information. This information can be used to recover the orig-
inal message, even if some errors occurred, for example during transmission of the
codeword over an radio channel.

3.2. Goppa Codes

Introduced by V.D. Goppa in [21] Goppa codes are a generalization of BCH- and
RS-codes. There exist good decoding algorithms e.g. [38] and [44].

Definition 3.2.1 (Goppa polynomial, Syndrome, binary Goppa Codes). Let m and t be
positive integers and let
t
g(z) = ∑ gi z i ∈ F 2 m [ z ] (3.1)
i =0
16 Chapter 3 - Introduction to Coding Theory

be a monic polynomial of degree t called Goppa polynomial and

ℒ = { α0 , ⋅ ⋅ ⋅ , α n −1 } α i ∈ F 2m (3.2)

a tuple of n distinct elements, called support, such that

g(α j ) ∕= 0, ∀0 ≤ j ≤ n. (3.3)
For any vector c = (c0 , ⋅ ⋅ ⋅ , cn−1) ∈ F n , define the syndrome of c by

n −1
ci g (z) − g (αi )
𝒮c (z) = − ∑ g (αi ) z − αi
mod g(z) (3.4)
i =0

The binary Goppa code Γ(ℒ, g(z)) over F2 is the set of all c = (c0 , ⋅ ⋅ ⋅ , cn−1 ) ∈ F2n
such that the indentity
𝒮c (z) = 0 (3.5)
holds in the polynomial ring F2m (z) or equivalent

n −1
ci
𝒮c (z) = ∑ z − αi
≡0 mod g(z) (3.6)
i =0

If g(z) is irreducible over F2m then Γ(ℒ, g(z)) is called an irreducible binary Goppa code.

3.3. Parity Check Matrix of Goppa Codes

Recall equation (3.4). From there it follows that every bit of the received codeword
is multiplied with
g (z) − g (αi )
(3.7)
g (αi ) ⋅ (z − αi )
Describing the Goppa polynomial as gs ⋅ zs + gs−1 ⋅ zs−1 ⋅ . . . ⋅ + g0 one can con-
struct the parity check matrix H as
⎧ ⎫
gs gs gs


 g ( α0 ) g ( α1 )
⋅ ⋅ ⋅ g ( α n −1 )



 
 g s −1 + g s ⋅ α 0 g s −1 + g s ⋅ α 0 g s −1 + g s ⋅ α 0

⋅ ⋅ ⋅

⎨ 

g ( α0 ) g ( α1 ) g ( α n −1 )
H= .. .. .. (3.8)


 . . . 


s −1
g1 + g2 ⋅ α0 +⋅⋅⋅+ gs ⋅ α0s −1 g1 + g2 ⋅ α0 +⋅⋅⋅+ gs ⋅ α0s −1 
 
⎩ g1 + g2 ⋅α0 +⋅⋅⋅+ gs ⋅α0

⋅⋅⋅
 

g(α ) 0 g(α )
1 g(α
n −1 )
3.4 Encoding 17

This can be simplified to


⎧ ⎫ ⎧ 1 1 1


 gs 0 ⋅⋅⋅ 0 


 g ( α0 ) g ( α1 )
⋅⋅⋅ 
g ( α n −1 ) 

   
   α0 α1 α n −1 
⎨g

s −1 gs ⋅ ⋅ ⋅ 0

⎬ 

g ( α0 ) g ( α1 )
⋅⋅⋅ g ( α n −1 )


H= .. .. ..  ∗  .. .. .. (3.9)


 . . . 
 . . . 


s −1
α1s −1 αsn− 1
   
   α 
⎩ g
1 g2 ⋅ ⋅ ⋅ g s ⎭ ⎩ 0

⋅⋅⋅ −1 

g ( α0 ) g ( α1 ) g ( α n −1 )

where the first part has a determinate unequal Zero and following the second part
Ĥ is a equivalent parity check matrix, which has a simpler structure. By applying
the Gaussian algorithm to the second matrix Ĥ one can bring it to systematic form
( Ik ∣ H ), where Ik is the k × k identity matrix. Note that whenever a column swap is
performed, actual also a swap of the elements in the support ℒ is performed.
From the systematic parity check matrix ( Ik ∣ H ), now the systematic generator
matrix G can be derived as ( In−k ∣ H T ).

3.4. Encoding

To encode a message m into a codeword c represent the message m as a binary


string of length k and multiply it with the n × k matrix G. If G is in systematic form
( Ik ∣ P), where Ik is the k × k identity matrix, one only have to multiply m with
the (n − k) × k matrix P and append the result to m. This trick can not be used in
the context of McEliece without special actions (see section 2.5) because the public
generator matrix Ĝ is generally not in systematic form, due to multiplication with
two random matrices.

3.5. Decoding

However, decoding such a codeword r on the receiver’s side with a (possibly) ad-
ditive error vector e is far more complex. For decoding, we use Patterson’s algo-
rithm [38] with improvements from [43].
Since r = c + e ≡ e mod G (z) holds, the syndrome Syn(z) of a received codeword
can be obtained from Equation (3.6) by
rα eα
Syn(z) = ∑ z − α
≡ ∑ z−α
mod G (z) (3.10)
m
α∈ GF (2 ) α∈ GF (2m )
18 Chapter 3 - Introduction to Coding Theory

To finally recover e, we need to solve the key equation σ(z) ⋅ Syn(z) ≡ ω (z)
mod G (z), where σ(z) denotes a corresponding error-locator polynomial and ω (z)
denotes an error-weight polynomial. Note that it can be shown that ω (z) = σ(z)′ is
the formal derivative of the error locator. By splitting σ(z) into even and odd poly-
nomial parts σ(z) = a(z)2 + z ⋅ b(z)2 , we finally determine the following equation
which needs to be solved to determine error positions:

Syn(z)(a(z)2 + z ⋅ b(z)2 ) ≡ b(z)2 mod G (z) (3.11)

To solve Equation (3.11) for a given codeword r, the following steps have to be
performed:

1. From the received codeword r compute the syndrome Syn(z) according to


Equation (3.10). This can also be done using simple table-lookups.
2. Compute an inverse polynomial T (z) with T (z) ⋅ Syn(z) ≡ 1 mod G (z) (or
provide a corresponding table). It follows that (T (z) + z)b(z)2 ≡ a(z)2 mod G (z).
3. There is a simple case if T (z) = z ⇒ a(z) = 0 s.t. b(z)2 ≡ z ⋅ b(z)2 ⋅ Syn(z)
mod G (z) ⇒ 1 ≡ z ⋅ Syn(z) mod G (z) what directly leads to σ(z) = z.
Contrary, if T (z) ∕= z, compute a square root R(z) for the given polynomial
R(z)2 ≡ T (z) + z mod G (z). Based on a observation by Huber [24] we can
compute the square root R(z) by:

R(z) = T0 (z) + w(z) ⋅ T1 (z) (3.12)

where T0 (z), T1 (z) are odd and even part of T (z) + z satisfying T (z) + z =
T0 (z)2 + z ⋅ T1 (z)2 and w(z)2 = z mod g(z) which can be precomputed for
every given code. We can then determine solutions a(z), b(z) satisfying

a(z) = b(z) ⋅ R(z) mod G (z). (3.13)

with a modified euclidean algorithm(see section 3.6). Finally, we use the identi-
fied a(z), b(z) to construct the error-locator polynomial σ(z) = a(z)2 + z ⋅ b(z)2 .
4. The roots of σ(z) denote the positions of error bits(see section 3.7). If σ(αi ) ≡ 0
mod G (z) with αi being the corresponding bit of a generator in GF(211 ), there
was an error in the position i in the received codeword that can be corrected
by bit-flipping.

This decoding process, as required in Step 2 of Algorithm 3 for message decryp-


tion, is finally summarized in Algorithm 4:
3.6 Solving the Key Equation 19

Algorithm 4 Decoding Goppa Codes


Input: Received codeword r with up to t errors, inverse generator matrix iG
Output: Recovered message m̂
1: Compute syndrome Syn(z) for codeword r
2: T (z) ← Syn(z)−1
3: if T (z) = z then
4: σ (z) ← z
5: else √
6: R(z) ← T (z) + z
7: Compute a(z) and b(z) with a(z) ≡ b(z) ⋅ R(z) mod G (z)
8: σ ( z ) ← a ( z )2 + z ⋅ b ( z )2
9: end if
10: Determine roots of σ (z) and correct errors in r which results in r̂
11: m̂ ← r̂ ⋅ iG {Map rcor to m̂}
12: return m̂

3.6. Solving the Key Equation

To solve this equation a(z) = b(z) ⋅ R(z) mod G (z) with the extended euclidean
algorithm observe the following: From σ(z) = a(z)2 + z ⋅ b(z)2 and deg(σ(z)) ≤
deg( G (z)) deg( G (z))−1
deg(G (z)) it follows that deg(a(z)) ≤ 2 and deg(b(z)) ≤ 2 . During
the iterations of the extended euclidean algorithm:

ri −2 ( z ) = 1 ⋅ R( z ) + 0 ⋅ G ( z )
ri −1 ( z ) = 0 ⋅ R( z ) + 1 ⋅ G ( z )
..
.
r k = uk (z) ⋅ R(z) + v k (z) ⋅ G (z)
..
.
ggT ( R(z), G (z)) = u L ⋅ R(z) + v L (z) ⋅ G (z)
0 = u L +1 (z) ⋅ R(z) + v L +1 (z) ⋅ G (z)

the following holds:

ri −2
ri = ri −2 − ai ⋅ ri −1 where ai = (3.14)
ri −1
20 Chapter 3 - Introduction to Coding Theory

and

deg(ri ) < deg(ri −1 ) (3.15)


In addition:

deg(ui (z)) + deg(ri −1 (z)) = deg(G (z)) (3.16)


It follows that deg(r (z)) is constantly decreasing after starting with deg(G (z)) and
deg(u(z)) increases after starting with zero. Using this, one can see that there is a
unique point in the computation of EEA where both polynomials are just below their
deg( G (z))
bounds. The EEA can be stopped when for the first time deg(r (z)) < 2 and
deg( G (z)−1)
deg(u(z)) < 2 . These results are the needed polynomials a(z), b(z). From
a(z) and b(z) construct σ(z) = a(z)2 + z ⋅ b(z)2 .

3.7. Extracting Roots of the Error Locator

Polynomial

The roots of σ(z) indicate the error positions. If σ(αi ) ≡ 0 mod G (z)1 there was an
error ei = 1 at ci in the received bit string, which can be corrected by just flipping
the bit. The roots can be found by evaluating σ(ai ) for every i. A more sophisticated
method is the Chien search [13]. To get all roots αi of σ(z) the following holds for all
elements except the zero element:
s s−1
σ(αi ) = σs (αi ) +σs−1 (αi ) +... +σ1 (αi ) + σ0

= λs,i +λs−1,i + . . . +λ1,i + λ0,i
s s−1
σ(αi +1 ) = σs (αi +1 ) +σs−1 (αi +1 ) +... +σ1 (αi +1 ) + σ0
s s−1 s−1
= σs (αi ) αs +σs−1 (αi ) α +... +σ1 (αi )α + σ0
= λs,i αs +λs−1,i αs−1 + . . . +λ1,i α + λ0,i

= λs,i +1 +λs−1,i +1 + . . . +λ1,i +1 + λ0,i +1

In other words, one may define σ(αi ) as the sum of a set {λ j,i ∣0 ≤ j ≤ s}, from
which the next set of coefficients may be derived thus:
λ j,i +1 = λ j,i α j (3.17)
1 Note that α was a generator of GF (211 )
3.7 Extracting Roots of the Error Locator Polynomial 21

Start at i = 0 with λ j,0 = σj and iterate through every value of i up to s. If at any


iteration the sum:
t
∑ λ j,i = 0 (3.18)
j =0

then σ(αi ) = 0 and αi is a root. This method is more efficient than the brute force
like method mentioned before.
If the generator matrix G is in standard form, just take the first k bits out of the
n-bit codeword c and retrieve the original message m. To get a generator matrix
in this form, reorder the elements in the support of the code until G is in standard
form. Note that in the computation of the syndrome and the error correction, the
i-th element in C and σ(z), respectively, corresponds with the i-th element in the
support and not with ai . If a matrix not in standard form is used, a mapping matrix
which selects the right k bits from c, has to be found. This matrix iG must be of the
form G ⋅ iG = IDk , where IDk is the k × k identity matrix. To get iG do the following:

Algorithm 5 Getting a Mapping Matrix from Codewords to Messages


Input: G
1: Select randomly k columns of G
2: Test if the resulting k × k matrix sub is invertible
3: If not goto Step 1
4: Else insert the rows of sub−1 into a k × n matrix at the positions where the k
random columns come from and fill the remaining rows with zeros.
5: return iG

For the Magma algorithm computing iG see Appendix B. A multiplication of


a valid codeword c with this matrix iG computes the corresponding message m
because:
c ⋅ iG = (m ⋅ G ) ⋅ iG = m ⋅ IDk = m (3.19)

Now all prerequisites to start with implementing the McEliece public key scheme
are given.
4. Recongurable Hardware
This chapter introduces FPGAs and presents the necessary additional hardware.
Furthermore, field arithmetic in hardware is introduced.

4.1. Introducing FPGAs

FPGA stands for Field Programmable Gate Arrays. It consists of a large amount of
Look Up Tables that can generate any logic combination with four inputs and one
output and basic storage elements based on Flip Flops.

Figure 4.1.: 4-Input LUT with FF

Two LUT’s and two FF’s are packed together into an slice and four slices into an
Configurable Logic Block. Between the CLBs is a programmable switch matrix that
can connect the input and outputs of the CLB’s. How LUTs and CLBs are configured
is defined in a vendor specific binary file, the bitstream. Most modern FPGAs also
contain dedicated hardware like multiplier, clock manager, and configurable block
RAM.

IOBs

CLB
IOBs
OBs
Block RAM

Multiplier

DCM

DCM
Block RAM / Multiplier

CLBs
IOBs

IOBs

DCM

IOBs

Figure 4.2.: Simplified Overview over an FPGA [27]

The process of building a design for a FPGA consist of several steps which are
depicted in Figure 4.3. After writing the VHDL code in an editor, it is translated
24 Chapter 4 - Reconfigurable Hardware

Figure 4.3.: VHDL Designflow[28]

to a net list. This process is called synthesis and for this implementation the tools
XST and Synplify are chosen from the hundred available. Based on the net list the
correct behavior of the design can be verified by using a simulation tool, which is
in our case Modelsim. This both steps are completely hardware independent. The
next step is mapping and translating the net list into logic resources and special
resources offered by the target platform. Due to this hardware dependency, those
and the following steps need to know the exact target hardware. The final step
place-and-route (PAR) then tries to find an optimum placement for the single logic
blocks and connects them over the switching matrix. The output of PAR can now be
converted into a bitstream file and loaded into a flash memory on the FPGA board.
The FPGA contains a logic block that can read this flash memory and can configure
the FPGA accordingly. On most FPGA boards this memory is located outside the
FPGA chip and can therefore be accessed by anyone. To protect the content of the
bitstream, which may include intellectual property (IP) cores or, like in our case,
secret key material, the bitstream can be stored encrypted. The FPGA boot-up logic
4.1 Introducing FPGAs 25

then has to decrypt the bitstream before configuring the FPGA. Some special FPGAs,
for example the Spartan3-AN series, contain large on-die flash memory, which can
only be accessed by opening the chip physically. For the decryption algorithm the
bitstream file has to be protected by one of the two methods mentioned above.

4.1.1. Interfacing the FPGA


Aside from the algorithmic part of the design, a way has to be found to get data into
the FPGA, and after computation, to read the data back. For this implementation we
remain at a standard UART-interface even though interfaces with higher bandwidth
(Ethernet,USB,PCI-Express) are possible. The used UART is derived from an existing
source for the PicoBlaze published in Xilinx application note 223 [11]. A wrapper
component was developed, that provide all necessary ports: Input en_16_x_baud

uart

uart_tx
write_buffer
reset_buffer tx_out
serial_out tx_out
en_16_x_baud
clk buffer_full
clk clk
buffer_half_full
[7:0]
data_in[7:0]

C_uart_tx

rx_in
uart_rx
rx_in serial_in
buffer_data_present
read_buffer
reset_rx_buffer buffer_full
rst reset_buffer
buffer_half_full
en_16_x_baud
clk data_out[7:0] [7:0]

C_uart_rx

C_uart

Figure 4.4.: The UART component

should be pulsed HIGH for one clock cycle duration only and at a rate 16 times (or
approximately 16 times due to oversampling) faster the rate at which the serial data
transmission takes place. The receive and transmit component each contain a 16
byte buffer. Transmission is started as soon as a byte is written to the buffer. Writing
to the buffer if it is full has no effect. When the receive buffer is full, all subsequent
write attempts from the host system are ignored, which will cause data loss.
26 Chapter 4 - Reconfigurable Hardware

4.1.2. Buers in Dedicated Hardware


For some data a structure is required, that hold some values for later processing.
For example ciphertext, plaintext, bytes send and received from UART and also
precomputed values have to be stored inside the FPGA. Instead of building these
storage from a huge amount of registers, both target FPGA for this implementation
provide dedicated RAM. This block RAM (BRAM) is organized in blocks of 18Kbit
and can be accessed with the maximum frequency of the FPGA. Each of the BRAMs
is true dual ported, allowing independent read and write access at the same time.
With a Xilinx vendor tool CoreGen each block can be configured as RAM or ROM
in single or dual port configuration. Also one can select the required width and
depth of the BRAM. CoreGen then generates a VHDL instantiation template that can
be used in the VHDL code. Every type of BRAM can be initialized with predefined
content via the CoreGen wizard. It uses a .coe file which can contain the memory
content in binary, decimal or hex. This is used transfer the precomputed tables and
constant polynomial w into the FPGA. Goppa polynomial G is not stored in a ROM,
because we need to access it in its whole width at once.This polynomial is stored as
a large register. Because it is constant and only needed in the extended euclidean
algorithm it will be resolved to fixed connections to VCC and GND by the synthesis
tool. The Spartan3-200 incorporates 12 BRAMs and the Spartan3-2000 incorporates
40 BRAMs allowing up to 216 or 720 Kbits of data to be stored, respectively.

4.1.3. Secure Storage


All data that is required to configure the FPGA and also constant values from VHDL
code, like ROM and RAM initialization values, has to be stored on the FPGA board.
In normal circumstances the target platform contains a flash memory chip which
holds the bitstream file. During boot-up, the FPGA reads this file and configure
himself and initialize the BRAM cells accordingly. But this flash has a standard
interface and can therefore be read by anyone. To protect the private key data, there
exist two ways. The first way, which is only possible at newer FPGAs [47], is to store
the bitstream file encrypted. The FPGA contains a hardware decryption module
and a user defined secret key. During boot-up, the bitstream file is decrypted inside
the FPGA and then the normal configuration takes place. Our target FPGAs are
not capable of decrypting bitstream files. But the Spartan3-AN family of FPGAs
contain a large on-die flash memory. Assuming that physical opening the chip is
hard, this memory can be accepted as secure on chip storage. If the designer does
not connect the internal flash to the outside world, then the flash cannot be read
from the I/O pins. Additionally, there exist some security features which can be
4.2 Field Arithmetic in Hardware 27

Table 4.1.: Bitstream Generator Security Level Settings


Security Level Description

None Default. Unrestricted access to all configuration and


Readback functions.
Level 1 Disable all Readback functions from both the SelectMAP
or JTAG ports (external pins). Readback via the ICAP
allowed.
Level 2 Disable all Readback operations on all ports.
Level 3 Disable all configuration and Readback functions from all
configuration and JTAG ports. The only command (in
terms of Readback and configuration) that can be issued
and executed in Level3 is REBOOT. This erases the con-
figuration of the device. This has the same function as
enabling the PROG_B pin on the device, except it is done
from within the device.

configured during bitstream generation. Table 4.1 summarizes the different security
levels provides by Spartan3 FPGAs [26].

4.2. Field Arithmetic in Hardware

Analyzing McEliece encryption and decryption algorithms (cf. Section 3.2), the fol-
lowing arithmetic components are required supporting computations in GF(2m ): a
multiplier, a squaring unit, calculation of square roots, and an inverter. Furthermore,
a binary matrix multiplier for encryption and a permutation element for step 2 in
Algorithm 2 are required. Many arithmetic operations in McEliece can be replaced
by table lookups to significantly accelerate computations at the cost of additional
memory. Our primary goal is area and memory efficiency to fit the large keys and
required lookup-tables into the limited on-chip memories of our embedded target
platform.
Arithmetic operations in the underlying field GF(211 ) can be performed efficiently
with a combination of polynomial and exponential representation. In registers, we
store the coefficients of a value a ∈ GF(211 ) using a polynomial basis with natural
order. Given an a = a10 α10 + a9 α9 + a8 α8 + ⋅ ⋅ ⋅ + a0 α0 , the coefficient ai ∈ GF(2) is
determined by bit i of an 11 bit standard logic vector where bit 0 denotes the least sig-
28 Chapter 4 - Reconfigurable Hardware

nificant bit. In this representation, addition is fast just by performing an exclusive-or


operation on two 11 bit standard logic vectors. For more complex operations, such
as multiplication, squaring, inversion and root extraction, an exponential represen-
tation is more suitable. Since every element in GF(211 ) can be written as a power of
some primitive element α, all elements in the finite field can also be represented by
αi with i ∈ Z2m −1 . Multiplication and squaring can then be performed by adding
the exponents of the factors over Z2m −1 such as

c = a ⋅ b = αi ⋅ α j = αi + j ∣ a, b ∈ GF(211 ), 0 ≤ i, j ≤ 2m − 2. (4.1)

The inverse of a value d ∈ GF(211 ) in exponential representation d = αi can be


11
obtained from a single subtraction in the exponent d−1 = α2 −1−i with a subsequent
table-lookup. Root extraction, i.e., given a value a = αi to determine r = ai/2 is
simple, when i is even and can be performed by a simple right shift on index i.
For odd values of i, perform m − 1 = 10 left shifts followed each by a reduction
with 211 − 1. To allow for efficient conversion between the two representations, we
employ two precomputed tables (so called log and antilog tables) that allow fast
conversion between polynomial and exponential representation. Each table consists
of 2048 11-bit values that are stored in two of the Block Rams Cells (BRAM) in the
FPGA. For multiplication, squaring, inversion, and root extraction the operands are
transformed on-the-fly to exponential representation and reverted to the polynomial
basis after finishing the operation. To reduce routing delay every arithmetic unit can
access their own LUT that is placed closed to the logic block.
5. Designing for
Area-Time-Eciency
Most components of an algorithm can be implemented in a way to finish as fast as
possible but then they need a lot of logic resources. On the other hand a component
can be implemented resource efficient, but most likely it will then consume more
time to complete the task. Table 5.1 gives an estimate about how often specific parts
of the McEliece algorithms will be executed on average during one encryption or
decryption with the parameters m = 11, t = 27.
From this table one can see easily that poly_EEA is the most important, time critical
and largest component in the whole design and thus is worth a few more details.

5.1. Extended Euclidean Algorithm

In algorithm 6 the extended euclidean algorithm adopted to polynomials over


GF(2m )is summarized again.

Algorithm 6 Extended Euclid over GF(211 ) with Stop Value


Input: G (z) irreducible ∈ F211 [z], x ∈ F211 [z] with degree x < G
Output: x − 1 mod G (z)
1: A ← G (z), B ← x (z), v ← 1, u ← 0
2: while degree( A) ≥ stop do
lc( A)
3: q ← lc(B) { lc(X) is leading coefficient of X}
4: A ← A − q ⋅ zk ⋅ B
5: u ← u + q ⋅ zk ⋅ v
6: if degree( A) ≤ degree( B) then
7: A←B B ← A { swap A and B}
8: u←v v ← u { swap u and v}
9: end if
10: end while
u (z)
11: return ( A0 )

We identified the following components:


30 Chapter 5 - Designing for Area-Time-Efficiency

Table 5.1.: Execution Count for Crucial Parts.

Part Count

UART receive 219


UART send 256
Encryption

(1751 × 2048)Matrix MUL 1


(8 × 8)Submatrix MUL 56064

Error Distribution 2
PRNG* 7

UART receive 256


UART send 219

Permutation 1

Polynomial MUL 1
Polynomial SQRT 1
Decryption

Polynomial SQR 2

EEA+ 1024 + 2
-EEA.getDegree 2 ⋅ 27 ⋅ (1024 + 2) ≈ 55, 500
-EEA.shiftPoly 2 ⋅ 27 ⋅ (1024 + 2) ≈ 55, 500
-EEA.MulCoeff 2 ⋅ 27 ⋅ 27 ⋅ (1024 + 2) ≈ 1, 500, 000
-EEA.DivCoeff 2 ⋅ 27 ⋅ (1024 + 2) ≈ 55, 500

(2048 × 2048)Matrix MUL 1


(8 × 8)Submatrix MUL 65, 536
PRNG- 65, 536
* One PRNG run generates 4 error positions.
+ Assuming that half of the ciphertext bits are one.
- One PRNG run generates one (8 × 8)Submatrix.

1. get_degree: a component that determine the degree and the leading coefficient
(lc) of a polynomial
2. gf_div: a component that divide two field elements
3. gf_mul: a component that multiply two field elements
5.1 Extended Euclidean Algorithm 31

4. poly_mul_shift_add: a component that computes X = X + q ⋅ zk ⋅ Y


We decide to implement the extended euclidean algorithm in a full parallel way. In
other words, we are trying to compute nearly every intermediate result in one clock
cycle, which means that the design is working on a complete polynomial at once.
Only in places that occur rarely during a decoding run, we compute the results in a
serially, coefficient-wise way to save resources.
If all computations would be done in a serially, coefficient-wise way, this would
result in about 30 times more clock cycles. Although a serial design is smaller and
should achieve a higher clock frequency, it can not reach 30 times the frequency
of the parallel design. A Spartan3 FPGA can run at only 300 MHz and thus the
advantage of parallelism can not be outperformed when the parallel design operates
at least with 10 MHz.

5.1.1. Multiplication Component


Due to the large number of required multiplications we choose the fastest possi-
ble design for this operation. Instead of performing the multiplication with the table
look-up method mentioned in Section 4.2 or implementing school-book or Karazuba-
like methods, we completely unroll the multiplication to a hardwired tree of XORs
and ANDs. This tree is derived from the formal multiplication of two field elements
and includes modulo reduction mod p(α) = α11 + α2 + 1, which is a defining poly-
nomial for GF(211 ). Using this a complete multiplication is finished in one clock
cycles. See Appendix C for the complete multiplication code.
After optimization by the synthesis tool the multiplier consists of an 11 bit register
for the result and a number of XORs as shown in Table A.2 in Appendix A. Overall
one multiplier consumes 89 four-input LUT’s, but computes the product in one clock
cycle.

5.1.2. Component for Degree Extraction


To allow get_degree to complete as fast as possible, first every coefficient of the input
polynomial is checked if it is Zero as shown in Listing 5.1.2.
e n t i t y gf_compare i s PORT(
c o e f f _ i n : in STD_LOGIC_VECTOR(MCE_M−1 downto 0 ) ;
equa l : out s t d _ l o g i c
);
end gf_compare ;
32 Chapter 5 - Designing for Area-Time-Efficiency

a r c h i t e c t u r e s t r u c t u r a l of gf_compare i s
begin
equal <= ’0 ’ when ( c o e f f _ i n =GF_ZERO ) e l s e ’1 ’;
end s t r u c t u r a l ;

Listing 5.1: Architecture for gf_compare

Using gf_compare for all 27 coefficients, we construct a std_logic_vector(26 downto


0), that contains a 0 if the coefficient is zero and 1 otherwise. This construction is
shown in Listing 5.1.2.
gen_comp : f o r I in 0 t o 26 ge ne ra t e
C_compare : gf_compare port map(
c o e f f _ i n =>p o l y _ i n ( I ) ,
equa l=> i n t e r n a l _ d e g r e e ( I ) ) ;
end ge ne ra t e gen_comp ;

Listing 5.2: Instantiation of gf_compare

These components are unclocked and consist only of combinatorial logic. When the
start signal is driven high from this vector, degree and lc are derived in one clock
cycle.
i f i n t e r n a l _ d e g r e e ( 2 6 ) = ’ 1 ’ then
degree <= " 11010 " ;
l c <= p o l y _ i n ( 2 6 ) ;
e l s i f i n t e r n a l _ d e g r e e ( 2 5 )= ’ 1 ’ then
degree <= " 11001 " ;
l c <= p o l y _ i n ( 2 5 ) ;
.
.
.
e l s i f i n t e r n a l _ d e g r e e ( 0 ) = ’ 1 ’ then
degree <= " 00000 " ;
l c <= p o l y _ i n ( 0 ) ;
else
r e p o r t " Zero polynomial d e t e c t e d ! ! ! " s e v e r i t y e r r o r ;
end i f ;

Listing 5.3: Determining the Degree of a Polynomial

5.1.3. Component for Coecient Division


Also gf_div is designed for troughput. Like mentioned in section 4.2 it makes use of
two precomputed tables in BRAM. Gf_div transforms both input coefficients into ex-
ponential representation at once with a TLU in a dual port ROM poly2exp_dualport.
5.1 Extended Euclidean Algorithm 33

The exponents are then subtracted exp_diff. This difference may become negative.
To avoid an extra clock cycle for adding the modulo to bring the difference back
to the positive numbers, we extend the exp2poly table to the negative numbers as
exp2polymod. It turns out that negative numbers are represented in twos complement
form in hardware. So addresses 0 to 2047 contain the standard exponent-to-polynom
mapping. Addresses 2048 and 2049 contain due to twos complement form the ex-
ponent −2048, −2047, which can’t occur because the maximum exponent range is
0 ⋅ ⋅ ⋅ 2046. This addresses are filled with a dummy value. The rest of the address
space now contains the same values in the same order as at the beginning of the
table. By this, the modulo reduction is avoided resulting in about 55, 600 overall
saved clock cycles. Remark that the TLU method is chosen because these design
has finished in 3 clock cycles, whereas a dedicated arithmetic unit computing the
division need at least 11 clock cycles when assuming that one clock cycle is needed
per bit operation. Figure 5.1 shows the complete division component. The look-up
tables and the substractor are highlighted in red. The surrounding signals form the
controlling state machine.

poly2exp_dualport [10:0] exp2polymod


clka clka [10:0]
douta[10:0] [10:0]
coeff_c[10:0]
clk clkb douta[10:0] =1 [11:0] [11:0]
[10:0]
[10:0]
[10:0] [10:0]
+ [11:0]
D[11:0] Q[11:0] [11:0]
addra[11:0]
coeff_a[10:0] [10:0]
addra[10:0] doutb[10:0] [10:0] [10:0]
R
1 [2]
[10:0] E TLU2
coeff_b[10:0] addrb[10:0]
[10:0] exp_out_b_1[10:0] invert\.exp_diff_2[11:0]
exp_diff[11:0]
TLU1

rst statemachine
I[1:0]
[3]
start C Q[3:0] [0:3]
e
0 0
R d
[2] [1]
e D[0] Q[0] valid
0
state[0:3] d R
[1] E
e
1
d un1_state_1[0]
un1_state_2 un1_state_3 valid
[0]
e
1
d

un1_state[0]

[0]

valid_1_sqmuxa

Figure 5.1.: The Complete Coefficient Divider

5.1.4. Component for Polynomial Multiplication and Shifting


The poly_mul_shift_add component also is designed with priority to high perfor-
mance. Therefore, the multiplication of the input polynomial with the coefficient
is completely unrolled. We use 27 multiplier in parallel of which each consists of a
tree of ANDs and XORs. Each of these multiplier blocks is connected to the fixed
coefficient q and one of the polynomial coefficients as shows in Figure 5.2.
34 Chapter 5 - Designing for Area-Time-Efficiency

297
11

Mul26 Mul25 Mul24 Mul23 Mul22 Mul2 Mul1 Mul0

297

Figure 5.2.: Overview of the Polynomial Multiplier

After determining the shift value k = deg( A) − deg( B) as the difference in the
degree of both polynomials A(z), B(z) the shift step by k coefficients is accomplished
by large 297b̃it width 27 to 1 multiplexer to allow minimum runtime.
with k s e l e c t
poly_out <= p o l y _ i n when " 00000 " ,
p o l y _ i n ( 2 5 downto 0)&GF_ZERO when " 00001 " ,
p o l y _ i n ( 2 4 downto 0)&GF_ZERO&GF_ZERO when " 00010 " ,
.
.
.
( o t h e r s=> " 0 0 0 0 0 0 0 0000 " ) when o t h e r s ;

Listing 5.4: k-width Shifter

To allow maximum throughput, we decide to instantiate the poly_mul_shift_add com-


ponent twice, allowing line4 and 5 of Algorithm 6 to run in parallel. For the same
reason component get_degree is instantiated twice as required in line 3 of Algo-
rithm 6.
The final normalization is also handled by the poly_mul_shift_add component. This
works since in normal mode poly_mul_shift_add computes

u(z) = u(z) + q ⋅ zk ⋅ v (z) (5.1)

Remember that q is the result of the division of the leading coefficients of two poly-
nomials. By setting u(z) = 0, q = A10 , k = 0 and v(z) = u(z), this equation results in

1 u(z)
u(z) = 0 + ⋅ z0 ⋅ u ( z ) = (5.2)
A0 A0
5.2 Computation of Polynomial Square Roots 35

which is the required normalization. This method avoid implementation of an extra


normalization component and save available slices without increasing required clock
cycles because this step can anyway be computed after the last computation which
involves poly_mul_shift_add.

5.2. Computation of Polynomial Square Roots

The next required computation, according to Algorithm 4 line 6, is taking the square
root of T (z) + z).Like mentioned in Section 4.2, instead of using a matrix to compute
the square root of the coefficients, we choose the method proposed by K. Huber [24].
For this and a latter purpose, we need a component, which can split polynomials into
its odd and even part so that T (z) + z = T0 (z)2 + z ⋅ T1 (z)2 . Splitting a polynomial
means to take the square root of all coefficients and assigning square roots from odd
positions to one result polynomials and even one to the other result polynomial.
All coefficients that origin from even positions form the even part and vise versa.
Taking square roots of coefficients only occurs in this component and nowhere else
in the whole algorithm. Therefore, we choose a method that does not consume many
slices, but use the already introduced TLU method.

5.3. Computation of Polynomial Squares

And the last required field operation is the computation of a polynomial square
in Algorithm 4 at line eight. Like the square root computation, this operation is
required only once during one decryption. Due to this low utilization we decide to
implement it in a space saving way. The square is computed coefficient by coefficient,
thus requiring only a single squaring unit. This squaring unit is simply the unrolled
multiplier from Section 5.1.1, where both input are wired together.
6. Implementation
Based on the decisions made in Chapter 4, this chapter discusses the implementation
of encryption and decryption.

6.1. Encryption

In this section the implementation of the matrix multiplication and generation of the
random error plus its distribution among the ciphertext is presented. Remember
that encryption is only c = m ⋅ Ĝ + e. But first the target platform is presented.
Because of the lower logic requirements for the encryption and since the encryp-
tion key is public and does not require confidential storage, the encryption routine is
implemented on a low cost Spartan3 FPGA, namely a XC3S200. This device is part of
a Spartan-3 development board manufactured by Digilent[14]. Aside from the FPGA
this board additional provides:
1. On-board 2Mbit Platform Flash (XCF02S)
2. 8 slide switches, 4 pushbuttons, 9 LEDs, and 4-digit seven-segment display
3. Serial port, VGA port, and PS/2 mouse/keyboard port
4. Three 40-pin expansion connectors
5. Three high-current voltage regulators (3.3V, 2.5V, and 1.2V)
6. 1Mbyte on-board 10ns SRAM (256Kb x 32)
Figure 6.2 shows an summary of the design. The single blocks will be explained
in the following.
The implementation consist of two parts which can be selected via two of the
sliding switches. The first part ( selected with sw1 = 1) reads the public key from
the UART and store it in external SRAM. The second part(selected with sw0 = 1)
reads the message from the UART and encrypt it with the public key.

6.1.1. Reading the Public Key


The top level component , called toplevel_encrypt controls the UART interface, reads
the external switches and writes data to the SRAM. Also this component controls
38 Chapter 6 - Implementation

(a) Board (b) Block Overview

Figure 6.1.: Spartan3-200 Development Board

PRNG
BRAM
Buffer for Errors

8x8
Matrix
Multiplier
Mode Select

Access switch
Row Counter
DEBUG

FSM
Matrix Column Counter
FSM Multiplier
Buffer BRAM

Access switch

Buffer

I/O mce_encrypt

UART toplevel_encrypt

SRAM

Figure 6.2.: Block Overview for Encryption

the real encryption component mce_encrypt. When a reset occurs, all buffers are
cleared, SRAM is disabled by driving SRAMce1, SRAMce2, SRAMoe high and the
state machine is set to idle state. In the moment when the UART indicates a received
byte, the FSM switches to read_in state, the SRAM is enabled and its address bus set
to all zero. Now, a byte is read into the lower 8 bits of an 32 bit register and the
register is shifted up by 8 bits. After reading 4 byte now, a 32 bit word is complete
which is written to SRAM. Afterwards, the address counter is incremented. This
6.1 Encryption 39

procedure is repeated until all 448, 512 bytes of Ĝ are written to SRAM. Now, the
FSM returns to idle state which is indicated by driving an LED on the board. Then,
another public key could be read or, more useful, the mode can be switched to
encryption by setting sw1 to 0 and sw0 to 1.
To send data from the PC to the FPGA the open source terminal program Hterm [23]
is used. Hterm can send bytes to the FPGA either from keyboard or from a file. The
structure of the file containing the data is simply one byte in hexadecimal form (w/o
leading 0x header) on a single line.
During development, debug methods where integrated, that are also in the final
design to allow verification of the current state and the behavior of some important
signals. The development board contains four seven-segment displays which share
a common data bus with signals (a, b, c, d, e, f, g, DP) and are individual selected
by four separated anode control lines. The function of the signals can be seen from
Figure 6.3.

Figure 6.3.: 7 Segment Display

The common data bus is connected to a component that converts a four bit input
character to the bit pattern that is necessary to display the character in hexadecimal
form. The conversion procedure is taken from the reference manual of the board and

SevenSeg
1
dot
data[0:7] [0:7]
[3:0]
char[3:0]

C_SevenSeg0

Figure 6.4.: Seven Segment Driver Component

is given in Appendix A.1. The decimal dot (data bus pin DB) in the display can be
40 Chapter 6 - Implementation

misused to display data larger than 16 bit. For example, the SRAM address is 18 bit
wide and therefore the two leading bits are displayed as dots in two of the segments.
Four of this components are in parallel connected to a signal called disp, which by
itself can be connected to various internal signals. Which signal is displayed can be
selected using the sliding switches sw2 to sw4. Table 6.1 shows the available debug
modes and how they are selected. Switch sw5 controls whether the status bit of the

Table 6.1.: Function of the Debug Switches


Signal on Seven Segment Display SW2 SW3 SW4

SRAMaddress in encrypt state 1 0 x


SRAMaddress in public key mode 0 0 x
SRAMdata low word x 0 0
SRAMdata high word x 1 0

UART or the control bits for the SRAM are displayed on the seven LEDs led0 to led6.
Signal led7 is reserved and hardwired to indicate completion of transferring Ĝ to the
SRAM.

6.1.2. Encrypting a message


When the encryption mode is entered by setting sw0 = 1, all buffers are cleared,
SRAM is disabled and the state machine is set to idle state. Now every received
byte from the UART is directly feed through to the mce_encrypt component which
is depicted in Figure 6.5. This component contains the matrix multiplier and the

mce_encrypt
clk
valid
prng_clk
need_byte
rst rst
SRAMoe
start
byte_ready status[15:0] [15:0]
[7:0]
M_byte_in[0:7] Cipher_byte_out[0:7] [7:0]
[31:0]
SRAMdata[31:0] SRAMaddress[17:0] [17:0]

C_mce_encrypt
[31:0]
SRAMdata[31:0]

Figure 6.5.: Interface of the mce_encrypt Component

PRNG plus the controlling state machine.


6.1 Encryption 41

First, every byte is read into an buffer in BRAM. Because this buffer is imple-
mented as BRAM of 18Kbit of which only 1752 bits are needed, we decide to addi-
tional put the cipher buffer of 2048 bit into the same BRAM block. The interface of
this combined buffer is shown in Figure 6.6.

YandM_buff
clka
wea[0]
clk clkb
web[0]
douta[7:0] [0:7]
[0:7]
dina[7:0]
doutb[7:0] [0:7]
=0
addra[8:0]
[0:7]
[0:7]
dinb[7:0]
=1
addrb[8:0]
[0:7]

C_YandM_buff

Figure 6.6.: Buffer for Plaintext and Ciphertext

After all input bytes are buffered, the FSM changes to multiplication state. First
row and column counter (RowCnt,ColCnt) and the SRAM address are set to zero.
Note that RowCnt and ColCnt actual count blocks of (8 × 8) submatrices instead
single rows and columns. Due to the (1751 × 2048) dimension of the matrix this
leads to a range of 0 . . . 218 for RowCnt and 0 . . . 255 for ColCnt. The partial product
from the submatrix multiplications are summed up in a register c_work, which will
contain the resulting cipher text when the multiplication is finished. First, the last
temporary ciphertext byte c_workis read from BRAM. In a next step, the plaintext
byte for this row is read out of BRAM and stored in a register m_work and, addition-
ally, the 64 bit submatrix of Ĝ is read from SRAM into G_work prior to incrementing
the SRAM address. The partial product of this block is computed with:
gen_part_prod : f o r i in 0 t o 7 loop
pa rt_ prod ( i ) <= ( M_work ( 0 ) AND G_work ( 8 ∗ i + 0 ) ) xor
( M_work ( 1 ) AND G_work ( 8 ∗ i + 1 ) ) xor
( M_work ( 2 ) AND G_work ( 8 ∗ i + 2 ) ) xor
( M_work ( 3 ) AND G_work ( 8 ∗ i + 3 ) ) xor
( M_work ( 4 ) AND G_work ( 8 ∗ i + 4 ) ) xor
( M_work ( 5 ) AND G_work ( 8 ∗ i + 5 ) ) xor
( M_work ( 6 ) AND G_work ( 8 ∗ i + 6 ) ) xor
( M_work ( 7 ) AND G_work ( 8 ∗ i + 7 ) ) ;
end loop gen_part_prod ;

Listing 6.1: Computation of Partial Product

and the result is added to c_work. After writing c_work back to the buffer the FSM
switches to a check state. In this state ColCnt is incremented until the actual row is
42 Chapter 6 - Implementation

finished. ColCnt also addresses the current c_work which saves an extra address reg-
ister. When a new row is started the next m_work is read in. When the multiplication
is complete, the FSM changes to error distribution state.

Generation and Distribution of the Errors


To distribute random errors among the ciphertext, the bit address of the error posi-
tions is generated on-the-fly by a fast and small PRNG based on the PRESENT block
chiffre [10]. Figure 6.7 shows how Present is embedded into the PRNG. For real

PRNG

rst
Present
PRNG rst rst
clk
rst rst prng_clk clk
prng_clk clk valid start valid
[79:0]
start key[79:0] cipher[63:0]
data_out[63:0] [63:0]
[63:0]
seed [63:0]
plain[63:0]

C_PRNG C_present

C_PRNG

(a) Interface of the PRNG (b) Interface of PRESENT

Figure 6.7.: Parts of the PRNG

security this should be replaced by a TRNG or at least a CSPRNG, but a hardware


design of these type of RNGs is far beyond the scope of this thesis. The PRNG is
seeded by an fixed 80 bit IV. Bit 0 to 63 are used as plaintext and the whole IV is
used as key. The produced ciphertext is the random output and also the plaintext
for the next iteration. The key is scheduled by rotating it one to the left and XORing
a five bit counter to the lowest bits after the rotation. Figure 6.8 gives an overview
over the round function and key derivation. Due to a codeword length of 2048 bit,
eleven bits are required to index each bit position. From one 64bit random word
four errors positions can be derived by splitting the word into four 16 bit words and
using the lower eleven as index. Therefore, seven runs of PRESENT are required to
generate the 27 errors. Because the ciphertext is stored byte wise, the upper eight
bits of the error position select the byte address of c_work in the buffer and the lower
three a bit of this byte. The bit error is induced by toggling (0 ← 1, 1 ← 0) the
bit and afterwards the whole byte is written back to the buffer. After 27 errors are
distributed, mce_encrypt signals completion to the top level component by driving
signal valid high. Now, the content of the ciphertext buffer is fed to the UART and
mce_encrypt and top_level return to idle state. Finally, the encryption is complete
and either a new message can be encrypted or the FPGA can be switched over to
initialization mode and a new matrix Ĝ can be read in.
6.2 Decryption 43

IV
80 bit
64 bit

Plain

80 bit
PRESENT KEY

Cipher

<<1

Add
Round
Counter
80 bit

64 bit

Random

Figure 6.8.: Construction of PRNG from PRESENT

6.2. Decryption

Due to the polynomial arithmetic involved in decoding the goppa code, decryption
requires far more logic and storage than encryption. For this reason, the decryption
is implemented on a larger Spartan3 FPGA, namely an Xilinx XC3S2000. Originally,
it was planned to save the large S matrix in the internal flash of an XC3S1400AN,
but it turns out that reading data from flash is possible only bitwise at 50MHz
using an SPI-Interface and therefore will become a bottleneck. Hence we developed
a method to generate S on the fly in the FPGA so that flash is only required for
storing the bitstream. The McEliece implementation (decryption configuration) for
the Spartan-3 2000 FPGA is depicted in Figure 6.9. The top_level component is nearly
the same as for encryption, except for the SRAM control and the adopted debugging
information to the new development board. This board used for decryption is a
HW-AFX-SP3-2000 from NuHorizons [46].

As for encryption, the ciphertext is first read using the UART and fed into mce_decrypt.
Figure 6.11 shows the important components and registers of mce_decrypt. The sev-
eral parts will be explained in the following sections. The mce_decrypt component
stores the ciphertext in a BRAM buffer. This buffer is shared for storing plaintext
and ciphertext. The address buses of this buffer use the same address register, ex-
cept the highest bit address[8] which selects the ciphertext buffer or plaintext buffer,
respectively. See Figure 6.12 for the complete interface.
44 Chapter 6 - Implementation

PRNG Polynomial MUL/


Matrix P-1 SQ over GF(211)
Multiplier
for S-1
BRAM (PRESENT)

Access switch
Polynomial EEA

FSM
over GF(211)

FSM

BRAM
Syn(z) Table
goppa_ Syn(z)-1 Table
decode

BRAM
Log Table
Anti-log Table
Buffer

I/O mce_decrypt Buffer BRAM

UART toplevel_decrypt

Figure 6.9.: McEliece implementation on Spartan-3 2000 FPGA

Figure 6.10.: The XC3S2000 Development Board

6.2.1. Inverting the Permutation P


The first step in decryption is to revert the permutation P. Like mentioned in Sec-
tion 2.5, the inverse permutation matrix P−1 is not stored as matrix but as array
of 11 bit indexes into the ciphertext. P−1 is 2048 ⋅ 11 = 22.528 bit large and stored
in a ROM. Because one BRAM can only hold 18 KBits, this ROM is built of two
BRAMs with a simple 11 bit address bus. The input and output buses are depicted
in Figure 6.13.
From this ROM now consecutive the indexes of the permutated ciphertext are
read as 11bit value perm_address. The value at every address i indicates from which
received cipher text bit j the permutated bit i is taken. Because the received cipher-
text is stored byte wise, the upper 8 bits select the byte address of YandMbuff and
the lower three the corresponding bit of the byte. Listing C in the appendix should
make the process of permutation clear.
After eight iterations, one byte of the permutated ciphertext has become avail-
able,and is input to the goppa_decode component. This component can immediately
start computing a partial syndrome, while concurrently the next permutated byte is
generated. Both components communicate with a simple hand-shake protocol with
6.2 Decryption
mce_decrypt

valid
D[0] Q[0] D[0] Q[0]
R E
E
valid
start_goppa
goppa_decode YandM_buff
clk
clk clk clka
rst
rst rst need_byte wea[0]
start valid clkb

Figure 6.11.: Overview of mce_decrypt Component


byte_ready [0:7] web[0]
plain_byte_out[0:7] [0:7]
0 douta[7:0] [0:7]
[0:7] [0:7] [0:7]
cipher_byte_in[0:7] [0:7]
D[7:0] Q[7:0] [0:7]
dina[7:0] [0:7] plain_byte_out[0:7]
[0:7] doutb[7:0] [0:7]
D[7:0] Q[7:0] [0:7] [7:0]
[1:7] [0:7] 1 E =0
D[7:0] Q[7:0] D[7:0] Q[7:0] addra[8:0] E ext_need_byte
[0:7] [0:7] C_goppa_decode [0:7]
E R P_decrypt\.Y_byte_w_8[0:7] [0:7]
Y_byte_w[0:7] dinb[7:0] status[15:0]
E plain_byte_out[0:7] [15:0]
=1
perm_byte[0:7] addrb[8:0]
[0:7]
cipher_byte[0:7]
[7:0] Y_byte_in[0:7] C_YandM_buff

D[0] Q[0] [0:7]


D[7:0] Q[7:0] [0:7]
E
E
start_prng
t_work[0:7]
PRNG
rst
prng_clk
prng_clk clk valid
start
data_out[63:0] [63:0]
seed

statemachine_9
C_PRNG I[33:0]
D[0] Q[0] C Q[18:0] [0:18]
[0:7] 0
E R

seed_prng
ext_byte_ready
state[0:18]
start

[2:0]
[8:2] [7:0]
D[7:0] Q[7:0] [7:0] [7:0]
R

RowCnt[7:0]
[63:0]
D[63:0] Q[63:0] [63:0]
E

random[63:0]

[8:1]
D[7:0] Q[7:0] [7:0]
R

ColCnt[7:0]
[0:7]
D[7:0] Q[7:0] [0:7]
E

part_prod[0:7]

[10:0]
D[10:0] Q[10:0] [10:0]
R

perm_addr[10:0]
[0:6]
D[7:0] Q[7:0] [0:7]
E

m_work[0:7]

invP
clka
douta[10:0] [10:0]
[10:0]
addra[10:0]

C_invP

C_mce_decrypt

45
46 Chapter 6 - Implementation

YandM_buff
clka
wea[0]
clk clkb
web[0]
douta[7:0] [0:7]
[0:7]
dina[7:0]
doutb[7:0] [0:7]
=0
addra[8:0]
[0:7]
[0:7]
dinb[7:0]
=1
addrb[8:0]
[0:7]

C_YandM_buff

Figure 6.12.: Buffer for Ciphertext and Plaintext

invP
clk clka
douta[10:0] [10:0]
[10:0]
addra[10:0]

C_invP

Figure 6.13.: ROM containing the Inverse Permutation Matrix

two signals byte_ready and need_byte to indicate if mce_decrypt has a byte completed
and if goppa_decode has finished the previous byte. After the permutation is com-
plete, the FSM of mce_decrypt waits until goppa_decode has finished with decoding
the permutated ciphertext. The implementation of decoding will now be explained
based on Algorithm 4 and the decisions made in Section 5.1.

6.3. Decoding

The goppa_decode component is the crucial part of the McEliece implementation in


hardware. It operates on large polynomials with up to 28 ⋅ 11 = 308 bit length.
Operations on these polynomials involve assigning specific values, or shifting and
manipulating individual coefficients. This require large multiplexer, shift registers or
massive parallel mathematical operations. For details on how often which operation
is performed, refer to Chapter 5.
Figure 6.14 shows the interface of this component. Like mentioned in Section 6.2.1
the ciphertext is read in byte per byte and in the same way the decoded message
is written out in a byte-wise manner. Figure 6.15 shows the different part of the
decoding component.
6.3 Decoding 47

goppa_decode
clk clk
rst rst need_byte
start valid
byte_ready
plain_byte_out[0:7] [0:7]
[0:7]
cipher_byte_in[0:7]

C_goppa_decode

Figure 6.14.: Interface of goppa-decode Component

The different steps during decoding a codeword are controlled by an FSM which
enables the involved components with a start signal and awaits completion indi-
cated by signal valid going high. These two signals are used in every component
except those who require a single clock cycle. Figure 6.16 shows the FSM for the
goppa_decode component. When goppa_decode has taken a byte it is written to a buffer
cipher_BRAM_buff, because after decoding the ciphertext is needed for the error cor-
rection step. Then the actual byte is scanned bit wise. If a 1 is found, the correspond-
ing trailing coefficient for the polynomial that has to be inverted(see Equation (3.6))
is looked up in SList-ROM. This polynomial is now fed into the poly_EEA compo-
nent (see section 5.1 for details). We recognized that there is only one time during
each run of the EEA, when a polynomial with degree 27 occurs. This is in the first
iteration of the EEA, when polynomial A(z) is set to Goppa polynomial. To save slices
occupied by this components, the finite state machine takes care of this special case.
By resolving this case with signal first_run and manually set the degree of A(z) to 27
and lc( A(z)) = 1 without computing it, all components can be reduced in size and
only operate on polynomials of maximum degree 26. This saves 11 FFs, a complete
multiplier and one multiplexer stage in each (sub)component and additional one
clock cycle per EEA run. The resulting FSM is depicted in Figure 6.17.

In the control state called checkab state the degree and the leading coefficient of
both polynomials A, B is computed and also the break condition deg( A) = stopdegree)
is checked. For the normal EEA stopdegree is zero, but this component is also
used to solve the key equation (see Equation3.13). In this case, the EEA must be
deg( g(z))
stopped when for the first time deg( A) reaches ⌊ 2 ⌋ = ⌊ 27
2 ⌋ = 13. Assign-
ing the correct stopdegree is handled by the finite state machine of the goppa_decode
component (see Figure 6.16). In reduce1 the difference k of the degrees of A and B
lc( A)
and also lc(B) is computed. State reduce2 then waits for completion of the the two
poly_mul_shift_add components and then returns to checkab. When the break condi-
tion deg( A) = stopdegree is true (depending on the actual value of stopdegree), the
state switches to normalize if stopdegree = 0 or to ready otherwise. After normaliza-
48
mce_decrypt
goppa_decode

poly_EEA cipher_BRAM_buff
clk clka
rst valid wea[0]
[0:7]
D[0] Q[0] start [0:7] douta[7:0] [0:7]
D[7:0] Q[7:0] [0:7]
poly_r[153:0] [10:0]
dina[7:0]
R =0 R
stopdegree[4:0] [10:0]
[0:7]
E [2] poly_s[296:0] addra[7:0] E
Figure 6.15.: Overview of goppa-decode Component

[2] poly_x[296:0] [10:0]


start_EEA =0 [10:0] C_cipher_buff plain_byte_out[0:7]
[0] C_EEA [10:0]
[10:0]
[10:0] [10:0]

[10:0] [10:0]

[10:0] [10:0]

[10:0] [10:0]
[10:0]
SList
[10:0]
[10:0]
clka
[10:0] douta[10:0] [10:0]
[10:0]
[10:0] [10:0] addra[10:0]
[10:0] [10:0]

[10:0] C_TLU_SLIST
[10:0] [10:0]

[10:0] [10:0]
[10:0] [10:0]

[10:0] [10:0]

[10:0] [10:0]

[10:0] [10:0]

[10:0] [10:0] InvSList


[10:0] [10:0] clka
[10:0]
douta[10:0] [10:0]
[10:0] [10:0]
[10:0]
D[0] Q[0] addra[10:0]
[10:0]
[10:0]
R
[10:0]
[10:0]
E C_TLU_InvSList
[10:0]
[10:0] [10:0]
[10:0]
start_mulT1w
[10:0]
[10:0] [10:0]

[10:0] [10:0]

[10:0] [10:0]

[10:0] [10:0]
[10:0]
poly_mulW_T1_clocked
start clk
[10:0] poly_split
rst valid
[10:0] statemachine_7 clk
[10:0]
valid start
I[22:0] rst poly_WxT1[296:0] [10:0]
[10:0] [11] [10:0]
C Q[18:0] [0:18]
D[0] Q[0] start poly_odd[153:0] [10:0]
poly_T1[153:0] [10:0]
0
[10:0] R R [10:0]
[10:0] poly2split[296:0] poly_even[153:0] [10:0]
[10:0] E [10:0] [10:0] C_mulT1W [10:0]
[10:0] [10:0] [10:0]
state[0:18] C_split [10:0]
[10:0] start_split [10:0] [10:0]
[10:0]
[10:0] [10:0]
[10:0]
[10:0] [10:0]
[10:0]
[10:0] [10:0]
[10:0]
D[0] Q[0] [10:0] [10:0]
[0:7] [0] [10:0]
R D[7:0] Q[7:0] [10:0] [10:0]
[0:7] [10:0]
E E [10:0] [10:0]
[10:0]
[10:0] [10:0]
[10:0]
start_chien cipher_byte[0:7] [10:0] [10:0]
[10:0]
byte_ready [10:0] [10:0]
[10:0]
[10:0] [10:0]
[10:0]
[10:0] [10:0]
[10:0]
[10:0] [10:0]
[10:0]
[10:0]
[10:0]
comp_sigma chien_search [10:0]
[10:0]
clk clk [10:0]
clk valid [10:0] [10:0]
rst rst
ready [10:0] [10:0]
D[0] Q[0] start valid start
rst [10:0]
[10:0]
R [10:0] [10:0] root[10:0] [10:0]
poly_a_in[153:0] poly_sigma[307:0] [10:0]
poly_sigma_in[307:0] [10:0] [10:0]
E [10:0]
[10:0]
[10:0] poly_b_in[153:0] [10:0]
C_chien_search [10:0]
start_comp_sigma [10:0] [10:0]
[10:0]
[10:0] C_comp_sig
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]

Chapter 6 - Implementation
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]

[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]

C_goppa_decode
C_mce_decrypt
6.3 Decoding 49

Figure 6.16.: FSM of the goppa_decode Component

tion the ready state is entered, in which the FSM waits until signal start drops to zero.
Then, the EEA is finished and returns to idle state.

When the complete syndrome is computed, Syn(z) will be fed again into poly_EEA
to compute T (z) and z is added. In the first version of the implementation all
polynomials are kept in std_logic_vector(297 downto 0) but it turns out that synthesis
tool is unable to handle such large vectors. For example, in a simple XOR, both input
coefficients are placed in one corner of the FPGA and the output far away in another
corner, leading to a large routing delay (nearly 90% of the overall delay). So the
description of polynomials was modified to type PolyArray_t is array(26 downto 0) of
std_logic_vector(10 downto 0). Now, the synthesis tool seems to be able to identify the
regular structure and place corresponding coefficents closer. This leads to a higher
achievable frequency compared to the original version.
50 Chapter 6 - Implementation

:!valid_degreea&!:rst :start&!:rst

:valid_degreea&:stop_degree!=0
checkab ready

:valid_degreea&:stop_degree="0000"
:start&!:rst
:rst

:degreeA&!stop_degree&valid_degreea
!:valid_div&:valid_reduceu&! :rst
!:valid_reduceu&!:rst

: va
lid_
red
normalize idle :rst

uce
u&
:rst

: va
lid_
red
uce
&!:A
rst
:rst
:rst

reduce1 reduce2

!:valid_div&!:rst !:valid_reduceA&!:rst

Figure 6.17.: FSM of the EEA Component

6.3.1. Computing the Square Root of T(z)+z


The next component handles line 6 of Algorithm 4, which is a polynomial square
root. To compute the square root of T (z) + z, first T (z) + z is split into odd and even
part (see Equation (3.12). In the VHDL code the odd part T0 (z) is called poly_T0
and the even part T1 (z) is called poly_T1. This step is also outsourced into an extra
component called poly_split.
This component scans the input polynomial coefficient-wise and compute the
square root of each coefficient. Depending on the origin of the coefficient (i.e odd or
even index), the square root is then appended to the odd or even result polynomial.
Note that by construction the resulting polynomials can only have up to 14 coeffi-
cients, and thus only 14 ⋅ 11 = 154 bits are reserved for storing each polynomial.
After getting the odd and even part of T (z) + z its square root is computed ac-
6.3 Decoding 51

goppa_decode

clk
poly_split
clk clk
rst valid
rst rst
start poly_odd[153:0] [153:0]
[296:0]
poly2split[296:0] poly_even[153:0] [153:0]

C_split

C_goppa_decode

Figure 6.18.: Interface of the poly-split Component

cording to Equation (3.12). This step requires the only polynomial multiplication for
which a dedicated component is designed.

goppa_decode

clk
poly_mulW_T1_clocked
clk clk
rst
rst rst valid
start
poly_WxT1[296:0] [296:0]
[153:0]
poly_T1[153:0]

C_mulT1W

C_goppa_decode

Figure 6.19.: Interface of the poly-mulWT1 Component

This component is iteratively fed with the coefficients of polynomial w (recall


Equation (3.12), which are read from a constant ROM in BRAM and the polynomial
poly_T1. After each iteration the intermediate result is summed up. Then poly_T1
is shifted to the left and reduced if necessary. For the reduction, the multiplier is
used. If poly_T1 reaches degree 27, the goppa polynomial is multiplied with the
leading coefficient of poly_T1 and the result is added to poly_T1. Thus the leading
coefficient of poly_T1 is set to zero, which results in the required reduction. The
output value is called poly_WxT1, because it is computes from w(z) ⋅ T1(z). We
don’t choose a design here, that can multiply a polynomial in one clock cycle to save
available slices. Since his component is used only once per decryption the slower
performance is negligible.

6.3.2. Solving the Key Equation


After computing R(z) completes, it is feed to the poly_EEA component. But now
poly_EEA runs in a special mode. Over the input value stop_EEA, one can decide
if poly_EEA runs in normal mode (stop_EEA =′′ 00000′′ ) or in solve-key-equation-
mode (stop_EEA ∕=′′ 00000′′ ). In this mode, poly_EEA stops the algorithm, when the
52 Chapter 6 - Implementation

goppa_decode
poly_mulW_T1_clocked

poly_mul
clk clk
clk clk
rst rst
rst rst
start valid
[10:0]
coeff_mulPoly[10:0] poly_Mulout_extern[307:0] [307:0]
[307:0]
poly_Mulin_extern[307:0]

C_poly_mul

C_mulT1W
C_goppa_decode

Figure 6.20.: Interface of the poly-mul Component

degree of A(z) drops down this stop value and the normalization step is omitted. In
this step the second output polynomial of poly_EEA is required and again slices are
saved by only reserving registers for 14 coefficients because the second polynomials
can not get larger than this. See Section 3.6 for the cause of this behavior. These two
polynomials are now feed to comp_sigma component.

6.3.3. Computing the Error Locator Polynomial Sigma


goppa_decode

comp_sigma
clk
clk clk
rst
rst rst
start valid
[153:0]
poly_a_in[153:0] poly_sigma[307:0] [307:0]
[153:0]
poly_b_in[153:0]

C_comp_sig

C_goppa_decode

Figure 6.21.: Interface of the comp-sigma Component

This component simultaneously squares the coefficients of the both input poly-
nomials by assigning both inputs of the multiplier described in Section 5.1.1 the
same value and assign the results to the correct positions of poly_sigma with is a 308
bits large std_logic_vector. Afterwards poly_sigma is fed to a component chien_search
which searches all the roots of σ(z) and indicates by signal valid and ready if one or
all roots are found, respectively.

6.3.4. Searching Roots


For searching roots of the error locator polynomial, an extra component was de-
signed. It reads the whole polynomial poly_sigma at once and outputs the found
6.3 Decoding 53

roots in polynomial representation of GF(2m ). The port of this component is shown


in Figure 6.22.

goppa_decode

clk
chien_search
clk clk
rst valid
rst rst
ready
start
[307:0] root[10:0] [10:0]
poly_sigma_in[307:0]

C_chien_search

C_goppa_decode

Figure 6.22.: Interface of the chien_search Component

Remember that the Chien search initialize an array with the coefficients of poly_sigma
and updates these array according to Equation (3.17). If, at any time, the sum over
the array elements equals zero, a roots has been found. The Chien array in this com-
ponent is stored in an array of std_logic_vectors to make the synthesis tool implement
it as RAM.

type array_t is array(27 downto 0) of std_logic_vector(10 downto 0);


signal chien_array: array_t;

These array is initialized with the coefficients of poly_sigma and then iteratively up-
dated. After each update, the sum over all array elements is computed and if the
sum becomes zero, the actual tested field element is passed off to the goppa_decode
component. To avoid a timing based side channel attack as mentioned in [30], the
iterations are not stopped when the maximum number of possible roots are found.
Instead the component continue until all field elements are checked.
Each time a new root is reported to goppa_decode, this root (in polynomial repre-
sentation) is used as address for another ROM InvSList. The 11 bit value located
at this address is the ciphertext bit corresponding the the root. Similar to the per-
mutation operation the upper eight bit address a byte in the ciphertext RAM ci-
pher_BRAM_buff and the lower tree bits select a bit of this byte. Due to the properties
described in Section 3.3 each root correspond with the position of an error bit. This
bit is now corrected(by mean of toggling it) and the whole byte is written back to ci-
pher_BRAM_buff. Because checking roots by chien_search in every case requires more
time than correcting a bit error, no need for synchronizing exist and goppa_decode
can correct the error and afterwards wait for the next root.
When chien_search indicates completion, goppa_decode is also finished. Then,the
transfer of the corrected ciphertext back to mce_decrypt can be started in a byte-wise
manner.
54 Chapter 6 - Implementation

6.4. Reverting the Substitution S

As mentioned in Section 2.5, we do not save the inverse scrambling matrix S−1 ,
but generated it on-the-fly by a fast and small PRNG based on the PRESENT block
cipher [10]. Instead of storing the matrix as a whole only a small 80 bit IV is stored
in the configuration bit stream. Refer to Section 6.1.2 for details of the PRNG used.
This IV is used as seed for PRESENT. Bit 0 to 63 are used as plaintext and the
whole IV is used as key. The produced ciphertext is the random output and also
2
the plaintext for the next iteration. Overall ⌈ 1751
64 ⌉ = 47, 907 runs of PRESENT are
required the generate S−1 . The output are interpreted as 8 × 8 submatrices where the
single bits are written column-wise. For example, if the random word is r0 , ⋅ ⋅ ⋅ , r63
then the corresponding submatrix say, S0,0 is
⎧ ⎫


 r 0 r 8 ⋅ ⋅ ⋅ r 56 


 
⎨r r ⋅ ⋅ ⋅ r 
 ⎬
1 9 57
S0,0 = . . ..  (6.1)

 .. .. . 

 

 
⎩r r
7 15 ⋅ ⋅ ⋅ r 63

Actually, the complete matrix S−1 is build row wise from these submatrices. Goal
of this method is to minimize memory accesses for reading and writing message
bytes of m̂. This way every byte of m̂ has to be read only once and when one row
of submatrices is completed, this byte is not required anymore in the later process.
Note that the virtually generated matrix is one bit larger in each dimension than
necessary. The additional bit in each row is removed manually and the additional
column is handled by padding a 0 bit to m̂, what does not affect the computed result.
To allow maximum throughput the PRNG has its own clock domain clk_prng. Due
to its simple and regular structure, the PRNG can be clocked with up to the double
frequency compared to the rest of the design. Nevertheless, it turned out that the
multiplication procedure processes one random word in four clock cycles ( five clock
cycles if a column increment occurs) and then wait about 14 clock cycles until the
PRNG has generated the next random word.
Now, the matrix multiplication reads the intermediate message byte mi,j and the
message byte m̂i from YandMbuff and multiply m̂i with Si,j . The code used here is
the same as shown in listing 6.1.2. The result is added to mi,j and afterwards mi,j is
written back to YandMbuff and mi,j+1 is read. Figure shows a timing diagram of the
multiplication routine:
When the multiplication is finished also the decryption is completed. Now, the
content of YandMbuff is byte-by-byte feed to the UART and send to the host PC.
6.4 Reverting the Substitution S 55

waiting for PRNG MUL

Figure 6.23.: Timing Diagram of the Matrix Multiplication

After sending out the last byte the state machine switches back to idle state and is
ready for the next decryption.
7. Results
We now present the results the McEliece implementation providing 80 bit security
(n = 2048, k = 1751, t = 27) for an Xilinx Spartan-3 FPGA. We report performance
figures for a Xilinx Spartan-3 XC3S200-5 FPGA performing decryption and a Xilinx
Spartan-3 XC3S2000-5 FPGA performing encryption based on results obtained from
using Xilinx ISE 10.1. The resource requirements for the FPGA implementation after
place-and-route (PAR) are shown in Table 7.1.

Table 7.1.: Implementation results of the McEliece scheme with n = 2048, k =


1751, t = 27 on a Spartan-3 FPGA after PAR

Resource With Uart and DEBUG Without Uart and DEBUG Available

Slices 796 (41%) 694 (36%) 1,920


Encryption

LUTs 1,082 (28%) 870 (22%) 3,840


XC3S200

FFs 1,101 (28%) 915 (23%) 3,840


BRAMs 1 (8%) 1 (8%) 12
Flash Memory* 4,644 KBits - 16,896 Kbits

Slices 12.977 (63%) 12,443 (60%) 20,480


Decryption
XC3S2000

LUTs 17,974 (43%) 18,637 (45%) 40,960


FFs 9,985 (24%) 9,894 (23%) 40,960
BRAMs 22 (55%) 22 (55%) 40
Flash Memory 11,218 (100%) 4,644 KBits 16,896 Kbits
* For holding the bitstream file.

Table 7.2 summarizes the clock cycles needed for every part of the de- and encryp-
tion routines for the FPGA implementation.
In our FPGA design, the CPRNG to generate S−1 based on the PRESENT block
cipher turns out to be a bottleneck of our implementation since the matrix gener-
ation does not meet the performance of the matrix multiplication. Since designing
CPRNGs for FPGAs is complex, this is not in the scope of this thesis. Hence, Ta-
ble 7.2 gives estimates for the case that an ideal PRNG (which does not incur any
wait cycles due to throughput limitations) is available. This PRNG must be about
58 Chapter 7 - Results

Table 7.2.: Performance of McEliece implementations with n = 2048, k = 1751, t = 27


on the Spartan-3 FPGA.

Aspect Spartan3-200 Spartan2-2000

Maximum frequency 150 MHz 80 MHz


Encryption

Load Ĝ into SRAM 180sec+ -


Encrypt c‘ = m ⋅ Ĝ 336, 606 cycles -
Inject errors 147 cycles -

Undo permutation c ⋅ P−1 - combined with Syn(z)


Determine Syn(z) - 360,184 cycles
Decryption

Compute T = Syn(z)−1 - 625 cycles



Compute T + z - 487 cycles
Solve Equation (3.13) with EEA - 312 cycles
Correct errors - 312,328 cycles
Undo scrambling m̂ ⋅ S−1 - 1,035,684/188,306* cycles
+ Limited by UART baud rate of 19200 baud.
* This figure is an estimate assuming that an ideal PRNG for generation of S−1 would be
available.

5.5 times faster than then one used by us.

The public-key cryptosystems RSA-1024 and ECC-P160 are assumed1 to roughly


achieve a similar margin of 80 bit symmetric security [15]. We finally compare our
results to published implementations of these systems that target similar platforms
(i.e., Xilinx Spartan-3 FPGAs). Note that the figures for ECC are obtained from the
ECDSA signature scheme.
Note that all throughput figures are based on the number of plaintext bits pro-
cessed by each system and do not take any message expansion in the ciphertext into
account. If our target board for encryption would be equipped with DDR2-333, we
could gain double performance. This would result in 1.07 ms/op and a throughput
of 1, 626, 517 bits/sec. Note, however, that an additional DDR2 memory controller
consumes a significant amount of logic resources of the Spartan-3 device, but fits
into our target FPGA.
1 According to [15], RSA-1248 actually corresponds to 80 bit symmetric security. However, no imple-
mentation results for embedded systems are available for this key size.
59

Table 7.3.: Comparison of our McEliece designs with single-core ECC and RSA im-
plementations for 80 bit security.

Method Platform Time Throughput


ms/op bits/sec

McEliece encryption Spartan-3 200-5 2.24 779, 948


McEliece decryption Spartan-3 2000-5 21.61/10.82+ 81,023/161,829+

ECC-P160 [22] Spartan-3 1000-4 5.1 31,200

RSA-1024 random [25] Spartan-3E 1500-5 51 20,275


+ These are estimates assuming that an ideal PRNG for generating S−1 would be used.

We are unable to find performance values for RSA encryption on a comparable


Spartan-3 FPGA. RSA encryption can use a small exponent and should therefore be
faster than with a random exponent. Also for another important post quantum can-
didate, NTRU [29], we are unable to find an encryption implementation for FPGAs
of the Spartan-3 class. In citentru2, a general overview over NTRU on constrained
devices is given.
We also tried to put both encryption and decryption onto a single FPGA. Due to
the high density of the decryption routine this was only be possible on a Spartan3AN-
1400. This device has a large internal flash memory and thus can store the private
key data in this protected area (see Section 4.1.3). Also the Spartan3AN family is
capable of multi-booting from different bitstream files in the Flash. Therefore both
bitstreams for encryption and decryption can be put into the internal flash and the
FPGA can reconfigure itself during run-time. Because reconfiguration is slow (about
50 ms)2 , this is only practicable if it is not executed too often. The encryption func-
tion achieves the same performance level as the Spartan3-200 FPGA, while decryp-
tion is about 5MHz slower. The reason for this behavior is the lower amount (11, 264
instead 20, 480)of slices in the Spartan3AN-1400, which leads to longer critical paths
during PAR and therefore to a lower maximum frequency.
To compare how the performance increases when using high performance FPGAs
the design was implemented on Virtex4 and Virtex5 FPGAs. Table 7.4 summarizes
the resource requirements including UART and the same debug components as for
the Spartan3 FPGAs.
This would result in a throughput of 1.3 MBits/sec for encryption on a Virtex4 and
2 TBITLOAD=(Bitstream Length in bits)/((Clock Frequency in Hz)*(Configuration Port Width in
bits)) [1]
60 Chapter 7 - Results

Table 7.4.: Implementation results of the McEliece scheme with n = 2048, k =


1751, t = 27 Virtex FPGAs after PAR.

Resource Encryption Decryption Available

Maximum frequency 250 MHz 140 MHz

Slices 1,056 ( 5%) 13,162 (72%) 18,432


Virtex4-lx40

LUTs 1,204 ( 3%) 19,343 (52%) 36,864


FFs 1,158 ( 3%) 10,861 (29%) 36,864
BRAMs 1 ( 1%) 22 (22%) 96
Flash Memory* 1,497 KBits 1,497 KBits –

Maximum frequency 350 MHz 150 MHz

Slices 369 ( 5%) 4,896 (68%) 7,200


Virtex5-lx50

LUTs 944 ( 3%) 13,227 (46%) 28,800


FFs 973 ( 3%) 10,051 (34%) 28,800
BRAMs 1 ( 2%) 12 (25%) 48
Flash Memory* 1,533 KBits 1,533 KBits –
* For holding the bitstream file.

284 KBit/sec for decryption. The Virtex5 can achieve 1.82 MBits/sec for encryption
and 304 KBit/sec for decryption.
8. Discussion
Finally, we like to summarize our findings of this thesis. Furthermore, open issues
and unconsidered aspects will be stated in Section 8.3 to point out the scope for
future research.

8.1. Conclusions

We have implemented the McEliece crypto system with special attention to memory
limitations found in all modern FPGAs. This implementation is 2,5 times faster in
encryption than ECC and 23 times faster than RSA. The decryption is about two
times slower than ECC, but 5 times faster than RSA. When taking the output rate
into account this implementation encrypts about 25 times faster than ECC160 with
respect to the plaintext size and decrypts at a five times higher rate. In comparison
to RSA1024 this implementation achieves 38 time the output rate in encryption.
Decryption is eight times faster then RSA1024. Additionally we developed a method
to significantly reduce the secret key size, what is very important since secret key
material has to be stored in a protected area.

Although the underlying algorithms consist matrix multiplication, polynomial


field arithmetic and the extended Euclidean algorithm, due to the large polynomials
and huge data to be stored and processed, we were unable to reach the performance
we expected. For encryption wider RAMs are required to read in as much data as
possible at once. Decryption would benefit from faster PRNGs and shorter polyno-
mials. When the size of the underlying field is increased, less errors are required
for the same level of security. The polynomials can then have a lower degree and
thus an shorter total bit length. This should result in a more effective routing, which
makes about 80% of the delay in our implementation.

Nevertheless, we believe with growing memories in embedded systems, ongo-


ing research and optimizations, McEliece can evolve to a suitable and quantum
computer- resistant alternative to RSA and ECC that have been extensively stud-
ied for years.
62 Chapter 8 - Discussion

8.2. Problems

Depending on the chosen parameters for PAR the maximum achievable frequency
is in the range of 70 to 80 MHz on the XC3S2000 FPGA. The critical path is always
in the EEA between the get_degree and poly_mul_shift_add component. Inserting a
pipeline stage flipflop into one of these components would double the number of
clock cycles required and do not lead to a double frequency. About 80% of the signal
delay are due to routing. We found out that synthesis always had problems placing
logical connected parts also close together on the chip. For example, the log/antilog
BRAM was placed in one edge of the chip, while the logic working with the output
was placed in the opponent edge. However, this issue could not be corrected by
manual placement due to other constraints, i.e. often signal paths would become
longer, resulting in even worser critical paths.

8.3. Outlook for Further Work

As mentioned in Section 2.5 some research was done in reducing the public key size.
These approaches reduce the public key size at the cost of additional computations
during de- and encryption. But for our purposes, reducing the secret key size is
more important, which should be further investigated.
Also the impact of different field sizes for the same security level is still a field
for further research. Especially for implementations that do the field arithmetic with
table lookups, it can be an advantage, if the size of the tables fit exactly into the
memory and has a word size that is a multiple of the register size of the target.
Furthermore the ongoing research to replace Goppa codes with other, more com-
pactly representable codes, should be intensified. Note that all proposed replace-
ments for the Goppa Codes were broken.
In a future implementation, other ways to compute the parity check matrix should
be tested. And finally we plan to implement the Niederreiter scheme, which is
closely related to McEliece to compare the efficiency of both systems at the same
level of security.
A. Tables
Table A.1.: Display Characters and Resulting Control Values
Character Control Value
(abcdefg)

0 (0000001)
1 (1001111)
2 (0010010)
3 (0000110)
4 (1001100)
5 (0100100)
6 (0100000)
7 (0001111)
8 (0000000)
9 (0000100)
A (0001000)
B (1100000)
C (0110001)
D (1000010)
E (0110000)
F (0111000)
64 Chapter A - Tables

Table A.2.: Number of XORs for Multiplication.


Type of XOR Count

XOR2 2
XOR3 2
XOR4 1
XOR5 2
XOR6 2
XOR7 2
XOR8 2
XOR9 2
XOR10 1
XOR11 2
XOR12 2
B. Magma Functions

Listing B.1: Construction of the Mapping Matrix


funct i on image ( Ghat )
n : = Ncols (G) ;
k : = Nrows (G) ;
count : = 1 ;
repeat
sub : = ColumnSubmatrix (G, count , k ) ;
count : = count + 1 ;
u n t i l ( I s U n i t ( sub ) or ( count eq ( n−k ) ) ) ;
count : = count − 1;
sub : = sub^ − 1;
iG : = Zero ( KMatrixSpace ( GF ( 2 ) , n , k ) ) ;
I n s e r t B l o c k (~ iG , sub , count , 1 ) ;
r e t u r n iG ;
end funct i on ;

Listing B.2: Present Block Cipher in Magma


F:= FiniteField ( 2 ) ;
plainV : = VectorS pa ce ( F , 6 4 ) ;
keyV : = VectorS pa ce ( F , 8 0 ) ;

//REMARK! ! !
//magma i n d i c a t e s from ( 1 t o n ) while C i n d i c a t e s from ( n−1 downto 0 )
funct i on sbox ( n i b b l e )
out : = case < n i b b l e |
[0 ,0 ,0 ,0]:[0 ,0 ,1 ,1] ,
[1 ,0 ,0 ,0]:[1 ,0 ,1 ,0] ,
[0 ,1 ,0 ,0]:[0 ,1 ,1 ,0] ,
[1 ,1 ,0 ,0]:[1 ,1 ,0 ,1] ,
[0 ,0 ,1 ,0]:[1 ,0 ,0 ,1] ,
[1 ,0 ,1 ,0]:[0 ,0 ,0 ,0] ,
[0 ,1 ,1 ,0]:[0 ,1 ,0 ,1] ,
[1 ,1 ,1 ,0]:[1 ,0 ,1 ,1] ,
[0 ,0 ,0 ,1]:[1 ,1 ,0 ,0] ,
[1 ,0 ,0 ,1]:[0 ,1 ,1 ,1] ,
[0 ,1 ,0 ,1]:[1 ,1 ,1 ,1] ,
[1 ,1 ,0 ,1]:[0 ,0 ,0 ,1] ,
[0 ,0 ,1 ,1]:[0 ,0 ,1 ,0] ,
[1 ,0 ,1 ,1]:[1 ,1 ,1 ,0] ,
66 Chapter B - Magma Functions

[0 ,1 ,1 ,1]:[1 ,0 ,0 ,0] ,
[1 ,1 ,1 ,1]:[0 ,1 ,0 ,0] ,
default :[0 ,0 ,0 ,0] >;
r e t u r n out ;
end funct i on ;

funct i on perm ( da ta )
out : = Zero ( plainV ) ;
f o r i in [ 0 . . 6 3 ] do
out [ ( ( i mod 4 ) ∗16 + F l o o r ( i /4) ) + 1 ] : = da ta [ i + 1 ] ;
end f o r ;
r e t u r n out ;
end funct i on ;

funct i on p r e s e n t ( p l a i n , key )
// p r i n t " p l a i n : " , Reverse ( E l t s e q ( p l a i n ) ) ;
da ta : = p l a i n ;
f o r count in [ 1 . . 3 1 ] do
// p r i n t " \nround : " , count ;
keyseq : = E l t s e q ( key ) ;
//key : = keyV ! keyseq ;
roundkey : = plainV ! keyseq [ 1 7 . . 8 0 ] ;
p r i n t " Roundkey : " , Reverse ( E l t s e q ( roundkey ) ) ;
// p r i n t " Data a f t e r . . . " ;
//add roundkey
da ta : = da ta + roundkey ;
// p r i n t " key xor : " , Reverse ( E l t s e q ( da ta ) ) ;
da ta seq : = E l t s e q ( da ta ) ;
//sbox
f o r i in [ 0 . . 1 5 ] do
s u b s t _ d a t a : = sbox ( da ta seq [ 4 ∗ i +1 . . 4∗ i + 4 ] ) ;
f o r j in [ 1 . . 4 ] do
da ta [ 4 ∗ i + j ] : = s u b s t _ d a t a [ j ] ;
end f o r ;
end f o r ;
// p r i n t " Sbox : " , Reverse ( E l t s e q ( da ta ) ) ;
//perm
da ta : = perm ( da ta ) ;
// p r i n t " pbox : " , Reverse ( E l t s e q ( da ta ) ) ;
//compute roundkey
// p r i n t " Next key : " ;
key : = R o t a t e ( key , 6 1 ) ;
// p r i n t " a f t e r r o t : " , Reverse ( E l t s e q ( key ) ) ;
keyseq : = E l t s e q ( key ) ;
k e y _ n i b b l e : = keyseq [ 7 7 . . 8 0 ] ;
subst_ key : = sbox ( k e y _ n i b b l e ) ;
f o r j in [ 1 . . 4 ] do
key [ 7 6 + j ] : = subst_ key [ j ] ;
67

end f o r ;
// p r i n t " a f t e r sbox : " , Reverse ( E l t s e q ( key ) ) ;
tmp : = count ;
i f ( tmp ge 1 6 ) then
tmp : = tmp − 16;
key [ 2 0 ] : = key [ 2 0 ] + 1 ;
end i f ;
i f ( tmp ge 8 ) then
tmp : = tmp − 8;
key [ 1 9 ] : = key [ 1 9 ] + 1 ;
end i f ;
i f ( tmp ge 4 ) then
tmp : = tmp − 4;
key [ 1 8 ] : = key [ 1 8 ] + 1 ;
end i f ;
i f ( tmp ge 2 ) then
tmp : = tmp − 2;
key [ 1 7 ] : = key [ 1 7 ] + 1 ;
end i f ;
i f ( tmp ge 1 ) then
tmp : = tmp − 1;
key [ 1 6 ] : = key [ 1 6 ] + 1 ;
end i f ;
// p r i n t " a f t e r s a l t : " , Reverse ( E l t s e q ( key ) ) ;
end f o r ;
keyseq : = E l t s e q ( key ) ;
roundkey : = plainV ! keyseq [ 1 7 . . 8 0 ] ;
p r i n t " f i n a l roundkey : " , Reverse ( E l t s e q ( roundkey ) ) ;
// p r i n t " Data a f t e r . . . " ;
//add roundkey
da ta : = da ta + roundkey ;
// p r i n t " f i n a l key xor : " , Reverse ( E l t s e q ( da ta ) ) ;
r e t u r n da ta ;
end funct i on ;

Listing B.3: PRNG based on Present in Magma


loa d " l o c a l / p r e s e n t . mag" ;

funct i on i n i t _ p r n g ( IV )
key : = IV ;
IVseq : = E l t s e q ( IV ) ;
p l a i n : = plainV ! IVseq [ 1 . . 6 4 ] ;
count : = 0 ;
r e t u r n < p l a i n , key , count > ;
end funct i on ;

funct i on prng ( s t a t e )
68 Chapter B - Magma Functions

plain := s t a t e [ 1 ] ;
key : = s t a t e [ 2 ] ;
count : = s t a t e [ 3 ] ;
p l a i n : = p r e s e n t ( p l a i n , key ) ;
key : = R o t a t e ( key , 1 ) ;
tmp : = count ;
i f ( tmp ge 1 6 ) then
tmp : = tmp − 16;
key [ 5 ] : = key [ 5 ] + 1 ;
end i f ;
i f ( tmp ge 8 ) then
tmp : = tmp − 8;
key [ 4 ] : = key [ 4 ] + 1 ;
end i f ;
i f ( tmp ge 4 ) then
tmp : = tmp − 4;
key [ 3 ] : = key [ 3 ] + 1 ;
end i f ;
i f ( tmp ge 2 ) then
tmp : = tmp − 2;
key [ 2 ] : = key [ 2 ] + 1 ;
end i f ;
i f ( tmp ge 1 ) then
tmp : = tmp − 1;
key [ 1 ] : = key [ 1 ] + 1 ;
end i f ;
count : = ( count +1) mod 2 ^ 5 ;
r e t u r n < p l a i n , key , count > ;
end funct i on ;

Listing B.4: Generation and Testing for the Substitution Matrix


funct i on gen_S ( dim )
versuch : = 1 ;
count : = 0 ;
BlockBound : = C e i l i n g ( dim/8) ;
BitBound : = 8 ∗ BlockBound ;
repeat
RowCnt : = 0 ;
ColCnt : = 0 ;
BlockRowCnt : = 0 ;
BlockColCnt : = 0 ;
p r i n t f " Versuch %o\n " , versuch ;
IV : =Random( keyV ) ;
s t a t e : = i n i t _ p r n g ( IV ) ;
s t a t e : = prng ( s t a t e ) ;
versuch : = versuch + 1 ;
S : = Zero ( KMatrixSpace ( F , BitBound , BitBound ) ) ;
69

// f i l l S with 8 x8 blocks , each block colums per colums


f o r i in [ 1 . . BitBound ^2] do
count : = count + 1 ;
S [ 8 ∗ BlockRowCnt+RowCnt + 1 ] [ 8 ∗ BlockColCnt+ColCnt + 1 ] : =
s t a t e [ 1 ] [ count ] ;
// r e l o a d prng
i f ( count eq 6 4 ) then
s t a t e : = prng ( s t a t e ) ;
count : = 0 ;
end i f ;
i f ( RowCnt l t 7 ) then
RowCnt : =RowCnt + 1 ;
else
RowCnt : = 0 ;
i f ( ColCnt l t 7 ) then
ColCnt : = ColCnt + 1 ;
else
ColCnt : = 0 ;
i f ( BlockColCnt l t BlockBound − 1)
then
BlockColCnt : = BlockColCnt + 1 ;
else
BlockColCnt : = 0 ;
BlockRowCnt : = BlockRowCnt + 1 ;
p r i n t "Now working i n
BlockRow " , BlockRowCnt ;
end i f ;
end i f ;
end i f ;
end f o r ;
u n t i l I s U n i t ( Submatrix ( S , 1 , 1 , dim , dim ) ) ;
r e t u r n Submatrix ( S , 1 , 1 , dim , dim ) , S , IV ;
end funct i on ;

funct i on b u i l d _ S ( IV , dim )
BlockBound : = C e i l i n g ( dim /8) ;
BitBound : = 8 ∗ BlockBound ;
count : = 0 ;
RowCnt : = 0 ;
ColCnt : = 0 ;
BlockRowCnt : = 0 ;
BlockColCnt : = 0 ;
s t a t e : = i n i t _ p r n g ( IV ) ;
s t a t e : = prng ( s t a t e ) ;
S : = Zero ( KMatrixSpace ( F , BitBound , BitBound ) ) ;
// f i l l S with 8 x8 blocks , each block colums per colums
f o r i in [ 1 . . BitBound ^2] do
count : = count + 1 ;
70 Chapter B - Magma Functions

// p r i n t f
" S[%o ][%o ]\ n " , 8 ∗ BlockRowCnt +RowCnt , 8 ∗ BlockColCnt+ColCnt ;
S [ 8 ∗ BlockRowCnt +RowCnt + 1 ] [ 8 ∗ BlockColCnt+ColCnt + 1 ] : =
s t a t e [ 1 ] [ count ] ;
// r e l o a d prng
i f ( count eq 6 4 ) then
s t a t e : = prng ( s t a t e ) ;
count : = 0 ;
end i f ;
i f ( RowCnt l t 7 ) then
RowCnt : =RowCnt + 1 ;
else
RowCnt : = 0 ;
i f ( ColCnt l t 7 ) then
ColCnt : = ColCnt + 1 ;
else
ColCnt : = 0 ;
i f ( BlockColCnt l t BlockBound − 1) then
BlockColCnt : = BlockColCnt + 1 ;
else
BlockColCnt : = 0 ;
BlockRowCnt : = BlockRowCnt + 1 ;
p r i n t "Now working i n BlockRow
" , BlockRowCnt ;
end i f ;
end i f ;
end i f ;
end f o r ;
p r i n t I s U n i t ( Submatrix ( S , 1 , 1 , dim , dim ) ) ;
r e t u r n Submatrix ( S , 1 , 1 , dim , dim ) ;
end funct i on ;
C. VHDL Code Snippets
Listing C.1: Complete Unrolled Field Multiplication
c ( 1 0 ) <=( a ( 0 ) AND b ( 1 0 ) ) XOR ( a ( 1 ) AND b ( 9 ) ) XOR ( a ( 2 ) AND b ( 8 ) ) XOR ( a ( 3 )
AND b ( 7 ) ) XOR ( a ( 4 ) AND b ( 6 ) ) XOR ( a ( 5 ) AND b ( 5 ) ) XOR ( a ( 6 ) AND b ( 4 ) )
XOR ( a ( 7 ) AND b ( 3 ) ) XOR ( a ( 8 ) AND b ( 2 ) ) XOR ( a ( 9 ) AND b ( 1 ) ) XOR ( a ( 9 )
AND b ( 1 0 ) ) XOR ( a ( 1 0 ) AND b ( 0 ) ) XOR ( a ( 1 0 ) AND b ( 9 ) ) ;

c ( 9 ) <=( a ( 0 ) AND b ( 9 ) ) XOR ( a ( 1 ) AND b ( 8 ) ) XOR ( a ( 2 ) AND b ( 7 ) ) XOR ( a ( 3 )


AND b ( 6 ) ) XOR ( a ( 4 ) AND b ( 5 ) ) XOR ( a ( 5 ) AND b ( 4 ) ) XOR ( a ( 6 ) AND b ( 3 ) )
XOR ( a ( 7 ) AND b ( 2 ) ) XOR ( a ( 8 ) AND b ( 1 ) ) XOR ( a ( 8 ) AND b ( 1 0 ) ) XOR ( a ( 9 )
AND b ( 0 ) ) XOR ( a ( 9 ) AND b ( 9 ) ) XOR ( a ( 1 0 ) AND b ( 8 ) ) XOR ( a ( 1 0 ) AND
b(10) ) ;

c ( 8 ) <=( a ( 0 ) AND b ( 8 ) ) XOR ( a ( 1 ) AND b ( 7 ) ) XOR ( a ( 2 ) AND b ( 6 ) ) XOR ( a ( 3 )


AND b ( 5 ) ) XOR ( a ( 4 ) AND b ( 4 ) ) XOR ( a ( 5 ) AND b ( 3 ) ) XOR ( a ( 6 ) AND b ( 2 ) )
XOR ( a ( 7 ) AND b ( 1 ) ) XOR ( a ( 7 ) AND b ( 1 0 ) ) XOR ( a ( 8 ) AND b ( 0 ) ) XOR ( a ( 8 )
AND b ( 9 ) ) XOR ( a ( 9 ) AND b ( 8 ) ) XOR ( a ( 9 ) AND b ( 1 0 ) ) XOR ( a ( 1 0 ) AND
b ( 7 ) ) XOR ( a ( 1 0 ) AND b ( 9 ) ) ;

c ( 7 ) <=( a ( 0 ) AND b ( 7 ) ) XOR ( a ( 1 ) AND b ( 6 ) ) XOR ( a ( 2 ) AND b ( 5 ) ) XOR ( a ( 3 )


AND b ( 4 ) ) XOR ( a ( 4 ) AND b ( 3 ) ) XOR ( a ( 5 ) AND b ( 2 ) ) XOR ( a ( 6 ) AND b ( 1 ) )
XOR ( a ( 6 ) AND b ( 1 0 ) ) XOR ( a ( 7 ) AND b ( 0 ) ) XOR ( a ( 7 ) AND b ( 9 ) ) XOR ( a ( 8 )
AND b ( 8 ) ) XOR ( a ( 8 ) AND b ( 1 0 ) ) XOR ( a ( 9 ) AND b ( 7 ) ) XOR ( a ( 9 ) AND b ( 9 ) )
XOR ( a ( 1 0 ) AND b ( 6 ) ) XOR ( a ( 1 0 ) AND b ( 8 ) ) ;

c ( 6 ) <=( a ( 0 ) AND b ( 6 ) ) XOR ( a ( 1 ) AND b ( 5 ) ) XOR ( a ( 2 ) AND b ( 4 ) ) XOR ( a ( 3 )


AND b ( 3 ) ) XOR ( a ( 4 ) AND b ( 2 ) ) XOR ( a ( 5 ) AND b ( 1 ) ) XOR ( a ( 5 ) AND b ( 1 0 ) )
XOR ( a ( 6 ) AND b ( 0 ) ) XOR ( a ( 6 ) AND b ( 9 ) ) XOR ( a ( 7 ) AND b ( 8 ) ) XOR ( a ( 7 )
AND b ( 1 0 ) ) XOR ( a ( 8 ) AND b ( 7 ) ) XOR ( a ( 8 ) AND b ( 9 ) ) XOR ( a ( 9 ) AND b ( 6 ) )
XOR ( a ( 9 ) AND b ( 8 ) ) XOR ( a ( 1 0 ) AND b ( 5 ) ) XOR ( a ( 1 0 ) AND b ( 7 ) ) ;

c ( 5 ) <=( a ( 0 ) AND b ( 5 ) ) XOR ( a ( 1 ) AND b ( 4 ) ) XOR ( a ( 2 ) AND b ( 3 ) ) XOR ( a ( 3 )


AND b ( 2 ) ) XOR ( a ( 4 ) AND b ( 1 ) ) XOR ( a ( 4 ) AND b ( 1 0 ) ) XOR ( a ( 5 ) AND b ( 0 ) )
XOR ( a ( 5 ) AND b ( 9 ) ) XOR ( a ( 6 ) AND b ( 8 ) ) XOR ( a ( 6 ) AND b ( 1 0 ) ) XOR ( a ( 7 )
AND b ( 7 ) ) XOR ( a ( 7 ) AND b ( 9 ) ) XOR ( a ( 8 ) AND b ( 6 ) ) XOR ( a ( 8 ) AND b ( 8 ) )
XOR ( a ( 9 ) AND b ( 5 ) ) XOR ( a ( 9 ) AND b ( 7 ) ) XOR ( a ( 1 0 ) AND b ( 4 ) ) XOR
( a ( 1 0 ) AND b ( 6 ) ) ;

c ( 4 ) <=( a ( 0 ) AND b ( 4 ) ) XOR ( a ( 1 ) AND b ( 3 ) ) XOR ( a ( 2 ) AND b ( 2 ) ) XOR ( a ( 3 )


AND b ( 1 ) ) XOR ( a ( 3 ) AND b ( 1 0 ) ) XOR ( a ( 4 ) AND b ( 0 ) ) XOR ( a ( 4 ) AND b ( 9 ) )
XOR ( a ( 5 ) AND b ( 8 ) ) XOR ( a ( 5 ) AND b ( 1 0 ) ) XOR ( a ( 6 ) AND b ( 7 ) ) XOR ( a ( 6 )
AND b ( 9 ) ) XOR ( a ( 7 ) AND b ( 6 ) ) XOR ( a ( 7 ) AND b ( 8 ) ) XOR ( a ( 8 ) AND b ( 5 ) )
XOR ( a ( 8 ) AND b ( 7 ) ) XOR ( a ( 9 ) AND b ( 4 ) ) XOR ( a ( 9 ) AND b ( 6 ) ) XOR ( a ( 1 0 )
72 Chapter C - VHDL Code Snippets

AND b ( 3 ) ) XOR ( a ( 1 0 ) AND b ( 5 ) ) ;

c ( 3 ) <=( a ( 0 ) AND b ( 3 ) ) XOR ( a ( 1 ) AND b ( 2 ) ) XOR ( a ( 2 ) AND b ( 1 ) ) XOR ( a ( 2 )


AND b ( 1 0 ) ) XOR ( a ( 3 ) AND b ( 0 ) ) XOR ( a ( 3 ) AND b ( 9 ) ) XOR ( a ( 4 ) AND b ( 8 ) )
XOR ( a ( 4 ) AND b ( 1 0 ) ) XOR ( a ( 5 ) AND b ( 7 ) ) XOR ( a ( 5 ) AND b ( 9 ) ) XOR ( a ( 6 )
AND b ( 6 ) ) XOR ( a ( 6 ) AND b ( 8 ) ) XOR ( a ( 7 ) AND b ( 5 ) ) XOR ( a ( 7 ) AND b ( 7 ) )
XOR ( a ( 8 ) AND b ( 4 ) ) XOR ( a ( 8 ) AND b ( 6 ) ) XOR ( a ( 9 ) AND b ( 3 ) ) XOR ( a ( 9 )
AND b ( 5 ) ) XOR ( a ( 1 0 ) AND b ( 2 ) ) XOR ( a ( 1 0 ) AND b ( 4 ) ) ;

c ( 2 ) <=( a ( 0 ) AND b ( 2 ) ) XOR ( a ( 1 ) AND b ( 1 ) ) XOR ( a ( 1 ) AND b ( 1 0 ) ) XOR ( a ( 2 )


AND b ( 0 ) ) XOR ( a ( 2 ) AND b ( 9 ) ) XOR ( a ( 3 ) AND b ( 8 ) ) XOR ( a ( 3 ) AND b ( 1 0 ) )
XOR ( a ( 4 ) AND b ( 7 ) ) XOR ( a ( 4 ) AND b ( 9 ) ) XOR ( a ( 5 ) AND b ( 6 ) ) XOR ( a ( 5 )
AND b ( 8 ) ) XOR ( a ( 6 ) AND b ( 5 ) ) XOR ( a ( 6 ) AND b ( 7 ) ) XOR ( a ( 7 ) AND b ( 4 ) )
XOR ( a ( 7 ) AND b ( 6 ) ) XOR ( a ( 8 ) AND b ( 3 ) ) XOR ( a ( 8 ) AND b ( 5 ) ) XOR ( a ( 9 )
AND b ( 2 ) ) XOR ( a ( 9 ) AND b ( 4 ) ) XOR ( a ( 1 0 ) AND b ( 1 ) ) XOR ( a ( 1 0 ) AND
b ( 3 ) ) XOR ( a ( 1 0 ) AND b ( 1 0 ) ) ;

c ( 1 ) <=( a ( 0 ) AND b ( 1 ) ) XOR ( a ( 1 ) AND b ( 0 ) ) XOR ( a ( 2 ) AND b ( 1 0 ) ) XOR ( a ( 3 )


AND b ( 9 ) ) XOR ( a ( 4 ) AND b ( 8 ) ) XOR ( a ( 5 ) AND b ( 7 ) ) XOR ( a ( 6 ) AND b ( 6 ) )
XOR ( a ( 7 ) AND b ( 5 ) ) XOR ( a ( 8 ) AND b ( 4 ) ) XOR ( a ( 9 ) AND b ( 3 ) ) XOR ( a ( 1 0 )
AND b ( 2 ) ) ;

c ( 0 ) <=( a ( 0 ) AND b ( 0 ) ) XOR ( a ( 1 ) AND b ( 1 0 ) ) XOR ( a ( 2 ) AND b ( 9 ) ) XOR ( a ( 3 )


AND b ( 8 ) ) XOR ( a ( 4 ) AND b ( 7 ) ) XOR ( a ( 5 ) AND b ( 6 ) ) XOR ( a ( 6 ) AND b ( 5 ) )
XOR ( a ( 7 ) AND b ( 4 ) ) XOR ( a ( 8 ) AND b ( 3 ) ) XOR ( a ( 9 ) AND b ( 2 ) ) XOR ( a ( 1 0 )
AND b ( 1 ) ) XOR ( a ( 1 0 ) AND b ( 1 0 ) ) ;

Listing C.2: Performing the Permutation


WHEN perm2=>
−−p u t t h e n e x t b i t i n t o p e r m _ b y t e
c a s e perm_number( 2 downto 0 ) i s
WHEN " 000 " =>
perm_byte<=perm_byte ( 1 t o 7 ) & Y_ byte_ r ( 0 ) ;
.
.
.
WHEN " 111 " =>
perm_byte<=perm_byte ( 1 t o 7 ) & Y_ byte_ r ( 7 ) ;
end c a s e ;
perm_addr<=perm_addr+ 1 ;
s t a t e <=perm3 ;
WHEN perm3=>
i f ( perm_addr( 2 downto 0 ) = " 000 " ) then
−− f e e d t h e n e x t b y t e t o t h e g op p a d e c o d e r
sta rt_ goppa < = ’ 1 ’ ;
c i p h e r _ b y t e <=perm_byte ;
byte_ready < = ’ 1 ’ ;
73

i f ( need_byte = ’ 1 ’ ) then−− g op p a r e a d y t o t a k e
wa it4 ta ke < = ’ 1 ’ ;
e l s e−− d o n t n e e d o r t a k e n ?
i f ( w a i t 4 t a k e = ’ 1 ’ ) then−− t a k e n , c o n t i n u e
bu i l d i n g the next byte
wa it4 ta ke < = ’ 0 ’ ;
byte_ready < = ’ 0 ’ ;
s t a t e <=perm1 ;
−− e l s e d o n t need , w a i t i n g
end i f ;
end i f ;
e l s e−− c o n t i n u e b u i l d i n g t h e c u r r e n t b y t e
byte_ready < = ’ 0 ’ ;
s t a t e <=perm1 ;
end i f ;
D. Bibliography
[1] X. A. N. 457. Powering and Configuring Spartan-3 Generation FPGAs. Xilinx Inc.
http://www.xilinx.com/support/documentation/application_notes/xapp457.pdf.
[2] C. M. Adams and H. Meijer. Security-Related Comments Regarding McEliece’s
Public-Key Cryptosystem, pages 224–228. Springer-Verlag, 1988.
[3] E. R. Berlekamp, R. J. McEliece, and H. C. A. van Tilborg. On the inherent in-
tractability of certain coding problems. IEEE Trans. Information Theory, 24(3):384–
386, 1978.
[4] D. J. Bernstein. List decoding for binary codes. Technical report, University of
Illinois at Chicago, 2008. http://cr.yp.to/codes/goppalist-20081107.pdf.
[5] D. J. Bernstein, T. Lange, and C. Peters. Attacking and
defending the McEliece cryptosystem. Cryptology ePrint
Archive, Report 2008/318 "http://eprint.iacr.org/", 2008.
http://cr.yp.to/codes/mceliece-20080807.pdf.
[6] D. J. Bernstein, T. Lange, and C. Peters. Attacking and defending the McEliece
cryptosystem cryptosystem. In Proceedings of the International Workshop on Post-
Quantum Cryptography – PQCrypto ’08, volume 5299 of LNCS, pages 31–46,
Berlin, Heidelberg, 2008. Springer-Verlag.
[7] T. Berson. Failure of the McEliece public-key cryptosystem under message-resend and
related-message attack, pages 213–220. Springer-Verlag, 1997.
[8] J.-L. Beuchat, N. Sendrier, A. Tisserand, and G. Villard. FPGA Implemen-
tation of a Recently Published Signature Scheme. Technical report, INRIA
- Institut National de Recherche en Informatique et en Automatique, 2004.
http://hal.archives-ouvertes.fr/docs/00/07/70/45/PDF/RR-5158.pdf.
[9] A. Bogdanov, T. Eisenbarth, A. Rupp, and C. Wolf. Time-Area Optimized
Public-Key Engines: MQ-Cryptosystems as Replacement for Elliptic Curves?
In Proceedings of the Workshop on Cryptographic Hardware and Embedded Systems –
CHES 2008, volume 5154 of LNCS, pages 45–61. Springer-Verlag, 2008.
[10] A. Bogdanov, L. R. Knudsen, G. Le, C. Paar, A. Poschmann, M. J. B. Robshaw,
Y. Seurin, and C. Vikkelsoe. Present: An ultra-lightweight block cipher. In
Proceedings of the Workshop on Cryptographic Hardware and Embedded Systems –
CHES 2007, volume 4727 of LNCS, pages 450–466. Springer-Verlag, 2007.
[11] K. Chapman. 200 MHz UART with Internal 16-Byte Buffer. Xilinx Inc., Juli 2001.
76 D. Bibliography

[12] S. Chari, J. R. Rao, and P. Rohatgi. Template Attacks. In Proceedings of the


Workshop on Cryptographic Hardware and Embedded Systems – CHES 2002, volume
2523 of LNCS, pages 13–28. Springer-Verlag, 2002.
[13] R. Chien. Cyclic decoding procedure for the bose-chaudhuri-hocquenghem
codes. IEEE Trans. Information Theory, IT-10(10):357–363, 1964.
[14] Digilent. Digilent inc. www.digilent.com.
[15] ECRYPT. Yearly report on algorithms and keysizes (2007-
2008). Technical report, D.SPA.28 Rev. 1.1, July 2008.
http://www.ecrypt.eu.org/documents/D.SPA.10-1.1.pdf.
[16] T. Eisenbarth, T. Güneysu, S. Heyse, and C. Paar. Microeliece: McEliece for
Embedded Systems. Accepted at CHES2009.
[17] D. Engelbert, R. Overbeck, and A. Schmidt. A summary of mceliece-type cryp-
tosystems and their security, May 2006. http://eprint.iacr.org/2006/162.ps.
[18] Entropy. Entropy. http://entropy.stop1984.com.
[19] Freenet. Freenet. http://freenetproject.org.
[20] P. Gaborit. Shorter keys for code based cryptograhy. In Proceedings of Workshop
on Codes and Cryptography, pages 81–91, 2005.
[21] V. Goppa. A new class of linear error-correcting codes. Problemy Peredachi
Informatsii, 6(3):24–30, 1970. Original in russian language.
[22] T. Güneysu, C. Paar, and J. Pelzl. Special-purpose hardware for solving the
elliptic curve discrete logarithm problem. ACM Transactions on Reconfigurable
Technology and Systems (TRETS), 1(2):1–21, 2008.
[23] T. Hammer. Hterm - a terminal program for windows and linux.
www.der-hammer.info/terminal/.
[24] K. Huber. Note on decoding binary goppa codes. Electronics Letters, 32:102–103,
1996.
[25] H. T. Inc. Modular exponentiation core fam-
ily for xilinx fpga. Data Sheet, October 2008.
http://www.heliontech.com/downloads/modexp_xilinx_datasheet.pdf.
[26] X. Inc. Security Solutions Using Spartan-3 Generation FPGAs.
http://www.xilinx.com/support/documentation/white_papers/wp266.pdf.
[27] X. Inc. Spartan-3AN FPGA Family Data Sheet.
http://www.xilinx.com/support/documentation/data_sheets/ds706.pdf.
[28] X. Inc. Xilinx ISE 10.1 Design Suite Software Manuals and Help - PDF Collection.
http://toolbox.xilinx.com/docsan/xilinx10/books/manuals.pdf.
D. Bibliography 77

[29] J. H. S. Jeffrey Hoffstein, Jill Pipher. NTRU: A Ring-Based Public Key


Cryptosystem. Algorithmic Number Theory (ANTS III), 1423:267–288, 1998.
http://www.ntru.com/cryptolab/pdf/ANTS97.pdf.
[30] P. C. Kocher. Timing Attacks on Implementations of Diffie-Hellman, RSA, DSS,
and Other Systems. In Advances in Cryptology — CRYPTO ’96, volume 1109 of
LNCS, pages 104–113. Springer Verlag, 1996.
[31] B. Köpf and M. Dürmuth. A Provably Secure And Efficient Countermeasure
Against Timing Attacks. Cryptology ePrint Archive, Report 2009/089, 2009.
http://eprint.iacr.org/2009/089.pdf.
[32] V. Korzhik and A. Turkin. Cryptanalysis of McElieces Public-Key Cryptosystem,
pages 68–70. Springer Verlag Berlin / Heidelberg, 1991.
[33] P. Lee and E. Brickell. An Observation on the Security of McElieces Public-Key
Cryptosystem, pages 275–280. Springer-Verlag New York, Inc., 1988.
[34] S. Mangard, E. Oswald, and T. Popp. Power Analysis Attacks: Revealing the Secrets
of Smartcards. Springer-Verlag, 2007.
[35] R. J. McEliece. A public-key cryptosystem based on algebraic coding theory.
Deep Space Network Progress Report, 44:114–116, Jan. 1978.
[36] R. J. McEliece. Finite Fields for Computer Scientists and Engineers. Kluwer Aca-
demic Publishers, 1987.
[37] C. Monico, J. Rosenthal, and A. Shokrollahi. Using low density parity check
codes in the mceliece cryptosystem. In IEEE International Symposium on Informa-
tion Theorie, page 215, 2000.
[38] N. Patterson. The algebraic decoding of Goppa codes. Information Theory, IEEE
Transactions on, 21:203–207, 1975.
[39] B. Preneel, A. Bosselaers, R. Govaerts, and J. Vandewalle. A Software Imple-
mentation of the McEliece Public-Key Cryptosystem. In Proceedings of the 13th
Symposium on Information Theory in the Benelux, Werkgemeenschap voor Informatie
en Communicatietheorie, pages 119–126. Springer-Verlag, 1992.
[40] Prometheus. Implementation of McEliece Cryptosystem for 32-bit microproces-
sors (c-source). http://www.eccpage.com/goppacode.c, 2009.
[41] P. W. Shor. Polynomial-time algorithms for prime factorization and discrete
logarithms on a quantum computer. quant-ph/9508027, Aug. 1995. SIAM J. Sci.
Statist. Comput. 26 (1997) 1484.
[42] F. Strenzke, E. Tews, H. Molter, R. Overbeck, and A. Shoufan. Side Channels
in the McEliece PKC. In 2nd Workshop on Post-Quantum Cryptography, pages
216–229. Springer, 2008.
78 D. Bibliography

[43] Y. Sugiyama, M. Kasahara, S. Hirasawa, and T. Namekawa. A Method for


Solving Key Equation for Decoding Goppa Codes. Information and Control, IEEE
Transactions on, 27:87–99, 1975.
[44] Y. Sugiyama, M. Kasahara, S. Hirasawa, and T. Namekawa. An erasures-and-
errors decoding algorithm for goppa codes (corresp.). Information Theory, IEEE
Transactions on, 22:238–241, 1976.
[45] H. C. van Tilborg. Fundamentals of Cryptology. Kluwer Academic Publishers,
2000.
[46] www.nuhorizons.com. Nu horizons electronics corp.
http://www.nuhorizons.com.
[47] Xilinx. IP Security in FPGAs, Whitepaper261.
http://www.xilinx.com/support/documentation/white_papers/wp261.pdf.
E. List of Figures
4.1. 4-Input LUT with FF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2. Simplified Overview over an FPGA [27] . . . . . . . . . . . . . . . . . . 23
4.3. VHDL Designflow[28] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.4. The UART component . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5.1. The Complete Coefficient Divider . . . . . . . . . . . . . . . . . . . . . 33


5.2. Overview of the Polynomial Multiplier . . . . . . . . . . . . . . . . . . 33

6.1. Spartan3-200 Development Board . . . . . . . . . . . . . . . . . . . . . . 37


6.2. Block Overview for Encryption . . . . . . . . . . . . . . . . . . . . . . . 37
6.3. 7 Segment Display . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
6.4. Seven Segment Driver Component . . . . . . . . . . . . . . . . . . . . . 39
6.5. Interface of the mce_encrypt Component . . . . . . . . . . . . . . . . . . 39
6.6. Buffer for Plaintext and Ciphertext . . . . . . . . . . . . . . . . . . . . . 40
6.7. Parts of the PRNG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.8. Construction of PRNG from PRESENT . . . . . . . . . . . . . . . . . . 41
6.9. McEliece implementation on Spartan-3 2000 FPGA . . . . . . . . . . . 42
6.10. The XC3S2000 Development Board . . . . . . . . . . . . . . . . . . . . . 42
6.11. Overview of mce_decrypt Component . . . . . . . . . . . . . . . . . . . 42
6.12. Buffer for Ciphertext and Plaintext . . . . . . . . . . . . . . . . . . . . . 43
6.13. ROM containing the Inverse Permutation Matrix . . . . . . . . . . . . . 43
6.14. Interface of goppa-decode Component . . . . . . . . . . . . . . . . . . . . 43
6.15. Overview of goppa-decode Component . . . . . . . . . . . . . . . . . . . 44
6.16. FSM of the goppa_decode Component . . . . . . . . . . . . . . . . . . . 44
6.17. FSM of the EEA Component . . . . . . . . . . . . . . . . . . . . . . . . 44
6.18. Interface of the poly-split Component . . . . . . . . . . . . . . . . . . . 45
6.19. Interface of the poly-mulWT1 Component . . . . . . . . . . . . . . . . 45
6.20. Interface of the poly-mul Component . . . . . . . . . . . . . . . . . . . 46
6.21. Interface of the comp-sigma Component . . . . . . . . . . . . . . . . . 46
6.22. Interface of the chien_search Component . . . . . . . . . . . . . . . . . . 47
6.23. Timing Diagram of the Matrix Multiplication . . . . . . . . . . . . . . . 49
F. List of Tables
2.1. Security of McEliece Depending on Parameters . . . . . . . . . . . . . 12

4.1. Bitstream Generator Security Level Settings . . . . . . . . . . . . . . . . 26

5.1. Execution Count for Crucial Parts. . . . . . . . . . . . . . . . . . . . . . 30

6.1. Function of the Debug Switches . . . . . . . . . . . . . . . . . . . . . . 39

7.1. Implementation results of the McEliece scheme with n = 2048, k =


1751, t = 27 on a Spartan-3 FPGA after PAR . . . . . . . . . . . . . . . 51
7.2. Performance of McEliece implementations with n = 2048, k = 1751, t =
27 on the Spartan-3 FPGA. . . . . . . . . . . . . . . . . . . . . . . . . . . 52
7.3. Comparison of our McEliece designs with single-core ECC and RSA
implementations for 80 bit security. . . . . . . . . . . . . . . . . . . . . . 53
7.4. Implementation results of the McEliece scheme with n = 2048, k =
1751, t = 27 Virtex FPGAs after PAR. . . . . . . . . . . . . . . . . . . . . 54

A.1. Display Characters and Resulting Control Values . . . . . . . . . . . . 57


A.2. Number of XORs for Multiplication. . . . . . . . . . . . . . . . . . . . . 58
G. Listings
5.1. Architecture for gf_compare . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.2. Instantiation of gf_compare . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.3. Determining the Degree of a Polynomial . . . . . . . . . . . . . . . . . 32
5.4. k-width Shifter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

6.1. Computation of Partial Product . . . . . . . . . . . . . . . . . . . . . . . 40

B.1. Construction of the Mapping Matrix . . . . . . . . . . . . . . . . . . . . 59


B.2. Present Block Cipher in Magma . . . . . . . . . . . . . . . . . . . . . . 59
B.3. PRNG based on Present in Magma . . . . . . . . . . . . . . . . . . . . . 61
B.4. Generation and Testing for the Substitution Matrix . . . . . . . . . . . 62

C.1. Complete Unrolled Field Multiplication . . . . . . . . . . . . . . . . . . 65


C.2. Performing the Permutation . . . . . . . . . . . . . . . . . . . . . . . . . 66
H. List of Algorithms
1. Key Generation Algorithm for McEliece Scheme . . . . . . . . . . . . . 8
2. McEliece Message Encryption . . . . . . . . . . . . . . . . . . . . . . . 9
3. McEliece Message Decryption . . . . . . . . . . . . . . . . . . . . . . . . 9

4. Decoding Goppa Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . 19


5. Getting a Mapping Matrix from Codewords to Messages . . . . . . . . 21

6. Extended Euclid over GF(211 ) with Stop Value . . . . . . . . . . . . . . 29