Sie sind auf Seite 1von 19

HDL Design Laboratory

Report

Hardware Implementation of
An Encryption Algorithm

Daniel Thangaraj Stanley


MSCE / WS 2013
Matrikel-Nr: 03636685
User ID: ga39por
February 01, 2013

CONTENTS
I.
II.
III.
IV.
V.

Answers for all the marked questions


Delay Estimation
Comparison of Different Implementation
Simulation Plots with all Inputs and Outputs
Screenshots using Ideatester

Answers for all the marked questions


1)

2)

Preparation Question 1:
The variable Xi(1) is the input to the hardware, to be encrypted
or it is the input to the first round. Similarly Y i(9) is the output of
the encryption algorithm or the output of the output
transformation, where i = {1,2,3,4}
The output that feeds the input Xi(r+1) are the output of the
previous stage r i.e. Yi(r), where i ={1,2,3,4}
Preparation Question 2:
If either Xi or Zi are 0, then the corresponding inputs, a or b, are
set to 2n which consists of n+1 bits, else a and b are set to the
value of Xi and Zi with a zero appended to the MSB to make it
n+1 bits.
The number of bits needed to represent a and b in the range
{1,,2n} are n+1 bits. In worst case, the number of bits needed
to represent the product of a and b are 2n+1 bits.
The value of n for the modulo-multiplier in the IDEA Algorithm
is 16.
In modulo 2n operation, the n LSB bits remain and the next n+1
bits are masked. Thus, n bits are needed for the result of (ab
mod 2n).
The simple bit operation which is equivalent to (ab div 2n). The
bits MSB down to n of ab remain. After the division these bits
are positioned n down to 0. n+1 bits are needed to store the
result.
The result of (ab mod (2n+1)), if (ab mod 2n) = (ab div 2n) is zero.
This implies that the bits of ab from 2n down n and the bits
from n-1 down to 0 bits are identical (from the previous
answers) and the 2n+1 bit being 0. Using this result we can
represent ab as (2n+1)C where C = (ab mod 2n) = (ab div 2n).

Thus for (ab mod (2n+1)) to be zero, either a or b or both must


be zero or (2n+1). Hence this case is not possible when a or b is
in the range {1,, 2n} and 2n+1 is a prime number which is also
not in the range.
3) Preparation Question 3:
If there are two modulo-multipliers, two adders and unlimited
XOR modules, then the round can be split up as follows.

Fig: Round Module spilt using two modulo multiplier and two adder

For this design, we would need two partial steps with each
partial step containing two modulo-multiplier and two adders
with some XORs.
The registers have to be inserted at the output of modulomultipliers, adders and the XORs of the each partial step.
The design for the datapath with 2 modulo-multipliers and two
adders can be found below.

Fig: Design for the datapath with 2 modulo-multipliers and two adders

DELAY ESTIMATION
4) Preparation Question 4:
Since a SLICE can perform two XOR operations, to perform a 16
bit XOR operation we require 8 SLICES.
5) Preparation Question 5:
Since the XOR operation is performed only by the F/G function
generators we consider the propagation delay of function
generator alone. Since the function generators are in parallel
and the SLICES operate in parallel, it takes 3ns to complete 16

bit XOR operation. Since the operation is in parallel the time is


independent on the input width.
6) Preparation Question 6:
The adder used here is a ripple carry adder. Thus there is a sum
logic and a carry logic whose formula can be written as
Sn = Xn Yn Zn
Cn+1 = XnYn+(Xn+Yn)Cn
The XOR operation is performed by the function generators and
the carry bit is calculated by the carry logic inside the SLICE.
Thus each SLICE performs two full adders. At time t=0, all input
signals are valid, thus after 3ns we obtain the sum for the LSB
and the carry bit after 1ns, thus the sum of the next LSB in 4ns,
similarly for 16 bits it takes 18ns. The longest delay is caused
when the LSB is a generate and all other bits are propagate.
7) Preparation Question 7:
One SLICE consists of two function generators. Each function
generators can be programmed to be a 2:1 MUX. The SLICE has
an inbuilt 2:1 MUX at the output of the function generator.
Thus these combined function as a 4:1 MUX. So a 16 bit 4:1
MUX requires 16 SLICES. The propagation delay is the delay of
the function generator which is 3ns only, since the delay of the
output MUX is neglected.
8) Preparation Question 8:
The time taken for the entire encryption can be computed from
calculating the time delay in round output transformation
block. As shown in the figure, in one round, the longest path of
propagation includes 3 modulo-multipliers, 2 adders and 2
XORs and in the output transformation block, the longest path
depends on one modulo multiplier only.

From previous calculations, we have,


Delay of XOR: 3ns
Delay of adder: 10ns
Delay of modulo-multiplier: 22ns
Thus the delay for the longest path in a round, is given by
3(22) + 2(10) + 2(3) = 92ns.
Since there are 8 round module followed by an output
transformation we have,
The total delay is now 8(92) + 22 = 758 ns.
The number of encryptions per second is given by 1/delay i.e.
1/758ns = 1319261 encryptions per second.
Yes, it is possible to start a new calculation while the current
encryption is not finished by using additional registers. This is
done by giving the output of the registers to the input of the
round on the next clock pulse i.e. they are given to the next
round module while the first round receives next set of inputs
from the input registers, provided the key value remains the
same.
9) Preparation Question: 9
The shortest clock cycle corresponds to the time delay of the
longest logic combinational path. For RCS I it includes the round
module, 16 bit 2:1 MUX and the register. From the previous
answer we know the time delay of the round module.
Time delay = tround + treg + tmux = 92+4+3 = 99ns.
Thus the shortest clock cycle is 99ns
The number of clock cycles required to complete one
encryption is 8.

10) Preparation Question: 10


The shortest clock cycle is found by calculating the time delay
for the longest logic combinational path. Here the longest path
in the clocked round is the path from the register R5 to R8. This
includes one multiplier, one adder, two multiplexers and the
register.
tlong = tmul + tadder + tmux + tsetup = 22+10+2(3)+4 = 42ns.
Thus the shortest clock cycle for the clocked round is 42ns.
The multiplexer is switched to 10. i.e. S = 10.
The register at the end of this path is R8.

Fig: Longest path in the clocked round is the path from the register R5 to R8.

11) Preparation Question: 11


The shortest clock cycle for RCS II (complete encryption)
corresponds to the time delay of the longest path. It is from
register R8 to the outer register R3. It includes multiplexer,
adder, two MUXs, one XOR and one register.

tlong = tmul + tadder + tmux + txor + tsetup = 22+10+2(3)+3+4 = 45ns


The shortest clock cycle for RCS II = 45 ns
Thus the minimal clock cycle for RCS II is longer than the
shortest clock cycle of the clocked round. The longer cycles are
caused by the additional delay of the XOR operation. Find
below the trace of the longest path.

Fig: Longest path from register R8 to the outer register R3

12) Preparation Question: 12


The total number of clock cycles needed for complete
encryption of RCS II is the sum of clock cycles for 8 round
module and the output transformation.
For the round module to be completed it takes 8 clock cycles
and for output transformation block, it is 6 clock cycles.

Clock cycles = 8(8) + 6 = 70 clock cycles. The output is available


at the next rising edge of the clock pulse. Thus the number of
clock cycles required for complete encryption of RCS II is 71
clock cycles.
Assuming that the start signal is active at the clock signal
number 0, the init signals are set at the following rising edge of
the clock - 0, 8th, 16th, 24th, 32nd, 40th, 48th, 56th and 62nd rising
edge of the clock.
The result signal are set at the 7th, 15th, 23rd, 31st, 39th, 47th,
55th, 63rd and 69th rising edges of the clock.
The ready signal is set at the 70th rising edge of the clock.
From the Question 11 we know the shortest clock cycle to be
45ns.
Thus time taken for the complete encryption in the RCS II is 71
* 45 = 3195ns.
Therefore the number of encryptions per second is 1/3195ns =
312989 encryptions/second.
14) Preparation Question: 14
For the complete encryption of RCS II+, it takes 5 clock cycles
for each round module and 4 clock cycles for output
transformation.
Number of clock cycles for complete encryption = 5(8)+4 = 44
clock cycles. The output is available at the next rising edge of
the clock. Thus it requires 45 clock cycles.
Time taken to complete the encryption = 45*45ns = 2025ns
Number of encryption per second is 1/T = 1/2025ns = 493827.2
encryptions per second.

Comparison of Different Implementations


Implementation
Encryption/second

1
1319261

Required area in 2458


SLICES
Required
192/64
input/output pins
Efficiency
536.7

2
3
4
10869565 1265822 312989

5
493827.2

2714

478

293

293

193/64

194/65

194/65

194/65

4005

2648.2

1068.2

1685.417

1) Direct Implementation:
The number of encryption is as calculated in the previous
answers.
From the LUT table, we know that number of LUTS for modulo
multiplier is 106, thus the number of SLICES is 53.
The number of SLICES for a XOR is 8, adder is 8, modulo
multiplier is 53. In a round module there are 6 XORs, 4 adders,
4 multipliers.
Number of SLICES per round = 8(6)+8(4)+53(4) = 292 SLICES. For
8 round it is 8(292) = 2336 SLICES.
Similarly for output transformation it takes 2 multipliers and 2
adders. Therefore the number of SLICES = 2(53)+2(8) = 122
SLICES.
The total number of SLICES for direct implementation =
2336+122 =2458 SLICES.
There are 4 inputs each being 16 bits thus 16*4 = 64 PINs and
128 bit input for key thus the input PINs are 64+128 = 192 PINs.
There are 4 output each being 16 bits thus 16*4 = 64 PINs.

Efficiency is number of encryption per second/required SLICES


= 1319261/2458 = 536.721.
2) Direct implementation with simultaneous calculation of several
encryptions:
In this case we use registers in between the round modules to
store the output of each round and then forwarded to the next
round. By doing so, we make sure that all the values propagate
at same time instance to the next this enables us to input a new
set of values into round 1. The longest propagation path is now
92 ns in each round and hence we can give new set inputs. By
this we can now calculate
(1/92ns) = 10869565 encryptions/sec.
The number of SLICEs is increased as we have an additional 4
registers at the output of each round and each register uses 8
SLICEs. Therefore 256 + 2458 = 2714 SLICEs.
The number of input/output PINs is very similar to the previous
case but we have one more extra input pin as clock for the
registers. Hence 193 input PINs and 64 output PINs.
Efficiency is number of encryption per second/required SLICES
= 10869565/2714 = 4004.998
3) Hardware Oriented Implementation Resource Constrained
Scheduling I
Since we call the round module 8 times, thus we are reusing
the resources. The number of SLICES = 6(8)+4(8)+4(53) = 292
SLICES. The output transformation uses 122 SLICEs. There are 4
16 bit register each needs 8 SLICES and 4 2:1 MUX each needs 8
SLICES, which requires 4(8)+4(8) = 64 SLICES.
The total number of SLICES = 292 + 122 + 64 = 478 SLICES.

The number of input/output PINs is very similar to the previous


case but we have extra input pin as clock and start for the
registers and a ready output signal. Hence 194 input PINs and
65 output PINs.
Time taken for a complete encryption is given by
T = 8(tround+tsetup)+ttrafo = 8(92+4)+22 = 790ns.
Number of encryptions per second is 1/T = 1/790 = 1265822
Efficiency is 1265822/790 = 2648.2
4) Hardware Oriented Implementation Resource Constrained
Scheduling II
The number of encryptions per second is 312989
In the datapath we use 1 adder, 1 multiplier, 5 XORs, 8
Registers, 4 4:1 MUXs which uses 16 SLICES each, which gives
1(8)+1(53)+5(8)+4(16)+8(8)=229 SLICES. Also there are 4
external registers and 4 2:1 MUX which gives the number of
SLICES = 4(8)+4(8)+229=293 SLICES.
The number of input/output PINs is similar to the previous case
which is 194 input PINs and 65 output PINs.
Efficiency is 312989/293 = 1068.221
5) Hardware Oriented Implementation Resource Constrained
Scheduling II+
From question 14, the number of encryptions per second =
493827.2.

The number of SLICES is also same as that of previous case, 293


SLICES.
The number of input/output PINs is similar to the previous case
which is 194 input PINs and 65 output PINs.
Efficiency is 493827.2/293 = 1685.417

SIMULATION PLOTS WITH ALL INPUTS AND


OUTPUTS
1.

Direct Implementation:

2) RCS 1
a. Control Path

b. Result

3) RSC2
a. Control Logic

b. Control logic and its extension

c. Round Counter

d. Result

3) RSC2plus

SCREENSHOTS USING IDEATESTOR


1) Final Testing of RCS1

2) Final Testing of RSC2

3) Final Testing of RSC2plus