Sie sind auf Seite 1von 5

Second International Conference on Computer and Network Technology

A Compact 8-bit AES Crypto-Processor


F. Haghighizadeh, H. Attarzadeh, M. Sharifkhani
Electrical Engineering Department Sharif University of Technology Tehran, Iran haghighizadeh@ee.sharif.edu
Abstract Advance Encryption Standard (AES), has received significant interest over the past decade due to its performance and security level. In this paper, we propose a compact 8-bit AES crypto-processor for area constrained and low power applications where both encryption and decryption is needed. The cycle count of the design is the least among previously reported 8-bit AES architectures and the throughput is 203 Mbps. The AES core consumes 5.6k gates in 0.18 m standardcell CMOS technology. The power consumption of the core is 49 W/MHz at 128 MHz which is the minimum power reported thus far. Keywords: AES, Encryption, ASIC, Digital Architecture

compares it with previous relevant works. Section 5 then conclude this paper. II. THE AES ALGORITHM The AES algorithm [1] is a symmetric block cipher that processes data blocks of 128 bits using a cipher key of length 128, 192 or 256 bits. Each data block consists of a 44 array of bytes called the states, on which the basic operations of the AES algorithm are performed. The AES encryption and modified decryption procedure is shown in Fig. 1. After an initial round key addition, a round function consisting of four different transformations SubBytes( ), Shiftrows( ), Mixcolumns( ) and AddRoundKey( ) is applied to the data block (i.e., the state array). The round function is performed iteratively 10, 12 or 14 times, depending on the key length. The four transformations are described briefly as follows [3]. SubBytes( ) is a nonlinear byte substitution that operates independently on each byte of the state using a substitution table (SBox). SubByte consists of a multiplicative inverse over GF(28) and an affine transformation. This operation can be implemented either by computing the substitution [8] or using a Look-Up-Table to store the coefficients [9]. ShiftRows or Byte Permutation is a circular shifting operation on the rows of the state with different offsets. The offset equals the row index. The first row is not shifted at all while the last row is shifted 3 bytes to the left. MixColumn()

I.

INTRODUCTION

In 2001, Rijndael algorithm was selected as the Advance Encryption Standard (AES) [1] due to the combination of the security, performance, efficiency, ease of implementation and flexibility. These features make AES the first choice in many applications such as Wireless LAN and smart cards. Hardware implementation is more secure and faster than software implementation in practice. In order to use AES in low end applications, the hardware implementation must be inexpensive and low power. In this paper, we present an ASIC low area and low power 8-bit AES Crypto-Processor for AES-128, which covers both encryption and decryption and has the minimum cycle count for an iterated 8-bit processor. Many FPGA [2], [3] and ASIC [4], [5] implementations of AES have been reported to date. Most of them use 128bit or 32-bit processors. Although, these architectures provide higher throughputs, they are not suitable for lowcost and low power devices. The minimum area and power can be achieved by an 8-bit architecture. The interest on 8bit architectures has emerged since 2005 [6], [7]. [7] presents a low area and low power architecture which does both encryption and decryption, but throughput is of minor importance in this design and its cycle count limits its application. The proposed implementation in [6] outperforms [7] in terms of throughput however it does not support the decryption algorithm. The remainder of this paper is organized as follows. It begins with a brief introduction of AES algorithm in section 2. The main focus of the paper will be presented in the section 3, where the architecture of the AES core is proposed. Section 4 analyze the proposed design and

Figure 1. AES encryption and modified decryption algorithm

978-0-7695-4042-9/10 $26.00 2010 IEEE DOI 10.1109/ICCNT.2010.50

71

mixes the bytes in each column by the multiplication of the state with a fixed polynomial modulo x4+1. It interprets a column as a polynomial over GF(28), with degree less than 4. The state bytes are the coefficients of the polynomial. AddRoundKey( ) is a 128 bit XOR operation, which adds a round key to the state in each iteration. The round keys are generated during the key expansion phase. The decryption procedure of the AES is basically the inverse of each transformation (InvSubBytes ( ), InvShiftRows( ) and InvMixcolumn( )) in the reverse order. However, the order of InvSubBytes( ) and InvShiftRows( ) is indifferent. Also, the order of InvMixColumn( ) and AddRoundKey( ) can be swapped just by applying the InvMixColumn( ) to the original round key. These substitutions provide a similar structure for both encryption and decryption. This similarity allows for high level of resource sharing in the hardware implementation which adds to the efficiency. III. AES CORE ARCHITECTURE The high-level architecture of the design can be seen in Fig. 2. All the connections and registers are 8-bit wide. The following sections describe each part individually. The AES core data path will be presented in F. A. AddRoundKey This is the simplest operation in the AES. It can be done by XORing two bytes, bit by bit. The first Nr-1 (Nr is the round number) AddRoundKey operations are done by XOR1 and the last one is performed by XOR2. B. (Inv)SubByte SubByte( ) and InvSubByte( ) are the most critical parts of the AES algorithm in terms of computational complexity. Conventionally, the coefficients of the S-Box and inverse SBox are stored in the LUTs [9] or a hard-wired multiplicative inverter over GF(28) can be used, together with an affine transform circuit [8]. LUT does not result in

optimal area in ASIC technologies. So, hard wired circuit solution is utilized in this paper. Affine transform circuit can be implemented with some XOR gates. However, a dedicated inverter has a high area overhead. Mapping elements in GF(28) into the smaller subfields, saves a significant area for Multiplicative Inverter over GF(28). We used the Inverter proposed in [8] in our AES core. This design break down the GF(28) into the GF(24)2 and then to the GF(22)2)2. The required hardware which does both SubByte( ) and InvSubByte( ) includes isomorphic and inverse isomorphic functions, affine and inverse affine transform parts, two multiplexer and one Multiplicative inversion unit. We used two SBox in our design, on in the round unit and another in the key expansion unit. In the key expansion unit we need only SubByte( ), so we can save the area by omitting two multiplexer and one inverse affine transform circuit. C. (Inv)MixColumn The MixColumns operate on a column and mixes the bytes in each column. In an 8-bit architecture that one byte comes out of the ShiftRows unit in each clock, we need an architecture to collect four successive bytes, perform the MixColumn in four clocks and producing a byte at a time. To achieve this goal we used the MixColumn multiplier and the parallel to serial converter depicted in Fig. 3. Data are fed to the unit byte by byte and the intermediate results are maintained in the four registers. As, the same multiplier coefficients are used for each row of a column, only in a cyclically shifted order, a 32-bit part of the MixColumns operation can be performed by adding and cyclically shifting the intermediate results in the unit. During inputting the first byte of a column (bytes 0, 4, 8 and 12) the contents of the registers are masked to zero with the en signal. Upon the completion, the 32-bit output is fed to the parallel-toserial converter. The MixColumn multiplier performs a complete MixColumns operation in 16 cycles in parallel with the rest of the operations of the AES core [6]. In MixColumns operation we need to multiply a byte by 1, 2 and 3 and in InvMixColumns these coefficients are 9, B, D and E. Coefficients 3, 9, B, D and E can be written as follows: 3 = 1 + 2; 9 = 1 + 8; B = 1 + 2 + 8; D = 1 + 4 + 8; E = 2 + 4 + 8;

Figure 2. High level architecture of the AES crypto-processor

Interface Unit

Parallel-to-Serial Converter

Figure 3. MixColumn Multiplier

72

This fracturing helps us to implement the multiplications just by three 2 times block, one multiplexer and 5 XOR gates [10]. Our design needs one (Inv)MixColumns for the round unit and one InvMixcolumns for the key expansion unit. D. (Inv)ShiftRows Unlike (Inv)MixColumns which is an operation on the columns, (Inv)ShiftRows operates on the rows. Since the design is 8-bit width and it has to generate a byte of cipher text at each clock; and also the (Inv)MixColumns is just after the (Inv)ShiftRows , an special architecture is needed to rearrange the rows of the state block and send bytes to (Inv)MixColumns in the right column wise order. We used the design in [11] for both left and right shifting. This architecture which is shown in Fig. 4, has 12 registers for partially storage of the state bytes. The remaining four bytes are processed and maintained in the other data path registers of the AES core [6]. Choosing the multiplexers select in the proper order [11], shifting left and right is performed by reordering the state bytes while they are shifted through the unit. E. Key Expansion There are two approaches to provide the roundkeys. One is to pre-compute and store all the roundkeys, and the other one is to produce them on-the-fly. The first approach is not optimum in terms of area, since this requires 1116 bytes to store the roundkeys. The second approach produces the next block of the roundkey from the previous one. As the roundkeys remain the same when the key is not changed, the first solution is more energy-efficient in long run. Since one of our most important goals is to minimize the area, in the trade-off between area and power we preferred the second solution. However, our design is still low power. The key expansion unit is improved significantly over the architecture presented in [6]. The proposed design needs to provide the proper on-the-fly roundkeys during the decryption. The round keys for the decryption are the same

as the encryption but in the reverse order. However, producing the roundkeys, on-the-fly, needs another algorithm which is applying to the last block of the expanded round keys [12]. In the first 11 rounds of the decryption, no cipher text comes out and only the key expansion unit expands the initial key to generate the last block. When the last block prepared, the decryption process begins and it takes 11 rounds. So, the decryption takes twice the encryption clocks. The key expansion unit operates on 4-byte words and produces the next key block. It iterates this function Nr times. One round of the key expansion algorithms for the encryption and the decryption is shown in Fig. 4.a and 4.b successively [6], [1], [12]. The first column of the roundkey[i] is formed by XORing the same column of the roundkey[i-1] with a transformation of the previous word or words. This transformation includes XORing Rcon(i) with the SubByte( ) of the rotated last word in the encryption and XORed of the rotate two last words in the decryption. Rcon(i) is round dependent constant which is zero for all the bytes of a key block except the first one. The other columns of roundkey[i] are formed by XORing the same columns of the roundkey[i-1] with an appropriate word. This word is the previous word for the encryption and the previous word of the same column in roundkey[i-1] for the decryption. The desired architecture for the mentioned algorithm is depicted in Fig. 4.c. The InvMixColumns part is used for the decryption. As described in the section 2, to achieve a similar structure for both encryption and decryption we need to apply an InvMixColumns transformation to the roundkeys in the decryption. The buffer is to maintain the input of the InvMixColumns unchanged during the encryption and saves the power. The rk_last_out is used for AddRoundKey during the final round. The other output, rk_int_out is used for the AddRoundKey during the first Nr1 rounds. Since a 32-bit MixColumns and InvMixColumns operation takes four clock cycles, the key expansion unit output is delayed by four clocks in the encryption related to the decryption.

Figure 4. Key Expansion (a) One of key Expansion Algorithm for Encryption (b) One round of Key Expansion algorithm for Decryption (c) Key Expansion Data Path

73

F. Data Path Fig 5 shows the AES core data path. In encryption, the plaintext and the encryption key are simultaneously loaded into the core a byte at a time through the input ports data_in and key_in. In the decryption, 160 clock cycles needed to produce the first byte of the last roundkey before the actual decryption begins. Once the decryption process begins the plain text is loaded into the core and the last roundkey is used as the input key. The encryption process takes 176 clocks including loading and unloading and the decryption takes 336 clocks including the key generation. Since data is stored in the registers of the architecture, a new plaintext and key can be loaded into the core during the unloading of the cipher text. This feature improves the performance as there are no dead cycles between the processed blocks. So in the consecutive loading, the effective clocks per block are 160 for encryption and 320 for the decryption. IV. RESULTS AND COMPARISON The AES architecture is described in Verilog HDL at the register-transfer level. Synthesizing the RTL into the gate level was done using a 0.18 m, 1.8 V, standard-cell CMOS technology. Advanced industry standard CAD tools are used to perform simulation and synthesis. The results are presented in Table I. All the results in Table I are presented under typical operating conditions (1.8 V and 25 C). The reported area is the number of the gate equivalents (i.e. the total area divide by the area of a NAND gate in the used technology). The power has been measured based on the nodes switching activities in the gate level circuit with typical test vectors as stimulus. Fig. 6 depicts important signals for two last rounds of the operation. Table I also compares our results with previous 8-bit ASIC AES core. In order to achieve the minimum area, we chose 8-bit architecture, which is the smallest useful word size. Implementing an 8-bit architecture not only decreases

the number of the required units such as registers, (Inv)SubByte and (Inv)MixColumn but also reduces the bit width of all the buses and units like multiplexers. These reductions, save a lot of area; so the design is smaller than most of the already published AES hardware architectures. Our AES core includes 5.6K gates that is more than [6] and [7]. However, we have added decryption to [6], so the larger area is justifiable. Also we have achieved nearly 1/6 of cycle count of [7] with about more 2k gates. Registers are the largest parts of the design and after that (Inv)MixColumn and (Inv)SubByte are in the next places. An 8-bit architecture's power consumption is much less than the same 32-bit and 128-bit processors. Also, the output of the registers and multiplexers in the architecture changes only when they have to be updated. This method has two advantages. First is that changing the registers output exactly at the time of need, saves the consumed power in them and it decrease the effective clock frequency which is proportional to the dynamic power. The second advantage is that the inputs of the combinational logic parts do not change when their output is not needed. This effect decreases the switching activity of the circuit that is another important factor in dynamic power. Regarding the larger area, different technologies and higher voltage, the power consumption of our core is still in the same level of [6] and [7]. The most power consuming part of the architecture is (Inv)SubByte. It consists of about 600 gates and creates a high amount of switching activity which increases the consumed power at this block and (Inv)MixColumn which is at the output of (Inv)SubByte. Since different parts of our design work in parallel, the throughput of the circuit is high in terms of area. The cycle count of the design for encryption is 160 which is equal to [6] and is the least for an iterated 8-bit AES Implementation. Also, as mentioned above the cycle count of our design is 1/6 of [7]. The critical path begins in the control unit, passes through the SubByte and InvMixColumn in the key expansion unit and ends at the Data_out register.

Figure 5. AES core data path

74

Implementation
Ours(area) Ours(power) Ours(speed) [6] (area) [6](power) [6](speed) [7] [7]

Table I. Result and comparisons with previous 8-bit implementations Max. Cycle per Process Voltage Area Power Freq. block [m] [V] [kgates] [W/MHz] (for encryption) [MHz]

Throughput [Mbps]

Decryption

0.18 0.18 0.18 0.13 0.13 0.13 0.35 0.35

1.8 1.8 1.8 1.2 1.2 1.2 3.3 1.5

5.6 5.8 6.4 3.1 3.2 3.9 3.4 3.4

57 49 89 37 30 62 n/a 45

160 160 160 160 160 160 1032 1032

130 128 254 152 130 290 80 n/a

104 102 203 121 104 232 10 n/a

yes yes yes no no no yes yes

Figure 6. Waves form for two last rounds of the operation [4] A. Hodjat and I. Verbauwhede, Minimum area cost for a 30 to 70 Gbits/s AES processor, Proc. IEEE Computer Society Annual symp. On VLSI Emerging Trends in VLSI System Design (ISVLSI 04), pp. 103-112, 2006. [5] C. P. Su et al., A configurable AES processor for enhanced security, Proc. Asia and South Pacific Design Automation Conf., (ASP-DAC 05), pp361-366, 2005. [6] P. Hmlinen et al., Design and impementation of low-area and low-power AES encryption hardware core, Proc. Euromicro Conf. on Digital System Design (DSD 06), pp. 443-453,2006. [7] M. Feldhofer, J. Wolkerstorfer, and V. Rijmen, AES implementation on a grain of sand, IEE Proc. Inf. Secur., pp. 13-20, 2005. [8] S. Chantarawong et al., An architecture for S-Box Computation in the AES, pp. 157-162, Jan. 2004. [9] P. Bulens, Implementation of the AES-128 on Virtex-5 FPGAs, Int. conf. on Cryptology in Africa, Casablanca, Moroco (AFRICACRYPT 08), June 2008. [10] H. Li and Z. Friggstad, An efficient architecture for the AES mix columns operation, IEEE Int. Symp. Circuits and Systems (ISCAS 05), vol. 5, pp. 4637-4640, May 2005. [11] T. Jrvinen et al., Efficient byte permutation realization for compact AES implementations, Proc. 13th European Signal Processing Conf. (EUSIPCO 05), Sept. 2008. [12] J. Daemen and V. Rijmen, The design of rijndael: AES- the advanced enryption standard, ISBN 3-540-42580-2 Springer-Verlog Berlin Heidelberg New York, pp. 56-59.

V.

CONCLUSION

The paper presents a compact 8-bit AES cryptoprocessor for low-cost and low power applications. The proposed architecture supports both encryption and decryption at minimal hardware cost thanks to the resource sharing and time multiplexing realized in the architecture. The proposed architecture is synthesized in 0.18m CMOS technology and simulated at gate level to measure the speed and power consumption. The proposed architecture consumes a mere 49W/MHz at 128MHz in 0.18um CMOS technology which outperforms the previously reported schemes. REFERENCES
[1] [2] National Istitude of Standard and Technology (NIST). Advanced Encryption Standard (AES), 2001. FIPS-197. P. Chodowiec and K. Gaj, Very compact FPGA implementation of the AES algorithm, Proc. 5th Int. WorkShop on Cryptographic Hardware and Embedded Systems (CHES 03), pp.313-319, Sept. 2003. H. Li and J. Li, A high performance sub-piplined architecture for AES, Proc. Int. Conf. Computer Design (ICCD 05), pp. 491-496, 2005.

[3]

75

Das könnte Ihnen auch gefallen