52 views

Uploaded by jhty2112

Outline of GPU-based Encryption Acceleration

- [IJCST-V6I3P2]: Chekkala Vindhya Devi, Mr.Prasad
- Cracking Advanced Encryption Standard a Review
- aesIntro
- InfoSec Resources - CBC Byte Flipping Attack—101 Approach
- The DES Algorithm Illustrated
- secret writing
- Cryptography
- CHACHA ppt.pptx
- A Guide to Cryptography
- CEHv8 Module 19 Cryptography.pdf
- Aes
- 00953a
- IASL Oral Question Bank
- Cryptography and Network Security Principles and Practice - Lecture Notes, Study Materials and Important questions answers
- aes.xlsx
- 06470679 Image
- Security Issues in Wireless Mesh Networks
- Survey Encrypt
- A Design and Verification Methodology for Secure Isolated Regions
- QNS

You are on page 1of 25

11/16/14 10:55 PM

Abstract

Salman Ul Haq, Jawad Masood, Aamir Majeed, Usman Aziz

10/11/2011

This article covers the implementation and optimization of the Advanced Encryption

Standard (AES) on AMD GPUs using OpenCL, which is fine-tuned for bulk encryption applications. Reliable encryption schemes are needed to ensure the information

security of individuals, organizations and governments by protecting against potential

threats. One particular scheme is the AES algorithm-based bulk encryption technique,

which is based upon the Rijndael algorithm, a symmetric block cipher with 128-bit, 192bit and 256-bit cipher keys. OpenCL also allows you to tap into the huge parallel processing power of GPUs for data parallel computing applications. This article begins by

exploring the AES algorithm, focusing on a parallel breakdown of the problem and explaining suitable indexing schemes. This is followed by GPU-specific optimization

strategies, such as using local memory, covering their relation to the memory bandwidth and computational intensity that is required. We finish the article by examining

the final benchmarks that signify the acceleration achieved using AMD GPUs.

Introduction

Information security is becoming increasingly important given the ever -increasing

number of new applications in the public and private domain. There is a continuing

trend to secure data in all of its uses, ranging from its live communication to archived

data storage. The unauthorized access to intercepted transmissions can result in the

compromise of sensitive and vital information. Data managers around the world are,

thus, facing an interesting dilemma;: how to store data securely while still being able to

access it quickly. Encryption is an effective solution for protecting valuable data assets

against such attacks.

Encryption

Encryption is the process of transforming information referred to as plain-text into an

unintelligible code called cipher-text, using a secret key and an algorithm generally rehttp://developer.amd.com/resources/documentation-articles/articles-whitepapers/bulk-encryption-on-gpus/

Page 1 of 25

11/16/14 10:55 PM

ferred to as the cipher [1]. The cipher-text (encrypted data) can be decoded back into its

original form using the same cipher algorithm and the secret key. In this process, critical

information can be protected from hackers, competitors and others who would use the

information for malicious intent.

Common uses for encryption technology are found in the static archiving of large

amounts of sensitive data, as well as its communication over the local area network

(LAN) or across an Internet gateway in the case of Wide Area Networks (WANs) or Virtual Private Networks (VPNs). Similar applications can also be abundantly found in the

telecommunications industry and other proprietary setups dealing with data protection

issues.

Bulk Encryption

Bulk encryption provides safe and effective methods for protecting data transmissions

from its compromise and theft. This can be achieved through secured storage and the

transmission of bulk data.

Bulk encryption technology provides a method to encrypt large amounts of data during

transmission or storage. The amount of information that must be encrypted, however,

simultaneously leads to very large response times. Currently, the processing power requirements for bulk encryption are being met by hardware extensions in the form of

cryptographic accelerators [2]. There exists the potential to use the parallel processing

power of a GPU as a co-processor in a similar role that existing hardware cryptographic

solutions play.

Most modern encryption algorithms or ciphers can be categorized in one of the following ways:

1. Whether the same key is used for both encryption and decryption (symmetric key

algorithms), or if a different key is used for each (asymmetric key algorithms).The

use of a symmetric algorithm requires only a single key for both the encryption

and decryption process. Encryption schemes that do not involve the use of a key

are far less secure and subject to compromise. In fact, anyone who is in possession

http://developer.amd.com/resources/documentation-articles/articles-whitepapers/bulk-encryption-on-gpus/

Page 2 of 25

11/16/14 10:55 PM

of the decryption algorithm can decipher any transmission written with that particular algorithm.

2. Whether they work on blocks of symbols of a fixed size (block ciphers), or on a continuous stream of symbols (stream ciphers).A block cipher is a symmetric key cipher operating on a fixed-length groups of bits, called blocks, with an unvarying

transformation. A block cipher encryption algorithm might take (for example) a

128-bit block of plain-text as input, and output a corresponding 128-bit block of

cipher-text. The exact transformation is controlled using a second input called the

secret key. Decryption is similar; the decryption algorithm takes, in this example, a

128-bit block of cipher text together with the secret key, and yields the original 128bit block of plain-text.A message longer than the block size (128 bits in the above

example) can still be encrypted with a block cipher by breaking the message into

blocks and encrypting each block individually. Since all pure block ciphers have

independent workloads, they are the ideal candidates for parallel implementation.

The Advanced Encryption Standard (AES) is a symmetric-key encryption standard approved by NSA for top secret information and is adopted by the U.S. government. The

standard was adopted from a larger collection originally published as Rijndael [5]. The

Rijndael cipher was developed by two Belgian cryptographers, Joan Daemen and Vincent Rijmen, and submitted by them to the AES selection process. AES is based on a

design principle known as a substitution permutation network. The standard comprises

three block ciphers: AES-128, AES-192 and AES-256. Each of these ciphers has a 128-bit

block size, with key sizes of 128, 192 and 256 bits, respectively. The AES ciphers have

been analyzed extensively and are now used worldwide, as was the case with its

predecessor, the Data Encryption Standard (DES) [5].

AES was selected due to the level of security it offers and its well documented implementation and optimization techniques [6]. Furthermore, AES is very efficient in terms

of both time and memory requirements. The block ciphers have high computation intensity and independent workloads (apply the same steps to different blocks of plain text),

so acceleration using a GPU is the next logical step.

AES Algorithm

http://developer.amd.com/resources/documentation-articles/articles-whitepapers/bulk-encryption-on-gpus/

Page 3 of 25

11/16/14 10:55 PM

In this section, we will provide a brief overview of the AES algorithm and the working

of its major constituent computations.

The AES block-cipher operates on a 44 array of bytes (128 Bits), termed as the state. For

the AES algorithm, the size of the input block, the output block and the state is 128 bits.

This is represented by Nb = 4, which reflects the number of 32-bit words (number of columns) in the state array. The permissible lengths of the Cipher Key, K, are 128, 192, and

256 bits. The key length is represented by Nk = 4, 6, or 8, which reflects, again, the number of 32-bit words (number of columns) in the Cipher Key array [6].

The state is encrypted or decrypted by applying byte-oriented transformations for a

specific number of rounds. The number of rounds to be performed is dependent on the

key size. The number of rounds is represented by Nr, where Nr = 10 when Nk = 4, Nr =

12 when Nk = 6, and Nr = 14 when Nk = 8 [6].

The AES algorithm specifies both cipher and its inverse for the complete

encrypt-decrypt cycle. The Forward Cipher takes plain-text as input along with the

cipher-key and its output is the encrypted data or cipher-text. The Inverse Cipher takes

this cipher-text as input and decrypts it back to plain-text using the same cipher-key

used for encryption.

The AES algorithm consists of following phases:

1. Key Expansion.Round keys are derived from the cipher key using the

Rijndaels

2. Initial Round.AddRoundKeyeach byte of the state is combined with the

round key using a bit-wise operation.

3. Middle Rounds.Nr = 1 till Nr-1 Repeatedly perform the following transformations:

1. SubBytesa non-linear substitution step where each byte is

replaced with another according to a lookup table.

2. ShiftRowsa transposition step where each row of the state is

shifted cyclically a certain number of steps.

3. MixColumnsa mixing operation which operates on the columns

of the state, combining the four bytes in each column.

http://developer.amd.com/resources/documentation-articles/articles-whitepapers/bulk-encryption-on-gpus/

Page 4 of 25

11/16/14 10:55 PM

4. Final Round (no MixColumns)

1. SubBytessame as described above.

2. ShiftRowssame as described above.

3. AddRoundKeysame as described above.

AES Transformations

For both the Forward and Inverse Cipher, the AES algorithm uses a round function that

http://developer.amd.com/resources/documentation-articles/articles-whitepapers/bulk-encryption-on-gpus/

Page 5 of 25

11/16/14 10:55 PM

1. SubBytes Transformation.The SubBytes transformation is a non-linear byte

substitution that operates independently on each byte of the state. In this step,

each byte of the state array is updated using an 8-bit substitution box, the Rijndael S-box. This operation provides the non-linearity in the cipher and helps

avoid attacks based upon algebraic manipulation. The S-box used is derived

by combining the multiplicative inverse over GF(28), known to have good

non-linearity properties, with an invertible affine transformation. The

complete S-box table is displayed below in Figure 2 [6].

2. AddRoundKey Transformation: In the AddRoundKey transformation, a

Round-Key is added to the state by a simple bitwise XOR operation. RoundKey is derived for each round, from the cipher-key using Rijndaels key

schedule. Each Round-key is the same size as the state. AddRoundKey can be

mathematically represented as follows [6]:

[wi] are the Expanded key words, and round is a value in the range 0 = round

= Nr. In the Cipher, the initial Round Key addition occurs when round = 0,

prior to the first application of the round function.

3. ShiftRows Transformation:The ShiftRows transformation operates on the

rows of the state and it cyclically shifts the bytes in each row by a different

http://developer.amd.com/resources/documentation-articles/articles-whitepapers/bulk-encryption-on-gpus/

Page 6 of 25

11/16/14 10:55 PM

(number of bytes) offset. For AES, the first row is not shifted (left unchanged).

Each byte of the second row is shifted one byte to the left. Similarly, the third

and fourth rows are shifted by offsets of two and three bytes respectively. In

this way, each column of the output state of the ShiftRows step is composed

of bytes from each column of the input state. Specifically, the ShiftRows transformation proceeds as follows [6]:where the shift value shift(r,Nb) depends on

the row

number,

r, as follows:

each column of the state are combined using an invertible linear transformation.

Each column is treated as a polynomial over GF(28) and is multiplied by the coefficient polynomial c(x) = 33+x2+x+2 (modulo x4+1). The coefficients are displayed

in their hexadecimal equivalent of the binary representation of bit polynomials

from GF(28). This transformation, in conjunction with the ShiftRows

transformation, provides diffusion in the original message, spreading out any nonuniform patterns.

The AES Transformations discussed above are the Forward Transformations used by

the Forward Cipher for encryption of the plain-text. The Inverse Cipher uses the Inverse

Transformations for decryption of cipher-text back to plain-text. The AddRoundKey,

ShiftRows and MixColumns Transformations remain the same for both Forward and Inverse Cipher. For Inverse Cipher, the SubBytes Transformation is replaced by the InvSubBytes Transformation, which takes the substitution values from the Inverse S-box

table.

Key expansion takes the input key (cipher key) of 128, 192 or 256 bits and produces an

expanded key for use in the rounds of subsequent stages. The expanded keys size is related to the number of rounds to be performed. For 128-bit keys, there are 10 rounds

and the expanded key size is 1408 bits. For 192 and 256 bit keys, the number or rounds

increases to 12 and 14 rounds respectively with an overall expanded key size of 1664

http://developer.amd.com/resources/documentation-articles/articles-whitepapers/bulk-encryption-on-gpus/

Page 7 of 25

11/16/14 10:55 PM

and 1920 bits. During each round, a different portion of the expanded key is used in the

AddRoundKey step.

Modes of Operation

A block cipher by itself allows encryption of a single data block of size equal to the

ciphers block size. Modes of operation enable the repeated and secure use of a block

cipher, on multiple data blocks, under a single key [7]. When targeting a variable-length

message, the data must first be partitioned into separate cipher blocks. Typically, the

last block must also be extended to match the ciphers block length using a suitable

padding scheme. A mode of operation describes the process of encrypting each of these

data blocks, and generally uses randomization based on an additional input value, often

called an initialization vector.

There are different modes under which encryption can take place, where some modes

are inherently more secure and some lend themselves more to parallelism. The gfollowing table lists various modes of operation along with their inherent level of parallelism

[7]. For details on the modes of operation, look at the Resources section [7].

Mode of Operation

Parallelism

Electronic codebook (ECB) High

Counter (CTR)

High

Cipher-block chaining (CBC) Low

Cipher feedback (CFB)

Low

Output feedback (OFB)

Low

The ECB mode comes out to be the most parallel implementation. The message is

divided into blocks and each block is encrypted with an identical key and there is no

serial dependence between the blocks. The advantage of ECB mode is the extensive parallelism which scales well to the GPU architecture. The disadvantage of this method is

that, identical plain-text blocks are encrypted into identical cipher-text blocks; thus, it

does not hide data patterns well and the large scale structures in the plain-text are preserved [7].

In the Counter (CTR) mode, the large scale structures that may have been present in the

original plain-text are diminished. Thus, the cipher-text blocks obtained by encrypting

two identical plain-text blocks using CTR mode are completely different. This provides

http://developer.amd.com/resources/documentation-articles/articles-whitepapers/bulk-encryption-on-gpus/

Page 8 of 25

11/16/14 10:55 PM

better security level as compared to the ECB mode [7]. We have implemented the ECB

mode of operation that is not only parallel but can be easily extended to CTR.

GPUs are massively parallel devices. The SPMD architecture allows GPUs to perform

the same operation on multiple data sets simultaneously. AES is based on a block cipher

algorithm that operates on 128 bit data chunks independently. This provides a high level of Data Parallelism as the same operation is performed on each state block, with no

dependencies in between (provided that the AES is being used in the ECB mode). So,

multiple blocks can be encrypted simultaneously, which is more parallel threads that

GPUs love. As the data size increases so does the level of parallelism, thus making

GPUs more efficient for bulk encryption. The only serial operation in AES is the keyexpansion, which provides the round keys to be used in subsequent rounds. However,

keeping in view its serial nature and the fact that it is just a one-time operation, keyexpansion can be safely moved to the CPU Host Code for better performance. The figure explaining the parallel nature of AES is included in the Design Approach Section.

Now that we have a fair understanding of the AES encryption algorithm and its

different ingredients, such as key-expansion and AES-Transformations, we are ready to

start an implementation using OpenCL specific to an AMD GPU. In the remaining

part of this article, we will discuss our design approach in detail, including an efficient

indexing scheme for handling input and output data, and device functions for various

AES-Transformations. Later on, we will investigate memory optimization strategies,

such as using local memory, constant cache and coalesced memory accesses along with

their impact on throughput performance and memory bandwidth.

Design Approach

In this section we will discuss the level of parallelism we want to exploit, the portion of

code that should be ported to GPU, and the Host-Device work division.

In the current approach, we will exploit parallelism only on the block level without

changing the original algorithm. (The algorithm breakdown can further optimize the rehttp://developer.amd.com/resources/documentation-articles/articles-whitepapers/bulk-encryption-on-gpus/

Page 9 of 25

11/16/14 10:55 PM

sults, though that discussion is outside the scope of this article.). Each work-item will

take one state block as input and convert it to cipher-text. This implies that the Global

Work-Size is directly proportional to the exploited parallelism. Encryption of the one

state block of 128 bit will remain serial; however we will use loop unrolling to optimize

the code. Another serial operation in AES is the key-expansion which provides the

round keys to be used in subsequent rounds. However, keeping in view its serial nature

and the fact that it is just a one-time operation, key-expansion can be safely moved to

the CPU Host Code for better performance. Figure 3 below explains the parallelism in

AES as well as our design approach.

In this section we will discuss the device functions that implement the AES-Transformations in OpenCL. We will also list the sample codes for implementing each of these

transformations in OpenCL.

http://developer.amd.com/resources/documentation-articles/articles-whitepapers/bulk-encryption-on-gpus/

Page 10 of 25

11/16/14 10:55 PM

In the SubBytes transformation, each state value is updated by a value from S-Box, having the same index as the value of the state. For example, if S1,1 = {53}, then the substitution value from the S-Box would be determined by the intersection of the row with index 5 and the column with index 3. This process is explained in Figure 4 below.

A sample code for SubBytes transformation is listed next:

for(i = 0; i < 16; i++)

{

x = i & 0x03;

y = i >> 2;

state[4*x + y] = gpu_AES_sbox[state[4*x + y]];

}

In the ShiftRows transformation, the bytes in the last three rows of the state are cyclically shifted over different numbers of bytes. The first row, r = 0, is not shifted. Each byte

of the second row is shifted one byte to the left. Similarly, the third and fourth rows are

shifted by offsets of two and three bytes respectively. This has the effect of moving bytes

http://developer.amd.com/resources/documentation-articles/articles-whitepapers/bulk-encryption-on-gpus/

Page 11 of 25

11/16/14 10:55 PM

to lower positions in the row, while the lowest bytes wrap around into the top of

the row. Figure 5 below explains the ShiftRows transformation. Here S represents the

state array and Sis the n_state array.

The code for ShiftRows transformation is listed below. The shift rows transformation

uses both n_state and the state buffers as it is not an in-place transform:

for(i = 0; i < 16; i++)

{

x = i & 0x03;

y = i >> 2;

n_state[4*x + y] = state[4*x + ((y+x)& 0x03)];

}

In the AddRoundKey transformation, a Round Key is added to the state by a simple bithttp://developer.amd.com/resources/documentation-articles/articles-whitepapers/bulk-encryption-on-gpus/

Page 12 of 25

11/16/14 10:55 PM

wise XOR operation as depicted in the figure below. Each Round Key consists of Nb

words from the expanded key obtained from the Key-Expansion function, described

earlier. Figure 6 depicts the AddRoundKey Transformation.

The sample code for AddRoundKey transformation is listed next:

for(i = 0; i < 16; i++)

{

x = i & 0x03;

y = i >> 2;

state[4*x + y] = state[4*x + y] ^

((keysched]y] & (0xff << (x*8))) >> (x*8));

}

http://developer.amd.com/resources/documentation-articles/articles-whitepapers/bulk-encryption-on-gpus/

Page 13 of 25

11/16/14 10:55 PM

In the MixColumns transformation, each column is treated as a polynomial over GF(28)

http://developer.amd.com/resources/documentation-articles/articles-whitepapers/bulk-encryption-on-gpus/

Page 14 of 25

11/16/14 10:55 PM

and is multiplied modulo x4+1 by the coefficient polynomial a(x) [6] shown here:

The MixColumns transformation

updates each column of the state

using a matrix multiplication, as explained by the following equation [6]:

Figure 7 below explains the MixColumns

transformation.

By now we have explained how various AES Transformations are implemented in

OpenCL, and we are ready to discuss the kernel codes for both Forward and Inverse

http://developer.amd.com/resources/documentation-articles/articles-whitepapers/bulk-encryption-on-gpus/

Page 15 of 25

11/16/14 10:55 PM

AES Ciphers.

Forward Cipher

We explain the working of our kernel by considering the simplest case where we have a

single work item operating on a 128-Bit state block. Kernel arguments would be the

input and output buffer, AES fixed table buffers and the expanded key buffer. Also, the

key-Length parameter adds the flexibility to use all three allowed key sizes128, 192 and

256-bit keysand they are passed to the kernel as an argument. The number of rounds to

be performed is calculated based upon the key length. 128-bit state is copied from global

plain-text buffer into the registers for computing. Two blocks of state size are created in

the register files, as all the AES-Transformations cant be performed in-place. The input

is copied to the state block in registers using a special access pattern to allow coalescing

(more on this latter). Forward AES-Transformations are applied to the state block as

described by the AES flow graph. The resulting cipher-text block is copied back to the

Global cipher-text buffer using the same indexing scheme that was followed while copying plain-text to the state.

Inverse Cipher

In this section we will discuss the major changes required to convert the Forward Cipher into the Inverse Cipher for decryption process. The Inverse Cipher essentially runs

the forward cipher in the reverse order for decryption process. The AES transformations

used in Inverse Cipher are the inverse versions of previously discussed forward transforms.

The Inverse Cipher incorporates minor changes in the transformations, the order of execution and the required AES-Tables. For example the InvSubBytes transform, which is

the inverse of SubBytes transform, requires Inverse S-box table instead of the S-box

table. The code for InvSubBytes is shown here:

for(i = 0; i < 16; i++)

{

x = i & 0x03;

y = i >> 2;

state[4*x + y] = gpu_AES_isbox[state[4*x + y]];

}

http://developer.amd.com/resources/documentation-articles/articles-whitepapers/bulk-encryption-on-gpus/

Page 16 of 25

11/16/14 10:55 PM

All other transformations: ShiftRows, MixColumns and the AddRoundKey remain the

same. The order in which these transforms are applied is different from the Forward

Cipher. Figure 8 displays the flow-graph for Inverse Cipher.

Indexing Schemes

Now we will examine the input and output indexing schemes for a simple AES kernel

with a single work-item in detail. We will then explain what needs to be added to run

the kernel with multiple work-groups and larger work-group sizes. The described inhttp://developer.amd.com/resources/documentation-articles/articles-whitepapers/bulk-encryption-on-gpus/

Page 17 of 25

11/16/14 10:55 PM

At the beginning of the Forward or the Inverse Cipher, the input array is copied to the

state array according to the following convention [6]:

S [ r , c ] = In [ r + 4c ]

where Nb = 4 for our case. This is the column major access pattern as depicted in Figure

9 below. The code for column major access pattern for input is listed here:

for(i = 0; i < 16; i++)

{

x = i & 0x03;

y = i >> 2;

state[4*x + y] = gpu_input]i];

}

At the end of the Forward or the Inverse Cipher, the state array is copied to the output

array as follows:

Out [ r + 4c ] = S [ r , c ]

for(i = 0; i < 16; i++)

{

http://developer.amd.com/resources/documentation-articles/articles-whitepapers/bulk-encryption-on-gpus/

Page 18 of 25

11/16/14 10:55 PM

x = i & 0x03;

y = i >> 2;

gpu_output]i] = state[4*x + y]; ,br />}

Generalizing this indexing scheme to accommodate more threads require some mechanism of identifying which thread is being executed. A new variable named idx is introduced that queries the OpenCL runtime for the Global Id of each thread. Now, the

input array will be copied to the state array as follows:

for(i = 0; i < 16; i++)

{

x = i & 0x03;

y = i >> 2;

state[4*x + y] = gpu_input]i + 16*idx];

}

The net offset for each thread is 16*idx, as each thread handles 16 elements (Bytes) of the

input array. The same holds for writing the data to the output array after completion of

Encryption or Decryption Process:

for(i = 0; i < 16; i++)

{

x = i & 0x03;

y = i >> 2;

gpu_output]i + 16*idx] = state[4*x + y];

}

Memory Optimizations

Here we will discuss the drawbacks in the basic implementation described above and

suggest improvements to overcome these.

In the basic implementation we have used only the Global memory available on the

GPU. Remember, Global memory has the least memory bandwidth compared to other

memory spaces available on the GPU. The main disadvantage of low memory bandwidth is long latency access. Another drawback is the huge resource usage per thread as

all the calculations takes place in the register files, thus limiting the number of parallel

threads and degrading performance. The possible memory optimizations include the

use of local and constant memory. The results of the significant performance increments

with these optimizations have been included; however, the discussion is outside the

http://developer.amd.com/resources/documentation-articles/articles-whitepapers/bulk-encryption-on-gpus/

Page 19 of 25

11/16/14 10:55 PM

We have also investigated the key expansion on the GPU using local memory. In this

approach the original/unexpanded cipher key is passed directly to the GPU as a kernel

argument. As the local memory is consistent only across a single work-group [8], we

need to expand key for each work-group separately. An array named keysched is

created in the local memory having size equal to the size of expanded key. The first

work-item (tid=0) of each work-group copies the cipher-key into the keysched and calls

the key expansion function. The expanded key in the keysched array is then accessible

by all the work-items in that work-group. A barrier is placed in the kernel after the key

expansion function. This is required to make sure that no thread proceeds with the

AES-Transformations before the key expansion is complete. The incurred overhead is

not much as the key expansion is not compute intensive; also the overheads are being

nullified by much faster access to the expanded key in the local memory. Figure 10 below explains the key expansion process at the work-group level. Here tid is the local-id

of each work-item within a work-group. A condition (tid==0) is evaluated to direct only

the first thread of each work-group to the key expansion procedure. The rest of the

threads wait at the barrier until the first thread hits the barrier after key expansion is

complete. All the threads are now ready to proceed with the AES-Transformations.

http://developer.amd.com/resources/documentation-articles/articles-whitepapers/bulk-encryption-on-gpus/

Page 20 of 25

11/16/14 10:55 PM

Performance Results

Performance

tests

were

carried out on two different

machines, both running a 64bit version of the Windows

7 operating system and AMD

APP

SDK

v2.3

with

OpenCL 1.1 support. The

kernel execution times have

been measured using the

AMD APP Profiler v2.1. The

hardware details for both systems are described below.

TEST SYSTEM 2

CPU

Intel Core i7 930 2.80GHz Intel Core i3 370M 2.40GHz

GPU

ATI Radeon HD 5870 ATI Mobility Radeon HD 5650

MEMORY (RAM) 8GB DDR3

4GB DDR3

Due to the inherent parallelism in the AES algorithm, it shows better performance gains

for large data sizes, which are suited for bulk encryption. In the benchmarks, we validated this through performance results taken on various input sizes. The results also

show the impact of various optimization techniques applied to the standard implementation to further increase the performance, especially by reducing global memory calls

and moving more data to constant and local caches.

Figure 11 shows a performance comparison of various AES kernels. The graph has been

http://developer.amd.com/resources/documentation-articles/articles-whitepapers/bulk-encryption-on-gpus/

Page 21 of 25

11/16/14 10:55 PM

plotted with input size on the horizontal axis (Mega Bytes) and the kernel execution

time (milliseconds) for 256-Bit AES on the vertical axis.

Figure 12 shows performance comparison of various hardware for fully optimized AES

kernels. The benchmarks for Core i7 CPU are obtained using 8 threads in OpenCL.

http://developer.amd.com/resources/documentation-articles/articles-whitepapers/bulk-encryption-on-gpus/

Page 22 of 25

11/16/14 10:55 PM

Figure 13 displays the AES speedup chart. These speedups are against the Core i7 CPU

running 8 threads OpenCL. We have achieved up to a 16x speedup at larger input

sizes.

Page 23 of 25

11/16/14 10:55 PM

Conclusion

The results illustrated in this article prove the viability of implementing the AES algorithm on the AMD GPUs, which show considerable speedups compared to the current

generation Intel processor or commodity graphics cards. We have obtained a speedup

of up to 16 times with the ATI Radeon HD 5870 GPU while the ATI Mobility

Radeon HD 5650 GPU is showing up to 3 times the performance increase.

References

[1] http://en.wikipedia.org/wiki/Encryption viewed 20 March, 2011.

[2] AES Encryption Implementation and Analysis on Commodity Graphics Processing

Units Owen Harrison and John Waldron, 2007.

[3] http://en.wikipedia.org/wiki/Cipher viewed 20 March, 2011.

[4] http://en.wikipedia.org/wiki/Block_cipher viewed 20 March, 2011.

[5]http://en.wikipedia.org/wiki/Advanced_Encryption_Standardviewed 20

2011.

March,

[6]Announcing the ADVANCED ENCRYPTION STANDARD (AES) Federal Information Processing Standards Publications, November 26, 2001.

[7] http://en.wikipedia.org/wiki/Modes_of_operationviewed 20 March, 2011.

[8] ATI Stream Computing OpenCL Programming Guide, Ch-4 OpenCL performance and optimization, June 2010.

GLOSSARY

Forward Cipher Series of transformations that converts plain-text to cipher-text using

the Cipher Key.

Cipher Key Secret, cryptographic key that is used by the Key Expansion routine to genhttp://developer.amd.com/resources/documentation-articles/articles-whitepapers/bulk-encryption-on-gpus/

Page 24 of 25

11/16/14 10:55 PM

Cipher-Text Data output from the Cipher or input to the Inverse Cipher.

Inverse Cipher Series of transformations that converts ciphertext to plaintext using the

Cipher Key.

Plain-Text The data to be encrypted, Data input to the Forward Cipher or output from

the Inverse Cipher

Round Key Round keys are values derived from the Cipher Key using the Key Expansion routine; they are applied to the State in the Forward and Inverse Cipher.

Host A standard CPU device running the main operating system.

Compute Device A GPU or CPU device that provides the processing power for

OpenCL. In our case, a GPU device.

Host Code A C/C++ program executing on the Host to setup the OpenCL resources

and invoke the Kernel Code.

Kernel CodeThe parallel executable code which is executed on the Compute Device,

also called as Device Function.

Global Memory The DRAM video memory available on Graphics Cards.

Local Memory High bandwidth on-chip memory available to each compute unit.

Page 25 of 25

- [IJCST-V6I3P2]: Chekkala Vindhya Devi, Mr.PrasadUploaded byEighthSenseGroup
- Cracking Advanced Encryption Standard a ReviewUploaded byIJSTR Research Publication
- aesIntroUploaded byGeorge Miron
- InfoSec Resources - CBC Byte Flipping Attack—101 ApproachUploaded bySameer A. Masood
- The DES Algorithm IllustratedUploaded byumair
- secret writingUploaded byBridget Smith
- CryptographyUploaded bySravani Nalluri
- CHACHA ppt.pptxUploaded byVikram Singh
- A Guide to CryptographyUploaded byJobin Idiculla
- CEHv8 Module 19 Cryptography.pdfUploaded byMehrdad Jingoism
- AesUploaded byMarius Gelea
- 00953aUploaded byandrel0
- IASL Oral Question BankUploaded byJeevjyot Singh Chhabda
- Cryptography and Network Security Principles and Practice - Lecture Notes, Study Materials and Important questions answersUploaded byBrainKart Com
- aes.xlsxUploaded bysüleyman_kabadayı
- 06470679 ImageUploaded bynewhondacity
- Security Issues in Wireless Mesh NetworksUploaded bymohana91
- Survey EncryptUploaded bysrisairampoly
- A Design and Verification Methodology for Secure Isolated RegionsUploaded byPinnacle Systems Group
- QNSUploaded bymberege1991
- Dynamic Encryption Method - DTU OrbitUploaded byP
- 228965_Security in the SaaS EnvironmentUploaded bygunalprasadg
- Secured Energy Efficient Rebroadcasting With Neighbour Knowledge In MANETUploaded byInnovative Research Publications
- An Efficient Nonlinear Shift-Invariant Transformation.psUploaded byΟλυμπίδης Ιωάννης
- AES Encryption and Decryption Software on LPCUploaded bymilmilmulmul
- 148816 3041 Encryption Decryption ProjectUploaded bymansha99
- Genetic Algorithm and Tabu Search Attack (1)Uploaded byskw
- Key DistributionUploaded byAbraham Dontoh
- A Hybrid Cloud Approach for Secure Authorized DeduplicationUploaded byMohd Junaid
- Performance evaluation of Hard and Soft Wimax by using PGP and PKM protocols with RSA AlgorithmUploaded byInternational Organization of Scientific Research (IOSR)

- Applications _ Institute for Quantum ComputingUploaded byjhty2112
- Hexagonal Grid Image ProcessingUploaded byjhty2112
- F3EAD_ Find, Fix, Finish, Exploit, Analyze and Disseminate – the Alternative Intelligence Cycle » Digital ShadowsUploaded byjhty2112
- howtomakeshakuhachiUploaded byJwanda
- BlackArch VirtualBox InstallUploaded byjhty2112
- 257.pdfUploaded byjhty2112
- 257.pdfUploaded byjhty2112
- The United States Government Configuration Baseline (USGCB) - Windows 7 ContentUploaded byjhty2112
- Exploiting Foscam IP CamerasUploaded byjhty2112
- Shellware _ AR.assist Infrastructure Wi-Fi Enabling Your AR.drone Made EasyUploaded byjhty2112
- Ar Drone Wifi Antenna ModUploaded byjhty2112
- Google Summer of Code _ Qubes OSUploaded byjhty2112
- How to install Docker on Mac OS using brew_ - Marcin PilśniakUploaded byjhty2112
- Sepher RazielUploaded byVictoria Generao
- 2013 Asilomar GPU TurboUploaded byjhty2112
- AngelTalk2-I Give a Damn-Dew (Due) From HeavenUploaded byjhty2112
- The Power of Collective Intention-Aaron C. MurakamiUploaded byjhty2112
- The Material Culture of Astronomy in Daily LifeUploaded byjhty2112
- Blueprint for a Computer Immune SystemUploaded byjhty2112
- The SIVA Demonstration Gallery for Signal, Image, and Video Processing EducationUploaded byjhty2112
- The BSD Associate Study GuideUploaded byjhty2112
- The Gaelic Topography of Scotland and WhUploaded byjhty2112
- A Review of Keratin-Based Biomaterials for Biomedical ApplicationsUploaded byjhty2112
- SURFTrac- Efficient Tracking and Continuous Object Recognition Using Local Feature DescriptorsUploaded byjhty2112
- C++NPv1Uploaded byjhty2112

- ASROCK 775i65GUploaded byPhilip Lorenzana Reytiran
- HP Planning Virtualization Deployment VmwareUploaded byconnect2praveen
- CandCppOnUnixUploaded byVysakh Sreenivasan
- Datasheet Fujitsu Keyboard KbPC PX eCoUploaded byapi-26347484
- Hall 5e TB Ch16 (1)Uploaded byFrancisco Valenzuela Jr
- 800xA Popup Window ConfigurationUploaded byGundeep Singh Saini
- CSCA08_Mid_2011FUploaded byexamkiller
- PureFlex Updating Best PracticeUploaded byshantified
- Basic c++ Interview QuestionsUploaded bystymhr
- Use Xampp as Local ServerUploaded byAbubakar Arome Muhammad
- HomeUploaded byangelov
- SQL Server 2012 xVelocityBenchmark DatasheetMar2012Uploaded bykanondark
- Motorola DALIUploaded byalexwongks6118
- How-To 64 MDM AirWatchUploaded byMichael O'Connell
- BCSL-045Uploaded byAnkitSingh
- SPSS for the ClassroomUploaded byJennifer L. Magboo-Oestar
- RTOSUploaded bySumit Kumar
- RIP Manager User GuideUploaded byDGrafico Marco Camargo Soliz
- Thread JavaUploaded byshafeekshefeer
- Stress Testing Web Intelligence and Crystal ReportsUploaded byanil
- CSE-300 - Topics for User Manual, Software Review and Press ReleaseUploaded byPhong Viet Cao
- Readme-Instructions for InstallationUploaded byjanoverfox
- Program for Digital Clock Using RTC DS12C887Uploaded byAshish Sharma
- ABAP Background Debugging StepsUploaded byshubendubarwe
- script tentang vlanUploaded byelda
- CXOne_FunctionBlock_OpManual_EN_201208_tcm849-112200.pdfUploaded byjujuq
- f29ba7fa-9370-43d1-ae72-a8ab74c18c2bUploaded bynip_vik
- S7-200 plcUploaded byfemalefaust
- Genalg3 (Bold)Uploaded bybambangtriw
- 2.55 DocumentationUploaded bySandor Jakab