You are on page 1of 25

Bulk Encryption on GPUs - AMD

11/16/14 10:55 PM

Abstract
Salman Ul Haq, Jawad Masood, Aamir Majeed, Usman Aziz
10/11/2011
This article covers the implementation and optimization of the Advanced Encryption
Standard (AES) on AMD GPUs using OpenCL, which is fine-tuned for bulk encryption applications. Reliable encryption schemes are needed to ensure the information
security of individuals, organizations and governments by protecting against potential
threats. One particular scheme is the AES algorithm-based bulk encryption technique,
which is based upon the Rijndael algorithm, a symmetric block cipher with 128-bit, 192bit and 256-bit cipher keys. OpenCL also allows you to tap into the huge parallel processing power of GPUs for data parallel computing applications. This article begins by
exploring the AES algorithm, focusing on a parallel breakdown of the problem and explaining suitable indexing schemes. This is followed by GPU-specific optimization
strategies, such as using local memory, covering their relation to the memory bandwidth and computational intensity that is required. We finish the article by examining
the final benchmarks that signify the acceleration achieved using AMD GPUs.

Introduction
Information security is becoming increasingly important given the ever -increasing
number of new applications in the public and private domain. There is a continuing
trend to secure data in all of its uses, ranging from its live communication to archived
data storage. The unauthorized access to intercepted transmissions can result in the
compromise of sensitive and vital information. Data managers around the world are,
thus, facing an interesting dilemma;: how to store data securely while still being able to
access it quickly. Encryption is an effective solution for protecting valuable data assets
against such attacks.

Encryption
Encryption is the process of transforming information referred to as plain-text into an
unintelligible code called cipher-text, using a secret key and an algorithm generally rehttp://developer.amd.com/resources/documentation-articles/articles-whitepapers/bulk-encryption-on-gpus/

Page 1 of 25

Bulk Encryption on GPUs - AMD

11/16/14 10:55 PM

ferred to as the cipher [1]. The cipher-text (encrypted data) can be decoded back into its
original form using the same cipher algorithm and the secret key. In this process, critical
information can be protected from hackers, competitors and others who would use the
information for malicious intent.
Common uses for encryption technology are found in the static archiving of large
amounts of sensitive data, as well as its communication over the local area network
(LAN) or across an Internet gateway in the case of Wide Area Networks (WANs) or Virtual Private Networks (VPNs). Similar applications can also be abundantly found in the
telecommunications industry and other proprietary setups dealing with data protection
issues.

Bulk Encryption
Bulk encryption provides safe and effective methods for protecting data transmissions
from its compromise and theft. This can be achieved through secured storage and the
transmission of bulk data.
Bulk encryption technology provides a method to encrypt large amounts of data during
transmission or storage. The amount of information that must be encrypted, however,
simultaneously leads to very large response times. Currently, the processing power requirements for bulk encryption are being met by hardware extensions in the form of
cryptographic accelerators [2]. There exists the potential to use the parallel processing
power of a GPU as a co-processor in a similar role that existing hardware cryptographic
solutions play.

Bulk Encryption Methods


Most modern encryption algorithms or ciphers can be categorized in one of the following ways:
1. Whether the same key is used for both encryption and decryption (symmetric key
algorithms), or if a different key is used for each (asymmetric key algorithms).The
use of a symmetric algorithm requires only a single key for both the encryption
and decryption process. Encryption schemes that do not involve the use of a key
are far less secure and subject to compromise. In fact, anyone who is in possession
http://developer.amd.com/resources/documentation-articles/articles-whitepapers/bulk-encryption-on-gpus/

Page 2 of 25

Bulk Encryption on GPUs - AMD

11/16/14 10:55 PM

of the decryption algorithm can decipher any transmission written with that particular algorithm.
2. Whether they work on blocks of symbols of a fixed size (block ciphers), or on a continuous stream of symbols (stream ciphers).A block cipher is a symmetric key cipher operating on a fixed-length groups of bits, called blocks, with an unvarying
transformation. A block cipher encryption algorithm might take (for example) a
128-bit block of plain-text as input, and output a corresponding 128-bit block of
cipher-text. The exact transformation is controlled using a second input called the
secret key. Decryption is similar; the decryption algorithm takes, in this example, a
128-bit block of cipher text together with the secret key, and yields the original 128bit block of plain-text.A message longer than the block size (128 bits in the above
example) can still be encrypted with a block cipher by breaking the message into
blocks and encrypting each block individually. Since all pure block ciphers have
independent workloads, they are the ideal candidates for parallel implementation.

Advanced Encryption Standard


The Advanced Encryption Standard (AES) is a symmetric-key encryption standard approved by NSA for top secret information and is adopted by the U.S. government. The
standard was adopted from a larger collection originally published as Rijndael [5]. The
Rijndael cipher was developed by two Belgian cryptographers, Joan Daemen and Vincent Rijmen, and submitted by them to the AES selection process. AES is based on a
design principle known as a substitution permutation network. The standard comprises
three block ciphers: AES-128, AES-192 and AES-256. Each of these ciphers has a 128-bit
block size, with key sizes of 128, 192 and 256 bits, respectively. The AES ciphers have
been analyzed extensively and are now used worldwide, as was the case with its
predecessor, the Data Encryption Standard (DES) [5].
AES was selected due to the level of security it offers and its well documented implementation and optimization techniques [6]. Furthermore, AES is very efficient in terms
of both time and memory requirements. The block ciphers have high computation intensity and independent workloads (apply the same steps to different blocks of plain text),
so acceleration using a GPU is the next logical step.

AES Algorithm
http://developer.amd.com/resources/documentation-articles/articles-whitepapers/bulk-encryption-on-gpus/

Page 3 of 25

Bulk Encryption on GPUs - AMD

11/16/14 10:55 PM

In this section, we will provide a brief overview of the AES algorithm and the working
of its major constituent computations.
The AES block-cipher operates on a 44 array of bytes (128 Bits), termed as the state. For
the AES algorithm, the size of the input block, the output block and the state is 128 bits.
This is represented by Nb = 4, which reflects the number of 32-bit words (number of columns) in the state array. The permissible lengths of the Cipher Key, K, are 128, 192, and
256 bits. The key length is represented by Nk = 4, 6, or 8, which reflects, again, the number of 32-bit words (number of columns) in the Cipher Key array [6].
The state is encrypted or decrypted by applying byte-oriented transformations for a
specific number of rounds. The number of rounds to be performed is dependent on the
key size. The number of rounds is represented by Nr, where Nr = 10 when Nk = 4, Nr =
12 when Nk = 6, and Nr = 14 when Nk = 8 [6].
The AES algorithm specifies both cipher and its inverse for the complete
encrypt-decrypt cycle. The Forward Cipher takes plain-text as input along with the
cipher-key and its output is the encrypted data or cipher-text. The Inverse Cipher takes
this cipher-text as input and decrypts it back to plain-text using the same cipher-key
used for encryption.
The AES algorithm consists of following phases:
1. Key Expansion.Round keys are derived from the cipher key using the
Rijndaels
2. Initial Round.AddRoundKeyeach byte of the state is combined with the
round key using a bit-wise operation.
3. Middle Rounds.Nr = 1 till Nr-1 Repeatedly perform the following transformations:
1. SubBytesa non-linear substitution step where each byte is
replaced with another according to a lookup table.
2. ShiftRowsa transposition step where each row of the state is
shifted cyclically a certain number of steps.
3. MixColumnsa mixing operation which operates on the columns
of the state, combining the four bytes in each column.
http://developer.amd.com/resources/documentation-articles/articles-whitepapers/bulk-encryption-on-gpus/

Page 4 of 25

Bulk Encryption on GPUs - AMD

11/16/14 10:55 PM

4. AddRoundKeysame as described above.


4. Final Round (no MixColumns)
1. SubBytessame as described above.
2. ShiftRowssame as described above.
3. AddRoundKeysame as described above.

Figure 1: AES Forward Cipher Flow Graph

AES Transformations
For both the Forward and Inverse Cipher, the AES algorithm uses a round function that
http://developer.amd.com/resources/documentation-articles/articles-whitepapers/bulk-encryption-on-gpus/

Page 5 of 25

Bulk Encryption on GPUs - AMD

11/16/14 10:55 PM

is composed of four different byte-oriented transformations [6]:


1. SubBytes Transformation.The SubBytes transformation is a non-linear byte
substitution that operates independently on each byte of the state. In this step,
each byte of the state array is updated using an 8-bit substitution box, the Rijndael S-box. This operation provides the non-linearity in the cipher and helps
avoid attacks based upon algebraic manipulation. The S-box used is derived
by combining the multiplicative inverse over GF(28), known to have good
non-linearity properties, with an invertible affine transformation. The
complete S-box table is displayed below in Figure 2 [6].

Figure 2: AES S-BOX


2. AddRoundKey Transformation: In the AddRoundKey transformation, a
Round-Key is added to the state by a simple bitwise XOR operation. RoundKey is derived for each round, from the cipher-key using Rijndaels key
schedule. Each Round-key is the same size as the state. AddRoundKey can be
mathematically represented as follows [6]:
[wi] are the Expanded key words, and round is a value in the range 0 = round
= Nr. In the Cipher, the initial Round Key addition occurs when round = 0,
prior to the first application of the round function.
3. ShiftRows Transformation:The ShiftRows transformation operates on the
rows of the state and it cyclically shifts the bytes in each row by a different
http://developer.amd.com/resources/documentation-articles/articles-whitepapers/bulk-encryption-on-gpus/

Page 6 of 25

Bulk Encryption on GPUs - AMD

11/16/14 10:55 PM

(number of bytes) offset. For AES, the first row is not shifted (left unchanged).
Each byte of the second row is shifted one byte to the left. Similarly, the third
and fourth rows are shifted by offsets of two and three bytes respectively. In
this way, each column of the output state of the ShiftRows step is composed
of bytes from each column of the input state. Specifically, the ShiftRows transformation proceeds as follows [6]:where the shift value shift(r,Nb) depends on
the row
number,
r, as follows:

shift(1,4) =1; shift(2,4) = 2 ; shift(3,4) = 3.

MixColumns Transformation: In the MixColumns transformation, the four bytes of


each column of the state are combined using an invertible linear transformation.
Each column is treated as a polynomial over GF(28) and is multiplied by the coefficient polynomial c(x) = 33+x2+x+2 (modulo x4+1). The coefficients are displayed
in their hexadecimal equivalent of the binary representation of bit polynomials
from GF(28). This transformation, in conjunction with the ShiftRows
transformation, provides diffusion in the original message, spreading out any nonuniform patterns.
The AES Transformations discussed above are the Forward Transformations used by
the Forward Cipher for encryption of the plain-text. The Inverse Cipher uses the Inverse
Transformations for decryption of cipher-text back to plain-text. The AddRoundKey,
ShiftRows and MixColumns Transformations remain the same for both Forward and Inverse Cipher. For Inverse Cipher, the SubBytes Transformation is replaced by the InvSubBytes Transformation, which takes the substitution values from the Inverse S-box
table.
Key expansion takes the input key (cipher key) of 128, 192 or 256 bits and produces an
expanded key for use in the rounds of subsequent stages. The expanded keys size is related to the number of rounds to be performed. For 128-bit keys, there are 10 rounds
and the expanded key size is 1408 bits. For 192 and 256 bit keys, the number or rounds
increases to 12 and 14 rounds respectively with an overall expanded key size of 1664
http://developer.amd.com/resources/documentation-articles/articles-whitepapers/bulk-encryption-on-gpus/

Page 7 of 25

Bulk Encryption on GPUs - AMD

11/16/14 10:55 PM

and 1920 bits. During each round, a different portion of the expanded key is used in the
AddRoundKey step.

Modes of Operation
A block cipher by itself allows encryption of a single data block of size equal to the
ciphers block size. Modes of operation enable the repeated and secure use of a block
cipher, on multiple data blocks, under a single key [7]. When targeting a variable-length
message, the data must first be partitioned into separate cipher blocks. Typically, the
last block must also be extended to match the ciphers block length using a suitable
padding scheme. A mode of operation describes the process of encrypting each of these
data blocks, and generally uses randomization based on an additional input value, often
called an initialization vector.
There are different modes under which encryption can take place, where some modes
are inherently more secure and some lend themselves more to parallelism. The gfollowing table lists various modes of operation along with their inherent level of parallelism
[7]. For details on the modes of operation, look at the Resources section [7].
Mode of Operation
Parallelism
Electronic codebook (ECB) High
Counter (CTR)
High
Cipher-block chaining (CBC) Low
Cipher feedback (CFB)
Low
Output feedback (OFB)
Low

The ECB mode comes out to be the most parallel implementation. The message is
divided into blocks and each block is encrypted with an identical key and there is no
serial dependence between the blocks. The advantage of ECB mode is the extensive parallelism which scales well to the GPU architecture. The disadvantage of this method is
that, identical plain-text blocks are encrypted into identical cipher-text blocks; thus, it
does not hide data patterns well and the large scale structures in the plain-text are preserved [7].
In the Counter (CTR) mode, the large scale structures that may have been present in the
original plain-text are diminished. Thus, the cipher-text blocks obtained by encrypting
two identical plain-text blocks using CTR mode are completely different. This provides
http://developer.amd.com/resources/documentation-articles/articles-whitepapers/bulk-encryption-on-gpus/

Page 8 of 25

Bulk Encryption on GPUs - AMD

11/16/14 10:55 PM

better security level as compared to the ECB mode [7]. We have implemented the ECB
mode of operation that is not only parallel but can be easily extended to CTR.

Exploiting Parallelism in AES


GPUs are massively parallel devices. The SPMD architecture allows GPUs to perform
the same operation on multiple data sets simultaneously. AES is based on a block cipher
algorithm that operates on 128 bit data chunks independently. This provides a high level of Data Parallelism as the same operation is performed on each state block, with no
dependencies in between (provided that the AES is being used in the ECB mode). So,
multiple blocks can be encrypted simultaneously, which is more parallel threads that
GPUs love. As the data size increases so does the level of parallelism, thus making
GPUs more efficient for bulk encryption. The only serial operation in AES is the keyexpansion, which provides the round keys to be used in subsequent rounds. However,
keeping in view its serial nature and the fact that it is just a one-time operation, keyexpansion can be safely moved to the CPU Host Code for better performance. The figure explaining the parallel nature of AES is included in the Design Approach Section.

OpenCL Implementation of AES


Now that we have a fair understanding of the AES encryption algorithm and its
different ingredients, such as key-expansion and AES-Transformations, we are ready to
start an implementation using OpenCL specific to an AMD GPU. In the remaining
part of this article, we will discuss our design approach in detail, including an efficient
indexing scheme for handling input and output data, and device functions for various
AES-Transformations. Later on, we will investigate memory optimization strategies,
such as using local memory, constant cache and coalesced memory accesses along with
their impact on throughput performance and memory bandwidth.

Design Approach
In this section we will discuss the level of parallelism we want to exploit, the portion of
code that should be ported to GPU, and the Host-Device work division.
In the current approach, we will exploit parallelism only on the block level without
changing the original algorithm. (The algorithm breakdown can further optimize the rehttp://developer.amd.com/resources/documentation-articles/articles-whitepapers/bulk-encryption-on-gpus/

Page 9 of 25

Bulk Encryption on GPUs - AMD

11/16/14 10:55 PM

sults, though that discussion is outside the scope of this article.). Each work-item will
take one state block as input and convert it to cipher-text. This implies that the Global
Work-Size is directly proportional to the exploited parallelism. Encryption of the one
state block of 128 bit will remain serial; however we will use loop unrolling to optimize
the code. Another serial operation in AES is the key-expansion which provides the
round keys to be used in subsequent rounds. However, keeping in view its serial nature
and the fact that it is just a one-time operation, key-expansion can be safely moved to
the CPU Host Code for better performance. Figure 3 below explains the parallelism in
AES as well as our design approach.

Figure 3: Parallel AES Breakdown

Algorithm Phases & GPU Kernels


In this section we will discuss the device functions that implement the AES-Transformations in OpenCL. We will also list the sample codes for implementing each of these
transformations in OpenCL.
http://developer.amd.com/resources/documentation-articles/articles-whitepapers/bulk-encryption-on-gpus/

Page 10 of 25

Bulk Encryption on GPUs - AMD

11/16/14 10:55 PM

In the SubBytes transformation, each state value is updated by a value from S-Box, having the same index as the value of the state. For example, if S1,1 = {53}, then the substitution value from the S-Box would be determined by the intersection of the row with index 5 and the column with index 3. This process is explained in Figure 4 below.

Figure 4: AES SubBytes Transformation


A sample code for SubBytes transformation is listed next:
for(i = 0; i < 16; i++)
{
x = i & 0x03;
y = i >> 2;
state[4*x + y] = gpu_AES_sbox[state[4*x + y]];
}

In the ShiftRows transformation, the bytes in the last three rows of the state are cyclically shifted over different numbers of bytes. The first row, r = 0, is not shifted. Each byte
of the second row is shifted one byte to the left. Similarly, the third and fourth rows are
shifted by offsets of two and three bytes respectively. This has the effect of moving bytes
http://developer.amd.com/resources/documentation-articles/articles-whitepapers/bulk-encryption-on-gpus/

Page 11 of 25

Bulk Encryption on GPUs - AMD

11/16/14 10:55 PM

to lower positions in the row, while the lowest bytes wrap around into the top of
the row. Figure 5 below explains the ShiftRows transformation. Here S represents the
state array and Sis the n_state array.
The code for ShiftRows transformation is listed below. The shift rows transformation
uses both n_state and the state buffers as it is not an in-place transform:
for(i = 0; i < 16; i++)
{
x = i & 0x03;
y = i >> 2;
n_state[4*x + y] = state[4*x + ((y+x)& 0x03)];
}

Figure 5: AES ShiftRows Transformation


In the AddRoundKey transformation, a Round Key is added to the state by a simple bithttp://developer.amd.com/resources/documentation-articles/articles-whitepapers/bulk-encryption-on-gpus/

Page 12 of 25

Bulk Encryption on GPUs - AMD

11/16/14 10:55 PM

wise XOR operation as depicted in the figure below. Each Round Key consists of Nb
words from the expanded key obtained from the Key-Expansion function, described
earlier. Figure 6 depicts the AddRoundKey Transformation.
The sample code for AddRoundKey transformation is listed next:
for(i = 0; i < 16; i++)
{
x = i & 0x03;
y = i >> 2;
state[4*x + y] = state[4*x + y] ^
((keysched]y] & (0xff << (x*8))) >> (x*8));
}

http://developer.amd.com/resources/documentation-articles/articles-whitepapers/bulk-encryption-on-gpus/

Page 13 of 25

Bulk Encryption on GPUs - AMD

11/16/14 10:55 PM

Figure 6: AES AddRoundKey Transformation


In the MixColumns transformation, each column is treated as a polynomial over GF(28)
http://developer.amd.com/resources/documentation-articles/articles-whitepapers/bulk-encryption-on-gpus/

Page 14 of 25

Bulk Encryption on GPUs - AMD

11/16/14 10:55 PM

and is multiplied modulo x4+1 by the coefficient polynomial a(x) [6] shown here:
The MixColumns transformation
updates each column of the state
using a matrix multiplication, as explained by the following equation [6]:
Figure 7 below explains the MixColumns
transformation.

Figure 7: AES MixColumns Transformation

Kernel Execution Stages


By now we have explained how various AES Transformations are implemented in
OpenCL, and we are ready to discuss the kernel codes for both Forward and Inverse
http://developer.amd.com/resources/documentation-articles/articles-whitepapers/bulk-encryption-on-gpus/

Page 15 of 25

Bulk Encryption on GPUs - AMD

11/16/14 10:55 PM

AES Ciphers.
Forward Cipher
We explain the working of our kernel by considering the simplest case where we have a
single work item operating on a 128-Bit state block. Kernel arguments would be the
input and output buffer, AES fixed table buffers and the expanded key buffer. Also, the
key-Length parameter adds the flexibility to use all three allowed key sizes128, 192 and
256-bit keysand they are passed to the kernel as an argument. The number of rounds to
be performed is calculated based upon the key length. 128-bit state is copied from global
plain-text buffer into the registers for computing. Two blocks of state size are created in
the register files, as all the AES-Transformations cant be performed in-place. The input
is copied to the state block in registers using a special access pattern to allow coalescing
(more on this latter). Forward AES-Transformations are applied to the state block as
described by the AES flow graph. The resulting cipher-text block is copied back to the
Global cipher-text buffer using the same indexing scheme that was followed while copying plain-text to the state.
Inverse Cipher
In this section we will discuss the major changes required to convert the Forward Cipher into the Inverse Cipher for decryption process. The Inverse Cipher essentially runs
the forward cipher in the reverse order for decryption process. The AES transformations
used in Inverse Cipher are the inverse versions of previously discussed forward transforms.
The Inverse Cipher incorporates minor changes in the transformations, the order of execution and the required AES-Tables. For example the InvSubBytes transform, which is
the inverse of SubBytes transform, requires Inverse S-box table instead of the S-box
table. The code for InvSubBytes is shown here:
for(i = 0; i < 16; i++)
{
x = i & 0x03;
y = i >> 2;
state[4*x + y] = gpu_AES_isbox[state[4*x + y]];
}
http://developer.amd.com/resources/documentation-articles/articles-whitepapers/bulk-encryption-on-gpus/

Page 16 of 25

Bulk Encryption on GPUs - AMD

11/16/14 10:55 PM

All other transformations: ShiftRows, MixColumns and the AddRoundKey remain the
same. The order in which these transforms are applied is different from the Forward
Cipher. Figure 8 displays the flow-graph for Inverse Cipher.

Figure 8: AES Inverse Cipher Flow-Graph

Indexing Schemes
Now we will examine the input and output indexing schemes for a simple AES kernel
with a single work-item in detail. We will then explain what needs to be added to run
the kernel with multiple work-groups and larger work-group sizes. The described inhttp://developer.amd.com/resources/documentation-articles/articles-whitepapers/bulk-encryption-on-gpus/

Page 17 of 25

Bulk Encryption on GPUs - AMD

11/16/14 10:55 PM

dexing scheme applies to both Forward and Inverse Cipher.


At the beginning of the Forward or the Inverse Cipher, the input array is copied to the
state array according to the following convention [6]:
S [ r , c ] = In [ r + 4c ]

for 0 = r < 4; and 0 = c < Nb

where Nb = 4 for our case. This is the column major access pattern as depicted in Figure
9 below. The code for column major access pattern for input is listed here:
for(i = 0; i < 16; i++)
{
x = i & 0x03;
y = i >> 2;
state[4*x + y] = gpu_input]i];
}

Figure 9: AES Data Patterns


At the end of the Forward or the Inverse Cipher, the state array is copied to the output
array as follows:
Out [ r + 4c ] = S [ r , c ]

for 0 = r < 4; and 0 = c < Nb

The code for output indexing pattern is listed here:


for(i = 0; i < 16; i++)
{
http://developer.amd.com/resources/documentation-articles/articles-whitepapers/bulk-encryption-on-gpus/

Page 18 of 25

Bulk Encryption on GPUs - AMD

11/16/14 10:55 PM

x = i & 0x03;
y = i >> 2;
gpu_output]i] = state[4*x + y]; ,br />}

Generalizing this indexing scheme to accommodate more threads require some mechanism of identifying which thread is being executed. A new variable named idx is introduced that queries the OpenCL runtime for the Global Id of each thread. Now, the
input array will be copied to the state array as follows:
for(i = 0; i < 16; i++)
{
x = i & 0x03;
y = i >> 2;
state[4*x + y] = gpu_input]i + 16*idx];
}

The net offset for each thread is 16*idx, as each thread handles 16 elements (Bytes) of the
input array. The same holds for writing the data to the output array after completion of
Encryption or Decryption Process:
for(i = 0; i < 16; i++)
{
x = i & 0x03;
y = i >> 2;
gpu_output]i + 16*idx] = state[4*x + y];
}

Memory Optimizations
Here we will discuss the drawbacks in the basic implementation described above and
suggest improvements to overcome these.
In the basic implementation we have used only the Global memory available on the
GPU. Remember, Global memory has the least memory bandwidth compared to other
memory spaces available on the GPU. The main disadvantage of low memory bandwidth is long latency access. Another drawback is the huge resource usage per thread as
all the calculations takes place in the register files, thus limiting the number of parallel
threads and degrading performance. The possible memory optimizations include the
use of local and constant memory. The results of the significant performance increments
with these optimizations have been included; however, the discussion is outside the
http://developer.amd.com/resources/documentation-articles/articles-whitepapers/bulk-encryption-on-gpus/

Page 19 of 25

Bulk Encryption on GPUs - AMD

11/16/14 10:55 PM

scope of this article.

Key Expansion using Local Memory


We have also investigated the key expansion on the GPU using local memory. In this
approach the original/unexpanded cipher key is passed directly to the GPU as a kernel
argument. As the local memory is consistent only across a single work-group [8], we
need to expand key for each work-group separately. An array named keysched is
created in the local memory having size equal to the size of expanded key. The first
work-item (tid=0) of each work-group copies the cipher-key into the keysched and calls
the key expansion function. The expanded key in the keysched array is then accessible
by all the work-items in that work-group. A barrier is placed in the kernel after the key
expansion function. This is required to make sure that no thread proceeds with the
AES-Transformations before the key expansion is complete. The incurred overhead is
not much as the key expansion is not compute intensive; also the overheads are being
nullified by much faster access to the expanded key in the local memory. Figure 10 below explains the key expansion process at the work-group level. Here tid is the local-id
of each work-item within a work-group. A condition (tid==0) is evaluated to direct only
the first thread of each work-group to the key expansion procedure. The rest of the
threads wait at the barrier until the first thread hits the barrier after key expansion is
complete. All the threads are now ready to proceed with the AES-Transformations.

http://developer.amd.com/resources/documentation-articles/articles-whitepapers/bulk-encryption-on-gpus/

Page 20 of 25

Bulk Encryption on GPUs - AMD

11/16/14 10:55 PM

Figure 10: Key Expansion using Local Memory

Performance Results
Performance
tests
were
carried out on two different
machines, both running a 64bit version of the Windows
7 operating system and AMD
APP
SDK
v2.3
with
OpenCL 1.1 support. The
kernel execution times have
been measured using the
AMD APP Profiler v2.1. The
hardware details for both systems are described below.

SPECIFICATIONS TEST SYSTEM 1


TEST SYSTEM 2
CPU
Intel Core i7 930 2.80GHz Intel Core i3 370M 2.40GHz
GPU
ATI Radeon HD 5870 ATI Mobility Radeon HD 5650
MEMORY (RAM) 8GB DDR3
4GB DDR3

Due to the inherent parallelism in the AES algorithm, it shows better performance gains
for large data sizes, which are suited for bulk encryption. In the benchmarks, we validated this through performance results taken on various input sizes. The results also
show the impact of various optimization techniques applied to the standard implementation to further increase the performance, especially by reducing global memory calls
and moving more data to constant and local caches.
Figure 11 shows a performance comparison of various AES kernels. The graph has been
http://developer.amd.com/resources/documentation-articles/articles-whitepapers/bulk-encryption-on-gpus/

Page 21 of 25

Bulk Encryption on GPUs - AMD

11/16/14 10:55 PM

plotted with input size on the horizontal axis (Mega Bytes) and the kernel execution
time (milliseconds) for 256-Bit AES on the vertical axis.

Figure 11: Performance comparison of AES kernels.


Figure 12 shows performance comparison of various hardware for fully optimized AES
kernels. The benchmarks for Core i7 CPU are obtained using 8 threads in OpenCL.

http://developer.amd.com/resources/documentation-articles/articles-whitepapers/bulk-encryption-on-gpus/

Page 22 of 25

Bulk Encryption on GPUs - AMD

11/16/14 10:55 PM

Figure 12: Hardware for AES kernels


Figure 13 displays the AES speedup chart. These speedups are against the Core i7 CPU
running 8 threads OpenCL. We have achieved up to a 16x speedup at larger input
sizes.

http://developer.amd.com/resources/documentation-articles/articles-whitepapers/bulk-encryption-on-gpus/

Page 23 of 25

Bulk Encryption on GPUs - AMD

11/16/14 10:55 PM

Figure 13: AES speedup chart

Conclusion
The results illustrated in this article prove the viability of implementing the AES algorithm on the AMD GPUs, which show considerable speedups compared to the current
generation Intel processor or commodity graphics cards. We have obtained a speedup
of up to 16 times with the ATI Radeon HD 5870 GPU while the ATI Mobility
Radeon HD 5650 GPU is showing up to 3 times the performance increase.

References
[1] http://en.wikipedia.org/wiki/Encryption viewed 20 March, 2011.
[2] AES Encryption Implementation and Analysis on Commodity Graphics Processing
Units Owen Harrison and John Waldron, 2007.
[3] http://en.wikipedia.org/wiki/Cipher viewed 20 March, 2011.
[4] http://en.wikipedia.org/wiki/Block_cipher viewed 20 March, 2011.
[5]http://en.wikipedia.org/wiki/Advanced_Encryption_Standardviewed 20
2011.

March,

[6]Announcing the ADVANCED ENCRYPTION STANDARD (AES) Federal Information Processing Standards Publications, November 26, 2001.
[7] http://en.wikipedia.org/wiki/Modes_of_operationviewed 20 March, 2011.
[8] ATI Stream Computing OpenCL Programming Guide, Ch-4 OpenCL performance and optimization, June 2010.

GLOSSARY
Forward Cipher Series of transformations that converts plain-text to cipher-text using
the Cipher Key.
Cipher Key Secret, cryptographic key that is used by the Key Expansion routine to genhttp://developer.amd.com/resources/documentation-articles/articles-whitepapers/bulk-encryption-on-gpus/

Page 24 of 25

Bulk Encryption on GPUs - AMD

11/16/14 10:55 PM

erate a set of Round Keys.


Cipher-Text Data output from the Cipher or input to the Inverse Cipher.
Inverse Cipher Series of transformations that converts ciphertext to plaintext using the
Cipher Key.
Plain-Text The data to be encrypted, Data input to the Forward Cipher or output from
the Inverse Cipher
Round Key Round keys are values derived from the Cipher Key using the Key Expansion routine; they are applied to the State in the Forward and Inverse Cipher.
Host A standard CPU device running the main operating system.
Compute Device A GPU or CPU device that provides the processing power for
OpenCL. In our case, a GPU device.
Host Code A C/C++ program executing on the Host to setup the OpenCL resources
and invoke the Kernel Code.
Kernel CodeThe parallel executable code which is executed on the Compute Device,
also called as Device Function.
Global Memory The DRAM video memory available on Graphics Cards.
Local Memory High bandwidth on-chip memory available to each compute unit.

http://developer.amd.com/resources/documentation-articles/articles-whitepapers/bulk-encryption-on-gpus/

Page 25 of 25