Description and porting of 128-bit AES encryption to GPUs

Written by Damien Dubuc, on 06 February 2018

The continuation of these posts revolves around an application of data encryption using AES (Advanced Encryption Standard).

The reasons for this choice are based on its relevance - highly utilized in its field - on the natural but non-trivial parallelism that characterizes it, and finally, on the fact that it solely relies on logical and integer operations.

Regarding the latter point, this choice is based on the fact that the performance of current FPGA floating-point operations is a step below that of other architectures. While Altera continues to improve its floating-point calculation IPs with each FPGA generation, it is currently preferable to rely on logical and integer operations to get a more accurate idea of the differences between these two architectures and consider performance comparisons.

What is AES 128-bit?

Data encryption by AES has become the new standard for data encryption since the early 2000s; it is currently the most widely used and secure. It is based on knowledge of a unique key, used for both encryption and decryption of data, and consists of a series of logical and integer operations to be applied to the entire text.

In its basic version, the algorithm takes as input a 128-bit block (16 bytes) and a 128-bit key (other key sizes are possible and slightly alter the implementation). To encrypt a message of a given length, it must be divided into packets of 128 bits (called "states"), on which identical encryption will be performed: the application thus exhibits a natural parallelism, with the granularity being the "state," containing 16 characters.

It is this version of the algorithm that we are using. Since encryption and decryption are symmetrical, we focus solely on the first phase, described below.

Encrypting a state (a set of 16 characters, 128 bits in size here) boils down to applying a set of 4 operations, repeated over 10 encryption rounds. These four operations are named subbytes, addroundkey, shiftrows, and mixcolumns, described below, shamelessly taken from the pretty wikipedia page.

The pseudocode is as follows:

key :initial key (16 characters)

roundkeys: extended key, for the 10 rounds

sbox: Rindjael substitution box required by AES (lookup table, 256 characters)

state: working array (4x4), initially containing the 16 characters to be encrypted.

new_state: another working array (4x4), will contain the 16 encrypted characters at the end.

subbytes substitutes all characters of the state with others (in-place), through the sbox lookup table, which is part of the AES method: it is predetermined.

shiftrows shifts the elements of row n of the state to the left by n positions (the first row being row 0).

mixcolumns is the equivalent of a 4-size dot product for each character, which is therefore replaced by a linear combination (predefined, denoted c(x) in the diagram) of itself and the other characters in its column.

addroundkey applies a different part of the extended key (generated from the base key) at each round, simply through an XOR operation, to each character.

The set of data dependencies through the different steps, as well as the position of the characters in the state during the sequence of operations, is given by the following figure.

It essentially shows the need for the state to be fully up-to-date before moving from one operation to another, and thus the inability to return to a finer granularity of parallelism: mixcolumns, in particular, requires 4 elements updated per output, but from one round to another, these 4 elements occupy different positions and themselves depend on other elements of the preceding mixcolumns.

Implementation on GPU

This application is very naturally implemented on a GPU, where threads / workitems are executed in a scheduled manner using SIMT instructions, by half-warps of 16 threads. Since the instructions of the 16 threads each processing a character of a given state are identical and executed simultaneously, each function (subbytes, etc.) is embarrassingly parallel.

The two working arrays state and state_new contain only 16 unsigned characters, and each of their indices will be accessed multiple times by different threads: they will therefore be allocated in shared memory (shared / __local).

The sbox is specific to AES (thus predetermined), weighs only 256 bytes, and will be read in an unstructured and multiple manner by all threads.

Finally, the roundkeys are calculated from the base key, but these calculations are neither heavy nor parallelizable: this table, weighing 176 bytes, can be calculated once on the CPU and then sent to the GPU. This table will also be read in a read-only manner by all threads.

We decide to initially place these tables in shared memory, although constant memory can sometimes be an alternative to consider.

Essentially, for a block / workgroup, the kernel boils down to:

  • Copying a state (portion of plaintext), the sbox, and the roundkeys into shared memory (448 bytes total)
  • Chaining operations only on tables in shared memory
  • Copying the encrypted state into the output array, in global memory (16 bytes)

Accesses to global memory - one read and one write per state - are efficient, as successive threads access contiguous addresses.

Since warps consist of 32 threads, it is not efficient to only use a block / workgroup size of 16 threads: the GPU code has been adapted for any size (of the form 16*n), encrypting n states in parallel, with 2 per warp, to reduce the number of scheduled blocks and fully utilize the warps. This also leads to a reduction in shared memory copies of the sbox and roundkeys lookup tables, now reused n times.

Under these reasonable optimizations, which require only a few hours for implementation and verification, the OpenCL kernel providing the encryption of a text of size 2^25, roughly about 33 million characters, takes around 480 ms on an Nvidia Tesla C2075 GPU (Fermi architecture). We keep in mind this qualitative relationship between optimization time and performance obtained before moving on to the same exercise on FPGA. Further, more detailed optimizations, significantly transforming the kernel, as well as the assistance of the NVidia profiler, allow the execution time to be reduced to around 60 ms with an additional 2 man-days of work. However, we started from the natural formulation of AES (at 480 ms) to avoid biasing the FPGA development because we are good people.

In the next post, we will tackle the implementation of a first FPGA design. We will discuss the issues encountered and the solutions that allowed us to achieve a first design while obtaining calculation performance that seemed correct.