Exploring the Single Work-Item design

Written by Damien Dubuc, on 06 February 2018

Last time we discussed the differences between the execution models of FPGA and GPU. On GPUs, single-workitem kernels (with only one workitem per workgroup) would return poor performance and utilize only a small portion of the architecture.

On the other hand, the pipeline parallelism of FPGA is ideal for this kind of design, as Altera recommends it when there are data dependencies.

Therefore, we start by taking control of the Altera SDK and the OpenCL kernel development and optimization cycle for FPGA through the exercise of data encryption using AES with a single-workitem approach.

First Attempt: Back to Loops

As soon as there are data dependencies or necessary synchronizations between workitems, Altera encourages designing a single-workitem kernel, potentially more suitable for the FPGA architecture. This means reverting to a design where a single workitem performs the work of an entire workgroup, practically implying a possible return to loops.

Whether it's a single-workitem design or a design with larger workgroups and SIMD operations, the generated pipelines essentially translate to very similar instruction sequences. Therefore, it is reasonable to think that understanding and optimizing the pipeline of a single-workitem design could potentially provide a better foundation for more elaborate designs (such as a vectorized design).

This is an opportunity for us to start exploring different parameters that influence execution time and FPGA resource occupancy, and to start using all of the Altera tools. Therefore, we rewrite the 4 functions subbytes, shiftrows, mixcolumns, and addroundkey with a loop over 16 characters, so that a single workitem is responsible for performing these operations on an entire state.

We also specify a required workgroup size of 1 workitem by preceding the definition of our kernel with the attribute line:


This first approach has the following characteristics (numbers obtained through files generated by the aoc compiler and the timeline visualized under aocl):

  SW base
Execution Time 10 684 ms
LE (% used) 13
FF (% used) 10
RAM (% used) 23.5
DSP (% used) 0.2

It is worth noting that this initial implementation follows Altera's recommendations, favoring better performance, especially on the following points:

  • use of "restrict" qualifiers for disjoint array pointers
  • memory alignment of text arrays to be transferred from the host (significant difference in execution time!)

The profiler invoked with aocl reveals that there is no stalling in our design, and the instruction occupancy is low (many are below 20%). This aligns with the fact that the performance of this design seems poor, but it does not help us understand why. However, it shows something interesting: reading/writing the states in global memory is efficient at 100% (maximum burst rate corresponding to simultaneous memory transactions of 16 elements); while reading the sbox and roundkeys is only at a burst rate of 2. This is unexpected since these memory requests are similar to the first one, which works very well. Indeed, contiguous addresses are accessed in global memory (256 and 16*11 elements depending on the table).

Much later, we will notice that although there are no differences in the kernel itself, the two previous arrays were not aligned in memory on the host. This could explain the poor performance of these loops, but could not be verified since we no longer had access to the FPGA board on the remote machine provided by Bittware.

Aocl has no information to report on instructions involving modulo or integer division, which should be costly.

Finally, we note that the board initialization time has a cost: it must be configured in the way dictated by the .aocx design file passed to it during runtime. This cost of 190 ms for our kernel represents only a fraction of the execution time for now. But one can wonder what its weight will be once the kernel is optimized, with its execution time potentially below.

FPGA Optimization Basics

Loop unrolling is the most "obvious" optimization on FPGA. To extract more parallelism on the pipeline stages, there is nothing like unrolling parts of them: this should greatly improve the performance of our FPGA design.

Consider a set of data passing through a loop with n iterations, where the pipeline generated for each iteration would be of depth k. Without unrolling, we would have up to k data processed in parallel in a pipeline representing one iteration of the loop, and these same data would be reinjected n times in total. Meanwhile, the next data is blocked, causing "stalling" (poor data flow). By unrolling the loop over n iterations, we would have a pipeline of depth n*k processing as much data in parallel, without stalling, at the cost of increased hardware usage.

This loop unrolling is done using a pragma at the beginning of the loop, indicating the unrolling factor p (ranging from 1 to n).

#pragma unroll p
for (i=0 ; i<n ; i++) {...}

However, the impact of loop unrolling on the execution time of an FPGA application is not so easy to understand and has already been the subject of research in the past (especially for nested loops). Our single-workitem kernel exhibits up to 3 levels of nested loops, presented approximately as follows:

[in kernel :: loop 3]
for (round=0 ; round <9 ; round ++)
. [in mixcolumns :: loop 2]
. for (workitem=0 ; workitem<16 ; workitem++)
. [in gmul :: loop 1]
. for (bit=0 ; bit<8 ; bit ++)
. (...)

where gmul is the multiplication in the field G(2^8), whose structure is used by AES.

Note that the gmul operation could be tabulated, being a binary operation on characters (65k possible inputs). We chose to initially compare the two architectures by asking them to go through the calculation, based on logical shift operations essentially.

This configuration already presents a lot of possible combinations of loop unrolling (partial or total unrolling, at a single or multiple levels of nested loops). Each attempt requiring costly compilation time, we cannot explore them randomly. The amount of calls to the gmul operation suggests that improving the pipeline generated for it could have a significant impact on the execution time of our application.

We arbitrarily test 3 designs, unroll1, unroll2, and unroll3, unrolling the loops respectively to level 1, 2, or 3 (for example, unroll2 unrolls loops 1 and 2), each with total unrolling. We obtain the following results:

  SW base SW unroll1 SW unroll2 SW unroll3
Execution Time 10 684 ms 2212 ms 2295 ms 4189 ms
LE (% used) 13 12.2 13.5 35.4
FF (% used) 10 8.2 8.5 21
RAM (% used) 23.5 21.7 25 67
DSP (% used) 0.2 0.2 0 0

This initial study shows that loop unrolling can have a very significant impact on an FPGA application: unroll1 and unroll2 designs present a speedup of almost x5, with little difference in resource usage. It also shows that it is not easy to predict the impact of loop unrolling on the application as a whole: although the unroll3 design unrolled the loops at all levels, it is the slowest and the most expensive in terms of FPGA resource usage. Conversely, the unroll1 design is the fastest but also less resource-intensive than our initial design, which is counterintuitive in this context of loop unrolling.

The profiler still does not show any stalling in these designs, which is a bit strange. However, it is observed that the instruction occupancy of unroll1 is around 60% for most of them, compared to less than 20% for the initial design, a meager 4% for unroll2, and less than 1% for unroll3. Considering the very close performances of unroll1 and unroll2, this difference in the "instruction occupancy" metric is unexpected.

Since it only uses about 20% at most of the components of each category present on the hardware, one naturally wonders about the scalability of these numbers with respect to the number of copies of this pipeline that we can fit on the board.

Optimizing FPGA Resource Usage to Scale

After some trials, we have arrived at a design with much better efficiency than the initial design. It is probably improvable, but we already wonder what performance we can achieve by using all the hardware by replicating the associated instruction pipeline. Scalability is an important factor in HPC, and studying it now is necessary to have a more precise idea of how resource usage will guide the general optimization process. Replication is implemented using the following line, positioned just before the kernel definition: attribute(compute_units(k)), where k is an explicit integer. This allows us to go up to k = 16 compute units (a number determined empirically) and visualize the data.

We notice that resource usage does not grow linearly with the number of pipeline replicas, which is somewhat unexpected since the Altera Guide mentioned that all logic is replicated. So it's not easy to predict how many compute units we can use at maximum... and we will have to compile multiple times (joy).

In this graph, we can see that the limiting factor of the design is RAM: the design cannot really scale beyond 16 cu because the amount of memory blocks available on the FPGA is almost exhausted (92% used). On the other hand, the other categories of resources on the card are only used to about 50% and do not pose a scalability problem.

A second graph presents the execution times of the kernels (in ms), as well as the speedup achieved compared to the 1cu kernel.

The design seems to have good scalability since with 16 replicas of the pipeline, we obtain a speedup of almost 10 and bring the kernel execution time down to 228 ms. This imperfect scalability is explained by I/O contention phenomena because all pipelines share the bandwidth to global memory. The aocl profiler informs us that the average burst rate for reading/writing the states is 1, and that the efficiency of these transactions is only 25% (how is this last figure calculated?). Strangely, the burst rate is still 2 for the sbox and the roundkeys...

The logical next step in optimizing this single-workitem design would be to try to reduce the FPGA resource usage of "RAM", which is here the limiting factor of scalability. The idea would be to use constant memory instead of shared memory for look-up tables (sbox, roundkeys). Not having to copy them from global memory should reduce I/O contention, but what effect will it have on access performance during calculations and the scalability of the design?

Limited by time, we want to focus on vectorized kernels that should offer designs closer to the GPU execution model but also better performance for less resource consumption (i.e., better "efficiency"). We have not been able to explore all the designs envisaged so far (partial or mixed loop unrollings, constant memory...). Although some avenues have been abandoned, some will be taken up again in the context of the vectorized kernel. This first step raises an essential question: "How can we effectively explore all possible variants of a kernel with such a large number of parameters to play with? And to what extent have we not been able to take advantage of the Altera SDK to go further in understanding the weaknesses of our initial designs?"

In particular, we have not been able to extract concrete information from the occupation and stall metric readings, and we have not been able to interpret performance differences on certain accesses to global memory, even with 1 cu.

The next post will tell our adventures with the multi-workitem design and highlight certain limits of the SDK...