Vectorisation on FPGA

HPC
Written by Damien Dubuc, on 06 February 2018

After our first experience with the development and optimization of a single-workitem design with AES encryption, we continue the adventure with a multi-workitem kernel that we wish to vectorize.

We will see what difficulties were encountered during this exercise, and then how constant memory could help us scale a little further.

The AES application is suitable for SIMT (Single Instruction Multiple Threads) instructions because all characters of the state can be updated simultaneously by following identical instructions. This is particularly efficient on GPUs, whose parallelism highlighted by the architecture is not the same as that on FPGAs.

On FPGA, it is also desirable to achieve a SIMT/vectorized design, for two reasons:

  • It is closer to the GPU execution model, which is the one we want to draw an analogy with. Having an idea of the cost-to-performance ratio of this type of design (which is our main motivation) is the central point of our study.

  • It is less costly in terms of board utilization to build a pipeline with vector instructions of size n than to replicate a pipeline n times, and this should reduce the effects of I/O contention of the pipelines by a factor of n. Assuming a similar execution speed, we thus have a better efficiency of the design, which we can try to scale up.

We could therefore hope to produce a design that resembles more a typical GPU code - and therefore requiring presumably less porting effort, and reflecting better the potential performance achievable with the FPGA.

Implementation Constraints

The implementation of vector operations is done by adding lines of__attribute__ preceding the kernel definition, where the workgroup size and the number of compute units are also specified.

The size of vector operations must divide the number of workitems present in the workgroup; the following example asks the compiler to generate a kernel whose workgroups are in one dimension and of size 32 workitems, with vector operations operating on 4 elements at a time:

__attribute__((num_simd_work_items(4)))

__attribute__((reqd_work_group_size(32,1,1)))

__kernel void vectoradd(…)

Altera's documentation informs that the size of a SIMD instruction is limited to 16, which allows a natural implementation of AES by manipulating entire states at once.

The attribute line __attribute__((num_simd_work_items(16))) tells the aoc compiler that we want to generate a pipeline of width 16; however, this does not mean that it will succeed. The issue has been known for many years: automatic code vectorization by a compiler is difficult and often requires assistance.

In our case, the first compilation attempts are unsuccessful; the compiler informs us that it is unable to vectorize the kernel - at all - with the following message:

Compiler Warning: Kernel Vectorization: branching is thread ID dependent ... cannot vectorize.

We then realize that certain acquired habits of GPU programming will pose a problem here. On GPU, branching due to a conditional instruction only implies a sequentialization of instructions: that is, within a warp, we execute the different paths taken in turn with SIMT instructions on the concerned threads. On FPGA, this seems to be prohibitive from a vectorization perspective.

From this perspective, several control flow instructions can be considered ambiguous as soon as they refer to the workitem index (below idx). With a little research, we can guess the culprits:

if (idx < 16)  (…)
for (int i=idx; i<256; i+=16)

.       sbox_loc[i]=sbox_d[i];

These lines of code are quite ordinary and very present in GPU programming but will prevent the vectorization of the FPGA kernel here. While the first one is not a problem to work around here, the second one is much more annoying: aoc is not able to determine, with a forced workgroup size of 16, that the loop above gives perfectly identical work to all workitems and can therefore be vectorized.

This means that we will have to rewrite the loop to show it:

for (int i=0; i<16; i++)

.       sbox_loc[idx_l+16*i]=sbox_d[idx_g+16*i];

While the resolution here is simple (though annoying), the problem is deeper: if aoc cannot handle this kind of case, then we can doubt its ability to vectorize codes whose loops cannot be rewritten. For example, those depending on a parameter not known at the compilation of the OpenCL kernel, which is done separately. That is:

  • the case where the loop step depends on the number of workgroups (stride of the grid size for example) will not be vectorizable

  • the case where the termination criterion depends on the size of a variable not known at compile time (like the size of an input array) neither

After some quick adjustments (and several hours of unsuccessful compilation), the compiler agrees to vectorize our kernel. However, it points out that some read and write operations could not be vectorized:

Compiler Warning: Vectorized kernel contains loads/stores that cannot be vectorized. This might reduce performance.

We would have liked to know which ones! The impact of one or two non-vectorized instructions could slow down the entire pipeline and more or less nullify the rest of the vectorization efforts.

On CPU, everything you vectorize without destroying the initial code is an immediate gain: your total execution time is the sum of the times of each portion of the code and you have effectively reduced it. On FPGA, all data advances in a pipelined manner, whose latency is indeed governed by something akin to the sum of the latencies of each portion of code but whose throughput is given by the slowest portion... And the more data you have to process, the more throughput matters. Ouch.

  W16 base
Execution Time 1488 ms
LE (% used) 31
FF (% used) 23
RAM (% used) 63
DSP (% used) 2

The efficiency of this design is well below that of the single-workitem kernel with 16 cu: its execution time is roughly 6 times longer. We immediately hypothesize that the non-vectorized read/write operations significantly slow down the rest of the application.

The two loops reading the look-up tables sbox and roundkeys from global memory have a burst rate of 2, while the one reading the states is at 16 (which is behavior identical to our single-workitem kernel). Given that there is no visible difference between these 3 loops in the kernel compilation and that we have also rewritten them to avoid causing problems for the compiler, the problem lies elsewhere. And there seems to be only one good candidate: the substitution step by the sbox with an indirect memory access (state[i] = sbox[state[i]]). Although non-contiguous accesses to shared memory have limited impact on GPU, we have no real idea of the consequences of this situation regarding our FPGA design.

Using constant memory for these look-up tables would still allow us to get rid of these two loops with annoying performance. On GPU, constant memory has several specificities that must be known to ensure better performance than global memory. On FPGA, we do not know of any particular characteristics.

Finally, the current vectorized pipeline is not replicable and consumes too much RAM resources. An attempt to compile with 2cu returns the final compilation error (since it takes several hours before seeing it): Error : Kernel doesn't fit

We believe that a notable change in memory is needed for this design to have hope of replicating it and hope that constant memory can help us.

Constant Memory for Further Scaling?

The use of constant memory has not yet been addressed in our experiments. In the Altera documentation, this memory cache is described as ideal for using look-up tables, with a very high cache-hit percentage. Unlike the GPU architecture, where the size of this memory area is fixed, it is adjustable at compilation on FPGA. By default, it is 16 kB, and the user can adjust its size using a compilation flag:

-- const-cache-bytes N

where N is its size in bytes (and must be a power of 2). We use it to store the sbox and roundkeys tables, which gives the value 512 for n. In addition, the tables in question must be passed as arguments to the kernel preceded by the keyword __constant.

The use of this constant memory allows us to do without copies of look-up tables in local memory, an approach favored on GPU.

It is said that on GPU, if threads of the same warp want to access different addresses in constant memory, these requests are serialized and the memory transaction could then be less efficient than a grouped read in global memory. On GPU, constant memory has been designed to ideally broadcast a single value to an entire warp in a single request. None of the Altera documents (Programming Guide & Best Practices) mention any particularities of FPGA constant memory, which probably allows grouped accesses (coalesced).

In our case, all threads access different values. For the roundkey, it is 16 successive values in memory; for the sbox, it is 16 values whose distribution is unknown. Even if constant memory could allow grouped accesses for the elements of the roundkey, it is difficult to see how we could vectorize the accesses to the sbox. One solution would be to no longer use the sbox as a look-up table, but to retrieve its values by calculation, since it is predetermined.

Beyond a gain in terms of execution time, we also wait to see how this change will affect board utilization. By using constant memory cache (which is part of global memory) instead of storing all our data in shared memory registers, we should reduce the memory block usage of our kernel. This, even if it leads to a loss of immediate performance after this modification, could allow us to scale to multiple compute units. After compilation, we obtain the following figures:

  W16 base W16 constant W16 constant 2cu
Execution Time 1488 ms 1299 ms 745 ms
LE (% used) 31 21 35
FF (% used) 23 20 39.5
RAM (% used) 63 44 85
DSP (% used) 2 1.5 3

This design using constant memory allows for a smaller execution time and reduces the amount of RAM used by about 30%, which allows us to scale up to 2 cu.

However, the compiler still warns us that there are non-vectorizable load/store operations and that consequently the performance of our kernel is not optimal (all signs point to the sbox). We notice that it is not easy to maximize board utilization, with the replication of a basic pipeline whose resource usage is also high: the level of granularity is not the same as in the case of a single-workitem kernel.

The general feeling of this work is that it is not easy to use SIMD operations on FPGA with a kernel whose data accesses and control flow are non-trivial. Many habits and writings that pose no problem on GPU (CUDA/OpenCL) become serious issues on FPGA that can prevent kernel vectorization without being explicitly revealed. Vectorization is particularly sensitive to thread branching, unknown values during kernel compilation, and grouped memory accesses, which are not always achievable. In conclusion, after a number of trials and efforts over several days, we have not been able to completely vectorize our kernel, and we get a design about 3x slower than the single-workitem (which was probably improvable with constant memory). This experience suggests that a SIMD design is only worthwhile if it can be fully done. A comprehensive compilation report - as promised in the Altera documentation - would have been very helpful, as this phase requires a good understanding of the architecture and the code.

This post concludes our user experience feedback on the Altera (now Intel) OpenCL SDK. We will return in a final post to share with you our thoughts and perspectives on FPGA in the world of scientific computing and HPC.