Towards the first FPGA Single Work-item design

HPC
Written by Damien Dubuc, on 06 February 2018

Today, we are talking a little about the subtleties that make an FPGA... can't work like a GPU, even if you really want it to.

There are differences in their execution model, but there are also aspects of the Altera (sorry, I mean Intel!) OpenCL FPGA SDK that limit attempts, since you'll need to start by convincing the compiler that your cause is just and commendable. All of these observations guide us in choosing our first approach to FPGA design.

It's important to understand that once we have a non-trivial application, i.e., one that performs a certain number of operations on input data, the most natural and powerful degree of parallelism extracted by the FPGA is dictated by the length of the instruction pipeline generated by the design. A performant application on FPGA will likely be one that enables "smooth flow" of data through this pipeline.

We can deduce that significant parallelism can already be extracted from sequential code as long as there are calculation loops on a data set. The elements are then processed in parallel, not in a SIMD (Single Instruction Multiple Data) model but rather in a way akin to MIMD (Multiple Instructions Multiple Data): that is, at a given cycle, at a given stage of the pipeline, an instruction is executed on data. But from one stage to the next of the pipeline, neither the data nor the instructions are the same. Thus, an efficient sequential OpenCL kernel with a few directives can be a good basis whose logic can be duplicated later (vectorization over the "width" of the pipeline, or replication of the entire pipeline).

Multi-workitem design and SIMD operations: an equivalent to the GPU model?

A "single workitem" kernel runs with a single workitem per workgroup, as opposed to a "multi-workitem" kernel. If you have already worked with GPUs, then you will naturally want to use multi-workitem kernels. Why? Because having kernels with only one workitem (or "thread") means using only a tiny part of the architecture, which then becomes less performant than your CPU. On an FPGA, one could imagine that workgroups and workitems are lined up in a pipeline of instructions, so the idea of having a single-workitem design doesn't seem particularly shocking if you have plenty of workgroups running.

It is even possible to use SIMD operations on an FPGA and thus propose vector designs applying a given operation to multiple workitems in the same workgroup simultaneously, through a necessarily multi-workitem design. The Altera guide mentions that the size limit of a vector operation is 16, which corresponds to the size of a half-GPU warp. One might then wonder to what extent the FPGA joins the GPU execution model in this case.

On a GPU, a workgroup is executed on a multiprocessor (Streaming Multiprocessor in NVidia terminology) and all the instructions of its workitems are executed in groups of 16 or 32 threads ("warps") using SIMD instructions. These warps are scheduled on the different cores of the multiprocessor in question, and as long as the SMP finds warps whose instructions are executable, even if it means switching from one warp to another to cover a synchronization or a memory request, our architecture is used efficiently.

However, the synchronization problem on FPGA is more severe. In particular, the compiler states that:

  • if a barrier/synchronization is required, then a workgroup can have a maximum of 256 workitems.
  • if it estimates that the workitems can reach a barrier disorderly, then only 2 workgroups can be simultaneously in the pipeline.

These restrictions could prove catastrophic for performance and strongly discourage the use of designs that do not align with the architecture.

Consider two successive functions applied to an entire workgroup of size nk where the output of the first (an array of nk data) is the input of the second and where each workitem must modify a unique entry of this array. Considering a vector operation size of k, the workgroup then spreads over n successive stages of the pipeline.

Synchronization of all workitems would require that the first function has been applied to the n*k data of the workgroup before the second can be applied to any outgoing data. Assuming one cycle per pipeline stage, this can result in a period of (n-1) cycles between when the first and second instructions were executed, possibly resulting in resource overconsumption creating these (n-1) "waiting" stages.

In general, any retro-dependence or complexity of this type is bad for the flow of data through a pipeline of instructions. Altera strongly discourages designs exhibiting this kind of behavior and recommends using a single-workitem kernel instead. This is where the fundamental difference between FPGA and GPU approaches lies.

"But I want to vectorize stuff!"

Yes: vectorization is good. We rediscover this every week in articles published on Linkedin, and this is what manufacturers and profiling software publishers are constantly reminding us of. From the "dumb" CPU to the GPU, including Intel's MIC, they all highlight the benefits of vectorization: it's in the accompanying manual and the magic formula for calculating peak power.

In reality, what is said is not that we absolutely need single-workitem kernels, but that we should ideally come up with a design where there are no retro-dependencies of data flowing through the pipeline. In particular, if in the example from the previous figure with our functions 1 and 2 we end up with a vector operation size equal to the size of the executed workgroup, we win: the workgroup spans only a single stage of the pipeline and advances "paced", without data retro-dependencies. But is this easy to implement with OpenCL and this SDK?

Indeed, the compiler doesn't seem to be able to generate a vectorized design as soon as there is ambiguity about the workload of each thread (i.e., the instructions performed depend on the thread index). This problem is significant because we will naturally encounter it in any parallelization of work on the indices of a loop traversing, for example, the indices of an input array in our kernel.

On a GPU, branching due to a conditional instruction implies "only" a sequentialization of the execution of the different branches. That is, in a "diverging" warp, the different branches taken during the run are executed in turn, using SIMT instructions on the concerned threads. On FPGA, this is apparently prohibitive from a vectorization point of view, as we will see in practice in a future post.

In particular, questions may arise about instructions like:

for (i=idx ; i<ARRAY_SIZE ; i+= 16) (…)

 

for (i=idx ; i <4096 ; i+= grid_size)

where idx is the index of a workitem, ARRAY_SIZE a variable indicating the value of an input array, and grid_size the size of a stride typically used to distribute a large number of iterations cyclically over a smaller number of workitems.

These instructions typically appear when all workitems must work on an array whose size may not be known at compile time and could exceed the number of workitems used (or usable).

Given that the OpenCL kernel is compiled separately, how can the compiler know at this stage that the workitems will have the same number of iterations to process, given that it does not know the number of workitems or workgroups? How can we help it?

All of these reflections show that the FPGA cannot replicate the GPU execution model: it is important to understand what makes them different. Two types of designs seem to be particularly interesting: single-workitem and vectorized. It seems natural to start with a single-workitem design to get a handle on the basics of FPGA development through this SDK before iterating on a vectorized design.

We will continue in the next post with the optimization of this Single Work-Item design.