Du CUDA sur GPU à l'OpenCL sur FPGA

HPC
Written by Damien Dubuc, on 06 February 2018

We tried to put ourselves in the shoes of one of our clients who want to explore the possibilities of using FPGA despite coming from the GPU world.

So, we approached it as follows: we ported a GPU application whose calculation kernels are typically written in CUDA to an FPGA application whose kernels are written in OpenCL. The goal was to identify the difficulties encountered during development, quantify the effort and time required for the different phases, and, to a reasonable extent, focus on optimizations and achievable performance.

Bittware is a company specializing in FPGAs offering solutions based on Altera or Xilinx products. They provided us access to a remote machine with an Altera Stratix V FPGA card and also provided us with a license to use Altera's AOCL compilation and profiling tools. These tools are now packaged under the name Intel FPGA SDK for OpenCL following Intel's acquisition of Altera.

Transition from CUDA to OpenCL GPU

In both CUDA and OpenCL usage, two aspects need to be considered: the host side describes the CPU code and allows the GPU card to be controlled, while the device side explicitly describes the calculation code to be accelerated on the latter.

Although changes on the host side are somewhat heavy compared to CUDA and not very sexy at first glance, OpenCL's generality allows you to reuse most of them for all your future developments, regardless of the device.

Host side

Regarding the host-side code, OpenCL can be daunting due to its verbosity. Indeed, as it is intended for programming heterogeneous architectures in general, the programmer must go through a fairly elaborate ritual to identify the available devices, select the one they wish to use, point to the source code of the calculation kernel to compile, all using OpenCL-specific variables.

Behind this verbosity lies the generality of the language and perhaps a lack of development that could have made the interface more user-friendly. It is important to note that the C++ API is still lighter than the C API, which is the one we used above. It's just a matter of taking, with few exceptions, the setup from existing OpenCL examples, and you get something reusable, regardless of the device.

On the host side, there are many functions that have a CUDA equivalent, such as memory allocations and alignments on the device in global or constant memory, or the setting of work items and workgroups (respectively threads and blocks in CUDA terminology). The terminology changes slightly, but it remains transparent to anyone familiar with the GPU. However, some memory qualifiers are misleading; the following table gives the equivalence of some of the main keywords:

OpenCL Terminology CUDA Terminology
Workitem Thread
Workgroup Thread block
Compute unit SMP (streaming multi-processor)
Global / constant memory Global / constant memory
Local memory Shared memory
Private memory Local memory

A notable change concerns the setting and launching of the OpenCL kernel, written in a .cl file but not compiled separately, similar to .cu files in CUDA compiled by nvcc. The developer must indicate in the host code the desired .cl file name, as well as the name of the kernel to use inside. This kernel will not be compiled with the rest of the code because it is considered as a character string: any error in it only appears at runtime. Finally, the arguments of a kernel must be defined one by one using an instruction determining their value and position in the argument list.

Device side

However, on the device side, the transition from CUDA to OpenCL is very simple. The qualifiers designating the different types of memory (global, read-only, shared...) are transparently retrieved, and the functions allowing to retrieve the local or global position of a work item are even simpler than those proposed by CUDA. It is very quick to have an OpenCL equivalent of a CUDA kernel, assuming that it does not use advanced features specific to NVIDIA architecture or developments (dynamic parallelism, etc). Without previous experience and with the help of a summary sheet of language keywords, the effort can be quantified in minutes.

Ultimately, if you already know the main differences with CUDA and start from an existing OpenCL example, you can in a few hours (mainly spent on the host side) develop a functional OpenCL GPU application on your first try.

Et enfin vers le FPGA

This transition involves modifications on both the host and device sides; but this time, it is the OpenCL kernel that will require the most experimentation, thinking time, and above all, compilation time. Several hours will be needed to produce a first functional FPGA application: optimization is left out here.

Device side

The OpenCL kernel will be compiled separately by the Altera SDK's aoc compiler (well, Intel now). This compiler provides significant optimization and translation work into hardware description language (HDL) to produce the design that will be used during the application run: a .aocx binary that can weigh ~100 MB or more. This phase can take several hours and prove to be a bottleneck in the development or optimization process. Indeed, since the design is generated from this single file, any change in it requires recompilation and generation of a new design. Note that this compilation is done in ignorance of the rest of the host code, and of course, runtime parameters.

The work to be done on the OpenCL kernel is at two levels:

  • Adding simple pragmas and attributes to indicate the sizes/dimensions of workgroups, and possibly the number of pipelines to be replicated, loop unrolling, and the size of vector operations if performance is already being considered.
  • Changes to the initial code to highlight possibilities for vectorization by the compiler or to return to a workgroup composed of a single work item (performing the equivalent work of an entire workgroup, with a return to loops).

The first point involves a very succinct task to perform, but the number of parameter combinations is large, and each requires compilation of a new binary, costly in time. Ultimately, for the search for good performance, there is a deep reflection on the use of hardware in terms of its resource usage ("occupation") and instruction pipeline performance.

The attribute pragmas inserted before the kernel specify, respectively:

  • The number of pipelines generated, referred to as OpenCL compute units
  • The size of vector operations (here, 16 elements)
  • The required workgroup size (here, its 1st dimension is 16, and the other two are 1)

Loop unrolling pragmas, on the other hand, need to be inserted into the code and can specify the unrolling factor (which can be 1 if you absolutely want to prevent the compiler from unrolling anything). Their effectiveness and parameterization can be the subject of a study in itself.

It is then understood that any change in these details can generate a design very different from the previous one and therefore requires a new compilation, which will be described in the next section with more precision.

Not so simple after all...

The second part requires a certain understanding of the FPGA architecture, particularly how workgroups/workitems are scheduled and executed, and highlights the fact that a kernel suitable for the GPU is not always suitable for the FPGA, as it stands.

For applications that are not "embarrassingly parallel" exhibiting data dependencies or synchronizations between work items, it will probably be necessary to temporarily or permanently switch to "single work item" kernels. This means returning to loops to generate a pipeline without data dependencies within a workgroup: it is concretely reducing the granularity of an application from work item to workgroup, to get the most out of the pipeline.

For multi-work item designs with vector operations (SIMD), the problem is rather to make the compiler understand that it can vectorize certain operations. Indeed, the design is compiled in ignorance of the rest of the host code (since compiled separately by aoc) and runtime parameters. Unlike CPU code where, for example, calculation loops would be versioned and which would use the "right" one each time at runtime, the FPGA design is associated with a fixed pipeline during execution. So you will have to convince it that your work items/threads all execute the same instructions, that your work loops fall just right...

Host side

The amount of work is negligible, thanks to the generality of the OpenCL language. The FPGA is a device like any other, and all you have to do is pass the path and name of the design (.aocx binary) to call when creating OpenCL variables, instead of the path of the .cl file as formerly done in the GPU application. However, you will need to verify that the execution context of the kernel (workgroup sizes) matches what was specified in the .aocx binary linked to the kernel to be used.

To switch to the FPGA, the vast majority of the work is on the device side. This involves thinking about your application in terms of pipeline to find an appropriate formulation. Returning to the single work item offers a default solution, simple enough to implement, consisting of returning to loops. Searching for a correct vectorized kernel on a non-trivial application is more complex and can be seen as already a step towards design optimization. Adding pragmas and attribute lines to the OpenCL code to characterize the design is extremely quick, but its (their) compilation(s) takes several hours. Finally, the changes on the host side are very few, fixed, and reusable: they can be quantified in minutes. We will continue in the next post with a presentation of the development tools provided with the Altera SDK, as a prelude to a practical case of porting an application to the FPGA.