Perspectives on the use of FPGAs at Aneo

HPC
Written by Damien Dubuc, on 06 February 2018

Here are some reflections and perspectives that we draw from the study.

The execution model on FPGA has fundamental differences from the GPU execution model. Porting a kernel from one architecture to another is not straightforward and may require significant work, especially on the optimization phases.

On GPUs, the current methodology and tools allow for gradual convergence towards a result better than the previous one, thus targeting successive bottlenecks: we are rather well guided. On FPGA, iterating on a new design is costly, and the large number of possible optimizations (differing in nature and parameters) poses a challenge. This difficulty is indeed complex due to our position of discovering the environment and our lack of experience, which has not yet allowed us to take full advantage of the SDK tools. It is challenging to derive an iterative optimization methodology in this context. In the former case, the equivalent of 3 man-days for 3 experts seems to allow us to get closer to implementations that make the best use of the GPU. On FPGA, the equivalent of 20 man-days, including discovery, has only allowed us to partially explore all possibilities, sometimes guided by arbitrary decisions.

We have left many paths and variants unexplored, facing a costly iterative compilation process and probably not being able to guide ourselves as best as possible due to the amount of information produced by the Altera SDK, beyond our current skills. In particular, we did not obtain a compilation report as described in the Altera document (providing feedback on the success or failure of loop unrolling, vectorization of certain instructions, etc.) which would have been very helpful.

After all these tests, armed with a better understanding of the architecture, the desire is to continue iterating on designs, extracting as much information as possible from the profiler. We also want to try other more elaborate directions involving functionalities not mentioned in this document, such as expressing the application in different single-workitem or vectorized kernels linked by a data channel/pipeline, on FPGA clusters... But this effort must imperatively be put at the service of a methodology to enable it to navigate efficiently in the universe of possible optimizations. The risk is to confine oneself to a reduced scope of optimizations and to ignore some more general avenues.

We regret not having been able to obtain more help from Bittware and Altera: this had an obvious impact on our time capital. Altera's documentation is extensive, but without methodology and a good understanding of the SDK tools, it remains difficult to navigate. Bittware very often suggested that we contact Altera, whose answers to our questions - often essential - come from users on their forum.

Beyond these points, the FPGA execution model allows us to have a better idea of the type of applications suitable for this architecture. Conversely, it raises the following questions, relevant to many in scientific and parallel computing:

 

  • Are the limits in logic and memory resources prohibitive for large-scale applications? In particular, will iterative applications whose number of iterations is high or unknown (convergence/termination criterion) make efficient use of the FPGA?
  • For parallel applications with non-trivial control flow (which can make vectorization complex on FPGA), to what extent can we compete with GPU implementations, naturally leveraging SIMT instructions? Is it worth the investment in time and money?

What about the performance of floating-point applications, with recent FPGA cards? It would be particularly interesting to look at parallel applications like Monte Carlo on non-trivial financial products, optimization search, or even solving PDE problems in time and space.

Furthermore, what are the prospects for increasing the width of SIMD operations? Is vectorization truly a crucial point in FPGA programming?

Finally, we have restricted ourselves to the efficiency of the kernel alone: but we would have liked to consider a little more the host-device I/O aspect (already predominant in the relevance of using a GPU). More generally, we would like to study the impact of the choice of architecture on an entire application, perhaps even composed of several calls to different kernels. Indeed, data transfers and device initialization time play an increasingly important role as the kernel becomes more efficient.