I've written a fairly simple kernel that multiplies multi-limb operands. I've written two versions: an ndrange version and a task version. Both work fine, but the performance of the task version is significantly better.The design is quite small, and I calculated that it only uses maybe 10% of the PCIe bandwidth. I'd like, therefore, to put ~8 copies of this kernel in the FPGA instead of just one, so that I can have 8x the parallelism.With the ndrange kernel, this appears dead easy: I just put the attribute num_compute_units(8) at the top of my kernel, and my clEnqueueNDRangeKernel just works like a charm, divvying the work up among the compute units the same way it does among different devices.With the task kernel, however, only one of my compute units ever gets driven. All of my attempts to cause better behavior by adjusting global or local work group size or max or required size only cause weird and undesirable behavior. All the documentation suggests that any attempt I might make to query work item ids or use attribute-driven SIMD vectorization will result in my kernel being an ndrange, and thus non-pipelined, kernel.I want my kernel pipelined, as the performance is much better. I just want, in one FPGA, eight copies that get fed 1/8th of the data each.What's the right way to do this??(I'm using version 19.4 and 20.1.)

If you want an example of a working high-performance code with autorun kernels, you can take a look at this repository:https://github.com/zohourih/Diffusion_FPGA

I think you are referring to external memory bandwidth and not PCI-E bandwidth because PCI-E bandwidth is determined by the physical features of the PCI-E connection on your FPGA board and motherboard (number of lanes and PCI-E version) and its effective throughput is determined by multiple factors such as the size of your data transfer and the efficiency of the PCI-E driver; these are not really factors that can be controlled by the programmer.Assuming that you refer to external memory bandwidth, then your problem has a simple solution: you need to use loop unrolling to vectorize your single work-item kernel. Loop unrolling will not only increase the amount of computation that is done by your kernel per cycle, it will also lead to consecutive memory accesses in your loop being coalesced into larger accesses by the compiler which will result in better utilization of the external memory bandwidth. Loop unrolling in single work-item kernels gives a behavior similar to the SIMD attribute in NDRange kernels.

Since this is the forum thread I started, I'll consolidate my 'progress' and followup question here. I followed your advice in your reply here: https://forums.intel.com/s/question/0D70P000006i6SySAI I now do have a version of my code working, which has a non-autorun kernel that fetches the data out of memory, an autorun kernel that operates on it, and another non-autorun kernel that writes the results back to memory, all interconnected with Intel channels and access with blocking reads and writes.However, when I try to add a num_compute_units attribute to my kernels (and add in the necessary code to make use of the compute unit IDs), I get this error:Initializing OpenCLPlatform: Intel(R) FPGA SDK for OpenCL(TM)Using 1 device(s) EmulatorDevice : Emulated DeviceUsing emulator, adding '_em' to output filenameBinary filename = my_autorun_emUsing AOCX: my_autorun_em.aocxLaunching for device 0 (4 elements)Hey, I'm comp_id 0I'm operating with NUM_LIMBS = 4 and NUM_COMPUTE_UNITS = 1.about to read first operandterminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_allocAbortedI'm guessing my other kernels aren't relevant since they haven't even been started yet. I've attached the code snippet of my autorun kernel I'm guessing is relevant. (Pardon the .c extension, amazingly, we can't upload .cl files here!) Is it obvious what I'm doing wrong?I am using legacy emulation, by the way. When I try fast emulation, I get a seg fault.

Well, things have gotten weirder.The error I posted above is coming from legacy emulation on my local compute server using aoc 20.1.We don't have fast emulation working yet on our local server.On the dev cloud, where can use fast emulation, I'm uniformly getting a seg fault, as I described above. In addition, I noticed today that one of my kernels reports a compute ID of -1 a few print statements before the seg fault.However, also on the dev cloud, if I use legacy emulation with num_compute_units=1, it works. My program runs and declares PASS on valid answers.If I run on the dev cloud with num_compute_units=2, legacy emulation mostly works, but it hangs without completing. Fast emulation seg faults as before.So anyhow. I'm going to hunt for more clues.

I am not sure if this is an artifact in your snippet but it seems you are reading from and writing to channels with the same ID as the compute unit (albeit with different channel names). One would typically read from a previous compute unit and write to a following one, you should not read from and write to the same channel ID in the same compute unit. Another potential pitfall is channel ordering; the compiler will freely re-order channel operations and if there is a cycle of channels in your design, you can potentially run into a dead-lock unless you enforce channel re-ordering using barriers as described in Intel's documentation. Finally, it seems v20.1 is quite problematic based on reports from you and other people in the forum, you might want to consider switching to v19.4 on your local machine and see if you would run into the same problems. Assuming that you can create a minimal example that reproduces the issue and post it here, it will be easier to find potential issues in the code.

OpenCL: What's best practice for replicating a task kernel in-FPGA?

11 Replies

HRZ
Frequent Contributor
5 years ago
I think you are referring to external memory bandwidth and not PCI-E bandwidth because PCI-E bandwidth is determined by the physical features of the PCI-E connection on your FPGA board and motherboard (number of lanes and PCI-E version) and its effective throughput is determined by multiple factors such as the size of your data transfer and the efficiency of the PCI-E driver; these are not really factors that can be controlled by the programmer.
Assuming that you refer to external memory bandwidth, then your problem has a simple solution: you need to use loop unrolling to vectorize your single work-item kernel. Loop unrolling will not only increase the amount of computation that is done by your kernel per cycle, it will also lead to consecutive memory accesses in your loop being coalesced into larger accesses by the compiler which will result in better utilization of the external memory bandwidth. Loop unrolling in single work-item kernels gives a behavior similar to the SIMD attribute in NDRange kernels.
- JSchr20
  Occasional Contributor
  5 years ago
  Since this is the forum thread I started, I'll consolidate my 'progress' and followup question here. I followed your advice in your reply here: https://forums.intel.com/s/question/0D70P000006i6SySAI
  I now do have a version of my code working, which has a non-autorun kernel that fetches the data out of memory, an autorun kernel that operates on it, and another non-autorun kernel that writes the results back to memory, all interconnected with Intel channels and access with blocking reads and writes.
  However, when I try to add a num_compute_units attribute to my kernels (and add in the necessary code to make use of the compute unit IDs), I get this error:
  Initializing OpenCL
  Platform: Intel(R) FPGA SDK for OpenCL(TM)
  Using 1 device(s)
  EmulatorDevice : Emulated Device
  Using emulator, adding '_em' to output filename
  Binary filename = my_autorun_em
  Using AOCX: my_autorun_em.aocx
  Launching for device 0 (4 elements)
  Hey, I'm comp_id 0
  I'm operating with NUM_LIMBS = 4 and NUM_COMPUTE_UNITS = 1.
  about to read first operand
  terminate called after throwing an instance of 'std::bad_alloc'
  what(): std::bad_alloc
  Aborted
  I'm guessing my other kernels aren't relevant since they haven't even been started yet. I've attached the code snippet of my autorun kernel I'm guessing is relevant. (Pardon the .c extension, amazingly, we can't upload .cl files here!) Is it obvious what I'm doing wrong?
  I am using legacy emulation, by the way. When I try fast emulation, I get a seg fault.
  multi_compute_unit_autorun_snippet.cl.c1 KB
- JSchr20
  Occasional Contributor
  5 years ago
  Well, things have gotten weirder.
  The error I posted above is coming from legacy emulation on my local compute server using aoc 20.1.
  We don't have fast emulation working yet on our local server.
  On the dev cloud, where can use fast emulation, I'm uniformly getting a seg fault, as I described above. In addition, I noticed today that one of my kernels reports a compute ID of -1 a few print statements before the seg fault.
  However, also on the dev cloud, if I use legacy emulation with num_compute_units=1, it works. My program runs and declares PASS on valid answers.
  If I run on the dev cloud with num_compute_units=2, legacy emulation mostly works, but it hangs without completing. Fast emulation seg faults as before.
  So anyhow. I'm going to hunt for more clues.
  - HRZ
    Frequent Contributor
    5 years ago
    I am not sure if this is an artifact in your snippet but it seems you are reading from and writing to channels with the same ID as the compute unit (albeit with different channel names). One would typically read from a previous compute unit and write to a following one, you should not read from and write to the same channel ID in the same compute unit. Another potential pitfall is channel ordering; the compiler will freely re-order channel operations and if there is a cycle of channels in your design, you can potentially run into a dead-lock unless you enforce channel re-ordering using barriers as described in Intel's documentation. Finally, it seems v20.1 is quite problematic based on reports from you and other people in the forum, you might want to consider switching to v19.4 on your local machine and see if you would run into the same problems. Assuming that you can create a minimal example that reproduces the issue and post it here, it will be easier to find potential issues in the code.
AnilErinch_A_Intel
Frequent Contributor
5 years ago
Hi ,
Would recommend to try out the code in v19.4 as mentioned in previous answer .
Please refer to the programming guide for some more clarification regarding recommended practices.
Please adhere to the statements in section
5.4.4. Restrictions in the Implementation of Intel FPGA SDK for OpenCL Channels Extension
like the one below.
"Performance of a kernel that has multiple accesses (reads or writes) to the same channel might be poor."
Also please refer to the following thread for some rules about multiple channel writes.
https://forums.intel.com/s/question/0D50P00003yyMFISA2/compiler-error-multiple-channel-write-sites
For implementing fence also please refer to
https://www.intel.com/content/dam/altera-www/global/en_US/pdfs/literature/hb/opencl-sdk/aocl_programming_guide.pdf
Please let us know , once you have rewritten the code following the guidelines.
Thanks and Regards
Anil
JSchr20
Occasional Contributor
5 years ago
I still don't have this code working, but I'm stuck at a "higher level", at least. I now have only one autorun kernel with num_compute_units > 1, and the ndrange kernels on either end that feed it data and gather results are singletons that distribute the appropriate slices of the incoming (now num_compute_units times as big) work items to the various channels (or gather them at the other end). This code now works for few work items, but not for many. Mem and channel fences haven't helped. Simulation and hardware builds fail without useful messages. If I exhaust my debug avenues or if I find a solution, I'll post here again.
- HRZ
  Frequent Contributor
  5 years ago
  If you want an example of a working high-performance code with autorun kernels, you can take a look at this repository:
  https://github.com/zohourih/Diffusion_FPGA
  - JSchr20
    Occasional Contributor
    5 years ago
    Thank you very much! I will check it out.
AnilErinch_A_Intel
Frequent Contributor
5 years ago
Hi ,
Glad to see that the issue is resolved, and the discussions are encouraging.
Thanks and Regards
Anil

Forum Discussion

OpenCL: What's best practice for replicating a task kernel in-FPGA?

11 Replies

Recent Discussions

Agilex 7 FPGA Starter Kit with oneAPI Toolkit flow not detected over PCIe

MCTP over PCIe VDM routing to PMCI in OFS N6000 FIM configuration and datapath clarification

HLS Compiler 24.1 error - aocl-clang.exe - dll entry point not found

Error faced while executing on Agilex FPGA board....

AI Suite System Throughput Issue