Forum Discussion

Occasional Contributor

6 years ago

OpenCL: What's best practice for replicating a task kernel in-FPGA?

I've written a fairly simple kernel that multiplies multi-limb operands. I've written two versions: an ndrange version and a task version. Both work fine, but the performance of the task version is s...

HRZ

Frequent Contributor

6 years ago

I think you are referring to external memory bandwidth and not PCI-E bandwidth because PCI-E bandwidth is determined by the physical features of the PCI-E connection on your FPGA board and motherboard (number of lanes and PCI-E version) and its effective throughput is determined by multiple factors such as the size of your data transfer and the efficiency of the PCI-E driver; these are not really factors that can be controlled by the programmer.

Assuming that you refer to external memory bandwidth, then your problem has a simple solution: you need to use loop unrolling to vectorize your single work-item kernel. Loop unrolling will not only increase the amount of computation that is done by your kernel per cycle, it will also lead to consecutive memory accesses in your loop being coalesced into larger accesses by the compiler which will result in better utilization of the external memory bandwidth. Loop unrolling in single work-item kernels gives a behavior similar to the SIMD attribute in NDRange kernels.

JSchr20

Occasional Contributor

6 years ago

Well, things have gotten weirder.

The error I posted above is coming from legacy emulation on my local compute server using aoc 20.1.

We don't have fast emulation working yet on our local server.

On the dev cloud, where can use fast emulation, I'm uniformly getting a seg fault, as I described above. In addition, I noticed today that one of my kernels reports a compute ID of -1 a few print statements before the seg fault.

However, also on the dev cloud, if I use legacy emulation with num_compute_units=1, it works. My program runs and declares PASS on valid answers.

If I run on the dev cloud with num_compute_units=2, legacy emulation mostly works, but it hangs without completing. Fast emulation seg faults as before.

So anyhow. I'm going to hunt for more clues.

HRZ
Frequent Contributor
6 years ago
I am not sure if this is an artifact in your snippet but it seems you are reading from and writing to channels with the same ID as the compute unit (albeit with different channel names). One would typically read from a previous compute unit and write to a following one, you should not read from and write to the same channel ID in the same compute unit. Another potential pitfall is channel ordering; the compiler will freely re-order channel operations and if there is a cycle of channels in your design, you can potentially run into a dead-lock unless you enforce channel re-ordering using barriers as described in Intel's documentation. Finally, it seems v20.1 is quite problematic based on reports from you and other people in the forum, you might want to consider switching to v19.4 on your local machine and see if you would run into the same problems. Assuming that you can create a minimal example that reproduces the issue and post it here, it will be easier to find potential issues in the code.

Forum Discussion

OpenCL: What's best practice for replicating a task kernel in-FPGA?

Recent Discussions

AI Suite - Spatial IP outputs wrong value

AI Suite - Is it possible to simulate the AI IP?

AI Suite - Streaming from HPS to DLA IP

Agilex 7 I-Series "aocl diagnose acl0" error following OFS

AI Suite - Custom model in the FPGA building process