Forum Discussion
I think you are referring to external memory bandwidth and not PCI-E bandwidth because PCI-E bandwidth is determined by the physical features of the PCI-E connection on your FPGA board and motherboard (number of lanes and PCI-E version) and its effective throughput is determined by multiple factors such as the size of your data transfer and the efficiency of the PCI-E driver; these are not really factors that can be controlled by the programmer.
Assuming that you refer to external memory bandwidth, then your problem has a simple solution: you need to use loop unrolling to vectorize your single work-item kernel. Loop unrolling will not only increase the amount of computation that is done by your kernel per cycle, it will also lead to consecutive memory accesses in your loop being coalesced into larger accesses by the compiler which will result in better utilization of the external memory bandwidth. Loop unrolling in single work-item kernels gives a behavior similar to the SIMD attribute in NDRange kernels.
Since this is the forum thread I started, I'll consolidate my 'progress' and followup question here. I followed your advice in your reply here: https://forums.intel.com/s/question/0D70P000006i6SySAI
I now do have a version of my code working, which has a non-autorun kernel that fetches the data out of memory, an autorun kernel that operates on it, and another non-autorun kernel that writes the results back to memory, all interconnected with Intel channels and access with blocking reads and writes.
However, when I try to add a num_compute_units attribute to my kernels (and add in the necessary code to make use of the compute unit IDs), I get this error:
Initializing OpenCL
Platform: Intel(R) FPGA SDK for OpenCL(TM)
Using 1 device(s)
EmulatorDevice : Emulated Device
Using emulator, adding '_em' to output filename
Binary filename = my_autorun_em
Using AOCX: my_autorun_em.aocx
Launching for device 0 (4 elements)
Hey, I'm comp_id 0
I'm operating with NUM_LIMBS = 4 and NUM_COMPUTE_UNITS = 1.
about to read first operand
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
Aborted
I'm guessing my other kernels aren't relevant since they haven't even been started yet. I've attached the code snippet of my autorun kernel I'm guessing is relevant. (Pardon the .c extension, amazingly, we can't upload .cl files here!) Is it obvious what I'm doing wrong?
I am using legacy emulation, by the way. When I try fast emulation, I get a seg fault.