Forum Discussion
Please, consider the attached image. It shows the profile report for two Intel FPGA OpenCL SDK 17.1 Bittware 385A designs.
In both, data is transferred between two kernels using external channels: 'krnlA_send' sends to 'krnlB_recv' and 'krnlB_send' sends to 'krnlA_recv'.
On the left, you have 'float4' (128b) external channels. On the right, 'float8' (256b) external channels. Both reach almost the same fmax: 308 MHz (float4) and 294.55 MHz (float8).
As you can see, the bandwidth measurement (last column) shows the same values in both cases: 4900+ MByte/s, very close to the peak performance of 5000 MByte/s (40 Gbit/s). So, there is no issue on this point.
The main difference is in the 'Stall%' and 'Occupancy%' columns.
In the 'float4' case, there is not stall, i.e. we get full occupancy. This means that there are no bubbles into the pipeline, every clock cycle is used actively for computation.
On the other hand for 'float8', 'Stall%' is equal to 47%. This is expected. In fact, given a bandwidth of 40 Gbit/s, an Intel channel whose datatype is 256b-wide can transfer data at most at 156 MHz.
Considering that the actual frequency of the design is 294 MHz, the maximum 'Occupancy%' achievable can be at most (156MHz / 294MHz) = 53%.
Sorry for the long preamble, but it's necessary to share with you my point of view.
Define channels with a smaller width will let me to exploit an higher 'Occupancy%' in case I don't need/want to process 256b (related to the external channels) per clock cycle.
This behavior was correctly implemented in the Intel FPGA OpenCL SDK 17.1 BSP for Bittware 385A.
If I'm forced to process 256b, the occurring 'Stall%' will propagate along the pipeline and will effects the other computations/memory accesses.
Moreover, for complex design the extension of the data-path to 256b cannot be always feasible in terms of area utilization.
I know that I could write workarounds in my code in order to overcome these issues. But this will cause drawbacks that were not present previously.
Thank you,
Paolo
- PGorl16 years ago
New Contributor
posted a file.- HRZ6 years ago
Frequent Contributor
>If I'm forced to process 256b, the occurring 'Stall%' will propagate along the pipeline and will effects the other computations/memory accesses.
This is a perfectly valid point to explain why "padding" the data to match the physical width is not a good idea. You will be essentially limiting your kernel "throughput" to the throughput of the I/O channel, even if you don't need to fully utilize the channel throughput.