I am looking at my report generated by compiling my opencl kernel for intel fpga . So in that report I am see average write burst and read burst measured as 1 but optimal possible is 16. I believe having more read burst will help improving the design but not sure how to do it. Any suggestions will be really helpful

Unless you post your kernel code, it would be difficult to recommend ways to improve the memory throughput just based on profiling results.

Okay , I am sharing the part where I am reading from the memory and writing to the memory. If from that you can give me some suggestions it will be helpful. I can share full code if you need though , just sharing this part as code is very lengthy

Can you also post your kernel header/variable declarations? I am not sure which buffer is local and which is global. Note that the system viewer tab in the HTML report provides invaluable details regarding the memory ports and their length which can help considerably in debugging memory performance.

Here are my header declarations

Unless I am missing something here, you should get one 512-bit read port for input and one 512-bit write port for output and one 128-bit port for mask. Assuming that your board has two banks of DDR3/4 memory, this should give you relatively good memory throughput. Can you check the system viewer tab in the HTML report to make sure the compiler is correctly coalescing the accesses? You can also archive the "reports" folder and post it here and I will take a look at it.P.S. Can you post your board model?

Intel opencl Dynamic profiler report | Altera Community

19 Replies

HRZ
Frequent Contributor
6 years ago
If the compiler report says the loop is pipelined, then index calculation is also pipelined.
70-80% is optimal for realistic cases based on my own testing. 85-90% is for perfect cases (one 512-bit access per memory bank with interleaving disabled).
If you do not see much difference in operating frequency without profiling, then performance difference will also be minimal. The profiler does not change loop II as far as I have seen and hence, the only factor causing performance difference between enabling profiling or not should be operating frequency.
Previously you had 14x 512-bit accesses which was excessive. One bank of DDR should saturate with a total of 512 bits read/written per iteration but the memory controller/interface is far from optimal. I have experienced that oversubscribing the memory interface (by 2x and not 14x) can sometimes improve the performance a little bit, but certainly not as much as it would increase resource usage.
ADua0
Occasional Contributor
6 years ago
Thanks for the reply, So for my case I am restricting to 8 read and write ports as I am getting better performance from that. But for writing I am not able to achieve high bandwidth , I am getting very poor writing performance with memory bandwidth in few 100's MB/s but for reading I am getting around 12GB/s of bandwidth, is there any optimization technique you can suggest to improve upon my writing bandwidth?
Also for reading although I get that 12GB/s of bandwidth but profiler gives the description of 60% efficient , do you know what that could mean?
HRZ
Frequent Contributor
6 years ago
Check this part of the guide where it is explained how to interpret the information obtained from the profiler:
https://www.intel.com/content/www/us/en/programmable/documentation/mwh1391807516407.html#ewa1399325588941
ADua0
Occasional Contributor
6 years ago
My question aligns with this topic only, so that's why I am asking here. For my OpenCl design i have channel between 2 kernels. So for read and write channel I see in the report generated that my write cycle starts is 40 and read cycle is 4 that leads to stalling at read end and it is affecting the performance as it stalls because of channel is 50%. But I am reading same number of time as writing is , so it should be balanced as such. Do you have any suggestions on how to improve on the end?
HRZ
Frequent Contributor
6 years ago
Channels are pretty much never a source of bottleneck. Channel stalls pretty much always are stalls propagated from the top of the pipeline, very likely global memory operations. Since the operations are pipelined, it doesn't matter what the start cycle of the operation is since the start latency will be amortized if the pipleine is kept busy for long enough. Stall rate for channel reads and writes do not need to be balanced. In fact, they definitely won't be balanced if the source of the stall is somewhere else. e.g. if the source of the stall is globall memory, channels writes will never stall sine the channel never becomes full, but channel reads will stall since the channel is faster than globabl memory and it becomes empty frequently, waiting for data to arrive from global memory. The behavior you are seeing is completely normal. The document I posted above also describes how to find the source of stalls in a kernel.

Forum Discussion

Intel opencl Dynamic profiler report

19 Replies

Recent Discussions

AI Suite System Throughput Issue

Agilex 7 I-Series "aocl diagnose acl0" error following OFS

HLS Compiler 24.1 error - aocl-clang.exe - dll entry point not found

How Do I get the License for HLS?

Deprecation Notice for FPGA Support Package for oneAPI DPC++/C++. What is the alternative?