Forum Discussion

ADua0's avatar
ADua0
Icon for Occasional Contributor rankOccasional Contributor
6 years ago

Intel opencl Dynamic profiler report

I am looking at my report generated by compiling my opencl kernel for intel fpga . So in that report I am see average write burst and read burst measured as 1 but optimal possible is 16. I believe having more read burst will help improving the design but not sure how to do it. Any suggestions will be really helpful

19 Replies

  • HRZ's avatar
    HRZ
    Icon for Frequent Contributor rankFrequent Contributor

    If the compiler report says the loop is pipelined, then index calculation is also pipelined.

    70-80% is optimal for realistic cases based on my own testing. 85-90% is for perfect cases (one 512-bit access per memory bank with interleaving disabled).

    If you do not see much difference in operating frequency without profiling, then performance difference will also be minimal. The profiler does not change loop II as far as I have seen and hence, the only factor causing performance difference between enabling profiling or not should be operating frequency.

    Previously you had 14x 512-bit accesses which was excessive. One bank of DDR should saturate with a total of 512 bits read/written per iteration but the memory controller/interface is far from optimal. I have experienced that oversubscribing the memory interface (by 2x and not 14x) can sometimes improve the performance a little bit, but certainly not as much as it would increase resource usage.

  • ADua0's avatar
    ADua0
    Icon for Occasional Contributor rankOccasional Contributor

    Thanks for the reply, So for my case I am restricting to 8 read and write ports as I am getting better performance from that. But for writing I am not able to achieve high bandwidth , I am getting very poor writing performance with memory bandwidth in few 100's MB/s but for reading I am getting around 12GB/s of bandwidth, is there any optimization technique you can suggest to improve upon my writing bandwidth?

    Also for reading although I get that 12GB/s of bandwidth but profiler gives the description of 60% efficient , do you know what that could mean?

  • ADua0's avatar
    ADua0
    Icon for Occasional Contributor rankOccasional Contributor

    My question aligns with this topic only, so that's why I am asking here. For my OpenCl design i have channel between 2 kernels. So for read and write channel I see in the report generated that my write cycle starts is 40 and read cycle is 4 that leads to stalling at read end and it is affecting the performance as it stalls because of channel is 50%. But I am reading same number of time as writing is , so it should be balanced as such. Do you have any suggestions on how to improve on the end?

  • HRZ's avatar
    HRZ
    Icon for Frequent Contributor rankFrequent Contributor

    Channels are pretty much never a source of bottleneck. Channel stalls pretty much always are stalls propagated from the top of the pipeline, very likely global memory operations. Since the operations are pipelined, it doesn't matter what the start cycle of the operation is since the start latency will be amortized if the pipleine is kept busy for long enough. Stall rate for channel reads and writes do not need to be balanced. In fact, they definitely won't be balanced if the source of the stall is somewhere else. e.g. if the source of the stall is globall memory, channels writes will never stall sine the channel never becomes full, but channel reads will stall since the channel is faster than globabl memory and it becomes empty frequently, waiting for data to arrive from global memory. The behavior you are seeing is completely normal. The document I posted above also describes how to find the source of stalls in a kernel.