Forum Discussion
I have calculated the memory throughput, it turns out I am able to utilize 50% bandwidth which is measured . for example, in my case I am achieving 12GB/s of bandwidth and measuring the throughput I get ~ 50% efficiency which in fact aligns with memory efficiency I get from the profiler. But theoretical external memory bandwidth possible is 19GB/s of which I am able to utilize only 30%. I am not sure how to improve on that factor, I have 8 read and 8 write ports so there should be minimal memory contention?
Do you have any suggestions?
Also when i keep the indexing simpler like accessing smaller data , I get small stall rate but if my indexing is bit complex, stall rate increasing although each time I am accessing new data. For example data[ a* b+c] has much less stall rate as compared to data[ a* b + c*d + e*f + g], do you know any was to improve that because for my kernel I need a bit complex indexing?
- HRZ6 years ago
Frequent Contributor
The complexity of indexing should have no effect since index caluclation is pipelined. What can make a large difference is alignment; the starting address of the wide accesses must be at least 256-bit aligned or else effective throughput will be reduced. The difference you are seeing could also be due to difference in operating frequency of your kernels. What is the operating frequency with/without profiling enabled?
I have measured the memory controller efficiency to be around 85-90% (i.e. peak practical throughput for you will be 16-17 GB/s). In your case, since your board has only one memory bank, full performance can be achieved if your kernel has *only one 512-bit access* (either read or write) but such kernel will have little practical usecase. Two 256-bit accesses (plus another for mask) will always be slower than that. You can try increasing the vector size to 16 which might help increasing memory throughout (but also stall rate) by oversubscribing the memory interface, especially if your kernel is running below 266 MHz.