Forum Discussion
Altera_Forum
Honored Contributor
12 years agoThanks for your response.
That is not a typo in my post, it is indeed 10 thousand percent utilization, which is why I was wondering how that happens with "only" 4mb shared memory. But your tip that the compiler may try to create hardware that can pipeline multiple work-groups seems to be correct. When I added __attribute((task)) to my kernel, the utilization went down to 177% which I can see (given that BRAM is used for other things, too, as you said). The utilization also goes down to 337% (that seems to be the lowest) if I set a larger work group size (say 64x64). For a small but > 1 work group the utilization is still way up. I don't know for sure, but the extra overhead between the 177% for a task and the 337% for 64x64 work group may just come from the extra hardware needed for scheduling work items. So it seems that your assumption is correct and the compiler tries to generate hardware that can have multiple work groups in flight and thus must have multiple local memories. Please let me know if you have any other thoughts to that issue. E.g., our analysis may be wrong and the cause is something else? Or is there a pragma that can tell the compiler to only compile hardware for one work group and not duplicate shared memory? Or can it be considered a compiler bug to compile for multiple in-flight work-groups even though that runs out of local memory? Thanks Christoph