Hi,BadOmen.
Thank you for your reply!
--- Quote Start ---
Without seeing the algorithm I can't say for certain it's the __private memory causing this but it's a possiblity.
--- Quote End ---
If that's the reason, will it help if I try to use local memory instead of private memory? I mean I understand that private variables are stored in registers, but what is the local memory for real? Will it too use FFs to store variables?
--- Quote Start ---
You method of debugging the issue is sound, just keep in mind that if you break up your kernel into pieces that the footprint of each of those pieces will not necessarily add up to the same sum as the kernel as a whole.
--- Quote End ---
I'll keep on trying and I'm wondering the whether the sum of the pieces is tend to be bigger or smaller than that as a whole?
--- Quote Start ---
Are you using a fixed work-group size, or know how large your work-group size will need to be? The compiler assumes a work-group size of 256 so if you don't need one that large or know it's going to be a fixed size you can specify attributes to let the compiler know this. Often the compiler will create smaller hardware with hints like these.
--- Quote End ---
I didn't specify a work-group size or use local work item id in my kernel so I assume the work-group size was 256. I followed your advice and set the attribute as
__attribute__((reqd_work_group_size(1,1,1)))
However, the estimation of logic utilization doesn't change. Can you tell me how does this attribute effect the hardware usage? It doesn't replicate compute unit or do anything like that as
num_compute_unit and
num_simd_work_items do, right?
Besides, I've also tried the following methods
1.resource-driven optimization(-o3), 2.__attribute__((num_share_resources(16))) Neither of these two worked. Maybe the source do require that much resources.
Thanks for your reply again.