Altera_Forum
Honored Contributor
12 years agoSlowdown when increasing the number of compute units
Hi,
I've recently started experimenting with increasing the number of compute units for a particular kernel. I've previously had some good results using manual vectorisation to reduce runtime. However when I increase the number of compute units using __attribute__((num_compute_units(2))), the kernel runtime slows down by ~30%. If the kernel is bottlenecking on global memory b/w I was expecting performance to at worst remain the same (is this a valid assumption?). However I don't think that it is fully maxing out the global memory bandwidth as increasing the manual vectorisation level further does still improve performance. I'd therefore be grateful for any assistance on what might be causing this? Looking at the logic utilisation of both kernels the one that uses the 2 compute units does use more logic: 54% vs 39%. I also notice that when querying the device the max number of compute units (CL_DEVICE_MAX_COMPUTE_UNITS) is returned as 1. The device in question is a pcie385n_d5 from Nallatech. Does the max number of compute units refer to the same things as "num_compute_units"? And is there any way to confirm that the compiler is actually creating more compute units? Could the workgroup size that I am launching the kernel with have an affect on performance. eg is there a minimum size which I should specify for this sort of setup? I'm just wondering is this could affect how the 2 compute units are accessing memory eg causing bank conflicts or similar. Could specifying larger workgroup sizes help? Many thanks