Altera_Forum
Honored Contributor
7 years agoReplication of single work item kernel to increase the performance
Hi Everyone,
I have single Work item kernel which consumes less resource on board. Now I want to replicate the same kernel 2 or 4 times on to the board to improve the performance. I have a for loop which runs for 2400 times, now I want to divide the loop into two/four compute units, so that each CU can do a loop of 1200/600 iterations. NOTE: I can't use NDRange kernel for dependencies in my loop. I have explored the following options from Intel programming guide. 1. num_compute_units: Have increased the compute units from 1 to 2 for single work item kernel, resource got increased but there was no improvement in perfromance. Later in forums it was mentioned that "a single work-group kernel (i.e. no local_id in the kernel) will not at all benefit from num_compute_units, which is probably the reason why the original poster could not achieve any performance improvement." Link: https://www.alteraforum.com/forum/showthread.php?t=51783&highlight=num_compute_units 2. But in the Altera programming guide it says "You can replicate your single work-item OpenCL kernel by including the num_compute_units(X,Y,Z) kernel attribute" 3. The other option would be to use get_compute_unit, but it requires the kernel to be a autorun and I need to use channels for that which would again increase the resource utilization. The programming guide says "to create compute units that are slightly different from one another but share a lot of common code, call the get_compute_id()" In my case all compute units would remain the same and would not differ. Can any one please help me on how I can improve my single Work Item kernel performance by replication or any other ways. Thanks