Forum Discussion
Altera_Forum
Honored Contributor
8 years ago --- Quote Start --- Are you talking about num_compute_units for NDRange kernels or single work-item kernels? num_compute_units for NDRange kernels works in a fully automatic manner and does not require any user intervention other than adding the attribute to the kernel header. The compiler will automatically replicate the pipeline in this case, allowing multiple work-groups to be scheduled in parallel. This obviously comes at the cost of higher area usage and higher memory bandwidth utilization. If memory bandwidth is saturated, using num_compute_units will actually reduce performance due to extra memory contention. --- Quote End --- Thx for the reply. I meant from single work-item to NDRange. Original kernel process all pixels in one for loop, then I tried spiting the loop in half and launch two copies in parallel. By using NDRange I have to call get_global_id and later I found out that this will cause latency compare to not using it at all. I guess it's not a big deal when the kernels are complex and this 10ms doesn't cause a bottle neck, mimicking GPU programming on FPGA just won't pay off...