Altera_Forum
Honored Contributor
10 years agoExample showing multiple compute units providing speedup
Hi,
Is there an example of a kernel that performs better with the use of multiple compute units? I have experimented on various kernels, the simplest of all is a floating point multiplication of an array of double precision floats. I haven’t obtained any kernel that actually improves in terms of execution time from __attribute((num_compute_units(N))). My profiler tells me that my global memory bandwidth is severely affected after using this OpenCL attribute but my access patterns aren’t too complex. Does Altera have a working example of a kernel that benefits from multiple compute units? Has anyone gotten it to work before? Appreciate any feedback or examples that show performance benefits from __attribute((num_compute_units(N))). This is an example of what I tried to run on AOCL. This code’s global memory access is 2200MB/s with a single compute unit. When I use two compute units, my bandwidth drops to 480MB/s. Why is there such a vast difference? __attribute__((num_compute_units(CU))) __kernel void vector_add(__global double * restrict x) { // get index of the work item int id = get_global_id(0); x[id] = x[id] * x[id]; } I have tried the SIMD option, which improves performance, but I want to test the performance of multiple compute units, which thus far has been unsuccessfully. Would appreciate a simple and straightforward working example to build/investigate on.