Forum Discussion
Altera_Forum
Honored Contributor
8 years ago --- Quote Start --- Hi, Is there an example of a kernel that performs better with the use of multiple compute units? I have experimented on various kernels, the simplest of all is a floating point multiplication of an array of double precision floats. I haven’t obtained any kernel that actually improves in terms of execution time from __attribute((num_compute_units(N))). My profiler tells me that my global memory bandwidth is severely affected after using this OpenCL attribute but my access patterns aren’t too complex. Does Altera have a working example of a kernel that benefits from multiple compute units? Has anyone gotten it to work before? Appreciate any feedback or examples that show performance benefits from __attribute((num_compute_units(N))). This is an example of what I tried to run on AOCL. This code’s global memory access is 2200MB/s with a single compute unit. When I use two compute units, my bandwidth drops to 480MB/s. Why is there such a vast difference? __attribute__((num_compute_units(CU))) __kernel void vector_add(__global double * restrict x) { // get index of the work item int id = get_global_id(0); x[id] = x[id] * x[id]; } I have tried the SIMD option, which improves performance, but I want to test the performance of multiple compute units, which thus far has been unsuccessfully. Would appreciate a simple and straightforward working example to build/investigate on. --- Quote End --- It is important that you size the work groups properly when applying the compute unit attribute to get increase in performance. The https://www.altera.com/en_us/pdfs/literature/hb/opencl-sdk/aocl-best-practices-guide.pdf speaks in lengths about how to use the compute unit attribute. However, it is still a fact the amount of improvement that you get with the SIMD is always more than that you achieve by replicating compute units. One key reason for this is increase in memory contention. On the other hand as per manual, SIMD option attribute also allows the offline compiler to coalesce memory accesses. This could be a key reason for increase in performance.