Forum Discussion
Altera_Forum
Honored Contributor
11 years ago --- Quote Start --- Thank you! I'll try. What if I have 2d work groups instead of for loops, where each thread copies 1 item from the global memory to local memory? Would the compiler automatically merge memory accesses? Is using "num_simd_work_items" the only way to optimize the kernel? --- Quote End --- "num_simd_work_items" is an effective way for optimizing kernels. It is essentially similar to unrolling loops: the amount of hardware resources are replicated to increase throughput. There are two types of merging (i.e. coalescing) performed by the compiler. 1) Compile-time coalescing performed by the compiler: This is when the compiler detects that there are consecutive (local or global) instructions in the kernel and merges them. This may increase fmax because it simplifies the design (fewer load/store instructions), and increase throughput because fewer memory requests are sent. 2) Dynamic coalescing performed on the FPGA: This is when the same "global" load/store instruction sends consecutive memory requests; these requests are merged by the hardware before they are sent to memory to increase throughput. When you unroll loops or use use "num_simd_work_items", you can take advantage of both# 1 and# 2. If you do not, then only# 2 for the global accesses.