Forum Discussion
Altera_Forum
Honored Contributor
11 years agoThank you!
If I have a 1 dimensional work group kernel that needs cache a 2d data block from global to local memory (by using a for loop), should I use the for loop index as row index and work_id as column index, or the other way around? The data block is stored in row major form. I intend to use unroll and SIMD to increase throughput, but not sure if it's more effective to merge memory access indexed by loop index or workitem id. BTW: Regarding the compilation, I am just wondering if it's OK for me to compile aocl kernels on windows but execute them on Linux? Is there any disadvantages on executing kernels compiled from a different OS? Also, given enough DRAM, is it possible to compile multiple kernels on the same workstation (Linux) at the same time, where each compilation will be started in a separate "screen" session? My workstation has a 6 cores 12 threads processor, but the compiler only uses more than 3 cores in timing analysis stage, so I am trying to see if there is a way to save some time. --- Quote Start --- "num_simd_work_items" is an effective way for optimizing kernels. It is essentially similar to unrolling loops: the amount of hardware resources are replicated to increase throughput. There are two types of merging (i.e. coalescing) performed by the compiler. 1) Compile-time coalescing performed by the compiler: This is when the compiler detects that there are consecutive (local or global) instructions in the kernel and merges them. This may increase fmax because it simplifies the design (fewer load/store instructions), and increase throughput because fewer memory requests are sent. 2) Dynamic coalescing performed on the FPGA: This is when the same "global" load/store instruction sends consecutive memory requests; these requests are merged by the hardware before they are sent to memory to increase throughput. When you unroll loops or use use "num_simd_work_items", you can take advantage of both# 1 and# 2. If you do not, then only# 2 for the global accesses. --- Quote End ---