Forum Discussion
Altera_Forum
Honored Contributor
12 years agoIf a single compute unit was bottlenecked by global memory, by putting two into the hardware the access pattern between the two compute units can reduce the memory efficiency. When you have multiple compute units you end up duplicating load/store units that access memory. SDRAM operates most efficiently when accessed sequentially so if you have two compute units accessing different regions of memory then a less sequential memory access pattern will be the result. Vectorizing the kernel may improve performance simply because narrow accesses will become coalesced
I have not queried the FPGA to determine things like the maximum number of compute units so I'm not sure if returning a 1 is the excepted behavior when you instruct the compiler to create 2. There is no minimum workgroup size but if you don't launch the hardware with large number of work-items you may run into performance issues (hard to tell without seeing the kernel). The maximum work-group size by default is 256 so if you have a different maximum size (or a fixed size) there are attributes that you can set for those. Instead of manually vectorizing the kernel have you tried to use the num_simd_work_items attribute? It vectorizes the kernel for you instead of you having to manually change all your data to vector types. There will still be cases where manual vectorization is ideal but when you are prototyping I recommend giving it a shot since you can change the vector size quickly using it.