Forum Discussion
Altera_Forum
Honored Contributor
11 years agoThis is an interesting and deep question. The right answer may depend on many factors, but mainly the size of your kernel, the size of your buffers, and your SIMD parameter. I will try to give some insights.
For global load, where N and M are constants: Case-G1: for(i = 0; i < N; i++) { val = A[i * M + get_global_id(0)]; } If you unroll this loop 4 times, you will get 4 super-efficient loads (e.g. A[4 * M + gid]), regardless of the SIMD parameter. They will be super-efficient because each work-item will access consecutive addresses in each load, which FPGAs excel at. However, if you unroll the loop large number of times, then you may start slowing down your kernel because of the large number of loads. Case-G2: for(i = 0; i < M; i++) { val = A[get_global_id(0) * M + i]; } If you unroll this loop, the consecutive accesses will be coalesced, and you will get wide some-what efficient loads. These loads will not be as efficient as the loads in Case-G1, however, you will have fewer of them because of the coalescing, which may be adventegous. On the other hand, you if use a large SIMD parameter, then you will start increasing the number of loads again, because each SIMD lane will have its own load. For local store; Case-L1: for(i = 0; i < N; i++) { B[get_local_id(0)] = val;}
when you unroll the loop and use simd parameter, then each store will be as wide as the simd parameter. the larger simd parameter is, the wider and more efficient the accesses will be.
case-l2:
for(i = 0; i < n; i++) {
b[get_local_id(0)] = val; } When you unroll the loop and use SIMD parameter, then the store width will be determined by the compiler, however, can potentially be very wide (because of unrolling), and wider than Case-L1.