--- Quote Start ---
- about floating point accumulator : compiler doesn't complain like those examples in "best practice guide", and then I think it is implemented in single-cycle, as specifically explained in programming guide section 1.6.12 (single-cycle FP accumulator, I use Arria 10 device).
--- Quote End ---
Yes, if you are targeting Arria 10 you can take advantage of single-cycle floating-point accumulation and in fact, your code is constructed in a way that takes advantage of this feature. I think I compiled your code against Stratix V and hence, I got an II of 16 when unrolling the inner loop. If you get an II of one after unrolling the inner loop, then that is good and you can go ahead with unrolling it without needing to use the shift register optimization.
--- Quote Start ---
1- in a task, I want to have max memory bandwidth from single port for a specific global variable array.
I guess in ideal case (burst, cache,......, but without coalescence), if I get one 32-bit data (of that variable) per cycle, I am done. right?
(Unfortunately the index of consecutive access are random -dynamic indexing- and then I think I can not have coalescence access. To not confuse you, I will open a new thread about the code.)
--- Quote End ---
Actually, no, you need a much larger access. I am not sure which Arria 10 board you are using but the one I use has two banks of DDR4 memory running at 2133 MHz. Each bank is connected to the FPGA through a 64-bit bus (72-bit with ECC). Also the memory controller on the FPGA is running at 266 MHz. Hence, if you want to saturate the memory bandwidth using one access, considering the operating frequency difference between the memory controller and the memory itself, you need an access size equal to:
(memory frequency/controller frequency) * number of banks * width of memory port = (2133/266) * 2 * 64 = 1024 bits = 32 floats --> unroll factor of 32 on the memory access loop
Of course this is with the assumption that you only have one access in your kernel. If you have more accesses, then you should divide your bandwidth between the different access, and adjust the unrolling factor accordingly. You can also consider "sacrificing" your less important accesses by using a lower unroll factor for them. Doing one 32-bit access per cycle will only give you 1/32 of the memory bandwidth. Having multiple 32-bit accesses per cycle will give you some limited performance improvement, but you will never reach the peak or anywhere close to it due to contention on the memory bus.
--- Quote Start ---
2- in the same task (No ND-range), in another level, I want to have parallel access to memory by more than one port (for that specific variable), to saturate whole of global memory bandwidth. I want to somehow implement multi compute-unit, but inside one task. Does it make sense to unroll-outer loop for this purpose? Do you have any other suggestion? lets assume coalescence access will not work, due to random memory access.
--- Quote End ---
Yes, unrolling the outer loop will create a multiple-compute-unit-like design; however, you will likely not get much of a performance improvement. The memory controller on the FPGA is extremely inefficient for random memory accesses and due to lack of a proper cache hierarchy, there is very little you can do when it comes to random accesses. The same issue more or less also exists on GPUs; however, GPUs have a far more advanced memory controller and far much higher memory bandwidth and they do also have a cache hierarchy.
If you are interested in understanding the memory performance, you can try porting the following repository for FPGAs:
https://github.com/uob-hpc/babelstream Then you can play around with different SIMD factors for NDRange and unroll factors for single work-item to see how the memory performance varies. I ported an older version of this repository for the same purpose a while back which you can use (but no documentation or anything):
https://github.com/zohourih/gpu-stream