loop-unrolling and memory access performance

Honored Contributor

8 years ago

hello,

I really thank you very much for following this thread. I really learn things that I could not find in Altera manuals. I went and checked the manuals again.

- about floating point accumulator : compiler doesn't complain like those examples in "best practice guide", and then I think it is implemented in single-cycle, as specifically explained in programming guide section 1.6.12 (single-cycle FP accumulator, I use Arria 10 device).

lets divide my question into two parts.

1- in a task, I want to have max memory bandwidth from single port for a specific global variable array.

I guess in ideal case (burst, cache,......, but without coalescence), if I get one 32-bit data (of that variable) per cycle, I am done. right?

(Unfortunately the index of consecutive access are random -dynamic indexing- and then I think I can not have coalescence access. To not confuse you, I will open a new thread about the code.)

2- in the same task (No ND-range), in another level, I want to have parallel access to memory by more than one port (for that specific variable), to saturate whole of global memory bandwidth. I want to somehow implement multi compute-unit, but inside one task. Does it make sense to unroll-outer loop for this purpose? Do you have any other suggestion? lets assume coalescence access will not work, due to random memory access.

thanks

Forum Discussion

loop-unrolling and memory access performance

Recent Discussions

How to fix Error(23782): Failed to find an expected report

Quartus 22.1 and 23.1 Synthesis Error

Connection bit order between hierarchy

Could not link 'vsim_auto_compile.dll' error troubleshooting.

Failed to run ip-setup-simulation: