Forum Discussion
Altera_Forum
Honored Contributor
7 years agohello,
I really thank you very much for following this thread. I really learn things that I could not find in Altera manuals. I went and checked the manuals again. - about floating point accumulator : compiler doesn't complain like those examples in "best practice guide", and then I think it is implemented in single-cycle, as specifically explained in programming guide section 1.6.12 (single-cycle FP accumulator, I use Arria 10 device). lets divide my question into two parts. 1- in a task, I want to have max memory bandwidth from single port for a specific global variable array. I guess in ideal case (burst, cache,......, but without coalescence), if I get one 32-bit data (of that variable) per cycle, I am done. right? (Unfortunately the index of consecutive access are random -dynamic indexing- and then I think I can not have coalescence access. To not confuse you, I will open a new thread about the code.) 2- in the same task (No ND-range), in another level, I want to have parallel access to memory by more than one port (for that specific variable), to saturate whole of global memory bandwidth. I want to somehow implement multi compute-unit, but inside one task. Does it make sense to unroll-outer loop for this purpose? Do you have any other suggestion? lets assume coalescence access will not work, due to random memory access. thanks