Forum Discussion
Altera_Forum
Honored Contributor
8 years agoThe 40 reads that the compiler reports is correct and does not actually have the implication you think it does. Your kernel will turn into one single pipeline, and there will be 40 read ports going from the pipeline to the "output" buffer. It does not matter in which part of the code those reads are located; the compiler will make sure all reads can be satisfied in parallel to avoid any stalls in the pipeline. Note that all of these ports will be reported in the same part of the report, you should not expect to see 32 ports in one part and another 8 ports in another part of it.
Regarding outputs being sent to the channel before being fully accumulated: this is not possible. The compiler will ensure the loop on "t" fully finishes before the loop on the output channels starts. If you are seeing different output that what you expect, the problem is somewhere else. Have you tried to see what output you receive in the emulator and debug by printing the intermediate values? A few tips: 1- Be very careful with using uint loop variables; if you compare "unit" with "int", you could get very different behavior compared to what you expect. Specifically for the loop on "t", "loop_cnt" must be also uint for correct behavior. 2- You might have to initialize the output buffer; depending on how the compiler actually unrolls the loops, the accumulation line might result in undefined behavior if the buffer is not initialized. 3- You should probably consider changing the order of your "h" and "w" loops, or transpose your "output" buffer, so that when you unroll the loops, the unrolled accesses to the buffer can be coalesced, resulting in larger but less read and write ports. With your current implementation, if you keep increasing the size of the output buffer to a point that it has to be implemented using Block RAMs, you will get a really large replication factor from the compiler to support all the non-coalesced read and write ports.