Forum Discussion
Altera_Forum
Honored Contributor
8 years agoIf you take a look at the report, the reason is pretty obvious. The compiler is being stupid and splitting your read access into 8x 512-bit simple accesses and also 3x 32-bit prefetching accesses (no idea what the hell this is), instead of inferring a single 4096-bit coalesced access like the write one. Because of this, you have 12 ports going to memory instead of 2. It goes without saying that this configurations results in a huge amount of contention on the memory bus and significantly reduces your memory performance.
If you add the volatile tag to your input (__global volatile float *restrict bottom), you will also get one single 4096-bit access for the read which will likely allow you to achieve close to peak performance. Needless to say, since the devkit only has one memory bank, you should be able to achieve full bandwidth with a total access size of 512 bits (read + write), so a vector size of 8 or 16 should be enough in your case.