Forum Discussion

New Contributor

6 years ago

Solved

Load/store cannot be vectorised - local memory

Hello, I'm having some trouble with local memory and SIMD in a matrix transpose kernel I'm adapting from GPU. The code: #define TILE_DIM 4 __attribute__((reqd_work_group_size(TILE_DIM, TILE_DIM,...

HRZ
6 years ago
That compiler warning in particular is a very misleading warning and it does not always point to an actual problem in your code. Looking at the report, both the load from and the store to global memory are coalesced into 128-bit accesses which points to correct vectorization. The local buffer "tile" is also replicated by 28 times to provide fully-parallel non-stallable accesses. 4 times of it is because your code has 4 non-coalescable reads on line 28, and one coalescable write on line 19 (each Block RAM has two ports, writes are connected to all replicas while reads are connected to one, resulting in a replication factor of 4 for 4 reads and one write). The buffer is also replicated by 7 extra times to support 7 work-groups running concurrently in the same compute unit; this latter replication factor is a compiler decision that cannot be overridden by the user. All in all there is nothing wrong with your code and I would say you can safely ignore the warning.

GRodr25

New Contributor

6 years ago

Yes, sure. I attached the report to this message. I really appreciate your help.

reports.tar.gz1.8 MB

HRZ

Frequent Contributor

6 years ago

I made a small mistake, my Arria 10 compile with v16.1.2 did indeed give the same results as your report, but I didn't pay proper attention to the numbers and thought the results were the same as my compile with v19.4 on Stratix 10. On Arria 10, the compiler chooses a "bandwidth" of 64 bits, resulting in four reads and two 64-bit writes which, coupled with double-pumping, results in a replication factor of two for parallel accesses. It can be forced to 4 reads and one write by adding "__attribute__((memory, bankwidth(4*TILE_DIM)))", but that seems to slightly increase area usage and that is likely why the compiler is opting for a bandwidth of 64 bits. On Stratix 10, however, Block RAM double-pumping is not supported for some magical reason and hence, the compiler has to choose a "bandwidth" size of 128 bits to reduce number of writes to one or else it would be impossible to have stall-free accesses with two writes without double-pumping. In this case the replication factor will be what I mentioned in my first post.

Forum Discussion

Load/store cannot be vectorised - local memory

Recent Discussions

AI Suite - Spatial IP outputs wrong value

AI Suite - Is it possible to simulate the AI IP?

AI Suite - Streaming from HPS to DLA IP

Agilex 7 I-Series "aocl diagnose acl0" error following OFS

AI Suite - Custom model in the FPGA building process