Why does one single load/store consume much RAMs?

Question

Hi,

when I design OpenCL and run aocl, it reports that one single load consume 13 RAMs and one single store consume 16 RAMs.

e.g

__kernel void top_kernel(__global restrict volatile int *a, __global restrict volatile int *b, __global restrict volatile int *c) {

int i;

for (i = 0; i < 10000; ++i)

c = a + b[i]; // it will consume 13x2 + 16 RAMs

}

altera_forum · Answer

The RAMs are used as FIFO buffers to minimize the negative effect of stalls caused by off-chip memory accesses which can have variable latency. There are also some extra RAMs used per off-chip memory access as a private cache, but since your kernel arguments have been defined as volatile, that cache will not be used.

altera_forum · Answer

--- Quote Start ---

The RAMs are used as FIFO buffers to minimize the negative effect of stalls caused by off-chip memory accesses which can have variable latency. There are also some extra RAMs used per off-chip memory access as a private cache, but since your kernel arguments have been defined as volatile, that cache will not be used.

--- Quote End ---

Thanks for your reply.

Can we disable or decrease the RAM usage?

altera_forum · Answer

Other than disabling the cache using volatile, there is nothing else that can be done. Is there a specific reason you want to do this? Removing those FIFOs will have a very large negative impact on performance.

altera_forum · Answer

--- Quote Start ---

Other than disabling the cache using volatile, there is nothing else that can be done. Is there a specific reason you want to do this? Removing those FIFOs will have a very large negative impact on performance.

--- Quote End ---

I also saw that using volatile can disable the cache.

I want to apply coarse grained parallel on external memory access, e.g.

# pragma unroll

for (i = 0; i < 64; ++i)

for (j = 0; j < 1000; ++j)

... = a[i * 1000 + j];

It will consume a lot of RAMs (64 * 16 RAMs) while total of RAMs is about 2700 RAMs in arria10.

After that, we have little optimization space because of lack of RAMs.

Thanks.

altera_forum · Answer

Considering the very low memory bandwidth on current FPGA boards and the very high overhead of contention for off-chip memory accesses, you should actually avoid having parallel memory accesses and instead, unroll your memory accesses in a way that they will be coalesced into bigger ones, to minimize the number of ports to external memory. These ports, as you have noticed, waste a lot of space on the FPGA.

In your code example, you are unrolling the i loop, while the memory accesses are not contiguous over the i dimension and hence, you get 64 memory ports. This, apart from very high area usage, will lower your memory bandwidth to near-zero due to constant contention between all those ports. However, if you partially unroll the j loop 64 times, since the accesses are contiguous, you will get a few large coalesced ports with very low area overhead, and you will get very close to theoretical memory bandwidth.

Forum Discussion

Why does one single load/store consume much RAMs?

6 Replies

Recent Discussions

Failed to run ip-setup-simulation:

Compile option not saved (reversed to default)

How to fix Error(23782): Failed to find an expected report

SSLC Login Issue – "You need to enroll" loop after OTP verification

Free Licence for Max+PlusII