Forum Discussion

Altera_Forum's avatar
Altera_Forum
Icon for Honored Contributor rankHonored Contributor
8 years ago

Why does one single load/store consume much RAMs?

Hi,

when I design OpenCL and run aocl, it reports that one single load consume 13 RAMs and one single store consume 16 RAMs.

e.g

__kernel void top_kernel(__global restrict volatile int *a, __global restrict volatile int *b, __global restrict volatile int *c) {

int i;

for (i = 0; i < 10000; ++i)

c = a + b[i]; // it will consume 13x2 + 16 RAMs

}

6 Replies

  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    The RAMs are used as FIFO buffers to minimize the negative effect of stalls caused by off-chip memory accesses which can have variable latency. There are also some extra RAMs used per off-chip memory access as a private cache, but since your kernel arguments have been defined as volatile, that cache will not be used.

  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    --- Quote Start ---

    The RAMs are used as FIFO buffers to minimize the negative effect of stalls caused by off-chip memory accesses which can have variable latency. There are also some extra RAMs used per off-chip memory access as a private cache, but since your kernel arguments have been defined as volatile, that cache will not be used.

    --- Quote End ---

    Thanks for your reply.

    Can we disable or decrease the RAM usage?
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    Other than disabling the cache using volatile, there is nothing else that can be done. Is there a specific reason you want to do this? Removing those FIFOs will have a very large negative impact on performance.

  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    --- Quote Start ---

    Other than disabling the cache using volatile, there is nothing else that can be done. Is there a specific reason you want to do this? Removing those FIFOs will have a very large negative impact on performance.

    --- Quote End ---

    I also saw that using volatile can disable the cache.

    I want to apply coarse grained parallel on external memory access, e.g.

    # pragma unroll

    for (i = 0; i < 64; ++i)

    for (j = 0; j < 1000; ++j)

    ... = a[i * 1000 + j];

    It will consume a lot of RAMs (64 * 16 RAMs) while total of RAMs is about 2700 RAMs in arria10.

    After that, we have little optimization space because of lack of RAMs.

    Thanks.
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    Considering the very low memory bandwidth on current FPGA boards and the very high overhead of contention for off-chip memory accesses, you should actually avoid having parallel memory accesses and instead, unroll your memory accesses in a way that they will be coalesced into bigger ones, to minimize the number of ports to external memory. These ports, as you have noticed, waste a lot of space on the FPGA.

    In your code example, you are unrolling the i loop, while the memory accesses are not contiguous over the i dimension and hence, you get 64 memory ports. This, apart from very high area usage, will lower your memory bandwidth to near-zero due to constant contention between all those ports. However, if you partially unroll the j loop 64 times, since the accesses are contiguous, you will get a few large coalesced ports with very low area overhead, and you will get very close to theoretical memory bandwidth.
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    --- Quote Start ---

    Considering the very low memory bandwidth on current FPGA boards and the very high overhead of contention for off-chip memory accesses, you should actually avoid having parallel memory accesses and instead, unroll your memory accesses in a way that they will be coalesced into bigger ones, to minimize the number of ports to external memory. These ports, as you have noticed, waste a lot of space on the FPGA.

    In your code example, you are unrolling the i loop, while the memory accesses are not contiguous over the i dimension and hence, you get 64 memory ports. This, apart from very high area usage, will lower your memory bandwidth to near-zero due to constant contention between all those ports. However, if you partially unroll the j loop 64 times, since the accesses are contiguous, you will get a few large coalesced ports with very low area overhead, and you will get very close to theoretical memory bandwidth.

    --- Quote End ---

    I totally agree with your suggestion. I have tried and it works.

    Thanks.