Forum Discussion

Altera_Forum's avatar
Altera_Forum
Icon for Honored Contributor rankHonored Contributor
12 years ago

Size and persistence of __local memory

Hi,

I was trying to understand the implementation of __local memory better and have two questions.

1. Max size of local memory

-------------------------------

If I am not completely wrong, an OpenCL __local variable is placed in BRAM and a Stratix V should have about 7MB of that. So I was trying to allocate a large matrix like so:

__kernel

__attribute((reqd_work_group_size(1,1,1)))

foo(...) {

__local char matrix[2048][2048];

//read and write matrix here, otherwise it will be optimized away

}

(I also tried to declare matrix as __local in the parameter list and add __attribute__((local_mem_size(2048 * 2048))); I also tried to allocate it as a 1-Dimensional array; it all has the same result)

If I compile this with aoc -c kernel.cl --report --estimate-throughput I get the two following lines in the report:

..# of RAMs (local mem) / compute unit : 262144

... more ...

; Memory blocks ; 10257% ;

The matrix should take 4MB space and should fit, I think, but it says I'm 10k percent over.

Can anybody explain to me what's happening? Am I missing something? Is the report just assuming 16kB as 100% (it seems so) but synthesis would actually succeed? Is there any other limit I am not aware of?

2) Persistence of __local memory

-------------------------------------

I was wondering if data stored into __local memory survives kernel invokations? Assume I have a 1-compute-unit, 1-dimension kernel that in the first invokation initializes the above __local matrix. Will the data be still available and valid the next time the kernel runs? I know that I cannot rely on that according to the OpenCL specification and on a GPU that makes a lot of sense because I don't know what other kernels may have run on the given compute unit. But here I know that no other kernel runs and there is only one compute unit that's dedicated to the kernel so I assume that nobody overwrites the BRAM "just for the fun of it". But I was wondering how bad this assumption really is---or if it is even wrong to rely on that (in the current version; it's OK if that may "change in the future")?

Thanks in advance,

Christoph

7 Replies

  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    Unless your post has a typo the memory block utilization is way above 100% The number of RAMs being requested is 262144 and the device you are using probably only has around ~2500 RAM blocks. I suspect what is happening is you are requesting a work-group size of 1,1,1, and the compiler is attempting to create hardware that can have multiple work-groups in flight. As a result 4MB would be needed for each work-group in flight so that would be a lot of local memory. One other thing to keep in mind is that on-chip RAM blocks are used for other things than just the __local buffers as well.

    __local buffers are not preserved between kernel invocations. They are not cleared out between invocations but the compiler also makes no attempt preserve them either so if the data happens to persist I wouldn't rely on that behavior.
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    Thanks for your response.

    That is not a typo in my post, it is indeed 10 thousand percent utilization, which is why I was wondering how that happens with "only" 4mb shared memory.

    But your tip that the compiler may try to create hardware that can pipeline multiple work-groups seems to be correct. When I added __attribute((task)) to my kernel, the utilization went down to 177% which I can see (given that BRAM is used for other things, too, as you said).

    The utilization also goes down to 337% (that seems to be the lowest) if I set a larger work group size (say 64x64). For a small but > 1 work group the utilization is still way up. I don't know for sure, but the extra overhead between the 177% for a task and the 337% for 64x64 work group may just come from the extra hardware needed for scheduling work items.

    So it seems that your assumption is correct and the compiler tries to generate hardware that can have multiple work groups in flight and thus must have multiple local memories.

    Please let me know if you have any other thoughts to that issue. E.g., our analysis may be wrong and the cause is something else? Or is there a pragma that can tell the compiler to only compile hardware for one work group and not duplicate shared memory? Or can it be considered a compiler bug to compile for multiple in-flight work-groups even though that runs out of local memory?

    Thanks

    Christoph
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    The reason for the drop when using a task based kernel is because it's a single work-item execution unit at that point (NDRange = 1). In other words there would be only a single __local memory at that point (and as a task you might as well use __private memory in that case since there are no other work-items in flight that need visibility into that memory).

    In the case of a bigger work group what is happening is that you have fewer work-groups in flight in the compute unit because each work-group has more than one work-item in this case. Remember that __local memory is shared by all the work-items in a workgroup and as a result of using a work-group size of 1 it would require lots of work-groups to keep the compute unit filled and as a result lots of __local memories as well.

    At the moment there isn't a way to trim back the number of work-groups that get scheduled into the compute unit that would help reduce your memory footprint. I don't really consider it a bug because the compute unit is a deep pipeline so it if didn't keep many tiny work-groups in flight like you are seeing the performance would drop. That said, I do see the need to have the ability to trade off throughput in exchange for a smaller hardware footprint and my colleagues have had discussions about this so we are already taking it into consideration.

    Without seeing the kernel I'm not sure this would be a good recommendation or not but if you use a task instead that would be one way to control the hardware blowup. A task or "single work-item execution" kernel typically contains loops that iterate over the problem set that a normal NDRange kernel would do using multiple work-items. So unlike NDRange kernels where you normally flatten out loops of an algorithm and replace them with parallel work-times, tasks typically resemble algorithms that still contain loops that you would normally execute on a host processor. Tasks on Altera FPGAs are efficient if there are no loop carried dependencies in your algorithm, if these dependencies exists the compiler will output a message that tells you the compute unit will be underutlized. If you take a look at the OpenCL optimization guide up on altera.com you'll see examples of what I'm talking about. If no dependencies exists then the compiler should be able to generate a compute unit that can execute a loop iteration every clock cycle assuming there is enough memory bandwidth available.
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    Thanks for the quick and informative response! That all makes perfect sense and I think I understand much better now.

    I will experiment with __constant parameters a bit, maybe they are more appropriate for what I wanted to do anyways (load a big matrix only once and use it many times). Let's see how much const memory I can get out of my system (using the --const-cache-bytes <N> aoc compiler flag). Otherwise I can always declare my kernel to be a task.
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    Keep in mind that if your kernel attempts to read from __constant memory and the contents are not already cached, the latency to fetch the data from global memory will be much higher. In general I only use __constant memory if I know the entire buffer will fit into the constant cache.

  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    Sorry to bring this thread up. But can i know if there is any way to limit the number of work-groups in flight created by the compiler? Since if the compiler attempts to create hardware to support too many work groups in flight that means it may waste a lot of resources.

    --- Quote Start ---

    Unless your post has a typo the memory block utilization is way above 100% The number of RAMs being requested is 262144 and the device you are using probably only has around ~2500 RAM blocks. I suspect what is happening is you are requesting a work-group size of 1,1,1, and the compiler is attempting to create hardware that can have multiple work-groups in flight. As a result 4MB would be needed for each work-group in flight so that would be a lot of local memory. One other thing to keep in mind is that on-chip RAM blocks are used for other things than just the __local buffers as well.

    __local buffers are not preserved between kernel invocations. They are not cleared out between invocations but the compiler also makes no attempt preserve them either so if the data happens to persist I wouldn't rely on that behavior.

    --- Quote End ---

  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    I haven't worked on OpenCL in a while so take this with a grain of salt since my information might be old. Last time I checked the compiler is optimizing for throughput and so if it scaled the number of workitems in flight you'll end up bubbles in the compute unit where some hardware sits idle and the overall performance suffers. Sometimes you can restructure your kernel to have a more optimal footprint but when I was looking for savings I often looked at changing my kernel to be a single-threaded task kernel. Since tasks are single threaded there isn't a bunch of buffering hardware created to keep the compute unit full of work to do, instead the compiler finds parallelism in your algorithm and pipelines it accordingly.

    I recommend taking a look at the single-threaded task kernel documentation to see if your algorithm fits into that programming model better. Not only does it often result in hardware savings but you also get the benefit from simpler synchronization (only one work-item) and the body of the kernel typically ends up looking very similar to what you would run on a processor. Tasks are also important if you are going to stream data in and out of your kernel since NDRange kernels do not offer much in terms of predefined ordering so image having millions of work-items popping a FIFO, it would be difficult to know which word ended up with each work-item whereas tasks can do this in a loop which implies ordering.