Size and persistence of __local memory

Honored Contributor

10 years ago

I haven't worked on OpenCL in a while so take this with a grain of salt since my information might be old. Last time I checked the compiler is optimizing for throughput and so if it scaled the number of workitems in flight you'll end up bubbles in the compute unit where some hardware sits idle and the overall performance suffers. Sometimes you can restructure your kernel to have a more optimal footprint but when I was looking for savings I often looked at changing my kernel to be a single-threaded task kernel. Since tasks are single threaded there isn't a bunch of buffering hardware created to keep the compute unit full of work to do, instead the compiler finds parallelism in your algorithm and pipelines it accordingly.

I recommend taking a look at the single-threaded task kernel documentation to see if your algorithm fits into that programming model better. Not only does it often result in hardware savings but you also get the benefit from simpler synchronization (only one work-item) and the body of the kernel typically ends up looking very similar to what you would run on a processor. Tasks are also important if you are going to stream data in and out of your kernel since NDRange kernels do not offer much in terms of predefined ordering so image having millions of work-items popping a FIFO, it would be difficult to know which word ended up with each work-item whereas tasks can do this in a loop which implies ordering.

Forum Discussion

Recent Discussions

Connection bit order between hierarchy

Free Licence for Max+PlusII

MAX10 ADC - getting it to simulate in Modelsim

Failed to run ip-setup-simulation:

Compile option not saved (reversed to default)