Forum Discussion
Using the term "local memory" for single work-item kernels could be misleading since there are no threads running in parallel to share data between each other using local memory, and the "__local" identifier does not make any difference int his kernel type, either. For single work-item kernels, any data that is not in external memory will be implemented as buffers/FIFOs/RAMs that use FPGA on-chip memory resources (registers and Block RAMs). This includes all variables, channels, etc.
Autorun kernels do NOT have an interface to host or external memory; hence, even if you pass a global memory pointer to an autorun kernel, you will not be able to read from the global buffer in the autorun kernel since it is not connected to the memory interface.
I am not sure how you are judging that your kernel is slow. Considering your code snippet, your bottleneck is likely the external memory transfers, not the channel transfers, and it will likely run at the same speed even if you remove the channels.
P.S. There is no need to load data from global memory into a separate variable and then write it into the channel. Yu can write dirrectly from global memory to the channel.