Based on your log, 309 RAMs are being used by the BSP, 103 are being used by the channel, and 643 and 83 RAMs for two memory loads.
You cannot change or reduce the amount used by the BSP.
The channel depth you have requested in zero, but the compiler has decided that a depth of 4096 is better for you, hence the high RAM requirement. Channel depth is one of the things that the compiler regularly overestimates, yet there is no way to override it by the user.
The RAMs used for the external memory loads are mostly used for the private cache. You can reduce this amount by adding the "volatile" tag to your __global "coef" buffer. The cache can help a lot if your code does a lot of repeated accesses, but if it doesn't, the cache will be useless and just waste RAMs. There will still be some RAMs used for the access even with volatile tag, and that is because the compiler tries to hide the latency of the memory accesses by putting buffers between the kernel and the memory interface.