Hi Ken,
in most C-programms, most data-transfers are from/to the stack, where a data-stack helps much.
If you have a stream of data to process, maybe you can control the streaming yourself and read the next word always from the same address (not cached and also not SDRAM, of course). You could also implement custom instructions to access the stream, that would be even faster. Of course it is a pity, that the data-cache/SDRAM-controller does not read a line of data (then you would have to cache-miss-penalty only at the first of the words in a cache-line). You could also do a DMA-transfer to a internal SRAM-block and use this as "cache" for further processing.
You also mentioned a 16bit LUT, I think: If it is a steady transfer-function, it may be possible to reduce its size (to lets say 256 points) and interpolate in between (in FPGA-"hardware"). Then you can put the LUT into a internal SRAM-block. If you implement this with e.g. custom instructions, you could achieve about 2 clocks per look-up. If you implement this in a clever way, the resource-usage (LCs and RAM-blocks) would not be too much, I think.
Regards,
Thomas