Hi Jan:
TCM is on-chip memory, so the size will be limited.
Is your SDRAM just 16 bit wide as well, or is it wider?
If you are going to say a 64 bit interface, 16 bits at a time, it's going to be more efficient to pack it into 64 bit words then use the DMA to transfer the 64 bits at a time.
It's something to look into at least.
One thing to consider when using an different clock for the SDRAM than for the rest of your logic, this will have a latency penalty as well, since this would require clock crossing logic to prevent meta-stability.
Pete