Uncached memory accesses to Avalon slaves are also slower than accesses to cache (and tightly coupled memory - which is a bit like pre-filled cache!)
Since internal memory is dual ported, you can give the cpu 'tightly coupled' access and allow other masters to access via the Avalon 'bus'.
Give the cpu 'data' access to its own code.
This works best if you can separate out the code (without any data) into a separate memory area (the Altera provided linker script doesn't do this, and their gcc4 build directly embeds data (jump tables) in the code segment!).
You'll still need an instruction cache if you want to use any of the Altera boot code or JTAG debug - since they don't export both ports of the relevant memory areas so the contained code can't be 'tightly coupled'.