Hi Ken,
you are right, if you are using the /f-core, also internal RAM-accesses that miss the cache are slow. Like Jesse pointed out, using the /s-core would remove some clocks of delay as there is no cache, therefore there is no need for a check. Still you cannot achieve 1 or 2 cycles, even with internal RAM.
Using custom instructions would be the quickest, but also a more complicated, way. To achieve 1 cycle accesses will be tricky, but possible for streaming data. For random accesses you need a second cycle, because the internal SRAM is always registered (at least at Cyclone).
If you would like me to design it for you, I would be happy to get a mail.
Regards,
Thomas