Ah you are correct, this whole time I thought the small memtest used the DMA. I think if you took the memtest code, and removed the flash testing stuff and switched the software over to using small printfs you could probably get it to fit in an 8k on-chip memory.
If your VGA controller has a streaming input then maybe something like this would work well for you:
http://www.alterawiki.com/wiki/modular_sgdma_video_frame_buffer Instead of phase shifting the SDRAM clock I would recommend writing .sdc constraints for the SDRAM instead. The fitter will move the logic of the SDRAM controller around to meet the offchip timing. This will require you to read the SDRAM device data sheet to find out it's timing so that you can key them into the custom timing constraints. If you are new to Timequest this might take a while to learn and maybe what I refer to as the "lick your finger and hold it up to the wind" method (phase shifted clock) would be the quickest solution for you.