Unfortunately there will not be much you can do to reduce the latency from the HPS into the FPGA from a hardware perspective.
I suspect what you need is a kernel driver that talks to your hardware directly because I think dev/mem maintains a copy of the data and moves it to/from the destination which is adding an additional copy operation. Keep in mind I'm a hardware engineer so I could be completely wrong. Your driver would mmap the region and provide APIs for accessing the hardware. I would search around for online material about how to write a Linux device driver because this information isn't Altera SoC specific and there is a lot of material on the web about this. You might find quite a bit of information on rocketboards about this as well, for example this:
https://rocketboards.org/foswiki/view/documentation/ws3developingdriversforalterasoclinux