Why are you trying to reduce the delay based on simulation results? Are you getting a negative slack in static timing analysis? If the delay is causing a problem seen in simulation but you have positive slack, then maybe you don't have the design constrained properly for timing. First constrain it properly and see whether the Fitter will make the delay better.
Without testing your code in Quartus, I'm not sure whether the read portion of your "if" statement is describing two stages of registers allowing one to be at the RAM block output. I'll assume for the following that you are getting only the RAM inputs registered inside the RAM block.
If the read delay is too long because of the portion of the delay within the RAM block, then that can be reduced by using the optional output registers in the RAM block. That will of course add a clock cycle of pipeline delay to the read.
With a quick check I didn't find an example of RTL to describe a RAM with output registers, but there might be an example in the Quartus handbook at Volume 1, Section II, Chapter 6 "Recommended HDL Coding Styles".
I suspect it would be enough to describe an ordinary register in the RTL that the RAM output feeds directly. If "Auto Packed Registers" in the "More Fitter Settings" dialog box is set to anything other than "Off", the Fitter will likely implement the RTL registers using the output registers inside the RAM block even if synthesis implements the registers outside the RAM block.