What he meant is the delay needed is proportional to the length of the traces between the FPGA and the RAM. So if your custom board has a shorter trace distance your delay is less, and if it's further then the delay is longer (it's the electrical signal propagation delay). So in short I would call it a progation delay and not a phase shift really (because it's not clock dependent).
So to measure this delay on the ocsilloscope you want to get a probe onto the closest possible point out of the fpga to the closest possible point on the ram and look at them on the scope to measure the time between the two signals (if you are running at high clock rates you will need a pretty fast scope to do this accurately). Also keep in mind that this is just the trace delay and if you wanted even more accuracy then you would need the delay from the NIOS to that trace (but that's not going to stay consistent between hardware compiles so I wouldn't bother).
But in the end if it was me, I would probably just have done trial and error since getting at those traces would either require scrapping the solder mask, or soldering directly to VIAs (and if you know what those are you know to avoid doing that at all costs hehe) If you have no errors then you have pretty much found the required delay and since thats a fixed parameter in your design you will not be able to get any more out of it. If you were able to get the Cyclone up to 120MHz then I would stay there (I can barely get my Stratix 1S10 much further then 125MHz and I use a completely on chip design so you are in pretty good shape).
Cheers