I'm not saying it's not understood, I'm saying most users who complain don't understand the complexities(i.e. a decent number of people on this forum are probably in college). You obviously do...
I'm not sure how you want to remove the two cycle latency(which obviously makes the full path slower, if it doesn't meet timing with them in there already). Speed, area and latency are more trade-offs this has to deal with.
The latencies are pretty well documented:
http://www.altera.com/literature/ug/ug_fifo.pdf#page=14 My point in the previous slide was not to rely on rdusedw to determine if it's empty, as rdusedwd is not intended for that. But since you know it has a latency of two, it's probably easier to decode that(double-register the rdreq in parallel, and if rdempty is at 2 and there were two rdreqs on the last two cycles, assume it is empty. It's more complicated than that, but just an idea.)