The answer turned out to be surprisingly simple. Make sure the physical address passed by the driver to the FPGA accelerator is "| 0x80000000". This puts the address in the ACP range. Otherwise, the address bypasses the SCU.
This also allows one to test performance with and without the ACP by simply changing the address at runtime.
Thanks to LegUp for the solution:
http://janders.eecg.toronto.edu/pdfs/euc14.pdf Note: cache attributes set by hand in the top level HDL after Qsys compile are:
.f2h_ARCACHE (4'hf)
.f2h_ARPROT (3'h0)
.f2h_ARUSER (5'h1f)
.f2h_AWCACHE (4'hf)
.f2h_AWPROT (3'h0)
.f2h_AWUSER (5'h1f)
ACDS is 14.0.