This thread totally slipped my mind. Sincere apology for the super delayed reply. But also I have been pretty hesitant to post my solution because I think my problem was just caused by my ignorance about how PCIe device and the IP works.
But, anyways, there were two reasons why it didn't work for me:
(1) I was using a PCIe extension cable to connect the device
(2) I didn't successfully recompile the reference design with the version of Quartus I use
After I plug the FPGA directly to the PCIe slot, regenerate (update) the Qsys file, and successfully recompile the entire design with the new version of Quartus, I was able to run the reference design with the speed of 5.X GB/s or up.
Hope this helps.