Here's an update of my findings after a round of messing with DDR2 configurations.
(Of all the configurations I tried, the following two were the only ones that affected performance).
First, though this may be obvious to DDR2 experts and experienced HW designers, reducing Mode Register 0 Burst Length from 8 to 4 will half the performance (50% read BW utilization becomes 25%, and 95% write BW utilization becomes 50%). So I set it back to 8.
Second, I disabled reordering (i.e. unchecked "Enable Reordering" in the Controller Settings). This gave me near max read performance when using burst. For non-burst read operations, I get 50% utilization.
# [110483758] [DWR=000]: Reading data 00001ffd00001ffc @ c003fe (BRC=3/0/3f8 ) burst 6
# [110485633] [DWR=000]: Reading data 00001fff00001ffe @ c003ff (BRC=3/0/3f8 ) burst 7
# 1024 write operations using burst during 1043 AFI_CLK cycles, utilization of 98.17%
# 1024 read operations without burst during 2132 AFI_CLK cycles, utilization of 48. 3%
# 1024 read operations using burst during 1067 AFI_CLK cycles, utilization of 95.97%
This was done in simulation, so I will have to try it on the real hardware.
Is this normal that read performance drops when using enabling reordering?
I couldn't find anything about performance drops when enabling reordering in the "DDR2 and DDR3 SDRAM Controllers with UniPHY User Guide". Having reordering data feature would be nice, since my hardware does not always do sequential reads/writes. The user guide also explains that reordering data would allow maximum efficiency, so I was wondering if I was doing anything wrong to get poor sequential read performance.