The modular SGDMA design assumes that something (host or the DMA itself) will shovel multiple descriptors into it. If you send one descriptor at a time into it you are using it like a standard DMA. There shouldn't be much overhead difference since you are either stuffing descriptors into the FIFO inside the modular SGDMA dispatcher or you are placing them into memory and letting the SGDMA go fetch them (that's actually more overhead since you have to maintain a linked-list in memory... and adding to the list while the SGDMA is operating isn't trivial)
Neither the SGDMA or modular SGDMA are capable of posting reads for a descriptor while the previous descriptor transfer reads are still trickling in. The modular SGDMA will have this added for "Full word access only" mode and I doubt the regular SGDMA will ever have this feature. This will allow the DMA to hide the latency in between transfers for sequential descriptors. This feature is handy for high latency links like PCI, PCIe, SRIO, etc....
I suspect the reason why you are seeing inefficiencies is due to the high burst count you have selected. If you chose a FIFO depth that is only 2x the max burst count I could see this being very inefficient (simulate to find out why). The only difference between a burst transfer and a non-burst transfer is that the arbiter gets locked down for the entire burst causing other masters to have to wait. Bursting is meant for interfaces like SDRAM, PCI, PCIe, SRIO, etc... Since on-chip memories don't support bursting you are having a burst adapter inserted automatically for you which will chop up the bursts of 1024 into bursts of 1 (i.e. non bursting). RapidIO if I remember correctly uses a max burst count of 32 so for your testing I recommend using 32 and a master FIFO depth of 4x or greater.