Turn off "Force burst alignment". Probably what is happening is the DMA is posting a bunch of bursts of 1 which will be fairly inefficient.
The burst alignment feature is meant for SDRAM which has a concept of a wrapping burst. Since onchip memory and SRIO doesn't use wrapping bursts it's best to disable it so that the DMA can post full bursts from the beginning. Also the large burst sizes are not helping since the write master can't start writing until it has enough data to complete a full burst (it doesn't start early because bursting locks the arbiter which could lead to system performance problems). That means the burst reads are posted, the data trickles through the read data FIFO, enters the write data FIFO, and when there is enough data in the FIFO the burst begins. So this is the initial overhead of the DMA, if the host can't keep it fed with more descriptors fast enough then this initial overhead will be experienced multiple times.
The best onchip RAM to onchip RAM performance will be when bursting is disabled. With bursting disabled, a data width of 32 bits, and a transfer length of 32kB I suspect that should take approximately 8200 clock cycles. I have been able to get around 95% efficiency out of SDRAM copying data to and from the same memory. As the transfer size increases I have seen 97% utilization out of SDRAM which becomes more of a limitation of the memory than the DMA at that point.
Normally when I'm trying to figure out efficiency problems I just simulate the transfer. When you see what is happening in the fabric, memories, and DMA it usually becomes very clear what the problem is.