You are welcome.
By the way if you are not already doing so you might be able to get even more speed by using the "early done" feature of the DMA. A typical DMA transfer starts with the read and write masters being told what to do by the dispatcher then the reads start and after enough data has been read in and transferred over to the write master the write master starts writing to memory. When you get to the end of the transfer the read master stops because it is waiting for data to return and when that last word arrives then it tells the dispatcher it's done and is ready for the next descriptor (write master will still be writing out data). With early done enabled in the descriptor control field, the read master will signal to the dispatcher that it's done right when it issues it's last read without waiting for the read data to return.
Over PCIe a read will typically be over 100 cycles of latency so the early done bit lets you hide that latency by having the read master start issuing the reads for the next descriptor before the read data from the last descriptor arrives. So instead of seeing 100+ cycle gaps between transfers you'll have a couple of cycles between transfers instead. This of course assumes you have multiple descriptors already written into the dispatcher.