The SGDMA is a bit more complex, but isn't that difficult to use.
The main difference is that the DMA operation to execute is stored in memory, in a structure called a descriptor, instead of the DMA registers. If you have a look at the driver made by Altera, you will find functions to make a descriptor and another function to start a transfer. Just be sure when you design your SOPC system that the descriptor read and write ports are connected to the memory you will put the descriptors in.
Descriptors can be chained, i.e. each descriptor can point to another descriptor describing the new operation to accomplish. This is how you can have a very efficient operation without using the CPU.
As I said each operation is limited to 65536 bytes transferred. The quick-an-dirty solution to transfer 32MB would be to create a chain of 512 descriptors and launch the SGDMA on it.
A more elegant solution would be to use a circular buffer of 4-5 descriptors, place an interrupt each time the SGDMA finished processing a descriptor and write an interrupt handler that adds a new descriptor to the chain. By staying ahead of the SGDMA by 2 descriptors you'll manage to keep it busy almost 100% of the time. If you are familiar with NiosII interrupts handlers I think that such a solution would be doable in 60/80 hours.