--- Quote Start ---
if anybody has hints or examples - would be very appriciated!
--- Quote End ---
1) Look at the generated assembly code.
You'll likely find that the CPU performs a read (64-bits perhaps), and then a write. From that you can immediately understand why the transfer will be slow.
2) Use SignalTap II to probe "something".
For example, if you cannot probe the bridge, probe the destination memory interface (Avalon-MM bus signals).
3) Read the documentation regarding DMA controllers, and then test it.
I have not used the HPS system, but I would assume they have DMA controllers, or allow a DMA controller in the FPGA fabric to access the HPS system buses.
One of the first things I do when determining whether a processor is suitable for a project is to test the DMA controller(s) to ensure my bus transfer requirements are met. A CPU can generally not generate burst transactions, so testing memcpy() is not a good performance test.
Cheers,
Dave