Got back on working on that issue yesterday, and it seems that by moving heap & stack and also the code to respective TCM coupled onchip memories resolve the contention on the bus.
One small question related to that whole issue: My algorithm heavily relies on memory accesses and I measured with a performance counter that my algorithm takes ~ 30 000 000 clock cycles. Doing a naive calculation this should roughly translate to 300ms, but measuring with a interval timer I see ~1200 ms now. Does anyone have good guess why the two methods differ that much? Or even better, has anyone a good advice on how to increase the utilization of the NIOS?
Thanks for all the help so far.