Thank you for your explanation. However, I still have some questions.
When I did some experiments on the assembly level, I got some strange numbers.
When I read from an address that is already in the d-cache it takes one clock cycle. On the other hand, when I read from an address that is not in the d-cache it takes much more than one extra cycle to access the on-chip memory on avalon-bus. Even when I instruct the Qsys not to add any register in between to improve fmax, the result still a does not make any sense.
So my question will be how many cycles the CPU takes to read from on-chip memory that is connected through Avalon-bus.
1. using normal asm(ldw)
2. using IO asm(ldwio)
Thank you again for your support.