Forum Discussion

Altera_Forum's avatar
Altera_Forum
Icon for Honored Contributor rankHonored Contributor
14 years ago

IORD takes 20 cycles??

Hello,

We are having IORD performance problem with Nios II Cyclone III 3C25F324 fpga.

In our design we aru using CPU at 100MHz clock, and we have some custom fpga blocks. We need to read registers of these fpag blocks as fast as possible to meet with the main loop performance requirements. However, IORD takes about 20 cycles.

Is this regular performance of NIOS? Do you have any idea to fasten IORD & IOWR?

Thank you all,

5 Replies

  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    Check that your avalon slave is 32bits wide and clocked from exactly the same clock as the Nios CPU.

    With the /f core you should be able to get an uncached memory read from an Avalon slave in 3 cycles (I'm not sure it is possible to do better), plus the 2 instruction 'result delay'.

    Cycles to an M9K block behind a clock crossing bridge take 10 clocks.

    The fastest way to access peripherals is actually through the 'custom instruction' interface. Assuming everything is synchonised to the cpu clock unclocked status reads are single cycle with no result delay (just a mux), clocked instructions can be single cycle but are subject to the result delay.
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    First of all, thank you for quick reply.

    I am a software guy who is new at NIOS CPU and who is completely novice about the fpga stuff.

    Bu as far as I know,

    *Yes, the fpga blocks are working at 50Mhz (slower than CPU).

    * And I will check wit the fpga designer if the avalon slave is 32bits and if we are using M9K blocks or not.

    As you mentioned, we used a few custom instructions and they are pretty fast.

    I wonder if we can read the registers of custom fpga blocks from another custom instruction? Would such a design cause conflict of master-slave relations?
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    If your fpga blocks are running at 50MHz then you'll have a clock crossing bridge - which will slow things down a lot.

    I'm not a hardware expert, but it might be possible to run the the avalon slave at 100MHz and use a synchronous 50MHz clock (eg from a divide by 2) for the rest of the logic in order to avoid the clock crossing bridge

    (or, maybe, synchronise below the avalon slave interface).

    One thing I haven't determined is whether the 'readra' and 'readrb' bits of a custom instruction actually have any affect on the pipeline stall waiting for writes to the 'rA' and 'rB' registers.

    My suspicion is that (except for call and jmpi) the cpu always stalls on 'rA' and stalls on 'rB' if the low 2 bits of the instruction differ. Checking the bits of the custom instruction would take far too much logic.

    The effect is that every instruction has the 32bit instruction word, and the two register values from rA and rB available as inputs!

    I did a custom instruction for byte/bitswap that used the 'B' field to select the required operation.
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    Because of another problem we couldn't work on these issues. In a few days we will return this issues again, I will share the results of your improvment advices.

    Currently we are using a single clock very simple custom instruction, but we will implement more complex custom instructions. I guess we will see if rA/rB stalls :).. I hope we can come with a solution..

    Thanks again..
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    I wrote a C program that analysed the object listing (generated gcc) and calculated the actual execution time for the code.

    I then tweaked the C source to remove mis-predicted branches and stalls following memory reads (etc) - made easier because I arranged for the code to have no non-inlined function calls.

    To squeeze the last clock cycle out you need to:

    1) mark conditionals with __builtin_expect() to select the 'fall through' path

    2) put dummy asm instructions in otherwise empty parts of conditionals so that gcc will generate a forwards jump (to the asm contents) and then jump backwards

    3) use asm volatile("#gcc_membar, line " STR(__LINE__) "\n" ::: "memory") at various places to control which memory values gcc has cached in registers (can force reads early and force writes to avoid local variables)

    4) build a better gcc (see the wiki, gcc4 seems worse!) so that structures can be put into the 'small memory' area.

    5) get altera to tell you how to disable the dynamic branch predictor.

    6) don't use volatile for 8 or 16bit items (gcc masks/sign extends them after doing the correct memory read).

    I did manage to get my code to execute at the calculated rate.

    In particular I needed to minimise the worst-case code path, not the common one!