Forum Discussion

Altera_Forum's avatar
Altera_Forum
Icon for Honored Contributor rankHonored Contributor
21 years ago

NIOS II Custom Instruction access to CPU Regs'

Hi,

I'm looking for a solution to access internal CPU Register via a custom instruction. I didn't found anything about this in NIOS II documentation.

My problem is following :

The interrupt responce of NIOS II is very slow

(NIOS "e-core" at 50MHz access to IRQ0 at about 12µs;

NIOS "f-core" at 50MHz access to IRQ0 at about 3µs)

so I'd try to speed up the interrupt service routines from altera. In this case I'd like to save the CPU Regs into shadow register via a custom instruction. Is this possible?

Thanks for help,

6 Replies

  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    Custom instructions are coupled to the ALU so the best you could do is save a couple of them (2-3 actually). Custom instructions are meant for receiving 2-3 inputs and outputting a single result back to the NIOS core. So in short no this would not be possible.

    I'm curious, what is the hard deadline that your ISR needs to process? Also what memory are you using for code and ISR code? (you would want your "fast" code to be in fast memory (on chip memory)). So I would look into that, and also if you can operate your NIOS core faster you may want to do that as well (if you are using Stratix you can easily clear 100MHz).

    Good-luck
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    Hi,

    I placed the service routine in On Chip memory. But it is the same. I'm wonderring why it took so long, because I counted the assembler commands for alt_irq_entry.h and alt_irq_handler. I calculated an interrupt responce time of about 1,5µs. (target system is NIOS II f core at 50MHz) And I'd like to know why my system need so much time. Did you have any expierience on interrupt service request time on NIOS II ? We only like to know, how much time an interrupt need on NIOS II system.

    Thanks
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    The interrupt routine is slow because most of the time the code itself doesn't come out of the instruction cache. And if it is, it is for sure not worst case timing.

    Storing and recovering the registers is also slow. Bit can be faster if data cache is available, and because generally interrupts are small and use small data sets, the register contents will still be in the datacache. Do not think internal RAM is a solution for everything. I did some small benchmarks, and it seems to take 3 or more cycles to read a value, around 2 to store one. SO it is not single cycle (if I'm wrong, please let me know).

    Also to keep in mind : if interupts are not reenabled, only the first interrupt that comes in will be handled fast. The priority mechanisme is not working in this case (at least not for worst case timing).

    So some tips :

    1. Do assembler (also for calculating the offset in the function table and to call the function).

    2. Figure out when it is possible to reenable interrupts. It takes some time to reenable and disable, but it can save a lot on worst case latency.

    3. Use an inner loop (in between register sving and recovering) to do lower priority interrupts that became active when the higher priority interrupt was handled. (So less overhead for saving and restoring, but still the chance for high priority interrupts to come in).

    4. Be carefull whith the fast core and data cache. See what can eb cahched or not. Also there seems to be a strange issue when reenabling interrupts and using the fast core. If you ever see some strange effects that go away when the data cache is disabled, let me know, then I know I'm not the only one with this issue, and maybe altera will listen then.

    Stefaan.
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    Hi Stefaan,

    thanks for your information. That seems to be strange I thought if I am using the F core it would take only 1 cycle to store and load data from internal ram. In the SOPC designer the internal ram is created with no wait states, so i thougth I only have to look at "fundamental slave read transfers" of avalon bus -> so it is 1 cycle. The same is for the write transfer (Avalon Bus Specification Reference Manual pages 27 and 37). Did I missunderstood something ?

    Holger
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    Holger,

    Look at the processor documentation, paga 190 of the reference handbook (pdf page nr).

    load and store with avalon transfer : >1 , so can be 100 without violating the spec. (So even with a single cycle RAM access, and no other peripherals accessing the bus at that moment (DMA transfers, instruction load, ...), it is more than one according to the spec. I don't think Altera will put '>' in the documentation if there is a chance to have a '>='.

    load and store without avalon transfer = 1 --> so if cache is containing the data, it should be 1 cycle, this has nothing to do with ineternal or external ram. If an interrupt occurs, the caches will normally not contain the addresses of where you want to store the registers, and if thay contain them (e.g. when you just left a function with a big stack frame, and the addresses on the stack are reused for the interrupt stack frame), you can not count on it for maximum values.

    Stefaan
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    For what Avalon does for you don't expect one clock cycle. Just remember that most of the cores if not all are on that avalon bus. Like what was meantioned before, when you service an interrupt the state of the cpu must be saved which is a lot of data (control registers, and data registers).

    ots_dev, the internal memory is used for the cache, but it is not accessed from the avalon bus, using the NIOS II pipeline they are able to effectively get the data out in one clock cycle. When accessing memory that you have added to your system (internal, sram, sdram, etc...) your transfers occur across the avalon bus (which may not be free so waiting may occur). The reason why I suggested internal memory, is that you do not want to be transferring to a slow storage device in your ISR (you can have as low of a latency as you want, but if your ISR is slow there is no point really).

    I think for now the best you can do is implement this in assembly to cut the latency down. I haven't done any assembly on NIOS II yet, but on some other processors I have done things like skip saving the register set (set aside registers that I never touch to use in the ISR). But in most cases if I want a latency lower then what you are seeing I just do the algorithm in hardware and leave the processor out of the loop.

    Good-luck