Let me clear up a few things.
There are two types of custom instructions:
- combinatorial
- multi-cycle
The term register is ambiguous because you can't tell if I'm talking about one
of the CPU registers in the register file or just a basic storage device.
Let me use flop for the later.
The combinatorial instructions can't write any flops and can't stall the CPU.
We don't allow it to write any flops because the Nios II/s and Nios II/f might
speculatively execute an instruction (mainly due to branch mispredictions) but
then kill it later in the pipeline. Notice that I didn't say combinatorial instructions
can't read from flops or can't read from external inputs.
If you really have some data available for the CPU to read that is guaranteed to
always be present (i.e. never needs to stall), I don't see why you can't use
a combinatorial custom instruction to read it. Writing it is not allowed because
of the speculative execution issue.
The multi-cycle custom instructions execute later in the pipeline so are never
speculatively executed. This allows them to read or write any flop or external
values. They can also stall the pipeline as needed. These instructions always
stall the pipeline for at least one cycle to avoid slow paths from the custom
instruction into the pipeline (stall logic and register write data).
As you've noticed, if you setup your custom instruction to take N cycles
to execute, the pipeline always adds one more cycle to this. The pipeline
registers the result data provided by your custom instruction before muxing
it with the other sources of register write data (e.g. ALU, load data, multiply result, etc).
You are allowed to have a multi-cycle custom instruction with N=1
although I seem to remember a bogus error/warning from the custom instruction
wizard in SOPC Builder if you try to do this. This should be fixed in Nios II 1.1.
So, in summary, here's the best you can do:
- Reading data from register/external input that is always ready:
Implementation: Combinatorial custom instruction
Performance: 1 cycle per instruction
- Reading data from register/external input that might not always be ready:
Implementation: variable-latency multi-cycle custom instruction
Performance: Number of cycles of latency + 1 more
- Writing data to register/external input:
Implementation: fixed-latency multi-cycle custom instruction setup for one cycle
Performance: 2 cycles per instruction
Be careful if you have a multicycle custom instruction writing a flop/external output
and a combinatorial custom instruction reading the same data. The combinatorial custom
instruction won't read the latest value because it executes in an earlier pipeline stage
than the multicycle custom instruction. If you have this situation, you also should
use a multicycle custom instruction (with a fixed latency of 1 cycle) to read values.
This will cut down your read performance to 2 cycles per instruction.
I hope this helps!