--- Quote Start ---
In your scenario, the two operands (those are 32 bits) have to be clocked-in for 32 clock cycles into the on chip memory so that the BIGNUM module can read the operands from it. So there will be some latency to get the operands value.
--- Quote End ---
Correct. It's kind of like a cache, and the "cache miss" penalty is quite large. It only makes sense to use the memory and keep the operands around if you think you will be using them more than once.
--- Quote Start ---
However I see a small issue with the BIGNUM module being a custom instruction. The custom instruction is an instruction which we implement for the NIOS II processor, right ? So is it possible for us to modify the data path for a custom instruction like this? i.e, reading from the On chip memory than from the NIOS II data bus?
--- Quote End ---
My suggestion is to implement a single new IP component which has two interfaces: an Avalon-MM master, and a custom instruction interface. The Avalon-MM is for the data path (1024-bit operands), and the custom instruction is for the control path (opcodes).
--- Quote Start ---
Also the output from a custom instruction goes to the NIOS II (which is again 32 bit data path, I am not sure it can return 1024 data bits and we might have to wait for 32 clocks), and unfortunately a custom instruction has only one O/P signal port. For an adder there will be two signal outputs, which is the SUM and the CARRY. For the above reasons I had thought of coming up with an IP core like design which is more flexible.
--- Quote End ---
It's your component and you can do whatever you like to meet your needs, but in my diagram I had been thinking that the output would have been written by BIGNUM back to the memory, and not traverse the instruction interface. In other words, the NIOS tells the BIGNUM where to put the result.
--- Quote Start ---
I am really interested in the SGDMA module (thanks for pointing out the ST interface!), using which I think I can clock-in upto 256 data bits in a clock cycle. So if I need to clock in a text (that has to be encrypted) or an operand that is 1024 bits, I think I can do it in just 4 clock cycles. (more efficient than using 32 clocks). Please correct me if I am going wrong.
--- Quote End ---
It's only going to theoretically go as fast as the interfaces it is connected to. If you're DMA'ing from an 32-bit SDRAM, the 256-bit DMA will only emit a word on 1/8th of the clocks since it has to buffer them up in 32-bit increments.