Hello Ted,
Thank you for explaining the design! I think I understand what you had been doing with the dual port on chip memory and the way of connecting it to the NIOS II processor. Your design almost works the same way as the one explained by Daixiwen. However I see a gain in the 'read' performance.
In your scenario, the two operands (those are 32 bits) have to be clocked-in for 32 clock cycles into the on chip memory so that the BIGNUM module can read the operands from it. So there will be some latency to get the operands value.
However I see a small issue with the BIGNUM module being a custom instruction. The custom instruction is an instruction which we implement for the NIOS II processor, right ? So is it possible for us to modify the data path for a custom instruction like this? i.e, reading from the On chip memory than from the NIOS II data bus? Also the output from a custom instruction goes to the NIOS II (which is again 32 bit data path, I am not sure it can return 1024 data bits and we might have to wait for 32 clocks), and unfortunately a custom instruction has only one O/P signal port. For an adder there will be two signal outputs, which is the SUM and the CARRY. For the above reasons I had thought of coming up with an IP core like design which is more flexible.
I am really interested in the SGDMA module (thanks for pointing out the ST interface!), using which I think I can clock-in upto 256 data bits in a clock cycle. So if I need to clock in a text (that has to be encrypted) or an operand that is 1024 bits, I think I can do it in just 4 clock cycles. (more efficient than using 32 clocks). Please correct me if I am going wrong.
Thank You,
Akhil