Hello Ted,
The operand_A path and operand_B path to on_chip memory is just a tentative design. I can implement an RSA algorithm which generates the two prime numbers (operand_A and operand_B internally (the RSA IP core can take care of that). Just in thoughts of a future enhancement (the thought that BIGDIGITS IP should be an IP core that can be used with any application which can accept two input operands to use) I plan to use the data path.
And as you pointed out, that might be a bottleneck. However I do not need to clock in the operand_A from the body of a nested loop. In fact, I plan to clock-in the data before the algorithm/loops execute. Also please note that if I need to clock-in 1024 bits of data, I need to do that for 32 clock cycles, 32 bits every cycle.
(also please cross check that my understanding of the custom instruction is correct, i.e, the way it passes the addresses to the BIGDIGITS IP core)
Thank You,
Akhil