--- Quote Start ---
This thread is almost ( but not quite ) gone full circle.:p
With pure software optimization, I guess you would be looking at an ideal goal that boiled down to (64) loads for the two operands, (32) add/subtract/whatever, and (32) store of the result. Let's call it 128 instructions.
If you are not a HDL person, and this ideal performance is more than adequate of where you need to be, you can attack your problem with just software and at least have a chance of achieving your desired performance.
--- Quote End ---
Akhil>> Please note that this is an academic thesis project and hence by I plan to implement hardware modules for RSA and BIGNUM which could assist an application developer (who may program in a NIOS II SBT, in C/C++). It is not the other way around, which can make a circle as you pointed out :P. I have an example code with me which does the software implementation for RSA. I was wondering any means of converting the software implementation to an HDL RTL design and comparing the performances. And please remember my tentative design diagram, that is not a finalized one. As of now I plan to generate the primes for RSA inside my RSA IP core and hence by I may not have to load the NIOS II CPU for that. Also the communication between the RSA <-> BIGNUM should be okay (I guess?) since I plan to make both the data buses as 1024 bits (even the ALU s).
--- Quote Start ---
The shortcoming of C2H in this specific example is that I believe it is structured as an optimization tool for 32-bit operations, and there is no way to communicate to it that you want to specify 1024-bit operations.
--- Quote End ---
Akhil>> The C2H has been discontinued for QSYS and as you pointed out, that is specific for NIOS II, which supports a 32-bit architecture. So I may not be using this to convert my C to a hardware accelerator.
--- Quote End ---
--- Quote Start ---
Back to the current topic of software optimization: I think what you want to do is use the profiler to identify your high execution count / execution time functions, and then techniques like dsl has written to zero in on the big bottlenecks. As far as acquiring timing data, I like the Performance Counter IP block.
--- Quote End ---
Akhil >> I am interested to profile my code for finding out the bottlenecks. However the challenge here is to convert my C code to a hardware efficiently after identifying the bottlenecks in the code. Please advice me if you know an efficient tool or if I have to hand-code everything.
Best,
Akhil Kalathungal