This thread is almost ( but not quite ) gone full circle.:p
I think you can solve this problem ("I need good bignum performance") any number of ways, including software optimization using traditional techniques like what dsl has described.
But backing up to the original post in this thread: you already know that this is a 32-bit processor and to operate on (1024-bit) data you need [at least] 32x of each operation.
With pure software optimization, I guess you would be looking at an ideal goal that boiled down to (64) loads for the two operands, (32) add/subtract/whatever, and (32) store of the result. Let's call it 128 instructions.
If you are not a HDL person, and this ideal performance is more than adequate of where you need to be, you can attack your problem with just software and at least have a chance of achieving your desired performance.
The selling point for C2H is that if you are not a HDL person, and this performance is inadequate, then MAYBE you can point and click (aka lower cost and faster time to market) and use FPGA resources to achieve your performance needed. The shortcoming of C2H in this specific example is that I believe it is structured as an optimization tool for 32-bit operations, and there is no way to communicate to it that you want to specify 1024-bit operations.
Of course if you are an HDL person, you will think the software person is nuts for doing any of this: you know up front that you want a 1024-bit wide ALU, so go ahead and create one.
Back to the current topic of software optimization: I think what you want to do is use the profiler to identify your high execution count / execution time functions, and then techniques like dsl has written to zero in on the big bottlenecks. As far as acquiring timing data, I like the Performance Counter IP block.