Hello, I am interested in developing some instructions which can do the bignum operations (like adding two operands of 1024 bit) and hence by plan to implement a simple Full Adder like this: {carry,sum} = OP_A + OP_B; I am a beginner in Altera and realized that I can develop an IP component with Avalon MM slave interface which can talk with the NIOS II processor. I was wondering how to give the bignum values as the operand from the NIOS II processor (master) to the IP component (slave) from the application code? I see only these macros in the generated 'io.h' file: # define IOWR_32DIRECT(BASE, OFFSET, DATA) io_write((BASE_APB_ADDR) + __AVL_TO_APB((alt_u32)((BASE) + (OFFSET))), DATA, (BASE) + (OFFSET)) # define IORD_32DIRECT(BASE, OFFSET) io_read((BASE_APB_ADDR) + __AVL_TO_APB((alt_u32)((BASE) + (OFFSET))), (BASE) + (OFFSET)) I guess these are the 32 bits write and read instructions. So, do I get to clock in the 1024 bits as a single instruction? Or do I have to wait for 32 clock cycles (sampling the 32 bits in a clock cycle, do for 32 clocks. Please say a no to this..) ! Hope that question is clear and someone can respond. Really appreciate it. Thank You, Akhil

Hello, Can someone be of assistance to the above question please? That would be a big help. If the question is not clear I can explain it again. Any inputs are appreciated. Thank You, Akhil

Hello, I gentle bounce here. Someone please help me. Thank You, Akhil

The Nios II CPU uses a 32 bit data bus, and a 32-bit data path internally, so it won't be able to process a 1024-bit word with a single instruction. You will have to divide your 1024-bit word in 32*32-bit words when accessing your IP component.

Hello Daixiwen, In the Altera user manual for Avalon interface (sopc builder, not qsys), I have seen the MM interface section. There the 'readdata' and 'writedata' ports can have a data width up to 1024 bits. So does that mean the NIOS II processor (master) can write or read 1024 bit data to the slave component in 32 clock cycles, 32 bit word every clock cycle? Earlier I thought that since the data width is 1024, it might be possible to clock in/out that much data every clock! (I am dumb :( ). And does this also mean that I can have an accumulator (kind of register) in my hardware IP module which keeps sampling the write data for the entire 32 clock cycles? And for read data, which reads the 1024 bits read register, 32 bits every clock cycle, and does the same for 32 clock cycles. I sense a low throughput for performance here (but, with a proper functionality). Please correct me if my assertion is wrong. Thank You, Akhil

You are right that you can have an Avalon MM bus with a 1024-bit data bus size, and if you connect a master and a slave that both have a 1024-bit data bus size, they will be able to transfer 1024 bits on each clock cycle (see, you aren't dumb ;) ). You need to be careful when using such a bus size, as you increase considerably the resources used on the FPGA. The problem in your case is that the CPU itself is 32 bit. So even if you connect it to a component with a 1024 bit data bus, the CPU will only be able to read or write 32 bits at a time (using byte enables). If you have a lot of operations to perform on those big words, an idea could be to implement your IP as a full ALU with a bank of 1024 bit registers. Then the CPU would only need to transfer the actual values at the beginning and end of the algorithm, but for all the intermediary steps it would only need to send instructions, and the values itself would stay in your IP component. Of course it depends entirely on what you want to do and how many intermediary values you need.

New IP development | Altera Community

48 Replies

Altera_Forum
Honored Contributor
12 years ago
I'm not sure I'd use the profiler, it's statistical nature (usually based on timer ticks) means it is of limited use.
You can probably guess which parts of the code take time, implement a TSC instruction and use it to count the number of cycles taken to get through the code sections. Use the elapsed time to generate a histogram of how long each section takes.
I have these defines:
#define STAMP_SET(stamp) (stamp = SYS_CLK_COUNT()) # define STAMP_COUNT(array, stamp, factor) do { unsigned int new_stamp, bucket, count; new_stamp = SYS_CLK_COUNT(); bucket = ((new_stamp - stamp) & ((nelem(array) - 1) << (factor))) >> (factor); count = array + 1; stamp = new_stamp; array = count; } while (0)

which I use to generate 64-entry histograms of the execution time of code blocks.
I've a comment that says the above costs about 10 clocks - I can't remember if that includes the Avalon read to get the cycle count.

Since my code runs from tightly coupled instruction/data memory and the few Avalon xfers are usually uncontended the clock counts I get match those I calculate from the object code. SDRAM accesses perterb things somewhat (but I don't have many of those).

To get the code to run fast(er):
1) Ensure functions that can be inlined are inlined, try to get everything inlined (if code space permits).
Mostly this reduces register pressure.
2) Try to use global register variable(s) to access static data (ie put it all in a single 'struct'). This generates slightly better code (even after my patches to gcc) than using 'small data'. With care you can use %gp as the global register.
Without this the compiler will need a register to reference each global - and I've seen it have two registers pointing to the same global.
3) Ensure your C doesn't force the compiler to keep re-reading variables from memory (eg because a write via a 'char *' might overwrite the same location).
4) Avoid having too many live values in a function, gcc will create virtual registers and then spill them to stack. Sometimes it is necessary to force gcc to write the register values out to memory (asm volatile ("":::memory) is your friend here).
5) Avoid read delay stalls. gcc hasn't been told about these properly. Sometimes you need to force a read early.
6) Find out how to disable the dynamic branch predictor, and set all the branches with correct static prediction.
7) I could carry on ....

I got considerable speedup from the above - and after I thought I'd made a good jod of the code!
Altera_Forum
Honored Contributor
12 years ago
This thread is almost ( but not quite ) gone full circle.:p

I think you can solve this problem ("I need good bignum performance") any number of ways, including software optimization using traditional techniques like what dsl has described.

But backing up to the original post in this thread: you already know that this is a 32-bit processor and to operate on (1024-bit) data you need [at least] 32x of each operation.
With pure software optimization, I guess you would be looking at an ideal goal that boiled down to (64) loads for the two operands, (32) add/subtract/whatever, and (32) store of the result. Let's call it 128 instructions.

If you are not a HDL person, and this ideal performance is more than adequate of where you need to be, you can attack your problem with just software and at least have a chance of achieving your desired performance.

The selling point for C2H is that if you are not a HDL person, and this performance is inadequate, then MAYBE you can point and click (aka lower cost and faster time to market) and use FPGA resources to achieve your performance needed. The shortcoming of C2H in this specific example is that I believe it is structured as an optimization tool for 32-bit operations, and there is no way to communicate to it that you want to specify 1024-bit operations.

Of course if you are an HDL person, you will think the software person is nuts for doing any of this: you know up front that you want a 1024-bit wide ALU, so go ahead and create one.

Back to the current topic of software optimization: I think what you want to do is use the profiler to identify your high execution count / execution time functions, and then techniques like dsl has written to zero in on the big bottlenecks. As far as acquiring timing data, I like the Performance Counter IP block.
Altera_Forum
Honored Contributor
12 years ago
My problem with the performance counter block is that it is a 16bit slave - so it takes a lot of accesses to do anything.
A 32bit version would be much more friendly.

The lack of 'add with carry' could well be a problem, but one of the cmp instructions would do the job.
So code like:
sum = a + b + carry; carry = sum < a;

is amost right (needs an extra check for b[i] + carry == 0).
So 5 instructions (+ memory access) including 1 branch that can be staticly predicted 'not taken'.
With care and a combinatorial custom instruction (or 2) could be 3 instruction (I think).
Altera_Forum
Honored Contributor
12 years ago
--- Quote Start ---
My problem with the performance counter block is that it is a 16bit slave - so it takes a lot of accesses to do anything.

--- Quote End ---

I'm not sure about the history of the block, but at least in the files sitting on my hard drive right now, it is a 32-bit slave.

--- Quote Start ---

With care and a combinatorial custom instruction (or 2) could be 3 instruction (I think).

--- Quote End ---

I'm not following you: I see two loads, one store, and one math function (4 instructions, or 128 total).
I'm not intimately familiar with the instruction set: does NIOS have a double-load or add-and-store instruction? (how did you compress it down to 3?)
Altera_Forum
Honored Contributor
12 years ago
I wasn't counting the load and store instructiions....
Altera_Forum
Honored Contributor
12 years ago
--- Quote Start ---
This thread is almost ( but not quite ) gone full circle.:p

With pure software optimization, I guess you would be looking at an ideal goal that boiled down to (64) loads for the two operands, (32) add/subtract/whatever, and (32) store of the result. Let's call it 128 instructions.

If you are not a HDL person, and this ideal performance is more than adequate of where you need to be, you can attack your problem with just software and at least have a chance of achieving your desired performance.
--- Quote End ---

Akhil>> Please note that this is an academic thesis project and hence by I plan to implement hardware modules for RSA and BIGNUM which could assist an application developer (who may program in a NIOS II SBT, in C/C++). It is not the other way around, which can make a circle as you pointed out :P. I have an example code with me which does the software implementation for RSA. I was wondering any means of converting the software implementation to an HDL RTL design and comparing the performances. And please remember my tentative design diagram, that is not a finalized one. As of now I plan to generate the primes for RSA inside my RSA IP core and hence by I may not have to load the NIOS II CPU for that. Also the communication between the RSA <-> BIGNUM should be okay (I guess?) since I plan to make both the data buses as 1024 bits (even the ALU s).

--- Quote Start ---
The shortcoming of C2H in this specific example is that I believe it is structured as an optimization tool for 32-bit operations, and there is no way to communicate to it that you want to specify 1024-bit operations.
--- Quote End ---

Akhil>> The C2H has been discontinued for QSYS and as you pointed out, that is specific for NIOS II, which supports a 32-bit architecture. So I may not be using this to convert my C to a hardware accelerator.
--- Quote End ---

--- Quote Start ---
Back to the current topic of software optimization: I think what you want to do is use the profiler to identify your high execution count / execution time functions, and then techniques like dsl has written to zero in on the big bottlenecks. As far as acquiring timing data, I like the Performance Counter IP block.
--- Quote End ---

Akhil >> I am interested to profile my code for finding out the bottlenecks. However the challenge here is to convert my C code to a hardware efficiently after identifying the bottlenecks in the code. Please advice me if you know an efficient tool or if I have to hand-code everything.

Best,
Akhil Kalathungal
Altera_Forum
Honored Contributor
12 years ago
--- Quote Start ---
I was wondering any means of converting the software implementation to an HDL RTL design and comparing the performances.
--- Quote End ---

http://en.wikipedia.org/wiki/c_to_hdl

It looks like today there is some free options (NISC, c-to-verilog.com, ROCCC, ...) as well as big name commercial (Mentor) to choose from. As far as Altera's offering goes, I think the situation is that it "works" it is just not being further developed. If you are willing to jump through some hoops with current releases in order to use it, you can.

I understand this is not where you are aiming, but once you wrap your hands around using tools like this, it seems like a useful topic of for a paper/wiki might be a brief summary with 2013 view point of the output you get from variety of tools using your standard well-known RSA math as the test case. I think the challenge would be making any of them understand that you are working with 1024-bit operands and not arrays of 32-bit words, for example.

Here is one relatively current paper (obviously, biased toward one tool):
http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=5474060

You may get additional / more useful feedback from this forum with a new thread with different topic soliciting opinions, since there may be people with relevant first hand knowledge of current offerings.
Altera_Forum
Honored Contributor
12 years ago
First of all I'm guilty of not reading this entire thread but are you essentially trying to perform a vector operation (same operating across multiple data in memory)? If so and assuming the algorithm isn't particularly complicated a DMA engine with your transform hardware in the middle should suffice. For example if you wanted to perform a vector addition of two blocks in memory you just need this:

read master --> adder transform block --> write master

If you don't want to use a DMA but need memory masters to do this you could take a look at the read and write masters in the Qsys tutorial and just shove a custom controller on them. That's an overly simple algorithm and chances are it doesn't offer much in terms of speedup over a processor since you are performing two reads and one write for ever computation in the vector (i.e. memory limited). You realistically get a speedup when the algorithm is data independent or feed forward so that you can perform a bunch of calculations in a wide or pipelined data path.

A lot of the C to HDL products out there have gone the way of the dinosaur since it's pretty difficult for compilers to find parallelism in sequential C code. As far as I know the only commercial player left in that space is ImpulseC.

Forum Discussion

New IP development

48 Replies

Recent Discussions

Access to RLC data for Agilex5 IBIS Models

Configurable transceiver enable

Agilex 5 dual simplex fitting

JESD240B - No license

Interface LVDS to Gigabit transceivers