Forum Discussion

Altera_Forum's avatar
Altera_Forum
Icon for Honored Contributor rankHonored Contributor
13 years ago

New IP development

Hello,

I am interested in developing some instructions which can do the bignum operations (like adding two operands of 1024 bit)

and hence by plan to implement a simple Full Adder like this: {carry,sum} = OP_A + OP_B;

I am a beginner in Altera and realized that I can develop an IP component with Avalon MM slave interface which can talk with the NIOS II processor.

I was wondering how to give the bignum values as the operand from the NIOS II processor (master) to the IP component (slave) from the application code?

I see only these macros in the generated 'io.h' file: # define IOWR_32DIRECT(BASE, OFFSET, DATA)

io_write((BASE_APB_ADDR) + __AVL_TO_APB((alt_u32)((BASE) + (OFFSET))), DATA, (BASE) + (OFFSET))

# define IORD_32DIRECT(BASE, OFFSET)

io_read((BASE_APB_ADDR) + __AVL_TO_APB((alt_u32)((BASE) + (OFFSET))), (BASE) + (OFFSET))

I guess these are the 32 bits write and read instructions. So, do I get to clock in the 1024 bits as a single instruction?

Or do I have to wait for 32 clock cycles (sampling the 32 bits in a clock cycle, do for 32 clocks. Please say a no to this..) !

Hope that question is clear and someone can respond. Really appreciate it.

Thank You,

Akhil

48 Replies

  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    I'm not sure I'd use the profiler, it's statistical nature (usually based on timer ticks) means it is of limited use.

    You can probably guess which parts of the code take time, implement a TSC instruction and use it to count the number of cycles taken to get through the code sections. Use the elapsed time to generate a histogram of how long each section takes.

    I have these defines:

    #define STAMP_SET(stamp) (stamp = SYS_CLK_COUNT())
    # define STAMP_COUNT(array, stamp, factor) 
        do { 
            unsigned int new_stamp, bucket, count; 
            new_stamp = SYS_CLK_COUNT(); 
            bucket = ((new_stamp - stamp) & ((nelem(array) - 1) << (factor))) >> (factor); 
            count = array + 1; 
            stamp = new_stamp; 
            array = count; 
        } while (0)

    which I use to generate 64-entry histograms of the execution time of code blocks.

    I've a comment that says the above costs about 10 clocks - I can't remember if that includes the Avalon read to get the cycle count.

    Since my code runs from tightly coupled instruction/data memory and the few Avalon xfers are usually uncontended the clock counts I get match those I calculate from the object code. SDRAM accesses perterb things somewhat (but I don't have many of those).

    To get the code to run fast(er):

    1) Ensure functions that can be inlined are inlined, try to get everything inlined (if code space permits).

    Mostly this reduces register pressure.

    2) Try to use global register variable(s) to access static data (ie put it all in a single 'struct'). This generates slightly better code (even after my patches to gcc) than using 'small data'. With care you can use %gp as the global register.

    Without this the compiler will need a register to reference each global - and I've seen it have two registers pointing to the same global.

    3) Ensure your C doesn't force the compiler to keep re-reading variables from memory (eg because a write via a 'char *' might overwrite the same location).

    4) Avoid having too many live values in a function, gcc will create virtual registers and then spill them to stack. Sometimes it is necessary to force gcc to write the register values out to memory (asm volatile ("":::memory) is your friend here).

    5) Avoid read delay stalls. gcc hasn't been told about these properly. Sometimes you need to force a read early.

    6) Find out how to disable the dynamic branch predictor, and set all the branches with correct static prediction.

    7) I could carry on ....

    I got considerable speedup from the above - and after I thought I'd made a good jod of the code!
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    This thread is almost ( but not quite ) gone full circle.:p

    I think you can solve this problem ("I need good bignum performance") any number of ways, including software optimization using traditional techniques like what dsl has described.

    But backing up to the original post in this thread: you already know that this is a 32-bit processor and to operate on (1024-bit) data you need [at least] 32x of each operation.

    With pure software optimization, I guess you would be looking at an ideal goal that boiled down to (64) loads for the two operands, (32) add/subtract/whatever, and (32) store of the result. Let's call it 128 instructions.

    If you are not a HDL person, and this ideal performance is more than adequate of where you need to be, you can attack your problem with just software and at least have a chance of achieving your desired performance.

    The selling point for C2H is that if you are not a HDL person, and this performance is inadequate, then MAYBE you can point and click (aka lower cost and faster time to market) and use FPGA resources to achieve your performance needed. The shortcoming of C2H in this specific example is that I believe it is structured as an optimization tool for 32-bit operations, and there is no way to communicate to it that you want to specify 1024-bit operations.

    Of course if you are an HDL person, you will think the software person is nuts for doing any of this: you know up front that you want a 1024-bit wide ALU, so go ahead and create one.

    Back to the current topic of software optimization: I think what you want to do is use the profiler to identify your high execution count / execution time functions, and then techniques like dsl has written to zero in on the big bottlenecks. As far as acquiring timing data, I like the Performance Counter IP block.
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    My problem with the performance counter block is that it is a 16bit slave - so it takes a lot of accesses to do anything.

    A 32bit version would be much more friendly.

    The lack of 'add with carry' could well be a problem, but one of the cmp instructions would do the job.

    So code like:

            sum = a + b + carry;
            carry = sum < a;

    is amost right (needs an extra check for b[i] + carry == 0).

    So 5 instructions (+ memory access) including 1 branch that can be staticly predicted 'not taken'.

    With care and a combinatorial custom instruction (or 2) could be 3 instruction (I think).
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    --- Quote Start ---

    My problem with the performance counter block is that it is a 16bit slave - so it takes a lot of accesses to do anything.

    --- Quote End ---

    I'm not sure about the history of the block, but at least in the files sitting on my hard drive right now, it is a 32-bit slave.

    --- Quote Start ---

    With care and a combinatorial custom instruction (or 2) could be 3 instruction (I think).

    --- Quote End ---

    I'm not following you: I see two loads, one store, and one math function (4 instructions, or 128 total).

    I'm not intimately familiar with the instruction set: does NIOS have a double-load or add-and-store instruction? (how did you compress it down to 3?)
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    --- Quote Start ---

    This thread is almost ( but not quite ) gone full circle.:p

    With pure software optimization, I guess you would be looking at an ideal goal that boiled down to (64) loads for the two operands, (32) add/subtract/whatever, and (32) store of the result. Let's call it 128 instructions.

    If you are not a HDL person, and this ideal performance is more than adequate of where you need to be, you can attack your problem with just software and at least have a chance of achieving your desired performance.

    --- Quote End ---

    Akhil>> Please note that this is an academic thesis project and hence by I plan to implement hardware modules for RSA and BIGNUM which could assist an application developer (who may program in a NIOS II SBT, in C/C++). It is not the other way around, which can make a circle as you pointed out :P. I have an example code with me which does the software implementation for RSA. I was wondering any means of converting the software implementation to an HDL RTL design and comparing the performances. And please remember my tentative design diagram, that is not a finalized one. As of now I plan to generate the primes for RSA inside my RSA IP core and hence by I may not have to load the NIOS II CPU for that. Also the communication between the RSA <-> BIGNUM should be okay (I guess?) since I plan to make both the data buses as 1024 bits (even the ALU s).

    --- Quote Start ---

    The shortcoming of C2H in this specific example is that I believe it is structured as an optimization tool for 32-bit operations, and there is no way to communicate to it that you want to specify 1024-bit operations.

    --- Quote End ---

    Akhil>> The C2H has been discontinued for QSYS and as you pointed out, that is specific for NIOS II, which supports a 32-bit architecture. So I may not be using this to convert my C to a hardware accelerator.

    --- Quote End ---

    --- Quote Start ---

    Back to the current topic of software optimization: I think what you want to do is use the profiler to identify your high execution count / execution time functions, and then techniques like dsl has written to zero in on the big bottlenecks. As far as acquiring timing data, I like the Performance Counter IP block.

    --- Quote End ---

    Akhil >> I am interested to profile my code for finding out the bottlenecks. However the challenge here is to convert my C code to a hardware efficiently after identifying the bottlenecks in the code. Please advice me if you know an efficient tool or if I have to hand-code everything.

    Best,

    Akhil Kalathungal
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    --- Quote Start ---

    I was wondering any means of converting the software implementation to an HDL RTL design and comparing the performances.

    --- Quote End ---

    http://en.wikipedia.org/wiki/c_to_hdl

    It looks like today there is some free options (NISC, c-to-verilog.com, ROCCC, ...) as well as big name commercial (Mentor) to choose from. As far as Altera's offering goes, I think the situation is that it "works" it is just not being further developed. If you are willing to jump through some hoops with current releases in order to use it, you can.

    I understand this is not where you are aiming, but once you wrap your hands around using tools like this, it seems like a useful topic of for a paper/wiki might be a brief summary with 2013 view point of the output you get from variety of tools using your standard well-known RSA math as the test case. I think the challenge would be making any of them understand that you are working with 1024-bit operands and not arrays of 32-bit words, for example.

    Here is one relatively current paper (obviously, biased toward one tool):

    http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=5474060

    You may get additional / more useful feedback from this forum with a new thread with different topic soliciting opinions, since there may be people with relevant first hand knowledge of current offerings.
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    First of all I'm guilty of not reading this entire thread but are you essentially trying to perform a vector operation (same operating across multiple data in memory)? If so and assuming the algorithm isn't particularly complicated a DMA engine with your transform hardware in the middle should suffice. For example if you wanted to perform a vector addition of two blocks in memory you just need this:

    read master --> adder transform block --> write master

    If you don't want to use a DMA but need memory masters to do this you could take a look at the read and write masters in the Qsys tutorial and just shove a custom controller on them. That's an overly simple algorithm and chances are it doesn't offer much in terms of speedup over a processor since you are performing two reads and one write for ever computation in the vector (i.e. memory limited). You realistically get a speedup when the algorithm is data independent or feed forward so that you can perform a bunch of calculations in a wide or pipelined data path.

    A lot of the C to HDL products out there have gone the way of the dinosaur since it's pretty difficult for compilers to find parallelism in sequential C code. As far as I know the only commercial player left in that space is ImpulseC.