Hello i am in charge of designing a system that does a lot of complex sin/cos/* instructions and i need them to be fast. I am using NIOS2-F at 100mhz in a Cyclone IV. I've enabled HW and DIV by hardware at NIOS2-F and i am using 64K data and instruction cache. My software is being compiled with these flags CFLAGS = -Wall -DNOCRYPT -mhw-div -mhw-mul -mcustom-fpu-cfg=60-1 -mcustom-fpu-cfg=60-2 -O3 I tried to use the Custom FPU instruction at QSys but my performace is slower with Custom FPU instruction attached to NIOS2-F at QSYS (why?) Is there anything else i can do? Any suggestions?

Depending on what you are doing you might be able to reduce the amount of maths required! If you only need limited precision use a lookup table and linear interpolation (and maybe a single newton-raphson step). If you are generating values in sequence (eg tone generation) then use the equality: sin(a+b) = 2sin(a)cos(b) - sin(a-b) Use fixed-point maths (in integers) rather than floating point.

Thanks for your help but i can't do this math optimizations :S Maybe if i change the GCC i can get a better result? I am using version 3.4.6

Changing the version of gcc won't make much difference. It might be worth looking at the generated code (either pass '-S -fverbose_asm' to gcc and look at the generated .s file, or run 'objdump -d' on the object/program file). Also remember that (IIRC) the FP opcodes only do 'float' not 'double', and you don't want to be converting between float and double either.

I will take a look into that, DSL why i get a worse performance when i add the float point unit custom instruction at NIOS2 @ QSys? Doesnt make sense to me, NIOS2 already have those custom instructions built in? Also what should i be looking for in the objdump? (Never used this command:eek:) This is a part of my meter_run (main function) 4e0c: 39000017 ldw r4,0(r7) 4e10: 31800104 addi r6,r6,4 4e14: 10c7ff32 custom 252,r3,r2,r3 4e18: 1105ff32 custom 252,r2,r2,r4 4e1c: 40d1ff72 custom 253,r8,r8,r3 4e20: 4893ffb2 custom 254,r9,r9,r2 4e24: 39c00104 addi r7,r7,4 4e28: 29400904 addi r5,r5,36 4e2c: 32bff51e bne r6,r10,4e04 <meter_run+0xb4> 4e30: 73800044 addi r14,r14,1 4e34: 5a400015 stw r9,0(r11) 4e38: 62000015 stw r8,0(r12) 4e3c: 6b400104 addi r13,r13,4 4e40: 63000104 addi r12,r12,4 4e44: 5ac00104 addi r11,r11,4 4e48: 73ffe31e bne r14,r15,4dd8 <meter_run+0x88> 4e4c: d8c00217 ldw r3,8(sp) 4e50: d1e09117 ldw r7,-32188(gp) 4e54: d2e09017 ldw r11,-32192(gp) 4e58: d9000617 ldw r4,24(sp) 4e5c: da400117 ldw r9,4(sp) 4e60: da800517 ldw r10,20(sp) 4e64: d9400317 ldw r5,12(sp) 4e68: d2209317 ldw r8,-32180(gp) 4e6c: d3209217 ldw r12,-32184(gp) 4e70: 1ac5ff32 custom 252,r2,r3,r11 4e74: d9800717 ldw r6,28(sp) 4e78: 19c7ff32 custom 252,r3,r3,r7 4e7c: 1a47ff72 custom 253,r3,r3,r9 4e80: 390fff32 custom 252,r7,r7,r4 4e84: 1285ff72 custom 253,r2,r2,r10 4e88: 22c9ff32 custom 252,r4,r4,r11 4e8c: 1907ffb2 custom 254,r3,r3,r4 4e90: 11c5ff72 custom 253,r2,r2,r7 4e94: 2b09ff32 custom 252,r4,r5,r12 4e98: 2a0bff32 custom 252,r5,r5,r8 4e9c: 1947ff72 custom 253,r3,r3,r5 4ea0: 1105ff72 custom 253,r2,r2,r4 4ea4: 4191ff32 custom 252,r8,r8,r6 4ea8: 330dff32 custom 252,r6,r6,r12 4eac: 19a9ffb2 custom 254,r20,r3,r6 4eb0: 1227ff72 custom 253,r19,r2,r8 I Guess it is using custom instructions right? (Even thought i didnt add the custom instruction guy @ QSys)

The sine/cosine implementations of the IEEE variant of newlib use lookup tables to compute the result so I wouldn't expect adding the FPU to make any difference at all. If you want a faster implementation you either need to look for software optimizations or implement the sine/cosine in hardware and bolt it up to the CPU as a custom instruction. There are compiler flags you can pass in to tell the tools to use the custom instruction implementation instead of the software library. Optimizations you can look at are taylor series, cordic, etc..... Some may work well in software with the FPU and some would make more sense being offloaded into hardware. There are others you can look at if you want to trade off accuracy or have inputs that are bound to the point where you can use lookup tables efficiently. If you are seeing those 'custom' opcodes without a custom instruction then I would think you are having the wrong code linked in. It could be that code is being linked in but never executed but that would surprise me.

NIOS2 HW & DIV Instructions + FPU | Altera Community

28 Replies

Altera_Forum
Honored Contributor
14 years ago
Presumably there is a separate compiler option than can be turned back off? (without doing a compiler rebuild).
Even if that does mean you have to explicitly mark constants as 'float'.

You might also need to set another option to let the compiler do certain FP arithmetic itself - instead of generating code to do it. The issue here is that the compiler might not generate exactly the same bit-pattern as the target.
Altera_Forum
Honored Contributor
14 years ago
Aprado - your last example can be optimised away again ....
Altera_Forum
Honored Contributor
14 years ago
--- Quote Start ---
Aprado - your last example can be optimised away again ....
--- Quote End ---

Yup however i compiled without the -o flag and took a look at the ASM and it is calling a custom instruction to do the float operation
Altera_Forum
Honored Contributor
14 years ago
Hmmm... the code altera added to gcc to default FP constants to 'float' is trully borked.
The option is normally selectable as -f[no-]single-precision-constant. However, rather than set this when the -mcustom-fpu-cfg options is seen, it is done at the end of option processing so cannot be turned off from the command line.
Worse still, the generation of fp custom instructions can be enabled by a pragma - this will also force single precision constants from then on!

Seems tempting to rebuild the compiler with the assignment "flag_single_precision_constant = 1;" moved into the argument saving code (if not deleted completely).
Altera_Forum
Honored Contributor
14 years ago
That's correct, I looked around as well and couldn't find a clean way to disable having constants treated as single precision.

One way is to use the suffix of 'l' for doubles and 'f' for floats but that can be a pain if you have a lot of constants scattered around in the code.

The way I would probably do it would be to generate the FPU, feed the HDL component editor, then pass the appropriate flags to the compiler for +, -, *, / without using the 60-1 or 60-2 flags. I have never done this since I typically use "YAFPU" that I posted over in the alterawiki since it has more operators in it.

Aprado I have not use the configurable FPU before but I think I know which one it is, so when using that one if you have to pass in compiler flags for each floating point operator then you probably don't need to worry about constants being treated as single precision values.
Altera_Forum
Honored Contributor
14 years ago
Yes i need to pass in compiler flags for each floating point operator.
That's great news then. Is the YAFPU faster than the configurable one? I will try it.

Thanks for the help BadOmen and DSL.
Altera_Forum
Honored Contributor
14 years ago
I'm not sure which one is faster, the latency counts are visible somewhere in the verilog I think. Also as a heads up YAFPU is old as dirt and is .ptf based so I'm not sure if it works with the tools still. One of these days I'll give it a facelift and add double precision support.
Altera_Forum
Honored Contributor
14 years ago
Unless something very obscure goes on, it is only the -custom-fpu-cfg option that forces single precision constants.
The C functions I found are the same horrid ones an arm system I used many years ago ended up using - they are very slow at the best of times [1].
For an embedded system you might get away with:
- No NaN, infinity or -0
- non-ieee rounding (maybe just truncate)

It is also a shame that altera didn't think through the custom instruction interface a little further.
- Separate opcode for FP
- Allow an instruction to disable interrupts before the following instruction. This would allow, for example, a 64bit result to be recovered.

[1] I spent a week or so writing them in arm asm - not that hard.

Forum Discussion

NIOS2 HW & DIV Instructions + FPU

28 Replies

Recent Discussions

Nios V Logic Element not include

Ashling IDE scripted project creation

NIOS SDK SBOM/FOSS info

JTAG_UART stuck in printf

Recommended Quartus Prime Standard Edition for Nios V Development on MAX 10 FPGA (10M25DAF4817G)