--- Quote Start ---
originally posted by timbr@Nov 8 2004, 09:27 AM
hi all people who wants a fast multiply now,
even without adding a hardware multiplier, it is possible to gain some performance on the multiply. i disassembled the multiply routine (__mulsi3):
__mulsi3:
â â â â addi â â sp,sp,-8 â â â â â â â â â <- useless, can be removed
â â â â stw â â fp,4(sp) â â â â â â â â â <- useless, can be removed
â â â â mov â â r3,zero
â â â â mov â â fp,sp â â â â â â â â â â â <- useless, can be removed
â â â â beq â â r4,zero,mul_30
mul_14:
â â â â andi â â r2,r4,1
â â â â cmpeq â r2,r2,zero
â â â â srli â â r4,r4,1
â â â â bne â â r2,zero,mul_28
â â â â add â â r3,r3,r5
mul_28:
â â â â slli â â r5,r5,1
â â â â bne â â r4,zero,mul_14
mul_30:
â â â â mov â â r2,r3
â â â â ldw â â fp,4(sp) â â â â â â â â <- useless, can be removed
â â â â addi â â sp,sp,8 â â â â â â â â <- useless, can be removed
â â â â ret
this routine can be optimized a lot. the most important optimization that can be done is removing the stack frame stuff. this removes 5 instructions without any problem. this together with two minor optimizations results in the following function:
__mulsi3:
â â â â mov â â r2,zero
â â â â beq â â r4,zero,mul_30
mul_14:
â â â â andi â â r3,r4,1
â â â â srli â â â r4,r4,1
â â â â beq â â r3,zero,mul_28
â â â â add â â r2,r2,r5
mul_28:
â â â â slli â â â r5,r5,1
â â â â bne â â r4,zero,mul_14
mul_30:
â â â â ret
--- Quote End ---
I think you get the same result if you should be able to compile the libraries with the -f-omit-frame-pointer compiler option.
Has anyone done this already, is it hard to do for someone that has practically no experience in porting or making the gnu toolchain?
Stefaan