Finallly we have our HW-Based multiplier.
The trick to tell the compiler to use our custom multiplier was to put somewhere in the project a "__mulsi3()" routine. This one seems to override the builtin library function. This method only works if we are using alt_main - for us OK because we don't use the HAL. Perhaps someone with deeper knowledge of the compiler can clarify this.
The HW is a peripheral (not a custom instruction) implenting an
asynchronous multiplier. To get the result it takes 3 clock cycles@60MHz on a Cyclone with speed grade 8 - the multiplyer-unit is defined as multi-cycle in Quartus - hence no impact on Fmax. Nice side-effect of this implementation is that we can use a 64bit result with the same hardware and time.
We know that normally everything should be designed strictly syncronous - but sometimes it's necessary to take other solutions.
BTW: with the same method we implemented a 32/32 divider with less than 12cycles. And we get result and remainder in the same time.
Chris