--- Quote Start ---
The processor manual for that chip advertises 3 cycles for floating point multiply-accumulate, which it looks like your inner kernel requires two of them and at 250MHz is considerable power.
If you are actually realizing performance significantly worse than that, then your question becomes an STM32 tools / optimization discussion not appropriate for this forum although many people here may be knowledgeable about it.
If you are certain you have already maxed out the chip and definitely require a co-processor in an FPGA, then you will need to say some more about your proposed system architecture and requirements.
--- Quote End ---
Memory to memory FMAC is more than good for my task, problem with timings, i need to check flags, clk state and so on, this is slow, and kills performance.
Idea is simple.
Processor will generate start pulse (50ns long),
fpga will start detector with same pulse, wait 7 delay cycles (pipelined adc), and after that will multiply adc value to corresponding floating point constant, and addup together.taht would generate Real and imag part of that exact frequency.
so c code would look like this
while(i<128)
{
while(GPIOC->IDR < 32766);//is fifo empty?
CLK_LOW;
CLK_HIGH;
k=GPIOB->IDR;//adc value from fifo
real+=k*cosinusas
;
imag-=k*sinusas;
i++;
}
after that, i would print real and imag to 2x32b ports, and do complex math with STM32F4, or maybe it is possible to get phase value from this ?
c code for fast atan2 is in here:
dspguru.com/dsp/tricks/fixed-point-atan2-with-self-normalization if that would be possible, i could implement compleate PIC controller inside cyclone FPGA