Try a fully pipelined architecture to multiply 8 input elements pairs and add the results to get a partial_sum(n) every clock.
This requires 8 altfp_mul and 7 altfp_add_sub, chained in a 4 stage pipeline.
Then add each sequence of 7 partial results to obtain a output element.
You should be able to pipeline this part as well, with 3 more altfp_add_sub chained in a 3 stage pipeline.
- Every 2 clocks, partial_sum1(n) = partial_sum(n) + partial_sum(n-1);
- Every 4 clocks, partial_sum2(n) = partial_sum1(n) + partial_sum(n-3).
- Every 8 clocks, element(n) = partial_sum2(n) + partial_sum2(n-7)
Because this last part actually needs 8 operands, the first part needs a "pause" cycle after each sequence of 7 partial sums, in which it's output is zero.
This should be able to calculate one complete result every 8*56 clocks.
Latency should be 8*56 clocks + 1 altfp_mul + 3*altfp_add_sub + 3*altfp_add_sub.
PS: I'm assuming altfp_add_sub and altfp_mult have 1 cycle throughput, like the integer counterparts. If that's not true, then this doesn't work.