Forum Discussion
Altera_Forum
Honored Contributor
16 years agoTry a fully pipelined architecture to multiply 8 input elements pairs and add the results to get a partial_sum(n) every clock.
This requires 8 altfp_mul and 7 altfp_add_sub, chained in a 4 stage pipeline. Then add each sequence of 7 partial results to obtain a output element. You should be able to pipeline this part as well, with 3 more altfp_add_sub chained in a 3 stage pipeline. - Every 2 clocks, partial_sum1(n) = partial_sum(n) + partial_sum(n-1); - Every 4 clocks, partial_sum2(n) = partial_sum1(n) + partial_sum(n-3). - Every 8 clocks, element(n) = partial_sum2(n) + partial_sum2(n-7) Because this last part actually needs 8 operands, the first part needs a "pause" cycle after each sequence of 7 partial sums, in which it's output is zero. This should be able to calculate one complete result every 8*56 clocks. Latency should be 8*56 clocks + 1 altfp_mul + 3*altfp_add_sub + 3*altfp_add_sub. PS: I'm assuming altfp_add_sub and altfp_mult have 1 cycle throughput, like the integer counterparts. If that's not true, then this doesn't work.