Floating Point Matrix Multiplication

Honored Contributor

16 years ago

Try a fully pipelined architecture to multiply 8 input elements pairs and add the results to get a partial_sum(n) every clock.

This requires 8 altfp_mul and 7 altfp_add_sub, chained in a 4 stage pipeline.

Then add each sequence of 7 partial results to obtain a output element.

You should be able to pipeline this part as well, with 3 more altfp_add_sub chained in a 3 stage pipeline.

- Every 2 clocks, partial_sum1(n) = partial_sum(n) + partial_sum(n-1);

- Every 4 clocks, partial_sum2(n) = partial_sum1(n) + partial_sum(n-3).

- Every 8 clocks, element(n) = partial_sum2(n) + partial_sum2(n-7)

Because this last part actually needs 8 operands, the first part needs a "pause" cycle after each sequence of 7 partial sums, in which it's output is zero.

This should be able to calculate one complete result every 8*56 clocks.

Latency should be 8*56 clocks + 1 altfp_mul + 3*altfp_add_sub + 3*altfp_add_sub.

PS: I'm assuming altfp_add_sub and altfp_mult have 1 cycle throughput, like the integer counterparts. If that's not true, then this doesn't work.

Forum Discussion

Recent Discussions

Access to RLC data for Agilex5 IBIS Models

Agilex3/5 GTS Hard Ethernet IP 10G example design pin loc and io std wanted

Agilex 7 I Series Development Kit: External hardware access error when programming

Inquiry: Reference Clock Jitter Limits for 1G Operation on Agilex 5

F-tile 10GBASE-R firecode FEC IP (Agilex 7)