One row: 56 mults, 55 adds.
All 56 rows: 3136 mults, 3080.
All 56 rows every 6 us: 523 Mmult/s and 523 Madd/s.
With a 5 ns clock, that becomes 2.6 mults/clock and 2.6 adds/clock.
Quite feasible.
Victor,
looks like you're mixing throughput with latency.
A altfp_mul has a latency of 5 cycles but a throughput of a result every cycle. Which means it takes 5 cycles to multiply a pair ofnumbers but you can feed it a new pair of numbers every cycle. Same goes for altfp_add_sub.
Since matrix multiplication has lots of independent operations, this can be exploited.
The architecture I suggested performs 8 mults/clock and more than 7 adds/clock and manages one complete result every 448 cycles (2.2 us), with a latency of 495 cycles (2.4 us).