Forum Discussion

Altera_Forum's avatar
Altera_Forum
Icon for Honored Contributor rankHonored Contributor
16 years ago

Floating Point Matrix Multiplication

I want to multiply a 56x56 matrix with a 56x1 matrix in floating point.

There is a altfp_matrix_mult megafunction which does not compute this quickly enough.

I am looking for ideas to implement this in floating point. Please let me know if you have any ideas. Thanks.

18 Replies

  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    They have a minimum pipeline delay of 7 respectively 5 cycles, more if highest clock frequency is intended.

  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    That will take 5+7*3 (for first 8 multiplications and 7 additions) +5+ 7+7+7 = 52 cycles for 1 row

    For 56 rows = 56X52 = 2912 cycles = 14560 ns

    That is slower than I need.
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    The project feasibility can be determined in terms of required GFlops/s and available (respectively granted) multiplier resources, before thinking about structures.

  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    One row: 56 mults, 55 adds.

    All 56 rows: 3136 mults, 3080.

    All 56 rows every 6 us: 523 Mmult/s and 523 Madd/s.

    With a 5 ns clock, that becomes 2.6 mults/clock and 2.6 adds/clock.

    Quite feasible.

    Victor,

    looks like you're mixing throughput with latency.

    A altfp_mul has a latency of 5 cycles but a throughput of a result every cycle. Which means it takes 5 cycles to multiply a pair ofnumbers but you can feed it a new pair of numbers every cycle. Same goes for altfp_add_sub.

    Since matrix multiplication has lots of independent operations, this can be exploited.

    The architecture I suggested performs 8 mults/clock and more than 7 adds/clock and manages one complete result every 448 cycles (2.2 us), with a latency of 495 cycles (2.4 us).
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    I did not realize that you can send the next data input without waiting for the prior operation to finish. Thanks for pointing it out.

  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    Hello there,

    I would like to ask you if you got the same error as me, I always try to compile a single matrix_multiplier but this error appears:

    library "work" does not contain primary unit "hcc_package"

    Do i have to import this package? I've been looking for it but i cant find it, where is it?

    Regards,

    Juan