I want to multiply a 56x56 matrix with a 56x1 matrix in floating point. There is a altfp_matrix_mult megafunction which does not compute this quickly enough. I am looking for ideas to implement this in floating point. Please let me know if you have any ideas. Thanks.

Is latency really a factor? whats the application for this? FPGAs really dont like doing floating point. I would reocmmened trying to convert it to fixed point as it Hugely reduces the latency and logic requirements. If you really have to do it floating point, you're stuck with long latency and large resource requirements.

The altfp_matrix_mult Handbook gives an overview of required FPGA resources versus GFlops/s throughput. You can hardly expect to achieve a better result with a different FP design, so it can basically answer the question, if the intended design is feasible at all. If the achievable GFlop amount isn't an issue, but altfp_matrix_mult doesn't fit the design structure, then it can be meaningful to think about a different FP design.

Are you resource starved? What happens if you change the calculation to 56 dot products ... each row of the 56x56 matrix by the same 56 element vector?

--- Quote Start --- Is latency really a factor? whats the application for this? FPGAs really dont like doing floating point. I would reocmmened trying to convert it to fixed point as it Hugely reduces the latency and logic requirements. If you really have to do it floating point, you're stuck with long latency and large resource requirements. --- Quote End --- I want to first see if this can be done in floating point before looking at fixed point. The result needs to be available every 6 us. This includes the time to load the matrice - at least the smaller one. The bigger one does not change frequently. Yes, I want to do it with the least amount of resources. I am targeting a Stratix 3 so if I do 56 multiplications in parallel then wil use up 224/288 ~80% of the multipliers just for this.

Since 56 in parallel is too many resources then split the large matrix by groups of rows - enough to meet the latency requirement. For example, use 7 matrix mults each processing 8 rows, then recollect the 7 8x1 results.

Floating Point Matrix Multiplication | Altera Community

18 Replies

Altera_Forum
Honored Contributor
16 years ago
Is latency really a factor? whats the application for this?

FPGAs really dont like doing floating point. I would reocmmened trying to convert it to fixed point as it Hugely reduces the latency and logic requirements. If you really have to do it floating point, you're stuck with long latency and large resource requirements.
Altera_Forum
Honored Contributor
16 years ago
The altfp_matrix_mult Handbook gives an overview of required FPGA resources versus GFlops/s throughput. You can hardly expect to achieve a better result with a different FP design, so it can basically answer the question, if the intended design is feasible at all.

If the achievable GFlop amount isn't an issue, but altfp_matrix_mult doesn't fit the design structure, then it can be meaningful to think about a different FP design.
Altera_Forum
Honored Contributor
16 years ago
Are you resource starved? What happens if you change the calculation to 56 dot products ... each row of the 56x56 matrix by the same 56 element vector?
Altera_Forum
Honored Contributor
16 years ago
--- Quote Start ---
Is latency really a factor? whats the application for this?

FPGAs really dont like doing floating point. I would reocmmened trying to convert it to fixed point as it Hugely reduces the latency and logic requirements. If you really have to do it floating point, you're stuck with long latency and large resource requirements.
--- Quote End ---

I want to first see if this can be done in floating point before looking at fixed point.
The result needs to be available every 6 us. This includes the time to load the matrice - at least the smaller one. The bigger one does not change frequently.

Yes, I want to do it with the least amount of resources.

I am targeting a Stratix 3 so if I do 56 multiplications in parallel then wil use up 224/288 ~80% of the multipliers just for this.
Altera_Forum
Honored Contributor
16 years ago
Since 56 in parallel is too many resources then split the large matrix by groups of rows - enough to meet the latency requirement. For example, use 7 matrix mults each processing 8 rows, then recollect the 7 8x1 results.
Altera_Forum
Honored Contributor
16 years ago
I generated the megafunction with Q9.0
Columns AA : 64 (it does not allow 56)
Rows AA : 56
Columns BB :1
Vector size : 16
Block size :2

synthesis gave me :
64 mutipliers (22%)
14% logic
7% memory.

When I simulate it.
The time period between successive results is 18 us
(Takes nearly 6 us to output the result)

Is it possible to input and output data faster in order to get a higher throughput.

Even with a vector size of 32 it takes same amount of time to compute it and ~3 us to output the result.

Do you know how many Gigaflops I can expect from this design.
I believe I am getting 0.38 Gigaflops (the benchmarks has designs with 4 - 55 Gigaflops on Stratix -3). How is it computed?
Altera_Forum
Honored Contributor
16 years ago
I have uploaded the waveform showing time between 2 output elements. 20 clocks at time period = 5 ns.
Altera_Forum
Honored Contributor
16 years ago
Looking at the altfp_matrix_mult handbook, I fear, it's not designed to perform a fast multiplication with a [m,1] vector "matrix". All examples are [n,m] x [m,n]. Hopefully it's at least giving correct results. You most likely need to implement an optimized algorithm yourself.
Altera_Forum
Honored Contributor
16 years ago
You could also try rearranging the matrix and vector and check if the implementation works on rows faster then columns. Try using BB transpose for AA and AA transpose for BB. Mathmatically, the result = AA * BB = (BB' * AA')'. See http://en.wikipedia.org/wiki/matrix_multiplication#common_properties. Depending on how your matrices are stored, transposes can be "free" in FPGAs.
Altera_Forum
Honored Contributor
16 years ago
Try a fully pipelined architecture to multiply 8 input elements pairs and add the results to get a partial_sum(n) every clock.
This requires 8 altfp_mul and 7 altfp_add_sub, chained in a 4 stage pipeline.

Then add each sequence of 7 partial results to obtain a output element.
You should be able to pipeline this part as well, with 3 more altfp_add_sub chained in a 3 stage pipeline.
- Every 2 clocks, partial_sum1(n) = partial_sum(n) + partial_sum(n-1);
- Every 4 clocks, partial_sum2(n) = partial_sum1(n) + partial_sum(n-3).
- Every 8 clocks, element(n) = partial_sum2(n) + partial_sum2(n-7)

Because this last part actually needs 8 operands, the first part needs a "pause" cycle after each sequence of 7 partial sums, in which it's output is zero.

This should be able to calculate one complete result every 8*56 clocks.
Latency should be 8*56 clocks + 1 altfp_mul + 3*altfp_add_sub + 3*altfp_add_sub.

PS: I'm assuming altfp_add_sub and altfp_mult have 1 cycle throughput, like the integer counterparts. If that's not true, then this doesn't work.

Forum Discussion

Floating Point Matrix Multiplication

18 Replies

Recent Discussions

Access to RLC data for Agilex5 IBIS Models

Agilex3/5 GTS Hard Ethernet IP 10G example design pin loc and io std wanted

Agilex 7 I Series Development Kit: External hardware access error when programming

Inquiry: Reference Clock Jitter Limits for 1G Operation on Agilex 5

F-tile 10GBASE-R firecode FEC IP (Agilex 7)