Forum Discussion

New Contributor

1 year ago

Solved

SIMD using DSP in Stratix 10

Hello All, I am new to Stratix 10. I am wondering does Altera's DSP have SIMD feature as Xilinx? SIMD means like 1 DSP performs 4 parallel addition. I checked the DSP features in verilog template but...

Hdl

KennyT_altera
1 year ago
Unlike Xilinx DSP48E1 slices, Altera DSP blocks are primarily optimized for Multiply-Accumulate (MAC) operations — SIMD-style packed adders are not as directly exposed or as flexible in the same way Xilinx’s DSP48E slices are.

Can you see if the below code help?

Verilog RTL — 4×12-bit Parallel Adder (Using ALM Carry Chains)

module parallel_4x12b_adder (

input [11:0] a0, a1, a2, a3, // 4× 12-bit inputs

input [11:0] b0, b1, b2, b3, // 4× 12-bit inputs

output [12:0] sum0, sum1, sum2, sum3 // 4× 13-bit outputs to handle carry

);

assign sum0 = a0 + b0;

assign sum1 = a1 + b1;

assign sum2 = a2 + b2;

assign sum3 = a3 + b3;

endmodule

Explanation:

Each assign statement uses the ALM carry chain. Quartus synthesizer will map this to the dedicated carry chains in the ALMs.

Outputs are 13-bit wide to handle potential overflow.

Very efficient — no LUT wasting.

If You Want to Pack into a 48-bit DSP-friendly Add (More Complex)

If you really wanted to try packing them and using a DSP block (assuming 48-bit add support, e.g., Arria 10/Stratix 10), here’s a conceptual version:

verilog

module packed_4x12b_adder (

input [47:0] a_packed, // 4× 12-bit packed inputs

input [47:0] b_packed, // 4× 12-bit packed inputs

output [47:0] sum_packed // 4× 12-bit packed results (overflow risk!)

);

assign sum_packed = a_packed + b_packed;

endmodule

Notes:

You'd need to align each 12-bit value properly in the 48-bit word.

Risk of overflow if sum exceeds 12 bits in each lane.

Post-add masking and saturation may be needed.

This is risky since Intel DSPs don’t natively split this into SIMD lanes like Xilinx. Quartus might split this into ALM logic anyway.

Recommendation

✔ Use the first version — Quartus will map those independent 12-bit additions onto ALMs using fast carry chains, highly efficient, no LUT waste.

✔ Avoid the packed 48-bit add unless you're certain the device and toolchain will optimize it safely into a DSP block.

Let me know if the above helps to some extent. If not, we may have to leave this for a future enhancement in Quartus.

KennyT_altera

Super Contributor

1 year ago

Unlike Xilinx DSP48E1 slices, Altera DSP blocks are primarily optimized for Multiply-Accumulate (MAC) operations — SIMD-style packed adders are not as directly exposed or as flexible in the same way Xilinx’s DSP48E slices are.

Can you see if the below code help?

Verilog RTL — 4×12-bit Parallel Adder (Using ALM Carry Chains)

module parallel_4x12b_adder (

input [11:0] a0, a1, a2, a3, // 4× 12-bit inputs

input [11:0] b0, b1, b2, b3, // 4× 12-bit inputs

output [12:0] sum0, sum1, sum2, sum3 // 4× 13-bit outputs to handle carry

);

assign sum0 = a0 + b0;

assign sum1 = a1 + b1;

assign sum2 = a2 + b2;

assign sum3 = a3 + b3;

endmodule

Explanation:

Each assign statement uses the ALM carry chain. Quartus synthesizer will map this to the dedicated carry chains in the ALMs.

Outputs are 13-bit wide to handle potential overflow.

Very efficient — no LUT wasting.

If You Want to Pack into a 48-bit DSP-friendly Add (More Complex)

If you really wanted to try packing them and using a DSP block (assuming 48-bit add support, e.g., Arria 10/Stratix 10), here’s a conceptual version:

verilog

module packed_4x12b_adder (

input [47:0] a_packed, // 4× 12-bit packed inputs

input [47:0] b_packed, // 4× 12-bit packed inputs

output [47:0] sum_packed // 4× 12-bit packed results (overflow risk!)

);

assign sum_packed = a_packed + b_packed;

endmodule

Notes:

You'd need to align each 12-bit value properly in the 48-bit word.

Risk of overflow if sum exceeds 12 bits in each lane.

Post-add masking and saturation may be needed.

This is risky since Intel DSPs don’t natively split this into SIMD lanes like Xilinx. Quartus might split this into ALM logic anyway.

Recommendation

✔ Use the first version — Quartus will map those independent 12-bit additions onto ALMs using fast carry chains, highly efficient, no LUT waste.

✔ Avoid the packed 48-bit add unless you're certain the device and toolchain will optimize it safely into a DSP block.

Let me know if the above helps to some extent. If not, we may have to leave this for a future enhancement in Quartus.

Forum Discussion

SIMD using DSP in Stratix 10

Recent Discussions

Generate License by Activation Code Not Working

QPP 26.1.0 Tools->Generate Simulator Setup Script produces no output

Could not link 'vsim_auto_compile.dll' error troubleshooting.

Questa Sim on Windows - linking to external LIB

Quartus did not start