I'm tuning my logic to meet timing. I have a long continuous assignment statement that outputs a signal based on the current state in my FSM. The below implementation results in a long mux chain. Is there an alternative coding style that results in smaller path delay? assign tx_tlp_dword_offset = (curstate == HANDLE_BAR1_READ_1_STATE) ? 7'h4 : (curstate == HANDLE_BAR1_READ_2_STATE) ? 7'h0 : (curstate == HANDLE_BAR1_READ_3_STATE) ? 7'h1 : (curstate == HANDLE_BAR1_READ_4_STATE) ? 7'h2 : (curstate == HANDLE_BAR1_READ_5_STATE) ? 7'h5 : (curstate == HANDLE_BAR1_READ_6_STATE) ? 7'h5 : (curstate == H2D_DMA_INIT_MEMRD_DW0_STATE) ? 7'h0 : (curstate == H2D_DMA_INIT_MEMRD_DW1_STATE) ? 7'h1 : (curstate == H2D_DMA_INIT_MEMRD_DW2_STATE) ? 7'h2 : (curstate == H2D_DMA_INIT_MEMRD_DW3_STATE) ? 7'h3 : (curstate == H2D_DMA_SEND_MEMRD_TLP_STATE) ? 7'h3 : (curstate == H2D_DMA_SEND_MEMRD_TLP2_STATE) ? 7'h3 : (curstate == D2H_DMA_INIT_MEMWR_DW0_STATE) ? 7'h0 : (curstate == D2H_DMA_INIT_MEMWR_DW1_STATE) ? 7'h1 : (curstate == D2H_DMA_INIT_MEMWR_DW2_STATE) ? 7'h2 : (curstate == D2H_DMA_INIT_MEMWR_PL_STATE) ? reg_tx_tlp_dword_offset : (curstate == D2H_DMA_SEND_MEMWR_TLP_STATE) ? reg_tx_tlp_dword_offset : (curstate == D2H_DMA_SEND_MEMWR_TLP2_STATE) ? reg_tx_tlp_dword_offset : 7'h0; I have attached the TimeQuest path information as well as the output from the RTL viewer.

In no vhdl expert, but maybe use single bit values for each of the 'curstate' values so that each line becomes: (curstate & XXX ? 7'hx : 0) |which is simple logic. Dunno if that ?: gets optimised away (easy to do).

Use a case statement instead of "if-then-else". "if-then-else" generates a priority encoder and a case statement does not. Hope that helps /Boris

Well cant comment exactly, but have a doubt. DO you really need a priority encoder? Can you do it in a case statement or maybe a state machine?

I can't see the priority encoder point for the present code. As long as the IF conditions are mutually exclusive (of course they are!) there's no difference between the IF THEN ELSE chain and a case construct. Both can be expected to end up in the same gate level netlist. The actual effort depends however on the chosen state encoding.

If you examine the TQ diagram you see that the code shown in post 1 is only half the data path and second that 66% of the data path is Interconnect delay. The code shown only takes two hops, although it is a lot of typed text it is a rather simple 'mux' and the compiler can optimize heavily (given that the 'mux' inputs are mostly constants). The high ratio of interconnect makes me think your FPGA is getting full? Can I suggest to put a pipeline register between the resp. outputs of main_controller_inst and the input of tx_tlp_buffer_inst?

Coding style to minimize combinational path delay?

20 Replies

Altera_Forum
Honored Contributor
13 years ago
very interesting..getting to learn a lot here...thank you all
Altera_Forum
Honored Contributor
13 years ago
--- Quote Start ---
...would get into a habit of just using case statements when you don't need priority...

--- Quote End ---

I just want to point out, in contrast to multiple comments in this thread, that case statements are inherently supposed to have priority. They are not supposed to execute all branches in parallel. Don't take my word for it, though; take it from someone who trains Verilog professionally:

http://sutherland-hdl.com/online_verilog_ref_guide/vlog_ref_top.html

"Compares the net, register or literal value to each case and executes the statement or statement group associated with the first matching case."

I have found this to be a point of contention across different tools. If you want to make sure that the branches of a case statement execute in parallel, look into the "unique" key word from SystemVerilog.
Altera_Forum
Honored Contributor
13 years ago
--- Quote Start ---
I just want to point out, in contrast to multiple comments in this thread, that case statements are inherently supposed to have priority. They are not supposed to execute all branches in parallel. Don't take my word for it, though; take it from someone who trains Verilog professionally:

http://sutherland-hdl.com/online_verilog_ref_guide/vlog_ref_top.html

"Compares the net, register or literal value to each case and executes the statement or statement group associated with the first matching case."

I have found this to be a point of contention across different tools. If you want to make sure that the branches of a case statement execute in parallel, look into the "unique" key word from SystemVerilog.
--- Quote End ---

You are absolutely correct, that was a bit of a blanket statement on my part. I have never coded using a case statement that could potentially have priority like you mentioned so it slipped my mind.
Altera_Forum
Honored Contributor
13 years ago
--- Quote Start ---
I just want to point out, in contrast to multiple comments in this thread, that case statements are inherently supposed to have priority. They are not supposed to execute all branches in parallel.
--- Quote End ---

Yes, there are incorrect assumptions about evaluation of case constructs. My discussion point was, that there's no room for priority in the present code, thus "if..then..else" chain and case construct should be expected to end up in the same gate level code anyway.

It should be added, that "if then else" and regular case construct (no parallel case) are evaluating the code in the same way.

Altera_Forum

Honored Contributor

13 years ago

I wonder if grouping the select lines helps since it reduces the number of inputs...

(I omitted the specific conditions that chose 7'h0 as an input since that will be taken care of by the default condition)

assign tx_tlp_dword_offset = 
		(curstate == HANDLE_BAR1_READ_1_STATE) ? 7'h4 : 
		(curstate == HANDLE_BAR1_READ_3_STATE || 
                 curstate == H2D_DMA_INIT_MEMRD_DW1_STATE || 
                 curstate == D2H_DMA_INIT_MEMWR_DW1_STATE) ? 7'h1 : 
		(curstate == HANDLE_BAR1_READ_4_STATE || 
                 curstate == H2D_DMA_INIT_MEMRD_DW2_STATE ||
                 curstate == D2H_DMA_INIT_MEMWR_DW2_STATE) ? 7'h2 : 
		(curstate == HANDLE_BAR1_READ_5_STATE || 
		(curstate == HANDLE_BAR1_READ_6_STATE) ? 7'h5 :
		(curstate == H2D_DMA_INIT_MEMRD_DW3_STATE ||
		(curstate == H2D_DMA_SEND_MEMRD_TLP_STATE ||
		(curstate == H2D_DMA_SEND_MEMRD_TLP2_STATE) ? 7'h3 : 
		(curstate == D2H_DMA_INIT_MEMWR_PL_STATE) ? reg_tx_tlp_dword_offset : 
		(curstate == D2H_DMA_SEND_MEMWR_TLP_STATE) ? reg_tx_tlp_dword_offset : 
		(curstate == D2H_DMA_SEND_MEMWR_TLP2_STATE) ? reg_tx_tlp_dword_offset : 
		7'h0;

Altera_Forum
Honored Contributor
13 years ago
--- Quote Start ---
I wonder if grouping the select lines helps since it reduces the number of inputs...
--- Quote End ---

The expression for each bit of tx_tlp_dword_offset will undergo logic minimization during synthesis, thus I won't expect an effect of reordering or grouping on logic element usage.

Different ways of state encoding matter, in contrast.
Altera_Forum
Honored Contributor
13 years ago
Thanks all for your suggestions. I ended up cutting my clock frequency in half, giving me much more head room. The Cyclone IV is simply too slow (i.e. too much combinational path delays). In case I can't meet my data throughput target (border-line now) then I will have to double the width of the critical data path in my design.
Altera_Forum
Honored Contributor
13 years ago
--- Quote Start ---
Thanks all for your suggestions. I ended up cutting my clock frequency in half, giving me much more head room. The Cyclone IV is simply too slow (i.e. too much combinational path delays). In case I can't meet my data throughput target (border-line now) then I will have to double the width of the critical data path in my design.
--- Quote End ---

The Cyclone IV isn't that slow. I have a design with 200 MHz and 150 MHz (among others) clock frequencies in a EP4CE40F23C7N device. I have similar muxes in the 150 MHz domain (switching constants for multipliers). Proper pipelining is key - divide and conquer!
Altera_Forum
Honored Contributor
13 years ago
Unfortunately, it is too slow for running my design at 125 MHz. Perhaps not Cyclone IV specific. Anything can be pipelined but it will fragment an otherwise straight-forward design into a jumble of flip-flops impossible to understand. Pipelining is suitable for some designs where it makes architechural sense but, unfortunately, it did not make sense in my design - lowering the clock frequency to 62.5 MHz and possibly widening certain data buses from 8 to 16 (or 32) bits made more sense in my case.

By the way; How does a Cyclone IV compare to, say, an Arria II when it comes to combinational path delays across the device? Is the Arria II 'faster' and, if so, why?
Altera_Forum
Honored Contributor
13 years ago
Arria II is built using the 'High Speed' process just like the Stratix used. However, Arria V is using the 'Low power' process like Cyclone uses. Yes Arria II will run faster than a Cyclone IV

Forum Discussion

Coding style to minimize combinational path delay?

20 Replies

Recent Discussions

Verifying Cyclone V FPGA functionality using different FPGA flash devices (Intel / Micron).

LTC Connector DE10-Standard FPGA

Issue with configuring EPCQ64A & Cyclone

Agilex5 A5EB013BB23BE4S BSDL

MAX 10 FPGA Programming Failure via JTAG – nSTATUS & CONFIG_DONE as No Connect