Forum Discussion

Altera_Forum's avatar
Altera_Forum
Icon for Honored Contributor rankHonored Contributor
14 years ago

Design is too slow

I'm trying to build a frame alignment component for an OTN (Optical Transport Network) with Quartus II on a Cyclone II FPGA and my problem is that I cannot reach the desired frequency.

Obviously I have to do some pipelining but I do not know what and where to pipeline. The code follows in VHDL. As you will see the code consists of two processes, a clocked process and a combinatorical one. Any kind of help would be mostly appreciated.


library ieee;use IEEE.std_logic_1164.all;
use IEEE.std_logic_unsigned.all;
entity framer is
generic (DATA_WIDTH :integer := 256);
port (clk		: in std_logic;
		reset		: in std_logic;
		input		: in std_logic_vector(255 downto 0);
		output	: out std_logic_vector(255 downto 0));
end framer;
architecture mixed of framer is
type state_type is (Frame_Search, Prove_In_Sync, In_Frame_Sync, Prove_Out_Sync);
signal next_state, current_state : state_type;
--signal cyclecounter : std_logic_vector(8 downto 0);
signal cyclecounter : integer range 510 downto 0;
signal cycle_inc : std_logic;
signal error_inc : std_logic;
--signal errorcount : std_logic_vector(3 downto 0);
signal errorcount : integer range 4 downto 0;
signal w_en : std_logic;
signal FAOOF : std_logic;
signal index, prev_index : natural;
signal input_reg1,input_reg2 : std_logic_vector(DATA_WIDTH-1 downto 0);
signal output_reg : std_logic_vector(DATA_WIDTH-1 downto 0);
signal joint : std_logic_vector(2*DATA_WIDTH-1 downto 0);
signal sync_regB : std_logic_vector(31 downto 0);
signal sync_regS : std_logic_vector(23 downto 0);
constant Fr_find : std_logic_vector(31 downto 0) := X"F6F6_2828";
constant Fr_lose : std_logic_vector(23 downto 0) := X"F62828";
begin
state_update : process (clk,reset)
begin
	if reset = '1' then
		current_state <= Frame_Search after 1 ns; --default state upon reset
		cyclecounter <= 0 after 1 ns;
		errorcount <= 0 after 1 ns;
	elsif rising_edge(clk) then
		current_state <= next_state; --update state
		input_reg1 <= input after 1 ns;
		input_reg2 <= input_reg1 after 1 ns;
		prev_index <= index;
		if w_en = '1' then
			for i in joint'left-8 downto DATA_WIDTH loop
				if i = index then
					output_reg <= joint(i + 8 downto i + 8 - DATA_WIDTH + 1) after 1 ns;
				end if;
			end loop;
		end if;
		if FAOOF = '0' then
			output <= output_reg;
		else
			output <= (others=>'0');
		end if;
		if cycle_inc = '1' then
			cyclecounter <= cyclecounter + 1 after 1 ns;
		else
			cyclecounter <= 0 after 1 ns;
		end if;
		if cyclecounter = 509 then
			if error_inc = '1' then
				errorcount <= errorcount + 1 after 1 ns;
			else
				errorcount <= 0 after 1 ns;
			end if;
		end if;
	end if;
end process;
joint <= input_reg2 & input_reg1;
state_machine : process (current_state, joint, output_reg, cycle_inc, error_inc, index, prev_index, cyclecounter, errorcount)
begin
	case current_state is
		
		when Frame_Search =>
			FAOOF <= '1';
			index <= 0;
			cycle_inc <= '0';--counter reset
			error_inc <= '0';--counter reset
			w_en <= '0';
			next_state <= current_state;
			for j in joint'left-8 downto DATA_WIDTH loop
				if joint(j downto j - (Fr_find'length-1)) = Fr_find then
					w_en <= '1';
					index <= j;
					next_state <= Prove_In_Sync;
				end if;
			end loop;
		
		when Prove_In_Sync =>
			w_en <= '1';
			FAOOF <= '1';
			index <= prev_index;
			cycle_inc <= '1';
			error_inc <= '0';
			next_state <= current_state;
			if cyclecounter = 510 then
				cycle_inc <= '0';--counter reset
				--for k in joint'left-8 downto DATA_WIDTH loop
					--if k = index then
						--if joint(k downto k - (Fr_find'length-1)) = Fr_find then
						if output_reg(output_reg'left - 8 downto output_reg'left - 8 - (Fr_find'length-1)) = Fr_find then
							next_state <= In_Frame_Sync;
							FAOOF <= '0';
							--cycle_inc <= '0';--counter reset
							--error_inc <= '0';--counter reset
						else
							next_state <= Frame_Search;
						end if;
					--end if;
				--end loop;
			else
				cycle_inc <= '1';
			end if;
		
		when In_Frame_Sync =>
		w_en <= '1';
		FAOOF <= '0';
		index <= prev_index;
		cycle_inc <= '1';
		error_inc <= '0';
		next_state <= current_state;
		if cyclecounter = 509 then
			cycle_inc <= '0';--counter reset
			--for l in joint'left-8 downto DATA_WIDTH loop
				--if l = index then
					--if joint(l - 8 downto (l - 8) - (Fr_lose'length - 1)) = Fr_lose then
					if output_reg(output_reg'left - 16 downto output_reg'left - 16 - (Fr_lose'length-1)) = Fr_lose then
						next_state <= current_state;
					else
						error_inc <= '1';
						next_state <= Prove_Out_Sync;
					end if;
				--end if;
			--end loop;
		else
			cycle_inc <= '1';
		end if;	
		
		when Prove_Out_Sync=>
		w_en <= '1';
		FAOOF <= '0';
		index <= prev_index;
		cycle_inc <= '1';
		error_inc <= '1';
		next_state <= current_state;
		if cyclecounter = 509 then
			cycle_inc <= '0';--counter reset
			--for m in joint'left-8 downto DATA_WIDTH loop
				--if m = index then
					--if joint(m - 8 downto (m - 8) - (Fr_lose'length - 1)) = Fr_lose then
					if output_reg(output_reg'left - 16 downto output_reg'left- 16 - (Fr_lose'length-1)) = Fr_lose then
						next_state <= In_Frame_Sync;
						--cycle_inc <= '0';--counter reset
						error_inc <= '0';--counter reset
					else
						error_inc <= '1';
						if errorcount = 4 then
							error_inc <= '0';--counter reset
							next_state <= Frame_Search;
						else
							next_state <= current_state;
						end if;
					end if;
				--end if;
			--end loop;
		else
			cycle_inc <= '1';
		end if;
	end case;
end process;
end mixed;

6 Replies

  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    Your second process is heavily combinatorial and is unlikely to pass timing. I personally use one process state machine always. That way I declare one state signal of type say s0,s1,s2,s3,s4, ...etc. and assign next state as follows: if...state <= s0, if... state <= s1 and so on. The change of state is always on the clock edge so are any conditions and assignments. To keep functionality you will need to redesign for latency effect.

  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    --- Quote Start ---

    Your second process is heavily combinatorial and is unlikely to pass timing. I personally use one process state machine always. That way I declare one state signal of type say s0,s1,s2,s3,s4, ...etc. and assign next state as follows: if...state <= s0, if... state <= s1 and so on. The change of state is always on the clock edge so are any conditions and assignments. To keep functionality you will need to redesign for latency effect.

    --- Quote End ---

    Using Timing Analysis I found that the critical path is inside the clock process where i store the result in the output register.

    
            if w_en = '1' then            
                for i in joint'left-8 downto DATA_WIDTH loop
                    if i = index then
                        output_reg <= joint(i + 8 downto i + 8 - DATA_WIDTH + 1) after 1 ns;
                    end if;
                end loop;
            end if;
    

    This design requires to have 2 processes at least to work properly and comply with the ITU-T G.709, G798 OTN specifications.
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    Does the spec actually say in the specification "You MUST use two processes" or is it just that you cant get it to work with a single process state machine? I would be very surprised if it was the former.

    The problem comes, I assume, because of the massive long combinatorial path of index, that forms part of the write-enable input on the output_reg register. index is muxed based on a comparator and then muxed again and anded with w_en.

    What you havent said is what the target FMax is and what the actual FMax is. You may have to just accept that your design isnt good enough or re-design it with more pipelining.
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    kaz:

    --- Quote Start ---

    Your second process is heavily combinatorial and is unlikely to pass timing. I personally use one process state machine always. That way I declare one state signal of type say s0,s1,s2,s3,s4, ...etc. and assign next state as follows: if...state <= s0, if... state <= s1 and so on. The change of state is always on the clock edge so are any conditions and assignments. To keep functionality you will need to redesign for latency effect.

    --- Quote End ---

    tricky:

    --- Quote Start ---

    Does the spec actually say in the specification "You MUST use two processes" or is it just that you cant get it to work with a single process state machine? I would be very surprised if it was the former.

    --- Quote End ---

    This shows a bias against two-process state-machines, claiming one-process states-machines are better which is without any proof or justification. A two-process state-machine does perform equally as a one-process state-machine. A two-process state machine allows you to generate additional combinatorial outputs, without having to resort to a additional combinatorial process to decode these outputs, or having to "redesign for latency effect".

    The correct advice is to look at the critical paths in TimeQuest, as hargeo effectively did, and rework the logic.
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    The idea of one or more process per se is irrelevant. You may use as many as you want. The issue is avoiding long comb paths.

    The timing does not refer to process boundaries but to paths.

    In your case I realised "joint" is assigned to output_register from input through reg1 and reg2 directly and in this case yes you are right but it could be the long comb paths are contributing to routing difficulty.

    edit:

    note that output_register is assigned conditional on wr_en and FA00F which both come from depth of three levels of if statements in the second process. That is causing your violation I believe.
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    By making a few changes like using std_logic_vectors with less bits where possible, prefering ranges that are multiple of 2 but still using 2 processes, I managed to increase the Fmax dramatically.

    
                   if w_en = '1' then
    			for step in 1 downto 0 loop
    				for i in DATA_WIDTH/2-1 downto 0 loop
    					if i = prev_index(6 downto 0) then
    						temp_reg1 <= joint(i + DATA_WIDTH downto i + 1) after 1 ns;
    						temp_reg2 <= joint(i + DATA_WIDTH*3/2 downto i + DATA_WIDTH/2 + 1) after 1 ns;
    					end if;
    				end loop;
    			end loop;
    			if prev_index(7) = '0' then
    				output_reg <= temp_reg1 after 1 ns;
    			else
    				output_reg <= temp_reg2 after 1 ns;
    			end if;
    		end if;
    

    I hope that this pipelining is appropriate. Could anyone indicate a correction or improvement?