Forum Discussion

Altera_Forum's avatar
Altera_Forum
Icon for Honored Contributor rankHonored Contributor
13 years ago

cache controller and read-during-write behavior

Hello all,

I'm in the process of writing a simple cache controller. I want this controller to exhibit write-back behavior, i.e. when the cache is being written to by the processor, the controller first snoops the appropriate tag and flags. If the tag and valid flag indicate a hit, then the write data goes straight into cache; if the tag does not indicate a hit, but the location is dirty then the controller first flushes the cache line into memory and then loads the requested cache line so as to produce a hit. This all sounds good, but I've come across a problem I can't seem to shake. As just described, when writing to the cache the controller first needs to check the cache line tag/flags. Obviously, all the block rams of all devices are synchronous, so there will be a one cycle delay before I get back the tag and flags. This would imply the controller will always be only 50% efficient at best (first snoop the tag/flags then write the data). My first solution was to read the tag/flags and write the data on the same clock cycle, and then on the next clock cycle, if the tag/flags indicated a miss, to rewind the operation using some skid buffers and then flush/load cache like usual. This, however, requires that the block ram return the old data on read-during-write operations. But I can't find any dual-port ram options on my target device (Stratix V) that support that setting. It always returns new data! Trying to force the ram to return old data for read-during-write operations via inferrence just results in it placing the ram in logic, which is unacceptable. Does anyone have any suggestions? Thanks.

5 Replies

  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    You are probably getting a random mix of old and new values for each bit. The timings might be such that the new value always wins!

    I guess reads aren't a problem - you can just discard the data.

    For writes you may have to add a 'store buffer' so they can be processes asynchonously (and pipelined).
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    From what I can see the only way to get the old memory value in a read-during-write transaction in Stratix V M20K blocks is to use a simple dual port (1 read address, 1 write address, 1 clock). My implementation would have been considerably easier if I was allowed a true dual port, dual clock ram. Instead I'll have to mux between the processor and lower memory accessing the ram, which should be fun for timing analysis. I still welcome any more elegant solutions.

  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    I've just been looking at some signaltap traces of bus cycles for M9K on ArriaII.

    With 'OLD_DATA' enabled (and single clock) during a write on s1, s1 returns the old data, but s2 returns the new data.

    So to read the old data during a write you'd need to put the write address onto both the s1 and s2 address inputs.

    This might be what you've already discovered.
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    That is what I expected, as the following code indicates:

    wr_addr_int <= to_integer(unsigned(wr_addr));

    rd_addr_int <= to_integer(unsigned(rd_addr));

    process(clk)

    begin

    if rising_edge(clk) then

    if we = '1' then

    for i in 0 to DATA_BYTE_WIDTH-1 loop

    ram(wr_addr_int)(i) <= data(8*(i+1)-1 downto 8*i);

    end loop;

    end if;

    q_int <= ram(rd_addr_int);

    end if;

    end process;

    UNPACK : for i in 0 to DATA_BYTE_WIDTH-1 generate

    q(8*(i+1)-1 downto 8*i) <= q_int(i);

    end generate;

    Thank you for verifying that for me, I'm just about ready to start my verificiation efforts.
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    Something else I've discovered.

    There is a 1 clock stall when a read from tightly coupled data memory immediately follows a write to the same memory block.

    Basically the write cycle can only be done when it is actually required - and the decision takes a clock.

    The read is done unconditionally - ie regardless of the opcode byte or the actual memory block referenced by the high-order address bits.

    The same delay may affect data cache operations.