Forum Discussion

Altera_Forum's avatar
Altera_Forum
Icon for Honored Contributor rankHonored Contributor
17 years ago

Implementing registers on a high-spee interface

Hi all

I'm doing a design which includes a highly configurable piece of logic, and I need a register interface to configure it. The link to the CPU is a high speed parallel interface.

Also I don't want register accesses to always just store and regurgitate values - in some cases I need a register write to perform some simple action, like resetting a counter or flushing a buffer.

In previous designs I've used logic along the lines of:

IF mclk'event AND mclk = '1' THEN

IF (read_condition) THEN

CASE address IS

WHEN 0 =>

result <= device_version;

WHEN 1 =>

result <= interesting_parameter;

WHEN 2 =>

result <= interrupt_outstanding AND NOT interrupt_acknowledged;

WHEN 3 =>

result <= something_else;

.

.

END CASE;

ELSE

CASE address IS

WHEN 0 =>

NULL; -- version register is read-only

WHEN 1 =>

interesting_parameter <= cpu_data;

WHEN 2 =>

interrupt_acknowledged <= interrupt_acknowledged OR cpu_data;

WHEN 3 =>

something_else <= cpu_data;

.

.

END CASE;

END IF;

END IF;

This works absolutely fine, but it makes for rather a lot of logic to be evaluated on every clock, which limits the speed at which it can run.

In my current design, however, I need the interface to run fast for other reasons, so I need to find a way to speed up access to the registers.

On writes, I guess I could store address/data pairs in a dcfifo and read / process them more slowly, but this would create a new clock domain which isn't ideal.

Is there a 'standard' way to implement a register interface of this type that doesn't rely on a slow clock? The nature of the CPU interface is such that cycle latency isn't a problem provided the clock speed remains high.

Thanks :)

Andy.

6 Replies

  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    Hello,

    generally, an FPGA can present any usual periheral interface

    to a processor. I've used this with a lot of processors. There

    are however some options, that should be thought carefully,

    cause interface performance and also timing closure of the

    design partitions contacting the interface may strongly depend

    on.

    A basic condition is the type of processor interface. You just say

    it's high speed parallel, that's not very specific. For a dis-

    cussion of detail problems, you should tell a bit more.

    As you already mentioned, timing problems are basically in read

    access, cause write accesses can be pipelined without reply in many

    cases. I think, that you should try to use access types, that are

    supported by the processor. So if it don't support a pipelined

    access (would be rather unlikely), you can either deliver the data

    as fast as necessary or have to use wait cycles (if the processor

    would support these at all).

    At the FPGA side, dual-port RAM would be a fast mechanism to

    create an interface, also very attractive in terms of resource

    usage. However, it does not fit all needs, as a disadvantage,

    register content is only sequentially atainable from the FPGA

    side. Using different word widths at both ports may help to

    optimize data access.

    When creating registers in FPGA logic, it must be accepted that

    a considerable part of FPGA resources is consumed, I have several

    thousand registers related to processor interfaces in some designs.

    If the application demands this highly configurable piece of logic,

    a FPGA can supply it.

    Regards,

    Frank
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    OK, a little more information, then :)

    The interface is packet based, not just the usual memory mapped I/O controlled by the CPU. Packets containing a series of instructions (ie. register reads or writes) come in to the FPGA on consecutive clock cycles. The FPGA, in turn, must generate a similarly formatted packet in response, both as an acknowledgement that the initial packet was received, and to return the results of any reads.

    The time from receiving the control packet to delivering the response really doesn't matter, I can have as many pipeline stages as I like. I have no shortage of resources either, and in fact, I do have a FIFO between the packet interface and the register file anyway.

    However, the problem I have is, I think, more general.

    What I'm struggling with, is simply how to pipeline something like the CASE structure above, so that I can run the register interface off the same high speed clock as the packet interface. If I have, say, 100 registers, all of them readable through the same port, then ultimately there has to be a 100:1 multiplexer implemented somehow. The issue is how to break it down across multiple clock cycles, so that the amount of logic that needs to be evaluated per clock is minimised.

    Any tips please?
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    Hello,

    thank you revealing the interface technique. It should be actually flexible enough for the intended purpose.

    Regarding the raised question. There could be a misundertanding. I think, the logic effort doesn't come from the case structure, it comes from the necessity to select the individual source and target of data. When you are using e. g. writable registers, the FPGA compiler will construct parallel select logic and a datapath for each register. This is necessary anyway and doesn't depend on how you code the selection. You can write a case structure or a for iteration loop, it will end up basically in the same (or similar) logic.

    When you process the instruction from the fifo, that can do the pipelining, one instruction per clock cycle (or more clock cycles, if complexer action is necessary). But the selection could be in a case structure without disadvantage. I have usually arrays of register data and boolean functions that describe properties as readability and such. But this also ends up in the same logic.

    I suggest to try with the case structure, monitoring the related resource usage. Internal RAM for I/O space would reduce the logic consumption, but as I mentioned, restrict accessibilty from the internal side. Sometimes it's possible to access data sequentially in the code, than RAM could be a suitable storage location.

    Regards,

    Frank
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    Sounds like you need to break up the read multiplexer. This is probably where you are taking the hit in fmax. You could break down a 100:1 mux to four 25:1 mux's and one 4:1 mux.

    exampe of taking a 2 bit address and breaking the mux into two 2:1 mux's in first stage and one 2:1 mux in the second stage.

    IF clk'event AND clk = '1' THEN

    read_condition_d1 <= read_condition;

    address_d1 <= address;

    IF (read_condition) THEN

    CASE address(1) IS

    WHEN 0 =>

    result_a <= device_version;

    result_b <= interesting_parameter;

    WHEN 1 =>

    result_a <= interrupt_outstanding AND NOT interrupt_acknowledged;

    result_b <= something_else;

    END CASE;

    IF (read_condition_d1) THEN

    CASE address_d1(0) IS

    WHEN 0 =>

    result <= result_a;

    WHEN 1 =>

    result <= result_b;

    END CASE;

    This will deliver the register content a clock cycle later but will increase fmax.

    Just an idea, may not be the best way or even work in your situation.
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    Hello,

    --- Quote Start ---

    Sounds like you need to break up the read multiplexer

    --- Quote End ---

    would be an option, if fmax is actually an issue. As I understand, this hasn't turned out yet.

    Also the intended operation frequency hasn't been said.

    I think, with a large amount of registers, it is obvious to operate the registers in the clock domain that accesses the data. This way, clock speed for the register side of the interface isn't prescribed by the interface speed but by the main design. Using complex logic, arithmetic, whatever, an apropriate main clock speed could be 50 rather than 100 Mhz. At this speed, a lot of selection logic and multiplexers can be operated.

    The other option would be to run the registers at the interface speed, but then data must be double buffered and synchronized to main clock domain.

    Or use a dual port RAM, but have only limited (sequential) access to the data.

    Regards,

    Frank
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    Thanks guys :)

    gmpstr: That's the kind of thing I was after - I just had a mental block on how to break down the mux. I may well give your suggestion a go, it looks like a good solution.

    :D