Hi freinds, I want to address a on-chip memory in the Cyclone IV FPGA via the HIP PCIe from my computer, so booth the HIP and the Memory are interconnected by the Avalon-MM interface. my question is: how can i address the memory from the PC ?? please if you can help me thanx

If you follow the example design, it’s no big issue. The PC will usually run a sensible operating system which needs a driver for your card. All memory accesses should happen over this driver. And the driver is in charge of doing all the PCI housework like identifying and claiming the device as well as enabling the BAR of the memory in your card. Once the BAR is assigned, the driver is able to access the memory. Linux admins/users can also access the memory BAR from the /sys/bus/pci tree easily without the need for additional drivers. The only problem is: What performance do you expect from your memory accesses from the PC side? Each memory read access will take between 0.5 and 2 us, depending on the CPU/chipset architecture and speed, and at 100% CPU load you typically reach not more than a couple of MBs per second. Write accesses from the CPU to FPGA memory are just a little bit more efficient, maybe a factor of 2 or 3. And write combining as well as read data prefetching might save you another bit of performance, but this only works with so-called prefetchable memory BARs. If your bandwidth requirements are low, then this is no problem, but at higher rates you will need a DMA structure. With this, not the PC CPU is moving data into or out of your FPGA memory but the FPGA itself will shift blocks of data between main PC memory and FPGA memory. This requires a way more sophisticated architecture of both the FPGA design and the driver architecture, and it typically has a higher latency than the PIO approach you envision. The CDMA example design from Altera will help you towards this goal. – Matthias

You may also be able to initiate a dma transfer from the host (PCIe master) side. The DMA controller really needs to be coupled closely enough to the PCIe master interface to allow it to do long PCIe transfers (eg 128 bytes). I don't know which hardware component can do that on a typical x86 pc, nor the os (driver) api calls necessary to setup such transfers. I've only done it on one of the small ppc processor under linux - and I had write my own driver for the dma device. With extreme care you might manage to use cache-line read/write (probably 64 bytes) to improve transfer performance.

There are some PCIe Switches that contain DMA engines, e.g. from plx technologies (http://www.plxtech.com/products/expresslane/switches). The problem with ready-to-use DMA controllers is that they rarely match the driver’s or device’s specific needs for continuous transfers to the last bit, so a result is typically one interrupt per block transfer instead of one (or even zero) per a long sequence of transfers. And it’s another guy to appear on the PCIe party with already too many of them: CPU/driver, northbridge/memory, switches/flow control, device and application interface. The DMA engine placed into your PCIe endpoint design will always be there in the way designed, but the system DMA engine will only show occasional presence. And, of course, there is also the option to do inter I/O transfers (peer-to-peer) with even tighter bounding between the devices. I think i2o (http://en.wikipedia.org/wiki/i2o) was then an attempt – as expensive as unsuccessful – to establishing a standard for the device interface which should have resulted in a more abstract CPU&#8596;Device and Device&#8596;Device communication. FWIW, I think both options are beyond the first steps for casamar to go … – Matthias

Thanks for your insights Matthias. I have a new Stratix IV GX based PXI Express (PXI Express = PCIe + additional PXI backplane clock/timing strobes) board design, which uses the Altera PCIe Hard IP and DDR3 SDRAM controller. My initial test configuration uses an Avalon MM interface from the PCIe IP to the DDR3 controller. I am set up for 512 MBytes at BAR_1_0, 64 bit pre-fetchable. I am using Jungo WinDriver ALTERA_BlockReadWrite() commands to move data between PXI Express board memory and my host PC. Your explanation of read/write performance is consistent with my results with this set-up: 512 MByte block writes take ~10sec, and block reads take over 3 minutes! I can probably live with the block write performance for now, but have to improve the read performance. Are there any changes I can make short of implementing a DMA solution? Is there a way to make the host PC side pipeline multiple read requests, and not wait for the round trip return of each read before proceeding onto the next read request, for example? Any advice would be greatly appreciated. Best Regards, Ron

addressing on-chip memory(cyclone IV) from PC via PCIe

19 Replies

Altera_Forum
Honored Contributor
13 years ago
--- Quote Start ---
I would like to keep this as simple as possible, and avoid using DMA.

--- Quote End ---

Ron,

Sorry about the bad news, but that is the wrong approach. It’s like: »I want to keep all my 8 CPU cores busy, and I set the target of doing it with a single-tasking application.«

As long as the CPU initiates each (byte, word) transfer, you are at the performance you stated. DMA is a beast, but get used to it sooner or later, it will pay off. And it’s not *that* hard to do.

You can approach DMA in multiple levels of efficiency. The first and most important aspect is to let the CPU do something while the hardware does the transfers. Without descriptor-based pipelining, this approach might not bring the absolute highest bus performance (in MB/s) but it will reduce CPU load already significantly and let it do other things while single blocks are transferred. In this solution, the driver would initiate a single DMA block transfer (in either direction, so device→memory or memory→device) and wait for an interrupt, issued by the device, when it’s done (or an error occurs). The larger the memory block is, the less frequent will the data exchange starve from finalizing the one transfer and setting up the next, in addition to lesser CPU context switches and interrupt calls.

Say, you have a 32 MiB large window of kernel memory, cut in two halves, alternatingly filled with data by the device and then used by the application. Filling up 16 MiB of data at best takes the device about 70 to 100 ms at gen1x1, so you will get max. 15 interrupts per second indicating the buffer full condition, which is already low. But this doesn’t work as efficiently when you want to transfer only, say, 4 KiB per turnaround. And with higher lane width and PCIe speed, the device will already be done in maybe 1 or 2 ms. You could raise the memory size to work around this kind of speed issue then.

So, doing such a simple one-shot DMA is quite easy. The DMA write accesses are easier to do than DMA reads as you don’t have to bother with tags, completion credits and any kind of completion ordering. First, check your max_payload_size setting and adhere to that for your transfers. Mind all the other PCIe write request rules like 4k boundaries, correct byte enables for short requests, too, and do the right byte shifting if the buffer is not aligned (or assure it is, in your driver). Then do your transfers and issue an interrupt when done. The read requests are more tricky and have another level of performance-vs.-implementation-complexity tradeoff: You either do interleaved reads, or you don’t.

Going without interleaving reads makes the code rather easy: Issue a read request with at most max_read_request_size bytes (and, of course, properly aligned to 4k boundaries, mind max. completion credits in your receive hard IP), wait for all completion to arrive (or expect a timeout properly!), and once the request is complete, issue a request for the next part right away. In this mode you don’t have to bother about the tag as you only use a single tag at all. Once the whole block is done, just issue the next interrupt indicating »DMA ready«.

For interleaved reads, you have to keep track of different read requests as the completions are allowed to overtake each other (note: parts of a read request always come in-order, but completions to different read requests might use the fast lane of some northbridge-internal data highway). And tracking the completion credit, tag IDs and timeouts might also look more complicated than for a strictly ordered implementation.

Only the next step would be a descriptor-based DMA solution. It is targeted towards the next level of performance: The CPU assigns (ordered) jobs in one or more system-RAM-based tables, typically separate for read and write DMA, and the DMA controller autonomously does the transfers while the CPU can assign new tasks or work on already finished tasks. Interrupts are then only exchanged when a queue cannot be serviced from one side anymore because of a full or empty condition in a table. But not only the DMA controller changes but also the driver architecture. Look up terms like »bottom half« to get an impression how this is supported in the operating systems.

--- Quote Start ---
I suspect that the discrepancy between read and write performance has to do with the controller not pipelining read requests, and that it is waiting for a round-trip read completion before moving on to the next read request. I'm hoping there is some way to pipeline multiple read requests to bring up block read efficiency.
--- Quote End ---

No, there is no way, for good reason. The CPU has to adhere to the (strict) PCI ordering rules, so when a machine instruction indicates a move from the device memory BAR, this CPU has to wait on the result, no matter what. The CPU cannot know that you are not actually needing the data right away and there wouldn’t be any side effects from doing other work while the returned data is still in flight. This can only be worked around by a DMA engine, device-local or system-global.

– Matthias
Altera_Forum
Honored Contributor
13 years ago
--- Quote Start ---

The CPU has to adhere to the (strict) PCI ordering rules, so when a machine instruction indicates a move from the device memory BAR, this CPU has to wait on the result, no matter what. The CPU cannot know that you are not actually needing the data right away and there wouldn’t be any side effects from doing other work while the returned data is still in flight. This can only be worked around by a DMA engine, device-local or system-global.

– Matthias
--- Quote End ---

That is a non-sequitor. A modern cpu will execute other instructions following a PCIe memory read provided they aren't dependant on the value being read. Memory reads can be re-ordered, so other locations can be read. Any PCIe transfers must be sequenced, but that doesn't affect other operations.

The 'problem' is that PCIe reads to the Altera fpga are very slow (I don't know if this is typical of PCIe slaves - I've not timed any others), so the cpu quickly runs out of instructions it can execute before the PCIe read completes.

A single PCIe transfer (which is 2 (maybe more) hdlc frames) can usually contain upto 128 bytes. The transfer time is largely independant of the transfer length, and IIRC is of the order of 1-2us. This may be shorter than the interrupt latency and process reschedule.

So although it may be necessary to use a DMA controller to generate the long transfer, it can make sense to synchronously wait for completion by polling the 'dma done' bit. Splitting the 'setup' from the 'wait for complete' will allow overlapping within the driver (eg processing the previous block, or getting the next block ready) without the overhead and complexity of a fully asynchronous DMA.
Altera_Forum
Honored Contributor
13 years ago
Thanks for the feedback Matthias and dsl.

I guess I'll have to implement a DMA solution.

Best Regards,
Ron
Altera_Forum
Honored Contributor
13 years ago
Hi Ron,

--- Quote Start ---

I guess I'll have to implement a DMA solution.

--- Quote End ---
Don't worry, you're not alone. I'm in the process of evaluating each of the PCIe cores. The Qsys PCIe core is not particularly useful. The PCIe-to-Avalon-MM bridge should really have inbound translation window remapping, and should have a built in DMA controller. It also has some timing issues:

http://www.alteraforum.com/forum/showthread.php?t=35678

Next I'll be looking at the MegaWizard design flow.

matthias, do you know of an existing DMA controller that supports scatter-gather lists on either the Avalon-MM side or PCIe side, with scatter-gather entries described by 64-bit PCIe addresses, Avalon-MM addresses, etc?

If I can't find what I want, I'll just write one (a statement that is much easier said than done).

Cheers,
Dave
Altera_Forum
Honored Contributor
13 years ago
Hi Dave,

Sorry I have no practical experience with ready-made DMA engines. The one I’m using is self-made. Recently I got aware of northwest logic (http://nwlogic.com/)’s dma backend core (http://nwlogic.com/docs/dma_back-end_core.pdf) but I intend to stay with my current solution.

Writing DMA engines is not *that* hard, at least not for DMA receive (device→main memory) engines. There are only occasional descriptor table reads which can happen with single non-interleaved DMA reads and an easy timeout logic, but all other accesses are uncomplicated DMA writes. Only the other direction, DMA transmit (main memory→device) requires significant thinking about tags, completion reordering and request-selective handling of the mysterious completion timeout mechanism. On the driver side, there is not much difference between the two data transfer directions.

The advantage of writing your own DMA engine is that you can make it fit your application and driver operation best. For example, one design might supply the data in a FIFO while another supplies it in a memory-like structure, and there are quite some different needs for the descriptor structure and the data blocks in terms of size and regarding stream/block orientation. And, by designing your own core, you learn a lot about PCI and PCIe.

The downside: If you can’t wrap your head around the transaction ordering stuff in the FPGA or in the driver or in between (interrupts, transaction table locking and updates), you will have a hard time debugging all the issues out of your design. Some mechanisms require early correct integration, and if not done so, you might have to rewrite your design completely once you understood the issue. Finally, I’d recommend to make the DMA engine design together with the driver development by one person – there is so much that looks logical in a DMA engine design which cannot be supported efficiently by the driver, and vice versa, that there should be a short iteration loop between those two tasks to implement.

– Matthias
Altera_Forum
Honored Contributor
13 years ago
One more note.

--- Quote Start ---
That is a non-sequitor. A modern cpu will execute other instructions following a PCIe memory read provided they aren't dependant on the value being read. Memory reads can be re-ordered, so other locations can be read. Any PCIe transfers must be sequenced, but that doesn't affect other operations.
--- Quote End ---

The CPU doesn’t know which memory locations are dependent on the PCI memory read data, i.e. they were updated by the device per DMA write just before the MMIO read was issued by the driver. For such dependent data, the old memory content shouldn’t be used. To avoid this race to bite you, you should have your best knowledge of memory barrier enforcement at hand when writing the drivers. Or you run a CPU architecture that simply does not do any related reordering (http://en.wikipedia.org/wiki/memory_ordering). For example, according to the table shown in the link, AMD64 does just »Store reordered after Loads« leading to a very strict and consistent timing compared to the assembly instructions, so it limits efficiency in return to more stability of drivers written by developers not aware of memory barriers. Nevertheless, I wouldn’t suggest to rely on the CPU architecture to write bad drivers. Always place proper explicit memory barriers into the code. Linux provides macros that typically resolve to nothing (or at most a volatile memory declaration avoiding compile-time reordering) if the code is compiled on one of the stricter architectures. Always suppose your driver might once be compiled for the Alpha. Remember: »and then there’s the alpha« (http://lxr.linux.no/#linux+v3.3.4/documentation/memory-barriers.txt).

But as soon as a memory barrier (http://en.wikipedia.org/wiki/memory_barrier) is in place – either explicitly given as a compiler/assembler directive or implicitly by the CPU architecture – any PCI access will actually stall the CPU. Even if it can process some five or ten instructions more that really just handle independent register content, this accounts as background noise facing multi-gigahertz CPUs and a PCI read latency of 0.5 to 2 μs.

Let me quote a document of the pci-sig (http://www.pcisig.com/developers/main/training_materials/get_document?doc_id=00941b570381863f8cc97850d46c0597e919a34b) on that, see page 10. It is from 2005 and addresses mostly PCI and PCI-X, but the reasons count even more with modern, highly switched PCIe system architectures.

--- Quote Start ---
• Why are MMIO Loads so bad
– Processor stalls or has to do a context switch waiting for the MMIO Load Reply Data
– MMIO Load Reply Data takes a long time due to PCI ordering rules
– MMIO Load Request have to push MMIO store data
– MMIO Load Reply Data have to push DMA store data

--- Quote End ---
So this means: With more MMIO Loads the CPU can do less. But also: With more DMA store data and MMIO Writes flying around, MMIO Loads take longer to finish, as the PCI ordering rules force some operations to be pushed ahead of the request or even the response.

BTW, pages 25 and 26 show the reasons for having these ordering rules. And the affect reads and writes by the device, i.e. DMA accesses, too, but usually a DMA controller can handle reads and writes in parallel, so they don’t suffer that much from this pushing behavior. But it’s always better to set no-snoop and relaxed ordering for any request suited for that.

Finally, PCI Express 3.0 adds some options to loosen these ordering requirements a little bit to allow higher efficient data exchange with less transaction coupling between independent DMA channels. They call it “loose transaction ordering” in their faq (http://www.pcisig.com/news_room/faqs/pcie3.0_faq/).
Altera_Forum
Honored Contributor
13 years ago
Hi Matthias,

--- Quote Start ---

Sorry I have no practical experience with ready-made DMA engines. The one I’m using is self-made. Recently I got aware of northwest logic (http://nwlogic.com/)’s dma backend core (http://nwlogic.com/docs/dma_back-end_core.pdf) but I intend to stay with my current solution.

--- Quote End ---
Thanks. You strike me as knowing what you are talking about with regards to PCIe, so I'll take this as encouragement to write my own as well.

How general-purpose is the one you have developed? Is it something you want to share? If I write my own, I'll use the PLX PCI bridges, and the PowerQUICC III bridges as reference designs and implement something analogous to their interfaces. I'll then create a tutorial and post the code to the wiki.

--- Quote Start ---

Writing DMA engines is not *that* hard, at least not for DMA receive (device→main memory) engines. There are only occasional descriptor table reads which can happen with single non-interleaved DMA reads and an easy timeout logic, but all other accesses are uncomplicated DMA writes. Only the other direction, DMA transmit (main memory→device) requires significant thinking about tags, completion reordering and request-selective handling of the mysterious completion timeout mechanism. On the driver side, there is not much difference between the two data transfer directions.

The advantage of writing your own DMA engine is that you can make it fit your application and driver operation best. For example, one design might supply the data in a FIFO while another supplies it in a memory-like structure, and there are quite some different needs for the descriptor structure and the data blocks in terms of size and regarding stream/block orientation. And, by designing your own core, you learn a lot about PCI and PCIe.

The downside: If you can’t wrap your head around the transaction ordering stuff in the FPGA or in the driver or in between (interrupts, transaction table locking and updates), you will have a hard time debugging all the issues out of your design. Some mechanisms require early correct integration, and if not done so, you might have to rewrite your design completely once you understood the issue. Finally, I’d recommend to make the DMA engine design together with the driver development by one person – there is so much that looks logical in a DMA engine design which cannot be supported efficiently by the driver, and vice versa, that there should be a short iteration loop between those two tasks to implement.

--- Quote End ---
I took the advice you gave me earlier in the year, and have a copy of the PCIe specification. I think I can wrap my head around it ok.

One more question; the Altera PCIe BFM is essentially read-only - they say this much in their PCIe webinar - do you have any recommendations for a PCIe BFM? A commercial BFM is fine. I'd just like to see what is out there, and whether its worth my time making the Altera BFM friendlier, or using an external vendor's core.

Thanks!

Cheers,
Dave
Altera_Forum
Honored Contributor
13 years ago
Hi Dave,

I’m sorry I cannot share my core with you, I have done it as part of my business work. And, BTW, you might not like our coding style which is based on Jiri Gaisler’s structured design method (http://www.gaisler.com/doc/structdes.pdf), and goes even further (we use bit, boolean and (ranged) integers instead of std_logic/std_logic_vector).

And I have to admit that I don’t have any experience with BFMs. I relied on my experience in reviewing VHDL code and keeping the code at least without a chance to hangup the system. Debugging was either a collection of a lot of printk()’s in the Linux driver, together with a couple of debug registers readably by MMIO accesses, or to output some simple internal signals to a couple of LEDs. And, yes, the coding style helped a lot.

– Matthias
Altera_Forum
Honored Contributor
13 years ago
Hi Matthias,

--- Quote Start ---

I’m sorry I cannot share my core with you, I have done it as part of my business work.

--- Quote End ---
No problem, that is completely understandable.

--- Quote Start ---

And, BTW, you might not like our coding style which is based on Jiri Gaisler’s structured design method (http://www.gaisler.com/doc/structdes.pdf), and goes even further (we use bit, boolean and (ranged) integers instead of std_logic/std_logic_vector).

--- Quote End ---
I take a fairly neutral approach to coding style. I call it the "when in Rome" style, i.e., I code in the style of the rest of the design, eg., for Linux drivers I use Linux style, for uC/OS-II I use Labrosse's style, and if I was working with you, I'd use the Leon/Gaisler style. Its much easier to review code written using a uniform coding style.

--- Quote Start ---

And I have to admit that I don’t have any experience with BFMs. I relied on my experience in reviewing VHDL code and keeping the code at least without a chance to hangup the system. Debugging was either a collection of a lot of printk()’s in the Linux driver, together with a couple of debug registers readably by MMIO accesses, or to output some simple internal signals to a couple of LEDs. And, yes, the coding style helped a lot.

--- Quote End ---
Ok, when I get something working, I'll post it. You might decide you like BFMs. Once I have a working design, I'll code a Gaisler style version and see if I like it too :)

Cheers,
Dave

Forum Discussion

addressing on-chip memory(cyclone IV) from PC via PCIe

19 Replies

Recent Discussions

Agilex 7 slew rate reconfiguration

Agilex-7 AXI MCDMA for PCIe hang

Constraints not being picked for DCFIFO

Can't generate F-Tile Ethernet Hard IP Design Example

MAX10 TSE reference design