Hi freinds, I want to address a on-chip memory in the Cyclone IV FPGA via the HIP PCIe from my computer, so booth the HIP and the Memory are interconnected by the Avalon-MM interface. my question is: how can i address the memory from the PC ?? please if you can help me thanx

If you follow the example design, it’s no big issue. The PC will usually run a sensible operating system which needs a driver for your card. All memory accesses should happen over this driver. And the driver is in charge of doing all the PCI housework like identifying and claiming the device as well as enabling the BAR of the memory in your card. Once the BAR is assigned, the driver is able to access the memory. Linux admins/users can also access the memory BAR from the /sys/bus/pci tree easily without the need for additional drivers. The only problem is: What performance do you expect from your memory accesses from the PC side? Each memory read access will take between 0.5 and 2 us, depending on the CPU/chipset architecture and speed, and at 100% CPU load you typically reach not more than a couple of MBs per second. Write accesses from the CPU to FPGA memory are just a little bit more efficient, maybe a factor of 2 or 3. And write combining as well as read data prefetching might save you another bit of performance, but this only works with so-called prefetchable memory BARs. If your bandwidth requirements are low, then this is no problem, but at higher rates you will need a DMA structure. With this, not the PC CPU is moving data into or out of your FPGA memory but the FPGA itself will shift blocks of data between main PC memory and FPGA memory. This requires a way more sophisticated architecture of both the FPGA design and the driver architecture, and it typically has a higher latency than the PIO approach you envision. The CDMA example design from Altera will help you towards this goal. – Matthias

You may also be able to initiate a dma transfer from the host (PCIe master) side. The DMA controller really needs to be coupled closely enough to the PCIe master interface to allow it to do long PCIe transfers (eg 128 bytes). I don't know which hardware component can do that on a typical x86 pc, nor the os (driver) api calls necessary to setup such transfers. I've only done it on one of the small ppc processor under linux - and I had write my own driver for the dma device. With extreme care you might manage to use cache-line read/write (probably 64 bytes) to improve transfer performance.

There are some PCIe Switches that contain DMA engines, e.g. from plx technologies (http://www.plxtech.com/products/expresslane/switches). The problem with ready-to-use DMA controllers is that they rarely match the driver’s or device’s specific needs for continuous transfers to the last bit, so a result is typically one interrupt per block transfer instead of one (or even zero) per a long sequence of transfers. And it’s another guy to appear on the PCIe party with already too many of them: CPU/driver, northbridge/memory, switches/flow control, device and application interface. The DMA engine placed into your PCIe endpoint design will always be there in the way designed, but the system DMA engine will only show occasional presence. And, of course, there is also the option to do inter I/O transfers (peer-to-peer) with even tighter bounding between the devices. I think i2o (http://en.wikipedia.org/wiki/i2o) was then an attempt – as expensive as unsuccessful – to establishing a standard for the device interface which should have resulted in a more abstract CPU&#8596;Device and Device&#8596;Device communication. FWIW, I think both options are beyond the first steps for casamar to go … – Matthias

Thanks for your insights Matthias. I have a new Stratix IV GX based PXI Express (PXI Express = PCIe + additional PXI backplane clock/timing strobes) board design, which uses the Altera PCIe Hard IP and DDR3 SDRAM controller. My initial test configuration uses an Avalon MM interface from the PCIe IP to the DDR3 controller. I am set up for 512 MBytes at BAR_1_0, 64 bit pre-fetchable. I am using Jungo WinDriver ALTERA_BlockReadWrite() commands to move data between PXI Express board memory and my host PC. Your explanation of read/write performance is consistent with my results with this set-up: 512 MByte block writes take ~10sec, and block reads take over 3 minutes! I can probably live with the block write performance for now, but have to improve the read performance. Are there any changes I can make short of implementing a DMA solution? Is there a way to make the host PC side pipeline multiple read requests, and not wait for the round trip return of each read before proceeding onto the next read request, for example? Any advice would be greatly appreciated. Best Regards, Ron

addressing on-chip memory(cyclone IV) from PC via PCIe

19 Replies

Altera_Forum
Honored Contributor
13 years ago
If you follow the example design, it’s no big issue. The PC will usually run a sensible operating system which needs a driver for your card. All memory accesses should happen over this driver. And the driver is in charge of doing all the PCI housework like identifying and claiming the device as well as enabling the BAR of the memory in your card. Once the BAR is assigned, the driver is able to access the memory. Linux admins/users can also access the memory BAR from the /sys/bus/pci tree easily without the need for additional drivers.

The only problem is: What performance do you expect from your memory accesses from the PC side? Each memory read access will take between 0.5 and 2 us, depending on the CPU/chipset architecture and speed, and at 100% CPU load you typically reach not more than a couple of MBs per second. Write accesses from the CPU to FPGA memory are just a little bit more efficient, maybe a factor of 2 or 3. And write combining as well as read data prefetching might save you another bit of performance, but this only works with so-called prefetchable memory BARs.

If your bandwidth requirements are low, then this is no problem, but at higher rates you will need a DMA structure. With this, not the PC CPU is moving data into or out of your FPGA memory but the FPGA itself will shift blocks of data between main PC memory and FPGA memory. This requires a way more sophisticated architecture of both the FPGA design and the driver architecture, and it typically has a higher latency than the PIO approach you envision. The CDMA example design from Altera will help you towards this goal.

– Matthias
Altera_Forum
Honored Contributor
13 years ago
thnx a lot
Altera_Forum
Honored Contributor
13 years ago
You may also be able to initiate a dma transfer from the host (PCIe master) side. The DMA controller really needs to be coupled closely enough to the PCIe master interface to allow it to do long PCIe transfers (eg 128 bytes).
I don't know which hardware component can do that on a typical x86 pc, nor the os (driver) api calls necessary to setup such transfers. I've only done it on one of the small ppc processor under linux - and I had write my own driver for the dma device.

With extreme care you might manage to use cache-line read/write (probably 64 bytes) to improve transfer performance.
Altera_Forum
Honored Contributor
13 years ago
There are some PCIe Switches that contain DMA engines, e.g. from plx technologies (http://www.plxtech.com/products/expresslane/switches). The problem with ready-to-use DMA controllers is that they rarely match the driver’s or device’s specific needs for continuous transfers to the last bit, so a result is typically one interrupt per block transfer instead of one (or even zero) per a long sequence of transfers. And it’s another guy to appear on the PCIe party with already too many of them: CPU/driver, northbridge/memory, switches/flow control, device and application interface. The DMA engine placed into your PCIe endpoint design will always be there in the way designed, but the system DMA engine will only show occasional presence.

And, of course, there is also the option to do inter I/O transfers (peer-to-peer) with even tighter bounding between the devices. I think i2o (http://en.wikipedia.org/wiki/i2o) was then an attempt – as expensive as unsuccessful – to establishing a standard for the device interface which should have resulted in a more abstract CPU↔Device and Device↔Device communication.

FWIW, I think both options are beyond the first steps for casamar to go …

– Matthias
Altera_Forum
Honored Contributor
13 years ago
Thanks for your insights Matthias.

I have a new Stratix IV GX based PXI Express (PXI Express = PCIe + additional PXI backplane clock/timing strobes) board design, which uses the Altera PCIe Hard IP and DDR3 SDRAM controller. My initial test configuration uses an Avalon MM interface from the PCIe IP to the DDR3 controller. I am set up for 512 MBytes at BAR_1_0, 64 bit pre-fetchable. I am using Jungo WinDriver ALTERA_BlockReadWrite() commands to move data between PXI Express board memory and my host PC.

Your explanation of read/write performance is consistent with my results with this set-up: 512 MByte block writes take ~10sec, and block reads take over 3 minutes!

I can probably live with the block write performance for now, but have to improve the read performance. Are there any changes I can make short of implementing a DMA solution? Is there a way to make the host PC side pipeline multiple read requests, and not wait for the round trip return of each read before proceeding onto the next read request, for example?

Any advice would be greatly appreciated.

Best Regards,
Ron
Altera_Forum
Honored Contributor
13 years ago
Hi Ron,

--- Quote Start ---

I have a new Stratix IV GX based PXI Express (PXI Express = PCIe + additional PXI backplane clock/timing strobes) board design, which uses the Altera PCIe Hard IP and DDR3 SDRAM controller. My initial test configuration uses an Avalon MM interface from the PCIe IP to the DDR3 controller. I am set up for 512 MBytes at BAR_1_0, 64 bit pre-fetchable.

--- Quote End ---
Is this going to be a product, or something only you use in-house?

The reason I ask, is that your BAR0 region is pretty large. Your board would probably stop some systems from booting.

For example, I have an HP EliteBook with 16GB of RAM and an ExpressCard to PCIe motherboard setup (from OneStopSystems). If I use the Stratix IV GX Development Kit and the MegaWizard flow example with a 256MB BAR0, the machine will not boot. If I reduce the BAR0 to something more typical of a PCIe device (~1MB ), it boots fine. I don't have a PCI analyzer on-hand, but I suspect that the BIOS probably starts up with the x86 processor in 32-bit mode, and it just runs out of address space to map the devices.

I also have systems with PCI boards, up to 18 boards in a single cPCI chassis. If the BAR regions on each board are made too large, then the 32-bit x86 host CPU runs out of address mappings.

One key aspect when designing a PCI or PCIe device is to have BAR regions as small as you can. The BAR regions only need to be big enough for the system-slot CPU (the CPU with the PCIe root complex) to access registers on the board. Those registers are typically mailboxes and interrupts for CPU-to-CPU communications, and DMA controller control registers. On-board DMA controllers are typically used to move large volumes of data between the board and the system slot CPU.

I've been looking at the Qsys PCIe examples, and the Qsys PCIe bridge fails in this regard. It should have an integrated DMA controller capable of implementing scatter-gather DMA lists, with an Avalon-MM address field, a 64-bit PCIe address field, a direction bit, and data length.

I haven't looked at all of the MegaWizard and AlteraWiki examples, but I suspect you really need the same kind of bridge.

Cheers,
Dave
Altera_Forum
Honored Contributor
13 years ago
Hi Dave,

This is an in-house application.

Yes, I've had issues with the BAR size as well. My PXIe board actually has 1GByte of DDR3-SDRAM, and Jungo WinDriver is unable to open my PCIe device for the full 1GBytes (although it did correctly read the BAR settings), so I am configuring the FPGA for only 512 MBytes, which works. (My PC has Win7, 32 bits. I am putting together another PC with Win7 64 bits to try to fix that problem.)

The Altera PCIe User's Guide indicates that for the Qsys flow (which I am using), 32 bit BARs are only usable for "non-prefetchable memory". I figured that "prefetchable" would be the most efficient option for SDRAM, hence my 64 bit prefetchable BAR selection.

As an experiment, I also set up for 32 bits non-prefetchable, 512 MBytes , and was unable to boot the PC at all.

I would like to keep this as simple as possible, and avoid using DMA.

I suspect that the discrepancy between read and write performance has to do with the controller not pipelining read requests, and that it is waiting for a round-trip read completion before moving on to the next read request. I'm hoping there is some way to pipeline multiple read requests to bring up block read efficiency.

AN431 shows a Qsys design with BAR_1_0 access to DDR3, and BAR_2 access to the mSGDMA IP. Failing a simpler solution, I'll play around with mSGDMA ...

Regards,
Ron
Altera_Forum
Honored Contributor
13 years ago
--- Quote Start ---

This is an in-house application.

--- Quote End ---
That gives you a little more flexibility for just getting it to work then :)

--- Quote Start ---

Yes, I've had issues with the BAR size as well. My PXIe board actually has 1GByte of DDR3-SDRAM, and Jungo WinDriver is unable to open my PCIe device for the full 1GBytes (although it did correctly read the BAR settings), so I am configuring the FPGA for only 512 MBytes, which works. (My PC has Win7, 32 bits. I am putting together another PC with Win7 64 bits to try to fix that problem.)

--- Quote End ---
Another way to deal with this is to have an incoming PCIe translation window. For example, lets say you were restricted to a 128MB BAR0 window, then you could take that window and move it anywhere within the 1GB RAM by setting a base address register in say BAR2. Unfortunately, the Qsys PCIe core does not support this type of dynamic address translation.

--- Quote Start ---

The Altera PCIe User's Guide indicates that for the Qsys flow (which I am using), 32 bit BARs are only usable for "non-prefetchable memory". I figured that "prefetchable" would be the most efficient option for SDRAM, hence my 64 bit prefetchable BAR selection.

--- Quote End ---
I don't think it makes any difference to the performance for PCIe (it did for PCI).

--- Quote Start ---

I suspect that the discrepancy between read and write performance has to do with the controller not pipelining read requests, and that it is waiting for a round-trip read completion before moving on to the next read request. I'm hoping there is some way to pipeline multiple read requests to bring up block read efficiency.

--- Quote End ---
Have you tried getting the PCIe BFM working? (It only exists in v11.0)

Cheers,
Dave
Altera_Forum
Honored Contributor
13 years ago
--- Quote Start ---
Have you tried getting the PCIe BFM working?
--- Quote End ---

No, went straight to hardware. I played around with the Stratix IV GX eval board, first. Then I designed my PXIe board. I am using it in Geotech and National Instruments PXIe backplanes/racks, and use a National PCIe-PXIe bridge back to the PC.

-Ron
Altera_Forum
Honored Contributor
13 years ago
It might be worth trying to read/write from one of the simd (SSE3) registers - it is just possible such requests will generate a single PCIe transfer.
You'll need to check whether windows allows kernel code to use them - if not you'll have to disable pre-emption and save/restore the register.

It is rather a shame that the altera PCIe block doesn't contain a dma engine (a simple single transfer one would suffice - scatter-gather could easily be build on top). For Nios code, spinning waiting for completion (in code) would be fine - allowing some overlap but without the cost of termination interrupts.

Forum Discussion

addressing on-chip memory(cyclone IV) from PC via PCIe

19 Replies

Recent Discussions

Agilex 7 slew rate reconfiguration

Agilex-7 AXI MCDMA for PCIe hang

Constraints not being picked for DCFIFO

Can't generate F-Tile Ethernet Hard IP Design Example

MAX10 TSE reference design