If you go for DMA (initiated from within your FPGA), you will most likely not use a huge BAR, just enough for accessing all your DMA registers from the system CPU (the one with the root complex). Our project goes well with a single 128 byte BAR.
In the beginning of the project, though, PIO accesses might allow you to quickly mock up your communication and get something up and running, as using PIO transfers are much more straight-forward to program than designing DMA engine and driver.
Nevertheless, PIO accesses come at a *huge* speed penalty, especially when the system CPU reads from your end point FPGA. Writes are a bit faster, but still unacceptably slow for register-like non-cached BARs.
And worst of all: All the time the CPU spends waiting for PCIe reads to finish is *lost*. That means: If the system takes, say, 1 us (microsecond) to get your read access to the FPGA finished, the CPU does not execute a *single* instruction for this 1 us as the CPU has to keep the system in a consistent state. Only DMA with minimum PIO accesses allows PCIe to work at respective rates. In our project with a PCIe x1 link, we had about 1 MByte/s data rate at 100% CPU load in PIO mode (32 bit single accesses to a non-burstable BAR) and 190 MByte/s at 3% CPU in DMA mode. Masters of PCIe register interface design don’t need a *single* PCIe read operation under normal operation (they typically do at initialization time), further reducing CPU load.